* [Tarantool-patches] [PATCH 0/4] Split vote @ 2022-01-15 0:48 Vladislav Shpilevoy via Tarantool-patches 2022-01-15 0:48 ` [Tarantool-patches] [PATCH 1/4] raft: fix crash on election_timeout reconfig Vladislav Shpilevoy via Tarantool-patches ` (4 more replies) 0 siblings, 5 replies; 16+ messages in thread From: Vladislav Shpilevoy via Tarantool-patches @ 2022-01-15 0:48 UTC (permalink / raw) To: tarantool-patches, sergepetrenko Split vote handling in Raft, its usage in storage, and a bug fix found while working on this in the first commit. Branch: http://github.com/tarantool/tarantool/tree/gerold103/gh-5285-raft-split-vote Issue: https://github.com/tarantool/tarantool/issues/5285 Vladislav Shpilevoy (4): raft: fix crash on election_timeout reconfig raft: track all votes, even not own raft: introduce split vote detection election: activate raft split vote handling .../unreleased/election-timeout-cfg-crash.md | 5 + src/box/raft.c | 4 +- src/lib/raft/raft.c | 142 ++++++++- src/lib/raft/raft.h | 30 +- .../election_split_vote_test.lua | 92 ++++++ test/unit/raft.c | 299 +++++++++++++++++- test/unit/raft.result | 64 +++- test/unit/raft_test_utils.c | 12 + test/unit/raft_test_utils.h | 5 + 9 files changed, 622 insertions(+), 31 deletions(-) create mode 100644 changelogs/unreleased/election-timeout-cfg-crash.md create mode 100644 test/replication-luatest/election_split_vote_test.lua -- 2.24.3 (Apple Git-128) ^ permalink raw reply [flat|nested] 16+ messages in thread
* [Tarantool-patches] [PATCH 1/4] raft: fix crash on election_timeout reconfig 2022-01-15 0:48 [Tarantool-patches] [PATCH 0/4] Split vote Vladislav Shpilevoy via Tarantool-patches @ 2022-01-15 0:48 ` Vladislav Shpilevoy via Tarantool-patches 2022-01-18 13:12 ` Serge Petrenko via Tarantool-patches 2022-01-15 0:48 ` [Tarantool-patches] [PATCH 2/4] raft: track all votes, even not own Vladislav Shpilevoy via Tarantool-patches ` (3 subsequent siblings) 4 siblings, 1 reply; 16+ messages in thread From: Vladislav Shpilevoy via Tarantool-patches @ 2022-01-15 0:48 UTC (permalink / raw) To: tarantool-patches, sergepetrenko It used to crash if done during election on a node voted for anybody, it is a candidate, it doesn't know a leader yet, but has a WAL write in progress. Thus it could only happen if the term was bumped by a message from a non-leader node and wasn't flushed to the disk yet. The patch makes the reconfig check if there is a WAL write in progress. Then don't do anything. Could also check for volatile vote instead of persistent, but it would create the same problem for the case when started writing vote for self and didn't finish yet. Reconfig would crash. --- .../unreleased/election-timeout-cfg-crash.md | 5 ++ src/lib/raft/raft.c | 3 +- test/unit/raft.c | 75 ++++++++++++++++++- test/unit/raft.result | 16 +++- 4 files changed, 96 insertions(+), 3 deletions(-) create mode 100644 changelogs/unreleased/election-timeout-cfg-crash.md diff --git a/changelogs/unreleased/election-timeout-cfg-crash.md b/changelogs/unreleased/election-timeout-cfg-crash.md new file mode 100644 index 000000000..e870e3665 --- /dev/null +++ b/changelogs/unreleased/election-timeout-cfg-crash.md @@ -0,0 +1,5 @@ +## bugfix/raft + +* Reconfiguration of `box.cfg.election_timeout` could lead to a crash or + undefined behaviour if done during an ongoing election with a special WAL + write in progress. diff --git a/src/lib/raft/raft.c b/src/lib/raft/raft.c index 0e6dc6155..1ae8fe87f 100644 --- a/src/lib/raft/raft.c +++ b/src/lib/raft/raft.c @@ -935,7 +935,8 @@ raft_cfg_election_timeout(struct raft *raft, double timeout) return; raft->election_timeout = timeout; - if (raft->vote != 0 && raft->leader == 0 && raft->is_candidate) { + if (raft->vote != 0 && raft->leader == 0 && raft->is_candidate && + !raft->is_write_in_progress) { assert(raft_ev_is_active(&raft->timer)); struct ev_loop *loop = raft_loop(); double timeout = raft_ev_timer_remaining(loop, &raft->timer) - diff --git a/test/unit/raft.c b/test/unit/raft.c index 8f4d2dd9a..520b94466 100644 --- a/test/unit/raft.c +++ b/test/unit/raft.c @@ -1296,7 +1296,7 @@ raft_test_enable_disable(void) static void raft_test_too_long_wal_write(void) { - raft_start_test(8); + raft_start_test(22); struct raft_node node; raft_node_create(&node); @@ -1362,6 +1362,79 @@ raft_test_too_long_wal_write(void) ok(raft_time() - ts == node.cfg_death_timeout, "timer works again"); is(node.raft.state, RAFT_STATE_CANDIDATE, "became candidate"); + /* + * During WAL write it is possible to reconfigure election timeout. + * The dangerous case is when the timer is active already. It happens + * when the node voted and is a candidate, but leader is unknown. + */ + raft_node_destroy(&node); + raft_node_create(&node); + + raft_node_cfg_election_timeout(&node, 100); + raft_run_next_event(); + is(node.raft.term, 2, "term is bumped"); + + /* Bump term again but it is not written to WAL yet. */ + raft_node_block(&node); + is(raft_node_send_vote_response(&node, + 3 /* Term. */, + 3 /* Vote. */, + 2 /* Source. */ + ), 0, "2 votes for 3 in a new term"); + raft_run_next_event(); + is(node.raft.term, 2, "term is old"); + is(node.raft.vote, 1, "vote is used for self"); + is(node.raft.volatile_term, 3, "volatile term is new"); + is(node.raft.volatile_vote, 0, "volatile vote is unused"); + + raft_node_cfg_election_timeout(&node, 50); + raft_node_unblock(&node); + ts = raft_time(); + raft_run_next_event(); + ts = raft_time() - ts; + /* 50 + <= 10% random delay. */ + ok(ts >= 50 && ts <= 55, "new election timeout works"); + ok(raft_node_check_full_state(&node, + RAFT_STATE_CANDIDATE /* State. */, + 0 /* Leader. */, + 4 /* Term. */, + 1 /* Vote. */, + 4 /* Volatile term. */, + 1 /* Volatile vote. */, + "{0: 4}" /* Vclock. */ + ), "new term is started with vote for self"); + + /* + * Similar case when a vote is being written but not finished yet. + */ + raft_node_destroy(&node); + raft_node_create(&node); + + raft_node_cfg_election_timeout(&node, 100); + raft_node_block(&node); + raft_run_next_event(); + is(node.raft.term, 1, "term is old"); + is(node.raft.vote, 0, "vote is unused"); + is(node.raft.volatile_term, 2, "volatile term is new"); + is(node.raft.volatile_vote, 1, "volatile vote is self"); + + raft_node_cfg_election_timeout(&node, 50); + raft_node_unblock(&node); + ts = raft_time(); + raft_run_next_event(); + ts = raft_time() - ts; + /* 50 + <= 10% random delay. */ + ok(ts >= 50 && ts <= 55, "new election timeout works"); + ok(raft_node_check_full_state(&node, + RAFT_STATE_CANDIDATE /* State. */, + 0 /* Leader. */, + 3 /* Term. */, + 1 /* Vote. */, + 3 /* Volatile term. */, + 1 /* Volatile vote. */, + "{0: 2}" /* Vclock. */ + ), "new term is started with vote for self"); + raft_node_destroy(&node); raft_finish_test(); } diff --git a/test/unit/raft.result b/test/unit/raft.result index 14689ea81..764d82969 100644 --- a/test/unit/raft.result +++ b/test/unit/raft.result @@ -220,7 +220,7 @@ ok 11 - subtests ok 12 - subtests *** raft_test_enable_disable: done *** *** raft_test_too_long_wal_write *** - 1..8 + 1..22 ok 1 - vote for 2 ok 2 - vote is volatile ok 3 - message from leader @@ -229,6 +229,20 @@ ok 12 - subtests ok 6 - wal write is finished ok 7 - timer works again ok 8 - became candidate + ok 9 - term is bumped + ok 10 - 2 votes for 3 in a new term + ok 11 - term is old + ok 12 - vote is used for self + ok 13 - volatile term is new + ok 14 - volatile vote is unused + ok 15 - new election timeout works + ok 16 - new term is started with vote for self + ok 17 - term is old + ok 18 - vote is unused + ok 19 - volatile term is new + ok 20 - volatile vote is self + ok 21 - new election timeout works + ok 22 - new term is started with vote for self ok 13 - subtests *** raft_test_too_long_wal_write: done *** *** raft_test_promote_restore *** -- 2.24.3 (Apple Git-128) ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Tarantool-patches] [PATCH 1/4] raft: fix crash on election_timeout reconfig 2022-01-15 0:48 ` [Tarantool-patches] [PATCH 1/4] raft: fix crash on election_timeout reconfig Vladislav Shpilevoy via Tarantool-patches @ 2022-01-18 13:12 ` Serge Petrenko via Tarantool-patches 0 siblings, 0 replies; 16+ messages in thread From: Serge Petrenko via Tarantool-patches @ 2022-01-18 13:12 UTC (permalink / raw) To: Vladislav Shpilevoy, tarantool-patches 15.01.2022 03:48, Vladislav Shpilevoy пишет: > It used to crash if done during election on a node voted for > anybody, it is a candidate, it doesn't know a leader yet, but has > a WAL write in progress. > > Thus it could only happen if the term was bumped by a message from > a non-leader node and wasn't flushed to the disk yet. > > The patch makes the reconfig check if there is a WAL write in > progress. Then don't do anything. > > Could also check for volatile vote instead of persistent, but it > would create the same problem for the case when started writing > vote for self and didn't finish yet. Reconfig would crash. > --- > .../unreleased/election-timeout-cfg-crash.md | 5 ++ > src/lib/raft/raft.c | 3 +- > test/unit/raft.c | 75 ++++++++++++++++++- > test/unit/raft.result | 16 +++- > 4 files changed, 96 insertions(+), 3 deletions(-) > create mode 100644 changelogs/unreleased/election-timeout-cfg-crash.md > > Good catch! LGTM. -- Serge Petrenko ^ permalink raw reply [flat|nested] 16+ messages in thread
* [Tarantool-patches] [PATCH 2/4] raft: track all votes, even not own 2022-01-15 0:48 [Tarantool-patches] [PATCH 0/4] Split vote Vladislav Shpilevoy via Tarantool-patches 2022-01-15 0:48 ` [Tarantool-patches] [PATCH 1/4] raft: fix crash on election_timeout reconfig Vladislav Shpilevoy via Tarantool-patches @ 2022-01-15 0:48 ` Vladislav Shpilevoy via Tarantool-patches 2022-01-21 0:42 ` Vladislav Shpilevoy via Tarantool-patches 2022-01-15 0:48 ` [Tarantool-patches] [PATCH 3/4] raft: introduce split vote detection Vladislav Shpilevoy via Tarantool-patches ` (2 subsequent siblings) 4 siblings, 1 reply; 16+ messages in thread From: Vladislav Shpilevoy via Tarantool-patches @ 2022-01-15 0:48 UTC (permalink / raw) To: tarantool-patches, sergepetrenko To detect split vote a node needs to see that number of free votes is not enough for anyone to win even if it gets them all. Hence every node needs to count votes for all other nodes. The patch makes raft store votes in a bit more complicated manner than a simple bitmap for just the current instance. Part of #5285 --- src/lib/raft/raft.c | 41 +++++++++++++++++++++++++++++++++-------- src/lib/raft/raft.h | 24 +++++++++++++++++------- test/unit/raft.c | 17 ++++++++++------- test/unit/raft.result | 7 ++++--- 4 files changed, 64 insertions(+), 25 deletions(-) diff --git a/src/lib/raft/raft.c b/src/lib/raft/raft.c index 1ae8fe87f..289d53fd5 100644 --- a/src/lib/raft/raft.c +++ b/src/lib/raft/raft.c @@ -152,6 +152,16 @@ raft_can_vote_for(const struct raft *raft, const struct vclock *v) return cmp == 0 || cmp == 1; } +static inline void +raft_add_vote(struct raft *raft, int src, int dst) +{ + struct raft_vote *v = &raft->votes[src]; + if (v->did_vote) + return; + v->did_vote = true; + ++raft->votes[dst].count; +} + /** Schedule broadcast of the complete Raft state to all the followers. */ static void raft_schedule_broadcast(struct raft *raft); @@ -279,6 +289,12 @@ void raft_process_recovery(struct raft *raft, const struct raft_msg *req) { say_verbose("RAFT: recover %s", raft_msg_to_string(req)); + /* + * Instance ID is not unknown until recovery ends. Because apparently it + * can change during join. In Raft it is set only one time when recovery + * ends for good. + */ + assert(raft->self == 0); if (req->term != 0) { raft->term = req->term; raft->volatile_term = req->term; @@ -335,6 +351,8 @@ raft_process_msg(struct raft *raft, const struct raft_msg *req, uint32_t source) * persisted long time ago and still broadcasted. Or a vote response. */ if (req->vote != 0) { + raft_add_vote(raft, source, req->vote); + switch (raft->state) { case RAFT_STATE_FOLLOWER: case RAFT_STATE_LEADER: @@ -395,11 +413,10 @@ raft_process_msg(struct raft *raft, const struct raft_msg *req, uint32_t source) * and now was answered by some other instance. */ assert(raft->volatile_vote == raft->self); - bool was_set = bit_set(&raft->vote_mask, source); - raft->vote_count += !was_set; - if (raft->vote_count < raft->election_quorum) { + int vote_count = raft_vote_count(raft); + if (vote_count < raft->election_quorum) { say_info("RAFT: accepted vote for self, vote " - "count is %d/%d", raft->vote_count, + "count is %d/%d", vote_count, raft->election_quorum); break; } @@ -640,13 +657,11 @@ raft_sm_become_candidate(struct raft *raft) assert(raft->state == RAFT_STATE_FOLLOWER); assert(raft->leader == 0); assert(raft->vote == raft->self); + assert(raft_vote_count(raft) >= 1); assert(raft->is_candidate); assert(!raft->is_write_in_progress); assert(raft->election_quorum > 1); raft->state = RAFT_STATE_CANDIDATE; - raft->vote_count = 1; - raft->vote_mask = 0; - bit_set(&raft->vote_mask, raft->self); raft_sm_wait_election_end(raft); /* State is visible and it is changed - broadcast. */ raft_schedule_broadcast(raft); @@ -663,6 +678,7 @@ raft_sm_schedule_new_term(struct raft *raft, uint64_t new_term) raft->volatile_vote = 0; raft->leader = 0; raft->state = RAFT_STATE_FOLLOWER; + memset(raft->votes, 0, sizeof(raft->votes)); /* * The instance could be promoted for the previous term. But promotion * has no effect on following terms. @@ -685,6 +701,8 @@ raft_sm_schedule_new_vote(struct raft *raft, uint32_t new_vote) assert(raft->leader == 0); assert(raft->state == RAFT_STATE_FOLLOWER); raft->volatile_vote = new_vote; + assert(!raft->votes[raft->self].did_vote); + raft_add_vote(raft, raft->self, raft->self); raft_sm_pause_and_dump(raft); /* Nothing visible is changed - no broadcast. */ } @@ -956,7 +974,7 @@ raft_cfg_election_quorum(struct raft *raft, int election_quorum) assert(election_quorum > 0); raft->election_quorum = election_quorum; if (raft->state == RAFT_STATE_CANDIDATE && - raft->vote_count >= raft->election_quorum) + raft_vote_count(raft) >= raft->election_quorum) raft_sm_become_leader(raft); } @@ -984,6 +1002,13 @@ raft_cfg_instance_id(struct raft *raft, uint32_t instance_id) assert(raft->self == 0); assert(instance_id != 0); raft->self = instance_id; + /* + * Couldn't do that reliably during recovery. Instance ID can change + * more than once during join. Here instance ID is configured when it is + * known forever and is safe to use. + */ + if (raft->vote != 0) + raft_add_vote(raft, instance_id, raft->vote); } void diff --git a/src/lib/raft/raft.h b/src/lib/raft/raft.h index 53e2c58a8..4fbfaa91a 100644 --- a/src/lib/raft/raft.h +++ b/src/lib/raft/raft.h @@ -138,6 +138,14 @@ struct raft_vtab { raft_schedule_async_f schedule_async; }; +/** Vote descriptor of a single node. */ +struct raft_vote { + /** Whether the node voted for anybody. */ + bool did_vote; + /** How many votes the node got in the current term. */ + int count; +}; + struct raft { /** Instance ID of this node. */ uint32_t self; @@ -189,13 +197,8 @@ struct raft { */ uint64_t term; uint32_t vote; - /** - * Bit 1 on position N means that a vote from instance with ID = N was - * obtained. - */ - vclock_map_t vote_mask; - /** Number of votes for this instance. Valid only in candidate state. */ - int vote_count; + /** Statistics which node voted for who. */ + struct raft_vote votes[VCLOCK_MAX]; /** Number of votes necessary for successful election. */ int election_quorum; /** @@ -243,6 +246,13 @@ raft_is_enabled(const struct raft *raft) return raft->is_enabled; } +/** Number of votes for self. */ +static inline int +raft_vote_count(const struct raft *raft) +{ + return raft->votes[raft->self].count; +} + /** Process a raft entry stored in WAL/snapshot. */ void raft_process_recovery(struct raft *raft, const struct raft_msg *req); diff --git a/test/unit/raft.c b/test/unit/raft.c index 520b94466..9f18f4d8e 100644 --- a/test/unit/raft.c +++ b/test/unit/raft.c @@ -70,7 +70,7 @@ raft_test_leader_election(void) 1 /* Volatile vote. */, "{0: 1}" /* Vclock. */ ), "elections with a new term"); - is(node.raft.vote_count, 1, "single vote for self"); + is(raft_vote_count(&node.raft), 1, "single vote for self"); ok(node.update_count > 0, "trigger worked"); node.update_count = 0; @@ -100,7 +100,7 @@ raft_test_leader_election(void) 1 /* Vote. */, 2 /* Source. */ ), 0, "vote response from 2"); - is(node.raft.vote_count, 2, "2 votes - 1 self and 1 foreign"); + is(raft_vote_count(&node.raft), 2, "2 votes - 1 self and 1 foreign"); ok(!node.has_work, "no work to do - not enough votes yet"); raft_run_for(node.cfg_election_timeout / 2); @@ -115,7 +115,7 @@ raft_test_leader_election(void) 1 /* Vote. */, 3 /* Source. */ ), 0, "vote response from 3"); - is(node.raft.vote_count, 3, "2 votes - 1 self and 2 foreign"); + is(raft_vote_count(&node.raft), 3, "2 votes - 1 self and 2 foreign"); is(node.raft.state, RAFT_STATE_LEADER, "became leader"); ok(node.update_count > 0, "trigger worked"); node.update_count = 0; @@ -142,7 +142,7 @@ raft_test_leader_election(void) static void raft_test_recovery(void) { - raft_start_test(12); + raft_start_test(13); struct raft_msg msg; struct raft_node node; raft_node_create(&node); @@ -224,6 +224,8 @@ raft_test_recovery(void) "{0: 1}" /* Vclock. */ ), "restart always as a follower"); + is(raft_vote_count(&node.raft), 1, "vote count is restored correctly"); + raft_checkpoint_remote(&node.raft, &msg); ok(raft_msg_check(&msg, RAFT_STATE_FOLLOWER /* State. */, @@ -383,7 +385,7 @@ raft_test_vote_skip(void) 1 /* Vote. */, 2 /* Source. */ ), 0, "message is accepted"); - is(node.raft.vote_count, 1, "but ignored - too old term"); + is(raft_vote_count(&node.raft), 1, "but ignored - too old term"); /* Competing vote requests are skipped. */ @@ -392,7 +394,8 @@ raft_test_vote_skip(void) 3 /* Vote. */, 2 /* Source. */ ), 0, "message is accepted"); - is(node.raft.vote_count, 1, "but ignored - vote not for this node"); + is(raft_vote_count(&node.raft), 1, + "but ignored - vote not for this node"); is(node.raft.state, RAFT_STATE_CANDIDATE, "this node does not give up"); /* Vote requests are ignored when node is disabled. */ @@ -1071,7 +1074,7 @@ raft_test_election_quorum(void) 1 /* Vote. */, 2 /* Source. */ ), 0, "send vote response from second node"); - is(node.raft.vote_count, 2, "vote is accepted"); + is(raft_vote_count(&node.raft), 2, "vote is accepted"); is(node.raft.state, RAFT_STATE_CANDIDATE, "but still candidate"); raft_node_cfg_election_quorum(&node, 2); diff --git a/test/unit/raft.result b/test/unit/raft.result index 764d82969..3fcdf8180 100644 --- a/test/unit/raft.result +++ b/test/unit/raft.result @@ -29,7 +29,7 @@ ok 1 - subtests *** raft_test_leader_election: done *** *** raft_test_recovery *** - 1..12 + 1..13 ok 1 - became candidate ok 2 - remote checkpoint of a candidate ok 3 - local checkpoint of a candidate @@ -40,8 +40,9 @@ ok 1 - subtests ok 8 - remote checkpoint of a leader ok 9 - local checkpoint of a leader ok 10 - restart always as a follower - ok 11 - remote checkpoint of a leader - ok 12 - local checkpoint of a leader + ok 11 - vote count is restored correctly + ok 12 - remote checkpoint of a leader + ok 13 - local checkpoint of a leader ok 2 - subtests *** raft_test_recovery: done *** *** raft_test_bad_msg *** -- 2.24.3 (Apple Git-128) ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Tarantool-patches] [PATCH 2/4] raft: track all votes, even not own 2022-01-15 0:48 ` [Tarantool-patches] [PATCH 2/4] raft: track all votes, even not own Vladislav Shpilevoy via Tarantool-patches @ 2022-01-21 0:42 ` Vladislav Shpilevoy via Tarantool-patches 0 siblings, 0 replies; 16+ messages in thread From: Vladislav Shpilevoy via Tarantool-patches @ 2022-01-21 0:42 UTC (permalink / raw) To: tarantool-patches, sergepetrenko I see there is a crash in CI on this commit. I found and fixed the reason. See diff and full new commit below. ==================== diff --git a/src/lib/raft/raft.c b/src/lib/raft/raft.c index 8182e73e3..01b7d469f 100644 --- a/src/lib/raft/raft.c +++ b/src/lib/raft/raft.c @@ -700,8 +700,8 @@ raft_sm_schedule_new_vote(struct raft *raft, uint32_t new_vote) assert(raft->volatile_vote == 0); assert(raft->leader == 0); assert(raft->state == RAFT_STATE_FOLLOWER); - raft->volatile_vote = new_vote; assert(!raft->votes[raft->self].did_vote); + raft->volatile_vote = new_vote; raft_add_vote(raft, raft->self, raft->self); raft_sm_pause_and_dump(raft); /* Nothing visible is changed - no broadcast. */ @@ -1012,8 +1012,8 @@ raft_cfg_instance_id(struct raft *raft, uint32_t instance_id) * more than once during join. Here instance ID is configured when it is * known forever and is safe to use. */ - if (raft->vote != 0) - raft_add_vote(raft, instance_id, raft->vote); + if (raft->volatile_vote != 0) + raft_add_vote(raft, instance_id, raft->volatile_vote); } void diff --git a/test/unit/raft.c b/test/unit/raft.c index 9f18f4d8e..1a3c81c64 100644 --- a/test/unit/raft.c +++ b/test/unit/raft.c @@ -1497,10 +1497,75 @@ raft_test_promote_restore(void) raft_finish_test(); } +static void +raft_test_bump_term_before_cfg() +{ + raft_start_test(6); + struct raft_node node; + raft_node_create(&node); + /* + * Term bump is started between recovery and instance ID configuration + * but WAL write is not finished yet. + */ + raft_run_next_event(); + ok(raft_node_check_full_state(&node, + RAFT_STATE_CANDIDATE /* State. */, + 0 /* Leader. */, + 2 /* Term. */, + 1 /* Vote. */, + 2 /* Volatile term. */, + 1 /* Volatile vote. */, + "{0: 1}" /* Vclock. */ + ), "new term is started with vote for self"); + + raft_node_stop(&node); + raft_node_recover(&node); + ok(raft_node_check_full_state(&node, + RAFT_STATE_FOLLOWER /* State. */, + 0 /* Leader. */, + 2 /* Term. */, + 1 /* Vote. */, + 2 /* Volatile term. */, + 1 /* Volatile vote. */, + NULL /* Vclock. */ + ), "recovered"); + is(node.raft.self, 0, "instance id is unknown"); + + raft_node_block(&node); + is(raft_node_send_follower(&node, + 3 /* Term. */, + 2 /* Source. */ + ), 0, "bump term externally"); + ok(raft_node_check_full_state(&node, + RAFT_STATE_FOLLOWER /* State. */, + 0 /* Leader. */, + 2 /* Term. */, + 1 /* Vote. */, + 3 /* Volatile term. */, + 0 /* Volatile vote. */, + NULL /* Vclock. */ + ), "term write is in progress"); + + raft_node_cfg(&node); + raft_node_unblock(&node); + ok(raft_node_check_full_state(&node, + RAFT_STATE_CANDIDATE /* State. */, + 0 /* Leader. */, + 3 /* Term. */, + 1 /* Vote. */, + 3 /* Volatile term. */, + 1 /* Volatile vote. */, + "{0: 3}" /* Vclock. */ + ), "started new term"); + + raft_node_destroy(&node); + raft_finish_test(); +} + static int main_f(va_list ap) { - raft_start_test(14); + raft_start_test(15); (void) ap; fakeev_init(); @@ -1519,6 +1584,7 @@ main_f(va_list ap) raft_test_enable_disable(); raft_test_too_long_wal_write(); raft_test_promote_restore(); + raft_test_bump_term_before_cfg(); fakeev_free(); diff --git a/test/unit/raft_test_utils.c b/test/unit/raft_test_utils.c index b5754cf78..daf43797f 100644 --- a/test/unit/raft_test_utils.c +++ b/test/unit/raft_test_utils.c @@ -224,13 +224,18 @@ raft_node_check_full_state(const struct raft_node *node, enum raft_state state, { assert(raft_node_is_started(node)); const struct raft *raft = &node->raft; - struct vclock v; - raft_vclock_from_string(&v, vclock); + if (vclock != NULL) { + struct vclock v; + raft_vclock_from_string(&v, vclock); + if (vclock_compare(&v, raft->vclock) != 0) + return false; + } else if (raft->vclock != NULL) { + return false; + } return raft->state == state && raft->leader == leader && raft->term == term && raft->vote == vote && raft->volatile_term == volatile_term && - raft->volatile_vote == volatile_vote && - vclock_compare(&v, raft->vclock) == 0; + raft->volatile_vote == volatile_vote; } bool @@ -342,6 +347,13 @@ raft_node_stop(struct raft_node *node) void raft_node_start(struct raft_node *node) +{ + raft_node_recover(node); + raft_node_cfg(node); +} + +void +raft_node_recover(struct raft_node *node) { assert(!raft_node_is_started(node)); @@ -358,6 +370,13 @@ raft_node_start(struct raft_node *node) for (int i = 0; i < node->journal.size; ++i) raft_process_recovery(&node->raft, &node->journal.rows[i]); + raft_run_async_work(); +} + +void +raft_node_cfg(struct raft_node *node) +{ + assert(raft_node_is_started(node)); raft_cfg_is_enabled(&node->raft, node->cfg_is_enabled); raft_cfg_is_candidate(&node->raft, node->cfg_is_candidate); diff --git a/test/unit/raft_test_utils.h b/test/unit/raft_test_utils.h index c68dc3b22..8d400923e 100644 --- a/test/unit/raft_test_utils.h +++ b/test/unit/raft_test_utils.h @@ -200,6 +200,14 @@ raft_node_stop(struct raft_node *node); void raft_node_start(struct raft_node *node); +/** Start the node but not configure it yet. Only recover. */ +void +raft_node_recover(struct raft_node *node); + +/** Apply the entire Raft config to a started. */ +void +raft_node_cfg(struct raft_node *node); + /** Block async work execution. */ void raft_node_block(struct raft_node *node); ==================== Full new commit: ================================================== raft: track all votes, even not own To detect split vote a node needs to see that number of free votes is not enough for anyone to win even if it gets them all. Hence every node needs to count votes for all other nodes. The patch makes raft store votes in a bit more complicated manner than a simple bitmap for just the current instance. Part of #5285 diff --git a/src/lib/raft/raft.c b/src/lib/raft/raft.c index d6a849503..01b7d469f 100644 --- a/src/lib/raft/raft.c +++ b/src/lib/raft/raft.c @@ -152,6 +152,16 @@ raft_can_vote_for(const struct raft *raft, const struct vclock *v) return cmp == 0 || cmp == 1; } +static inline void +raft_add_vote(struct raft *raft, int src, int dst) +{ + struct raft_vote *v = &raft->votes[src]; + if (v->did_vote) + return; + v->did_vote = true; + ++raft->votes[dst].count; +} + /** Schedule broadcast of the complete Raft state to all the followers. */ static void raft_schedule_broadcast(struct raft *raft); @@ -279,6 +289,12 @@ void raft_process_recovery(struct raft *raft, const struct raft_msg *req) { say_verbose("RAFT: recover %s", raft_msg_to_string(req)); + /* + * Instance ID is not unknown until recovery ends. Because apparently it + * can change during join. In Raft it is set only one time when recovery + * ends for good. + */ + assert(raft->self == 0); if (req->term != 0) { raft->term = req->term; raft->volatile_term = req->term; @@ -335,6 +351,8 @@ raft_process_msg(struct raft *raft, const struct raft_msg *req, uint32_t source) * persisted long time ago and still broadcasted. Or a vote response. */ if (req->vote != 0) { + raft_add_vote(raft, source, req->vote); + switch (raft->state) { case RAFT_STATE_FOLLOWER: case RAFT_STATE_LEADER: @@ -395,11 +413,10 @@ raft_process_msg(struct raft *raft, const struct raft_msg *req, uint32_t source) * and now was answered by some other instance. */ assert(raft->volatile_vote == raft->self); - bool was_set = bit_set(&raft->vote_mask, source); - raft->vote_count += !was_set; - if (raft->vote_count < raft->election_quorum) { + int vote_count = raft_vote_count(raft); + if (vote_count < raft->election_quorum) { say_info("RAFT: accepted vote for self, vote " - "count is %d/%d", raft->vote_count, + "count is %d/%d", vote_count, raft->election_quorum); break; } @@ -640,13 +657,11 @@ raft_sm_become_candidate(struct raft *raft) assert(raft->state == RAFT_STATE_FOLLOWER); assert(raft->leader == 0); assert(raft->vote == raft->self); + assert(raft_vote_count(raft) >= 1); assert(raft->is_candidate); assert(!raft->is_write_in_progress); assert(raft->election_quorum > 1); raft->state = RAFT_STATE_CANDIDATE; - raft->vote_count = 1; - raft->vote_mask = 0; - bit_set(&raft->vote_mask, raft->self); raft_sm_wait_election_end(raft); /* State is visible and it is changed - broadcast. */ raft_schedule_broadcast(raft); @@ -663,6 +678,7 @@ raft_sm_schedule_new_term(struct raft *raft, uint64_t new_term) raft->volatile_vote = 0; raft->leader = 0; raft->state = RAFT_STATE_FOLLOWER; + memset(raft->votes, 0, sizeof(raft->votes)); /* * The instance could be promoted for the previous term. But promotion * has no effect on following terms. @@ -684,7 +700,9 @@ raft_sm_schedule_new_vote(struct raft *raft, uint32_t new_vote) assert(raft->volatile_vote == 0); assert(raft->leader == 0); assert(raft->state == RAFT_STATE_FOLLOWER); + assert(!raft->votes[raft->self].did_vote); raft->volatile_vote = new_vote; + raft_add_vote(raft, raft->self, raft->self); raft_sm_pause_and_dump(raft); /* Nothing visible is changed - no broadcast. */ } @@ -957,7 +975,7 @@ raft_cfg_election_quorum(struct raft *raft, int election_quorum) assert(election_quorum > 0); raft->election_quorum = election_quorum; if (raft->state == RAFT_STATE_CANDIDATE && - raft->vote_count >= raft->election_quorum) + raft_vote_count(raft) >= raft->election_quorum) raft_sm_become_leader(raft); } @@ -989,6 +1007,13 @@ raft_cfg_instance_id(struct raft *raft, uint32_t instance_id) assert(raft->self == 0); assert(instance_id != 0); raft->self = instance_id; + /* + * Couldn't do that reliably during recovery. Instance ID can change + * more than once during join. Here instance ID is configured when it is + * known forever and is safe to use. + */ + if (raft->volatile_vote != 0) + raft_add_vote(raft, instance_id, raft->volatile_vote); } void diff --git a/src/lib/raft/raft.h b/src/lib/raft/raft.h index e844c2d07..1d064bef0 100644 --- a/src/lib/raft/raft.h +++ b/src/lib/raft/raft.h @@ -138,6 +138,14 @@ struct raft_vtab { raft_schedule_async_f schedule_async; }; +/** Vote descriptor of a single node. */ +struct raft_vote { + /** Whether the node voted for anybody. */ + bool did_vote; + /** How many votes the node got in the current term. */ + int count; +}; + struct raft { /** Instance ID of this node. */ uint32_t self; @@ -189,13 +197,8 @@ struct raft { */ uint64_t term; uint32_t vote; - /** - * Bit 1 on position N means that a vote from instance with ID = N was - * obtained. - */ - vclock_map_t vote_mask; - /** Number of votes for this instance. Valid only in candidate state. */ - int vote_count; + /** Statistics which node voted for who. */ + struct raft_vote votes[VCLOCK_MAX]; /** Number of votes necessary for successful election. */ int election_quorum; /** @@ -243,6 +246,13 @@ raft_is_enabled(const struct raft *raft) return raft->is_enabled; } +/** Number of votes for self. */ +static inline int +raft_vote_count(const struct raft *raft) +{ + return raft->votes[raft->self].count; +} + /** Process a raft entry stored in WAL/snapshot. */ void raft_process_recovery(struct raft *raft, const struct raft_msg *req); diff --git a/test/unit/raft.c b/test/unit/raft.c index 520b94466..1a3c81c64 100644 --- a/test/unit/raft.c +++ b/test/unit/raft.c @@ -70,7 +70,7 @@ raft_test_leader_election(void) 1 /* Volatile vote. */, "{0: 1}" /* Vclock. */ ), "elections with a new term"); - is(node.raft.vote_count, 1, "single vote for self"); + is(raft_vote_count(&node.raft), 1, "single vote for self"); ok(node.update_count > 0, "trigger worked"); node.update_count = 0; @@ -100,7 +100,7 @@ raft_test_leader_election(void) 1 /* Vote. */, 2 /* Source. */ ), 0, "vote response from 2"); - is(node.raft.vote_count, 2, "2 votes - 1 self and 1 foreign"); + is(raft_vote_count(&node.raft), 2, "2 votes - 1 self and 1 foreign"); ok(!node.has_work, "no work to do - not enough votes yet"); raft_run_for(node.cfg_election_timeout / 2); @@ -115,7 +115,7 @@ raft_test_leader_election(void) 1 /* Vote. */, 3 /* Source. */ ), 0, "vote response from 3"); - is(node.raft.vote_count, 3, "2 votes - 1 self and 2 foreign"); + is(raft_vote_count(&node.raft), 3, "2 votes - 1 self and 2 foreign"); is(node.raft.state, RAFT_STATE_LEADER, "became leader"); ok(node.update_count > 0, "trigger worked"); node.update_count = 0; @@ -142,7 +142,7 @@ raft_test_leader_election(void) static void raft_test_recovery(void) { - raft_start_test(12); + raft_start_test(13); struct raft_msg msg; struct raft_node node; raft_node_create(&node); @@ -224,6 +224,8 @@ raft_test_recovery(void) "{0: 1}" /* Vclock. */ ), "restart always as a follower"); + is(raft_vote_count(&node.raft), 1, "vote count is restored correctly"); + raft_checkpoint_remote(&node.raft, &msg); ok(raft_msg_check(&msg, RAFT_STATE_FOLLOWER /* State. */, @@ -383,7 +385,7 @@ raft_test_vote_skip(void) 1 /* Vote. */, 2 /* Source. */ ), 0, "message is accepted"); - is(node.raft.vote_count, 1, "but ignored - too old term"); + is(raft_vote_count(&node.raft), 1, "but ignored - too old term"); /* Competing vote requests are skipped. */ @@ -392,7 +394,8 @@ raft_test_vote_skip(void) 3 /* Vote. */, 2 /* Source. */ ), 0, "message is accepted"); - is(node.raft.vote_count, 1, "but ignored - vote not for this node"); + is(raft_vote_count(&node.raft), 1, + "but ignored - vote not for this node"); is(node.raft.state, RAFT_STATE_CANDIDATE, "this node does not give up"); /* Vote requests are ignored when node is disabled. */ @@ -1071,7 +1074,7 @@ raft_test_election_quorum(void) 1 /* Vote. */, 2 /* Source. */ ), 0, "send vote response from second node"); - is(node.raft.vote_count, 2, "vote is accepted"); + is(raft_vote_count(&node.raft), 2, "vote is accepted"); is(node.raft.state, RAFT_STATE_CANDIDATE, "but still candidate"); raft_node_cfg_election_quorum(&node, 2); @@ -1494,10 +1497,75 @@ raft_test_promote_restore(void) raft_finish_test(); } +static void +raft_test_bump_term_before_cfg() +{ + raft_start_test(6); + struct raft_node node; + raft_node_create(&node); + /* + * Term bump is started between recovery and instance ID configuration + * but WAL write is not finished yet. + */ + raft_run_next_event(); + ok(raft_node_check_full_state(&node, + RAFT_STATE_CANDIDATE /* State. */, + 0 /* Leader. */, + 2 /* Term. */, + 1 /* Vote. */, + 2 /* Volatile term. */, + 1 /* Volatile vote. */, + "{0: 1}" /* Vclock. */ + ), "new term is started with vote for self"); + + raft_node_stop(&node); + raft_node_recover(&node); + ok(raft_node_check_full_state(&node, + RAFT_STATE_FOLLOWER /* State. */, + 0 /* Leader. */, + 2 /* Term. */, + 1 /* Vote. */, + 2 /* Volatile term. */, + 1 /* Volatile vote. */, + NULL /* Vclock. */ + ), "recovered"); + is(node.raft.self, 0, "instance id is unknown"); + + raft_node_block(&node); + is(raft_node_send_follower(&node, + 3 /* Term. */, + 2 /* Source. */ + ), 0, "bump term externally"); + ok(raft_node_check_full_state(&node, + RAFT_STATE_FOLLOWER /* State. */, + 0 /* Leader. */, + 2 /* Term. */, + 1 /* Vote. */, + 3 /* Volatile term. */, + 0 /* Volatile vote. */, + NULL /* Vclock. */ + ), "term write is in progress"); + + raft_node_cfg(&node); + raft_node_unblock(&node); + ok(raft_node_check_full_state(&node, + RAFT_STATE_CANDIDATE /* State. */, + 0 /* Leader. */, + 3 /* Term. */, + 1 /* Vote. */, + 3 /* Volatile term. */, + 1 /* Volatile vote. */, + "{0: 3}" /* Vclock. */ + ), "started new term"); + + raft_node_destroy(&node); + raft_finish_test(); +} + static int main_f(va_list ap) { - raft_start_test(14); + raft_start_test(15); (void) ap; fakeev_init(); @@ -1516,6 +1584,7 @@ main_f(va_list ap) raft_test_enable_disable(); raft_test_too_long_wal_write(); raft_test_promote_restore(); + raft_test_bump_term_before_cfg(); fakeev_free(); diff --git a/test/unit/raft.result b/test/unit/raft.result index 764d82969..5e63e523d 100644 --- a/test/unit/raft.result +++ b/test/unit/raft.result @@ -1,5 +1,5 @@ *** main_f *** -1..14 +1..15 *** raft_test_leader_election *** 1..24 ok 1 - 1 pending message at start @@ -29,7 +29,7 @@ ok 1 - subtests *** raft_test_leader_election: done *** *** raft_test_recovery *** - 1..12 + 1..13 ok 1 - became candidate ok 2 - remote checkpoint of a candidate ok 3 - local checkpoint of a candidate @@ -40,8 +40,9 @@ ok 1 - subtests ok 8 - remote checkpoint of a leader ok 9 - local checkpoint of a leader ok 10 - restart always as a follower - ok 11 - remote checkpoint of a leader - ok 12 - local checkpoint of a leader + ok 11 - vote count is restored correctly + ok 12 - remote checkpoint of a leader + ok 13 - local checkpoint of a leader ok 2 - subtests *** raft_test_recovery: done *** *** raft_test_bad_msg *** @@ -261,4 +262,14 @@ ok 13 - subtests ok 12 - not a candidate ok 14 - subtests *** raft_test_promote_restore: done *** + *** raft_test_bump_term_before_cfg *** + 1..6 + ok 1 - new term is started with vote for self + ok 2 - recovered + ok 3 - instance id is unknown + ok 4 - bump term externally + ok 5 - term write is in progress + ok 6 - started new term +ok 15 - subtests + *** raft_test_bump_term_before_cfg: done *** *** main_f: done *** diff --git a/test/unit/raft_test_utils.c b/test/unit/raft_test_utils.c index b5754cf78..daf43797f 100644 --- a/test/unit/raft_test_utils.c +++ b/test/unit/raft_test_utils.c @@ -224,13 +224,18 @@ raft_node_check_full_state(const struct raft_node *node, enum raft_state state, { assert(raft_node_is_started(node)); const struct raft *raft = &node->raft; - struct vclock v; - raft_vclock_from_string(&v, vclock); + if (vclock != NULL) { + struct vclock v; + raft_vclock_from_string(&v, vclock); + if (vclock_compare(&v, raft->vclock) != 0) + return false; + } else if (raft->vclock != NULL) { + return false; + } return raft->state == state && raft->leader == leader && raft->term == term && raft->vote == vote && raft->volatile_term == volatile_term && - raft->volatile_vote == volatile_vote && - vclock_compare(&v, raft->vclock) == 0; + raft->volatile_vote == volatile_vote; } bool @@ -342,6 +347,13 @@ raft_node_stop(struct raft_node *node) void raft_node_start(struct raft_node *node) +{ + raft_node_recover(node); + raft_node_cfg(node); +} + +void +raft_node_recover(struct raft_node *node) { assert(!raft_node_is_started(node)); @@ -358,6 +370,13 @@ raft_node_start(struct raft_node *node) for (int i = 0; i < node->journal.size; ++i) raft_process_recovery(&node->raft, &node->journal.rows[i]); + raft_run_async_work(); +} + +void +raft_node_cfg(struct raft_node *node) +{ + assert(raft_node_is_started(node)); raft_cfg_is_enabled(&node->raft, node->cfg_is_enabled); raft_cfg_is_candidate(&node->raft, node->cfg_is_candidate); diff --git a/test/unit/raft_test_utils.h b/test/unit/raft_test_utils.h index c68dc3b22..8d400923e 100644 --- a/test/unit/raft_test_utils.h +++ b/test/unit/raft_test_utils.h @@ -200,6 +200,14 @@ raft_node_stop(struct raft_node *node); void raft_node_start(struct raft_node *node); +/** Start the node but not configure it yet. Only recover. */ +void +raft_node_recover(struct raft_node *node); + +/** Apply the entire Raft config to a started. */ +void +raft_node_cfg(struct raft_node *node); + /** Block async work execution. */ void raft_node_block(struct raft_node *node); ^ permalink raw reply [flat|nested] 16+ messages in thread
* [Tarantool-patches] [PATCH 3/4] raft: introduce split vote detection 2022-01-15 0:48 [Tarantool-patches] [PATCH 0/4] Split vote Vladislav Shpilevoy via Tarantool-patches 2022-01-15 0:48 ` [Tarantool-patches] [PATCH 1/4] raft: fix crash on election_timeout reconfig Vladislav Shpilevoy via Tarantool-patches 2022-01-15 0:48 ` [Tarantool-patches] [PATCH 2/4] raft: track all votes, even not own Vladislav Shpilevoy via Tarantool-patches @ 2022-01-15 0:48 ` Vladislav Shpilevoy via Tarantool-patches 2022-01-18 13:20 ` Serge Petrenko via Tarantool-patches 2022-01-15 0:48 ` [Tarantool-patches] [PATCH 4/4] election: activate raft split vote handling Vladislav Shpilevoy via Tarantool-patches 2022-01-16 14:10 ` [Tarantool-patches] [PATCH 0/4] Split vote Konstantin Osipov via Tarantool-patches 4 siblings, 1 reply; 16+ messages in thread From: Vladislav Shpilevoy via Tarantool-patches @ 2022-01-15 0:48 UTC (permalink / raw) To: tarantool-patches, sergepetrenko Split vote is a situation when during election nobody can win and will not win in this term for sure. Not a single node could get enough votes. For example, with 4 nodes one could get 2 votes and other also 2 votes. Nobody will get quorum 3 in this term. The patch makes raft able to notice that situation and speed up the term bump. It is not bumped immediately though, because nodes might do that simultaneously and will get the split vote again after voting for self. There is a random delay. But it is just max 10% of election timeout, so it should speed up split vote resolution on 90% at least. Part of #5285 --- src/lib/raft/raft.c | 104 +++++++++++++++++- src/lib/raft/raft.h | 6 ++ test/unit/raft.c | 207 +++++++++++++++++++++++++++++++++++- test/unit/raft.result | 41 ++++++- test/unit/raft_test_utils.c | 12 +++ test/unit/raft_test_utils.h | 5 + 6 files changed, 370 insertions(+), 5 deletions(-) diff --git a/src/lib/raft/raft.c b/src/lib/raft/raft.c index 289d53fd5..5dcbc7821 100644 --- a/src/lib/raft/raft.c +++ b/src/lib/raft/raft.c @@ -152,20 +152,69 @@ raft_can_vote_for(const struct raft *raft, const struct vclock *v) return cmp == 0 || cmp == 1; } -static inline void +static inline bool raft_add_vote(struct raft *raft, int src, int dst) { struct raft_vote *v = &raft->votes[src]; if (v->did_vote) - return; + return false; v->did_vote = true; ++raft->votes[dst].count; + return true; +} + +static bool +raft_has_split_vote(const struct raft *raft) +{ + int max_vote = 0; + int vote_vac = raft->cluster_size; + int quorum = raft->election_quorum; + for (int i = 0; i < VCLOCK_MAX; ++i) { + int count = raft->votes[i].count; + vote_vac -= count; + if (count > max_vote) + max_vote = count; + } + return max_vote < quorum && max_vote + vote_vac < quorum; +} + +static int +raft_scores_snprintf(const struct raft *raft, char *buf, int size) +{ + int total = 0; + bool is_empty = true; + SNPRINT(total, snprintf, buf, size, "{"); + for (int i = 0; i < VCLOCK_MAX; ++i) { + int count = raft->votes[i].count; + if (count == 0) + continue; + if (!is_empty) + SNPRINT(total, snprintf, buf, size, ", "); + is_empty = false; + SNPRINT(total, snprintf, buf, size, "%d: %d", i, count); + } + SNPRINT(total, snprintf, buf, size, "}"); + return total; +} + +static const char * +raft_scores_str(const struct raft *raft) +{ + char *buf = tt_static_buf(); + int rc = raft_scores_snprintf(raft, buf, TT_STATIC_BUF_LEN); + assert(rc >= 0); + (void)rc; + return buf; } /** Schedule broadcast of the complete Raft state to all the followers. */ static void raft_schedule_broadcast(struct raft *raft); +/** If there is split vote, the node might reduce the next term delay. */ +static void +raft_check_split_vote(struct raft *raft); + /** Raft state machine methods. 'sm' stands for State Machine. */ /** @@ -351,7 +400,8 @@ raft_process_msg(struct raft *raft, const struct raft_msg *req, uint32_t source) * persisted long time ago and still broadcasted. Or a vote response. */ if (req->vote != 0) { - raft_add_vote(raft, source, req->vote); + if (raft_add_vote(raft, source, req->vote)) + raft_check_split_vote(raft); switch (raft->state) { case RAFT_STATE_FOLLOWER: @@ -1019,6 +1069,13 @@ raft_cfg_vclock(struct raft *raft, const struct vclock *vclock) raft->vclock = vclock; } +void +raft_cfg_cluster_size(struct raft *raft, int size) +{ + raft->cluster_size = size; + raft_check_split_vote(raft); +} + void raft_new_term(struct raft *raft) { @@ -1043,6 +1100,46 @@ raft_schedule_broadcast(struct raft *raft) raft_schedule_async(raft); } +static void +raft_check_split_vote(struct raft *raft) +{ + /* When leader is known, there is no election. Thus no vote to split. */ + if (raft->leader != 0) + return; + /* Not a candidate = can't trigger term bump anyway. */ + if (!raft->is_candidate) + return; + /* + * WAL write in progress means the state is changing. All is rechecked + * when it is done. + */ + if (raft->is_write_in_progress) + return; + if (!raft_has_split_vote(raft)) + return; + assert(raft_ev_is_active(&raft->timer)); + if (raft->timer.at < raft->election_timeout) + return; + + assert(raft->state == RAFT_STATE_FOLLOWER || + raft->state == RAFT_STATE_CANDIDATE); + struct ev_loop *loop = raft_loop(); + struct ev_timer *timer = &raft->timer; + double delay = raft_new_random_election_shift(raft); + /* + * Could be too late to speed up anything - probably the term is almost + * over anyway. + */ + double remaining = raft_ev_timer_remaining(loop, timer); + if (delay >= remaining) + delay = remaining; + say_info("RAFT: split vote is discovered - %s, new term in %lf sec", + raft_scores_str(raft), delay); + raft_ev_timer_stop(loop, timer); + raft_ev_timer_set(timer, delay, delay); + raft_ev_timer_start(loop, timer); +} + void raft_create(struct raft *raft, const struct raft_vtab *vtab) { @@ -1053,6 +1150,7 @@ raft_create(struct raft *raft, const struct raft_vtab *vtab) .election_quorum = 1, .election_timeout = 5, .death_timeout = 5, + .cluster_size = VCLOCK_MAX, .vtab = vtab, }; raft_ev_timer_init(&raft->timer, raft_sm_schedule_new_election_cb, diff --git a/src/lib/raft/raft.h b/src/lib/raft/raft.h index 4fbfaa91a..33597d153 100644 --- a/src/lib/raft/raft.h +++ b/src/lib/raft/raft.h @@ -219,6 +219,8 @@ struct raft { * elections can be started. */ double death_timeout; + /** Number of instances registered in the cluster. */ + int cluster_size; /** Virtual table to perform application-specific actions. */ const struct raft_vtab *vtab; /** @@ -333,6 +335,10 @@ raft_cfg_instance_id(struct raft *raft, uint32_t instance_id); void raft_cfg_vclock(struct raft *raft, const struct vclock *vclock); +/** Configure number of registered instances in the cluster. */ +void +raft_cfg_cluster_size(struct raft *raft, int size); + /** * Bump the term. When it is persisted, the node checks if there is a leader, * and if there is not, a new election is started. That said, this function can diff --git a/test/unit/raft.c b/test/unit/raft.c index 9f18f4d8e..9b462f755 100644 --- a/test/unit/raft.c +++ b/test/unit/raft.c @@ -1497,10 +1497,214 @@ raft_test_promote_restore(void) raft_finish_test(); } +static void +raft_test_split_vote(void) +{ + raft_start_test(35); + struct raft_node node; + raft_node_create(&node); + + /* Normal split vote. */ + + raft_node_cfg_cluster_size(&node, 4); + raft_node_cfg_election_quorum(&node, 3); + raft_run_next_event(); + + ok(raft_node_check_full_state(&node, + RAFT_STATE_CANDIDATE /* State. */, + 0 /* Leader. */, + 2 /* Term. */, + 1 /* Vote. */, + 2 /* Volatile term. */, + 1 /* Volatile vote. */, + "{0: 1}" /* Vclock. */ + ), "elections with a new term"); + + /* Make so node 1 has votes 1 and 2. Node 3 has votes 3 and 4. */ + is(raft_node_send_vote_response(&node, + 2 /* Term. */, + 1 /* Vote. */, + 2 /* Source. */ + ), 0, "vote response for 1 from 2"); + is(raft_node_send_vote_response(&node, + 2 /* Term. */, + 3 /* Vote. */, + 3 /* Source. */ + ), 0, "vote response for 3 from 3"); + + ok(node.raft.timer.at >= node.raft.election_timeout, "term timeout >= " + "election timeout normally"); + + is(raft_node_send_vote_response(&node, + 2 /* Term. */, + 3 /* Vote. */, + 4 /* Source. */ + ), 0, "vote response for 3 from 4"); + + ok(node.raft.timer.at < node.raft.election_timeout, "split vote " + "reduced the term timeout"); + + raft_run_next_event(); + ok(raft_node_check_full_state(&node, + RAFT_STATE_CANDIDATE /* State. */, + 0 /* Leader. */, + 3 /* Term. */, + 1 /* Vote. */, + 3 /* Volatile term. */, + 1 /* Volatile vote. */, + "{0: 2}" /* Vclock. */ + ), "a new term"); + + ok(node.raft.timer.at >= node.raft.election_timeout, "timeout is " + "normal again"); + + /* + * Cluster size change can make split vote. + */ + raft_node_destroy(&node); + raft_node_create(&node); + raft_node_cfg_cluster_size(&node, 3); + raft_node_cfg_election_quorum(&node, 2); + raft_run_next_event(); + is(node.raft.state, RAFT_STATE_CANDIDATE, "is candidate"); + is(node.raft.vote, 1, "voted for self"); + + is(raft_node_send_vote_response(&node, + 2 /* Term. */, + 2 /* Vote. */, + 2 /* Source. */ + ), 0, "vote response for 2 from 2"); + ok(node.raft.timer.at >= node.raft.election_timeout, "the vote is not " + "split yet"); + + raft_node_cfg_cluster_size(&node, 2); + ok(node.raft.timer.at < node.raft.election_timeout, "cluster size " + "change makes split vote"); + + raft_run_next_event(); + ok(raft_node_check_full_state(&node, + RAFT_STATE_CANDIDATE /* State. */, + 0 /* Leader. */, + 3 /* Term. */, + 1 /* Vote. */, + 3 /* Volatile term. */, + 1 /* Volatile vote. */, + "{0: 2}" /* Vclock. */ + ), "a new term"); + + /* + * Split vote can be even when the leader is already known - then + * nothing to do. Votes are just left from the beginning of the term and + * then probably cluster size reduced a bit. + */ + raft_node_destroy(&node); + raft_node_create(&node); + raft_node_cfg_cluster_size(&node, 3); + raft_node_cfg_election_quorum(&node, 2); + is(raft_node_send_vote_response(&node, + 2 /* Term. */, + 2 /* Vote. */, + 2 /* Source. */ + ), 0, "vote response for 2 from 2"); + /* There is also a vote from 3 for 2. But it wasn't delivered to 1. */ + is(raft_node_send_leader(&node, + 2 /* Term. */, + 2 /* Source. */ + ), 0, "message is accepted"); + is(node.raft.leader, 2, "other node's leadership is accepted"); + is(raft_vote_count(&node.raft), 1, "the only own vote was from self"); + + raft_node_cfg_cluster_size(&node, 2); + ok(node.raft.timer.at >= node.raft.death_timeout, "cluster change does " + "not affect the leader's death waiting"); + + /* + * Non-candidate should ignore split vote. + */ + raft_node_destroy(&node); + raft_node_create(&node); + raft_node_cfg_cluster_size(&node, 3); + raft_node_cfg_election_quorum(&node, 3); + raft_node_cfg_is_candidate(&node, false); + + is(raft_node_send_vote_response(&node, + 2 /* Term. */, + 2 /* Vote. */, + 2 /* Source. */ + ), 0, "vote response for 2 from 2"); + is(raft_node_send_vote_response(&node, + 2 /* Term. */, + 3 /* Vote. */, + 3 /* Source. */ + ), 0, "vote response for 3 from 3"); + + ok(!raft_ev_is_active(&node.raft.timer), "voter couldn't schedule " + "new term"); + + /* + * Split vote can get worse, but it shouldn't lead to new term delay + * restart. + */ + raft_node_destroy(&node); + raft_node_create(&node); + raft_node_cfg_cluster_size(&node, 3); + raft_node_cfg_election_quorum(&node, 3); + + raft_run_next_event(); + is(node.raft.term, 2, "bump term"); + is(node.raft.vote, 1, "vote for self"); + is(raft_node_send_vote_response(&node, + 2 /* Term. */, + 2 /* Vote. */, + 2 /* Source. */ + ), 0, "vote response for 2 from 2"); + + double delay = node.raft.timer.at; + ok(delay < node.raft.election_timeout, "split vote is already " + "inevitable"); + + is(raft_node_send_vote_response(&node, + 2 /* Term. */, + 3 /* Vote. */, + 3 /* Source. */ + ), 0, "vote response for 3 from 3"); + + is(delay, node.raft.timer.at, "split vote got worse, but delay didn't " + "change"); + + /* + * Handle split vote when WAL write is in progress. + */ + raft_node_destroy(&node); + raft_node_create(&node); + raft_node_cfg_cluster_size(&node, 2); + raft_node_cfg_election_quorum(&node, 2); + + raft_node_block(&node); + raft_run_next_event(); + is(node.raft.term, 1, "old term"); + is(node.raft.vote, 0, "unused vote"); + is(node.raft.volatile_term, 2, "new volatile term"); + is(node.raft.volatile_vote, 1, "new volatile vote"); + is(raft_node_send_vote_response(&node, + 2 /* Term. */, + 2 /* Vote. */, + 2 /* Source. */ + ), 0, "vote response for 2 from 2"); + + raft_node_unblock(&node); + raft_run_next_event(); + is(node.raft.term, 3, "bump term"); + is(node.raft.vote, 1, "vote for self"); + + raft_node_destroy(&node); + raft_finish_test(); +} + static int main_f(va_list ap) { - raft_start_test(14); + raft_start_test(15); (void) ap; fakeev_init(); @@ -1519,6 +1723,7 @@ main_f(va_list ap) raft_test_enable_disable(); raft_test_too_long_wal_write(); raft_test_promote_restore(); + raft_test_split_vote(); fakeev_free(); diff --git a/test/unit/raft.result b/test/unit/raft.result index 3fcdf8180..d12aae710 100644 --- a/test/unit/raft.result +++ b/test/unit/raft.result @@ -1,5 +1,5 @@ *** main_f *** -1..14 +1..15 *** raft_test_leader_election *** 1..24 ok 1 - 1 pending message at start @@ -262,4 +262,43 @@ ok 13 - subtests ok 12 - not a candidate ok 14 - subtests *** raft_test_promote_restore: done *** + *** raft_test_split_vote *** + 1..35 + ok 1 - elections with a new term + ok 2 - vote response for 1 from 2 + ok 3 - vote response for 3 from 3 + ok 4 - term timeout >= election timeout normally + ok 5 - vote response for 3 from 4 + ok 6 - split vote reduced the term timeout + ok 7 - a new term + ok 8 - timeout is normal again + ok 9 - is candidate + ok 10 - voted for self + ok 11 - vote response for 2 from 2 + ok 12 - the vote is not split yet + ok 13 - cluster size change makes split vote + ok 14 - a new term + ok 15 - vote response for 2 from 2 + ok 16 - message is accepted + ok 17 - other node's leadership is accepted + ok 18 - the only own vote was from self + ok 19 - cluster change does not affect the leader's death waiting + ok 20 - vote response for 2 from 2 + ok 21 - vote response for 3 from 3 + ok 22 - voter couldn't schedule new term + ok 23 - bump term + ok 24 - vote for self + ok 25 - vote response for 2 from 2 + ok 26 - split vote is already inevitable + ok 27 - vote response for 3 from 3 + ok 28 - split vote got worse, but delay didn't change + ok 29 - old term + ok 30 - unused vote + ok 31 - new volatile term + ok 32 - new volatile vote + ok 33 - vote response for 2 from 2 + ok 34 - bump term + ok 35 - vote for self +ok 15 - subtests + *** raft_test_split_vote: done *** *** main_f: done *** diff --git a/test/unit/raft_test_utils.c b/test/unit/raft_test_utils.c index b5754cf78..1d7dd01ba 100644 --- a/test/unit/raft_test_utils.c +++ b/test/unit/raft_test_utils.c @@ -194,6 +194,7 @@ raft_node_create(struct raft_node *node) node->cfg_election_quorum = 3; node->cfg_death_timeout = 5; node->cfg_instance_id = 1; + node->cfg_cluster_size = VCLOCK_MAX; node->cfg_vclock = &node->journal.vclock; raft_journal_create(&node->journal, node->cfg_instance_id); raft_node_start(node); @@ -365,6 +366,7 @@ raft_node_start(struct raft_node *node) raft_cfg_election_quorum(&node->raft, node->cfg_election_quorum); raft_cfg_death_timeout(&node->raft, node->cfg_death_timeout); raft_cfg_instance_id(&node->raft, node->cfg_instance_id); + raft_cfg_cluster_size(&node->raft, node->cfg_cluster_size); raft_cfg_vclock(&node->raft, node->cfg_vclock); raft_run_async_work(); } @@ -423,6 +425,16 @@ raft_node_cfg_is_candidate(struct raft_node *node, bool value) } } +void +raft_node_cfg_cluster_size(struct raft_node *node, int value) +{ + node->cfg_cluster_size = value; + if (raft_node_is_started(node)) { + raft_cfg_cluster_size(&node->raft, value); + raft_run_async_work(); + } +} + void raft_node_cfg_election_timeout(struct raft_node *node, double value) { diff --git a/test/unit/raft_test_utils.h b/test/unit/raft_test_utils.h index c68dc3b22..2138a829e 100644 --- a/test/unit/raft_test_utils.h +++ b/test/unit/raft_test_utils.h @@ -32,6 +32,7 @@ #include "fakesys/fakeev.h" #include "fiber.h" #include "raft/raft.h" +#include "raft/raft_ev.h" #include "unit.h" /** WAL simulation. It stores a list of rows which raft wanted to persist. */ @@ -105,6 +106,7 @@ struct raft_node { int cfg_election_quorum; double cfg_death_timeout; uint32_t cfg_instance_id; + int cfg_cluster_size; struct vclock *cfg_vclock; }; @@ -227,6 +229,9 @@ raft_node_cfg_is_enabled(struct raft_node *node, bool value); void raft_node_cfg_is_candidate(struct raft_node *node, bool value); +void +raft_node_cfg_cluster_size(struct raft_node *node, int value); + void raft_node_cfg_election_timeout(struct raft_node *node, double value); -- 2.24.3 (Apple Git-128) ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Tarantool-patches] [PATCH 3/4] raft: introduce split vote detection 2022-01-15 0:48 ` [Tarantool-patches] [PATCH 3/4] raft: introduce split vote detection Vladislav Shpilevoy via Tarantool-patches @ 2022-01-18 13:20 ` Serge Petrenko via Tarantool-patches 2022-01-20 0:44 ` Vladislav Shpilevoy via Tarantool-patches 0 siblings, 1 reply; 16+ messages in thread From: Serge Petrenko via Tarantool-patches @ 2022-01-18 13:20 UTC (permalink / raw) To: Vladislav Shpilevoy, tarantool-patches Thanks for the patch! I don't think this optimisation is "too much of a hassle". It's quite nice, and looks like a bunch of SLOC in the patch are used up by verbose printing (I mean raft_scores_snprint). In other words, I like the idea and I think we should have that on board. (Just like pre-voting) Please find my comments below. > diff --git a/src/lib/raft/raft.c b/src/lib/raft/raft.c > index 289d53fd5..5dcbc7821 100644 > --- a/src/lib/raft/raft.c > +++ b/src/lib/raft/raft.c > @@ -152,20 +152,69 @@ raft_can_vote_for(const struct raft *raft, const struct vclock *v) > return cmp == 0 || cmp == 1; > } > > -static inline void > +static inline bool > raft_add_vote(struct raft *raft, int src, int dst) > { > struct raft_vote *v = &raft->votes[src]; > if (v->did_vote) > - return; > + return false; > v->did_vote = true; > ++raft->votes[dst].count; > + return true; > +} > + You may check split_vote right in raft_add_vote: simply track number of votes given in this term and max votes given for one instance. This way you won't have to run over all 32 nodes each time a vote is casted. > +static bool > +raft_has_split_vote(const struct raft *raft) > +{ > + int max_vote = 0; > + int vote_vac = raft->cluster_size; > + int quorum = raft->election_quorum; > + for (int i = 0; i < VCLOCK_MAX; ++i) { > + int count = raft->votes[i].count; > + vote_vac -= count; > + if (count > max_vote) > + max_vote = count; > + } > + return max_vote < quorum && max_vote + vote_vac < quorum; This is equal to `return max_vote + vote_vac < quorum` > +} > + > +static int > +raft_scores_snprintf(const struct raft *raft, char *buf, int size) > +{ > + int total = 0; > + bool is_empty = true; > + SNPRINT(total, snprintf, buf, size, "{"); > + for (int i = 0; i < VCLOCK_MAX; ++i) { > + int count = raft->votes[i].count; > + if (count == 0) > + continue; > + if (!is_empty) > + SNPRINT(total, snprintf, buf, size, ", "); > + is_empty = false; Nit: you may move is_empty = false into the 'else' branch. > + SNPRINT(total, snprintf, buf, size, "%d: %d", i, count); > + } > + SNPRINT(total, snprintf, buf, size, "}"); > + return total; > +} > + ... > > +static void > +raft_check_split_vote(struct raft *raft) > +{ > + /* When leader is known, there is no election. Thus no vote to split. */ > + if (raft->leader != 0) > + return; > + /* Not a candidate = can't trigger term bump anyway. */ > + if (!raft->is_candidate) > + return; > + /* > + * WAL write in progress means the state is changing. All is rechecked > + * when it is done. > + */ > + if (raft->is_write_in_progress) > + return; > + if (!raft_has_split_vote(raft)) > + return; > + assert(raft_ev_is_active(&raft->timer)); > + if (raft->timer.at < raft->election_timeout) > + return; I don't understand that. timer.at should point at current time, shouldn't it? > + > + assert(raft->state == RAFT_STATE_FOLLOWER || > + raft->state == RAFT_STATE_CANDIDATE); > + struct ev_loop *loop = raft_loop(); > + struct ev_timer *timer = &raft->timer; > + double delay = raft_new_random_election_shift(raft); > + /* > + * Could be too late to speed up anything - probably the term is almost > + * over anyway. > + */ > + double remaining = raft_ev_timer_remaining(loop, timer); > + if (delay >= remaining) > + delay = remaining; > + say_info("RAFT: split vote is discovered - %s, new term in %lf sec", > + raft_scores_str(raft), delay); > + raft_ev_timer_stop(loop, timer); > + raft_ev_timer_set(timer, delay, delay); > + raft_ev_timer_start(loop, timer); > +} > + > void > raft_create(struct raft *raft, const struct raft_vtab *vtab) > { > @@ -1053,6 +1150,7 @@ raft_create(struct raft *raft, const struct raft_vtab *vtab) > .election_quorum = 1, > .election_timeout = 5, > .death_timeout = 5, > + .cluster_size = VCLOCK_MAX, > .vtab = vtab, > }; > raft_ev_timer_init(&raft->timer, raft_sm_schedule_new_election_cb, ... > -- Serge Petrenko ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Tarantool-patches] [PATCH 3/4] raft: introduce split vote detection 2022-01-18 13:20 ` Serge Petrenko via Tarantool-patches @ 2022-01-20 0:44 ` Vladislav Shpilevoy via Tarantool-patches 2022-01-20 10:21 ` Serge Petrenko via Tarantool-patches 0 siblings, 1 reply; 16+ messages in thread From: Vladislav Shpilevoy via Tarantool-patches @ 2022-01-20 0:44 UTC (permalink / raw) To: Serge Petrenko, tarantool-patches Hi! Thanks for the review! >> diff --git a/src/lib/raft/raft.c b/src/lib/raft/raft.c >> index 289d53fd5..5dcbc7821 100644 >> --- a/src/lib/raft/raft.c >> +++ b/src/lib/raft/raft.c >> @@ -152,20 +152,69 @@ raft_can_vote_for(const struct raft *raft, const struct vclock *v) >> return cmp == 0 || cmp == 1; >> } >> -static inline void >> +static inline bool >> raft_add_vote(struct raft *raft, int src, int dst) >> { >> struct raft_vote *v = &raft->votes[src]; >> if (v->did_vote) >> - return; >> + return false; >> v->did_vote = true; >> ++raft->votes[dst].count; >> + return true; >> +} >> + > > You may check split_vote right in raft_add_vote: > simply track number of votes given in this term and > max votes given for one instance. > > This way you won't have to run over all 32 nodes each time a vote > is casted. I did the fullscan intentionally. Otherwise I need to introduce 2 new members to struct raft, keep them up to date, clear on term bump. Too easy to miss something and introduce a bug. While in the current version all the split-vote-specific details are in a single function except for 'raft.votes' member. This thing I couldn't get rid of. As for perf, a couple of ifs or a loop over 32 structs - both would take order of nanoseconds anyway. Here I wouldn't bother. Simplicity matters most. I did the proposal to see how it looks but then discarded as more complex than necessary. However if you think it is still worth doing, tell me and I will re-apply the diff. ==================== diff --git a/src/lib/raft/raft.c b/src/lib/raft/raft.c index c4f5fb059..fcce3126a 100644 --- a/src/lib/raft/raft.c +++ b/src/lib/raft/raft.c @@ -159,23 +159,20 @@ raft_add_vote(struct raft *raft, int src, int dst) if (v->did_vote) return false; v->did_vote = true; - ++raft->votes[dst].count; + ++raft->voted_count; + int count = ++raft->votes[dst].count; + if (count > raft->max_vote_count) + raft->max_vote_count = count; return true; } static bool raft_has_split_vote(const struct raft *raft) { - int max_vote = 0; - int vote_vac = raft->cluster_size; - int quorum = raft->election_quorum; - for (int i = 0; i < VCLOCK_MAX; ++i) { - int count = raft->votes[i].count; - vote_vac -= count; - if (count > max_vote) - max_vote = count; - } - return max_vote + vote_vac < quorum; + int vote_vac = raft->cluster_size - raft->voted_count; + if (vote_vac < 0) + vote_vac = 0; + return raft->max_vote_count + vote_vac < raft->election_quorum; } static int @@ -730,6 +727,8 @@ raft_sm_schedule_new_term(struct raft *raft, uint64_t new_term) raft->leader = 0; raft->state = RAFT_STATE_FOLLOWER; memset(raft->votes, 0, sizeof(raft->votes)); + raft->max_vote_count = 0; + raft->voted_count = 0; /* * The instance could be promoted for the previous term. But promotion * has no effect on following terms. diff --git a/src/lib/raft/raft.h b/src/lib/raft/raft.h index 817148792..7115f658f 100644 --- a/src/lib/raft/raft.h +++ b/src/lib/raft/raft.h @@ -199,6 +199,10 @@ struct raft { uint32_t vote; /** Statistics which node voted for who. */ struct raft_vote votes[VCLOCK_MAX]; + /** How many nodes voted in the current term. */ + int voted_count; + /** Max vote count given to any node in the current term. */ + int max_vote_count; /** Number of votes necessary for successful election. */ int election_quorum; /** ==================== >> +static bool >> +raft_has_split_vote(const struct raft *raft) >> +{ >> + int max_vote = 0; >> + int vote_vac = raft->cluster_size; >> + int quorum = raft->election_quorum; >> + for (int i = 0; i < VCLOCK_MAX; ++i) { >> + int count = raft->votes[i].count; >> + vote_vac -= count; >> + if (count > max_vote) >> + max_vote = count; >> + } >> + return max_vote < quorum && max_vote + vote_vac < quorum; > > This is equal to `return max_vote + vote_vac < quorum` Ouch, that was stupid indeed. I honestly did multiple self-reviews and still missed it. ==================== @@ -175,7 +175,7 @@ raft_has_split_vote(const struct raft *raft) if (count > max_vote) max_vote = count; } - return max_vote < quorum && max_vote + vote_vac < quorum; + return max_vote + vote_vac < quorum; } ==================== >> +} >> + >> +static int >> +raft_scores_snprintf(const struct raft *raft, char *buf, int size) >> +{ >> + int total = 0; >> + bool is_empty = true; >> + SNPRINT(total, snprintf, buf, size, "{"); >> + for (int i = 0; i < VCLOCK_MAX; ++i) { >> + int count = raft->votes[i].count; >> + if (count == 0) >> + continue; >> + if (!is_empty) >> + SNPRINT(total, snprintf, buf, size, ", "); >> + is_empty = false; > > Nit: you may move is_empty = false into the 'else' branch. I put it here to write 1 line less. But I don't mind. ==================== @@ -190,7 +190,8 @@ raft_scores_snprintf(const struct raft *raft, char *buf, int size) continue; if (!is_empty) SNPRINT(total, snprintf, buf, size, ", "); - is_empty = false; + else + is_empty = false; SNPRINT(total, snprintf, buf, size, "%d: %d", i, count); } ==================== >> + SNPRINT(total, snprintf, buf, size, "%d: %d", i, count); >> + } >> + SNPRINT(total, snprintf, buf, size, "}"); >> + return total; >> +} >> + > > ... > >> +static void >> +raft_check_split_vote(struct raft *raft) >> +{ >> + /* When leader is known, there is no election. Thus no vote to split. */ >> + if (raft->leader != 0) >> + return; >> + /* Not a candidate = can't trigger term bump anyway. */ >> + if (!raft->is_candidate) >> + return; >> + /* >> + * WAL write in progress means the state is changing. All is rechecked >> + * when it is done. >> + */ >> + if (raft->is_write_in_progress) >> + return; >> + if (!raft_has_split_vote(raft)) >> + return; >> + assert(raft_ev_is_active(&raft->timer)); >> + if (raft->timer.at < raft->election_timeout) >> + return; > > I don't understand that. timer.at should point at current time, shouldn't it? Yes, thanks for noticing. This is a bug. And in some other existing places too. I made a separate commit fixing the existing places, see v2. And I fixed this place below. Replaced with ev_timer.repeat. ==================== diff --git a/src/lib/raft/raft.c b/src/lib/raft/raft.c index 630f8c677..c4f5fb059 100644 --- a/src/lib/raft/raft.c +++ b/src/lib/raft/raft.c @@ -1124,7 +1124,7 @@ raft_check_split_vote(struct raft *raft) if (!raft_has_split_vote(raft)) return; assert(raft_ev_is_active(&raft->timer)); - if (raft->timer.at < raft->election_timeout) + if (raft->timer.repeat < raft->election_timeout) return; assert(raft->state == RAFT_STATE_FOLLOWER || diff --git a/test/unit/raft.c b/test/unit/raft.c index 9b462f755..f3ed453ed 100644 --- a/test/unit/raft.c +++ b/test/unit/raft.c @@ -1532,8 +1532,8 @@ raft_test_split_vote(void) 3 /* Source. */ ), 0, "vote response for 3 from 3"); - ok(node.raft.timer.at >= node.raft.election_timeout, "term timeout >= " - "election timeout normally"); + ok(node.raft.timer.repeat >= node.raft.election_timeout, "term " + "timeout >= election timeout normally"); is(raft_node_send_vote_response(&node, 2 /* Term. */, @@ -1541,7 +1541,7 @@ raft_test_split_vote(void) 4 /* Source. */ ), 0, "vote response for 3 from 4"); - ok(node.raft.timer.at < node.raft.election_timeout, "split vote " + ok(node.raft.timer.repeat < node.raft.election_timeout, "split vote " "reduced the term timeout"); raft_run_next_event(); @@ -1555,7 +1555,7 @@ raft_test_split_vote(void) "{0: 2}" /* Vclock. */ ), "a new term"); - ok(node.raft.timer.at >= node.raft.election_timeout, "timeout is " + ok(node.raft.timer.repeat >= node.raft.election_timeout, "timeout is " "normal again"); /* @@ -1574,11 +1574,11 @@ raft_test_split_vote(void) 2 /* Vote. */, 2 /* Source. */ ), 0, "vote response for 2 from 2"); - ok(node.raft.timer.at >= node.raft.election_timeout, "the vote is not " - "split yet"); + ok(node.raft.timer.repeat >= node.raft.election_timeout, "the vote is " + "not split yet"); raft_node_cfg_cluster_size(&node, 2); - ok(node.raft.timer.at < node.raft.election_timeout, "cluster size " + ok(node.raft.timer.repeat < node.raft.election_timeout, "cluster size " "change makes split vote"); raft_run_next_event(); @@ -1615,8 +1615,8 @@ raft_test_split_vote(void) is(raft_vote_count(&node.raft), 1, "the only own vote was from self"); raft_node_cfg_cluster_size(&node, 2); - ok(node.raft.timer.at >= node.raft.death_timeout, "cluster change does " - "not affect the leader's death waiting"); + ok(node.raft.timer.repeat >= node.raft.death_timeout, "cluster change " + "does not affect the leader's death waiting"); /* * Non-candidate should ignore split vote. @@ -1659,7 +1659,7 @@ raft_test_split_vote(void) 2 /* Source. */ ), 0, "vote response for 2 from 2"); - double delay = node.raft.timer.at; + double delay = node.raft.timer.repeat; ok(delay < node.raft.election_timeout, "split vote is already " "inevitable"); @@ -1669,8 +1669,8 @@ raft_test_split_vote(void) 3 /* Source. */ ), 0, "vote response for 3 from 3"); - is(delay, node.raft.timer.at, "split vote got worse, but delay didn't " - "change"); + is(delay, node.raft.timer.repeat, "split vote got worse, but delay " + "didn't change"); /* * Handle split vote when WAL write is in progress. ==================== Additionally, I disabled split vote detection for cluster < quorum. It would only increase rate of term bumps and wouldn't solve the problem. ==================== diff --git a/src/lib/raft/raft.c b/src/lib/raft/raft.c index c4f5fb059..098f60ed9 100644 --- a/src/lib/raft/raft.c +++ b/src/lib/raft/raft.c @@ -169,6 +169,12 @@ raft_has_split_vote(const struct raft *raft) int max_vote = 0; int vote_vac = raft->cluster_size; int quorum = raft->election_quorum; + /* + * Quorum > cluster is either a misconfiguration or some instances + * didn't register yet. Anyway, speeding the elections up won't help. + */ + if (vote_vac < quorum) + return false; for (int i = 0; i < VCLOCK_MAX; ++i) { int count = raft->votes[i].count; vote_vac -= count; diff --git a/test/unit/raft.c b/test/unit/raft.c index f3ed453ed..e88d12a5c 100644 --- a/test/unit/raft.c +++ b/test/unit/raft.c @@ -1500,7 +1500,7 @@ raft_test_promote_restore(void) static void raft_test_split_vote(void) { - raft_start_test(35); + raft_start_test(39); struct raft_node node; raft_node_create(&node); @@ -1697,6 +1697,27 @@ raft_test_split_vote(void) is(node.raft.term, 3, "bump term"); is(node.raft.vote, 1, "vote for self"); + /* + * Split vote check is disabled when cluster size < quorum. Makes no + * sense to speed the elections up. + */ + raft_node_destroy(&node); + raft_node_create(&node); + raft_node_cfg_cluster_size(&node, 1); + raft_node_cfg_election_quorum(&node, 2); + + raft_run_next_event(); + is(node.raft.term, 2, "bump term"); + is(node.raft.vote, 1, "vote for self"); + is(raft_node_send_vote_response(&node, + 2 /* Term. */, + 2 /* Vote. */, + 2 /* Source. */ + ), 0, "vote response for 2 from 2"); + + ok(node.raft.timer.repeat >= node.raft.election_timeout, "split vote " + "is not checked for cluster < quorum"); + raft_node_destroy(&node); raft_finish_test(); } diff --git a/test/unit/raft.result b/test/unit/raft.result index d12aae710..465bfb97d 100644 --- a/test/unit/raft.result +++ b/test/unit/raft.result @@ -263,7 +263,7 @@ ok 13 - subtests ok 14 - subtests *** raft_test_promote_restore: done *** *** raft_test_split_vote *** - 1..35 + 1..39 ok 1 - elections with a new term ok 2 - vote response for 1 from 2 ok 3 - vote response for 3 from 3 @@ -299,6 +299,10 @@ ok 14 - subtests ok 33 - vote response for 2 from 2 ok 34 - bump term ok 35 - vote for self + ok 36 - bump term + ok 37 - vote for self + ok 38 - vote response for 2 from 2 + ok 39 - split vote is not checked for cluster < quorum ok 15 - subtests *** raft_test_split_vote: done *** *** main_f: done *** ==================== Then I fixed a similar issue with voted node count > cluster size. ==================== diff --git a/src/lib/raft/raft.c b/src/lib/raft/raft.c index 098f60ed9..50e0dfe87 100644 --- a/src/lib/raft/raft.c +++ b/src/lib/raft/raft.c @@ -181,6 +181,13 @@ raft_has_split_vote(const struct raft *raft) if (count > max_vote) max_vote = count; } + /* + * More nodes voted than there are nodes configured. The reason is the + * the same as with quorum > cluster. The action is also the same - + * faster term bumps won't help. + */ + if (vote_vac < 0) + return false; return max_vote + vote_vac < quorum; } diff --git a/test/unit/raft.c b/test/unit/raft.c index e88d12a5c..3d5e2c777 100644 --- a/test/unit/raft.c +++ b/test/unit/raft.c @@ -1500,7 +1500,7 @@ raft_test_promote_restore(void) static void raft_test_split_vote(void) { - raft_start_test(39); + raft_start_test(46); struct raft_node node; raft_node_create(&node); @@ -1718,6 +1718,43 @@ raft_test_split_vote(void) ok(node.raft.timer.repeat >= node.raft.election_timeout, "split vote " "is not checked for cluster < quorum"); + /* + * Split vote check is disabled when vote count > cluster size. The + * reason is the same as with quorum > cluster size - something is odd, + * more term bumps won't help. + */ + raft_node_destroy(&node); + raft_node_create(&node); + raft_node_cfg_cluster_size(&node, 3); + raft_node_cfg_election_quorum(&node, 2); + + raft_run_next_event(); + is(node.raft.term, 2, "bump term"); + is(node.raft.vote, 1, "vote for self"); + is(raft_node_send_vote_response(&node, + 2 /* Term. */, + 2 /* Vote. */, + 2 /* Source. */ + ), 0, "vote response for 2 from 2"); + is(raft_node_send_vote_response(&node, + 2 /* Term. */, + 2 /* Vote. */, + 3 /* Source. */ + ), 0, "vote response for 2 from 3"); + is(raft_node_send_vote_response(&node, + 2 /* Term. */, + 3 /* Vote. */, + 4 /* Source. */ + ), 0, "vote response for 3 from 4"); + is(raft_node_send_vote_response(&node, + 2 /* Term. */, + 4 /* Vote. */, + 5 /* Source. */ + ), 0, "vote response for 4 from 5"); + + ok(node.raft.timer.repeat >= node.raft.election_timeout, "split vote " + "is not checked when vote count > cluster size"); + raft_node_destroy(&node); raft_finish_test(); } diff --git a/test/unit/raft.result b/test/unit/raft.result index 465bfb97d..cce1891c4 100644 --- a/test/unit/raft.result +++ b/test/unit/raft.result @@ -263,7 +263,7 @@ ok 13 - subtests ok 14 - subtests *** raft_test_promote_restore: done *** *** raft_test_split_vote *** - 1..39 + 1..46 ok 1 - elections with a new term ok 2 - vote response for 1 from 2 ok 3 - vote response for 3 from 3 @@ -303,6 +303,13 @@ ok 14 - subtests ok 37 - vote for self ok 38 - vote response for 2 from 2 ok 39 - split vote is not checked for cluster < quorum + ok 40 - bump term + ok 41 - vote for self + ok 42 - vote response for 2 from 2 + ok 43 - vote response for 2 from 3 + ok 44 - vote response for 3 from 4 + ok 45 - vote response for 4 from 5 + ok 46 - split vote is not checked when vote count > cluster size ok 15 - subtests *** raft_test_split_vote: done *** *** main_f: done *** ==================== Then I realized one of my tests wasn't accurate enough - split vote during WAL write wasn't properly checked. ==================== diff --git a/src/lib/raft/raft.c b/src/lib/raft/raft.c index f606f9e79..cd7c91c8b 100644 --- a/src/lib/raft/raft.c +++ b/src/lib/raft/raft.c @@ -835,6 +835,11 @@ raft_sm_wait_election_end(struct raft *raft) raft_new_random_election_shift(raft); raft_ev_timer_set(&raft->timer, election_timeout, election_timeout); raft_ev_timer_start(raft_loop(), &raft->timer); + /* + * Could start the waiting after a WAL write during which the split vote + * could happen. + */ + raft_check_split_vote(raft); } static void diff --git a/test/unit/raft.c b/test/unit/raft.c index 3d5e2c777..4aa3f4959 100644 --- a/test/unit/raft.c +++ b/test/unit/raft.c @@ -1500,7 +1500,7 @@ raft_test_promote_restore(void) static void raft_test_split_vote(void) { - raft_start_test(46); + raft_start_test(51); struct raft_node node; raft_node_create(&node); @@ -1693,6 +1693,13 @@ raft_test_split_vote(void) ), 0, "vote response for 2 from 2"); raft_node_unblock(&node); + is(node.raft.term, 2, "new term"); + is(node.raft.vote, 1, "voted for self"); + is(node.raft.volatile_term, 2, "volatile term"); + is(node.raft.volatile_vote, 1, "volatile vote"); + ok(node.raft.timer.repeat < node.raft.election_timeout, "found split " + "vote after WAL write"); + raft_run_next_event(); is(node.raft.term, 3, "bump term"); is(node.raft.vote, 1, "vote for self"); diff --git a/test/unit/raft.result b/test/unit/raft.result index cce1891c4..3cebf2513 100644 --- a/test/unit/raft.result +++ b/test/unit/raft.result @@ -263,7 +263,7 @@ ok 13 - subtests ok 14 - subtests *** raft_test_promote_restore: done *** *** raft_test_split_vote *** - 1..46 + 1..51 ok 1 - elections with a new term ok 2 - vote response for 1 from 2 ok 3 - vote response for 3 from 3 @@ -297,19 +297,24 @@ ok 14 - subtests ok 31 - new volatile term ok 32 - new volatile vote ok 33 - vote response for 2 from 2 - ok 34 - bump term - ok 35 - vote for self - ok 36 - bump term - ok 37 - vote for self - ok 38 - vote response for 2 from 2 - ok 39 - split vote is not checked for cluster < quorum - ok 40 - bump term - ok 41 - vote for self - ok 42 - vote response for 2 from 2 - ok 43 - vote response for 2 from 3 - ok 44 - vote response for 3 from 4 - ok 45 - vote response for 4 from 5 - ok 46 - split vote is not checked when vote count > cluster size + ok 34 - new term + ok 35 - voted for self + ok 36 - volatile term + ok 37 - volatile vote + ok 38 - found split vote after WAL write + ok 39 - bump term + ok 40 - vote for self + ok 41 - bump term + ok 42 - vote for self + ok 43 - vote response for 2 from 2 + ok 44 - split vote is not checked for cluster < quorum + ok 45 - bump term + ok 46 - vote for self + ok 47 - vote response for 2 from 2 + ok 48 - vote response for 2 from 3 + ok 49 - vote response for 3 from 4 + ok 50 - vote response for 4 from 5 + ok 51 - split vote is not checked when vote count > cluster size ok 15 - subtests *** raft_test_split_vote: done *** *** main_f: done *** ==================== Then I realized I didn't check split vote on quorum change. When it is increased, it can trigger split vote. ==================== diff --git a/src/lib/raft/raft.c b/src/lib/raft/raft.c index cd7c91c8b..5dcc5beaf 100644 --- a/src/lib/raft/raft.c +++ b/src/lib/raft/raft.c @@ -1046,6 +1046,8 @@ raft_cfg_election_quorum(struct raft *raft, int election_quorum) if (raft->state == RAFT_STATE_CANDIDATE && raft_vote_count(raft) >= raft->election_quorum) raft_sm_become_leader(raft); + else + raft_check_split_vote(raft); } void diff --git a/test/unit/raft.c b/test/unit/raft.c index 4aa3f4959..e264e864e 100644 --- a/test/unit/raft.c +++ b/test/unit/raft.c @@ -1500,7 +1500,7 @@ raft_test_promote_restore(void) static void raft_test_split_vote(void) { - raft_start_test(51); + raft_start_test(58); struct raft_node node; raft_node_create(&node); @@ -1762,6 +1762,34 @@ raft_test_split_vote(void) ok(node.raft.timer.repeat >= node.raft.election_timeout, "split vote " "is not checked when vote count > cluster size"); + /* + * Split vote can happen if quorum was suddenly increased. + */ + raft_node_destroy(&node); + raft_node_create(&node); + raft_node_cfg_cluster_size(&node, 3); + raft_node_cfg_election_quorum(&node, 2); + + raft_run_next_event(); + is(node.raft.term, 2, "bump term"); + is(node.raft.vote, 1, "vote for self"); + is(raft_node_send_vote_response(&node, + 2 /* Term. */, + 2 /* Vote. */, + 2 /* Source. */ + ), 0, "vote response for 2 from 2"); + + ok(node.raft.timer.repeat >= node.raft.election_timeout, "not split " + "vote yet"); + + raft_node_cfg_election_quorum(&node, 3); + ok(node.raft.timer.repeat < node.raft.election_timeout, "split vote " + "after quorum increase"); + + raft_run_next_event(); + is(node.raft.term, 3, "bump term"); + is(node.raft.vote, 1, "vote for self"); + raft_node_destroy(&node); raft_finish_test(); } ==================== ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Tarantool-patches] [PATCH 3/4] raft: introduce split vote detection 2022-01-20 0:44 ` Vladislav Shpilevoy via Tarantool-patches @ 2022-01-20 10:21 ` Serge Petrenko via Tarantool-patches 2022-01-20 23:02 ` Vladislav Shpilevoy via Tarantool-patches 0 siblings, 1 reply; 16+ messages in thread From: Serge Petrenko via Tarantool-patches @ 2022-01-20 10:21 UTC (permalink / raw) To: Vladislav Shpilevoy, tarantool-patches 20.01.2022 03:44, Vladislav Shpilevoy пишет: > Hi! Thanks for the review! > >>> diff --git a/src/lib/raft/raft.c b/src/lib/raft/raft.c >>> index 289d53fd5..5dcbc7821 100644 >>> --- a/src/lib/raft/raft.c >>> +++ b/src/lib/raft/raft.c >>> @@ -152,20 +152,69 @@ raft_can_vote_for(const struct raft *raft, const struct vclock *v) >>> return cmp == 0 || cmp == 1; >>> } >>> -static inline void >>> +static inline bool >>> raft_add_vote(struct raft *raft, int src, int dst) >>> { >>> struct raft_vote *v = &raft->votes[src]; >>> if (v->did_vote) >>> - return; >>> + return false; >>> v->did_vote = true; >>> ++raft->votes[dst].count; >>> + return true; >>> +} >>> + >> You may check split_vote right in raft_add_vote: >> simply track number of votes given in this term and >> max votes given for one instance. >> >> This way you won't have to run over all 32 nodes each time a vote >> is casted. > I did the fullscan intentionally. Otherwise I need to introduce 2 > new members to struct raft, keep them up to date, clear on term > bump. Too easy to miss something and introduce a bug. While in > the current version all the split-vote-specific details are in a > single function except for 'raft.votes' member. This thing I couldn't > get rid of. > > As for perf, a couple of ifs or a loop over 32 structs - both would > take order of nanoseconds anyway. Here I wouldn't bother. Simplicity > matters most. > > I did the proposal to see how it looks but then discarded as more > complex than necessary. However if you think it is still worth doing, > tell me and I will re-apply the diff. Thanks for trying this out! TBH, this diff is exactly what I wanted and it still looks better (not only preformance-wise, but simpler as well) in my opinion. I understand your point about having to update 2 extra members every now and then, so feel free to choose any option you like. > > ==================== > diff --git a/src/lib/raft/raft.c b/src/lib/raft/raft.c > index c4f5fb059..fcce3126a 100644 > --- a/src/lib/raft/raft.c > +++ b/src/lib/raft/raft.c > @@ -159,23 +159,20 @@ raft_add_vote(struct raft *raft, int src, int dst) > if (v->did_vote) > return false; > v->did_vote = true; > - ++raft->votes[dst].count; > + ++raft->voted_count; > + int count = ++raft->votes[dst].count; > + if (count > raft->max_vote_count) > + raft->max_vote_count = count; > return true; > } > > static bool > raft_has_split_vote(const struct raft *raft) > { > - int max_vote = 0; > - int vote_vac = raft->cluster_size; > - int quorum = raft->election_quorum; > - for (int i = 0; i < VCLOCK_MAX; ++i) { > - int count = raft->votes[i].count; > - vote_vac -= count; > - if (count > max_vote) > - max_vote = count; > - } > - return max_vote + vote_vac < quorum; > + int vote_vac = raft->cluster_size - raft->voted_count; > + if (vote_vac < 0) > + vote_vac = 0; > + return raft->max_vote_count + vote_vac < raft->election_quorum; > } > > static int > @@ -730,6 +727,8 @@ raft_sm_schedule_new_term(struct raft *raft, uint64_t new_term) > raft->leader = 0; > raft->state = RAFT_STATE_FOLLOWER; > memset(raft->votes, 0, sizeof(raft->votes)); > + raft->max_vote_count = 0; > + raft->voted_count = 0; > /* > * The instance could be promoted for the previous term. But promotion > * has no effect on following terms. > diff --git a/src/lib/raft/raft.h b/src/lib/raft/raft.h > index 817148792..7115f658f 100644 > --- a/src/lib/raft/raft.h > +++ b/src/lib/raft/raft.h > @@ -199,6 +199,10 @@ struct raft { > uint32_t vote; > /** Statistics which node voted for who. */ > struct raft_vote votes[VCLOCK_MAX]; > + /** How many nodes voted in the current term. */ > + int voted_count; > + /** Max vote count given to any node in the current term. */ > + int max_vote_count; > /** Number of votes necessary for successful election. */ > int election_quorum; > /** > ==================== > >>> +static bool >>> +raft_has_split_vote(const struct raft *raft) >>> +{ >>> + int max_vote = 0; >>> + int vote_vac = raft->cluster_size; >>> + int quorum = raft->election_quorum; >>> + for (int i = 0; i < VCLOCK_MAX; ++i) { >>> + int count = raft->votes[i].count; >>> + vote_vac -= count; >>> + if (count > max_vote) >>> + max_vote = count; >>> + } >>> + return max_vote < quorum && max_vote + vote_vac < quorum; >> This is equal to `return max_vote + vote_vac < quorum` > Ouch, that was stupid indeed. I honestly did multiple self-reviews > and still missed it. Don't bother. Could happen to anyone. > ==================== > @@ -175,7 +175,7 @@ raft_has_split_vote(const struct raft *raft) > if (count > max_vote) > max_vote = count; > } > - return max_vote < quorum && max_vote + vote_vac < quorum; > + return max_vote + vote_vac < quorum; > } > ==================== > ... -- Serge Petrenko ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Tarantool-patches] [PATCH 3/4] raft: introduce split vote detection 2022-01-20 10:21 ` Serge Petrenko via Tarantool-patches @ 2022-01-20 23:02 ` Vladislav Shpilevoy via Tarantool-patches 0 siblings, 0 replies; 16+ messages in thread From: Vladislav Shpilevoy via Tarantool-patches @ 2022-01-20 23:02 UTC (permalink / raw) To: Serge Petrenko, tarantool-patches >>>> diff --git a/src/lib/raft/raft.c b/src/lib/raft/raft.c >>>> index 289d53fd5..5dcbc7821 100644 >>>> --- a/src/lib/raft/raft.c >>>> +++ b/src/lib/raft/raft.c >>>> @@ -152,20 +152,69 @@ raft_can_vote_for(const struct raft *raft, const struct vclock *v) >>>> return cmp == 0 || cmp == 1; >>>> } >>>> -static inline void >>>> +static inline bool >>>> raft_add_vote(struct raft *raft, int src, int dst) >>>> { >>>> struct raft_vote *v = &raft->votes[src]; >>>> if (v->did_vote) >>>> - return; >>>> + return false; >>>> v->did_vote = true; >>>> ++raft->votes[dst].count; >>>> + return true; >>>> +} >>>> + >>> You may check split_vote right in raft_add_vote: >>> simply track number of votes given in this term and >>> max votes given for one instance. >>> >>> This way you won't have to run over all 32 nodes each time a vote >>> is casted. >> I did the fullscan intentionally. Otherwise I need to introduce 2 >> new members to struct raft, keep them up to date, clear on term >> bump. Too easy to miss something and introduce a bug. While in >> the current version all the split-vote-specific details are in a >> single function except for 'raft.votes' member. This thing I couldn't >> get rid of. >> >> As for perf, a couple of ifs or a loop over 32 structs - both would >> take order of nanoseconds anyway. Here I wouldn't bother. Simplicity >> matters most. >> >> I did the proposal to see how it looks but then discarded as more >> complex than necessary. However if you think it is still worth doing, >> tell me and I will re-apply the diff. > > Thanks for trying this out! > > TBH, this diff is exactly what I wanted and it still looks better > (not only preformance-wise, but simpler as well) in my opinion. > > I understand your point about having to update 2 extra members every > now and then, so feel free to choose any option you like. I applied this diff: ==================== diff --git a/src/lib/raft/raft.c b/src/lib/raft/raft.c index 5dcc5beaf..90ed01ca4 100644 --- a/src/lib/raft/raft.c +++ b/src/lib/raft/raft.c @@ -159,36 +159,29 @@ raft_add_vote(struct raft *raft, int src, int dst) if (v->did_vote) return false; v->did_vote = true; - ++raft->votes[dst].count; + ++raft->voted_count; + int count = ++raft->votes[dst].count; + if (count > raft->max_vote) + raft->max_vote = count; return true; } static bool raft_has_split_vote(const struct raft *raft) { - int max_vote = 0; int vote_vac = raft->cluster_size; int quorum = raft->election_quorum; /* * Quorum > cluster is either a misconfiguration or some instances * didn't register yet. Anyway, speeding the elections up won't help. + * The same when more nodes voted than there are nodes configured. */ if (vote_vac < quorum) return false; - for (int i = 0; i < VCLOCK_MAX; ++i) { - int count = raft->votes[i].count; - vote_vac -= count; - if (count > max_vote) - max_vote = count; - } - /* - * More nodes voted than there are nodes configured. The reason is the - * the same as with quorum > cluster. The action is also the same - - * faster term bumps won't help. - */ + vote_vac -= raft->voted_count; if (vote_vac < 0) return false; - return max_vote + vote_vac < quorum; + return raft->max_vote + vote_vac < quorum; } static int @@ -743,6 +736,8 @@ raft_sm_schedule_new_term(struct raft *raft, uint64_t new_term) raft->leader = 0; raft->state = RAFT_STATE_FOLLOWER; memset(raft->votes, 0, sizeof(raft->votes)); + raft->voted_count = 0; + raft->max_vote = 0; /* * The instance could be promoted for the previous term. But promotion * has no effect on following terms. diff --git a/src/lib/raft/raft.h b/src/lib/raft/raft.h index 817148792..05e373254 100644 --- a/src/lib/raft/raft.h +++ b/src/lib/raft/raft.h @@ -199,6 +199,10 @@ struct raft { uint32_t vote; /** Statistics which node voted for who. */ struct raft_vote votes[VCLOCK_MAX]; + /** How many nodes voted in the current term. */ + int voted_count; + /** Max vote count given to any node in the current term. */ + int max_vote; /** Number of votes necessary for successful election. */ int election_quorum; /** ^ permalink raw reply [flat|nested] 16+ messages in thread
* [Tarantool-patches] [PATCH 4/4] election: activate raft split vote handling 2022-01-15 0:48 [Tarantool-patches] [PATCH 0/4] Split vote Vladislav Shpilevoy via Tarantool-patches ` (2 preceding siblings ...) 2022-01-15 0:48 ` [Tarantool-patches] [PATCH 3/4] raft: introduce split vote detection Vladislav Shpilevoy via Tarantool-patches @ 2022-01-15 0:48 ` Vladislav Shpilevoy via Tarantool-patches 2022-01-18 13:21 ` Serge Petrenko via Tarantool-patches 2022-01-16 14:10 ` [Tarantool-patches] [PATCH 0/4] Split vote Konstantin Osipov via Tarantool-patches 4 siblings, 1 reply; 16+ messages in thread From: Vladislav Shpilevoy via Tarantool-patches @ 2022-01-15 0:48 UTC (permalink / raw) To: tarantool-patches, sergepetrenko Raft needs to know cluster size in order to detect and handle split vote. The patch uses registered server count as cluster size. It is not documented nor has a changelog file because this is an optimization. Can't be observed except in logs or with a watch. Closes #5285 --- src/box/raft.c | 4 +- .../election_split_vote_test.lua | 92 +++++++++++++++++++ 2 files changed, 95 insertions(+), 1 deletion(-) create mode 100644 test/replication-luatest/election_split_vote_test.lua diff --git a/src/box/raft.c b/src/box/raft.c index 1e360dc88..1908b71b6 100644 --- a/src/box/raft.c +++ b/src/box/raft.c @@ -229,7 +229,9 @@ box_raft_update_election_quorum(void) * be lost. */ int quorum = MIN(replication_synchro_quorum, max); - raft_cfg_election_quorum(box_raft(), quorum); + struct raft *raft = box_raft(); + raft_cfg_election_quorum(raft, quorum); + raft_cfg_cluster_size(raft, replicaset.registered_count); } void diff --git a/test/replication-luatest/election_split_vote_test.lua b/test/replication-luatest/election_split_vote_test.lua new file mode 100644 index 000000000..f31bfd7f3 --- /dev/null +++ b/test/replication-luatest/election_split_vote_test.lua @@ -0,0 +1,92 @@ +local t = require('luatest') +local cluster = require('test.luatest_helpers.cluster') +local helpers = require('test.luatest_helpers') +local wait_timeout = 120 + +-- +-- gh-5285: split vote is when in the current term there can't be winner of the +-- leader role. Number of unused votes is not enough for anyone to get the +-- quorum. It can be detected to speed up the term bump. +-- +local g = t.group('split-vote') + +g.before_each(function() + g.cluster = cluster:new({}) + local node1_uri = helpers.instance_uri('node1') + local node2_uri = helpers.instance_uri('node2') + local replication = {node1_uri, node2_uri} + local box_cfg = { + listen = node1_uri, + replication = replication, + -- To speed up new term when try to elect a first leader. + replication_timeout = 0.1, + replication_synchro_quorum = 2, + election_timeout = 1000000, + } + g.node1 = g.cluster:build_server({alias = 'node1', box_cfg = box_cfg}) + + box_cfg.listen = node2_uri + g.node2 = g.cluster:build_server({alias = 'node2', box_cfg = box_cfg}) + + g.cluster:add_server(g.node1) + g.cluster:add_server(g.node2) + g.cluster:start() +end) + +g.after_each(function() + g.cluster:drop() +end) + +g.test_split_vote = function(g) + -- Stop the replication so as the nodes can't request votes from each other. + local node1_repl = g.node1:exec(function() + local repl = box.cfg.replication + box.cfg{replication = {}} + return repl + end) + local node2_repl = g.node2:exec(function() + local repl = box.cfg.replication + box.cfg{replication = {}} + return repl + end) + + -- Both vote for self but don't see the split-vote yet. + g.node1:exec(function() + box.cfg{election_mode = 'candidate'} + end) + g.node2:exec(function() + box.cfg{election_mode = 'candidate'} + end) + + -- Wait for the votes to actually happen. + t.helpers.retrying({timeout = wait_timeout}, function() + local func = function() + return box.info.election.vote == box.info.id + end + assert(g.node1:exec(func)) + assert(g.node2:exec(func)) + end) + + -- Now let the nodes notice the split vote. + g.node1:exec(function(repl) + box.cfg{replication = repl} + end, {node1_repl}) + g.node2:exec(function(repl) + box.cfg{replication = repl} + end, {node2_repl}) + + t.helpers.retrying({timeout = wait_timeout}, function() + local msg = 'split vote is discovered' + assert(g.node1:grep_log(msg) or g.node2:grep_log(msg)) + end) + + -- Ensure a leader is eventually elected. Nothing is broken for good. + g.node1:exec(function() + box.cfg{election_timeout = 1} + end) + g.node2:exec(function() + box.cfg{election_timeout = 1} + end) + g.node1:wait_election_leader_found() + g.node2:wait_election_leader_found() +end -- 2.24.3 (Apple Git-128) ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Tarantool-patches] [PATCH 4/4] election: activate raft split vote handling 2022-01-15 0:48 ` [Tarantool-patches] [PATCH 4/4] election: activate raft split vote handling Vladislav Shpilevoy via Tarantool-patches @ 2022-01-18 13:21 ` Serge Petrenko via Tarantool-patches 2022-01-20 0:44 ` Vladislav Shpilevoy via Tarantool-patches 0 siblings, 1 reply; 16+ messages in thread From: Serge Petrenko via Tarantool-patches @ 2022-01-18 13:21 UTC (permalink / raw) To: Vladislav Shpilevoy, tarantool-patches 15.01.2022 03:48, Vladislav Shpilevoy пишет: > Raft needs to know cluster size in order to detect and handle > split vote. The patch uses registered server count as cluster > size. > > It is not documented nor has a changelog file because this is an > optimization. Can't be observed except in logs or with a watch. > > Closes #5285 Thanks for working on this! Please, find one comment below. > --- > src/box/raft.c | 4 +- > .../election_split_vote_test.lua | 92 +++++++++++++++++++ > 2 files changed, 95 insertions(+), 1 deletion(-) > create mode 100644 test/replication-luatest/election_split_vote_test.lua > > diff --git a/src/box/raft.c b/src/box/raft.c > index 1e360dc88..1908b71b6 100644 > --- a/src/box/raft.c > +++ b/src/box/raft.c > @@ -229,7 +229,9 @@ box_raft_update_election_quorum(void) > * be lost. > */ > int quorum = MIN(replication_synchro_quorum, max); > - raft_cfg_election_quorum(box_raft(), quorum); > + struct raft *raft = box_raft(); > + raft_cfg_election_quorum(raft, quorum); > + raft_cfg_cluster_size(raft, replicaset.registered_count); Better use `max` instead of `replicaset.registered_count` here. I don't think something could go wrong with setting cluster size to zero initially, but let's better avoid that. > } > > void ... -- Serge Petrenko ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Tarantool-patches] [PATCH 4/4] election: activate raft split vote handling 2022-01-18 13:21 ` Serge Petrenko via Tarantool-patches @ 2022-01-20 0:44 ` Vladislav Shpilevoy via Tarantool-patches 0 siblings, 0 replies; 16+ messages in thread From: Vladislav Shpilevoy via Tarantool-patches @ 2022-01-20 0:44 UTC (permalink / raw) To: Serge Petrenko, tarantool-patches Thanks for the review! >> diff --git a/src/box/raft.c b/src/box/raft.c >> index 1e360dc88..1908b71b6 100644 >> --- a/src/box/raft.c >> +++ b/src/box/raft.c >> @@ -229,7 +229,9 @@ box_raft_update_election_quorum(void) >> * be lost. >> */ >> int quorum = MIN(replication_synchro_quorum, max); >> - raft_cfg_election_quorum(box_raft(), quorum); >> + struct raft *raft = box_raft(); >> + raft_cfg_election_quorum(raft, quorum); >> + raft_cfg_cluster_size(raft, replicaset.registered_count); > > Better use `max` instead of `replicaset.registered_count` here. > I don't think something could go wrong with setting cluster size to zero > initially, but let's better avoid that. Done: ==================== diff --git a/src/box/raft.c b/src/box/raft.c index 1908b71b6..be6009cc1 100644 --- a/src/box/raft.c +++ b/src/box/raft.c @@ -231,7 +231,7 @@ box_raft_update_election_quorum(void) int quorum = MIN(replication_synchro_quorum, max); struct raft *raft = box_raft(); raft_cfg_election_quorum(raft, quorum); - raft_cfg_cluster_size(raft, replicaset.registered_count); + raft_cfg_cluster_size(raft, max); } void ==================== ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Tarantool-patches] [PATCH 0/4] Split vote 2022-01-15 0:48 [Tarantool-patches] [PATCH 0/4] Split vote Vladislav Shpilevoy via Tarantool-patches ` (3 preceding siblings ...) 2022-01-15 0:48 ` [Tarantool-patches] [PATCH 4/4] election: activate raft split vote handling Vladislav Shpilevoy via Tarantool-patches @ 2022-01-16 14:10 ` Konstantin Osipov via Tarantool-patches 2022-01-17 22:57 ` Vladislav Shpilevoy via Tarantool-patches 4 siblings, 1 reply; 16+ messages in thread From: Konstantin Osipov via Tarantool-patches @ 2022-01-16 14:10 UTC (permalink / raw) To: Vladislav Shpilevoy; +Cc: tarantool-patches * Vladislav Shpilevoy via Tarantool-patches <tarantool-patches@dev.tarantool.org> [22/01/15 03:52]: > Split vote handling in Raft, its usage in storage, and a bug fix found while > working on this in the first commit. > > Branch: http://github.com/tarantool/tarantool/tree/gerold103/gh-5285-raft-split-vote > Issue: https://github.com/tarantool/tarantool/issues/5285 While I don't mind this patch it is an unnecessary complication on top of the spec. To prevent the deteriorating effects on liveness from split vote one needs pre-voting, not split-vote detection. > Vladislav Shpilevoy (4): > raft: fix crash on election_timeout reconfig > raft: track all votes, even not own > raft: introduce split vote detection > election: activate raft split vote handling -- Konstantin Osipov, Moscow, Russia ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Tarantool-patches] [PATCH 0/4] Split vote 2022-01-16 14:10 ` [Tarantool-patches] [PATCH 0/4] Split vote Konstantin Osipov via Tarantool-patches @ 2022-01-17 22:57 ` Vladislav Shpilevoy via Tarantool-patches 2022-01-18 7:18 ` Konstantin Osipov via Tarantool-patches 0 siblings, 1 reply; 16+ messages in thread From: Vladislav Shpilevoy via Tarantool-patches @ 2022-01-17 22:57 UTC (permalink / raw) To: Konstantin Osipov, tarantool-patches, sergepetrenko On 16.01.2022 15:10, Konstantin Osipov wrote: > * Vladislav Shpilevoy via Tarantool-patches <tarantool-patches@dev.tarantool.org> [22/01/15 03:52]: >> Split vote handling in Raft, its usage in storage, and a bug fix found while >> working on this in the first commit. >> >> Branch: http://github.com/tarantool/tarantool/tree/gerold103/gh-5285-raft-split-vote >> Issue: https://github.com/tarantool/tarantool/issues/5285 > > While I don't mind this patch it is an unnecessary complication on > top of the spec. To prevent the deteriorating effects on liveness > from split vote one needs pre-voting, not split-vote detection. I agree. I also don't like that patchset too much (not counting the first commit, which fixes a crash). The pros are: - can save some seconds in case of a split-vote; - potentially can expose all the vote counts to the monitoring, not just own count. Cons is - too much hassle in the code just for that. Pre-voting on the other hand solves a different problem. Only slightly related. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Tarantool-patches] [PATCH 0/4] Split vote 2022-01-17 22:57 ` Vladislav Shpilevoy via Tarantool-patches @ 2022-01-18 7:18 ` Konstantin Osipov via Tarantool-patches 0 siblings, 0 replies; 16+ messages in thread From: Konstantin Osipov via Tarantool-patches @ 2022-01-18 7:18 UTC (permalink / raw) To: Vladislav Shpilevoy; +Cc: tarantool-patches * Vladislav Shpilevoy <v.shpilevoy@tarantool.org> [22/01/18 02:16]: > >> Split vote handling in Raft, its usage in storage, and a bug fix found while > >> working on this in the first commit. > >> > >> Branch: http://github.com/tarantool/tarantool/tree/gerold103/gh-5285-raft-split-vote > >> Issue: https://github.com/tarantool/tarantool/issues/5285 > > > > While I don't mind this patch it is an unnecessary complication on > > top of the spec. To prevent the deteriorating effects on liveness > > from split vote one needs pre-voting, not split-vote detection. > > I agree. I also don't like that patchset too much (not counting the first > commit, which fixes a crash). The pros are: > > - can save some seconds in case of a split-vote; > > - potentially can expose all the vote counts to the monitoring, not > just own count. > > Cons is - too much hassle in the code just for that. > > Pre-voting on the other hand solves a different problem. Only slightly > related. Well, I don't doubt this hassle is not too harmful to the logic of the protocol, but it's not solving all the liveness issues either. -- Konstantin Osipov, Moscow, Russia ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2022-01-21 0:42 UTC | newest] Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2022-01-15 0:48 [Tarantool-patches] [PATCH 0/4] Split vote Vladislav Shpilevoy via Tarantool-patches 2022-01-15 0:48 ` [Tarantool-patches] [PATCH 1/4] raft: fix crash on election_timeout reconfig Vladislav Shpilevoy via Tarantool-patches 2022-01-18 13:12 ` Serge Petrenko via Tarantool-patches 2022-01-15 0:48 ` [Tarantool-patches] [PATCH 2/4] raft: track all votes, even not own Vladislav Shpilevoy via Tarantool-patches 2022-01-21 0:42 ` Vladislav Shpilevoy via Tarantool-patches 2022-01-15 0:48 ` [Tarantool-patches] [PATCH 3/4] raft: introduce split vote detection Vladislav Shpilevoy via Tarantool-patches 2022-01-18 13:20 ` Serge Petrenko via Tarantool-patches 2022-01-20 0:44 ` Vladislav Shpilevoy via Tarantool-patches 2022-01-20 10:21 ` Serge Petrenko via Tarantool-patches 2022-01-20 23:02 ` Vladislav Shpilevoy via Tarantool-patches 2022-01-15 0:48 ` [Tarantool-patches] [PATCH 4/4] election: activate raft split vote handling Vladislav Shpilevoy via Tarantool-patches 2022-01-18 13:21 ` Serge Petrenko via Tarantool-patches 2022-01-20 0:44 ` Vladislav Shpilevoy via Tarantool-patches 2022-01-16 14:10 ` [Tarantool-patches] [PATCH 0/4] Split vote Konstantin Osipov via Tarantool-patches 2022-01-17 22:57 ` Vladislav Shpilevoy via Tarantool-patches 2022-01-18 7:18 ` Konstantin Osipov via Tarantool-patches
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox