* [Tarantool-patches] [PATCH 0/4] Split vote
@ 2022-01-15 0:48 Vladislav Shpilevoy via Tarantool-patches
2022-01-15 0:48 ` [Tarantool-patches] [PATCH 1/4] raft: fix crash on election_timeout reconfig Vladislav Shpilevoy via Tarantool-patches
` (4 more replies)
0 siblings, 5 replies; 16+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2022-01-15 0:48 UTC (permalink / raw)
To: tarantool-patches, sergepetrenko
Split vote handling in Raft, its usage in storage, and a bug fix found while
working on this in the first commit.
Branch: http://github.com/tarantool/tarantool/tree/gerold103/gh-5285-raft-split-vote
Issue: https://github.com/tarantool/tarantool/issues/5285
Vladislav Shpilevoy (4):
raft: fix crash on election_timeout reconfig
raft: track all votes, even not own
raft: introduce split vote detection
election: activate raft split vote handling
.../unreleased/election-timeout-cfg-crash.md | 5 +
src/box/raft.c | 4 +-
src/lib/raft/raft.c | 142 ++++++++-
src/lib/raft/raft.h | 30 +-
.../election_split_vote_test.lua | 92 ++++++
test/unit/raft.c | 299 +++++++++++++++++-
test/unit/raft.result | 64 +++-
test/unit/raft_test_utils.c | 12 +
test/unit/raft_test_utils.h | 5 +
9 files changed, 622 insertions(+), 31 deletions(-)
create mode 100644 changelogs/unreleased/election-timeout-cfg-crash.md
create mode 100644 test/replication-luatest/election_split_vote_test.lua
--
2.24.3 (Apple Git-128)
^ permalink raw reply [flat|nested] 16+ messages in thread
* [Tarantool-patches] [PATCH 1/4] raft: fix crash on election_timeout reconfig
2022-01-15 0:48 [Tarantool-patches] [PATCH 0/4] Split vote Vladislav Shpilevoy via Tarantool-patches
@ 2022-01-15 0:48 ` Vladislav Shpilevoy via Tarantool-patches
2022-01-18 13:12 ` Serge Petrenko via Tarantool-patches
2022-01-15 0:48 ` [Tarantool-patches] [PATCH 2/4] raft: track all votes, even not own Vladislav Shpilevoy via Tarantool-patches
` (3 subsequent siblings)
4 siblings, 1 reply; 16+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2022-01-15 0:48 UTC (permalink / raw)
To: tarantool-patches, sergepetrenko
It used to crash if done during election on a node voted for
anybody, it is a candidate, it doesn't know a leader yet, but has
a WAL write in progress.
Thus it could only happen if the term was bumped by a message from
a non-leader node and wasn't flushed to the disk yet.
The patch makes the reconfig check if there is a WAL write in
progress. Then don't do anything.
Could also check for volatile vote instead of persistent, but it
would create the same problem for the case when started writing
vote for self and didn't finish yet. Reconfig would crash.
---
.../unreleased/election-timeout-cfg-crash.md | 5 ++
src/lib/raft/raft.c | 3 +-
test/unit/raft.c | 75 ++++++++++++++++++-
test/unit/raft.result | 16 +++-
4 files changed, 96 insertions(+), 3 deletions(-)
create mode 100644 changelogs/unreleased/election-timeout-cfg-crash.md
diff --git a/changelogs/unreleased/election-timeout-cfg-crash.md b/changelogs/unreleased/election-timeout-cfg-crash.md
new file mode 100644
index 000000000..e870e3665
--- /dev/null
+++ b/changelogs/unreleased/election-timeout-cfg-crash.md
@@ -0,0 +1,5 @@
+## bugfix/raft
+
+* Reconfiguration of `box.cfg.election_timeout` could lead to a crash or
+ undefined behaviour if done during an ongoing election with a special WAL
+ write in progress.
diff --git a/src/lib/raft/raft.c b/src/lib/raft/raft.c
index 0e6dc6155..1ae8fe87f 100644
--- a/src/lib/raft/raft.c
+++ b/src/lib/raft/raft.c
@@ -935,7 +935,8 @@ raft_cfg_election_timeout(struct raft *raft, double timeout)
return;
raft->election_timeout = timeout;
- if (raft->vote != 0 && raft->leader == 0 && raft->is_candidate) {
+ if (raft->vote != 0 && raft->leader == 0 && raft->is_candidate &&
+ !raft->is_write_in_progress) {
assert(raft_ev_is_active(&raft->timer));
struct ev_loop *loop = raft_loop();
double timeout = raft_ev_timer_remaining(loop, &raft->timer) -
diff --git a/test/unit/raft.c b/test/unit/raft.c
index 8f4d2dd9a..520b94466 100644
--- a/test/unit/raft.c
+++ b/test/unit/raft.c
@@ -1296,7 +1296,7 @@ raft_test_enable_disable(void)
static void
raft_test_too_long_wal_write(void)
{
- raft_start_test(8);
+ raft_start_test(22);
struct raft_node node;
raft_node_create(&node);
@@ -1362,6 +1362,79 @@ raft_test_too_long_wal_write(void)
ok(raft_time() - ts == node.cfg_death_timeout, "timer works again");
is(node.raft.state, RAFT_STATE_CANDIDATE, "became candidate");
+ /*
+ * During WAL write it is possible to reconfigure election timeout.
+ * The dangerous case is when the timer is active already. It happens
+ * when the node voted and is a candidate, but leader is unknown.
+ */
+ raft_node_destroy(&node);
+ raft_node_create(&node);
+
+ raft_node_cfg_election_timeout(&node, 100);
+ raft_run_next_event();
+ is(node.raft.term, 2, "term is bumped");
+
+ /* Bump term again but it is not written to WAL yet. */
+ raft_node_block(&node);
+ is(raft_node_send_vote_response(&node,
+ 3 /* Term. */,
+ 3 /* Vote. */,
+ 2 /* Source. */
+ ), 0, "2 votes for 3 in a new term");
+ raft_run_next_event();
+ is(node.raft.term, 2, "term is old");
+ is(node.raft.vote, 1, "vote is used for self");
+ is(node.raft.volatile_term, 3, "volatile term is new");
+ is(node.raft.volatile_vote, 0, "volatile vote is unused");
+
+ raft_node_cfg_election_timeout(&node, 50);
+ raft_node_unblock(&node);
+ ts = raft_time();
+ raft_run_next_event();
+ ts = raft_time() - ts;
+ /* 50 + <= 10% random delay. */
+ ok(ts >= 50 && ts <= 55, "new election timeout works");
+ ok(raft_node_check_full_state(&node,
+ RAFT_STATE_CANDIDATE /* State. */,
+ 0 /* Leader. */,
+ 4 /* Term. */,
+ 1 /* Vote. */,
+ 4 /* Volatile term. */,
+ 1 /* Volatile vote. */,
+ "{0: 4}" /* Vclock. */
+ ), "new term is started with vote for self");
+
+ /*
+ * Similar case when a vote is being written but not finished yet.
+ */
+ raft_node_destroy(&node);
+ raft_node_create(&node);
+
+ raft_node_cfg_election_timeout(&node, 100);
+ raft_node_block(&node);
+ raft_run_next_event();
+ is(node.raft.term, 1, "term is old");
+ is(node.raft.vote, 0, "vote is unused");
+ is(node.raft.volatile_term, 2, "volatile term is new");
+ is(node.raft.volatile_vote, 1, "volatile vote is self");
+
+ raft_node_cfg_election_timeout(&node, 50);
+ raft_node_unblock(&node);
+ ts = raft_time();
+ raft_run_next_event();
+ ts = raft_time() - ts;
+ /* 50 + <= 10% random delay. */
+ ok(ts >= 50 && ts <= 55, "new election timeout works");
+ ok(raft_node_check_full_state(&node,
+ RAFT_STATE_CANDIDATE /* State. */,
+ 0 /* Leader. */,
+ 3 /* Term. */,
+ 1 /* Vote. */,
+ 3 /* Volatile term. */,
+ 1 /* Volatile vote. */,
+ "{0: 2}" /* Vclock. */
+ ), "new term is started with vote for self");
+
raft_node_destroy(&node);
raft_finish_test();
}
diff --git a/test/unit/raft.result b/test/unit/raft.result
index 14689ea81..764d82969 100644
--- a/test/unit/raft.result
+++ b/test/unit/raft.result
@@ -220,7 +220,7 @@ ok 11 - subtests
ok 12 - subtests
*** raft_test_enable_disable: done ***
*** raft_test_too_long_wal_write ***
- 1..8
+ 1..22
ok 1 - vote for 2
ok 2 - vote is volatile
ok 3 - message from leader
@@ -229,6 +229,20 @@ ok 12 - subtests
ok 6 - wal write is finished
ok 7 - timer works again
ok 8 - became candidate
+ ok 9 - term is bumped
+ ok 10 - 2 votes for 3 in a new term
+ ok 11 - term is old
+ ok 12 - vote is used for self
+ ok 13 - volatile term is new
+ ok 14 - volatile vote is unused
+ ok 15 - new election timeout works
+ ok 16 - new term is started with vote for self
+ ok 17 - term is old
+ ok 18 - vote is unused
+ ok 19 - volatile term is new
+ ok 20 - volatile vote is self
+ ok 21 - new election timeout works
+ ok 22 - new term is started with vote for self
ok 13 - subtests
*** raft_test_too_long_wal_write: done ***
*** raft_test_promote_restore ***
--
2.24.3 (Apple Git-128)
^ permalink raw reply [flat|nested] 16+ messages in thread
* [Tarantool-patches] [PATCH 2/4] raft: track all votes, even not own
2022-01-15 0:48 [Tarantool-patches] [PATCH 0/4] Split vote Vladislav Shpilevoy via Tarantool-patches
2022-01-15 0:48 ` [Tarantool-patches] [PATCH 1/4] raft: fix crash on election_timeout reconfig Vladislav Shpilevoy via Tarantool-patches
@ 2022-01-15 0:48 ` Vladislav Shpilevoy via Tarantool-patches
2022-01-21 0:42 ` Vladislav Shpilevoy via Tarantool-patches
2022-01-15 0:48 ` [Tarantool-patches] [PATCH 3/4] raft: introduce split vote detection Vladislav Shpilevoy via Tarantool-patches
` (2 subsequent siblings)
4 siblings, 1 reply; 16+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2022-01-15 0:48 UTC (permalink / raw)
To: tarantool-patches, sergepetrenko
To detect split vote a node needs to see that number of free votes
is not enough for anyone to win even if it gets them all.
Hence every node needs to count votes for all other nodes.
The patch makes raft store votes in a bit more complicated manner
than a simple bitmap for just the current instance.
Part of #5285
---
src/lib/raft/raft.c | 41 +++++++++++++++++++++++++++++++++--------
src/lib/raft/raft.h | 24 +++++++++++++++++-------
test/unit/raft.c | 17 ++++++++++-------
test/unit/raft.result | 7 ++++---
4 files changed, 64 insertions(+), 25 deletions(-)
diff --git a/src/lib/raft/raft.c b/src/lib/raft/raft.c
index 1ae8fe87f..289d53fd5 100644
--- a/src/lib/raft/raft.c
+++ b/src/lib/raft/raft.c
@@ -152,6 +152,16 @@ raft_can_vote_for(const struct raft *raft, const struct vclock *v)
return cmp == 0 || cmp == 1;
}
+static inline void
+raft_add_vote(struct raft *raft, int src, int dst)
+{
+ struct raft_vote *v = &raft->votes[src];
+ if (v->did_vote)
+ return;
+ v->did_vote = true;
+ ++raft->votes[dst].count;
+}
+
/** Schedule broadcast of the complete Raft state to all the followers. */
static void
raft_schedule_broadcast(struct raft *raft);
@@ -279,6 +289,12 @@ void
raft_process_recovery(struct raft *raft, const struct raft_msg *req)
{
say_verbose("RAFT: recover %s", raft_msg_to_string(req));
+ /*
+ * Instance ID is not unknown until recovery ends. Because apparently it
+ * can change during join. In Raft it is set only one time when recovery
+ * ends for good.
+ */
+ assert(raft->self == 0);
if (req->term != 0) {
raft->term = req->term;
raft->volatile_term = req->term;
@@ -335,6 +351,8 @@ raft_process_msg(struct raft *raft, const struct raft_msg *req, uint32_t source)
* persisted long time ago and still broadcasted. Or a vote response.
*/
if (req->vote != 0) {
+ raft_add_vote(raft, source, req->vote);
+
switch (raft->state) {
case RAFT_STATE_FOLLOWER:
case RAFT_STATE_LEADER:
@@ -395,11 +413,10 @@ raft_process_msg(struct raft *raft, const struct raft_msg *req, uint32_t source)
* and now was answered by some other instance.
*/
assert(raft->volatile_vote == raft->self);
- bool was_set = bit_set(&raft->vote_mask, source);
- raft->vote_count += !was_set;
- if (raft->vote_count < raft->election_quorum) {
+ int vote_count = raft_vote_count(raft);
+ if (vote_count < raft->election_quorum) {
say_info("RAFT: accepted vote for self, vote "
- "count is %d/%d", raft->vote_count,
+ "count is %d/%d", vote_count,
raft->election_quorum);
break;
}
@@ -640,13 +657,11 @@ raft_sm_become_candidate(struct raft *raft)
assert(raft->state == RAFT_STATE_FOLLOWER);
assert(raft->leader == 0);
assert(raft->vote == raft->self);
+ assert(raft_vote_count(raft) >= 1);
assert(raft->is_candidate);
assert(!raft->is_write_in_progress);
assert(raft->election_quorum > 1);
raft->state = RAFT_STATE_CANDIDATE;
- raft->vote_count = 1;
- raft->vote_mask = 0;
- bit_set(&raft->vote_mask, raft->self);
raft_sm_wait_election_end(raft);
/* State is visible and it is changed - broadcast. */
raft_schedule_broadcast(raft);
@@ -663,6 +678,7 @@ raft_sm_schedule_new_term(struct raft *raft, uint64_t new_term)
raft->volatile_vote = 0;
raft->leader = 0;
raft->state = RAFT_STATE_FOLLOWER;
+ memset(raft->votes, 0, sizeof(raft->votes));
/*
* The instance could be promoted for the previous term. But promotion
* has no effect on following terms.
@@ -685,6 +701,8 @@ raft_sm_schedule_new_vote(struct raft *raft, uint32_t new_vote)
assert(raft->leader == 0);
assert(raft->state == RAFT_STATE_FOLLOWER);
raft->volatile_vote = new_vote;
+ assert(!raft->votes[raft->self].did_vote);
+ raft_add_vote(raft, raft->self, raft->self);
raft_sm_pause_and_dump(raft);
/* Nothing visible is changed - no broadcast. */
}
@@ -956,7 +974,7 @@ raft_cfg_election_quorum(struct raft *raft, int election_quorum)
assert(election_quorum > 0);
raft->election_quorum = election_quorum;
if (raft->state == RAFT_STATE_CANDIDATE &&
- raft->vote_count >= raft->election_quorum)
+ raft_vote_count(raft) >= raft->election_quorum)
raft_sm_become_leader(raft);
}
@@ -984,6 +1002,13 @@ raft_cfg_instance_id(struct raft *raft, uint32_t instance_id)
assert(raft->self == 0);
assert(instance_id != 0);
raft->self = instance_id;
+ /*
+ * Couldn't do that reliably during recovery. Instance ID can change
+ * more than once during join. Here instance ID is configured when it is
+ * known forever and is safe to use.
+ */
+ if (raft->vote != 0)
+ raft_add_vote(raft, instance_id, raft->vote);
}
void
diff --git a/src/lib/raft/raft.h b/src/lib/raft/raft.h
index 53e2c58a8..4fbfaa91a 100644
--- a/src/lib/raft/raft.h
+++ b/src/lib/raft/raft.h
@@ -138,6 +138,14 @@ struct raft_vtab {
raft_schedule_async_f schedule_async;
};
+/** Vote descriptor of a single node. */
+struct raft_vote {
+ /** Whether the node voted for anybody. */
+ bool did_vote;
+ /** How many votes the node got in the current term. */
+ int count;
+};
+
struct raft {
/** Instance ID of this node. */
uint32_t self;
@@ -189,13 +197,8 @@ struct raft {
*/
uint64_t term;
uint32_t vote;
- /**
- * Bit 1 on position N means that a vote from instance with ID = N was
- * obtained.
- */
- vclock_map_t vote_mask;
- /** Number of votes for this instance. Valid only in candidate state. */
- int vote_count;
+ /** Statistics which node voted for who. */
+ struct raft_vote votes[VCLOCK_MAX];
/** Number of votes necessary for successful election. */
int election_quorum;
/**
@@ -243,6 +246,13 @@ raft_is_enabled(const struct raft *raft)
return raft->is_enabled;
}
+/** Number of votes for self. */
+static inline int
+raft_vote_count(const struct raft *raft)
+{
+ return raft->votes[raft->self].count;
+}
+
/** Process a raft entry stored in WAL/snapshot. */
void
raft_process_recovery(struct raft *raft, const struct raft_msg *req);
diff --git a/test/unit/raft.c b/test/unit/raft.c
index 520b94466..9f18f4d8e 100644
--- a/test/unit/raft.c
+++ b/test/unit/raft.c
@@ -70,7 +70,7 @@ raft_test_leader_election(void)
1 /* Volatile vote. */,
"{0: 1}" /* Vclock. */
), "elections with a new term");
- is(node.raft.vote_count, 1, "single vote for self");
+ is(raft_vote_count(&node.raft), 1, "single vote for self");
ok(node.update_count > 0, "trigger worked");
node.update_count = 0;
@@ -100,7 +100,7 @@ raft_test_leader_election(void)
1 /* Vote. */,
2 /* Source. */
), 0, "vote response from 2");
- is(node.raft.vote_count, 2, "2 votes - 1 self and 1 foreign");
+ is(raft_vote_count(&node.raft), 2, "2 votes - 1 self and 1 foreign");
ok(!node.has_work, "no work to do - not enough votes yet");
raft_run_for(node.cfg_election_timeout / 2);
@@ -115,7 +115,7 @@ raft_test_leader_election(void)
1 /* Vote. */,
3 /* Source. */
), 0, "vote response from 3");
- is(node.raft.vote_count, 3, "2 votes - 1 self and 2 foreign");
+ is(raft_vote_count(&node.raft), 3, "2 votes - 1 self and 2 foreign");
is(node.raft.state, RAFT_STATE_LEADER, "became leader");
ok(node.update_count > 0, "trigger worked");
node.update_count = 0;
@@ -142,7 +142,7 @@ raft_test_leader_election(void)
static void
raft_test_recovery(void)
{
- raft_start_test(12);
+ raft_start_test(13);
struct raft_msg msg;
struct raft_node node;
raft_node_create(&node);
@@ -224,6 +224,8 @@ raft_test_recovery(void)
"{0: 1}" /* Vclock. */
), "restart always as a follower");
+ is(raft_vote_count(&node.raft), 1, "vote count is restored correctly");
+
raft_checkpoint_remote(&node.raft, &msg);
ok(raft_msg_check(&msg,
RAFT_STATE_FOLLOWER /* State. */,
@@ -383,7 +385,7 @@ raft_test_vote_skip(void)
1 /* Vote. */,
2 /* Source. */
), 0, "message is accepted");
- is(node.raft.vote_count, 1, "but ignored - too old term");
+ is(raft_vote_count(&node.raft), 1, "but ignored - too old term");
/* Competing vote requests are skipped. */
@@ -392,7 +394,8 @@ raft_test_vote_skip(void)
3 /* Vote. */,
2 /* Source. */
), 0, "message is accepted");
- is(node.raft.vote_count, 1, "but ignored - vote not for this node");
+ is(raft_vote_count(&node.raft), 1,
+ "but ignored - vote not for this node");
is(node.raft.state, RAFT_STATE_CANDIDATE, "this node does not give up");
/* Vote requests are ignored when node is disabled. */
@@ -1071,7 +1074,7 @@ raft_test_election_quorum(void)
1 /* Vote. */,
2 /* Source. */
), 0, "send vote response from second node");
- is(node.raft.vote_count, 2, "vote is accepted");
+ is(raft_vote_count(&node.raft), 2, "vote is accepted");
is(node.raft.state, RAFT_STATE_CANDIDATE, "but still candidate");
raft_node_cfg_election_quorum(&node, 2);
diff --git a/test/unit/raft.result b/test/unit/raft.result
index 764d82969..3fcdf8180 100644
--- a/test/unit/raft.result
+++ b/test/unit/raft.result
@@ -29,7 +29,7 @@
ok 1 - subtests
*** raft_test_leader_election: done ***
*** raft_test_recovery ***
- 1..12
+ 1..13
ok 1 - became candidate
ok 2 - remote checkpoint of a candidate
ok 3 - local checkpoint of a candidate
@@ -40,8 +40,9 @@ ok 1 - subtests
ok 8 - remote checkpoint of a leader
ok 9 - local checkpoint of a leader
ok 10 - restart always as a follower
- ok 11 - remote checkpoint of a leader
- ok 12 - local checkpoint of a leader
+ ok 11 - vote count is restored correctly
+ ok 12 - remote checkpoint of a leader
+ ok 13 - local checkpoint of a leader
ok 2 - subtests
*** raft_test_recovery: done ***
*** raft_test_bad_msg ***
--
2.24.3 (Apple Git-128)
^ permalink raw reply [flat|nested] 16+ messages in thread
* [Tarantool-patches] [PATCH 3/4] raft: introduce split vote detection
2022-01-15 0:48 [Tarantool-patches] [PATCH 0/4] Split vote Vladislav Shpilevoy via Tarantool-patches
2022-01-15 0:48 ` [Tarantool-patches] [PATCH 1/4] raft: fix crash on election_timeout reconfig Vladislav Shpilevoy via Tarantool-patches
2022-01-15 0:48 ` [Tarantool-patches] [PATCH 2/4] raft: track all votes, even not own Vladislav Shpilevoy via Tarantool-patches
@ 2022-01-15 0:48 ` Vladislav Shpilevoy via Tarantool-patches
2022-01-18 13:20 ` Serge Petrenko via Tarantool-patches
2022-01-15 0:48 ` [Tarantool-patches] [PATCH 4/4] election: activate raft split vote handling Vladislav Shpilevoy via Tarantool-patches
2022-01-16 14:10 ` [Tarantool-patches] [PATCH 0/4] Split vote Konstantin Osipov via Tarantool-patches
4 siblings, 1 reply; 16+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2022-01-15 0:48 UTC (permalink / raw)
To: tarantool-patches, sergepetrenko
Split vote is a situation when during election nobody can win and
will not win in this term for sure. Not a single node could get
enough votes. For example, with 4 nodes one could get 2 votes and
other also 2 votes. Nobody will get quorum 3 in this term.
The patch makes raft able to notice that situation and speed up
the term bump. It is not bumped immediately though, because nodes
might do that simultaneously and will get the split vote again
after voting for self. There is a random delay. But it is just max
10% of election timeout, so it should speed up split vote
resolution on 90% at least.
Part of #5285
---
src/lib/raft/raft.c | 104 +++++++++++++++++-
src/lib/raft/raft.h | 6 ++
test/unit/raft.c | 207 +++++++++++++++++++++++++++++++++++-
test/unit/raft.result | 41 ++++++-
test/unit/raft_test_utils.c | 12 +++
test/unit/raft_test_utils.h | 5 +
6 files changed, 370 insertions(+), 5 deletions(-)
diff --git a/src/lib/raft/raft.c b/src/lib/raft/raft.c
index 289d53fd5..5dcbc7821 100644
--- a/src/lib/raft/raft.c
+++ b/src/lib/raft/raft.c
@@ -152,20 +152,69 @@ raft_can_vote_for(const struct raft *raft, const struct vclock *v)
return cmp == 0 || cmp == 1;
}
-static inline void
+static inline bool
raft_add_vote(struct raft *raft, int src, int dst)
{
struct raft_vote *v = &raft->votes[src];
if (v->did_vote)
- return;
+ return false;
v->did_vote = true;
++raft->votes[dst].count;
+ return true;
+}
+
+static bool
+raft_has_split_vote(const struct raft *raft)
+{
+ int max_vote = 0;
+ int vote_vac = raft->cluster_size;
+ int quorum = raft->election_quorum;
+ for (int i = 0; i < VCLOCK_MAX; ++i) {
+ int count = raft->votes[i].count;
+ vote_vac -= count;
+ if (count > max_vote)
+ max_vote = count;
+ }
+ return max_vote < quorum && max_vote + vote_vac < quorum;
+}
+
+static int
+raft_scores_snprintf(const struct raft *raft, char *buf, int size)
+{
+ int total = 0;
+ bool is_empty = true;
+ SNPRINT(total, snprintf, buf, size, "{");
+ for (int i = 0; i < VCLOCK_MAX; ++i) {
+ int count = raft->votes[i].count;
+ if (count == 0)
+ continue;
+ if (!is_empty)
+ SNPRINT(total, snprintf, buf, size, ", ");
+ is_empty = false;
+ SNPRINT(total, snprintf, buf, size, "%d: %d", i, count);
+ }
+ SNPRINT(total, snprintf, buf, size, "}");
+ return total;
+}
+
+static const char *
+raft_scores_str(const struct raft *raft)
+{
+ char *buf = tt_static_buf();
+ int rc = raft_scores_snprintf(raft, buf, TT_STATIC_BUF_LEN);
+ assert(rc >= 0);
+ (void)rc;
+ return buf;
}
/** Schedule broadcast of the complete Raft state to all the followers. */
static void
raft_schedule_broadcast(struct raft *raft);
+/** If there is split vote, the node might reduce the next term delay. */
+static void
+raft_check_split_vote(struct raft *raft);
+
/** Raft state machine methods. 'sm' stands for State Machine. */
/**
@@ -351,7 +400,8 @@ raft_process_msg(struct raft *raft, const struct raft_msg *req, uint32_t source)
* persisted long time ago and still broadcasted. Or a vote response.
*/
if (req->vote != 0) {
- raft_add_vote(raft, source, req->vote);
+ if (raft_add_vote(raft, source, req->vote))
+ raft_check_split_vote(raft);
switch (raft->state) {
case RAFT_STATE_FOLLOWER:
@@ -1019,6 +1069,13 @@ raft_cfg_vclock(struct raft *raft, const struct vclock *vclock)
raft->vclock = vclock;
}
+void
+raft_cfg_cluster_size(struct raft *raft, int size)
+{
+ raft->cluster_size = size;
+ raft_check_split_vote(raft);
+}
+
void
raft_new_term(struct raft *raft)
{
@@ -1043,6 +1100,46 @@ raft_schedule_broadcast(struct raft *raft)
raft_schedule_async(raft);
}
+static void
+raft_check_split_vote(struct raft *raft)
+{
+ /* When leader is known, there is no election. Thus no vote to split. */
+ if (raft->leader != 0)
+ return;
+ /* Not a candidate = can't trigger term bump anyway. */
+ if (!raft->is_candidate)
+ return;
+ /*
+ * WAL write in progress means the state is changing. All is rechecked
+ * when it is done.
+ */
+ if (raft->is_write_in_progress)
+ return;
+ if (!raft_has_split_vote(raft))
+ return;
+ assert(raft_ev_is_active(&raft->timer));
+ if (raft->timer.at < raft->election_timeout)
+ return;
+
+ assert(raft->state == RAFT_STATE_FOLLOWER ||
+ raft->state == RAFT_STATE_CANDIDATE);
+ struct ev_loop *loop = raft_loop();
+ struct ev_timer *timer = &raft->timer;
+ double delay = raft_new_random_election_shift(raft);
+ /*
+ * Could be too late to speed up anything - probably the term is almost
+ * over anyway.
+ */
+ double remaining = raft_ev_timer_remaining(loop, timer);
+ if (delay >= remaining)
+ delay = remaining;
+ say_info("RAFT: split vote is discovered - %s, new term in %lf sec",
+ raft_scores_str(raft), delay);
+ raft_ev_timer_stop(loop, timer);
+ raft_ev_timer_set(timer, delay, delay);
+ raft_ev_timer_start(loop, timer);
+}
+
void
raft_create(struct raft *raft, const struct raft_vtab *vtab)
{
@@ -1053,6 +1150,7 @@ raft_create(struct raft *raft, const struct raft_vtab *vtab)
.election_quorum = 1,
.election_timeout = 5,
.death_timeout = 5,
+ .cluster_size = VCLOCK_MAX,
.vtab = vtab,
};
raft_ev_timer_init(&raft->timer, raft_sm_schedule_new_election_cb,
diff --git a/src/lib/raft/raft.h b/src/lib/raft/raft.h
index 4fbfaa91a..33597d153 100644
--- a/src/lib/raft/raft.h
+++ b/src/lib/raft/raft.h
@@ -219,6 +219,8 @@ struct raft {
* elections can be started.
*/
double death_timeout;
+ /** Number of instances registered in the cluster. */
+ int cluster_size;
/** Virtual table to perform application-specific actions. */
const struct raft_vtab *vtab;
/**
@@ -333,6 +335,10 @@ raft_cfg_instance_id(struct raft *raft, uint32_t instance_id);
void
raft_cfg_vclock(struct raft *raft, const struct vclock *vclock);
+/** Configure number of registered instances in the cluster. */
+void
+raft_cfg_cluster_size(struct raft *raft, int size);
+
/**
* Bump the term. When it is persisted, the node checks if there is a leader,
* and if there is not, a new election is started. That said, this function can
diff --git a/test/unit/raft.c b/test/unit/raft.c
index 9f18f4d8e..9b462f755 100644
--- a/test/unit/raft.c
+++ b/test/unit/raft.c
@@ -1497,10 +1497,214 @@ raft_test_promote_restore(void)
raft_finish_test();
}
+static void
+raft_test_split_vote(void)
+{
+ raft_start_test(35);
+ struct raft_node node;
+ raft_node_create(&node);
+
+ /* Normal split vote. */
+
+ raft_node_cfg_cluster_size(&node, 4);
+ raft_node_cfg_election_quorum(&node, 3);
+ raft_run_next_event();
+
+ ok(raft_node_check_full_state(&node,
+ RAFT_STATE_CANDIDATE /* State. */,
+ 0 /* Leader. */,
+ 2 /* Term. */,
+ 1 /* Vote. */,
+ 2 /* Volatile term. */,
+ 1 /* Volatile vote. */,
+ "{0: 1}" /* Vclock. */
+ ), "elections with a new term");
+
+ /* Make so node 1 has votes 1 and 2. Node 3 has votes 3 and 4. */
+ is(raft_node_send_vote_response(&node,
+ 2 /* Term. */,
+ 1 /* Vote. */,
+ 2 /* Source. */
+ ), 0, "vote response for 1 from 2");
+ is(raft_node_send_vote_response(&node,
+ 2 /* Term. */,
+ 3 /* Vote. */,
+ 3 /* Source. */
+ ), 0, "vote response for 3 from 3");
+
+ ok(node.raft.timer.at >= node.raft.election_timeout, "term timeout >= "
+ "election timeout normally");
+
+ is(raft_node_send_vote_response(&node,
+ 2 /* Term. */,
+ 3 /* Vote. */,
+ 4 /* Source. */
+ ), 0, "vote response for 3 from 4");
+
+ ok(node.raft.timer.at < node.raft.election_timeout, "split vote "
+ "reduced the term timeout");
+
+ raft_run_next_event();
+ ok(raft_node_check_full_state(&node,
+ RAFT_STATE_CANDIDATE /* State. */,
+ 0 /* Leader. */,
+ 3 /* Term. */,
+ 1 /* Vote. */,
+ 3 /* Volatile term. */,
+ 1 /* Volatile vote. */,
+ "{0: 2}" /* Vclock. */
+ ), "a new term");
+
+ ok(node.raft.timer.at >= node.raft.election_timeout, "timeout is "
+ "normal again");
+
+ /*
+ * Cluster size change can make split vote.
+ */
+ raft_node_destroy(&node);
+ raft_node_create(&node);
+ raft_node_cfg_cluster_size(&node, 3);
+ raft_node_cfg_election_quorum(&node, 2);
+ raft_run_next_event();
+ is(node.raft.state, RAFT_STATE_CANDIDATE, "is candidate");
+ is(node.raft.vote, 1, "voted for self");
+
+ is(raft_node_send_vote_response(&node,
+ 2 /* Term. */,
+ 2 /* Vote. */,
+ 2 /* Source. */
+ ), 0, "vote response for 2 from 2");
+ ok(node.raft.timer.at >= node.raft.election_timeout, "the vote is not "
+ "split yet");
+
+ raft_node_cfg_cluster_size(&node, 2);
+ ok(node.raft.timer.at < node.raft.election_timeout, "cluster size "
+ "change makes split vote");
+
+ raft_run_next_event();
+ ok(raft_node_check_full_state(&node,
+ RAFT_STATE_CANDIDATE /* State. */,
+ 0 /* Leader. */,
+ 3 /* Term. */,
+ 1 /* Vote. */,
+ 3 /* Volatile term. */,
+ 1 /* Volatile vote. */,
+ "{0: 2}" /* Vclock. */
+ ), "a new term");
+
+ /*
+ * Split vote can be even when the leader is already known - then
+ * nothing to do. Votes are just left from the beginning of the term and
+ * then probably cluster size reduced a bit.
+ */
+ raft_node_destroy(&node);
+ raft_node_create(&node);
+ raft_node_cfg_cluster_size(&node, 3);
+ raft_node_cfg_election_quorum(&node, 2);
+ is(raft_node_send_vote_response(&node,
+ 2 /* Term. */,
+ 2 /* Vote. */,
+ 2 /* Source. */
+ ), 0, "vote response for 2 from 2");
+ /* There is also a vote from 3 for 2. But it wasn't delivered to 1. */
+ is(raft_node_send_leader(&node,
+ 2 /* Term. */,
+ 2 /* Source. */
+ ), 0, "message is accepted");
+ is(node.raft.leader, 2, "other node's leadership is accepted");
+ is(raft_vote_count(&node.raft), 1, "the only own vote was from self");
+
+ raft_node_cfg_cluster_size(&node, 2);
+ ok(node.raft.timer.at >= node.raft.death_timeout, "cluster change does "
+ "not affect the leader's death waiting");
+
+ /*
+ * Non-candidate should ignore split vote.
+ */
+ raft_node_destroy(&node);
+ raft_node_create(&node);
+ raft_node_cfg_cluster_size(&node, 3);
+ raft_node_cfg_election_quorum(&node, 3);
+ raft_node_cfg_is_candidate(&node, false);
+
+ is(raft_node_send_vote_response(&node,
+ 2 /* Term. */,
+ 2 /* Vote. */,
+ 2 /* Source. */
+ ), 0, "vote response for 2 from 2");
+ is(raft_node_send_vote_response(&node,
+ 2 /* Term. */,
+ 3 /* Vote. */,
+ 3 /* Source. */
+ ), 0, "vote response for 3 from 3");
+
+ ok(!raft_ev_is_active(&node.raft.timer), "voter couldn't schedule "
+ "new term");
+
+ /*
+ * Split vote can get worse, but it shouldn't lead to new term delay
+ * restart.
+ */
+ raft_node_destroy(&node);
+ raft_node_create(&node);
+ raft_node_cfg_cluster_size(&node, 3);
+ raft_node_cfg_election_quorum(&node, 3);
+
+ raft_run_next_event();
+ is(node.raft.term, 2, "bump term");
+ is(node.raft.vote, 1, "vote for self");
+ is(raft_node_send_vote_response(&node,
+ 2 /* Term. */,
+ 2 /* Vote. */,
+ 2 /* Source. */
+ ), 0, "vote response for 2 from 2");
+
+ double delay = node.raft.timer.at;
+ ok(delay < node.raft.election_timeout, "split vote is already "
+ "inevitable");
+
+ is(raft_node_send_vote_response(&node,
+ 2 /* Term. */,
+ 3 /* Vote. */,
+ 3 /* Source. */
+ ), 0, "vote response for 3 from 3");
+
+ is(delay, node.raft.timer.at, "split vote got worse, but delay didn't "
+ "change");
+
+ /*
+ * Handle split vote when WAL write is in progress.
+ */
+ raft_node_destroy(&node);
+ raft_node_create(&node);
+ raft_node_cfg_cluster_size(&node, 2);
+ raft_node_cfg_election_quorum(&node, 2);
+
+ raft_node_block(&node);
+ raft_run_next_event();
+ is(node.raft.term, 1, "old term");
+ is(node.raft.vote, 0, "unused vote");
+ is(node.raft.volatile_term, 2, "new volatile term");
+ is(node.raft.volatile_vote, 1, "new volatile vote");
+ is(raft_node_send_vote_response(&node,
+ 2 /* Term. */,
+ 2 /* Vote. */,
+ 2 /* Source. */
+ ), 0, "vote response for 2 from 2");
+
+ raft_node_unblock(&node);
+ raft_run_next_event();
+ is(node.raft.term, 3, "bump term");
+ is(node.raft.vote, 1, "vote for self");
+
+ raft_node_destroy(&node);
+ raft_finish_test();
+}
+
static int
main_f(va_list ap)
{
- raft_start_test(14);
+ raft_start_test(15);
(void) ap;
fakeev_init();
@@ -1519,6 +1723,7 @@ main_f(va_list ap)
raft_test_enable_disable();
raft_test_too_long_wal_write();
raft_test_promote_restore();
+ raft_test_split_vote();
fakeev_free();
diff --git a/test/unit/raft.result b/test/unit/raft.result
index 3fcdf8180..d12aae710 100644
--- a/test/unit/raft.result
+++ b/test/unit/raft.result
@@ -1,5 +1,5 @@
*** main_f ***
-1..14
+1..15
*** raft_test_leader_election ***
1..24
ok 1 - 1 pending message at start
@@ -262,4 +262,43 @@ ok 13 - subtests
ok 12 - not a candidate
ok 14 - subtests
*** raft_test_promote_restore: done ***
+ *** raft_test_split_vote ***
+ 1..35
+ ok 1 - elections with a new term
+ ok 2 - vote response for 1 from 2
+ ok 3 - vote response for 3 from 3
+ ok 4 - term timeout >= election timeout normally
+ ok 5 - vote response for 3 from 4
+ ok 6 - split vote reduced the term timeout
+ ok 7 - a new term
+ ok 8 - timeout is normal again
+ ok 9 - is candidate
+ ok 10 - voted for self
+ ok 11 - vote response for 2 from 2
+ ok 12 - the vote is not split yet
+ ok 13 - cluster size change makes split vote
+ ok 14 - a new term
+ ok 15 - vote response for 2 from 2
+ ok 16 - message is accepted
+ ok 17 - other node's leadership is accepted
+ ok 18 - the only own vote was from self
+ ok 19 - cluster change does not affect the leader's death waiting
+ ok 20 - vote response for 2 from 2
+ ok 21 - vote response for 3 from 3
+ ok 22 - voter couldn't schedule new term
+ ok 23 - bump term
+ ok 24 - vote for self
+ ok 25 - vote response for 2 from 2
+ ok 26 - split vote is already inevitable
+ ok 27 - vote response for 3 from 3
+ ok 28 - split vote got worse, but delay didn't change
+ ok 29 - old term
+ ok 30 - unused vote
+ ok 31 - new volatile term
+ ok 32 - new volatile vote
+ ok 33 - vote response for 2 from 2
+ ok 34 - bump term
+ ok 35 - vote for self
+ok 15 - subtests
+ *** raft_test_split_vote: done ***
*** main_f: done ***
diff --git a/test/unit/raft_test_utils.c b/test/unit/raft_test_utils.c
index b5754cf78..1d7dd01ba 100644
--- a/test/unit/raft_test_utils.c
+++ b/test/unit/raft_test_utils.c
@@ -194,6 +194,7 @@ raft_node_create(struct raft_node *node)
node->cfg_election_quorum = 3;
node->cfg_death_timeout = 5;
node->cfg_instance_id = 1;
+ node->cfg_cluster_size = VCLOCK_MAX;
node->cfg_vclock = &node->journal.vclock;
raft_journal_create(&node->journal, node->cfg_instance_id);
raft_node_start(node);
@@ -365,6 +366,7 @@ raft_node_start(struct raft_node *node)
raft_cfg_election_quorum(&node->raft, node->cfg_election_quorum);
raft_cfg_death_timeout(&node->raft, node->cfg_death_timeout);
raft_cfg_instance_id(&node->raft, node->cfg_instance_id);
+ raft_cfg_cluster_size(&node->raft, node->cfg_cluster_size);
raft_cfg_vclock(&node->raft, node->cfg_vclock);
raft_run_async_work();
}
@@ -423,6 +425,16 @@ raft_node_cfg_is_candidate(struct raft_node *node, bool value)
}
}
+void
+raft_node_cfg_cluster_size(struct raft_node *node, int value)
+{
+ node->cfg_cluster_size = value;
+ if (raft_node_is_started(node)) {
+ raft_cfg_cluster_size(&node->raft, value);
+ raft_run_async_work();
+ }
+}
+
void
raft_node_cfg_election_timeout(struct raft_node *node, double value)
{
diff --git a/test/unit/raft_test_utils.h b/test/unit/raft_test_utils.h
index c68dc3b22..2138a829e 100644
--- a/test/unit/raft_test_utils.h
+++ b/test/unit/raft_test_utils.h
@@ -32,6 +32,7 @@
#include "fakesys/fakeev.h"
#include "fiber.h"
#include "raft/raft.h"
+#include "raft/raft_ev.h"
#include "unit.h"
/** WAL simulation. It stores a list of rows which raft wanted to persist. */
@@ -105,6 +106,7 @@ struct raft_node {
int cfg_election_quorum;
double cfg_death_timeout;
uint32_t cfg_instance_id;
+ int cfg_cluster_size;
struct vclock *cfg_vclock;
};
@@ -227,6 +229,9 @@ raft_node_cfg_is_enabled(struct raft_node *node, bool value);
void
raft_node_cfg_is_candidate(struct raft_node *node, bool value);
+void
+raft_node_cfg_cluster_size(struct raft_node *node, int value);
+
void
raft_node_cfg_election_timeout(struct raft_node *node, double value);
--
2.24.3 (Apple Git-128)
^ permalink raw reply [flat|nested] 16+ messages in thread
* [Tarantool-patches] [PATCH 4/4] election: activate raft split vote handling
2022-01-15 0:48 [Tarantool-patches] [PATCH 0/4] Split vote Vladislav Shpilevoy via Tarantool-patches
` (2 preceding siblings ...)
2022-01-15 0:48 ` [Tarantool-patches] [PATCH 3/4] raft: introduce split vote detection Vladislav Shpilevoy via Tarantool-patches
@ 2022-01-15 0:48 ` Vladislav Shpilevoy via Tarantool-patches
2022-01-18 13:21 ` Serge Petrenko via Tarantool-patches
2022-01-16 14:10 ` [Tarantool-patches] [PATCH 0/4] Split vote Konstantin Osipov via Tarantool-patches
4 siblings, 1 reply; 16+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2022-01-15 0:48 UTC (permalink / raw)
To: tarantool-patches, sergepetrenko
Raft needs to know cluster size in order to detect and handle
split vote. The patch uses registered server count as cluster
size.
It is not documented nor has a changelog file because this is an
optimization. Can't be observed except in logs or with a watch.
Closes #5285
---
src/box/raft.c | 4 +-
.../election_split_vote_test.lua | 92 +++++++++++++++++++
2 files changed, 95 insertions(+), 1 deletion(-)
create mode 100644 test/replication-luatest/election_split_vote_test.lua
diff --git a/src/box/raft.c b/src/box/raft.c
index 1e360dc88..1908b71b6 100644
--- a/src/box/raft.c
+++ b/src/box/raft.c
@@ -229,7 +229,9 @@ box_raft_update_election_quorum(void)
* be lost.
*/
int quorum = MIN(replication_synchro_quorum, max);
- raft_cfg_election_quorum(box_raft(), quorum);
+ struct raft *raft = box_raft();
+ raft_cfg_election_quorum(raft, quorum);
+ raft_cfg_cluster_size(raft, replicaset.registered_count);
}
void
diff --git a/test/replication-luatest/election_split_vote_test.lua b/test/replication-luatest/election_split_vote_test.lua
new file mode 100644
index 000000000..f31bfd7f3
--- /dev/null
+++ b/test/replication-luatest/election_split_vote_test.lua
@@ -0,0 +1,92 @@
+local t = require('luatest')
+local cluster = require('test.luatest_helpers.cluster')
+local helpers = require('test.luatest_helpers')
+local wait_timeout = 120
+
+--
+-- gh-5285: split vote is when in the current term there can't be winner of the
+-- leader role. Number of unused votes is not enough for anyone to get the
+-- quorum. It can be detected to speed up the term bump.
+--
+local g = t.group('split-vote')
+
+g.before_each(function()
+ g.cluster = cluster:new({})
+ local node1_uri = helpers.instance_uri('node1')
+ local node2_uri = helpers.instance_uri('node2')
+ local replication = {node1_uri, node2_uri}
+ local box_cfg = {
+ listen = node1_uri,
+ replication = replication,
+ -- To speed up new term when try to elect a first leader.
+ replication_timeout = 0.1,
+ replication_synchro_quorum = 2,
+ election_timeout = 1000000,
+ }
+ g.node1 = g.cluster:build_server({alias = 'node1', box_cfg = box_cfg})
+
+ box_cfg.listen = node2_uri
+ g.node2 = g.cluster:build_server({alias = 'node2', box_cfg = box_cfg})
+
+ g.cluster:add_server(g.node1)
+ g.cluster:add_server(g.node2)
+ g.cluster:start()
+end)
+
+g.after_each(function()
+ g.cluster:drop()
+end)
+
+g.test_split_vote = function(g)
+ -- Stop the replication so as the nodes can't request votes from each other.
+ local node1_repl = g.node1:exec(function()
+ local repl = box.cfg.replication
+ box.cfg{replication = {}}
+ return repl
+ end)
+ local node2_repl = g.node2:exec(function()
+ local repl = box.cfg.replication
+ box.cfg{replication = {}}
+ return repl
+ end)
+
+ -- Both vote for self but don't see the split-vote yet.
+ g.node1:exec(function()
+ box.cfg{election_mode = 'candidate'}
+ end)
+ g.node2:exec(function()
+ box.cfg{election_mode = 'candidate'}
+ end)
+
+ -- Wait for the votes to actually happen.
+ t.helpers.retrying({timeout = wait_timeout}, function()
+ local func = function()
+ return box.info.election.vote == box.info.id
+ end
+ assert(g.node1:exec(func))
+ assert(g.node2:exec(func))
+ end)
+
+ -- Now let the nodes notice the split vote.
+ g.node1:exec(function(repl)
+ box.cfg{replication = repl}
+ end, {node1_repl})
+ g.node2:exec(function(repl)
+ box.cfg{replication = repl}
+ end, {node2_repl})
+
+ t.helpers.retrying({timeout = wait_timeout}, function()
+ local msg = 'split vote is discovered'
+ assert(g.node1:grep_log(msg) or g.node2:grep_log(msg))
+ end)
+
+ -- Ensure a leader is eventually elected. Nothing is broken for good.
+ g.node1:exec(function()
+ box.cfg{election_timeout = 1}
+ end)
+ g.node2:exec(function()
+ box.cfg{election_timeout = 1}
+ end)
+ g.node1:wait_election_leader_found()
+ g.node2:wait_election_leader_found()
+end
--
2.24.3 (Apple Git-128)
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Tarantool-patches] [PATCH 0/4] Split vote
2022-01-15 0:48 [Tarantool-patches] [PATCH 0/4] Split vote Vladislav Shpilevoy via Tarantool-patches
` (3 preceding siblings ...)
2022-01-15 0:48 ` [Tarantool-patches] [PATCH 4/4] election: activate raft split vote handling Vladislav Shpilevoy via Tarantool-patches
@ 2022-01-16 14:10 ` Konstantin Osipov via Tarantool-patches
2022-01-17 22:57 ` Vladislav Shpilevoy via Tarantool-patches
4 siblings, 1 reply; 16+ messages in thread
From: Konstantin Osipov via Tarantool-patches @ 2022-01-16 14:10 UTC (permalink / raw)
To: Vladislav Shpilevoy; +Cc: tarantool-patches
* Vladislav Shpilevoy via Tarantool-patches <tarantool-patches@dev.tarantool.org> [22/01/15 03:52]:
> Split vote handling in Raft, its usage in storage, and a bug fix found while
> working on this in the first commit.
>
> Branch: http://github.com/tarantool/tarantool/tree/gerold103/gh-5285-raft-split-vote
> Issue: https://github.com/tarantool/tarantool/issues/5285
While I don't mind this patch it is an unnecessary complication on
top of the spec. To prevent the deteriorating effects on liveness
from split vote one needs pre-voting, not split-vote detection.
> Vladislav Shpilevoy (4):
> raft: fix crash on election_timeout reconfig
> raft: track all votes, even not own
> raft: introduce split vote detection
> election: activate raft split vote handling
--
Konstantin Osipov, Moscow, Russia
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Tarantool-patches] [PATCH 0/4] Split vote
2022-01-16 14:10 ` [Tarantool-patches] [PATCH 0/4] Split vote Konstantin Osipov via Tarantool-patches
@ 2022-01-17 22:57 ` Vladislav Shpilevoy via Tarantool-patches
2022-01-18 7:18 ` Konstantin Osipov via Tarantool-patches
0 siblings, 1 reply; 16+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2022-01-17 22:57 UTC (permalink / raw)
To: Konstantin Osipov, tarantool-patches, sergepetrenko
On 16.01.2022 15:10, Konstantin Osipov wrote:
> * Vladislav Shpilevoy via Tarantool-patches <tarantool-patches@dev.tarantool.org> [22/01/15 03:52]:
>> Split vote handling in Raft, its usage in storage, and a bug fix found while
>> working on this in the first commit.
>>
>> Branch: http://github.com/tarantool/tarantool/tree/gerold103/gh-5285-raft-split-vote
>> Issue: https://github.com/tarantool/tarantool/issues/5285
>
> While I don't mind this patch it is an unnecessary complication on
> top of the spec. To prevent the deteriorating effects on liveness
> from split vote one needs pre-voting, not split-vote detection.
I agree. I also don't like that patchset too much (not counting the first
commit, which fixes a crash). The pros are:
- can save some seconds in case of a split-vote;
- potentially can expose all the vote counts to the monitoring, not
just own count.
Cons is - too much hassle in the code just for that.
Pre-voting on the other hand solves a different problem. Only slightly
related.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Tarantool-patches] [PATCH 0/4] Split vote
2022-01-17 22:57 ` Vladislav Shpilevoy via Tarantool-patches
@ 2022-01-18 7:18 ` Konstantin Osipov via Tarantool-patches
0 siblings, 0 replies; 16+ messages in thread
From: Konstantin Osipov via Tarantool-patches @ 2022-01-18 7:18 UTC (permalink / raw)
To: Vladislav Shpilevoy; +Cc: tarantool-patches
* Vladislav Shpilevoy <v.shpilevoy@tarantool.org> [22/01/18 02:16]:
> >> Split vote handling in Raft, its usage in storage, and a bug fix found while
> >> working on this in the first commit.
> >>
> >> Branch: http://github.com/tarantool/tarantool/tree/gerold103/gh-5285-raft-split-vote
> >> Issue: https://github.com/tarantool/tarantool/issues/5285
> >
> > While I don't mind this patch it is an unnecessary complication on
> > top of the spec. To prevent the deteriorating effects on liveness
> > from split vote one needs pre-voting, not split-vote detection.
>
> I agree. I also don't like that patchset too much (not counting the first
> commit, which fixes a crash). The pros are:
>
> - can save some seconds in case of a split-vote;
>
> - potentially can expose all the vote counts to the monitoring, not
> just own count.
>
> Cons is - too much hassle in the code just for that.
>
> Pre-voting on the other hand solves a different problem. Only slightly
> related.
Well, I don't doubt this hassle is not too harmful to the logic of
the protocol, but it's not solving all the liveness issues either.
--
Konstantin Osipov, Moscow, Russia
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Tarantool-patches] [PATCH 1/4] raft: fix crash on election_timeout reconfig
2022-01-15 0:48 ` [Tarantool-patches] [PATCH 1/4] raft: fix crash on election_timeout reconfig Vladislav Shpilevoy via Tarantool-patches
@ 2022-01-18 13:12 ` Serge Petrenko via Tarantool-patches
0 siblings, 0 replies; 16+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2022-01-18 13:12 UTC (permalink / raw)
To: Vladislav Shpilevoy, tarantool-patches
15.01.2022 03:48, Vladislav Shpilevoy пишет:
> It used to crash if done during election on a node voted for
> anybody, it is a candidate, it doesn't know a leader yet, but has
> a WAL write in progress.
>
> Thus it could only happen if the term was bumped by a message from
> a non-leader node and wasn't flushed to the disk yet.
>
> The patch makes the reconfig check if there is a WAL write in
> progress. Then don't do anything.
>
> Could also check for volatile vote instead of persistent, but it
> would create the same problem for the case when started writing
> vote for self and didn't finish yet. Reconfig would crash.
> ---
> .../unreleased/election-timeout-cfg-crash.md | 5 ++
> src/lib/raft/raft.c | 3 +-
> test/unit/raft.c | 75 ++++++++++++++++++-
> test/unit/raft.result | 16 +++-
> 4 files changed, 96 insertions(+), 3 deletions(-)
> create mode 100644 changelogs/unreleased/election-timeout-cfg-crash.md
>
>
Good catch! LGTM.
--
Serge Petrenko
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Tarantool-patches] [PATCH 3/4] raft: introduce split vote detection
2022-01-15 0:48 ` [Tarantool-patches] [PATCH 3/4] raft: introduce split vote detection Vladislav Shpilevoy via Tarantool-patches
@ 2022-01-18 13:20 ` Serge Petrenko via Tarantool-patches
2022-01-20 0:44 ` Vladislav Shpilevoy via Tarantool-patches
0 siblings, 1 reply; 16+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2022-01-18 13:20 UTC (permalink / raw)
To: Vladislav Shpilevoy, tarantool-patches
Thanks for the patch!
I don't think this optimisation is "too much of a hassle".
It's quite nice, and looks like a bunch of SLOC in the patch are used up
by verbose printing (I mean raft_scores_snprint).
In other words, I like the idea and I think we should have that on board.
(Just like pre-voting)
Please find my comments below.
> diff --git a/src/lib/raft/raft.c b/src/lib/raft/raft.c
> index 289d53fd5..5dcbc7821 100644
> --- a/src/lib/raft/raft.c
> +++ b/src/lib/raft/raft.c
> @@ -152,20 +152,69 @@ raft_can_vote_for(const struct raft *raft, const struct vclock *v)
> return cmp == 0 || cmp == 1;
> }
>
> -static inline void
> +static inline bool
> raft_add_vote(struct raft *raft, int src, int dst)
> {
> struct raft_vote *v = &raft->votes[src];
> if (v->did_vote)
> - return;
> + return false;
> v->did_vote = true;
> ++raft->votes[dst].count;
> + return true;
> +}
> +
You may check split_vote right in raft_add_vote:
simply track number of votes given in this term and
max votes given for one instance.
This way you won't have to run over all 32 nodes each time a vote
is casted.
> +static bool
> +raft_has_split_vote(const struct raft *raft)
> +{
> + int max_vote = 0;
> + int vote_vac = raft->cluster_size;
> + int quorum = raft->election_quorum;
> + for (int i = 0; i < VCLOCK_MAX; ++i) {
> + int count = raft->votes[i].count;
> + vote_vac -= count;
> + if (count > max_vote)
> + max_vote = count;
> + }
> + return max_vote < quorum && max_vote + vote_vac < quorum;
This is equal to `return max_vote + vote_vac < quorum`
> +}
> +
> +static int
> +raft_scores_snprintf(const struct raft *raft, char *buf, int size)
> +{
> + int total = 0;
> + bool is_empty = true;
> + SNPRINT(total, snprintf, buf, size, "{");
> + for (int i = 0; i < VCLOCK_MAX; ++i) {
> + int count = raft->votes[i].count;
> + if (count == 0)
> + continue;
> + if (!is_empty)
> + SNPRINT(total, snprintf, buf, size, ", ");
> + is_empty = false;
Nit: you may move is_empty = false into the 'else' branch.
> + SNPRINT(total, snprintf, buf, size, "%d: %d", i, count);
> + }
> + SNPRINT(total, snprintf, buf, size, "}");
> + return total;
> +}
> +
...
>
> +static void
> +raft_check_split_vote(struct raft *raft)
> +{
> + /* When leader is known, there is no election. Thus no vote to split. */
> + if (raft->leader != 0)
> + return;
> + /* Not a candidate = can't trigger term bump anyway. */
> + if (!raft->is_candidate)
> + return;
> + /*
> + * WAL write in progress means the state is changing. All is rechecked
> + * when it is done.
> + */
> + if (raft->is_write_in_progress)
> + return;
> + if (!raft_has_split_vote(raft))
> + return;
> + assert(raft_ev_is_active(&raft->timer));
> + if (raft->timer.at < raft->election_timeout)
> + return;
I don't understand that. timer.at should point at current time,
shouldn't it?
> +
> + assert(raft->state == RAFT_STATE_FOLLOWER ||
> + raft->state == RAFT_STATE_CANDIDATE);
> + struct ev_loop *loop = raft_loop();
> + struct ev_timer *timer = &raft->timer;
> + double delay = raft_new_random_election_shift(raft);
> + /*
> + * Could be too late to speed up anything - probably the term is almost
> + * over anyway.
> + */
> + double remaining = raft_ev_timer_remaining(loop, timer);
> + if (delay >= remaining)
> + delay = remaining;
> + say_info("RAFT: split vote is discovered - %s, new term in %lf sec",
> + raft_scores_str(raft), delay);
> + raft_ev_timer_stop(loop, timer);
> + raft_ev_timer_set(timer, delay, delay);
> + raft_ev_timer_start(loop, timer);
> +}
> +
> void
> raft_create(struct raft *raft, const struct raft_vtab *vtab)
> {
> @@ -1053,6 +1150,7 @@ raft_create(struct raft *raft, const struct raft_vtab *vtab)
> .election_quorum = 1,
> .election_timeout = 5,
> .death_timeout = 5,
> + .cluster_size = VCLOCK_MAX,
> .vtab = vtab,
> };
> raft_ev_timer_init(&raft->timer, raft_sm_schedule_new_election_cb,
...
>
--
Serge Petrenko
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Tarantool-patches] [PATCH 4/4] election: activate raft split vote handling
2022-01-15 0:48 ` [Tarantool-patches] [PATCH 4/4] election: activate raft split vote handling Vladislav Shpilevoy via Tarantool-patches
@ 2022-01-18 13:21 ` Serge Petrenko via Tarantool-patches
2022-01-20 0:44 ` Vladislav Shpilevoy via Tarantool-patches
0 siblings, 1 reply; 16+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2022-01-18 13:21 UTC (permalink / raw)
To: Vladislav Shpilevoy, tarantool-patches
15.01.2022 03:48, Vladislav Shpilevoy пишет:
> Raft needs to know cluster size in order to detect and handle
> split vote. The patch uses registered server count as cluster
> size.
>
> It is not documented nor has a changelog file because this is an
> optimization. Can't be observed except in logs or with a watch.
>
> Closes #5285
Thanks for working on this!
Please, find one comment below.
> ---
> src/box/raft.c | 4 +-
> .../election_split_vote_test.lua | 92 +++++++++++++++++++
> 2 files changed, 95 insertions(+), 1 deletion(-)
> create mode 100644 test/replication-luatest/election_split_vote_test.lua
>
> diff --git a/src/box/raft.c b/src/box/raft.c
> index 1e360dc88..1908b71b6 100644
> --- a/src/box/raft.c
> +++ b/src/box/raft.c
> @@ -229,7 +229,9 @@ box_raft_update_election_quorum(void)
> * be lost.
> */
> int quorum = MIN(replication_synchro_quorum, max);
> - raft_cfg_election_quorum(box_raft(), quorum);
> + struct raft *raft = box_raft();
> + raft_cfg_election_quorum(raft, quorum);
> + raft_cfg_cluster_size(raft, replicaset.registered_count);
Better use `max` instead of `replicaset.registered_count` here.
I don't think something could go wrong with setting cluster size to zero
initially, but let's better avoid that.
> }
>
> void
...
--
Serge Petrenko
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Tarantool-patches] [PATCH 3/4] raft: introduce split vote detection
2022-01-18 13:20 ` Serge Petrenko via Tarantool-patches
@ 2022-01-20 0:44 ` Vladislav Shpilevoy via Tarantool-patches
2022-01-20 10:21 ` Serge Petrenko via Tarantool-patches
0 siblings, 1 reply; 16+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2022-01-20 0:44 UTC (permalink / raw)
To: Serge Petrenko, tarantool-patches
Hi! Thanks for the review!
>> diff --git a/src/lib/raft/raft.c b/src/lib/raft/raft.c
>> index 289d53fd5..5dcbc7821 100644
>> --- a/src/lib/raft/raft.c
>> +++ b/src/lib/raft/raft.c
>> @@ -152,20 +152,69 @@ raft_can_vote_for(const struct raft *raft, const struct vclock *v)
>> return cmp == 0 || cmp == 1;
>> }
>> -static inline void
>> +static inline bool
>> raft_add_vote(struct raft *raft, int src, int dst)
>> {
>> struct raft_vote *v = &raft->votes[src];
>> if (v->did_vote)
>> - return;
>> + return false;
>> v->did_vote = true;
>> ++raft->votes[dst].count;
>> + return true;
>> +}
>> +
>
> You may check split_vote right in raft_add_vote:
> simply track number of votes given in this term and
> max votes given for one instance.
>
> This way you won't have to run over all 32 nodes each time a vote
> is casted.
I did the fullscan intentionally. Otherwise I need to introduce 2
new members to struct raft, keep them up to date, clear on term
bump. Too easy to miss something and introduce a bug. While in
the current version all the split-vote-specific details are in a
single function except for 'raft.votes' member. This thing I couldn't
get rid of.
As for perf, a couple of ifs or a loop over 32 structs - both would
take order of nanoseconds anyway. Here I wouldn't bother. Simplicity
matters most.
I did the proposal to see how it looks but then discarded as more
complex than necessary. However if you think it is still worth doing,
tell me and I will re-apply the diff.
====================
diff --git a/src/lib/raft/raft.c b/src/lib/raft/raft.c
index c4f5fb059..fcce3126a 100644
--- a/src/lib/raft/raft.c
+++ b/src/lib/raft/raft.c
@@ -159,23 +159,20 @@ raft_add_vote(struct raft *raft, int src, int dst)
if (v->did_vote)
return false;
v->did_vote = true;
- ++raft->votes[dst].count;
+ ++raft->voted_count;
+ int count = ++raft->votes[dst].count;
+ if (count > raft->max_vote_count)
+ raft->max_vote_count = count;
return true;
}
static bool
raft_has_split_vote(const struct raft *raft)
{
- int max_vote = 0;
- int vote_vac = raft->cluster_size;
- int quorum = raft->election_quorum;
- for (int i = 0; i < VCLOCK_MAX; ++i) {
- int count = raft->votes[i].count;
- vote_vac -= count;
- if (count > max_vote)
- max_vote = count;
- }
- return max_vote + vote_vac < quorum;
+ int vote_vac = raft->cluster_size - raft->voted_count;
+ if (vote_vac < 0)
+ vote_vac = 0;
+ return raft->max_vote_count + vote_vac < raft->election_quorum;
}
static int
@@ -730,6 +727,8 @@ raft_sm_schedule_new_term(struct raft *raft, uint64_t new_term)
raft->leader = 0;
raft->state = RAFT_STATE_FOLLOWER;
memset(raft->votes, 0, sizeof(raft->votes));
+ raft->max_vote_count = 0;
+ raft->voted_count = 0;
/*
* The instance could be promoted for the previous term. But promotion
* has no effect on following terms.
diff --git a/src/lib/raft/raft.h b/src/lib/raft/raft.h
index 817148792..7115f658f 100644
--- a/src/lib/raft/raft.h
+++ b/src/lib/raft/raft.h
@@ -199,6 +199,10 @@ struct raft {
uint32_t vote;
/** Statistics which node voted for who. */
struct raft_vote votes[VCLOCK_MAX];
+ /** How many nodes voted in the current term. */
+ int voted_count;
+ /** Max vote count given to any node in the current term. */
+ int max_vote_count;
/** Number of votes necessary for successful election. */
int election_quorum;
/**
====================
>> +static bool
>> +raft_has_split_vote(const struct raft *raft)
>> +{
>> + int max_vote = 0;
>> + int vote_vac = raft->cluster_size;
>> + int quorum = raft->election_quorum;
>> + for (int i = 0; i < VCLOCK_MAX; ++i) {
>> + int count = raft->votes[i].count;
>> + vote_vac -= count;
>> + if (count > max_vote)
>> + max_vote = count;
>> + }
>> + return max_vote < quorum && max_vote + vote_vac < quorum;
>
> This is equal to `return max_vote + vote_vac < quorum`
Ouch, that was stupid indeed. I honestly did multiple self-reviews
and still missed it.
====================
@@ -175,7 +175,7 @@ raft_has_split_vote(const struct raft *raft)
if (count > max_vote)
max_vote = count;
}
- return max_vote < quorum && max_vote + vote_vac < quorum;
+ return max_vote + vote_vac < quorum;
}
====================
>> +}
>> +
>> +static int
>> +raft_scores_snprintf(const struct raft *raft, char *buf, int size)
>> +{
>> + int total = 0;
>> + bool is_empty = true;
>> + SNPRINT(total, snprintf, buf, size, "{");
>> + for (int i = 0; i < VCLOCK_MAX; ++i) {
>> + int count = raft->votes[i].count;
>> + if (count == 0)
>> + continue;
>> + if (!is_empty)
>> + SNPRINT(total, snprintf, buf, size, ", ");
>> + is_empty = false;
>
> Nit: you may move is_empty = false into the 'else' branch.
I put it here to write 1 line less. But I don't mind.
====================
@@ -190,7 +190,8 @@ raft_scores_snprintf(const struct raft *raft, char *buf, int size)
continue;
if (!is_empty)
SNPRINT(total, snprintf, buf, size, ", ");
- is_empty = false;
+ else
+ is_empty = false;
SNPRINT(total, snprintf, buf, size, "%d: %d", i, count);
}
====================
>> + SNPRINT(total, snprintf, buf, size, "%d: %d", i, count);
>> + }
>> + SNPRINT(total, snprintf, buf, size, "}");
>> + return total;
>> +}
>> +
>
> ...
>
>> +static void
>> +raft_check_split_vote(struct raft *raft)
>> +{
>> + /* When leader is known, there is no election. Thus no vote to split. */
>> + if (raft->leader != 0)
>> + return;
>> + /* Not a candidate = can't trigger term bump anyway. */
>> + if (!raft->is_candidate)
>> + return;
>> + /*
>> + * WAL write in progress means the state is changing. All is rechecked
>> + * when it is done.
>> + */
>> + if (raft->is_write_in_progress)
>> + return;
>> + if (!raft_has_split_vote(raft))
>> + return;
>> + assert(raft_ev_is_active(&raft->timer));
>> + if (raft->timer.at < raft->election_timeout)
>> + return;
>
> I don't understand that. timer.at should point at current time, shouldn't it?
Yes, thanks for noticing. This is a bug. And in some other existing places too.
I made a separate commit fixing the existing places, see v2. And I fixed this
place below. Replaced with ev_timer.repeat.
====================
diff --git a/src/lib/raft/raft.c b/src/lib/raft/raft.c
index 630f8c677..c4f5fb059 100644
--- a/src/lib/raft/raft.c
+++ b/src/lib/raft/raft.c
@@ -1124,7 +1124,7 @@ raft_check_split_vote(struct raft *raft)
if (!raft_has_split_vote(raft))
return;
assert(raft_ev_is_active(&raft->timer));
- if (raft->timer.at < raft->election_timeout)
+ if (raft->timer.repeat < raft->election_timeout)
return;
assert(raft->state == RAFT_STATE_FOLLOWER ||
diff --git a/test/unit/raft.c b/test/unit/raft.c
index 9b462f755..f3ed453ed 100644
--- a/test/unit/raft.c
+++ b/test/unit/raft.c
@@ -1532,8 +1532,8 @@ raft_test_split_vote(void)
3 /* Source. */
), 0, "vote response for 3 from 3");
- ok(node.raft.timer.at >= node.raft.election_timeout, "term timeout >= "
- "election timeout normally");
+ ok(node.raft.timer.repeat >= node.raft.election_timeout, "term "
+ "timeout >= election timeout normally");
is(raft_node_send_vote_response(&node,
2 /* Term. */,
@@ -1541,7 +1541,7 @@ raft_test_split_vote(void)
4 /* Source. */
), 0, "vote response for 3 from 4");
- ok(node.raft.timer.at < node.raft.election_timeout, "split vote "
+ ok(node.raft.timer.repeat < node.raft.election_timeout, "split vote "
"reduced the term timeout");
raft_run_next_event();
@@ -1555,7 +1555,7 @@ raft_test_split_vote(void)
"{0: 2}" /* Vclock. */
), "a new term");
- ok(node.raft.timer.at >= node.raft.election_timeout, "timeout is "
+ ok(node.raft.timer.repeat >= node.raft.election_timeout, "timeout is "
"normal again");
/*
@@ -1574,11 +1574,11 @@ raft_test_split_vote(void)
2 /* Vote. */,
2 /* Source. */
), 0, "vote response for 2 from 2");
- ok(node.raft.timer.at >= node.raft.election_timeout, "the vote is not "
- "split yet");
+ ok(node.raft.timer.repeat >= node.raft.election_timeout, "the vote is "
+ "not split yet");
raft_node_cfg_cluster_size(&node, 2);
- ok(node.raft.timer.at < node.raft.election_timeout, "cluster size "
+ ok(node.raft.timer.repeat < node.raft.election_timeout, "cluster size "
"change makes split vote");
raft_run_next_event();
@@ -1615,8 +1615,8 @@ raft_test_split_vote(void)
is(raft_vote_count(&node.raft), 1, "the only own vote was from self");
raft_node_cfg_cluster_size(&node, 2);
- ok(node.raft.timer.at >= node.raft.death_timeout, "cluster change does "
- "not affect the leader's death waiting");
+ ok(node.raft.timer.repeat >= node.raft.death_timeout, "cluster change "
+ "does not affect the leader's death waiting");
/*
* Non-candidate should ignore split vote.
@@ -1659,7 +1659,7 @@ raft_test_split_vote(void)
2 /* Source. */
), 0, "vote response for 2 from 2");
- double delay = node.raft.timer.at;
+ double delay = node.raft.timer.repeat;
ok(delay < node.raft.election_timeout, "split vote is already "
"inevitable");
@@ -1669,8 +1669,8 @@ raft_test_split_vote(void)
3 /* Source. */
), 0, "vote response for 3 from 3");
- is(delay, node.raft.timer.at, "split vote got worse, but delay didn't "
- "change");
+ is(delay, node.raft.timer.repeat, "split vote got worse, but delay "
+ "didn't change");
/*
* Handle split vote when WAL write is in progress.
====================
Additionally, I disabled split vote detection for cluster < quorum.
It would only increase rate of term bumps and wouldn't solve the problem.
====================
diff --git a/src/lib/raft/raft.c b/src/lib/raft/raft.c
index c4f5fb059..098f60ed9 100644
--- a/src/lib/raft/raft.c
+++ b/src/lib/raft/raft.c
@@ -169,6 +169,12 @@ raft_has_split_vote(const struct raft *raft)
int max_vote = 0;
int vote_vac = raft->cluster_size;
int quorum = raft->election_quorum;
+ /*
+ * Quorum > cluster is either a misconfiguration or some instances
+ * didn't register yet. Anyway, speeding the elections up won't help.
+ */
+ if (vote_vac < quorum)
+ return false;
for (int i = 0; i < VCLOCK_MAX; ++i) {
int count = raft->votes[i].count;
vote_vac -= count;
diff --git a/test/unit/raft.c b/test/unit/raft.c
index f3ed453ed..e88d12a5c 100644
--- a/test/unit/raft.c
+++ b/test/unit/raft.c
@@ -1500,7 +1500,7 @@ raft_test_promote_restore(void)
static void
raft_test_split_vote(void)
{
- raft_start_test(35);
+ raft_start_test(39);
struct raft_node node;
raft_node_create(&node);
@@ -1697,6 +1697,27 @@ raft_test_split_vote(void)
is(node.raft.term, 3, "bump term");
is(node.raft.vote, 1, "vote for self");
+ /*
+ * Split vote check is disabled when cluster size < quorum. Makes no
+ * sense to speed the elections up.
+ */
+ raft_node_destroy(&node);
+ raft_node_create(&node);
+ raft_node_cfg_cluster_size(&node, 1);
+ raft_node_cfg_election_quorum(&node, 2);
+
+ raft_run_next_event();
+ is(node.raft.term, 2, "bump term");
+ is(node.raft.vote, 1, "vote for self");
+ is(raft_node_send_vote_response(&node,
+ 2 /* Term. */,
+ 2 /* Vote. */,
+ 2 /* Source. */
+ ), 0, "vote response for 2 from 2");
+
+ ok(node.raft.timer.repeat >= node.raft.election_timeout, "split vote "
+ "is not checked for cluster < quorum");
+
raft_node_destroy(&node);
raft_finish_test();
}
diff --git a/test/unit/raft.result b/test/unit/raft.result
index d12aae710..465bfb97d 100644
--- a/test/unit/raft.result
+++ b/test/unit/raft.result
@@ -263,7 +263,7 @@ ok 13 - subtests
ok 14 - subtests
*** raft_test_promote_restore: done ***
*** raft_test_split_vote ***
- 1..35
+ 1..39
ok 1 - elections with a new term
ok 2 - vote response for 1 from 2
ok 3 - vote response for 3 from 3
@@ -299,6 +299,10 @@ ok 14 - subtests
ok 33 - vote response for 2 from 2
ok 34 - bump term
ok 35 - vote for self
+ ok 36 - bump term
+ ok 37 - vote for self
+ ok 38 - vote response for 2 from 2
+ ok 39 - split vote is not checked for cluster < quorum
ok 15 - subtests
*** raft_test_split_vote: done ***
*** main_f: done ***
====================
Then I fixed a similar issue with voted node count > cluster size.
====================
diff --git a/src/lib/raft/raft.c b/src/lib/raft/raft.c
index 098f60ed9..50e0dfe87 100644
--- a/src/lib/raft/raft.c
+++ b/src/lib/raft/raft.c
@@ -181,6 +181,13 @@ raft_has_split_vote(const struct raft *raft)
if (count > max_vote)
max_vote = count;
}
+ /*
+ * More nodes voted than there are nodes configured. The reason is the
+ * the same as with quorum > cluster. The action is also the same -
+ * faster term bumps won't help.
+ */
+ if (vote_vac < 0)
+ return false;
return max_vote + vote_vac < quorum;
}
diff --git a/test/unit/raft.c b/test/unit/raft.c
index e88d12a5c..3d5e2c777 100644
--- a/test/unit/raft.c
+++ b/test/unit/raft.c
@@ -1500,7 +1500,7 @@ raft_test_promote_restore(void)
static void
raft_test_split_vote(void)
{
- raft_start_test(39);
+ raft_start_test(46);
struct raft_node node;
raft_node_create(&node);
@@ -1718,6 +1718,43 @@ raft_test_split_vote(void)
ok(node.raft.timer.repeat >= node.raft.election_timeout, "split vote "
"is not checked for cluster < quorum");
+ /*
+ * Split vote check is disabled when vote count > cluster size. The
+ * reason is the same as with quorum > cluster size - something is odd,
+ * more term bumps won't help.
+ */
+ raft_node_destroy(&node);
+ raft_node_create(&node);
+ raft_node_cfg_cluster_size(&node, 3);
+ raft_node_cfg_election_quorum(&node, 2);
+
+ raft_run_next_event();
+ is(node.raft.term, 2, "bump term");
+ is(node.raft.vote, 1, "vote for self");
+ is(raft_node_send_vote_response(&node,
+ 2 /* Term. */,
+ 2 /* Vote. */,
+ 2 /* Source. */
+ ), 0, "vote response for 2 from 2");
+ is(raft_node_send_vote_response(&node,
+ 2 /* Term. */,
+ 2 /* Vote. */,
+ 3 /* Source. */
+ ), 0, "vote response for 2 from 3");
+ is(raft_node_send_vote_response(&node,
+ 2 /* Term. */,
+ 3 /* Vote. */,
+ 4 /* Source. */
+ ), 0, "vote response for 3 from 4");
+ is(raft_node_send_vote_response(&node,
+ 2 /* Term. */,
+ 4 /* Vote. */,
+ 5 /* Source. */
+ ), 0, "vote response for 4 from 5");
+
+ ok(node.raft.timer.repeat >= node.raft.election_timeout, "split vote "
+ "is not checked when vote count > cluster size");
+
raft_node_destroy(&node);
raft_finish_test();
}
diff --git a/test/unit/raft.result b/test/unit/raft.result
index 465bfb97d..cce1891c4 100644
--- a/test/unit/raft.result
+++ b/test/unit/raft.result
@@ -263,7 +263,7 @@ ok 13 - subtests
ok 14 - subtests
*** raft_test_promote_restore: done ***
*** raft_test_split_vote ***
- 1..39
+ 1..46
ok 1 - elections with a new term
ok 2 - vote response for 1 from 2
ok 3 - vote response for 3 from 3
@@ -303,6 +303,13 @@ ok 14 - subtests
ok 37 - vote for self
ok 38 - vote response for 2 from 2
ok 39 - split vote is not checked for cluster < quorum
+ ok 40 - bump term
+ ok 41 - vote for self
+ ok 42 - vote response for 2 from 2
+ ok 43 - vote response for 2 from 3
+ ok 44 - vote response for 3 from 4
+ ok 45 - vote response for 4 from 5
+ ok 46 - split vote is not checked when vote count > cluster size
ok 15 - subtests
*** raft_test_split_vote: done ***
*** main_f: done ***
====================
Then I realized one of my tests wasn't accurate enough - split vote
during WAL write wasn't properly checked.
====================
diff --git a/src/lib/raft/raft.c b/src/lib/raft/raft.c
index f606f9e79..cd7c91c8b 100644
--- a/src/lib/raft/raft.c
+++ b/src/lib/raft/raft.c
@@ -835,6 +835,11 @@ raft_sm_wait_election_end(struct raft *raft)
raft_new_random_election_shift(raft);
raft_ev_timer_set(&raft->timer, election_timeout, election_timeout);
raft_ev_timer_start(raft_loop(), &raft->timer);
+ /*
+ * Could start the waiting after a WAL write during which the split vote
+ * could happen.
+ */
+ raft_check_split_vote(raft);
}
static void
diff --git a/test/unit/raft.c b/test/unit/raft.c
index 3d5e2c777..4aa3f4959 100644
--- a/test/unit/raft.c
+++ b/test/unit/raft.c
@@ -1500,7 +1500,7 @@ raft_test_promote_restore(void)
static void
raft_test_split_vote(void)
{
- raft_start_test(46);
+ raft_start_test(51);
struct raft_node node;
raft_node_create(&node);
@@ -1693,6 +1693,13 @@ raft_test_split_vote(void)
), 0, "vote response for 2 from 2");
raft_node_unblock(&node);
+ is(node.raft.term, 2, "new term");
+ is(node.raft.vote, 1, "voted for self");
+ is(node.raft.volatile_term, 2, "volatile term");
+ is(node.raft.volatile_vote, 1, "volatile vote");
+ ok(node.raft.timer.repeat < node.raft.election_timeout, "found split "
+ "vote after WAL write");
+
raft_run_next_event();
is(node.raft.term, 3, "bump term");
is(node.raft.vote, 1, "vote for self");
diff --git a/test/unit/raft.result b/test/unit/raft.result
index cce1891c4..3cebf2513 100644
--- a/test/unit/raft.result
+++ b/test/unit/raft.result
@@ -263,7 +263,7 @@ ok 13 - subtests
ok 14 - subtests
*** raft_test_promote_restore: done ***
*** raft_test_split_vote ***
- 1..46
+ 1..51
ok 1 - elections with a new term
ok 2 - vote response for 1 from 2
ok 3 - vote response for 3 from 3
@@ -297,19 +297,24 @@ ok 14 - subtests
ok 31 - new volatile term
ok 32 - new volatile vote
ok 33 - vote response for 2 from 2
- ok 34 - bump term
- ok 35 - vote for self
- ok 36 - bump term
- ok 37 - vote for self
- ok 38 - vote response for 2 from 2
- ok 39 - split vote is not checked for cluster < quorum
- ok 40 - bump term
- ok 41 - vote for self
- ok 42 - vote response for 2 from 2
- ok 43 - vote response for 2 from 3
- ok 44 - vote response for 3 from 4
- ok 45 - vote response for 4 from 5
- ok 46 - split vote is not checked when vote count > cluster size
+ ok 34 - new term
+ ok 35 - voted for self
+ ok 36 - volatile term
+ ok 37 - volatile vote
+ ok 38 - found split vote after WAL write
+ ok 39 - bump term
+ ok 40 - vote for self
+ ok 41 - bump term
+ ok 42 - vote for self
+ ok 43 - vote response for 2 from 2
+ ok 44 - split vote is not checked for cluster < quorum
+ ok 45 - bump term
+ ok 46 - vote for self
+ ok 47 - vote response for 2 from 2
+ ok 48 - vote response for 2 from 3
+ ok 49 - vote response for 3 from 4
+ ok 50 - vote response for 4 from 5
+ ok 51 - split vote is not checked when vote count > cluster size
ok 15 - subtests
*** raft_test_split_vote: done ***
*** main_f: done ***
====================
Then I realized I didn't check split vote on quorum change. When
it is increased, it can trigger split vote.
====================
diff --git a/src/lib/raft/raft.c b/src/lib/raft/raft.c
index cd7c91c8b..5dcc5beaf 100644
--- a/src/lib/raft/raft.c
+++ b/src/lib/raft/raft.c
@@ -1046,6 +1046,8 @@ raft_cfg_election_quorum(struct raft *raft, int election_quorum)
if (raft->state == RAFT_STATE_CANDIDATE &&
raft_vote_count(raft) >= raft->election_quorum)
raft_sm_become_leader(raft);
+ else
+ raft_check_split_vote(raft);
}
void
diff --git a/test/unit/raft.c b/test/unit/raft.c
index 4aa3f4959..e264e864e 100644
--- a/test/unit/raft.c
+++ b/test/unit/raft.c
@@ -1500,7 +1500,7 @@ raft_test_promote_restore(void)
static void
raft_test_split_vote(void)
{
- raft_start_test(51);
+ raft_start_test(58);
struct raft_node node;
raft_node_create(&node);
@@ -1762,6 +1762,34 @@ raft_test_split_vote(void)
ok(node.raft.timer.repeat >= node.raft.election_timeout, "split vote "
"is not checked when vote count > cluster size");
+ /*
+ * Split vote can happen if quorum was suddenly increased.
+ */
+ raft_node_destroy(&node);
+ raft_node_create(&node);
+ raft_node_cfg_cluster_size(&node, 3);
+ raft_node_cfg_election_quorum(&node, 2);
+
+ raft_run_next_event();
+ is(node.raft.term, 2, "bump term");
+ is(node.raft.vote, 1, "vote for self");
+ is(raft_node_send_vote_response(&node,
+ 2 /* Term. */,
+ 2 /* Vote. */,
+ 2 /* Source. */
+ ), 0, "vote response for 2 from 2");
+
+ ok(node.raft.timer.repeat >= node.raft.election_timeout, "not split "
+ "vote yet");
+
+ raft_node_cfg_election_quorum(&node, 3);
+ ok(node.raft.timer.repeat < node.raft.election_timeout, "split vote "
+ "after quorum increase");
+
+ raft_run_next_event();
+ is(node.raft.term, 3, "bump term");
+ is(node.raft.vote, 1, "vote for self");
+
raft_node_destroy(&node);
raft_finish_test();
}
====================
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Tarantool-patches] [PATCH 4/4] election: activate raft split vote handling
2022-01-18 13:21 ` Serge Petrenko via Tarantool-patches
@ 2022-01-20 0:44 ` Vladislav Shpilevoy via Tarantool-patches
0 siblings, 0 replies; 16+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2022-01-20 0:44 UTC (permalink / raw)
To: Serge Petrenko, tarantool-patches
Thanks for the review!
>> diff --git a/src/box/raft.c b/src/box/raft.c
>> index 1e360dc88..1908b71b6 100644
>> --- a/src/box/raft.c
>> +++ b/src/box/raft.c
>> @@ -229,7 +229,9 @@ box_raft_update_election_quorum(void)
>> * be lost.
>> */
>> int quorum = MIN(replication_synchro_quorum, max);
>> - raft_cfg_election_quorum(box_raft(), quorum);
>> + struct raft *raft = box_raft();
>> + raft_cfg_election_quorum(raft, quorum);
>> + raft_cfg_cluster_size(raft, replicaset.registered_count);
>
> Better use `max` instead of `replicaset.registered_count` here.
> I don't think something could go wrong with setting cluster size to zero
> initially, but let's better avoid that.
Done:
====================
diff --git a/src/box/raft.c b/src/box/raft.c
index 1908b71b6..be6009cc1 100644
--- a/src/box/raft.c
+++ b/src/box/raft.c
@@ -231,7 +231,7 @@ box_raft_update_election_quorum(void)
int quorum = MIN(replication_synchro_quorum, max);
struct raft *raft = box_raft();
raft_cfg_election_quorum(raft, quorum);
- raft_cfg_cluster_size(raft, replicaset.registered_count);
+ raft_cfg_cluster_size(raft, max);
}
void
====================
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Tarantool-patches] [PATCH 3/4] raft: introduce split vote detection
2022-01-20 0:44 ` Vladislav Shpilevoy via Tarantool-patches
@ 2022-01-20 10:21 ` Serge Petrenko via Tarantool-patches
2022-01-20 23:02 ` Vladislav Shpilevoy via Tarantool-patches
0 siblings, 1 reply; 16+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2022-01-20 10:21 UTC (permalink / raw)
To: Vladislav Shpilevoy, tarantool-patches
20.01.2022 03:44, Vladislav Shpilevoy пишет:
> Hi! Thanks for the review!
>
>>> diff --git a/src/lib/raft/raft.c b/src/lib/raft/raft.c
>>> index 289d53fd5..5dcbc7821 100644
>>> --- a/src/lib/raft/raft.c
>>> +++ b/src/lib/raft/raft.c
>>> @@ -152,20 +152,69 @@ raft_can_vote_for(const struct raft *raft, const struct vclock *v)
>>> return cmp == 0 || cmp == 1;
>>> }
>>> -static inline void
>>> +static inline bool
>>> raft_add_vote(struct raft *raft, int src, int dst)
>>> {
>>> struct raft_vote *v = &raft->votes[src];
>>> if (v->did_vote)
>>> - return;
>>> + return false;
>>> v->did_vote = true;
>>> ++raft->votes[dst].count;
>>> + return true;
>>> +}
>>> +
>> You may check split_vote right in raft_add_vote:
>> simply track number of votes given in this term and
>> max votes given for one instance.
>>
>> This way you won't have to run over all 32 nodes each time a vote
>> is casted.
> I did the fullscan intentionally. Otherwise I need to introduce 2
> new members to struct raft, keep them up to date, clear on term
> bump. Too easy to miss something and introduce a bug. While in
> the current version all the split-vote-specific details are in a
> single function except for 'raft.votes' member. This thing I couldn't
> get rid of.
>
> As for perf, a couple of ifs or a loop over 32 structs - both would
> take order of nanoseconds anyway. Here I wouldn't bother. Simplicity
> matters most.
>
> I did the proposal to see how it looks but then discarded as more
> complex than necessary. However if you think it is still worth doing,
> tell me and I will re-apply the diff.
Thanks for trying this out!
TBH, this diff is exactly what I wanted and it still looks better
(not only preformance-wise, but simpler as well) in my opinion.
I understand your point about having to update 2 extra members every
now and then, so feel free to choose any option you like.
>
> ====================
> diff --git a/src/lib/raft/raft.c b/src/lib/raft/raft.c
> index c4f5fb059..fcce3126a 100644
> --- a/src/lib/raft/raft.c
> +++ b/src/lib/raft/raft.c
> @@ -159,23 +159,20 @@ raft_add_vote(struct raft *raft, int src, int dst)
> if (v->did_vote)
> return false;
> v->did_vote = true;
> - ++raft->votes[dst].count;
> + ++raft->voted_count;
> + int count = ++raft->votes[dst].count;
> + if (count > raft->max_vote_count)
> + raft->max_vote_count = count;
> return true;
> }
>
> static bool
> raft_has_split_vote(const struct raft *raft)
> {
> - int max_vote = 0;
> - int vote_vac = raft->cluster_size;
> - int quorum = raft->election_quorum;
> - for (int i = 0; i < VCLOCK_MAX; ++i) {
> - int count = raft->votes[i].count;
> - vote_vac -= count;
> - if (count > max_vote)
> - max_vote = count;
> - }
> - return max_vote + vote_vac < quorum;
> + int vote_vac = raft->cluster_size - raft->voted_count;
> + if (vote_vac < 0)
> + vote_vac = 0;
> + return raft->max_vote_count + vote_vac < raft->election_quorum;
> }
>
> static int
> @@ -730,6 +727,8 @@ raft_sm_schedule_new_term(struct raft *raft, uint64_t new_term)
> raft->leader = 0;
> raft->state = RAFT_STATE_FOLLOWER;
> memset(raft->votes, 0, sizeof(raft->votes));
> + raft->max_vote_count = 0;
> + raft->voted_count = 0;
> /*
> * The instance could be promoted for the previous term. But promotion
> * has no effect on following terms.
> diff --git a/src/lib/raft/raft.h b/src/lib/raft/raft.h
> index 817148792..7115f658f 100644
> --- a/src/lib/raft/raft.h
> +++ b/src/lib/raft/raft.h
> @@ -199,6 +199,10 @@ struct raft {
> uint32_t vote;
> /** Statistics which node voted for who. */
> struct raft_vote votes[VCLOCK_MAX];
> + /** How many nodes voted in the current term. */
> + int voted_count;
> + /** Max vote count given to any node in the current term. */
> + int max_vote_count;
> /** Number of votes necessary for successful election. */
> int election_quorum;
> /**
> ====================
>
>>> +static bool
>>> +raft_has_split_vote(const struct raft *raft)
>>> +{
>>> + int max_vote = 0;
>>> + int vote_vac = raft->cluster_size;
>>> + int quorum = raft->election_quorum;
>>> + for (int i = 0; i < VCLOCK_MAX; ++i) {
>>> + int count = raft->votes[i].count;
>>> + vote_vac -= count;
>>> + if (count > max_vote)
>>> + max_vote = count;
>>> + }
>>> + return max_vote < quorum && max_vote + vote_vac < quorum;
>> This is equal to `return max_vote + vote_vac < quorum`
> Ouch, that was stupid indeed. I honestly did multiple self-reviews
> and still missed it.
Don't bother. Could happen to anyone.
> ====================
> @@ -175,7 +175,7 @@ raft_has_split_vote(const struct raft *raft)
> if (count > max_vote)
> max_vote = count;
> }
> - return max_vote < quorum && max_vote + vote_vac < quorum;
> + return max_vote + vote_vac < quorum;
> }
> ====================
>
...
--
Serge Petrenko
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Tarantool-patches] [PATCH 3/4] raft: introduce split vote detection
2022-01-20 10:21 ` Serge Petrenko via Tarantool-patches
@ 2022-01-20 23:02 ` Vladislav Shpilevoy via Tarantool-patches
0 siblings, 0 replies; 16+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2022-01-20 23:02 UTC (permalink / raw)
To: Serge Petrenko, tarantool-patches
>>>> diff --git a/src/lib/raft/raft.c b/src/lib/raft/raft.c
>>>> index 289d53fd5..5dcbc7821 100644
>>>> --- a/src/lib/raft/raft.c
>>>> +++ b/src/lib/raft/raft.c
>>>> @@ -152,20 +152,69 @@ raft_can_vote_for(const struct raft *raft, const struct vclock *v)
>>>> return cmp == 0 || cmp == 1;
>>>> }
>>>> -static inline void
>>>> +static inline bool
>>>> raft_add_vote(struct raft *raft, int src, int dst)
>>>> {
>>>> struct raft_vote *v = &raft->votes[src];
>>>> if (v->did_vote)
>>>> - return;
>>>> + return false;
>>>> v->did_vote = true;
>>>> ++raft->votes[dst].count;
>>>> + return true;
>>>> +}
>>>> +
>>> You may check split_vote right in raft_add_vote:
>>> simply track number of votes given in this term and
>>> max votes given for one instance.
>>>
>>> This way you won't have to run over all 32 nodes each time a vote
>>> is casted.
>> I did the fullscan intentionally. Otherwise I need to introduce 2
>> new members to struct raft, keep them up to date, clear on term
>> bump. Too easy to miss something and introduce a bug. While in
>> the current version all the split-vote-specific details are in a
>> single function except for 'raft.votes' member. This thing I couldn't
>> get rid of.
>>
>> As for perf, a couple of ifs or a loop over 32 structs - both would
>> take order of nanoseconds anyway. Here I wouldn't bother. Simplicity
>> matters most.
>>
>> I did the proposal to see how it looks but then discarded as more
>> complex than necessary. However if you think it is still worth doing,
>> tell me and I will re-apply the diff.
>
> Thanks for trying this out!
>
> TBH, this diff is exactly what I wanted and it still looks better
> (not only preformance-wise, but simpler as well) in my opinion.
>
> I understand your point about having to update 2 extra members every
> now and then, so feel free to choose any option you like.
I applied this diff:
====================
diff --git a/src/lib/raft/raft.c b/src/lib/raft/raft.c
index 5dcc5beaf..90ed01ca4 100644
--- a/src/lib/raft/raft.c
+++ b/src/lib/raft/raft.c
@@ -159,36 +159,29 @@ raft_add_vote(struct raft *raft, int src, int dst)
if (v->did_vote)
return false;
v->did_vote = true;
- ++raft->votes[dst].count;
+ ++raft->voted_count;
+ int count = ++raft->votes[dst].count;
+ if (count > raft->max_vote)
+ raft->max_vote = count;
return true;
}
static bool
raft_has_split_vote(const struct raft *raft)
{
- int max_vote = 0;
int vote_vac = raft->cluster_size;
int quorum = raft->election_quorum;
/*
* Quorum > cluster is either a misconfiguration or some instances
* didn't register yet. Anyway, speeding the elections up won't help.
+ * The same when more nodes voted than there are nodes configured.
*/
if (vote_vac < quorum)
return false;
- for (int i = 0; i < VCLOCK_MAX; ++i) {
- int count = raft->votes[i].count;
- vote_vac -= count;
- if (count > max_vote)
- max_vote = count;
- }
- /*
- * More nodes voted than there are nodes configured. The reason is the
- * the same as with quorum > cluster. The action is also the same -
- * faster term bumps won't help.
- */
+ vote_vac -= raft->voted_count;
if (vote_vac < 0)
return false;
- return max_vote + vote_vac < quorum;
+ return raft->max_vote + vote_vac < quorum;
}
static int
@@ -743,6 +736,8 @@ raft_sm_schedule_new_term(struct raft *raft, uint64_t new_term)
raft->leader = 0;
raft->state = RAFT_STATE_FOLLOWER;
memset(raft->votes, 0, sizeof(raft->votes));
+ raft->voted_count = 0;
+ raft->max_vote = 0;
/*
* The instance could be promoted for the previous term. But promotion
* has no effect on following terms.
diff --git a/src/lib/raft/raft.h b/src/lib/raft/raft.h
index 817148792..05e373254 100644
--- a/src/lib/raft/raft.h
+++ b/src/lib/raft/raft.h
@@ -199,6 +199,10 @@ struct raft {
uint32_t vote;
/** Statistics which node voted for who. */
struct raft_vote votes[VCLOCK_MAX];
+ /** How many nodes voted in the current term. */
+ int voted_count;
+ /** Max vote count given to any node in the current term. */
+ int max_vote;
/** Number of votes necessary for successful election. */
int election_quorum;
/**
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [Tarantool-patches] [PATCH 2/4] raft: track all votes, even not own
2022-01-15 0:48 ` [Tarantool-patches] [PATCH 2/4] raft: track all votes, even not own Vladislav Shpilevoy via Tarantool-patches
@ 2022-01-21 0:42 ` Vladislav Shpilevoy via Tarantool-patches
0 siblings, 0 replies; 16+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2022-01-21 0:42 UTC (permalink / raw)
To: tarantool-patches, sergepetrenko
I see there is a crash in CI on this commit. I found and fixed the reason.
See diff and full new commit below.
====================
diff --git a/src/lib/raft/raft.c b/src/lib/raft/raft.c
index 8182e73e3..01b7d469f 100644
--- a/src/lib/raft/raft.c
+++ b/src/lib/raft/raft.c
@@ -700,8 +700,8 @@ raft_sm_schedule_new_vote(struct raft *raft, uint32_t new_vote)
assert(raft->volatile_vote == 0);
assert(raft->leader == 0);
assert(raft->state == RAFT_STATE_FOLLOWER);
- raft->volatile_vote = new_vote;
assert(!raft->votes[raft->self].did_vote);
+ raft->volatile_vote = new_vote;
raft_add_vote(raft, raft->self, raft->self);
raft_sm_pause_and_dump(raft);
/* Nothing visible is changed - no broadcast. */
@@ -1012,8 +1012,8 @@ raft_cfg_instance_id(struct raft *raft, uint32_t instance_id)
* more than once during join. Here instance ID is configured when it is
* known forever and is safe to use.
*/
- if (raft->vote != 0)
- raft_add_vote(raft, instance_id, raft->vote);
+ if (raft->volatile_vote != 0)
+ raft_add_vote(raft, instance_id, raft->volatile_vote);
}
void
diff --git a/test/unit/raft.c b/test/unit/raft.c
index 9f18f4d8e..1a3c81c64 100644
--- a/test/unit/raft.c
+++ b/test/unit/raft.c
@@ -1497,10 +1497,75 @@ raft_test_promote_restore(void)
raft_finish_test();
}
+static void
+raft_test_bump_term_before_cfg()
+{
+ raft_start_test(6);
+ struct raft_node node;
+ raft_node_create(&node);
+ /*
+ * Term bump is started between recovery and instance ID configuration
+ * but WAL write is not finished yet.
+ */
+ raft_run_next_event();
+ ok(raft_node_check_full_state(&node,
+ RAFT_STATE_CANDIDATE /* State. */,
+ 0 /* Leader. */,
+ 2 /* Term. */,
+ 1 /* Vote. */,
+ 2 /* Volatile term. */,
+ 1 /* Volatile vote. */,
+ "{0: 1}" /* Vclock. */
+ ), "new term is started with vote for self");
+
+ raft_node_stop(&node);
+ raft_node_recover(&node);
+ ok(raft_node_check_full_state(&node,
+ RAFT_STATE_FOLLOWER /* State. */,
+ 0 /* Leader. */,
+ 2 /* Term. */,
+ 1 /* Vote. */,
+ 2 /* Volatile term. */,
+ 1 /* Volatile vote. */,
+ NULL /* Vclock. */
+ ), "recovered");
+ is(node.raft.self, 0, "instance id is unknown");
+
+ raft_node_block(&node);
+ is(raft_node_send_follower(&node,
+ 3 /* Term. */,
+ 2 /* Source. */
+ ), 0, "bump term externally");
+ ok(raft_node_check_full_state(&node,
+ RAFT_STATE_FOLLOWER /* State. */,
+ 0 /* Leader. */,
+ 2 /* Term. */,
+ 1 /* Vote. */,
+ 3 /* Volatile term. */,
+ 0 /* Volatile vote. */,
+ NULL /* Vclock. */
+ ), "term write is in progress");
+
+ raft_node_cfg(&node);
+ raft_node_unblock(&node);
+ ok(raft_node_check_full_state(&node,
+ RAFT_STATE_CANDIDATE /* State. */,
+ 0 /* Leader. */,
+ 3 /* Term. */,
+ 1 /* Vote. */,
+ 3 /* Volatile term. */,
+ 1 /* Volatile vote. */,
+ "{0: 3}" /* Vclock. */
+ ), "started new term");
+
+ raft_node_destroy(&node);
+ raft_finish_test();
+}
+
static int
main_f(va_list ap)
{
- raft_start_test(14);
+ raft_start_test(15);
(void) ap;
fakeev_init();
@@ -1519,6 +1584,7 @@ main_f(va_list ap)
raft_test_enable_disable();
raft_test_too_long_wal_write();
raft_test_promote_restore();
+ raft_test_bump_term_before_cfg();
fakeev_free();
diff --git a/test/unit/raft_test_utils.c b/test/unit/raft_test_utils.c
index b5754cf78..daf43797f 100644
--- a/test/unit/raft_test_utils.c
+++ b/test/unit/raft_test_utils.c
@@ -224,13 +224,18 @@ raft_node_check_full_state(const struct raft_node *node, enum raft_state state,
{
assert(raft_node_is_started(node));
const struct raft *raft = &node->raft;
- struct vclock v;
- raft_vclock_from_string(&v, vclock);
+ if (vclock != NULL) {
+ struct vclock v;
+ raft_vclock_from_string(&v, vclock);
+ if (vclock_compare(&v, raft->vclock) != 0)
+ return false;
+ } else if (raft->vclock != NULL) {
+ return false;
+ }
return raft->state == state && raft->leader == leader &&
raft->term == term && raft->vote == vote &&
raft->volatile_term == volatile_term &&
- raft->volatile_vote == volatile_vote &&
- vclock_compare(&v, raft->vclock) == 0;
+ raft->volatile_vote == volatile_vote;
}
bool
@@ -342,6 +347,13 @@ raft_node_stop(struct raft_node *node)
void
raft_node_start(struct raft_node *node)
+{
+ raft_node_recover(node);
+ raft_node_cfg(node);
+}
+
+void
+raft_node_recover(struct raft_node *node)
{
assert(!raft_node_is_started(node));
@@ -358,6 +370,13 @@ raft_node_start(struct raft_node *node)
for (int i = 0; i < node->journal.size; ++i)
raft_process_recovery(&node->raft, &node->journal.rows[i]);
+ raft_run_async_work();
+}
+
+void
+raft_node_cfg(struct raft_node *node)
+{
+ assert(raft_node_is_started(node));
raft_cfg_is_enabled(&node->raft, node->cfg_is_enabled);
raft_cfg_is_candidate(&node->raft, node->cfg_is_candidate);
diff --git a/test/unit/raft_test_utils.h b/test/unit/raft_test_utils.h
index c68dc3b22..8d400923e 100644
--- a/test/unit/raft_test_utils.h
+++ b/test/unit/raft_test_utils.h
@@ -200,6 +200,14 @@ raft_node_stop(struct raft_node *node);
void
raft_node_start(struct raft_node *node);
+/** Start the node but not configure it yet. Only recover. */
+void
+raft_node_recover(struct raft_node *node);
+
+/** Apply the entire Raft config to a started. */
+void
+raft_node_cfg(struct raft_node *node);
+
/** Block async work execution. */
void
raft_node_block(struct raft_node *node);
====================
Full new commit:
==================================================
raft: track all votes, even not own
To detect split vote a node needs to see that number of free votes
is not enough for anyone to win even if it gets them all.
Hence every node needs to count votes for all other nodes.
The patch makes raft store votes in a bit more complicated manner
than a simple bitmap for just the current instance.
Part of #5285
diff --git a/src/lib/raft/raft.c b/src/lib/raft/raft.c
index d6a849503..01b7d469f 100644
--- a/src/lib/raft/raft.c
+++ b/src/lib/raft/raft.c
@@ -152,6 +152,16 @@ raft_can_vote_for(const struct raft *raft, const struct vclock *v)
return cmp == 0 || cmp == 1;
}
+static inline void
+raft_add_vote(struct raft *raft, int src, int dst)
+{
+ struct raft_vote *v = &raft->votes[src];
+ if (v->did_vote)
+ return;
+ v->did_vote = true;
+ ++raft->votes[dst].count;
+}
+
/** Schedule broadcast of the complete Raft state to all the followers. */
static void
raft_schedule_broadcast(struct raft *raft);
@@ -279,6 +289,12 @@ void
raft_process_recovery(struct raft *raft, const struct raft_msg *req)
{
say_verbose("RAFT: recover %s", raft_msg_to_string(req));
+ /*
+ * Instance ID is not unknown until recovery ends. Because apparently it
+ * can change during join. In Raft it is set only one time when recovery
+ * ends for good.
+ */
+ assert(raft->self == 0);
if (req->term != 0) {
raft->term = req->term;
raft->volatile_term = req->term;
@@ -335,6 +351,8 @@ raft_process_msg(struct raft *raft, const struct raft_msg *req, uint32_t source)
* persisted long time ago and still broadcasted. Or a vote response.
*/
if (req->vote != 0) {
+ raft_add_vote(raft, source, req->vote);
+
switch (raft->state) {
case RAFT_STATE_FOLLOWER:
case RAFT_STATE_LEADER:
@@ -395,11 +413,10 @@ raft_process_msg(struct raft *raft, const struct raft_msg *req, uint32_t source)
* and now was answered by some other instance.
*/
assert(raft->volatile_vote == raft->self);
- bool was_set = bit_set(&raft->vote_mask, source);
- raft->vote_count += !was_set;
- if (raft->vote_count < raft->election_quorum) {
+ int vote_count = raft_vote_count(raft);
+ if (vote_count < raft->election_quorum) {
say_info("RAFT: accepted vote for self, vote "
- "count is %d/%d", raft->vote_count,
+ "count is %d/%d", vote_count,
raft->election_quorum);
break;
}
@@ -640,13 +657,11 @@ raft_sm_become_candidate(struct raft *raft)
assert(raft->state == RAFT_STATE_FOLLOWER);
assert(raft->leader == 0);
assert(raft->vote == raft->self);
+ assert(raft_vote_count(raft) >= 1);
assert(raft->is_candidate);
assert(!raft->is_write_in_progress);
assert(raft->election_quorum > 1);
raft->state = RAFT_STATE_CANDIDATE;
- raft->vote_count = 1;
- raft->vote_mask = 0;
- bit_set(&raft->vote_mask, raft->self);
raft_sm_wait_election_end(raft);
/* State is visible and it is changed - broadcast. */
raft_schedule_broadcast(raft);
@@ -663,6 +678,7 @@ raft_sm_schedule_new_term(struct raft *raft, uint64_t new_term)
raft->volatile_vote = 0;
raft->leader = 0;
raft->state = RAFT_STATE_FOLLOWER;
+ memset(raft->votes, 0, sizeof(raft->votes));
/*
* The instance could be promoted for the previous term. But promotion
* has no effect on following terms.
@@ -684,7 +700,9 @@ raft_sm_schedule_new_vote(struct raft *raft, uint32_t new_vote)
assert(raft->volatile_vote == 0);
assert(raft->leader == 0);
assert(raft->state == RAFT_STATE_FOLLOWER);
+ assert(!raft->votes[raft->self].did_vote);
raft->volatile_vote = new_vote;
+ raft_add_vote(raft, raft->self, raft->self);
raft_sm_pause_and_dump(raft);
/* Nothing visible is changed - no broadcast. */
}
@@ -957,7 +975,7 @@ raft_cfg_election_quorum(struct raft *raft, int election_quorum)
assert(election_quorum > 0);
raft->election_quorum = election_quorum;
if (raft->state == RAFT_STATE_CANDIDATE &&
- raft->vote_count >= raft->election_quorum)
+ raft_vote_count(raft) >= raft->election_quorum)
raft_sm_become_leader(raft);
}
@@ -989,6 +1007,13 @@ raft_cfg_instance_id(struct raft *raft, uint32_t instance_id)
assert(raft->self == 0);
assert(instance_id != 0);
raft->self = instance_id;
+ /*
+ * Couldn't do that reliably during recovery. Instance ID can change
+ * more than once during join. Here instance ID is configured when it is
+ * known forever and is safe to use.
+ */
+ if (raft->volatile_vote != 0)
+ raft_add_vote(raft, instance_id, raft->volatile_vote);
}
void
diff --git a/src/lib/raft/raft.h b/src/lib/raft/raft.h
index e844c2d07..1d064bef0 100644
--- a/src/lib/raft/raft.h
+++ b/src/lib/raft/raft.h
@@ -138,6 +138,14 @@ struct raft_vtab {
raft_schedule_async_f schedule_async;
};
+/** Vote descriptor of a single node. */
+struct raft_vote {
+ /** Whether the node voted for anybody. */
+ bool did_vote;
+ /** How many votes the node got in the current term. */
+ int count;
+};
+
struct raft {
/** Instance ID of this node. */
uint32_t self;
@@ -189,13 +197,8 @@ struct raft {
*/
uint64_t term;
uint32_t vote;
- /**
- * Bit 1 on position N means that a vote from instance with ID = N was
- * obtained.
- */
- vclock_map_t vote_mask;
- /** Number of votes for this instance. Valid only in candidate state. */
- int vote_count;
+ /** Statistics which node voted for who. */
+ struct raft_vote votes[VCLOCK_MAX];
/** Number of votes necessary for successful election. */
int election_quorum;
/**
@@ -243,6 +246,13 @@ raft_is_enabled(const struct raft *raft)
return raft->is_enabled;
}
+/** Number of votes for self. */
+static inline int
+raft_vote_count(const struct raft *raft)
+{
+ return raft->votes[raft->self].count;
+}
+
/** Process a raft entry stored in WAL/snapshot. */
void
raft_process_recovery(struct raft *raft, const struct raft_msg *req);
diff --git a/test/unit/raft.c b/test/unit/raft.c
index 520b94466..1a3c81c64 100644
--- a/test/unit/raft.c
+++ b/test/unit/raft.c
@@ -70,7 +70,7 @@ raft_test_leader_election(void)
1 /* Volatile vote. */,
"{0: 1}" /* Vclock. */
), "elections with a new term");
- is(node.raft.vote_count, 1, "single vote for self");
+ is(raft_vote_count(&node.raft), 1, "single vote for self");
ok(node.update_count > 0, "trigger worked");
node.update_count = 0;
@@ -100,7 +100,7 @@ raft_test_leader_election(void)
1 /* Vote. */,
2 /* Source. */
), 0, "vote response from 2");
- is(node.raft.vote_count, 2, "2 votes - 1 self and 1 foreign");
+ is(raft_vote_count(&node.raft), 2, "2 votes - 1 self and 1 foreign");
ok(!node.has_work, "no work to do - not enough votes yet");
raft_run_for(node.cfg_election_timeout / 2);
@@ -115,7 +115,7 @@ raft_test_leader_election(void)
1 /* Vote. */,
3 /* Source. */
), 0, "vote response from 3");
- is(node.raft.vote_count, 3, "2 votes - 1 self and 2 foreign");
+ is(raft_vote_count(&node.raft), 3, "2 votes - 1 self and 2 foreign");
is(node.raft.state, RAFT_STATE_LEADER, "became leader");
ok(node.update_count > 0, "trigger worked");
node.update_count = 0;
@@ -142,7 +142,7 @@ raft_test_leader_election(void)
static void
raft_test_recovery(void)
{
- raft_start_test(12);
+ raft_start_test(13);
struct raft_msg msg;
struct raft_node node;
raft_node_create(&node);
@@ -224,6 +224,8 @@ raft_test_recovery(void)
"{0: 1}" /* Vclock. */
), "restart always as a follower");
+ is(raft_vote_count(&node.raft), 1, "vote count is restored correctly");
+
raft_checkpoint_remote(&node.raft, &msg);
ok(raft_msg_check(&msg,
RAFT_STATE_FOLLOWER /* State. */,
@@ -383,7 +385,7 @@ raft_test_vote_skip(void)
1 /* Vote. */,
2 /* Source. */
), 0, "message is accepted");
- is(node.raft.vote_count, 1, "but ignored - too old term");
+ is(raft_vote_count(&node.raft), 1, "but ignored - too old term");
/* Competing vote requests are skipped. */
@@ -392,7 +394,8 @@ raft_test_vote_skip(void)
3 /* Vote. */,
2 /* Source. */
), 0, "message is accepted");
- is(node.raft.vote_count, 1, "but ignored - vote not for this node");
+ is(raft_vote_count(&node.raft), 1,
+ "but ignored - vote not for this node");
is(node.raft.state, RAFT_STATE_CANDIDATE, "this node does not give up");
/* Vote requests are ignored when node is disabled. */
@@ -1071,7 +1074,7 @@ raft_test_election_quorum(void)
1 /* Vote. */,
2 /* Source. */
), 0, "send vote response from second node");
- is(node.raft.vote_count, 2, "vote is accepted");
+ is(raft_vote_count(&node.raft), 2, "vote is accepted");
is(node.raft.state, RAFT_STATE_CANDIDATE, "but still candidate");
raft_node_cfg_election_quorum(&node, 2);
@@ -1494,10 +1497,75 @@ raft_test_promote_restore(void)
raft_finish_test();
}
+static void
+raft_test_bump_term_before_cfg()
+{
+ raft_start_test(6);
+ struct raft_node node;
+ raft_node_create(&node);
+ /*
+ * Term bump is started between recovery and instance ID configuration
+ * but WAL write is not finished yet.
+ */
+ raft_run_next_event();
+ ok(raft_node_check_full_state(&node,
+ RAFT_STATE_CANDIDATE /* State. */,
+ 0 /* Leader. */,
+ 2 /* Term. */,
+ 1 /* Vote. */,
+ 2 /* Volatile term. */,
+ 1 /* Volatile vote. */,
+ "{0: 1}" /* Vclock. */
+ ), "new term is started with vote for self");
+
+ raft_node_stop(&node);
+ raft_node_recover(&node);
+ ok(raft_node_check_full_state(&node,
+ RAFT_STATE_FOLLOWER /* State. */,
+ 0 /* Leader. */,
+ 2 /* Term. */,
+ 1 /* Vote. */,
+ 2 /* Volatile term. */,
+ 1 /* Volatile vote. */,
+ NULL /* Vclock. */
+ ), "recovered");
+ is(node.raft.self, 0, "instance id is unknown");
+
+ raft_node_block(&node);
+ is(raft_node_send_follower(&node,
+ 3 /* Term. */,
+ 2 /* Source. */
+ ), 0, "bump term externally");
+ ok(raft_node_check_full_state(&node,
+ RAFT_STATE_FOLLOWER /* State. */,
+ 0 /* Leader. */,
+ 2 /* Term. */,
+ 1 /* Vote. */,
+ 3 /* Volatile term. */,
+ 0 /* Volatile vote. */,
+ NULL /* Vclock. */
+ ), "term write is in progress");
+
+ raft_node_cfg(&node);
+ raft_node_unblock(&node);
+ ok(raft_node_check_full_state(&node,
+ RAFT_STATE_CANDIDATE /* State. */,
+ 0 /* Leader. */,
+ 3 /* Term. */,
+ 1 /* Vote. */,
+ 3 /* Volatile term. */,
+ 1 /* Volatile vote. */,
+ "{0: 3}" /* Vclock. */
+ ), "started new term");
+
+ raft_node_destroy(&node);
+ raft_finish_test();
+}
+
static int
main_f(va_list ap)
{
- raft_start_test(14);
+ raft_start_test(15);
(void) ap;
fakeev_init();
@@ -1516,6 +1584,7 @@ main_f(va_list ap)
raft_test_enable_disable();
raft_test_too_long_wal_write();
raft_test_promote_restore();
+ raft_test_bump_term_before_cfg();
fakeev_free();
diff --git a/test/unit/raft.result b/test/unit/raft.result
index 764d82969..5e63e523d 100644
--- a/test/unit/raft.result
+++ b/test/unit/raft.result
@@ -1,5 +1,5 @@
*** main_f ***
-1..14
+1..15
*** raft_test_leader_election ***
1..24
ok 1 - 1 pending message at start
@@ -29,7 +29,7 @@
ok 1 - subtests
*** raft_test_leader_election: done ***
*** raft_test_recovery ***
- 1..12
+ 1..13
ok 1 - became candidate
ok 2 - remote checkpoint of a candidate
ok 3 - local checkpoint of a candidate
@@ -40,8 +40,9 @@ ok 1 - subtests
ok 8 - remote checkpoint of a leader
ok 9 - local checkpoint of a leader
ok 10 - restart always as a follower
- ok 11 - remote checkpoint of a leader
- ok 12 - local checkpoint of a leader
+ ok 11 - vote count is restored correctly
+ ok 12 - remote checkpoint of a leader
+ ok 13 - local checkpoint of a leader
ok 2 - subtests
*** raft_test_recovery: done ***
*** raft_test_bad_msg ***
@@ -261,4 +262,14 @@ ok 13 - subtests
ok 12 - not a candidate
ok 14 - subtests
*** raft_test_promote_restore: done ***
+ *** raft_test_bump_term_before_cfg ***
+ 1..6
+ ok 1 - new term is started with vote for self
+ ok 2 - recovered
+ ok 3 - instance id is unknown
+ ok 4 - bump term externally
+ ok 5 - term write is in progress
+ ok 6 - started new term
+ok 15 - subtests
+ *** raft_test_bump_term_before_cfg: done ***
*** main_f: done ***
diff --git a/test/unit/raft_test_utils.c b/test/unit/raft_test_utils.c
index b5754cf78..daf43797f 100644
--- a/test/unit/raft_test_utils.c
+++ b/test/unit/raft_test_utils.c
@@ -224,13 +224,18 @@ raft_node_check_full_state(const struct raft_node *node, enum raft_state state,
{
assert(raft_node_is_started(node));
const struct raft *raft = &node->raft;
- struct vclock v;
- raft_vclock_from_string(&v, vclock);
+ if (vclock != NULL) {
+ struct vclock v;
+ raft_vclock_from_string(&v, vclock);
+ if (vclock_compare(&v, raft->vclock) != 0)
+ return false;
+ } else if (raft->vclock != NULL) {
+ return false;
+ }
return raft->state == state && raft->leader == leader &&
raft->term == term && raft->vote == vote &&
raft->volatile_term == volatile_term &&
- raft->volatile_vote == volatile_vote &&
- vclock_compare(&v, raft->vclock) == 0;
+ raft->volatile_vote == volatile_vote;
}
bool
@@ -342,6 +347,13 @@ raft_node_stop(struct raft_node *node)
void
raft_node_start(struct raft_node *node)
+{
+ raft_node_recover(node);
+ raft_node_cfg(node);
+}
+
+void
+raft_node_recover(struct raft_node *node)
{
assert(!raft_node_is_started(node));
@@ -358,6 +370,13 @@ raft_node_start(struct raft_node *node)
for (int i = 0; i < node->journal.size; ++i)
raft_process_recovery(&node->raft, &node->journal.rows[i]);
+ raft_run_async_work();
+}
+
+void
+raft_node_cfg(struct raft_node *node)
+{
+ assert(raft_node_is_started(node));
raft_cfg_is_enabled(&node->raft, node->cfg_is_enabled);
raft_cfg_is_candidate(&node->raft, node->cfg_is_candidate);
diff --git a/test/unit/raft_test_utils.h b/test/unit/raft_test_utils.h
index c68dc3b22..8d400923e 100644
--- a/test/unit/raft_test_utils.h
+++ b/test/unit/raft_test_utils.h
@@ -200,6 +200,14 @@ raft_node_stop(struct raft_node *node);
void
raft_node_start(struct raft_node *node);
+/** Start the node but not configure it yet. Only recover. */
+void
+raft_node_recover(struct raft_node *node);
+
+/** Apply the entire Raft config to a started. */
+void
+raft_node_cfg(struct raft_node *node);
+
/** Block async work execution. */
void
raft_node_block(struct raft_node *node);
^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2022-01-21 0:42 UTC | newest]
Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-01-15 0:48 [Tarantool-patches] [PATCH 0/4] Split vote Vladislav Shpilevoy via Tarantool-patches
2022-01-15 0:48 ` [Tarantool-patches] [PATCH 1/4] raft: fix crash on election_timeout reconfig Vladislav Shpilevoy via Tarantool-patches
2022-01-18 13:12 ` Serge Petrenko via Tarantool-patches
2022-01-15 0:48 ` [Tarantool-patches] [PATCH 2/4] raft: track all votes, even not own Vladislav Shpilevoy via Tarantool-patches
2022-01-21 0:42 ` Vladislav Shpilevoy via Tarantool-patches
2022-01-15 0:48 ` [Tarantool-patches] [PATCH 3/4] raft: introduce split vote detection Vladislav Shpilevoy via Tarantool-patches
2022-01-18 13:20 ` Serge Petrenko via Tarantool-patches
2022-01-20 0:44 ` Vladislav Shpilevoy via Tarantool-patches
2022-01-20 10:21 ` Serge Petrenko via Tarantool-patches
2022-01-20 23:02 ` Vladislav Shpilevoy via Tarantool-patches
2022-01-15 0:48 ` [Tarantool-patches] [PATCH 4/4] election: activate raft split vote handling Vladislav Shpilevoy via Tarantool-patches
2022-01-18 13:21 ` Serge Petrenko via Tarantool-patches
2022-01-20 0:44 ` Vladislav Shpilevoy via Tarantool-patches
2022-01-16 14:10 ` [Tarantool-patches] [PATCH 0/4] Split vote Konstantin Osipov via Tarantool-patches
2022-01-17 22:57 ` Vladislav Shpilevoy via Tarantool-patches
2022-01-18 7:18 ` Konstantin Osipov via Tarantool-patches
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox