Tarantool development patches archive
 help / color / mirror / Atom feed
* [Tarantool-patches] [PATCH v2 0/5] Split vote and bugs
@ 2022-01-20  0:43 Vladislav Shpilevoy via Tarantool-patches
  2022-01-20  0:43 ` [Tarantool-patches] [PATCH v2 1/5] raft: fix crash on election_timeout reconfig Vladislav Shpilevoy via Tarantool-patches
                   ` (6 more replies)
  0 siblings, 7 replies; 11+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2022-01-20  0:43 UTC (permalink / raw)
  To: tarantool-patches, sergepetrenko

Split vote handling in Raft, its usage in storage, and 2 bug fixes found while
working on this.

Changes in v2:
- dropped ev_timer.at usage;
- disabled split-vote detection for cluster < quorum;
- disabled split-vote for vote count > cluster;
- re-checked split vote after WAL write;
- re-checked split vote on quorum change.

Branch: http://github.com/tarantool/tarantool/tree/gerold103/gh-5285-raft-split-vote
Issue: https://github.com/tarantool/tarantool/issues/5285

Vladislav Shpilevoy (5):
  raft: fix crash on election_timeout reconfig
  raft: fix ev_timer.at incorrect usage
  raft: track all votes, even not own
  raft: introduce split vote detection
  election: activate raft split vote handling

 .../unreleased/election-timeout-cfg-crash.md  |   5 +
 src/box/raft.c                                |   4 +-
 src/lib/fakesys/fakeev.c                      |  10 +-
 src/lib/raft/raft.c                           | 222 ++++++++--
 src/lib/raft/raft.h                           |  32 +-
 .../election_split_vote_test.lua              |  92 ++++
 test/unit/raft.c                              | 393 +++++++++++++++++-
 test/unit/raft.result                         |  87 +++-
 test/unit/raft_test_utils.c                   |  12 +
 test/unit/raft_test_utils.h                   |   5 +
 10 files changed, 803 insertions(+), 59 deletions(-)
 create mode 100644 changelogs/unreleased/election-timeout-cfg-crash.md
 create mode 100644 test/replication-luatest/election_split_vote_test.lua

-- 
2.24.3 (Apple Git-128)


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Tarantool-patches] [PATCH v2 1/5] raft: fix crash on election_timeout reconfig
  2022-01-20  0:43 [Tarantool-patches] [PATCH v2 0/5] Split vote and bugs Vladislav Shpilevoy via Tarantool-patches
@ 2022-01-20  0:43 ` Vladislav Shpilevoy via Tarantool-patches
  2022-01-20  0:43 ` [Tarantool-patches] [PATCH v2 2/5] raft: fix ev_timer.at incorrect usage Vladislav Shpilevoy via Tarantool-patches
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 11+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2022-01-20  0:43 UTC (permalink / raw)
  To: tarantool-patches, sergepetrenko

It used to crash if done during election on a node voted for
anybody, it is a candidate, it doesn't know a leader yet, but has
a WAL write in progress.

Thus it could only happen if the term was bumped by a message from
a non-leader node and wasn't flushed to the disk yet.

The patch makes the reconfig check if there is a WAL write in
progress. Then don't do anything.

Could also check for volatile vote instead of persistent, but it
would create the same problem for the case when started writing
vote for self and didn't finish yet. Reconfig would crash.
---
 .../unreleased/election-timeout-cfg-crash.md  |  5 ++
 src/lib/raft/raft.c                           |  3 +-
 test/unit/raft.c                              | 75 ++++++++++++++++++-
 test/unit/raft.result                         | 16 +++-
 4 files changed, 96 insertions(+), 3 deletions(-)
 create mode 100644 changelogs/unreleased/election-timeout-cfg-crash.md

diff --git a/changelogs/unreleased/election-timeout-cfg-crash.md b/changelogs/unreleased/election-timeout-cfg-crash.md
new file mode 100644
index 000000000..e870e3665
--- /dev/null
+++ b/changelogs/unreleased/election-timeout-cfg-crash.md
@@ -0,0 +1,5 @@
+## bugfix/raft
+
+* Reconfiguration of `box.cfg.election_timeout` could lead to a crash or
+  undefined behaviour if done during an ongoing election with a special WAL
+  write in progress.
diff --git a/src/lib/raft/raft.c b/src/lib/raft/raft.c
index 0e6dc6155..1ae8fe87f 100644
--- a/src/lib/raft/raft.c
+++ b/src/lib/raft/raft.c
@@ -935,7 +935,8 @@ raft_cfg_election_timeout(struct raft *raft, double timeout)
 		return;
 
 	raft->election_timeout = timeout;
-	if (raft->vote != 0 && raft->leader == 0 && raft->is_candidate) {
+	if (raft->vote != 0 && raft->leader == 0 && raft->is_candidate &&
+	    !raft->is_write_in_progress) {
 		assert(raft_ev_is_active(&raft->timer));
 		struct ev_loop *loop = raft_loop();
 		double timeout = raft_ev_timer_remaining(loop, &raft->timer) -
diff --git a/test/unit/raft.c b/test/unit/raft.c
index 8f4d2dd9a..520b94466 100644
--- a/test/unit/raft.c
+++ b/test/unit/raft.c
@@ -1296,7 +1296,7 @@ raft_test_enable_disable(void)
 static void
 raft_test_too_long_wal_write(void)
 {
-	raft_start_test(8);
+	raft_start_test(22);
 	struct raft_node node;
 	raft_node_create(&node);
 
@@ -1362,6 +1362,79 @@ raft_test_too_long_wal_write(void)
 	ok(raft_time() - ts == node.cfg_death_timeout, "timer works again");
 	is(node.raft.state, RAFT_STATE_CANDIDATE, "became candidate");
 
+	/*
+	 * During WAL write it is possible to reconfigure election timeout.
+	 * The dangerous case is when the timer is active already. It happens
+	 * when the node voted and is a candidate, but leader is unknown.
+	 */
+	raft_node_destroy(&node);
+	raft_node_create(&node);
+
+	raft_node_cfg_election_timeout(&node, 100);
+	raft_run_next_event();
+	is(node.raft.term, 2, "term is bumped");
+
+	/* Bump term again but it is not written to WAL yet. */
+	raft_node_block(&node);
+	is(raft_node_send_vote_response(&node,
+		3 /* Term. */,
+		3 /* Vote. */,
+		2 /* Source. */
+	), 0, "2 votes for 3 in a new term");
+	raft_run_next_event();
+	is(node.raft.term, 2, "term is old");
+	is(node.raft.vote, 1, "vote is used for self");
+	is(node.raft.volatile_term, 3, "volatile term is new");
+	is(node.raft.volatile_vote, 0, "volatile vote is unused");
+
+	raft_node_cfg_election_timeout(&node, 50);
+	raft_node_unblock(&node);
+	ts = raft_time();
+	raft_run_next_event();
+	ts = raft_time() - ts;
+	/* 50 + <= 10% random delay. */
+	ok(ts >= 50 && ts <= 55, "new election timeout works");
+	ok(raft_node_check_full_state(&node,
+		RAFT_STATE_CANDIDATE /* State. */,
+		0 /* Leader. */,
+		4 /* Term. */,
+		1 /* Vote. */,
+		4 /* Volatile term. */,
+		1 /* Volatile vote. */,
+		"{0: 4}" /* Vclock. */
+	), "new term is started with vote for self");
+
+	/*
+	 * Similar case when a vote is being written but not finished yet.
+	 */
+	raft_node_destroy(&node);
+	raft_node_create(&node);
+
+	raft_node_cfg_election_timeout(&node, 100);
+	raft_node_block(&node);
+	raft_run_next_event();
+	is(node.raft.term, 1, "term is old");
+	is(node.raft.vote, 0, "vote is unused");
+	is(node.raft.volatile_term, 2, "volatile term is new");
+	is(node.raft.volatile_vote, 1, "volatile vote is self");
+
+	raft_node_cfg_election_timeout(&node, 50);
+	raft_node_unblock(&node);
+	ts = raft_time();
+	raft_run_next_event();
+	ts = raft_time() - ts;
+	/* 50 + <= 10% random delay. */
+	ok(ts >= 50 && ts <= 55, "new election timeout works");
+	ok(raft_node_check_full_state(&node,
+		RAFT_STATE_CANDIDATE /* State. */,
+		0 /* Leader. */,
+		3 /* Term. */,
+		1 /* Vote. */,
+		3 /* Volatile term. */,
+		1 /* Volatile vote. */,
+		"{0: 2}" /* Vclock. */
+	), "new term is started with vote for self");
+
 	raft_node_destroy(&node);
 	raft_finish_test();
 }
diff --git a/test/unit/raft.result b/test/unit/raft.result
index 14689ea81..764d82969 100644
--- a/test/unit/raft.result
+++ b/test/unit/raft.result
@@ -220,7 +220,7 @@ ok 11 - subtests
 ok 12 - subtests
 	*** raft_test_enable_disable: done ***
 	*** raft_test_too_long_wal_write ***
-    1..8
+    1..22
     ok 1 - vote for 2
     ok 2 - vote is volatile
     ok 3 - message from leader
@@ -229,6 +229,20 @@ ok 12 - subtests
     ok 6 - wal write is finished
     ok 7 - timer works again
     ok 8 - became candidate
+    ok 9 - term is bumped
+    ok 10 - 2 votes for 3 in a new term
+    ok 11 - term is old
+    ok 12 - vote is used for self
+    ok 13 - volatile term is new
+    ok 14 - volatile vote is unused
+    ok 15 - new election timeout works
+    ok 16 - new term is started with vote for self
+    ok 17 - term is old
+    ok 18 - vote is unused
+    ok 19 - volatile term is new
+    ok 20 - volatile vote is self
+    ok 21 - new election timeout works
+    ok 22 - new term is started with vote for self
 ok 13 - subtests
 	*** raft_test_too_long_wal_write: done ***
 	*** raft_test_promote_restore ***
-- 
2.24.3 (Apple Git-128)


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Tarantool-patches] [PATCH v2 2/5] raft: fix ev_timer.at incorrect usage
  2022-01-20  0:43 [Tarantool-patches] [PATCH v2 0/5] Split vote and bugs Vladislav Shpilevoy via Tarantool-patches
  2022-01-20  0:43 ` [Tarantool-patches] [PATCH v2 1/5] raft: fix crash on election_timeout reconfig Vladislav Shpilevoy via Tarantool-patches
@ 2022-01-20  0:43 ` Vladislav Shpilevoy via Tarantool-patches
  2022-01-20  0:43 ` [Tarantool-patches] [PATCH v2 3/5] raft: track all votes, even not own Vladislav Shpilevoy via Tarantool-patches
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 11+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2022-01-20  0:43 UTC (permalink / raw)
  To: tarantool-patches, sergepetrenko

ev_timer.at was used as timeout. But after ev_timer_start() it
turns into the deadline - totally different value.

The patch makes sure ev_timer.at is not used in raft at all.

To test that the fakeev subsystem is patched to start its time not
from 0. Otherwise ev_timer.at often really matched the timeout
even for an active timer.
---
 src/lib/fakesys/fakeev.c | 10 +++++--
 src/lib/raft/raft.c      | 59 ++++++++++++++++++++++------------------
 src/lib/raft/raft.h      |  2 +-
 3 files changed, 41 insertions(+), 30 deletions(-)

diff --git a/src/lib/fakesys/fakeev.c b/src/lib/fakesys/fakeev.c
index 5c3469486..333c1646f 100644
--- a/src/lib/fakesys/fakeev.c
+++ b/src/lib/fakesys/fakeev.c
@@ -37,8 +37,11 @@
 #include "say.h"
 #include <stdbool.h>
 
+/** Start not from 0 to catch bugs like incorrect usage of ev_timer members. */
+#define FAKEEV_START_TIME 10000
+
 /** Global watch, propagated by new events. */
-static double watch = 0;
+static double watch = FAKEEV_START_TIME;
 
 /**
  * Increasing event identifiers are used to preserve order of
@@ -278,6 +281,7 @@ fakeev_timer_start(struct ev_loop *loop, struct ev_timer *base)
 		return;
 	/* Create the periodic watcher and one event. */
 	fakeev_timer_event_new((struct ev_watcher *)base, base->at);
+	base->at += watch;
 }
 
 double
@@ -287,6 +291,7 @@ fakeev_timer_remaining(struct ev_loop *loop, struct ev_timer *base)
 	struct fakeev_event *e = fakeev_event_by_ev((struct ev_watcher *)base);
 	if (e == NULL)
 		return base->at;
+	assert(e->deadline == base->at);
 	return e->deadline - fakeev_time();
 }
 
@@ -298,6 +303,7 @@ fakeev_timer_again(struct ev_loop *loop, struct ev_timer *base)
 		return;
 	/* Create the periodic watcher and one event. */
 	fakeev_timer_event_new((struct ev_watcher *)base, base->repeat);
+	base->at = watch + base->repeat;
 }
 
 /** Time stop cancels the event if the timer is active. */
@@ -339,7 +345,7 @@ fakeev_reset(void)
 		fakeev_event_delete(e);
 	assert(mh_size(events_hash) == 0);
 	event_id = 0;
-	watch = 0;
+	watch = FAKEEV_START_TIME;
 }
 
 void
diff --git a/src/lib/raft/raft.c b/src/lib/raft/raft.c
index 1ae8fe87f..d6a849503 100644
--- a/src/lib/raft/raft.c
+++ b/src/lib/raft/raft.c
@@ -931,22 +931,23 @@ raft_restore(struct raft *raft)
 void
 raft_cfg_election_timeout(struct raft *raft, double timeout)
 {
-	if (timeout == raft->election_timeout)
+	double old_timeout = raft->election_timeout;
+	if (timeout == old_timeout)
 		return;
 
 	raft->election_timeout = timeout;
-	if (raft->vote != 0 && raft->leader == 0 && raft->is_candidate &&
-	    !raft->is_write_in_progress) {
-		assert(raft_ev_is_active(&raft->timer));
-		struct ev_loop *loop = raft_loop();
-		double timeout = raft_ev_timer_remaining(loop, &raft->timer) -
-				 raft->timer.at + raft->election_timeout;
-		if (timeout < 0)
-			timeout = 0;
-		raft_ev_timer_stop(loop, &raft->timer);
-		raft_ev_timer_set(&raft->timer, timeout, timeout);
-		raft_ev_timer_start(loop, &raft->timer);
-	}
+	if (raft->vote == 0 || raft->leader != 0 || !raft->is_candidate ||
+	    raft->is_write_in_progress)
+		return;
+
+	assert(raft_ev_is_active(&raft->timer));
+	struct ev_loop *loop = raft_loop();
+	timeout += raft_ev_timer_remaining(loop, &raft->timer) - old_timeout;
+	if (timeout < 0)
+		timeout = 0;
+	raft_ev_timer_stop(loop, &raft->timer);
+	raft_ev_timer_set(&raft->timer, timeout, raft->election_timeout);
+	raft_ev_timer_start(loop, &raft->timer);
 }
 
 void
@@ -961,21 +962,25 @@ raft_cfg_election_quorum(struct raft *raft, int election_quorum)
 }
 
 void
-raft_cfg_death_timeout(struct raft *raft, double death_timeout)
+raft_cfg_death_timeout(struct raft *raft, double timeout)
 {
-	raft->death_timeout = death_timeout;
-	if (raft->state == RAFT_STATE_FOLLOWER && raft->is_candidate &&
-	    raft->leader != 0) {
-		assert(raft_ev_is_active(&raft->timer));
-		struct ev_loop *loop = raft_loop();
-		double timeout = raft_ev_timer_remaining(loop, &raft->timer) -
-				 raft->timer.at + raft->death_timeout;
-		if (timeout < 0)
-			timeout = 0;
-		raft_ev_timer_stop(loop, &raft->timer);
-		raft_ev_timer_set(&raft->timer, timeout, timeout);
-		raft_ev_timer_start(loop, &raft->timer);
-	}
+	double old_timeout = raft->death_timeout;
+	if (timeout == old_timeout)
+		return;
+
+	raft->death_timeout = timeout;
+	if (raft->state != RAFT_STATE_FOLLOWER || !raft->is_candidate ||
+	    raft->leader == 0)
+		return;
+
+	assert(raft_ev_is_active(&raft->timer));
+	struct ev_loop *loop = raft_loop();
+	timeout += raft_ev_timer_remaining(loop, &raft->timer) - old_timeout;
+	if (timeout < 0)
+		timeout = 0;
+	raft_ev_timer_stop(loop, &raft->timer);
+	raft_ev_timer_set(&raft->timer, timeout, raft->death_timeout);
+	raft_ev_timer_start(loop, &raft->timer);
 }
 
 void
diff --git a/src/lib/raft/raft.h b/src/lib/raft/raft.h
index 53e2c58a8..e844c2d07 100644
--- a/src/lib/raft/raft.h
+++ b/src/lib/raft/raft.h
@@ -307,7 +307,7 @@ raft_cfg_election_quorum(struct raft *raft, int election_quorum);
  * heartbeats from the leader to consider it dead.
  */
 void
-raft_cfg_death_timeout(struct raft *raft, double death_timeout);
+raft_cfg_death_timeout(struct raft *raft, double timeout);
 
 /**
  * Configure ID of the given Raft instance. The ID can't be changed after it is
-- 
2.24.3 (Apple Git-128)


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Tarantool-patches] [PATCH v2 3/5] raft: track all votes, even not own
  2022-01-20  0:43 [Tarantool-patches] [PATCH v2 0/5] Split vote and bugs Vladislav Shpilevoy via Tarantool-patches
  2022-01-20  0:43 ` [Tarantool-patches] [PATCH v2 1/5] raft: fix crash on election_timeout reconfig Vladislav Shpilevoy via Tarantool-patches
  2022-01-20  0:43 ` [Tarantool-patches] [PATCH v2 2/5] raft: fix ev_timer.at incorrect usage Vladislav Shpilevoy via Tarantool-patches
@ 2022-01-20  0:43 ` Vladislav Shpilevoy via Tarantool-patches
  2022-01-20  0:43 ` [Tarantool-patches] [PATCH v2 4/5] raft: introduce split vote detection Vladislav Shpilevoy via Tarantool-patches
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 11+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2022-01-20  0:43 UTC (permalink / raw)
  To: tarantool-patches, sergepetrenko

To detect split vote a node needs to see that number of free votes
is not enough for anyone to win even if it gets them all.

Hence every node needs to count votes for all other nodes.

The patch makes raft store votes in a bit more complicated manner
than a simple bitmap for just the current instance.

Part of #5285
---
 src/lib/raft/raft.c   | 41 +++++++++++++++++++++++++++++++++--------
 src/lib/raft/raft.h   | 24 +++++++++++++++++-------
 test/unit/raft.c      | 17 ++++++++++-------
 test/unit/raft.result |  7 ++++---
 4 files changed, 64 insertions(+), 25 deletions(-)

diff --git a/src/lib/raft/raft.c b/src/lib/raft/raft.c
index d6a849503..8182e73e3 100644
--- a/src/lib/raft/raft.c
+++ b/src/lib/raft/raft.c
@@ -152,6 +152,16 @@ raft_can_vote_for(const struct raft *raft, const struct vclock *v)
 	return cmp == 0 || cmp == 1;
 }
 
+static inline void
+raft_add_vote(struct raft *raft, int src, int dst)
+{
+	struct raft_vote *v = &raft->votes[src];
+	if (v->did_vote)
+		return;
+	v->did_vote = true;
+	++raft->votes[dst].count;
+}
+
 /** Schedule broadcast of the complete Raft state to all the followers. */
 static void
 raft_schedule_broadcast(struct raft *raft);
@@ -279,6 +289,12 @@ void
 raft_process_recovery(struct raft *raft, const struct raft_msg *req)
 {
 	say_verbose("RAFT: recover %s", raft_msg_to_string(req));
+	/*
+	 * Instance ID is not unknown until recovery ends. Because apparently it
+	 * can change during join. In Raft it is set only one time when recovery
+	 * ends for good.
+	 */
+	assert(raft->self == 0);
 	if (req->term != 0) {
 		raft->term = req->term;
 		raft->volatile_term = req->term;
@@ -335,6 +351,8 @@ raft_process_msg(struct raft *raft, const struct raft_msg *req, uint32_t source)
 	 * persisted long time ago and still broadcasted. Or a vote response.
 	 */
 	if (req->vote != 0) {
+		raft_add_vote(raft, source, req->vote);
+
 		switch (raft->state) {
 		case RAFT_STATE_FOLLOWER:
 		case RAFT_STATE_LEADER:
@@ -395,11 +413,10 @@ raft_process_msg(struct raft *raft, const struct raft_msg *req, uint32_t source)
 			 * and now was answered by some other instance.
 			 */
 			assert(raft->volatile_vote == raft->self);
-			bool was_set = bit_set(&raft->vote_mask, source);
-			raft->vote_count += !was_set;
-			if (raft->vote_count < raft->election_quorum) {
+			int vote_count = raft_vote_count(raft);
+			if (vote_count < raft->election_quorum) {
 				say_info("RAFT: accepted vote for self, vote "
-					 "count is %d/%d", raft->vote_count,
+					 "count is %d/%d", vote_count,
 					 raft->election_quorum);
 				break;
 			}
@@ -640,13 +657,11 @@ raft_sm_become_candidate(struct raft *raft)
 	assert(raft->state == RAFT_STATE_FOLLOWER);
 	assert(raft->leader == 0);
 	assert(raft->vote == raft->self);
+	assert(raft_vote_count(raft) >= 1);
 	assert(raft->is_candidate);
 	assert(!raft->is_write_in_progress);
 	assert(raft->election_quorum > 1);
 	raft->state = RAFT_STATE_CANDIDATE;
-	raft->vote_count = 1;
-	raft->vote_mask = 0;
-	bit_set(&raft->vote_mask, raft->self);
 	raft_sm_wait_election_end(raft);
 	/* State is visible and it is changed - broadcast. */
 	raft_schedule_broadcast(raft);
@@ -663,6 +678,7 @@ raft_sm_schedule_new_term(struct raft *raft, uint64_t new_term)
 	raft->volatile_vote = 0;
 	raft->leader = 0;
 	raft->state = RAFT_STATE_FOLLOWER;
+	memset(raft->votes, 0, sizeof(raft->votes));
 	/*
 	 * The instance could be promoted for the previous term. But promotion
 	 * has no effect on following terms.
@@ -685,6 +701,8 @@ raft_sm_schedule_new_vote(struct raft *raft, uint32_t new_vote)
 	assert(raft->leader == 0);
 	assert(raft->state == RAFT_STATE_FOLLOWER);
 	raft->volatile_vote = new_vote;
+	assert(!raft->votes[raft->self].did_vote);
+	raft_add_vote(raft, raft->self, raft->self);
 	raft_sm_pause_and_dump(raft);
 	/* Nothing visible is changed - no broadcast. */
 }
@@ -957,7 +975,7 @@ raft_cfg_election_quorum(struct raft *raft, int election_quorum)
 	assert(election_quorum > 0);
 	raft->election_quorum = election_quorum;
 	if (raft->state == RAFT_STATE_CANDIDATE &&
-	    raft->vote_count >= raft->election_quorum)
+	    raft_vote_count(raft) >= raft->election_quorum)
 		raft_sm_become_leader(raft);
 }
 
@@ -989,6 +1007,13 @@ raft_cfg_instance_id(struct raft *raft, uint32_t instance_id)
 	assert(raft->self == 0);
 	assert(instance_id != 0);
 	raft->self = instance_id;
+	/*
+	 * Couldn't do that reliably during recovery. Instance ID can change
+	 * more than once during join. Here instance ID is configured when it is
+	 * known forever and is safe to use.
+	 */
+	if (raft->vote != 0)
+		raft_add_vote(raft, instance_id, raft->vote);
 }
 
 void
diff --git a/src/lib/raft/raft.h b/src/lib/raft/raft.h
index e844c2d07..1d064bef0 100644
--- a/src/lib/raft/raft.h
+++ b/src/lib/raft/raft.h
@@ -138,6 +138,14 @@ struct raft_vtab {
 	raft_schedule_async_f schedule_async;
 };
 
+/** Vote descriptor of a single node. */
+struct raft_vote {
+	/** Whether the node voted for anybody. */
+	bool did_vote;
+	/** How many votes the node got in the current term. */
+	int count;
+};
+
 struct raft {
 	/** Instance ID of this node. */
 	uint32_t self;
@@ -189,13 +197,8 @@ struct raft {
 	 */
 	uint64_t term;
 	uint32_t vote;
-	/**
-	 * Bit 1 on position N means that a vote from instance with ID = N was
-	 * obtained.
-	 */
-	vclock_map_t vote_mask;
-	/** Number of votes for this instance. Valid only in candidate state. */
-	int vote_count;
+	/** Statistics which node voted for who. */
+	struct raft_vote votes[VCLOCK_MAX];
 	/** Number of votes necessary for successful election. */
 	int election_quorum;
 	/**
@@ -243,6 +246,13 @@ raft_is_enabled(const struct raft *raft)
 	return raft->is_enabled;
 }
 
+/** Number of votes for self. */
+static inline int
+raft_vote_count(const struct raft *raft)
+{
+	return raft->votes[raft->self].count;
+}
+
 /** Process a raft entry stored in WAL/snapshot. */
 void
 raft_process_recovery(struct raft *raft, const struct raft_msg *req);
diff --git a/test/unit/raft.c b/test/unit/raft.c
index 520b94466..9f18f4d8e 100644
--- a/test/unit/raft.c
+++ b/test/unit/raft.c
@@ -70,7 +70,7 @@ raft_test_leader_election(void)
 		1 /* Volatile vote. */,
 		"{0: 1}" /* Vclock. */
 	), "elections with a new term");
-	is(node.raft.vote_count, 1, "single vote for self");
+	is(raft_vote_count(&node.raft), 1, "single vote for self");
 	ok(node.update_count > 0, "trigger worked");
 	node.update_count = 0;
 
@@ -100,7 +100,7 @@ raft_test_leader_election(void)
 		1 /* Vote. */,
 		2 /* Source. */
 	), 0, "vote response from 2");
-	is(node.raft.vote_count, 2, "2 votes - 1 self and 1 foreign");
+	is(raft_vote_count(&node.raft), 2, "2 votes - 1 self and 1 foreign");
 	ok(!node.has_work, "no work to do - not enough votes yet");
 
 	raft_run_for(node.cfg_election_timeout / 2);
@@ -115,7 +115,7 @@ raft_test_leader_election(void)
 		1 /* Vote. */,
 		3 /* Source. */
 	), 0, "vote response from 3");
-	is(node.raft.vote_count, 3, "2 votes - 1 self and 2 foreign");
+	is(raft_vote_count(&node.raft), 3, "2 votes - 1 self and 2 foreign");
 	is(node.raft.state, RAFT_STATE_LEADER, "became leader");
 	ok(node.update_count > 0, "trigger worked");
 	node.update_count = 0;
@@ -142,7 +142,7 @@ raft_test_leader_election(void)
 static void
 raft_test_recovery(void)
 {
-	raft_start_test(12);
+	raft_start_test(13);
 	struct raft_msg msg;
 	struct raft_node node;
 	raft_node_create(&node);
@@ -224,6 +224,8 @@ raft_test_recovery(void)
 		"{0: 1}" /* Vclock. */
 	), "restart always as a follower");
 
+	is(raft_vote_count(&node.raft), 1, "vote count is restored correctly");
+
 	raft_checkpoint_remote(&node.raft, &msg);
 	ok(raft_msg_check(&msg,
 		RAFT_STATE_FOLLOWER /* State. */,
@@ -383,7 +385,7 @@ raft_test_vote_skip(void)
 		1 /* Vote. */,
 		2 /* Source. */
 	), 0, "message is accepted");
-	is(node.raft.vote_count, 1, "but ignored - too old term");
+	is(raft_vote_count(&node.raft), 1, "but ignored - too old term");
 
 	/* Competing vote requests are skipped. */
 
@@ -392,7 +394,8 @@ raft_test_vote_skip(void)
 		3 /* Vote. */,
 		2 /* Source. */
 	), 0, "message is accepted");
-	is(node.raft.vote_count, 1, "but ignored - vote not for this node");
+	is(raft_vote_count(&node.raft), 1,
+	   "but ignored - vote not for this node");
 	is(node.raft.state, RAFT_STATE_CANDIDATE, "this node does not give up");
 
 	/* Vote requests are ignored when node is disabled. */
@@ -1071,7 +1074,7 @@ raft_test_election_quorum(void)
 		1 /* Vote. */,
 		2 /* Source. */
 	), 0, "send vote response from second node");
-	is(node.raft.vote_count, 2, "vote is accepted");
+	is(raft_vote_count(&node.raft), 2, "vote is accepted");
 	is(node.raft.state, RAFT_STATE_CANDIDATE, "but still candidate");
 
 	raft_node_cfg_election_quorum(&node, 2);
diff --git a/test/unit/raft.result b/test/unit/raft.result
index 764d82969..3fcdf8180 100644
--- a/test/unit/raft.result
+++ b/test/unit/raft.result
@@ -29,7 +29,7 @@
 ok 1 - subtests
 	*** raft_test_leader_election: done ***
 	*** raft_test_recovery ***
-    1..12
+    1..13
     ok 1 - became candidate
     ok 2 - remote checkpoint of a candidate
     ok 3 - local checkpoint of a candidate
@@ -40,8 +40,9 @@ ok 1 - subtests
     ok 8 - remote checkpoint of a leader
     ok 9 - local checkpoint of a leader
     ok 10 - restart always as a follower
-    ok 11 - remote checkpoint of a leader
-    ok 12 - local checkpoint of a leader
+    ok 11 - vote count is restored correctly
+    ok 12 - remote checkpoint of a leader
+    ok 13 - local checkpoint of a leader
 ok 2 - subtests
 	*** raft_test_recovery: done ***
 	*** raft_test_bad_msg ***
-- 
2.24.3 (Apple Git-128)


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Tarantool-patches] [PATCH v2 4/5] raft: introduce split vote detection
  2022-01-20  0:43 [Tarantool-patches] [PATCH v2 0/5] Split vote and bugs Vladislav Shpilevoy via Tarantool-patches
                   ` (2 preceding siblings ...)
  2022-01-20  0:43 ` [Tarantool-patches] [PATCH v2 3/5] raft: track all votes, even not own Vladislav Shpilevoy via Tarantool-patches
@ 2022-01-20  0:43 ` Vladislav Shpilevoy via Tarantool-patches
  2022-01-20 13:22   ` Serge Petrenko via Tarantool-patches
  2022-01-20  0:43 ` [Tarantool-patches] [PATCH v2 5/5] election: activate raft split vote handling Vladislav Shpilevoy via Tarantool-patches
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 11+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2022-01-20  0:43 UTC (permalink / raw)
  To: tarantool-patches, sergepetrenko

Split vote is a situation when during election nobody can win and
will not win in this term for sure. Not a single node could get
enough votes. For example, with 4 nodes one could get 2 votes and
other also 2 votes. Nobody will get quorum 3 in this term.

The patch makes raft able to notice that situation and speed up
the term bump. It is not bumped immediately though, because nodes
might do that simultaneously and will get the split vote again
after voting for self. There is a random delay. But it is just max
10% of election timeout, so it should speed up split vote
resolution on 90% at least.

Part of #5285
---
 src/lib/raft/raft.c         | 129 +++++++++++++++-
 src/lib/raft/raft.h         |   6 +
 test/unit/raft.c            | 301 +++++++++++++++++++++++++++++++++++-
 test/unit/raft.result       |  64 +++++++-
 test/unit/raft_test_utils.c |  12 ++
 test/unit/raft_test_utils.h |   5 +
 6 files changed, 512 insertions(+), 5 deletions(-)

diff --git a/src/lib/raft/raft.c b/src/lib/raft/raft.c
index 8182e73e3..5dcc5beaf 100644
--- a/src/lib/raft/raft.c
+++ b/src/lib/raft/raft.c
@@ -152,20 +152,83 @@ raft_can_vote_for(const struct raft *raft, const struct vclock *v)
 	return cmp == 0 || cmp == 1;
 }
 
-static inline void
+static inline bool
 raft_add_vote(struct raft *raft, int src, int dst)
 {
 	struct raft_vote *v = &raft->votes[src];
 	if (v->did_vote)
-		return;
+		return false;
 	v->did_vote = true;
 	++raft->votes[dst].count;
+	return true;
+}
+
+static bool
+raft_has_split_vote(const struct raft *raft)
+{
+	int max_vote = 0;
+	int vote_vac = raft->cluster_size;
+	int quorum = raft->election_quorum;
+	/*
+	 * Quorum > cluster is either a misconfiguration or some instances
+	 * didn't register yet. Anyway, speeding the elections up won't help.
+	 */
+	if (vote_vac < quorum)
+		return false;
+	for (int i = 0; i < VCLOCK_MAX; ++i) {
+		int count = raft->votes[i].count;
+		vote_vac -= count;
+		if (count > max_vote)
+			max_vote = count;
+	}
+	/*
+	 * More nodes voted than there are nodes configured. The reason is the
+	 * the same as with quorum > cluster. The action is also the same -
+	 * faster term bumps won't help.
+	 */
+	if (vote_vac < 0)
+		return false;
+	return max_vote + vote_vac < quorum;
+}
+
+static int
+raft_scores_snprintf(const struct raft *raft, char *buf, int size)
+{
+	int total = 0;
+	bool is_empty = true;
+	SNPRINT(total, snprintf, buf, size, "{");
+	for (int i = 0; i < VCLOCK_MAX; ++i) {
+		int count = raft->votes[i].count;
+		if (count == 0)
+			continue;
+		if (!is_empty)
+			SNPRINT(total, snprintf, buf, size, ", ");
+		else
+			is_empty = false;
+		SNPRINT(total, snprintf, buf, size, "%d: %d", i, count);
+	}
+	SNPRINT(total, snprintf, buf, size, "}");
+	return total;
+}
+
+static const char *
+raft_scores_str(const struct raft *raft)
+{
+	char *buf = tt_static_buf();
+	int rc = raft_scores_snprintf(raft, buf, TT_STATIC_BUF_LEN);
+	assert(rc >= 0);
+	(void)rc;
+	return buf;
 }
 
 /** Schedule broadcast of the complete Raft state to all the followers. */
 static void
 raft_schedule_broadcast(struct raft *raft);
 
+/** If there is split vote, the node might reduce the next term delay. */
+static void
+raft_check_split_vote(struct raft *raft);
+
 /** Raft state machine methods. 'sm' stands for State Machine. */
 
 /**
@@ -351,7 +414,8 @@ raft_process_msg(struct raft *raft, const struct raft_msg *req, uint32_t source)
 	 * persisted long time ago and still broadcasted. Or a vote response.
 	 */
 	if (req->vote != 0) {
-		raft_add_vote(raft, source, req->vote);
+		if (raft_add_vote(raft, source, req->vote))
+		    raft_check_split_vote(raft);
 
 		switch (raft->state) {
 		case RAFT_STATE_FOLLOWER:
@@ -771,6 +835,11 @@ raft_sm_wait_election_end(struct raft *raft)
 				  raft_new_random_election_shift(raft);
 	raft_ev_timer_set(&raft->timer, election_timeout, election_timeout);
 	raft_ev_timer_start(raft_loop(), &raft->timer);
+	/*
+	 * Could start the waiting after a WAL write during which the split vote
+	 * could happen.
+	 */
+	raft_check_split_vote(raft);
 }
 
 static void
@@ -977,6 +1046,8 @@ raft_cfg_election_quorum(struct raft *raft, int election_quorum)
 	if (raft->state == RAFT_STATE_CANDIDATE &&
 	    raft_vote_count(raft) >= raft->election_quorum)
 		raft_sm_become_leader(raft);
+	else
+		raft_check_split_vote(raft);
 }
 
 void
@@ -1024,6 +1095,13 @@ raft_cfg_vclock(struct raft *raft, const struct vclock *vclock)
 	raft->vclock = vclock;
 }
 
+void
+raft_cfg_cluster_size(struct raft *raft, int size)
+{
+	raft->cluster_size = size;
+	raft_check_split_vote(raft);
+}
+
 void
 raft_new_term(struct raft *raft)
 {
@@ -1048,6 +1126,50 @@ raft_schedule_broadcast(struct raft *raft)
 	raft_schedule_async(raft);
 }
 
+static void
+raft_check_split_vote(struct raft *raft)
+{
+	/* When leader is known, there is no election. Thus no vote to split. */
+	if (raft->leader != 0)
+		return;
+	/* Not a candidate = can't trigger term bump anyway. */
+	if (!raft->is_candidate)
+		return;
+	/*
+	 * WAL write in progress means the state is changing. All is rechecked
+	 * when it is done.
+	 */
+	if (raft->is_write_in_progress)
+		return;
+	if (!raft_has_split_vote(raft))
+		return;
+	assert(raft_ev_is_active(&raft->timer));
+	/*
+	 * Could be already detected before. The timeout would be updated by now
+	 * then.
+	 */
+	if (raft->timer.repeat < raft->election_timeout)
+		return;
+
+	assert(raft->state == RAFT_STATE_FOLLOWER ||
+	       raft->state == RAFT_STATE_CANDIDATE);
+	struct ev_loop *loop = raft_loop();
+	struct ev_timer *timer = &raft->timer;
+	double delay = raft_new_random_election_shift(raft);
+	/*
+	 * Could be too late to speed up anything - probably the term is almost
+	 * over anyway.
+	 */
+	double remaining = raft_ev_timer_remaining(loop, timer);
+	if (delay >= remaining)
+		delay = remaining;
+	say_info("RAFT: split vote is discovered - %s, new term in %lf sec",
+		 raft_scores_str(raft), delay);
+	raft_ev_timer_stop(loop, timer);
+	raft_ev_timer_set(timer, delay, delay);
+	raft_ev_timer_start(loop, timer);
+}
+
 void
 raft_create(struct raft *raft, const struct raft_vtab *vtab)
 {
@@ -1058,6 +1180,7 @@ raft_create(struct raft *raft, const struct raft_vtab *vtab)
 		.election_quorum = 1,
 		.election_timeout = 5,
 		.death_timeout = 5,
+		.cluster_size = VCLOCK_MAX,
 		.vtab = vtab,
 	};
 	raft_ev_timer_init(&raft->timer, raft_sm_schedule_new_election_cb,
diff --git a/src/lib/raft/raft.h b/src/lib/raft/raft.h
index 1d064bef0..817148792 100644
--- a/src/lib/raft/raft.h
+++ b/src/lib/raft/raft.h
@@ -219,6 +219,8 @@ struct raft {
 	 * elections can be started.
 	 */
 	double death_timeout;
+	/** Number of instances registered in the cluster. */
+	int cluster_size;
 	/** Virtual table to perform application-specific actions. */
 	const struct raft_vtab *vtab;
 	/**
@@ -333,6 +335,10 @@ raft_cfg_instance_id(struct raft *raft, uint32_t instance_id);
 void
 raft_cfg_vclock(struct raft *raft, const struct vclock *vclock);
 
+/** Configure number of registered instances in the cluster. */
+void
+raft_cfg_cluster_size(struct raft *raft, int size);
+
 /**
  * Bump the term. When it is persisted, the node checks if there is a leader,
  * and if there is not, a new election is started. That said, this function can
diff --git a/test/unit/raft.c b/test/unit/raft.c
index 9f18f4d8e..2fe6354cd 100644
--- a/test/unit/raft.c
+++ b/test/unit/raft.c
@@ -1497,10 +1497,308 @@ raft_test_promote_restore(void)
 	raft_finish_test();
 }
 
+static void
+raft_test_split_vote(void)
+{
+	raft_start_test(58);
+	struct raft_node node;
+	raft_node_create(&node);
+
+	/*
+	 * Normal split vote.
+	 */
+	raft_node_cfg_cluster_size(&node, 4);
+	raft_node_cfg_election_quorum(&node, 3);
+	raft_run_next_event();
+
+	ok(raft_node_check_full_state(&node,
+		RAFT_STATE_CANDIDATE /* State. */,
+		0 /* Leader. */,
+		2 /* Term. */,
+		1 /* Vote. */,
+		2 /* Volatile term. */,
+		1 /* Volatile vote. */,
+		"{0: 1}" /* Vclock. */
+	), "elections with a new term");
+
+	/* Make so node 1 has votes 1 and 2. Node 3 has votes 3 and 4. */
+	is(raft_node_send_vote_response(&node,
+		2 /* Term. */,
+		1 /* Vote. */,
+		2 /* Source. */
+	), 0, "vote response for 1 from 2");
+	is(raft_node_send_vote_response(&node,
+		2 /* Term. */,
+		3 /* Vote. */,
+		3 /* Source. */
+	), 0, "vote response for 3 from 3");
+
+	ok(node.raft.timer.repeat >= node.raft.election_timeout, "term "
+	   "timeout >= election timeout normally");
+
+	is(raft_node_send_vote_response(&node,
+		2 /* Term. */,
+		3 /* Vote. */,
+		4 /* Source. */
+	), 0, "vote response for 3 from 4");
+
+	ok(node.raft.timer.repeat < node.raft.election_timeout, "split vote "
+	   "reduced the term timeout");
+
+	raft_run_next_event();
+	ok(raft_node_check_full_state(&node,
+		RAFT_STATE_CANDIDATE /* State. */,
+		0 /* Leader. */,
+		3 /* Term. */,
+		1 /* Vote. */,
+		3 /* Volatile term. */,
+		1 /* Volatile vote. */,
+		"{0: 2}" /* Vclock. */
+	), "a new term");
+
+	ok(node.raft.timer.repeat >= node.raft.election_timeout, "timeout is "
+	   "normal again");
+
+	/*
+	 * Cluster size change can make split vote.
+	 */
+	raft_node_destroy(&node);
+	raft_node_create(&node);
+	raft_node_cfg_cluster_size(&node, 3);
+	raft_node_cfg_election_quorum(&node, 2);
+	raft_run_next_event();
+	is(node.raft.state, RAFT_STATE_CANDIDATE, "is candidate");
+	is(node.raft.vote, 1, "voted for self");
+
+	is(raft_node_send_vote_response(&node,
+		2 /* Term. */,
+		2 /* Vote. */,
+		2 /* Source. */
+	), 0, "vote response for 2 from 2");
+	ok(node.raft.timer.repeat >= node.raft.election_timeout, "the vote is "
+	   "not split yet");
+
+	raft_node_cfg_cluster_size(&node, 2);
+	ok(node.raft.timer.repeat < node.raft.election_timeout, "cluster size "
+	   "change makes split vote");
+
+	raft_run_next_event();
+	ok(raft_node_check_full_state(&node,
+		RAFT_STATE_CANDIDATE /* State. */,
+		0 /* Leader. */,
+		3 /* Term. */,
+		1 /* Vote. */,
+		3 /* Volatile term. */,
+		1 /* Volatile vote. */,
+		"{0: 2}" /* Vclock. */
+	), "a new term");
+
+	/*
+	 * Split vote can be even when the leader is already known - then
+	 * nothing to do. Votes are just left from the beginning of the term and
+	 * then probably cluster size reduced a bit.
+	 */
+	raft_node_destroy(&node);
+	raft_node_create(&node);
+	raft_node_cfg_cluster_size(&node, 3);
+	raft_node_cfg_election_quorum(&node, 2);
+	is(raft_node_send_vote_response(&node,
+		2 /* Term. */,
+		2 /* Vote. */,
+		2 /* Source. */
+	), 0, "vote response for 2 from 2");
+	/* There is also a vote from 3 for 2. But it wasn't delivered to 1. */
+	is(raft_node_send_leader(&node,
+		2 /* Term. */,
+		2 /* Source. */
+	), 0, "message is accepted");
+	is(node.raft.leader, 2, "other node's leadership is accepted");
+	is(raft_vote_count(&node.raft), 1, "the only own vote was from self");
+
+	raft_node_cfg_cluster_size(&node, 2);
+	ok(node.raft.timer.repeat >= node.raft.death_timeout, "cluster change "
+	   "does not affect the leader's death waiting");
+
+	/*
+	 * Non-candidate should ignore split vote.
+	 */
+	raft_node_destroy(&node);
+	raft_node_create(&node);
+	raft_node_cfg_cluster_size(&node, 3);
+	raft_node_cfg_election_quorum(&node, 3);
+	raft_node_cfg_is_candidate(&node, false);
+
+	is(raft_node_send_vote_response(&node,
+		2 /* Term. */,
+		2 /* Vote. */,
+		2 /* Source. */
+	), 0, "vote response for 2 from 2");
+	is(raft_node_send_vote_response(&node,
+		2 /* Term. */,
+		3 /* Vote. */,
+		3 /* Source. */
+	), 0, "vote response for 3 from 3");
+
+	ok(!raft_ev_is_active(&node.raft.timer), "voter couldn't schedule "
+	    "new term");
+
+	/*
+	 * Split vote can get worse, but it shouldn't lead to new term delay
+	 * restart.
+	 */
+	raft_node_destroy(&node);
+	raft_node_create(&node);
+	raft_node_cfg_cluster_size(&node, 3);
+	raft_node_cfg_election_quorum(&node, 3);
+
+	raft_run_next_event();
+	is(node.raft.term, 2, "bump term");
+	is(node.raft.vote, 1, "vote for self");
+	is(raft_node_send_vote_response(&node,
+		2 /* Term. */,
+		2 /* Vote. */,
+		2 /* Source. */
+	), 0, "vote response for 2 from 2");
+
+	double delay = node.raft.timer.repeat;
+	ok(delay < node.raft.election_timeout, "split vote is already "
+	   "inevitable");
+
+	is(raft_node_send_vote_response(&node,
+		2 /* Term. */,
+		3 /* Vote. */,
+		3 /* Source. */
+	), 0, "vote response for 3 from 3");
+
+	is(delay, node.raft.timer.repeat, "split vote got worse, but delay "
+	   "didn't change");
+
+	/*
+	 * Handle split vote when WAL write is in progress.
+	 */
+	raft_node_destroy(&node);
+	raft_node_create(&node);
+	raft_node_cfg_cluster_size(&node, 2);
+	raft_node_cfg_election_quorum(&node, 2);
+
+	raft_node_block(&node);
+	raft_run_next_event();
+	is(node.raft.term, 1, "old term");
+	is(node.raft.vote, 0, "unused vote");
+	is(node.raft.volatile_term, 2, "new volatile term");
+	is(node.raft.volatile_vote, 1, "new volatile vote");
+	is(raft_node_send_vote_response(&node,
+		2 /* Term. */,
+		2 /* Vote. */,
+		2 /* Source. */
+	), 0, "vote response for 2 from 2");
+
+	raft_node_unblock(&node);
+	is(node.raft.term, 2, "new term");
+	is(node.raft.vote, 1, "voted for self");
+	is(node.raft.volatile_term, 2, "volatile term");
+	is(node.raft.volatile_vote, 1, "volatile vote");
+	ok(node.raft.timer.repeat < node.raft.election_timeout, "found split "
+	   "vote after WAL write");
+
+	raft_run_next_event();
+	is(node.raft.term, 3, "bump term");
+	is(node.raft.vote, 1, "vote for self");
+
+	/*
+	 * Split vote check is disabled when cluster size < quorum. Makes no
+	 * sense to speed the elections up.
+	 */
+	raft_node_destroy(&node);
+	raft_node_create(&node);
+	raft_node_cfg_cluster_size(&node, 1);
+	raft_node_cfg_election_quorum(&node, 2);
+
+	raft_run_next_event();
+	is(node.raft.term, 2, "bump term");
+	is(node.raft.vote, 1, "vote for self");
+	is(raft_node_send_vote_response(&node,
+		2 /* Term. */,
+		2 /* Vote. */,
+		2 /* Source. */
+	), 0, "vote response for 2 from 2");
+
+	ok(node.raft.timer.repeat >= node.raft.election_timeout, "split vote "
+	   "is not checked for cluster < quorum");
+
+	/*
+	 * Split vote check is disabled when vote count > cluster size. The
+	 * reason is the same as with quorum > cluster size - something is odd,
+	 * more term bumps won't help.
+	 */
+	raft_node_destroy(&node);
+	raft_node_create(&node);
+	raft_node_cfg_cluster_size(&node, 3);
+	raft_node_cfg_election_quorum(&node, 2);
+
+	raft_run_next_event();
+	is(node.raft.term, 2, "bump term");
+	is(node.raft.vote, 1, "vote for self");
+	is(raft_node_send_vote_response(&node,
+		2 /* Term. */,
+		2 /* Vote. */,
+		2 /* Source. */
+	), 0, "vote response for 2 from 2");
+	is(raft_node_send_vote_response(&node,
+		2 /* Term. */,
+		2 /* Vote. */,
+		3 /* Source. */
+	), 0, "vote response for 2 from 3");
+	is(raft_node_send_vote_response(&node,
+		2 /* Term. */,
+		3 /* Vote. */,
+		4 /* Source. */
+	), 0, "vote response for 3 from 4");
+	is(raft_node_send_vote_response(&node,
+		2 /* Term. */,
+		4 /* Vote. */,
+		5 /* Source. */
+	), 0, "vote response for 4 from 5");
+
+	ok(node.raft.timer.repeat >= node.raft.election_timeout, "split vote "
+	   "is not checked when vote count > cluster size");
+
+	/*
+	 * Split vote can happen if quorum was suddenly increased.
+	 */
+	raft_node_destroy(&node);
+	raft_node_create(&node);
+	raft_node_cfg_cluster_size(&node, 3);
+	raft_node_cfg_election_quorum(&node, 2);
+
+	raft_run_next_event();
+	is(node.raft.term, 2, "bump term");
+	is(node.raft.vote, 1, "vote for self");
+	is(raft_node_send_vote_response(&node,
+		2 /* Term. */,
+		2 /* Vote. */,
+		2 /* Source. */
+	), 0, "vote response for 2 from 2");
+
+	ok(node.raft.timer.repeat >= node.raft.election_timeout, "not split "
+	   "vote yet");
+
+	raft_node_cfg_election_quorum(&node, 3);
+	ok(node.raft.timer.repeat < node.raft.election_timeout, "split vote "
+	   "after quorum increase");
+
+	raft_run_next_event();
+	is(node.raft.term, 3, "bump term");
+	is(node.raft.vote, 1, "vote for self");
+
+	raft_node_destroy(&node);
+	raft_finish_test();
+}
+
 static int
 main_f(va_list ap)
 {
-	raft_start_test(14);
+	raft_start_test(15);
 
 	(void) ap;
 	fakeev_init();
@@ -1519,6 +1817,7 @@ main_f(va_list ap)
 	raft_test_enable_disable();
 	raft_test_too_long_wal_write();
 	raft_test_promote_restore();
+	raft_test_split_vote();
 
 	fakeev_free();
 
diff --git a/test/unit/raft.result b/test/unit/raft.result
index 3fcdf8180..cfc9be733 100644
--- a/test/unit/raft.result
+++ b/test/unit/raft.result
@@ -1,5 +1,5 @@
 	*** main_f ***
-1..14
+1..15
 	*** raft_test_leader_election ***
     1..24
     ok 1 - 1 pending message at start
@@ -262,4 +262,66 @@ ok 13 - subtests
     ok 12 - not a candidate
 ok 14 - subtests
 	*** raft_test_promote_restore: done ***
+	*** raft_test_split_vote ***
+    1..58
+    ok 1 - elections with a new term
+    ok 2 - vote response for 1 from 2
+    ok 3 - vote response for 3 from 3
+    ok 4 - term timeout >= election timeout normally
+    ok 5 - vote response for 3 from 4
+    ok 6 - split vote reduced the term timeout
+    ok 7 - a new term
+    ok 8 - timeout is normal again
+    ok 9 - is candidate
+    ok 10 - voted for self
+    ok 11 - vote response for 2 from 2
+    ok 12 - the vote is not split yet
+    ok 13 - cluster size change makes split vote
+    ok 14 - a new term
+    ok 15 - vote response for 2 from 2
+    ok 16 - message is accepted
+    ok 17 - other node's leadership is accepted
+    ok 18 - the only own vote was from self
+    ok 19 - cluster change does not affect the leader's death waiting
+    ok 20 - vote response for 2 from 2
+    ok 21 - vote response for 3 from 3
+    ok 22 - voter couldn't schedule new term
+    ok 23 - bump term
+    ok 24 - vote for self
+    ok 25 - vote response for 2 from 2
+    ok 26 - split vote is already inevitable
+    ok 27 - vote response for 3 from 3
+    ok 28 - split vote got worse, but delay didn't change
+    ok 29 - old term
+    ok 30 - unused vote
+    ok 31 - new volatile term
+    ok 32 - new volatile vote
+    ok 33 - vote response for 2 from 2
+    ok 34 - new term
+    ok 35 - voted for self
+    ok 36 - volatile term
+    ok 37 - volatile vote
+    ok 38 - found split vote after WAL write
+    ok 39 - bump term
+    ok 40 - vote for self
+    ok 41 - bump term
+    ok 42 - vote for self
+    ok 43 - vote response for 2 from 2
+    ok 44 - split vote is not checked for cluster < quorum
+    ok 45 - bump term
+    ok 46 - vote for self
+    ok 47 - vote response for 2 from 2
+    ok 48 - vote response for 2 from 3
+    ok 49 - vote response for 3 from 4
+    ok 50 - vote response for 4 from 5
+    ok 51 - split vote is not checked when vote count > cluster size
+    ok 52 - bump term
+    ok 53 - vote for self
+    ok 54 - vote response for 2 from 2
+    ok 55 - not split vote yet
+    ok 56 - split vote after quorum increase
+    ok 57 - bump term
+    ok 58 - vote for self
+ok 15 - subtests
+	*** raft_test_split_vote: done ***
 	*** main_f: done ***
diff --git a/test/unit/raft_test_utils.c b/test/unit/raft_test_utils.c
index b5754cf78..1d7dd01ba 100644
--- a/test/unit/raft_test_utils.c
+++ b/test/unit/raft_test_utils.c
@@ -194,6 +194,7 @@ raft_node_create(struct raft_node *node)
 	node->cfg_election_quorum = 3;
 	node->cfg_death_timeout = 5;
 	node->cfg_instance_id = 1;
+	node->cfg_cluster_size = VCLOCK_MAX;
 	node->cfg_vclock = &node->journal.vclock;
 	raft_journal_create(&node->journal, node->cfg_instance_id);
 	raft_node_start(node);
@@ -365,6 +366,7 @@ raft_node_start(struct raft_node *node)
 	raft_cfg_election_quorum(&node->raft, node->cfg_election_quorum);
 	raft_cfg_death_timeout(&node->raft, node->cfg_death_timeout);
 	raft_cfg_instance_id(&node->raft, node->cfg_instance_id);
+	raft_cfg_cluster_size(&node->raft, node->cfg_cluster_size);
 	raft_cfg_vclock(&node->raft, node->cfg_vclock);
 	raft_run_async_work();
 }
@@ -423,6 +425,16 @@ raft_node_cfg_is_candidate(struct raft_node *node, bool value)
 	}
 }
 
+void
+raft_node_cfg_cluster_size(struct raft_node *node, int value)
+{
+	node->cfg_cluster_size = value;
+	if (raft_node_is_started(node)) {
+		raft_cfg_cluster_size(&node->raft, value);
+		raft_run_async_work();
+	}
+}
+
 void
 raft_node_cfg_election_timeout(struct raft_node *node, double value)
 {
diff --git a/test/unit/raft_test_utils.h b/test/unit/raft_test_utils.h
index c68dc3b22..2138a829e 100644
--- a/test/unit/raft_test_utils.h
+++ b/test/unit/raft_test_utils.h
@@ -32,6 +32,7 @@
 #include "fakesys/fakeev.h"
 #include "fiber.h"
 #include "raft/raft.h"
+#include "raft/raft_ev.h"
 #include "unit.h"
 
 /** WAL simulation. It stores a list of rows which raft wanted to persist. */
@@ -105,6 +106,7 @@ struct raft_node {
 	int cfg_election_quorum;
 	double cfg_death_timeout;
 	uint32_t cfg_instance_id;
+	int cfg_cluster_size;
 	struct vclock *cfg_vclock;
 };
 
@@ -227,6 +229,9 @@ raft_node_cfg_is_enabled(struct raft_node *node, bool value);
 void
 raft_node_cfg_is_candidate(struct raft_node *node, bool value);
 
+void
+raft_node_cfg_cluster_size(struct raft_node *node, int value);
+
 void
 raft_node_cfg_election_timeout(struct raft_node *node, double value);
 
-- 
2.24.3 (Apple Git-128)


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Tarantool-patches] [PATCH v2 5/5] election: activate raft split vote handling
  2022-01-20  0:43 [Tarantool-patches] [PATCH v2 0/5] Split vote and bugs Vladislav Shpilevoy via Tarantool-patches
                   ` (3 preceding siblings ...)
  2022-01-20  0:43 ` [Tarantool-patches] [PATCH v2 4/5] raft: introduce split vote detection Vladislav Shpilevoy via Tarantool-patches
@ 2022-01-20  0:43 ` Vladislav Shpilevoy via Tarantool-patches
  2022-01-25 10:18 ` [Tarantool-patches] [PATCH v2 0/5] Split vote and bugs Serge Petrenko via Tarantool-patches
  2022-01-25 22:51 ` Vladislav Shpilevoy via Tarantool-patches
  6 siblings, 0 replies; 11+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2022-01-20  0:43 UTC (permalink / raw)
  To: tarantool-patches, sergepetrenko

Raft needs to know cluster size in order to detect and handle
split vote. The patch uses registered server count as cluster
size.

It is not documented nor has a changelog file because this is an
optimization. Can't be observed except in logs or with a watch.

Closes #5285
---
 src/box/raft.c                                |  4 +-
 .../election_split_vote_test.lua              | 92 +++++++++++++++++++
 2 files changed, 95 insertions(+), 1 deletion(-)
 create mode 100644 test/replication-luatest/election_split_vote_test.lua

diff --git a/src/box/raft.c b/src/box/raft.c
index 1e360dc88..be6009cc1 100644
--- a/src/box/raft.c
+++ b/src/box/raft.c
@@ -229,7 +229,9 @@ box_raft_update_election_quorum(void)
 	 *   be lost.
 	 */
 	int quorum = MIN(replication_synchro_quorum, max);
-	raft_cfg_election_quorum(box_raft(), quorum);
+	struct raft *raft = box_raft();
+	raft_cfg_election_quorum(raft, quorum);
+	raft_cfg_cluster_size(raft, max);
 }
 
 void
diff --git a/test/replication-luatest/election_split_vote_test.lua b/test/replication-luatest/election_split_vote_test.lua
new file mode 100644
index 000000000..f31bfd7f3
--- /dev/null
+++ b/test/replication-luatest/election_split_vote_test.lua
@@ -0,0 +1,92 @@
+local t = require('luatest')
+local cluster = require('test.luatest_helpers.cluster')
+local helpers = require('test.luatest_helpers')
+local wait_timeout = 120
+
+--
+-- gh-5285: split vote is when in the current term there can't be winner of the
+-- leader role. Number of unused votes is not enough for anyone to get the
+-- quorum. It can be detected to speed up the term bump.
+--
+local g = t.group('split-vote')
+
+g.before_each(function()
+    g.cluster = cluster:new({})
+    local node1_uri = helpers.instance_uri('node1')
+    local node2_uri = helpers.instance_uri('node2')
+    local replication = {node1_uri, node2_uri}
+    local box_cfg = {
+        listen = node1_uri,
+        replication = replication,
+        -- To speed up new term when try to elect a first leader.
+        replication_timeout = 0.1,
+        replication_synchro_quorum = 2,
+        election_timeout = 1000000,
+    }
+    g.node1 = g.cluster:build_server({alias = 'node1', box_cfg = box_cfg})
+
+    box_cfg.listen = node2_uri
+    g.node2 = g.cluster:build_server({alias = 'node2', box_cfg = box_cfg})
+
+    g.cluster:add_server(g.node1)
+    g.cluster:add_server(g.node2)
+    g.cluster:start()
+end)
+
+g.after_each(function()
+    g.cluster:drop()
+end)
+
+g.test_split_vote = function(g)
+    -- Stop the replication so as the nodes can't request votes from each other.
+    local node1_repl = g.node1:exec(function()
+        local repl = box.cfg.replication
+        box.cfg{replication = {}}
+        return repl
+    end)
+    local node2_repl = g.node2:exec(function()
+        local repl = box.cfg.replication
+        box.cfg{replication = {}}
+        return repl
+    end)
+
+    -- Both vote for self but don't see the split-vote yet.
+    g.node1:exec(function()
+        box.cfg{election_mode = 'candidate'}
+    end)
+    g.node2:exec(function()
+        box.cfg{election_mode = 'candidate'}
+    end)
+
+    -- Wait for the votes to actually happen.
+    t.helpers.retrying({timeout = wait_timeout}, function()
+        local func = function()
+            return box.info.election.vote == box.info.id
+        end
+        assert(g.node1:exec(func))
+        assert(g.node2:exec(func))
+    end)
+
+    -- Now let the nodes notice the split vote.
+    g.node1:exec(function(repl)
+        box.cfg{replication = repl}
+    end, {node1_repl})
+    g.node2:exec(function(repl)
+        box.cfg{replication = repl}
+    end, {node2_repl})
+
+    t.helpers.retrying({timeout = wait_timeout}, function()
+        local msg = 'split vote is discovered'
+        assert(g.node1:grep_log(msg) or g.node2:grep_log(msg))
+    end)
+
+    -- Ensure a leader is eventually elected. Nothing is broken for good.
+    g.node1:exec(function()
+        box.cfg{election_timeout = 1}
+    end)
+    g.node2:exec(function()
+        box.cfg{election_timeout = 1}
+    end)
+    g.node1:wait_election_leader_found()
+    g.node2:wait_election_leader_found()
+end
-- 
2.24.3 (Apple Git-128)


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Tarantool-patches] [PATCH v2 4/5] raft: introduce split vote detection
  2022-01-20  0:43 ` [Tarantool-patches] [PATCH v2 4/5] raft: introduce split vote detection Vladislav Shpilevoy via Tarantool-patches
@ 2022-01-20 13:22   ` Serge Petrenko via Tarantool-patches
  2022-01-20 23:02     ` Vladislav Shpilevoy via Tarantool-patches
  0 siblings, 1 reply; 11+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2022-01-20 13:22 UTC (permalink / raw)
  To: Vladislav Shpilevoy, tarantool-patches



20.01.2022 03:43, Vladislav Shpilevoy пишет:
> Split vote is a situation when during election nobody can win and
> will not win in this term for sure. Not a single node could get
> enough votes. For example, with 4 nodes one could get 2 votes and
> other also 2 votes. Nobody will get quorum 3 in this term.
>
> The patch makes raft able to notice that situation and speed up
> the term bump. It is not bumped immediately though, because nodes
> might do that simultaneously and will get the split vote again
> after voting for self. There is a random delay. But it is just max
> 10% of election timeout, so it should speed up split vote
> resolution on 90% at least.
>
> Part of #5285
> ---
>   src/lib/raft/raft.c         | 129 +++++++++++++++-
>   src/lib/raft/raft.h         |   6 +
>   test/unit/raft.c            | 301 +++++++++++++++++++++++++++++++++++-
>   test/unit/raft.result       |  64 +++++++-
>   test/unit/raft_test_utils.c |  12 ++
>   test/unit/raft_test_utils.h |   5 +
>   6 files changed, 512 insertions(+), 5 deletions(-)
>

Thanks for the fixes! Please find 2 comments below.

> +static void
> +raft_check_split_vote(struct raft *raft)
> +{
> +	/* When leader is known, there is no election. Thus no vote to split. */
> +	if (raft->leader != 0)
> +		return;
> +	/* Not a candidate = can't trigger term bump anyway. */
> +	if (!raft->is_candidate)
> +		return;
> +	/*
> +	 * WAL write in progress means the state is changing. All is rechecked
> +	 * when it is done.
> +	 */
> +	if (raft->is_write_in_progress)
> +		return;
> +	if (!raft_has_split_vote(raft))
> +		return;
> +	assert(raft_ev_is_active(&raft->timer));
> +	/*
> +	 * Could be already detected before. The timeout would be updated by now
> +	 * then.
> +	 */
> +	if (raft->timer.repeat < raft->election_timeout)
> +		return;

I don't think you should decrease timer.repeat.
This 'vote speedup' is for a single term only.

Besides the check below about delay >= remaining is enough
to test if split vote detection was already triggered.

> +
> +	assert(raft->state == RAFT_STATE_FOLLOWER ||
> +	       raft->state == RAFT_STATE_CANDIDATE);
> +	struct ev_loop *loop = raft_loop();
> +	struct ev_timer *timer = &raft->timer;
> +	double delay = raft_new_random_election_shift(raft);
> +	/*
> +	 * Could be too late to speed up anything - probably the term is almost
> +	 * over anyway.
> +	 */
> +	double remaining = raft_ev_timer_remaining(loop, timer);
> +	if (delay >= remaining)
> +		delay = remaining;
> +	say_info("RAFT: split vote is discovered - %s, new term in %lf sec",
> +		 raft_scores_str(raft), delay);
> +	raft_ev_timer_stop(loop, timer);
> +	raft_ev_timer_set(timer, delay, delay);


...
> diff --git a/test/unit/raft_test_utils.h b/test/unit/raft_test_utils.h
> index c68dc3b22..2138a829e 100644
> --- a/test/unit/raft_test_utils.h
> +++ b/test/unit/raft_test_utils.h
> @@ -32,6 +32,7 @@
>   #include "fakesys/fakeev.h"
>   #include "fiber.h"
>   #include "raft/raft.h"
> +#include "raft/raft_ev.h"

Why do you need it here?

>   #include "unit.h"
>   
>   /** WAL simulation. It stores a list of rows which raft wanted to persist. */
> @@ -105,6 +106,7 @@ struct raft_node {
>   	int cfg_election_quorum;
>   	double cfg_death_timeout;
>   	uint32_t cfg_instance_id;
> +	int cfg_cluster_size;
>   	struct vclock *cfg_vclock;
>   };
>   
> @@ -227,6 +229,9 @@ raft_node_cfg_is_enabled(struct raft_node *node, bool value);
>   void
>   raft_node_cfg_is_candidate(struct raft_node *node, bool value);
>   
> +void
> +raft_node_cfg_cluster_size(struct raft_node *node, int value);
> +
>   void
>   raft_node_cfg_election_timeout(struct raft_node *node, double value);
>   

-- 
Serge Petrenko


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Tarantool-patches] [PATCH v2 4/5] raft: introduce split vote detection
  2022-01-20 13:22   ` Serge Petrenko via Tarantool-patches
@ 2022-01-20 23:02     ` Vladislav Shpilevoy via Tarantool-patches
  2022-01-25 10:17       ` Serge Petrenko via Tarantool-patches
  0 siblings, 1 reply; 11+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2022-01-20 23:02 UTC (permalink / raw)
  To: Serge Petrenko, tarantool-patches

>> +static void
>> +raft_check_split_vote(struct raft *raft)
>> +{
>> +    /* When leader is known, there is no election. Thus no vote to split. */
>> +    if (raft->leader != 0)
>> +        return;
>> +    /* Not a candidate = can't trigger term bump anyway. */
>> +    if (!raft->is_candidate)
>> +        return;
>> +    /*
>> +     * WAL write in progress means the state is changing. All is rechecked
>> +     * when it is done.
>> +     */
>> +    if (raft->is_write_in_progress)
>> +        return;
>> +    if (!raft_has_split_vote(raft))
>> +        return;
>> +    assert(raft_ev_is_active(&raft->timer));
>> +    /*
>> +     * Could be already detected before. The timeout would be updated by now
>> +     * then.
>> +     */
>> +    if (raft->timer.repeat < raft->election_timeout)
>> +        return;
> 
> I don't think you should decrease timer.repeat.
> This 'vote speedup' is for a single term only.
> 
> Besides the check below about delay >= remaining is enough
> to test if split vote detection was already triggered.

I update timer.repeat as a flag that the split vote was already seen
during this term. If I drop this check, each next vote after the first
detection of split vote will lead to potential timeout decrease depending
on random. I wanted to decrease it only once - when split vote is seen
first time. Even managed to add a test for it.

The alternative was to store a flag 'has_split_vote' somewhere in struct
raft, but I didn't want to add it.

>> +
>> +    assert(raft->state == RAFT_STATE_FOLLOWER ||
>> +           raft->state == RAFT_STATE_CANDIDATE);
>> +    struct ev_loop *loop = raft_loop();
>> +    struct ev_timer *timer = &raft->timer;
>> +    double delay = raft_new_random_election_shift(raft);
>> +    /*
>> +     * Could be too late to speed up anything - probably the term is almost
>> +     * over anyway.
>> +     */
>> +    double remaining = raft_ev_timer_remaining(loop, timer);
>> +    if (delay >= remaining)
>> +        delay = remaining;
>> +    say_info("RAFT: split vote is discovered - %s, new term in %lf sec",
>> +         raft_scores_str(raft), delay);
>> +    raft_ev_timer_stop(loop, timer);
>> +    raft_ev_timer_set(timer, delay, delay);
> 
> 
> ...
>> diff --git a/test/unit/raft_test_utils.h b/test/unit/raft_test_utils.h
>> index c68dc3b22..2138a829e 100644
>> --- a/test/unit/raft_test_utils.h
>> +++ b/test/unit/raft_test_utils.h
>> @@ -32,6 +32,7 @@
>>   #include "fakesys/fakeev.h"
>>   #include "fiber.h"
>>   #include "raft/raft.h"
>> +#include "raft/raft_ev.h"
> 
> Why do you need it here?

The tests now use raft_ev_is_active() which is defined in this header. I
decided to add it here instead of unit/raft.c because the idea of
raft_test_utils.h is, among other things, to do all the boilerplate work
such as necessary inclusions.

New version of the commit after the comment from v1 about new raft
members:

====================
    raft: introduce split vote detection
    
    Split vote is a situation when during election nobody can win and
    will not win in this term for sure. Not a single node could get
    enough votes. For example, with 4 nodes one could get 2 votes and
    other also 2 votes. Nobody will get quorum 3 in this term.
    
    The patch makes raft able to notice that situation and speed up
    the term bump. It is not bumped immediately though, because nodes
    might do that simultaneously and will get the split vote again
    after voting for self. There is a random delay. But it is just max
    10% of election timeout, so it should speed up split vote
    resolution on 90% at least.
    
    Part of #5285

diff --git a/src/lib/raft/raft.c b/src/lib/raft/raft.c
index 8182e73e3..90ed01ca4 100644
--- a/src/lib/raft/raft.c
+++ b/src/lib/raft/raft.c
@@ -152,20 +152,76 @@ raft_can_vote_for(const struct raft *raft, const struct vclock *v)
 	return cmp == 0 || cmp == 1;
 }
 
-static inline void
+static inline bool
 raft_add_vote(struct raft *raft, int src, int dst)
 {
 	struct raft_vote *v = &raft->votes[src];
 	if (v->did_vote)
-		return;
+		return false;
 	v->did_vote = true;
-	++raft->votes[dst].count;
+	++raft->voted_count;
+	int count = ++raft->votes[dst].count;
+	if (count > raft->max_vote)
+		raft->max_vote = count;
+	return true;
+}
+
+static bool
+raft_has_split_vote(const struct raft *raft)
+{
+	int vote_vac = raft->cluster_size;
+	int quorum = raft->election_quorum;
+	/*
+	 * Quorum > cluster is either a misconfiguration or some instances
+	 * didn't register yet. Anyway, speeding the elections up won't help.
+	 * The same when more nodes voted than there are nodes configured.
+	 */
+	if (vote_vac < quorum)
+		return false;
+	vote_vac -= raft->voted_count;
+	if (vote_vac < 0)
+		return false;
+	return raft->max_vote + vote_vac < quorum;
+}
+
+static int
+raft_scores_snprintf(const struct raft *raft, char *buf, int size)
+{
+	int total = 0;
+	bool is_empty = true;
+	SNPRINT(total, snprintf, buf, size, "{");
+	for (int i = 0; i < VCLOCK_MAX; ++i) {
+		int count = raft->votes[i].count;
+		if (count == 0)
+			continue;
+		if (!is_empty)
+			SNPRINT(total, snprintf, buf, size, ", ");
+		else
+			is_empty = false;
+		SNPRINT(total, snprintf, buf, size, "%d: %d", i, count);
+	}
+	SNPRINT(total, snprintf, buf, size, "}");
+	return total;
+}
+
+static const char *
+raft_scores_str(const struct raft *raft)
+{
+	char *buf = tt_static_buf();
+	int rc = raft_scores_snprintf(raft, buf, TT_STATIC_BUF_LEN);
+	assert(rc >= 0);
+	(void)rc;
+	return buf;
 }
 
 /** Schedule broadcast of the complete Raft state to all the followers. */
 static void
 raft_schedule_broadcast(struct raft *raft);
 
+/** If there is split vote, the node might reduce the next term delay. */
+static void
+raft_check_split_vote(struct raft *raft);
+
 /** Raft state machine methods. 'sm' stands for State Machine. */
 
 /**
@@ -351,7 +407,8 @@ raft_process_msg(struct raft *raft, const struct raft_msg *req, uint32_t source)
 	 * persisted long time ago and still broadcasted. Or a vote response.
 	 */
 	if (req->vote != 0) {
-		raft_add_vote(raft, source, req->vote);
+		if (raft_add_vote(raft, source, req->vote))
+		    raft_check_split_vote(raft);
 
 		switch (raft->state) {
 		case RAFT_STATE_FOLLOWER:
@@ -679,6 +736,8 @@ raft_sm_schedule_new_term(struct raft *raft, uint64_t new_term)
 	raft->leader = 0;
 	raft->state = RAFT_STATE_FOLLOWER;
 	memset(raft->votes, 0, sizeof(raft->votes));
+	raft->voted_count = 0;
+	raft->max_vote = 0;
 	/*
 	 * The instance could be promoted for the previous term. But promotion
 	 * has no effect on following terms.
@@ -771,6 +830,11 @@ raft_sm_wait_election_end(struct raft *raft)
 				  raft_new_random_election_shift(raft);
 	raft_ev_timer_set(&raft->timer, election_timeout, election_timeout);
 	raft_ev_timer_start(raft_loop(), &raft->timer);
+	/*
+	 * Could start the waiting after a WAL write during which the split vote
+	 * could happen.
+	 */
+	raft_check_split_vote(raft);
 }
 
 static void
@@ -977,6 +1041,8 @@ raft_cfg_election_quorum(struct raft *raft, int election_quorum)
 	if (raft->state == RAFT_STATE_CANDIDATE &&
 	    raft_vote_count(raft) >= raft->election_quorum)
 		raft_sm_become_leader(raft);
+	else
+		raft_check_split_vote(raft);
 }
 
 void
@@ -1024,6 +1090,13 @@ raft_cfg_vclock(struct raft *raft, const struct vclock *vclock)
 	raft->vclock = vclock;
 }
 
+void
+raft_cfg_cluster_size(struct raft *raft, int size)
+{
+	raft->cluster_size = size;
+	raft_check_split_vote(raft);
+}
+
 void
 raft_new_term(struct raft *raft)
 {
@@ -1048,6 +1121,50 @@ raft_schedule_broadcast(struct raft *raft)
 	raft_schedule_async(raft);
 }
 
+static void
+raft_check_split_vote(struct raft *raft)
+{
+	/* When leader is known, there is no election. Thus no vote to split. */
+	if (raft->leader != 0)
+		return;
+	/* Not a candidate = can't trigger term bump anyway. */
+	if (!raft->is_candidate)
+		return;
+	/*
+	 * WAL write in progress means the state is changing. All is rechecked
+	 * when it is done.
+	 */
+	if (raft->is_write_in_progress)
+		return;
+	if (!raft_has_split_vote(raft))
+		return;
+	assert(raft_ev_is_active(&raft->timer));
+	/*
+	 * Could be already detected before. The timeout would be updated by now
+	 * then.
+	 */
+	if (raft->timer.repeat < raft->election_timeout)
+		return;
+
+	assert(raft->state == RAFT_STATE_FOLLOWER ||
+	       raft->state == RAFT_STATE_CANDIDATE);
+	struct ev_loop *loop = raft_loop();
+	struct ev_timer *timer = &raft->timer;
+	double delay = raft_new_random_election_shift(raft);
+	/*
+	 * Could be too late to speed up anything - probably the term is almost
+	 * over anyway.
+	 */
+	double remaining = raft_ev_timer_remaining(loop, timer);
+	if (delay >= remaining)
+		delay = remaining;
+	say_info("RAFT: split vote is discovered - %s, new term in %lf sec",
+		 raft_scores_str(raft), delay);
+	raft_ev_timer_stop(loop, timer);
+	raft_ev_timer_set(timer, delay, delay);
+	raft_ev_timer_start(loop, timer);
+}
+
 void
 raft_create(struct raft *raft, const struct raft_vtab *vtab)
 {
@@ -1058,6 +1175,7 @@ raft_create(struct raft *raft, const struct raft_vtab *vtab)
 		.election_quorum = 1,
 		.election_timeout = 5,
 		.death_timeout = 5,
+		.cluster_size = VCLOCK_MAX,
 		.vtab = vtab,
 	};
 	raft_ev_timer_init(&raft->timer, raft_sm_schedule_new_election_cb,
diff --git a/src/lib/raft/raft.h b/src/lib/raft/raft.h
index 1d064bef0..05e373254 100644
--- a/src/lib/raft/raft.h
+++ b/src/lib/raft/raft.h
@@ -199,6 +199,10 @@ struct raft {
 	uint32_t vote;
 	/** Statistics which node voted for who. */
 	struct raft_vote votes[VCLOCK_MAX];
+	/** How many nodes voted in the current term. */
+	int voted_count;
+	/** Max vote count given to any node in the current term. */
+	int max_vote;
 	/** Number of votes necessary for successful election. */
 	int election_quorum;
 	/**
@@ -219,6 +223,8 @@ struct raft {
 	 * elections can be started.
 	 */
 	double death_timeout;
+	/** Number of instances registered in the cluster. */
+	int cluster_size;
 	/** Virtual table to perform application-specific actions. */
 	const struct raft_vtab *vtab;
 	/**
@@ -333,6 +339,10 @@ raft_cfg_instance_id(struct raft *raft, uint32_t instance_id);
 void
 raft_cfg_vclock(struct raft *raft, const struct vclock *vclock);
 
+/** Configure number of registered instances in the cluster. */
+void
+raft_cfg_cluster_size(struct raft *raft, int size);
+
 /**
  * Bump the term. When it is persisted, the node checks if there is a leader,
  * and if there is not, a new election is started. That said, this function can
diff --git a/test/unit/raft.c b/test/unit/raft.c
index 9f18f4d8e..2fe6354cd 100644
--- a/test/unit/raft.c
+++ b/test/unit/raft.c
@@ -1497,10 +1497,308 @@ raft_test_promote_restore(void)
 	raft_finish_test();
 }
 
+static void
+raft_test_split_vote(void)
+{
+	raft_start_test(58);
+	struct raft_node node;
+	raft_node_create(&node);
+
+	/*
+	 * Normal split vote.
+	 */
+	raft_node_cfg_cluster_size(&node, 4);
+	raft_node_cfg_election_quorum(&node, 3);
+	raft_run_next_event();
+
+	ok(raft_node_check_full_state(&node,
+		RAFT_STATE_CANDIDATE /* State. */,
+		0 /* Leader. */,
+		2 /* Term. */,
+		1 /* Vote. */,
+		2 /* Volatile term. */,
+		1 /* Volatile vote. */,
+		"{0: 1}" /* Vclock. */
+	), "elections with a new term");
+
+	/* Make so node 1 has votes 1 and 2. Node 3 has votes 3 and 4. */
+	is(raft_node_send_vote_response(&node,
+		2 /* Term. */,
+		1 /* Vote. */,
+		2 /* Source. */
+	), 0, "vote response for 1 from 2");
+	is(raft_node_send_vote_response(&node,
+		2 /* Term. */,
+		3 /* Vote. */,
+		3 /* Source. */
+	), 0, "vote response for 3 from 3");
+
+	ok(node.raft.timer.repeat >= node.raft.election_timeout, "term "
+	   "timeout >= election timeout normally");
+
+	is(raft_node_send_vote_response(&node,
+		2 /* Term. */,
+		3 /* Vote. */,
+		4 /* Source. */
+	), 0, "vote response for 3 from 4");
+
+	ok(node.raft.timer.repeat < node.raft.election_timeout, "split vote "
+	   "reduced the term timeout");
+
+	raft_run_next_event();
+	ok(raft_node_check_full_state(&node,
+		RAFT_STATE_CANDIDATE /* State. */,
+		0 /* Leader. */,
+		3 /* Term. */,
+		1 /* Vote. */,
+		3 /* Volatile term. */,
+		1 /* Volatile vote. */,
+		"{0: 2}" /* Vclock. */
+	), "a new term");
+
+	ok(node.raft.timer.repeat >= node.raft.election_timeout, "timeout is "
+	   "normal again");
+
+	/*
+	 * Cluster size change can make split vote.
+	 */
+	raft_node_destroy(&node);
+	raft_node_create(&node);
+	raft_node_cfg_cluster_size(&node, 3);
+	raft_node_cfg_election_quorum(&node, 2);
+	raft_run_next_event();
+	is(node.raft.state, RAFT_STATE_CANDIDATE, "is candidate");
+	is(node.raft.vote, 1, "voted for self");
+
+	is(raft_node_send_vote_response(&node,
+		2 /* Term. */,
+		2 /* Vote. */,
+		2 /* Source. */
+	), 0, "vote response for 2 from 2");
+	ok(node.raft.timer.repeat >= node.raft.election_timeout, "the vote is "
+	   "not split yet");
+
+	raft_node_cfg_cluster_size(&node, 2);
+	ok(node.raft.timer.repeat < node.raft.election_timeout, "cluster size "
+	   "change makes split vote");
+
+	raft_run_next_event();
+	ok(raft_node_check_full_state(&node,
+		RAFT_STATE_CANDIDATE /* State. */,
+		0 /* Leader. */,
+		3 /* Term. */,
+		1 /* Vote. */,
+		3 /* Volatile term. */,
+		1 /* Volatile vote. */,
+		"{0: 2}" /* Vclock. */
+	), "a new term");
+
+	/*
+	 * Split vote can be even when the leader is already known - then
+	 * nothing to do. Votes are just left from the beginning of the term and
+	 * then probably cluster size reduced a bit.
+	 */
+	raft_node_destroy(&node);
+	raft_node_create(&node);
+	raft_node_cfg_cluster_size(&node, 3);
+	raft_node_cfg_election_quorum(&node, 2);
+	is(raft_node_send_vote_response(&node,
+		2 /* Term. */,
+		2 /* Vote. */,
+		2 /* Source. */
+	), 0, "vote response for 2 from 2");
+	/* There is also a vote from 3 for 2. But it wasn't delivered to 1. */
+	is(raft_node_send_leader(&node,
+		2 /* Term. */,
+		2 /* Source. */
+	), 0, "message is accepted");
+	is(node.raft.leader, 2, "other node's leadership is accepted");
+	is(raft_vote_count(&node.raft), 1, "the only own vote was from self");
+
+	raft_node_cfg_cluster_size(&node, 2);
+	ok(node.raft.timer.repeat >= node.raft.death_timeout, "cluster change "
+	   "does not affect the leader's death waiting");
+
+	/*
+	 * Non-candidate should ignore split vote.
+	 */
+	raft_node_destroy(&node);
+	raft_node_create(&node);
+	raft_node_cfg_cluster_size(&node, 3);
+	raft_node_cfg_election_quorum(&node, 3);
+	raft_node_cfg_is_candidate(&node, false);
+
+	is(raft_node_send_vote_response(&node,
+		2 /* Term. */,
+		2 /* Vote. */,
+		2 /* Source. */
+	), 0, "vote response for 2 from 2");
+	is(raft_node_send_vote_response(&node,
+		2 /* Term. */,
+		3 /* Vote. */,
+		3 /* Source. */
+	), 0, "vote response for 3 from 3");
+
+	ok(!raft_ev_is_active(&node.raft.timer), "voter couldn't schedule "
+	    "new term");
+
+	/*
+	 * Split vote can get worse, but it shouldn't lead to new term delay
+	 * restart.
+	 */
+	raft_node_destroy(&node);
+	raft_node_create(&node);
+	raft_node_cfg_cluster_size(&node, 3);
+	raft_node_cfg_election_quorum(&node, 3);
+
+	raft_run_next_event();
+	is(node.raft.term, 2, "bump term");
+	is(node.raft.vote, 1, "vote for self");
+	is(raft_node_send_vote_response(&node,
+		2 /* Term. */,
+		2 /* Vote. */,
+		2 /* Source. */
+	), 0, "vote response for 2 from 2");
+
+	double delay = node.raft.timer.repeat;
+	ok(delay < node.raft.election_timeout, "split vote is already "
+	   "inevitable");
+
+	is(raft_node_send_vote_response(&node,
+		2 /* Term. */,
+		3 /* Vote. */,
+		3 /* Source. */
+	), 0, "vote response for 3 from 3");
+
+	is(delay, node.raft.timer.repeat, "split vote got worse, but delay "
+	   "didn't change");
+
+	/*
+	 * Handle split vote when WAL write is in progress.
+	 */
+	raft_node_destroy(&node);
+	raft_node_create(&node);
+	raft_node_cfg_cluster_size(&node, 2);
+	raft_node_cfg_election_quorum(&node, 2);
+
+	raft_node_block(&node);
+	raft_run_next_event();
+	is(node.raft.term, 1, "old term");
+	is(node.raft.vote, 0, "unused vote");
+	is(node.raft.volatile_term, 2, "new volatile term");
+	is(node.raft.volatile_vote, 1, "new volatile vote");
+	is(raft_node_send_vote_response(&node,
+		2 /* Term. */,
+		2 /* Vote. */,
+		2 /* Source. */
+	), 0, "vote response for 2 from 2");
+
+	raft_node_unblock(&node);
+	is(node.raft.term, 2, "new term");
+	is(node.raft.vote, 1, "voted for self");
+	is(node.raft.volatile_term, 2, "volatile term");
+	is(node.raft.volatile_vote, 1, "volatile vote");
+	ok(node.raft.timer.repeat < node.raft.election_timeout, "found split "
+	   "vote after WAL write");
+
+	raft_run_next_event();
+	is(node.raft.term, 3, "bump term");
+	is(node.raft.vote, 1, "vote for self");
+
+	/*
+	 * Split vote check is disabled when cluster size < quorum. Makes no
+	 * sense to speed the elections up.
+	 */
+	raft_node_destroy(&node);
+	raft_node_create(&node);
+	raft_node_cfg_cluster_size(&node, 1);
+	raft_node_cfg_election_quorum(&node, 2);
+
+	raft_run_next_event();
+	is(node.raft.term, 2, "bump term");
+	is(node.raft.vote, 1, "vote for self");
+	is(raft_node_send_vote_response(&node,
+		2 /* Term. */,
+		2 /* Vote. */,
+		2 /* Source. */
+	), 0, "vote response for 2 from 2");
+
+	ok(node.raft.timer.repeat >= node.raft.election_timeout, "split vote "
+	   "is not checked for cluster < quorum");
+
+	/*
+	 * Split vote check is disabled when vote count > cluster size. The
+	 * reason is the same as with quorum > cluster size - something is odd,
+	 * more term bumps won't help.
+	 */
+	raft_node_destroy(&node);
+	raft_node_create(&node);
+	raft_node_cfg_cluster_size(&node, 3);
+	raft_node_cfg_election_quorum(&node, 2);
+
+	raft_run_next_event();
+	is(node.raft.term, 2, "bump term");
+	is(node.raft.vote, 1, "vote for self");
+	is(raft_node_send_vote_response(&node,
+		2 /* Term. */,
+		2 /* Vote. */,
+		2 /* Source. */
+	), 0, "vote response for 2 from 2");
+	is(raft_node_send_vote_response(&node,
+		2 /* Term. */,
+		2 /* Vote. */,
+		3 /* Source. */
+	), 0, "vote response for 2 from 3");
+	is(raft_node_send_vote_response(&node,
+		2 /* Term. */,
+		3 /* Vote. */,
+		4 /* Source. */
+	), 0, "vote response for 3 from 4");
+	is(raft_node_send_vote_response(&node,
+		2 /* Term. */,
+		4 /* Vote. */,
+		5 /* Source. */
+	), 0, "vote response for 4 from 5");
+
+	ok(node.raft.timer.repeat >= node.raft.election_timeout, "split vote "
+	   "is not checked when vote count > cluster size");
+
+	/*
+	 * Split vote can happen if quorum was suddenly increased.
+	 */
+	raft_node_destroy(&node);
+	raft_node_create(&node);
+	raft_node_cfg_cluster_size(&node, 3);
+	raft_node_cfg_election_quorum(&node, 2);
+
+	raft_run_next_event();
+	is(node.raft.term, 2, "bump term");
+	is(node.raft.vote, 1, "vote for self");
+	is(raft_node_send_vote_response(&node,
+		2 /* Term. */,
+		2 /* Vote. */,
+		2 /* Source. */
+	), 0, "vote response for 2 from 2");
+
+	ok(node.raft.timer.repeat >= node.raft.election_timeout, "not split "
+	   "vote yet");
+
+	raft_node_cfg_election_quorum(&node, 3);
+	ok(node.raft.timer.repeat < node.raft.election_timeout, "split vote "
+	   "after quorum increase");
+
+	raft_run_next_event();
+	is(node.raft.term, 3, "bump term");
+	is(node.raft.vote, 1, "vote for self");
+
+	raft_node_destroy(&node);
+	raft_finish_test();
+}
+
 static int
 main_f(va_list ap)
 {
-	raft_start_test(14);
+	raft_start_test(15);
 
 	(void) ap;
 	fakeev_init();
@@ -1519,6 +1817,7 @@ main_f(va_list ap)
 	raft_test_enable_disable();
 	raft_test_too_long_wal_write();
 	raft_test_promote_restore();
+	raft_test_split_vote();
 
 	fakeev_free();
 
diff --git a/test/unit/raft.result b/test/unit/raft.result
index 3fcdf8180..cfc9be733 100644
--- a/test/unit/raft.result
+++ b/test/unit/raft.result
@@ -1,5 +1,5 @@
 	*** main_f ***
-1..14
+1..15
 	*** raft_test_leader_election ***
     1..24
     ok 1 - 1 pending message at start
@@ -262,4 +262,66 @@ ok 13 - subtests
     ok 12 - not a candidate
 ok 14 - subtests
 	*** raft_test_promote_restore: done ***
+	*** raft_test_split_vote ***
+    1..58
+    ok 1 - elections with a new term
+    ok 2 - vote response for 1 from 2
+    ok 3 - vote response for 3 from 3
+    ok 4 - term timeout >= election timeout normally
+    ok 5 - vote response for 3 from 4
+    ok 6 - split vote reduced the term timeout
+    ok 7 - a new term
+    ok 8 - timeout is normal again
+    ok 9 - is candidate
+    ok 10 - voted for self
+    ok 11 - vote response for 2 from 2
+    ok 12 - the vote is not split yet
+    ok 13 - cluster size change makes split vote
+    ok 14 - a new term
+    ok 15 - vote response for 2 from 2
+    ok 16 - message is accepted
+    ok 17 - other node's leadership is accepted
+    ok 18 - the only own vote was from self
+    ok 19 - cluster change does not affect the leader's death waiting
+    ok 20 - vote response for 2 from 2
+    ok 21 - vote response for 3 from 3
+    ok 22 - voter couldn't schedule new term
+    ok 23 - bump term
+    ok 24 - vote for self
+    ok 25 - vote response for 2 from 2
+    ok 26 - split vote is already inevitable
+    ok 27 - vote response for 3 from 3
+    ok 28 - split vote got worse, but delay didn't change
+    ok 29 - old term
+    ok 30 - unused vote
+    ok 31 - new volatile term
+    ok 32 - new volatile vote
+    ok 33 - vote response for 2 from 2
+    ok 34 - new term
+    ok 35 - voted for self
+    ok 36 - volatile term
+    ok 37 - volatile vote
+    ok 38 - found split vote after WAL write
+    ok 39 - bump term
+    ok 40 - vote for self
+    ok 41 - bump term
+    ok 42 - vote for self
+    ok 43 - vote response for 2 from 2
+    ok 44 - split vote is not checked for cluster < quorum
+    ok 45 - bump term
+    ok 46 - vote for self
+    ok 47 - vote response for 2 from 2
+    ok 48 - vote response for 2 from 3
+    ok 49 - vote response for 3 from 4
+    ok 50 - vote response for 4 from 5
+    ok 51 - split vote is not checked when vote count > cluster size
+    ok 52 - bump term
+    ok 53 - vote for self
+    ok 54 - vote response for 2 from 2
+    ok 55 - not split vote yet
+    ok 56 - split vote after quorum increase
+    ok 57 - bump term
+    ok 58 - vote for self
+ok 15 - subtests
+	*** raft_test_split_vote: done ***
 	*** main_f: done ***
diff --git a/test/unit/raft_test_utils.c b/test/unit/raft_test_utils.c
index b5754cf78..1d7dd01ba 100644
--- a/test/unit/raft_test_utils.c
+++ b/test/unit/raft_test_utils.c
@@ -194,6 +194,7 @@ raft_node_create(struct raft_node *node)
 	node->cfg_election_quorum = 3;
 	node->cfg_death_timeout = 5;
 	node->cfg_instance_id = 1;
+	node->cfg_cluster_size = VCLOCK_MAX;
 	node->cfg_vclock = &node->journal.vclock;
 	raft_journal_create(&node->journal, node->cfg_instance_id);
 	raft_node_start(node);
@@ -365,6 +366,7 @@ raft_node_start(struct raft_node *node)
 	raft_cfg_election_quorum(&node->raft, node->cfg_election_quorum);
 	raft_cfg_death_timeout(&node->raft, node->cfg_death_timeout);
 	raft_cfg_instance_id(&node->raft, node->cfg_instance_id);
+	raft_cfg_cluster_size(&node->raft, node->cfg_cluster_size);
 	raft_cfg_vclock(&node->raft, node->cfg_vclock);
 	raft_run_async_work();
 }
@@ -423,6 +425,16 @@ raft_node_cfg_is_candidate(struct raft_node *node, bool value)
 	}
 }
 
+void
+raft_node_cfg_cluster_size(struct raft_node *node, int value)
+{
+	node->cfg_cluster_size = value;
+	if (raft_node_is_started(node)) {
+		raft_cfg_cluster_size(&node->raft, value);
+		raft_run_async_work();
+	}
+}
+
 void
 raft_node_cfg_election_timeout(struct raft_node *node, double value)
 {
diff --git a/test/unit/raft_test_utils.h b/test/unit/raft_test_utils.h
index c68dc3b22..2138a829e 100644
--- a/test/unit/raft_test_utils.h
+++ b/test/unit/raft_test_utils.h
@@ -32,6 +32,7 @@
 #include "fakesys/fakeev.h"
 #include "fiber.h"
 #include "raft/raft.h"
+#include "raft/raft_ev.h"
 #include "unit.h"
 
 /** WAL simulation. It stores a list of rows which raft wanted to persist. */
@@ -105,6 +106,7 @@ struct raft_node {
 	int cfg_election_quorum;
 	double cfg_death_timeout;
 	uint32_t cfg_instance_id;
+	int cfg_cluster_size;
 	struct vclock *cfg_vclock;
 };
 
@@ -227,6 +229,9 @@ raft_node_cfg_is_enabled(struct raft_node *node, bool value);
 void
 raft_node_cfg_is_candidate(struct raft_node *node, bool value);
 
+void
+raft_node_cfg_cluster_size(struct raft_node *node, int value);
+
 void
 raft_node_cfg_election_timeout(struct raft_node *node, double value);
 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Tarantool-patches] [PATCH v2 4/5] raft: introduce split vote detection
  2022-01-20 23:02     ` Vladislav Shpilevoy via Tarantool-patches
@ 2022-01-25 10:17       ` Serge Petrenko via Tarantool-patches
  0 siblings, 0 replies; 11+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2022-01-25 10:17 UTC (permalink / raw)
  To: Vladislav Shpilevoy, tarantool-patches



21.01.2022 02:02, Vladislav Shpilevoy пишет:
>>> +static void
>>> +raft_check_split_vote(struct raft *raft)
>>> +{
>>> +    /* When leader is known, there is no election. Thus no vote to split. */
>>> +    if (raft->leader != 0)
>>> +        return;
>>> +    /* Not a candidate = can't trigger term bump anyway. */
>>> +    if (!raft->is_candidate)
>>> +        return;
>>> +    /*
>>> +     * WAL write in progress means the state is changing. All is rechecked
>>> +     * when it is done.
>>> +     */
>>> +    if (raft->is_write_in_progress)
>>> +        return;
>>> +    if (!raft_has_split_vote(raft))
>>> +        return;
>>> +    assert(raft_ev_is_active(&raft->timer));
>>> +    /*
>>> +     * Could be already detected before. The timeout would be updated by now
>>> +     * then.
>>> +     */
>>> +    if (raft->timer.repeat < raft->election_timeout)
>>> +        return;
>> I don't think you should decrease timer.repeat.
>> This 'vote speedup' is for a single term only.
>>
>> Besides the check below about delay >= remaining is enough
>> to test if split vote detection was already triggered.
> I update timer.repeat as a flag that the split vote was already seen
> during this term. If I drop this check, each next vote after the first
> detection of split vote will lead to potential timeout decrease depending
> on random. I wanted to decrease it only once - when split vote is seen
> first time. Even managed to add a test for it.
>
> The alternative was to store a flag 'has_split_vote' somewhere in struct
> raft, but I didn't want to add it.

Ok, I see now. Thanks for the explanation!

>
>>> +
>>> +    assert(raft->state == RAFT_STATE_FOLLOWER ||
>>> +           raft->state == RAFT_STATE_CANDIDATE);
>>> +    struct ev_loop *loop = raft_loop();
>>> +    struct ev_timer *timer = &raft->timer;
>>> +    double delay = raft_new_random_election_shift(raft);
>>> +    /*
>>> +     * Could be too late to speed up anything - probably the term is almost
>>> +     * over anyway.
>>> +     */
>>> +    double remaining = raft_ev_timer_remaining(loop, timer);
>>> +    if (delay >= remaining)
>>> +        delay = remaining;
>>> +    say_info("RAFT: split vote is discovered - %s, new term in %lf sec",
>>> +         raft_scores_str(raft), delay);
>>> +    raft_ev_timer_stop(loop, timer);
>>> +    raft_ev_timer_set(timer, delay, delay);
>>
>> ...
>>> diff --git a/test/unit/raft_test_utils.h b/test/unit/raft_test_utils.h
>>> index c68dc3b22..2138a829e 100644
>>> --- a/test/unit/raft_test_utils.h
>>> +++ b/test/unit/raft_test_utils.h
>>> @@ -32,6 +32,7 @@
>>>    #include "fakesys/fakeev.h"
>>>    #include "fiber.h"
>>>    #include "raft/raft.h"
>>> +#include "raft/raft_ev.h"
>> Why do you need it here?
> The tests now use raft_ev_is_active() which is defined in this header. I
> decided to add it here instead of unit/raft.c because the idea of
> raft_test_utils.h is, among other things, to do all the boilerplate work
> such as necessary inclusions.

Ok.

>
> New version of the commit after the comment from v1 about new raft
> members:
>
> ====================
>      raft: introduce split vote detection

Thanks for the fixes! LGTM.

-- 
Serge Petrenko


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Tarantool-patches] [PATCH v2 0/5] Split vote and bugs
  2022-01-20  0:43 [Tarantool-patches] [PATCH v2 0/5] Split vote and bugs Vladislav Shpilevoy via Tarantool-patches
                   ` (4 preceding siblings ...)
  2022-01-20  0:43 ` [Tarantool-patches] [PATCH v2 5/5] election: activate raft split vote handling Vladislav Shpilevoy via Tarantool-patches
@ 2022-01-25 10:18 ` Serge Petrenko via Tarantool-patches
  2022-01-25 22:51 ` Vladislav Shpilevoy via Tarantool-patches
  6 siblings, 0 replies; 11+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2022-01-25 10:18 UTC (permalink / raw)
  To: Vladislav Shpilevoy, tarantool-patches



20.01.2022 03:43, Vladislav Shpilevoy пишет:
> Split vote handling in Raft, its usage in storage, and 2 bug fixes found while
> working on this.
>
> Changes in v2:
> - dropped ev_timer.at usage;
> - disabled split-vote detection for cluster < quorum;
> - disabled split-vote for vote count > cluster;
> - re-checked split vote after WAL write;
> - re-checked split vote on quorum change.
>
> Branch: http://github.com/tarantool/tarantool/tree/gerold103/gh-5285-raft-split-vote
> Issue: https://github.com/tarantool/tarantool/issues/5285
>
> Vladislav Shpilevoy (5):
>    raft: fix crash on election_timeout reconfig
>    raft: fix ev_timer.at incorrect usage
>    raft: track all votes, even not own
>    raft: introduce split vote detection
>    election: activate raft split vote handling
>
>   .../unreleased/election-timeout-cfg-crash.md  |   5 +
>   src/box/raft.c                                |   4 +-
>   src/lib/fakesys/fakeev.c                      |  10 +-
>   src/lib/raft/raft.c                           | 222 ++++++++--
>   src/lib/raft/raft.h                           |  32 +-
>   .../election_split_vote_test.lua              |  92 ++++
>   test/unit/raft.c                              | 393 +++++++++++++++++-
>   test/unit/raft.result                         |  87 +++-
>   test/unit/raft_test_utils.c                   |  12 +
>   test/unit/raft_test_utils.h                   |   5 +
>   10 files changed, 803 insertions(+), 59 deletions(-)
>   create mode 100644 changelogs/unreleased/election-timeout-cfg-crash.md
>   create mode 100644 test/replication-luatest/election_split_vote_test.lua
>
LGTM. Thanks!

-- 
Serge Petrenko


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Tarantool-patches] [PATCH v2 0/5] Split vote and bugs
  2022-01-20  0:43 [Tarantool-patches] [PATCH v2 0/5] Split vote and bugs Vladislav Shpilevoy via Tarantool-patches
                   ` (5 preceding siblings ...)
  2022-01-25 10:18 ` [Tarantool-patches] [PATCH v2 0/5] Split vote and bugs Serge Petrenko via Tarantool-patches
@ 2022-01-25 22:51 ` Vladislav Shpilevoy via Tarantool-patches
  6 siblings, 0 replies; 11+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2022-01-25 22:51 UTC (permalink / raw)
  To: tarantool-patches, sergepetrenko

Pushed to master.

The first 2 commits are also cherry-picked to 2.8.

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2022-01-25 22:51 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-01-20  0:43 [Tarantool-patches] [PATCH v2 0/5] Split vote and bugs Vladislav Shpilevoy via Tarantool-patches
2022-01-20  0:43 ` [Tarantool-patches] [PATCH v2 1/5] raft: fix crash on election_timeout reconfig Vladislav Shpilevoy via Tarantool-patches
2022-01-20  0:43 ` [Tarantool-patches] [PATCH v2 2/5] raft: fix ev_timer.at incorrect usage Vladislav Shpilevoy via Tarantool-patches
2022-01-20  0:43 ` [Tarantool-patches] [PATCH v2 3/5] raft: track all votes, even not own Vladislav Shpilevoy via Tarantool-patches
2022-01-20  0:43 ` [Tarantool-patches] [PATCH v2 4/5] raft: introduce split vote detection Vladislav Shpilevoy via Tarantool-patches
2022-01-20 13:22   ` Serge Petrenko via Tarantool-patches
2022-01-20 23:02     ` Vladislav Shpilevoy via Tarantool-patches
2022-01-25 10:17       ` Serge Petrenko via Tarantool-patches
2022-01-20  0:43 ` [Tarantool-patches] [PATCH v2 5/5] election: activate raft split vote handling Vladislav Shpilevoy via Tarantool-patches
2022-01-25 10:18 ` [Tarantool-patches] [PATCH v2 0/5] Split vote and bugs Serge Petrenko via Tarantool-patches
2022-01-25 22:51 ` Vladislav Shpilevoy via Tarantool-patches

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox