Tarantool development patches archive
 help / color / mirror / Atom feed
* [Tarantool-patches] [PATCH 0/3] Raft follow-ups
@ 2020-09-25 21:29 Vladislav Shpilevoy
  2020-09-25 21:29 ` [Tarantool-patches] [PATCH 1/3] [tosquash] raft: vote without a new term when possible Vladislav Shpilevoy
                   ` (3 more replies)
  0 siblings, 4 replies; 5+ messages in thread
From: Vladislav Shpilevoy @ 2020-09-25 21:29 UTC (permalink / raw)
  To: tarantool-patches, sergepetrenko

The patchset is a couple of fixes for the existing Raft code, and
also a new leader election policy - the third patch makes so an
instance restart does not lead to a new election immediately
anymore.

See the commit message and comments why.

Branch: http://github.com/tarantool/tarantool/tree/gh-1146-raft
Issue: https://github.com/tarantool/tarantool/issues/1146

Vladislav Shpilevoy (3):
  [tosquash] raft: vote without a new term when possible
  [tosquash] raft: fix 99% CPU issue in raft io worker
  [tosquash] raft: don't start new election immediately after restart

 src/box/box.cc | 12 ++++++++++-
 src/box/raft.c | 54 +++++++++++++++++++++++++++++++++++++++++++-------
 src/box/raft.h |  8 ++++++++
 3 files changed, 66 insertions(+), 8 deletions(-)

-- 
2.21.1 (Apple Git-122.3)

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Tarantool-patches] [PATCH 1/3] [tosquash] raft: vote without a new term when possible
  2020-09-25 21:29 [Tarantool-patches] [PATCH 0/3] Raft follow-ups Vladislav Shpilevoy
@ 2020-09-25 21:29 ` Vladislav Shpilevoy
  2020-09-25 21:29 ` [Tarantool-patches] [PATCH 2/3] [tosquash] raft: fix 99% CPU issue in raft io worker Vladislav Shpilevoy
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 5+ messages in thread
From: Vladislav Shpilevoy @ 2020-09-25 21:29 UTC (permalink / raw)
  To: tarantool-patches, sergepetrenko

When instance wrote something to WAL and saw that there is no a
leader, and no vote, it started new term + wrote a vote for self.
Really the new term wasn't necessary. Since the node didn't vote
in the current term yet, it can vote right away.

It is not about saving term numbers. More about making the logs
more understandable. Otherwise it looks strange, that in some
terms a node didn't participate at all and just skipped them even
though it started tehm.
---
 src/box/raft.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/box/raft.c b/src/box/raft.c
index ce90ed533..9ef03d8c4 100644
--- a/src/box/raft.c
+++ b/src/box/raft.c
@@ -584,7 +584,7 @@ end_dump:
 			raft_sm_wait_election_end();
 		} else {
 			/* No leaders, no votes. */
-			raft_sm_schedule_new_election();
+			raft_sm_schedule_new_vote(instance_id);
 		}
 	} else {
 		memset(&req, 0, sizeof(req));
-- 
2.21.1 (Apple Git-122.3)

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Tarantool-patches] [PATCH 2/3] [tosquash] raft: fix 99% CPU issue in raft io worker
  2020-09-25 21:29 [Tarantool-patches] [PATCH 0/3] Raft follow-ups Vladislav Shpilevoy
  2020-09-25 21:29 ` [Tarantool-patches] [PATCH 1/3] [tosquash] raft: vote without a new term when possible Vladislav Shpilevoy
@ 2020-09-25 21:29 ` Vladislav Shpilevoy
  2020-09-25 21:29 ` [Tarantool-patches] [PATCH 3/3] [tosquash] raft: don't start new election immediately after restart Vladislav Shpilevoy
  2020-09-28  7:09 ` [Tarantool-patches] [PATCH 0/3] Raft follow-ups Serge Petrenko
  3 siblings, 0 replies; 5+ messages in thread
From: Vladislav Shpilevoy @ 2020-09-25 21:29 UTC (permalink / raw)
  To: tarantool-patches, sergepetrenko

The idle flag was never set to true after being set to false. So
the fiber infinitely called fiber_sleep(0) when didn't have
anything else to do.
---
 src/box/raft.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/src/box/raft.c b/src/box/raft.c
index 9ef03d8c4..8e5f7803c 100644
--- a/src/box/raft.c
+++ b/src/box/raft.c
@@ -631,8 +631,9 @@ static int
 raft_worker_f(va_list args)
 {
 	(void)args;
-	bool is_idle = true;
+	bool is_idle;
 	while (!fiber_is_cancelled()) {
+		is_idle = true;
 		if (raft.is_write_in_progress) {
 			raft_worker_handle_io();
 			is_idle = false;
-- 
2.21.1 (Apple Git-122.3)

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Tarantool-patches] [PATCH 3/3] [tosquash] raft: don't start new election immediately after restart
  2020-09-25 21:29 [Tarantool-patches] [PATCH 0/3] Raft follow-ups Vladislav Shpilevoy
  2020-09-25 21:29 ` [Tarantool-patches] [PATCH 1/3] [tosquash] raft: vote without a new term when possible Vladislav Shpilevoy
  2020-09-25 21:29 ` [Tarantool-patches] [PATCH 2/3] [tosquash] raft: fix 99% CPU issue in raft io worker Vladislav Shpilevoy
@ 2020-09-25 21:29 ` Vladislav Shpilevoy
  2020-09-28  7:09 ` [Tarantool-patches] [PATCH 0/3] Raft follow-ups Serge Petrenko
  3 siblings, 0 replies; 5+ messages in thread
From: Vladislav Shpilevoy @ 2020-09-25 21:29 UTC (permalink / raw)
  To: tarantool-patches, sergepetrenko

That creates problems in a cluster when followers restart.
Because if they would bump terms after restart, they would turn
the current leader into follower, therefore interrupting the
cluster progression.

Better let them wait for the leader death timeout if a leader
will appear. Because this is what will happen in 99% cases.
---
 src/box/box.cc | 12 +++++++++++-
 src/box/raft.c | 49 ++++++++++++++++++++++++++++++++++++++++++++-----
 src/box/raft.h |  8 ++++++++
 3 files changed, 63 insertions(+), 6 deletions(-)

diff --git a/src/box/box.cc b/src/box/box.cc
index 1169f2cd5..957bf4d7a 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -2799,8 +2799,18 @@ box_cfg_xc(void)
 	title("running");
 	say_info("ready to accept requests");
 
-	if (!is_bootstrap_leader)
+	if (!is_bootstrap_leader) {
 		replicaset_sync();
+	} else {
+		/*
+		 * When the cluster is just bootstrapped and this instance is a
+		 * leader, it makes no sense to wait for a leader appearance.
+		 * There is no one. Moreover this node *is* a leader, so it
+		 * should take the control over the situation and start a new
+		 * term immediately.
+		 */
+		raft_new_term();
+	}
 
 	/* box.cfg.read_only is not read yet. */
 	assert(box_is_ro());
diff --git a/src/box/raft.c b/src/box/raft.c
index 8e5f7803c..608722824 100644
--- a/src/box/raft.c
+++ b/src/box/raft.c
@@ -209,6 +209,17 @@ raft_sm_stop(void);
 static void
 raft_sm_wait_leader_dead(void);
 
+/**
+ * Wait for the leader death timeout until a leader lets the node know he is
+ * alive. Otherwise the node will start a new term. Can be useful when it is not
+ * known whether the leader is alive, but it is undesirable to start a new term
+ * immediately. Because in case the leader is alive, a new term would stun him
+ * and therefore would stun DB write requests. Usually happens when a follower
+ * restarts and may need some time to hear something from the leader.
+ */
+static void
+raft_sm_wait_leader_found(void);
+
 /**
  * If election is started by this node, or it voted for some other node started
  * the election, and it can be a leader itself, it will wait until the current
@@ -785,6 +796,19 @@ raft_sm_wait_leader_dead(void)
 	ev_timer_start(loop(), &raft.timer);
 }
 
+static void
+raft_sm_wait_leader_found(void)
+{
+	assert(!ev_is_active(&raft.timer));
+	assert(!raft.is_write_in_progress);
+	assert(raft.is_candidate);
+	assert(raft.state == RAFT_STATE_FOLLOWER);
+	assert(raft.leader == 0);
+	double death_timeout = replication_disconnect_timeout();
+	ev_timer_set(&raft.timer, death_timeout, death_timeout);
+	ev_timer_start(loop(), &raft.timer);
+}
+
 static void
 raft_sm_wait_election_end(void)
 {
@@ -811,12 +835,20 @@ raft_sm_start(void)
 	assert(raft.state == RAFT_STATE_FOLLOWER);
 	raft.is_enabled = true;
 	raft.is_candidate = raft.is_cfg_candidate;
-	if (!raft.is_candidate)
+	if (!raft.is_candidate) {
 		/* Nop. */;
-	else if (raft.leader != 0)
+	} else if (raft.leader != 0) {
 		raft_sm_wait_leader_dead();
-	else
-		raft_sm_schedule_new_election();
+	} else {
+		/*
+		 * Don't start new election. The situation is most likely
+		 * happened because this node was restarted. Instance restarts
+		 * may happen in the cluster, and each restart shouldn't
+		 * disturb the current leader. Give it time to notify this node
+		 * that there is a leader.
+		 */
+		raft_sm_wait_leader_found();
+	}
 	box_update_ro_summary();
 }
 
@@ -892,7 +924,7 @@ raft_cfg_is_candidate(bool is_candidate)
 		 * node who sent newer data to this node.
 		 */
 		if (raft.leader == 0 && raft_is_fully_on_disk())
-			raft_sm_schedule_new_election();
+			raft_sm_wait_leader_found();
 	} else if (raft.state != RAFT_STATE_FOLLOWER) {
 		if (raft.state == RAFT_STATE_LEADER)
 			raft.leader = 0;
@@ -946,6 +978,13 @@ raft_cfg_death_timeout(void)
 	}
 }
 
+void
+raft_new_term(void)
+{
+	if (raft.is_enabled)
+		raft_sm_schedule_new_term(raft.volatile_term + 1);
+}
+
 static void
 raft_schedule_broadcast(void)
 {
diff --git a/src/box/raft.h b/src/box/raft.h
index 8c1cf467f..be77a5473 100644
--- a/src/box/raft.h
+++ b/src/box/raft.h
@@ -229,6 +229,14 @@ raft_cfg_election_quorum(void);
 void
 raft_cfg_death_timeout(void);
 
+/**
+ * Bump the term. When it is persisted, the node checks if there is a leader,
+ * and if there is not, a new election is started. That said, this function can
+ * be used as tool to forcefully start new election, or restart an existing.
+ */
+void
+raft_new_term(void);
+
 /**
  * Save complete Raft state into a request to be sent to other instances of the
  * cluster. It is allowed to save anything here, not only persistent state.
-- 
2.21.1 (Apple Git-122.3)

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Tarantool-patches] [PATCH 0/3] Raft follow-ups
  2020-09-25 21:29 [Tarantool-patches] [PATCH 0/3] Raft follow-ups Vladislav Shpilevoy
                   ` (2 preceding siblings ...)
  2020-09-25 21:29 ` [Tarantool-patches] [PATCH 3/3] [tosquash] raft: don't start new election immediately after restart Vladislav Shpilevoy
@ 2020-09-28  7:09 ` Serge Petrenko
  3 siblings, 0 replies; 5+ messages in thread
From: Serge Petrenko @ 2020-09-28  7:09 UTC (permalink / raw)
  To: Vladislav Shpilevoy, tarantool-patches


26.09.2020 00:29, Vladislav Shpilevoy пишет:
> The patchset is a couple of fixes for the existing Raft code, and
> also a new leader election policy - the third patch makes so an
> instance restart does not lead to a new election immediately
> anymore.
>
> See the commit message and comments why.
>
> Branch: http://github.com/tarantool/tarantool/tree/gh-1146-raft
> Issue: https://github.com/tarantool/tarantool/issues/1146
>
> Vladislav Shpilevoy (3):
>    [tosquash] raft: vote without a new term when possible
>    [tosquash] raft: fix 99% CPU issue in raft io worker
>    [tosquash] raft: don't start new election immediately after restart
>
>   src/box/box.cc | 12 ++++++++++-
>   src/box/raft.c | 54 +++++++++++++++++++++++++++++++++++++++++++-------
>   src/box/raft.h |  8 ++++++++
>   3 files changed, 66 insertions(+), 8 deletions(-)

Hi! Thanks for the fixes!

The patchset LGTM.

>
-- 
Serge Petrenko

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2020-09-28  7:09 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-25 21:29 [Tarantool-patches] [PATCH 0/3] Raft follow-ups Vladislav Shpilevoy
2020-09-25 21:29 ` [Tarantool-patches] [PATCH 1/3] [tosquash] raft: vote without a new term when possible Vladislav Shpilevoy
2020-09-25 21:29 ` [Tarantool-patches] [PATCH 2/3] [tosquash] raft: fix 99% CPU issue in raft io worker Vladislav Shpilevoy
2020-09-25 21:29 ` [Tarantool-patches] [PATCH 3/3] [tosquash] raft: don't start new election immediately after restart Vladislav Shpilevoy
2020-09-28  7:09 ` [Tarantool-patches] [PATCH 0/3] Raft follow-ups Serge Petrenko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox