Tarantool development patches archive
 help / color / mirror / Atom feed
* [Tarantool-patches] [PATCH 1/1] raft: handle remote leader resign during WAL write
@ 2021-06-07 20:48 Vladislav Shpilevoy via Tarantool-patches
  2021-06-10 14:24 ` Serge Petrenko via Tarantool-patches
  2021-06-10 20:01 ` Vladislav Shpilevoy via Tarantool-patches
  0 siblings, 2 replies; 3+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-06-07 20:48 UTC (permalink / raw)
  To: tarantool-patches, gorcunov, sergepetrenko

If Raft state machine sees the current leader has explicitly
resigned from its role, it starts a new election round right away.

But in the code starting a new round there was an assumption that
there is no a volatile state. There was, in fact.

The patch makes the election start code use the volatile state to
bump the term. It should be safe, because the other nodes won't
receive it anyway until the new term is persisted.

There was an alternative - do not schedule new election until the
current WAL write ends. It wasn't done, because would achieve the
same (the term would be bumped and persisted) but with bigger a
latency.

Another reason is that if the leader would appear and resign
during WAL write on another candidate, in the end of its WAL write
the latter would see 0 leader and would think this term didn't
have one yet. And would try to elect self now, in the current
term. It makes little sense, because it won't win - the current
term had already had a leader and the majority of votes is
already taken.

Closes #6129
---
Branch: http://github.com/tarantool/tarantool/tree/gerold103/gh-6129-raft-crash
Issue: https://github.com/tarantool/tarantool/issues/6129

 .../gh-6129-raft-resign-during-wal.md         |  7 ++
 src/lib/raft/raft.c                           |  8 +-
 test/unit/raft.c                              | 80 ++++++++++++++++++-
 test/unit/raft.result                         | 10 ++-
 4 files changed, 101 insertions(+), 4 deletions(-)
 create mode 100644 changelogs/unreleased/gh-6129-raft-resign-during-wal.md

diff --git a/changelogs/unreleased/gh-6129-raft-resign-during-wal.md b/changelogs/unreleased/gh-6129-raft-resign-during-wal.md
new file mode 100644
index 000000000..eb6cf3984
--- /dev/null
+++ b/changelogs/unreleased/gh-6129-raft-resign-during-wal.md
@@ -0,0 +1,7 @@
+## bugfix/raft
+
+* Fixed a rare crash with the leader election enabled (any mode except `off`),
+  which could happen if a leader resigned from its role at the same time as some
+  other node was writing something related to the elections to WAL. The crash
+  was in debug build and in the release build it would lead to undefined
+  behaviour (gh-6129).
diff --git a/src/lib/raft/raft.c b/src/lib/raft/raft.c
index 247c7a71a..eacdddb7e 100644
--- a/src/lib/raft/raft.c
+++ b/src/lib/raft/raft.c
@@ -687,10 +687,9 @@ static void
 raft_sm_schedule_new_election(struct raft *raft)
 {
 	say_info("RAFT: begin new election round");
-	assert(raft_is_fully_on_disk(raft));
 	assert(raft->is_candidate);
 	/* Everyone is a follower until its vote for self is persisted. */
-	raft_sm_schedule_new_term(raft, raft->term + 1);
+	raft_sm_schedule_new_term(raft, raft->volatile_term + 1);
 	raft_sm_schedule_new_vote(raft, raft->self);
 }
 
@@ -701,6 +700,11 @@ raft_sm_schedule_new_election_cb(struct ev_loop *loop, struct ev_timer *timer,
 	(void)events;
 	struct raft *raft = timer->data;
 	assert(timer == &raft->timer);
+	/*
+	 * Otherwise the timer would be stopped and the callback wouldn't be
+	 * invoked.
+	 */
+	assert(raft_is_fully_on_disk(raft));
 	raft_ev_timer_stop(loop, timer);
 	raft_sm_schedule_new_election(raft);
 }
diff --git a/test/unit/raft.c b/test/unit/raft.c
index 29e48da13..6369c42d3 100644
--- a/test/unit/raft.c
+++ b/test/unit/raft.c
@@ -572,7 +572,7 @@ raft_test_vote_skip(void)
 static void
 raft_test_leader_resign(void)
 {
-	raft_start_test(15);
+	raft_start_test(23);
 	struct raft_node node;
 
 	/*
@@ -672,6 +672,84 @@ raft_test_leader_resign(void)
 		NULL /* Vclock. */
 	), "resign notification is sent");
 
+	/*
+	 * gh-6129: resign of a remote leader during a local WAL write should
+	 * schedule a new election after the WAL write.
+	 *
+	 * Firstly start a new term.
+	 */
+	raft_node_block(&node);
+	raft_node_cfg_is_candidate(&node, true);
+	raft_run_next_event();
+	/* Volatile term is new, but the persistent one is not updated yet. */
+	ok(raft_node_check_full_state(&node,
+		RAFT_STATE_FOLLOWER /* State. */,
+		0 /* Leader. */,
+		2 /* Term. */,
+		1 /* Vote. */,
+		3 /* Volatile term. */,
+		1 /* Volatile vote. */,
+		"{0: 1}" /* Vclock. */
+	), "new election is waiting for WAL write");
+
+	/* Now another node wins the election earlier. */
+	is(raft_node_send_leader(&node,
+		3 /* Term. */,
+		2 /* Source. */
+	), 0, "message is accepted");
+	ok(raft_node_check_full_state(&node,
+		RAFT_STATE_FOLLOWER /* State. */,
+		2 /* Leader. */,
+		2 /* Term. */,
+		1 /* Vote. */,
+		3 /* Volatile term. */,
+		1 /* Volatile vote. */,
+		"{0: 1}" /* Vclock. */
+	), "the leader is accepted");
+
+	/*
+	 * The leader resigns and triggers a new election round on the first
+	 * node. A new election is triggered, but still waiting for the previous
+	 * WAL write to end.
+	 */
+	is(raft_node_send_follower(&node,
+		3 /* Term. */,
+		2 /* Source. */
+	), 0, "message is accepted");
+	/* Note how the volatile term is updated again. */
+	ok(raft_node_check_full_state(&node,
+		RAFT_STATE_FOLLOWER /* State. */,
+		0 /* Leader. */,
+		2 /* Term. */,
+		1 /* Vote. */,
+		4 /* Volatile term. */,
+		1 /* Volatile vote. */,
+		"{0: 1}" /* Vclock. */
+	), "the leader has resigned, new election is scheduled");
+	raft_node_unblock(&node);
+
+	/* Ensure the node still collects votes after the WAL write. */
+	is(raft_node_send_vote_response(&node,
+		4 /* Term. */,
+		1 /* Vote. */,
+		2 /* Source. */
+	), 0, "vote from 2");
+	is(raft_node_send_vote_response(&node,
+		4 /* Term. */,
+		1 /* Vote. */,
+		3 /* Source. */
+	), 0, "vote from 3");
+	raft_run_next_event();
+	ok(raft_node_check_full_state(&node,
+		RAFT_STATE_LEADER /* State. */,
+		1 /* Leader. */,
+		4 /* Term. */,
+		1 /* Vote. */,
+		4 /* Volatile term. */,
+		1 /* Volatile vote. */,
+		"{0: 2}" /* Vclock. */
+	), "the leader is elected");
+
 	raft_node_destroy(&node);
 
 	raft_finish_test();
diff --git a/test/unit/raft.result b/test/unit/raft.result
index 3a3dc5dd2..598a7e760 100644
--- a/test/unit/raft.result
+++ b/test/unit/raft.result
@@ -111,7 +111,7 @@ ok 4 - subtests
 ok 5 - subtests
 	*** raft_test_vote_skip: done ***
 	*** raft_test_leader_resign ***
-    1..15
+    1..23
     ok 1 - message is accepted
     ok 2 - leader is elected
     ok 3 - message is accepted
@@ -127,6 +127,14 @@ ok 5 - subtests
     ok 13 - became leader
     ok 14 - the leader has resigned
     ok 15 - resign notification is sent
+    ok 16 - new election is waiting for WAL write
+    ok 17 - message is accepted
+    ok 18 - the leader is accepted
+    ok 19 - message is accepted
+    ok 20 - the leader has resigned, new election is scheduled
+    ok 21 - vote from 2
+    ok 22 - vote from 3
+    ok 23 - the leader is elected
 ok 6 - subtests
 	*** raft_test_leader_resign: done ***
 	*** raft_test_split_brain ***
-- 
2.24.3 (Apple Git-128)


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [Tarantool-patches] [PATCH 1/1] raft: handle remote leader resign during WAL write
  2021-06-07 20:48 [Tarantool-patches] [PATCH 1/1] raft: handle remote leader resign during WAL write Vladislav Shpilevoy via Tarantool-patches
@ 2021-06-10 14:24 ` Serge Petrenko via Tarantool-patches
  2021-06-10 20:01 ` Vladislav Shpilevoy via Tarantool-patches
  1 sibling, 0 replies; 3+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-06-10 14:24 UTC (permalink / raw)
  To: Vladislav Shpilevoy, tarantool-patches, gorcunov



07.06.2021 23:48, Vladislav Shpilevoy пишет:
> If Raft state machine sees the current leader has explicitly
> resigned from its role, it starts a new election round right away.
>
> But in the code starting a new round there was an assumption that
> there is no a volatile state. There was, in fact.
>
> The patch makes the election start code use the volatile state to
> bump the term. It should be safe, because the other nodes won't
> receive it anyway until the new term is persisted.
>
> There was an alternative - do not schedule new election until the
> current WAL write ends. It wasn't done, because would achieve the
> same (the term would be bumped and persisted) but with bigger a
> latency.
>
> Another reason is that if the leader would appear and resign
> during WAL write on another candidate, in the end of its WAL write
> the latter would see 0 leader and would think this term didn't
> have one yet. And would try to elect self now, in the current
> term. It makes little sense, because it won't win - the current
> term had already had a leader and the majority of votes is
> already taken.
>
> Closes #6129

Thanks for the patch!
LGTM.

> ---
> Branch: http://github.com/tarantool/tarantool/tree/gerold103/gh-6129-raft-crash
> Issue: https://github.com/tarantool/tarantool/issues/6129
>
>   .../gh-6129-raft-resign-during-wal.md         |  7 ++
>   src/lib/raft/raft.c                           |  8 +-
>   test/unit/raft.c                              | 80 ++++++++++++++++++-
>   test/unit/raft.result                         | 10 ++-
>   4 files changed, 101 insertions(+), 4 deletions(-)
>   create mode 100644 changelogs/unreleased/gh-6129-raft-resign-during-wal.md
>
> diff --git a/changelogs/unreleased/gh-6129-raft-resign-during-wal.md b/changelogs/unreleased/gh-6129-raft-resign-during-wal.md
> new file mode 100644
> index 000000000..eb6cf3984
> --- /dev/null
> +++ b/changelogs/unreleased/gh-6129-raft-resign-during-wal.md
> @@ -0,0 +1,7 @@
> +## bugfix/raft
> +
> +* Fixed a rare crash with the leader election enabled (any mode except `off`),
> +  which could happen if a leader resigned from its role at the same time as some
> +  other node was writing something related to the elections to WAL. The crash
> +  was in debug build and in the release build it would lead to undefined
> +  behaviour (gh-6129).
> diff --git a/src/lib/raft/raft.c b/src/lib/raft/raft.c
> index 247c7a71a..eacdddb7e 100644
> --- a/src/lib/raft/raft.c
> +++ b/src/lib/raft/raft.c
> @@ -687,10 +687,9 @@ static void
>   raft_sm_schedule_new_election(struct raft *raft)
>   {
>   	say_info("RAFT: begin new election round");
> -	assert(raft_is_fully_on_disk(raft));
>   	assert(raft->is_candidate);
>   	/* Everyone is a follower until its vote for self is persisted. */
> -	raft_sm_schedule_new_term(raft, raft->term + 1);
> +	raft_sm_schedule_new_term(raft, raft->volatile_term + 1);
>   	raft_sm_schedule_new_vote(raft, raft->self);
>   }
>   
> @@ -701,6 +700,11 @@ raft_sm_schedule_new_election_cb(struct ev_loop *loop, struct ev_timer *timer,
>   	(void)events;
>   	struct raft *raft = timer->data;
>   	assert(timer == &raft->timer);
> +	/*
> +	 * Otherwise the timer would be stopped and the callback wouldn't be
> +	 * invoked.
> +	 */
> +	assert(raft_is_fully_on_disk(raft));
>   	raft_ev_timer_stop(loop, timer);
>   	raft_sm_schedule_new_election(raft);
>   }
> diff --git a/test/unit/raft.c b/test/unit/raft.c
> index 29e48da13..6369c42d3 100644
> --- a/test/unit/raft.c
> +++ b/test/unit/raft.c
> @@ -572,7 +572,7 @@ raft_test_vote_skip(void)
>   static void
>   raft_test_leader_resign(void)
>   {
> -	raft_start_test(15);
> +	raft_start_test(23);
>   	struct raft_node node;
>   
>   	/*
> @@ -672,6 +672,84 @@ raft_test_leader_resign(void)
>   		NULL /* Vclock. */
>   	), "resign notification is sent");
>   
> +	/*
> +	 * gh-6129: resign of a remote leader during a local WAL write should
> +	 * schedule a new election after the WAL write.
> +	 *
> +	 * Firstly start a new term.
> +	 */
> +	raft_node_block(&node);
> +	raft_node_cfg_is_candidate(&node, true);
> +	raft_run_next_event();
> +	/* Volatile term is new, but the persistent one is not updated yet. */
> +	ok(raft_node_check_full_state(&node,
> +		RAFT_STATE_FOLLOWER /* State. */,
> +		0 /* Leader. */,
> +		2 /* Term. */,
> +		1 /* Vote. */,
> +		3 /* Volatile term. */,
> +		1 /* Volatile vote. */,
> +		"{0: 1}" /* Vclock. */
> +	), "new election is waiting for WAL write");
> +
> +	/* Now another node wins the election earlier. */
> +	is(raft_node_send_leader(&node,
> +		3 /* Term. */,
> +		2 /* Source. */
> +	), 0, "message is accepted");
> +	ok(raft_node_check_full_state(&node,
> +		RAFT_STATE_FOLLOWER /* State. */,
> +		2 /* Leader. */,
> +		2 /* Term. */,
> +		1 /* Vote. */,
> +		3 /* Volatile term. */,
> +		1 /* Volatile vote. */,
> +		"{0: 1}" /* Vclock. */
> +	), "the leader is accepted");
> +
> +	/*
> +	 * The leader resigns and triggers a new election round on the first
> +	 * node. A new election is triggered, but still waiting for the previous
> +	 * WAL write to end.
> +	 */
> +	is(raft_node_send_follower(&node,
> +		3 /* Term. */,
> +		2 /* Source. */
> +	), 0, "message is accepted");
> +	/* Note how the volatile term is updated again. */
> +	ok(raft_node_check_full_state(&node,
> +		RAFT_STATE_FOLLOWER /* State. */,
> +		0 /* Leader. */,
> +		2 /* Term. */,
> +		1 /* Vote. */,
> +		4 /* Volatile term. */,
> +		1 /* Volatile vote. */,
> +		"{0: 1}" /* Vclock. */
> +	), "the leader has resigned, new election is scheduled");
> +	raft_node_unblock(&node);
> +
> +	/* Ensure the node still collects votes after the WAL write. */
> +	is(raft_node_send_vote_response(&node,
> +		4 /* Term. */,
> +		1 /* Vote. */,
> +		2 /* Source. */
> +	), 0, "vote from 2");
> +	is(raft_node_send_vote_response(&node,
> +		4 /* Term. */,
> +		1 /* Vote. */,
> +		3 /* Source. */
> +	), 0, "vote from 3");
> +	raft_run_next_event();
> +	ok(raft_node_check_full_state(&node,
> +		RAFT_STATE_LEADER /* State. */,
> +		1 /* Leader. */,
> +		4 /* Term. */,
> +		1 /* Vote. */,
> +		4 /* Volatile term. */,
> +		1 /* Volatile vote. */,
> +		"{0: 2}" /* Vclock. */
> +	), "the leader is elected");
> +
>   	raft_node_destroy(&node);
>   
>   	raft_finish_test();
> diff --git a/test/unit/raft.result b/test/unit/raft.result
> index 3a3dc5dd2..598a7e760 100644
> --- a/test/unit/raft.result
> +++ b/test/unit/raft.result
> @@ -111,7 +111,7 @@ ok 4 - subtests
>   ok 5 - subtests
>   	*** raft_test_vote_skip: done ***
>   	*** raft_test_leader_resign ***
> -    1..15
> +    1..23
>       ok 1 - message is accepted
>       ok 2 - leader is elected
>       ok 3 - message is accepted
> @@ -127,6 +127,14 @@ ok 5 - subtests
>       ok 13 - became leader
>       ok 14 - the leader has resigned
>       ok 15 - resign notification is sent
> +    ok 16 - new election is waiting for WAL write
> +    ok 17 - message is accepted
> +    ok 18 - the leader is accepted
> +    ok 19 - message is accepted
> +    ok 20 - the leader has resigned, new election is scheduled
> +    ok 21 - vote from 2
> +    ok 22 - vote from 3
> +    ok 23 - the leader is elected
>   ok 6 - subtests
>   	*** raft_test_leader_resign: done ***
>   	*** raft_test_split_brain ***

-- 
Serge Petrenko


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [Tarantool-patches] [PATCH 1/1] raft: handle remote leader resign during WAL write
  2021-06-07 20:48 [Tarantool-patches] [PATCH 1/1] raft: handle remote leader resign during WAL write Vladislav Shpilevoy via Tarantool-patches
  2021-06-10 14:24 ` Serge Petrenko via Tarantool-patches
@ 2021-06-10 20:01 ` Vladislav Shpilevoy via Tarantool-patches
  1 sibling, 0 replies; 3+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-06-10 20:01 UTC (permalink / raw)
  To: tarantool-patches, gorcunov, sergepetrenko

Thanks for the review!

Pushed to master, 2.8, 2.7.

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2021-06-10 20:01 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-07 20:48 [Tarantool-patches] [PATCH 1/1] raft: handle remote leader resign during WAL write Vladislav Shpilevoy via Tarantool-patches
2021-06-10 14:24 ` Serge Petrenko via Tarantool-patches
2021-06-10 20:01 ` Vladislav Shpilevoy via Tarantool-patches

Tarantool development patches archive

This inbox may be cloned and mirrored by anyone:

	git clone --mirror https://lists.tarantool.org/tarantool-patches/0 tarantool-patches/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 tarantool-patches tarantool-patches/ https://lists.tarantool.org/tarantool-patches \
		tarantool-patches@dev.tarantool.org.
	public-inbox-index tarantool-patches

Example config snippet for mirrors.


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git