[Tarantool-patches] [PATCH 1/1] raft: handle remote leader resign during WAL write

Serge Petrenko sergepetrenko at tarantool.org
Thu Jun 10 17:24:44 MSK 2021



07.06.2021 23:48, Vladislav Shpilevoy пишет:
> If Raft state machine sees the current leader has explicitly
> resigned from its role, it starts a new election round right away.
>
> But in the code starting a new round there was an assumption that
> there is no a volatile state. There was, in fact.
>
> The patch makes the election start code use the volatile state to
> bump the term. It should be safe, because the other nodes won't
> receive it anyway until the new term is persisted.
>
> There was an alternative - do not schedule new election until the
> current WAL write ends. It wasn't done, because would achieve the
> same (the term would be bumped and persisted) but with bigger a
> latency.
>
> Another reason is that if the leader would appear and resign
> during WAL write on another candidate, in the end of its WAL write
> the latter would see 0 leader and would think this term didn't
> have one yet. And would try to elect self now, in the current
> term. It makes little sense, because it won't win - the current
> term had already had a leader and the majority of votes is
> already taken.
>
> Closes #6129

Thanks for the patch!
LGTM.

> ---
> Branch: http://github.com/tarantool/tarantool/tree/gerold103/gh-6129-raft-crash
> Issue: https://github.com/tarantool/tarantool/issues/6129
>
>   .../gh-6129-raft-resign-during-wal.md         |  7 ++
>   src/lib/raft/raft.c                           |  8 +-
>   test/unit/raft.c                              | 80 ++++++++++++++++++-
>   test/unit/raft.result                         | 10 ++-
>   4 files changed, 101 insertions(+), 4 deletions(-)
>   create mode 100644 changelogs/unreleased/gh-6129-raft-resign-during-wal.md
>
> diff --git a/changelogs/unreleased/gh-6129-raft-resign-during-wal.md b/changelogs/unreleased/gh-6129-raft-resign-during-wal.md
> new file mode 100644
> index 000000000..eb6cf3984
> --- /dev/null
> +++ b/changelogs/unreleased/gh-6129-raft-resign-during-wal.md
> @@ -0,0 +1,7 @@
> +## bugfix/raft
> +
> +* Fixed a rare crash with the leader election enabled (any mode except `off`),
> +  which could happen if a leader resigned from its role at the same time as some
> +  other node was writing something related to the elections to WAL. The crash
> +  was in debug build and in the release build it would lead to undefined
> +  behaviour (gh-6129).
> diff --git a/src/lib/raft/raft.c b/src/lib/raft/raft.c
> index 247c7a71a..eacdddb7e 100644
> --- a/src/lib/raft/raft.c
> +++ b/src/lib/raft/raft.c
> @@ -687,10 +687,9 @@ static void
>   raft_sm_schedule_new_election(struct raft *raft)
>   {
>   	say_info("RAFT: begin new election round");
> -	assert(raft_is_fully_on_disk(raft));
>   	assert(raft->is_candidate);
>   	/* Everyone is a follower until its vote for self is persisted. */
> -	raft_sm_schedule_new_term(raft, raft->term + 1);
> +	raft_sm_schedule_new_term(raft, raft->volatile_term + 1);
>   	raft_sm_schedule_new_vote(raft, raft->self);
>   }
>   
> @@ -701,6 +700,11 @@ raft_sm_schedule_new_election_cb(struct ev_loop *loop, struct ev_timer *timer,
>   	(void)events;
>   	struct raft *raft = timer->data;
>   	assert(timer == &raft->timer);
> +	/*
> +	 * Otherwise the timer would be stopped and the callback wouldn't be
> +	 * invoked.
> +	 */
> +	assert(raft_is_fully_on_disk(raft));
>   	raft_ev_timer_stop(loop, timer);
>   	raft_sm_schedule_new_election(raft);
>   }
> diff --git a/test/unit/raft.c b/test/unit/raft.c
> index 29e48da13..6369c42d3 100644
> --- a/test/unit/raft.c
> +++ b/test/unit/raft.c
> @@ -572,7 +572,7 @@ raft_test_vote_skip(void)
>   static void
>   raft_test_leader_resign(void)
>   {
> -	raft_start_test(15);
> +	raft_start_test(23);
>   	struct raft_node node;
>   
>   	/*
> @@ -672,6 +672,84 @@ raft_test_leader_resign(void)
>   		NULL /* Vclock. */
>   	), "resign notification is sent");
>   
> +	/*
> +	 * gh-6129: resign of a remote leader during a local WAL write should
> +	 * schedule a new election after the WAL write.
> +	 *
> +	 * Firstly start a new term.
> +	 */
> +	raft_node_block(&node);
> +	raft_node_cfg_is_candidate(&node, true);
> +	raft_run_next_event();
> +	/* Volatile term is new, but the persistent one is not updated yet. */
> +	ok(raft_node_check_full_state(&node,
> +		RAFT_STATE_FOLLOWER /* State. */,
> +		0 /* Leader. */,
> +		2 /* Term. */,
> +		1 /* Vote. */,
> +		3 /* Volatile term. */,
> +		1 /* Volatile vote. */,
> +		"{0: 1}" /* Vclock. */
> +	), "new election is waiting for WAL write");
> +
> +	/* Now another node wins the election earlier. */
> +	is(raft_node_send_leader(&node,
> +		3 /* Term. */,
> +		2 /* Source. */
> +	), 0, "message is accepted");
> +	ok(raft_node_check_full_state(&node,
> +		RAFT_STATE_FOLLOWER /* State. */,
> +		2 /* Leader. */,
> +		2 /* Term. */,
> +		1 /* Vote. */,
> +		3 /* Volatile term. */,
> +		1 /* Volatile vote. */,
> +		"{0: 1}" /* Vclock. */
> +	), "the leader is accepted");
> +
> +	/*
> +	 * The leader resigns and triggers a new election round on the first
> +	 * node. A new election is triggered, but still waiting for the previous
> +	 * WAL write to end.
> +	 */
> +	is(raft_node_send_follower(&node,
> +		3 /* Term. */,
> +		2 /* Source. */
> +	), 0, "message is accepted");
> +	/* Note how the volatile term is updated again. */
> +	ok(raft_node_check_full_state(&node,
> +		RAFT_STATE_FOLLOWER /* State. */,
> +		0 /* Leader. */,
> +		2 /* Term. */,
> +		1 /* Vote. */,
> +		4 /* Volatile term. */,
> +		1 /* Volatile vote. */,
> +		"{0: 1}" /* Vclock. */
> +	), "the leader has resigned, new election is scheduled");
> +	raft_node_unblock(&node);
> +
> +	/* Ensure the node still collects votes after the WAL write. */
> +	is(raft_node_send_vote_response(&node,
> +		4 /* Term. */,
> +		1 /* Vote. */,
> +		2 /* Source. */
> +	), 0, "vote from 2");
> +	is(raft_node_send_vote_response(&node,
> +		4 /* Term. */,
> +		1 /* Vote. */,
> +		3 /* Source. */
> +	), 0, "vote from 3");
> +	raft_run_next_event();
> +	ok(raft_node_check_full_state(&node,
> +		RAFT_STATE_LEADER /* State. */,
> +		1 /* Leader. */,
> +		4 /* Term. */,
> +		1 /* Vote. */,
> +		4 /* Volatile term. */,
> +		1 /* Volatile vote. */,
> +		"{0: 2}" /* Vclock. */
> +	), "the leader is elected");
> +
>   	raft_node_destroy(&node);
>   
>   	raft_finish_test();
> diff --git a/test/unit/raft.result b/test/unit/raft.result
> index 3a3dc5dd2..598a7e760 100644
> --- a/test/unit/raft.result
> +++ b/test/unit/raft.result
> @@ -111,7 +111,7 @@ ok 4 - subtests
>   ok 5 - subtests
>   	*** raft_test_vote_skip: done ***
>   	*** raft_test_leader_resign ***
> -    1..15
> +    1..23
>       ok 1 - message is accepted
>       ok 2 - leader is elected
>       ok 3 - message is accepted
> @@ -127,6 +127,14 @@ ok 5 - subtests
>       ok 13 - became leader
>       ok 14 - the leader has resigned
>       ok 15 - resign notification is sent
> +    ok 16 - new election is waiting for WAL write
> +    ok 17 - message is accepted
> +    ok 18 - the leader is accepted
> +    ok 19 - message is accepted
> +    ok 20 - the leader has resigned, new election is scheduled
> +    ok 21 - vote from 2
> +    ok 22 - vote from 3
> +    ok 23 - the leader is elected
>   ok 6 - subtests
>   	*** raft_test_leader_resign: done ***
>   	*** raft_test_split_brain ***

-- 
Serge Petrenko



More information about the Tarantool-patches mailing list