[Tarantool-patches] [PATCH 1/1] raft: handle remote leader resign during WAL write
Serge Petrenko
sergepetrenko at tarantool.org
Thu Jun 10 17:24:44 MSK 2021
07.06.2021 23:48, Vladislav Shpilevoy пишет:
> If Raft state machine sees the current leader has explicitly
> resigned from its role, it starts a new election round right away.
>
> But in the code starting a new round there was an assumption that
> there is no a volatile state. There was, in fact.
>
> The patch makes the election start code use the volatile state to
> bump the term. It should be safe, because the other nodes won't
> receive it anyway until the new term is persisted.
>
> There was an alternative - do not schedule new election until the
> current WAL write ends. It wasn't done, because would achieve the
> same (the term would be bumped and persisted) but with bigger a
> latency.
>
> Another reason is that if the leader would appear and resign
> during WAL write on another candidate, in the end of its WAL write
> the latter would see 0 leader and would think this term didn't
> have one yet. And would try to elect self now, in the current
> term. It makes little sense, because it won't win - the current
> term had already had a leader and the majority of votes is
> already taken.
>
> Closes #6129
Thanks for the patch!
LGTM.
> ---
> Branch: http://github.com/tarantool/tarantool/tree/gerold103/gh-6129-raft-crash
> Issue: https://github.com/tarantool/tarantool/issues/6129
>
> .../gh-6129-raft-resign-during-wal.md | 7 ++
> src/lib/raft/raft.c | 8 +-
> test/unit/raft.c | 80 ++++++++++++++++++-
> test/unit/raft.result | 10 ++-
> 4 files changed, 101 insertions(+), 4 deletions(-)
> create mode 100644 changelogs/unreleased/gh-6129-raft-resign-during-wal.md
>
> diff --git a/changelogs/unreleased/gh-6129-raft-resign-during-wal.md b/changelogs/unreleased/gh-6129-raft-resign-during-wal.md
> new file mode 100644
> index 000000000..eb6cf3984
> --- /dev/null
> +++ b/changelogs/unreleased/gh-6129-raft-resign-during-wal.md
> @@ -0,0 +1,7 @@
> +## bugfix/raft
> +
> +* Fixed a rare crash with the leader election enabled (any mode except `off`),
> + which could happen if a leader resigned from its role at the same time as some
> + other node was writing something related to the elections to WAL. The crash
> + was in debug build and in the release build it would lead to undefined
> + behaviour (gh-6129).
> diff --git a/src/lib/raft/raft.c b/src/lib/raft/raft.c
> index 247c7a71a..eacdddb7e 100644
> --- a/src/lib/raft/raft.c
> +++ b/src/lib/raft/raft.c
> @@ -687,10 +687,9 @@ static void
> raft_sm_schedule_new_election(struct raft *raft)
> {
> say_info("RAFT: begin new election round");
> - assert(raft_is_fully_on_disk(raft));
> assert(raft->is_candidate);
> /* Everyone is a follower until its vote for self is persisted. */
> - raft_sm_schedule_new_term(raft, raft->term + 1);
> + raft_sm_schedule_new_term(raft, raft->volatile_term + 1);
> raft_sm_schedule_new_vote(raft, raft->self);
> }
>
> @@ -701,6 +700,11 @@ raft_sm_schedule_new_election_cb(struct ev_loop *loop, struct ev_timer *timer,
> (void)events;
> struct raft *raft = timer->data;
> assert(timer == &raft->timer);
> + /*
> + * Otherwise the timer would be stopped and the callback wouldn't be
> + * invoked.
> + */
> + assert(raft_is_fully_on_disk(raft));
> raft_ev_timer_stop(loop, timer);
> raft_sm_schedule_new_election(raft);
> }
> diff --git a/test/unit/raft.c b/test/unit/raft.c
> index 29e48da13..6369c42d3 100644
> --- a/test/unit/raft.c
> +++ b/test/unit/raft.c
> @@ -572,7 +572,7 @@ raft_test_vote_skip(void)
> static void
> raft_test_leader_resign(void)
> {
> - raft_start_test(15);
> + raft_start_test(23);
> struct raft_node node;
>
> /*
> @@ -672,6 +672,84 @@ raft_test_leader_resign(void)
> NULL /* Vclock. */
> ), "resign notification is sent");
>
> + /*
> + * gh-6129: resign of a remote leader during a local WAL write should
> + * schedule a new election after the WAL write.
> + *
> + * Firstly start a new term.
> + */
> + raft_node_block(&node);
> + raft_node_cfg_is_candidate(&node, true);
> + raft_run_next_event();
> + /* Volatile term is new, but the persistent one is not updated yet. */
> + ok(raft_node_check_full_state(&node,
> + RAFT_STATE_FOLLOWER /* State. */,
> + 0 /* Leader. */,
> + 2 /* Term. */,
> + 1 /* Vote. */,
> + 3 /* Volatile term. */,
> + 1 /* Volatile vote. */,
> + "{0: 1}" /* Vclock. */
> + ), "new election is waiting for WAL write");
> +
> + /* Now another node wins the election earlier. */
> + is(raft_node_send_leader(&node,
> + 3 /* Term. */,
> + 2 /* Source. */
> + ), 0, "message is accepted");
> + ok(raft_node_check_full_state(&node,
> + RAFT_STATE_FOLLOWER /* State. */,
> + 2 /* Leader. */,
> + 2 /* Term. */,
> + 1 /* Vote. */,
> + 3 /* Volatile term. */,
> + 1 /* Volatile vote. */,
> + "{0: 1}" /* Vclock. */
> + ), "the leader is accepted");
> +
> + /*
> + * The leader resigns and triggers a new election round on the first
> + * node. A new election is triggered, but still waiting for the previous
> + * WAL write to end.
> + */
> + is(raft_node_send_follower(&node,
> + 3 /* Term. */,
> + 2 /* Source. */
> + ), 0, "message is accepted");
> + /* Note how the volatile term is updated again. */
> + ok(raft_node_check_full_state(&node,
> + RAFT_STATE_FOLLOWER /* State. */,
> + 0 /* Leader. */,
> + 2 /* Term. */,
> + 1 /* Vote. */,
> + 4 /* Volatile term. */,
> + 1 /* Volatile vote. */,
> + "{0: 1}" /* Vclock. */
> + ), "the leader has resigned, new election is scheduled");
> + raft_node_unblock(&node);
> +
> + /* Ensure the node still collects votes after the WAL write. */
> + is(raft_node_send_vote_response(&node,
> + 4 /* Term. */,
> + 1 /* Vote. */,
> + 2 /* Source. */
> + ), 0, "vote from 2");
> + is(raft_node_send_vote_response(&node,
> + 4 /* Term. */,
> + 1 /* Vote. */,
> + 3 /* Source. */
> + ), 0, "vote from 3");
> + raft_run_next_event();
> + ok(raft_node_check_full_state(&node,
> + RAFT_STATE_LEADER /* State. */,
> + 1 /* Leader. */,
> + 4 /* Term. */,
> + 1 /* Vote. */,
> + 4 /* Volatile term. */,
> + 1 /* Volatile vote. */,
> + "{0: 2}" /* Vclock. */
> + ), "the leader is elected");
> +
> raft_node_destroy(&node);
>
> raft_finish_test();
> diff --git a/test/unit/raft.result b/test/unit/raft.result
> index 3a3dc5dd2..598a7e760 100644
> --- a/test/unit/raft.result
> +++ b/test/unit/raft.result
> @@ -111,7 +111,7 @@ ok 4 - subtests
> ok 5 - subtests
> *** raft_test_vote_skip: done ***
> *** raft_test_leader_resign ***
> - 1..15
> + 1..23
> ok 1 - message is accepted
> ok 2 - leader is elected
> ok 3 - message is accepted
> @@ -127,6 +127,14 @@ ok 5 - subtests
> ok 13 - became leader
> ok 14 - the leader has resigned
> ok 15 - resign notification is sent
> + ok 16 - new election is waiting for WAL write
> + ok 17 - message is accepted
> + ok 18 - the leader is accepted
> + ok 19 - message is accepted
> + ok 20 - the leader has resigned, new election is scheduled
> + ok 21 - vote from 2
> + ok 22 - vote from 3
> + ok 23 - the leader is elected
> ok 6 - subtests
> *** raft_test_leader_resign: done ***
> *** raft_test_split_brain ***
--
Serge Petrenko
More information about the Tarantool-patches
mailing list