[Tarantool-patches] [PATCH 1/1] raft: handle remote leader resign during WAL write
Vladislav Shpilevoy
v.shpilevoy at tarantool.org
Mon Jun 7 23:48:02 MSK 2021
If Raft state machine sees the current leader has explicitly
resigned from its role, it starts a new election round right away.
But in the code starting a new round there was an assumption that
there is no a volatile state. There was, in fact.
The patch makes the election start code use the volatile state to
bump the term. It should be safe, because the other nodes won't
receive it anyway until the new term is persisted.
There was an alternative - do not schedule new election until the
current WAL write ends. It wasn't done, because would achieve the
same (the term would be bumped and persisted) but with bigger a
latency.
Another reason is that if the leader would appear and resign
during WAL write on another candidate, in the end of its WAL write
the latter would see 0 leader and would think this term didn't
have one yet. And would try to elect self now, in the current
term. It makes little sense, because it won't win - the current
term had already had a leader and the majority of votes is
already taken.
Closes #6129
---
Branch: http://github.com/tarantool/tarantool/tree/gerold103/gh-6129-raft-crash
Issue: https://github.com/tarantool/tarantool/issues/6129
.../gh-6129-raft-resign-during-wal.md | 7 ++
src/lib/raft/raft.c | 8 +-
test/unit/raft.c | 80 ++++++++++++++++++-
test/unit/raft.result | 10 ++-
4 files changed, 101 insertions(+), 4 deletions(-)
create mode 100644 changelogs/unreleased/gh-6129-raft-resign-during-wal.md
diff --git a/changelogs/unreleased/gh-6129-raft-resign-during-wal.md b/changelogs/unreleased/gh-6129-raft-resign-during-wal.md
new file mode 100644
index 000000000..eb6cf3984
--- /dev/null
+++ b/changelogs/unreleased/gh-6129-raft-resign-during-wal.md
@@ -0,0 +1,7 @@
+## bugfix/raft
+
+* Fixed a rare crash with the leader election enabled (any mode except `off`),
+ which could happen if a leader resigned from its role at the same time as some
+ other node was writing something related to the elections to WAL. The crash
+ was in debug build and in the release build it would lead to undefined
+ behaviour (gh-6129).
diff --git a/src/lib/raft/raft.c b/src/lib/raft/raft.c
index 247c7a71a..eacdddb7e 100644
--- a/src/lib/raft/raft.c
+++ b/src/lib/raft/raft.c
@@ -687,10 +687,9 @@ static void
raft_sm_schedule_new_election(struct raft *raft)
{
say_info("RAFT: begin new election round");
- assert(raft_is_fully_on_disk(raft));
assert(raft->is_candidate);
/* Everyone is a follower until its vote for self is persisted. */
- raft_sm_schedule_new_term(raft, raft->term + 1);
+ raft_sm_schedule_new_term(raft, raft->volatile_term + 1);
raft_sm_schedule_new_vote(raft, raft->self);
}
@@ -701,6 +700,11 @@ raft_sm_schedule_new_election_cb(struct ev_loop *loop, struct ev_timer *timer,
(void)events;
struct raft *raft = timer->data;
assert(timer == &raft->timer);
+ /*
+ * Otherwise the timer would be stopped and the callback wouldn't be
+ * invoked.
+ */
+ assert(raft_is_fully_on_disk(raft));
raft_ev_timer_stop(loop, timer);
raft_sm_schedule_new_election(raft);
}
diff --git a/test/unit/raft.c b/test/unit/raft.c
index 29e48da13..6369c42d3 100644
--- a/test/unit/raft.c
+++ b/test/unit/raft.c
@@ -572,7 +572,7 @@ raft_test_vote_skip(void)
static void
raft_test_leader_resign(void)
{
- raft_start_test(15);
+ raft_start_test(23);
struct raft_node node;
/*
@@ -672,6 +672,84 @@ raft_test_leader_resign(void)
NULL /* Vclock. */
), "resign notification is sent");
+ /*
+ * gh-6129: resign of a remote leader during a local WAL write should
+ * schedule a new election after the WAL write.
+ *
+ * Firstly start a new term.
+ */
+ raft_node_block(&node);
+ raft_node_cfg_is_candidate(&node, true);
+ raft_run_next_event();
+ /* Volatile term is new, but the persistent one is not updated yet. */
+ ok(raft_node_check_full_state(&node,
+ RAFT_STATE_FOLLOWER /* State. */,
+ 0 /* Leader. */,
+ 2 /* Term. */,
+ 1 /* Vote. */,
+ 3 /* Volatile term. */,
+ 1 /* Volatile vote. */,
+ "{0: 1}" /* Vclock. */
+ ), "new election is waiting for WAL write");
+
+ /* Now another node wins the election earlier. */
+ is(raft_node_send_leader(&node,
+ 3 /* Term. */,
+ 2 /* Source. */
+ ), 0, "message is accepted");
+ ok(raft_node_check_full_state(&node,
+ RAFT_STATE_FOLLOWER /* State. */,
+ 2 /* Leader. */,
+ 2 /* Term. */,
+ 1 /* Vote. */,
+ 3 /* Volatile term. */,
+ 1 /* Volatile vote. */,
+ "{0: 1}" /* Vclock. */
+ ), "the leader is accepted");
+
+ /*
+ * The leader resigns and triggers a new election round on the first
+ * node. A new election is triggered, but still waiting for the previous
+ * WAL write to end.
+ */
+ is(raft_node_send_follower(&node,
+ 3 /* Term. */,
+ 2 /* Source. */
+ ), 0, "message is accepted");
+ /* Note how the volatile term is updated again. */
+ ok(raft_node_check_full_state(&node,
+ RAFT_STATE_FOLLOWER /* State. */,
+ 0 /* Leader. */,
+ 2 /* Term. */,
+ 1 /* Vote. */,
+ 4 /* Volatile term. */,
+ 1 /* Volatile vote. */,
+ "{0: 1}" /* Vclock. */
+ ), "the leader has resigned, new election is scheduled");
+ raft_node_unblock(&node);
+
+ /* Ensure the node still collects votes after the WAL write. */
+ is(raft_node_send_vote_response(&node,
+ 4 /* Term. */,
+ 1 /* Vote. */,
+ 2 /* Source. */
+ ), 0, "vote from 2");
+ is(raft_node_send_vote_response(&node,
+ 4 /* Term. */,
+ 1 /* Vote. */,
+ 3 /* Source. */
+ ), 0, "vote from 3");
+ raft_run_next_event();
+ ok(raft_node_check_full_state(&node,
+ RAFT_STATE_LEADER /* State. */,
+ 1 /* Leader. */,
+ 4 /* Term. */,
+ 1 /* Vote. */,
+ 4 /* Volatile term. */,
+ 1 /* Volatile vote. */,
+ "{0: 2}" /* Vclock. */
+ ), "the leader is elected");
+
raft_node_destroy(&node);
raft_finish_test();
diff --git a/test/unit/raft.result b/test/unit/raft.result
index 3a3dc5dd2..598a7e760 100644
--- a/test/unit/raft.result
+++ b/test/unit/raft.result
@@ -111,7 +111,7 @@ ok 4 - subtests
ok 5 - subtests
*** raft_test_vote_skip: done ***
*** raft_test_leader_resign ***
- 1..15
+ 1..23
ok 1 - message is accepted
ok 2 - leader is elected
ok 3 - message is accepted
@@ -127,6 +127,14 @@ ok 5 - subtests
ok 13 - became leader
ok 14 - the leader has resigned
ok 15 - resign notification is sent
+ ok 16 - new election is waiting for WAL write
+ ok 17 - message is accepted
+ ok 18 - the leader is accepted
+ ok 19 - message is accepted
+ ok 20 - the leader has resigned, new election is scheduled
+ ok 21 - vote from 2
+ ok 22 - vote from 3
+ ok 23 - the leader is elected
ok 6 - subtests
*** raft_test_leader_resign: done ***
*** raft_test_split_brain ***
--
2.24.3 (Apple Git-128)
More information about the Tarantool-patches
mailing list