From: Vladislav Shpilevoy via Tarantool-patches <tarantool-patches@dev.tarantool.org> To: tarantool-patches@dev.tarantool.org, gorcunov@gmail.com, sergepetrenko@tarantool.org Subject: [Tarantool-patches] [PATCH 1/1] raft: handle remote leader resign during WAL write Date: Mon, 7 Jun 2021 22:48:02 +0200 [thread overview] Message-ID: <08a836b17506f5b6e9edfcaee36bbeea5754f917.1623098844.git.v.shpilevoy@tarantool.org> (raw) If Raft state machine sees the current leader has explicitly resigned from its role, it starts a new election round right away. But in the code starting a new round there was an assumption that there is no a volatile state. There was, in fact. The patch makes the election start code use the volatile state to bump the term. It should be safe, because the other nodes won't receive it anyway until the new term is persisted. There was an alternative - do not schedule new election until the current WAL write ends. It wasn't done, because would achieve the same (the term would be bumped and persisted) but with bigger a latency. Another reason is that if the leader would appear and resign during WAL write on another candidate, in the end of its WAL write the latter would see 0 leader and would think this term didn't have one yet. And would try to elect self now, in the current term. It makes little sense, because it won't win - the current term had already had a leader and the majority of votes is already taken. Closes #6129 --- Branch: http://github.com/tarantool/tarantool/tree/gerold103/gh-6129-raft-crash Issue: https://github.com/tarantool/tarantool/issues/6129 .../gh-6129-raft-resign-during-wal.md | 7 ++ src/lib/raft/raft.c | 8 +- test/unit/raft.c | 80 ++++++++++++++++++- test/unit/raft.result | 10 ++- 4 files changed, 101 insertions(+), 4 deletions(-) create mode 100644 changelogs/unreleased/gh-6129-raft-resign-during-wal.md diff --git a/changelogs/unreleased/gh-6129-raft-resign-during-wal.md b/changelogs/unreleased/gh-6129-raft-resign-during-wal.md new file mode 100644 index 000000000..eb6cf3984 --- /dev/null +++ b/changelogs/unreleased/gh-6129-raft-resign-during-wal.md @@ -0,0 +1,7 @@ +## bugfix/raft + +* Fixed a rare crash with the leader election enabled (any mode except `off`), + which could happen if a leader resigned from its role at the same time as some + other node was writing something related to the elections to WAL. The crash + was in debug build and in the release build it would lead to undefined + behaviour (gh-6129). diff --git a/src/lib/raft/raft.c b/src/lib/raft/raft.c index 247c7a71a..eacdddb7e 100644 --- a/src/lib/raft/raft.c +++ b/src/lib/raft/raft.c @@ -687,10 +687,9 @@ static void raft_sm_schedule_new_election(struct raft *raft) { say_info("RAFT: begin new election round"); - assert(raft_is_fully_on_disk(raft)); assert(raft->is_candidate); /* Everyone is a follower until its vote for self is persisted. */ - raft_sm_schedule_new_term(raft, raft->term + 1); + raft_sm_schedule_new_term(raft, raft->volatile_term + 1); raft_sm_schedule_new_vote(raft, raft->self); } @@ -701,6 +700,11 @@ raft_sm_schedule_new_election_cb(struct ev_loop *loop, struct ev_timer *timer, (void)events; struct raft *raft = timer->data; assert(timer == &raft->timer); + /* + * Otherwise the timer would be stopped and the callback wouldn't be + * invoked. + */ + assert(raft_is_fully_on_disk(raft)); raft_ev_timer_stop(loop, timer); raft_sm_schedule_new_election(raft); } diff --git a/test/unit/raft.c b/test/unit/raft.c index 29e48da13..6369c42d3 100644 --- a/test/unit/raft.c +++ b/test/unit/raft.c @@ -572,7 +572,7 @@ raft_test_vote_skip(void) static void raft_test_leader_resign(void) { - raft_start_test(15); + raft_start_test(23); struct raft_node node; /* @@ -672,6 +672,84 @@ raft_test_leader_resign(void) NULL /* Vclock. */ ), "resign notification is sent"); + /* + * gh-6129: resign of a remote leader during a local WAL write should + * schedule a new election after the WAL write. + * + * Firstly start a new term. + */ + raft_node_block(&node); + raft_node_cfg_is_candidate(&node, true); + raft_run_next_event(); + /* Volatile term is new, but the persistent one is not updated yet. */ + ok(raft_node_check_full_state(&node, + RAFT_STATE_FOLLOWER /* State. */, + 0 /* Leader. */, + 2 /* Term. */, + 1 /* Vote. */, + 3 /* Volatile term. */, + 1 /* Volatile vote. */, + "{0: 1}" /* Vclock. */ + ), "new election is waiting for WAL write"); + + /* Now another node wins the election earlier. */ + is(raft_node_send_leader(&node, + 3 /* Term. */, + 2 /* Source. */ + ), 0, "message is accepted"); + ok(raft_node_check_full_state(&node, + RAFT_STATE_FOLLOWER /* State. */, + 2 /* Leader. */, + 2 /* Term. */, + 1 /* Vote. */, + 3 /* Volatile term. */, + 1 /* Volatile vote. */, + "{0: 1}" /* Vclock. */ + ), "the leader is accepted"); + + /* + * The leader resigns and triggers a new election round on the first + * node. A new election is triggered, but still waiting for the previous + * WAL write to end. + */ + is(raft_node_send_follower(&node, + 3 /* Term. */, + 2 /* Source. */ + ), 0, "message is accepted"); + /* Note how the volatile term is updated again. */ + ok(raft_node_check_full_state(&node, + RAFT_STATE_FOLLOWER /* State. */, + 0 /* Leader. */, + 2 /* Term. */, + 1 /* Vote. */, + 4 /* Volatile term. */, + 1 /* Volatile vote. */, + "{0: 1}" /* Vclock. */ + ), "the leader has resigned, new election is scheduled"); + raft_node_unblock(&node); + + /* Ensure the node still collects votes after the WAL write. */ + is(raft_node_send_vote_response(&node, + 4 /* Term. */, + 1 /* Vote. */, + 2 /* Source. */ + ), 0, "vote from 2"); + is(raft_node_send_vote_response(&node, + 4 /* Term. */, + 1 /* Vote. */, + 3 /* Source. */ + ), 0, "vote from 3"); + raft_run_next_event(); + ok(raft_node_check_full_state(&node, + RAFT_STATE_LEADER /* State. */, + 1 /* Leader. */, + 4 /* Term. */, + 1 /* Vote. */, + 4 /* Volatile term. */, + 1 /* Volatile vote. */, + "{0: 2}" /* Vclock. */ + ), "the leader is elected"); + raft_node_destroy(&node); raft_finish_test(); diff --git a/test/unit/raft.result b/test/unit/raft.result index 3a3dc5dd2..598a7e760 100644 --- a/test/unit/raft.result +++ b/test/unit/raft.result @@ -111,7 +111,7 @@ ok 4 - subtests ok 5 - subtests *** raft_test_vote_skip: done *** *** raft_test_leader_resign *** - 1..15 + 1..23 ok 1 - message is accepted ok 2 - leader is elected ok 3 - message is accepted @@ -127,6 +127,14 @@ ok 5 - subtests ok 13 - became leader ok 14 - the leader has resigned ok 15 - resign notification is sent + ok 16 - new election is waiting for WAL write + ok 17 - message is accepted + ok 18 - the leader is accepted + ok 19 - message is accepted + ok 20 - the leader has resigned, new election is scheduled + ok 21 - vote from 2 + ok 22 - vote from 3 + ok 23 - the leader is elected ok 6 - subtests *** raft_test_leader_resign: done *** *** raft_test_split_brain *** -- 2.24.3 (Apple Git-128)
next reply other threads:[~2021-06-07 20:48 UTC|newest] Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top 2021-06-07 20:48 Vladislav Shpilevoy via Tarantool-patches [this message] 2021-06-10 14:24 ` Serge Petrenko via Tarantool-patches 2021-06-10 20:01 ` Vladislav Shpilevoy via Tarantool-patches
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=08a836b17506f5b6e9edfcaee36bbeea5754f917.1623098844.git.v.shpilevoy@tarantool.org \ --to=tarantool-patches@dev.tarantool.org \ --cc=gorcunov@gmail.com \ --cc=sergepetrenko@tarantool.org \ --cc=v.shpilevoy@tarantool.org \ --subject='Re: [Tarantool-patches] [PATCH 1/1] raft: handle remote leader resign during WAL write' \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: link
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox