From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from [87.239.111.99] (localhost [127.0.0.1]) by dev.tarantool.org (Postfix) with ESMTP id E14BD6EC40; Mon, 7 Jun 2021 23:48:05 +0300 (MSK) DKIM-Filter: OpenDKIM Filter v2.11.0 dev.tarantool.org E14BD6EC40 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=tarantool.org; s=dev; t=1623098885; bh=1y5njpttAgqSJWqmFvPaXuPKFAEnqv0qlXUP8N6DfSw=; h=To:Date:Subject:List-Id:List-Unsubscribe:List-Archive:List-Post: List-Help:List-Subscribe:From:Reply-To:From; b=soRAY/vIiITVKHHg9BGPl1WVinyD7uK+CvIOHTAXd7Xc0X5gMPgXy2DeqkjR/kzB6 L4PrltqHVWNpc3F9cUWWaLh283CYV4f+gQZqVXnIH1lEFE5Qf3LUEoANP9HROe2+J9 aqiXo9k5idNjzrdg8SKJFgvdlePMET1bLptbhxoE= Received: from smtp63.i.mail.ru (smtp63.i.mail.ru [217.69.128.43]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dev.tarantool.org (Postfix) with ESMTPS id B503B6EC40 for ; Mon, 7 Jun 2021 23:48:04 +0300 (MSK) DKIM-Filter: OpenDKIM Filter v2.11.0 dev.tarantool.org B503B6EC40 Received: by smtp63.i.mail.ru with esmtpa (envelope-from ) id 1lqMAF-00082u-N2; Mon, 07 Jun 2021 23:48:04 +0300 To: tarantool-patches@dev.tarantool.org, gorcunov@gmail.com, sergepetrenko@tarantool.org Date: Mon, 7 Jun 2021 22:48:02 +0200 Message-Id: <08a836b17506f5b6e9edfcaee36bbeea5754f917.1623098844.git.v.shpilevoy@tarantool.org> X-Mailer: git-send-email 2.24.3 (Apple Git-128) MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-7564579A: 646B95376F6C166E X-77F55803: 4F1203BC0FB41BD9D5B0DA836B685C54907A7AE9C1BA82BC67C1327DFB87C6A6182A05F538085040FF9779B0228DD6EB081808D3B785E9D34E17EBD95F85E7DDC3A55FE8C8F83EDB X-7FA49CB5: FF5795518A3D127A4AD6D5ED66289B5278DA827A17800CE798269FE5AA193785EA1F7E6F0F101C67BD4B6F7A4D31EC0BCC500DACC3FED6E28638F802B75D45FF8AA50765F7900637B997C8222C70C3D98638F802B75D45FF36EB9D2243A4F8B5A6FCA7DBDB1FC311F39EFFDF887939037866D6147AF826D80DA5094DCBF569B271265B6920D8D653117882F4460429724CE54428C33FAD305F5C1EE8F4F765FC2EE5AD8F952D28FBA471835C12D1D9774AD6D5ED66289B52BA9C0B312567BB23117882F446042972877693876707352026055571C92BF10F618001F51B5FD3F9D2E47CDBA5A96583BA9C0B312567BB2376E601842F6C81A19E625A9149C048EE0AC5B80A05675ACD9935A1E27F592749D8FC6C240DEA7642DBF02ECDB25306B2B78CF848AE20165D0A6AB1C7CE11FEE30CABCCA60F52D7EB040F9FF01DFDA4A8C4224003CC836476EA7A3FFF5B025636E2021AF6380DFAD1A18204E546F3947CB11811A4A51E3B096D1867E19FE1407959CC434672EE6371089D37D7C0E48F6C8AA50765F7900637AD0424077D726551EFF80C71ABB335746BA297DBC24807EABDAD6C7F3747799A X-B7AD71C0: AC4F5C86D027EB782CDD5689AFBDA7A24209795067102C07E8F7B195E1C9783137C172E39E2163612D680659EFB54793 X-C1DE0DAB: 8BD88D57C5CADBC8B2710865C3867510E3B0F63F495E1F96A3B1A56EE2B804F6B226C914C9968946695E9D90444CEC264DCC8C77FBA9901322D2CEDE4E95CF1BDBE8DEE28BC9005C095FFBCAB1CFE8AAD6CF32B5F8F9D40488EC6805C9BC01C4307E9670061ACE4A589120F7DAE46353205367B2BCC23E5BE99843EE6E79C5DCBDAD6C7F3747799A X-C8649E89: 4E36BF7865823D7055A7F0CF078B5EC49A30900B95165D34C2BE7381A0AD3DF127C9C55576E23B859930407D14A363A93699A2585D13F44DFBC020BFE39758931D7E09C32AA3244CDDB5719DB574401089F075300D5BD4EF97FE24653F78E668FACE5A9C96DEB163 X-D57D3AED: 3ZO7eAau8CL7WIMRKs4sN3D3tLDjz0dLbV79QFUyzQ2Ujvy7cMT6pYYqY16iZVKkSc3dCLJ7zSJH7+u4VD18S7Vl4ZUrpaVfd2+vE6kuoey4m4VkSEu530nj6fImhcD4MUrOEAnl0W826KZ9Q+tr5ycPtXkTV4k65bRjmOUUP8cvGozZ33TWg5HZplvhhXbhDGzqmQDTd6OAevLeAnq3Ra9uf7zvY2zzsIhlcp/Y7m53TZgf2aB4JOg4gkr2bioj/099YK7Sav3KEt0gPhVffQ== X-Mailru-Sender: 504CC1E875BF3E7D9BC0E5172ADA3110F8AF0990236F8A6426399E01F921D92CC4627B3BD372EAD207784C02288277CA03E0582D3806FB6A5317862B1921BA260ED6CFD6382C13A6112434F685709FCF0DA7A0AF5A3A8387 X-Mras: Ok Subject: [Tarantool-patches] [PATCH 1/1] raft: handle remote leader resign during WAL write X-BeenThere: tarantool-patches@dev.tarantool.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: Tarantool development patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , From: Vladislav Shpilevoy via Tarantool-patches Reply-To: Vladislav Shpilevoy Errors-To: tarantool-patches-bounces@dev.tarantool.org Sender: "Tarantool-patches" If Raft state machine sees the current leader has explicitly resigned from its role, it starts a new election round right away. But in the code starting a new round there was an assumption that there is no a volatile state. There was, in fact. The patch makes the election start code use the volatile state to bump the term. It should be safe, because the other nodes won't receive it anyway until the new term is persisted. There was an alternative - do not schedule new election until the current WAL write ends. It wasn't done, because would achieve the same (the term would be bumped and persisted) but with bigger a latency. Another reason is that if the leader would appear and resign during WAL write on another candidate, in the end of its WAL write the latter would see 0 leader and would think this term didn't have one yet. And would try to elect self now, in the current term. It makes little sense, because it won't win - the current term had already had a leader and the majority of votes is already taken. Closes #6129 --- Branch: http://github.com/tarantool/tarantool/tree/gerold103/gh-6129-raft-crash Issue: https://github.com/tarantool/tarantool/issues/6129 .../gh-6129-raft-resign-during-wal.md | 7 ++ src/lib/raft/raft.c | 8 +- test/unit/raft.c | 80 ++++++++++++++++++- test/unit/raft.result | 10 ++- 4 files changed, 101 insertions(+), 4 deletions(-) create mode 100644 changelogs/unreleased/gh-6129-raft-resign-during-wal.md diff --git a/changelogs/unreleased/gh-6129-raft-resign-during-wal.md b/changelogs/unreleased/gh-6129-raft-resign-during-wal.md new file mode 100644 index 000000000..eb6cf3984 --- /dev/null +++ b/changelogs/unreleased/gh-6129-raft-resign-during-wal.md @@ -0,0 +1,7 @@ +## bugfix/raft + +* Fixed a rare crash with the leader election enabled (any mode except `off`), + which could happen if a leader resigned from its role at the same time as some + other node was writing something related to the elections to WAL. The crash + was in debug build and in the release build it would lead to undefined + behaviour (gh-6129). diff --git a/src/lib/raft/raft.c b/src/lib/raft/raft.c index 247c7a71a..eacdddb7e 100644 --- a/src/lib/raft/raft.c +++ b/src/lib/raft/raft.c @@ -687,10 +687,9 @@ static void raft_sm_schedule_new_election(struct raft *raft) { say_info("RAFT: begin new election round"); - assert(raft_is_fully_on_disk(raft)); assert(raft->is_candidate); /* Everyone is a follower until its vote for self is persisted. */ - raft_sm_schedule_new_term(raft, raft->term + 1); + raft_sm_schedule_new_term(raft, raft->volatile_term + 1); raft_sm_schedule_new_vote(raft, raft->self); } @@ -701,6 +700,11 @@ raft_sm_schedule_new_election_cb(struct ev_loop *loop, struct ev_timer *timer, (void)events; struct raft *raft = timer->data; assert(timer == &raft->timer); + /* + * Otherwise the timer would be stopped and the callback wouldn't be + * invoked. + */ + assert(raft_is_fully_on_disk(raft)); raft_ev_timer_stop(loop, timer); raft_sm_schedule_new_election(raft); } diff --git a/test/unit/raft.c b/test/unit/raft.c index 29e48da13..6369c42d3 100644 --- a/test/unit/raft.c +++ b/test/unit/raft.c @@ -572,7 +572,7 @@ raft_test_vote_skip(void) static void raft_test_leader_resign(void) { - raft_start_test(15); + raft_start_test(23); struct raft_node node; /* @@ -672,6 +672,84 @@ raft_test_leader_resign(void) NULL /* Vclock. */ ), "resign notification is sent"); + /* + * gh-6129: resign of a remote leader during a local WAL write should + * schedule a new election after the WAL write. + * + * Firstly start a new term. + */ + raft_node_block(&node); + raft_node_cfg_is_candidate(&node, true); + raft_run_next_event(); + /* Volatile term is new, but the persistent one is not updated yet. */ + ok(raft_node_check_full_state(&node, + RAFT_STATE_FOLLOWER /* State. */, + 0 /* Leader. */, + 2 /* Term. */, + 1 /* Vote. */, + 3 /* Volatile term. */, + 1 /* Volatile vote. */, + "{0: 1}" /* Vclock. */ + ), "new election is waiting for WAL write"); + + /* Now another node wins the election earlier. */ + is(raft_node_send_leader(&node, + 3 /* Term. */, + 2 /* Source. */ + ), 0, "message is accepted"); + ok(raft_node_check_full_state(&node, + RAFT_STATE_FOLLOWER /* State. */, + 2 /* Leader. */, + 2 /* Term. */, + 1 /* Vote. */, + 3 /* Volatile term. */, + 1 /* Volatile vote. */, + "{0: 1}" /* Vclock. */ + ), "the leader is accepted"); + + /* + * The leader resigns and triggers a new election round on the first + * node. A new election is triggered, but still waiting for the previous + * WAL write to end. + */ + is(raft_node_send_follower(&node, + 3 /* Term. */, + 2 /* Source. */ + ), 0, "message is accepted"); + /* Note how the volatile term is updated again. */ + ok(raft_node_check_full_state(&node, + RAFT_STATE_FOLLOWER /* State. */, + 0 /* Leader. */, + 2 /* Term. */, + 1 /* Vote. */, + 4 /* Volatile term. */, + 1 /* Volatile vote. */, + "{0: 1}" /* Vclock. */ + ), "the leader has resigned, new election is scheduled"); + raft_node_unblock(&node); + + /* Ensure the node still collects votes after the WAL write. */ + is(raft_node_send_vote_response(&node, + 4 /* Term. */, + 1 /* Vote. */, + 2 /* Source. */ + ), 0, "vote from 2"); + is(raft_node_send_vote_response(&node, + 4 /* Term. */, + 1 /* Vote. */, + 3 /* Source. */ + ), 0, "vote from 3"); + raft_run_next_event(); + ok(raft_node_check_full_state(&node, + RAFT_STATE_LEADER /* State. */, + 1 /* Leader. */, + 4 /* Term. */, + 1 /* Vote. */, + 4 /* Volatile term. */, + 1 /* Volatile vote. */, + "{0: 2}" /* Vclock. */ + ), "the leader is elected"); + raft_node_destroy(&node); raft_finish_test(); diff --git a/test/unit/raft.result b/test/unit/raft.result index 3a3dc5dd2..598a7e760 100644 --- a/test/unit/raft.result +++ b/test/unit/raft.result @@ -111,7 +111,7 @@ ok 4 - subtests ok 5 - subtests *** raft_test_vote_skip: done *** *** raft_test_leader_resign *** - 1..15 + 1..23 ok 1 - message is accepted ok 2 - leader is elected ok 3 - message is accepted @@ -127,6 +127,14 @@ ok 5 - subtests ok 13 - became leader ok 14 - the leader has resigned ok 15 - resign notification is sent + ok 16 - new election is waiting for WAL write + ok 17 - message is accepted + ok 18 - the leader is accepted + ok 19 - message is accepted + ok 20 - the leader has resigned, new election is scheduled + ok 21 - vote from 2 + ok 22 - vote from 3 + ok 23 - the leader is elected ok 6 - subtests *** raft_test_leader_resign: done *** *** raft_test_split_brain *** -- 2.24.3 (Apple Git-128)