From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from smtpng1.m.smailru.net (smtpng1.m.smailru.net [94.100.181.251]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dev.tarantool.org (Postfix) with ESMTPS id 01CBE469719 for ; Fri, 16 Oct 2020 01:47:15 +0300 (MSK) From: Vladislav Shpilevoy Date: Fri, 16 Oct 2020 00:47:14 +0200 Message-Id: <470cb12f9bcce14409c28de104e4e442fc9d37d2.1602801983.git.v.shpilevoy@tarantool.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Subject: [Tarantool-patches] [PATCH 1/1] raft: fix crash when leader resigned from its role List-Id: Tarantool development patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: tarantool-patches@dev.tarantool.org, sergepetrenko@tarantool.org Nodes with disabled Raft keep listening for Raft events and persist them. To be able to quickly enroll into the process if they are configured to be candidates. The same for the voter nodes - they can't be a leader, but watch and persist all what is happening. However when a leader resigned from its role, the voter and disabled nodes tried to start a new election round, even though they were not supposed to. That led to a crash, and is fixed in this patch. Closes #5426 --- Branch: http://github.com/tarantool/tarantool/tree/gerold103/gh-5426-raft-crash-on-election-toogle Issue: https://github.com/tarantool/tarantool/issues/5426 @ChangeLog * Fixed a crash on followers, when a leader resigned from its role voluntarily (gh-5426). src/box/raft.c | 9 +- .../gh-5426-election-on-off.result | 134 ++++++++++++++++++ .../gh-5426-election-on-off.test.lua | 57 ++++++++ test/replication/suite.cfg | 1 + 4 files changed, 200 insertions(+), 1 deletion(-) create mode 100644 test/replication/gh-5426-election-on-off.result create mode 100644 test/replication/gh-5426-election-on-off.test.lua diff --git a/src/box/raft.c b/src/box/raft.c index 24f65ada7..b70f47006 100644 --- a/src/box/raft.c +++ b/src/box/raft.c @@ -449,7 +449,14 @@ raft_process_msg(const struct raft_request *req, uint32_t source) if (source == raft.leader) { say_info("RAFT: the node %u has resigned from the " "leader role", raft.leader); - raft_sm_schedule_new_election(); + /* + * Candidate node clears leader implicitly when starts a + * new term, but non-candidate won't do that, so clear + * it manually. + */ + raft.leader = 0; + if (raft.is_candidate) + raft_sm_schedule_new_election(); } return 0; } diff --git a/test/replication/gh-5426-election-on-off.result b/test/replication/gh-5426-election-on-off.result new file mode 100644 index 000000000..1abfb9154 --- /dev/null +++ b/test/replication/gh-5426-election-on-off.result @@ -0,0 +1,134 @@ +-- test-run result file version 2 +test_run = require('test_run').new() + | --- + | ... +box.schema.user.grant('guest', 'super') + | --- + | ... + +old_election_mode = box.cfg.election_mode + | --- + | ... +old_replication_timeout = box.cfg.replication_timeout + | --- + | ... + +test_run:cmd('create server replica with rpl_master=default,\ + script="replication/replica.lua"') + | --- + | - true + | ... +test_run:cmd('start server replica with wait=True, wait_load=True') + | --- + | - true + | ... + +-- +-- gh-5426: leader resignation could crash non-candidate nodes. +-- +-- Small timeout to speed up the election. +box.cfg{ \ + replication_timeout = 0.1, \ + election_mode = 'candidate', \ +} + | --- + | ... + +-- First crash could happen when the election was disabled on the non-leader +-- node. +test_run:wait_cond(function() return box.info.election.state == 'leader' end) + | --- + | - true + | ... + +test_run:switch('replica') + | --- + | - true + | ... +test_run:wait_cond(function() return box.info.election.leader ~= 0 end) + | --- + | - true + | ... + +test_run:switch('default') + | --- + | - true + | ... +box.cfg{election_mode = 'off'} + | --- + | ... + +test_run:switch('replica') + | --- + | - true + | ... +test_run:wait_cond(function() return box.info.election.leader == 0 end) + | --- + | - true + | ... + +-- Another crash could happen if election mode was 'voter' on the non-leader +-- node. +box.cfg{election_mode = 'voter'} + | --- + | ... + +test_run:switch('default') + | --- + | - true + | ... +box.cfg{election_mode = 'candidate'} + | --- + | ... +test_run:wait_cond(function() return box.info.election.state == 'leader' end) + | --- + | - true + | ... + +test_run:switch('replica') + | --- + | - true + | ... +test_run:wait_cond(function() return box.info.election.leader ~= 0 end) + | --- + | - true + | ... + +test_run:switch('default') + | --- + | - true + | ... +box.cfg{election_mode = 'off'} + | --- + | ... + +test_run:switch('replica') + | --- + | - true + | ... +test_run:wait_cond(function() return box.info.election.leader == 0 end) + | --- + | - true + | ... + +test_run:switch('default') + | --- + | - true + | ... +test_run:cmd('stop server replica') + | --- + | - true + | ... +test_run:cmd('delete server replica') + | --- + | - true + | ... +box.cfg{ \ + election_mode = old_election_mode, \ + replication_timeout = old_replication_timeout, \ +} + | --- + | ... +box.schema.user.revoke('guest', 'super') + | --- + | ... diff --git a/test/replication/gh-5426-election-on-off.test.lua b/test/replication/gh-5426-election-on-off.test.lua new file mode 100644 index 000000000..d6b980d0a --- /dev/null +++ b/test/replication/gh-5426-election-on-off.test.lua @@ -0,0 +1,57 @@ +test_run = require('test_run').new() +box.schema.user.grant('guest', 'super') + +old_election_mode = box.cfg.election_mode +old_replication_timeout = box.cfg.replication_timeout + +test_run:cmd('create server replica with rpl_master=default,\ + script="replication/replica.lua"') +test_run:cmd('start server replica with wait=True, wait_load=True') + +-- +-- gh-5426: leader resignation could crash non-candidate nodes. +-- +-- Small timeout to speed up the election. +box.cfg{ \ + replication_timeout = 0.1, \ + election_mode = 'candidate', \ +} + +-- First crash could happen when the election was disabled on the non-leader +-- node. +test_run:wait_cond(function() return box.info.election.state == 'leader' end) + +test_run:switch('replica') +test_run:wait_cond(function() return box.info.election.leader ~= 0 end) + +test_run:switch('default') +box.cfg{election_mode = 'off'} + +test_run:switch('replica') +test_run:wait_cond(function() return box.info.election.leader == 0 end) + +-- Another crash could happen if election mode was 'voter' on the non-leader +-- node. +box.cfg{election_mode = 'voter'} + +test_run:switch('default') +box.cfg{election_mode = 'candidate'} +test_run:wait_cond(function() return box.info.election.state == 'leader' end) + +test_run:switch('replica') +test_run:wait_cond(function() return box.info.election.leader ~= 0 end) + +test_run:switch('default') +box.cfg{election_mode = 'off'} + +test_run:switch('replica') +test_run:wait_cond(function() return box.info.election.leader == 0 end) + +test_run:switch('default') +test_run:cmd('stop server replica') +test_run:cmd('delete server replica') +box.cfg{ \ + election_mode = old_election_mode, \ + replication_timeout = old_replication_timeout, \ +} +box.schema.user.revoke('guest', 'super') diff --git a/test/replication/suite.cfg b/test/replication/suite.cfg index a862f5a97..766f276a2 100644 --- a/test/replication/suite.cfg +++ b/test/replication/suite.cfg @@ -14,6 +14,7 @@ "gh-3760-misc-return-on-quorum-0.test.lua": {}, "gh-4399-misc-no-failure-on-error-reading-wal.test.lua": {}, "gh-4424-misc-orphan-on-reconfiguration-error.test.lua": {}, + "gh-5426-election-on-off.test.lua": {}, "once.test.lua": {}, "on_replace.test.lua": {}, "status.test.lua": {}, -- 2.21.1 (Apple Git-122.3)