From: Vladislav Shpilevoy <v.shpilevoy@tarantool.org> To: tarantool-patches@dev.tarantool.org, sergepetrenko@tarantool.org Subject: [Tarantool-patches] [PATCH 1/1] raft: fix crash when leader resigned from its role Date: Fri, 16 Oct 2020 00:47:14 +0200 [thread overview] Message-ID: <470cb12f9bcce14409c28de104e4e442fc9d37d2.1602801983.git.v.shpilevoy@tarantool.org> (raw) Nodes with disabled Raft keep listening for Raft events and persist them. To be able to quickly enroll into the process if they are configured to be candidates. The same for the voter nodes - they can't be a leader, but watch and persist all what is happening. However when a leader resigned from its role, the voter and disabled nodes tried to start a new election round, even though they were not supposed to. That led to a crash, and is fixed in this patch. Closes #5426 --- Branch: http://github.com/tarantool/tarantool/tree/gerold103/gh-5426-raft-crash-on-election-toogle Issue: https://github.com/tarantool/tarantool/issues/5426 @ChangeLog * Fixed a crash on followers, when a leader resigned from its role voluntarily (gh-5426). src/box/raft.c | 9 +- .../gh-5426-election-on-off.result | 134 ++++++++++++++++++ .../gh-5426-election-on-off.test.lua | 57 ++++++++ test/replication/suite.cfg | 1 + 4 files changed, 200 insertions(+), 1 deletion(-) create mode 100644 test/replication/gh-5426-election-on-off.result create mode 100644 test/replication/gh-5426-election-on-off.test.lua diff --git a/src/box/raft.c b/src/box/raft.c index 24f65ada7..b70f47006 100644 --- a/src/box/raft.c +++ b/src/box/raft.c @@ -449,7 +449,14 @@ raft_process_msg(const struct raft_request *req, uint32_t source) if (source == raft.leader) { say_info("RAFT: the node %u has resigned from the " "leader role", raft.leader); - raft_sm_schedule_new_election(); + /* + * Candidate node clears leader implicitly when starts a + * new term, but non-candidate won't do that, so clear + * it manually. + */ + raft.leader = 0; + if (raft.is_candidate) + raft_sm_schedule_new_election(); } return 0; } diff --git a/test/replication/gh-5426-election-on-off.result b/test/replication/gh-5426-election-on-off.result new file mode 100644 index 000000000..1abfb9154 --- /dev/null +++ b/test/replication/gh-5426-election-on-off.result @@ -0,0 +1,134 @@ +-- test-run result file version 2 +test_run = require('test_run').new() + | --- + | ... +box.schema.user.grant('guest', 'super') + | --- + | ... + +old_election_mode = box.cfg.election_mode + | --- + | ... +old_replication_timeout = box.cfg.replication_timeout + | --- + | ... + +test_run:cmd('create server replica with rpl_master=default,\ + script="replication/replica.lua"') + | --- + | - true + | ... +test_run:cmd('start server replica with wait=True, wait_load=True') + | --- + | - true + | ... + +-- +-- gh-5426: leader resignation could crash non-candidate nodes. +-- +-- Small timeout to speed up the election. +box.cfg{ \ + replication_timeout = 0.1, \ + election_mode = 'candidate', \ +} + | --- + | ... + +-- First crash could happen when the election was disabled on the non-leader +-- node. +test_run:wait_cond(function() return box.info.election.state == 'leader' end) + | --- + | - true + | ... + +test_run:switch('replica') + | --- + | - true + | ... +test_run:wait_cond(function() return box.info.election.leader ~= 0 end) + | --- + | - true + | ... + +test_run:switch('default') + | --- + | - true + | ... +box.cfg{election_mode = 'off'} + | --- + | ... + +test_run:switch('replica') + | --- + | - true + | ... +test_run:wait_cond(function() return box.info.election.leader == 0 end) + | --- + | - true + | ... + +-- Another crash could happen if election mode was 'voter' on the non-leader +-- node. +box.cfg{election_mode = 'voter'} + | --- + | ... + +test_run:switch('default') + | --- + | - true + | ... +box.cfg{election_mode = 'candidate'} + | --- + | ... +test_run:wait_cond(function() return box.info.election.state == 'leader' end) + | --- + | - true + | ... + +test_run:switch('replica') + | --- + | - true + | ... +test_run:wait_cond(function() return box.info.election.leader ~= 0 end) + | --- + | - true + | ... + +test_run:switch('default') + | --- + | - true + | ... +box.cfg{election_mode = 'off'} + | --- + | ... + +test_run:switch('replica') + | --- + | - true + | ... +test_run:wait_cond(function() return box.info.election.leader == 0 end) + | --- + | - true + | ... + +test_run:switch('default') + | --- + | - true + | ... +test_run:cmd('stop server replica') + | --- + | - true + | ... +test_run:cmd('delete server replica') + | --- + | - true + | ... +box.cfg{ \ + election_mode = old_election_mode, \ + replication_timeout = old_replication_timeout, \ +} + | --- + | ... +box.schema.user.revoke('guest', 'super') + | --- + | ... diff --git a/test/replication/gh-5426-election-on-off.test.lua b/test/replication/gh-5426-election-on-off.test.lua new file mode 100644 index 000000000..d6b980d0a --- /dev/null +++ b/test/replication/gh-5426-election-on-off.test.lua @@ -0,0 +1,57 @@ +test_run = require('test_run').new() +box.schema.user.grant('guest', 'super') + +old_election_mode = box.cfg.election_mode +old_replication_timeout = box.cfg.replication_timeout + +test_run:cmd('create server replica with rpl_master=default,\ + script="replication/replica.lua"') +test_run:cmd('start server replica with wait=True, wait_load=True') + +-- +-- gh-5426: leader resignation could crash non-candidate nodes. +-- +-- Small timeout to speed up the election. +box.cfg{ \ + replication_timeout = 0.1, \ + election_mode = 'candidate', \ +} + +-- First crash could happen when the election was disabled on the non-leader +-- node. +test_run:wait_cond(function() return box.info.election.state == 'leader' end) + +test_run:switch('replica') +test_run:wait_cond(function() return box.info.election.leader ~= 0 end) + +test_run:switch('default') +box.cfg{election_mode = 'off'} + +test_run:switch('replica') +test_run:wait_cond(function() return box.info.election.leader == 0 end) + +-- Another crash could happen if election mode was 'voter' on the non-leader +-- node. +box.cfg{election_mode = 'voter'} + +test_run:switch('default') +box.cfg{election_mode = 'candidate'} +test_run:wait_cond(function() return box.info.election.state == 'leader' end) + +test_run:switch('replica') +test_run:wait_cond(function() return box.info.election.leader ~= 0 end) + +test_run:switch('default') +box.cfg{election_mode = 'off'} + +test_run:switch('replica') +test_run:wait_cond(function() return box.info.election.leader == 0 end) + +test_run:switch('default') +test_run:cmd('stop server replica') +test_run:cmd('delete server replica') +box.cfg{ \ + election_mode = old_election_mode, \ + replication_timeout = old_replication_timeout, \ +} +box.schema.user.revoke('guest', 'super') diff --git a/test/replication/suite.cfg b/test/replication/suite.cfg index a862f5a97..766f276a2 100644 --- a/test/replication/suite.cfg +++ b/test/replication/suite.cfg @@ -14,6 +14,7 @@ "gh-3760-misc-return-on-quorum-0.test.lua": {}, "gh-4399-misc-no-failure-on-error-reading-wal.test.lua": {}, "gh-4424-misc-orphan-on-reconfiguration-error.test.lua": {}, + "gh-5426-election-on-off.test.lua": {}, "once.test.lua": {}, "on_replace.test.lua": {}, "status.test.lua": {}, -- 2.21.1 (Apple Git-122.3)
next reply other threads:[~2020-10-15 22:47 UTC|newest] Thread overview: 2+ messages / expand[flat|nested] mbox.gz Atom feed top 2020-10-15 22:47 Vladislav Shpilevoy [this message] 2020-10-16 7:16 ` Kirill Yukhin
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=470cb12f9bcce14409c28de104e4e442fc9d37d2.1602801983.git.v.shpilevoy@tarantool.org \ --to=v.shpilevoy@tarantool.org \ --cc=sergepetrenko@tarantool.org \ --cc=tarantool-patches@dev.tarantool.org \ --subject='Re: [Tarantool-patches] [PATCH 1/1] raft: fix crash when leader resigned from its role' \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: link
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox