Tarantool development patches archive
 help / color / mirror / Atom feed
From: Vladislav Shpilevoy <v.shpilevoy@tarantool.org>
To: tarantool-patches@dev.tarantool.org, sergepetrenko@tarantool.org
Subject: [Tarantool-patches] [PATCH 1/1] raft: fix crash when leader resigned from its role
Date: Fri, 16 Oct 2020 00:47:14 +0200	[thread overview]
Message-ID: <470cb12f9bcce14409c28de104e4e442fc9d37d2.1602801983.git.v.shpilevoy@tarantool.org> (raw)

Nodes with disabled Raft keep listening for Raft events and
persist them. To be able to quickly enroll into the process if
they are configured to be candidates.

The same for the voter nodes - they can't be a leader, but watch
and persist all what is happening.

However when a leader resigned from its role, the voter and
disabled nodes tried to start a new election round, even though
they were not supposed to. That led to a crash, and is fixed in
this patch.

Closes #5426
---
Branch: http://github.com/tarantool/tarantool/tree/gerold103/gh-5426-raft-crash-on-election-toogle
Issue: https://github.com/tarantool/tarantool/issues/5426

@ChangeLog
* Fixed a crash on followers, when a leader resigned from its role voluntarily (gh-5426).

 src/box/raft.c                                |   9 +-
 .../gh-5426-election-on-off.result            | 134 ++++++++++++++++++
 .../gh-5426-election-on-off.test.lua          |  57 ++++++++
 test/replication/suite.cfg                    |   1 +
 4 files changed, 200 insertions(+), 1 deletion(-)
 create mode 100644 test/replication/gh-5426-election-on-off.result
 create mode 100644 test/replication/gh-5426-election-on-off.test.lua

diff --git a/src/box/raft.c b/src/box/raft.c
index 24f65ada7..b70f47006 100644
--- a/src/box/raft.c
+++ b/src/box/raft.c
@@ -449,7 +449,14 @@ raft_process_msg(const struct raft_request *req, uint32_t source)
 		if (source == raft.leader) {
 			say_info("RAFT: the node %u has resigned from the "
 				 "leader role", raft.leader);
-			raft_sm_schedule_new_election();
+			/*
+			 * Candidate node clears leader implicitly when starts a
+			 * new term, but non-candidate won't do that, so clear
+			 * it manually.
+			 */
+			raft.leader = 0;
+			if (raft.is_candidate)
+				raft_sm_schedule_new_election();
 		}
 		return 0;
 	}
diff --git a/test/replication/gh-5426-election-on-off.result b/test/replication/gh-5426-election-on-off.result
new file mode 100644
index 000000000..1abfb9154
--- /dev/null
+++ b/test/replication/gh-5426-election-on-off.result
@@ -0,0 +1,134 @@
+-- test-run result file version 2
+test_run = require('test_run').new()
+ | ---
+ | ...
+box.schema.user.grant('guest', 'super')
+ | ---
+ | ...
+
+old_election_mode = box.cfg.election_mode
+ | ---
+ | ...
+old_replication_timeout = box.cfg.replication_timeout
+ | ---
+ | ...
+
+test_run:cmd('create server replica with rpl_master=default,\
+              script="replication/replica.lua"')
+ | ---
+ | - true
+ | ...
+test_run:cmd('start server replica with wait=True, wait_load=True')
+ | ---
+ | - true
+ | ...
+
+--
+-- gh-5426: leader resignation could crash non-candidate nodes.
+--
+-- Small timeout to speed up the election.
+box.cfg{                                                                        \
+    replication_timeout = 0.1,                                                  \
+    election_mode = 'candidate',                                                \
+}
+ | ---
+ | ...
+
+-- First crash could happen when the election was disabled on the non-leader
+-- node.
+test_run:wait_cond(function() return box.info.election.state == 'leader' end)
+ | ---
+ | - true
+ | ...
+
+test_run:switch('replica')
+ | ---
+ | - true
+ | ...
+test_run:wait_cond(function() return box.info.election.leader ~= 0 end)
+ | ---
+ | - true
+ | ...
+
+test_run:switch('default')
+ | ---
+ | - true
+ | ...
+box.cfg{election_mode = 'off'}
+ | ---
+ | ...
+
+test_run:switch('replica')
+ | ---
+ | - true
+ | ...
+test_run:wait_cond(function() return box.info.election.leader == 0 end)
+ | ---
+ | - true
+ | ...
+
+-- Another crash could happen if election mode was 'voter' on the non-leader
+-- node.
+box.cfg{election_mode = 'voter'}
+ | ---
+ | ...
+
+test_run:switch('default')
+ | ---
+ | - true
+ | ...
+box.cfg{election_mode = 'candidate'}
+ | ---
+ | ...
+test_run:wait_cond(function() return box.info.election.state == 'leader' end)
+ | ---
+ | - true
+ | ...
+
+test_run:switch('replica')
+ | ---
+ | - true
+ | ...
+test_run:wait_cond(function() return box.info.election.leader ~= 0 end)
+ | ---
+ | - true
+ | ...
+
+test_run:switch('default')
+ | ---
+ | - true
+ | ...
+box.cfg{election_mode = 'off'}
+ | ---
+ | ...
+
+test_run:switch('replica')
+ | ---
+ | - true
+ | ...
+test_run:wait_cond(function() return box.info.election.leader == 0 end)
+ | ---
+ | - true
+ | ...
+
+test_run:switch('default')
+ | ---
+ | - true
+ | ...
+test_run:cmd('stop server replica')
+ | ---
+ | - true
+ | ...
+test_run:cmd('delete server replica')
+ | ---
+ | - true
+ | ...
+box.cfg{                                                                        \
+    election_mode = old_election_mode,                                          \
+    replication_timeout = old_replication_timeout,                              \
+}
+ | ---
+ | ...
+box.schema.user.revoke('guest', 'super')
+ | ---
+ | ...
diff --git a/test/replication/gh-5426-election-on-off.test.lua b/test/replication/gh-5426-election-on-off.test.lua
new file mode 100644
index 000000000..d6b980d0a
--- /dev/null
+++ b/test/replication/gh-5426-election-on-off.test.lua
@@ -0,0 +1,57 @@
+test_run = require('test_run').new()
+box.schema.user.grant('guest', 'super')
+
+old_election_mode = box.cfg.election_mode
+old_replication_timeout = box.cfg.replication_timeout
+
+test_run:cmd('create server replica with rpl_master=default,\
+              script="replication/replica.lua"')
+test_run:cmd('start server replica with wait=True, wait_load=True')
+
+--
+-- gh-5426: leader resignation could crash non-candidate nodes.
+--
+-- Small timeout to speed up the election.
+box.cfg{                                                                        \
+    replication_timeout = 0.1,                                                  \
+    election_mode = 'candidate',                                                \
+}
+
+-- First crash could happen when the election was disabled on the non-leader
+-- node.
+test_run:wait_cond(function() return box.info.election.state == 'leader' end)
+
+test_run:switch('replica')
+test_run:wait_cond(function() return box.info.election.leader ~= 0 end)
+
+test_run:switch('default')
+box.cfg{election_mode = 'off'}
+
+test_run:switch('replica')
+test_run:wait_cond(function() return box.info.election.leader == 0 end)
+
+-- Another crash could happen if election mode was 'voter' on the non-leader
+-- node.
+box.cfg{election_mode = 'voter'}
+
+test_run:switch('default')
+box.cfg{election_mode = 'candidate'}
+test_run:wait_cond(function() return box.info.election.state == 'leader' end)
+
+test_run:switch('replica')
+test_run:wait_cond(function() return box.info.election.leader ~= 0 end)
+
+test_run:switch('default')
+box.cfg{election_mode = 'off'}
+
+test_run:switch('replica')
+test_run:wait_cond(function() return box.info.election.leader == 0 end)
+
+test_run:switch('default')
+test_run:cmd('stop server replica')
+test_run:cmd('delete server replica')
+box.cfg{                                                                        \
+    election_mode = old_election_mode,                                          \
+    replication_timeout = old_replication_timeout,                              \
+}
+box.schema.user.revoke('guest', 'super')
diff --git a/test/replication/suite.cfg b/test/replication/suite.cfg
index a862f5a97..766f276a2 100644
--- a/test/replication/suite.cfg
+++ b/test/replication/suite.cfg
@@ -14,6 +14,7 @@
     "gh-3760-misc-return-on-quorum-0.test.lua": {},
     "gh-4399-misc-no-failure-on-error-reading-wal.test.lua": {},
     "gh-4424-misc-orphan-on-reconfiguration-error.test.lua": {},
+    "gh-5426-election-on-off.test.lua": {},
     "once.test.lua": {},
     "on_replace.test.lua": {},
     "status.test.lua": {},
-- 
2.21.1 (Apple Git-122.3)

             reply	other threads:[~2020-10-15 22:47 UTC|newest]

Thread overview: 2+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-10-15 22:47 Vladislav Shpilevoy [this message]
2020-10-16  7:16 ` Kirill Yukhin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=470cb12f9bcce14409c28de104e4e442fc9d37d2.1602801983.git.v.shpilevoy@tarantool.org \
    --to=v.shpilevoy@tarantool.org \
    --cc=sergepetrenko@tarantool.org \
    --cc=tarantool-patches@dev.tarantool.org \
    --subject='Re: [Tarantool-patches] [PATCH 1/1] raft: fix crash when leader resigned from its role' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox