Tarantool development patches archive
 help / color / mirror / Atom feed
* [Tarantool-patches] [PATCH 1/1] raft: fix crash when leader resigned from its role
@ 2020-10-15 22:47 Vladislav Shpilevoy
  2020-10-16  7:16 ` Kirill Yukhin
  0 siblings, 1 reply; 2+ messages in thread
From: Vladislav Shpilevoy @ 2020-10-15 22:47 UTC (permalink / raw)
  To: tarantool-patches, sergepetrenko

Nodes with disabled Raft keep listening for Raft events and
persist them. To be able to quickly enroll into the process if
they are configured to be candidates.

The same for the voter nodes - they can't be a leader, but watch
and persist all what is happening.

However when a leader resigned from its role, the voter and
disabled nodes tried to start a new election round, even though
they were not supposed to. That led to a crash, and is fixed in
this patch.

Closes #5426
---
Branch: http://github.com/tarantool/tarantool/tree/gerold103/gh-5426-raft-crash-on-election-toogle
Issue: https://github.com/tarantool/tarantool/issues/5426

@ChangeLog
* Fixed a crash on followers, when a leader resigned from its role voluntarily (gh-5426).

 src/box/raft.c                                |   9 +-
 .../gh-5426-election-on-off.result            | 134 ++++++++++++++++++
 .../gh-5426-election-on-off.test.lua          |  57 ++++++++
 test/replication/suite.cfg                    |   1 +
 4 files changed, 200 insertions(+), 1 deletion(-)
 create mode 100644 test/replication/gh-5426-election-on-off.result
 create mode 100644 test/replication/gh-5426-election-on-off.test.lua

diff --git a/src/box/raft.c b/src/box/raft.c
index 24f65ada7..b70f47006 100644
--- a/src/box/raft.c
+++ b/src/box/raft.c
@@ -449,7 +449,14 @@ raft_process_msg(const struct raft_request *req, uint32_t source)
 		if (source == raft.leader) {
 			say_info("RAFT: the node %u has resigned from the "
 				 "leader role", raft.leader);
-			raft_sm_schedule_new_election();
+			/*
+			 * Candidate node clears leader implicitly when starts a
+			 * new term, but non-candidate won't do that, so clear
+			 * it manually.
+			 */
+			raft.leader = 0;
+			if (raft.is_candidate)
+				raft_sm_schedule_new_election();
 		}
 		return 0;
 	}
diff --git a/test/replication/gh-5426-election-on-off.result b/test/replication/gh-5426-election-on-off.result
new file mode 100644
index 000000000..1abfb9154
--- /dev/null
+++ b/test/replication/gh-5426-election-on-off.result
@@ -0,0 +1,134 @@
+-- test-run result file version 2
+test_run = require('test_run').new()
+ | ---
+ | ...
+box.schema.user.grant('guest', 'super')
+ | ---
+ | ...
+
+old_election_mode = box.cfg.election_mode
+ | ---
+ | ...
+old_replication_timeout = box.cfg.replication_timeout
+ | ---
+ | ...
+
+test_run:cmd('create server replica with rpl_master=default,\
+              script="replication/replica.lua"')
+ | ---
+ | - true
+ | ...
+test_run:cmd('start server replica with wait=True, wait_load=True')
+ | ---
+ | - true
+ | ...
+
+--
+-- gh-5426: leader resignation could crash non-candidate nodes.
+--
+-- Small timeout to speed up the election.
+box.cfg{                                                                        \
+    replication_timeout = 0.1,                                                  \
+    election_mode = 'candidate',                                                \
+}
+ | ---
+ | ...
+
+-- First crash could happen when the election was disabled on the non-leader
+-- node.
+test_run:wait_cond(function() return box.info.election.state == 'leader' end)
+ | ---
+ | - true
+ | ...
+
+test_run:switch('replica')
+ | ---
+ | - true
+ | ...
+test_run:wait_cond(function() return box.info.election.leader ~= 0 end)
+ | ---
+ | - true
+ | ...
+
+test_run:switch('default')
+ | ---
+ | - true
+ | ...
+box.cfg{election_mode = 'off'}
+ | ---
+ | ...
+
+test_run:switch('replica')
+ | ---
+ | - true
+ | ...
+test_run:wait_cond(function() return box.info.election.leader == 0 end)
+ | ---
+ | - true
+ | ...
+
+-- Another crash could happen if election mode was 'voter' on the non-leader
+-- node.
+box.cfg{election_mode = 'voter'}
+ | ---
+ | ...
+
+test_run:switch('default')
+ | ---
+ | - true
+ | ...
+box.cfg{election_mode = 'candidate'}
+ | ---
+ | ...
+test_run:wait_cond(function() return box.info.election.state == 'leader' end)
+ | ---
+ | - true
+ | ...
+
+test_run:switch('replica')
+ | ---
+ | - true
+ | ...
+test_run:wait_cond(function() return box.info.election.leader ~= 0 end)
+ | ---
+ | - true
+ | ...
+
+test_run:switch('default')
+ | ---
+ | - true
+ | ...
+box.cfg{election_mode = 'off'}
+ | ---
+ | ...
+
+test_run:switch('replica')
+ | ---
+ | - true
+ | ...
+test_run:wait_cond(function() return box.info.election.leader == 0 end)
+ | ---
+ | - true
+ | ...
+
+test_run:switch('default')
+ | ---
+ | - true
+ | ...
+test_run:cmd('stop server replica')
+ | ---
+ | - true
+ | ...
+test_run:cmd('delete server replica')
+ | ---
+ | - true
+ | ...
+box.cfg{                                                                        \
+    election_mode = old_election_mode,                                          \
+    replication_timeout = old_replication_timeout,                              \
+}
+ | ---
+ | ...
+box.schema.user.revoke('guest', 'super')
+ | ---
+ | ...
diff --git a/test/replication/gh-5426-election-on-off.test.lua b/test/replication/gh-5426-election-on-off.test.lua
new file mode 100644
index 000000000..d6b980d0a
--- /dev/null
+++ b/test/replication/gh-5426-election-on-off.test.lua
@@ -0,0 +1,57 @@
+test_run = require('test_run').new()
+box.schema.user.grant('guest', 'super')
+
+old_election_mode = box.cfg.election_mode
+old_replication_timeout = box.cfg.replication_timeout
+
+test_run:cmd('create server replica with rpl_master=default,\
+              script="replication/replica.lua"')
+test_run:cmd('start server replica with wait=True, wait_load=True')
+
+--
+-- gh-5426: leader resignation could crash non-candidate nodes.
+--
+-- Small timeout to speed up the election.
+box.cfg{                                                                        \
+    replication_timeout = 0.1,                                                  \
+    election_mode = 'candidate',                                                \
+}
+
+-- First crash could happen when the election was disabled on the non-leader
+-- node.
+test_run:wait_cond(function() return box.info.election.state == 'leader' end)
+
+test_run:switch('replica')
+test_run:wait_cond(function() return box.info.election.leader ~= 0 end)
+
+test_run:switch('default')
+box.cfg{election_mode = 'off'}
+
+test_run:switch('replica')
+test_run:wait_cond(function() return box.info.election.leader == 0 end)
+
+-- Another crash could happen if election mode was 'voter' on the non-leader
+-- node.
+box.cfg{election_mode = 'voter'}
+
+test_run:switch('default')
+box.cfg{election_mode = 'candidate'}
+test_run:wait_cond(function() return box.info.election.state == 'leader' end)
+
+test_run:switch('replica')
+test_run:wait_cond(function() return box.info.election.leader ~= 0 end)
+
+test_run:switch('default')
+box.cfg{election_mode = 'off'}
+
+test_run:switch('replica')
+test_run:wait_cond(function() return box.info.election.leader == 0 end)
+
+test_run:switch('default')
+test_run:cmd('stop server replica')
+test_run:cmd('delete server replica')
+box.cfg{                                                                        \
+    election_mode = old_election_mode,                                          \
+    replication_timeout = old_replication_timeout,                              \
+}
+box.schema.user.revoke('guest', 'super')
diff --git a/test/replication/suite.cfg b/test/replication/suite.cfg
index a862f5a97..766f276a2 100644
--- a/test/replication/suite.cfg
+++ b/test/replication/suite.cfg
@@ -14,6 +14,7 @@
     "gh-3760-misc-return-on-quorum-0.test.lua": {},
     "gh-4399-misc-no-failure-on-error-reading-wal.test.lua": {},
     "gh-4424-misc-orphan-on-reconfiguration-error.test.lua": {},
+    "gh-5426-election-on-off.test.lua": {},
     "once.test.lua": {},
     "on_replace.test.lua": {},
     "status.test.lua": {},
-- 
2.21.1 (Apple Git-122.3)

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: [Tarantool-patches] [PATCH 1/1] raft: fix crash when leader resigned from its role
  2020-10-15 22:47 [Tarantool-patches] [PATCH 1/1] raft: fix crash when leader resigned from its role Vladislav Shpilevoy
@ 2020-10-16  7:16 ` Kirill Yukhin
  0 siblings, 0 replies; 2+ messages in thread
From: Kirill Yukhin @ 2020-10-16  7:16 UTC (permalink / raw)
  To: Vladislav Shpilevoy; +Cc: tarantool-patches

Hello,

On 16 окт 00:47, Vladislav Shpilevoy wrote:
> Nodes with disabled Raft keep listening for Raft events and
> persist them. To be able to quickly enroll into the process if
> they are configured to be candidates.
> 
> The same for the voter nodes - they can't be a leader, but watch
> and persist all what is happening.
> 
> However when a leader resigned from its role, the voter and
> disabled nodes tried to start a new election round, even though
> they were not supposed to. That led to a crash, and is fixed in
> this patch.
> 
> Closes #5426
> ---
> Branch: http://github.com/tarantool/tarantool/tree/gerold103/gh-5426-raft-crash-on-election-toogle
> Issue: https://github.com/tarantool/tarantool/issues/5426

LGTM. I've checked your patch into master.

--
Regards, Kirill Yukhin

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2020-10-16  7:16 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-10-15 22:47 [Tarantool-patches] [PATCH 1/1] raft: fix crash when leader resigned from its role Vladislav Shpilevoy
2020-10-16  7:16 ` Kirill Yukhin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox