[Tarantool-patches] [PATCH 0/3] Raft on leader election recovery restart

Tarantool development patches archive
 help / color / mirror / Atom feed

* [Tarantool-patches] [PATCH 0/3] Raft on leader election recovery restart
@ 2020-10-17 17:17 Vladislav Shpilevoy
  2020-10-17 17:17 ` [Tarantool-patches] [PATCH 1/3] raft: send state to new subscribers if Raft worked Vladislav Shpilevoy
                   ` (4 more replies)
  0 siblings, 5 replies; 12+ messages in thread
From: Vladislav Shpilevoy @ 2020-10-17 17:17 UTC (permalink / raw)
  To: tarantool-patches, sergepetrenko

There were 2 issues with the relay restarting recovery cursor when the node is
elected as a leader. Fixed in the last 2 commits. First was about local LSN not
being set, second about GC not being propagated.

The first patch is not related to the bugs above directly. Just was found while
working on this. In theory without the first patch we can get flakiness into
the testes changed in this commit, but only if a replication connection will
break without a reason.

Additionally, the new test - gh-5433-election-restart-recovery - hangs on my
machine when I start tens of it. All workers, after executing it several times,
hang. But!!! not in something related to the raft - they hang in the first
box.snapshot(), where the election is not even enabled yet. From some debug
prints I see it hangs somewhere in engine_being_checkpoint(), and consumes
~80% of the CPU. But it may be just a consequence of the corrupted memory on
Mac, due to libeio being broken. Don't know what to do with that now.

Branch: http://github.com/tarantool/tarantool/tree/gerold103/gh-5433-raft-leader-recovery-restart
Issue: https://github.com/tarantool/tarantool/issues/5433

Vladislav Shpilevoy (3):
  raft: send state to new subscribers if Raft worked
  raft: use local LSN in relay recovery restart
  raft: don't drop GC when restart relay recovery

 src/box/box.cc                                |  14 +-
 src/box/raft.h                                |  10 +
 src/box/relay.cc                              |  22 ++-
 .../gh-5426-election-on-off.result            |  59 ++++--
 .../gh-5426-election-on-off.test.lua          |  26 ++-
 .../gh-5433-election-restart-recovery.result  | 174 ++++++++++++++++++
 ...gh-5433-election-restart-recovery.test.lua |  87 +++++++++
 test/replication/suite.cfg                    |   1 +
 8 files changed, 367 insertions(+), 26 deletions(-)
 create mode 100644 test/replication/gh-5433-election-restart-recovery.result
 create mode 100644 test/replication/gh-5433-election-restart-recovery.test.lua

-- 
2.21.1 (Apple Git-122.3)

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Tarantool-patches] [PATCH 1/3] raft: send state to new subscribers if Raft worked
  2020-10-17 17:17 [Tarantool-patches] [PATCH 0/3] Raft on leader election recovery restart Vladislav Shpilevoy
@ 2020-10-17 17:17 ` Vladislav Shpilevoy
  2020-10-20 20:43   ` Vladislav Shpilevoy
  2020-10-17 17:17 ` [Tarantool-patches] [PATCH 2/3] raft: use local LSN in relay recovery restart Vladislav Shpilevoy
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 12+ messages in thread
From: Vladislav Shpilevoy @ 2020-10-17 17:17 UTC (permalink / raw)
  To: tarantool-patches, sergepetrenko

Before the patch a node sent its complete Raft state to new
replication subscribers only when Raft was enabled: voter or
candidate. When the mode was off, it didn't send anything.

This led to a bug (or maybe just inconvenience), that if the node
was a leader, and couldn't deliver the 'off' message to all the
followers, some of them would still think the node is a leader
until their restart or a new leader election.

If the followers were just voters or also had Raft disabled, they
would treat the old node as a leader infinitely until a new one
was elected somewhere.

The state wasn't sent not to break connections with the old
versions, which won't understand the Raft messages.

The patch makes Raft state being sent even if Raft is disabled on
the node, but was enabled ever in the past (term is not initial,
!= 1). In that way the old leader will keep sending resignation
messages even in case of restart or any other issue. And it won't
break the old versions when Raft is not used and was not used
ever.

The issue was found while working on #5433, but is not strictly
related.
---
 src/box/box.cc                                | 14 +++--
 src/box/raft.h                                | 10 ++++
 .../gh-5426-election-on-off.result            | 59 +++++++++++++++----
 .../gh-5426-election-on-off.test.lua          | 26 +++++---
 4 files changed, 84 insertions(+), 25 deletions(-)

diff --git a/src/box/box.cc b/src/box/box.cc
index 18568df3b..6d2ca3349 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -2154,12 +2154,16 @@ box_process_subscribe(struct ev_io *io, struct xrow_header *header)
 		 tt_uuid_str(&replica_uuid), sio_socketname(io->fd));
 	say_info("remote vclock %s local vclock %s",
 		 vclock_to_string(&replica_clock), vclock_to_string(&vclock));
-	if (raft_is_enabled()) {
+	if (raft_was_enabled()) {
 		/*
-		 * Send out the current raft state of the instance. Don't do
-		 * that if Raft is disabled. It can be that a part of the
-		 * cluster still contains old versions, which can't handle Raft
-		 * messages. So when it is disabled, its network footprint
+		 * Send Raft state only if it was ever used in the cluster. That
+		 * is because if it was used, then this node could be a leader
+		 * in the past. And there may be nodes, who still consider it a
+		 * leader. So even if it is disabled now, but was probably
+		 * enabled in the past, the state must be sent. Can't send it
+		 * always because it can be a part of a cluster with old
+		 * Tarantool versions, which can't handle Raft messages. When it
+		 * is disabled and was never enabled, its network footprint
 		 * should be 0.
 		 */
 		struct raft_request req;
diff --git a/src/box/raft.h b/src/box/raft.h
index 82d5aa442..74255e627 100644
--- a/src/box/raft.h
+++ b/src/box/raft.h
@@ -185,6 +185,16 @@ raft_is_enabled(void)
 	return raft.is_enabled;
 }
 
+/**
+ * Check if the node ever heard anything from anyone about Raft. When it is
+ * true, the node, in theory, could be a leader in the past.
+ */
+static inline bool
+raft_was_enabled(void)
+{
+	return raft.volatile_term > 1;
+}
+
 /** Process a raft entry stored in WAL/snapshot. */
 void
 raft_process_recovery(const struct raft_request *req);
diff --git a/test/replication/gh-5426-election-on-off.result b/test/replication/gh-5426-election-on-off.result
index 1abfb9154..08253443c 100644
--- a/test/replication/gh-5426-election-on-off.result
+++ b/test/replication/gh-5426-election-on-off.result
@@ -6,13 +6,6 @@ box.schema.user.grant('guest', 'super')
  | ---
  | ...
 
-old_election_mode = box.cfg.election_mode
- | ---
- | ...
-old_replication_timeout = box.cfg.replication_timeout
- | ---
- | ...
-
 test_run:cmd('create server replica with rpl_master=default,\
               script="replication/replica.lua"')
  | ---
@@ -111,23 +104,63 @@ test_run:wait_cond(function() return box.info.election.leader == 0 end)
  | - true
  | ...
 
+-- When turned off, the former leader could stop sending its state to new
+-- subscribers. That could make them think the leader is still alive.
 test_run:switch('default')
  | ---
  | - true
  | ...
-test_run:cmd('stop server replica')
+box.cfg{election_mode = 'candidate'}
+ | ---
+ | ...
+test_run:wait_cond(function() return box.info.election.state == 'leader' end)
  | ---
  | - true
  | ...
-test_run:cmd('delete server replica')
+
+test_run:switch('replica')
  | ---
  | - true
  | ...
-box.cfg{                                                                        \
-    election_mode = old_election_mode,                                          \
-    replication_timeout = old_replication_timeout,                              \
-}
+test_run:wait_cond(function() return box.info.election.leader ~= 0 end)
+ | ---
+ | - true
+ | ...
+
+test_run:switch('default')
+ | ---
+ | - true
+ | ...
+test_run:cmd('restart server default')
+ | 
+-- All settings are reset, the election is 'off'. The replica re-subscribes,
+-- and the former leader should send its state. Even though the election is
+-- disabled on it.
+assert(box.cfg.election_mode == 'off')
  | ---
+ | - true
+ | ...
+
+test_run:switch('replica')
+ | ---
+ | - true
+ | ...
+test_run:wait_cond(function() return box.info.election.leader == 0 end)
+ | ---
+ | - true
+ | ...
+
+test_run:switch('default')
+ | ---
+ | - true
+ | ...
+test_run:cmd('stop server replica')
+ | ---
+ | - true
+ | ...
+test_run:cmd('delete server replica')
+ | ---
+ | - true
  | ...
 box.schema.user.revoke('guest', 'super')
  | ---
diff --git a/test/replication/gh-5426-election-on-off.test.lua b/test/replication/gh-5426-election-on-off.test.lua
index d6b980d0a..adbfcd97d 100644
--- a/test/replication/gh-5426-election-on-off.test.lua
+++ b/test/replication/gh-5426-election-on-off.test.lua
@@ -1,9 +1,6 @@
 test_run = require('test_run').new()
 box.schema.user.grant('guest', 'super')
 
-old_election_mode = box.cfg.election_mode
-old_replication_timeout = box.cfg.replication_timeout
-
 test_run:cmd('create server replica with rpl_master=default,\
               script="replication/replica.lua"')
 test_run:cmd('start server replica with wait=True, wait_load=True')
@@ -47,11 +44,26 @@ box.cfg{election_mode = 'off'}
 test_run:switch('replica')
 test_run:wait_cond(function() return box.info.election.leader == 0 end)
 
+-- When turned off, the former leader could stop sending its state to new
+-- subscribers. That could make them think the leader is still alive.
+test_run:switch('default')
+box.cfg{election_mode = 'candidate'}
+test_run:wait_cond(function() return box.info.election.state == 'leader' end)
+
+test_run:switch('replica')
+test_run:wait_cond(function() return box.info.election.leader ~= 0 end)
+
+test_run:switch('default')
+test_run:cmd('restart server default')
+-- All settings are reset, the election is 'off'. The replica re-subscribes,
+-- and the former leader should send its state. Even though the election is
+-- disabled on it.
+assert(box.cfg.election_mode == 'off')
+
+test_run:switch('replica')
+test_run:wait_cond(function() return box.info.election.leader == 0 end)
+
 test_run:switch('default')
 test_run:cmd('stop server replica')
 test_run:cmd('delete server replica')
-box.cfg{                                                                        \
-    election_mode = old_election_mode,                                          \
-    replication_timeout = old_replication_timeout,                              \
-}
 box.schema.user.revoke('guest', 'super')
-- 
2.21.1 (Apple Git-122.3)

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Tarantool-patches] [PATCH 2/3] raft: use local LSN in relay recovery restart
  2020-10-17 17:17 [Tarantool-patches] [PATCH 0/3] Raft on leader election recovery restart Vladislav Shpilevoy
  2020-10-17 17:17 ` [Tarantool-patches] [PATCH 1/3] raft: send state to new subscribers if Raft worked Vladislav Shpilevoy
@ 2020-10-17 17:17 ` Vladislav Shpilevoy
  2020-10-17 17:17 ` [Tarantool-patches] [PATCH 3/3] raft: don't drop GC when restart relay recovery Vladislav Shpilevoy
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 12+ messages in thread
From: Vladislav Shpilevoy @ 2020-10-17 17:17 UTC (permalink / raw)
  To: tarantool-patches, sergepetrenko

When a Raft node is elected as a leader, it should resend all its
data to the followers from the last acked vclock. Because while
the node was not a leader, the other instances ignored all the
changes from it.

The resending is done via restart of the recovery cursor in the
relay thread. When the cursor was restarted, it used the last
acked vclock to find the needed xlog file. But it didn't set the
local LSN component, which was 0 (replicas don't send it).

When in reality the component was not zero, the recovery cursor
still tried to find the oldest xlog file having the first local
row. And it couldn't. The first created local row may be gone long
time ago.

The patch makes the restart keep the local LSN component
unchanged, as it was used by the previous recovery cursor, before
the restart.

Closes #5433
---
 src/box/relay.cc                              |  20 ++-
 .../gh-5433-election-restart-recovery.result  | 119 ++++++++++++++++++
 ...gh-5433-election-restart-recovery.test.lua |  59 +++++++++
 test/replication/suite.cfg                    |   1 +
 4 files changed, 198 insertions(+), 1 deletion(-)
 create mode 100644 test/replication/gh-5433-election-restart-recovery.result
 create mode 100644 test/replication/gh-5433-election-restart-recovery.test.lua

diff --git a/src/box/relay.cc b/src/box/relay.cc
index 096f455a1..285f96108 100644
--- a/src/box/relay.cc
+++ b/src/box/relay.cc
@@ -864,8 +864,26 @@ relay_send_initial_join_row(struct xstream *stream, struct xrow_header *row)
 static void
 relay_restart_recovery(struct relay *relay)
 {
+	/*
+	 * Need to keep the 0 component unchanged. The remote vclock has it
+	 * unset, because it is a local LSN, and it is not sent between the
+	 * instances. Nonetheless, the recovery procedures still need the local
+	 * LSN set to find a proper xlog to start from. If it is not set, the
+	 * recovery process will treat it as 0, and will try to find an xlog
+	 * file having the first local row. Which of course may not exist for a
+	 * long time already. And then it will fail with an xlog gap error.
+	 *
+	 * The currently used local LSN is ok to use, even if it is outdated a
+	 * bit. Because the existing recovery object keeps the current xlog file
+	 * not deleted thanks to xlog GC subsystem. So by using the current
+	 * local LSN and the last confirmed remote LSNs we can be sure such an
+	 * xlog file still exists.
+	 */
+	struct vclock restart_vclock;
+	vclock_copy(&restart_vclock, &relay->recv_vclock);
+	vclock_reset(&restart_vclock, 0, vclock_get(&relay->r->vclock, 0));
 	recovery_delete(relay->r);
-	relay->r = recovery_new(wal_dir(), false, &relay->recv_vclock);
+	relay->r = recovery_new(wal_dir(), false, &restart_vclock);
 	recover_remaining_wals(relay->r, &relay->stream, NULL, true);
 }
 
diff --git a/test/replication/gh-5433-election-restart-recovery.result b/test/replication/gh-5433-election-restart-recovery.result
new file mode 100644
index 000000000..a8d7893bd
--- /dev/null
+++ b/test/replication/gh-5433-election-restart-recovery.result
@@ -0,0 +1,119 @@
+-- test-run result file version 2
+test_run = require('test_run').new()
+ | ---
+ | ...
+box.schema.user.grant('guest', 'super')
+ | ---
+ | ...
+
+old_election_mode = box.cfg.election_mode
+ | ---
+ | ...
+old_replication_timeout = box.cfg.replication_timeout
+ | ---
+ | ...
+
+test_run:cmd('create server replica with rpl_master=default,\
+              script="replication/replica.lua"')
+ | ---
+ | - true
+ | ...
+test_run:cmd('start server replica with wait=True, wait_load=True')
+ | ---
+ | - true
+ | ...
+
+--
+-- gh-5433: leader, when elected, should resend rows to replicas starting from
+-- the last acked vclock. Because they ignored all what was sent afterwards as
+-- it didn't come from a leader.
+--
+-- There was a bug, that the local LSN wasn't passed to the restarted recovery
+-- iterator, and it tried to find not existing files. This caused a reconnect
+-- and a scary error message.
+--
+
+-- Increase local LSN so as without it being accounted during recovery cursor
+-- restart it would fail, as it won't find the xlogs with first local LSN. Do
+-- snapshots to mark the old files for drop.
+s = box.schema.create_space('test', {is_local = true})
+ | ---
+ | ...
+_ = s:create_index('pk')
+ | ---
+ | ...
+for i = 1, 10 do s:replace({i}) end
+ | ---
+ | ...
+box.snapshot()
+ | ---
+ | - ok
+ | ...
+for i = 11, 20 do s:replace({i}) end
+ | ---
+ | ...
+box.snapshot()
+ | ---
+ | - ok
+ | ...
+
+-- The log may contain old irrelevant records. Add some garbage and in the next
+-- lines check for not more than the added garbage. So the old text won't be
+-- touched.
+require('log').info(string.rep('a', 1000))
+ | ---
+ | ...
+
+-- Small timeout to speed up the election.
+box.cfg{                                                                        \
+    replication_timeout = 0.1,                                                  \
+    election_mode = 'candidate',                                                \
+}
+ | ---
+ | ...
+test_run:wait_cond(function() return box.info.election.state == 'leader' end)
+ | ---
+ | - true
+ | ...
+
+test_run:switch('replica')
+ | ---
+ | - true
+ | ...
+test_run:wait_cond(function() return box.info.election.leader ~= 0 end)
+ | ---
+ | - true
+ | ...
+
+test_run:switch('default')
+ | ---
+ | - true
+ | ...
+-- When leader was elected, it should have restarted the recovery iterator in
+-- the relay. And should have accounted the local LSN. Otherwise will see the
+-- XlogGapError.
+assert(not test_run:grep_log('default', 'XlogGapError', 1000))
+ | ---
+ | - true
+ | ...
+s:drop()
+ | ---
+ | ...
+
+test_run:cmd('stop server replica')
+ | ---
+ | - true
+ | ...
+test_run:cmd('delete server replica')
+ | ---
+ | - true
+ | ...
+box.cfg{                                                                        \
+    election_mode = old_election_mode,                                          \
+    replication_timeout = old_replication_timeout,                              \
+}
+ | ---
+ | ...
+box.schema.user.revoke('guest', 'super')
+ | ---
+ | ...
diff --git a/test/replication/gh-5433-election-restart-recovery.test.lua b/test/replication/gh-5433-election-restart-recovery.test.lua
new file mode 100644
index 000000000..0339a5504
--- /dev/null
+++ b/test/replication/gh-5433-election-restart-recovery.test.lua
@@ -0,0 +1,59 @@
+test_run = require('test_run').new()
+box.schema.user.grant('guest', 'super')
+
+old_election_mode = box.cfg.election_mode
+old_replication_timeout = box.cfg.replication_timeout
+
+test_run:cmd('create server replica with rpl_master=default,\
+              script="replication/replica.lua"')
+test_run:cmd('start server replica with wait=True, wait_load=True')
+
+--
+-- gh-5433: leader, when elected, should resend rows to replicas starting from
+-- the last acked vclock. Because they ignored all what was sent afterwards as
+-- it didn't come from a leader.
+--
+-- There was a bug, that the local LSN wasn't passed to the restarted recovery
+-- iterator, and it tried to find not existing files. This caused a reconnect
+-- and a scary error message.
+--
+
+-- Increase local LSN so as without it being accounted during recovery cursor
+-- restart it would fail, as it won't find the xlogs with first local LSN. Do
+-- snapshots to mark the old files for drop.
+s = box.schema.create_space('test', {is_local = true})
+_ = s:create_index('pk')
+for i = 1, 10 do s:replace({i}) end
+box.snapshot()
+for i = 11, 20 do s:replace({i}) end
+box.snapshot()
+
+-- The log may contain old irrelevant records. Add some garbage and in the next
+-- lines check for not more than the added garbage. So the old text won't be
+-- touched.
+require('log').info(string.rep('a', 1000))
+
+-- Small timeout to speed up the election.
+box.cfg{                                                                        \
+    replication_timeout = 0.1,                                                  \
+    election_mode = 'candidate',                                                \
+}
+test_run:wait_cond(function() return box.info.election.state == 'leader' end)
+
+test_run:switch('replica')
+test_run:wait_cond(function() return box.info.election.leader ~= 0 end)
+
+test_run:switch('default')
+-- When leader was elected, it should have restarted the recovery iterator in
+-- the relay. And should have accounted the local LSN. Otherwise will see the
+-- XlogGapError.
+assert(not test_run:grep_log('default', 'XlogGapError', 1000))
+s:drop()
+
+test_run:cmd('stop server replica')
+test_run:cmd('delete server replica')
+box.cfg{                                                                        \
+    election_mode = old_election_mode,                                          \
+    replication_timeout = old_replication_timeout,                              \
+}
+box.schema.user.revoke('guest', 'super')
diff --git a/test/replication/suite.cfg b/test/replication/suite.cfg
index 766f276a2..8fd62fdb8 100644
--- a/test/replication/suite.cfg
+++ b/test/replication/suite.cfg
@@ -15,6 +15,7 @@
     "gh-4399-misc-no-failure-on-error-reading-wal.test.lua": {},
     "gh-4424-misc-orphan-on-reconfiguration-error.test.lua": {},
     "gh-5426-election-on-off.test.lua": {},
+    "gh-5433-election-restart-recovery.test.lua": {},
     "once.test.lua": {},
     "on_replace.test.lua": {},
     "status.test.lua": {},
-- 
2.21.1 (Apple Git-122.3)

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Tarantool-patches] [PATCH 3/3] raft: don't drop GC when restart relay recovery
  2020-10-17 17:17 [Tarantool-patches] [PATCH 0/3] Raft on leader election recovery restart Vladislav Shpilevoy
  2020-10-17 17:17 ` [Tarantool-patches] [PATCH 1/3] raft: send state to new subscribers if Raft worked Vladislav Shpilevoy
  2020-10-17 17:17 ` [Tarantool-patches] [PATCH 2/3] raft: use local LSN in relay recovery restart Vladislav Shpilevoy
@ 2020-10-17 17:17 ` Vladislav Shpilevoy
  2020-10-19  9:36 ` [Tarantool-patches] [PATCH 0/3] Raft on leader election recovery restart Serge Petrenko
  2020-10-22  8:55 ` Alexander V. Tikhonov
  4 siblings, 0 replies; 12+ messages in thread
From: Vladislav Shpilevoy @ 2020-10-17 17:17 UTC (permalink / raw)
  To: tarantool-patches, sergepetrenko

When a node becomes a leader, it restarts relay recovery cursors
to re-send all the data since the last acked row.

But during recovery restart the relay lost the trigger, which used
to update GC state in TX thread.

The patch preserves the trigger.

Follow up for #5433
---
 src/box/relay.cc                              |  4 +-
 .../gh-5433-election-restart-recovery.result  | 55 +++++++++++++++++++
 ...gh-5433-election-restart-recovery.test.lua | 28 ++++++++++
 3 files changed, 86 insertions(+), 1 deletion(-)

diff --git a/src/box/relay.cc b/src/box/relay.cc
index 285f96108..b68b45e00 100644
--- a/src/box/relay.cc
+++ b/src/box/relay.cc
@@ -882,8 +882,10 @@ relay_restart_recovery(struct relay *relay)
 	struct vclock restart_vclock;
 	vclock_copy(&restart_vclock, &relay->recv_vclock);
 	vclock_reset(&restart_vclock, 0, vclock_get(&relay->r->vclock, 0));
+	struct recovery *r = recovery_new(wal_dir(), false, &restart_vclock);
+	rlist_swap(&relay->r->on_close_log, &r->on_close_log);
 	recovery_delete(relay->r);
-	relay->r = recovery_new(wal_dir(), false, &restart_vclock);
+	relay->r = r;
 	recover_remaining_wals(relay->r, &relay->stream, NULL, true);
 }
 
diff --git a/test/replication/gh-5433-election-restart-recovery.result b/test/replication/gh-5433-election-restart-recovery.result
index a8d7893bd..f8f32416e 100644
--- a/test/replication/gh-5433-election-restart-recovery.result
+++ b/test/replication/gh-5433-election-restart-recovery.result
@@ -100,6 +100,61 @@ s:drop()
  | ---
  | ...
 
+-- Ensure the restarted recovery correctly propagates GC state. For that create
+-- some noise xlog files, snapshots, and check if the relay reports to GC that
+-- it does not use them anymore after scanning.
+fiber = require('fiber')
+ | ---
+ | ...
+s = box.schema.create_space('test')
+ | ---
+ | ...
+_ = s:create_index('pk')
+ | ---
+ | ...
+s:replace{1}
+ | ---
+ | - [1]
+ | ...
+box.snapshot()
+ | ---
+ | - ok
+ | ...
+s:replace{2}
+ | ---
+ | - [2]
+ | ...
+box.snapshot()
+ | ---
+ | - ok
+ | ...
+test_run:wait_lsn('replica', 'default')
+ | ---
+ | ...
+lsn = test_run:get_lsn('replica', box.info.id)
+ | ---
+ | ...
+-- Eventually GC should get the last relayed LSN as it is reported on each
+-- relayed xlog file.
+test_run:wait_cond(function()                                                   \
+    local consumers = box.info.gc().consumers                                   \
+    assert(#consumers == 1)                                                     \
+    local vclock = consumers[1].vclock                                          \
+    if vclock[box.info.id] >= lsn then                                          \
+        return true                                                             \
+    end                                                                         \
+    s:replace{3}                                                                \
+    box.snapshot()                                                              \
+    test_run:wait_lsn('replica', 'default')                                     \
+    return false                                                                \
+end)
+ | ---
+ | - true
+ | ...
+s:drop()
+ | ---
+ | ...
+
 test_run:cmd('stop server replica')
  | ---
  | - true
diff --git a/test/replication/gh-5433-election-restart-recovery.test.lua b/test/replication/gh-5433-election-restart-recovery.test.lua
index 0339a5504..4aff000bf 100644
--- a/test/replication/gh-5433-election-restart-recovery.test.lua
+++ b/test/replication/gh-5433-election-restart-recovery.test.lua
@@ -50,6 +50,34 @@ test_run:switch('default')
 assert(not test_run:grep_log('default', 'XlogGapError', 1000))
 s:drop()
 
+-- Ensure the restarted recovery correctly propagates GC state. For that create
+-- some noise xlog files, snapshots, and check if the relay reports to GC that
+-- it does not use them anymore after scanning.
+fiber = require('fiber')
+s = box.schema.create_space('test')
+_ = s:create_index('pk')
+s:replace{1}
+box.snapshot()
+s:replace{2}
+box.snapshot()
+test_run:wait_lsn('replica', 'default')
+lsn = test_run:get_lsn('replica', box.info.id)
+-- Eventually GC should get the last relayed LSN as it is reported on each
+-- relayed xlog file.
+test_run:wait_cond(function()                                                   \
+    local consumers = box.info.gc().consumers                                   \
+    assert(#consumers == 1)                                                     \
+    local vclock = consumers[1].vclock                                          \
+    if vclock[box.info.id] >= lsn then                                          \
+        return true                                                             \
+    end                                                                         \
+    s:replace{3}                                                                \
+    box.snapshot()                                                              \
+    test_run:wait_lsn('replica', 'default')                                     \
+    return false                                                                \
+end)
+s:drop()
+
 test_run:cmd('stop server replica')
 test_run:cmd('delete server replica')
 box.cfg{                                                                        \
-- 
2.21.1 (Apple Git-122.3)

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Tarantool-patches] [PATCH 0/3] Raft on leader election recovery restart
  2020-10-17 17:17 [Tarantool-patches] [PATCH 0/3] Raft on leader election recovery restart Vladislav Shpilevoy
                   ` (2 preceding siblings ...)
  2020-10-17 17:17 ` [Tarantool-patches] [PATCH 3/3] raft: don't drop GC when restart relay recovery Vladislav Shpilevoy
@ 2020-10-19  9:36 ` Serge Petrenko
  2020-10-19 20:26   ` Vladislav Shpilevoy
  2020-10-22  8:55 ` Alexander V. Tikhonov
  4 siblings, 1 reply; 12+ messages in thread
From: Serge Petrenko @ 2020-10-19  9:36 UTC (permalink / raw)
  To: Vladislav Shpilevoy, tarantool-patches


17.10.2020 20:17, Vladislav Shpilevoy пишет:
> There were 2 issues with the relay restarting recovery cursor when the node is
> elected as a leader. Fixed in the last 2 commits. First was about local LSN not
> being set, second about GC not being propagated.
>
> The first patch is not related to the bugs above directly. Just was found while
> working on this. In theory without the first patch we can get flakiness into
> the testes changed in this commit, but only if a replication connection will
> break without a reason.
>
> Additionally, the new test - gh-5433-election-restart-recovery - hangs on my
> machine when I start tens of it. All workers, after executing it several times,
> hang. But!!! not in something related to the raft - they hang in the first
> box.snapshot(), where the election is not even enabled yet. From some debug
> prints I see it hangs somewhere in engine_being_checkpoint(), and consumes
> ~80% of the CPU. But it may be just a consequence of the corrupted memory on
> Mac, due to libeio being broken. Don't know what to do with that now.

Hi! Thanks  for the patchset!

Patches 2 and 3 LGTM.

Patch 1 looks ok, but I have one question.
What happens when a user accidentally enables raft  during a cluster 
upgrade, when
some of the instances support raft, and some don't?
Looks like it'll lead to even more inconvenience.

In my opinion it's fine if the leader just disappears without further 
notice.
We have an election timeout set up for this anyway.

>
> Branch: http://github.com/tarantool/tarantool/tree/gerold103/gh-5433-raft-leader-recovery-restart
> Issue: https://github.com/tarantool/tarantool/issues/5433
>
> Vladislav Shpilevoy (3):
>    raft: send state to new subscribers if Raft worked
>    raft: use local LSN in relay recovery restart
>    raft: don't drop GC when restart relay recovery
>
>   src/box/box.cc                                |  14 +-
>   src/box/raft.h                                |  10 +
>   src/box/relay.cc                              |  22 ++-
>   .../gh-5426-election-on-off.result            |  59 ++++--
>   .../gh-5426-election-on-off.test.lua          |  26 ++-
>   .../gh-5433-election-restart-recovery.result  | 174 ++++++++++++++++++
>   ...gh-5433-election-restart-recovery.test.lua |  87 +++++++++
>   test/replication/suite.cfg                    |   1 +
>   8 files changed, 367 insertions(+), 26 deletions(-)
>   create mode 100644 test/replication/gh-5433-election-restart-recovery.result
>   create mode 100644 test/replication/gh-5433-election-restart-recovery.test.lua
>
-- 
Serge Petrenko

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Tarantool-patches] [PATCH 0/3] Raft on leader election recovery restart
  2020-10-19  9:36 ` [Tarantool-patches] [PATCH 0/3] Raft on leader election recovery restart Serge Petrenko
@ 2020-10-19 20:26   ` Vladislav Shpilevoy
  2020-10-20  8:18     ` Serge Petrenko
  0 siblings, 1 reply; 12+ messages in thread
From: Vladislav Shpilevoy @ 2020-10-19 20:26 UTC (permalink / raw)
  To: Serge Petrenko, tarantool-patches

Hi! Thanks for the patch!

> 17.10.2020 20:17, Vladislav Shpilevoy пишет:
>> There were 2 issues with the relay restarting recovery cursor when the node is
>> elected as a leader. Fixed in the last 2 commits. First was about local LSN not
>> being set, second about GC not being propagated.
>>
>> The first patch is not related to the bugs above directly. Just was found while
>> working on this. In theory without the first patch we can get flakiness into
>> the testes changed in this commit, but only if a replication connection will
>> break without a reason.
>>
>> Additionally, the new test - gh-5433-election-restart-recovery - hangs on my
>> machine when I start tens of it. All workers, after executing it several times,
>> hang. But!!! not in something related to the raft - they hang in the first
>> box.snapshot(), where the election is not even enabled yet. From some debug
>> prints I see it hangs somewhere in engine_being_checkpoint(), and consumes
>> ~80% of the CPU. But it may be just a consequence of the corrupted memory on
>> Mac, due to libeio being broken. Don't know what to do with that now.
> 
> Hi! Thanks  for the patchset!
> 
> Patches 2 and 3 LGTM.
> 
> Patch 1 looks ok, but I have one question.
> What happens when a user accidentally enables raft  during a cluster upgrade, when
> some of the instances support raft, and some don't?
> Looks like it'll lead to even more inconvenience.
> 
> In my opinion it's fine if the leader just disappears without further notice.
> We have an election timeout set up for this anyway.

Election timeout won't work if Raft is disabled or is in 'voter' mode on the other
nodes. Moreover, even if we enable the timeout, they will see the leader alive!
Even if it is not a leader. Because with Raft disabled, the node does not send
Raft state, but keeps sending regular replication heartbeats.

Also then the box.info.election output will freeze. Even after a connection
to the old master is established, the other nodes will show it as a leader
in box.info.election.

I don't see how to fix it except by sending Raft state always. This looks more
confusing than accidental error problems. May affect monitoring depending on
box.info.election, and routing, if someone will make it depend on box.info.election.

Other option - we could add a new raft mode: 'mute'. Such node can't vote,
can't become a leader, but send Raft state, and is read-only.

'off' means that your totally understand the consequences - the node won't
send Raft state at all, and the others still may think it is a leader.

This would help to protect from the accidental enabling of Raft. You would set it
to 'off' and nothing will be sent. I will ask in chats.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Tarantool-patches] [PATCH 0/3] Raft on leader election recovery restart
  2020-10-19 20:26   ` Vladislav Shpilevoy
@ 2020-10-20  8:18     ` Serge Petrenko
  0 siblings, 0 replies; 12+ messages in thread
From: Serge Petrenko @ 2020-10-20  8:18 UTC (permalink / raw)
  To: Vladislav Shpilevoy, tarantool-patches


19.10.2020 23:26, Vladislav Shpilevoy пишет:
> Hi! Thanks for the patch!
No problem =)
>
>> 17.10.2020 20:17, Vladislav Shpilevoy пишет:
>>> There were 2 issues with the relay restarting recovery cursor when the node is
>>> elected as a leader. Fixed in the last 2 commits. First was about local LSN not
>>> being set, second about GC not being propagated.
>>>
>>> The first patch is not related to the bugs above directly. Just was found while
>>> working on this. In theory without the first patch we can get flakiness into
>>> the testes changed in this commit, but only if a replication connection will
>>> break without a reason.
>>>
>>> Additionally, the new test - gh-5433-election-restart-recovery - hangs on my
>>> machine when I start tens of it. All workers, after executing it several times,
>>> hang. But!!! not in something related to the raft - they hang in the first
>>> box.snapshot(), where the election is not even enabled yet. From some debug
>>> prints I see it hangs somewhere in engine_being_checkpoint(), and consumes
>>> ~80% of the CPU. But it may be just a consequence of the corrupted memory on
>>> Mac, due to libeio being broken. Don't know what to do with that now.
>> Hi! Thanks  for the patchset!
>>
>> Patches 2 and 3 LGTM.
>>
>> Patch 1 looks ok, but I have one question.
>> What happens when a user accidentally enables raft  during a cluster upgrade, when
>> some of the instances support raft, and some don't?
>> Looks like it'll lead to even more inconvenience.
>>
>> In my opinion it's fine if the leader just disappears without further notice.
>> We have an election timeout set up for this anyway.
> Election timeout won't work if Raft is disabled or is in 'voter' mode on the other
> nodes. Moreover, even if we enable the timeout, they will see the leader alive!
> Even if it is not a leader. Because with Raft disabled, the node does not send
> Raft state, but keeps sending regular replication heartbeats.
Oh, I see
>
> Also then the box.info.election output will freeze. Even after a connection
> to the old master is established, the other nodes will show it as a leader
> in box.info.election.
>
> I don't see how to fix it except by sending Raft state always. This looks more
> confusing than accidental error problems. May affect monitoring depending on
> box.info.election, and routing, if someone will make it depend on box.info.election.
>
> Other option - we could add a new raft mode: 'mute'. Such node can't vote,
> can't become a leader, but send Raft state, and is read-only.
>
> 'off' means that your totally understand the consequences - the node won't
> send Raft state at all, and the others still may think it is a leader.
>
> This would help to protect from the accidental enabling of Raft. You would set it
> to 'off' and nothing will be sent. I will ask in chats.

Ok then, let's just send raft state always, if raft was ever used.

The new mode looks more like a crutch.

The whole patchset LGTM then.

-- 
Serge Petrenko

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Tarantool-patches] [PATCH 1/3] raft: send state to new subscribers if Raft worked
  2020-10-17 17:17 ` [Tarantool-patches] [PATCH 1/3] raft: send state to new subscribers if Raft worked Vladislav Shpilevoy
@ 2020-10-20 20:43   ` Vladislav Shpilevoy
  2020-10-21 11:41     ` Serge Petrenko
  0 siblings, 1 reply; 12+ messages in thread
From: Vladislav Shpilevoy @ 2020-10-20 20:43 UTC (permalink / raw)
  To: tarantool-patches, sergepetrenko

The first patch was reworked to look at the subscriber's version.
And send Raft state if it is supported. Even when mode is 'off'.

====================
    raft: send state always if version is >= 2.6.0
    
    Before the patch a node sent its complete Raft state to new
    replication subscribers only when Raft was enabled: voter or
    candidate. When the mode was off, it didn't send anything.
    
    This led to a bug (or maybe just inconvenience), that if the node
    was a leader, was turned off, and couldn't deliver the 'off'
    message to all the followers, some of them would still think the
    node is a leader until their restart or a new leader election.
    
    If the followers were just voters or also had Raft disabled, they
    would treat the old node as a leader infinitely until a new one
    was elected somewhere.
    
    The state wasn't sent not to break connections with the old
    versions, which won't understand the Raft messages.
    
    The patch makes Raft state being sent even if Raft is disabled on
    the node, but the connected node supports Raft (its version is >=
    2.6.0). In that way the old leader will keep sending resignation
    messages even in case of restart or any other issue. And it won't
    break the old versions when Raft is not used and was not used
    ever.
    
    The issue was found while working on #5433, but is not strictly
    related.

diff --git a/src/box/box.cc b/src/box/box.cc
index 18568df3b..86daaf626 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -80,6 +80,7 @@
 #include "msgpack.h"
 #include "raft.h"
 #include "trivia/util.h"
+#include "version.h"
 
 static char status[64] = "unknown";
 
@@ -2154,13 +2155,14 @@ box_process_subscribe(struct ev_io *io, struct xrow_header *header)
 		 tt_uuid_str(&replica_uuid), sio_socketname(io->fd));
 	say_info("remote vclock %s local vclock %s",
 		 vclock_to_string(&replica_clock), vclock_to_string(&vclock));
-	if (raft_is_enabled()) {
+	if (replica_version_id >= version_id(2, 6, 0)) {
 		/*
-		 * Send out the current raft state of the instance. Don't do
-		 * that if Raft is disabled. It can be that a part of the
-		 * cluster still contains old versions, which can't handle Raft
-		 * messages. So when it is disabled, its network footprint
-		 * should be 0.
+		 * Send Raft state always to all instances supporting it.
+		 * Regardless of its mode, even when disabled. That is because
+		 * if it was ever enabled, then this node could be a leader in
+		 * the past. And there may be nodes, who still consider it a
+		 * leader. So even if it is disabled now, but was probably
+		 * enabled in the past, the state must be sent.
 		 */
 		struct raft_request req;
 		raft_serialize_for_network(&req, &vclock);
diff --git a/test/replication/gh-5426-election-on-off.result b/test/replication/gh-5426-election-on-off.result
index 1abfb9154..08253443c 100644
--- a/test/replication/gh-5426-election-on-off.result
+++ b/test/replication/gh-5426-election-on-off.result
@@ -6,13 +6,6 @@ box.schema.user.grant('guest', 'super')
  | ---
  | ...
 
-old_election_mode = box.cfg.election_mode
- | ---
- | ...
-old_replication_timeout = box.cfg.replication_timeout
- | ---
- | ...
-
 test_run:cmd('create server replica with rpl_master=default,\
               script="replication/replica.lua"')
  | ---
@@ -111,23 +104,63 @@ test_run:wait_cond(function() return box.info.election.leader == 0 end)
  | - true
  | ...
 
+-- When turned off, the former leader could stop sending its state to new
+-- subscribers. That could make them think the leader is still alive.
 test_run:switch('default')
  | ---
  | - true
  | ...
-test_run:cmd('stop server replica')
+box.cfg{election_mode = 'candidate'}
+ | ---
+ | ...
+test_run:wait_cond(function() return box.info.election.state == 'leader' end)
  | ---
  | - true
  | ...
-test_run:cmd('delete server replica')
+
+test_run:switch('replica')
  | ---
  | - true
  | ...
-box.cfg{                                                                        \
-    election_mode = old_election_mode,                                          \
-    replication_timeout = old_replication_timeout,                              \
-}
+test_run:wait_cond(function() return box.info.election.leader ~= 0 end)
+ | ---
+ | - true
+ | ...
+
+test_run:switch('default')
+ | ---
+ | - true
+ | ...
+test_run:cmd('restart server default')
+ | 
+-- All settings are reset, the election is 'off'. The replica re-subscribes,
+-- and the former leader should send its state. Even though the election is
+-- disabled on it.
+assert(box.cfg.election_mode == 'off')
  | ---
+ | - true
+ | ...
+
+test_run:switch('replica')
+ | ---
+ | - true
+ | ...
+test_run:wait_cond(function() return box.info.election.leader == 0 end)
+ | ---
+ | - true
+ | ...
+
+test_run:switch('default')
+ | ---
+ | - true
+ | ...
+test_run:cmd('stop server replica')
+ | ---
+ | - true
+ | ...
+test_run:cmd('delete server replica')
+ | ---
+ | - true
  | ...
 box.schema.user.revoke('guest', 'super')
  | ---
diff --git a/test/replication/gh-5426-election-on-off.test.lua b/test/replication/gh-5426-election-on-off.test.lua
index d6b980d0a..adbfcd97d 100644
--- a/test/replication/gh-5426-election-on-off.test.lua
+++ b/test/replication/gh-5426-election-on-off.test.lua
@@ -1,9 +1,6 @@
 test_run = require('test_run').new()
 box.schema.user.grant('guest', 'super')
 
-old_election_mode = box.cfg.election_mode
-old_replication_timeout = box.cfg.replication_timeout
-
 test_run:cmd('create server replica with rpl_master=default,\
               script="replication/replica.lua"')
 test_run:cmd('start server replica with wait=True, wait_load=True')
@@ -47,11 +44,26 @@ box.cfg{election_mode = 'off'}
 test_run:switch('replica')
 test_run:wait_cond(function() return box.info.election.leader == 0 end)
 
+-- When turned off, the former leader could stop sending its state to new
+-- subscribers. That could make them think the leader is still alive.
+test_run:switch('default')
+box.cfg{election_mode = 'candidate'}
+test_run:wait_cond(function() return box.info.election.state == 'leader' end)
+
+test_run:switch('replica')
+test_run:wait_cond(function() return box.info.election.leader ~= 0 end)
+
+test_run:switch('default')
+test_run:cmd('restart server default')
+-- All settings are reset, the election is 'off'. The replica re-subscribes,
+-- and the former leader should send its state. Even though the election is
+-- disabled on it.
+assert(box.cfg.election_mode == 'off')
+
+test_run:switch('replica')
+test_run:wait_cond(function() return box.info.election.leader == 0 end)
+
 test_run:switch('default')
 test_run:cmd('stop server replica')
 test_run:cmd('delete server replica')
-box.cfg{                                                                        \
-    election_mode = old_election_mode,                                          \
-    replication_timeout = old_replication_timeout,                              \
-}
 box.schema.user.revoke('guest', 'super')

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Tarantool-patches] [PATCH 1/3] raft: send state to new subscribers if Raft worked
  2020-10-20 20:43   ` Vladislav Shpilevoy
@ 2020-10-21 11:41     ` Serge Petrenko
  2020-10-21 21:41       ` Vladislav Shpilevoy
  0 siblings, 1 reply; 12+ messages in thread
From: Serge Petrenko @ 2020-10-21 11:41 UTC (permalink / raw)
  To: Vladislav Shpilevoy, tarantool-patches


20.10.2020 23:43, Vladislav Shpilevoy пишет:
> The first patch was reworked to look at the subscriber's version.
> And send Raft state if it is supported. Even when mode is 'off'.


Hi! Thanks for the patch!

Please see my comments below.

>
> ====================
>      raft: send state always if version is >= 2.6.0
>      
>      Before the patch a node sent its complete Raft state to new
>      replication subscribers only when Raft was enabled: voter or
>      candidate. When the mode was off, it didn't send anything.
>      
>      This led to a bug (or maybe just inconvenience), that if the node
>      was a leader, was turned off, and couldn't deliver the 'off'
>      message to all the followers, some of them would still think the
>      node is a leader until their restart or a new leader election.
>      
>      If the followers were just voters or also had Raft disabled, they
>      would treat the old node as a leader infinitely until a new one
>      was elected somewhere.
>      
>      The state wasn't sent not to break connections with the old
>      versions, which won't understand the Raft messages.
>      
>      The patch makes Raft state being sent even if Raft is disabled on
>      the node, but the connected node supports Raft (its version is >=
>      2.6.0). In that way the old leader will keep sending resignation
>      messages even in case of restart or any other issue. And it won't
>      break the old versions when Raft is not used and was not used
>      ever.
>      
>      The issue was found while working on #5433, but is not strictly
>      related.
>
> diff --git a/src/box/box.cc b/src/box/box.cc
> index 18568df3b..86daaf626 100644
> --- a/src/box/box.cc
> +++ b/src/box/box.cc
> @@ -80,6 +80,7 @@
>   #include "msgpack.h"
>   #include "raft.h"
>   #include "trivia/util.h"
> +#include "version.h"
>   
>   static char status[64] = "unknown";
>   
> @@ -2154,13 +2155,14 @@ box_process_subscribe(struct ev_io *io, struct xrow_header *header)
>   		 tt_uuid_str(&replica_uuid), sio_socketname(io->fd));
>   	say_info("remote vclock %s local vclock %s",
>   		 vclock_to_string(&replica_clock), vclock_to_string(&vclock));
> -	if (raft_is_enabled()) {
> +	if (replica_version_id >= version_id(2, 6, 0)) {

Why don't you use 'raft_was_used()` here together with version check?

I've just understood we should also check for relay version in 
relay_push_raft(), shouldn't we?


>   		/*
> -		 * Send out the current raft state of the instance. Don't do
> -		 * that if Raft is disabled. It can be that a part of the
> -		 * cluster still contains old versions, which can't handle Raft
> -		 * messages. So when it is disabled, its network footprint
> -		 * should be 0.
> +		 * Send Raft state always to all instances supporting it.
> +		 * Regardless of its mode, even when disabled. That is because
> +		 * if it was ever enabled, then this node could be a leader in
> +		 * the past. And there may be nodes, who still consider it a
> +		 * leader. So even if it is disabled now, but was probably
> +		 * enabled in the past, the state must be sent.
>   		 */
>   		struct raft_request req;
>   		raft_serialize_for_network(&req, &vclock);
> diff --git a/test/replication/gh-5426-election-on-off.result b/test/replication/gh-5426-election-on-off.result
> index 1abfb9154..08253443c 100644
> --- a/test/replication/gh-5426-election-on-off.result
> +++ b/test/replication/gh-5426-election-on-off.result
> @@ -6,13 +6,6 @@ box.schema.user.grant('guest', 'super')
>    | ---
>    | ...
>   
> -old_election_mode = box.cfg.election_mode
> - | ---
> - | ...
> -old_replication_timeout = box.cfg.replication_timeout
> - | ---
> - | ...
> -
>   test_run:cmd('create server replica with rpl_master=default,\
>                 script="replication/replica.lua"')
>    | ---
> @@ -111,23 +104,63 @@ test_run:wait_cond(function() return box.info.election.leader == 0 end)
>    | - true
>    | ...
>   
> +-- When turned off, the former leader could stop sending its state to new
> +-- subscribers. That could make them think the leader is still alive.
>   test_run:switch('default')
>    | ---
>    | - true
>    | ...
> -test_run:cmd('stop server replica')
> +box.cfg{election_mode = 'candidate'}
> + | ---
> + | ...
> +test_run:wait_cond(function() return box.info.election.state == 'leader' end)
>    | ---
>    | - true
>    | ...
> -test_run:cmd('delete server replica')
> +
> +test_run:switch('replica')
>    | ---
>    | - true
>    | ...
> -box.cfg{                                                                        \
> -    election_mode = old_election_mode,                                          \
> -    replication_timeout = old_replication_timeout,                              \
> -}
> +test_run:wait_cond(function() return box.info.election.leader ~= 0 end)
> + | ---
> + | - true
> + | ...
> +
> +test_run:switch('default')
> + | ---
> + | - true
> + | ...
> +test_run:cmd('restart server default')
> + |
> +-- All settings are reset, the election is 'off'. The replica re-subscribes,
> +-- and the former leader should send its state. Even though the election is
> +-- disabled on it.
> +assert(box.cfg.election_mode == 'off')
>    | ---
> + | - true
> + | ...
> +
> +test_run:switch('replica')
> + | ---
> + | - true
> + | ...
> +test_run:wait_cond(function() return box.info.election.leader == 0 end)
> + | ---
> + | - true
> + | ...
> +
> +test_run:switch('default')
> + | ---
> + | - true
> + | ...
> +test_run:cmd('stop server replica')
> + | ---
> + | - true
> + | ...
> +test_run:cmd('delete server replica')
> + | ---
> + | - true
>    | ...
>   box.schema.user.revoke('guest', 'super')
>    | ---
> diff --git a/test/replication/gh-5426-election-on-off.test.lua b/test/replication/gh-5426-election-on-off.test.lua
> index d6b980d0a..adbfcd97d 100644
> --- a/test/replication/gh-5426-election-on-off.test.lua
> +++ b/test/replication/gh-5426-election-on-off.test.lua
> @@ -1,9 +1,6 @@
>   test_run = require('test_run').new()
>   box.schema.user.grant('guest', 'super')
>   
> -old_election_mode = box.cfg.election_mode
> -old_replication_timeout = box.cfg.replication_timeout
> -
>   test_run:cmd('create server replica with rpl_master=default,\
>                 script="replication/replica.lua"')
>   test_run:cmd('start server replica with wait=True, wait_load=True')
> @@ -47,11 +44,26 @@ box.cfg{election_mode = 'off'}
>   test_run:switch('replica')
>   test_run:wait_cond(function() return box.info.election.leader == 0 end)
>   
> +-- When turned off, the former leader could stop sending its state to new
> +-- subscribers. That could make them think the leader is still alive.
> +test_run:switch('default')
> +box.cfg{election_mode = 'candidate'}
> +test_run:wait_cond(function() return box.info.election.state == 'leader' end)
> +
> +test_run:switch('replica')
> +test_run:wait_cond(function() return box.info.election.leader ~= 0 end)
> +
> +test_run:switch('default')
> +test_run:cmd('restart server default')
> +-- All settings are reset, the election is 'off'. The replica re-subscribes,
> +-- and the former leader should send its state. Even though the election is
> +-- disabled on it.
> +assert(box.cfg.election_mode == 'off')
> +
> +test_run:switch('replica')
> +test_run:wait_cond(function() return box.info.election.leader == 0 end)
> +
>   test_run:switch('default')
>   test_run:cmd('stop server replica')
>   test_run:cmd('delete server replica')
> -box.cfg{                                                                        \
> -    election_mode = old_election_mode,                                          \
> -    replication_timeout = old_replication_timeout,                              \
> -}
>   box.schema.user.revoke('guest', 'super')

-- 
Serge Petrenko

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Tarantool-patches] [PATCH 1/3] raft: send state to new subscribers if Raft worked
  2020-10-21 11:41     ` Serge Petrenko
@ 2020-10-21 21:41       ` Vladislav Shpilevoy
  2020-10-22  8:53         ` Alexander V. Tikhonov
  0 siblings, 1 reply; 12+ messages in thread
From: Vladislav Shpilevoy @ 2020-10-21 21:41 UTC (permalink / raw)
  To: Serge Petrenko, tarantool-patches

Hi! Thanks for the review!

>> diff --git a/src/box/box.cc b/src/box/box.cc
>> index 18568df3b..86daaf626 100644
>> --- a/src/box/box.cc
>> +++ b/src/box/box.cc
>>   @@ -2154,13 +2155,14 @@ box_process_subscribe(struct ev_io *io, struct xrow_header *header)
>>            tt_uuid_str(&replica_uuid), sio_socketname(io->fd));
>>       say_info("remote vclock %s local vclock %s",
>>            vclock_to_string(&replica_clock), vclock_to_string(&vclock));
>> -    if (raft_is_enabled()) {
>> +    if (replica_version_id >= version_id(2, 6, 0)) {
> 
> Why don't you use 'raft_was_used()` here together with version check?

It stops making sense. Its purpose was not to send anything while there can be
old versions. Assuming that in this case Raft was never enabled. After we
start checking subscriber's version, we can send Raft always when it is
supported.

> I've just understood we should also check for relay version in relay_push_raft(), shouldn't we?

Perhaps, yeah.

The patch is dropped from the patchset, and has forked into a new issue:
https://github.com/tarantool/tarantool/issues/5438

Main reason is that the patch breaks the tests. The problem is that when
Raft is enabled and delivered to a subscriber, it bumps its local LSN. But
after that the subscriber may get XlogGapError from master, if some of
master logs were deleted. And the follower won't be able to auto-rejoin, because
local LSN is > 0 already - Raft wrote something to WAL.

That breaks replication/replica_rejoin.test.lua. Not sure what to do with
that yet. Auto-rejoin in a cluster with election is not so vital I think, so
we can work on it in the next release. And for now not to send anything
when Raft is off. It is bearable.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Tarantool-patches] [PATCH 1/3] raft: send state to new subscribers if Raft worked
  2020-10-21 21:41       ` Vladislav Shpilevoy
@ 2020-10-22  8:53         ` Alexander V. Tikhonov
  0 siblings, 0 replies; 12+ messages in thread
From: Alexander V. Tikhonov @ 2020-10-22  8:53 UTC (permalink / raw)
  To: Vladislav Shpilevoy; +Cc: tarantool-patches

Hi Vlad, thanks that you dropped the patch and created the new issue.
Without the patch testing of the test fixed. Check your branch rebased
in my branch:

https://gitlab.com/tarantool/tarantool/-/pipelines/206144990

On Wed, Oct 21, 2020 at 11:41:52PM +0200, Vladislav Shpilevoy wrote:
> Hi! Thanks for the review!
> 
> >> diff --git a/src/box/box.cc b/src/box/box.cc
> >> index 18568df3b..86daaf626 100644
> >> --- a/src/box/box.cc
> >> +++ b/src/box/box.cc
> >>   @@ -2154,13 +2155,14 @@ box_process_subscribe(struct ev_io *io, struct xrow_header *header)
> >>            tt_uuid_str(&replica_uuid), sio_socketname(io->fd));
> >>       say_info("remote vclock %s local vclock %s",
> >>            vclock_to_string(&replica_clock), vclock_to_string(&vclock));
> >> -    if (raft_is_enabled()) {
> >> +    if (replica_version_id >= version_id(2, 6, 0)) {
> > 
> > Why don't you use 'raft_was_used()` here together with version check?
> 
> It stops making sense. Its purpose was not to send anything while there can be
> old versions. Assuming that in this case Raft was never enabled. After we
> start checking subscriber's version, we can send Raft always when it is
> supported.
> 
> > I've just understood we should also check for relay version in relay_push_raft(), shouldn't we?
> 
> Perhaps, yeah.
> 
> The patch is dropped from the patchset, and has forked into a new issue:
> https://github.com/tarantool/tarantool/issues/5438
> 
> Main reason is that the patch breaks the tests. The problem is that when
> Raft is enabled and delivered to a subscriber, it bumps its local LSN. But
> after that the subscriber may get XlogGapError from master, if some of
> master logs were deleted. And the follower won't be able to auto-rejoin, because
> local LSN is > 0 already - Raft wrote something to WAL.
> 
> That breaks replication/replica_rejoin.test.lua. Not sure what to do with
> that yet. Auto-rejoin in a cluster with election is not so vital I think, so
> we can work on it in the next release. And for now not to send anything
> when Raft is off. It is bearable.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Tarantool-patches] [PATCH 0/3] Raft on leader election recovery restart
  2020-10-17 17:17 [Tarantool-patches] [PATCH 0/3] Raft on leader election recovery restart Vladislav Shpilevoy
                   ` (3 preceding siblings ...)
  2020-10-19  9:36 ` [Tarantool-patches] [PATCH 0/3] Raft on leader election recovery restart Serge Petrenko
@ 2020-10-22  8:55 ` Alexander V. Tikhonov
  4 siblings, 0 replies; 12+ messages in thread
From: Alexander V. Tikhonov @ 2020-10-22  8:55 UTC (permalink / raw)
  To: Vladislav Shpilevoy; +Cc: tarantool-patches

Hi Vlad, thanks for the patchset. After your removed one of the patch
from it test began to pass [1]. No new degradations found. Patchset
LGTM.

[1] - https://gitlab.com/tarantool/tarantool/-/pipelines/206144990

On Sat, Oct 17, 2020 at 07:17:54PM +0200, Vladislav Shpilevoy wrote:
> There were 2 issues with the relay restarting recovery cursor when the node is
> elected as a leader. Fixed in the last 2 commits. First was about local LSN not
> being set, second about GC not being propagated.
> 
> The first patch is not related to the bugs above directly. Just was found while
> working on this. In theory without the first patch we can get flakiness into
> the testes changed in this commit, but only if a replication connection will
> break without a reason.
> 
> Additionally, the new test - gh-5433-election-restart-recovery - hangs on my
> machine when I start tens of it. All workers, after executing it several times,
> hang. But!!! not in something related to the raft - they hang in the first
> box.snapshot(), where the election is not even enabled yet. From some debug
> prints I see it hangs somewhere in engine_being_checkpoint(), and consumes
> ~80% of the CPU. But it may be just a consequence of the corrupted memory on
> Mac, due to libeio being broken. Don't know what to do with that now.
> 
> Branch: http://github.com/tarantool/tarantool/tree/gerold103/gh-5433-raft-leader-recovery-restart
> Issue: https://github.com/tarantool/tarantool/issues/5433
> 
> Vladislav Shpilevoy (3):
>   raft: send state to new subscribers if Raft worked
>   raft: use local LSN in relay recovery restart
>   raft: don't drop GC when restart relay recovery
> 
>  src/box/box.cc                                |  14 +-
>  src/box/raft.h                                |  10 +
>  src/box/relay.cc                              |  22 ++-
>  .../gh-5426-election-on-off.result            |  59 ++++--
>  .../gh-5426-election-on-off.test.lua          |  26 ++-
>  .../gh-5433-election-restart-recovery.result  | 174 ++++++++++++++++++
>  ...gh-5433-election-restart-recovery.test.lua |  87 +++++++++
>  test/replication/suite.cfg                    |   1 +
>  8 files changed, 367 insertions(+), 26 deletions(-)
>  create mode 100644 test/replication/gh-5433-election-restart-recovery.result
>  create mode 100644 test/replication/gh-5433-election-restart-recovery.test.lua
> 
> -- 
> 2.21.1 (Apple Git-122.3)
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2020-10-22  8:55 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-10-17 17:17 [Tarantool-patches] [PATCH 0/3] Raft on leader election recovery restart Vladislav Shpilevoy
2020-10-17 17:17 ` [Tarantool-patches] [PATCH 1/3] raft: send state to new subscribers if Raft worked Vladislav Shpilevoy
2020-10-20 20:43   ` Vladislav Shpilevoy
2020-10-21 11:41     ` Serge Petrenko
2020-10-21 21:41       ` Vladislav Shpilevoy
2020-10-22  8:53         ` Alexander V. Tikhonov
2020-10-17 17:17 ` [Tarantool-patches] [PATCH 2/3] raft: use local LSN in relay recovery restart Vladislav Shpilevoy
2020-10-17 17:17 ` [Tarantool-patches] [PATCH 3/3] raft: don't drop GC when restart relay recovery Vladislav Shpilevoy
2020-10-19  9:36 ` [Tarantool-patches] [PATCH 0/3] Raft on leader election recovery restart Serge Petrenko
2020-10-19 20:26   ` Vladislav Shpilevoy
2020-10-20  8:18     ` Serge Petrenko
2020-10-22  8:55 ` Alexander V. Tikhonov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox