From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <v.shpilevoy@tarantool.org>
Received: from smtpng2.m.smailru.net (smtpng2.m.smailru.net [94.100.179.3])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (No client certificate requested)
 by dev.tarantool.org (Postfix) with ESMTPS id 87AAD44643A
 for <tarantool-patches@dev.tarantool.org>;
 Sat, 17 Oct 2020 20:18:00 +0300 (MSK)
From: Vladislav Shpilevoy <v.shpilevoy@tarantool.org>
Date: Sat, 17 Oct 2020 19:17:56 +0200
Message-Id: <8d052e2a773dace9788603a4d0f85e89dfa99b5e.1602954690.git.v.shpilevoy@tarantool.org>
In-Reply-To: <cover.1602954690.git.v.shpilevoy@tarantool.org>
References: <cover.1602954690.git.v.shpilevoy@tarantool.org>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Subject: [Tarantool-patches] [PATCH 2/3] raft: use local LSN in relay
	recovery restart
List-Id: Tarantool development patches <tarantool-patches.dev.tarantool.org>
List-Unsubscribe: <https://lists.tarantool.org/mailman/options/tarantool-patches>, 
 <mailto:tarantool-patches-request@dev.tarantool.org?subject=unsubscribe>
List-Archive: <https://lists.tarantool.org/pipermail/tarantool-patches/>
List-Post: <mailto:tarantool-patches@dev.tarantool.org>
List-Help: <mailto:tarantool-patches-request@dev.tarantool.org?subject=help>
List-Subscribe: <https://lists.tarantool.org/mailman/listinfo/tarantool-patches>, 
 <mailto:tarantool-patches-request@dev.tarantool.org?subject=subscribe>
To: tarantool-patches@dev.tarantool.org, sergepetrenko@tarantool.org

When a Raft node is elected as a leader, it should resend all its
data to the followers from the last acked vclock. Because while
the node was not a leader, the other instances ignored all the
changes from it.

The resending is done via restart of the recovery cursor in the
relay thread. When the cursor was restarted, it used the last
acked vclock to find the needed xlog file. But it didn't set the
local LSN component, which was 0 (replicas don't send it).

When in reality the component was not zero, the recovery cursor
still tried to find the oldest xlog file having the first local
row. And it couldn't. The first created local row may be gone long
time ago.

The patch makes the restart keep the local LSN component
unchanged, as it was used by the previous recovery cursor, before
the restart.

Closes #5433
---
 src/box/relay.cc                              |  20 ++-
 .../gh-5433-election-restart-recovery.result  | 119 ++++++++++++++++++
 ...gh-5433-election-restart-recovery.test.lua |  59 +++++++++
 test/replication/suite.cfg                    |   1 +
 4 files changed, 198 insertions(+), 1 deletion(-)
 create mode 100644 test/replication/gh-5433-election-restart-recovery.result
 create mode 100644 test/replication/gh-5433-election-restart-recovery.test.lua

diff --git a/src/box/relay.cc b/src/box/relay.cc
index 096f455a1..285f96108 100644
--- a/src/box/relay.cc
+++ b/src/box/relay.cc
@@ -864,8 +864,26 @@ relay_send_initial_join_row(struct xstream *stream, struct xrow_header *row)
 static void
 relay_restart_recovery(struct relay *relay)
 {
+	/*
+	 * Need to keep the 0 component unchanged. The remote vclock has it
+	 * unset, because it is a local LSN, and it is not sent between the
+	 * instances. Nonetheless, the recovery procedures still need the local
+	 * LSN set to find a proper xlog to start from. If it is not set, the
+	 * recovery process will treat it as 0, and will try to find an xlog
+	 * file having the first local row. Which of course may not exist for a
+	 * long time already. And then it will fail with an xlog gap error.
+	 *
+	 * The currently used local LSN is ok to use, even if it is outdated a
+	 * bit. Because the existing recovery object keeps the current xlog file
+	 * not deleted thanks to xlog GC subsystem. So by using the current
+	 * local LSN and the last confirmed remote LSNs we can be sure such an
+	 * xlog file still exists.
+	 */
+	struct vclock restart_vclock;
+	vclock_copy(&restart_vclock, &relay->recv_vclock);
+	vclock_reset(&restart_vclock, 0, vclock_get(&relay->r->vclock, 0));
 	recovery_delete(relay->r);
-	relay->r = recovery_new(wal_dir(), false, &relay->recv_vclock);
+	relay->r = recovery_new(wal_dir(), false, &restart_vclock);
 	recover_remaining_wals(relay->r, &relay->stream, NULL, true);
 }
 
diff --git a/test/replication/gh-5433-election-restart-recovery.result b/test/replication/gh-5433-election-restart-recovery.result
new file mode 100644
index 000000000..a8d7893bd
--- /dev/null
+++ b/test/replication/gh-5433-election-restart-recovery.result
@@ -0,0 +1,119 @@
+-- test-run result file version 2
+test_run = require('test_run').new()
+ | ---
+ | ...
+box.schema.user.grant('guest', 'super')
+ | ---
+ | ...
+
+old_election_mode = box.cfg.election_mode
+ | ---
+ | ...
+old_replication_timeout = box.cfg.replication_timeout
+ | ---
+ | ...
+
+test_run:cmd('create server replica with rpl_master=default,\
+              script="replication/replica.lua"')
+ | ---
+ | - true
+ | ...
+test_run:cmd('start server replica with wait=True, wait_load=True')
+ | ---
+ | - true
+ | ...
+
+--
+-- gh-5433: leader, when elected, should resend rows to replicas starting from
+-- the last acked vclock. Because they ignored all what was sent afterwards as
+-- it didn't come from a leader.
+--
+-- There was a bug, that the local LSN wasn't passed to the restarted recovery
+-- iterator, and it tried to find not existing files. This caused a reconnect
+-- and a scary error message.
+--
+
+-- Increase local LSN so as without it being accounted during recovery cursor
+-- restart it would fail, as it won't find the xlogs with first local LSN. Do
+-- snapshots to mark the old files for drop.
+s = box.schema.create_space('test', {is_local = true})
+ | ---
+ | ...
+_ = s:create_index('pk')
+ | ---
+ | ...
+for i = 1, 10 do s:replace({i}) end
+ | ---
+ | ...
+box.snapshot()
+ | ---
+ | - ok
+ | ...
+for i = 11, 20 do s:replace({i}) end
+ | ---
+ | ...
+box.snapshot()
+ | ---
+ | - ok
+ | ...
+
+-- The log may contain old irrelevant records. Add some garbage and in the next
+-- lines check for not more than the added garbage. So the old text won't be
+-- touched.
+require('log').info(string.rep('a', 1000))
+ | ---
+ | ...
+
+-- Small timeout to speed up the election.
+box.cfg{                                                                        \
+    replication_timeout = 0.1,                                                  \
+    election_mode = 'candidate',                                                \
+}
+ | ---
+ | ...
+test_run:wait_cond(function() return box.info.election.state == 'leader' end)
+ | ---
+ | - true
+ | ...
+
+test_run:switch('replica')
+ | ---
+ | - true
+ | ...
+test_run:wait_cond(function() return box.info.election.leader ~= 0 end)
+ | ---
+ | - true
+ | ...
+
+test_run:switch('default')
+ | ---
+ | - true
+ | ...
+-- When leader was elected, it should have restarted the recovery iterator in
+-- the relay. And should have accounted the local LSN. Otherwise will see the
+-- XlogGapError.
+assert(not test_run:grep_log('default', 'XlogGapError', 1000))
+ | ---
+ | - true
+ | ...
+s:drop()
+ | ---
+ | ...
+
+test_run:cmd('stop server replica')
+ | ---
+ | - true
+ | ...
+test_run:cmd('delete server replica')
+ | ---
+ | - true
+ | ...
+box.cfg{                                                                        \
+    election_mode = old_election_mode,                                          \
+    replication_timeout = old_replication_timeout,                              \
+}
+ | ---
+ | ...
+box.schema.user.revoke('guest', 'super')
+ | ---
+ | ...
diff --git a/test/replication/gh-5433-election-restart-recovery.test.lua b/test/replication/gh-5433-election-restart-recovery.test.lua
new file mode 100644
index 000000000..0339a5504
--- /dev/null
+++ b/test/replication/gh-5433-election-restart-recovery.test.lua
@@ -0,0 +1,59 @@
+test_run = require('test_run').new()
+box.schema.user.grant('guest', 'super')
+
+old_election_mode = box.cfg.election_mode
+old_replication_timeout = box.cfg.replication_timeout
+
+test_run:cmd('create server replica with rpl_master=default,\
+              script="replication/replica.lua"')
+test_run:cmd('start server replica with wait=True, wait_load=True')
+
+--
+-- gh-5433: leader, when elected, should resend rows to replicas starting from
+-- the last acked vclock. Because they ignored all what was sent afterwards as
+-- it didn't come from a leader.
+--
+-- There was a bug, that the local LSN wasn't passed to the restarted recovery
+-- iterator, and it tried to find not existing files. This caused a reconnect
+-- and a scary error message.
+--
+
+-- Increase local LSN so as without it being accounted during recovery cursor
+-- restart it would fail, as it won't find the xlogs with first local LSN. Do
+-- snapshots to mark the old files for drop.
+s = box.schema.create_space('test', {is_local = true})
+_ = s:create_index('pk')
+for i = 1, 10 do s:replace({i}) end
+box.snapshot()
+for i = 11, 20 do s:replace({i}) end
+box.snapshot()
+
+-- The log may contain old irrelevant records. Add some garbage and in the next
+-- lines check for not more than the added garbage. So the old text won't be
+-- touched.
+require('log').info(string.rep('a', 1000))
+
+-- Small timeout to speed up the election.
+box.cfg{                                                                        \
+    replication_timeout = 0.1,                                                  \
+    election_mode = 'candidate',                                                \
+}
+test_run:wait_cond(function() return box.info.election.state == 'leader' end)
+
+test_run:switch('replica')
+test_run:wait_cond(function() return box.info.election.leader ~= 0 end)
+
+test_run:switch('default')
+-- When leader was elected, it should have restarted the recovery iterator in
+-- the relay. And should have accounted the local LSN. Otherwise will see the
+-- XlogGapError.
+assert(not test_run:grep_log('default', 'XlogGapError', 1000))
+s:drop()
+
+test_run:cmd('stop server replica')
+test_run:cmd('delete server replica')
+box.cfg{                                                                        \
+    election_mode = old_election_mode,                                          \
+    replication_timeout = old_replication_timeout,                              \
+}
+box.schema.user.revoke('guest', 'super')
diff --git a/test/replication/suite.cfg b/test/replication/suite.cfg
index 766f276a2..8fd62fdb8 100644
--- a/test/replication/suite.cfg
+++ b/test/replication/suite.cfg
@@ -15,6 +15,7 @@
     "gh-4399-misc-no-failure-on-error-reading-wal.test.lua": {},
     "gh-4424-misc-orphan-on-reconfiguration-error.test.lua": {},
     "gh-5426-election-on-off.test.lua": {},
+    "gh-5433-election-restart-recovery.test.lua": {},
     "once.test.lua": {},
     "on_replace.test.lua": {},
     "status.test.lua": {},
-- 
2.21.1 (Apple Git-122.3)