[Tarantool-patches] [PATCH v2 2/4] replication: retry in case of XlogGapError

Wed Sep 16 00:23:10 MSK 2020

Hi! Thanks for the review!

> I propose you rework the tests so that they expect upstream.status == 'loading'
> instead of deleting them altogether.

I return them back for now with a few amendments. However here is
the motivation why I dropped them earlier:

- Their functionality was fully covered by some other tests doing exactly the same;
- Both removed tests were also doing the same;
- Xlog/panic_on_wal_error is in a wrong folder, we should have such tests in
  replication/ folder. Also the test is close to useless because it seems it
  tries to see what happens with recovery when box.cfg{force_recovery = false}
  on the replica, but the setting is installed on the master. So it does not mean
  anything.

Anyway, now they are back. Maybe we will drop them later someday.

To make them work with the 'loading' status I needed to move one line in
xlog/replica.lua and replication/replica.lua about starting the admin console
before box.cfg. So as test-run could attach before box.cfg is finished.
Otherwise it hangs because of 'loading'.

See the full new patch below. The diff was too big.

====================
    replication: retry in case of XlogGapError
    
    Previously XlogGapError was considered a critical error stopping
    the replication. That may be not so good as it looks.
    
    XlogGapError is a perfectly fine error, which should not kill the
    replication connection. It should be retried instead.
    
    Because here is an example, when the gap can be recovered on its
    own. Consider the case: node1 is a leader, it is booted with
    vclock {1: 3}. Node2 connects and fetches snapshot of node1, it
    also gets vclock {1: 3}. Then node1 writes something and its
    vclock becomes {1: 4}. Now node3 boots from node1, and gets the
    same vclock. Vclocks now look like this:
    
      - node1: {1: 4}, leader, has {1: 3} snap.
      - node2: {1: 3}, booted from node1, has only snap.
      - node3: {1: 4}, booted from node1, has only snap.
    
    If the cluster is a fullmesh, node2 will send subscribe requests
    with vclock {1: 3}. If node3 receives it, it will respond with
    xlog gap error, because it only has a snap with {1: 4}, nothing
    else. In that case node2 should retry connecting to node3, and in
    the meantime try to get newer changes from node1.
    
    The example is totally valid. However it is unreachable now
    because master registers all replicas in _cluster before allowing
    them to make a join. So they all bootstrap from a snapshot
    containing all their IDs. This is a bug, because such
    auto-registration leads to registration of anonymous replicas, if
    they are present during bootstrap. Also it blocks Raft, which
    can't work if there are registered, but not yet joined nodes.
    
    Once the registration problem will be solved in a next commit, the
    XlogGapError will strike quite often during bootstrap. This patch
    won't allow that happen.
    
    Needed for #5287

diff --git a/src/box/applier.cc b/src/box/applier.cc
index c1d07ca54..96dd48c0d 100644
--- a/src/box/applier.cc
+++ b/src/box/applier.cc
@@ -106,6 +106,7 @@ applier_log_error(struct applier *applier, struct error *e)
 	case ER_SYSTEM:
 	case ER_UNKNOWN_REPLICA:
 	case ER_PASSWORD_MISMATCH:
+	case ER_XLOG_GAP:
 		say_info("will retry every %.2lf second",
 			 replication_reconnect_interval());
 		break;
@@ -1333,6 +1334,32 @@ applier_f(va_list ap)
 				applier_disconnect(applier, APPLIER_STOPPED);
 				return -1;
 			}
+		} catch (XlogGapError *e) {
+			/*
+			 * Xlog gap error can't be a critical error. Because it
+			 * is totally normal during bootstrap. Consider the
+			 * case: node1 is a leader, it is booted with vclock
+			 * {1: 3}. Node2 connects and fetches snapshot of node1,
+			 * it also gets vclock {1: 3}. Then node1 writes
+			 * something and its vclock becomes {1: 4}. Now node3
+			 * boots from node1, and gets the same vclock. Vclocks
+			 * now look like this:
+			 *
+			 * - node1: {1: 4}, leader, has {1: 3} snap.
+			 * - node2: {1: 3}, booted from node1, has only snap.
+			 * - node3: {1: 4}, booted from node1, has only snap.
+			 *
+			 * If the cluster is a fullmesh, node2 will send
+			 * subscribe requests with vclock {1: 3}. If node3
+			 * receives it, it will respond with xlog gap error,
+			 * because it only has a snap with {1: 4}, nothing else.
+			 * In that case node2 should retry connecting to node3,
+			 * and in the meantime try to get newer changes from
+			 * node1.
+			 */
+			applier_log_error(applier, e);
+			applier_disconnect(applier, APPLIER_LOADING);
+			goto reconnect;
 		} catch (FiberIsCancelled *e) {
 			if (!diag_is_empty(&applier->diag)) {
 				diag_move(&applier->diag, &fiber()->diag);
diff --git a/test/replication/force_recovery.result b/test/replication/force_recovery.result
index f50452858..e142e829a 100644
--- a/test/replication/force_recovery.result
+++ b/test/replication/force_recovery.result
@@ -63,7 +63,7 @@ fio.unlink(xlog)
 box.cfg{force_recovery = true}
 ---
 ...
-test_run:cmd("start server test")
+test_run:cmd("start server test with wait=False")
 ---
 - true
 ...
@@ -71,13 +71,13 @@ test_run:cmd("switch test")
 ---
 - true
 ...
-box.space.test:select()
+test_run:wait_upstream(1, {message_re = 'Missing %.xlog file', status = 'loading'})
 ---
-- []
+- true
 ...
-box.info.replication[1].upstream.status == 'stopped' or box.info
+box.space.test:select()
 ---
-- true
+- []
 ...
 test_run:cmd("switch default")
 ---
diff --git a/test/replication/force_recovery.test.lua b/test/replication/force_recovery.test.lua
index 54307814b..bd3b439d2 100644
--- a/test/replication/force_recovery.test.lua
+++ b/test/replication/force_recovery.test.lua
@@ -27,10 +27,10 @@ fio.unlink(xlog)
 -- Check that even though box.cfg.force_recovery is set,
 -- replication will still fail due to LSN gap.
 box.cfg{force_recovery = true}
-test_run:cmd("start server test")
+test_run:cmd("start server test with wait=False")
 test_run:cmd("switch test")
+test_run:wait_upstream(1, {message_re = 'Missing %.xlog file', status = 'loading'})
 box.space.test:select()
-box.info.replication[1].upstream.status == 'stopped' or box.info
 test_run:cmd("switch default")
 box.cfg{force_recovery = false}
 
diff --git a/test/replication/replica.lua b/test/replication/replica.lua
index f3a6dfe58..76949a568 100644
--- a/test/replication/replica.lua
+++ b/test/replication/replica.lua
@@ -9,11 +9,13 @@ else
     repl_list = os.getenv("MASTER")
 end
 
+-- Start the console first to allow test-run to attach even before
+-- box.cfg is finished.
+require('console').listen(os.getenv('ADMIN'))
+
 box.cfg({
     listen              = os.getenv("LISTEN"),
     replication         = repl_list,
     memtx_memory        = 107374182,
     replication_timeout = 0.1,
 })
-
-require('console').listen(os.getenv('ADMIN'))
diff --git a/test/replication/replica_rejoin.result b/test/replication/replica_rejoin.result
index dd04ae297..f6e74eae1 100644
--- a/test/replication/replica_rejoin.result
+++ b/test/replication/replica_rejoin.result
@@ -213,7 +213,7 @@ test_run:wait_cond(function() return #fio.glob(fio.pathjoin(box.cfg.wal_dir, '*.
 box.cfg{checkpoint_count = checkpoint_count}
 ---
 ...
-test_run:cmd("start server replica with args='true'")
+test_run:cmd("start server replica with args='true', wait=False")
 ---
 - true
 ...
@@ -221,9 +221,17 @@ test_run:cmd("switch replica")
 ---
 - true
 ...
-box.info.status -- orphan
+-- Need to wait for box.info.replication[1] defined, otherwise test-run fails to
+-- wait for the upstream status sometimes.
+test_run:wait_cond(function()                                                   \
+    return box.info ~= nil and box.info.replication[1] ~= nil                   \
+end)
 ---
-- orphan
+- true
+...
+test_run:wait_upstream(1, {message_re = 'Missing %.xlog file', status = 'loading'})
+---
+- true
 ...
 box.space.test:select()
 ---
diff --git a/test/replication/replica_rejoin.test.lua b/test/replication/replica_rejoin.test.lua
index 410ef44d7..0feea152e 100644
--- a/test/replication/replica_rejoin.test.lua
+++ b/test/replication/replica_rejoin.test.lua
@@ -79,9 +79,14 @@ for i = 1, 3 do box.space.test:insert{i * 100} end
 fio = require('fio')
 test_run:wait_cond(function() return #fio.glob(fio.pathjoin(box.cfg.wal_dir, '*.xlog')) == 1 end) or fio.pathjoin(box.cfg.wal_dir, '*.xlog')
 box.cfg{checkpoint_count = checkpoint_count}
-test_run:cmd("start server replica with args='true'")
+test_run:cmd("start server replica with args='true', wait=False")
 test_run:cmd("switch replica")
-box.info.status -- orphan
+-- Need to wait for box.info.replication[1] defined, otherwise test-run fails to
+-- wait for the upstream status sometimes.
+test_run:wait_cond(function()                                                   \
+    return box.info ~= nil and box.info.replication[1] ~= nil                   \
+end)
+test_run:wait_upstream(1, {message_re = 'Missing %.xlog file', status = 'loading'})
 box.space.test:select()
 
 --
diff --git a/test/replication/show_error_on_disconnect.result b/test/replication/show_error_on_disconnect.result
index 48003db06..19d3886e4 100644
--- a/test/replication/show_error_on_disconnect.result
+++ b/test/replication/show_error_on_disconnect.result
@@ -72,7 +72,7 @@ box.space.test:select()
 other_id = box.info.id % 2 + 1
 ---
 ...
-test_run:wait_upstream(other_id, {status = 'stopped', message_re = 'Missing'})
+test_run:wait_upstream(other_id, {status = 'loading', message_re = 'Missing'})
 ---
 - true
 ...
diff --git a/test/replication/show_error_on_disconnect.test.lua b/test/replication/show_error_on_disconnect.test.lua
index 1b0ea4373..dce926a34 100644
--- a/test/replication/show_error_on_disconnect.test.lua
+++ b/test/replication/show_error_on_disconnect.test.lua
@@ -29,7 +29,7 @@ test_run:cmd("switch master_quorum1")
 box.cfg{replication = repl}
 box.space.test:select()
 other_id = box.info.id % 2 + 1
-test_run:wait_upstream(other_id, {status = 'stopped', message_re = 'Missing'})
+test_run:wait_upstream(other_id, {status = 'loading', message_re = 'Missing'})
 test_run:cmd("switch master_quorum2")
 box.space.test:select()
 other_id = box.info.id % 2 + 1
diff --git a/test/xlog/panic_on_wal_error.result b/test/xlog/panic_on_wal_error.result
index 22f14f912..c4494ac87 100644
--- a/test/xlog/panic_on_wal_error.result
+++ b/test/xlog/panic_on_wal_error.result
@@ -121,7 +121,7 @@ box.cfg.force_recovery
 -- try to start the replica, ha-ha
 -- (replication should fail, some rows are missing)
 --
-test_run:cmd("start server replica")
+test_run:cmd("start server replica with wait=False")
 ---
 - true
 ...
@@ -129,19 +129,17 @@ test_run:cmd("switch replica")
 ---
 - true
 ...
-fiber = require('fiber')
----
-...
-while box.info.replication[1].upstream.status ~= "stopped" do fiber.sleep(0.001) end
+-- Need to wait for box.info.replication[1] defined, otherwise test-run fails to
+-- wait for the upstream status sometimes.
+test_run:wait_cond(function()                                                   \
+    return box.info ~= nil and box.info.replication[1] ~= nil                   \
+end)
 ---
+- true
 ...
-box.info.replication[1].upstream.status
----
-- stopped
-...
-box.info.replication[1].upstream.message
+test_run:wait_upstream(1, {message_re = 'Missing %.xlog file', status = 'loading'})
 ---
-- 'Missing .xlog file between LSN 6 {1: 6} and 8 {1: 8}'
+- true
 ...
 box.space.test:select{}
 ---
diff --git a/test/xlog/panic_on_wal_error.test.lua b/test/xlog/panic_on_wal_error.test.lua
index 2e95431c6..eea6aad30 100644
--- a/test/xlog/panic_on_wal_error.test.lua
+++ b/test/xlog/panic_on_wal_error.test.lua
@@ -57,12 +57,14 @@ box.cfg.force_recovery
 -- try to start the replica, ha-ha
 -- (replication should fail, some rows are missing)
 --
-test_run:cmd("start server replica")
+test_run:cmd("start server replica with wait=False")
 test_run:cmd("switch replica")
-fiber = require('fiber')
-while box.info.replication[1].upstream.status ~= "stopped" do fiber.sleep(0.001) end
-box.info.replication[1].upstream.status
-box.info.replication[1].upstream.message
+-- Need to wait for box.info.replication[1] defined, otherwise test-run fails to
+-- wait for the upstream status sometimes.
+test_run:wait_cond(function()                                                   \
+    return box.info ~= nil and box.info.replication[1] ~= nil                   \
+end)
+test_run:wait_upstream(1, {message_re = 'Missing %.xlog file', status = 'loading'})
 box.space.test:select{}
 --
 --
diff --git a/test/xlog/replica.lua b/test/xlog/replica.lua
index b76c2f776..3268023dc 100644
--- a/test/xlog/replica.lua
+++ b/test/xlog/replica.lua
@@ -1,5 +1,9 @@
 #!/usr/bin/env tarantool
 
+-- Start the console first to allow test-run to attach even before
+-- box.cfg is finished.
+require('console').listen(os.getenv('ADMIN'))
+
 box.cfg({
     listen              = os.getenv("LISTEN"),
     replication         = os.getenv("MASTER"),
@@ -7,5 +11,3 @@ box.cfg({
 --    pid_file            = "tarantool.pid",
 --    logger              = "tarantool.log",
 })
-
-require('console').listen(os.getenv('ADMIN'))