[Tarantool-patches] [PATCH v2 2/4] replication: retry in case of XlogGapError

Wed Sep 16 13:59:11 MSK 2020

16.09.2020 00:23, Vladislav Shpilevoy пишет:
> Hi! Thanks for the review!
>
>> I propose you rework the tests so that they expect upstream.status == 'loading'
>> instead of deleting them altogether.
> I return them back for now with a few amendments. However here is
> the motivation why I dropped them earlier:
>
> - Their functionality was fully covered by some other tests doing exactly the same;
> - Both removed tests were also doing the same;
> - Xlog/panic_on_wal_error is in a wrong folder, we should have such tests in
>    replication/ folder. Also the test is close to useless because it seems it
>    tries to see what happens with recovery when box.cfg{force_recovery = false}
>    on the replica, but the setting is installed on the master. So it does not mean
>    anything.
>
> Anyway, now they are back. Maybe we will drop them later someday.
Thanks for your answer!
Still I believe we should delete the tests as a part of some separate 
activity.

The patch LGTM.

>
> To make them work with the 'loading' status I needed to move one line in
> xlog/replica.lua and replication/replica.lua about starting the admin console
> before box.cfg. So as test-run could attach before box.cfg is finished.
> Otherwise it hangs because of 'loading'.
>
> See the full new patch below. The diff was too big.
>
> ====================
>      replication: retry in case of XlogGapError
>      
>      Previously XlogGapError was considered a critical error stopping
>      the replication. That may be not so good as it looks.
>      
>      XlogGapError is a perfectly fine error, which should not kill the
>      replication connection. It should be retried instead.
>      
>      Because here is an example, when the gap can be recovered on its
>      own. Consider the case: node1 is a leader, it is booted with
>      vclock {1: 3}. Node2 connects and fetches snapshot of node1, it
>      also gets vclock {1: 3}. Then node1 writes something and its
>      vclock becomes {1: 4}. Now node3 boots from node1, and gets the
>      same vclock. Vclocks now look like this:
>      
>        - node1: {1: 4}, leader, has {1: 3} snap.
>        - node2: {1: 3}, booted from node1, has only snap.
>        - node3: {1: 4}, booted from node1, has only snap.
>      
>      If the cluster is a fullmesh, node2 will send subscribe requests
>      with vclock {1: 3}. If node3 receives it, it will respond with
>      xlog gap error, because it only has a snap with {1: 4}, nothing
>      else. In that case node2 should retry connecting to node3, and in
>      the meantime try to get newer changes from node1.
>      
>      The example is totally valid. However it is unreachable now
>      because master registers all replicas in _cluster before allowing
>      them to make a join. So they all bootstrap from a snapshot
>      containing all their IDs. This is a bug, because such
>      auto-registration leads to registration of anonymous replicas, if
>      they are present during bootstrap. Also it blocks Raft, which
>      can't work if there are registered, but not yet joined nodes.
>      
>      Once the registration problem will be solved in a next commit, the
>      XlogGapError will strike quite often during bootstrap. This patch
>      won't allow that happen.
>      
>      Needed for #5287
>
> diff --git a/src/box/applier.cc b/src/box/applier.cc
> index c1d07ca54..96dd48c0d 100644
> --- a/src/box/applier.cc
> +++ b/src/box/applier.cc
> @@ -106,6 +106,7 @@ applier_log_error(struct applier *applier, struct error *e)
>   	case ER_SYSTEM:
>   	case ER_UNKNOWN_REPLICA:
>   	case ER_PASSWORD_MISMATCH:
> +	case ER_XLOG_GAP:
>   		say_info("will retry every %.2lf second",
>   			 replication_reconnect_interval());
>   		break;
> @@ -1333,6 +1334,32 @@ applier_f(va_list ap)
>   				applier_disconnect(applier, APPLIER_STOPPED);
>   				return -1;
>   			}
> +		} catch (XlogGapError *e) {
> +			/*
> +			 * Xlog gap error can't be a critical error. Because it
> +			 * is totally normal during bootstrap. Consider the
> +			 * case: node1 is a leader, it is booted with vclock
> +			 * {1: 3}. Node2 connects and fetches snapshot of node1,
> +			 * it also gets vclock {1: 3}. Then node1 writes
> +			 * something and its vclock becomes {1: 4}. Now node3
> +			 * boots from node1, and gets the same vclock. Vclocks
> +			 * now look like this:
> +			 *
> +			 * - node1: {1: 4}, leader, has {1: 3} snap.
> +			 * - node2: {1: 3}, booted from node1, has only snap.
> +			 * - node3: {1: 4}, booted from node1, has only snap.
> +			 *
> +			 * If the cluster is a fullmesh, node2 will send
> +			 * subscribe requests with vclock {1: 3}. If node3
> +			 * receives it, it will respond with xlog gap error,
> +			 * because it only has a snap with {1: 4}, nothing else.
> +			 * In that case node2 should retry connecting to node3,
> +			 * and in the meantime try to get newer changes from
> +			 * node1.
> +			 */
> +			applier_log_error(applier, e);
> +			applier_disconnect(applier, APPLIER_LOADING);
> +			goto reconnect;
>   		} catch (FiberIsCancelled *e) {
>   			if (!diag_is_empty(&applier->diag)) {
>   				diag_move(&applier->diag, &fiber()->diag);
> diff --git a/test/replication/force_recovery.result b/test/replication/force_recovery.result
> index f50452858..e142e829a 100644
> --- a/test/replication/force_recovery.result
> +++ b/test/replication/force_recovery.result
> @@ -63,7 +63,7 @@ fio.unlink(xlog)
>   box.cfg{force_recovery = true}
>   ---
>   ...
> -test_run:cmd("start server test")
> +test_run:cmd("start server test with wait=False")
>   ---
>   - true
>   ...
> @@ -71,13 +71,13 @@ test_run:cmd("switch test")
>   ---
>   - true
>   ...
> -box.space.test:select()
> +test_run:wait_upstream(1, {message_re = 'Missing %.xlog file', status = 'loading'})
>   ---
> -- []
> +- true
>   ...
> -box.info.replication[1].upstream.status == 'stopped' or box.info
> +box.space.test:select()
>   ---
> -- true
> +- []
>   ...
>   test_run:cmd("switch default")
>   ---
> diff --git a/test/replication/force_recovery.test.lua b/test/replication/force_recovery.test.lua
> index 54307814b..bd3b439d2 100644
> --- a/test/replication/force_recovery.test.lua
> +++ b/test/replication/force_recovery.test.lua
> @@ -27,10 +27,10 @@ fio.unlink(xlog)
>   -- Check that even though box.cfg.force_recovery is set,
>   -- replication will still fail due to LSN gap.
>   box.cfg{force_recovery = true}
> -test_run:cmd("start server test")
> +test_run:cmd("start server test with wait=False")
>   test_run:cmd("switch test")
> +test_run:wait_upstream(1, {message_re = 'Missing %.xlog file', status = 'loading'})
>   box.space.test:select()
> -box.info.replication[1].upstream.status == 'stopped' or box.info
>   test_run:cmd("switch default")
>   box.cfg{force_recovery = false}
>   
> diff --git a/test/replication/replica.lua b/test/replication/replica.lua
> index f3a6dfe58..76949a568 100644
> --- a/test/replication/replica.lua
> +++ b/test/replication/replica.lua
> @@ -9,11 +9,13 @@ else
>       repl_list = os.getenv("MASTER")
>   end
>   
> +-- Start the console first to allow test-run to attach even before
> +-- box.cfg is finished.
> +require('console').listen(os.getenv('ADMIN'))
> +
>   box.cfg({
>       listen              = os.getenv("LISTEN"),
>       replication         = repl_list,
>       memtx_memory        = 107374182,
>       replication_timeout = 0.1,
>   })
> -
> -require('console').listen(os.getenv('ADMIN'))
> diff --git a/test/replication/replica_rejoin.result b/test/replication/replica_rejoin.result
> index dd04ae297..f6e74eae1 100644
> --- a/test/replication/replica_rejoin.result
> +++ b/test/replication/replica_rejoin.result
> @@ -213,7 +213,7 @@ test_run:wait_cond(function() return #fio.glob(fio.pathjoin(box.cfg.wal_dir, '*.
>   box.cfg{checkpoint_count = checkpoint_count}
>   ---
>   ...
> -test_run:cmd("start server replica with args='true'")
> +test_run:cmd("start server replica with args='true', wait=False")
>   ---
>   - true
>   ...
> @@ -221,9 +221,17 @@ test_run:cmd("switch replica")
>   ---
>   - true
>   ...
> -box.info.status -- orphan
> +-- Need to wait for box.info.replication[1] defined, otherwise test-run fails to
> +-- wait for the upstream status sometimes.
> +test_run:wait_cond(function()                                                   \
> +    return box.info ~= nil and box.info.replication[1] ~= nil                   \
> +end)
>   ---
> -- orphan
> +- true
> +...
> +test_run:wait_upstream(1, {message_re = 'Missing %.xlog file', status = 'loading'})
> +---
> +- true
>   ...
>   box.space.test:select()
>   ---
> diff --git a/test/replication/replica_rejoin.test.lua b/test/replication/replica_rejoin.test.lua
> index 410ef44d7..0feea152e 100644
> --- a/test/replication/replica_rejoin.test.lua
> +++ b/test/replication/replica_rejoin.test.lua
> @@ -79,9 +79,14 @@ for i = 1, 3 do box.space.test:insert{i * 100} end
>   fio = require('fio')
>   test_run:wait_cond(function() return #fio.glob(fio.pathjoin(box.cfg.wal_dir, '*.xlog')) == 1 end) or fio.pathjoin(box.cfg.wal_dir, '*.xlog')
>   box.cfg{checkpoint_count = checkpoint_count}
> -test_run:cmd("start server replica with args='true'")
> +test_run:cmd("start server replica with args='true', wait=False")
>   test_run:cmd("switch replica")
> -box.info.status -- orphan
> +-- Need to wait for box.info.replication[1] defined, otherwise test-run fails to
> +-- wait for the upstream status sometimes.
> +test_run:wait_cond(function()                                                   \
> +    return box.info ~= nil and box.info.replication[1] ~= nil                   \
> +end)
> +test_run:wait_upstream(1, {message_re = 'Missing %.xlog file', status = 'loading'})
>   box.space.test:select()
>   
>   --
> diff --git a/test/replication/show_error_on_disconnect.result b/test/replication/show_error_on_disconnect.result
> index 48003db06..19d3886e4 100644
> --- a/test/replication/show_error_on_disconnect.result
> +++ b/test/replication/show_error_on_disconnect.result
> @@ -72,7 +72,7 @@ box.space.test:select()
>   other_id = box.info.id % 2 + 1
>   ---
>   ...
> -test_run:wait_upstream(other_id, {status = 'stopped', message_re = 'Missing'})
> +test_run:wait_upstream(other_id, {status = 'loading', message_re = 'Missing'})
>   ---
>   - true
>   ...
> diff --git a/test/replication/show_error_on_disconnect.test.lua b/test/replication/show_error_on_disconnect.test.lua
> index 1b0ea4373..dce926a34 100644
> --- a/test/replication/show_error_on_disconnect.test.lua
> +++ b/test/replication/show_error_on_disconnect.test.lua
> @@ -29,7 +29,7 @@ test_run:cmd("switch master_quorum1")
>   box.cfg{replication = repl}
>   box.space.test:select()
>   other_id = box.info.id % 2 + 1
> -test_run:wait_upstream(other_id, {status = 'stopped', message_re = 'Missing'})
> +test_run:wait_upstream(other_id, {status = 'loading', message_re = 'Missing'})
>   test_run:cmd("switch master_quorum2")
>   box.space.test:select()
>   other_id = box.info.id % 2 + 1
> diff --git a/test/xlog/panic_on_wal_error.result b/test/xlog/panic_on_wal_error.result
> index 22f14f912..c4494ac87 100644
> --- a/test/xlog/panic_on_wal_error.result
> +++ b/test/xlog/panic_on_wal_error.result
> @@ -121,7 +121,7 @@ box.cfg.force_recovery
>   -- try to start the replica, ha-ha
>   -- (replication should fail, some rows are missing)
>   --
> -test_run:cmd("start server replica")
> +test_run:cmd("start server replica with wait=False")
>   ---
>   - true
>   ...
> @@ -129,19 +129,17 @@ test_run:cmd("switch replica")
>   ---
>   - true
>   ...
> -fiber = require('fiber')
> ----
> -...
> -while box.info.replication[1].upstream.status ~= "stopped" do fiber.sleep(0.001) end
> +-- Need to wait for box.info.replication[1] defined, otherwise test-run fails to
> +-- wait for the upstream status sometimes.
> +test_run:wait_cond(function()                                                   \
> +    return box.info ~= nil and box.info.replication[1] ~= nil                   \
> +end)
>   ---
> +- true
>   ...
> -box.info.replication[1].upstream.status
> ----
> -- stopped
> -...
> -box.info.replication[1].upstream.message
> +test_run:wait_upstream(1, {message_re = 'Missing %.xlog file', status = 'loading'})
>   ---
> -- 'Missing .xlog file between LSN 6 {1: 6} and 8 {1: 8}'
> +- true
>   ...
>   box.space.test:select{}
>   ---
> diff --git a/test/xlog/panic_on_wal_error.test.lua b/test/xlog/panic_on_wal_error.test.lua
> index 2e95431c6..eea6aad30 100644
> --- a/test/xlog/panic_on_wal_error.test.lua
> +++ b/test/xlog/panic_on_wal_error.test.lua
> @@ -57,12 +57,14 @@ box.cfg.force_recovery
>   -- try to start the replica, ha-ha
>   -- (replication should fail, some rows are missing)
>   --
> -test_run:cmd("start server replica")
> +test_run:cmd("start server replica with wait=False")
>   test_run:cmd("switch replica")
> -fiber = require('fiber')
> -while box.info.replication[1].upstream.status ~= "stopped" do fiber.sleep(0.001) end
> -box.info.replication[1].upstream.status
> -box.info.replication[1].upstream.message
> +-- Need to wait for box.info.replication[1] defined, otherwise test-run fails to
> +-- wait for the upstream status sometimes.
> +test_run:wait_cond(function()                                                   \
> +    return box.info ~= nil and box.info.replication[1] ~= nil                   \
> +end)
> +test_run:wait_upstream(1, {message_re = 'Missing %.xlog file', status = 'loading'})
>   box.space.test:select{}
>   --
>   --
> diff --git a/test/xlog/replica.lua b/test/xlog/replica.lua
> index b76c2f776..3268023dc 100644
> --- a/test/xlog/replica.lua
> +++ b/test/xlog/replica.lua
> @@ -1,5 +1,9 @@
>   #!/usr/bin/env tarantool
>   
> +-- Start the console first to allow test-run to attach even before
> +-- box.cfg is finished.
> +require('console').listen(os.getenv('ADMIN'))
> +
>   box.cfg({
>       listen              = os.getenv("LISTEN"),
>       replication         = os.getenv("MASTER"),
> @@ -7,5 +11,3 @@ box.cfg({
>   --    pid_file            = "tarantool.pid",
>   --    logger              = "tarantool.log",
>   })
> -
> -require('console').listen(os.getenv('ADMIN'))

-- 
Serge Petrenko