[Tarantool-patches] [PATCH v2 2/4] replication: retry in case of XlogGapError
Serge Petrenko
sergepetrenko at tarantool.org
Wed Sep 16 13:59:11 MSK 2020
16.09.2020 00:23, Vladislav Shpilevoy пишет:
> Hi! Thanks for the review!
>
>> I propose you rework the tests so that they expect upstream.status == 'loading'
>> instead of deleting them altogether.
> I return them back for now with a few amendments. However here is
> the motivation why I dropped them earlier:
>
> - Their functionality was fully covered by some other tests doing exactly the same;
> - Both removed tests were also doing the same;
> - Xlog/panic_on_wal_error is in a wrong folder, we should have such tests in
> replication/ folder. Also the test is close to useless because it seems it
> tries to see what happens with recovery when box.cfg{force_recovery = false}
> on the replica, but the setting is installed on the master. So it does not mean
> anything.
>
> Anyway, now they are back. Maybe we will drop them later someday.
Thanks for your answer!
Still I believe we should delete the tests as a part of some separate
activity.
The patch LGTM.
>
> To make them work with the 'loading' status I needed to move one line in
> xlog/replica.lua and replication/replica.lua about starting the admin console
> before box.cfg. So as test-run could attach before box.cfg is finished.
> Otherwise it hangs because of 'loading'.
>
> See the full new patch below. The diff was too big.
>
> ====================
> replication: retry in case of XlogGapError
>
> Previously XlogGapError was considered a critical error stopping
> the replication. That may be not so good as it looks.
>
> XlogGapError is a perfectly fine error, which should not kill the
> replication connection. It should be retried instead.
>
> Because here is an example, when the gap can be recovered on its
> own. Consider the case: node1 is a leader, it is booted with
> vclock {1: 3}. Node2 connects and fetches snapshot of node1, it
> also gets vclock {1: 3}. Then node1 writes something and its
> vclock becomes {1: 4}. Now node3 boots from node1, and gets the
> same vclock. Vclocks now look like this:
>
> - node1: {1: 4}, leader, has {1: 3} snap.
> - node2: {1: 3}, booted from node1, has only snap.
> - node3: {1: 4}, booted from node1, has only snap.
>
> If the cluster is a fullmesh, node2 will send subscribe requests
> with vclock {1: 3}. If node3 receives it, it will respond with
> xlog gap error, because it only has a snap with {1: 4}, nothing
> else. In that case node2 should retry connecting to node3, and in
> the meantime try to get newer changes from node1.
>
> The example is totally valid. However it is unreachable now
> because master registers all replicas in _cluster before allowing
> them to make a join. So they all bootstrap from a snapshot
> containing all their IDs. This is a bug, because such
> auto-registration leads to registration of anonymous replicas, if
> they are present during bootstrap. Also it blocks Raft, which
> can't work if there are registered, but not yet joined nodes.
>
> Once the registration problem will be solved in a next commit, the
> XlogGapError will strike quite often during bootstrap. This patch
> won't allow that happen.
>
> Needed for #5287
>
> diff --git a/src/box/applier.cc b/src/box/applier.cc
> index c1d07ca54..96dd48c0d 100644
> --- a/src/box/applier.cc
> +++ b/src/box/applier.cc
> @@ -106,6 +106,7 @@ applier_log_error(struct applier *applier, struct error *e)
> case ER_SYSTEM:
> case ER_UNKNOWN_REPLICA:
> case ER_PASSWORD_MISMATCH:
> + case ER_XLOG_GAP:
> say_info("will retry every %.2lf second",
> replication_reconnect_interval());
> break;
> @@ -1333,6 +1334,32 @@ applier_f(va_list ap)
> applier_disconnect(applier, APPLIER_STOPPED);
> return -1;
> }
> + } catch (XlogGapError *e) {
> + /*
> + * Xlog gap error can't be a critical error. Because it
> + * is totally normal during bootstrap. Consider the
> + * case: node1 is a leader, it is booted with vclock
> + * {1: 3}. Node2 connects and fetches snapshot of node1,
> + * it also gets vclock {1: 3}. Then node1 writes
> + * something and its vclock becomes {1: 4}. Now node3
> + * boots from node1, and gets the same vclock. Vclocks
> + * now look like this:
> + *
> + * - node1: {1: 4}, leader, has {1: 3} snap.
> + * - node2: {1: 3}, booted from node1, has only snap.
> + * - node3: {1: 4}, booted from node1, has only snap.
> + *
> + * If the cluster is a fullmesh, node2 will send
> + * subscribe requests with vclock {1: 3}. If node3
> + * receives it, it will respond with xlog gap error,
> + * because it only has a snap with {1: 4}, nothing else.
> + * In that case node2 should retry connecting to node3,
> + * and in the meantime try to get newer changes from
> + * node1.
> + */
> + applier_log_error(applier, e);
> + applier_disconnect(applier, APPLIER_LOADING);
> + goto reconnect;
> } catch (FiberIsCancelled *e) {
> if (!diag_is_empty(&applier->diag)) {
> diag_move(&applier->diag, &fiber()->diag);
> diff --git a/test/replication/force_recovery.result b/test/replication/force_recovery.result
> index f50452858..e142e829a 100644
> --- a/test/replication/force_recovery.result
> +++ b/test/replication/force_recovery.result
> @@ -63,7 +63,7 @@ fio.unlink(xlog)
> box.cfg{force_recovery = true}
> ---
> ...
> -test_run:cmd("start server test")
> +test_run:cmd("start server test with wait=False")
> ---
> - true
> ...
> @@ -71,13 +71,13 @@ test_run:cmd("switch test")
> ---
> - true
> ...
> -box.space.test:select()
> +test_run:wait_upstream(1, {message_re = 'Missing %.xlog file', status = 'loading'})
> ---
> -- []
> +- true
> ...
> -box.info.replication[1].upstream.status == 'stopped' or box.info
> +box.space.test:select()
> ---
> -- true
> +- []
> ...
> test_run:cmd("switch default")
> ---
> diff --git a/test/replication/force_recovery.test.lua b/test/replication/force_recovery.test.lua
> index 54307814b..bd3b439d2 100644
> --- a/test/replication/force_recovery.test.lua
> +++ b/test/replication/force_recovery.test.lua
> @@ -27,10 +27,10 @@ fio.unlink(xlog)
> -- Check that even though box.cfg.force_recovery is set,
> -- replication will still fail due to LSN gap.
> box.cfg{force_recovery = true}
> -test_run:cmd("start server test")
> +test_run:cmd("start server test with wait=False")
> test_run:cmd("switch test")
> +test_run:wait_upstream(1, {message_re = 'Missing %.xlog file', status = 'loading'})
> box.space.test:select()
> -box.info.replication[1].upstream.status == 'stopped' or box.info
> test_run:cmd("switch default")
> box.cfg{force_recovery = false}
>
> diff --git a/test/replication/replica.lua b/test/replication/replica.lua
> index f3a6dfe58..76949a568 100644
> --- a/test/replication/replica.lua
> +++ b/test/replication/replica.lua
> @@ -9,11 +9,13 @@ else
> repl_list = os.getenv("MASTER")
> end
>
> +-- Start the console first to allow test-run to attach even before
> +-- box.cfg is finished.
> +require('console').listen(os.getenv('ADMIN'))
> +
> box.cfg({
> listen = os.getenv("LISTEN"),
> replication = repl_list,
> memtx_memory = 107374182,
> replication_timeout = 0.1,
> })
> -
> -require('console').listen(os.getenv('ADMIN'))
> diff --git a/test/replication/replica_rejoin.result b/test/replication/replica_rejoin.result
> index dd04ae297..f6e74eae1 100644
> --- a/test/replication/replica_rejoin.result
> +++ b/test/replication/replica_rejoin.result
> @@ -213,7 +213,7 @@ test_run:wait_cond(function() return #fio.glob(fio.pathjoin(box.cfg.wal_dir, '*.
> box.cfg{checkpoint_count = checkpoint_count}
> ---
> ...
> -test_run:cmd("start server replica with args='true'")
> +test_run:cmd("start server replica with args='true', wait=False")
> ---
> - true
> ...
> @@ -221,9 +221,17 @@ test_run:cmd("switch replica")
> ---
> - true
> ...
> -box.info.status -- orphan
> +-- Need to wait for box.info.replication[1] defined, otherwise test-run fails to
> +-- wait for the upstream status sometimes.
> +test_run:wait_cond(function() \
> + return box.info ~= nil and box.info.replication[1] ~= nil \
> +end)
> ---
> -- orphan
> +- true
> +...
> +test_run:wait_upstream(1, {message_re = 'Missing %.xlog file', status = 'loading'})
> +---
> +- true
> ...
> box.space.test:select()
> ---
> diff --git a/test/replication/replica_rejoin.test.lua b/test/replication/replica_rejoin.test.lua
> index 410ef44d7..0feea152e 100644
> --- a/test/replication/replica_rejoin.test.lua
> +++ b/test/replication/replica_rejoin.test.lua
> @@ -79,9 +79,14 @@ for i = 1, 3 do box.space.test:insert{i * 100} end
> fio = require('fio')
> test_run:wait_cond(function() return #fio.glob(fio.pathjoin(box.cfg.wal_dir, '*.xlog')) == 1 end) or fio.pathjoin(box.cfg.wal_dir, '*.xlog')
> box.cfg{checkpoint_count = checkpoint_count}
> -test_run:cmd("start server replica with args='true'")
> +test_run:cmd("start server replica with args='true', wait=False")
> test_run:cmd("switch replica")
> -box.info.status -- orphan
> +-- Need to wait for box.info.replication[1] defined, otherwise test-run fails to
> +-- wait for the upstream status sometimes.
> +test_run:wait_cond(function() \
> + return box.info ~= nil and box.info.replication[1] ~= nil \
> +end)
> +test_run:wait_upstream(1, {message_re = 'Missing %.xlog file', status = 'loading'})
> box.space.test:select()
>
> --
> diff --git a/test/replication/show_error_on_disconnect.result b/test/replication/show_error_on_disconnect.result
> index 48003db06..19d3886e4 100644
> --- a/test/replication/show_error_on_disconnect.result
> +++ b/test/replication/show_error_on_disconnect.result
> @@ -72,7 +72,7 @@ box.space.test:select()
> other_id = box.info.id % 2 + 1
> ---
> ...
> -test_run:wait_upstream(other_id, {status = 'stopped', message_re = 'Missing'})
> +test_run:wait_upstream(other_id, {status = 'loading', message_re = 'Missing'})
> ---
> - true
> ...
> diff --git a/test/replication/show_error_on_disconnect.test.lua b/test/replication/show_error_on_disconnect.test.lua
> index 1b0ea4373..dce926a34 100644
> --- a/test/replication/show_error_on_disconnect.test.lua
> +++ b/test/replication/show_error_on_disconnect.test.lua
> @@ -29,7 +29,7 @@ test_run:cmd("switch master_quorum1")
> box.cfg{replication = repl}
> box.space.test:select()
> other_id = box.info.id % 2 + 1
> -test_run:wait_upstream(other_id, {status = 'stopped', message_re = 'Missing'})
> +test_run:wait_upstream(other_id, {status = 'loading', message_re = 'Missing'})
> test_run:cmd("switch master_quorum2")
> box.space.test:select()
> other_id = box.info.id % 2 + 1
> diff --git a/test/xlog/panic_on_wal_error.result b/test/xlog/panic_on_wal_error.result
> index 22f14f912..c4494ac87 100644
> --- a/test/xlog/panic_on_wal_error.result
> +++ b/test/xlog/panic_on_wal_error.result
> @@ -121,7 +121,7 @@ box.cfg.force_recovery
> -- try to start the replica, ha-ha
> -- (replication should fail, some rows are missing)
> --
> -test_run:cmd("start server replica")
> +test_run:cmd("start server replica with wait=False")
> ---
> - true
> ...
> @@ -129,19 +129,17 @@ test_run:cmd("switch replica")
> ---
> - true
> ...
> -fiber = require('fiber')
> ----
> -...
> -while box.info.replication[1].upstream.status ~= "stopped" do fiber.sleep(0.001) end
> +-- Need to wait for box.info.replication[1] defined, otherwise test-run fails to
> +-- wait for the upstream status sometimes.
> +test_run:wait_cond(function() \
> + return box.info ~= nil and box.info.replication[1] ~= nil \
> +end)
> ---
> +- true
> ...
> -box.info.replication[1].upstream.status
> ----
> -- stopped
> -...
> -box.info.replication[1].upstream.message
> +test_run:wait_upstream(1, {message_re = 'Missing %.xlog file', status = 'loading'})
> ---
> -- 'Missing .xlog file between LSN 6 {1: 6} and 8 {1: 8}'
> +- true
> ...
> box.space.test:select{}
> ---
> diff --git a/test/xlog/panic_on_wal_error.test.lua b/test/xlog/panic_on_wal_error.test.lua
> index 2e95431c6..eea6aad30 100644
> --- a/test/xlog/panic_on_wal_error.test.lua
> +++ b/test/xlog/panic_on_wal_error.test.lua
> @@ -57,12 +57,14 @@ box.cfg.force_recovery
> -- try to start the replica, ha-ha
> -- (replication should fail, some rows are missing)
> --
> -test_run:cmd("start server replica")
> +test_run:cmd("start server replica with wait=False")
> test_run:cmd("switch replica")
> -fiber = require('fiber')
> -while box.info.replication[1].upstream.status ~= "stopped" do fiber.sleep(0.001) end
> -box.info.replication[1].upstream.status
> -box.info.replication[1].upstream.message
> +-- Need to wait for box.info.replication[1] defined, otherwise test-run fails to
> +-- wait for the upstream status sometimes.
> +test_run:wait_cond(function() \
> + return box.info ~= nil and box.info.replication[1] ~= nil \
> +end)
> +test_run:wait_upstream(1, {message_re = 'Missing %.xlog file', status = 'loading'})
> box.space.test:select{}
> --
> --
> diff --git a/test/xlog/replica.lua b/test/xlog/replica.lua
> index b76c2f776..3268023dc 100644
> --- a/test/xlog/replica.lua
> +++ b/test/xlog/replica.lua
> @@ -1,5 +1,9 @@
> #!/usr/bin/env tarantool
>
> +-- Start the console first to allow test-run to attach even before
> +-- box.cfg is finished.
> +require('console').listen(os.getenv('ADMIN'))
> +
> box.cfg({
> listen = os.getenv("LISTEN"),
> replication = os.getenv("MASTER"),
> @@ -7,5 +11,3 @@ box.cfg({
> -- pid_file = "tarantool.pid",
> -- logger = "tarantool.log",
> })
> -
> -require('console').listen(os.getenv('ADMIN'))
--
Serge Petrenko
More information about the Tarantool-patches
mailing list