From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from smtp20.mail.ru (smtp20.mail.ru [94.100.179.251]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dev.tarantool.org (Postfix) with ESMTPS id 89BDB469719 for ; Wed, 16 Sep 2020 13:59:14 +0300 (MSK) References: <5f7f07d51358dd7fcdb5b764b9bdcc40347ea9b1.1600124767.git.v.shpilevoy@tarantool.org> <6d77190d-d061-e830-1e04-cc8e3804eaee@tarantool.org> From: Serge Petrenko Message-ID: Date: Wed, 16 Sep 2020 13:59:11 +0300 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset="utf-8"; format="flowed" Content-Transfer-Encoding: 8bit Content-Language: en-GB Subject: Re: [Tarantool-patches] [PATCH v2 2/4] replication: retry in case of XlogGapError List-Id: Tarantool development patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Vladislav Shpilevoy , tarantool-patches@dev.tarantool.org, gorcunov@gmail.com 16.09.2020 00:23, Vladislav Shpilevoy пишет: > Hi! Thanks for the review! > >> I propose you rework the tests so that they expect upstream.status == 'loading' >> instead of deleting them altogether. > I return them back for now with a few amendments. However here is > the motivation why I dropped them earlier: > > - Their functionality was fully covered by some other tests doing exactly the same; > - Both removed tests were also doing the same; > - Xlog/panic_on_wal_error is in a wrong folder, we should have such tests in > replication/ folder. Also the test is close to useless because it seems it > tries to see what happens with recovery when box.cfg{force_recovery = false} > on the replica, but the setting is installed on the master. So it does not mean > anything. > > Anyway, now they are back. Maybe we will drop them later someday. Thanks for your answer! Still I believe we should delete the tests as a part of some separate activity. The patch LGTM. > > To make them work with the 'loading' status I needed to move one line in > xlog/replica.lua and replication/replica.lua about starting the admin console > before box.cfg. So as test-run could attach before box.cfg is finished. > Otherwise it hangs because of 'loading'. > > See the full new patch below. The diff was too big. > > ==================== > replication: retry in case of XlogGapError > > Previously XlogGapError was considered a critical error stopping > the replication. That may be not so good as it looks. > > XlogGapError is a perfectly fine error, which should not kill the > replication connection. It should be retried instead. > > Because here is an example, when the gap can be recovered on its > own. Consider the case: node1 is a leader, it is booted with > vclock {1: 3}. Node2 connects and fetches snapshot of node1, it > also gets vclock {1: 3}. Then node1 writes something and its > vclock becomes {1: 4}. Now node3 boots from node1, and gets the > same vclock. Vclocks now look like this: > > - node1: {1: 4}, leader, has {1: 3} snap. > - node2: {1: 3}, booted from node1, has only snap. > - node3: {1: 4}, booted from node1, has only snap. > > If the cluster is a fullmesh, node2 will send subscribe requests > with vclock {1: 3}. If node3 receives it, it will respond with > xlog gap error, because it only has a snap with {1: 4}, nothing > else. In that case node2 should retry connecting to node3, and in > the meantime try to get newer changes from node1. > > The example is totally valid. However it is unreachable now > because master registers all replicas in _cluster before allowing > them to make a join. So they all bootstrap from a snapshot > containing all their IDs. This is a bug, because such > auto-registration leads to registration of anonymous replicas, if > they are present during bootstrap. Also it blocks Raft, which > can't work if there are registered, but not yet joined nodes. > > Once the registration problem will be solved in a next commit, the > XlogGapError will strike quite often during bootstrap. This patch > won't allow that happen. > > Needed for #5287 > > diff --git a/src/box/applier.cc b/src/box/applier.cc > index c1d07ca54..96dd48c0d 100644 > --- a/src/box/applier.cc > +++ b/src/box/applier.cc > @@ -106,6 +106,7 @@ applier_log_error(struct applier *applier, struct error *e) > case ER_SYSTEM: > case ER_UNKNOWN_REPLICA: > case ER_PASSWORD_MISMATCH: > + case ER_XLOG_GAP: > say_info("will retry every %.2lf second", > replication_reconnect_interval()); > break; > @@ -1333,6 +1334,32 @@ applier_f(va_list ap) > applier_disconnect(applier, APPLIER_STOPPED); > return -1; > } > + } catch (XlogGapError *e) { > + /* > + * Xlog gap error can't be a critical error. Because it > + * is totally normal during bootstrap. Consider the > + * case: node1 is a leader, it is booted with vclock > + * {1: 3}. Node2 connects and fetches snapshot of node1, > + * it also gets vclock {1: 3}. Then node1 writes > + * something and its vclock becomes {1: 4}. Now node3 > + * boots from node1, and gets the same vclock. Vclocks > + * now look like this: > + * > + * - node1: {1: 4}, leader, has {1: 3} snap. > + * - node2: {1: 3}, booted from node1, has only snap. > + * - node3: {1: 4}, booted from node1, has only snap. > + * > + * If the cluster is a fullmesh, node2 will send > + * subscribe requests with vclock {1: 3}. If node3 > + * receives it, it will respond with xlog gap error, > + * because it only has a snap with {1: 4}, nothing else. > + * In that case node2 should retry connecting to node3, > + * and in the meantime try to get newer changes from > + * node1. > + */ > + applier_log_error(applier, e); > + applier_disconnect(applier, APPLIER_LOADING); > + goto reconnect; > } catch (FiberIsCancelled *e) { > if (!diag_is_empty(&applier->diag)) { > diag_move(&applier->diag, &fiber()->diag); > diff --git a/test/replication/force_recovery.result b/test/replication/force_recovery.result > index f50452858..e142e829a 100644 > --- a/test/replication/force_recovery.result > +++ b/test/replication/force_recovery.result > @@ -63,7 +63,7 @@ fio.unlink(xlog) > box.cfg{force_recovery = true} > --- > ... > -test_run:cmd("start server test") > +test_run:cmd("start server test with wait=False") > --- > - true > ... > @@ -71,13 +71,13 @@ test_run:cmd("switch test") > --- > - true > ... > -box.space.test:select() > +test_run:wait_upstream(1, {message_re = 'Missing %.xlog file', status = 'loading'}) > --- > -- [] > +- true > ... > -box.info.replication[1].upstream.status == 'stopped' or box.info > +box.space.test:select() > --- > -- true > +- [] > ... > test_run:cmd("switch default") > --- > diff --git a/test/replication/force_recovery.test.lua b/test/replication/force_recovery.test.lua > index 54307814b..bd3b439d2 100644 > --- a/test/replication/force_recovery.test.lua > +++ b/test/replication/force_recovery.test.lua > @@ -27,10 +27,10 @@ fio.unlink(xlog) > -- Check that even though box.cfg.force_recovery is set, > -- replication will still fail due to LSN gap. > box.cfg{force_recovery = true} > -test_run:cmd("start server test") > +test_run:cmd("start server test with wait=False") > test_run:cmd("switch test") > +test_run:wait_upstream(1, {message_re = 'Missing %.xlog file', status = 'loading'}) > box.space.test:select() > -box.info.replication[1].upstream.status == 'stopped' or box.info > test_run:cmd("switch default") > box.cfg{force_recovery = false} > > diff --git a/test/replication/replica.lua b/test/replication/replica.lua > index f3a6dfe58..76949a568 100644 > --- a/test/replication/replica.lua > +++ b/test/replication/replica.lua > @@ -9,11 +9,13 @@ else > repl_list = os.getenv("MASTER") > end > > +-- Start the console first to allow test-run to attach even before > +-- box.cfg is finished. > +require('console').listen(os.getenv('ADMIN')) > + > box.cfg({ > listen = os.getenv("LISTEN"), > replication = repl_list, > memtx_memory = 107374182, > replication_timeout = 0.1, > }) > - > -require('console').listen(os.getenv('ADMIN')) > diff --git a/test/replication/replica_rejoin.result b/test/replication/replica_rejoin.result > index dd04ae297..f6e74eae1 100644 > --- a/test/replication/replica_rejoin.result > +++ b/test/replication/replica_rejoin.result > @@ -213,7 +213,7 @@ test_run:wait_cond(function() return #fio.glob(fio.pathjoin(box.cfg.wal_dir, '*. > box.cfg{checkpoint_count = checkpoint_count} > --- > ... > -test_run:cmd("start server replica with args='true'") > +test_run:cmd("start server replica with args='true', wait=False") > --- > - true > ... > @@ -221,9 +221,17 @@ test_run:cmd("switch replica") > --- > - true > ... > -box.info.status -- orphan > +-- Need to wait for box.info.replication[1] defined, otherwise test-run fails to > +-- wait for the upstream status sometimes. > +test_run:wait_cond(function() \ > + return box.info ~= nil and box.info.replication[1] ~= nil \ > +end) > --- > -- orphan > +- true > +... > +test_run:wait_upstream(1, {message_re = 'Missing %.xlog file', status = 'loading'}) > +--- > +- true > ... > box.space.test:select() > --- > diff --git a/test/replication/replica_rejoin.test.lua b/test/replication/replica_rejoin.test.lua > index 410ef44d7..0feea152e 100644 > --- a/test/replication/replica_rejoin.test.lua > +++ b/test/replication/replica_rejoin.test.lua > @@ -79,9 +79,14 @@ for i = 1, 3 do box.space.test:insert{i * 100} end > fio = require('fio') > test_run:wait_cond(function() return #fio.glob(fio.pathjoin(box.cfg.wal_dir, '*.xlog')) == 1 end) or fio.pathjoin(box.cfg.wal_dir, '*.xlog') > box.cfg{checkpoint_count = checkpoint_count} > -test_run:cmd("start server replica with args='true'") > +test_run:cmd("start server replica with args='true', wait=False") > test_run:cmd("switch replica") > -box.info.status -- orphan > +-- Need to wait for box.info.replication[1] defined, otherwise test-run fails to > +-- wait for the upstream status sometimes. > +test_run:wait_cond(function() \ > + return box.info ~= nil and box.info.replication[1] ~= nil \ > +end) > +test_run:wait_upstream(1, {message_re = 'Missing %.xlog file', status = 'loading'}) > box.space.test:select() > > -- > diff --git a/test/replication/show_error_on_disconnect.result b/test/replication/show_error_on_disconnect.result > index 48003db06..19d3886e4 100644 > --- a/test/replication/show_error_on_disconnect.result > +++ b/test/replication/show_error_on_disconnect.result > @@ -72,7 +72,7 @@ box.space.test:select() > other_id = box.info.id % 2 + 1 > --- > ... > -test_run:wait_upstream(other_id, {status = 'stopped', message_re = 'Missing'}) > +test_run:wait_upstream(other_id, {status = 'loading', message_re = 'Missing'}) > --- > - true > ... > diff --git a/test/replication/show_error_on_disconnect.test.lua b/test/replication/show_error_on_disconnect.test.lua > index 1b0ea4373..dce926a34 100644 > --- a/test/replication/show_error_on_disconnect.test.lua > +++ b/test/replication/show_error_on_disconnect.test.lua > @@ -29,7 +29,7 @@ test_run:cmd("switch master_quorum1") > box.cfg{replication = repl} > box.space.test:select() > other_id = box.info.id % 2 + 1 > -test_run:wait_upstream(other_id, {status = 'stopped', message_re = 'Missing'}) > +test_run:wait_upstream(other_id, {status = 'loading', message_re = 'Missing'}) > test_run:cmd("switch master_quorum2") > box.space.test:select() > other_id = box.info.id % 2 + 1 > diff --git a/test/xlog/panic_on_wal_error.result b/test/xlog/panic_on_wal_error.result > index 22f14f912..c4494ac87 100644 > --- a/test/xlog/panic_on_wal_error.result > +++ b/test/xlog/panic_on_wal_error.result > @@ -121,7 +121,7 @@ box.cfg.force_recovery > -- try to start the replica, ha-ha > -- (replication should fail, some rows are missing) > -- > -test_run:cmd("start server replica") > +test_run:cmd("start server replica with wait=False") > --- > - true > ... > @@ -129,19 +129,17 @@ test_run:cmd("switch replica") > --- > - true > ... > -fiber = require('fiber') > ---- > -... > -while box.info.replication[1].upstream.status ~= "stopped" do fiber.sleep(0.001) end > +-- Need to wait for box.info.replication[1] defined, otherwise test-run fails to > +-- wait for the upstream status sometimes. > +test_run:wait_cond(function() \ > + return box.info ~= nil and box.info.replication[1] ~= nil \ > +end) > --- > +- true > ... > -box.info.replication[1].upstream.status > ---- > -- stopped > -... > -box.info.replication[1].upstream.message > +test_run:wait_upstream(1, {message_re = 'Missing %.xlog file', status = 'loading'}) > --- > -- 'Missing .xlog file between LSN 6 {1: 6} and 8 {1: 8}' > +- true > ... > box.space.test:select{} > --- > diff --git a/test/xlog/panic_on_wal_error.test.lua b/test/xlog/panic_on_wal_error.test.lua > index 2e95431c6..eea6aad30 100644 > --- a/test/xlog/panic_on_wal_error.test.lua > +++ b/test/xlog/panic_on_wal_error.test.lua > @@ -57,12 +57,14 @@ box.cfg.force_recovery > -- try to start the replica, ha-ha > -- (replication should fail, some rows are missing) > -- > -test_run:cmd("start server replica") > +test_run:cmd("start server replica with wait=False") > test_run:cmd("switch replica") > -fiber = require('fiber') > -while box.info.replication[1].upstream.status ~= "stopped" do fiber.sleep(0.001) end > -box.info.replication[1].upstream.status > -box.info.replication[1].upstream.message > +-- Need to wait for box.info.replication[1] defined, otherwise test-run fails to > +-- wait for the upstream status sometimes. > +test_run:wait_cond(function() \ > + return box.info ~= nil and box.info.replication[1] ~= nil \ > +end) > +test_run:wait_upstream(1, {message_re = 'Missing %.xlog file', status = 'loading'}) > box.space.test:select{} > -- > -- > diff --git a/test/xlog/replica.lua b/test/xlog/replica.lua > index b76c2f776..3268023dc 100644 > --- a/test/xlog/replica.lua > +++ b/test/xlog/replica.lua > @@ -1,5 +1,9 @@ > #!/usr/bin/env tarantool > > +-- Start the console first to allow test-run to attach even before > +-- box.cfg is finished. > +require('console').listen(os.getenv('ADMIN')) > + > box.cfg({ > listen = os.getenv("LISTEN"), > replication = os.getenv("MASTER"), > @@ -7,5 +11,3 @@ box.cfg({ > -- pid_file = "tarantool.pid", > -- logger = "tarantool.log", > }) > - > -require('console').listen(os.getenv('ADMIN')) -- Serge Petrenko