From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Fri, 30 Mar 2018 14:33:56 +0300 From: Vladimir Davydov Subject: Re: [tarantool-patches] [PATCH 2/2] [replication] [recovery] recover missing data Message-ID: <20180330113356.lsmfwwmtprugwywh@esperanza> References: <734ad912f840868e94e7e34048795afee209b78f.1522339565.git.k.belyavskiy@tarantool.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <734ad912f840868e94e7e34048795afee209b78f.1522339565.git.k.belyavskiy@tarantool.org> To: Konstantin Belyavskiy Cc: tarantool-patches@freelists.org List-ID: On Thu, Mar 29, 2018 at 07:15:16PM +0300, Konstantin Belyavskiy wrote: > Part 2 of 2. > Recover missing local data from replica. > In case of sudden power-loss, if data was not written to WAL but > already sent to remote replica, local can't recover properly and > we have different datasets. > Fix it by using remote replica's data and LSN comparison. > Based on @GeorgyKirichenko proposal and @locker race free check. > > Closes #3210 > --- > branch: gh-3210-recover-missing-local-data-master-master > src/box/relay.cc | 16 ++++- > src/box/wal.cc | 15 +++- > test/replication/recover_missing.result | 116 ++++++++++++++++++++++++++++++ > test/replication/recover_missing.test.lua | 41 +++++++++++ > test/replication/suite.ini | 2 +- > 5 files changed, 185 insertions(+), 5 deletions(-) > create mode 100644 test/replication/recover_missing.result > create mode 100644 test/replication/recover_missing.test.lua Nit: please rename the test to recover_missing_xlog.test.lua > diff --git a/test/replication/recover_missing.test.lua b/test/replication/recover_missing.test.lua > new file mode 100644 > index 000000000..775d23a0b > --- /dev/null > +++ b/test/replication/recover_missing.test.lua > @@ -0,0 +1,41 @@ > +env = require('test_run') > +test_run = env.new() > + > +SERVERS = { 'autobootstrap1', 'autobootstrap2', 'autobootstrap3' } > +-- Start servers > +test_run:create_cluster(SERVERS) > +-- Wait for full mesh > +test_run:wait_fullmesh(SERVERS) > + > +test_run:cmd("switch autobootstrap1") > +for i = 0, 9 do box.space.test:insert{i, 'test' .. i} end > +box.space.test:count() > + > +test_run:cmd('switch default') > +vclock1 = test_run:get_vclock('autobootstrap1') > +vclock2 = test_run:wait_cluster_vclock(SERVERS, vclock1) > + > +test_run:cmd("switch autobootstrap2") > +box.space.test:count() > +box.error.injection.set("ERRINJ_RELAY_TIMEOUT", 0.01) > +test_run:cmd("stop server autobootstrap1") > +fio = require('fio') > +-- This test checks ability to recover missing local data > +-- from remote replica. See #3210. > +-- Delete data on first master and test that after restart, > +-- due to difference in vclock it will be able to recover > +-- all missing data from replica. > +-- Also check that there is no concurrency, i.e. master is > +-- in 'read-only' mode unless it receives all data. > +fio.unlink(fio.pathjoin(fio.abspath("."), string.format('autobootstrap1/%020d.xlog', 8))) > +test_run:cmd("start server autobootstrap1") > + > +test_run:cmd("switch autobootstrap1") > +for i = 10, 19 do box.space.test:insert{i, 'test' .. i} end > +fiber = require('fiber') > +fiber.sleep(0.1) I don't think you still need this 'sleep', not after patch 1. > +box.space.test:select() > + > +-- Cleanup. > +test_run:cmd('switch default') > +test_run:drop_cluster(SERVERS)