From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <vdavydov.dev@gmail.com>
Date: Fri, 30 Mar 2018 14:33:56 +0300
From: Vladimir Davydov <vdavydov.dev@gmail.com>
Subject: Re: [tarantool-patches] [PATCH 2/2] [replication] [recovery] recover
 missing data
Message-ID: <20180330113356.lsmfwwmtprugwywh@esperanza>
References: <cover.1522339565.git.k.belyavskiy@tarantool.org>
 <734ad912f840868e94e7e34048795afee209b78f.1522339565.git.k.belyavskiy@tarantool.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <734ad912f840868e94e7e34048795afee209b78f.1522339565.git.k.belyavskiy@tarantool.org>
To: Konstantin Belyavskiy <k.belyavskiy@tarantool.org>
Cc: tarantool-patches@freelists.org
List-ID: <tarantool-patches.dev.tarantool.org>

On Thu, Mar 29, 2018 at 07:15:16PM +0300, Konstantin Belyavskiy wrote:
> Part 2 of 2.
> Recover missing local data from replica.
> In case of sudden power-loss, if data was not written to WAL but
> already sent to remote replica, local can't recover properly and
> we have different datasets.
> Fix it by using remote replica's data and LSN comparison.
> Based on @GeorgyKirichenko proposal and @locker race free check.
> 
> Closes #3210
> ---
>  branch: gh-3210-recover-missing-local-data-master-master
>  src/box/relay.cc                          |  16 ++++-
>  src/box/wal.cc                            |  15 +++-
>  test/replication/recover_missing.result   | 116 ++++++++++++++++++++++++++++++
>  test/replication/recover_missing.test.lua |  41 +++++++++++
>  test/replication/suite.ini                |   2 +-
>  5 files changed, 185 insertions(+), 5 deletions(-)
>  create mode 100644 test/replication/recover_missing.result
>  create mode 100644 test/replication/recover_missing.test.lua

Nit: please rename the test to recover_missing_xlog.test.lua

> diff --git a/test/replication/recover_missing.test.lua b/test/replication/recover_missing.test.lua
> new file mode 100644
> index 000000000..775d23a0b
> --- /dev/null
> +++ b/test/replication/recover_missing.test.lua
> @@ -0,0 +1,41 @@
> +env = require('test_run')
> +test_run = env.new()
> +
> +SERVERS = { 'autobootstrap1', 'autobootstrap2', 'autobootstrap3' }
> +-- Start servers
> +test_run:create_cluster(SERVERS)
> +-- Wait for full mesh
> +test_run:wait_fullmesh(SERVERS)
> +
> +test_run:cmd("switch autobootstrap1")
> +for i = 0, 9 do box.space.test:insert{i, 'test' .. i} end
> +box.space.test:count()
> +
> +test_run:cmd('switch default')
> +vclock1 = test_run:get_vclock('autobootstrap1')
> +vclock2 = test_run:wait_cluster_vclock(SERVERS, vclock1)
> +
> +test_run:cmd("switch autobootstrap2")
> +box.space.test:count()
> +box.error.injection.set("ERRINJ_RELAY_TIMEOUT", 0.01)
> +test_run:cmd("stop server autobootstrap1")
> +fio = require('fio')
> +-- This test checks ability to recover missing local data
> +-- from remote replica. See #3210.
> +-- Delete data on first master and test that after restart,
> +-- due to difference in vclock it will be able to recover
> +-- all missing data from replica.
> +-- Also check that there is no concurrency, i.e. master is
> +-- in 'read-only' mode unless it receives all data.
> +fio.unlink(fio.pathjoin(fio.abspath("."), string.format('autobootstrap1/%020d.xlog', 8)))
> +test_run:cmd("start server autobootstrap1")
> +
> +test_run:cmd("switch autobootstrap1")
> +for i = 10, 19 do box.space.test:insert{i, 'test' .. i} end

> +fiber = require('fiber')
> +fiber.sleep(0.1)

I don't think you still need this 'sleep', not after patch 1.

> +box.space.test:select()
> +
> +-- Cleanup.
> +test_run:cmd('switch default')
> +test_run:drop_cluster(SERVERS)