Fwd: Re[2]: [patches] [PATCH] [replication] [recovery] recover missing data

Wed Mar 28 12:25:18 MSK 2018

On Mon, Mar 26, 2018 at 06:24:36PM +0300, Konstantin Belyavskiy wrote:
> Please check most recent version.
> branch: gh-3210-recover-missing-local-data-master-master

Please don't send or submit a patch for 1.6 until we commit it to
the trunk.

I failed to find the patch in the mailing list. Pasting it here for
review.

> From 391448a496fd769ff6724cadab4d333d37a9088e Mon Sep 17 00:00:00 2001
> From: Konstantin Belyavskiy <k.belyavskiy at tarantool.org>
> Date: Tue, 13 Mar 2018 17:51:52 +0300
> Subject: [PATCH] [replication] [recovery] recover missing data
> 
> Recover missing local data from replica.
> In case of sudden power-loss, if data was not written to WAL but
> already sent to remote replica, local can't recover properly and
> we have different datasets.
> Fix it by using remote replica's data and LSN comparison.
> Based on @GeorgyKirichenko proposal and @locker race free check.

> Switch off replication/catch.test.lua

Why? If the test is broken, please fix it. If you find the test
pointless, delete it with a proper explanation.

> 
> Closes #3210
> 
> diff --git a/src/box/applier.cc b/src/box/applier.cc
> index 6bfe5a99..5f0b3069 100644
> --- a/src/box/applier.cc
> +++ b/src/box/applier.cc
> @@ -453,7 +453,8 @@ applier_subscribe(struct applier *applier)
>  		}
>  
>  		if (applier->state == APPLIER_SYNC &&
> -		    applier->lag <= replication_sync_lag) {
> +		    applier->lag <= replication_sync_lag &&
> +		    vclock_compare(&applier->vclock, &replicaset.vclock) <= 0) {

First, you use a wrong vclock - applier->vclock is the vclock at connect
(the name is rather misleading though, true, we should probably rename
it to remote_vclock_at_connect). You should use the vclock received in
the SUBSCRIBE request.

Second, this new condition could use a comment.

Third, this is a worthwhile change as is so I think it should be
submitted in a separate patch.

>  			/* Applier is synced, switch to "follow". */
>  			applier_set_state(applier, APPLIER_FOLLOW);
>  		}
> diff --git a/src/box/relay.cc b/src/box/relay.cc
> index 2bd05ad5..344a8e01 100644
> --- a/src/box/relay.cc
> +++ b/src/box/relay.cc
> @@ -110,6 +110,11 @@ struct relay {
>  	struct vclock recv_vclock;
>  	/** Replicatoin slave version. */
>  	uint32_t version_id;
> +	/**
> +	 * Local master's LSN at the moment of subscribe, used to check
> +	 * dataset on the other side and send missing data rows if any.
> +	 */
> +	int64_t masters_lsn_at_subscribe;

Why did you change the member name? I only asked to update the comment.
'local_lsn_at_subscribe' is not perfect, but still a better name IMO.
Actually, I'm thinking about storing a whole vclock here instead of just
one LSN - that would help avoid confusion:

  /** Local vclock at the time of subscribe. */
  struct vclock local_vclock_at_subscribe;

>  
>  	/** Relay endpoint */
>  	struct cbus_endpoint endpoint;
> diff --git a/src/box/wal.cc b/src/box/wal.cc
> index 4576cfe0..4a43775d 100644
> --- a/src/box/wal.cc
> +++ b/src/box/wal.cc
> @@ -768,8 +768,15 @@ wal_write(struct journal *journal, struct journal_entry *entry)
>  			/*
>  			 * Find last row from local instance id
>  			 * and promote vclock.
> +			 * In master-master configuration, during sudden
> +			 * power-loss if data was not written to WAL but
> +			 * already sent to others they will send back.
> +			 * In this case we should update only local
> +			 * vclock but not the replicaset one. Could be
> +			 * checked by simple lsn comparison.

I still don't understand this comment.

>  			 */
> -			if ((*last)->replica_id == instance_id) {
> +			if ((*last)->replica_id == instance_id &&
> +			    replicaset.vclock.lsn[instance_id] < (*last)->lsn) {

Use vclock_get() for this.

Also, we agreed to move this to applier AFAIR.

>  				vclock_follow(&replicaset.vclock, instance_id,
>  					      (*last)->lsn);
>  				break;
> diff --git a/test/replication/on_replace.lua b/test/replication/on_replace.lua
> index 7e49efe1..c5855892 100644
> --- a/test/replication/on_replace.lua
> +++ b/test/replication/on_replace.lua
> @@ -22,13 +22,10 @@ box.cfg({
>      };
>  })
>  
> -env = require('test_run')
> -test_run = env.new()
> -engine = test_run:get_cfg('engine')
> -
>  box.once("bootstrap", function()
> +    local test_run = require('test_run').new()
>      box.schema.user.create(USER, { password = PASSWORD })
>      box.schema.user.grant(USER, 'replication')
> -    box.schema.space.create('test', {engine = engine})
> +    box.schema.space.create('test', {engine = test_run:get_cfg('engine')})
>      box.space.test:create_index('primary')
>  end)
> diff --git a/test/replication/recover_missing.test.lua b/test/replication/recover_missing.test.lua
> new file mode 100644
> index 00000000..d5b0e0ad
> --- /dev/null
> +++ b/test/replication/recover_missing.test.lua
> @@ -0,0 +1,42 @@
> +env = require('test_run')
> +test_run = env.new()
> +
> +SERVERS = { 'on_replace1', 'on_replace2' }

Please use autobootstrap.lua - I want to see how it works with multiple
masters.

> +-- Start servers
> +test_run:create_cluster(SERVERS)
> +-- Wait for full mesh
> +test_run:wait_fullmesh(SERVERS)
> +
> +test_run:cmd("switch on_replace1")
> +for i = 0, 9 do box.space.test:insert{i, 'test' .. i} end

Nit: please start counting from 1, as this is common in Lua.

> +box.space.test:count()
> +
> +test_run:cmd('switch default')
> +vclock1 = test_run:get_vclock('on_replace1')
> +vclock2 = test_run:wait_cluster_vclock(SERVERS, vclock1)
> +
> +test_run:cmd("switch on_replace2")
> +box.space.test:count()
> +box.error.injection.set("ERRINJ_RELAY_TIMEOUT", 0.1)

Decrease the timeout to speed up test execution time.

> +test_run:cmd("stop server on_replace1")
> +fio = require('fio')
> +-- This test checks ability to recover missing local data
> +-- from remote replica. See #3210.
> +-- Delete data on first master and test that after restart,
> +-- due to difference in vclock it will be able to recover
> +-- all missing data from replica.
> +-- Also check that there is no concurrency, i.e. master is
> +-- in 'read-only' mode unless it receives all data.
> +fio.unlink(fio.pathjoin(fio.abspath("."), string.format('on_replace1/%020d.xlog', 8)))
> +test_run:cmd("start server on_replace1")
> +
> +test_run:cmd("switch on_replace1")
> +for i = 10, 19 do box.space.test:insert{i, 'test' .. i} end
> +fiber = require('fiber')
> +fiber.sleep(1)
> +box.space.test:count()

Use 'select' here to make sure the data received are correct.

> +
> +-- Cleanup.
> +test_run:cmd('switch default')
> +test_run:drop_cluster(SERVERS)
> +

Nit: extra line at the end of the file.