[patches] [PATCH] [replication] [recovery] recover missing data

Wed Mar 21 19:34:43 MSK 2018

On Wed, Mar 21, 2018 at 04:47:46PM +0300, Konstantin Belyavskiy wrote:
> Please take a look at the new version.

I don't see the new patch in the mailing list nor in this email so I
have to check out the branch, which is inconvenient for me. Please
always make sure that your patches can be found in the mailing list.

> 
> >Четверг, 15 марта 2018, 17:10 +03:00 от Konstantin Osipov <kostja at tarantool.org>:
> >
> >* Konstantin Belyavskiy < k.belyavskiy at tarantool.org > [18/03/15 11:39]:
> >
> >The patch has to go to 1.6, since that's where the customer
> >complained about it.
> >
> >Is it feasible to backport? 
> for 1.9 branch: gh-3210-recover-missing-local-data-master-master
> for 1.6 branch: gh-3210-recover-missing-local-data-master-master-16 *

$ git log --oneline origin/gh-3210-recover-missing-local-data-master-master-16 | head -n3
2aa60f15 remove test
4e2fa3b9 [replication] [recovery] recover missing data
b80868f2 [replication] [recovery] recover missing data

Squash them, please.

Pasting the patch from the branch here for review:

> From 20e1f76c9ec829f48a0899fa71f805a01ef5be42 Mon Sep 17 00:00:00 2001
> From: Konstantin Belyavskiy <k.belyavskiy at tarantool.org>
> Date: Tue, 13 Mar 2018 17:51:52 +0300
> Subject: [PATCH] [replication] [recovery] recover missing data
> 
> Recover missing local data from replica.
> In case of sudden power-loss, if data was not written to WAL but
> already sent to remote replica, local can't recover properly and
> we have different datasets.
> Fix it by using remote replica's data and LSN comparison.
> Based on @GeorgyKirichenko proposal
> 
> Closes #3210
> 
> diff --git a/src/box/relay.cc b/src/box/relay.cc
> index 2bd05ad5..86c18ecb 100644
> --- a/src/box/relay.cc
> +++ b/src/box/relay.cc
> @@ -110,6 +110,11 @@ struct relay {
>  	struct vclock recv_vclock;
>  	/** Replicatoin slave version. */
>  	uint32_t version_id;
> +	/**
> +	 * LSN at the moment of subscribe, used to check dataset
> +	 * on the other side and send missing data rows if any.
> +	 */
> +	int64_t local_lsn_before_subscribe;

It isn't clear from the comment what LSN it is, namely that this is LSN
of the replica as the master sees it.

>  
>  	/** Relay endpoint */
>  	struct cbus_endpoint endpoint;
> @@ -541,6 +546,7 @@ relay_subscribe(int fd, uint64_t sync, struct replica *replica,
>  	relay.version_id = replica_version_id;
>  	relay.replica = replica;
>  	replica_set_relay(replica, &relay);
> +	relay.local_lsn_before_subscribe = vclock_get(&replicaset.vclock, replica->id);
>  
>  	int rc = cord_costart(&relay.cord, tt_sprintf("relay_%p", &relay),
>  			      relay_subscribe_f, &relay);
> @@ -583,10 +589,15 @@ relay_send_row(struct xstream *stream, struct xrow_header *packet)
>  	/*
>  	 * We're feeding a WAL, thus responding to SUBSCRIBE request.
>  	 * In that case, only send a row if it is not from the same replica
> -	 * (i.e. don't send replica's own rows back).
> +	 * (i.e. don't send replica's own rows back) or if this row is
> +	 * missing on the other side (i.e. in case of sudden power-loss,
> +	 * data was not written to WAL, so remote replica can't recover
> +	 * it). In this case packet's LSN is lower or equal then local
> +	 * replica's LSN at the moment it has issued 'SUBSCRIBE' request.
>  	 */

To be honest, I don't understand why we should fix this at all. If WAL
is written without syncing, there's no guarantee that there will be no
data loss in case of power outage.

Anyway, the fix seems to be inherently racy, because the master that
experienced a data loss might start writing new records to WAL right
after local recovery, even before it receives its own lost records from
the replica. What will happen then?

>  	if (relay->replica == NULL ||
> -	    packet->replica_id != relay->replica->id) {
> +	    packet->replica_id != relay->replica->id ||
> +	    packet->lsn <= relay->local_lsn_before_subscribe) {
>  		relay_send(relay, packet);
>  	}
>  }
> diff --git a/src/box/wal.cc b/src/box/wal.cc
> index 4576cfe0..c321b2de 100644
> --- a/src/box/wal.cc
> +++ b/src/box/wal.cc
> @@ -768,8 +768,16 @@ wal_write(struct journal *journal, struct journal_entry *entry)
>  			/*
>  			 * Find last row from local instance id
>  			 * and promote vclock.
> +			 * In master-master configuration, during sudden
> +			 * power-loss if data was not written to WAL but
> +			 * already sent to others they will send back.
> +			 * In this case we should update only local
> +			 * vclock but not the replicaset one. Could be
> +			 * checked by simple lsn comparison.

I'm afraid I don't understand the comment. Is this a counter-measure
against the race condition I wrote about above?

> +			 * See #3210 for details.

Please don't refer to GitHub tickets in the code - comments alone should
be enough.

>  			 */
> -			if ((*last)->replica_id == instance_id) {
> +			if ((*last)->replica_id == instance_id &&
> +			    replicaset.vclock.lsn[instance_id] < (*last)->lsn) {

There's vclock_get() helper for this.

>  				vclock_follow(&replicaset.vclock, instance_id,
>  					      (*last)->lsn);
>  				break;

> diff --git a/test/replication/master1.lua b/test/replication/master1.lua
> new file mode 120000
> diff --git a/test/replication/master2.lua b/test/replication/master2.lua
> new file mode 120000
> diff --git a/test/replication/master_master.lua b/test/replication/master_master.lua
> new file mode 100644

Do you really need all these new files. Can't you reuse
autobootstrap.lua?

> diff --git a/test/replication/recover_missing.test.lua b/test/replication/recover_missing.test.lua
> new file mode 100644
> index 00000000..2f36fc32
> --- /dev/null
> +++ b/test/replication/recover_missing.test.lua
> @@ -0,0 +1,30 @@

Please add a comment describing what you're testing here and the GitHub
ticket number, as we usually do.

> +env = require('test_run')
> +test_run = env.new()
> +
> +SERVERS = { 'master1', 'master2' }
> +-- Start servers
> +test_run:create_cluster(SERVERS)
> +-- Wait for full mesh
> +test_run:wait_fullmesh(SERVERS)
> +
> +test_run:cmd("switch master1")
> +box.space._schema:insert({'1'})
> +box.space._schema:select('1')

Please don't use system spaces for tests.

> +
> +fiber = require('fiber')
> +fiber.sleep(0.1)

Waiting for a change to be relayed like this is racy. Either use
wait_cluster_vclock() helper or loop on the replica until it sees the
change.

> +
> +test_run:cmd("switch master2")
> +box.space._schema:select('1')
> +test_run:cmd("stop server master1")
> +fio = require('fio')
> +fio.unlink(fio.pathjoin(fio.abspath("."), string.format('master1/%020d.xlog', 8)))
> +test_run:cmd("start server master1")
> +
> +test_run:cmd("switch master1")
> +box.space._schema:select('1')
> +
> +test_run:cmd("switch default")
> +test_run:cmd("stop server master1")
> +test_run:cmd("stop server master2")

Use drop_cluster() for cleanup.