[patches] [PATCH] [replication] [recovery] recover missing data

Mon Mar 26 13:38:08 MSK 2018

* Konstantin Belyavskiy <k.belyavskiy at tarantool.org> [18/03/26 13:20]:

Konstantin, we have moved to tarantool-patches at freelist.org

> Please check most recent version
> branch: gh-3210-recover-missing-local-data-master-master
> 
> Here only branch for Tarantool 1.9 is discussed, please check another letter for v.1.6 branch.
> 
> >Среда, 21 марта 2018, 19:34 +03:00 от Vladimir Davydov <vdavydov.dev at gmail.com>:
> >
> >On Wed, Mar 21, 2018 at 04:47:46PM +0300, Konstantin Belyavskiy wrote:
> >> Please take a look at the new version.
> >
> >I don't see the new patch in the mailing list nor in this email so I
> >have to check out the branch, which is inconvenient for me. Please
> >always make sure that your patches can be found in the mailing list.
> Separated
> >
> >> 
> >> >Четверг, 15 марта 2018, 17:10 +03:00 от Konstantin Osipov < kostja at tarantool.org >:
> >> >
> >> >* Konstantin Belyavskiy <  k.belyavskiy at tarantool.org > [18/03/15 11:39]:
> >> >
> >> >The patch has to go to 1.6, since that's where the customer
> >> >complained about it.
> >> >
> >> >Is it feasible to backport? 
> >> for 1.9 branch: gh-3210-recover-missing-local-data-master-master
> >> for 1.6 branch: gh-3210-recover-missing-local-data-master-master-16 *
> >
> >$ git log --oneline origin/gh-3210-recover-missing-local-data-master-master-16 | head -n3
> >2aa60f15 remove test
> >4e2fa3b9 [replication] [recovery] recover missing data
> >b80868f2 [replication] [recovery] recover missing data
> >
> >Squash them, please.
> >
> >Pasting the patch from the branch here for review:
> >
> >> From 20e1f76c9ec829f48a0899fa71f805a01ef5be42 Mon Sep 17 00:00:00 2001
> >> From: Konstantin Belyavskiy < k.belyavskiy at tarantool.org >
> >> Date: Tue, 13 Mar 2018 17:51:52 +0300
> >> Subject: [PATCH] [replication] [recovery] recover missing data
> >> 
> >> Recover missing local data from replica.
> >> In case of sudden power-loss, if data was not written to WAL but
> >> already sent to remote replica, local can't recover properly and
> >> we have different datasets.
> >> Fix it by using remote replica's data and LSN comparison.
> >> Based on @GeorgyKirichenko proposal
> >> 
> >> Closes #3210
> >> 
> >> diff --git a/src/box/relay.cc b/src/box/relay.cc
> >> index 2bd05ad5..86c18ecb 100644
> >> --- a/src/box/relay.cc
> >> +++ b/src/box/relay.cc
> >> @@ -110,6 +110,11 @@ struct relay {
> >>  	struct vclock recv_vclock;
> >>  	/** Replicatoin slave version. */
> >>  	uint32_t version_id;
> >> +	/**
> >> +	 * LSN at the moment of subscribe, used to check dataset
> >> +	 * on the other side and send missing data rows if any.
> >> +	 */
> >> +	int64_t local_lsn_before_subscribe;
> >
> >It isn't clear from the comment what LSN it is, namely that this is LSN
> >of the replica as the master sees it. 
> Updated
> >
> >
> >> 
> >>  	/** Relay endpoint */
> >>  	struct cbus_endpoint endpoint;
> >> @@ -541,6 +546,7 @@ relay_subscribe(int fd, uint64_t sync, struct replica *replica,
> >>  	relay.version_id = replica_version_id;
> >>  	relay.replica = replica;
> >>  	replica_set_relay(replica, &relay);
> >> +	relay.local_lsn_before_subscribe = vclock_get(&replicaset.vclock, replica->id);
> >> 
> >>  	int rc = cord_costart(&relay.cord, tt_sprintf("relay_%p", &relay),
> >>  			      relay_subscribe_f, &relay);
> >> @@ -583,10 +589,15 @@ relay_send_row(struct xstream *stream, struct xrow_header *packet)
> >>  	/*
> >>  	 * We're feeding a WAL, thus responding to SUBSCRIBE request.
> >>  	 * In that case, only send a row if it is not from the same replica
> >> -	 * (i.e. don't send replica's own rows back).
> >> +	 * (i.e. don't send replica's own rows back) or if this row is
> >> +	 * missing on the other side (i.e. in case of sudden power-loss,
> >> +	 * data was not written to WAL, so remote replica can't recover
> >> +	 * it). In this case packet's LSN is lower or equal then local
> >> +	 * replica's LSN at the moment it has issued 'SUBSCRIBE' request.
> >>  	 */
> >
> >To be honest, I don't understand why we should fix this at all. If WAL
> >is written without syncing, there's no guarantee that there will be no
> >data loss in case of power outage.
> >
> >Anyway, the fix seems to be inherently racy, because the master that
> >experienced a data loss might start writing new records to WAL right
> >after local recovery, even before it receives its own lost records from
> >the replica. What will happen then?
> Fixed with @locker proposal based on 'read-only' mode.
> >
> >>  	if (relay->replica == NULL ||
> >> -	    packet->replica_id != relay->replica->id) {
> >> +	    packet->replica_id != relay->replica->id ||
> >> +	    packet->lsn <= relay->local_lsn_before_subscribe) {
> >>  		relay_send(relay, packet);
> >>  	}
> >>  }
> >> diff --git a/src/box/wal.cc b/src/box/wal.cc
> >> index 4576cfe0..c321b2de 100644
> >> --- a/src/box/wal.cc
> >> +++ b/src/box/wal.cc
> >> @@ -768,8 +768,16 @@ wal_write(struct journal *journal, struct journal_entry *entry)
> >>  			/*
> >>  			 * Find last row from local instance id
> >>  			 * and promote vclock.
> >> +			 * In master-master configuration, during sudden
> >> +			 * power-loss if data was not written to WAL but
> >> +			 * already sent to others they will send back.
> >> +			 * In this case we should update only local
> >> +			 * vclock but not the replicaset one. Could be
> >> +			 * checked by simple lsn comparison.
> >
> >I'm afraid I don't understand the comment. Is this a counter-measure
> >against the race condition I wrote about above? 
> I'm trying to explain here, that in case we got master's own row, but it was missed, we
> should not update replicaset vclock, but only local vclock.
> So we should skip vclock_follow here.
> >
> >
> >> +			 * See #3210 for details.
> >
> >Please don't refer to GitHub tickets in the code - comments alone should
> >be enough. 
> Fixed.
> >
> >>  			 */
> >> -			if ((*last)->replica_id == instance_id) {
> >> +			if ((*last)->replica_id == instance_id &&
> >> +			    replicaset.vclock.lsn[instance_id] < (*last)->lsn) {
> >
> >There's vclock_get() helper for this.
> >
> >>  				vclock_follow(&replicaset.vclock, instance_id,
> >>  					      (*last)->lsn);
> >>  				break;
> >
> >> diff --git a/test/replication/master1.lua b/test/replication/master1.lua
> >> new file mode 120000
> >> diff --git a/test/replication/master2.lua b/test/replication/master2.lua
> >> new file mode 120000
> >> diff --git a/test/replication/master_master.lua b/test/replication/master_master.lua
> >> new file mode 100644
> >
> >Do you really need all these new files. Can't you reuse
> >autobootstrap.lua? 
> Fixed, based on 'on_replace.lua' since it's closer than autobootstrap.lua (2 instance vs. 3)
> >
> >
> >> diff --git a/test/replication/recover_missing.test.lua b/test/replication/recover_missing.test.lua
> >> new file mode 100644
> >> index 00000000..2f36fc32
> >> --- /dev/null
> >> +++ b/test/replication/recover_missing.test.lua
> >> @@ -0,0 +1,30 @@
> >
> >Please add a comment describing what you're testing here and the GitHub
> >ticket number, as we usually do.
> >
> >> +env = require('test_run')
> >> +test_run = env.new()
> >> +
> >> +SERVERS = { 'master1', 'master2' }
> >> +-- Start servers
> >> +test_run:create_cluster(SERVERS)
> >> +-- Wait for full mesh
> >> +test_run:wait_fullmesh(SERVERS)
> >> +
> >> +test_run:cmd("switch master1")
> >> +box.space._schema:insert({'1'})
> >> +box.space._schema:select('1')
> >
> >Please don't use system spaces for tests. 
> Fixed.
> >
> >
> >> +
> >> +fiber = require('fiber')
> >> +fiber.sleep(0.1)
> >
> >Waiting for a change to be relayed like this is racy. Either use
> >wait_cluster_vclock() helper or loop on the replica until it sees the
> >change. 
> Fixed.
> >
> >
> >> +
> >> +test_run:cmd("switch master2")
> >> +box.space._schema:select('1')
> >> +test_run:cmd("stop server master1")
> >> +fio = require('fio')
> >> +fio.unlink(fio.pathjoin(fio.abspath("."), string.format('master1/%020d.xlog', 8)))
> >> +test_run:cmd("start server master1")
> >> +
> >> +test_run:cmd("switch master1")
> >> +box.space._schema:select('1')
> >> +
> >> +test_run:cmd("switch default")
> >> +test_run:cmd("stop server master1")
> >> +test_run:cmd("stop server master2")
> >
> >Use drop_cluster() for cleanup.
> Fixed.
> 
> Best regards,
> Konstantin Belyavskiy
> k.belyavskiy at tarantool.org

-- 
Konstantin Osipov, Moscow, Russia, +7 903 626 22 32
http://tarantool.org - www.twitter.com/kostja_osipov