Tarantool development patches archive
 help / color / mirror / Atom feed
* Fwd: Re[2]: [patches] [PATCH] [replication] [recovery] recover missing data
@ 2018-03-26 15:24 Konstantin Belyavskiy
  2018-03-28  9:25 ` Vladimir Davydov
  0 siblings, 1 reply; 3+ messages in thread
From: Konstantin Belyavskiy @ 2018-03-26 15:24 UTC (permalink / raw)
  To: Vladimir Davydov, georgy; +Cc: tarantool-patches

[-- Attachment #1: Type: text/plain, Size: 8114 bytes --]

Vladimir,

Please check most recent version.
branch: gh-3210-recover-missing-local-data-master-master

Please note, that only branch for Tarantool 1.9 is discussed here, so please check another letter for v.1.6 branch.

>Среда, 21 марта 2018, 19:34 +03:00 от Vladimir Davydov < vdavydov.dev@gmail.com >:
>
>On Wed, Mar 21, 2018 at 04:47:46PM +0300, Konstantin Belyavskiy wrote:
>> Please take a look at the new version.
>
>I don't see the new patch in the mailing list nor in this email so I
>have to check out the branch, which is inconvenient for me. Please
>always make sure that your patches can be found in the mailing list.
Separated
>
>> 
>> >Четверг, 15 марта 2018, 17:10 +03:00 от Konstantin Osipov < kostja@tarantool.org >:
>> >
>> >* Konstantin Belyavskiy <  k.belyavskiy@tarantool.org > [18/03/15 11:39]:
>> >
>> >The patch has to go to 1.6, since that's where the customer
>> >complained about it.
>> >
>> >Is it feasible to backport? 
>> for 1.9 branch: gh-3210-recover-missing-local-data-master-master
>> for 1.6 branch: gh-3210-recover-missing-local-data-master-master-16 *
>
>$ git log --oneline origin/gh-3210-recover-missing-local-data-master-master-16 | head -n3
>2aa60f15 remove test
>4e2fa3b9 [replication] [recovery] recover missing data
>b80868f2 [replication] [recovery] recover missing data
>
>Squash them, please.
>
>Pasting the patch from the branch here for review:
>
>> From 20e1f76c9ec829f48a0899fa71f805a01ef5be42 Mon Sep 17 00:00:00 2001
>> From: Konstantin Belyavskiy < k.belyavskiy@tarantool.org >
>> Date: Tue, 13 Mar 2018 17:51:52 +0300
>> Subject: [PATCH] [replication] [recovery] recover missing data
>> 
>> Recover missing local data from replica.
>> In case of sudden power-loss, if data was not written to WAL but
>> already sent to remote replica, local can't recover properly and
>> we have different datasets.
>> Fix it by using remote replica's data and LSN comparison.
>> Based on @GeorgyKirichenko proposal
>> 
>> Closes #3210
>> 
>> diff --git a/src/box/relay.cc b/src/box/relay.cc
>> index 2bd05ad5..86c18ecb 100644
>> --- a/src/box/relay.cc
>> +++ b/src/box/relay.cc
>> @@ -110,6 +110,11 @@ struct relay {
>>  	struct vclock recv_vclock;
>>  	/** Replicatoin slave version. */
>>  	uint32_t version_id;
>> +	/**
>> +	 * LSN at the moment of subscribe, used to check dataset
>> +	 * on the other side and send missing data rows if any.
>> +	 */
>> +	int64_t local_lsn_before_subscribe;
>
>It isn't clear from the comment what LSN it is, namely that this is LSN
>of the replica as the master sees it. 
Updated
>
>
>> 
>>  	/** Relay endpoint */
>>  	struct cbus_endpoint endpoint;
>> @@ -541,6 +546,7 @@ relay_subscribe(int fd, uint64_t sync, struct replica *replica,
>>  	relay.version_id = replica_version_id;
>>  	relay.replica = replica;
>>  	replica_set_relay(replica, &relay);
>> +	relay.local_lsn_before_subscribe = vclock_get(&replicaset.vclock, replica->id);
>> 
>>  	int rc = cord_costart(&relay.cord, tt_sprintf("relay_%p", &relay),
>>  			      relay_subscribe_f, &relay);
>> @@ -583,10 +589,15 @@ relay_send_row(struct xstream *stream, struct xrow_header *packet)
>>  	/*
>>  	 * We're feeding a WAL, thus responding to SUBSCRIBE request.
>>  	 * In that case, only send a row if it is not from the same replica
>> -	 * (i.e. don't send replica's own rows back).
>> +	 * (i.e. don't send replica's own rows back) or if this row is
>> +	 * missing on the other side (i.e. in case of sudden power-loss,
>> +	 * data was not written to WAL, so remote replica can't recover
>> +	 * it). In this case packet's LSN is lower or equal then local
>> +	 * replica's LSN at the moment it has issued 'SUBSCRIBE' request.
>>  	 */
>
>To be honest, I don't understand why we should fix this at all. If WAL
>is written without syncing, there's no guarantee that there will be no
>data loss in case of power outage.
>
>Anyway, the fix seems to be inherently racy, because the master that
>experienced a data loss might start writing new records to WAL right
>after local recovery, even before it receives its own lost records from
>the replica. What will happen then?
Fixed with @locker proposal based on 'read-only' mode.
>
>>  	if (relay->replica == NULL ||
>> -	    packet->replica_id != relay->replica->id) {
>> +	    packet->replica_id != relay->replica->id ||
>> +	    packet->lsn <= relay->local_lsn_before_subscribe) {
>>  		relay_send(relay, packet);
>>  	}
>>  }
>> diff --git a/src/box/wal.cc b/src/box/wal.cc
>> index 4576cfe0..c321b2de 100644
>> --- a/src/box/wal.cc
>> +++ b/src/box/wal.cc
>> @@ -768,8 +768,16 @@ wal_write(struct journal *journal, struct journal_entry *entry)
>>  			/*
>>  			 * Find last row from local instance id
>>  			 * and promote vclock.
>> +			 * In master-master configuration, during sudden
>> +			 * power-loss if data was not written to WAL but
>> +			 * already sent to others they will send back.
>> +			 * In this case we should update only local
>> +			 * vclock but not the replicaset one. Could be
>> +			 * checked by simple lsn comparison.
>
>I'm afraid I don't understand the comment. Is this a counter-measure
>against the race condition I wrote about above? 
I'm trying to explain here, that in case we got master's own row, but it was missed, we
should not update replicaset vclock, but only local vclock.
So we should skip vclock_follow here.
>
>
>> +			 * See #3210 for details.
>
>Please don't refer to GitHub tickets in the code - comments alone should
>be enough. 
Fixed.
>
>>  			 */
>> -			if ((*last)->replica_id == instance_id) {
>> +			if ((*last)->replica_id == instance_id &&
>> +			    replicaset.vclock.lsn[instance_id] < (*last)->lsn) {
>
>There's vclock_get() helper for this.
>
>>  				vclock_follow(&replicaset.vclock, instance_id,
>>  					      (*last)->lsn);
>>  				break;
>
>> diff --git a/test/replication/master1.lua b/test/replication/master1.lua
>> new file mode 120000
>> diff --git a/test/replication/master2.lua b/test/replication/master2.lua
>> new file mode 120000
>> diff --git a/test/replication/master_master.lua b/test/replication/master_master.lua
>> new file mode 100644
>
>Do you really need all these new files. Can't you reuse
>autobootstrap.lua? 
Fixed, based on 'on_replace.lua' since it's closer than autobootstrap.lua (2 instance vs. 3)
>
>
>> diff --git a/test/replication/recover_missing.test.lua b/test/replication/recover_missing.test.lua
>> new file mode 100644
>> index 00000000..2f36fc32
>> --- /dev/null
>> +++ b/test/replication/recover_missing.test.lua
>> @@ -0,0 +1,30 @@
>
>Please add a comment describing what you're testing here and the GitHub
>ticket number, as we usually do.
>
>> +env = require('test_run')
>> +test_run = env.new()
>> +
>> +SERVERS = { 'master1', 'master2' }
>> +-- Start servers
>> +test_run:create_cluster(SERVERS)
>> +-- Wait for full mesh
>> +test_run:wait_fullmesh(SERVERS)
>> +
>> +test_run:cmd("switch master1")
>> +box.space._schema:insert({'1'})
>> +box.space._schema:select('1')
>
>Please don't use system spaces for tests. 
Fixed.
>
>
>> +
>> +fiber = require('fiber')
>> +fiber.sleep(0.1)
>
>Waiting for a change to be relayed like this is racy. Either use
>wait_cluster_vclock() helper or loop on the replica until it sees the
>change. 
Fixed.
>
>
>> +
>> +test_run:cmd("switch master2")
>> +box.space._schema:select('1')
>> +test_run:cmd("stop server master1")
>> +fio = require('fio')
>> +fio.unlink(fio.pathjoin(fio.abspath("."), string.format('master1/%020d.xlog', 8)))
>> +test_run:cmd("start server master1")
>> +
>> +test_run:cmd("switch master1")
>> +box.space._schema:select('1')
>> +
>> +test_run:cmd("switch default")
>> +test_run:cmd("stop server master1")
>> +test_run:cmd("stop server master2")
>
>Use drop_cluster() for cleanup.
Fixed.

Best regards,
Konstantin Belyavskiy
k.belyavskiy@tarantool.org

----------------------------------------------------------------------

Best regards,
Konstantin Belyavskiy
k.belyavskiy@tarantool.org

----------------------------------------------------------------------

Best regards,
Konstantin Belyavskiy
k.belyavskiy@tarantool.org

[-- Attachment #2: Type: text/html, Size: 13946 bytes --]

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Fwd: Re[2]: [patches] [PATCH] [replication] [recovery] recover missing data
  2018-03-26 15:24 Fwd: Re[2]: [patches] [PATCH] [replication] [recovery] recover missing data Konstantin Belyavskiy
@ 2018-03-28  9:25 ` Vladimir Davydov
  2018-03-28 17:01   ` Re[2]: " Konstantin Belyavskiy
  0 siblings, 1 reply; 3+ messages in thread
From: Vladimir Davydov @ 2018-03-28  9:25 UTC (permalink / raw)
  To: Konstantin Belyavskiy; +Cc: georgy, tarantool-patches

On Mon, Mar 26, 2018 at 06:24:36PM +0300, Konstantin Belyavskiy wrote:
> Please check most recent version.
> branch: gh-3210-recover-missing-local-data-master-master

Please don't send or submit a patch for 1.6 until we commit it to
the trunk.

I failed to find the patch in the mailing list. Pasting it here for
review.

> From 391448a496fd769ff6724cadab4d333d37a9088e Mon Sep 17 00:00:00 2001
> From: Konstantin Belyavskiy <k.belyavskiy@tarantool.org>
> Date: Tue, 13 Mar 2018 17:51:52 +0300
> Subject: [PATCH] [replication] [recovery] recover missing data
> 
> Recover missing local data from replica.
> In case of sudden power-loss, if data was not written to WAL but
> already sent to remote replica, local can't recover properly and
> we have different datasets.
> Fix it by using remote replica's data and LSN comparison.
> Based on @GeorgyKirichenko proposal and @locker race free check.

> Switch off replication/catch.test.lua

Why? If the test is broken, please fix it. If you find the test
pointless, delete it with a proper explanation.

> 
> Closes #3210
> 
> diff --git a/src/box/applier.cc b/src/box/applier.cc
> index 6bfe5a99..5f0b3069 100644
> --- a/src/box/applier.cc
> +++ b/src/box/applier.cc
> @@ -453,7 +453,8 @@ applier_subscribe(struct applier *applier)
>  		}
>  
>  		if (applier->state == APPLIER_SYNC &&
> -		    applier->lag <= replication_sync_lag) {
> +		    applier->lag <= replication_sync_lag &&
> +		    vclock_compare(&applier->vclock, &replicaset.vclock) <= 0) {

First, you use a wrong vclock - applier->vclock is the vclock at connect
(the name is rather misleading though, true, we should probably rename
it to remote_vclock_at_connect). You should use the vclock received in
the SUBSCRIBE request.

Second, this new condition could use a comment.

Third, this is a worthwhile change as is so I think it should be
submitted in a separate patch.

>  			/* Applier is synced, switch to "follow". */
>  			applier_set_state(applier, APPLIER_FOLLOW);
>  		}
> diff --git a/src/box/relay.cc b/src/box/relay.cc
> index 2bd05ad5..344a8e01 100644
> --- a/src/box/relay.cc
> +++ b/src/box/relay.cc
> @@ -110,6 +110,11 @@ struct relay {
>  	struct vclock recv_vclock;
>  	/** Replicatoin slave version. */
>  	uint32_t version_id;
> +	/**
> +	 * Local master's LSN at the moment of subscribe, used to check
> +	 * dataset on the other side and send missing data rows if any.
> +	 */
> +	int64_t masters_lsn_at_subscribe;

Why did you change the member name? I only asked to update the comment.
'local_lsn_at_subscribe' is not perfect, but still a better name IMO.
Actually, I'm thinking about storing a whole vclock here instead of just
one LSN - that would help avoid confusion:

  /** Local vclock at the time of subscribe. */
  struct vclock local_vclock_at_subscribe;

>  
>  	/** Relay endpoint */
>  	struct cbus_endpoint endpoint;
> diff --git a/src/box/wal.cc b/src/box/wal.cc
> index 4576cfe0..4a43775d 100644
> --- a/src/box/wal.cc
> +++ b/src/box/wal.cc
> @@ -768,8 +768,15 @@ wal_write(struct journal *journal, struct journal_entry *entry)
>  			/*
>  			 * Find last row from local instance id
>  			 * and promote vclock.
> +			 * In master-master configuration, during sudden
> +			 * power-loss if data was not written to WAL but
> +			 * already sent to others they will send back.
> +			 * In this case we should update only local
> +			 * vclock but not the replicaset one. Could be
> +			 * checked by simple lsn comparison.

I still don't understand this comment.

>  			 */
> -			if ((*last)->replica_id == instance_id) {
> +			if ((*last)->replica_id == instance_id &&
> +			    replicaset.vclock.lsn[instance_id] < (*last)->lsn) {

Use vclock_get() for this.

Also, we agreed to move this to applier AFAIR.

>  				vclock_follow(&replicaset.vclock, instance_id,
>  					      (*last)->lsn);
>  				break;
> diff --git a/test/replication/on_replace.lua b/test/replication/on_replace.lua
> index 7e49efe1..c5855892 100644
> --- a/test/replication/on_replace.lua
> +++ b/test/replication/on_replace.lua
> @@ -22,13 +22,10 @@ box.cfg({
>      };
>  })
>  
> -env = require('test_run')
> -test_run = env.new()
> -engine = test_run:get_cfg('engine')
> -
>  box.once("bootstrap", function()
> +    local test_run = require('test_run').new()
>      box.schema.user.create(USER, { password = PASSWORD })
>      box.schema.user.grant(USER, 'replication')
> -    box.schema.space.create('test', {engine = engine})
> +    box.schema.space.create('test', {engine = test_run:get_cfg('engine')})
>      box.space.test:create_index('primary')
>  end)
> diff --git a/test/replication/recover_missing.test.lua b/test/replication/recover_missing.test.lua
> new file mode 100644
> index 00000000..d5b0e0ad
> --- /dev/null
> +++ b/test/replication/recover_missing.test.lua
> @@ -0,0 +1,42 @@
> +env = require('test_run')
> +test_run = env.new()
> +
> +SERVERS = { 'on_replace1', 'on_replace2' }

Please use autobootstrap.lua - I want to see how it works with multiple
masters.

> +-- Start servers
> +test_run:create_cluster(SERVERS)
> +-- Wait for full mesh
> +test_run:wait_fullmesh(SERVERS)
> +
> +test_run:cmd("switch on_replace1")
> +for i = 0, 9 do box.space.test:insert{i, 'test' .. i} end

Nit: please start counting from 1, as this is common in Lua.

> +box.space.test:count()
> +
> +test_run:cmd('switch default')
> +vclock1 = test_run:get_vclock('on_replace1')
> +vclock2 = test_run:wait_cluster_vclock(SERVERS, vclock1)
> +
> +test_run:cmd("switch on_replace2")
> +box.space.test:count()
> +box.error.injection.set("ERRINJ_RELAY_TIMEOUT", 0.1)

Decrease the timeout to speed up test execution time.

> +test_run:cmd("stop server on_replace1")
> +fio = require('fio')
> +-- This test checks ability to recover missing local data
> +-- from remote replica. See #3210.
> +-- Delete data on first master and test that after restart,
> +-- due to difference in vclock it will be able to recover
> +-- all missing data from replica.
> +-- Also check that there is no concurrency, i.e. master is
> +-- in 'read-only' mode unless it receives all data.
> +fio.unlink(fio.pathjoin(fio.abspath("."), string.format('on_replace1/%020d.xlog', 8)))
> +test_run:cmd("start server on_replace1")
> +
> +test_run:cmd("switch on_replace1")
> +for i = 10, 19 do box.space.test:insert{i, 'test' .. i} end
> +fiber = require('fiber')
> +fiber.sleep(1)
> +box.space.test:count()

Use 'select' here to make sure the data received are correct.

> +
> +-- Cleanup.
> +test_run:cmd('switch default')
> +test_run:drop_cluster(SERVERS)
> +

Nit: extra line at the end of the file.

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re[2]: Fwd: Re[2]: [patches] [PATCH] [replication] [recovery] recover missing data
  2018-03-28  9:25 ` Vladimir Davydov
@ 2018-03-28 17:01   ` Konstantin Belyavskiy
  0 siblings, 0 replies; 3+ messages in thread
From: Konstantin Belyavskiy @ 2018-03-28 17:01 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: georgy, tarantool-patches

[-- Attachment #1: Type: text/plain, Size: 9061 bytes --]

Please look at the new version.

>Среда, 28 марта 2018, 12:25 +03:00 от Vladimir Davydov <vdavydov.dev@gmail.com>:
>
>On Mon, Mar 26, 2018 at 06:24:36PM +0300, Konstantin Belyavskiy wrote:
>> Please check most recent version.
>> branch: gh-3210-recover-missing-local-data-master-master
>
>Please don't send or submit a patch for 1.6 until we commit it to
>the trunk. 
Ok, let's finish with 1.9 first, then I will send you patch for 1.6.
>
>
>I failed to find the patch in the mailing list. Pasting it here for
>review.
>
>> From 391448a496fd769ff6724cadab4d333d37a9088e Mon Sep 17 00:00:00 2001
>> From: Konstantin Belyavskiy < k.belyavskiy@tarantool.org >
>> Date: Tue, 13 Mar 2018 17:51:52 +0300
>> Subject: [PATCH] [replication] [recovery] recover missing data
>> 
>> Recover missing local data from replica.
>> In case of sudden power-loss, if data was not written to WAL but
>> already sent to remote replica, local can't recover properly and
>> we have different datasets.
>> Fix it by using remote replica's data and LSN comparison.
>> Based on @GeorgyKirichenko proposal and @locker race free check.
>
>> Switch off replication/catch.test.lua
>
>Why? If the test is broken, please fix it. If you find the test
>pointless, delete it with a proper explanation. 
Actually this test was broken, since from description 
-- Check that replica doesn't enter read-write mode before
-- catching up with the master: to check that we inject sleep into
-- the master relay_send function and attempt a data modifying
-- statement in replica while it's still fetching data from the
-- master. it should be in read-only mode, but only now it works as expected. And I don't understand second part: 
-- case #2: delete tuple by net.box But my new test also checks this behaviour (remember you mention concurrency issue and suggest to
use read-only mode to fix it). So it's rather duplicated.
Ok, as for now, decided to update test and to enable it back.
Let's discuss about this test separately.
>
>
>> 
>> Closes #3210
>> 
>> diff --git a/src/box/applier.cc b/src/box/applier.cc
>> index 6bfe5a99..5f0b3069 100644
>> --- a/src/box/applier.cc
>> +++ b/src/box/applier.cc
>> @@ -453,7 +453,8 @@ applier_subscribe(struct applier *applier)
>>  		}
>> 
>>  		if (applier->state == APPLIER_SYNC &&
>> -		    applier->lag <= replication_sync_lag) {
>> +		    applier->lag <= replication_sync_lag &&
>> +		    vclock_compare(&applier->vclock, &replicaset.vclock) <= 0) {
>
>First, you use a wrong vclock - applier->vclock is the vclock at connect
>(the name is rather misleading though, true, we should probably rename
>it to remote_vclock_at_connect). You should use the vclock received in
>the SUBSCRIBE request. 
Initially I thought that applier->vclock is the same, but under certain condition it 
could have different value, thanks for find it out. Also it makes new code easier
to understand.
>
>
>Second, this new condition could use a comment. 
Add comment.
>
>
>Third, this is a worthwhile change as is so I think it should be
>submitted in a separate patch. 
May be, so first submit new sync condition, then other parts?
>
>
>>  			/* Applier is synced, switch to "follow". */
>>  			applier_set_state(applier, APPLIER_FOLLOW);
>>  		}
>> diff --git a/src/box/relay.cc b/src/box/relay.cc
>> index 2bd05ad5..344a8e01 100644
>> --- a/src/box/relay.cc
>> +++ b/src/box/relay.cc
>> @@ -110,6 +110,11 @@ struct relay {
>>  	struct vclock recv_vclock;
>>  	/** Replicatoin slave version. */
>>  	uint32_t version_id;
>> +	/**
>> +	 * Local master's LSN at the moment of subscribe, used to check
>> +	 * dataset on the other side and send missing data rows if any.
>> +	 */
>> +	int64_t masters_lsn_at_subscribe;
>
>Why did you change the member name? I only asked to update the comment.
>'local_lsn_at_subscribe' is not perfect, but still a better name IMO.
>Actually, I'm thinking about storing a whole vclock here instead of just
>one LSN - that would help avoid confusion:
>
>  /** Local vclock at the time of subscribe. */
>  struct vclock local_vclock_at_subscribe; 
Done.
>
>
>> 
>>  	/** Relay endpoint */
>>  	struct cbus_endpoint endpoint;
>> diff --git a/src/box/wal.cc b/src/box/wal.cc
>> index 4576cfe0..4a43775d 100644
>> --- a/src/box/wal.cc
>> +++ b/src/box/wal.cc
>> @@ -768,8 +768,15 @@ wal_write(struct journal *journal, struct journal_entry *entry)
>>  			/*
>>  			 * Find last row from local instance id
>>  			 * and promote vclock.
>> +			 * In master-master configuration, during sudden
>> +			 * power-loss if data was not written to WAL but
>> +			 * already sent to others they will send back.
>> +			 * In this case we should update only local
>> +			 * vclock but not the replicaset one. Could be
>> +			 * checked by simple lsn comparison.
>
>I still don't understand this comment.
>
>>  			 */
>> -			if ((*last)->replica_id == instance_id) {
>> +			if ((*last)->replica_id == instance_id &&
>> +			    replicaset.vclock.lsn[instance_id] < (*last)->lsn) {
>
>Use vclock_get() for this.
>
>Also, we agreed to move this to applier AFAIR. 
It's not so simple, attempt to check instance_id != row.replica_id to avoid double replicaset's
vclock promotion (just before  xstream_write_xc ) fails, since it's not the only case of vclock
promotion and lead to wrong leaving read-only mode. I will try to explain it better.. but this
not works:
--- a/src/box/applier.cc
+++ b/src/box/applier.cc
@@ -503,8 +503,9 @@ applier_subscribe(struct applier *applier)
-                       vclock_follow(&replicaset.vclock, row.replica_id,
-                                     row.lsn);
+                        if (row.replica_id != instance_id)
+                                vclock_follow(&replicaset.vclock, row.replica_id,
+                                              row.lsn);
                        xstream_write_xc(applier->subscribe_stream, &row);
>>  				vclock_follow(&replicaset.vclock, instance_id,
>>  					      (*last)->lsn);
>>  				break;
>> diff --git a/test/replication/on_replace.lua b/test/replication/on_replace.lua
>> index 7e49efe1..c5855892 100644
>> --- a/test/replication/on_replace.lua
>> +++ b/test/replication/on_replace.lua
>> @@ -22,13 +22,10 @@ box.cfg({
>>      };
>>  })
>> 
>> -env = require('test_run')
>> -test_run = env.new()
>> -engine = test_run:get_cfg('engine')
>> -
>>  box.once("bootstrap", function()
>> +    local test_run = require('test_run').new()
>>      box.schema.user.create(USER, { password = PASSWORD })
>>      box.schema.user.grant(USER, 'replication')
>> -    box.schema.space.create('test', {engine = engine})
>> +    box.schema.space.create('test', {engine = test_run:get_cfg('engine')})
>>      box.space.test:create_index('primary')
>>  end)
>> diff --git a/test/replication/recover_missing.test.lua b/test/replication/recover_missing.test.lua
>> new file mode 100644
>> index 00000000..d5b0e0ad
>> --- /dev/null
>> +++ b/test/replication/recover_missing.test.lua
>> @@ -0,0 +1,42 @@
>> +env = require('test_run')
>> +test_run = env.new()
>> +
>> +SERVERS = { 'on_replace1', 'on_replace2' }
>
>Please use autobootstrap.lua - I want to see how it works with multiple
>masters. 
Ok, now 3 instances.
>> +-- Start servers
>> +test_run:create_cluster(SERVERS)
>> +-- Wait for full mesh
>> +test_run:wait_fullmesh(SERVERS)
>> +
>> +test_run:cmd("switch on_replace1")
>> +for i = 0, 9 do box.space.test:insert{i, 'test' .. i} end
>
>Nit: please start counting from 1, as this is common in Lua. 
I hope this is not necessary, just insert some random data to check it later )
>> +box.space.test:count()
>> +
>> +test_run:cmd('switch default')
>> +vclock1 = test_run:get_vclock('on_replace1')
>> +vclock2 = test_run:wait_cluster_vclock(SERVERS, vclock1)
>> +
>> +test_run:cmd("switch on_replace2")
>> +box.space.test:count()
>> +box.error.injection.set("ERRINJ_RELAY_TIMEOUT", 0.1)
>
>Decrease the timeout to speed up test execution time. 
Ok
>> +test_run:cmd("stop server on_replace1")
>> +fio = require('fio')
>> +-- This test checks ability to recover missing local data
>> +-- from remote replica. See #3210.
>> +-- Delete data on first master and test that after restart,
>> +-- due to difference in vclock it will be able to recover
>> +-- all missing data from replica.
>> +-- Also check that there is no concurrency, i.e. master is
>> +-- in 'read-only' mode unless it receives all data.
>> +fio.unlink(fio.pathjoin(fio.abspath("."), string.format('on_replace1/%020d.xlog', 8)))
>> +test_run:cmd("start server on_replace1")
>> +
>> +test_run:cmd("switch on_replace1")
>> +for i = 10, 19 do box.space.test:insert{i, 'test' .. i} end
>> +fiber = require('fiber')
>> +fiber.sleep(1)
>> +box.space.test:count()
>
>Use 'select' here to make sure the data received are correct.
Ok
>
>> +
>> +-- Cleanup.
>> +test_run:cmd('switch default')
>> +test_run:drop_cluster(SERVERS)
>> +
>
>Nit: extra line at the end of the file.


Best regards,
Konstantin Belyavskiy
k.belyavskiy@tarantool.org

[-- Attachment #2: Type: text/html, Size: 21068 bytes --]

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2018-03-28 17:01 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-03-26 15:24 Fwd: Re[2]: [patches] [PATCH] [replication] [recovery] recover missing data Konstantin Belyavskiy
2018-03-28  9:25 ` Vladimir Davydov
2018-03-28 17:01   ` Re[2]: " Konstantin Belyavskiy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox