Please look at the new version.

Среда, 28 марта 2018, 12:25 +03:00 от Vladimir Davydov <vdavydov.dev@gmail.com>:

On Mon, Mar 26, 2018 at 06:24:36PM +0300, Konstantin Belyavskiy wrote:
> Please check most recent version.
> branch: gh-3210-recover-missing-local-data-master-master

Please don't send or submit a patch for 1.6 until we commit it to
the trunk.
Ok, let's finish with 1.9 first, then I will send you patch for 1.6.


I failed to find the patch in the mailing list. Pasting it here for
review.

> From 391448a496fd769ff6724cadab4d333d37a9088e Mon Sep 17 00:00:00 2001
> From: Konstantin Belyavskiy <k.belyavskiy@tarantool.org>
> Date: Tue, 13 Mar 2018 17:51:52 +0300
> Subject: [PATCH] [replication] [recovery] recover missing data
>
> Recover missing local data from replica.
> In case of sudden power-loss, if data was not written to WAL but
> already sent to remote replica, local can't recover properly and
> we have different datasets.
> Fix it by using remote replica's data and LSN comparison.
> Based on @GeorgyKirichenko proposal and @locker race free check.

> Switch off replication/catch.test.lua

Why? If the test is broken, please fix it. If you find the test
pointless, delete it with a proper explanation.
Actually this test was broken, since from description 

-- Check that replica doesn't enter read-write mode before

-- catching up with the master: to check that we inject sleep into

-- the master relay_send function and attempt a data modifying

-- statement in replica while it's still fetching data from the

-- master.

it should be in read-only mode, but only now it works as expected. And I don't understand second part: 

-- case #2: delete tuple by net.box

But my new test also checks this behaviour (remember you mention concurrency issue and suggest to
use read-only mode to fix it). So it's rather duplicated.
Ok, as for now, decided to update test and to enable it back.
Let's discuss about this test separately.


>
> Closes #3210
>
> diff --git a/src/box/applier.cc b/src/box/applier.cc
> index 6bfe5a99..5f0b3069 100644
> --- a/src/box/applier.cc
> +++ b/src/box/applier.cc
> @@ -453,7 +453,8 @@ applier_subscribe(struct applier *applier)
> }
>
> if (applier->state == APPLIER_SYNC &&
> - applier->lag <= replication_sync_lag) {
> + applier->lag <= replication_sync_lag &&
> + vclock_compare(&applier->vclock, &replicaset.vclock) <= 0) {

First, you use a wrong vclock - applier->vclock is the vclock at connect
(the name is rather misleading though, true, we should probably rename
it to remote_vclock_at_connect). You should use the vclock received in
the SUBSCRIBE request.
Initially I thought that applier->vclock is the same, but under certain condition it 
could have different value, thanks for find it out. Also it makes new code easier
to understand.


Second, this new condition could use a comment.
Add comment.


Third, this is a worthwhile change as is so I think it should be
submitted in a separate patch.
May be, so first submit new sync condition, then other parts?


> /* Applier is synced, switch to "follow". */
> applier_set_state(applier, APPLIER_FOLLOW);
> }
> diff --git a/src/box/relay.cc b/src/box/relay.cc
> index 2bd05ad5..344a8e01 100644
> --- a/src/box/relay.cc
> +++ b/src/box/relay.cc
> @@ -110,6 +110,11 @@ struct relay {
> struct vclock recv_vclock;
> /** Replicatoin slave version. */
> uint32_t version_id;
> + /**
> + * Local master's LSN at the moment of subscribe, used to check
> + * dataset on the other side and send missing data rows if any.
> + */
> + int64_t masters_lsn_at_subscribe;

Why did you change the member name? I only asked to update the comment.
'local_lsn_at_subscribe' is not perfect, but still a better name IMO.
Actually, I'm thinking about storing a whole vclock here instead of just
one LSN - that would help avoid confusion:

  /** Local vclock at the time of subscribe. */
  struct vclock local_vclock_at_subscribe;
Done.


>
> /** Relay endpoint */
> struct cbus_endpoint endpoint;
> diff --git a/src/box/wal.cc b/src/box/wal.cc
> index 4576cfe0..4a43775d 100644
> --- a/src/box/wal.cc
> +++ b/src/box/wal.cc
> @@ -768,8 +768,15 @@ wal_write(struct journal *journal, struct journal_entry *entry)
> /*
> * Find last row from local instance id
> * and promote vclock.
> + * In master-master configuration, during sudden
> + * power-loss if data was not written to WAL but
> + * already sent to others they will send back.
> + * In this case we should update only local
> + * vclock but not the replicaset one. Could be
> + * checked by simple lsn comparison.

I still don't understand this comment.

> */
> - if ((*last)->replica_id == instance_id) {
> + if ((*last)->replica_id == instance_id &&
> + replicaset.vclock.lsn[instance_id] < (*last)->lsn) {

Use vclock_get() for this.

Also, we agreed to move this to applier AFAIR.
It's not so simple, attempt to check instance_id != row.replica_id to avoid double replicaset's
vclock promotion (just before xstream_write_xc) fails, since it's not the only case of vclock
promotion and lead to wrong leaving read-only mode. I will try to explain it better.. but this
not works:

--- a/src/box/applier.cc

+++ b/src/box/applier.cc

@@ -503,8 +503,9 @@ applier_subscribe(struct applier *applier)


-                       vclock_follow(&replicaset.vclock, row.replica_id,

-                                     row.lsn);

+                       if (row.replica_id != instance_id)

+                               vclock_follow(&replicaset.vclock, row.replica_id,

+                                             row.lsn);

                        xstream_write_xc(applier->subscribe_stream, &row);


> vclock_follow(&replicaset.vclock, instance_id,
> (*last)->lsn);
> break;
> diff --git a/test/replication/on_replace.lua b/test/replication/on_replace.lua
> index 7e49efe1..c5855892 100644
> --- a/test/replication/on_replace.lua
> +++ b/test/replication/on_replace.lua
> @@ -22,13 +22,10 @@ box.cfg({
> };
> })
>
> -env = require('test_run')
> -test_run = env.new()
> -engine = test_run:get_cfg('engine')
> -
> box.once("bootstrap", function()
> + local test_run = require('test_run').new()
> box.schema.user.create(USER, { password = PASSWORD })
> box.schema.user.grant(USER, 'replication')
> - box.schema.space.create('test', {engine = engine})
> + box.schema.space.create('test', {engine = test_run:get_cfg('engine')})
> box.space.test:create_index('primary')
> end)
> diff --git a/test/replication/recover_missing.test.lua b/test/replication/recover_missing.test.lua
> new file mode 100644
> index 00000000..d5b0e0ad
> --- /dev/null
> +++ b/test/replication/recover_missing.test.lua
> @@ -0,0 +1,42 @@
> +env = require('test_run')
> +test_run = env.new()
> +
> +SERVERS = { 'on_replace1', 'on_replace2' }

Please use autobootstrap.lua - I want to see how it works with multiple
masters.
Ok, now 3 instances.
> +-- Start servers
> +test_run:create_cluster(SERVERS)
> +-- Wait for full mesh
> +test_run:wait_fullmesh(SERVERS)
> +
> +test_run:cmd("switch on_replace1")
> +for i = 0, 9 do box.space.test:insert{i, 'test' .. i} end

Nit: please start counting from 1, as this is common in Lua.
I hope this is not necessary, just insert some random data to check it later )
> +box.space.test:count()
> +
> +test_run:cmd('switch default')
> +vclock1 = test_run:get_vclock('on_replace1')
> +vclock2 = test_run:wait_cluster_vclock(SERVERS, vclock1)
> +
> +test_run:cmd("switch on_replace2")
> +box.space.test:count()
> +box.error.injection.set("ERRINJ_RELAY_TIMEOUT", 0.1)

Decrease the timeout to speed up test execution time.
Ok
> +test_run:cmd("stop server on_replace1")
> +fio = require('fio')
> +-- This test checks ability to recover missing local data
> +-- from remote replica. See #3210.
> +-- Delete data on first master and test that after restart,
> +-- due to difference in vclock it will be able to recover
> +-- all missing data from replica.
> +-- Also check that there is no concurrency, i.e. master is
> +-- in 'read-only' mode unless it receives all data.
> +fio.unlink(fio.pathjoin(fio.abspath("."), string.format('on_replace1/%020d.xlog', 8)))
> +test_run:cmd("start server on_replace1")
> +
> +test_run:cmd("switch on_replace1")
> +for i = 10, 19 do box.space.test:insert{i, 'test' .. i} end
> +fiber = require('fiber')
> +fiber.sleep(1)
> +box.space.test:count()

Use 'select' here to make sure the data received are correct.
Ok

> +
> +-- Cleanup.
> +test_run:cmd('switch default')
> +test_run:drop_cluster(SERVERS)
> +

Nit: extra line at the end of the file.


Best regards,
Konstantin Belyavskiy
k.belyavskiy@tarantool.org