Tarantool development patches archive
 help / color / mirror / Atom feed
* [PATCH 1/2] Revert "replication: update replica gc state on subscribe"
@ 2019-04-09 17:07 Vladimir Davydov
  2019-04-09 17:07 ` [PATCH 2/2] replication: fix garbage collection logic Vladimir Davydov
  2019-04-11 10:25 ` [tarantool-patches] Re: [PATCH 1/2] Revert "replication: update replica gc state on subscribe" Konstantin Osipov
  0 siblings, 2 replies; 6+ messages in thread
From: Vladimir Davydov @ 2019-04-09 17:07 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

This reverts commit b5b4809cf2e6d48230eb9e4301eac188b080e0f4.

The commit reverted by this patch made relay advance the consumer
associated with a replica right on subscribe. This is wrong, because the
current implementation of the garbage collector operates on vclock
signatures so that if a replica reconnects with a greater signature than
it had when it last disconnected (e.g. due to replica local changes or
changes pulled from other members of the cluster), the garbage collector
may delete WAL files still needed by the replica, breaking replication.

There are two ways to fix this problem. The first and the most difficult
way is to teach the garbage collector to work with vclocks, i.e. rather
than simply sorting all consumers by signature and using the smallest
signature for garbage collection, maintain a vclock each component of
which is the minimum among corresponding components of all registered
consumers.

The second (easy) way is to advance a consumer only if its acknowledged
vclock is greater than or equal to the vclock of a WAL fed to it. This
way the garbage collector still works with vclock signatures and it's
a responsibility of the caller (i.e. relay) to ensure that consumers are
advanced correctly.

I took on the second way for now, because I couldn't figure out an
efficient way to implement the first. This implies reverting the above
mentioned commit and reopening #4034 - sporadic replication/gc.test.lua
failure - which will have to be fixed some other way.

See the next patch for the rest of the fix and the test.

Needed for #4106
---
https://github.com/tarantool/tarantool/issues/4034
https://github.com/tarantool/tarantool/issues/4106
https://github.com/tarantool/tarantool/commits/dv/gh-4106-replica-gc-fix

 src/box/relay.cc | 5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/src/box/relay.cc b/src/box/relay.cc
index 45cd780a..29570323 100644
--- a/src/box/relay.cc
+++ b/src/box/relay.cc
@@ -643,16 +643,13 @@ relay_subscribe(struct replica *replica, int fd, uint64_t sync,
 	/*
 	 * Register the replica with the garbage collector
 	 * unless it has already been registered by initial
-	 * join or subscribe. Otherwise update the consumer
-	 * state with the current replica vclock.
+	 * join.
 	 */
 	if (replica->gc == NULL) {
 		replica->gc = gc_consumer_register(replica_clock, "replica %s",
 						   tt_uuid_str(&replica->uuid));
 		if (replica->gc == NULL)
 			diag_raise();
-	} else {
-		gc_consumer_advance(replica->gc, replica_clock);
 	}
 
 	relay_start(relay, fd, sync, relay_send_row);
-- 
2.11.0

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH 2/2] replication: fix garbage collection logic
  2019-04-09 17:07 [PATCH 1/2] Revert "replication: update replica gc state on subscribe" Vladimir Davydov
@ 2019-04-09 17:07 ` Vladimir Davydov
  2019-04-11  7:32   ` [tarantool-patches] " Konstantin Osipov
  2019-04-11 10:25 ` [tarantool-patches] Re: [PATCH 1/2] Revert "replication: update replica gc state on subscribe" Konstantin Osipov
  1 sibling, 1 reply; 6+ messages in thread
From: Vladimir Davydov @ 2019-04-09 17:07 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

Currently, the garbage collector works with vclock signatures and
doesn't take into account vclock components. This works as long as
the caller (i.e. relay) makes sure that it doesn't advance a consumer
associated with a replica unless its acknowledged vclock is greater
than or equal to the vclock of a WAL file fed to it. The bug is that
it does not - it only compares vclock signatures. As a result, if
a replica has some local changes or changes pulled from other members
of the cluster, which render its signature greater, the master may
remove files that are still needed by the replica, permanently breaking
replication and requiring rebootstrap.

I guess the proper fix would be teaching the garbage collector
operate on vclock components rather than signatures, but it's rather
difficult to implement. This patch is a quick fix, which simply
replaces vclock signature comparison in relay with vclock_compare.

Closes #4106
---
https://github.com/tarantool/tarantool/issues/4106
https://github.com/tarantool/tarantool/commits/dv/gh-4106-replica-gc-fix

 src/box/relay.cc             |  12 ++++-
 test/replication/gc.result   | 124 +++++++++++++++++++++++++++++++++++++++++--
 test/replication/gc.test.lua |  59 +++++++++++++++++++-
 3 files changed, 189 insertions(+), 6 deletions(-)

diff --git a/src/box/relay.cc b/src/box/relay.cc
index 29570323..e3d7f0ed 100644
--- a/src/box/relay.cc
+++ b/src/box/relay.cc
@@ -422,7 +422,17 @@ relay_schedule_pending_gc(struct relay *relay, const struct vclock *vclock)
 {
 	struct relay_gc_msg *curr, *next, *gc_msg = NULL;
 	stailq_foreach_entry_safe(curr, next, &relay->pending_gc, in_pending) {
-		if (vclock_sum(&curr->vclock) > vclock_sum(vclock))
+		/*
+		 * We may delete a WAL file only if its vclock is
+		 * less than or equal to the vclock acknowledged by
+		 * the replica. Even if the replica's signature is
+		 * is greater, but the vclocks are incomparable, we
+		 * must not delete the WAL, because there may still
+		 * be rows not applied by the replica in it while
+		 * the greater signatures is due to changes pulled
+		 * from other members of the cluster.
+		 */
+		if (vclock_compare(&curr->vclock, vclock) > 0)
 			break;
 		stailq_shift(&relay->pending_gc);
 		free(gc_msg);
diff --git a/test/replication/gc.result b/test/replication/gc.result
index 5b7057ad..65785f47 100644
--- a/test/replication/gc.result
+++ b/test/replication/gc.result
@@ -404,10 +404,6 @@ test_run:cmd("cleanup server replica")
 ---
 - true
 ...
-test_run:cmd("delete server replica")
----
-- true
-...
 _ = s:auto_increment{}
 ---
 ...
@@ -450,7 +446,127 @@ wait_xlog(0, 10) or fio.listdir('./master')
 box.cfg{replication = {}}
 ---
 ...
+s:truncate()
+---
+...
+--
+-- Check that a master doesn't remove WAL files needed by a replica
+-- in case the replica has local changes due to which its vclock has
+-- greater signature than the master's (gh-4106).
+--
+test_run:cmd("start server replica")
+---
+- true
+...
+-- Temporarily disable replication.
+test_run:cmd("switch replica")
+---
+- true
+...
+replication = box.cfg.replication
+---
+...
+box.cfg{replication = {}}
+---
+...
+-- Generate some WALs on the master.
+test_run:cmd("switch default")
+---
+- true
+...
+test_run:cmd("setopt delimiter ';'")
+---
+- true
+...
+for i = 1, 4 do
+    for j = 1, 100 do
+        s:replace{1, i, j}
+    end
+    box.snapshot()
+end;
+---
+...
+test_run:cmd("setopt delimiter ''");
+---
+- true
+...
+-- Stall WAL writes on the replica and reestablish replication,
+-- then bump local LSN and break replication on WAL error so that
+-- the master sends all rows and receives ack with the replica's
+-- vclock, but not all rows are actually applied on the replica.
+-- The master must not delete unconfirmed WALs even though the
+-- replica's vclock has a greater signature (checked later on).
+test_run:cmd("switch replica")
+---
+- true
+...
+box.error.injection.set('ERRINJ_WAL_DELAY', true)
+---
+- ok
+...
+box.cfg{replication_connect_quorum = 0} -- disable sync
+---
+...
+box.cfg{replication = replication}
+---
+...
+fiber = require('fiber')
+---
+...
+test_run:cmd("setopt delimiter ';'");
+---
+- true
+...
+_ = fiber.create(function()
+    box.begin()
+    for i = 1, 1000 do
+        box.space.test:replace{1, i}
+    end
+    box.commit()
+    box.error.injection.set('ERRINJ_WAL_WRITE_DISK', true)
+end);
+---
+...
+test_run:cmd("setopt delimiter ''");
+---
+- true
+...
+box.error.injection.set('ERRINJ_WAL_DELAY', false)
+---
+- ok
+...
+-- Restart the replica and wait for it to sync.
+test_run:cmd("switch default")
+---
+- true
+...
+test_run:cmd("stop server replica")
+---
+- true
+...
+test_run:cmd("start server replica")
+---
+- true
+...
+test_run:wait_lsn('replica', 'default')
+---
+...
 -- Cleanup.
+test_run:cmd("stop server replica")
+---
+- true
+...
+test_run:cmd("cleanup server replica")
+---
+- true
+...
+test_run:cmd("delete server replica")
+---
+- true
+...
+test_run:cleanup_cluster()
+---
+...
 s:drop()
 ---
 ...
diff --git a/test/replication/gc.test.lua b/test/replication/gc.test.lua
index a92a6ed7..890fe29a 100644
--- a/test/replication/gc.test.lua
+++ b/test/replication/gc.test.lua
@@ -189,7 +189,6 @@ box.cfg{replication = replica_port}
 -- Stop the replica and write a few WALs.
 test_run:cmd("stop server replica")
 test_run:cmd("cleanup server replica")
-test_run:cmd("delete server replica")
 _ = s:auto_increment{}
 box.snapshot()
 _ = s:auto_increment{}
@@ -207,7 +206,65 @@ wait_xlog(0, 10) or fio.listdir('./master')
 -- Restore the config.
 box.cfg{replication = {}}
 
+s:truncate()
+
+--
+-- Check that a master doesn't remove WAL files needed by a replica
+-- in case the replica has local changes due to which its vclock has
+-- greater signature than the master's (gh-4106).
+--
+test_run:cmd("start server replica")
+
+-- Temporarily disable replication.
+test_run:cmd("switch replica")
+replication = box.cfg.replication
+box.cfg{replication = {}}
+
+-- Generate some WALs on the master.
+test_run:cmd("switch default")
+test_run:cmd("setopt delimiter ';'")
+for i = 1, 4 do
+    for j = 1, 100 do
+        s:replace{1, i, j}
+    end
+    box.snapshot()
+end;
+test_run:cmd("setopt delimiter ''");
+
+-- Stall WAL writes on the replica and reestablish replication,
+-- then bump local LSN and break replication on WAL error so that
+-- the master sends all rows and receives ack with the replica's
+-- vclock, but not all rows are actually applied on the replica.
+-- The master must not delete unconfirmed WALs even though the
+-- replica's vclock has a greater signature (checked later on).
+test_run:cmd("switch replica")
+box.error.injection.set('ERRINJ_WAL_DELAY', true)
+box.cfg{replication_connect_quorum = 0} -- disable sync
+box.cfg{replication = replication}
+fiber = require('fiber')
+test_run:cmd("setopt delimiter ';'");
+_ = fiber.create(function()
+    box.begin()
+    for i = 1, 1000 do
+        box.space.test:replace{1, i}
+    end
+    box.commit()
+    box.error.injection.set('ERRINJ_WAL_WRITE_DISK', true)
+end);
+test_run:cmd("setopt delimiter ''");
+box.error.injection.set('ERRINJ_WAL_DELAY', false)
+
+-- Restart the replica and wait for it to sync.
+test_run:cmd("switch default")
+test_run:cmd("stop server replica")
+test_run:cmd("start server replica")
+test_run:wait_lsn('replica', 'default')
+
 -- Cleanup.
+test_run:cmd("stop server replica")
+test_run:cmd("cleanup server replica")
+test_run:cmd("delete server replica")
+test_run:cleanup_cluster()
 s:drop()
 box.error.injection.set("ERRINJ_RELAY_REPORT_INTERVAL", 0)
 box.schema.user.revoke('guest', 'replication')
-- 
2.11.0

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [tarantool-patches] Re: [PATCH 2/2] replication: fix garbage collection logic
  2019-04-09 17:07 ` [PATCH 2/2] replication: fix garbage collection logic Vladimir Davydov
@ 2019-04-11  7:32   ` Konstantin Osipov
  2019-04-11  8:25     ` Vladimir Davydov
  2019-04-11 11:01     ` Vladimir Davydov
  0 siblings, 2 replies; 6+ messages in thread
From: Konstantin Osipov @ 2019-04-11  7:32 UTC (permalink / raw)
  To: tarantool-patches

* Vladimir Davydov <vdavydov.dev@gmail.com> [19/04/09 20:09]:
> Currently, the garbage collector works with vclock signatures and
> doesn't take into account vclock components. This works as long as
> the caller (i.e. relay) makes sure that it doesn't advance a consumer
> associated with a replica unless its acknowledged vclock is greater
> than or equal to the vclock of a WAL file fed to it. The bug is that
> it does not - it only compares vclock signatures. As a result, if
> a replica has some local changes or changes pulled from other members
> of the cluster, which render its signature greater, the master may
> remove files that are still needed by the replica, permanently breaking
> replication and requiring rebootstrap.
> 
> I guess the proper fix would be teaching the garbage collector
> operate on vclock components rather than signatures, but it's rather
> difficult to implement. This patch is a quick fix, which simply
> replaces vclock signature comparison in relay with vclock_compare.

This patch is OK to push. I still think we need a special compare
function, which ignores one specified dimension, and we should
change vclock_compare in recover_remaining_wals and this
vclock_compare to use it.

This dimension is the server id of replica we're feeding wals to.
The logic is that we should not bother with feeding replica its
own changes, and depend on having these changes. This will make
vclocks comparable even if replica has local changes, and master
has local changes, and some of the xlogs which predate these
changes are already missing.

-- 
Konstantin Osipov, Moscow, Russia, +7 903 626 22 32
http://tarantool.io - www.twitter.com/kostja_osipov

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [tarantool-patches] Re: [PATCH 2/2] replication: fix garbage collection logic
  2019-04-11  7:32   ` [tarantool-patches] " Konstantin Osipov
@ 2019-04-11  8:25     ` Vladimir Davydov
  2019-04-11 11:01     ` Vladimir Davydov
  1 sibling, 0 replies; 6+ messages in thread
From: Vladimir Davydov @ 2019-04-11  8:25 UTC (permalink / raw)
  To: Konstantin Osipov; +Cc: tarantool-patches

On Thu, Apr 11, 2019 at 10:32:55AM +0300, Konstantin Osipov wrote:
> * Vladimir Davydov <vdavydov.dev@gmail.com> [19/04/09 20:09]:
> > Currently, the garbage collector works with vclock signatures and
> > doesn't take into account vclock components. This works as long as
> > the caller (i.e. relay) makes sure that it doesn't advance a consumer
> > associated with a replica unless its acknowledged vclock is greater
> > than or equal to the vclock of a WAL file fed to it. The bug is that
> > it does not - it only compares vclock signatures. As a result, if
> > a replica has some local changes or changes pulled from other members
> > of the cluster, which render its signature greater, the master may
> > remove files that are still needed by the replica, permanently breaking
> > replication and requiring rebootstrap.
> > 
> > I guess the proper fix would be teaching the garbage collector
> > operate on vclock components rather than signatures, but it's rather
> > difficult to implement. This patch is a quick fix, which simply
> > replaces vclock signature comparison in relay with vclock_compare.
> 
> This patch is OK to push. I still think we need a special compare
> function, which ignores one specified dimension, and we should
> change vclock_compare in recover_remaining_wals and this
> vclock_compare to use it.
> 
> This dimension is the server id of replica we're feeding wals to.
> The logic is that we should not bother with feeding replica its
> own changes, and depend on having these changes. This will make
> vclocks comparable even if replica has local changes, and master
> has local changes, and some of the xlogs which predate these
> changes are already missing.

The problem is the replica may have changes from other members of
the clusters, which haven't been pulled by the master yet. In other
words, it isn't as simple as ignoring just one vclock component.
In case of GC we should maintain a vclock that consists of minimal
components among all replicas' vclocks, e.g. if we have three replicas
with vclocks {1, 100, 200, 10}, {10, 50, 150, 30}, {1, 100, 100, 50},
then the gc vclock on the master (i.e. the vclock of the newest row
we can collect) must equal {1, 50, 100, 10}. This is kinda difficult to
implement - it's much easier to compare vclocks in relay upon feeding a
WAL file, which I did. However, I guess we'll have to figure out how to
do that for sync replication anyway.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [tarantool-patches] Re: [PATCH 1/2] Revert "replication: update replica gc state on subscribe"
  2019-04-09 17:07 [PATCH 1/2] Revert "replication: update replica gc state on subscribe" Vladimir Davydov
  2019-04-09 17:07 ` [PATCH 2/2] replication: fix garbage collection logic Vladimir Davydov
@ 2019-04-11 10:25 ` Konstantin Osipov
  1 sibling, 0 replies; 6+ messages in thread
From: Konstantin Osipov @ 2019-04-11 10:25 UTC (permalink / raw)
  To: tarantool-patches

* Vladimir Davydov <vdavydov.dev@gmail.com> [19/04/09 20:09]:
> This reverts commit b5b4809cf2e6d48230eb9e4301eac188b080e0f4.

OK to push.

-- 
Konstantin Osipov, Moscow, Russia, +7 903 626 22 32
http://tarantool.io - www.twitter.com/kostja_osipov

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [tarantool-patches] Re: [PATCH 2/2] replication: fix garbage collection logic
  2019-04-11  7:32   ` [tarantool-patches] " Konstantin Osipov
  2019-04-11  8:25     ` Vladimir Davydov
@ 2019-04-11 11:01     ` Vladimir Davydov
  1 sibling, 0 replies; 6+ messages in thread
From: Vladimir Davydov @ 2019-04-11 11:01 UTC (permalink / raw)
  To: Konstantin Osipov; +Cc: tarantool-patches

On Thu, Apr 11, 2019 at 10:32:55AM +0300, Konstantin Osipov wrote:
> * Vladimir Davydov <vdavydov.dev@gmail.com> [19/04/09 20:09]:
> > Currently, the garbage collector works with vclock signatures and
> > doesn't take into account vclock components. This works as long as
> > the caller (i.e. relay) makes sure that it doesn't advance a consumer
> > associated with a replica unless its acknowledged vclock is greater
> > than or equal to the vclock of a WAL file fed to it. The bug is that
> > it does not - it only compares vclock signatures. As a result, if
> > a replica has some local changes or changes pulled from other members
> > of the cluster, which render its signature greater, the master may
> > remove files that are still needed by the replica, permanently breaking
> > replication and requiring rebootstrap.
> > 
> > I guess the proper fix would be teaching the garbage collector
> > operate on vclock components rather than signatures, but it's rather
> > difficult to implement. This patch is a quick fix, which simply
> > replaces vclock signature comparison in relay with vclock_compare.
> 
> This patch is OK to push.

Pushed to master, 2.1, 1.10.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2019-04-11 11:01 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-04-09 17:07 [PATCH 1/2] Revert "replication: update replica gc state on subscribe" Vladimir Davydov
2019-04-09 17:07 ` [PATCH 2/2] replication: fix garbage collection logic Vladimir Davydov
2019-04-11  7:32   ` [tarantool-patches] " Konstantin Osipov
2019-04-11  8:25     ` Vladimir Davydov
2019-04-11 11:01     ` Vladimir Davydov
2019-04-11 10:25 ` [tarantool-patches] Re: [PATCH 1/2] Revert "replication: update replica gc state on subscribe" Konstantin Osipov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox