Tarantool development patches archive
 help / color / mirror / Atom feed
From: Serge Petrenko <sergepetrenko@tarantool.org>
To: v.shpilevoy@tarantool.org
Cc: tarantool-patches@dev.tarantool.org
Subject: [Tarantool-patches] [PATCH v6 2/3] gc: rely on minimal vclock components instead of signatures
Date: Tue,  7 Apr 2020 18:49:55 +0300	[thread overview]
Message-ID: <47c0ec5d7a4de7d47351e31211c6eb3a3491f844.1586273440.git.sergepetrenko@tarantool.org> (raw)
In-Reply-To: <cover.1586273440.git.sergepetrenko@tarantool.org>

The current WAL GC implementation tracks consumers (i.e. remote replicas) by
their vclock signature, which is the sum of all vclock components.
This approach is wrong, and this can be shown by a little example.
The example will be a little synthetic, but it'll illustrate the problem.
Say, you have 2 masters, A and B with ids 1 and 2 respectively, and a replica C
with id 3. Say, С replicates from both A and B, and there is no replication
between A and B (say, the instances were reconfigured to not replicate from each
other). Now, say replica C has followed A and B to vclock {1:5, 2:13}. At the
same time, A has lsn 10 and B has lsn 15. A and B do not know about each other’s
changes, so A’s vclock is {1:10} and B’s vclock is {2:15}. Now imagine A does a
snapshot and creates a new xlog with signature 10. A’s directory will look like:
00…000.xlog 00…010.snap 00….010.xlog
Replica C reports its vclock {1:5, 2:13} to A, A uses the vclock to update the
corresponding GC consumer. Since signatures are used, GC consumer is assigned a
signature = 13 + 5 = 18. This is greater than the signature of the last xlog on
A (which is 10), so the previous xlog (00…00.xlog) can be deleted (at least A
assumes it can be). Actually, replica still needs 00…00.xlog, because it
contains rows corresponding to vclocks {1:6} - {1:10}, which haven’t been
replicated yet.

If instead of using vclock signatures, gc consumers used vclocks, such a problem
wouldn’t arise. Replica would report its vclock {1:5, 2:13}. The vclock is NOT
strictly greater than A’s most recent xlog vclock ({1:10}), so the previous log
is kept until replica reports a vclock {1:10, 2:something} or {1:11, …} and so
on.

Rewrite gc to perform cleanup based on finding minimal vclock
components present in at least one of the consumer vclocks instead of
just comparing vclock signatures.

Prerequisite #4114
---
 src/box/gc.c     | 41 ++++++++++++++---------
 src/box/vclock.h | 85 ++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 110 insertions(+), 16 deletions(-)

diff --git a/src/box/gc.c b/src/box/gc.c
index f5c387f9d..a2d0a515c 100644
--- a/src/box/gc.c
+++ b/src/box/gc.c
@@ -66,16 +66,15 @@ static int
 gc_checkpoint_fiber_f(va_list);
 
 /**
- * Comparator used for ordering gc_consumer objects by signature
- * in a binary tree.
+ * Comparator used for ordering gc_consumer objects
+ * lexicographically by their vclock in a binary tree.
  */
 static inline int
 gc_consumer_cmp(const struct gc_consumer *a, const struct gc_consumer *b)
 {
-	if (vclock_sum(&a->vclock) < vclock_sum(&b->vclock))
-		return -1;
-	if (vclock_sum(&a->vclock) > vclock_sum(&b->vclock))
-		return 1;
+	int rc = vclock_lex_compare(&a->vclock, &b->vclock);
+	if (rc != 0)
+		return rc;
 	if ((intptr_t)a < (intptr_t)b)
 		return -1;
 	if ((intptr_t)a > (intptr_t)b)
@@ -187,19 +186,29 @@ gc_run_cleanup(void)
 	/* At least one checkpoint must always be available. */
 	assert(checkpoint != NULL);
 
+	/* Find the vclock of the oldest WAL row to keep. */
+	struct vclock min_vclock;
+	struct gc_consumer *consumer = gc_tree_first(&gc.consumers);
 	/*
-	 * Find the vclock of the oldest WAL row to keep.
+	 * Vclock of the oldest WAL row to keep is a by-component
+	 * minimum of all consumer vclocks and the oldest
+	 * checkpoint vclock. This ensures that all rows needed by
+	 * at least one consumer are kept.
 	 * Note, we must keep all WALs created after the
 	 * oldest checkpoint, even if no consumer needs them.
 	 */
-	const struct vclock *vclock = (gc_tree_empty(&gc.consumers) ? NULL :
-				       &gc_tree_first(&gc.consumers)->vclock);
-	if (vclock == NULL ||
-	    vclock_sum(vclock) > vclock_sum(&checkpoint->vclock))
-		vclock = &checkpoint->vclock;
-
-	if (vclock_sum(vclock) > vclock_sum(&gc.vclock)) {
-		vclock_copy(&gc.vclock, vclock);
+	vclock_copy(&min_vclock, &checkpoint->vclock);
+	while (consumer != NULL) {
+		/*
+		 * Consumers will never need rows signed
+		 * with a zero instance id (local rows).
+		 */
+		vclock_min_ignore0(&min_vclock, &consumer->vclock);
+		consumer = gc_tree_next(&gc.consumers, consumer);
+	}
+
+	if (vclock_sum(&min_vclock) > vclock_sum(&gc.vclock)) {
+		vclock_copy(&gc.vclock, &min_vclock);
 		run_wal_gc = true;
 	}
 
@@ -222,7 +231,7 @@ gc_run_cleanup(void)
 	if (run_engine_gc)
 		engine_collect_garbage(&checkpoint->vclock);
 	if (run_wal_gc)
-		wal_collect_garbage(vclock);
+		wal_collect_garbage(&min_vclock);
 }
 
 static int
diff --git a/src/box/vclock.h b/src/box/vclock.h
index 5c0525b00..5865f7443 100644
--- a/src/box/vclock.h
+++ b/src/box/vclock.h
@@ -182,6 +182,31 @@ vclock_inc(struct vclock *vclock, uint32_t replica_id)
 	return ++vclock->lsn[replica_id];
 }
 
+/**
+ * Set vclock component represented by replica id to the desired
+ * value. Can be used to decrease stored LSN value for the given
+ * replica id while maintaining a valid signature or in the same
+ * manner as vclock_follow.
+ *
+ * @param vclock Vector clock.
+ * @param replica_id Replica identifier.
+ * @param lsn Lsn to set
+ */
+static inline void
+vclock_reset(struct vclock *vclock, uint32_t replica_id, int64_t lsn)
+{
+	assert(lsn >= 0);
+	assert(replica_id < VCLOCK_MAX);
+	vclock->signature -= vclock_get(vclock, replica_id);
+	if (lsn == 0) {
+		vclock->map &= ~(1 << replica_id);
+		return;
+	}
+	vclock->lsn[replica_id] = lsn;
+	vclock->map |= 1 << replica_id;
+	vclock->signature += lsn;
+}
+
 static inline void
 vclock_copy(struct vclock *dst, const struct vclock *src)
 {
@@ -330,6 +355,66 @@ vclock_compare_ignore0(const struct vclock *a, const struct vclock *b)
 	return vclock_compare_generic(a, b, true);
 }
 
+/**
+ * Compare vclocks lexicographically.
+ * The following affirmation holds: if a < b in terms of
+ * vclock_compare, then a < b in terms of vclock_lex_compare.
+ * However, a < b in terms of vclock_lex_compare doesn't mean
+ * than a < b in terms of vclock_compare. Example:
+ * {1:5, 2:10} < {1:7, 2:3} (vclock_lex_compare), but the vclocks
+ * are incomparable in terms of vclock_compare.
+ * All vclocks are comparable lexicographically.
+ *
+ * @param a Vclock.
+ * @param b Vclock.
+ * @retval 1 If @a a is lexicographically greater than @a b.
+ * @retval -1 If @a a is lexicographically less than @a b.
+ * @retval 0 If vclocks are equal.
+ */
+static inline int
+vclock_lex_compare(const struct vclock *a, const struct vclock *b)
+{
+	vclock_map_t map = a->map | b->map;
+	struct bit_iterator it;
+	bit_iterator_init(&it, &map, sizeof(map), true);
+	for(size_t replica_id = bit_iterator_next(&it); replica_id < VCLOCK_MAX;
+	    replica_id = bit_iterator_next(&it)) {
+		int64_t lsn_a = vclock_get(a, replica_id);
+		int64_t lsn_b = vclock_get(b, replica_id);
+		if (lsn_a < lsn_b)
+			return -1;
+		if (lsn_a > lsn_b)
+			return 1;
+	}
+	return 0;
+}
+
+/**
+ * Return a vclock, which is a componentwise minimum between
+ * vclocks @a a and @a b. Do not take vclock[0] into account.
+ *
+ * @param[in,out] a Vclock containing the minimum components.
+ * @param b Vclock to compare with.
+ */
+static inline void
+vclock_min_ignore0(struct vclock *a, const struct vclock *b)
+{
+	vclock_map_t map = a->map | b->map;
+	struct bit_iterator it;
+	bit_iterator_init(&it, &map, sizeof(map), true);
+	size_t replica_id = bit_iterator_next(&it);
+	if (replica_id == 0)
+		replica_id = bit_iterator_next(&it);
+
+	for( ; replica_id < VCLOCK_MAX; replica_id = bit_iterator_next(&it)) {
+		int64_t lsn_a = vclock_get(a, replica_id);
+		int64_t lsn_b = vclock_get(b, replica_id);
+		if (lsn_a <= lsn_b)
+			continue;
+		vclock_reset(a, replica_id, lsn_b);
+	}
+}
+
 /**
  * @brief vclockset - a set of vclocks
  */
-- 
2.21.1 (Apple Git-122.3)

  parent reply	other threads:[~2020-04-07 15:50 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-04-07 15:49 [Tarantool-patches] [PATCH v6 0/3] replication: fix local space tracking Serge Petrenko
2020-04-07 15:49 ` [Tarantool-patches] [PATCH v6 1/3] replication: omit 0-th vclock component in replication responses Serge Petrenko
2020-04-09 23:08   ` Vladislav Shpilevoy
2020-04-10 12:25     ` Serge Petrenko
2020-04-11 16:05       ` Vladislav Shpilevoy
2020-04-13 10:12         ` Serge Petrenko
2020-04-07 15:49 ` Serge Petrenko [this message]
2020-04-09 23:08   ` [Tarantool-patches] [PATCH v6 2/3] gc: rely on minimal vclock components instead of signatures Vladislav Shpilevoy
2020-04-10 12:25     ` Serge Petrenko
2020-04-11 16:04       ` Vladislav Shpilevoy
2020-04-13 10:12         ` Serge Petrenko
2020-04-07 15:49 ` [Tarantool-patches] [PATCH v6 3/3] box: start counting local space requests separately Serge Petrenko
2020-04-13 14:38 ` [Tarantool-patches] [PATCH v6 0/3] replication: fix local space tracking Vladislav Shpilevoy
2020-04-16 13:49 ` Kirill Yukhin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=47c0ec5d7a4de7d47351e31211c6eb3a3491f844.1586273440.git.sergepetrenko@tarantool.org \
    --to=sergepetrenko@tarantool.org \
    --cc=tarantool-patches@dev.tarantool.org \
    --cc=v.shpilevoy@tarantool.org \
    --subject='Re: [Tarantool-patches] [PATCH v6 2/3] gc: rely on minimal vclock components instead of signatures' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox