Tarantool development patches archive
 help / color / mirror / Atom feed
* [Tarantool-patches] [PATCH v6 0/3] replication: fix local space tracking
@ 2020-04-07 15:49 Serge Petrenko
  2020-04-07 15:49 ` [Tarantool-patches] [PATCH v6 1/3] replication: omit 0-th vclock component in replication responses Serge Petrenko
                   ` (4 more replies)
  0 siblings, 5 replies; 14+ messages in thread
From: Serge Petrenko @ 2020-04-07 15:49 UTC (permalink / raw)
  To: v.shpilevoy; +Cc: tarantool-patches

https://github.com/tarantool/tarantool/issues/4114
https://github.com/tarantool/tarantool/tree/sp/gh-4114-local-space-replication

Changes in v6:
 - squash commit 1 into 3
 - get rid of vclock_copy_ignore0,
   make xrow_encode_vclock ignore 0th
   clock component instead.
 - review fixes as per review from Vlad

Changes in v5:
 - review fixes as per review from Kostja

Changes in v4:
 - review fixes as per review from Kostja

Changes in v3:
 - rewrite the patches re wal gc rework
   to avoid matrix clock.

Changes in v2:
 - used Georgy's patches re introduction of
   matrix clock and wal gc rework to fix the
   problem with ordering gc consumers by
   vclock signature

Serge Petrenko (3):
  replication: omit 0-th vclock component in replication responses
  gc: rely on minimal vclock components instead of signatures
  box: start counting local space requests separately

 src/box/box.cc                                |  17 +++
 src/box/gc.c                                  |  41 +++---
 src/box/recovery.cc                           |   8 +-
 src/box/relay.cc                              |  17 ++-
 src/box/replication.cc                        |   4 +-
 src/box/vclock.h                              |  91 +++++++++++++
 src/box/wal.c                                 |  16 ++-
 src/box/xrow.c                                |  10 +-
 test/replication/anon.result                  |   5 +
 test/replication/anon.test.lua                |   2 +
 test/replication/autobootstrap.result         |   6 +
 test/replication/autobootstrap.test.lua       |   2 +
 test/replication/before_replace.result        |   8 +-
 test/replication/before_replace.test.lua      |   4 +-
 .../gh-4114-local-space-replication.result    | 121 ++++++++++++++++++
 .../gh-4114-local-space-replication.test.lua  |  47 +++++++
 test/replication/local_spaces.result          |   4 +
 test/replication/local_spaces.test.lua        |   3 +
 test/replication/misc.result                  |   6 +
 test/replication/misc.test.lua                |   2 +
 test/replication/quorum.result                |   6 +
 test/replication/quorum.test.lua              |   2 +
 test/replication/replica_rejoin.result        |   9 ++
 test/replication/replica_rejoin.test.lua      |   3 +
 test/replication/skip_conflict_row.result     |   3 +
 test/replication/skip_conflict_row.test.lua   |   1 +
 test/replication/suite.cfg                    |   1 +
 test/vinyl/errinj.result                      |   5 +
 test/vinyl/errinj.test.lua                    |   4 +
 29 files changed, 406 insertions(+), 42 deletions(-)
 create mode 100644 test/replication/gh-4114-local-space-replication.result
 create mode 100644 test/replication/gh-4114-local-space-replication.test.lua

-- 
2.21.1 (Apple Git-122.3)

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Tarantool-patches] [PATCH v6 1/3] replication: omit 0-th vclock component in replication responses
  2020-04-07 15:49 [Tarantool-patches] [PATCH v6 0/3] replication: fix local space tracking Serge Petrenko
@ 2020-04-07 15:49 ` Serge Petrenko
  2020-04-09 23:08   ` Vladislav Shpilevoy
  2020-04-07 15:49 ` [Tarantool-patches] [PATCH v6 2/3] gc: rely on minimal vclock components instead of signatures Serge Petrenko
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 14+ messages in thread
From: Serge Petrenko @ 2020-04-07 15:49 UTC (permalink / raw)
  To: v.shpilevoy; +Cc: tarantool-patches

If an anonymous replica is promoted to a normal one and becomes
replication master later, its vclock contains a non-empty zero
component, tracking local changes on this replica from the time when it
had been anonymous. No need to pollute joining instance's vclock with
our non-empty 0 component.
When an anonymous replica reports its status to a remote instance it
should also hide its 0-th vclock component.

This is needed for backward compatibility with old instances, which
don't ignore 0th vclock component coming from a remote instance by
default.

In order to do so, introduce a new function - vclock_size_ignore0(),
which doesn't count 0th clock component, and patch xrow_encode_vclock()
to skip 0th clock component if it's present.

Also make sure that new instances ignore 0th vclock component coming
from an unpatched remote instance.

Follow-up #3186
Prerequisite #4114
---
 src/box/relay.cc               |  2 +-
 src/box/replication.cc         |  4 ++--
 src/box/vclock.h               |  6 ++++++
 src/box/xrow.c                 | 10 +++++++---
 test/replication/anon.result   |  5 +++++
 test/replication/anon.test.lua |  2 ++
 6 files changed, 23 insertions(+), 6 deletions(-)

diff --git a/src/box/relay.cc b/src/box/relay.cc
index c634348a4..fec9f07d1 100644
--- a/src/box/relay.cc
+++ b/src/box/relay.cc
@@ -464,7 +464,7 @@ relay_schedule_pending_gc(struct relay *relay, const struct vclock *vclock)
 		 * the greater signatures is due to changes pulled
 		 * from other members of the cluster.
 		 */
-		if (vclock_compare(&curr->vclock, vclock) > 0)
+		if (vclock_compare_ignore0(&curr->vclock, vclock) > 0)
 			break;
 		stailq_shift(&relay->pending_gc);
 		free(gc_msg);
diff --git a/src/box/replication.cc b/src/box/replication.cc
index 1345f189b..c833041a3 100644
--- a/src/box/replication.cc
+++ b/src/box/replication.cc
@@ -777,8 +777,8 @@ replicaset_needs_rejoin(struct replica **master)
 			continue;
 
 		const struct ballot *ballot = &applier->ballot;
-		if (vclock_compare(&ballot->gc_vclock,
-				   &replicaset.vclock) <= 0) {
+		if (vclock_compare_ignore0(&ballot->gc_vclock,
+					   &replicaset.vclock) <= 0) {
 			/*
 			 * There's at least one master that still stores
 			 * WALs needed by this instance. Proceed to local
diff --git a/src/box/vclock.h b/src/box/vclock.h
index 79e5a1bc0..5c0525b00 100644
--- a/src/box/vclock.h
+++ b/src/box/vclock.h
@@ -200,6 +200,12 @@ vclock_size(const struct vclock *vclock)
 	return bit_count_u32(vclock->map);
 }
 
+static inline uint32_t
+vclock_size_ignore0(const struct vclock *vclock)
+{
+	return bit_count_u32(vclock->map & ~1);
+}
+
 static inline int64_t
 vclock_calc_sum(const struct vclock *vclock)
 {
diff --git a/src/box/xrow.c b/src/box/xrow.c
index be026a43c..21a68220a 100644
--- a/src/box/xrow.c
+++ b/src/box/xrow.c
@@ -51,7 +51,7 @@ static_assert(IPROTO_DATA < 0x7f && IPROTO_METADATA < 0x7f &&
 static inline uint32_t
 mp_sizeof_vclock(const struct vclock *vclock)
 {
-	uint32_t size = vclock_size(vclock);
+	uint32_t size = vclock_size_ignore0(vclock);
 	return mp_sizeof_map(size) + size * (mp_sizeof_uint(UINT32_MAX) +
 					     mp_sizeof_uint(UINT64_MAX));
 }
@@ -59,10 +59,14 @@ mp_sizeof_vclock(const struct vclock *vclock)
 static inline char *
 mp_encode_vclock(char *data, const struct vclock *vclock)
 {
-	data = mp_encode_map(data, vclock_size(vclock));
+	data = mp_encode_map(data, vclock_size_ignore0(vclock));
 	struct vclock_iterator it;
 	vclock_iterator_init(&it, vclock);
-	vclock_foreach(&it, replica) {
+	struct vclock_c replica;
+	replica = vclock_iterator_next(&it);
+	if (replica.id == 0)
+		replica = vclock_iterator_next(&it);
+	for ( ; replica.id < VCLOCK_MAX; replica = vclock_iterator_next(&it)) {
 		data = mp_encode_uint(data, replica.id);
 		data = mp_encode_uint(data, replica.lsn);
 	}
diff --git a/test/replication/anon.result b/test/replication/anon.result
index 88061569f..cbbeeef09 100644
--- a/test/replication/anon.result
+++ b/test/replication/anon.result
@@ -187,6 +187,11 @@ a > 0
  | ---
  | - true
  | ...
+-- 0-th vclock component isn't propagated across the cluster.
+box.info.vclock[0]
+ | ---
+ | - null
+ | ...
 test_run:cmd('switch default')
  | ---
  | - true
diff --git a/test/replication/anon.test.lua b/test/replication/anon.test.lua
index 8a8d15c18..627dc5c8e 100644
--- a/test/replication/anon.test.lua
+++ b/test/replication/anon.test.lua
@@ -66,6 +66,8 @@ test_run:cmd('switch replica_anon2')
 a = box.info.vclock[1]
 -- The instance did fetch a snapshot.
 a > 0
+-- 0-th vclock component isn't propagated across the cluster.
+box.info.vclock[0]
 test_run:cmd('switch default')
 box.space.test:insert{2}
 test_run:cmd("switch replica_anon2")
-- 
2.21.1 (Apple Git-122.3)

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Tarantool-patches] [PATCH v6 2/3] gc: rely on minimal vclock components instead of signatures
  2020-04-07 15:49 [Tarantool-patches] [PATCH v6 0/3] replication: fix local space tracking Serge Petrenko
  2020-04-07 15:49 ` [Tarantool-patches] [PATCH v6 1/3] replication: omit 0-th vclock component in replication responses Serge Petrenko
@ 2020-04-07 15:49 ` Serge Petrenko
  2020-04-09 23:08   ` Vladislav Shpilevoy
  2020-04-07 15:49 ` [Tarantool-patches] [PATCH v6 3/3] box: start counting local space requests separately Serge Petrenko
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 14+ messages in thread
From: Serge Petrenko @ 2020-04-07 15:49 UTC (permalink / raw)
  To: v.shpilevoy; +Cc: tarantool-patches

The current WAL GC implementation tracks consumers (i.e. remote replicas) by
their vclock signature, which is the sum of all vclock components.
This approach is wrong, and this can be shown by a little example.
The example will be a little synthetic, but it'll illustrate the problem.
Say, you have 2 masters, A and B with ids 1 and 2 respectively, and a replica C
with id 3. Say, С replicates from both A and B, and there is no replication
between A and B (say, the instances were reconfigured to not replicate from each
other). Now, say replica C has followed A and B to vclock {1:5, 2:13}. At the
same time, A has lsn 10 and B has lsn 15. A and B do not know about each other’s
changes, so A’s vclock is {1:10} and B’s vclock is {2:15}. Now imagine A does a
snapshot and creates a new xlog with signature 10. A’s directory will look like:
00…000.xlog 00…010.snap 00….010.xlog
Replica C reports its vclock {1:5, 2:13} to A, A uses the vclock to update the
corresponding GC consumer. Since signatures are used, GC consumer is assigned a
signature = 13 + 5 = 18. This is greater than the signature of the last xlog on
A (which is 10), so the previous xlog (00…00.xlog) can be deleted (at least A
assumes it can be). Actually, replica still needs 00…00.xlog, because it
contains rows corresponding to vclocks {1:6} - {1:10}, which haven’t been
replicated yet.

If instead of using vclock signatures, gc consumers used vclocks, such a problem
wouldn’t arise. Replica would report its vclock {1:5, 2:13}. The vclock is NOT
strictly greater than A’s most recent xlog vclock ({1:10}), so the previous log
is kept until replica reports a vclock {1:10, 2:something} or {1:11, …} and so
on.

Rewrite gc to perform cleanup based on finding minimal vclock
components present in at least one of the consumer vclocks instead of
just comparing vclock signatures.

Prerequisite #4114
---
 src/box/gc.c     | 41 ++++++++++++++---------
 src/box/vclock.h | 85 ++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 110 insertions(+), 16 deletions(-)

diff --git a/src/box/gc.c b/src/box/gc.c
index f5c387f9d..a2d0a515c 100644
--- a/src/box/gc.c
+++ b/src/box/gc.c
@@ -66,16 +66,15 @@ static int
 gc_checkpoint_fiber_f(va_list);
 
 /**
- * Comparator used for ordering gc_consumer objects by signature
- * in a binary tree.
+ * Comparator used for ordering gc_consumer objects
+ * lexicographically by their vclock in a binary tree.
  */
 static inline int
 gc_consumer_cmp(const struct gc_consumer *a, const struct gc_consumer *b)
 {
-	if (vclock_sum(&a->vclock) < vclock_sum(&b->vclock))
-		return -1;
-	if (vclock_sum(&a->vclock) > vclock_sum(&b->vclock))
-		return 1;
+	int rc = vclock_lex_compare(&a->vclock, &b->vclock);
+	if (rc != 0)
+		return rc;
 	if ((intptr_t)a < (intptr_t)b)
 		return -1;
 	if ((intptr_t)a > (intptr_t)b)
@@ -187,19 +186,29 @@ gc_run_cleanup(void)
 	/* At least one checkpoint must always be available. */
 	assert(checkpoint != NULL);
 
+	/* Find the vclock of the oldest WAL row to keep. */
+	struct vclock min_vclock;
+	struct gc_consumer *consumer = gc_tree_first(&gc.consumers);
 	/*
-	 * Find the vclock of the oldest WAL row to keep.
+	 * Vclock of the oldest WAL row to keep is a by-component
+	 * minimum of all consumer vclocks and the oldest
+	 * checkpoint vclock. This ensures that all rows needed by
+	 * at least one consumer are kept.
 	 * Note, we must keep all WALs created after the
 	 * oldest checkpoint, even if no consumer needs them.
 	 */
-	const struct vclock *vclock = (gc_tree_empty(&gc.consumers) ? NULL :
-				       &gc_tree_first(&gc.consumers)->vclock);
-	if (vclock == NULL ||
-	    vclock_sum(vclock) > vclock_sum(&checkpoint->vclock))
-		vclock = &checkpoint->vclock;
-
-	if (vclock_sum(vclock) > vclock_sum(&gc.vclock)) {
-		vclock_copy(&gc.vclock, vclock);
+	vclock_copy(&min_vclock, &checkpoint->vclock);
+	while (consumer != NULL) {
+		/*
+		 * Consumers will never need rows signed
+		 * with a zero instance id (local rows).
+		 */
+		vclock_min_ignore0(&min_vclock, &consumer->vclock);
+		consumer = gc_tree_next(&gc.consumers, consumer);
+	}
+
+	if (vclock_sum(&min_vclock) > vclock_sum(&gc.vclock)) {
+		vclock_copy(&gc.vclock, &min_vclock);
 		run_wal_gc = true;
 	}
 
@@ -222,7 +231,7 @@ gc_run_cleanup(void)
 	if (run_engine_gc)
 		engine_collect_garbage(&checkpoint->vclock);
 	if (run_wal_gc)
-		wal_collect_garbage(vclock);
+		wal_collect_garbage(&min_vclock);
 }
 
 static int
diff --git a/src/box/vclock.h b/src/box/vclock.h
index 5c0525b00..5865f7443 100644
--- a/src/box/vclock.h
+++ b/src/box/vclock.h
@@ -182,6 +182,31 @@ vclock_inc(struct vclock *vclock, uint32_t replica_id)
 	return ++vclock->lsn[replica_id];
 }
 
+/**
+ * Set vclock component represented by replica id to the desired
+ * value. Can be used to decrease stored LSN value for the given
+ * replica id while maintaining a valid signature or in the same
+ * manner as vclock_follow.
+ *
+ * @param vclock Vector clock.
+ * @param replica_id Replica identifier.
+ * @param lsn Lsn to set
+ */
+static inline void
+vclock_reset(struct vclock *vclock, uint32_t replica_id, int64_t lsn)
+{
+	assert(lsn >= 0);
+	assert(replica_id < VCLOCK_MAX);
+	vclock->signature -= vclock_get(vclock, replica_id);
+	if (lsn == 0) {
+		vclock->map &= ~(1 << replica_id);
+		return;
+	}
+	vclock->lsn[replica_id] = lsn;
+	vclock->map |= 1 << replica_id;
+	vclock->signature += lsn;
+}
+
 static inline void
 vclock_copy(struct vclock *dst, const struct vclock *src)
 {
@@ -330,6 +355,66 @@ vclock_compare_ignore0(const struct vclock *a, const struct vclock *b)
 	return vclock_compare_generic(a, b, true);
 }
 
+/**
+ * Compare vclocks lexicographically.
+ * The following affirmation holds: if a < b in terms of
+ * vclock_compare, then a < b in terms of vclock_lex_compare.
+ * However, a < b in terms of vclock_lex_compare doesn't mean
+ * than a < b in terms of vclock_compare. Example:
+ * {1:5, 2:10} < {1:7, 2:3} (vclock_lex_compare), but the vclocks
+ * are incomparable in terms of vclock_compare.
+ * All vclocks are comparable lexicographically.
+ *
+ * @param a Vclock.
+ * @param b Vclock.
+ * @retval 1 If @a a is lexicographically greater than @a b.
+ * @retval -1 If @a a is lexicographically less than @a b.
+ * @retval 0 If vclocks are equal.
+ */
+static inline int
+vclock_lex_compare(const struct vclock *a, const struct vclock *b)
+{
+	vclock_map_t map = a->map | b->map;
+	struct bit_iterator it;
+	bit_iterator_init(&it, &map, sizeof(map), true);
+	for(size_t replica_id = bit_iterator_next(&it); replica_id < VCLOCK_MAX;
+	    replica_id = bit_iterator_next(&it)) {
+		int64_t lsn_a = vclock_get(a, replica_id);
+		int64_t lsn_b = vclock_get(b, replica_id);
+		if (lsn_a < lsn_b)
+			return -1;
+		if (lsn_a > lsn_b)
+			return 1;
+	}
+	return 0;
+}
+
+/**
+ * Return a vclock, which is a componentwise minimum between
+ * vclocks @a a and @a b. Do not take vclock[0] into account.
+ *
+ * @param[in,out] a Vclock containing the minimum components.
+ * @param b Vclock to compare with.
+ */
+static inline void
+vclock_min_ignore0(struct vclock *a, const struct vclock *b)
+{
+	vclock_map_t map = a->map | b->map;
+	struct bit_iterator it;
+	bit_iterator_init(&it, &map, sizeof(map), true);
+	size_t replica_id = bit_iterator_next(&it);
+	if (replica_id == 0)
+		replica_id = bit_iterator_next(&it);
+
+	for( ; replica_id < VCLOCK_MAX; replica_id = bit_iterator_next(&it)) {
+		int64_t lsn_a = vclock_get(a, replica_id);
+		int64_t lsn_b = vclock_get(b, replica_id);
+		if (lsn_a <= lsn_b)
+			continue;
+		vclock_reset(a, replica_id, lsn_b);
+	}
+}
+
 /**
  * @brief vclockset - a set of vclocks
  */
-- 
2.21.1 (Apple Git-122.3)

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Tarantool-patches] [PATCH v6 3/3] box: start counting local space requests separately
  2020-04-07 15:49 [Tarantool-patches] [PATCH v6 0/3] replication: fix local space tracking Serge Petrenko
  2020-04-07 15:49 ` [Tarantool-patches] [PATCH v6 1/3] replication: omit 0-th vclock component in replication responses Serge Petrenko
  2020-04-07 15:49 ` [Tarantool-patches] [PATCH v6 2/3] gc: rely on minimal vclock components instead of signatures Serge Petrenko
@ 2020-04-07 15:49 ` Serge Petrenko
  2020-04-13 14:38 ` [Tarantool-patches] [PATCH v6 0/3] replication: fix local space tracking Vladislav Shpilevoy
  2020-04-16 13:49 ` Kirill Yukhin
  4 siblings, 0 replies; 14+ messages in thread
From: Serge Petrenko @ 2020-04-07 15:49 UTC (permalink / raw)
  To: v.shpilevoy; +Cc: tarantool-patches

Sign local space requests with a zero instance id. This allows to split
local changes aside from the changes, which should be visible to the whole
cluster, and stop sending NOPs to replicas to follow local vclock.

Moreover, it fixes the following bug with local spaces and replication.
In a situation when there are a master and a replica set up, replica may
still write to local spaces even if it's read-only. Local space
operations used to promote instance's lsn before this patch. Eventually,
master would have vclock {1:x} and replica'd have vclock {1:x, 2:y},
where y > 0, due to local space requests performed by the replica.
If a snapshot happens on replica side, replica will delete it's .xlog
files prior to the snapshot, since no one replicates from it and thus it
doesn't have any registered GC consumers.
From this point, every attempt to configure replication from replica to
master will fail, since master will try to fetch records which account
for the difference in master's and replica's vclocks: {1:x} vs {1:x,2:y},
even though master will never need the rows in range {2:1} - {2:y},
since they'll be turned to NOPs during replication.

Starting from this patch, in the situation described above, replica's
clock will be {0:y, 1:x}, and, since local space requests are now not
replicated at all, master will be able to follow replica, turning the
configuration to master-master.

Closes #4114
---
 src/box/box.cc                                |  17 +++
 src/box/recovery.cc                           |   8 +-
 src/box/relay.cc                              |  15 +--
 src/box/wal.c                                 |  16 ++-
 test/replication/autobootstrap.result         |   6 +
 test/replication/autobootstrap.test.lua       |   2 +
 test/replication/before_replace.result        |   8 +-
 test/replication/before_replace.test.lua      |   4 +-
 .../gh-4114-local-space-replication.result    | 121 ++++++++++++++++++
 .../gh-4114-local-space-replication.test.lua  |  47 +++++++
 test/replication/local_spaces.result          |   4 +
 test/replication/local_spaces.test.lua        |   3 +
 test/replication/misc.result                  |   6 +
 test/replication/misc.test.lua                |   2 +
 test/replication/quorum.result                |   6 +
 test/replication/quorum.test.lua              |   2 +
 test/replication/replica_rejoin.result        |   9 ++
 test/replication/replica_rejoin.test.lua      |   3 +
 test/replication/skip_conflict_row.result     |   3 +
 test/replication/skip_conflict_row.test.lua   |   1 +
 test/replication/suite.cfg                    |   1 +
 test/vinyl/errinj.result                      |   5 +
 test/vinyl/errinj.test.lua                    |   4 +
 23 files changed, 273 insertions(+), 20 deletions(-)
 create mode 100644 test/replication/gh-4114-local-space-replication.result
 create mode 100644 test/replication/gh-4114-local-space-replication.test.lua

diff --git a/src/box/box.cc b/src/box/box.cc
index 765d64678..f696d594a 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -1875,6 +1875,23 @@ box_process_subscribe(struct ev_io *io, struct xrow_header *header)
 	say_info("remote vclock %s local vclock %s",
 		 vclock_to_string(&replica_clock), vclock_to_string(&vclock));
 
+	/*
+	 * Replica clock is used in gc state and recovery
+	 * initialization, so we need to replace the remote 0-th
+	 * component with our own one. This doesn't break
+	 * recovery: it finds the WAL with a vclock strictly less
+	 * than replia clock in all components except the 0th one.
+	 * This leads to finding the correct WAL, if it exists,
+	 * since we do not need to recover local rows (the ones,
+	 * that contribute to the 0-th vclock component).
+	 * Note, that it would be bad to set 0-th vclock component
+	 * to a smaller value, since it would unnecessarily
+	 * require additional WALs, which may have already been
+	 * deleted.
+	 * Speaking of gc, remote instances' local vclock
+	 * components are not used by consumers at all.
+	 */
+	vclock_reset(&replica_clock, 0, vclock_get(&replicaset.vclock, 0));
 	/*
 	 * Process SUBSCRIBE request via replication relay
 	 * Send current recovery vector clock as a marker
diff --git a/src/box/recovery.cc b/src/box/recovery.cc
index 64aa467b1..d1a503cfc 100644
--- a/src/box/recovery.cc
+++ b/src/box/recovery.cc
@@ -261,11 +261,9 @@ recover_xlog(struct recovery *r, struct xstream *stream,
 			continue; /* already applied, skip */
 
 		/*
-		 * All rows in xlog files have an assigned
-		 * replica id. The only exception is anonymous
-		 * replica, which has a zero instance id.
-		 * In this case the only rows from such an instance
-		 * can be for the local spaces.
+		 * All rows in xlog files have an assigned replica
+		 * id. The only exception are local rows, which
+		 * are signed with a zero replica id.
 		 */
 		assert(row.replica_id != 0 || row.group_id == GROUP_LOCAL);
 		/*
diff --git a/src/box/relay.cc b/src/box/relay.cc
index fec9f07d1..3fa5465d3 100644
--- a/src/box/relay.cc
+++ b/src/box/relay.cc
@@ -756,16 +756,15 @@ relay_send_row(struct xstream *stream, struct xrow_header *packet)
 {
 	struct relay *relay = container_of(stream, struct relay, stream);
 	assert(iproto_type_is_dml(packet->type));
-	/*
-	 * Transform replica local requests to IPROTO_NOP so as to
-	 * promote vclock on the replica without actually modifying
-	 * any data.
-	 */
 	if (packet->group_id == GROUP_LOCAL) {
 		/*
-		 * Replica-local requests generated while replica
-		 * was anonymous have a zero instance id. Just
-		 * skip all these rows.
+		 * We do not relay replica-local rows to other
+		 * instances, since we started signing them with
+		 * a zero instance id. However, if replica-local
+		 * rows, signed with a non-zero id are present in
+		 * our WAL, we still need to relay them as NOPs in
+		 * order to correctly promote the vclock on the
+		 * replica.
 		 */
 		if (packet->replica_id == REPLICA_ID_NIL)
 			return;
diff --git a/src/box/wal.c b/src/box/wal.c
index 3b094b0e8..1eb20272c 100644
--- a/src/box/wal.c
+++ b/src/box/wal.c
@@ -953,13 +953,19 @@ wal_assign_lsn(struct vclock *vclock_diff, struct vclock *base,
 	/** Assign LSN to all local rows. */
 	for ( ; row < end; row++) {
 		if ((*row)->replica_id == 0) {
-			(*row)->lsn = vclock_inc(vclock_diff, instance_id) +
-				      vclock_get(base, instance_id);
 			/*
-			 * Note, an anonymous replica signs local
-			 * rows whith a zero instance id.
+			 * All rows representing local space data
+			 * manipulations are signed wth a zero
+			 * instance id. This is also true for
+			 * anonymous replicas, since they are
+			 * only capable of writing to local and
+			 * temporary spaces.
 			 */
-			(*row)->replica_id = instance_id;
+			if ((*row)->group_id != GROUP_LOCAL)
+				(*row)->replica_id = instance_id;
+
+			(*row)->lsn = vclock_inc(vclock_diff, (*row)->replica_id) +
+				      vclock_get(base, (*row)->replica_id);
 			/* Use lsn of the first local row as transaction id. */
 			tsn = tsn == 0 ? (*row)->lsn : tsn;
 			(*row)->tsn = tsn;
diff --git a/test/replication/autobootstrap.result b/test/replication/autobootstrap.result
index ec9667663..af87db536 100644
--- a/test/replication/autobootstrap.result
+++ b/test/replication/autobootstrap.result
@@ -162,6 +162,9 @@ box.schema.user.revoke('test_u', 'create', 'space')
 vclock = test_run:get_vclock('autobootstrap1')
 ---
 ...
+vclock[0] = nil
+---
+...
 _ = test_run:wait_vclock("autobootstrap2", vclock)
 ---
 ...
@@ -206,6 +209,9 @@ test_run:wait_fullmesh(SERVERS)
 vclock = test_run:get_vclock("autobootstrap1")
 ---
 ...
+vclock[0] = nil
+---
+...
 _ = test_run:wait_vclock("autobootstrap2", vclock)
 ---
 ...
diff --git a/test/replication/autobootstrap.test.lua b/test/replication/autobootstrap.test.lua
index 8c75d1322..0be537a16 100644
--- a/test/replication/autobootstrap.test.lua
+++ b/test/replication/autobootstrap.test.lua
@@ -74,6 +74,7 @@ box.schema.user.revoke('test_u', 'create', 'space')
 
 -- Synchronize
 vclock = test_run:get_vclock('autobootstrap1')
+vclock[0] = nil
 _ = test_run:wait_vclock("autobootstrap2", vclock)
 _ = test_run:wait_vclock("autobootstrap3", vclock)
 
@@ -95,6 +96,7 @@ _ = test_run:cmd("switch default")
 test_run:wait_fullmesh(SERVERS)
 
 vclock = test_run:get_vclock("autobootstrap1")
+vclock[0] = nil
 _ = test_run:wait_vclock("autobootstrap2", vclock)
 _ = test_run:wait_vclock("autobootstrap3", vclock)
 
diff --git a/test/replication/before_replace.result b/test/replication/before_replace.result
index ce66718d1..e8bb982ae 100644
--- a/test/replication/before_replace.result
+++ b/test/replication/before_replace.result
@@ -266,7 +266,13 @@ box.space.test:replace{1, 1}
 ---
 - [1, 1]
 ...
-_ = test_run:wait_vclock('replica', test_run:get_vclock('default'))
+vclock = test_run:get_vclock('default')
+---
+...
+vclock[0] = nil
+---
+...
+_ = test_run:wait_vclock('replica', vclock)
 ---
 ...
 -- Check that replace{1, 2} coming from the master was suppressed
diff --git a/test/replication/before_replace.test.lua b/test/replication/before_replace.test.lua
index 3b9910cbc..8e54d2507 100644
--- a/test/replication/before_replace.test.lua
+++ b/test/replication/before_replace.test.lua
@@ -101,7 +101,9 @@ _ = box.space.test:before_replace(function(old, new) return new:update{{'+', 2,
 test_run:cmd("switch default")
 box.space.test:replace{1, 1}
 
-_ = test_run:wait_vclock('replica', test_run:get_vclock('default'))
+vclock = test_run:get_vclock('default')
+vclock[0] = nil
+_ = test_run:wait_vclock('replica', vclock)
 
 -- Check that replace{1, 2} coming from the master was suppressed
 -- by the before_replace trigger on the replica.
diff --git a/test/replication/gh-4114-local-space-replication.result b/test/replication/gh-4114-local-space-replication.result
new file mode 100644
index 000000000..9b63a4b99
--- /dev/null
+++ b/test/replication/gh-4114-local-space-replication.result
@@ -0,0 +1,121 @@
+-- test-run result file version 2
+env = require('test_run')
+ | ---
+ | ...
+test_run = env.new()
+ | ---
+ | ...
+
+--
+-- gh-4114. Account local space changes in a separate vclock
+-- component. Do not replicate local space changes, even as NOPs.
+--
+
+box.schema.user.grant('guest', 'replication')
+ | ---
+ | ...
+_ = box.schema.space.create('test', {is_local=true})
+ | ---
+ | ...
+_ = box.space.test:create_index("pk")
+ | ---
+ | ...
+
+test_run:cmd('create server replica with rpl_master=default, script "replication/replica.lua"')
+ | ---
+ | - true
+ | ...
+test_run:cmd('start server replica with wait=True, wait_load=True')
+ | ---
+ | - true
+ | ...
+
+a = box.info.vclock[0] or 0
+ | ---
+ | ...
+for i = 1,10 do box.space.test:insert{i} end
+ | ---
+ | ...
+box.info.vclock[0] == a + 10 or box.info.vclock[0] - a
+ | ---
+ | - true
+ | ...
+
+test_run:cmd('switch replica')
+ | ---
+ | - true
+ | ...
+box.info.vclock[0]
+ | ---
+ | - null
+ | ...
+box.cfg{checkpoint_count=1}
+ | ---
+ | ...
+box.space.test:select{}
+ | ---
+ | - []
+ | ...
+box.space.test:insert{1}
+ | ---
+ | - [1]
+ | ...
+box.snapshot()
+ | ---
+ | - ok
+ | ...
+box.space.test:insert{2}
+ | ---
+ | - [2]
+ | ...
+box.snapshot()
+ | ---
+ | - ok
+ | ...
+box.space.test:insert{3}
+ | ---
+ | - [3]
+ | ...
+
+box.info.vclock[0]
+ | ---
+ | - 3
+ | ...
+
+test_run:cmd('switch default')
+ | ---
+ | - true
+ | ...
+
+test_run:cmd('set variable repl_source to "replica.listen"')
+ | ---
+ | - true
+ | ...
+
+box.cfg{replication=repl_source}
+ | ---
+ | ...
+test_run:wait_cond(function()\
+                       return box.info.replication[2].upstream and\
+                       box.info.replication[2].upstream.status == 'follow'\
+                   end,\
+                   10)
+ | ---
+ | - true
+ | ...
+
+-- Cleanup.
+test_run:cmd('stop server replica')
+ | ---
+ | - true
+ | ...
+test_run:cmd('delete server replica')
+ | ---
+ | - true
+ | ...
+box.space.test:drop()
+ | ---
+ | ...
+box.schema.user.revoke('guest', 'replication')
+ | ---
+ | ...
diff --git a/test/replication/gh-4114-local-space-replication.test.lua b/test/replication/gh-4114-local-space-replication.test.lua
new file mode 100644
index 000000000..c18fb3b10
--- /dev/null
+++ b/test/replication/gh-4114-local-space-replication.test.lua
@@ -0,0 +1,47 @@
+env = require('test_run')
+test_run = env.new()
+
+--
+-- gh-4114. Account local space changes in a separate vclock
+-- component. Do not replicate local space changes, even as NOPs.
+--
+
+box.schema.user.grant('guest', 'replication')
+_ = box.schema.space.create('test', {is_local=true})
+_ = box.space.test:create_index("pk")
+
+test_run:cmd('create server replica with rpl_master=default, script "replication/replica.lua"')
+test_run:cmd('start server replica with wait=True, wait_load=True')
+
+a = box.info.vclock[0] or 0
+for i = 1,10 do box.space.test:insert{i} end
+box.info.vclock[0] == a + 10 or box.info.vclock[0] - a
+
+test_run:cmd('switch replica')
+box.info.vclock[0]
+box.cfg{checkpoint_count=1}
+box.space.test:select{}
+box.space.test:insert{1}
+box.snapshot()
+box.space.test:insert{2}
+box.snapshot()
+box.space.test:insert{3}
+
+box.info.vclock[0]
+
+test_run:cmd('switch default')
+
+test_run:cmd('set variable repl_source to "replica.listen"')
+
+box.cfg{replication=repl_source}
+test_run:wait_cond(function()\
+                       return box.info.replication[2].upstream and\
+                       box.info.replication[2].upstream.status == 'follow'\
+                   end,\
+                   10)
+
+-- Cleanup.
+test_run:cmd('stop server replica')
+test_run:cmd('delete server replica')
+box.space.test:drop()
+box.schema.user.revoke('guest', 'replication')
diff --git a/test/replication/local_spaces.result b/test/replication/local_spaces.result
index cf2c52010..4855d8a88 100644
--- a/test/replication/local_spaces.result
+++ b/test/replication/local_spaces.result
@@ -288,6 +288,10 @@ _ = s3:insert{3}
 vclock = test_run:get_vclock('default')
 ---
 ...
+-- Ignore 0-th component when waiting. They don't match.
+vclock[0] = nil
+---
+...
 _ = test_run:wait_vclock('replica', vclock)
 ---
 ...
diff --git a/test/replication/local_spaces.test.lua b/test/replication/local_spaces.test.lua
index 373e2cd20..c5e224030 100644
--- a/test/replication/local_spaces.test.lua
+++ b/test/replication/local_spaces.test.lua
@@ -112,6 +112,9 @@ _ = s1:insert{3}
 _ = s2:insert{3}
 _ = s3:insert{3}
 vclock = test_run:get_vclock('default')
+
+-- Ignore 0-th component when waiting. They don't match.
+vclock[0] = nil
 _ = test_run:wait_vclock('replica', vclock)
 
 test_run:cmd("switch replica")
diff --git a/test/replication/misc.result b/test/replication/misc.result
index b63d72846..e5d1f560e 100644
--- a/test/replication/misc.result
+++ b/test/replication/misc.result
@@ -214,6 +214,9 @@ box.space.space1:select{}
 vclock = test_run:get_vclock("autobootstrap1")
 ---
 ...
+vclock[0] = nil
+---
+...
 _ = test_run:wait_vclock("autobootstrap2", vclock)
 ---
 ...
@@ -414,6 +417,9 @@ while box.info.replication[2] == nil do fiber.sleep(0.01) end
 vclock = test_run:get_vclock('default')
 ---
 ...
+vclock[0] = nil
+---
+...
 _ = test_run:wait_vclock('replica_auth', vclock)
 ---
 ...
diff --git a/test/replication/misc.test.lua b/test/replication/misc.test.lua
index c454a0992..d285b014a 100644
--- a/test/replication/misc.test.lua
+++ b/test/replication/misc.test.lua
@@ -88,6 +88,7 @@ c.space.space1:insert{box.NULL, "data"} -- fails, but bumps sequence value
 c.space.space1:insert{box.NULL, 1, "data"}
 box.space.space1:select{}
 vclock = test_run:get_vclock("autobootstrap1")
+vclock[0] = nil
 _ = test_run:wait_vclock("autobootstrap2", vclock)
 test_run:cmd("switch autobootstrap2")
 box.space.space1:select{}
@@ -172,6 +173,7 @@ box.schema.user.grant('cluster', 'replication')
 
 while box.info.replication[2] == nil do fiber.sleep(0.01) end
 vclock = test_run:get_vclock('default')
+vclock[0] = nil
 _ = test_run:wait_vclock('replica_auth', vclock)
 
 test_run:cmd("stop server replica_auth")
diff --git a/test/replication/quorum.result b/test/replication/quorum.result
index 8ea9fa22e..4b90d13cb 100644
--- a/test/replication/quorum.result
+++ b/test/replication/quorum.result
@@ -325,6 +325,9 @@ space:insert{2}
 vclock = test_run:get_vclock("default")
 ---
 ...
+vclock[0] = nil
+---
+...
 _ = test_run:wait_vclock("replica", vclock)
 ---
 ...
@@ -390,6 +393,9 @@ box.cfg{replication = repl}
 vclock = test_run:get_vclock("master_quorum1")
 ---
 ...
+vclock[0] = nil
+---
+...
 _ = test_run:wait_vclock("master_quorum2", vclock)
 ---
 ...
diff --git a/test/replication/quorum.test.lua b/test/replication/quorum.test.lua
index 31be48fc2..7c4b76671 100644
--- a/test/replication/quorum.test.lua
+++ b/test/replication/quorum.test.lua
@@ -125,6 +125,7 @@ test_run:cmd("switch default")
 box.cfg{listen = listen}
 space:insert{2}
 vclock = test_run:get_vclock("default")
+vclock[0] = nil
 _ = test_run:wait_vclock("replica", vclock)
 test_run:cmd("switch replica")
 box.info.status -- running
@@ -145,6 +146,7 @@ box.cfg{replication = ""}
 box.space.test:insert{1}
 box.cfg{replication = repl}
 vclock = test_run:get_vclock("master_quorum1")
+vclock[0] = nil
 _ = test_run:wait_vclock("master_quorum2", vclock)
 test_run:cmd("switch master_quorum2")
 box.space.test:select()
diff --git a/test/replication/replica_rejoin.result b/test/replication/replica_rejoin.result
index b8ed79f14..dd04ae297 100644
--- a/test/replication/replica_rejoin.result
+++ b/test/replication/replica_rejoin.result
@@ -144,6 +144,9 @@ for i = 10, 30, 10 do box.space.test:update(i, {{'!', 1, i}}) end
 vclock = test_run:get_vclock('default')
 ---
 ...
+vclock[0] = nil
+---
+...
 _ = test_run:wait_vclock('replica', vclock)
 ---
 ...
@@ -295,6 +298,9 @@ for i = 1, 10 do box.space.test:replace{2} end
 vclock = test_run:get_vclock('replica')
 ---
 ...
+vclock[0] = nil
+---
+...
 _ = test_run:wait_vclock('default', vclock)
 ---
 ...
@@ -345,6 +351,9 @@ for i = 1, 10 do box.space.test:replace{2} end
 vclock = test_run:get_vclock('replica')
 ---
 ...
+vclock[0] = nil
+---
+...
 _ = test_run:wait_vclock('default', vclock)
 ---
 ...
diff --git a/test/replication/replica_rejoin.test.lua b/test/replication/replica_rejoin.test.lua
index 25c0849ec..410ef44d7 100644
--- a/test/replication/replica_rejoin.test.lua
+++ b/test/replication/replica_rejoin.test.lua
@@ -55,6 +55,7 @@ test_run:cmd("switch default")
 -- Make sure the replica follows new changes.
 for i = 10, 30, 10 do box.space.test:update(i, {{'!', 1, i}}) end
 vclock = test_run:get_vclock('default')
+vclock[0] = nil
 _ = test_run:wait_vclock('replica', vclock)
 test_run:cmd("switch replica")
 box.space.test:select()
@@ -109,6 +110,7 @@ box.space.test:replace{1}
 test_run:cmd("switch replica")
 for i = 1, 10 do box.space.test:replace{2} end
 vclock = test_run:get_vclock('replica')
+vclock[0] = nil
 _ = test_run:wait_vclock('default', vclock)
 -- Restart the master and force garbage collection.
 test_run:cmd("switch default")
@@ -126,6 +128,7 @@ test_run:wait_cond(function() return #fio.glob(fio.pathjoin(box.cfg.wal_dir, '*.
 test_run:cmd("switch replica")
 for i = 1, 10 do box.space.test:replace{2} end
 vclock = test_run:get_vclock('replica')
+vclock[0] = nil
 _ = test_run:wait_vclock('default', vclock)
 -- Restart the replica. It should successfully rebootstrap.
 test_run:cmd("restart server replica with args='true'")
diff --git a/test/replication/skip_conflict_row.result b/test/replication/skip_conflict_row.result
index 9b2777872..d70ac8e2a 100644
--- a/test/replication/skip_conflict_row.result
+++ b/test/replication/skip_conflict_row.result
@@ -54,6 +54,9 @@ box.info.status
 vclock = test_run:get_vclock('default')
 ---
 ...
+vclock[0] = nil
+---
+...
 _ = test_run:wait_vclock("replica", vclock)
 ---
 ...
diff --git a/test/replication/skip_conflict_row.test.lua b/test/replication/skip_conflict_row.test.lua
index 2982c730a..04fd08136 100644
--- a/test/replication/skip_conflict_row.test.lua
+++ b/test/replication/skip_conflict_row.test.lua
@@ -19,6 +19,7 @@ space:insert{2}
 box.info.status
 
 vclock = test_run:get_vclock('default')
+vclock[0] = nil
 _ = test_run:wait_vclock("replica", vclock)
 test_run:cmd("switch replica")
 box.info.replication[1].upstream.message
diff --git a/test/replication/suite.cfg b/test/replication/suite.cfg
index 90fd53ca6..0907ac17f 100644
--- a/test/replication/suite.cfg
+++ b/test/replication/suite.cfg
@@ -12,6 +12,7 @@
     "on_schema_init.test.lua": {},
     "long_row_timeout.test.lua": {},
     "join_without_snap.test.lua": {},
+    "gh-4114-local-space-replication.test.lua": {},
     "gh-4402-info-errno.test.lua": {},
     "gh-4605-empty-password.test.lua": {},
     "gh-4606-admin-creds.test.lua": {},
diff --git a/test/vinyl/errinj.result b/test/vinyl/errinj.result
index 2635da265..2bd701f70 100644
--- a/test/vinyl/errinj.result
+++ b/test/vinyl/errinj.result
@@ -1241,6 +1241,11 @@ test_run:cmd("switch default")
 vclock = test_run:get_vclock("default")
 ---
 ...
+-- Ignore 0-th vclock component. They don't match between
+-- replicas.
+vclock[0] = nil
+---
+...
 _ = test_run:wait_vclock("replica", vclock)
 ---
 ...
diff --git a/test/vinyl/errinj.test.lua b/test/vinyl/errinj.test.lua
index 4230cfae3..750d3bfe8 100644
--- a/test/vinyl/errinj.test.lua
+++ b/test/vinyl/errinj.test.lua
@@ -455,6 +455,10 @@ box.cfg{read_only = true}
 
 test_run:cmd("switch default")
 vclock = test_run:get_vclock("default")
+
+-- Ignore 0-th vclock component. They don't match between
+-- replicas.
+vclock[0] = nil
 _ = test_run:wait_vclock("replica", vclock)
 
 test_run:cmd("stop server replica")
-- 
2.21.1 (Apple Git-122.3)

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Tarantool-patches] [PATCH v6 1/3] replication: omit 0-th vclock component in replication responses
  2020-04-07 15:49 ` [Tarantool-patches] [PATCH v6 1/3] replication: omit 0-th vclock component in replication responses Serge Petrenko
@ 2020-04-09 23:08   ` Vladislav Shpilevoy
  2020-04-10 12:25     ` Serge Petrenko
  0 siblings, 1 reply; 14+ messages in thread
From: Vladislav Shpilevoy @ 2020-04-09 23:08 UTC (permalink / raw)
  To: Serge Petrenko; +Cc: tarantool-patches

Hi! Thanks for the patch!

> diff --git a/src/box/xrow.c b/src/box/xrow.c
> index be026a43c..21a68220a 100644
> --- a/src/box/xrow.c
> +++ b/src/box/xrow.c
> @@ -51,7 +51,7 @@ static_assert(IPROTO_DATA < 0x7f && IPROTO_METADATA < 0x7f &&
>  static inline uint32_t
>  mp_sizeof_vclock(const struct vclock *vclock)
>  {
> -	uint32_t size = vclock_size(vclock);
> +	uint32_t size = vclock_size_ignore0(vclock);

This doesn't look right to silently ignore 0 component in a
function, which says nothing about that: mp_sizeof_vclock().
Not in the function name, nor in a comment. Maybe worth renaming
it to mp_sizeof_vclock_ignore0(). Along with mp_encode_vclock().

I am picky to the names, because we still actually have places,
where 0 component *is* needed (we have, right?). And we need to
say explicitly, which places ignore 0, which don't.

Also seems mp_decode_vclock() does not ignore 0. Is it possible
that we will get a connection from tarantool, which still does not
ignore 0 component, and will send it to us? For example, it was
upgraded recently on the latest branch, right before we pushed this
patchset.

>  	return mp_sizeof_map(size) + size * (mp_sizeof_uint(UINT32_MAX) +
>  					     mp_sizeof_uint(UINT64_MAX));
>  }

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Tarantool-patches] [PATCH v6 2/3] gc: rely on minimal vclock components instead of signatures
  2020-04-07 15:49 ` [Tarantool-patches] [PATCH v6 2/3] gc: rely on minimal vclock components instead of signatures Serge Petrenko
@ 2020-04-09 23:08   ` Vladislav Shpilevoy
  2020-04-10 12:25     ` Serge Petrenko
  0 siblings, 1 reply; 14+ messages in thread
From: Vladislav Shpilevoy @ 2020-04-09 23:08 UTC (permalink / raw)
  To: Serge Petrenko; +Cc: tarantool-patches

Hi! Thanks for the patch!

See 4 questions below.

On 07/04/2020 17:49, Serge Petrenko wrote:
> The current WAL GC implementation tracks consumers (i.e. remote replicas) by
> their vclock signature, which is the sum of all vclock components.
> This approach is wrong, and this can be shown by a little example.
> The example will be a little synthetic, but it'll illustrate the problem.
> Say, you have 2 masters, A and B with ids 1 and 2 respectively, and a replica C
> with id 3. Say, С replicates from both A and B, and there is no replication
> between A and B (say, the instances were reconfigured to not replicate from each
> other). Now, say replica C has followed A and B to vclock {1:5, 2:13}. At the
> same time, A has lsn 10 and B has lsn 15. A and B do not know about each other’s
> changes, so A’s vclock is {1:10} and B’s vclock is {2:15}. Now imagine A does a
> snapshot and creates a new xlog with signature 10. A’s directory will look like:
> 00…000.xlog 00…010.snap 00….010.xlog
> Replica C reports its vclock {1:5, 2:13} to A, A uses the vclock to update the
> corresponding GC consumer. Since signatures are used, GC consumer is assigned a
> signature = 13 + 5 = 18. This is greater than the signature of the last xlog on
> A (which is 10), so the previous xlog (00…00.xlog) can be deleted (at least A
> assumes it can be). Actually, replica still needs 00…00.xlog, because it
> contains rows corresponding to vclocks {1:6} - {1:10}, which haven’t been
> replicated yet.
> 
> If instead of using vclock signatures, gc consumers used vclocks, such a problem
> wouldn’t arise. Replica would report its vclock {1:5, 2:13}. The vclock is NOT
> strictly greater than A’s most recent xlog vclock ({1:10}), so the previous log
> is kept until replica reports a vclock {1:10, 2:something} or {1:11, …} and so
> on.

1.
"
    It looks like B will always delete its snaps immediately, because C will
    send vclock lexicographically always bigger than B's.
"

I wrote the comment above and only then realized that vclock_min() is not really
min between 2 values. It is min for every of 32 components of vclock. So the
problem above won't happen, right?

B's "min" vclock in gc consumers will be

    {0, min(B last snap, C last received row), 0, 0, ...}

Correct? It won't see A's part of C's vclock.

> Rewrite gc to perform cleanup based on finding minimal vclock
> components present in at least one of the consumer vclocks instead of
> just comparing vclock signatures.
> 
> Prerequisite #4114
> ---
>  src/box/gc.c     | 41 ++++++++++++++---------
>  src/box/vclock.h | 85 ++++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 110 insertions(+), 16 deletions(-)
> 
> diff --git a/src/box/gc.c b/src/box/gc.c
> index f5c387f9d..a2d0a515c 100644
> --- a/src/box/gc.c
> +++ b/src/box/gc.c
> @@ -187,19 +186,29 @@ gc_run_cleanup(void)
>  	/* At least one checkpoint must always be available. */
>  	assert(checkpoint != NULL);
>  
> +	/* Find the vclock of the oldest WAL row to keep. */
> +	struct vclock min_vclock;
> +	struct gc_consumer *consumer = gc_tree_first(&gc.consumers);
>  	/*
> -	 * Find the vclock of the oldest WAL row to keep.
> +	 * Vclock of the oldest WAL row to keep is a by-component
> +	 * minimum of all consumer vclocks and the oldest
> +	 * checkpoint vclock. This ensures that all rows needed by
> +	 * at least one consumer are kept.
>  	 * Note, we must keep all WALs created after the
>  	 * oldest checkpoint, even if no consumer needs them.
>  	 */
> -	const struct vclock *vclock = (gc_tree_empty(&gc.consumers) ? NULL :
> -				       &gc_tree_first(&gc.consumers)->vclock);
> -	if (vclock == NULL ||
> -	    vclock_sum(vclock) > vclock_sum(&checkpoint->vclock))
> -		vclock = &checkpoint->vclock;
> -
> -	if (vclock_sum(vclock) > vclock_sum(&gc.vclock)) {
> -		vclock_copy(&gc.vclock, vclock);
> +	vclock_copy(&min_vclock, &checkpoint->vclock);
> +	while (consumer != NULL) {
> +		/*
> +		 * Consumers will never need rows signed
> +		 * with a zero instance id (local rows).
> +		 */
> +		vclock_min_ignore0(&min_vclock, &consumer->vclock);
> +		consumer = gc_tree_next(&gc.consumers, consumer);

2.
"
    Can you use gc_tree_last()? Without vclock_min_ignore0 and manual scan.

    Also since vclock[0] is never sent (mp_encode_vclock() now ignores it),
    what are the consumers having vclock[0] != 0 here? If there are only local,
    does it matter? They will be ordered always ok.
"

I wrote the comment above also before realized that vclock_min() is not a
normal min(). So it should be irrelevant, yes?


3. Should not we patch gc_advance() too? It still uses sum().

> +	}
> +
> +	if (vclock_sum(&min_vclock) > vclock_sum(&gc.vclock)) {
> +		vclock_copy(&gc.vclock, &min_vclock);
>  		run_wal_gc = true;
>  	}
> diff --git a/src/box/vclock.h b/src/box/vclock.h
> index 5c0525b00..5865f7443 100644
> --- a/src/box/vclock.h
> +++ b/src/box/vclock.h
> @@ -330,6 +355,66 @@ vclock_compare_ignore0(const struct vclock *a, const struct vclock *b)
>  	return vclock_compare_generic(a, b, true);
>  }
>  
> +/**
> + * Compare vclocks lexicographically.
> + * The following affirmation holds: if a < b in terms of
> + * vclock_compare, then a < b in terms of vclock_lex_compare.
> + * However, a < b in terms of vclock_lex_compare doesn't mean
> + * than a < b in terms of vclock_compare. Example:
> + * {1:5, 2:10} < {1:7, 2:3} (vclock_lex_compare), but the vclocks
> + * are incomparable in terms of vclock_compare.
> + * All vclocks are comparable lexicographically.
> + *
> + * @param a Vclock.
> + * @param b Vclock.
> + * @retval 1 If @a a is lexicographically greater than @a b.
> + * @retval -1 If @a a is lexicographically less than @a b.
> + * @retval 0 If vclocks are equal.
> + */
> +static inline int
> +vclock_lex_compare(const struct vclock *a, const struct vclock *b)

4. What are cases, when vclock_lex_compare() can't replace vclock_compare()?

> +{
> +	vclock_map_t map = a->map | b->map;
> +	struct bit_iterator it;
> +	bit_iterator_init(&it, &map, sizeof(map), true);
> +	for(size_t replica_id = bit_iterator_next(&it); replica_id < VCLOCK_MAX;
> +	    replica_id = bit_iterator_next(&it)) {
> +		int64_t lsn_a = vclock_get(a, replica_id);
> +		int64_t lsn_b = vclock_get(b, replica_id);
> +		if (lsn_a < lsn_b)
> +			return -1;
> +		if (lsn_a > lsn_b)
> +			return 1;
> +	}
> +	return 0;
> +}

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Tarantool-patches] [PATCH v6 1/3] replication: omit 0-th vclock component in replication responses
  2020-04-09 23:08   ` Vladislav Shpilevoy
@ 2020-04-10 12:25     ` Serge Petrenko
  2020-04-11 16:05       ` Vladislav Shpilevoy
  0 siblings, 1 reply; 14+ messages in thread
From: Serge Petrenko @ 2020-04-10 12:25 UTC (permalink / raw)
  To: Vladislav Shpilevoy; +Cc: tarantool-patches


> 10 апр. 2020 г., в 02:08, Vladislav Shpilevoy <v.shpilevoy@tarantool.org> написал(а):
> 
> Hi! Thanks for the patch!

Hi! Thanks for your review!

> 
>> diff --git a/src/box/xrow.c b/src/box/xrow.c
>> index be026a43c..21a68220a 100644
>> --- a/src/box/xrow.c
>> +++ b/src/box/xrow.c
>> @@ -51,7 +51,7 @@ static_assert(IPROTO_DATA < 0x7f && IPROTO_METADATA < 0x7f &&
>> static inline uint32_t
>> mp_sizeof_vclock(const struct vclock *vclock)
>> {
>> -	uint32_t size = vclock_size(vclock);
>> +	uint32_t size = vclock_size_ignore0(vclock);
> 
> This doesn't look right to silently ignore 0 component in a
> function, which says nothing about that: mp_sizeof_vclock().
> Not in the function name, nor in a comment. Maybe worth renaming
> it to mp_sizeof_vclock_ignore0(). Along with mp_encode_vclock().
> 
> I am picky to the names, because we still actually have places,
> where 0 component *is* needed (we have, right?). And we need to
> say explicitly, which places ignore 0, which don't.
> 
> Also seems mp_decode_vclock() does not ignore 0. Is it possible
> that we will get a connection from tarantool, which still does not
> ignore 0 component, and will send it to us? For example, it was
> upgraded recently on the latest branch, right before we pushed this
> patchset.

Ok, renamed mp_decode_vclock -> mp_decode_vclock_ignore0
            mp_encode_vclock -> mp_encode_vclock_ignore0

I also replaced back the now unnecessary vclock_compare_ignore0 to
vclock_compare.

The diff’s below.

diff --git a/src/box/relay.cc b/src/box/relay.cc
index fec9f07d1..c634348a4 100644
--- a/src/box/relay.cc
+++ b/src/box/relay.cc
@@ -464,7 +464,7 @@ relay_schedule_pending_gc(struct relay *relay, const struct vclock *vclock)
 		 * the greater signatures is due to changes pulled
 		 * from other members of the cluster.
 		 */
-		if (vclock_compare_ignore0(&curr->vclock, vclock) > 0)
+		if (vclock_compare(&curr->vclock, vclock) > 0)
 			break;
 		stailq_shift(&relay->pending_gc);
 		free(gc_msg);
diff --git a/src/box/replication.cc b/src/box/replication.cc
index ac71e0d89..7c10fb6f2 100644
--- a/src/box/replication.cc
+++ b/src/box/replication.cc
@@ -775,8 +775,8 @@ replicaset_needs_rejoin(struct replica **master)
 			continue;
 
 		const struct ballot *ballot = &applier->ballot;
-		if (vclock_compare_ignore0(&ballot->gc_vclock,
-					   &replicaset.vclock) <= 0) {
+		if (vclock_compare(&ballot->gc_vclock,
+				   &replicaset.vclock) <= 0) {
 			/*
 			 * There's at least one master that still stores
 			 * WALs needed by this instance. Proceed to local
diff --git a/src/box/xrow.c b/src/box/xrow.c
index 21a68220a..51de49387 100644
--- a/src/box/xrow.c
+++ b/src/box/xrow.c
@@ -49,7 +49,7 @@ static_assert(IPROTO_DATA < 0x7f && IPROTO_METADATA < 0x7f &&
 	      "one byte");
 
 static inline uint32_t
-mp_sizeof_vclock(const struct vclock *vclock)
+mp_sizeof_vclock_ignore0(const struct vclock *vclock)
 {
 	uint32_t size = vclock_size_ignore0(vclock);
 	return mp_sizeof_map(size) + size * (mp_sizeof_uint(UINT32_MAX) +
@@ -57,7 +57,7 @@ mp_sizeof_vclock(const struct vclock *vclock)
 }
 
 static inline char *
-mp_encode_vclock(char *data, const struct vclock *vclock)
+mp_encode_vclock_ignore0(char *data, const struct vclock *vclock)
 {
 	data = mp_encode_map(data, vclock_size_ignore0(vclock));
 	struct vclock_iterator it;
@@ -74,7 +74,7 @@ mp_encode_vclock(char *data, const struct vclock *vclock)
 }
 
 static int
-mp_decode_vclock(const char **data, struct vclock *vclock)
+mp_decode_vclock_ignore_0(const char **data, struct vclock *vclock)
 {
 	vclock_create(vclock);
 	if (mp_typeof(**data) != MP_MAP)
@@ -86,6 +86,14 @@ mp_decode_vclock(const char **data, struct vclock *vclock)
 		uint32_t id = mp_decode_uint(data);
 		if (mp_typeof(**data) != MP_UINT)
 			return -1;
+		/*
+		 * Skip vclock[0] coming from the remote
+		 * instances.
+		 */
+		if (id == 0) {
+			mp_next(data);
+			continue;
+		}
 		int64_t lsn = mp_decode_uint(data);
 		if (lsn > 0)
 			vclock_follow(vclock, id, lsn);
@@ -412,7 +420,7 @@ iproto_reply_vclock(struct obuf *out, const struct vclock *vclock,
 		    uint64_t sync, uint32_t schema_version)
 {
 	size_t max_size = IPROTO_HEADER_LEN + mp_sizeof_map(1) +
-		mp_sizeof_uint(UINT32_MAX) + mp_sizeof_vclock(vclock);
+		mp_sizeof_uint(UINT32_MAX) + mp_sizeof_vclock_ignore0(vclock);
 
 	char *buf = obuf_reserve(out, max_size);
 	if (buf == NULL) {
@@ -424,7 +432,7 @@ iproto_reply_vclock(struct obuf *out, const struct vclock *vclock,
 	char *data = buf + IPROTO_HEADER_LEN;
 	data = mp_encode_map(data, 1);
 	data = mp_encode_uint(data, IPROTO_VCLOCK);
-	data = mp_encode_vclock(data, vclock);
+	data = mp_encode_vclock_ignore0(data, vclock);
 	size_t size = data - buf;
 	assert(size <= max_size);
 
@@ -445,8 +453,10 @@ iproto_reply_vote(struct obuf *out, const struct ballot *ballot,
 		mp_sizeof_uint(UINT32_MAX) + mp_sizeof_map(4) +
 		mp_sizeof_uint(UINT32_MAX) + mp_sizeof_bool(ballot->is_ro) +
 		mp_sizeof_uint(UINT32_MAX) + mp_sizeof_bool(ballot->is_loading) +
-		mp_sizeof_uint(UINT32_MAX) + mp_sizeof_vclock(&ballot->vclock) +
-		mp_sizeof_uint(UINT32_MAX) + mp_sizeof_vclock(&ballot->gc_vclock);
+		mp_sizeof_uint(UINT32_MAX) +
+		mp_sizeof_vclock_ignore0(&ballot->vclock) +
+		mp_sizeof_uint(UINT32_MAX) +
+		mp_sizeof_vclock_ignore0(&ballot->gc_vclock);
 
 	char *buf = obuf_reserve(out, max_size);
 	if (buf == NULL) {
@@ -464,9 +474,9 @@ iproto_reply_vote(struct obuf *out, const struct ballot *ballot,
 	data = mp_encode_uint(data, IPROTO_BALLOT_IS_LOADING);
 	data = mp_encode_bool(data, ballot->is_loading);
 	data = mp_encode_uint(data, IPROTO_BALLOT_VCLOCK);
-	data = mp_encode_vclock(data, &ballot->vclock);
+	data = mp_encode_vclock_ignore0(data, &ballot->vclock);
 	data = mp_encode_uint(data, IPROTO_BALLOT_GC_VCLOCK);
-	data = mp_encode_vclock(data, &ballot->gc_vclock);
+	data = mp_encode_vclock_ignore0(data, &ballot->gc_vclock);
 	size_t size = data - buf;
 	assert(size <= max_size);
 
@@ -1244,11 +1254,13 @@ xrow_decode_ballot(struct xrow_header *row, struct ballot *ballot)
 			ballot->is_loading = mp_decode_bool(&data);
 			break;
 		case IPROTO_BALLOT_VCLOCK:
-			if (mp_decode_vclock(&data, &ballot->vclock) != 0)
+			if (mp_decode_vclock_ignore_0(&data,
+						      &ballot->vclock) != 0)
 				goto err;
 			break;
 		case IPROTO_BALLOT_GC_VCLOCK:
-			if (mp_decode_vclock(&data, &ballot->gc_vclock) != 0)
+			if (mp_decode_vclock_ignore_0(&data,
+						      &ballot->gc_vclock) != 0)
 				goto err;
 			break;
 		default:
@@ -1270,7 +1282,8 @@ xrow_encode_register(struct xrow_header *row,
 	size_t size = mp_sizeof_map(2) +
 		      mp_sizeof_uint(IPROTO_INSTANCE_UUID) +
 		      mp_sizeof_str(UUID_STR_LEN) +
-		      mp_sizeof_uint(IPROTO_VCLOCK) + mp_sizeof_vclock(vclock);
+		      mp_sizeof_uint(IPROTO_VCLOCK) +
+		      mp_sizeof_vclock_ignore0(vclock);
 	char *buf = (char *) region_alloc(&fiber()->gc, size);
 	if (buf == NULL) {
 		diag_set(OutOfMemory, size, "region_alloc", "buf");
@@ -1281,7 +1294,7 @@ xrow_encode_register(struct xrow_header *row,
 	data = mp_encode_uint(data, IPROTO_INSTANCE_UUID);
 	data = xrow_encode_uuid(data, instance_uuid);
 	data = mp_encode_uint(data, IPROTO_VCLOCK);
-	data = mp_encode_vclock(data, vclock);
+	data = mp_encode_vclock_ignore0(data, vclock);
 	assert(data <= buf + size);
 	row->body[0].iov_base = buf;
 	row->body[0].iov_len = (data - buf);
@@ -1298,7 +1311,8 @@ xrow_encode_subscribe(struct xrow_header *row,
 		      uint32_t id_filter)
 {
 	memset(row, 0, sizeof(*row));
-	size_t size = XROW_BODY_LEN_MAX + mp_sizeof_vclock(vclock);
+	size_t size = XROW_BODY_LEN_MAX +
+		      mp_sizeof_vclock_ignore0(vclock);
 	char *buf = (char *) region_alloc(&fiber()->gc, size);
 	if (buf == NULL) {
 		diag_set(OutOfMemory, size, "region_alloc", "buf");
@@ -1312,7 +1326,7 @@ xrow_encode_subscribe(struct xrow_header *row,
 	data = mp_encode_uint(data, IPROTO_INSTANCE_UUID);
 	data = xrow_encode_uuid(data, instance_uuid);
 	data = mp_encode_uint(data, IPROTO_VCLOCK);
-	data = mp_encode_vclock(data, vclock);
+	data = mp_encode_vclock_ignore0(data, vclock);
 	data = mp_encode_uint(data, IPROTO_SERVER_VERSION);
 	data = mp_encode_uint(data, tarantool_version_id());
 	data = mp_encode_uint(data, IPROTO_REPLICA_ANON);
@@ -1391,7 +1405,7 @@ xrow_decode_subscribe(struct xrow_header *row, struct tt_uuid *replicaset_uuid,
 		case IPROTO_VCLOCK:
 			if (vclock == NULL)
 				goto skip;
-			if (mp_decode_vclock(&d, vclock) != 0) {
+			if (mp_decode_vclock_ignore_0(&d, vclock) != 0) {
 				xrow_on_decode_err(data, end, ER_INVALID_MSGPACK,
 						   "invalid VCLOCK");
 				return -1;
@@ -1473,7 +1487,7 @@ xrow_encode_vclock(struct xrow_header *row, const struct vclock *vclock)
 	memset(row, 0, sizeof(*row));
 
 	/* Add vclock to response body */
-	size_t size = 8 + mp_sizeof_vclock(vclock);
+	size_t size = 8 + mp_sizeof_vclock_ignore0(vclock);
 	char *buf = (char *) region_alloc(&fiber()->gc, size);
 	if (buf == NULL) {
 		diag_set(OutOfMemory, size, "region_alloc", "buf");
@@ -1482,7 +1496,7 @@ xrow_encode_vclock(struct xrow_header *row, const struct vclock *vclock)
 	char *data = buf;
 	data = mp_encode_map(data, 1);
 	data = mp_encode_uint(data, IPROTO_VCLOCK);
-	data = mp_encode_vclock(data, vclock);
+	data = mp_encode_vclock_ignore0(data, vclock);
 	assert(data <= buf + size);
 	row->body[0].iov_base = buf;
 	row->body[0].iov_len = (data - buf);
@@ -1498,7 +1512,8 @@ xrow_encode_subscribe_response(struct xrow_header *row,
 {
 	memset(row, 0, sizeof(*row));
 	size_t size = mp_sizeof_map(2) +
-		      mp_sizeof_uint(IPROTO_VCLOCK) + mp_sizeof_vclock(vclock) +
+		      mp_sizeof_uint(IPROTO_VCLOCK) +
+		      mp_sizeof_vclock_ignore0(vclock) +
 		      mp_sizeof_uint(IPROTO_CLUSTER_UUID) +
 		      mp_sizeof_str(UUID_STR_LEN);
 	char *buf = (char *) region_alloc(&fiber()->gc, size);
@@ -1509,7 +1524,7 @@ xrow_encode_subscribe_response(struct xrow_header *row,
 	char *data = buf;
 	data = mp_encode_map(data, 2);
 	data = mp_encode_uint(data, IPROTO_VCLOCK);
-	data = mp_encode_vclock(data, vclock);
+	data = mp_encode_vclock_ignore0(data, vclock);
 	data = mp_encode_uint(data, IPROTO_CLUSTER_UUID);
 	data = xrow_encode_uuid(data, replicaset_uuid);
 	assert(data <= buf + size);

> 
>> 	return mp_sizeof_map(size) + size * (mp_sizeof_uint(UINT32_MAX) +
>> 					     mp_sizeof_uint(UINT64_MAX));
>> }



--
Serge Petrenko
sergepetrenko@tarantool.org

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Tarantool-patches] [PATCH v6 2/3] gc: rely on minimal vclock components instead of signatures
  2020-04-09 23:08   ` Vladislav Shpilevoy
@ 2020-04-10 12:25     ` Serge Petrenko
  2020-04-11 16:04       ` Vladislav Shpilevoy
  0 siblings, 1 reply; 14+ messages in thread
From: Serge Petrenko @ 2020-04-10 12:25 UTC (permalink / raw)
  To: Vladislav Shpilevoy; +Cc: tml


> 10 апр. 2020 г., в 02:08, Vladislav Shpilevoy <v.shpilevoy@tarantool.org> написал(а):
> 
> Hi! Thanks for the patch!

Hi! Thanks for your review!

> 
> See 4 questions below.
> 
> On 07/04/2020 17:49, Serge Petrenko wrote:
>> The current WAL GC implementation tracks consumers (i.e. remote replicas) by
>> their vclock signature, which is the sum of all vclock components.
>> This approach is wrong, and this can be shown by a little example.
>> The example will be a little synthetic, but it'll illustrate the problem.
>> Say, you have 2 masters, A and B with ids 1 and 2 respectively, and a replica C
>> with id 3. Say, С replicates from both A and B, and there is no replication
>> between A and B (say, the instances were reconfigured to not replicate from each
>> other). Now, say replica C has followed A and B to vclock {1:5, 2:13}. At the
>> same time, A has lsn 10 and B has lsn 15. A and B do not know about each other’s
>> changes, so A’s vclock is {1:10} and B’s vclock is {2:15}. Now imagine A does a
>> snapshot and creates a new xlog with signature 10. A’s directory will look like:
>> 00…000.xlog 00…010.snap 00….010.xlog
>> Replica C reports its vclock {1:5, 2:13} to A, A uses the vclock to update the
>> corresponding GC consumer. Since signatures are used, GC consumer is assigned a
>> signature = 13 + 5 = 18. This is greater than the signature of the last xlog on
>> A (which is 10), so the previous xlog (00…00.xlog) can be deleted (at least A
>> assumes it can be). Actually, replica still needs 00…00.xlog, because it
>> contains rows corresponding to vclocks {1:6} - {1:10}, which haven’t been
>> replicated yet.
>> 
>> If instead of using vclock signatures, gc consumers used vclocks, such a problem
>> wouldn’t arise. Replica would report its vclock {1:5, 2:13}. The vclock is NOT
>> strictly greater than A’s most recent xlog vclock ({1:10}), so the previous log
>> is kept until replica reports a vclock {1:10, 2:something} or {1:11, …} and so
>> on.
> 
> 1.
> "
>    It looks like B will always delete its snaps immediately, because C will
>    send vclock lexicographically always bigger than B's.
> "

I think you meant xlogs.  Snaps are deleted only when a new snap is created and
total snap count exceeds `box.cfg.checkpoint_count`
Anyway, your remark below is correct.

> 
> I wrote the comment above and only then realized that vclock_min() is not really
> min between 2 values. It is min for every of 32 components of vclock. So the
> problem above won't happen, right?
> 
> B's "min" vclock in gc consumers will be
> 
>    {0, min(B last snap, C last received row), 0, 0, ...}
> 
> Correct? It won't see A's part of C's vclock.

Yes, {1:0, 2:min(BB last snap, C last received row), 3:0, ….}.

> 
>> Rewrite gc to perform cleanup based on finding minimal vclock
>> components present in at least one of the consumer vclocks instead of
>> just comparing vclock signatures.
>> 
>> Prerequisite #4114
>> ---
>> src/box/gc.c     | 41 ++++++++++++++---------
>> src/box/vclock.h | 85 ++++++++++++++++++++++++++++++++++++++++++++++++
>> 2 files changed, 110 insertions(+), 16 deletions(-)
>> 
>> diff --git a/src/box/gc.c b/src/box/gc.c
>> index f5c387f9d..a2d0a515c 100644
>> --- a/src/box/gc.c
>> +++ b/src/box/gc.c
>> @@ -187,19 +186,29 @@ gc_run_cleanup(void)
>> 	/* At least one checkpoint must always be available. */
>> 	assert(checkpoint != NULL);
>> 
>> +	/* Find the vclock of the oldest WAL row to keep. */
>> +	struct vclock min_vclock;
>> +	struct gc_consumer *consumer = gc_tree_first(&gc.consumers);
>> 	/*
>> -	 * Find the vclock of the oldest WAL row to keep.
>> +	 * Vclock of the oldest WAL row to keep is a by-component
>> +	 * minimum of all consumer vclocks and the oldest
>> +	 * checkpoint vclock. This ensures that all rows needed by
>> +	 * at least one consumer are kept.
>> 	 * Note, we must keep all WALs created after the
>> 	 * oldest checkpoint, even if no consumer needs them.
>> 	 */
>> -	const struct vclock *vclock = (gc_tree_empty(&gc.consumers) ? NULL :
>> -				       &gc_tree_first(&gc.consumers)->vclock);
>> -	if (vclock == NULL ||
>> -	    vclock_sum(vclock) > vclock_sum(&checkpoint->vclock))
>> -		vclock = &checkpoint->vclock;
>> -
>> -	if (vclock_sum(vclock) > vclock_sum(&gc.vclock)) {
>> -		vclock_copy(&gc.vclock, vclock);
>> +	vclock_copy(&min_vclock, &checkpoint->vclock);
>> +	while (consumer != NULL) {
>> +		/*
>> +		 * Consumers will never need rows signed
>> +		 * with a zero instance id (local rows).
>> +		 */
>> +		vclock_min_ignore0(&min_vclock, &consumer->vclock);
>> +		consumer = gc_tree_next(&gc.consumers, consumer);
> 
> 2.
> "
>    Can you use gc_tree_last()? Without vclock_min_ignore0 and manual scan.
> 
>    Also since vclock[0] is never sent (mp_encode_vclock() now ignores it),
>    what are the consumers having vclock[0] != 0 here? If there are only local,
>    does it matter? They will be ordered always ok.
> "

consumers have the master instance’s vclock[0] component. It is just a leftover
from recovery initialization, and I didn’t bother clearing it.
So, the consumer, who is the last to report its status, will have the biggest
vclock[0], if master writes to local spaces at all.

No, gc_tree_last() (did you mean gc_tree_first() ? ) cannot be used here.
It’ll return the consumer with the biggest (smallest) vclock[0].
Which is the last (first) one to report its status and not necessarily the
most (least) up to date one.

> 
> I wrote the comment above also before realized that vclock_min() is not a
> normal min(). So it should be irrelevant, yes?

Yep.

> 
> 
> 3. Should not we patch gc_advance() too? It still uses sum().

Yes! Thanks for noticing.

diff --git a/src/box/gc.c b/src/box/gc.c
index a2d0a515c..36b2d7f4e 100644
--- a/src/box/gc.c
+++ b/src/box/gc.c
@@ -295,10 +295,18 @@ gc_advance(const struct vclock *vclock)
 	vclock_copy(&gc.vclock, vclock);
 
 	struct gc_consumer *consumer = gc_tree_first(&gc.consumers);
-	while (consumer != NULL &&
-	       vclock_sum(&consumer->vclock) < vclock_sum(vclock)) {
+	while (consumer != NULL) {
 		struct gc_consumer *next = gc_tree_next(&gc.consumers,
 							consumer);
+		/*
+		 * Remove all the consumers whose vclocks are
+		 * either less than or incomparable with the wal
+		 * gc vclock.
+		 */
+		if (vclock_compare_ignore0(vclock, &consumer->vclock) <= 0) {
+			consumer = next;
+			continue;
+		}
 		assert(!consumer->is_inactive);
 		consumer->is_inactive = true;
 		gc_tree_remove(&gc.consumers, consumer);

> 
>> +	}
>> +
>> +	if (vclock_sum(&min_vclock) > vclock_sum(&gc.vclock)) {
>> +		vclock_copy(&gc.vclock, &min_vclock);
>> 		run_wal_gc = true;
>> 	}
>> diff --git a/src/box/vclock.h b/src/box/vclock.h
>> index 5c0525b00..5865f7443 100644
>> --- a/src/box/vclock.h
>> +++ b/src/box/vclock.h
>> @@ -330,6 +355,66 @@ vclock_compare_ignore0(const struct vclock *a, const struct vclock *b)
>> 	return vclock_compare_generic(a, b, true);
>> }
>> 
>> +/**
>> + * Compare vclocks lexicographically.
>> + * The following affirmation holds: if a < b in terms of
>> + * vclock_compare, then a < b in terms of vclock_lex_compare.
>> + * However, a < b in terms of vclock_lex_compare doesn't mean
>> + * than a < b in terms of vclock_compare. Example:
>> + * {1:5, 2:10} < {1:7, 2:3} (vclock_lex_compare), but the vclocks
>> + * are incomparable in terms of vclock_compare.
>> + * All vclocks are comparable lexicographically.
>> + *
>> + * @param a Vclock.
>> + * @param b Vclock.
>> + * @retval 1 If @a a is lexicographically greater than @a b.
>> + * @retval -1 If @a a is lexicographically less than @a b.
>> + * @retval 0 If vclocks are equal.
>> + */
>> +static inline int
>> +vclock_lex_compare(const struct vclock *a, const struct vclock *b)
> 
> 4. What are cases, when vclock_lex_compare() can't replace vclock_compare()?

replicaset_needs_rejoin() comes to mind. Forget about vclock[0] for now.
Let’s say we have 2 instances with id’s 2 and 3. (Instance with id 1 died, for
example). And 3 is restarting and defining whether it needs a rebootstrap from 2.
Say, 2’s vclock is {1:10, 2:15, 3:30} and 3’s normal vclock is {1:10, 2:10, 3:50}.
Even though 3’s vclock is lexicographically smaller than 2’s vclock
(2nd vclock[2] = 15, 3rd vclock[2] = 10), 3 cannot reboostrap from 2, because 3 has data,
that will be destroyed by rebootstrap {3:31} - {3:50}.
So only the usual vclock_compare() can be used here.
3 can rebootstrap from 2 iff 3’s vclock is less than 2’s vclock in EVERY component.

> 
>> +{
>> +	vclock_map_t map = a->map | b->map;
>> +	struct bit_iterator it;
>> +	bit_iterator_init(&it, &map, sizeof(map), true);
>> +	for(size_t replica_id = bit_iterator_next(&it); replica_id < VCLOCK_MAX;
>> +	    replica_id = bit_iterator_next(&it)) {
>> +		int64_t lsn_a = vclock_get(a, replica_id);
>> +		int64_t lsn_b = vclock_get(b, replica_id);
>> +		if (lsn_a < lsn_b)
>> +			return -1;
>> +		if (lsn_a > lsn_b)
>> +			return 1;
>> +	}
>> +	return 0;
>> +}



--
Serge Petrenko
sergepetrenko@tarantool.org

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Tarantool-patches] [PATCH v6 2/3] gc: rely on minimal vclock components instead of signatures
  2020-04-10 12:25     ` Serge Petrenko
@ 2020-04-11 16:04       ` Vladislav Shpilevoy
  2020-04-13 10:12         ` Serge Petrenko
  0 siblings, 1 reply; 14+ messages in thread
From: Vladislav Shpilevoy @ 2020-04-11 16:04 UTC (permalink / raw)
  To: Serge Petrenko; +Cc: tml

Hi! Thanks for the answers!

gc_add_checkpoint() and gc_consumer_advance() still use sum().
Is it ok?

I see gc_consumer_advance() compares vclocks of two neighbour
consumers using sum() to check tree's invariant, even though
the invariant is not related to sum() anymore.

Please, check all the other vclock_sum() invocations too,
just in case.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Tarantool-patches] [PATCH v6 1/3] replication: omit 0-th vclock component in replication responses
  2020-04-10 12:25     ` Serge Petrenko
@ 2020-04-11 16:05       ` Vladislav Shpilevoy
  2020-04-13 10:12         ` Serge Petrenko
  0 siblings, 1 reply; 14+ messages in thread
From: Vladislav Shpilevoy @ 2020-04-11 16:05 UTC (permalink / raw)
  To: Serge Petrenko; +Cc: tarantool-patches

Hi! Thanks for the fixes!

Consider more below and on the branch in a separate commit.

====================
diff --git a/src/box/xrow.c b/src/box/xrow.c
index b22390a14..70aeb6a3c 100644
--- a/src/box/xrow.c
+++ b/src/box/xrow.c
@@ -74,7 +74,7 @@ mp_encode_vclock_ignore0(char *data, const struct vclock *vclock)
 }
 
 static int
-mp_decode_vclock_ignore_0(const char **data, struct vclock *vclock)
+mp_decode_vclock_ignore0(const char **data, struct vclock *vclock)
 {
 	vclock_create(vclock);
 	if (mp_typeof(**data) != MP_MAP)
@@ -86,16 +86,12 @@ mp_decode_vclock_ignore_0(const char **data, struct vclock *vclock)
 		uint32_t id = mp_decode_uint(data);
 		if (mp_typeof(**data) != MP_UINT)
 			return -1;
+		int64_t lsn = mp_decode_uint(data);
 		/*
 		 * Skip vclock[0] coming from the remote
 		 * instances.
 		 */
-		if (id == 0) {
-			mp_next(data);
-			continue;
-		}
-		int64_t lsn = mp_decode_uint(data);
-		if (lsn > 0)
+		if (lsn > 0 && id != 0)
 			vclock_follow(vclock, id, lsn);
 	}
 	return 0;
@@ -1254,13 +1250,13 @@ xrow_decode_ballot(struct xrow_header *row, struct ballot *ballot)
 			ballot->is_loading = mp_decode_bool(&data);
 			break;
 		case IPROTO_BALLOT_VCLOCK:
-			if (mp_decode_vclock_ignore_0(&data,
-						      &ballot->vclock) != 0)
+			if (mp_decode_vclock_ignore0(&data,
+						     &ballot->vclock) != 0)
 				goto err;
 			break;
 		case IPROTO_BALLOT_GC_VCLOCK:
-			if (mp_decode_vclock_ignore_0(&data,
-						      &ballot->gc_vclock) != 0)
+			if (mp_decode_vclock_ignore0(&data,
+						     &ballot->gc_vclock) != 0)
 				goto err;
 			break;
 		default:
@@ -1405,7 +1401,7 @@ xrow_decode_subscribe(struct xrow_header *row, struct tt_uuid *replicaset_uuid,
 		case IPROTO_VCLOCK:
 			if (vclock == NULL)
 				goto skip;
-			if (mp_decode_vclock_ignore_0(&d, vclock) != 0) {
+			if (mp_decode_vclock_ignore0(&d, vclock) != 0) {
 				xrow_on_decode_err(data, end, ER_INVALID_MSGPACK,
 						   "invalid VCLOCK");
 				return -1;

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Tarantool-patches] [PATCH v6 1/3] replication: omit 0-th vclock component in replication responses
  2020-04-11 16:05       ` Vladislav Shpilevoy
@ 2020-04-13 10:12         ` Serge Petrenko
  0 siblings, 0 replies; 14+ messages in thread
From: Serge Petrenko @ 2020-04-13 10:12 UTC (permalink / raw)
  To: Vladislav Shpilevoy; +Cc: tml



> 11 апр. 2020 г., в 19:05, Vladislav Shpilevoy <v.shpilevoy@tarantool.org> написал(а):
> 
> Hi! Thanks for the fixes!
> 
> Consider more below and on the branch in a separate commit.

Thanks! Squashed.

> 
> ====================
> diff --git a/src/box/xrow.c b/src/box/xrow.c
> index b22390a14..70aeb6a3c 100644
> --- a/src/box/xrow.c
> +++ b/src/box/xrow.c
> @@ -74,7 +74,7 @@ mp_encode_vclock_ignore0(char *data, const struct vclock *vclock)
> }
> 
> static int
> -mp_decode_vclock_ignore_0(const char **data, struct vclock *vclock)
> +mp_decode_vclock_ignore0(const char **data, struct vclock *vclock)
> {
> 	vclock_create(vclock);
> 	if (mp_typeof(**data) != MP_MAP)
> @@ -86,16 +86,12 @@ mp_decode_vclock_ignore_0(const char **data, struct vclock *vclock)
> 		uint32_t id = mp_decode_uint(data);
> 		if (mp_typeof(**data) != MP_UINT)
> 			return -1;
> +		int64_t lsn = mp_decode_uint(data);
> 		/*
> 		 * Skip vclock[0] coming from the remote
> 		 * instances.
> 		 */
> -		if (id == 0) {
> -			mp_next(data);
> -			continue;
> -		}
> -		int64_t lsn = mp_decode_uint(data);
> -		if (lsn > 0)
> +		if (lsn > 0 && id != 0)
> 			vclock_follow(vclock, id, lsn);
> 	}
> 	return 0;
> @@ -1254,13 +1250,13 @@ xrow_decode_ballot(struct xrow_header *row, struct ballot *ballot)
> 			ballot->is_loading = mp_decode_bool(&data);
> 			break;
> 		case IPROTO_BALLOT_VCLOCK:
> -			if (mp_decode_vclock_ignore_0(&data,
> -						      &ballot->vclock) != 0)
> +			if (mp_decode_vclock_ignore0(&data,
> +						     &ballot->vclock) != 0)
> 				goto err;
> 			break;
> 		case IPROTO_BALLOT_GC_VCLOCK:
> -			if (mp_decode_vclock_ignore_0(&data,
> -						      &ballot->gc_vclock) != 0)
> +			if (mp_decode_vclock_ignore0(&data,
> +						     &ballot->gc_vclock) != 0)
> 				goto err;
> 			break;
> 		default:
> @@ -1405,7 +1401,7 @@ xrow_decode_subscribe(struct xrow_header *row, struct tt_uuid *replicaset_uuid,
> 		case IPROTO_VCLOCK:
> 			if (vclock == NULL)
> 				goto skip;
> -			if (mp_decode_vclock_ignore_0(&d, vclock) != 0) {
> +			if (mp_decode_vclock_ignore0(&d, vclock) != 0) {
> 				xrow_on_decode_err(data, end, ER_INVALID_MSGPACK,
> 						   "invalid VCLOCK");
> 	
> 			return -1;



--
Serge Petrenko
sergepetrenko@tarantool.org

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Tarantool-patches] [PATCH v6 2/3] gc: rely on minimal vclock components instead of signatures
  2020-04-11 16:04       ` Vladislav Shpilevoy
@ 2020-04-13 10:12         ` Serge Petrenko
  0 siblings, 0 replies; 14+ messages in thread
From: Serge Petrenko @ 2020-04-13 10:12 UTC (permalink / raw)
  To: Vladislav Shpilevoy; +Cc: tml


> 11 апр. 2020 г., в 19:04, Vladislav Shpilevoy <v.shpilevoy@tarantool.org> написал(а):
> 
> Hi! Thanks for the answers!

Hi! Thanks for the review!

> 
> gc_add_checkpoint() and gc_consumer_advance() still use sum().
> Is it ok?

gc_add_checkpoint() is fine. It’s ok to compare checkpoints by
signature, they’re in a list in the order of creation, which is the
same as order by vclock signature.

> 
> I see gc_consumer_advance() compares vclocks of two neighbour
> consumers using sum() to check tree's invariant, even though
> the invariant is not related to sum() anymore.

Missed that one too, thanks!

diff --git a/src/box/gc.c b/src/box/gc.c
index 36b2d7f4e..cbcdf7d12 100644
--- a/src/box/gc.c
+++ b/src/box/gc.c
@@ -580,7 +580,7 @@ gc_consumer_advance(struct gc_consumer *consumer, const struct vclock *vclock)
         */
        struct gc_consumer *next = gc_tree_next(&gc.consumers, consumer);
        bool update_tree = (next != NULL &&
-                           signature >= vclock_sum(&next->vclock));
+                           vclock_lex_compare(vclock, &next->vclock) >= 0);
 
        if (update_tree)
                gc_tree_remove(&gc.consumers, consumer);


> 
> Please, check all the other vclock_sum() invocations too,
> just in case.

Other cases are fine.

--
Serge Petrenko
sergepetrenko@tarantool.org

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Tarantool-patches] [PATCH v6 0/3] replication: fix local space tracking
  2020-04-07 15:49 [Tarantool-patches] [PATCH v6 0/3] replication: fix local space tracking Serge Petrenko
                   ` (2 preceding siblings ...)
  2020-04-07 15:49 ` [Tarantool-patches] [PATCH v6 3/3] box: start counting local space requests separately Serge Petrenko
@ 2020-04-13 14:38 ` Vladislav Shpilevoy
  2020-04-16 13:49 ` Kirill Yukhin
  4 siblings, 0 replies; 14+ messages in thread
From: Vladislav Shpilevoy @ 2020-04-13 14:38 UTC (permalink / raw)
  To: Serge Petrenko; +Cc: tarantool-patches

Hi! Thanks for the fixes!

LGTM.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Tarantool-patches] [PATCH v6 0/3] replication: fix local space tracking
  2020-04-07 15:49 [Tarantool-patches] [PATCH v6 0/3] replication: fix local space tracking Serge Petrenko
                   ` (3 preceding siblings ...)
  2020-04-13 14:38 ` [Tarantool-patches] [PATCH v6 0/3] replication: fix local space tracking Vladislav Shpilevoy
@ 2020-04-16 13:49 ` Kirill Yukhin
  4 siblings, 0 replies; 14+ messages in thread
From: Kirill Yukhin @ 2020-04-16 13:49 UTC (permalink / raw)
  To: Serge Petrenko; +Cc: tarantool-patches, v.shpilevoy

Hello,

On 07 апр 18:49, Serge Petrenko wrote:
> https://github.com/tarantool/tarantool/issues/4114
> https://github.com/tarantool/tarantool/tree/sp/gh-4114-local-space-replication

I've checked your patchset into 1.10, 2.2, 2.3 and master.

--
Regards, Kirill Yukhin

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2020-04-16 13:49 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-04-07 15:49 [Tarantool-patches] [PATCH v6 0/3] replication: fix local space tracking Serge Petrenko
2020-04-07 15:49 ` [Tarantool-patches] [PATCH v6 1/3] replication: omit 0-th vclock component in replication responses Serge Petrenko
2020-04-09 23:08   ` Vladislav Shpilevoy
2020-04-10 12:25     ` Serge Petrenko
2020-04-11 16:05       ` Vladislav Shpilevoy
2020-04-13 10:12         ` Serge Petrenko
2020-04-07 15:49 ` [Tarantool-patches] [PATCH v6 2/3] gc: rely on minimal vclock components instead of signatures Serge Petrenko
2020-04-09 23:08   ` Vladislav Shpilevoy
2020-04-10 12:25     ` Serge Petrenko
2020-04-11 16:04       ` Vladislav Shpilevoy
2020-04-13 10:12         ` Serge Petrenko
2020-04-07 15:49 ` [Tarantool-patches] [PATCH v6 3/3] box: start counting local space requests separately Serge Petrenko
2020-04-13 14:38 ` [Tarantool-patches] [PATCH v6 0/3] replication: fix local space tracking Vladislav Shpilevoy
2020-04-16 13:49 ` Kirill Yukhin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox