* [Tarantool-patches] [PATCH v5 0/4] replication: fix local space tracking
@ 2020-03-30 11:04 Serge Petrenko
2020-03-30 11:04 ` [Tarantool-patches] [PATCH v5 1/4] vclock: add an ability to reset individual clock components Serge Petrenko
` (5 more replies)
0 siblings, 6 replies; 22+ messages in thread
From: Serge Petrenko @ 2020-03-30 11:04 UTC (permalink / raw)
To: kostja.osipov, v.shpilevoy; +Cc: tarantool-patches
https://github.com/tarantool/tarantool/issues/4114
https://github.com/tarantool/tarantool/tree/sp/gh-4114-local-space-replication
Changes in v5:
- review fixes as per review from Kostja
Changes in v4:
- review fixes as per review from Kostja
Changes in v3:
- rewrite the patches re wal gc rework
to avoid matrix clock.
Changes in v2:
- used Georgy's patches re introduction of
matrix clock and wal gc rework to fix the
problem with ordering gc consumers by
vclock signature
Serge Petrenko (4):
vclock: add an ability to reset individual clock components
replication: hide 0-th vclock components in replication responses
gc: rely on minimal vclock components instead of signatures
box: start counting local space requests separately
src/box/applier.cc | 4 +-
src/box/box.cc | 29 +++-
src/box/gc.c | 41 +++---
src/box/recovery.cc | 8 +-
src/box/relay.cc | 21 +--
src/box/vclock.c | 15 +++
src/box/vclock.h | 102 ++++++++++++++
src/box/wal.c | 16 ++-
test/replication/anon.result | 5 +
test/replication/anon.test.lua | 2 +
test/replication/autobootstrap.result | 6 +
test/replication/autobootstrap.test.lua | 2 +
test/replication/before_replace.result | 8 +-
test/replication/before_replace.test.lua | 4 +-
.../gh-4114-local-space-replication.result | 125 ++++++++++++++++++
.../gh-4114-local-space-replication.test.lua | 48 +++++++
test/replication/local_spaces.result | 4 +
test/replication/local_spaces.test.lua | 3 +
test/replication/misc.result | 6 +
test/replication/misc.test.lua | 2 +
test/replication/quorum.result | 6 +
test/replication/quorum.test.lua | 2 +
test/replication/replica_rejoin.result | 9 ++
test/replication/replica_rejoin.test.lua | 3 +
test/replication/skip_conflict_row.result | 3 +
test/replication/skip_conflict_row.test.lua | 1 +
test/replication/suite.cfg | 1 +
test/vinyl/errinj.result | 5 +
test/vinyl/errinj.test.lua | 4 +
29 files changed, 442 insertions(+), 43 deletions(-)
create mode 100644 test/replication/gh-4114-local-space-replication.result
create mode 100644 test/replication/gh-4114-local-space-replication.test.lua
--
2.21.1 (Apple Git-122.3)
^ permalink raw reply [flat|nested] 22+ messages in thread
* [Tarantool-patches] [PATCH v5 1/4] vclock: add an ability to reset individual clock components
2020-03-30 11:04 [Tarantool-patches] [PATCH v5 0/4] replication: fix local space tracking Serge Petrenko
@ 2020-03-30 11:04 ` Serge Petrenko
2020-03-30 12:56 ` Konstantin Osipov
2020-04-04 20:51 ` Vladislav Shpilevoy
2020-03-30 11:04 ` [Tarantool-patches] [PATCH v5 2/4] replication: hide 0-th vclock components in replication responses Serge Petrenko
` (4 subsequent siblings)
5 siblings, 2 replies; 22+ messages in thread
From: Serge Petrenko @ 2020-03-30 11:04 UTC (permalink / raw)
To: kostja.osipov, v.shpilevoy; +Cc: tarantool-patches
Anonymous replicas use 0th vclock component to sign local rows.
vclock_reset will allow to zero-out 0th vclock component when an
anonymous replica is promoted to a normal one and sends out its vclock
to other joining instances, as to not pollute their own 0th vclock
component.
Also add a shortcut for vclcok_copy() + vclock_reset() for 0th clock
component: vclock_copy_ignore0()
Follow-up #3186
Prerequisite #4114
---
src/box/vclock.c | 15 +++++++++++++++
src/box/vclock.h | 24 ++++++++++++++++++++++++
2 files changed, 39 insertions(+)
diff --git a/src/box/vclock.c b/src/box/vclock.c
index 90ae27591..ac3e9fccd 100644
--- a/src/box/vclock.c
+++ b/src/box/vclock.c
@@ -37,6 +37,21 @@
#include "diag.h"
#include "tt_static.h"
+void
+vclock_reset(struct vclock *vclock, uint32_t replica_id, int64_t lsn)
+{
+ assert(lsn >= 0);
+ assert(replica_id < VCLOCK_MAX);
+ vclock->signature -= vclock_get(vclock, replica_id);
+ if (lsn == 0) {
+ vclock->map &= ~(1 << replica_id);
+ return;
+ }
+ vclock->lsn[replica_id] = lsn;
+ vclock->map |= 1 << replica_id;
+ vclock->signature += lsn;
+}
+
int64_t
vclock_follow(struct vclock *vclock, uint32_t replica_id, int64_t lsn)
{
diff --git a/src/box/vclock.h b/src/box/vclock.h
index 79e5a1bc0..2a3a29020 100644
--- a/src/box/vclock.h
+++ b/src/box/vclock.h
@@ -182,6 +182,19 @@ vclock_inc(struct vclock *vclock, uint32_t replica_id)
return ++vclock->lsn[replica_id];
}
+/**
+ * Set vclock component represented by replica id to the desired
+ * value. Can be used to decrease stored LSN value for the given
+ * replica id while maintaining a valid signature or in the same
+ * manner as vclock_follow.
+ *
+ * @param vclock Vector clock.
+ * @param replica_id Replica identifier.
+ * @param lsn Lsn to set
+ */
+void
+vclock_reset(struct vclock *vclock, uint32_t replica_id, int64_t lsn);
+
static inline void
vclock_copy(struct vclock *dst, const struct vclock *src)
{
@@ -194,6 +207,17 @@ vclock_copy(struct vclock *dst, const struct vclock *src)
sizeof(*dst->lsn) * max_pos);
}
+/**
+ * A shortcut for vclock_copy() + vclock_reset() for 0th clock
+ * component.
+ */
+static inline void
+vclock_copy_ignore0(struct vclock *dst, const struct vclock *src)
+{
+ vclock_copy(dst, src);
+ vclock_reset(dst, 0, 0);
+}
+
static inline uint32_t
vclock_size(const struct vclock *vclock)
{
--
2.21.1 (Apple Git-122.3)
^ permalink raw reply [flat|nested] 22+ messages in thread
* [Tarantool-patches] [PATCH v5 2/4] replication: hide 0-th vclock components in replication responses
2020-03-30 11:04 [Tarantool-patches] [PATCH v5 0/4] replication: fix local space tracking Serge Petrenko
2020-03-30 11:04 ` [Tarantool-patches] [PATCH v5 1/4] vclock: add an ability to reset individual clock components Serge Petrenko
@ 2020-03-30 11:04 ` Serge Petrenko
2020-03-30 12:56 ` Konstantin Osipov
2020-04-04 20:51 ` Vladislav Shpilevoy
2020-03-30 11:04 ` [Tarantool-patches] [PATCH v5 3/4] gc: rely on minimal vclock components instead of signatures Serge Petrenko
` (3 subsequent siblings)
5 siblings, 2 replies; 22+ messages in thread
From: Serge Petrenko @ 2020-03-30 11:04 UTC (permalink / raw)
To: kostja.osipov, v.shpilevoy; +Cc: tarantool-patches
If an anonymous replica is promoted to a normal one and becomes
replication master later, its vclock contains a non-empty zero
component, tracking local changes on this replica from the time when it
had been anonymous. No need to pollute joining instance's vclock with
our non-empty 0 component.
When an anonymous replica reports its status to a remote instance it
should also hide its 0-th vclock component.
This is needed for backward compatibility with old instances, which
don't ignore 0th vclock component coming from a remote instance by
default.
Also make sure that new instances ignore 0th vclock component.
Follow-up #3186
Prerequisite #4114
---
src/box/applier.cc | 4 +++-
src/box/box.cc | 12 ++++++++----
src/box/relay.cc | 6 ++++--
test/replication/anon.result | 5 +++++
test/replication/anon.test.lua | 2 ++
5 files changed, 22 insertions(+), 7 deletions(-)
diff --git a/src/box/applier.cc b/src/box/applier.cc
index 47a26c366..f5f67b6a9 100644
--- a/src/box/applier.cc
+++ b/src/box/applier.cc
@@ -173,7 +173,9 @@ applier_writer_f(va_list ap)
continue;
try {
struct xrow_header xrow;
- xrow_encode_vclock(&xrow, &replicaset.vclock);
+ struct vclock vclock;
+ vclock_copy_ignore0(&vclock, &replicaset.vclock);
+ xrow_encode_vclock(&xrow, &vclock);
coio_write_xrow(&io, &xrow);
} catch (SocketError *e) {
/*
diff --git a/src/box/box.cc b/src/box/box.cc
index 765d64678..bf95d1b5e 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -1539,7 +1539,7 @@ box_process_fetch_snapshot(struct ev_io *io, struct xrow_header *header)
/* Remember master's vclock after the last request */
struct vclock stop_vclock;
- vclock_copy(&stop_vclock, &replicaset.vclock);
+ vclock_copy_ignore0(&stop_vclock, &replicaset.vclock);
/* Send end of snapshot data marker */
struct xrow_header row;
@@ -1621,7 +1621,8 @@ box_process_register(struct ev_io *io, struct xrow_header *header)
struct xrow_header row;
/* Send end of WAL stream marker */
- xrow_encode_vclock_xc(&row, &replicaset.vclock);
+ vclock_copy_ignore0(&vclock, &replicaset.vclock);
+ xrow_encode_vclock_xc(&row, &vclock);
row.sync = header->sync;
coio_write_xrow(io, &row);
@@ -1773,7 +1774,10 @@ box_process_join(struct ev_io *io, struct xrow_header *header)
say_info("final data sent.");
/* Send end of WAL stream marker */
- xrow_encode_vclock_xc(&row, &replicaset.vclock);
+ struct vclock vclock;
+ vclock_copy_ignore0(&vclock, &replicaset.vclock);
+ xrow_encode_vclock_xc(&row, &vclock);
+
row.sync = header->sync;
coio_write_xrow(io, &row);
@@ -1904,7 +1908,7 @@ box_process_vote(struct ballot *ballot)
*/
ballot->is_loading = is_ro;
vclock_copy(&ballot->vclock, &replicaset.vclock);
- vclock_copy(&ballot->gc_vclock, &gc.vclock);
+ vclock_copy_ignore0(&ballot->gc_vclock, &gc.vclock);
}
/** Insert a new cluster into _schema */
diff --git a/src/box/relay.cc b/src/box/relay.cc
index c634348a4..5ed8f5dae 100644
--- a/src/box/relay.cc
+++ b/src/box/relay.cc
@@ -322,7 +322,9 @@ relay_initial_join(int fd, uint64_t sync, struct vclock *vclock)
/* Respond to the JOIN request with the current vclock. */
struct xrow_header row;
- xrow_encode_vclock_xc(&row, vclock);
+ struct vclock subst_vclock;
+ vclock_copy_ignore0(&subst_vclock, vclock);
+ xrow_encode_vclock_xc(&row, &subst_vclock);
row.sync = sync;
coio_write_xrow(&relay->io, &row);
@@ -464,7 +466,7 @@ relay_schedule_pending_gc(struct relay *relay, const struct vclock *vclock)
* the greater signatures is due to changes pulled
* from other members of the cluster.
*/
- if (vclock_compare(&curr->vclock, vclock) > 0)
+ if (vclock_compare_ignore0(&curr->vclock, vclock) > 0)
break;
stailq_shift(&relay->pending_gc);
free(gc_msg);
diff --git a/test/replication/anon.result b/test/replication/anon.result
index 88061569f..cbbeeef09 100644
--- a/test/replication/anon.result
+++ b/test/replication/anon.result
@@ -187,6 +187,11 @@ a > 0
| ---
| - true
| ...
+-- 0-th vclock component isn't propagated across the cluster.
+box.info.vclock[0]
+ | ---
+ | - null
+ | ...
test_run:cmd('switch default')
| ---
| - true
diff --git a/test/replication/anon.test.lua b/test/replication/anon.test.lua
index 8a8d15c18..627dc5c8e 100644
--- a/test/replication/anon.test.lua
+++ b/test/replication/anon.test.lua
@@ -66,6 +66,8 @@ test_run:cmd('switch replica_anon2')
a = box.info.vclock[1]
-- The instance did fetch a snapshot.
a > 0
+-- 0-th vclock component isn't propagated across the cluster.
+box.info.vclock[0]
test_run:cmd('switch default')
box.space.test:insert{2}
test_run:cmd("switch replica_anon2")
--
2.21.1 (Apple Git-122.3)
^ permalink raw reply [flat|nested] 22+ messages in thread
* [Tarantool-patches] [PATCH v5 3/4] gc: rely on minimal vclock components instead of signatures
2020-03-30 11:04 [Tarantool-patches] [PATCH v5 0/4] replication: fix local space tracking Serge Petrenko
2020-03-30 11:04 ` [Tarantool-patches] [PATCH v5 1/4] vclock: add an ability to reset individual clock components Serge Petrenko
2020-03-30 11:04 ` [Tarantool-patches] [PATCH v5 2/4] replication: hide 0-th vclock components in replication responses Serge Petrenko
@ 2020-03-30 11:04 ` Serge Petrenko
2020-03-30 12:57 ` Konstantin Osipov
2020-04-04 20:51 ` Vladislav Shpilevoy
2020-03-30 11:04 ` [Tarantool-patches] [PATCH v5 4/4] box: start counting local space requests separately Serge Petrenko
` (2 subsequent siblings)
5 siblings, 2 replies; 22+ messages in thread
From: Serge Petrenko @ 2020-03-30 11:04 UTC (permalink / raw)
To: kostja.osipov, v.shpilevoy; +Cc: tarantool-patches
The current WAL GC implementation tracks consumers (i.e. remote replicas) by
their vclock signature, which is the sum of all vclock components.
This approach is wrong, and this can be shown by a little example.
The example will be a little synthetic, but it'll illustrate the problem.
Say, you have 2 masters, A and B with ids 1 and 2 respectively, and a replica C
with id 3. Say, С replicates from both A and B, and there is no replication
between A and B (say, the instances were reconfigured to not replicate from each
other). Now, say replica C has followed A and B to vclock {1:5, 2:13}. At the
same time, A has lsn 10 and B has lsn 15. A and B do not know about each other’s
changes, so A’s vclock is {1:10} and B’s vclock is {2:15}. Now imagine A does a
snapshot and creates a new xlog with signature 10. A’s directory will look like:
00…000.xlog 00…010.snap 00….010.xlog
Replica C reports its vclock {1:5, 2:13} to A, A uses the vclock to update the
corresponding GC consumer. Since signatures are used, GC consumer is assigned a
signature = 13 + 5 = 18. This is greater than the signature of the last xlog on
A (which is 10), so the previous xlog (00…00.xlog) can be deleted (at least A
assumes it can be). Actually, replica still needs 00…00.xlog, because it
contains rows corresponding to vclocks {1:6} - {1:10}, which haven’t been
replicated yet.
If instead of using vclock signatures, gc consumers used vclocks, such a problem
wouldn’t arise. Replia would report its vclock {1:5, 2:13}. The vclock is NOT
strictly greater than A’s most recent xlog vclock ({1:10}), so the previous log
is kept until replica reports a vclock {1:10, 2:something} or {1:11, …} and so
on.
Rewrite gc to perform cleanup based on finding minimal vclock
components present in at least one of the consumer vclocks instead of
just comparing vclock signatures.
Prerequisite #4114
---
src/box/gc.c | 41 +++++++++++++++----------
src/box/vclock.h | 78 ++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 103 insertions(+), 16 deletions(-)
diff --git a/src/box/gc.c b/src/box/gc.c
index f5c387f9d..a2d0a515c 100644
--- a/src/box/gc.c
+++ b/src/box/gc.c
@@ -66,16 +66,15 @@ static int
gc_checkpoint_fiber_f(va_list);
/**
- * Comparator used for ordering gc_consumer objects by signature
- * in a binary tree.
+ * Comparator used for ordering gc_consumer objects
+ * lexicographically by their vclock in a binary tree.
*/
static inline int
gc_consumer_cmp(const struct gc_consumer *a, const struct gc_consumer *b)
{
- if (vclock_sum(&a->vclock) < vclock_sum(&b->vclock))
- return -1;
- if (vclock_sum(&a->vclock) > vclock_sum(&b->vclock))
- return 1;
+ int rc = vclock_lex_compare(&a->vclock, &b->vclock);
+ if (rc != 0)
+ return rc;
if ((intptr_t)a < (intptr_t)b)
return -1;
if ((intptr_t)a > (intptr_t)b)
@@ -187,19 +186,29 @@ gc_run_cleanup(void)
/* At least one checkpoint must always be available. */
assert(checkpoint != NULL);
+ /* Find the vclock of the oldest WAL row to keep. */
+ struct vclock min_vclock;
+ struct gc_consumer *consumer = gc_tree_first(&gc.consumers);
/*
- * Find the vclock of the oldest WAL row to keep.
+ * Vclock of the oldest WAL row to keep is a by-component
+ * minimum of all consumer vclocks and the oldest
+ * checkpoint vclock. This ensures that all rows needed by
+ * at least one consumer are kept.
* Note, we must keep all WALs created after the
* oldest checkpoint, even if no consumer needs them.
*/
- const struct vclock *vclock = (gc_tree_empty(&gc.consumers) ? NULL :
- &gc_tree_first(&gc.consumers)->vclock);
- if (vclock == NULL ||
- vclock_sum(vclock) > vclock_sum(&checkpoint->vclock))
- vclock = &checkpoint->vclock;
-
- if (vclock_sum(vclock) > vclock_sum(&gc.vclock)) {
- vclock_copy(&gc.vclock, vclock);
+ vclock_copy(&min_vclock, &checkpoint->vclock);
+ while (consumer != NULL) {
+ /*
+ * Consumers will never need rows signed
+ * with a zero instance id (local rows).
+ */
+ vclock_min_ignore0(&min_vclock, &consumer->vclock);
+ consumer = gc_tree_next(&gc.consumers, consumer);
+ }
+
+ if (vclock_sum(&min_vclock) > vclock_sum(&gc.vclock)) {
+ vclock_copy(&gc.vclock, &min_vclock);
run_wal_gc = true;
}
@@ -222,7 +231,7 @@ gc_run_cleanup(void)
if (run_engine_gc)
engine_collect_garbage(&checkpoint->vclock);
if (run_wal_gc)
- wal_collect_garbage(vclock);
+ wal_collect_garbage(&min_vclock);
}
static int
diff --git a/src/box/vclock.h b/src/box/vclock.h
index 2a3a29020..36fcac966 100644
--- a/src/box/vclock.h
+++ b/src/box/vclock.h
@@ -348,6 +348,84 @@ vclock_compare_ignore0(const struct vclock *a, const struct vclock *b)
return vclock_compare_generic(a, b, true);
}
+/**
+ * Compare vclocks lexicographically.
+ * The following affirmation holds: if a < b in terms of
+ * vclock_compare, then a < b in terms of vclock_lex_compare.
+ * However, a < b in terms of vclock_lex_compare doesn't mean
+ * than a < b in terms of vclock_compare. Example:
+ * {1:5, 2:10} < {1:7, 2:3} (vclock_lex_compare), but the vclocks
+ * are incomparable in terms of vclock_compare.
+ * All vclocks are comparable lexicographically.
+ *
+ * \param a vclock
+ * \param b vclock
+ * \retval 1 if \a a is lexicographically greater than \a b
+ * \retval -1 if \a a is lexicographically less than \a b
+ * \retval 0 if vclocks are equal
+ */
+static inline int
+vclock_lex_compare(const struct vclock *a, const struct vclock *b)
+{
+ vclock_map_t map = a->map | b->map;
+ struct bit_iterator it;
+ bit_iterator_init(&it, &map, sizeof(map), true);
+ for(size_t replica_id = bit_iterator_next(&it); replica_id < VCLOCK_MAX;
+ replica_id = bit_iterator_next(&it)) {
+ int64_t lsn_a = vclock_get(a, replica_id);
+ int64_t lsn_b = vclock_get(b, replica_id);
+ if (lsn_a < lsn_b) return -1;
+ if (lsn_a > lsn_b) return 1;
+ }
+ return 0;
+}
+
+/**
+ * Return a vclock, which is a componentwise minimum between
+ * vclocks \a a and \a b.
+ *
+ * \param[in,out] a vclock containing the minimum components.
+ * \param b vclock to compare with.
+ * \param ignore_zero whether to ignore 0th vclock component
+ * or not.
+ */
+static inline void
+vclock_min_generic(struct vclock *a, const struct vclock *b, bool ignore_zero)
+{
+ vclock_map_t map = a->map | b->map;
+ struct bit_iterator it;
+ bit_iterator_init(&it, &map, sizeof(map), true);
+ size_t replica_id = bit_iterator_next(&it);
+ if (replica_id == 0 && ignore_zero)
+ replica_id = bit_iterator_next(&it);
+
+ for( ; replica_id < VCLOCK_MAX; replica_id = bit_iterator_next(&it)) {
+ int64_t lsn_a = vclock_get(a, replica_id);
+ int64_t lsn_b = vclock_get(b, replica_id);
+ if (lsn_a <= lsn_b)
+ continue;
+ vclock_reset(a, replica_id, lsn_b);
+ }
+}
+
+/**
+ * \sa vclock_min_generic
+ */
+static inline void
+vclock_min(struct vclock *a, const struct vclock *b)
+{
+ return vclock_min_generic(a, b, false);
+}
+
+/**
+ * \sa vclock_min_generic
+ */
+static inline void
+vclock_min_ignore0(struct vclock *a, const struct vclock *b)
+{
+ return vclock_min_generic(a, b, true);
+}
+
/**
* @brief vclockset - a set of vclocks
*/
--
2.21.1 (Apple Git-122.3)
^ permalink raw reply [flat|nested] 22+ messages in thread
* [Tarantool-patches] [PATCH v5 4/4] box: start counting local space requests separately
2020-03-30 11:04 [Tarantool-patches] [PATCH v5 0/4] replication: fix local space tracking Serge Petrenko
` (2 preceding siblings ...)
2020-03-30 11:04 ` [Tarantool-patches] [PATCH v5 3/4] gc: rely on minimal vclock components instead of signatures Serge Petrenko
@ 2020-03-30 11:04 ` Serge Petrenko
2020-03-30 12:58 ` Konstantin Osipov
2020-04-04 20:51 ` Vladislav Shpilevoy
2020-03-31 11:24 ` [Tarantool-patches] [PATCH v5 0/4] replication: fix local space tracking Serge Petrenko
2020-04-04 20:51 ` Vladislav Shpilevoy
5 siblings, 2 replies; 22+ messages in thread
From: Serge Petrenko @ 2020-03-30 11:04 UTC (permalink / raw)
To: kostja.osipov, v.shpilevoy; +Cc: tarantool-patches
Sign local space requests with a zero instance id. This allows to split
local changes aside from the changes, which should be visible to the whole
cluster, and stop sending NOPs to replicas to follow local vclock.
Moreover, it fixes the following bug with local spaces and replication.
In a situation when there are a master and a replica set up, replica may
still write to local spaces even if it's read-only. Local space
operations used to promote instance's lsn before this patch. Eventually,
master would have vclock {1:x} and replica'd have vclock {1:x, 2:y},
where y > 0, due to local space requests performed by the replica.
If a snapshot happens on replica side, replica will delete it's .xlog
files prior to the snapshot, since no one replicates from it and thus it
doesn't have any registered GC consumers.
From this point, every attempt to configure replication from replica to
master will fail, since master will try to fetch records which account
for the difference in master's and replica's vclocks: {1:x} vs {1:x,2:y},
even though master will never need the rows in range {2:1} - {2:y},
since they'll be turned to NOPs during replication.
Starting from this patch, in the situation described above, replica's
clock will be {0:y, 1:x}, and, since local space requests are now not
replicated at all, master will be able to follow replica, turning the
configuration to master-master.
Closes #4114
---
src/box/box.cc | 17 +++
src/box/recovery.cc | 8 +-
src/box/relay.cc | 15 +--
src/box/wal.c | 16 ++-
test/replication/autobootstrap.result | 6 +
test/replication/autobootstrap.test.lua | 2 +
test/replication/before_replace.result | 8 +-
test/replication/before_replace.test.lua | 4 +-
.../gh-4114-local-space-replication.result | 125 ++++++++++++++++++
.../gh-4114-local-space-replication.test.lua | 48 +++++++
test/replication/local_spaces.result | 4 +
test/replication/local_spaces.test.lua | 3 +
test/replication/misc.result | 6 +
test/replication/misc.test.lua | 2 +
test/replication/quorum.result | 6 +
test/replication/quorum.test.lua | 2 +
test/replication/replica_rejoin.result | 9 ++
test/replication/replica_rejoin.test.lua | 3 +
test/replication/skip_conflict_row.result | 3 +
test/replication/skip_conflict_row.test.lua | 1 +
test/replication/suite.cfg | 1 +
test/vinyl/errinj.result | 5 +
test/vinyl/errinj.test.lua | 4 +
23 files changed, 278 insertions(+), 20 deletions(-)
create mode 100644 test/replication/gh-4114-local-space-replication.result
create mode 100644 test/replication/gh-4114-local-space-replication.test.lua
diff --git a/src/box/box.cc b/src/box/box.cc
index bf95d1b5e..0762266b0 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -1879,6 +1879,23 @@ box_process_subscribe(struct ev_io *io, struct xrow_header *header)
say_info("remote vclock %s local vclock %s",
vclock_to_string(&replica_clock), vclock_to_string(&vclock));
+ /*
+ * Replica clock is used in gc state and recovery
+ * initialization, so we need to replace the remote 0-th
+ * component with our own one. This doesn't break
+ * recovery: it finds the WAL with a vclock strictly less
+ * than replia clock in all components except the 0th one.
+ * This leads to finding the correct WAL, if it exists,
+ * since we do not need to recover local rows (the ones,
+ * that contribute to the 0-th vclock component).
+ * Note, that it would be bad to set 0-th vclock component
+ * to a smaller value, since it would unnecessarily
+ * require additional WALs, which may have already been
+ * deleted.
+ * Speaking of gc, remote instances' local vclock
+ * components are not used by consumers at all.
+ */
+ vclock_reset(&replica_clock, 0, vclock_get(&replicaset.vclock, 0));
/*
* Process SUBSCRIBE request via replication relay
* Send current recovery vector clock as a marker
diff --git a/src/box/recovery.cc b/src/box/recovery.cc
index 64aa467b1..d1a503cfc 100644
--- a/src/box/recovery.cc
+++ b/src/box/recovery.cc
@@ -261,11 +261,9 @@ recover_xlog(struct recovery *r, struct xstream *stream,
continue; /* already applied, skip */
/*
- * All rows in xlog files have an assigned
- * replica id. The only exception is anonymous
- * replica, which has a zero instance id.
- * In this case the only rows from such an instance
- * can be for the local spaces.
+ * All rows in xlog files have an assigned replica
+ * id. The only exception are local rows, which
+ * are signed with a zero replica id.
*/
assert(row.replica_id != 0 || row.group_id == GROUP_LOCAL);
/*
diff --git a/src/box/relay.cc b/src/box/relay.cc
index 5ed8f5dae..870d0e5fc 100644
--- a/src/box/relay.cc
+++ b/src/box/relay.cc
@@ -758,16 +758,15 @@ relay_send_row(struct xstream *stream, struct xrow_header *packet)
{
struct relay *relay = container_of(stream, struct relay, stream);
assert(iproto_type_is_dml(packet->type));
- /*
- * Transform replica local requests to IPROTO_NOP so as to
- * promote vclock on the replica without actually modifying
- * any data.
- */
if (packet->group_id == GROUP_LOCAL) {
/*
- * Replica-local requests generated while replica
- * was anonymous have a zero instance id. Just
- * skip all these rows.
+ * We do not relay replica-local rows to other
+ * instances, since we started signing them with
+ * a zero instance id. However, if replica-local
+ * rows, signed with a non-zero id are present in
+ * our WAL, we still need to relay them as NOPs in
+ * order to correctly promote the vclock on the
+ * replica.
*/
if (packet->replica_id == REPLICA_ID_NIL)
return;
diff --git a/src/box/wal.c b/src/box/wal.c
index 3b094b0e8..a74bdecd9 100644
--- a/src/box/wal.c
+++ b/src/box/wal.c
@@ -953,13 +953,19 @@ wal_assign_lsn(struct vclock *vclock_diff, struct vclock *base,
/** Assign LSN to all local rows. */
for ( ; row < end; row++) {
if ((*row)->replica_id == 0) {
- (*row)->lsn = vclock_inc(vclock_diff, instance_id) +
- vclock_get(base, instance_id);
/*
- * Note, an anonymous replica signs local
- * rows whith a zero instance id.
+ * All rows representing local space data
+ * manipulations are signed wth a zero
+ * instance id. This is also true for
+ * anonymous replicas, since they are
+ * only capable of writing to local and
+ * temporary spaces.
*/
- (*row)->replica_id = instance_id;
+ if ((*row)->group_id != GROUP_LOCAL)
+ (*row)->replica_id = instance_id;
+
+ (*row)->lsn = vclock_inc(vclock_diff, (*row)->replica_id) +
+ vclock_get(base, (*row)->replica_id);
/* Use lsn of the first local row as transaction id. */
tsn = tsn == 0 ? (*row)->lsn : tsn;
(*row)->tsn = tsn;
diff --git a/test/replication/autobootstrap.result b/test/replication/autobootstrap.result
index 743982d47..6918e23ea 100644
--- a/test/replication/autobootstrap.result
+++ b/test/replication/autobootstrap.result
@@ -162,6 +162,9 @@ box.schema.user.revoke('test_u', 'create', 'space')
vclock = test_run:get_vclock('autobootstrap1')
---
...
+vclock[0] = nil
+---
+...
_ = test_run:wait_vclock("autobootstrap2", vclock)
---
...
@@ -206,6 +209,9 @@ test_run:wait_fullmesh(SERVERS)
vclock = test_run:get_vclock("autobootstrap1")
---
...
+vclock[0] = nil
+---
+...
_ = test_run:wait_vclock("autobootstrap2", vclock)
---
...
diff --git a/test/replication/autobootstrap.test.lua b/test/replication/autobootstrap.test.lua
index 055ea4277..f8bb1c74a 100644
--- a/test/replication/autobootstrap.test.lua
+++ b/test/replication/autobootstrap.test.lua
@@ -74,6 +74,7 @@ box.schema.user.revoke('test_u', 'create', 'space')
-- Synchronize
vclock = test_run:get_vclock('autobootstrap1')
+vclock[0] = nil
_ = test_run:wait_vclock("autobootstrap2", vclock)
_ = test_run:wait_vclock("autobootstrap3", vclock)
@@ -95,6 +96,7 @@ _ = test_run:cmd("switch default")
test_run:wait_fullmesh(SERVERS)
vclock = test_run:get_vclock("autobootstrap1")
+vclock[0] = nil
_ = test_run:wait_vclock("autobootstrap2", vclock)
_ = test_run:wait_vclock("autobootstrap3", vclock)
diff --git a/test/replication/before_replace.result b/test/replication/before_replace.result
index ced40547e..61a552e84 100644
--- a/test/replication/before_replace.result
+++ b/test/replication/before_replace.result
@@ -266,7 +266,13 @@ box.space.test:replace{1, 1}
---
- [1, 1]
...
-_ = test_run:wait_vclock('replica', test_run:get_vclock('default'))
+vclock = test_run:get_vclock('default')
+---
+...
+vclock[0] = nil
+---
+...
+_ = test_run:wait_vclock('replica', vclock)
---
...
-- Check that replace{1, 2} coming from the master was suppressed
diff --git a/test/replication/before_replace.test.lua b/test/replication/before_replace.test.lua
index bcc6dc00d..ecd8ff044 100644
--- a/test/replication/before_replace.test.lua
+++ b/test/replication/before_replace.test.lua
@@ -101,7 +101,9 @@ _ = box.space.test:before_replace(function(old, new) return new:update{{'+', 2,
test_run:cmd("switch default")
box.space.test:replace{1, 1}
-_ = test_run:wait_vclock('replica', test_run:get_vclock('default'))
+vclock = test_run:get_vclock('default')
+vclock[0] = nil
+_ = test_run:wait_vclock('replica', vclock)
-- Check that replace{1, 2} coming from the master was suppressed
-- by the before_replace trigger on the replica.
diff --git a/test/replication/gh-4114-local-space-replication.result b/test/replication/gh-4114-local-space-replication.result
new file mode 100644
index 000000000..e524c9a1b
--- /dev/null
+++ b/test/replication/gh-4114-local-space-replication.result
@@ -0,0 +1,125 @@
+-- test-run result file version 2
+env = require('test_run')
+ | ---
+ | ...
+test_run = env.new()
+ | ---
+ | ...
+
+--
+-- gh-4114. Account local space changes in a separate vclock
+-- component. Do not replicate local space changes, even as NOPs.
+--
+
+box.schema.user.grant('guest', 'replication')
+ | ---
+ | ...
+_ = box.schema.space.create('test', {is_local=true})
+ | ---
+ | ...
+_ = box.space.test:create_index("pk")
+ | ---
+ | ...
+
+test_run:cmd('create server replica with rpl_master=default, script "replication/replica.lua"')
+ | ---
+ | - true
+ | ...
+test_run:cmd('start server replica with wait=True, wait_load=True')
+ | ---
+ | - true
+ | ...
+
+a = box.info.vclock[0] or 0
+ | ---
+ | ...
+for i = 1,10 do box.space.test:insert{i} end
+ | ---
+ | ...
+box.info.vclock[0] == a + 10 or box.info.vclock[0] - a
+ | ---
+ | - true
+ | ...
+
+test_run:cmd('switch replica')
+ | ---
+ | - true
+ | ...
+box.info.vclock[0]
+ | ---
+ | - null
+ | ...
+box.cfg{checkpoint_count=1}
+ | ---
+ | ...
+box.space.test:select{}
+ | ---
+ | - []
+ | ...
+box.space.test:insert{1}
+ | ---
+ | - [1]
+ | ...
+box.snapshot()
+ | ---
+ | - ok
+ | ...
+box.space.test:insert{2}
+ | ---
+ | - [2]
+ | ...
+box.snapshot()
+ | ---
+ | - ok
+ | ...
+box.space.test:insert{3}
+ | ---
+ | - [3]
+ | ...
+box.snapshot()
+ | ---
+ | - ok
+ | ...
+
+box.info.vclock[0]
+ | ---
+ | - 3
+ | ...
+
+test_run:cmd('switch default')
+ | ---
+ | - true
+ | ...
+
+test_run:cmd('set variable repl_source to "replica.listen"')
+ | ---
+ | - true
+ | ...
+
+box.cfg{replication=repl_source}
+ | ---
+ | ...
+test_run:wait_cond(function()\
+ return box.info.replication[2].upstream and\
+ box.info.replication[2].upstream.status == 'follow'\
+ end,\
+ 10)
+ | ---
+ | - true
+ | ...
+
+-- Cleanup.
+test_run:cmd('stop server replica')
+ | ---
+ | - true
+ | ...
+test_run:cmd('delete server replica')
+ | ---
+ | - true
+ | ...
+box.space.test:drop()
+ | ---
+ | ...
+box.schema.user.revoke('guest', 'replication')
+ | ---
+ | ...
diff --git a/test/replication/gh-4114-local-space-replication.test.lua b/test/replication/gh-4114-local-space-replication.test.lua
new file mode 100644
index 000000000..26dccee68
--- /dev/null
+++ b/test/replication/gh-4114-local-space-replication.test.lua
@@ -0,0 +1,48 @@
+env = require('test_run')
+test_run = env.new()
+
+--
+-- gh-4114. Account local space changes in a separate vclock
+-- component. Do not replicate local space changes, even as NOPs.
+--
+
+box.schema.user.grant('guest', 'replication')
+_ = box.schema.space.create('test', {is_local=true})
+_ = box.space.test:create_index("pk")
+
+test_run:cmd('create server replica with rpl_master=default, script "replication/replica.lua"')
+test_run:cmd('start server replica with wait=True, wait_load=True')
+
+a = box.info.vclock[0] or 0
+for i = 1,10 do box.space.test:insert{i} end
+box.info.vclock[0] == a + 10 or box.info.vclock[0] - a
+
+test_run:cmd('switch replica')
+box.info.vclock[0]
+box.cfg{checkpoint_count=1}
+box.space.test:select{}
+box.space.test:insert{1}
+box.snapshot()
+box.space.test:insert{2}
+box.snapshot()
+box.space.test:insert{3}
+box.snapshot()
+
+box.info.vclock[0]
+
+test_run:cmd('switch default')
+
+test_run:cmd('set variable repl_source to "replica.listen"')
+
+box.cfg{replication=repl_source}
+test_run:wait_cond(function()\
+ return box.info.replication[2].upstream and\
+ box.info.replication[2].upstream.status == 'follow'\
+ end,\
+ 10)
+
+-- Cleanup.
+test_run:cmd('stop server replica')
+test_run:cmd('delete server replica')
+box.space.test:drop()
+box.schema.user.revoke('guest', 'replication')
diff --git a/test/replication/local_spaces.result b/test/replication/local_spaces.result
index cf2c52010..4855d8a88 100644
--- a/test/replication/local_spaces.result
+++ b/test/replication/local_spaces.result
@@ -288,6 +288,10 @@ _ = s3:insert{3}
vclock = test_run:get_vclock('default')
---
...
+-- Ignore 0-th component when waiting. They don't match.
+vclock[0] = nil
+---
+...
_ = test_run:wait_vclock('replica', vclock)
---
...
diff --git a/test/replication/local_spaces.test.lua b/test/replication/local_spaces.test.lua
index 373e2cd20..c5e224030 100644
--- a/test/replication/local_spaces.test.lua
+++ b/test/replication/local_spaces.test.lua
@@ -112,6 +112,9 @@ _ = s1:insert{3}
_ = s2:insert{3}
_ = s3:insert{3}
vclock = test_run:get_vclock('default')
+
+-- Ignore 0-th component when waiting. They don't match.
+vclock[0] = nil
_ = test_run:wait_vclock('replica', vclock)
test_run:cmd("switch replica")
diff --git a/test/replication/misc.result b/test/replication/misc.result
index b63d72846..e5d1f560e 100644
--- a/test/replication/misc.result
+++ b/test/replication/misc.result
@@ -214,6 +214,9 @@ box.space.space1:select{}
vclock = test_run:get_vclock("autobootstrap1")
---
...
+vclock[0] = nil
+---
+...
_ = test_run:wait_vclock("autobootstrap2", vclock)
---
...
@@ -414,6 +417,9 @@ while box.info.replication[2] == nil do fiber.sleep(0.01) end
vclock = test_run:get_vclock('default')
---
...
+vclock[0] = nil
+---
+...
_ = test_run:wait_vclock('replica_auth', vclock)
---
...
diff --git a/test/replication/misc.test.lua b/test/replication/misc.test.lua
index c454a0992..d285b014a 100644
--- a/test/replication/misc.test.lua
+++ b/test/replication/misc.test.lua
@@ -88,6 +88,7 @@ c.space.space1:insert{box.NULL, "data"} -- fails, but bumps sequence value
c.space.space1:insert{box.NULL, 1, "data"}
box.space.space1:select{}
vclock = test_run:get_vclock("autobootstrap1")
+vclock[0] = nil
_ = test_run:wait_vclock("autobootstrap2", vclock)
test_run:cmd("switch autobootstrap2")
box.space.space1:select{}
@@ -172,6 +173,7 @@ box.schema.user.grant('cluster', 'replication')
while box.info.replication[2] == nil do fiber.sleep(0.01) end
vclock = test_run:get_vclock('default')
+vclock[0] = nil
_ = test_run:wait_vclock('replica_auth', vclock)
test_run:cmd("stop server replica_auth")
diff --git a/test/replication/quorum.result b/test/replication/quorum.result
index 07abe7f2a..5ef66bf0a 100644
--- a/test/replication/quorum.result
+++ b/test/replication/quorum.result
@@ -325,6 +325,9 @@ space:insert{2}
vclock = test_run:get_vclock("default")
---
...
+vclock[0] = nil
+---
+...
_ = test_run:wait_vclock("replica", vclock)
---
...
@@ -390,6 +393,9 @@ box.cfg{replication = repl}
vclock = test_run:get_vclock("master_quorum1")
---
...
+vclock[0] = nil
+---
+...
_ = test_run:wait_vclock("master_quorum2", vclock)
---
...
diff --git a/test/replication/quorum.test.lua b/test/replication/quorum.test.lua
index 5f2872675..da24f34a0 100644
--- a/test/replication/quorum.test.lua
+++ b/test/replication/quorum.test.lua
@@ -125,6 +125,7 @@ test_run:cmd("switch default")
box.cfg{listen = listen}
space:insert{2}
vclock = test_run:get_vclock("default")
+vclock[0] = nil
_ = test_run:wait_vclock("replica", vclock)
test_run:cmd("switch replica")
box.info.status -- running
@@ -145,6 +146,7 @@ box.cfg{replication = ""}
box.space.test:insert{1}
box.cfg{replication = repl}
vclock = test_run:get_vclock("master_quorum1")
+vclock[0] = nil
_ = test_run:wait_vclock("master_quorum2", vclock)
test_run:cmd("switch master_quorum2")
box.space.test:select()
diff --git a/test/replication/replica_rejoin.result b/test/replication/replica_rejoin.result
index b8ed79f14..dd04ae297 100644
--- a/test/replication/replica_rejoin.result
+++ b/test/replication/replica_rejoin.result
@@ -144,6 +144,9 @@ for i = 10, 30, 10 do box.space.test:update(i, {{'!', 1, i}}) end
vclock = test_run:get_vclock('default')
---
...
+vclock[0] = nil
+---
+...
_ = test_run:wait_vclock('replica', vclock)
---
...
@@ -295,6 +298,9 @@ for i = 1, 10 do box.space.test:replace{2} end
vclock = test_run:get_vclock('replica')
---
...
+vclock[0] = nil
+---
+...
_ = test_run:wait_vclock('default', vclock)
---
...
@@ -345,6 +351,9 @@ for i = 1, 10 do box.space.test:replace{2} end
vclock = test_run:get_vclock('replica')
---
...
+vclock[0] = nil
+---
+...
_ = test_run:wait_vclock('default', vclock)
---
...
diff --git a/test/replication/replica_rejoin.test.lua b/test/replication/replica_rejoin.test.lua
index 25c0849ec..410ef44d7 100644
--- a/test/replication/replica_rejoin.test.lua
+++ b/test/replication/replica_rejoin.test.lua
@@ -55,6 +55,7 @@ test_run:cmd("switch default")
-- Make sure the replica follows new changes.
for i = 10, 30, 10 do box.space.test:update(i, {{'!', 1, i}}) end
vclock = test_run:get_vclock('default')
+vclock[0] = nil
_ = test_run:wait_vclock('replica', vclock)
test_run:cmd("switch replica")
box.space.test:select()
@@ -109,6 +110,7 @@ box.space.test:replace{1}
test_run:cmd("switch replica")
for i = 1, 10 do box.space.test:replace{2} end
vclock = test_run:get_vclock('replica')
+vclock[0] = nil
_ = test_run:wait_vclock('default', vclock)
-- Restart the master and force garbage collection.
test_run:cmd("switch default")
@@ -126,6 +128,7 @@ test_run:wait_cond(function() return #fio.glob(fio.pathjoin(box.cfg.wal_dir, '*.
test_run:cmd("switch replica")
for i = 1, 10 do box.space.test:replace{2} end
vclock = test_run:get_vclock('replica')
+vclock[0] = nil
_ = test_run:wait_vclock('default', vclock)
-- Restart the replica. It should successfully rebootstrap.
test_run:cmd("restart server replica with args='true'")
diff --git a/test/replication/skip_conflict_row.result b/test/replication/skip_conflict_row.result
index 9b2777872..d70ac8e2a 100644
--- a/test/replication/skip_conflict_row.result
+++ b/test/replication/skip_conflict_row.result
@@ -54,6 +54,9 @@ box.info.status
vclock = test_run:get_vclock('default')
---
...
+vclock[0] = nil
+---
+...
_ = test_run:wait_vclock("replica", vclock)
---
...
diff --git a/test/replication/skip_conflict_row.test.lua b/test/replication/skip_conflict_row.test.lua
index 2982c730a..04fd08136 100644
--- a/test/replication/skip_conflict_row.test.lua
+++ b/test/replication/skip_conflict_row.test.lua
@@ -19,6 +19,7 @@ space:insert{2}
box.info.status
vclock = test_run:get_vclock('default')
+vclock[0] = nil
_ = test_run:wait_vclock("replica", vclock)
test_run:cmd("switch replica")
box.info.replication[1].upstream.message
diff --git a/test/replication/suite.cfg b/test/replication/suite.cfg
index 90fd53ca6..0907ac17f 100644
--- a/test/replication/suite.cfg
+++ b/test/replication/suite.cfg
@@ -12,6 +12,7 @@
"on_schema_init.test.lua": {},
"long_row_timeout.test.lua": {},
"join_without_snap.test.lua": {},
+ "gh-4114-local-space-replication.test.lua": {},
"gh-4402-info-errno.test.lua": {},
"gh-4605-empty-password.test.lua": {},
"gh-4606-admin-creds.test.lua": {},
diff --git a/test/vinyl/errinj.result b/test/vinyl/errinj.result
index 2635da265..2bd701f70 100644
--- a/test/vinyl/errinj.result
+++ b/test/vinyl/errinj.result
@@ -1241,6 +1241,11 @@ test_run:cmd("switch default")
vclock = test_run:get_vclock("default")
---
...
+-- Ignore 0-th vclock component. They don't match between
+-- replicas.
+vclock[0] = nil
+---
+...
_ = test_run:wait_vclock("replica", vclock)
---
...
diff --git a/test/vinyl/errinj.test.lua b/test/vinyl/errinj.test.lua
index 4230cfae3..750d3bfe8 100644
--- a/test/vinyl/errinj.test.lua
+++ b/test/vinyl/errinj.test.lua
@@ -455,6 +455,10 @@ box.cfg{read_only = true}
test_run:cmd("switch default")
vclock = test_run:get_vclock("default")
+
+-- Ignore 0-th vclock component. They don't match between
+-- replicas.
+vclock[0] = nil
_ = test_run:wait_vclock("replica", vclock)
test_run:cmd("stop server replica")
--
2.21.1 (Apple Git-122.3)
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Tarantool-patches] [PATCH v5 1/4] vclock: add an ability to reset individual clock components
2020-03-30 11:04 ` [Tarantool-patches] [PATCH v5 1/4] vclock: add an ability to reset individual clock components Serge Petrenko
@ 2020-03-30 12:56 ` Konstantin Osipov
2020-04-04 20:51 ` Vladislav Shpilevoy
1 sibling, 0 replies; 22+ messages in thread
From: Konstantin Osipov @ 2020-03-30 12:56 UTC (permalink / raw)
To: Serge Petrenko; +Cc: tarantool-patches, v.shpilevoy
* Serge Petrenko <sergepetrenko@tarantool.org> [20/03/30 14:08]:
> +/**
> + * A shortcut for vclock_copy() + vclock_reset() for 0th clock
> + * component.
> + */
> +static inline void
> +vclock_copy_ignore0(struct vclock *dst, const struct vclock *src)
> +{
> + vclock_copy(dst, src);
> + vclock_reset(dst, 0, 0);
> +}
> +
This can be made a tad faster by merging copy + reset and removing
unnecessary branching, but is lgtm.
--
Konstantin Osipov, Moscow, Russia
https://scylladb.com
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Tarantool-patches] [PATCH v5 2/4] replication: hide 0-th vclock components in replication responses
2020-03-30 11:04 ` [Tarantool-patches] [PATCH v5 2/4] replication: hide 0-th vclock components in replication responses Serge Petrenko
@ 2020-03-30 12:56 ` Konstantin Osipov
2020-04-04 20:51 ` Vladislav Shpilevoy
1 sibling, 0 replies; 22+ messages in thread
From: Konstantin Osipov @ 2020-03-30 12:56 UTC (permalink / raw)
To: Serge Petrenko; +Cc: tarantool-patches, v.shpilevoy
* Serge Petrenko <sergepetrenko@tarantool.org> [20/03/30 14:08]:
> If an anonymous replica is promoted to a normal one and becomes
> replication master later, its vclock contains a non-empty zero
> component, tracking local changes on this replica from the time when it
> had been anonymous. No need to pollute joining instance's vclock with
> our non-empty 0 component.
> When an anonymous replica reports its status to a remote instance it
> should also hide its 0-th vclock component.
>
> This is needed for backward compatibility with old instances, which
> don't ignore 0th vclock component coming from a remote instance by
> default.
> Also make sure that new instances ignore 0th vclock component.
lgtm
--
Konstantin Osipov, Moscow, Russia
https://scylladb.com
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Tarantool-patches] [PATCH v5 3/4] gc: rely on minimal vclock components instead of signatures
2020-03-30 11:04 ` [Tarantool-patches] [PATCH v5 3/4] gc: rely on minimal vclock components instead of signatures Serge Petrenko
@ 2020-03-30 12:57 ` Konstantin Osipov
2020-04-04 20:51 ` Vladislav Shpilevoy
1 sibling, 0 replies; 22+ messages in thread
From: Konstantin Osipov @ 2020-03-30 12:57 UTC (permalink / raw)
To: Serge Petrenko; +Cc: tarantool-patches, v.shpilevoy
* Serge Petrenko <sergepetrenko@tarantool.org> [20/03/30 14:08]:
lgtm
> The current WAL GC implementation tracks consumers (i.e. remote replicas) by
> their vclock signature, which is the sum of all vclock components.
> This approach is wrong, and this can be shown by a little example.
> The example will be a little synthetic, but it'll illustrate the problem.
> Say, you have 2 masters, A and B with ids 1 and 2 respectively, and a replica C
> with id 3. Say, С replicates from both A and B, and there is no replication
> between A and B (say, the instances were reconfigured to not replicate from each
> other). Now, say replica C has followed A and B to vclock {1:5, 2:13}. At the
> same time, A has lsn 10 and B has lsn 15. A and B do not know about each other’s
> changes, so A’s vclock is {1:10} and B’s vclock is {2:15}. Now imagine A does a
> snapshot and creates a new xlog with signature 10. A’s directory will look like:
> 00…000.xlog 00…010.snap 00….010.xlog
> Replica C reports its vclock {1:5, 2:13} to A, A uses the vclock to update the
> corresponding GC consumer. Since signatures are used, GC consumer is assigned a
> signature = 13 + 5 = 18. This is greater than the signature of the last xlog on
> A (which is 10), so the previous xlog (00…00.xlog) can be deleted (at least A
> assumes it can be). Actually, replica still needs 00…00.xlog, because it
> contains rows corresponding to vclocks {1:6} - {1:10}, which haven’t been
> replicated yet.
>
> If instead of using vclock signatures, gc consumers used vclocks, such a problem
> wouldn’t arise. Replia would report its vclock {1:5, 2:13}. The vclock is NOT
> strictly greater than A’s most recent xlog vclock ({1:10}), so the previous log
> is kept until replica reports a vclock {1:10, 2:something} or {1:11, …} and so
> on.
>
> Rewrite gc to perform cleanup based on finding minimal vclock
> components present in at least one of the consumer vclocks instead of
> just comparing vclock signatures.
>
> Prerequisite #4114
> ---
> src/box/gc.c | 41 +++++++++++++++----------
> src/box/vclock.h | 78 ++++++++++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 103 insertions(+), 16 deletions(-)
>
> diff --git a/src/box/gc.c b/src/box/gc.c
> index f5c387f9d..a2d0a515c 100644
> --- a/src/box/gc.c
> +++ b/src/box/gc.c
> @@ -66,16 +66,15 @@ static int
> gc_checkpoint_fiber_f(va_list);
>
> /**
> - * Comparator used for ordering gc_consumer objects by signature
> - * in a binary tree.
> + * Comparator used for ordering gc_consumer objects
> + * lexicographically by their vclock in a binary tree.
> */
> static inline int
> gc_consumer_cmp(const struct gc_consumer *a, const struct gc_consumer *b)
> {
> - if (vclock_sum(&a->vclock) < vclock_sum(&b->vclock))
> - return -1;
> - if (vclock_sum(&a->vclock) > vclock_sum(&b->vclock))
> - return 1;
> + int rc = vclock_lex_compare(&a->vclock, &b->vclock);
> + if (rc != 0)
> + return rc;
> if ((intptr_t)a < (intptr_t)b)
> return -1;
> if ((intptr_t)a > (intptr_t)b)
> @@ -187,19 +186,29 @@ gc_run_cleanup(void)
> /* At least one checkpoint must always be available. */
> assert(checkpoint != NULL);
>
> + /* Find the vclock of the oldest WAL row to keep. */
> + struct vclock min_vclock;
> + struct gc_consumer *consumer = gc_tree_first(&gc.consumers);
> /*
> - * Find the vclock of the oldest WAL row to keep.
> + * Vclock of the oldest WAL row to keep is a by-component
> + * minimum of all consumer vclocks and the oldest
> + * checkpoint vclock. This ensures that all rows needed by
> + * at least one consumer are kept.
> * Note, we must keep all WALs created after the
> * oldest checkpoint, even if no consumer needs them.
> */
> - const struct vclock *vclock = (gc_tree_empty(&gc.consumers) ? NULL :
> - &gc_tree_first(&gc.consumers)->vclock);
> - if (vclock == NULL ||
> - vclock_sum(vclock) > vclock_sum(&checkpoint->vclock))
> - vclock = &checkpoint->vclock;
> -
> - if (vclock_sum(vclock) > vclock_sum(&gc.vclock)) {
> - vclock_copy(&gc.vclock, vclock);
> + vclock_copy(&min_vclock, &checkpoint->vclock);
> + while (consumer != NULL) {
> + /*
> + * Consumers will never need rows signed
> + * with a zero instance id (local rows).
> + */
> + vclock_min_ignore0(&min_vclock, &consumer->vclock);
> + consumer = gc_tree_next(&gc.consumers, consumer);
> + }
> +
> + if (vclock_sum(&min_vclock) > vclock_sum(&gc.vclock)) {
> + vclock_copy(&gc.vclock, &min_vclock);
> run_wal_gc = true;
> }
>
> @@ -222,7 +231,7 @@ gc_run_cleanup(void)
> if (run_engine_gc)
> engine_collect_garbage(&checkpoint->vclock);
> if (run_wal_gc)
> - wal_collect_garbage(vclock);
> + wal_collect_garbage(&min_vclock);
> }
>
> static int
> diff --git a/src/box/vclock.h b/src/box/vclock.h
> index 2a3a29020..36fcac966 100644
> --- a/src/box/vclock.h
> +++ b/src/box/vclock.h
> @@ -348,6 +348,84 @@ vclock_compare_ignore0(const struct vclock *a, const struct vclock *b)
> return vclock_compare_generic(a, b, true);
> }
>
> +/**
> + * Compare vclocks lexicographically.
> + * The following affirmation holds: if a < b in terms of
> + * vclock_compare, then a < b in terms of vclock_lex_compare.
> + * However, a < b in terms of vclock_lex_compare doesn't mean
> + * than a < b in terms of vclock_compare. Example:
> + * {1:5, 2:10} < {1:7, 2:3} (vclock_lex_compare), but the vclocks
> + * are incomparable in terms of vclock_compare.
> + * All vclocks are comparable lexicographically.
> + *
> + * \param a vclock
> + * \param b vclock
> + * \retval 1 if \a a is lexicographically greater than \a b
> + * \retval -1 if \a a is lexicographically less than \a b
> + * \retval 0 if vclocks are equal
> + */
> +static inline int
> +vclock_lex_compare(const struct vclock *a, const struct vclock *b)
> +{
> + vclock_map_t map = a->map | b->map;
> + struct bit_iterator it;
> + bit_iterator_init(&it, &map, sizeof(map), true);
> + for(size_t replica_id = bit_iterator_next(&it); replica_id < VCLOCK_MAX;
> + replica_id = bit_iterator_next(&it)) {
> + int64_t lsn_a = vclock_get(a, replica_id);
> + int64_t lsn_b = vclock_get(b, replica_id);
> + if (lsn_a < lsn_b) return -1;
> + if (lsn_a > lsn_b) return 1;
> + }
> + return 0;
> +}
> +
> +/**
> + * Return a vclock, which is a componentwise minimum between
> + * vclocks \a a and \a b.
> + *
> + * \param[in,out] a vclock containing the minimum components.
> + * \param b vclock to compare with.
> + * \param ignore_zero whether to ignore 0th vclock component
> + * or not.
> + */
> +static inline void
> +vclock_min_generic(struct vclock *a, const struct vclock *b, bool ignore_zero)
> +{
> + vclock_map_t map = a->map | b->map;
> + struct bit_iterator it;
> + bit_iterator_init(&it, &map, sizeof(map), true);
> + size_t replica_id = bit_iterator_next(&it);
> + if (replica_id == 0 && ignore_zero)
> + replica_id = bit_iterator_next(&it);
> +
> + for( ; replica_id < VCLOCK_MAX; replica_id = bit_iterator_next(&it)) {
> + int64_t lsn_a = vclock_get(a, replica_id);
> + int64_t lsn_b = vclock_get(b, replica_id);
> + if (lsn_a <= lsn_b)
> + continue;
> + vclock_reset(a, replica_id, lsn_b);
> + }
> +}
> +
> +/**
> + * \sa vclock_min_generic
> + */
> +static inline void
> +vclock_min(struct vclock *a, const struct vclock *b)
> +{
> + return vclock_min_generic(a, b, false);
> +}
> +
> +/**
> + * \sa vclock_min_generic
> + */
> +static inline void
> +vclock_min_ignore0(struct vclock *a, const struct vclock *b)
> +{
> + return vclock_min_generic(a, b, true);
> +}
> +
> /**
> * @brief vclockset - a set of vclocks
> */
> --
> 2.21.1 (Apple Git-122.3)
>
--
Konstantin Osipov, Moscow, Russia
https://scylladb.com
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Tarantool-patches] [PATCH v5 4/4] box: start counting local space requests separately
2020-03-30 11:04 ` [Tarantool-patches] [PATCH v5 4/4] box: start counting local space requests separately Serge Petrenko
@ 2020-03-30 12:58 ` Konstantin Osipov
2020-04-04 20:51 ` Vladislav Shpilevoy
1 sibling, 0 replies; 22+ messages in thread
From: Konstantin Osipov @ 2020-03-30 12:58 UTC (permalink / raw)
To: Serge Petrenko; +Cc: tarantool-patches, v.shpilevoy
* Serge Petrenko <sergepetrenko@tarantool.org> [20/03/30 14:08]:
> diff --git a/src/box/box.cc b/src/box/box.cc
> index bf95d1b5e..0762266b0 100644
> --- a/src/box/box.cc
> +++ b/src/box/box.cc
> @@ -1879,6 +1879,23 @@ box_process_subscribe(struct ev_io *io, struct xrow_header *header)
> say_info("remote vclock %s local vclock %s",
> vclock_to_string(&replica_clock), vclock_to_string(&vclock));
>
> + /*
> + * Replica clock is used in gc state and recovery
> + * initialization, so we need to replace the remote 0-th
> + * component with our own one. This doesn't break
> + * recovery: it finds the WAL with a vclock strictly less
> + * than replia clock in all components except the 0th one.
> + * This leads to finding the correct WAL, if it exists,
> + * since we do not need to recover local rows (the ones,
> + * that contribute to the 0-th vclock component).
> + * Note, that it would be bad to set 0-th vclock component
> + * to a smaller value, since it would unnecessarily
> + * require additional WALs, which may have already been
> + * deleted.
> + * Speaking of gc, remote instances' local vclock
> + * components are not used by consumers at all.
> + */
> + vclock_reset(&replica_clock, 0, vclock_get(&replicaset.vclock, 0));
I would love to get rid of any use of vclock_reset() outside
vclock.c, but it seems I can't.
lgtm.
--
Konstantin Osipov, Moscow, Russia
https://scylladb.com
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Tarantool-patches] [PATCH v5 0/4] replication: fix local space tracking
2020-03-30 11:04 [Tarantool-patches] [PATCH v5 0/4] replication: fix local space tracking Serge Petrenko
` (3 preceding siblings ...)
2020-03-30 11:04 ` [Tarantool-patches] [PATCH v5 4/4] box: start counting local space requests separately Serge Petrenko
@ 2020-03-31 11:24 ` Serge Petrenko
2020-04-04 20:51 ` Vladislav Shpilevoy
5 siblings, 0 replies; 22+ messages in thread
From: Serge Petrenko @ 2020-03-31 11:24 UTC (permalink / raw)
To: Vladislav Shpilevoy; +Cc: tarantool-patches
> 30 марта 2020 г., в 14:04, Serge Petrenko <sergepetrenko@tarantool.org> написал(а):
>
> https://github.com/tarantool/tarantool/issues/4114
> https://github.com/tarantool/tarantool/tree/sp/gh-4114-local-space-replication
>
> Changes in v5:
> - review fixes as per review from Kostja
>
> Changes in v4:
> - review fixes as per review from Kostja
>
> Changes in v3:
> - rewrite the patches re wal gc rework
> to avoid matrix clock.
>
> Changes in v2:
> - used Georgy's patches re introduction of
> matrix clock and wal gc rework to fix the
> problem with ordering gc consumers by
> vclock signature
>
> Serge Petrenko (4):
> vclock: add an ability to reset individual clock components
> replication: hide 0-th vclock components in replication responses
> gc: rely on minimal vclock components instead of signatures
> box: start counting local space requests separately
>
> src/box/applier.cc | 4 +-
> src/box/box.cc | 29 +++-
> src/box/gc.c | 41 +++---
> src/box/recovery.cc | 8 +-
> src/box/relay.cc | 21 +--
> src/box/vclock.c | 15 +++
> src/box/vclock.h | 102 ++++++++++++++
> src/box/wal.c | 16 ++-
> test/replication/anon.result | 5 +
> test/replication/anon.test.lua | 2 +
> test/replication/autobootstrap.result | 6 +
> test/replication/autobootstrap.test.lua | 2 +
> test/replication/before_replace.result | 8 +-
> test/replication/before_replace.test.lua | 4 +-
> .../gh-4114-local-space-replication.result | 125 ++++++++++++++++++
> .../gh-4114-local-space-replication.test.lua | 48 +++++++
> test/replication/local_spaces.result | 4 +
> test/replication/local_spaces.test.lua | 3 +
> test/replication/misc.result | 6 +
> test/replication/misc.test.lua | 2 +
> test/replication/quorum.result | 6 +
> test/replication/quorum.test.lua | 2 +
> test/replication/replica_rejoin.result | 9 ++
> test/replication/replica_rejoin.test.lua | 3 +
> test/replication/skip_conflict_row.result | 3 +
> test/replication/skip_conflict_row.test.lua | 1 +
> test/replication/suite.cfg | 1 +
> test/vinyl/errinj.result | 5 +
> test/vinyl/errinj.test.lua | 4 +
> 29 files changed, 442 insertions(+), 43 deletions(-)
> create mode 100644 test/replication/gh-4114-local-space-replication.result
> create mode 100644 test/replication/gh-4114-local-space-replication.test.lua
>
> --
> 2.21.1 (Apple Git-122.3)
>
The patchset needs a changelog entry.
Sorry for late posting, here’s one:
@ChangeLog:
- local space operations are now counted in 0th vclock
component. Every instance may have its own 0-th vclock
component not matching others’. Local space operations
are not replicated at all, even as NOPs. (gh-4114)
- gc consumers are now ordered by their vclocks and not
by vclock signatures. Only the WALS that contain no
entries needed by any of the consumers are deleted. (gh-4114)
--
Serge Petrenko
sergepetrenko@tarantool.org
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Tarantool-patches] [PATCH v5 0/4] replication: fix local space tracking
2020-03-30 11:04 [Tarantool-patches] [PATCH v5 0/4] replication: fix local space tracking Serge Petrenko
` (4 preceding siblings ...)
2020-03-31 11:24 ` [Tarantool-patches] [PATCH v5 0/4] replication: fix local space tracking Serge Petrenko
@ 2020-04-04 20:51 ` Vladislav Shpilevoy
2020-04-07 11:15 ` Serge Petrenko
5 siblings, 1 reply; 22+ messages in thread
From: Vladislav Shpilevoy @ 2020-04-04 20:51 UTC (permalink / raw)
To: Serge Petrenko, kostja.osipov; +Cc: tarantool-patches
Hi! Thanks for the patchset!
The branch has one failed job in Travis:
https://travis-ci.org/github/tarantool/tarantool/jobs/668688927?utm_medium=notification&utm_source=github_status
in replication-py/conflict.test.py. Can it be caused
by these changes?
On 30/03/2020 13:04, Serge Petrenko wrote:
> https://github.com/tarantool/tarantool/issues/4114
> https://github.com/tarantool/tarantool/tree/sp/gh-4114-local-space-replication
>
> Changes in v5:
> - review fixes as per review from Kostja
>
> Changes in v4:
> - review fixes as per review from Kostja
>
> Changes in v3:
> - rewrite the patches re wal gc rework
> to avoid matrix clock.
>
> Changes in v2:
> - used Georgy's patches re introduction of
> matrix clock and wal gc rework to fix the
> problem with ordering gc consumers by
> vclock signature
>
> Serge Petrenko (4):
> vclock: add an ability to reset individual clock components
> replication: hide 0-th vclock components in replication responses
> gc: rely on minimal vclock components instead of signatures
> box: start counting local space requests separately
>
> src/box/applier.cc | 4 +-
> src/box/box.cc | 29 +++-
> src/box/gc.c | 41 +++---
> src/box/recovery.cc | 8 +-
> src/box/relay.cc | 21 +--
> src/box/vclock.c | 15 +++
> src/box/vclock.h | 102 ++++++++++++++
> src/box/wal.c | 16 ++-
> test/replication/anon.result | 5 +
> test/replication/anon.test.lua | 2 +
> test/replication/autobootstrap.result | 6 +
> test/replication/autobootstrap.test.lua | 2 +
> test/replication/before_replace.result | 8 +-
> test/replication/before_replace.test.lua | 4 +-
> .../gh-4114-local-space-replication.result | 125 ++++++++++++++++++
> .../gh-4114-local-space-replication.test.lua | 48 +++++++
> test/replication/local_spaces.result | 4 +
> test/replication/local_spaces.test.lua | 3 +
> test/replication/misc.result | 6 +
> test/replication/misc.test.lua | 2 +
> test/replication/quorum.result | 6 +
> test/replication/quorum.test.lua | 2 +
> test/replication/replica_rejoin.result | 9 ++
> test/replication/replica_rejoin.test.lua | 3 +
> test/replication/skip_conflict_row.result | 3 +
> test/replication/skip_conflict_row.test.lua | 1 +
> test/replication/suite.cfg | 1 +
> test/vinyl/errinj.result | 5 +
> test/vinyl/errinj.test.lua | 4 +
> 29 files changed, 442 insertions(+), 43 deletions(-)
> create mode 100644 test/replication/gh-4114-local-space-replication.result
> create mode 100644 test/replication/gh-4114-local-space-replication.test.lua
>
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Tarantool-patches] [PATCH v5 3/4] gc: rely on minimal vclock components instead of signatures
2020-03-30 11:04 ` [Tarantool-patches] [PATCH v5 3/4] gc: rely on minimal vclock components instead of signatures Serge Petrenko
2020-03-30 12:57 ` Konstantin Osipov
@ 2020-04-04 20:51 ` Vladislav Shpilevoy
2020-04-07 12:40 ` Serge Petrenko
1 sibling, 1 reply; 22+ messages in thread
From: Vladislav Shpilevoy @ 2020-04-04 20:51 UTC (permalink / raw)
To: Serge Petrenko, kostja.osipov; +Cc: tarantool-patches
Thanks for the patch!
See 5 comments below.
On 30/03/2020 13:04, Serge Petrenko wrote:
> The current WAL GC implementation tracks consumers (i.e. remote replicas) by
> their vclock signature, which is the sum of all vclock components.
> This approach is wrong, and this can be shown by a little example.
> The example will be a little synthetic, but it'll illustrate the problem.
> Say, you have 2 masters, A and B with ids 1 and 2 respectively, and a replica C
> with id 3. Say, С replicates from both A and B, and there is no replication
> between A and B (say, the instances were reconfigured to not replicate from each
> other). Now, say replica C has followed A and B to vclock {1:5, 2:13}. At the
> same time, A has lsn 10 and B has lsn 15. A and B do not know about each other’s
> changes, so A’s vclock is {1:10} and B’s vclock is {2:15}. Now imagine A does a
> snapshot and creates a new xlog with signature 10. A’s directory will look like:
> 00…000.xlog 00…010.snap 00….010.xlog
> Replica C reports its vclock {1:5, 2:13} to A, A uses the vclock to update the
> corresponding GC consumer. Since signatures are used, GC consumer is assigned a
> signature = 13 + 5 = 18. This is greater than the signature of the last xlog on
> A (which is 10), so the previous xlog (00…00.xlog) can be deleted (at least A
> assumes it can be). Actually, replica still needs 00…00.xlog, because it
> contains rows corresponding to vclocks {1:6} - {1:10}, which haven’t been
> replicated yet.
>
> If instead of using vclock signatures, gc consumers used vclocks, such a problem
> wouldn’t arise. Replia would report its vclock {1:5, 2:13}. The vclock is NOT
1. Replia -> Replica.
> strictly greater than A’s most recent xlog vclock ({1:10}), so the previous log
> is kept until replica reports a vclock {1:10, 2:something} or {1:11, …} and so
> on.
>
> Rewrite gc to perform cleanup based on finding minimal vclock
> components present in at least one of the consumer vclocks instead of
> just comparing vclock signatures.
>
> Prerequisite #4114
> ---
> src/box/gc.c | 41 +++++++++++++++----------
> src/box/vclock.h | 78 ++++++++++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 103 insertions(+), 16 deletions(-)
>
> diff --git a/src/box/gc.c b/src/box/gc.c
> index f5c387f9d..a2d0a515c 100644
> --- a/src/box/gc.c
> +++ b/src/box/gc.c
> @@ -66,16 +66,15 @@ static int
> gc_checkpoint_fiber_f(va_list);
>
> /**
> - * Comparator used for ordering gc_consumer objects by signature
> - * in a binary tree.
> + * Comparator used for ordering gc_consumer objects
> + * lexicographically by their vclock in a binary tree.
> */
> static inline int
> gc_consumer_cmp(const struct gc_consumer *a, const struct gc_consumer *b)
> {
> - if (vclock_sum(&a->vclock) < vclock_sum(&b->vclock))
> - return -1;
> - if (vclock_sum(&a->vclock) > vclock_sum(&b->vclock))
> - return 1;
> + int rc = vclock_lex_compare(&a->vclock, &b->vclock);
> + if (rc != 0)
> + return rc;
> if ((intptr_t)a < (intptr_t)b)
> return -1;
> if ((intptr_t)a > (intptr_t)b)
> @@ -187,19 +186,29 @@ gc_run_cleanup(void)
> /* At least one checkpoint must always be available. */
> assert(checkpoint != NULL);
>
> + /* Find the vclock of the oldest WAL row to keep. */
> + struct vclock min_vclock;
> + struct gc_consumer *consumer = gc_tree_first(&gc.consumers);
> /*
> - * Find the vclock of the oldest WAL row to keep.
> + * Vclock of the oldest WAL row to keep is a by-component
> + * minimum of all consumer vclocks and the oldest
> + * checkpoint vclock. This ensures that all rows needed by
> + * at least one consumer are kept.
> * Note, we must keep all WALs created after the
> * oldest checkpoint, even if no consumer needs them.
> */
> - const struct vclock *vclock = (gc_tree_empty(&gc.consumers) ? NULL :
> - &gc_tree_first(&gc.consumers)->vclock);
> - if (vclock == NULL ||
> - vclock_sum(vclock) > vclock_sum(&checkpoint->vclock))
> - vclock = &checkpoint->vclock;
> -
> - if (vclock_sum(vclock) > vclock_sum(&gc.vclock)) {
> - vclock_copy(&gc.vclock, vclock);
> + vclock_copy(&min_vclock, &checkpoint->vclock);
> + while (consumer != NULL) {
> + /*
> + * Consumers will never need rows signed
> + * with a zero instance id (local rows).
> + */
> + vclock_min_ignore0(&min_vclock, &consumer->vclock);
> + consumer = gc_tree_next(&gc.consumers, consumer);
> + }
> +
> + if (vclock_sum(&min_vclock) > vclock_sum(&gc.vclock)) {
2. Why do we still use sum here?
> + vclock_copy(&gc.vclock, &min_vclock);
> run_wal_gc = true;
> }
>
> @@ -222,7 +231,7 @@ gc_run_cleanup(void)
> if (run_engine_gc)
> engine_collect_garbage(&checkpoint->vclock);
> if (run_wal_gc)
> - wal_collect_garbage(vclock);
> + wal_collect_garbage(&min_vclock);
> }
>
> static int
> diff --git a/src/box/vclock.h b/src/box/vclock.h
> index 2a3a29020..36fcac966 100644
> --- a/src/box/vclock.h
> +++ b/src/box/vclock.h
> @@ -348,6 +348,84 @@ vclock_compare_ignore0(const struct vclock *a, const struct vclock *b)
> return vclock_compare_generic(a, b, true);
> }
>
> +/**
> + * Compare vclocks lexicographically.
> + * The following affirmation holds: if a < b in terms of
> + * vclock_compare, then a < b in terms of vclock_lex_compare.
> + * However, a < b in terms of vclock_lex_compare doesn't mean
> + * than a < b in terms of vclock_compare. Example:
> + * {1:5, 2:10} < {1:7, 2:3} (vclock_lex_compare), but the vclocks
> + * are incomparable in terms of vclock_compare.
> + * All vclocks are comparable lexicographically.
> + *
> + * \param a vclock
> + * \param b vclock
> + * \retval 1 if \a a is lexicographically greater than \a b
> + * \retval -1 if \a a is lexicographically less than \a b
> + * \retval 0 if vclocks are equal
3. Please, be consistent with other comments' syntax. I.e. use @
instead of \, start sentences with capital letters, end with dots.
> + */
> +static inline int
> +vclock_lex_compare(const struct vclock *a, const struct vclock *b)
> +{
> + vclock_map_t map = a->map | b->map;
> + struct bit_iterator it;
> + bit_iterator_init(&it, &map, sizeof(map), true);
> + for(size_t replica_id = bit_iterator_next(&it); replica_id < VCLOCK_MAX;
> + replica_id = bit_iterator_next(&it)) {
> + int64_t lsn_a = vclock_get(a, replica_id);
> + int64_t lsn_b = vclock_get(b, replica_id);
> + if (lsn_a < lsn_b) return -1;
> + if (lsn_a > lsn_b) return 1;
4. Could you please put if's body on separate lines?
> + }
> + return 0;
> +}
> +
> +/**
> + * Return a vclock, which is a componentwise minimum between
> + * vclocks \a a and \a b.
> + *
> + * \param[in,out] a vclock containing the minimum components.
> + * \param b vclock to compare with.
> + * \param ignore_zero whether to ignore 0th vclock component
> + * or not.
> + */
> +static inline void
> +vclock_min_generic(struct vclock *a, const struct vclock *b, bool ignore_zero)
> +{
> + vclock_map_t map = a->map | b->map;
> + struct bit_iterator it;
> + bit_iterator_init(&it, &map, sizeof(map), true);
> + size_t replica_id = bit_iterator_next(&it);
> + if (replica_id == 0 && ignore_zero)
> + replica_id = bit_iterator_next(&it);
> +
> + for( ; replica_id < VCLOCK_MAX; replica_id = bit_iterator_next(&it)) {
> + int64_t lsn_a = vclock_get(a, replica_id);
> + int64_t lsn_b = vclock_get(b, replica_id);
> + if (lsn_a <= lsn_b)
> + continue;
> + vclock_reset(a, replica_id, lsn_b);
> + }
> +}
> +
> +/**
> + * \sa vclock_min_generic
> + */
> +static inline void
> +vclock_min(struct vclock *a, const struct vclock *b)
5. This function is never used, so you can drop it,
rename vclock_min_generic to vclock_min_ignore0, and
inline ignore_zero to true.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Tarantool-patches] [PATCH v5 2/4] replication: hide 0-th vclock components in replication responses
2020-03-30 11:04 ` [Tarantool-patches] [PATCH v5 2/4] replication: hide 0-th vclock components in replication responses Serge Petrenko
2020-03-30 12:56 ` Konstantin Osipov
@ 2020-04-04 20:51 ` Vladislav Shpilevoy
2020-04-06 8:38 ` Konstantin Osipov
2020-04-07 12:22 ` Serge Petrenko
1 sibling, 2 replies; 22+ messages in thread
From: Vladislav Shpilevoy @ 2020-04-04 20:51 UTC (permalink / raw)
To: Serge Petrenko, kostja.osipov; +Cc: tarantool-patches
Thanks for the patch!
Was it considered to ignore 0 component on receiver's
side rather than on sender's?
I see this:
> This is needed for backward compatibility with old instances, which
> don't ignore 0th vclock component coming from a remote instance by
> default.
But can anon replicas connect to old versions?
I am not saying that it would be better, but I don't
see why technically not.
On 30/03/2020 13:04, Serge Petrenko wrote:
> If an anonymous replica is promoted to a normal one and becomes
> replication master later, its vclock contains a non-empty zero
> component, tracking local changes on this replica from the time when it
> had been anonymous. No need to pollute joining instance's vclock with
> our non-empty 0 component.
> When an anonymous replica reports its status to a remote instance it
> should also hide its 0-th vclock component.
>
> This is needed for backward compatibility with old instances, which
> don't ignore 0th vclock component coming from a remote instance by
> default.
> Also make sure that new instances ignore 0th vclock component.
>
> Follow-up #3186
> Prerequisite #4114
> ---
> src/box/applier.cc | 4 +++-
> src/box/box.cc | 12 ++++++++----
> src/box/relay.cc | 6 ++++--
> test/replication/anon.result | 5 +++++
> test/replication/anon.test.lua | 2 ++
> 5 files changed, 22 insertions(+), 7 deletions(-)
>
> diff --git a/src/box/applier.cc b/src/box/applier.cc
> index 47a26c366..f5f67b6a9 100644
> --- a/src/box/applier.cc
> +++ b/src/box/applier.cc
> @@ -173,7 +173,9 @@ applier_writer_f(va_list ap)
> continue;
> try {
> struct xrow_header xrow;
> - xrow_encode_vclock(&xrow, &replicaset.vclock);
> + struct vclock vclock;
> + vclock_copy_ignore0(&vclock, &replicaset.vclock);
> + xrow_encode_vclock(&xrow, &vclock);
xrow_encode_vclock without 0 component is needed 4 times.
With 0 it is encoded 2 times. Maybe better add a function
xrow_encode_vclock_ignore0 or like that. Because copy_ignore0
is copying of ~290 bytes. This is several cache lines.
Probably even the original xrow_encode_vclock can appear to be
not needed anywhere.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Tarantool-patches] [PATCH v5 1/4] vclock: add an ability to reset individual clock components
2020-03-30 11:04 ` [Tarantool-patches] [PATCH v5 1/4] vclock: add an ability to reset individual clock components Serge Petrenko
2020-03-30 12:56 ` Konstantin Osipov
@ 2020-04-04 20:51 ` Vladislav Shpilevoy
2020-04-06 8:39 ` Konstantin Osipov
1 sibling, 1 reply; 22+ messages in thread
From: Vladislav Shpilevoy @ 2020-04-04 20:51 UTC (permalink / raw)
To: Serge Petrenko, kostja.osipov; +Cc: tarantool-patches
Thanks for the patch!
On 30/03/2020 13:04, Serge Petrenko wrote:
> Anonymous replicas use 0th vclock component to sign local rows.
> vclock_reset will allow to zero-out 0th vclock component when an
> anonymous replica is promoted to a normal one and sends out its vclock
> to other joining instances, as to not pollute their own 0th vclock
> component.
> Also add a shortcut for vclcok_copy() + vclock_reset() for 0th clock
> component: vclock_copy_ignore0()
>
> Follow-up #3186
> Prerequisite #4114
> ---
> src/box/vclock.c | 15 +++++++++++++++
> src/box/vclock.h | 24 ++++++++++++++++++++++++
> 2 files changed, 39 insertions(+)
>
> diff --git a/src/box/vclock.c b/src/box/vclock.c
> index 90ae27591..ac3e9fccd 100644
> --- a/src/box/vclock.c
> +++ b/src/box/vclock.c
> @@ -37,6 +37,21 @@
> #include "diag.h"
> #include "tt_static.h"
>
> +void
> +vclock_reset(struct vclock *vclock, uint32_t replica_id, int64_t lsn)
This is probably already discussed, but why do we have general
reset for arbitrary replica_id+lsn, and yet we have specialized
copy_ignore0? Why this function was not called vclock_reset0() with
replica_id == 0 and lsn == 0 always?
> +{
> + assert(lsn >= 0);
> + assert(replica_id < VCLOCK_MAX);
> + vclock->signature -= vclock_get(vclock, replica_id);
> + if (lsn == 0) {
> + vclock->map &= ~(1 << replica_id);
> + return;
> + }
> + vclock->lsn[replica_id] = lsn;
> + vclock->map |= 1 << replica_id;
> + vclock->signature += lsn;
> +}
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Tarantool-patches] [PATCH v5 4/4] box: start counting local space requests separately
2020-03-30 11:04 ` [Tarantool-patches] [PATCH v5 4/4] box: start counting local space requests separately Serge Petrenko
2020-03-30 12:58 ` Konstantin Osipov
@ 2020-04-04 20:51 ` Vladislav Shpilevoy
2020-04-07 15:48 ` Serge Petrenko
1 sibling, 1 reply; 22+ messages in thread
From: Vladislav Shpilevoy @ 2020-04-04 20:51 UTC (permalink / raw)
To: Serge Petrenko, kostja.osipov; +Cc: tarantool-patches
Thanks for the patch!
See 3 comments below.
> diff --git a/src/box/box.cc b/src/box/box.cc
> index bf95d1b5e..0762266b0 100644
> --- a/src/box/box.cc
> +++ b/src/box/box.cc
> @@ -1879,6 +1879,23 @@ box_process_subscribe(struct ev_io *io, struct xrow_header *header)
> say_info("remote vclock %s local vclock %s",
> vclock_to_string(&replica_clock), vclock_to_string(&vclock));
>
> + /*
> + * Replica clock is used in gc state and recovery
> + * initialization, so we need to replace the remote 0-th
> + * component with our own one. This doesn't break
> + * recovery: it finds the WAL with a vclock strictly less
> + * than replia clock in all components except the 0th one.
> + * This leads to finding the correct WAL, if it exists,
> + * since we do not need to recover local rows (the ones,
> + * that contribute to the 0-th vclock component).
> + * Note, that it would be bad to set 0-th vclock component
> + * to a smaller value, since it would unnecessarily
> + * require additional WALs, which may have already been
> + * deleted.
1. But you just said above, that 0 component is ignored in all
comparators anyway. So what is a problem of setting 0 component
to 0?
> + * Speaking of gc, remote instances' local vclock
> + * components are not used by consumers at all.
> + */
> + vclock_reset(&replica_clock, 0, vclock_get(&replicaset.vclock, 0));
> diff --git a/src/box/wal.c b/src/box/wal.c
> index 3b094b0e8..a74bdecd9 100644
> --- a/src/box/wal.c
> +++ b/src/box/wal.c
> @@ -953,13 +953,19 @@ wal_assign_lsn(struct vclock *vclock_diff, struct vclock *base,
> /** Assign LSN to all local rows. */
> for ( ; row < end; row++) {
> if ((*row)->replica_id == 0) {
> - (*row)->lsn = vclock_inc(vclock_diff, instance_id) +
> - vclock_get(base, instance_id);
> /*
> - * Note, an anonymous replica signs local
> - * rows whith a zero instance id.
> + * All rows representing local space data
> + * manipulations are signed wth a zero
> + * instance id. This is also true for
> + * anonymous replicas, since they are
> + * only capable of writing to local and
> + * temporary spaces.
> */
> - (*row)->replica_id = instance_id;
> + if ((*row)->group_id != GROUP_LOCAL)
> + (*row)->replica_id = instance_id;
> +
> + (*row)->lsn = vclock_inc(vclock_diff, (*row)->replica_id) +
> + vclock_get(base, (*row)->replica_id);
2. Bad indentation. vlock_get() should be aligned under vclock_inc().
> /* Use lsn of the first local row as transaction id. */
> tsn = tsn == 0 ? (*row)->lsn : tsn;
> (*row)->tsn = tsn;
3. I run the new test without your patch, and it just hangs. No
xloggaperror like in the ticket. Did it disappear thanks to the
previous commit, about GC?
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Tarantool-patches] [PATCH v5 2/4] replication: hide 0-th vclock components in replication responses
2020-04-04 20:51 ` Vladislav Shpilevoy
@ 2020-04-06 8:38 ` Konstantin Osipov
2020-04-07 12:22 ` Serge Petrenko
1 sibling, 0 replies; 22+ messages in thread
From: Konstantin Osipov @ 2020-04-06 8:38 UTC (permalink / raw)
To: Vladislav Shpilevoy; +Cc: tarantool-patches
* Vladislav Shpilevoy <v.shpilevoy@tarantool.org> [20/04/05 05:06]:
> xrow_encode_vclock without 0 component is needed 4 times.
> With 0 it is encoded 2 times. Maybe better add a function
> xrow_encode_vclock_ignore0 or like that. Because copy_ignore0
> is copying of ~290 bytes. This is several cache lines.
+
--
Konstantin Osipov, Moscow, Russia
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Tarantool-patches] [PATCH v5 1/4] vclock: add an ability to reset individual clock components
2020-04-04 20:51 ` Vladislav Shpilevoy
@ 2020-04-06 8:39 ` Konstantin Osipov
2020-04-07 11:48 ` Serge Petrenko
0 siblings, 1 reply; 22+ messages in thread
From: Konstantin Osipov @ 2020-04-06 8:39 UTC (permalink / raw)
To: Vladislav Shpilevoy; +Cc: tarantool-patches
* Vladislav Shpilevoy <v.shpilevoy@tarantool.org> [20/04/05 05:06]:
> > +void
> > +vclock_reset(struct vclock *vclock, uint32_t replica_id, int64_t lsn)
>
> This is probably already discussed, but why do we have general
> reset for arbitrary replica_id+lsn, and yet we have specialized
> copy_ignore0? Why this function was not called vclock_reset0() with
> replica_id == 0 and lsn == 0 always?
It's historical. I also suggest the function should be inlined and
optimized into callers.
>
> > +{
> > + assert(lsn >= 0);
> > + assert(replica_id < VCLOCK_MAX);
> > + vclock->signature -= vclock_get(vclock, replica_id);
> > + if (lsn == 0) {
> > + vclock->map &= ~(1 << replica_id);
> > + return;
> > + }
> > + vclock->lsn[replica_id] = lsn;
> > + vclock->map |= 1 << replica_id;
> > + vclock->signature += lsn;
> > +}
--
Konstantin Osipov, Moscow, Russia
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Tarantool-patches] [PATCH v5 0/4] replication: fix local space tracking
2020-04-04 20:51 ` Vladislav Shpilevoy
@ 2020-04-07 11:15 ` Serge Petrenko
0 siblings, 0 replies; 22+ messages in thread
From: Serge Petrenko @ 2020-04-07 11:15 UTC (permalink / raw)
To: Vladislav Shpilevoy; +Cc: tarantool-patches
> 4 апр. 2020 г., в 23:51, Vladislav Shpilevoy <v.shpilevoy@tarantool.org> написал(а):
>
> Hi! Thanks for the patchset!
>
> The branch has one failed job in Travis:
> https://travis-ci.org/github/tarantool/tarantool/jobs/668688927?utm_medium=notification&utm_source=github_status
> in replication-py/conflict.test.py. Can it be caused
> by these changes?
Hi! Don’t think so. I couldn’t reproduce the failure on my machine.
>
> On 30/03/2020 13:04, Serge Petrenko wrote:
>> https://github.com/tarantool/tarantool/issues/4114
>> https://github.com/tarantool/tarantool/tree/sp/gh-4114-local-space-replication
>>
>> Changes in v5:
>> - review fixes as per review from Kostja
>>
>> Changes in v4:
>> - review fixes as per review from Kostja
>>
>> Changes in v3:
>> - rewrite the patches re wal gc rework
>> to avoid matrix clock.
>>
>> Changes in v2:
>> - used Georgy's patches re introduction of
>> matrix clock and wal gc rework to fix the
>> problem with ordering gc consumers by
>> vclock signature
>>
>> Serge Petrenko (4):
>> vclock: add an ability to reset individual clock components
>> replication: hide 0-th vclock components in replication responses
>> gc: rely on minimal vclock components instead of signatures
>> box: start counting local space requests separately
>>
>> src/box/applier.cc | 4 +-
>> src/box/box.cc | 29 +++-
>> src/box/gc.c | 41 +++---
>> src/box/recovery.cc | 8 +-
>> src/box/relay.cc | 21 +--
>> src/box/vclock.c | 15 +++
>> src/box/vclock.h | 102 ++++++++++++++
>> src/box/wal.c | 16 ++-
>> test/replication/anon.result | 5 +
>> test/replication/anon.test.lua | 2 +
>> test/replication/autobootstrap.result | 6 +
>> test/replication/autobootstrap.test.lua | 2 +
>> test/replication/before_replace.result | 8 +-
>> test/replication/before_replace.test.lua | 4 +-
>> .../gh-4114-local-space-replication.result | 125 ++++++++++++++++++
>> .../gh-4114-local-space-replication.test.lua | 48 +++++++
>> test/replication/local_spaces.result | 4 +
>> test/replication/local_spaces.test.lua | 3 +
>> test/replication/misc.result | 6 +
>> test/replication/misc.test.lua | 2 +
>> test/replication/quorum.result | 6 +
>> test/replication/quorum.test.lua | 2 +
>> test/replication/replica_rejoin.result | 9 ++
>> test/replication/replica_rejoin.test.lua | 3 +
>> test/replication/skip_conflict_row.result | 3 +
>> test/replication/skip_conflict_row.test.lua | 1 +
>> test/replication/suite.cfg | 1 +
>> test/vinyl/errinj.result | 5 +
>> test/vinyl/errinj.test.lua | 4 +
>> 29 files changed, 442 insertions(+), 43 deletions(-)
>> create mode 100644 test/replication/gh-4114-local-space-replication.result
>> create mode 100644 test/replication/gh-4114-local-space-replication.test.lua
>>
--
Serge Petrenko
sergepetrenko@tarantool.org
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Tarantool-patches] [PATCH v5 1/4] vclock: add an ability to reset individual clock components
2020-04-06 8:39 ` Konstantin Osipov
@ 2020-04-07 11:48 ` Serge Petrenko
0 siblings, 0 replies; 22+ messages in thread
From: Serge Petrenko @ 2020-04-07 11:48 UTC (permalink / raw)
To: Konstantin Osipov, Vladislav Shpilevoy; +Cc: tml
Hi! Thanks for the review!
> 6 апр. 2020 г., в 11:39, Konstantin Osipov <kostja.osipov@gmail.com> написал(а):
>
> * Vladislav Shpilevoy <v.shpilevoy@tarantool.org> [20/04/05 05:06]:
>>> +void
>>> +vclock_reset(struct vclock *vclock, uint32_t replica_id, int64_t lsn)
>>
>> This is probably already discussed, but why do we have general
>> reset for arbitrary replica_id+lsn, and yet we have specialized
>> copy_ignore0? Why this function was not called vclock_reset0() with
>> replica_id == 0 and lsn == 0 always?
vclock_reset() is used in vclock_min() later. For arbitrary replica_id and
arbitrary lsn.
>
> It's historical. I also suggest the function should be inlined and
> optimized into callers.
Okay.
>>
>>> +{
>>> + assert(lsn >= 0);
>>> + assert(replica_id < VCLOCK_MAX);
>>> + vclock->signature -= vclock_get(vclock, replica_id);
>>> + if (lsn == 0) {
>>> + vclock->map &= ~(1 << replica_id);
>>> + return;
>>> + }
>>> + vclock->lsn[replica_id] = lsn;
>>> + vclock->map |= 1 << replica_id;
>>> + vclock->signature += lsn;
>>> +}
>
> --
> Konstantin Osipov, Moscow, Russia
diff --git a/src/box/vclock.c b/src/box/vclock.c
index ac3e9fccd..90ae27591 100644
--- a/src/box/vclock.c
+++ b/src/box/vclock.c
@@ -37,21 +37,6 @@
#include "diag.h"
#include "tt_static.h"
-void
-vclock_reset(struct vclock *vclock, uint32_t replica_id, int64_t lsn)
-{
- assert(lsn >= 0);
- assert(replica_id < VCLOCK_MAX);
- vclock->signature -= vclock_get(vclock, replica_id);
- if (lsn == 0) {
- vclock->map &= ~(1 << replica_id);
- return;
- }
- vclock->lsn[replica_id] = lsn;
- vclock->map |= 1 << replica_id;
- vclock->signature += lsn;
-}
-
int64_t
vclock_follow(struct vclock *vclock, uint32_t replica_id, int64_t lsn)
{
diff --git a/src/box/vclock.h b/src/box/vclock.h
index 2a3a29020..6438e63e4 100644
--- a/src/box/vclock.h
+++ b/src/box/vclock.h
@@ -192,8 +192,20 @@ vclock_inc(struct vclock *vclock, uint32_t replica_id)
* @param replica_id Replica identifier.
* @param lsn Lsn to set
*/
-void
-vclock_reset(struct vclock *vclock, uint32_t replica_id, int64_t lsn);
+static inline void
+vclock_reset(struct vclock *vclock, uint32_t replica_id, int64_t lsn)
+{
+ assert(lsn >= 0);
+ assert(replica_id < VCLOCK_MAX);
+ vclock->signature -= vclock_get(vclock, replica_id);
+ if (lsn == 0) {
+ vclock->map &= ~(1 << replica_id);
+ return;
+ }
+ vclock->lsn[replica_id] = lsn;
+ vclock->map |= 1 << replica_id;
+ vclock->signature += lsn;
+}
static inline void
vclock_copy(struct vclock *dst, const struct vclock *src)
--
Serge Petrenko
sergepetrenko@tarantool.org
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Tarantool-patches] [PATCH v5 2/4] replication: hide 0-th vclock components in replication responses
2020-04-04 20:51 ` Vladislav Shpilevoy
2020-04-06 8:38 ` Konstantin Osipov
@ 2020-04-07 12:22 ` Serge Petrenko
1 sibling, 0 replies; 22+ messages in thread
From: Serge Petrenko @ 2020-04-07 12:22 UTC (permalink / raw)
To: Vladislav Shpilevoy; +Cc: tarantool-patches
Hi! Thanks for the review!
> 4 апр. 2020 г., в 23:51, Vladislav Shpilevoy <v.shpilevoy@tarantool.org> написал(а):
>
> Thanks for the patch!
>
> Was it considered to ignore 0 component on receiver's
> side rather than on sender's?
>
> I see this:
>
>> This is needed for backward compatibility with old instances, which
>> don't ignore 0th vclock component coming from a remote instance by
>> default.
>
> But can anon replicas connect to old versions?
Anon replicas can be promoted to normal ones, and then connect
to old versions, while still having a non-zero vclock[0].
>
> I am not saying that it would be better, but I don't
> see why technically not.
>
> On 30/03/2020 13:04, Serge Petrenko wrote:
>> If an anonymous replica is promoted to a normal one and becomes
>> replication master later, its vclock contains a non-empty zero
>> component, tracking local changes on this replica from the time when it
>> had been anonymous. No need to pollute joining instance's vclock with
>> our non-empty 0 component.
>> When an anonymous replica reports its status to a remote instance it
>> should also hide its 0-th vclock component.
>>
>> This is needed for backward compatibility with old instances, which
>> don't ignore 0th vclock component coming from a remote instance by
>> default.
>> Also make sure that new instances ignore 0th vclock component.
>>
>> Follow-up #3186
>> Prerequisite #4114
>> ---
>> src/box/applier.cc | 4 +++-
>> src/box/box.cc | 12 ++++++++----
>> src/box/relay.cc | 6 ++++--
>> test/replication/anon.result | 5 +++++
>> test/replication/anon.test.lua | 2 ++
>> 5 files changed, 22 insertions(+), 7 deletions(-)
>>
>> diff --git a/src/box/applier.cc b/src/box/applier.cc
>> index 47a26c366..f5f67b6a9 100644
>> --- a/src/box/applier.cc
>> +++ b/src/box/applier.cc
>> @@ -173,7 +173,9 @@ applier_writer_f(va_list ap)
>> continue;
>> try {
>> struct xrow_header xrow;
>> - xrow_encode_vclock(&xrow, &replicaset.vclock);
>> + struct vclock vclock;
>> + vclock_copy_ignore0(&vclock, &replicaset.vclock);
>> + xrow_encode_vclock(&xrow, &vclock);
>
> xrow_encode_vclock without 0 component is needed 4 times.
> With 0 it is encoded 2 times. Maybe better add a function
> xrow_encode_vclock_ignore0 or like that. Because copy_ignore0
> is copying of ~290 bytes. This is several cache lines.
Okay, will do.
>
> Probably even the original xrow_encode_vclock can appear to be
> not needed anywhere.
--
Serge Petrenko
sergepetrenko@tarantool.org
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Tarantool-patches] [PATCH v5 3/4] gc: rely on minimal vclock components instead of signatures
2020-04-04 20:51 ` Vladislav Shpilevoy
@ 2020-04-07 12:40 ` Serge Petrenko
0 siblings, 0 replies; 22+ messages in thread
From: Serge Petrenko @ 2020-04-07 12:40 UTC (permalink / raw)
To: Vladislav Shpilevoy; +Cc: tarantool-patches
> 4 апр. 2020 г., в 23:51, Vladislav Shpilevoy <v.shpilevoy@tarantool.org> написал(а):
>
> Thanks for the patch!
>
Thanks for the review!
> See 5 comments below.
>
> On 30/03/2020 13:04, Serge Petrenko wrote:
>> The current WAL GC implementation tracks consumers (i.e. remote replicas) by
>> their vclock signature, which is the sum of all vclock components.
>> This approach is wrong, and this can be shown by a little example.
>> The example will be a little synthetic, but it'll illustrate the problem.
>> Say, you have 2 masters, A and B with ids 1 and 2 respectively, and a replica C
>> with id 3. Say, С replicates from both A and B, and there is no replication
>> between A and B (say, the instances were reconfigured to not replicate from each
>> other). Now, say replica C has followed A and B to vclock {1:5, 2:13}. At the
>> same time, A has lsn 10 and B has lsn 15. A and B do not know about each other’s
>> changes, so A’s vclock is {1:10} and B’s vclock is {2:15}. Now imagine A does a
>> snapshot and creates a new xlog with signature 10. A’s directory will look like:
>> 00…000.xlog 00…010.snap 00….010.xlog
>> Replica C reports its vclock {1:5, 2:13} to A, A uses the vclock to update the
>> corresponding GC consumer. Since signatures are used, GC consumer is assigned a
>> signature = 13 + 5 = 18. This is greater than the signature of the last xlog on
>> A (which is 10), so the previous xlog (00…00.xlog) can be deleted (at least A
>> assumes it can be). Actually, replica still needs 00…00.xlog, because it
>> contains rows corresponding to vclocks {1:6} - {1:10}, which haven’t been
>> replicated yet.
>>
>> If instead of using vclock signatures, gc consumers used vclocks, such a problem
>> wouldn’t arise. Replia would report its vclock {1:5, 2:13}. The vclock is NOT
>
> 1. Replia -> Replica.
Yep, thanks.
>
>> strictly greater than A’s most recent xlog vclock ({1:10}), so the previous log
>> is kept until replica reports a vclock {1:10, 2:something} or {1:11, …} and so
>> on.
>>
>> Rewrite gc to perform cleanup based on finding minimal vclock
>> components present in at least one of the consumer vclocks instead of
>> just comparing vclock signatures.
>>
>> Prerequisite #4114
>> ---
>> src/box/gc.c | 41 +++++++++++++++----------
>> src/box/vclock.h | 78 ++++++++++++++++++++++++++++++++++++++++++++++++
>> 2 files changed, 103 insertions(+), 16 deletions(-)
>>
>> diff --git a/src/box/gc.c b/src/box/gc.c
>> index f5c387f9d..a2d0a515c 100644
>> --- a/src/box/gc.c
>> +++ b/src/box/gc.c
>> @@ -66,16 +66,15 @@ static int
>> gc_checkpoint_fiber_f(va_list);
>>
>> /**
>> - * Comparator used for ordering gc_consumer objects by signature
>> - * in a binary tree.
>> + * Comparator used for ordering gc_consumer objects
>> + * lexicographically by their vclock in a binary tree.
>> */
>> static inline int
>> gc_consumer_cmp(const struct gc_consumer *a, const struct gc_consumer *b)
>> {
>> - if (vclock_sum(&a->vclock) < vclock_sum(&b->vclock))
>> - return -1;
>> - if (vclock_sum(&a->vclock) > vclock_sum(&b->vclock))
>> - return 1;
>> + int rc = vclock_lex_compare(&a->vclock, &b->vclock);
>> + if (rc != 0)
>> + return rc;
>> if ((intptr_t)a < (intptr_t)b)
>> return -1;
>> if ((intptr_t)a > (intptr_t)b)
>> @@ -187,19 +186,29 @@ gc_run_cleanup(void)
>> /* At least one checkpoint must always be available. */
>> assert(checkpoint != NULL);
>>
>> + /* Find the vclock of the oldest WAL row to keep. */
>> + struct vclock min_vclock;
>> + struct gc_consumer *consumer = gc_tree_first(&gc.consumers);
>> /*
>> - * Find the vclock of the oldest WAL row to keep.
>> + * Vclock of the oldest WAL row to keep is a by-component
>> + * minimum of all consumer vclocks and the oldest
>> + * checkpoint vclock. This ensures that all rows needed by
>> + * at least one consumer are kept.
>> * Note, we must keep all WALs created after the
>> * oldest checkpoint, even if no consumer needs them.
>> */
>> - const struct vclock *vclock = (gc_tree_empty(&gc.consumers) ? NULL :
>> - &gc_tree_first(&gc.consumers)->vclock);
>> - if (vclock == NULL ||
>> - vclock_sum(vclock) > vclock_sum(&checkpoint->vclock))
>> - vclock = &checkpoint->vclock;
>> -
>> - if (vclock_sum(vclock) > vclock_sum(&gc.vclock)) {
>> - vclock_copy(&gc.vclock, vclock);
>> + vclock_copy(&min_vclock, &checkpoint->vclock);
>> + while (consumer != NULL) {
>> + /*
>> + * Consumers will never need rows signed
>> + * with a zero instance id (local rows).
>> + */
>> + vclock_min_ignore0(&min_vclock, &consumer->vclock);
>> + consumer = gc_tree_next(&gc.consumers, consumer);
>> + }
>> +
>> + if (vclock_sum(&min_vclock) > vclock_sum(&gc.vclock)) {
>
> 2. Why do we still use sum here?
Doesn’t matter. In this case min_vclock can only get bigger over time,
so comparing signatures is the same as comparing vclocks.
We’re just checking whether min_vclock has changed, and if it has,
it got bigger.
>
>> + vclock_copy(&gc.vclock, &min_vclock);
>> run_wal_gc = true;
>> }
>>
>> @@ -222,7 +231,7 @@ gc_run_cleanup(void)
>> if (run_engine_gc)
>> engine_collect_garbage(&checkpoint->vclock);
>> if (run_wal_gc)
>> - wal_collect_garbage(vclock);
>> + wal_collect_garbage(&min_vclock);
>> }
>>
>> static int
>> diff --git a/src/box/vclock.h b/src/box/vclock.h
>> index 2a3a29020..36fcac966 100644
>> --- a/src/box/vclock.h
>> +++ b/src/box/vclock.h
>> @@ -348,6 +348,84 @@ vclock_compare_ignore0(const struct vclock *a, const struct vclock *b)
>> return vclock_compare_generic(a, b, true);
>> }
>>
>> +/**
>> + * Compare vclocks lexicographically.
>> + * The following affirmation holds: if a < b in terms of
>> + * vclock_compare, then a < b in terms of vclock_lex_compare.
>> + * However, a < b in terms of vclock_lex_compare doesn't mean
>> + * than a < b in terms of vclock_compare. Example:
>> + * {1:5, 2:10} < {1:7, 2:3} (vclock_lex_compare), but the vclocks
>> + * are incomparable in terms of vclock_compare.
>> + * All vclocks are comparable lexicographically.
>> + *
>> + * \param a vclock
>> + * \param b vclock
>> + * \retval 1 if \a a is lexicographically greater than \a b
>> + * \retval -1 if \a a is lexicographically less than \a b
>> + * \retval 0 if vclocks are equal
> 3. Please, be consistent with other comments' syntax. I.e. use @
> instead of \, start sentences with capital letters, end with dots.
Ok, will fix.
>
>> + */
>> +static inline int
>> +vclock_lex_compare(const struct vclock *a, const struct vclock *b)
>> +{
>> + vclock_map_t map = a->map | b->map;
>> + struct bit_iterator it;
>> + bit_iterator_init(&it, &map, sizeof(map), true);
>> + for(size_t replica_id = bit_iterator_next(&it); replica_id < VCLOCK_MAX;
>> + replica_id = bit_iterator_next(&it)) {
>> + int64_t lsn_a = vclock_get(a, replica_id);
>> + int64_t lsn_b = vclock_get(b, replica_id);
>> + if (lsn_a < lsn_b) return -1;
>> + if (lsn_a > lsn_b) return 1;
>
> 4. Could you please put if's body on separate lines?
No problem.
>
>> + }
>> + return 0;
>> +}
>> +
>> +/**
>> + * Return a vclock, which is a componentwise minimum between
>> + * vclocks \a a and \a b.
>> + *
>> + * \param[in,out] a vclock containing the minimum components.
>> + * \param b vclock to compare with.
>> + * \param ignore_zero whether to ignore 0th vclock component
>> + * or not.
>> + */
>> +static inline void
>> +vclock_min_generic(struct vclock *a, const struct vclock *b, bool ignore_zero)
>> +{
>> + vclock_map_t map = a->map | b->map;
>> + struct bit_iterator it;
>> + bit_iterator_init(&it, &map, sizeof(map), true);
>> + size_t replica_id = bit_iterator_next(&it);
>> + if (replica_id == 0 && ignore_zero)
>> + replica_id = bit_iterator_next(&it);
>> +
>> + for( ; replica_id < VCLOCK_MAX; replica_id = bit_iterator_next(&it)) {
>> + int64_t lsn_a = vclock_get(a, replica_id);
>> + int64_t lsn_b = vclock_get(b, replica_id);
>> + if (lsn_a <= lsn_b)
>> + continue;
>> + vclock_reset(a, replica_id, lsn_b);
>> + }
>> +}
>> +
>> +/**
>> + * \sa vclock_min_generic
>> + */
>> +static inline void
>> +vclock_min(struct vclock *a, const struct vclock *b)
>
> 5. This function is never used, so you can drop it,
> rename vclock_min_generic to vclock_min_ignore0, and
> inline ignore_zero to true.
Ok.
diff --git a/src/box/vclock.h b/src/box/vclock.h
index b79895e31..ed519f66b 100644
--- a/src/box/vclock.h
+++ b/src/box/vclock.h
@@ -365,11 +365,11 @@ vclock_compare_ignore0(const struct vclock *a, const struct vclock *b)
* are incomparable in terms of vclock_compare.
* All vclocks are comparable lexicographically.
*
- * \param a vclock
- * \param b vclock
- * \retval 1 if \a a is lexicographically greater than \a b
- * \retval -1 if \a a is lexicographically less than \a b
- * \retval 0 if vclocks are equal
+ * @param a Vclock.
+ * @param b Vclock.
+ * @retval 1 If @a a is lexicographically greater than @a b.
+ * @retval -1 If @a a is lexicographically less than @a b.
+ * @retval 0 If vclocks are equal.
*/
static inline int
vclock_lex_compare(const struct vclock *a, const struct vclock *b)
@@ -381,23 +381,23 @@ vclock_lex_compare(const struct vclock *a, const struct vclock *b)
replica_id = bit_iterator_next(&it)) {
int64_t lsn_a = vclock_get(a, replica_id);
int64_t lsn_b = vclock_get(b, replica_id);
- if (lsn_a < lsn_b) return -1;
- if (lsn_a > lsn_b) return 1;
+ if (lsn_a < lsn_b)
+ return -1;
+ if (lsn_a > lsn_b)
+ return 1;
}
return 0;
}
/**
* Return a vclock, which is a componentwise minimum between
- * vclocks \a a and \a b.
+ * vclocks @a a and @a b. Do not take vclock[0] into account.
*
- * \param[in,out] a vclock containing the minimum components.
- * \param b vclock to compare with.
- * \param ignore_zero whether to ignore 0th vclock component
- * or not.
+ * @param[in,out] a Vclock containing the minimum components.
+ * @param b Vclock to compare with.
*/
static inline void
-vclock_min_generic(struct vclock *a, const struct vclock *b, bool ignore_zero)
+vclock_min_ignore0(struct vclock *a, const struct vclock *b)
{
vclock_map_t map = a->map | b->map;
struct bit_iterator it;
@@ -415,24 +415,6 @@ vclock_min_generic(struct vclock *a, const struct vclock *b, bool ignore_zero)
}
}
-/**
- * \sa vclock_min_generic
- */
-static inline void
-vclock_min(struct vclock *a, const struct vclock *b)
-{
- return vclock_min_generic(a, b, false);
-}
-
-/**
- * \sa vclock_min_generic
- */
-static inline void
-vclock_min_ignore0(struct vclock *a, const struct vclock *b)
-{
- return vclock_min_generic(a, b, true);
-}
-
/**
* @brief vclockset - a set of vclocks
*/
--
Serge Petrenko
sergepetrenko@tarantool.org
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Tarantool-patches] [PATCH v5 4/4] box: start counting local space requests separately
2020-04-04 20:51 ` Vladislav Shpilevoy
@ 2020-04-07 15:48 ` Serge Petrenko
0 siblings, 0 replies; 22+ messages in thread
From: Serge Petrenko @ 2020-04-07 15:48 UTC (permalink / raw)
To: Vladislav Shpilevoy; +Cc: tarantool-patches
> 4 апр. 2020 г., в 23:51, Vladislav Shpilevoy <v.shpilevoy@tarantool.org> написал(а):
>
> Thanks for the patch!
Thanks for the review!
>
> See 3 comments below.
>
>> diff --git a/src/box/box.cc b/src/box/box.cc
>> index bf95d1b5e..0762266b0 100644
>> --- a/src/box/box.cc
>> +++ b/src/box/box.cc
>> @@ -1879,6 +1879,23 @@ box_process_subscribe(struct ev_io *io, struct xrow_header *header)
>> say_info("remote vclock %s local vclock %s",
>> vclock_to_string(&replica_clock), vclock_to_string(&vclock));
>>
>> + /*
>> + * Replica clock is used in gc state and recovery
>> + * initialization, so we need to replace the remote 0-th
>> + * component with our own one. This doesn't break
>> + * recovery: it finds the WAL with a vclock strictly less
>> + * than replia clock in all components except the 0th one.
>> + * This leads to finding the correct WAL, if it exists,
>> + * since we do not need to recover local rows (the ones,
>> + * that contribute to the 0-th vclock component).
>> + * Note, that it would be bad to set 0-th vclock component
>> + * to a smaller value, since it would unnecessarily
>> + * require additional WALs, which may have already been
>> + * deleted.
>
> 1. But you just said above, that 0 component is ignored in all
> comparators anyway. So what is a problem of setting 0 component
> to 0?
I didn’t. vclock[0] is ignored in gc_consumer objects now,
but when initializing recovery vclock[0] cannot be ignored.
Otherwise the instance itself would fail to recover local
rows on local recovery.
So relay recovery needs to be initialized with some valid vclock[0]
value.
Valid means «big enough to not require wals that have already been
deleted»
>
>> + * Speaking of gc, remote instances' local vclock
>> + * components are not used by consumers at all.
>> + */
>> + vclock_reset(&replica_clock, 0, vclock_get(&replicaset.vclock, 0));
>> diff --git a/src/box/wal.c b/src/box/wal.c
>> index 3b094b0e8..a74bdecd9 100644
>> --- a/src/box/wal.c
>> +++ b/src/box/wal.c
>> @@ -953,13 +953,19 @@ wal_assign_lsn(struct vclock *vclock_diff, struct vclock *base,
>> /** Assign LSN to all local rows. */
>> for ( ; row < end; row++) {
>> if ((*row)->replica_id == 0) {
>> - (*row)->lsn = vclock_inc(vclock_diff, instance_id) +
>> - vclock_get(base, instance_id);
>> /*
>> - * Note, an anonymous replica signs local
>> - * rows whith a zero instance id.
>> + * All rows representing local space data
>> + * manipulations are signed wth a zero
>> + * instance id. This is also true for
>> + * anonymous replicas, since they are
>> + * only capable of writing to local and
>> + * temporary spaces.
>> */
>> - (*row)->replica_id = instance_id;
>> + if ((*row)->group_id != GROUP_LOCAL)
>> + (*row)->replica_id = instance_id;
>> +
>> + (*row)->lsn = vclock_inc(vclock_diff, (*row)->replica_id) +
>> + vclock_get(base, (*row)->replica_id);
>
> 2. Bad indentation. vlock_get() should be aligned under vclock_inc().
Sorry, my bad. Fixed.
>
>> /* Use lsn of the first local row as transaction id. */
>> tsn = tsn == 0 ? (*row)->lsn : tsn;
>> (*row)->tsn = tsn;
>
> 3. I run the new test without your patch, and it just hangs. No
> xloggaperror like in the ticket. Did it disappear thanks to the
> previous commit, about GC?
Hmm, that’s an interesting bug (or a peculiarity?)
After deleting the old xlog replica is left with only one .snap file,
and doesn’t produce an error regarding «no wal between lsn x and y».
Master is left hanging until replica inserts some new data,
and opens a WAL which cannot be recovered from.
I worked it around.
diff --git a/src/box/wal.c b/src/box/wal.c
index a74bdecd9..1eb20272c 100644
--- a/src/box/wal.c
+++ b/src/box/wal.c
@@ -965,7 +965,7 @@ wal_assign_lsn(struct vclock *vclock_diff, struct vclock *base,
(*row)->replica_id = instance_id;
(*row)->lsn = vclock_inc(vclock_diff, (*row)->replica_id) +
- vclock_get(base, (*row)->replica_id);
+ vclock_get(base, (*row)->replica_id);
/* Use lsn of the first local row as transaction id. */
tsn = tsn == 0 ? (*row)->lsn : tsn;
(*row)->tsn = tsn;
diff --git a/test/replication/gh-4114-local-space-replication.result b/test/replication/gh-4114-local-space-replication.result
index e524c9a1b..4a71bb6dc 100644
--- a/test/replication/gh-4114-local-space-replication.result
+++ b/test/replication/gh-4114-local-space-replication.result
@@ -76,10 +76,7 @@ box.space.test:insert{3}
| ---
| - [3]
| ...
-box.snapshot()
- | ---
- | - ok
- | ...
+
box.info.vclock[0]
| ---
diff --git a/test/replication/gh-4114-local-space-replication.test.lua b/test/replication/gh-4114-local-space-replication.test.lua
index 26dccee68..114c592c5 100644
--- a/test/replication/gh-4114-local-space-replication.test.lua
+++ b/test/replication/gh-4114-local-space-replication.test.lua
@@ -26,7 +26,7 @@ box.snapshot()
box.space.test:insert{2}
box.snapshot()
box.space.test:insert{3}
-box.snapshot()
+
box.info.vclock[0]
--
Serge Petrenko
sergepetrenko@tarantool.org
^ permalink raw reply [flat|nested] 22+ messages in thread
end of thread, other threads:[~2020-04-07 15:48 UTC | newest]
Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-30 11:04 [Tarantool-patches] [PATCH v5 0/4] replication: fix local space tracking Serge Petrenko
2020-03-30 11:04 ` [Tarantool-patches] [PATCH v5 1/4] vclock: add an ability to reset individual clock components Serge Petrenko
2020-03-30 12:56 ` Konstantin Osipov
2020-04-04 20:51 ` Vladislav Shpilevoy
2020-04-06 8:39 ` Konstantin Osipov
2020-04-07 11:48 ` Serge Petrenko
2020-03-30 11:04 ` [Tarantool-patches] [PATCH v5 2/4] replication: hide 0-th vclock components in replication responses Serge Petrenko
2020-03-30 12:56 ` Konstantin Osipov
2020-04-04 20:51 ` Vladislav Shpilevoy
2020-04-06 8:38 ` Konstantin Osipov
2020-04-07 12:22 ` Serge Petrenko
2020-03-30 11:04 ` [Tarantool-patches] [PATCH v5 3/4] gc: rely on minimal vclock components instead of signatures Serge Petrenko
2020-03-30 12:57 ` Konstantin Osipov
2020-04-04 20:51 ` Vladislav Shpilevoy
2020-04-07 12:40 ` Serge Petrenko
2020-03-30 11:04 ` [Tarantool-patches] [PATCH v5 4/4] box: start counting local space requests separately Serge Petrenko
2020-03-30 12:58 ` Konstantin Osipov
2020-04-04 20:51 ` Vladislav Shpilevoy
2020-04-07 15:48 ` Serge Petrenko
2020-03-31 11:24 ` [Tarantool-patches] [PATCH v5 0/4] replication: fix local space tracking Serge Petrenko
2020-04-04 20:51 ` Vladislav Shpilevoy
2020-04-07 11:15 ` Serge Petrenko
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox