[Tarantool-patches] [PATCH v6 2/3] gc: rely on minimal vclock components instead of signatures
Vladislav Shpilevoy
v.shpilevoy at tarantool.org
Fri Apr 10 02:08:24 MSK 2020
Hi! Thanks for the patch!
See 4 questions below.
On 07/04/2020 17:49, Serge Petrenko wrote:
> The current WAL GC implementation tracks consumers (i.e. remote replicas) by
> their vclock signature, which is the sum of all vclock components.
> This approach is wrong, and this can be shown by a little example.
> The example will be a little synthetic, but it'll illustrate the problem.
> Say, you have 2 masters, A and B with ids 1 and 2 respectively, and a replica C
> with id 3. Say, С replicates from both A and B, and there is no replication
> between A and B (say, the instances were reconfigured to not replicate from each
> other). Now, say replica C has followed A and B to vclock {1:5, 2:13}. At the
> same time, A has lsn 10 and B has lsn 15. A and B do not know about each other’s
> changes, so A’s vclock is {1:10} and B’s vclock is {2:15}. Now imagine A does a
> snapshot and creates a new xlog with signature 10. A’s directory will look like:
> 00…000.xlog 00…010.snap 00….010.xlog
> Replica C reports its vclock {1:5, 2:13} to A, A uses the vclock to update the
> corresponding GC consumer. Since signatures are used, GC consumer is assigned a
> signature = 13 + 5 = 18. This is greater than the signature of the last xlog on
> A (which is 10), so the previous xlog (00…00.xlog) can be deleted (at least A
> assumes it can be). Actually, replica still needs 00…00.xlog, because it
> contains rows corresponding to vclocks {1:6} - {1:10}, which haven’t been
> replicated yet.
>
> If instead of using vclock signatures, gc consumers used vclocks, such a problem
> wouldn’t arise. Replica would report its vclock {1:5, 2:13}. The vclock is NOT
> strictly greater than A’s most recent xlog vclock ({1:10}), so the previous log
> is kept until replica reports a vclock {1:10, 2:something} or {1:11, …} and so
> on.
1.
"
It looks like B will always delete its snaps immediately, because C will
send vclock lexicographically always bigger than B's.
"
I wrote the comment above and only then realized that vclock_min() is not really
min between 2 values. It is min for every of 32 components of vclock. So the
problem above won't happen, right?
B's "min" vclock in gc consumers will be
{0, min(B last snap, C last received row), 0, 0, ...}
Correct? It won't see A's part of C's vclock.
> Rewrite gc to perform cleanup based on finding minimal vclock
> components present in at least one of the consumer vclocks instead of
> just comparing vclock signatures.
>
> Prerequisite #4114
> ---
> src/box/gc.c | 41 ++++++++++++++---------
> src/box/vclock.h | 85 ++++++++++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 110 insertions(+), 16 deletions(-)
>
> diff --git a/src/box/gc.c b/src/box/gc.c
> index f5c387f9d..a2d0a515c 100644
> --- a/src/box/gc.c
> +++ b/src/box/gc.c
> @@ -187,19 +186,29 @@ gc_run_cleanup(void)
> /* At least one checkpoint must always be available. */
> assert(checkpoint != NULL);
>
> + /* Find the vclock of the oldest WAL row to keep. */
> + struct vclock min_vclock;
> + struct gc_consumer *consumer = gc_tree_first(&gc.consumers);
> /*
> - * Find the vclock of the oldest WAL row to keep.
> + * Vclock of the oldest WAL row to keep is a by-component
> + * minimum of all consumer vclocks and the oldest
> + * checkpoint vclock. This ensures that all rows needed by
> + * at least one consumer are kept.
> * Note, we must keep all WALs created after the
> * oldest checkpoint, even if no consumer needs them.
> */
> - const struct vclock *vclock = (gc_tree_empty(&gc.consumers) ? NULL :
> - &gc_tree_first(&gc.consumers)->vclock);
> - if (vclock == NULL ||
> - vclock_sum(vclock) > vclock_sum(&checkpoint->vclock))
> - vclock = &checkpoint->vclock;
> -
> - if (vclock_sum(vclock) > vclock_sum(&gc.vclock)) {
> - vclock_copy(&gc.vclock, vclock);
> + vclock_copy(&min_vclock, &checkpoint->vclock);
> + while (consumer != NULL) {
> + /*
> + * Consumers will never need rows signed
> + * with a zero instance id (local rows).
> + */
> + vclock_min_ignore0(&min_vclock, &consumer->vclock);
> + consumer = gc_tree_next(&gc.consumers, consumer);
2.
"
Can you use gc_tree_last()? Without vclock_min_ignore0 and manual scan.
Also since vclock[0] is never sent (mp_encode_vclock() now ignores it),
what are the consumers having vclock[0] != 0 here? If there are only local,
does it matter? They will be ordered always ok.
"
I wrote the comment above also before realized that vclock_min() is not a
normal min(). So it should be irrelevant, yes?
3. Should not we patch gc_advance() too? It still uses sum().
> + }
> +
> + if (vclock_sum(&min_vclock) > vclock_sum(&gc.vclock)) {
> + vclock_copy(&gc.vclock, &min_vclock);
> run_wal_gc = true;
> }
> diff --git a/src/box/vclock.h b/src/box/vclock.h
> index 5c0525b00..5865f7443 100644
> --- a/src/box/vclock.h
> +++ b/src/box/vclock.h
> @@ -330,6 +355,66 @@ vclock_compare_ignore0(const struct vclock *a, const struct vclock *b)
> return vclock_compare_generic(a, b, true);
> }
>
> +/**
> + * Compare vclocks lexicographically.
> + * The following affirmation holds: if a < b in terms of
> + * vclock_compare, then a < b in terms of vclock_lex_compare.
> + * However, a < b in terms of vclock_lex_compare doesn't mean
> + * than a < b in terms of vclock_compare. Example:
> + * {1:5, 2:10} < {1:7, 2:3} (vclock_lex_compare), but the vclocks
> + * are incomparable in terms of vclock_compare.
> + * All vclocks are comparable lexicographically.
> + *
> + * @param a Vclock.
> + * @param b Vclock.
> + * @retval 1 If @a a is lexicographically greater than @a b.
> + * @retval -1 If @a a is lexicographically less than @a b.
> + * @retval 0 If vclocks are equal.
> + */
> +static inline int
> +vclock_lex_compare(const struct vclock *a, const struct vclock *b)
4. What are cases, when vclock_lex_compare() can't replace vclock_compare()?
> +{
> + vclock_map_t map = a->map | b->map;
> + struct bit_iterator it;
> + bit_iterator_init(&it, &map, sizeof(map), true);
> + for(size_t replica_id = bit_iterator_next(&it); replica_id < VCLOCK_MAX;
> + replica_id = bit_iterator_next(&it)) {
> + int64_t lsn_a = vclock_get(a, replica_id);
> + int64_t lsn_b = vclock_get(b, replica_id);
> + if (lsn_a < lsn_b)
> + return -1;
> + if (lsn_a > lsn_b)
> + return 1;
> + }
> + return 0;
> +}
More information about the Tarantool-patches
mailing list