[Tarantool-patches] [PATCH v6 2/3] gc: rely on minimal vclock components instead of signatures
Serge Petrenko
sergepetrenko at tarantool.org
Fri Apr 10 15:25:16 MSK 2020
> 10 апр. 2020 г., в 02:08, Vladislav Shpilevoy <v.shpilevoy at tarantool.org> написал(а):
>
> Hi! Thanks for the patch!
Hi! Thanks for your review!
>
> See 4 questions below.
>
> On 07/04/2020 17:49, Serge Petrenko wrote:
>> The current WAL GC implementation tracks consumers (i.e. remote replicas) by
>> their vclock signature, which is the sum of all vclock components.
>> This approach is wrong, and this can be shown by a little example.
>> The example will be a little synthetic, but it'll illustrate the problem.
>> Say, you have 2 masters, A and B with ids 1 and 2 respectively, and a replica C
>> with id 3. Say, С replicates from both A and B, and there is no replication
>> between A and B (say, the instances were reconfigured to not replicate from each
>> other). Now, say replica C has followed A and B to vclock {1:5, 2:13}. At the
>> same time, A has lsn 10 and B has lsn 15. A and B do not know about each other’s
>> changes, so A’s vclock is {1:10} and B’s vclock is {2:15}. Now imagine A does a
>> snapshot and creates a new xlog with signature 10. A’s directory will look like:
>> 00…000.xlog 00…010.snap 00….010.xlog
>> Replica C reports its vclock {1:5, 2:13} to A, A uses the vclock to update the
>> corresponding GC consumer. Since signatures are used, GC consumer is assigned a
>> signature = 13 + 5 = 18. This is greater than the signature of the last xlog on
>> A (which is 10), so the previous xlog (00…00.xlog) can be deleted (at least A
>> assumes it can be). Actually, replica still needs 00…00.xlog, because it
>> contains rows corresponding to vclocks {1:6} - {1:10}, which haven’t been
>> replicated yet.
>>
>> If instead of using vclock signatures, gc consumers used vclocks, such a problem
>> wouldn’t arise. Replica would report its vclock {1:5, 2:13}. The vclock is NOT
>> strictly greater than A’s most recent xlog vclock ({1:10}), so the previous log
>> is kept until replica reports a vclock {1:10, 2:something} or {1:11, …} and so
>> on.
>
> 1.
> "
> It looks like B will always delete its snaps immediately, because C will
> send vclock lexicographically always bigger than B's.
> "
I think you meant xlogs. Snaps are deleted only when a new snap is created and
total snap count exceeds `box.cfg.checkpoint_count`
Anyway, your remark below is correct.
>
> I wrote the comment above and only then realized that vclock_min() is not really
> min between 2 values. It is min for every of 32 components of vclock. So the
> problem above won't happen, right?
>
> B's "min" vclock in gc consumers will be
>
> {0, min(B last snap, C last received row), 0, 0, ...}
>
> Correct? It won't see A's part of C's vclock.
Yes, {1:0, 2:min(BB last snap, C last received row), 3:0, ….}.
>
>> Rewrite gc to perform cleanup based on finding minimal vclock
>> components present in at least one of the consumer vclocks instead of
>> just comparing vclock signatures.
>>
>> Prerequisite #4114
>> ---
>> src/box/gc.c | 41 ++++++++++++++---------
>> src/box/vclock.h | 85 ++++++++++++++++++++++++++++++++++++++++++++++++
>> 2 files changed, 110 insertions(+), 16 deletions(-)
>>
>> diff --git a/src/box/gc.c b/src/box/gc.c
>> index f5c387f9d..a2d0a515c 100644
>> --- a/src/box/gc.c
>> +++ b/src/box/gc.c
>> @@ -187,19 +186,29 @@ gc_run_cleanup(void)
>> /* At least one checkpoint must always be available. */
>> assert(checkpoint != NULL);
>>
>> + /* Find the vclock of the oldest WAL row to keep. */
>> + struct vclock min_vclock;
>> + struct gc_consumer *consumer = gc_tree_first(&gc.consumers);
>> /*
>> - * Find the vclock of the oldest WAL row to keep.
>> + * Vclock of the oldest WAL row to keep is a by-component
>> + * minimum of all consumer vclocks and the oldest
>> + * checkpoint vclock. This ensures that all rows needed by
>> + * at least one consumer are kept.
>> * Note, we must keep all WALs created after the
>> * oldest checkpoint, even if no consumer needs them.
>> */
>> - const struct vclock *vclock = (gc_tree_empty(&gc.consumers) ? NULL :
>> - &gc_tree_first(&gc.consumers)->vclock);
>> - if (vclock == NULL ||
>> - vclock_sum(vclock) > vclock_sum(&checkpoint->vclock))
>> - vclock = &checkpoint->vclock;
>> -
>> - if (vclock_sum(vclock) > vclock_sum(&gc.vclock)) {
>> - vclock_copy(&gc.vclock, vclock);
>> + vclock_copy(&min_vclock, &checkpoint->vclock);
>> + while (consumer != NULL) {
>> + /*
>> + * Consumers will never need rows signed
>> + * with a zero instance id (local rows).
>> + */
>> + vclock_min_ignore0(&min_vclock, &consumer->vclock);
>> + consumer = gc_tree_next(&gc.consumers, consumer);
>
> 2.
> "
> Can you use gc_tree_last()? Without vclock_min_ignore0 and manual scan.
>
> Also since vclock[0] is never sent (mp_encode_vclock() now ignores it),
> what are the consumers having vclock[0] != 0 here? If there are only local,
> does it matter? They will be ordered always ok.
> "
consumers have the master instance’s vclock[0] component. It is just a leftover
from recovery initialization, and I didn’t bother clearing it.
So, the consumer, who is the last to report its status, will have the biggest
vclock[0], if master writes to local spaces at all.
No, gc_tree_last() (did you mean gc_tree_first() ? ) cannot be used here.
It’ll return the consumer with the biggest (smallest) vclock[0].
Which is the last (first) one to report its status and not necessarily the
most (least) up to date one.
>
> I wrote the comment above also before realized that vclock_min() is not a
> normal min(). So it should be irrelevant, yes?
Yep.
>
>
> 3. Should not we patch gc_advance() too? It still uses sum().
Yes! Thanks for noticing.
diff --git a/src/box/gc.c b/src/box/gc.c
index a2d0a515c..36b2d7f4e 100644
--- a/src/box/gc.c
+++ b/src/box/gc.c
@@ -295,10 +295,18 @@ gc_advance(const struct vclock *vclock)
vclock_copy(&gc.vclock, vclock);
struct gc_consumer *consumer = gc_tree_first(&gc.consumers);
- while (consumer != NULL &&
- vclock_sum(&consumer->vclock) < vclock_sum(vclock)) {
+ while (consumer != NULL) {
struct gc_consumer *next = gc_tree_next(&gc.consumers,
consumer);
+ /*
+ * Remove all the consumers whose vclocks are
+ * either less than or incomparable with the wal
+ * gc vclock.
+ */
+ if (vclock_compare_ignore0(vclock, &consumer->vclock) <= 0) {
+ consumer = next;
+ continue;
+ }
assert(!consumer->is_inactive);
consumer->is_inactive = true;
gc_tree_remove(&gc.consumers, consumer);
>
>> + }
>> +
>> + if (vclock_sum(&min_vclock) > vclock_sum(&gc.vclock)) {
>> + vclock_copy(&gc.vclock, &min_vclock);
>> run_wal_gc = true;
>> }
>> diff --git a/src/box/vclock.h b/src/box/vclock.h
>> index 5c0525b00..5865f7443 100644
>> --- a/src/box/vclock.h
>> +++ b/src/box/vclock.h
>> @@ -330,6 +355,66 @@ vclock_compare_ignore0(const struct vclock *a, const struct vclock *b)
>> return vclock_compare_generic(a, b, true);
>> }
>>
>> +/**
>> + * Compare vclocks lexicographically.
>> + * The following affirmation holds: if a < b in terms of
>> + * vclock_compare, then a < b in terms of vclock_lex_compare.
>> + * However, a < b in terms of vclock_lex_compare doesn't mean
>> + * than a < b in terms of vclock_compare. Example:
>> + * {1:5, 2:10} < {1:7, 2:3} (vclock_lex_compare), but the vclocks
>> + * are incomparable in terms of vclock_compare.
>> + * All vclocks are comparable lexicographically.
>> + *
>> + * @param a Vclock.
>> + * @param b Vclock.
>> + * @retval 1 If @a a is lexicographically greater than @a b.
>> + * @retval -1 If @a a is lexicographically less than @a b.
>> + * @retval 0 If vclocks are equal.
>> + */
>> +static inline int
>> +vclock_lex_compare(const struct vclock *a, const struct vclock *b)
>
> 4. What are cases, when vclock_lex_compare() can't replace vclock_compare()?
replicaset_needs_rejoin() comes to mind. Forget about vclock[0] for now.
Let’s say we have 2 instances with id’s 2 and 3. (Instance with id 1 died, for
example). And 3 is restarting and defining whether it needs a rebootstrap from 2.
Say, 2’s vclock is {1:10, 2:15, 3:30} and 3’s normal vclock is {1:10, 2:10, 3:50}.
Even though 3’s vclock is lexicographically smaller than 2’s vclock
(2nd vclock[2] = 15, 3rd vclock[2] = 10), 3 cannot reboostrap from 2, because 3 has data,
that will be destroyed by rebootstrap {3:31} - {3:50}.
So only the usual vclock_compare() can be used here.
3 can rebootstrap from 2 iff 3’s vclock is less than 2’s vclock in EVERY component.
>
>> +{
>> + vclock_map_t map = a->map | b->map;
>> + struct bit_iterator it;
>> + bit_iterator_init(&it, &map, sizeof(map), true);
>> + for(size_t replica_id = bit_iterator_next(&it); replica_id < VCLOCK_MAX;
>> + replica_id = bit_iterator_next(&it)) {
>> + int64_t lsn_a = vclock_get(a, replica_id);
>> + int64_t lsn_b = vclock_get(b, replica_id);
>> + if (lsn_a < lsn_b)
>> + return -1;
>> + if (lsn_a > lsn_b)
>> + return 1;
>> + }
>> + return 0;
>> +}
--
Serge Petrenko
sergepetrenko at tarantool.org
More information about the Tarantool-patches
mailing list