From: Serge Petrenko <sergepetrenko@tarantool.org> To: Vladislav Shpilevoy <v.shpilevoy@tarantool.org> Cc: tml <tarantool-patches@dev.tarantool.org> Subject: Re: [Tarantool-patches] [PATCH v6 2/3] gc: rely on minimal vclock components instead of signatures Date: Fri, 10 Apr 2020 15:25:16 +0300 [thread overview] Message-ID: <769253CE-47DB-4FEF-BDB2-046799B15682@tarantool.org> (raw) In-Reply-To: <6336a973-02da-99c9-4942-762f02c2cb6d@tarantool.org> > 10 апр. 2020 г., в 02:08, Vladislav Shpilevoy <v.shpilevoy@tarantool.org> написал(а): > > Hi! Thanks for the patch! Hi! Thanks for your review! > > See 4 questions below. > > On 07/04/2020 17:49, Serge Petrenko wrote: >> The current WAL GC implementation tracks consumers (i.e. remote replicas) by >> their vclock signature, which is the sum of all vclock components. >> This approach is wrong, and this can be shown by a little example. >> The example will be a little synthetic, but it'll illustrate the problem. >> Say, you have 2 masters, A and B with ids 1 and 2 respectively, and a replica C >> with id 3. Say, С replicates from both A and B, and there is no replication >> between A and B (say, the instances were reconfigured to not replicate from each >> other). Now, say replica C has followed A and B to vclock {1:5, 2:13}. At the >> same time, A has lsn 10 and B has lsn 15. A and B do not know about each other’s >> changes, so A’s vclock is {1:10} and B’s vclock is {2:15}. Now imagine A does a >> snapshot and creates a new xlog with signature 10. A’s directory will look like: >> 00…000.xlog 00…010.snap 00….010.xlog >> Replica C reports its vclock {1:5, 2:13} to A, A uses the vclock to update the >> corresponding GC consumer. Since signatures are used, GC consumer is assigned a >> signature = 13 + 5 = 18. This is greater than the signature of the last xlog on >> A (which is 10), so the previous xlog (00…00.xlog) can be deleted (at least A >> assumes it can be). Actually, replica still needs 00…00.xlog, because it >> contains rows corresponding to vclocks {1:6} - {1:10}, which haven’t been >> replicated yet. >> >> If instead of using vclock signatures, gc consumers used vclocks, such a problem >> wouldn’t arise. Replica would report its vclock {1:5, 2:13}. The vclock is NOT >> strictly greater than A’s most recent xlog vclock ({1:10}), so the previous log >> is kept until replica reports a vclock {1:10, 2:something} or {1:11, …} and so >> on. > > 1. > " > It looks like B will always delete its snaps immediately, because C will > send vclock lexicographically always bigger than B's. > " I think you meant xlogs. Snaps are deleted only when a new snap is created and total snap count exceeds `box.cfg.checkpoint_count` Anyway, your remark below is correct. > > I wrote the comment above and only then realized that vclock_min() is not really > min between 2 values. It is min for every of 32 components of vclock. So the > problem above won't happen, right? > > B's "min" vclock in gc consumers will be > > {0, min(B last snap, C last received row), 0, 0, ...} > > Correct? It won't see A's part of C's vclock. Yes, {1:0, 2:min(BB last snap, C last received row), 3:0, ….}. > >> Rewrite gc to perform cleanup based on finding minimal vclock >> components present in at least one of the consumer vclocks instead of >> just comparing vclock signatures. >> >> Prerequisite #4114 >> --- >> src/box/gc.c | 41 ++++++++++++++--------- >> src/box/vclock.h | 85 ++++++++++++++++++++++++++++++++++++++++++++++++ >> 2 files changed, 110 insertions(+), 16 deletions(-) >> >> diff --git a/src/box/gc.c b/src/box/gc.c >> index f5c387f9d..a2d0a515c 100644 >> --- a/src/box/gc.c >> +++ b/src/box/gc.c >> @@ -187,19 +186,29 @@ gc_run_cleanup(void) >> /* At least one checkpoint must always be available. */ >> assert(checkpoint != NULL); >> >> + /* Find the vclock of the oldest WAL row to keep. */ >> + struct vclock min_vclock; >> + struct gc_consumer *consumer = gc_tree_first(&gc.consumers); >> /* >> - * Find the vclock of the oldest WAL row to keep. >> + * Vclock of the oldest WAL row to keep is a by-component >> + * minimum of all consumer vclocks and the oldest >> + * checkpoint vclock. This ensures that all rows needed by >> + * at least one consumer are kept. >> * Note, we must keep all WALs created after the >> * oldest checkpoint, even if no consumer needs them. >> */ >> - const struct vclock *vclock = (gc_tree_empty(&gc.consumers) ? NULL : >> - &gc_tree_first(&gc.consumers)->vclock); >> - if (vclock == NULL || >> - vclock_sum(vclock) > vclock_sum(&checkpoint->vclock)) >> - vclock = &checkpoint->vclock; >> - >> - if (vclock_sum(vclock) > vclock_sum(&gc.vclock)) { >> - vclock_copy(&gc.vclock, vclock); >> + vclock_copy(&min_vclock, &checkpoint->vclock); >> + while (consumer != NULL) { >> + /* >> + * Consumers will never need rows signed >> + * with a zero instance id (local rows). >> + */ >> + vclock_min_ignore0(&min_vclock, &consumer->vclock); >> + consumer = gc_tree_next(&gc.consumers, consumer); > > 2. > " > Can you use gc_tree_last()? Without vclock_min_ignore0 and manual scan. > > Also since vclock[0] is never sent (mp_encode_vclock() now ignores it), > what are the consumers having vclock[0] != 0 here? If there are only local, > does it matter? They will be ordered always ok. > " consumers have the master instance’s vclock[0] component. It is just a leftover from recovery initialization, and I didn’t bother clearing it. So, the consumer, who is the last to report its status, will have the biggest vclock[0], if master writes to local spaces at all. No, gc_tree_last() (did you mean gc_tree_first() ? ) cannot be used here. It’ll return the consumer with the biggest (smallest) vclock[0]. Which is the last (first) one to report its status and not necessarily the most (least) up to date one. > > I wrote the comment above also before realized that vclock_min() is not a > normal min(). So it should be irrelevant, yes? Yep. > > > 3. Should not we patch gc_advance() too? It still uses sum(). Yes! Thanks for noticing. diff --git a/src/box/gc.c b/src/box/gc.c index a2d0a515c..36b2d7f4e 100644 --- a/src/box/gc.c +++ b/src/box/gc.c @@ -295,10 +295,18 @@ gc_advance(const struct vclock *vclock) vclock_copy(&gc.vclock, vclock); struct gc_consumer *consumer = gc_tree_first(&gc.consumers); - while (consumer != NULL && - vclock_sum(&consumer->vclock) < vclock_sum(vclock)) { + while (consumer != NULL) { struct gc_consumer *next = gc_tree_next(&gc.consumers, consumer); + /* + * Remove all the consumers whose vclocks are + * either less than or incomparable with the wal + * gc vclock. + */ + if (vclock_compare_ignore0(vclock, &consumer->vclock) <= 0) { + consumer = next; + continue; + } assert(!consumer->is_inactive); consumer->is_inactive = true; gc_tree_remove(&gc.consumers, consumer); > >> + } >> + >> + if (vclock_sum(&min_vclock) > vclock_sum(&gc.vclock)) { >> + vclock_copy(&gc.vclock, &min_vclock); >> run_wal_gc = true; >> } >> diff --git a/src/box/vclock.h b/src/box/vclock.h >> index 5c0525b00..5865f7443 100644 >> --- a/src/box/vclock.h >> +++ b/src/box/vclock.h >> @@ -330,6 +355,66 @@ vclock_compare_ignore0(const struct vclock *a, const struct vclock *b) >> return vclock_compare_generic(a, b, true); >> } >> >> +/** >> + * Compare vclocks lexicographically. >> + * The following affirmation holds: if a < b in terms of >> + * vclock_compare, then a < b in terms of vclock_lex_compare. >> + * However, a < b in terms of vclock_lex_compare doesn't mean >> + * than a < b in terms of vclock_compare. Example: >> + * {1:5, 2:10} < {1:7, 2:3} (vclock_lex_compare), but the vclocks >> + * are incomparable in terms of vclock_compare. >> + * All vclocks are comparable lexicographically. >> + * >> + * @param a Vclock. >> + * @param b Vclock. >> + * @retval 1 If @a a is lexicographically greater than @a b. >> + * @retval -1 If @a a is lexicographically less than @a b. >> + * @retval 0 If vclocks are equal. >> + */ >> +static inline int >> +vclock_lex_compare(const struct vclock *a, const struct vclock *b) > > 4. What are cases, when vclock_lex_compare() can't replace vclock_compare()? replicaset_needs_rejoin() comes to mind. Forget about vclock[0] for now. Let’s say we have 2 instances with id’s 2 and 3. (Instance with id 1 died, for example). And 3 is restarting and defining whether it needs a rebootstrap from 2. Say, 2’s vclock is {1:10, 2:15, 3:30} and 3’s normal vclock is {1:10, 2:10, 3:50}. Even though 3’s vclock is lexicographically smaller than 2’s vclock (2nd vclock[2] = 15, 3rd vclock[2] = 10), 3 cannot reboostrap from 2, because 3 has data, that will be destroyed by rebootstrap {3:31} - {3:50}. So only the usual vclock_compare() can be used here. 3 can rebootstrap from 2 iff 3’s vclock is less than 2’s vclock in EVERY component. > >> +{ >> + vclock_map_t map = a->map | b->map; >> + struct bit_iterator it; >> + bit_iterator_init(&it, &map, sizeof(map), true); >> + for(size_t replica_id = bit_iterator_next(&it); replica_id < VCLOCK_MAX; >> + replica_id = bit_iterator_next(&it)) { >> + int64_t lsn_a = vclock_get(a, replica_id); >> + int64_t lsn_b = vclock_get(b, replica_id); >> + if (lsn_a < lsn_b) >> + return -1; >> + if (lsn_a > lsn_b) >> + return 1; >> + } >> + return 0; >> +} -- Serge Petrenko sergepetrenko@tarantool.org
next prev parent reply other threads:[~2020-04-10 12:25 UTC|newest] Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top 2020-04-07 15:49 [Tarantool-patches] [PATCH v6 0/3] replication: fix local space tracking Serge Petrenko 2020-04-07 15:49 ` [Tarantool-patches] [PATCH v6 1/3] replication: omit 0-th vclock component in replication responses Serge Petrenko 2020-04-09 23:08 ` Vladislav Shpilevoy 2020-04-10 12:25 ` Serge Petrenko 2020-04-11 16:05 ` Vladislav Shpilevoy 2020-04-13 10:12 ` Serge Petrenko 2020-04-07 15:49 ` [Tarantool-patches] [PATCH v6 2/3] gc: rely on minimal vclock components instead of signatures Serge Petrenko 2020-04-09 23:08 ` Vladislav Shpilevoy 2020-04-10 12:25 ` Serge Petrenko [this message] 2020-04-11 16:04 ` Vladislav Shpilevoy 2020-04-13 10:12 ` Serge Petrenko 2020-04-07 15:49 ` [Tarantool-patches] [PATCH v6 3/3] box: start counting local space requests separately Serge Petrenko 2020-04-13 14:38 ` [Tarantool-patches] [PATCH v6 0/3] replication: fix local space tracking Vladislav Shpilevoy 2020-04-16 13:49 ` Kirill Yukhin
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=769253CE-47DB-4FEF-BDB2-046799B15682@tarantool.org \ --to=sergepetrenko@tarantool.org \ --cc=tarantool-patches@dev.tarantool.org \ --cc=v.shpilevoy@tarantool.org \ --subject='Re: [Tarantool-patches] [PATCH v6 2/3] gc: rely on minimal vclock components instead of signatures' \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: link
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox