From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from smtp62.i.mail.ru (smtp62.i.mail.ru [217.69.128.42]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dev.tarantool.org (Postfix) with ESMTPS id 3FD7D4696C3 for ; Fri, 10 Apr 2020 15:25:17 +0300 (MSK) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 13.4 \(3608.80.23.2.2\)) From: Serge Petrenko In-Reply-To: <6336a973-02da-99c9-4942-762f02c2cb6d@tarantool.org> Date: Fri, 10 Apr 2020 15:25:16 +0300 Content-Transfer-Encoding: quoted-printable Message-Id: <769253CE-47DB-4FEF-BDB2-046799B15682@tarantool.org> References: <47c0ec5d7a4de7d47351e31211c6eb3a3491f844.1586273440.git.sergepetrenko@tarantool.org> <6336a973-02da-99c9-4942-762f02c2cb6d@tarantool.org> Subject: Re: [Tarantool-patches] [PATCH v6 2/3] gc: rely on minimal vclock components instead of signatures List-Id: Tarantool development patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Vladislav Shpilevoy Cc: tml > 10 =D0=B0=D0=BF=D1=80. 2020 =D0=B3., =D0=B2 02:08, Vladislav Shpilevoy = =D0=BD=D0=B0=D0=BF=D0=B8=D1=81=D0=B0=D0=BB(=D0= =B0): >=20 > Hi! Thanks for the patch! Hi! Thanks for your review! >=20 > See 4 questions below. >=20 > On 07/04/2020 17:49, Serge Petrenko wrote: >> The current WAL GC implementation tracks consumers (i.e. remote = replicas) by >> their vclock signature, which is the sum of all vclock components. >> This approach is wrong, and this can be shown by a little example. >> The example will be a little synthetic, but it'll illustrate the = problem. >> Say, you have 2 masters, A and B with ids 1 and 2 respectively, and a = replica C >> with id 3. Say, =D0=A1 replicates from both A and B, and there is no = replication >> between A and B (say, the instances were reconfigured to not = replicate from each >> other). Now, say replica C has followed A and B to vclock {1:5, = 2:13}. At the >> same time, A has lsn 10 and B has lsn 15. A and B do not know about = each other=E2=80=99s >> changes, so A=E2=80=99s vclock is {1:10} and B=E2=80=99s vclock is = {2:15}. Now imagine A does a >> snapshot and creates a new xlog with signature 10. A=E2=80=99s = directory will look like: >> 00=E2=80=A6000.xlog 00=E2=80=A6010.snap 00=E2=80=A6.010.xlog >> Replica C reports its vclock {1:5, 2:13} to A, A uses the vclock to = update the >> corresponding GC consumer. Since signatures are used, GC consumer is = assigned a >> signature =3D 13 + 5 =3D 18. This is greater than the signature of = the last xlog on >> A (which is 10), so the previous xlog (00=E2=80=A600.xlog) can be = deleted (at least A >> assumes it can be). Actually, replica still needs 00=E2=80=A600.xlog, = because it >> contains rows corresponding to vclocks {1:6} - {1:10}, which = haven=E2=80=99t been >> replicated yet. >>=20 >> If instead of using vclock signatures, gc consumers used vclocks, = such a problem >> wouldn=E2=80=99t arise. Replica would report its vclock {1:5, 2:13}. = The vclock is NOT >> strictly greater than A=E2=80=99s most recent xlog vclock ({1:10}), = so the previous log >> is kept until replica reports a vclock {1:10, 2:something} or {1:11, = =E2=80=A6} and so >> on. >=20 > 1. > " > It looks like B will always delete its snaps immediately, because C = will > send vclock lexicographically always bigger than B's. > " I think you meant xlogs. Snaps are deleted only when a new snap is = created and total snap count exceeds `box.cfg.checkpoint_count` Anyway, your remark below is correct. >=20 > I wrote the comment above and only then realized that vclock_min() is = not really > min between 2 values. It is min for every of 32 components of vclock. = So the > problem above won't happen, right? >=20 > B's "min" vclock in gc consumers will be >=20 > {0, min(B last snap, C last received row), 0, 0, ...} >=20 > Correct? It won't see A's part of C's vclock. Yes, {1:0, 2:min(BB last snap, C last received row), 3:0, =E2=80=A6.}. >=20 >> Rewrite gc to perform cleanup based on finding minimal vclock >> components present in at least one of the consumer vclocks instead of >> just comparing vclock signatures. >>=20 >> Prerequisite #4114 >> --- >> src/box/gc.c | 41 ++++++++++++++--------- >> src/box/vclock.h | 85 = ++++++++++++++++++++++++++++++++++++++++++++++++ >> 2 files changed, 110 insertions(+), 16 deletions(-) >>=20 >> diff --git a/src/box/gc.c b/src/box/gc.c >> index f5c387f9d..a2d0a515c 100644 >> --- a/src/box/gc.c >> +++ b/src/box/gc.c >> @@ -187,19 +186,29 @@ gc_run_cleanup(void) >> /* At least one checkpoint must always be available. */ >> assert(checkpoint !=3D NULL); >>=20 >> + /* Find the vclock of the oldest WAL row to keep. */ >> + struct vclock min_vclock; >> + struct gc_consumer *consumer =3D gc_tree_first(&gc.consumers); >> /* >> - * Find the vclock of the oldest WAL row to keep. >> + * Vclock of the oldest WAL row to keep is a by-component >> + * minimum of all consumer vclocks and the oldest >> + * checkpoint vclock. This ensures that all rows needed by >> + * at least one consumer are kept. >> * Note, we must keep all WALs created after the >> * oldest checkpoint, even if no consumer needs them. >> */ >> - const struct vclock *vclock =3D (gc_tree_empty(&gc.consumers) ? = NULL : >> - = &gc_tree_first(&gc.consumers)->vclock); >> - if (vclock =3D=3D NULL || >> - vclock_sum(vclock) > vclock_sum(&checkpoint->vclock)) >> - vclock =3D &checkpoint->vclock; >> - >> - if (vclock_sum(vclock) > vclock_sum(&gc.vclock)) { >> - vclock_copy(&gc.vclock, vclock); >> + vclock_copy(&min_vclock, &checkpoint->vclock); >> + while (consumer !=3D NULL) { >> + /* >> + * Consumers will never need rows signed >> + * with a zero instance id (local rows). >> + */ >> + vclock_min_ignore0(&min_vclock, &consumer->vclock); >> + consumer =3D gc_tree_next(&gc.consumers, consumer); >=20 > 2. > " > Can you use gc_tree_last()? Without vclock_min_ignore0 and manual = scan. >=20 > Also since vclock[0] is never sent (mp_encode_vclock() now ignores = it), > what are the consumers having vclock[0] !=3D 0 here? If there are = only local, > does it matter? They will be ordered always ok. > " consumers have the master instance=E2=80=99s vclock[0] component. It is = just a leftover from recovery initialization, and I didn=E2=80=99t bother clearing it. So, the consumer, who is the last to report its status, will have the = biggest vclock[0], if master writes to local spaces at all. No, gc_tree_last() (did you mean gc_tree_first() ? ) cannot be used = here. It=E2=80=99ll return the consumer with the biggest (smallest) vclock[0]. Which is the last (first) one to report its status and not necessarily = the most (least) up to date one. >=20 > I wrote the comment above also before realized that vclock_min() is = not a > normal min(). So it should be irrelevant, yes? Yep. >=20 >=20 > 3. Should not we patch gc_advance() too? It still uses sum(). Yes! Thanks for noticing. diff --git a/src/box/gc.c b/src/box/gc.c index a2d0a515c..36b2d7f4e 100644 --- a/src/box/gc.c +++ b/src/box/gc.c @@ -295,10 +295,18 @@ gc_advance(const struct vclock *vclock) vclock_copy(&gc.vclock, vclock); =20 struct gc_consumer *consumer =3D gc_tree_first(&gc.consumers); - while (consumer !=3D NULL && - vclock_sum(&consumer->vclock) < vclock_sum(vclock)) { + while (consumer !=3D NULL) { struct gc_consumer *next =3D gc_tree_next(&gc.consumers, consumer); + /* + * Remove all the consumers whose vclocks are + * either less than or incomparable with the wal + * gc vclock. + */ + if (vclock_compare_ignore0(vclock, &consumer->vclock) <=3D= 0) { + consumer =3D next; + continue; + } assert(!consumer->is_inactive); consumer->is_inactive =3D true; gc_tree_remove(&gc.consumers, consumer); >=20 >> + } >> + >> + if (vclock_sum(&min_vclock) > vclock_sum(&gc.vclock)) { >> + vclock_copy(&gc.vclock, &min_vclock); >> run_wal_gc =3D true; >> } >> diff --git a/src/box/vclock.h b/src/box/vclock.h >> index 5c0525b00..5865f7443 100644 >> --- a/src/box/vclock.h >> +++ b/src/box/vclock.h >> @@ -330,6 +355,66 @@ vclock_compare_ignore0(const struct vclock *a, = const struct vclock *b) >> return vclock_compare_generic(a, b, true); >> } >>=20 >> +/** >> + * Compare vclocks lexicographically. >> + * The following affirmation holds: if a < b in terms of >> + * vclock_compare, then a < b in terms of vclock_lex_compare. >> + * However, a < b in terms of vclock_lex_compare doesn't mean >> + * than a < b in terms of vclock_compare. Example: >> + * {1:5, 2:10} < {1:7, 2:3} (vclock_lex_compare), but the vclocks >> + * are incomparable in terms of vclock_compare. >> + * All vclocks are comparable lexicographically. >> + * >> + * @param a Vclock. >> + * @param b Vclock. >> + * @retval 1 If @a a is lexicographically greater than @a b. >> + * @retval -1 If @a a is lexicographically less than @a b. >> + * @retval 0 If vclocks are equal. >> + */ >> +static inline int >> +vclock_lex_compare(const struct vclock *a, const struct vclock *b) >=20 > 4. What are cases, when vclock_lex_compare() can't replace = vclock_compare()? replicaset_needs_rejoin() comes to mind. Forget about vclock[0] for now. Let=E2=80=99s say we have 2 instances with id=E2=80=99s 2 and 3. = (Instance with id 1 died, for example). And 3 is restarting and defining whether it needs a = rebootstrap from 2. Say, 2=E2=80=99s vclock is {1:10, 2:15, 3:30} and 3=E2=80=99s normal = vclock is {1:10, 2:10, 3:50}. Even though 3=E2=80=99s vclock is lexicographically smaller than 2=E2=80=99= s vclock (2nd vclock[2] =3D 15, 3rd vclock[2] =3D 10), 3 cannot reboostrap from = 2, because 3 has data, that will be destroyed by rebootstrap {3:31} - {3:50}. So only the usual vclock_compare() can be used here. 3 can rebootstrap from 2 iff 3=E2=80=99s vclock is less than 2=E2=80=99s = vclock in EVERY component. >=20 >> +{ >> + vclock_map_t map =3D a->map | b->map; >> + struct bit_iterator it; >> + bit_iterator_init(&it, &map, sizeof(map), true); >> + for(size_t replica_id =3D bit_iterator_next(&it); replica_id < = VCLOCK_MAX; >> + replica_id =3D bit_iterator_next(&it)) { >> + int64_t lsn_a =3D vclock_get(a, replica_id); >> + int64_t lsn_b =3D vclock_get(b, replica_id); >> + if (lsn_a < lsn_b) >> + return -1; >> + if (lsn_a > lsn_b) >> + return 1; >> + } >> + return 0; >> +} -- Serge Petrenko sergepetrenko@tarantool.org