[Tarantool-patches] [PATCH 1/2] gc/xlog: delay xlog cleanup until relays are subscribed

Fri Mar 19 14:03:52 MSK 2021

On Fri, Mar 19, 2021 at 12:04:09AM +0100, Vladislav Shpilevoy wrote:
> 
> AFAIU, we go for the timeout option for the only reason that
> there might be non-fullmesh typologies, when a replica is in
> _cluster, but does not have relays to other nodes.

Exactly.

> If we would know the topology, we could tell which nodes will
> relay to who, and would be able to detect when a replica needs
> to keep the logs, and when don't. Not even the entire topology.
> Just the info from the adjacent nodes. From our own box.cfg.replication,
> and from box.cfg.replication of these nodes.
> 
> I still don't understand what was wrong with implementing some kind
> of topology discovery for this 2-level topology tree. For instance
> when applier on a replica is connected, the remote node sends us a
> flag whether it is going to connect back. If the flag is false, we don't
> keep the logs for that instance.

There is nothing wrong and I think we should do it. Could you please
elaborate the details? You mean to extend applier protocol data, so
that it would send not only vclock but also the flag if it going to
be setting up a relay?

> See 13 comments below.
> 
> > @TarantoolBot document
> > Title: Add wal_cleanup_delay configuration parameter
> 
> 1. There are 2 issues with the name. Firstly, 'cleanup' conflicts
> with the existing 'gc' name, which is already exposed in box.info.gc.
> It is not called box.info.cleanup, so I would propose to use 'gc'
> here too.
> 
> Another issue I described in the first patchset's discussion. It is
> not really a delay. Because it is simply ignored when the replication
> feels like it. It must either have 'replication' prefix to designate
> it is not just a fixated timeout to keep the WALs and it depends on
> the replication, or at least it must mention "max". To designate it
> is not the exact strict timeout for keeping the logs.

Vlad, personally I don't mind to name this option whatever you like,
just gimme a name and I'll use it.

> > The `wal_cleanup_delay` option defines a delay in seconds
> > before write ahead log files (`*.xlog`) are getting started
> > to prune upon a node restart.
> > 
> > This option is isnored in case if a node is running as
> 
> 2. isnored -> ignored.

Ouch, I happen to miss this while been running aspell. Thanks!

> > 
> > Current state of `*.xlog` garbage collector can be found in
> > `box.info.gc()` output. For example
> > 
> > ``` Lua
> >  tarantool> box.info.gc()
> >  ---
> >    ...
> >    cleanup_is_paused: false
> 
> 3. 'gc.cleanup' is a tautology. If you want to expose the
> flag, it better be simply 'is_paused' IMO.

OK

> >    delay_ref: 0
> 
> 4. Can't the user get the same by looking at
> 
> 	_cluster size - box.info.gc().consumers count
> 
> ?

I would prefer to keep this as a separate output
simply because I need an internal variable for gc
code itself and reading it is a way more convenient
than fetching _cluster size and then consumers and
do some math after. Moreover I need a way to find
out _internal_ counter value for debug purpose.

> > +static void
> > +box_check_wal_cleanup_delay(int tmo)
> > +{
> > +	if (tmo < 0 && tmo != -1) {
> > +		tnt_raise(ClientError, ER_CFG, "wal_cleanup_delay",
> > +			  "the value must be either >= 0 or -1");
> 
> 3. As you could notice, we try not to use C++ exceptions in the new
> code. Lets use diag_set(). For example, like the new option
> wal_max_queue_size does.

OK

> 
> 4. Can you move cfg_geti() into this function? I see no reason why
> do you need to pass it as a parameter. The function is used in
> a single place.

Sure

> > +	}
> > +}
> > @@ -3045,6 +3055,7 @@ box_cfg_xc(void)
> >  		bootstrap(&instance_uuid, &replicaset_uuid,
> >  			  &is_bootstrap_leader);
> >  	}
> > +	gc_delay_unref();
> 
> 5. Why?

For case where you don't have replicas at all thus you need
to unref yourself from counting and the counter shrinks to
zero enabling gc.

> 
> >  	fiber_gc();
> >  
> >  	bootstrap_journal_guard.is_active = false;
> > diff --git a/src/box/gc.c b/src/box/gc.c
> > index 9af4ef958..595b98bdf 100644
> > --- a/src/box/gc.c
> > +++ b/src/box/gc.c
> > @@ -107,6 +108,25 @@ gc_init(void)
> >  	/* Don't delete any files until recovery is complete. */
> >  	gc.min_checkpoint_count = INT_MAX;
> >  
> > +	gc.wal_cleanup_delay = cfg_geti("wal_cleanup_delay");
> 
> 6. Please, pass the options as arguments of gc_init(). Having the
> global options disseminated over the sources does not simplify
> understanding of what subsystems on which options depend.
> 
> That would be consistent with not using cfg_get* fuctions almost
> anywhere except a few big 'main' files, and one sql file, which
> is also wrong, but is a mess in many other ways as well.

OK

> 
> > +	gc.delay_ref = 0;
> > +
> > +	if (cfg_geti("replication_anon") != 0) {
> 
> 7. GC does not really need to know if the instance is anon. This
> could be both controlled by wal_cleanup_delay cfg and the caller
> of gc_init(): box_cfg_xc(). Otherwise you made GC depend on
> replication details in the code, which is not great. We must keep
> the dependencies as low as possible.
> 
> In box_cfg_xc you can get both wal_cleanup_delay and replication_anon
> and check if the anon is set, then you call gc_init(0). If it is
> false, you call gc_init(wal_cleanup_delay). GC only knows about some
> delay obtained from its owner.

OK

> > +
> > +	/*
> > +	 * Stage 1 (optional): in case if we're booting
> > +	 * up with cleanup disabled lets do wait in a
> > +	 * separate cycle to minimize branching on stage 2.
> > +	 */
> > +	if (gc.cleanup_is_paused) {
> > +		ev_tstamp tmo = gc.wal_cleanup_delay == -1 ?
> > +			TIMEOUT_INFINITY : gc.wal_cleanup_delay;
> 
> 8. Please, call it 'timeout'. 'tmo' looks super weird, and is never
> used anywhere except this patch.

TMO is well known abbreviation in network stack but sure will rename.

> 
> 9. What if I see the timout was too big, and I cal
> box.cfg{wal_cleanup_delay = 0}? It seems I can only reduce it on
> restart. Can you support a dynamic update? It should be easy. You
> need to store gc deadline instead of the delay. So when you change
> the timeout, you can see if with the new timeout the deadline is 
> lready reached, and then can start GC earlier.

Thanks, will make it dynamic.

> If I increase the timeout, GC could be paused again, or simply
> ignored, or banned. But the decrease is more important, as I think.
> 
> > +		while (!fiber_is_cancelled()) {
> > +			if (fiber_yield_timeout(tmo)) {
> > +				say_info("gc: wal/engine cleanup is resumed "
> > +					 "due to timeout expiration");
> 
> 10. Other gc logs don't have 'gc' prefix. I would propose to follow
> it in the new logs too.

This makes easier for grepping the logs but sure will drop.

> > +void
> > +gc_delay_unref(void)
> > +{
> > +	if (gc.cleanup_is_paused) {
> > +		assert(gc.delay_ref > 0);
> > +		gc.delay_ref--;
> 
> 11. If it is not paused and GC started earlier due to a
> timeout, the user will see 'delay_ref' in box.info even
> after all the replicas are connected.

Yes, and this gonna be a sign that we're exited due to
timeout. Such output should not be treated as an error
but if you prefer I can zap the counter for this case.

> > +	if (!gc.cleanup_is_paused) {
> > +		int64_t scheduled = gc.cleanup_scheduled;
> > +		while (gc.cleanup_completed < scheduled)
> > +			fiber_cond_wait(&gc.cleanup_cond);
> > +	}
> 
> 12. The function is called 'wait_cleanup', but it does not really
> wait in case GC is paused. It looks wrong.

Hard to say. paused state is rather internal state and
when someone is waiting for cleanup to complete but cleanup
is turned off then I consider it pretty natural to exit
immediately. And I moved this "if" into the function itself
simply because we may use this helper in future and we should
not stuck forever in "waiting" if cleanup is disabled.

I can move this "if" to the caller side though if you prefer.

> >  
> > +static void
> > +relay_gc_delay_unref(struct cmsg *msg)
> > +{
> > +	(void)msg;
> > +	gc_delay_unref();
> 
> 13. You don't need a separate callback, and don't need to call it
> from the relay thread. Relay already works with GC - it calls
> gc_consumer_register() before starting the thread. You can do the
> unref in the same place. Starting from the consumer registration
> the logs are going to be kept anyway.
> 
> Looks like it would be simpler if it works.

Actually initially i did it this way, right before creating
relay thread. But I think this is not good and that's why:
when relay is starting it may fail in a number of places
(the thread itself is not created; thread is created but
then faiber creation failed with eception) and I think we
should decrement the reference when we only pretty sure that
there won't be new errors inside relay cycle.

What would happen when say relay fiber is triggering an
error? iproto will write an error and as far as I understand
the replica will try to reconnect. Thus we should keep the
logs until relay is subscribed for sure.

	Cyrill