[Tarantool-patches] [PATCH 1/2] gc/xlog: delay xlog cleanup until relays are subscribed

Fri Mar 19 02:04:09 MSK 2021

Hi! Thanks for the patch!

Generally looks fine except details.

AFAIU, we go for the timeout option for the only reason that
there might be non-fullmesh typologies, when a replica is in
_cluster, but does not have relays to other nodes.

If we would know the topology, we could tell which nodes will
relay to who, and would be able to detect when a replica needs
to keep the logs, and when don't. Not even the entire topology.
Just the info from the adjacent nodes. From our own box.cfg.replication,
and from box.cfg.replication of these nodes.

I still don't understand what was wrong with implementing some kind
of topology discovery for this 2-level topology tree. For instance
when applier on a replica is connected, the remote node sends us a
flag whether it is going to connect back. If the flag is false, we don't
keep the logs for that instance.

See 13 comments below.

> @TarantoolBot document
> Title: Add wal_cleanup_delay configuration parameter

1. There are 2 issues with the name. Firstly, 'cleanup' conflicts
with the existing 'gc' name, which is already exposed in box.info.gc.
It is not called box.info.cleanup, so I would propose to use 'gc'
here too.

Another issue I described in the first patchset's discussion. It is
not really a delay. Because it is simply ignored when the replication
feels like it. It must either have 'replication' prefix to designate
it is not just a fixated timeout to keep the WALs and it depends on
the replication, or at least it must mention "max". To designate it
is not the exact strict timeout for keeping the logs.

> The `wal_cleanup_delay` option defines a delay in seconds
> before write ahead log files (`*.xlog`) are getting started
> to prune upon a node restart.
> 
> This option is isnored in case if a node is running as

2. isnored -> ignored.

> an anonymous replica (`replication_anon = true`). Similarly
> if replication is unused or there is no plans to use
> replication at all then this option should not be considered.
> 
> An initial problem to solve is the case where a node is operating
> so fast that its replicas do not manage to reach the node state
> and in case if the node is restarted at this moment (for various
> reasons, for example due to power outage) then unfetched `*.xlog`
> files might be pruned during restart. In result replicas will not
> find these files on the main node and have to reread all data back
> which is a very expensive procedure.
> 
> Since replicas are tracked via `_cluster` system space this we use
> its content to count subscribed replicas and when all of them are
> up and running the cleanup procedure is automatically enabled even
> if `wal_cleanup_delay` is not expired.
> 
> The `wal_cleanup_delay` should be set to:
> 
>  - `-1` to wait infinitely until all existing replicas are subscribed;
>  - `0` to disable the cleanup delay;
>  - `>0` to wait for specified number of seconds.
> 
> By default it is set to `14400` seconds (ie `4` hours).
> 
> In case if registered replica is lost forever and timeout is set to
> infinity then a preferred way to enable cleanup procedure is not setting
> up a small timeout value but rather to delete this replica from `_cluster`
> space manually.
> 
> Note that the option does *not* prevent WAL engine from removing
> old `*.xlog` files if there is no space left on a storage device,
> WAL engine can remove them in a force way.
> 
> Current state of `*.xlog` garbage collector can be found in
> `box.info.gc()` output. For example
> 
> ``` Lua
>  tarantool> box.info.gc()
>  ---
>    ...
>    cleanup_is_paused: false

3. 'gc.cleanup' is a tautology. If you want to expose the
flag, it better be simply 'is_paused' IMO.

>    delay_ref: 0

4. Can't the user get the same by looking at

	_cluster size - box.info.gc().consumers count

?

> ```
> 
> The `cleanup_is_paused` shows if cleanup fiber is paused or not,
> and `delay_ref` represents number of replicas to be subscribed
> if the cleanup fiber is in `pause` mode.
> ---
>  src/box/box.cc                  | 11 +++++
>  src/box/gc.c                    | 75 +++++++++++++++++++++++++++++++--
>  src/box/gc.h                    | 30 +++++++++++++
>  src/box/lua/info.c              |  8 ++++
>  src/box/lua/load_cfg.lua        |  3 ++
>  src/box/relay.cc                | 20 +++++++++
>  src/box/replication.cc          |  2 +
>  test/app-tap/init_script.result |  1 +
>  test/box/admin.result           |  2 +
>  test/box/cfg.result             |  4 ++
>  10 files changed, 153 insertions(+), 3 deletions(-)
> 
> diff --git a/src/box/box.cc b/src/box/box.cc
> index b4a1a5e07..ee017c2dc 100644
> --- a/src/box/box.cc
> +++ b/src/box/box.cc
> @@ -754,6 +754,15 @@ box_check_wal_mode(const char *mode_name)
>  	return (enum wal_mode) mode;
>  }
>  
> +static void
> +box_check_wal_cleanup_delay(int tmo)
> +{
> +	if (tmo < 0 && tmo != -1) {
> +		tnt_raise(ClientError, ER_CFG, "wal_cleanup_delay",
> +			  "the value must be either >= 0 or -1");

3. As you could notice, we try not to use C++ exceptions in the new
code. Lets use diag_set(). For example, like the new option
wal_max_queue_size does.

4. Can you move cfg_geti() into this function? I see no reason why
do you need to pass it as a parameter. The function is used in
a single place.

> +	}
> +}
> @@ -3045,6 +3055,7 @@ box_cfg_xc(void)
>  		bootstrap(&instance_uuid, &replicaset_uuid,
>  			  &is_bootstrap_leader);
>  	}
> +	gc_delay_unref();

5. Why?

>  	fiber_gc();
>  
>  	bootstrap_journal_guard.is_active = false;
> diff --git a/src/box/gc.c b/src/box/gc.c
> index 9af4ef958..595b98bdf 100644
> --- a/src/box/gc.c
> +++ b/src/box/gc.c
> @@ -107,6 +108,25 @@ gc_init(void)
>  	/* Don't delete any files until recovery is complete. */
>  	gc.min_checkpoint_count = INT_MAX;
>  
> +	gc.wal_cleanup_delay = cfg_geti("wal_cleanup_delay");

6. Please, pass the options as arguments of gc_init(). Having the
global options disseminated over the sources does not simplify
understanding of what subsystems on which options depend.

That would be consistent with not using cfg_get* fuctions almost
anywhere except a few big 'main' files, and one sql file, which
is also wrong, but is a mess in many other ways as well.

> +	gc.delay_ref = 0;
> +
> +	if (cfg_geti("replication_anon") != 0) {

7. GC does not really need to know if the instance is anon. This
could be both controlled by wal_cleanup_delay cfg and the caller
of gc_init(): box_cfg_xc(). Otherwise you made GC depend on
replication details in the code, which is not great. We must keep
the dependencies as low as possible.

In box_cfg_xc you can get both wal_cleanup_delay and replication_anon
and check if the anon is set, then you call gc_init(0). If it is
false, you call gc_init(wal_cleanup_delay). GC only knows about some
delay obtained from its owner.

> +		/*
> +		 * Anonymous replicas are read only
> +		 * so no need to keep XLOGs.
> +		 */
> +		gc.cleanup_is_paused = false;
> +	} else if (gc.wal_cleanup_delay == 0) {
> +		/*
> +		 * The delay is disabled explicitly.
> +		 */
> +		gc.cleanup_is_paused = false;
> +	} else {
> +		say_info("gc: wal/engine cleanup is paused");
> +		gc.cleanup_is_paused = true;
> +	}
> +
>  	vclock_create(&gc.vclock);
>  	rlist_create(&gc.checkpoints);
>  	gc_tree_new(&gc.consumers);
> @@ -238,6 +258,30 @@ static int
>  gc_cleanup_fiber_f(va_list ap)
>  {
>  	(void)ap;
> +
> +	/*
> +	 * Stage 1 (optional): in case if we're booting
> +	 * up with cleanup disabled lets do wait in a
> +	 * separate cycle to minimize branching on stage 2.
> +	 */
> +	if (gc.cleanup_is_paused) {
> +		ev_tstamp tmo = gc.wal_cleanup_delay == -1 ?
> +			TIMEOUT_INFINITY : gc.wal_cleanup_delay;

8. Please, call it 'timeout'. 'tmo' looks super weird, and is never
used anywhere except this patch.

9. What if I see the timout was too big, and I cal
box.cfg{wal_cleanup_delay = 0}? It seems I can only reduce it on
restart. Can you support a dynamic update? It should be easy. You
need to store gc deadline instead of the delay. So when you change
the timeout, you can see if with the new timeout the deadline is 
lready reached, and then can start GC earlier.

If I increase the timeout, GC could be paused again, or simply
ignored, or banned. But the decrease is more important, as I think.

> +		while (!fiber_is_cancelled()) {
> +			if (fiber_yield_timeout(tmo)) {
> +				say_info("gc: wal/engine cleanup is resumed "
> +					 "due to timeout expiration");

10. Other gc logs don't have 'gc' prefix. I would propose to follow
it in the new logs too.

> +				gc.cleanup_is_paused = false;
> +				break;
> +			}
> +			if (!gc.cleanup_is_paused)
> +				break;
> +		}
> +	}
> +
> +	/*
> +	 * Stage 2: a regular cleanup cycle.
> +	 */
>  	while (!fiber_is_cancelled()) {
>  		int64_t delta = gc.cleanup_scheduled - gc.cleanup_completed;
>  		if (delta == 0) {
> @@ -253,6 +297,29 @@ gc_cleanup_fiber_f(va_list ap)
>  	return 0;
>  }
>  
> +void
> +gc_delay_ref(void)
> +{
> +	if (gc.cleanup_is_paused) {
> +		assert(gc.delay_ref >= 0);
> +		gc.delay_ref++;
> +	}
> +}
> +
> +void
> +gc_delay_unref(void)
> +{
> +	if (gc.cleanup_is_paused) {
> +		assert(gc.delay_ref > 0);
> +		gc.delay_ref--;

11. If it is not paused and GC started earlier due to a
timeout, the user will see 'delay_ref' in box.info even
after all the replicas are connected.

> +		if (gc.delay_ref == 0) {
> +			say_info("gc: wal/engine cleanup is resumed");
> +			gc.cleanup_is_paused = false;
> +			fiber_wakeup(gc.cleanup_fiber);
> +		}
> +	}
> +}
> +
>  /**
>   * Trigger asynchronous garbage collection.
>   */
> @@ -278,9 +345,11 @@ gc_schedule_cleanup(void)
>  static void
>  gc_wait_cleanup(void)
>  {
> -	int64_t scheduled = gc.cleanup_scheduled;
> -	while (gc.cleanup_completed < scheduled)
> -		fiber_cond_wait(&gc.cleanup_cond);
> +	if (!gc.cleanup_is_paused) {
> +		int64_t scheduled = gc.cleanup_scheduled;
> +		while (gc.cleanup_completed < scheduled)
> +			fiber_cond_wait(&gc.cleanup_cond);
> +	}

12. The function is called 'wait_cleanup', but it does not really
wait in case GC is paused. It looks wrong.

> diff --git a/src/box/relay.cc b/src/box/relay.cc
> index 41f949e8e..5d09690ad 100644
> --- a/src/box/relay.cc
> +++ b/src/box/relay.cc
> @@ -668,6 +668,13 @@ relay_send_is_raft_enabled(struct relay *relay,
>  	}
>  }
>  
> +static void
> +relay_gc_delay_unref(struct cmsg *msg)
> +{
> +	(void)msg;
> +	gc_delay_unref();

13. You don't need a separate callback, and don't need to call it
from the relay thread. Relay already works with GC - it calls
gc_consumer_register() before starting the thread. You can do the
unref in the same place. Starting from the consumer registration
the logs are going to be kept anyway.

Looks like it would be simpler if it works.

> +}
> +
>  /**
>   * A libev callback invoked when a relay client socket is ready
>   * for read. This currently only happens when the client closes
> @@ -721,6 +728,19 @@ relay_subscribe_f(va_list ap)
>  	 */
>  	relay_send_heartbeat(relay);
>  
> +	/*
> +	 * Now we can resume wal/engine gc as relay
> +	 * is up and running.
> +	 */
> +	if (!relay->replica->anon) {
> +		static const struct cmsg_hop gc_delay_route[] = {
> +			{relay_gc_delay_unref, NULL}
> +		};
> +		struct cmsg gc_delay_msg;
> +		cmsg_init(&gc_delay_msg, gc_delay_route);
> +		cpipe_push(&relay->tx_pipe, &gc_delay_msg);
> +	}
> +
>  	/*
>  	 * Run the event loop until the connection is broken
>  	 * or an error occurs.