[tarantool-patches] Re: [PATCH v3 2/2] replication: force gc to clean xdir on ENOSPC err

Konstantin Osipov kostja at tarantool.org
Thu Jun 28 18:25:30 MSK 2018


* Konstantin Belyavskiy <k.belyavskiy at tarantool.org> [18/06/28 18:13]:
> 
> +void
> +gc_xdir_clean_notify()
> +{
> +	/*
> +	 * Compare the current time with the time of the last run.
> +	 * This is needed in case of multiple failures to prevent
> +	 * from deleting all replicas.
> +	 */
> +	static int prev_time = 0;
> +	int cur_time = time(NULL);
> +	if (cur_time - prev_time < 1)

Please use fiber time.

> +	/**
> +	 * Mark consumer with least recent vclock as "dead" and
> +	 * invoke garbage collection. If nothing to delete find
> +	 * next alive consumer etc. Originally created for
> +	 * cases with running out of disk space because of
> +	 * disconnected replica.
> +	 */
> +	struct gc_consumer *leftmost =
> +	    gc_tree_first(&gc.consumers);
> +	/*
> +	 * Exit if no consumers left or if this consumer is
> +	 * not associated with replica (backup for example).
> +	 */

You should delete the consumer *only* if it's going to help with
garbage. Running out of space should not lead to deletion of all
replicas as consumers if they are not lagging behind.

> +	if (leftmost == NULL || leftmost->replica == NULL)
> +		return;
> +	int64_t signature = leftmost->signature;
> +	while (true) {
> +		gc_consumer_unregister(leftmost);
> +		leftmost = gc_tree_first(&gc.consumers);
> +		if (leftmost == NULL || leftmost->replica == NULL ||
> +		    leftmost->signature > signature) {
> +			gc_run();
> +			return;
> +		}
> +	}
> +}
> +

<cut>

>  const char *wal_mode_STRS[] = { "none", "write", "fsync", NULL };
> @@ -64,6 +65,8 @@ struct wal_thread {
>  	 * priority pipe and DOES NOT support yield.
>  	 */
>  	struct cpipe tx_prio_pipe;
> +	/** Return pipe from 'wal' to tx' */
> +	struct cpipe tx_pipe;

Why do you need a new pipe?

> 
> +static void
> +gc_status_update(struct cmsg *msg)
> +{
> +	gc_xdir_clean_notify();
> +	delete(msg);

delete()?!


>  {
> @@ -655,6 +665,15 @@ done:
>  		/* Until we can pass the error to tx, log it and clear. */
>  		error_log(error);
>  		diag_clear(diag_get());
> +		if (errno == ENOSPC) {
> +			struct cmsg *msg =
> +			    (struct cmsg*)calloc(1, sizeof(struct cmsg));

Please check calloc() return value and do nothing 
in case it's NULL.

> +			static const struct cmsg_hop route[] = {
> +				{gc_status_update, NULL}
> +			};
> +			cmsg_init(msg, route);
> +			cpipe_push(&wal_thread.tx_pipe, msg);
> +		}
> 
> index 895d938d5..11f1b7fdc 100644
> +...
> +-- add a little timeout so gc could finish job
> +require('fiber').sleep(0.01)

Please never sleep in a test case unconditionally. It doesn't work
reliably, especially in parallel runs.

The test case should test that replicas which are severely behind 
are abandoned, and replicas which are running well are not.

-- 
Konstantin Osipov, Moscow, Russia, +7 903 626 22 32
http://tarantool.io - www.twitter.com/kostja_osipov




More information about the Tarantool-patches mailing list