From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <sergos@tarantool.org>
Received: from smtp55.i.mail.ru (smtp55.i.mail.ru [217.69.128.35])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (No client certificate requested)
 by dev.tarantool.org (Postfix) with ESMTPS id 65318469710
 for <tarantool-patches@dev.tarantool.org>;
 Wed,  6 May 2020 18:38:44 +0300 (MSK)
Date: Wed, 6 May 2020 18:38:43 +0300
From: Sergey Ostanevich <sergos@tarantool.org>
Message-ID: <20200506153843.GG112@tarantool.org>
References: <e09e2aca2559cfa8c02d832ea357b9e2e9c7a3b0.1588286902.git.v.shpilevoy@tarantool.org>
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <e09e2aca2559cfa8c02d832ea357b9e2e9c7a3b0.1588286902.git.v.shpilevoy@tarantool.org>
Subject: Re: [Tarantool-patches] [PATCH 1/1] wal: simplify rollback
List-Id: Tarantool development patches <tarantool-patches.dev.tarantool.org>
List-Unsubscribe: <https://lists.tarantool.org/mailman/options/tarantool-patches>, 
 <mailto:tarantool-patches-request@dev.tarantool.org?subject=unsubscribe>
List-Archive: <https://lists.tarantool.org/pipermail/tarantool-patches/>
List-Post: <mailto:tarantool-patches@dev.tarantool.org>
List-Help: <mailto:tarantool-patches-request@dev.tarantool.org?subject=help>
List-Subscribe: <https://lists.tarantool.org/mailman/listinfo/tarantool-patches>, 
 <mailto:tarantool-patches-request@dev.tarantool.org?subject=subscribe>
To: Vladislav Shpilevoy <v.shpilevoy@tarantool.org>
Cc: tarantool-patches@dev.tarantool.org

Hi!

Thank you for the patch!

See below for some nits.

Thanks, 
Sergos

On 01 мая 00:50, Vladislav Shpilevoy wrote:
> From: Georgy Kirichenko <georgy@tarantool.org>
> 
> Here is a summary on how and when rollback works in WAL.
> 
> Rollback happens, when disk write fails. In that case the failed
^^^
Disk write failure can cause rollback. Is it better?

> and all next transactions, sent to WAL, should be rolled back.
> Together. Following transactions should be rolled back too,
> because they could make their statements based on what they saw in
> the failed transaction. Also rollback of the failed transaction
> without rollback of the next ones can actually rewrite what they
> committed.
> 
> So when rollback is started, *all* pending transactions should be
> rolled back. However if they would keep coming, the rollback would
> be infinite. 

Not quite - you start rolling of txn4...txn1 (in reverse order) and at
some moment the txn5 appears. It will just ruin the consistency of the
data, just as you mentioned before - read of a yet-to-be rolled back,
writing of a will-be-affected by next roll back.

> This means to complete a rollback it is necessary to
> stop sending new transactions to WAL, then rollback all already
> sent. In the end allow new transactions again.
> 
> Step-by-step:
> 
> 1) stop accepting all new transactions in WAL thread, where
> rollback is started. All new transactions don't even try to go to
> disk. They added to rollback queue immediately after arriving to
> WAL thread.
> 
> 2) tell TX thread to stop sending new transactions to WAL. So as
> the rollback queue would stop growing.
> 
> 3) rollback all transactions in reverse order.
> 
> 4) allow transactions again in WAL thread and TX thread.
> 
> The algorithm is long, but simple and understandable. However
> implementation wasn't so easy. It was done using a 4-hop cbus
> route. 2 hops of which were supposed to clear cbus channel from
> all other cbus messages. Next two hops implemented steps 3 and 4.
> Rollback state of the WAL was signaled by checking internals of a
> preallocated cbus message.
> 
> The patch makes it simpler and more straightforward. Rollback
> state is now signaled by a simple flag, and there is no a hack
> about clearing cbus channel, no touching attributes of a cbus
> message. The moment when all transactions are stopped and the last
> one has returned from WAL is visible explicitly, because the last
> sent to WAL journal entry is saved.
> 
> Also there is now a single route for commit and rollback cbus
                ^^^ move it 
> messages, called tx_complete_batch(). This change will come in
         ^^^ here 
> hand in scope of synchronous replication, when WAL write won't be
> enough for commit. And therefore 'commit' as a concept should be
> washed away from WAL's code gradually. Migrate to solely txn
> module.
> ---
> Branch: http://github.com/tarantool/tarantool/tree/gerold103/gh-4842-simplify-wal-rollback
> Issue: https://github.com/tarantool/tarantool/issues/4842
> 
> During working on 4842 I managed to extract this patch from
> Georgy's branch and make it not depending on anything else. This
> is supposed to make some things in WAL simpler before they will
> get more complex because of sync replication.
> 
>  src/box/wal.c | 178 +++++++++++++++++++++++++++-----------------------
>  1 file changed, 95 insertions(+), 83 deletions(-)
> 
> diff --git a/src/box/wal.c b/src/box/wal.c
> index 1eb20272c..b979244e3 100644
> --- a/src/box/wal.c
> +++ b/src/box/wal.c
> @@ -97,6 +97,13 @@ struct wal_writer
>  	struct cpipe wal_pipe;
>  	/** A memory pool for messages. */
>  	struct mempool msg_pool;
> +	/**
> +	 * A last journal entry submitted to write. This is a
> +	 * 'rollback border'. When rollback starts, all
> +	 * transactions keep being rolled back until this one is
> +	 * rolled back too.
> +	 */
> +	struct journal_entry *last_entry;
>  	/* ----------------- wal ------------------- */
>  	/** A setting from instance configuration - wal_max_size */
>  	int64_t wal_max_size;
> @@ -153,7 +160,7 @@ struct wal_writer
>  	 * keep adding all incoming requests to the rollback
>  	 * queue, until the tx thread has recovered.
>  	 */
> -	struct cmsg in_rollback;
> +	bool is_in_rollback;
>  	/**
>  	 * WAL watchers, i.e. threads that should be alerted
>  	 * whenever there are new records appended to the journal.
> @@ -198,11 +205,11 @@ static void
>  wal_write_to_disk(struct cmsg *msg);
>  
>  static void
> -tx_schedule_commit(struct cmsg *msg);
> +tx_complete_batch(struct cmsg *msg);
>  
>  static struct cmsg_hop wal_request_route[] = {
>  	{wal_write_to_disk, &wal_writer_singleton.tx_prio_pipe},
> -	{tx_schedule_commit, NULL},
> +	{tx_complete_batch, NULL},
>  };
>  
>  static void
> @@ -265,14 +272,83 @@ tx_schedule_queue(struct stailq *queue)
>  		journal_async_complete(&writer->base, req);
>  }
>  
> +/**
> + * Rollback happens, when disk write fails. In that case all next
> + * transactions, sent to WAL, also should be rolled back. Because
> + * they could make their statements based on what they saw in the
> + * failed transaction. Also rollback of the failed transaction
> + * without rollback of the next ones can actually rewrite what
> + * they committed.
> + * So when rollback is started, *all* pending transactions should
> + * be rolled back. However if they would keep coming, the rollback
> + * would be infinite. This means to complete a rollback it is
> + * necessary to stop sending new transactions to WAL, then
> + * rollback all already sent. In the end allow new transactions
> + * again.
> + *
> + * First step is stop accepting all new transactions. For that WAL
> + * thread sets a global flag. No rocket science here. All new
> + * transactions, if see the flag set, are added to the rollback
> + * queue immediately.
> + *
> + * Second step - tell TX thread to stop sending new transactions
> + * to WAL. So as the rollback queue would stop growing.
> + *
> + * Third step - rollback all transactions in reverse order.
> + *
> + * Fourth step - allow transactions again. Unset the global flag
> + * in WAL thread.
> + */
> +static inline void
> +wal_begin_rollback(void)
> +{
> +	/* Signal WAL-thread stop accepting new transactions. */
> +	wal_writer_singleton.is_in_rollback = true;
> +}
> +
> +static void
> +wal_complete_rollback(struct cmsg *base)
> +{
> +	(void) base;
> +	/* WAL-thread can try writing transactions again. */
> +	wal_writer_singleton.is_in_rollback = false;
> +}
> +
> +static void
> +tx_complete_rollback(void)
> +{
> +	struct wal_writer *writer = &wal_writer_singleton;
> +	/*
> +	 * Despite records are sent in batches, the last entry to
> +	 * commit can't be in the middle of a batch. After all
> +	 * transactions to rollback are collected, the last entry
> +	 * will be exactly, well, the last entry.
> +	 */
> +	if (stailq_last_entry(&writer->rollback, struct journal_entry,
> +			      fifo) != writer->last_entry)
> +		return;

I didn't get it: is there can be a batch whose last entry us not
the final one? You prematurely quit the rollback - is there a guarantee
you'll appeare here again?

> +	stailq_reverse(&writer->rollback);
> +	tx_schedule_queue(&writer->rollback);
> +	/* TX-thread can try sending transactions to WAL again. */
> +	stailq_create(&writer->rollback);
> +	static struct cmsg_hop route[] = {
> +		{wal_complete_rollback, NULL}
> +	};
> +	static struct cmsg msg;
> +	cmsg_init(&msg, route);
> +	cpipe_push(&writer->wal_pipe, &msg);
> +}
> +
>  /**
>   * Complete execution of a batch of WAL write requests:
>   * schedule all committed requests, and, should there
>   * be any requests to be rolled back, append them to
> - * the rollback queue.
> + * the rollback queue. In case this is a rollback and the batch
> + * contains the last transaction to rollback, the rollback is
> + * performed and normal processing is allowed again.
>   */
>  static void
> -tx_schedule_commit(struct cmsg *msg)
> +tx_complete_batch(struct cmsg *msg)
>  {
>  	struct wal_writer *writer = &wal_writer_singleton;
>  	struct wal_msg *batch = (struct wal_msg *) msg;
> @@ -282,8 +358,8 @@ tx_schedule_commit(struct cmsg *msg)
>  	 * iteration of tx_schedule_queue loop.
>  	 */
>  	if (! stailq_empty(&batch->rollback)) {
> -		/* Closes the input valve. */
>  		stailq_concat(&writer->rollback, &batch->rollback);
> +		tx_complete_rollback();
>  	}
>  	/* Update the tx vclock to the latest written by wal. */
>  	vclock_copy(&replicaset.vclock, &batch->vclock);
> @@ -291,28 +367,6 @@ tx_schedule_commit(struct cmsg *msg)
>  	mempool_free(&writer->msg_pool, container_of(msg, struct wal_msg, base));
>  }
>  
> -static void
> -tx_schedule_rollback(struct cmsg *msg)
> -{
> -	(void) msg;
> -	struct wal_writer *writer = &wal_writer_singleton;
> -	/*
> -	 * Perform a cascading abort of all transactions which
> -	 * depend on the transaction which failed to get written
> -	 * to the write ahead log. Abort transactions
> -	 * in reverse order, performing a playback of the
> -	 * in-memory database state.
> -	 */
> -	stailq_reverse(&writer->rollback);
> -	/* Must not yield. */
> -	tx_schedule_queue(&writer->rollback);
> -	stailq_create(&writer->rollback);
> -	if (msg != &writer->in_rollback)
> -		mempool_free(&writer->msg_pool,
> -			     container_of(msg, struct wal_msg, base));
> -}
> -
> -
>  /**
>   * This message is sent from WAL to TX when the WAL thread hits
>   * ENOSPC and has to delete some backup WAL files to continue.
> @@ -374,7 +428,7 @@ wal_writer_create(struct wal_writer *writer, enum wal_mode wal_mode,
>  		writer->wal_dir.open_wflags |= O_SYNC;
>  
>  	stailq_create(&writer->rollback);
> -	cmsg_init(&writer->in_rollback, NULL);
> +	writer->is_in_rollback = false;
>  
>  	writer->checkpoint_wal_size = 0;
>  	writer->checkpoint_threshold = INT64_MAX;
> @@ -543,7 +597,7 @@ wal_sync_f(struct cbus_call_msg *data)
>  {
>  	struct wal_vclock_msg *msg = (struct wal_vclock_msg *) data;
>  	struct wal_writer *writer = &wal_writer_singleton;
> -	if (writer->in_rollback.route != NULL) {
> +	if (writer->is_in_rollback) {
>  		/* We're rolling back a failed write. */
>  		diag_set(ClientError, ER_WAL_IO);
>  		return -1;
> @@ -586,7 +640,7 @@ wal_begin_checkpoint_f(struct cbus_call_msg *data)
>  {
>  	struct wal_checkpoint *msg = (struct wal_checkpoint *) data;
>  	struct wal_writer *writer = &wal_writer_singleton;
> -	if (writer->in_rollback.route != NULL) {
> +	if (writer->is_in_rollback) {
>  		/*
>  		 * We're rolling back a failed write and so
>  		 * can't make a checkpoint - see the comment
> @@ -892,54 +946,6 @@ out:
>  	return rc;
>  }
>  
> -static void
> -wal_writer_clear_bus(struct cmsg *msg)
> -{
> -	(void) msg;
> -}
> -
> -static void
> -wal_writer_end_rollback(struct cmsg *msg)
> -{
> -	(void) msg;
> -	struct wal_writer *writer = &wal_writer_singleton;
> -	cmsg_init(&writer->in_rollback, NULL);
> -}
> -
> -static void
> -wal_writer_begin_rollback(struct wal_writer *writer)
> -{
> -	static struct cmsg_hop rollback_route[4] = {
> -		/*
> -		 * Step 1: clear the bus, so that it contains
> -		 * no WAL write requests. This is achieved as a
> -		 * side effect of an empty message travelling
> -		 * through both bus pipes, while writer input
> -		 * valve is closed by non-empty writer->rollback
> -		 * list.
> -		 */
> -		{ wal_writer_clear_bus, &wal_writer_singleton.wal_pipe },
> -		{ wal_writer_clear_bus, &wal_writer_singleton.tx_prio_pipe },
> -		/*
> -		 * Step 2: writer->rollback queue contains all
> -		 * messages which need to be rolled back,
> -		 * perform the rollback.
> -		 */
> -		{ tx_schedule_rollback, &wal_writer_singleton.wal_pipe },
> -		/*
> -		 * Step 3: re-open the WAL for writing.
> -		 */
> -		{ wal_writer_end_rollback, NULL }
> -	};
> -
> -	/*
> -	 * Make sure the WAL writer rolls back
> -	 * all input until rollback mode is off.
> -	 */
> -	cmsg_init(&writer->in_rollback, rollback_route);
> -	cpipe_push(&writer->tx_prio_pipe, &writer->in_rollback);
> -}
> -
>  /*
>   * Assign lsn and replica identifier for local writes and track
>   * row into vclock_diff.
> @@ -1006,7 +1012,7 @@ wal_write_to_disk(struct cmsg *msg)
>  
>  	ERROR_INJECT_SLEEP(ERRINJ_WAL_DELAY);
>  
> -	if (writer->in_rollback.route != NULL) {
> +	if (writer->is_in_rollback) {
>  		/* We're rolling back a failed write. */
>  		stailq_concat(&wal_msg->rollback, &wal_msg->commit);
>  		vclock_copy(&wal_msg->vclock, &writer->vclock);
> @@ -1017,14 +1023,14 @@ wal_write_to_disk(struct cmsg *msg)
>  	if (wal_opt_rotate(writer) != 0) {
>  		stailq_concat(&wal_msg->rollback, &wal_msg->commit);
>  		vclock_copy(&wal_msg->vclock, &writer->vclock);
> -		return wal_writer_begin_rollback(writer);
> +		return wal_begin_rollback();
>  	}
>  
>  	/* Ensure there's enough disk space before writing anything. */
>  	if (wal_fallocate(writer, wal_msg->approx_len) != 0) {
>  		stailq_concat(&wal_msg->rollback, &wal_msg->commit);
>  		vclock_copy(&wal_msg->vclock, &writer->vclock);
> -		return wal_writer_begin_rollback(writer);
> +		return wal_begin_rollback();
>  	}
>  
>  	/*
> @@ -1130,7 +1136,7 @@ done:
>  			entry->res = -1;
>  		/* Rollback unprocessed requests */
>  		stailq_concat(&wal_msg->rollback, &rollback);
> -		wal_writer_begin_rollback(writer);
> +		wal_begin_rollback();
>  	}
>  	fiber_gc();
>  	wal_notify_watchers(writer, WAL_EVENT_WRITE);
> @@ -1234,6 +1240,12 @@ wal_write_async(struct journal *journal, struct journal_entry *entry)
>  		stailq_add_tail_entry(&batch->commit, entry, fifo);
>  		cpipe_push(&writer->wal_pipe, &batch->base);
>  	}
> +	/*
> +	 * Remember last entry sent to WAL. In case of rollback
> +	 * WAL will use this entry as an anchor to rollback all
> +	 * transactions until and including this one.
> +	 */
> +	writer->last_entry = entry;
>  	batch->approx_len += entry->approx_len;
>  	writer->wal_pipe.n_input += entry->n_rows * XROW_IOVMAX;
>  	cpipe_flush_input(&writer->wal_pipe);
> -- 
> 2.21.1 (Apple Git-122.3)
>