[Tarantool-patches] [PATCH 1/1] wal: simplify rollback

Vladislav Shpilevoy v.shpilevoy at tarantool.org
Fri May 1 01:50:57 MSK 2020


From: Georgy Kirichenko <georgy at tarantool.org>

Here is a summary on how and when rollback works in WAL.

Rollback happens, when disk write fails. In that case the failed
and all next transactions, sent to WAL, should be rolled back.
Together. Following transactions should be rolled back too,
because they could make their statements based on what they saw in
the failed transaction. Also rollback of the failed transaction
without rollback of the next ones can actually rewrite what they
committed.

So when rollback is started, *all* pending transactions should be
rolled back. However if they would keep coming, the rollback would
be infinite. This means to complete a rollback it is necessary to
stop sending new transactions to WAL, then rollback all already
sent. In the end allow new transactions again.

Step-by-step:

1) stop accepting all new transactions in WAL thread, where
rollback is started. All new transactions don't even try to go to
disk. They added to rollback queue immediately after arriving to
WAL thread.

2) tell TX thread to stop sending new transactions to WAL. So as
the rollback queue would stop growing.

3) rollback all transactions in reverse order.

4) allow transactions again in WAL thread and TX thread.

The algorithm is long, but simple and understandable. However
implementation wasn't so easy. It was done using a 4-hop cbus
route. 2 hops of which were supposed to clear cbus channel from
all other cbus messages. Next two hops implemented steps 3 and 4.
Rollback state of the WAL was signaled by checking internals of a
preallocated cbus message.

The patch makes it simpler and more straightforward. Rollback
state is now signaled by a simple flag, and there is no a hack
about clearing cbus channel, no touching attributes of a cbus
message. The moment when all transactions are stopped and the last
one has returned from WAL is visible explicitly, because the last
sent to WAL journal entry is saved.

Also there is now a single route for commit and rollback cbus
messages, called tx_complete_batch(). This change will come in
hand in scope of synchronous replication, when WAL write won't be
enough for commit. And therefore 'commit' as a concept should be
washed away from WAL's code gradually. Migrate to solely txn
module.
---
Branch: http://github.com/tarantool/tarantool/tree/gerold103/gh-4842-simplify-wal-rollback
Issue: https://github.com/tarantool/tarantool/issues/4842

During working on 4842 I managed to extract this patch from
Georgy's branch and make it not depending on anything else. This
is supposed to make some things in WAL simpler before they will
get more complex because of sync replication.

 src/box/wal.c | 178 +++++++++++++++++++++++++++-----------------------
 1 file changed, 95 insertions(+), 83 deletions(-)

diff --git a/src/box/wal.c b/src/box/wal.c
index 1eb20272c..b979244e3 100644
--- a/src/box/wal.c
+++ b/src/box/wal.c
@@ -97,6 +97,13 @@ struct wal_writer
 	struct cpipe wal_pipe;
 	/** A memory pool for messages. */
 	struct mempool msg_pool;
+	/**
+	 * A last journal entry submitted to write. This is a
+	 * 'rollback border'. When rollback starts, all
+	 * transactions keep being rolled back until this one is
+	 * rolled back too.
+	 */
+	struct journal_entry *last_entry;
 	/* ----------------- wal ------------------- */
 	/** A setting from instance configuration - wal_max_size */
 	int64_t wal_max_size;
@@ -153,7 +160,7 @@ struct wal_writer
 	 * keep adding all incoming requests to the rollback
 	 * queue, until the tx thread has recovered.
 	 */
-	struct cmsg in_rollback;
+	bool is_in_rollback;
 	/**
 	 * WAL watchers, i.e. threads that should be alerted
 	 * whenever there are new records appended to the journal.
@@ -198,11 +205,11 @@ static void
 wal_write_to_disk(struct cmsg *msg);
 
 static void
-tx_schedule_commit(struct cmsg *msg);
+tx_complete_batch(struct cmsg *msg);
 
 static struct cmsg_hop wal_request_route[] = {
 	{wal_write_to_disk, &wal_writer_singleton.tx_prio_pipe},
-	{tx_schedule_commit, NULL},
+	{tx_complete_batch, NULL},
 };
 
 static void
@@ -265,14 +272,83 @@ tx_schedule_queue(struct stailq *queue)
 		journal_async_complete(&writer->base, req);
 }
 
+/**
+ * Rollback happens, when disk write fails. In that case all next
+ * transactions, sent to WAL, also should be rolled back. Because
+ * they could make their statements based on what they saw in the
+ * failed transaction. Also rollback of the failed transaction
+ * without rollback of the next ones can actually rewrite what
+ * they committed.
+ * So when rollback is started, *all* pending transactions should
+ * be rolled back. However if they would keep coming, the rollback
+ * would be infinite. This means to complete a rollback it is
+ * necessary to stop sending new transactions to WAL, then
+ * rollback all already sent. In the end allow new transactions
+ * again.
+ *
+ * First step is stop accepting all new transactions. For that WAL
+ * thread sets a global flag. No rocket science here. All new
+ * transactions, if see the flag set, are added to the rollback
+ * queue immediately.
+ *
+ * Second step - tell TX thread to stop sending new transactions
+ * to WAL. So as the rollback queue would stop growing.
+ *
+ * Third step - rollback all transactions in reverse order.
+ *
+ * Fourth step - allow transactions again. Unset the global flag
+ * in WAL thread.
+ */
+static inline void
+wal_begin_rollback(void)
+{
+	/* Signal WAL-thread stop accepting new transactions. */
+	wal_writer_singleton.is_in_rollback = true;
+}
+
+static void
+wal_complete_rollback(struct cmsg *base)
+{
+	(void) base;
+	/* WAL-thread can try writing transactions again. */
+	wal_writer_singleton.is_in_rollback = false;
+}
+
+static void
+tx_complete_rollback(void)
+{
+	struct wal_writer *writer = &wal_writer_singleton;
+	/*
+	 * Despite records are sent in batches, the last entry to
+	 * commit can't be in the middle of a batch. After all
+	 * transactions to rollback are collected, the last entry
+	 * will be exactly, well, the last entry.
+	 */
+	if (stailq_last_entry(&writer->rollback, struct journal_entry,
+			      fifo) != writer->last_entry)
+		return;
+	stailq_reverse(&writer->rollback);
+	tx_schedule_queue(&writer->rollback);
+	/* TX-thread can try sending transactions to WAL again. */
+	stailq_create(&writer->rollback);
+	static struct cmsg_hop route[] = {
+		{wal_complete_rollback, NULL}
+	};
+	static struct cmsg msg;
+	cmsg_init(&msg, route);
+	cpipe_push(&writer->wal_pipe, &msg);
+}
+
 /**
  * Complete execution of a batch of WAL write requests:
  * schedule all committed requests, and, should there
  * be any requests to be rolled back, append them to
- * the rollback queue.
+ * the rollback queue. In case this is a rollback and the batch
+ * contains the last transaction to rollback, the rollback is
+ * performed and normal processing is allowed again.
  */
 static void
-tx_schedule_commit(struct cmsg *msg)
+tx_complete_batch(struct cmsg *msg)
 {
 	struct wal_writer *writer = &wal_writer_singleton;
 	struct wal_msg *batch = (struct wal_msg *) msg;
@@ -282,8 +358,8 @@ tx_schedule_commit(struct cmsg *msg)
 	 * iteration of tx_schedule_queue loop.
 	 */
 	if (! stailq_empty(&batch->rollback)) {
-		/* Closes the input valve. */
 		stailq_concat(&writer->rollback, &batch->rollback);
+		tx_complete_rollback();
 	}
 	/* Update the tx vclock to the latest written by wal. */
 	vclock_copy(&replicaset.vclock, &batch->vclock);
@@ -291,28 +367,6 @@ tx_schedule_commit(struct cmsg *msg)
 	mempool_free(&writer->msg_pool, container_of(msg, struct wal_msg, base));
 }
 
-static void
-tx_schedule_rollback(struct cmsg *msg)
-{
-	(void) msg;
-	struct wal_writer *writer = &wal_writer_singleton;
-	/*
-	 * Perform a cascading abort of all transactions which
-	 * depend on the transaction which failed to get written
-	 * to the write ahead log. Abort transactions
-	 * in reverse order, performing a playback of the
-	 * in-memory database state.
-	 */
-	stailq_reverse(&writer->rollback);
-	/* Must not yield. */
-	tx_schedule_queue(&writer->rollback);
-	stailq_create(&writer->rollback);
-	if (msg != &writer->in_rollback)
-		mempool_free(&writer->msg_pool,
-			     container_of(msg, struct wal_msg, base));
-}
-
-
 /**
  * This message is sent from WAL to TX when the WAL thread hits
  * ENOSPC and has to delete some backup WAL files to continue.
@@ -374,7 +428,7 @@ wal_writer_create(struct wal_writer *writer, enum wal_mode wal_mode,
 		writer->wal_dir.open_wflags |= O_SYNC;
 
 	stailq_create(&writer->rollback);
-	cmsg_init(&writer->in_rollback, NULL);
+	writer->is_in_rollback = false;
 
 	writer->checkpoint_wal_size = 0;
 	writer->checkpoint_threshold = INT64_MAX;
@@ -543,7 +597,7 @@ wal_sync_f(struct cbus_call_msg *data)
 {
 	struct wal_vclock_msg *msg = (struct wal_vclock_msg *) data;
 	struct wal_writer *writer = &wal_writer_singleton;
-	if (writer->in_rollback.route != NULL) {
+	if (writer->is_in_rollback) {
 		/* We're rolling back a failed write. */
 		diag_set(ClientError, ER_WAL_IO);
 		return -1;
@@ -586,7 +640,7 @@ wal_begin_checkpoint_f(struct cbus_call_msg *data)
 {
 	struct wal_checkpoint *msg = (struct wal_checkpoint *) data;
 	struct wal_writer *writer = &wal_writer_singleton;
-	if (writer->in_rollback.route != NULL) {
+	if (writer->is_in_rollback) {
 		/*
 		 * We're rolling back a failed write and so
 		 * can't make a checkpoint - see the comment
@@ -892,54 +946,6 @@ out:
 	return rc;
 }
 
-static void
-wal_writer_clear_bus(struct cmsg *msg)
-{
-	(void) msg;
-}
-
-static void
-wal_writer_end_rollback(struct cmsg *msg)
-{
-	(void) msg;
-	struct wal_writer *writer = &wal_writer_singleton;
-	cmsg_init(&writer->in_rollback, NULL);
-}
-
-static void
-wal_writer_begin_rollback(struct wal_writer *writer)
-{
-	static struct cmsg_hop rollback_route[4] = {
-		/*
-		 * Step 1: clear the bus, so that it contains
-		 * no WAL write requests. This is achieved as a
-		 * side effect of an empty message travelling
-		 * through both bus pipes, while writer input
-		 * valve is closed by non-empty writer->rollback
-		 * list.
-		 */
-		{ wal_writer_clear_bus, &wal_writer_singleton.wal_pipe },
-		{ wal_writer_clear_bus, &wal_writer_singleton.tx_prio_pipe },
-		/*
-		 * Step 2: writer->rollback queue contains all
-		 * messages which need to be rolled back,
-		 * perform the rollback.
-		 */
-		{ tx_schedule_rollback, &wal_writer_singleton.wal_pipe },
-		/*
-		 * Step 3: re-open the WAL for writing.
-		 */
-		{ wal_writer_end_rollback, NULL }
-	};
-
-	/*
-	 * Make sure the WAL writer rolls back
-	 * all input until rollback mode is off.
-	 */
-	cmsg_init(&writer->in_rollback, rollback_route);
-	cpipe_push(&writer->tx_prio_pipe, &writer->in_rollback);
-}
-
 /*
  * Assign lsn and replica identifier for local writes and track
  * row into vclock_diff.
@@ -1006,7 +1012,7 @@ wal_write_to_disk(struct cmsg *msg)
 
 	ERROR_INJECT_SLEEP(ERRINJ_WAL_DELAY);
 
-	if (writer->in_rollback.route != NULL) {
+	if (writer->is_in_rollback) {
 		/* We're rolling back a failed write. */
 		stailq_concat(&wal_msg->rollback, &wal_msg->commit);
 		vclock_copy(&wal_msg->vclock, &writer->vclock);
@@ -1017,14 +1023,14 @@ wal_write_to_disk(struct cmsg *msg)
 	if (wal_opt_rotate(writer) != 0) {
 		stailq_concat(&wal_msg->rollback, &wal_msg->commit);
 		vclock_copy(&wal_msg->vclock, &writer->vclock);
-		return wal_writer_begin_rollback(writer);
+		return wal_begin_rollback();
 	}
 
 	/* Ensure there's enough disk space before writing anything. */
 	if (wal_fallocate(writer, wal_msg->approx_len) != 0) {
 		stailq_concat(&wal_msg->rollback, &wal_msg->commit);
 		vclock_copy(&wal_msg->vclock, &writer->vclock);
-		return wal_writer_begin_rollback(writer);
+		return wal_begin_rollback();
 	}
 
 	/*
@@ -1130,7 +1136,7 @@ done:
 			entry->res = -1;
 		/* Rollback unprocessed requests */
 		stailq_concat(&wal_msg->rollback, &rollback);
-		wal_writer_begin_rollback(writer);
+		wal_begin_rollback();
 	}
 	fiber_gc();
 	wal_notify_watchers(writer, WAL_EVENT_WRITE);
@@ -1234,6 +1240,12 @@ wal_write_async(struct journal *journal, struct journal_entry *entry)
 		stailq_add_tail_entry(&batch->commit, entry, fifo);
 		cpipe_push(&writer->wal_pipe, &batch->base);
 	}
+	/*
+	 * Remember last entry sent to WAL. In case of rollback
+	 * WAL will use this entry as an anchor to rollback all
+	 * transactions until and including this one.
+	 */
+	writer->last_entry = entry;
 	batch->approx_len += entry->approx_len;
 	writer->wal_pipe.n_input += entry->n_rows * XROW_IOVMAX;
 	cpipe_flush_input(&writer->wal_pipe);
-- 
2.21.1 (Apple Git-122.3)



More information about the Tarantool-patches mailing list