Tarantool development patches archive
 help / color / mirror / Atom feed
* [Tarantool-patches] [PATCH 1/1] wal: simplify rollback
@ 2020-04-30 22:50 Vladislav Shpilevoy
  2020-05-03 10:46 ` Konstantin Osipov
                   ` (3 more replies)
  0 siblings, 4 replies; 10+ messages in thread
From: Vladislav Shpilevoy @ 2020-04-30 22:50 UTC (permalink / raw)
  To: tarantool-patches, gorcunov, sergepetrenko

From: Georgy Kirichenko <georgy@tarantool.org>

Here is a summary on how and when rollback works in WAL.

Rollback happens, when disk write fails. In that case the failed
and all next transactions, sent to WAL, should be rolled back.
Together. Following transactions should be rolled back too,
because they could make their statements based on what they saw in
the failed transaction. Also rollback of the failed transaction
without rollback of the next ones can actually rewrite what they
committed.

So when rollback is started, *all* pending transactions should be
rolled back. However if they would keep coming, the rollback would
be infinite. This means to complete a rollback it is necessary to
stop sending new transactions to WAL, then rollback all already
sent. In the end allow new transactions again.

Step-by-step:

1) stop accepting all new transactions in WAL thread, where
rollback is started. All new transactions don't even try to go to
disk. They added to rollback queue immediately after arriving to
WAL thread.

2) tell TX thread to stop sending new transactions to WAL. So as
the rollback queue would stop growing.

3) rollback all transactions in reverse order.

4) allow transactions again in WAL thread and TX thread.

The algorithm is long, but simple and understandable. However
implementation wasn't so easy. It was done using a 4-hop cbus
route. 2 hops of which were supposed to clear cbus channel from
all other cbus messages. Next two hops implemented steps 3 and 4.
Rollback state of the WAL was signaled by checking internals of a
preallocated cbus message.

The patch makes it simpler and more straightforward. Rollback
state is now signaled by a simple flag, and there is no a hack
about clearing cbus channel, no touching attributes of a cbus
message. The moment when all transactions are stopped and the last
one has returned from WAL is visible explicitly, because the last
sent to WAL journal entry is saved.

Also there is now a single route for commit and rollback cbus
messages, called tx_complete_batch(). This change will come in
hand in scope of synchronous replication, when WAL write won't be
enough for commit. And therefore 'commit' as a concept should be
washed away from WAL's code gradually. Migrate to solely txn
module.
---
Branch: http://github.com/tarantool/tarantool/tree/gerold103/gh-4842-simplify-wal-rollback
Issue: https://github.com/tarantool/tarantool/issues/4842

During working on 4842 I managed to extract this patch from
Georgy's branch and make it not depending on anything else. This
is supposed to make some things in WAL simpler before they will
get more complex because of sync replication.

 src/box/wal.c | 178 +++++++++++++++++++++++++++-----------------------
 1 file changed, 95 insertions(+), 83 deletions(-)

diff --git a/src/box/wal.c b/src/box/wal.c
index 1eb20272c..b979244e3 100644
--- a/src/box/wal.c
+++ b/src/box/wal.c
@@ -97,6 +97,13 @@ struct wal_writer
 	struct cpipe wal_pipe;
 	/** A memory pool for messages. */
 	struct mempool msg_pool;
+	/**
+	 * A last journal entry submitted to write. This is a
+	 * 'rollback border'. When rollback starts, all
+	 * transactions keep being rolled back until this one is
+	 * rolled back too.
+	 */
+	struct journal_entry *last_entry;
 	/* ----------------- wal ------------------- */
 	/** A setting from instance configuration - wal_max_size */
 	int64_t wal_max_size;
@@ -153,7 +160,7 @@ struct wal_writer
 	 * keep adding all incoming requests to the rollback
 	 * queue, until the tx thread has recovered.
 	 */
-	struct cmsg in_rollback;
+	bool is_in_rollback;
 	/**
 	 * WAL watchers, i.e. threads that should be alerted
 	 * whenever there are new records appended to the journal.
@@ -198,11 +205,11 @@ static void
 wal_write_to_disk(struct cmsg *msg);
 
 static void
-tx_schedule_commit(struct cmsg *msg);
+tx_complete_batch(struct cmsg *msg);
 
 static struct cmsg_hop wal_request_route[] = {
 	{wal_write_to_disk, &wal_writer_singleton.tx_prio_pipe},
-	{tx_schedule_commit, NULL},
+	{tx_complete_batch, NULL},
 };
 
 static void
@@ -265,14 +272,83 @@ tx_schedule_queue(struct stailq *queue)
 		journal_async_complete(&writer->base, req);
 }
 
+/**
+ * Rollback happens, when disk write fails. In that case all next
+ * transactions, sent to WAL, also should be rolled back. Because
+ * they could make their statements based on what they saw in the
+ * failed transaction. Also rollback of the failed transaction
+ * without rollback of the next ones can actually rewrite what
+ * they committed.
+ * So when rollback is started, *all* pending transactions should
+ * be rolled back. However if they would keep coming, the rollback
+ * would be infinite. This means to complete a rollback it is
+ * necessary to stop sending new transactions to WAL, then
+ * rollback all already sent. In the end allow new transactions
+ * again.
+ *
+ * First step is stop accepting all new transactions. For that WAL
+ * thread sets a global flag. No rocket science here. All new
+ * transactions, if see the flag set, are added to the rollback
+ * queue immediately.
+ *
+ * Second step - tell TX thread to stop sending new transactions
+ * to WAL. So as the rollback queue would stop growing.
+ *
+ * Third step - rollback all transactions in reverse order.
+ *
+ * Fourth step - allow transactions again. Unset the global flag
+ * in WAL thread.
+ */
+static inline void
+wal_begin_rollback(void)
+{
+	/* Signal WAL-thread stop accepting new transactions. */
+	wal_writer_singleton.is_in_rollback = true;
+}
+
+static void
+wal_complete_rollback(struct cmsg *base)
+{
+	(void) base;
+	/* WAL-thread can try writing transactions again. */
+	wal_writer_singleton.is_in_rollback = false;
+}
+
+static void
+tx_complete_rollback(void)
+{
+	struct wal_writer *writer = &wal_writer_singleton;
+	/*
+	 * Despite records are sent in batches, the last entry to
+	 * commit can't be in the middle of a batch. After all
+	 * transactions to rollback are collected, the last entry
+	 * will be exactly, well, the last entry.
+	 */
+	if (stailq_last_entry(&writer->rollback, struct journal_entry,
+			      fifo) != writer->last_entry)
+		return;
+	stailq_reverse(&writer->rollback);
+	tx_schedule_queue(&writer->rollback);
+	/* TX-thread can try sending transactions to WAL again. */
+	stailq_create(&writer->rollback);
+	static struct cmsg_hop route[] = {
+		{wal_complete_rollback, NULL}
+	};
+	static struct cmsg msg;
+	cmsg_init(&msg, route);
+	cpipe_push(&writer->wal_pipe, &msg);
+}
+
 /**
  * Complete execution of a batch of WAL write requests:
  * schedule all committed requests, and, should there
  * be any requests to be rolled back, append them to
- * the rollback queue.
+ * the rollback queue. In case this is a rollback and the batch
+ * contains the last transaction to rollback, the rollback is
+ * performed and normal processing is allowed again.
  */
 static void
-tx_schedule_commit(struct cmsg *msg)
+tx_complete_batch(struct cmsg *msg)
 {
 	struct wal_writer *writer = &wal_writer_singleton;
 	struct wal_msg *batch = (struct wal_msg *) msg;
@@ -282,8 +358,8 @@ tx_schedule_commit(struct cmsg *msg)
 	 * iteration of tx_schedule_queue loop.
 	 */
 	if (! stailq_empty(&batch->rollback)) {
-		/* Closes the input valve. */
 		stailq_concat(&writer->rollback, &batch->rollback);
+		tx_complete_rollback();
 	}
 	/* Update the tx vclock to the latest written by wal. */
 	vclock_copy(&replicaset.vclock, &batch->vclock);
@@ -291,28 +367,6 @@ tx_schedule_commit(struct cmsg *msg)
 	mempool_free(&writer->msg_pool, container_of(msg, struct wal_msg, base));
 }
 
-static void
-tx_schedule_rollback(struct cmsg *msg)
-{
-	(void) msg;
-	struct wal_writer *writer = &wal_writer_singleton;
-	/*
-	 * Perform a cascading abort of all transactions which
-	 * depend on the transaction which failed to get written
-	 * to the write ahead log. Abort transactions
-	 * in reverse order, performing a playback of the
-	 * in-memory database state.
-	 */
-	stailq_reverse(&writer->rollback);
-	/* Must not yield. */
-	tx_schedule_queue(&writer->rollback);
-	stailq_create(&writer->rollback);
-	if (msg != &writer->in_rollback)
-		mempool_free(&writer->msg_pool,
-			     container_of(msg, struct wal_msg, base));
-}
-
-
 /**
  * This message is sent from WAL to TX when the WAL thread hits
  * ENOSPC and has to delete some backup WAL files to continue.
@@ -374,7 +428,7 @@ wal_writer_create(struct wal_writer *writer, enum wal_mode wal_mode,
 		writer->wal_dir.open_wflags |= O_SYNC;
 
 	stailq_create(&writer->rollback);
-	cmsg_init(&writer->in_rollback, NULL);
+	writer->is_in_rollback = false;
 
 	writer->checkpoint_wal_size = 0;
 	writer->checkpoint_threshold = INT64_MAX;
@@ -543,7 +597,7 @@ wal_sync_f(struct cbus_call_msg *data)
 {
 	struct wal_vclock_msg *msg = (struct wal_vclock_msg *) data;
 	struct wal_writer *writer = &wal_writer_singleton;
-	if (writer->in_rollback.route != NULL) {
+	if (writer->is_in_rollback) {
 		/* We're rolling back a failed write. */
 		diag_set(ClientError, ER_WAL_IO);
 		return -1;
@@ -586,7 +640,7 @@ wal_begin_checkpoint_f(struct cbus_call_msg *data)
 {
 	struct wal_checkpoint *msg = (struct wal_checkpoint *) data;
 	struct wal_writer *writer = &wal_writer_singleton;
-	if (writer->in_rollback.route != NULL) {
+	if (writer->is_in_rollback) {
 		/*
 		 * We're rolling back a failed write and so
 		 * can't make a checkpoint - see the comment
@@ -892,54 +946,6 @@ out:
 	return rc;
 }
 
-static void
-wal_writer_clear_bus(struct cmsg *msg)
-{
-	(void) msg;
-}
-
-static void
-wal_writer_end_rollback(struct cmsg *msg)
-{
-	(void) msg;
-	struct wal_writer *writer = &wal_writer_singleton;
-	cmsg_init(&writer->in_rollback, NULL);
-}
-
-static void
-wal_writer_begin_rollback(struct wal_writer *writer)
-{
-	static struct cmsg_hop rollback_route[4] = {
-		/*
-		 * Step 1: clear the bus, so that it contains
-		 * no WAL write requests. This is achieved as a
-		 * side effect of an empty message travelling
-		 * through both bus pipes, while writer input
-		 * valve is closed by non-empty writer->rollback
-		 * list.
-		 */
-		{ wal_writer_clear_bus, &wal_writer_singleton.wal_pipe },
-		{ wal_writer_clear_bus, &wal_writer_singleton.tx_prio_pipe },
-		/*
-		 * Step 2: writer->rollback queue contains all
-		 * messages which need to be rolled back,
-		 * perform the rollback.
-		 */
-		{ tx_schedule_rollback, &wal_writer_singleton.wal_pipe },
-		/*
-		 * Step 3: re-open the WAL for writing.
-		 */
-		{ wal_writer_end_rollback, NULL }
-	};
-
-	/*
-	 * Make sure the WAL writer rolls back
-	 * all input until rollback mode is off.
-	 */
-	cmsg_init(&writer->in_rollback, rollback_route);
-	cpipe_push(&writer->tx_prio_pipe, &writer->in_rollback);
-}
-
 /*
  * Assign lsn and replica identifier for local writes and track
  * row into vclock_diff.
@@ -1006,7 +1012,7 @@ wal_write_to_disk(struct cmsg *msg)
 
 	ERROR_INJECT_SLEEP(ERRINJ_WAL_DELAY);
 
-	if (writer->in_rollback.route != NULL) {
+	if (writer->is_in_rollback) {
 		/* We're rolling back a failed write. */
 		stailq_concat(&wal_msg->rollback, &wal_msg->commit);
 		vclock_copy(&wal_msg->vclock, &writer->vclock);
@@ -1017,14 +1023,14 @@ wal_write_to_disk(struct cmsg *msg)
 	if (wal_opt_rotate(writer) != 0) {
 		stailq_concat(&wal_msg->rollback, &wal_msg->commit);
 		vclock_copy(&wal_msg->vclock, &writer->vclock);
-		return wal_writer_begin_rollback(writer);
+		return wal_begin_rollback();
 	}
 
 	/* Ensure there's enough disk space before writing anything. */
 	if (wal_fallocate(writer, wal_msg->approx_len) != 0) {
 		stailq_concat(&wal_msg->rollback, &wal_msg->commit);
 		vclock_copy(&wal_msg->vclock, &writer->vclock);
-		return wal_writer_begin_rollback(writer);
+		return wal_begin_rollback();
 	}
 
 	/*
@@ -1130,7 +1136,7 @@ done:
 			entry->res = -1;
 		/* Rollback unprocessed requests */
 		stailq_concat(&wal_msg->rollback, &rollback);
-		wal_writer_begin_rollback(writer);
+		wal_begin_rollback();
 	}
 	fiber_gc();
 	wal_notify_watchers(writer, WAL_EVENT_WRITE);
@@ -1234,6 +1240,12 @@ wal_write_async(struct journal *journal, struct journal_entry *entry)
 		stailq_add_tail_entry(&batch->commit, entry, fifo);
 		cpipe_push(&writer->wal_pipe, &batch->base);
 	}
+	/*
+	 * Remember last entry sent to WAL. In case of rollback
+	 * WAL will use this entry as an anchor to rollback all
+	 * transactions until and including this one.
+	 */
+	writer->last_entry = entry;
 	batch->approx_len += entry->approx_len;
 	writer->wal_pipe.n_input += entry->n_rows * XROW_IOVMAX;
 	cpipe_flush_input(&writer->wal_pipe);
-- 
2.21.1 (Apple Git-122.3)

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Tarantool-patches] [PATCH 1/1] wal: simplify rollback
  2020-04-30 22:50 [Tarantool-patches] [PATCH 1/1] wal: simplify rollback Vladislav Shpilevoy
@ 2020-05-03 10:46 ` Konstantin Osipov
  2020-05-03 16:45   ` Vladislav Shpilevoy
  2020-05-06  7:55 ` Cyrill Gorcunov
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 10+ messages in thread
From: Konstantin Osipov @ 2020-05-03 10:46 UTC (permalink / raw)
  To: Vladislav Shpilevoy; +Cc: tarantool-patches

* Vladislav Shpilevoy <v.shpilevoy@tarantool.org> [20/05/01 22:16]:

This isn't a particularly bad patch. It changes one fairly trivial
implementation to another. You claim it simplifies the rollback, I
believe it makes it more convoluted with the rest of the code. 

In any case it's a matter of taste - so perhaps your taste is
better than mine.

But the reason you had to do this patch *only* because you want to manage
rollback in tx - as you claim yourself here:

> 
> Also there is now a single route for commit and rollback cbus
> messages, called tx_complete_batch(). This change will come in
> hand in scope of synchronous replication, when WAL write won't be
> enough for commit. And therefore 'commit' as a concept should be
> washed away from WAL's code gradually. Migrate to solely txn
> module.

The reason it has to be managed in TX is that you don't want to
build synchronous replication on top of in-memory relay, which
would allow you to manage RAFT rollback in WAL.

When I called you out on the fact that you can't simply build your
implementation on top of existing code without an in-memory relay,
and it will require a bunch of hacks you got mad at me.

Now what? You will ask Nikita to review this and Kirill will
"LGTM" it and you will proceed with your own design - and even may
release it as a "stable" version over some time.

This "stable version may be even useful to some people. 

Will it be worth a dime on a worldwide scale? No.

-- 
Konstantin Osipov, Moscow, Russia

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Tarantool-patches] [PATCH 1/1] wal: simplify rollback
  2020-05-03 10:46 ` Konstantin Osipov
@ 2020-05-03 16:45   ` Vladislav Shpilevoy
  2020-05-03 19:30     ` Konstantin Osipov
  0 siblings, 1 reply; 10+ messages in thread
From: Vladislav Shpilevoy @ 2020-05-03 16:45 UTC (permalink / raw)
  To: Konstantin Osipov, tarantool-patches, gorcunov, sergepetrenko

On 03/05/2020 12:46, Konstantin Osipov wrote:
> * Vladislav Shpilevoy <v.shpilevoy@tarantool.org> [20/05/01 22:16]:
> 
> This isn't a particularly bad patch. It changes one fairly trivial
> implementation to another. You claim it simplifies the rollback, I
> believe it makes it more convoluted with the rest of the code. 
> 
> In any case it's a matter of taste - so perhaps your taste is
> better than mine.
> 
> But the reason you had to do this patch *only* because you want to manage
> rollback in tx - as you claim yourself here:

I didn't say it. All I said is this commit simplifies code, IMO.

>> Also there is now a single route for commit and rollback cbus
>> messages, called tx_complete_batch(). This change will come in
>> hand in scope of synchronous replication, when WAL write won't be
>> enough for commit. And therefore 'commit' as a concept should be
>> washed away from WAL's code gradually. Migrate to solely txn
>> module.
> 
> The reason it has to be managed in TX is that you don't want to
> build synchronous replication on top of in-memory relay, which
> would allow you to manage RAFT rollback in WAL.

Well, I already implemented some kind of base for sync replication on
a different branch. It collects quorum and all. Next things are to
implement 'confirm' and 'rollback' messages for WAL, timeouts. And it
does not depend on this commit, or on any WAL-internal things at all.
Sync replication needs only WAL API and does not care about its
implementation.

I sent this commit, because I was reading Georgy's implementation,
saw this commit, decided to finish it because it looked useful, and only
then realized I don't need it. Still decided to send it on a review,
because it makes the code better as I think. And removes 'commit' concept
from WAL.

> When I called you out on the fact that you can't simply build your
> implementation on top of existing code without an in-memory relay,
> and it will require a bunch of hacks you got mad at me.

I didn't get mad at you. You just started offending personally me out of
nothing, and I stopped the talk, that is all story.

Still sync replication *can be* built without in-memory relay; still not
necessarily should be trigger-based (even though it is completely
trigger-based in Georgy's implementation, regardless of in-memory WAL).
And still you can't explain why in-memory WAL is needed except for probably
better perf (and you can't prove it, because there is no a single bench).

I propose you to finally come up with some proofs that in-memory relay is
needed for anything, and is not just a userspace page-cache optimization.
So far you couldn't provide anything.

Luckily we have Cyrill G., who will bench this eventually, and see how
much perf really goes to reading disk and decoding.

> Now what? You will ask Nikita to review this and Kirill will
> "LGTM" it and you will proceed with your own design - and even may

This is actually funny - you don't like that we make 'our own design'?
You would rather expect us make all design decisions only if they are
authored and approved by you? Well, then you should just have stayed in
the team. Or send your own patches, or at least your own RFCs. So far you
only complain. You don't propose anything.

You can't really expect us to wait for your approval on everything we
do, including a simple refactoring patch.

> release it as a "stable" version over some time.
> 
> This "stable version may be even useful to some people. 
> 
> Will it be worth a dime on a worldwide scale? No.

It seems to me you like thinking so. From how often you appeal to
this. How the bad tarantool team wants only to find a way how to
make 'their own very bad design decisions' not approved by you to
offend personally you. You literally bring this up almost with every
email you write here about any non-trivial patch. And you wouldn't
get these toxic responses, if you wouldn't be toxic in the first place.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Tarantool-patches] [PATCH 1/1] wal: simplify rollback
  2020-05-03 16:45   ` Vladislav Shpilevoy
@ 2020-05-03 19:30     ` Konstantin Osipov
  0 siblings, 0 replies; 10+ messages in thread
From: Konstantin Osipov @ 2020-05-03 19:30 UTC (permalink / raw)
  To: Vladislav Shpilevoy; +Cc: tarantool-patches

* Vladislav Shpilevoy <v.shpilevoy@tarantool.org> [20/05/03 22:26]:
> This is actually funny - you don't like that we make 'our own design'?
> You would rather expect us make all design decisions only if they are
> authored and approved by you? Well, then you should just have stayed in
> the team. Or send your own patches, or at least your own RFCs. So far you
> only complain. You don't propose anything.

I love your own design. Vshard has a lot of decisions I had no
clue about and I love them. I hate it when you make up something
ugly *only* because you want to have it quick.

This is the case with not using in-memory WAL.

I will take a look at the branch and provide comments, and respond
to the rest later.

> You can't really expect us to wait for your approval on everything we
> do, including a simple refactoring patch.

I don't.

-- 
Konstantin Osipov, Moscow, Russia
https://scylladb.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Tarantool-patches] [PATCH 1/1] wal: simplify rollback
  2020-04-30 22:50 [Tarantool-patches] [PATCH 1/1] wal: simplify rollback Vladislav Shpilevoy
  2020-05-03 10:46 ` Konstantin Osipov
@ 2020-05-06  7:55 ` Cyrill Gorcunov
  2020-05-06 15:38 ` Sergey Ostanevich
  2020-05-08  8:12 ` Kirill Yukhin
  3 siblings, 0 replies; 10+ messages in thread
From: Cyrill Gorcunov @ 2020-05-06  7:55 UTC (permalink / raw)
  To: Vladislav Shpilevoy; +Cc: tarantool-patches

On Fri, May 01, 2020 at 12:50:57AM +0200, Vladislav Shpilevoy wrote:
> From: Georgy Kirichenko <georgy@tarantool.org>
> 
> Here is a summary on how and when rollback works in WAL.
> 
> Rollback happens, when disk write fails. In that case the failed
> and all next transactions, sent to WAL, should be rolled back.
> Together. Following transactions should be rolled back too,
> because they could make their statements based on what they saw in
> the failed transaction. Also rollback of the failed transaction
> without rollback of the next ones can actually rewrite what they
> committed.
> 
> So when rollback is started, *all* pending transactions should be
> rolled back. However if they would keep coming, the rollback would
> be infinite. This means to complete a rollback it is necessary to
> stop sending new transactions to WAL, then rollback all already
> sent. In the end allow new transactions again.
> 
> Step-by-step:
> 
> 1) stop accepting all new transactions in WAL thread, where
> rollback is started. All new transactions don't even try to go to
> disk. They added to rollback queue immediately after arriving to
> WAL thread.
> 
> 2) tell TX thread to stop sending new transactions to WAL. So as
> the rollback queue would stop growing.
> 
> 3) rollback all transactions in reverse order.
> 
> 4) allow transactions again in WAL thread and TX thread.
> 
> The algorithm is long, but simple and understandable. However
> implementation wasn't so easy. It was done using a 4-hop cbus
> route. 2 hops of which were supposed to clear cbus channel from
> all other cbus messages. Next two hops implemented steps 3 and 4.
> Rollback state of the WAL was signaled by checking internals of a
> preallocated cbus message.
> 
> The patch makes it simpler and more straightforward. Rollback
> state is now signaled by a simple flag, and there is no a hack
> about clearing cbus channel, no touching attributes of a cbus
> message. The moment when all transactions are stopped and the last
> one has returned from WAL is visible explicitly, because the last
> sent to WAL journal entry is saved.
> 
> Also there is now a single route for commit and rollback cbus
> messages, called tx_complete_batch(). This change will come in
> hand in scope of synchronous replication, when WAL write won't be
> enough for commit. And therefore 'commit' as a concept should be
> washed away from WAL's code gradually. Migrate to solely txn
> module.
> ---
> Branch: http://github.com/tarantool/tarantool/tree/gerold103/gh-4842-simplify-wal-rollback
> Issue: https://github.com/tarantool/tarantool/issues/4842
> 
> During working on 4842 I managed to extract this patch from
> Georgy's branch and make it not depending on anything else. This
> is supposed to make some things in WAL simpler before they will
> get more complex because of sync replication.

I don't see any obvious errors in the patch itself. Still I'm far
from being a wal expert so might miss something in big picture, thus
Reviewed-by: Cyrill Gorcunov <gorcunov@gmail.com>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Tarantool-patches] [PATCH 1/1] wal: simplify rollback
  2020-04-30 22:50 [Tarantool-patches] [PATCH 1/1] wal: simplify rollback Vladislav Shpilevoy
  2020-05-03 10:46 ` Konstantin Osipov
  2020-05-06  7:55 ` Cyrill Gorcunov
@ 2020-05-06 15:38 ` Sergey Ostanevich
  2020-05-06 21:43   ` Vladislav Shpilevoy
  2020-05-08  8:12 ` Kirill Yukhin
  3 siblings, 1 reply; 10+ messages in thread
From: Sergey Ostanevich @ 2020-05-06 15:38 UTC (permalink / raw)
  To: Vladislav Shpilevoy; +Cc: tarantool-patches

Hi!

Thank you for the patch!

See below for some nits.

Thanks, 
Sergos

On 01 мая 00:50, Vladislav Shpilevoy wrote:
> From: Georgy Kirichenko <georgy@tarantool.org>
> 
> Here is a summary on how and when rollback works in WAL.
> 
> Rollback happens, when disk write fails. In that case the failed
^^^
Disk write failure can cause rollback. Is it better?

> and all next transactions, sent to WAL, should be rolled back.
> Together. Following transactions should be rolled back too,
> because they could make their statements based on what they saw in
> the failed transaction. Also rollback of the failed transaction
> without rollback of the next ones can actually rewrite what they
> committed.
> 
> So when rollback is started, *all* pending transactions should be
> rolled back. However if they would keep coming, the rollback would
> be infinite. 

Not quite - you start rolling of txn4...txn1 (in reverse order) and at
some moment the txn5 appears. It will just ruin the consistency of the
data, just as you mentioned before - read of a yet-to-be rolled back,
writing of a will-be-affected by next roll back.

> This means to complete a rollback it is necessary to
> stop sending new transactions to WAL, then rollback all already
> sent. In the end allow new transactions again.
> 
> Step-by-step:
> 
> 1) stop accepting all new transactions in WAL thread, where
> rollback is started. All new transactions don't even try to go to
> disk. They added to rollback queue immediately after arriving to
> WAL thread.
> 
> 2) tell TX thread to stop sending new transactions to WAL. So as
> the rollback queue would stop growing.
> 
> 3) rollback all transactions in reverse order.
> 
> 4) allow transactions again in WAL thread and TX thread.
> 
> The algorithm is long, but simple and understandable. However
> implementation wasn't so easy. It was done using a 4-hop cbus
> route. 2 hops of which were supposed to clear cbus channel from
> all other cbus messages. Next two hops implemented steps 3 and 4.
> Rollback state of the WAL was signaled by checking internals of a
> preallocated cbus message.
> 
> The patch makes it simpler and more straightforward. Rollback
> state is now signaled by a simple flag, and there is no a hack
> about clearing cbus channel, no touching attributes of a cbus
> message. The moment when all transactions are stopped and the last
> one has returned from WAL is visible explicitly, because the last
> sent to WAL journal entry is saved.
> 
> Also there is now a single route for commit and rollback cbus
                ^^^ move it 
> messages, called tx_complete_batch(). This change will come in
         ^^^ here 
> hand in scope of synchronous replication, when WAL write won't be
> enough for commit. And therefore 'commit' as a concept should be
> washed away from WAL's code gradually. Migrate to solely txn
> module.
> ---
> Branch: http://github.com/tarantool/tarantool/tree/gerold103/gh-4842-simplify-wal-rollback
> Issue: https://github.com/tarantool/tarantool/issues/4842
> 
> During working on 4842 I managed to extract this patch from
> Georgy's branch and make it not depending on anything else. This
> is supposed to make some things in WAL simpler before they will
> get more complex because of sync replication.
> 
>  src/box/wal.c | 178 +++++++++++++++++++++++++++-----------------------
>  1 file changed, 95 insertions(+), 83 deletions(-)
> 
> diff --git a/src/box/wal.c b/src/box/wal.c
> index 1eb20272c..b979244e3 100644
> --- a/src/box/wal.c
> +++ b/src/box/wal.c
> @@ -97,6 +97,13 @@ struct wal_writer
>  	struct cpipe wal_pipe;
>  	/** A memory pool for messages. */
>  	struct mempool msg_pool;
> +	/**
> +	 * A last journal entry submitted to write. This is a
> +	 * 'rollback border'. When rollback starts, all
> +	 * transactions keep being rolled back until this one is
> +	 * rolled back too.
> +	 */
> +	struct journal_entry *last_entry;
>  	/* ----------------- wal ------------------- */
>  	/** A setting from instance configuration - wal_max_size */
>  	int64_t wal_max_size;
> @@ -153,7 +160,7 @@ struct wal_writer
>  	 * keep adding all incoming requests to the rollback
>  	 * queue, until the tx thread has recovered.
>  	 */
> -	struct cmsg in_rollback;
> +	bool is_in_rollback;
>  	/**
>  	 * WAL watchers, i.e. threads that should be alerted
>  	 * whenever there are new records appended to the journal.
> @@ -198,11 +205,11 @@ static void
>  wal_write_to_disk(struct cmsg *msg);
>  
>  static void
> -tx_schedule_commit(struct cmsg *msg);
> +tx_complete_batch(struct cmsg *msg);
>  
>  static struct cmsg_hop wal_request_route[] = {
>  	{wal_write_to_disk, &wal_writer_singleton.tx_prio_pipe},
> -	{tx_schedule_commit, NULL},
> +	{tx_complete_batch, NULL},
>  };
>  
>  static void
> @@ -265,14 +272,83 @@ tx_schedule_queue(struct stailq *queue)
>  		journal_async_complete(&writer->base, req);
>  }
>  
> +/**
> + * Rollback happens, when disk write fails. In that case all next
> + * transactions, sent to WAL, also should be rolled back. Because
> + * they could make their statements based on what they saw in the
> + * failed transaction. Also rollback of the failed transaction
> + * without rollback of the next ones can actually rewrite what
> + * they committed.
> + * So when rollback is started, *all* pending transactions should
> + * be rolled back. However if they would keep coming, the rollback
> + * would be infinite. This means to complete a rollback it is
> + * necessary to stop sending new transactions to WAL, then
> + * rollback all already sent. In the end allow new transactions
> + * again.
> + *
> + * First step is stop accepting all new transactions. For that WAL
> + * thread sets a global flag. No rocket science here. All new
> + * transactions, if see the flag set, are added to the rollback
> + * queue immediately.
> + *
> + * Second step - tell TX thread to stop sending new transactions
> + * to WAL. So as the rollback queue would stop growing.
> + *
> + * Third step - rollback all transactions in reverse order.
> + *
> + * Fourth step - allow transactions again. Unset the global flag
> + * in WAL thread.
> + */
> +static inline void
> +wal_begin_rollback(void)
> +{
> +	/* Signal WAL-thread stop accepting new transactions. */
> +	wal_writer_singleton.is_in_rollback = true;
> +}
> +
> +static void
> +wal_complete_rollback(struct cmsg *base)
> +{
> +	(void) base;
> +	/* WAL-thread can try writing transactions again. */
> +	wal_writer_singleton.is_in_rollback = false;
> +}
> +
> +static void
> +tx_complete_rollback(void)
> +{
> +	struct wal_writer *writer = &wal_writer_singleton;
> +	/*
> +	 * Despite records are sent in batches, the last entry to
> +	 * commit can't be in the middle of a batch. After all
> +	 * transactions to rollback are collected, the last entry
> +	 * will be exactly, well, the last entry.
> +	 */
> +	if (stailq_last_entry(&writer->rollback, struct journal_entry,
> +			      fifo) != writer->last_entry)
> +		return;

I didn't get it: is there can be a batch whose last entry us not
the final one? You prematurely quit the rollback - is there a guarantee
you'll appeare here again?

> +	stailq_reverse(&writer->rollback);
> +	tx_schedule_queue(&writer->rollback);
> +	/* TX-thread can try sending transactions to WAL again. */
> +	stailq_create(&writer->rollback);
> +	static struct cmsg_hop route[] = {
> +		{wal_complete_rollback, NULL}
> +	};
> +	static struct cmsg msg;
> +	cmsg_init(&msg, route);
> +	cpipe_push(&writer->wal_pipe, &msg);
> +}
> +
>  /**
>   * Complete execution of a batch of WAL write requests:
>   * schedule all committed requests, and, should there
>   * be any requests to be rolled back, append them to
> - * the rollback queue.
> + * the rollback queue. In case this is a rollback and the batch
> + * contains the last transaction to rollback, the rollback is
> + * performed and normal processing is allowed again.
>   */
>  static void
> -tx_schedule_commit(struct cmsg *msg)
> +tx_complete_batch(struct cmsg *msg)
>  {
>  	struct wal_writer *writer = &wal_writer_singleton;
>  	struct wal_msg *batch = (struct wal_msg *) msg;
> @@ -282,8 +358,8 @@ tx_schedule_commit(struct cmsg *msg)
>  	 * iteration of tx_schedule_queue loop.
>  	 */
>  	if (! stailq_empty(&batch->rollback)) {
> -		/* Closes the input valve. */
>  		stailq_concat(&writer->rollback, &batch->rollback);
> +		tx_complete_rollback();
>  	}
>  	/* Update the tx vclock to the latest written by wal. */
>  	vclock_copy(&replicaset.vclock, &batch->vclock);
> @@ -291,28 +367,6 @@ tx_schedule_commit(struct cmsg *msg)
>  	mempool_free(&writer->msg_pool, container_of(msg, struct wal_msg, base));
>  }
>  
> -static void
> -tx_schedule_rollback(struct cmsg *msg)
> -{
> -	(void) msg;
> -	struct wal_writer *writer = &wal_writer_singleton;
> -	/*
> -	 * Perform a cascading abort of all transactions which
> -	 * depend on the transaction which failed to get written
> -	 * to the write ahead log. Abort transactions
> -	 * in reverse order, performing a playback of the
> -	 * in-memory database state.
> -	 */
> -	stailq_reverse(&writer->rollback);
> -	/* Must not yield. */
> -	tx_schedule_queue(&writer->rollback);
> -	stailq_create(&writer->rollback);
> -	if (msg != &writer->in_rollback)
> -		mempool_free(&writer->msg_pool,
> -			     container_of(msg, struct wal_msg, base));
> -}
> -
> -
>  /**
>   * This message is sent from WAL to TX when the WAL thread hits
>   * ENOSPC and has to delete some backup WAL files to continue.
> @@ -374,7 +428,7 @@ wal_writer_create(struct wal_writer *writer, enum wal_mode wal_mode,
>  		writer->wal_dir.open_wflags |= O_SYNC;
>  
>  	stailq_create(&writer->rollback);
> -	cmsg_init(&writer->in_rollback, NULL);
> +	writer->is_in_rollback = false;
>  
>  	writer->checkpoint_wal_size = 0;
>  	writer->checkpoint_threshold = INT64_MAX;
> @@ -543,7 +597,7 @@ wal_sync_f(struct cbus_call_msg *data)
>  {
>  	struct wal_vclock_msg *msg = (struct wal_vclock_msg *) data;
>  	struct wal_writer *writer = &wal_writer_singleton;
> -	if (writer->in_rollback.route != NULL) {
> +	if (writer->is_in_rollback) {
>  		/* We're rolling back a failed write. */
>  		diag_set(ClientError, ER_WAL_IO);
>  		return -1;
> @@ -586,7 +640,7 @@ wal_begin_checkpoint_f(struct cbus_call_msg *data)
>  {
>  	struct wal_checkpoint *msg = (struct wal_checkpoint *) data;
>  	struct wal_writer *writer = &wal_writer_singleton;
> -	if (writer->in_rollback.route != NULL) {
> +	if (writer->is_in_rollback) {
>  		/*
>  		 * We're rolling back a failed write and so
>  		 * can't make a checkpoint - see the comment
> @@ -892,54 +946,6 @@ out:
>  	return rc;
>  }
>  
> -static void
> -wal_writer_clear_bus(struct cmsg *msg)
> -{
> -	(void) msg;
> -}
> -
> -static void
> -wal_writer_end_rollback(struct cmsg *msg)
> -{
> -	(void) msg;
> -	struct wal_writer *writer = &wal_writer_singleton;
> -	cmsg_init(&writer->in_rollback, NULL);
> -}
> -
> -static void
> -wal_writer_begin_rollback(struct wal_writer *writer)
> -{
> -	static struct cmsg_hop rollback_route[4] = {
> -		/*
> -		 * Step 1: clear the bus, so that it contains
> -		 * no WAL write requests. This is achieved as a
> -		 * side effect of an empty message travelling
> -		 * through both bus pipes, while writer input
> -		 * valve is closed by non-empty writer->rollback
> -		 * list.
> -		 */
> -		{ wal_writer_clear_bus, &wal_writer_singleton.wal_pipe },
> -		{ wal_writer_clear_bus, &wal_writer_singleton.tx_prio_pipe },
> -		/*
> -		 * Step 2: writer->rollback queue contains all
> -		 * messages which need to be rolled back,
> -		 * perform the rollback.
> -		 */
> -		{ tx_schedule_rollback, &wal_writer_singleton.wal_pipe },
> -		/*
> -		 * Step 3: re-open the WAL for writing.
> -		 */
> -		{ wal_writer_end_rollback, NULL }
> -	};
> -
> -	/*
> -	 * Make sure the WAL writer rolls back
> -	 * all input until rollback mode is off.
> -	 */
> -	cmsg_init(&writer->in_rollback, rollback_route);
> -	cpipe_push(&writer->tx_prio_pipe, &writer->in_rollback);
> -}
> -
>  /*
>   * Assign lsn and replica identifier for local writes and track
>   * row into vclock_diff.
> @@ -1006,7 +1012,7 @@ wal_write_to_disk(struct cmsg *msg)
>  
>  	ERROR_INJECT_SLEEP(ERRINJ_WAL_DELAY);
>  
> -	if (writer->in_rollback.route != NULL) {
> +	if (writer->is_in_rollback) {
>  		/* We're rolling back a failed write. */
>  		stailq_concat(&wal_msg->rollback, &wal_msg->commit);
>  		vclock_copy(&wal_msg->vclock, &writer->vclock);
> @@ -1017,14 +1023,14 @@ wal_write_to_disk(struct cmsg *msg)
>  	if (wal_opt_rotate(writer) != 0) {
>  		stailq_concat(&wal_msg->rollback, &wal_msg->commit);
>  		vclock_copy(&wal_msg->vclock, &writer->vclock);
> -		return wal_writer_begin_rollback(writer);
> +		return wal_begin_rollback();
>  	}
>  
>  	/* Ensure there's enough disk space before writing anything. */
>  	if (wal_fallocate(writer, wal_msg->approx_len) != 0) {
>  		stailq_concat(&wal_msg->rollback, &wal_msg->commit);
>  		vclock_copy(&wal_msg->vclock, &writer->vclock);
> -		return wal_writer_begin_rollback(writer);
> +		return wal_begin_rollback();
>  	}
>  
>  	/*
> @@ -1130,7 +1136,7 @@ done:
>  			entry->res = -1;
>  		/* Rollback unprocessed requests */
>  		stailq_concat(&wal_msg->rollback, &rollback);
> -		wal_writer_begin_rollback(writer);
> +		wal_begin_rollback();
>  	}
>  	fiber_gc();
>  	wal_notify_watchers(writer, WAL_EVENT_WRITE);
> @@ -1234,6 +1240,12 @@ wal_write_async(struct journal *journal, struct journal_entry *entry)
>  		stailq_add_tail_entry(&batch->commit, entry, fifo);
>  		cpipe_push(&writer->wal_pipe, &batch->base);
>  	}
> +	/*
> +	 * Remember last entry sent to WAL. In case of rollback
> +	 * WAL will use this entry as an anchor to rollback all
> +	 * transactions until and including this one.
> +	 */
> +	writer->last_entry = entry;
>  	batch->approx_len += entry->approx_len;
>  	writer->wal_pipe.n_input += entry->n_rows * XROW_IOVMAX;
>  	cpipe_flush_input(&writer->wal_pipe);
> -- 
> 2.21.1 (Apple Git-122.3)
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Tarantool-patches] [PATCH 1/1] wal: simplify rollback
  2020-05-06 15:38 ` Sergey Ostanevich
@ 2020-05-06 21:43   ` Vladislav Shpilevoy
  2020-05-07 10:28     ` Sergey Ostanevich
  0 siblings, 1 reply; 10+ messages in thread
From: Vladislav Shpilevoy @ 2020-05-06 21:43 UTC (permalink / raw)
  To: Sergey Ostanevich; +Cc: tarantool-patches

Hi! Thanks for the review!

> On 01 мая 00:50, Vladislav Shpilevoy wrote:
>> From: Georgy Kirichenko <georgy@tarantool.org>
>>
>> Here is a summary on how and when rollback works in WAL.
>>
>> Rollback happens, when disk write fails. In that case the failed
> ^^^
> Disk write failure can cause rollback. Is it better?

To me both are equal, so whatever. Applied your version.

>> and all next transactions, sent to WAL, should be rolled back.
>> Together. Following transactions should be rolled back too,
>> because they could make their statements based on what they saw in
>> the failed transaction. Also rollback of the failed transaction
>> without rollback of the next ones can actually rewrite what they
>> committed.
>>
>> So when rollback is started, *all* pending transactions should be
>> rolled back. However if they would keep coming, the rollback would
>> be infinite. 
> 
> Not quite - you start rolling of txn4...txn1 (in reverse order) and at
> some moment the txn5 appears. It will just ruin the consistency of the
> data, just as you mentioned before - read of a yet-to-be rolled back,
> writing of a will-be-affected by next roll back.

Well, it will not. txn5 appearance means it is after txn4. And it will be
rolled back before txn4, in case it tries to commit before the whole
rollback procedure ends. So the reversed order is fine.

In case you worry txn5 can appear right during rollback (between rolling back
txn4 and txn3, for example) - it is not possible. Rollback of the whole batch
does not yield. So while TX thread waits all rolled back transactions from
WAL thread, it is legal to rollback all newer transactions immediately. When
all rolled back transactions finally return back to TX thread, they are uncommitted
without yields.

Therefore my statement still seems to be correct. We rollback all transactions,
in reversed order, and if rollback is started, new transactions are rolled
back immediately without even trying to go to WAL, until all already sent
transactions are rolled back.

>> This means to complete a rollback it is necessary to
>> stop sending new transactions to WAL, then rollback all already
>> sent. In the end allow new transactions again.
>>
>> Step-by-step:
>>
>> 1) stop accepting all new transactions in WAL thread, where
>> rollback is started. All new transactions don't even try to go to
>> disk. They added to rollback queue immediately after arriving to
>> WAL thread.
>>
>> 2) tell TX thread to stop sending new transactions to WAL. So as
>> the rollback queue would stop growing.
>>
>> 3) rollback all transactions in reverse order.
>>
>> 4) allow transactions again in WAL thread and TX thread.
>>
>> The algorithm is long, but simple and understandable. However
>> implementation wasn't so easy. It was done using a 4-hop cbus
>> route. 2 hops of which were supposed to clear cbus channel from
>> all other cbus messages. Next two hops implemented steps 3 and 4.
>> Rollback state of the WAL was signaled by checking internals of a
>> preallocated cbus message.
>>
>> The patch makes it simpler and more straightforward. Rollback
>> state is now signaled by a simple flag, and there is no a hack
>> about clearing cbus channel, no touching attributes of a cbus
>> message. The moment when all transactions are stopped and the last
>> one has returned from WAL is visible explicitly, because the last
>> sent to WAL journal entry is saved.
>>
>> Also there is now a single route for commit and rollback cbus
>                 ^^^ move it 
>> messages, called tx_complete_batch(). This change will come in
>          ^^^ here 

Nope. Beforehand there was no a single route, and *now* there is a
single route. Which happened to be called tx_complete_batch(). The
accent is on *now there is a single route*. Not on *now it is called*.

>> hand in scope of synchronous replication, when WAL write won't be
>> enough for commit. And therefore 'commit' as a concept should be
>> washed away from WAL's code gradually. Migrate to solely txn
>> module.
>> ---
>> Branch: http://github.com/tarantool/tarantool/tree/gerold103/gh-4842-simplify-wal-rollback
>> Issue: https://github.com/tarantool/tarantool/issues/4842
>>
>> During working on 4842 I managed to extract this patch from
>> Georgy's branch and make it not depending on anything else. This
>> is supposed to make some things in WAL simpler before they will
>> get more complex because of sync replication.
>>
>>  src/box/wal.c | 178 +++++++++++++++++++++++++++-----------------------
>>  1 file changed, 95 insertions(+), 83 deletions(-)
>>
>> diff --git a/src/box/wal.c b/src/box/wal.c
>> index 1eb20272c..b979244e3 100644
>> --- a/src/box/wal.c
>> +++ b/src/box/wal.c
>> @@ -97,6 +97,13 @@ struct wal_writer
>>  	struct cpipe wal_pipe;
>>  	/** A memory pool for messages. */
>>  	struct mempool msg_pool;
>> +	/**
>> +	 * A last journal entry submitted to write. This is a
>> +	 * 'rollback border'. When rollback starts, all
>> +	 * transactions keep being rolled back until this one is
>> +	 * rolled back too.
>> +	 */
>> +	struct journal_entry *last_entry;
>>  	/* ----------------- wal ------------------- */
>>  	/** A setting from instance configuration - wal_max_size */
>>  	int64_t wal_max_size;
>> @@ -153,7 +160,7 @@ struct wal_writer
>>  	 * keep adding all incoming requests to the rollback
>>  	 * queue, until the tx thread has recovered.
>>  	 */
>> -	struct cmsg in_rollback;
>> +	bool is_in_rollback;
>>  	/**
>>  	 * WAL watchers, i.e. threads that should be alerted
>>  	 * whenever there are new records appended to the journal.
>> @@ -198,11 +205,11 @@ static void
>>  wal_write_to_disk(struct cmsg *msg);
>>  
>>  static void
>> -tx_schedule_commit(struct cmsg *msg);
>> +tx_complete_batch(struct cmsg *msg);
>>  
>>  static struct cmsg_hop wal_request_route[] = {
>>  	{wal_write_to_disk, &wal_writer_singleton.tx_prio_pipe},
>> -	{tx_schedule_commit, NULL},
>> +	{tx_complete_batch, NULL},
>>  };
>>  
>>  static void
>> @@ -265,14 +272,83 @@ tx_schedule_queue(struct stailq *queue)
>>  		journal_async_complete(&writer->base, req);
>>  }
>>  
>> +/**
>> + * Rollback happens, when disk write fails. In that case all next
>> + * transactions, sent to WAL, also should be rolled back. Because
>> + * they could make their statements based on what they saw in the
>> + * failed transaction. Also rollback of the failed transaction
>> + * without rollback of the next ones can actually rewrite what
>> + * they committed.
>> + * So when rollback is started, *all* pending transactions should
>> + * be rolled back. However if they would keep coming, the rollback
>> + * would be infinite. This means to complete a rollback it is
>> + * necessary to stop sending new transactions to WAL, then
>> + * rollback all already sent. In the end allow new transactions
>> + * again.
>> + *
>> + * First step is stop accepting all new transactions. For that WAL
>> + * thread sets a global flag. No rocket science here. All new
>> + * transactions, if see the flag set, are added to the rollback
>> + * queue immediately.
>> + *
>> + * Second step - tell TX thread to stop sending new transactions
>> + * to WAL. So as the rollback queue would stop growing.
>> + *
>> + * Third step - rollback all transactions in reverse order.
>> + *
>> + * Fourth step - allow transactions again. Unset the global flag
>> + * in WAL thread.
>> + */
>> +static inline void
>> +wal_begin_rollback(void)
>> +{
>> +	/* Signal WAL-thread stop accepting new transactions. */
>> +	wal_writer_singleton.is_in_rollback = true;
>> +}
>> +
>> +static void
>> +wal_complete_rollback(struct cmsg *base)
>> +{
>> +	(void) base;
>> +	/* WAL-thread can try writing transactions again. */
>> +	wal_writer_singleton.is_in_rollback = false;
>> +}
>> +
>> +static void
>> +tx_complete_rollback(void)
>> +{
>> +	struct wal_writer *writer = &wal_writer_singleton;
>> +	/*
>> +	 * Despite records are sent in batches, the last entry to
>> +	 * commit can't be in the middle of a batch. After all
>> +	 * transactions to rollback are collected, the last entry
>> +	 * will be exactly, well, the last entry.
>> +	 */
>> +	if (stailq_last_entry(&writer->rollback, struct journal_entry,
>> +			      fifo) != writer->last_entry)
>> +		return;
> 
> I didn't get it: is there can be a batch whose last entry us not
> the final one?

Nope. Last entry is exactly last entry. If there is a rollback in
progress, and there is a batch of transactions returned from WAL
thread, it means the last transaction which was sent to WAL is in
the end of the batch. If it is not in the end, then this batch is
not the last, and there will be more.

Since TX thread enters rollback state, it won't send other transactions
to WAL thread, and therefore 'last_entry' stop changing. Eventually
the batch, which contains the last entry, will return back to TX thread
from WAL thread. And the last entry will match the last transaction in
the batch. Because if something else would be sent to WAL afterwards,
the last_entry member would change again.

> You prematurely quit the rollback - is there a guarantee
> you'll appeare here again?

If the batch does not end with the last_entry, it is not the last
batch. So I can't start rollback now. Not all transactions to rollback
have returned from WAL thread.

There is a guarantee, that if last_entry didn't arrive back from WAL
in the current batch, there will be more batches.

>> +	stailq_reverse(&writer->rollback);
>> +	tx_schedule_queue(&writer->rollback);
>> +	/* TX-thread can try sending transactions to WAL again. */
>> +	stailq_create(&writer->rollback);
>> +	static struct cmsg_hop route[] = {
>> +		{wal_complete_rollback, NULL}
>> +	};
>> +	static struct cmsg msg;
>> +	cmsg_init(&msg, route);
>> +	cpipe_push(&writer->wal_pipe, &msg);
>> +}
I decided to help you with an example of how a typical rollback may
look. There are TX thread and WAL thread. Their states in the beginning
of the example:


            TX thread                               WAL thread

            mode: normal                           mode: normal
  rollback_queue: {}                         cbus_queue: {}
      last_entry: null

Assume there is transaction txn1. txn_commit(txn1) is called. TX thread
sends it to WAL.


            TX thread                               WAL thread

            mode: normal                           mode: normal
  rollback_queue: {}                         cbus_queue: {txn1}
      last_entry: txn1


Then txn2, txn3 are committed, they go to WAL in a second batch.

            TX thread                               WAL thread

            mode: normal                           mode: normal
  rollback_queue: {}                         cbus_queue: {txn2, txn3} -> {txn1}
      last_entry: txn3

Now WAL thread pops {txn1} (batch of a single transaction), and tries to
write it. But it fails. Then WAL enters rollback mode and sends {txn1}
back as rolled back.

            TX thread                               WAL thread

            mode: rollback                         mode: rollback
  rollback_queue: {txn1}                     cbus_queue: {txn2, txn3}
      last_entry: txn3

TX thread receives {txn1} as rolled back, so it enters 'rollback' state
too. Also TX thread sees that {txn1} does not end with last_entry, which
is txn3, so there is at least one more batch to wait from WAL before
rollback can be done. It waits.

Assume now arrives transaction txn4. TX thread is in rollback mode, so
an attempt to commit txn4 makes it fail immediately. It is totally legal.
Rollback order is respected. txn4 is just rolled back and removed. No
need to add it to any queue.

Now WAL thread processes next batch. WAL thread is still in rollback
state, so it returns {txn2, txn3} back to TX thread right away.

            TX thread                               WAL thread

            mode: rollback                         mode: rollback
  rollback_queue: {txn1} -> {txn2, txn3}     cbus_queue: {}
      last_entry: txn3

TX thread sees, that this batch ({txn2, txn3}) ends with last_entry,
and now it is sure there are no more batches in WAL thread. Indeed,
all transactions after last_entry were rolled back immediately, without
going to WAL. So it rolls back transactions in the queue in order txn3,
txn2, txn1. Without yields. And enters normal state.

            TX thread                               WAL thread

            mode: normal                           mode: rollback
  rollback_queue: {}                         cbus_queue: {}
      last_entry: txn3

Note, WAL thread is still in rollback state. But this is ok, because
right after rolling back the queue, TX thread sends a message to WAL
thread saying "all is ok now, you can try writing to disk again".

Newer transactions won't be able to arrive to WAL thread earlier,
because in cbus there is strict order of messages.

            TX thread                               WAL thread

            mode: normal                           mode: normal
  rollback_queue: {}                         cbus_queue: {}
      last_entry: txn3

Now rollback is finished. I hope this example helps.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Tarantool-patches] [PATCH 1/1] wal: simplify rollback
  2020-05-06 21:43   ` Vladislav Shpilevoy
@ 2020-05-07 10:28     ` Sergey Ostanevich
  2020-05-07 21:37       ` Vladislav Shpilevoy
  0 siblings, 1 reply; 10+ messages in thread
From: Sergey Ostanevich @ 2020-05-07 10:28 UTC (permalink / raw)
  To: Vladislav Shpilevoy; +Cc: tarantool-patches

Hi!

Thank you for the very detailed explanation of rollback activities
between WAL and TX threads! Now it's pretty clear how you made it work.
Neato!

> >>
> >> Also there is now a single route for commit and rollback cbus
> >                 ^^^ move it 
> >> messages, called tx_complete_batch(). This change will come in
> >          ^^^ here 
> 
> Nope. Beforehand there was no a single route, and *now* there is a
> single route. Which happened to be called tx_complete_batch(). The
> accent is on *now there is a single route*. Not on *now it is called*.

What I mean is actually "Also there is a single route for commit and
rollback cbus messages now, called tx_complete_batch(). 

Sorry for misleading. 

LGTM.

Sergos

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Tarantool-patches] [PATCH 1/1] wal: simplify rollback
  2020-05-07 10:28     ` Sergey Ostanevich
@ 2020-05-07 21:37       ` Vladislav Shpilevoy
  0 siblings, 0 replies; 10+ messages in thread
From: Vladislav Shpilevoy @ 2020-05-07 21:37 UTC (permalink / raw)
  To: Sergey Ostanevich; +Cc: tarantool-patches

>>>>
>>>> Also there is now a single route for commit and rollback cbus
>>>                 ^^^ move it 
>>>> messages, called tx_complete_batch(). This change will come in
>>>          ^^^ here 
>>
>> Nope. Beforehand there was no a single route, and *now* there is a
>> single route. Which happened to be called tx_complete_batch(). The
>> accent is on *now there is a single route*. Not on *now it is called*.
> 
> What I mean is actually "Also there is a single route for commit and
> rollback cbus messages now, called tx_complete_batch(). 

Aha, ok. Changed to your version.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Tarantool-patches] [PATCH 1/1] wal: simplify rollback
  2020-04-30 22:50 [Tarantool-patches] [PATCH 1/1] wal: simplify rollback Vladislav Shpilevoy
                   ` (2 preceding siblings ...)
  2020-05-06 15:38 ` Sergey Ostanevich
@ 2020-05-08  8:12 ` Kirill Yukhin
  3 siblings, 0 replies; 10+ messages in thread
From: Kirill Yukhin @ 2020-05-08  8:12 UTC (permalink / raw)
  To: Vladislav Shpilevoy; +Cc: tarantool-patches

Hello,

On 01 май 00:50, Vladislav Shpilevoy wrote:
> From: Georgy Kirichenko <georgy@tarantool.org>
> 
> Here is a summary on how and when rollback works in WAL.
> 
> Rollback happens, when disk write fails. In that case the failed
> and all next transactions, sent to WAL, should be rolled back.
> Together. Following transactions should be rolled back too,
> because they could make their statements based on what they saw in
> the failed transaction. Also rollback of the failed transaction
> without rollback of the next ones can actually rewrite what they
> committed.
> 
> So when rollback is started, *all* pending transactions should be
> rolled back. However if they would keep coming, the rollback would
> be infinite. This means to complete a rollback it is necessary to
> stop sending new transactions to WAL, then rollback all already
> sent. In the end allow new transactions again.
> 
> Step-by-step:
> 
> 1) stop accepting all new transactions in WAL thread, where
> rollback is started. All new transactions don't even try to go to
> disk. They added to rollback queue immediately after arriving to
> WAL thread.
> 
> 2) tell TX thread to stop sending new transactions to WAL. So as
> the rollback queue would stop growing.
> 
> 3) rollback all transactions in reverse order.
> 
> 4) allow transactions again in WAL thread and TX thread.
> 
> The algorithm is long, but simple and understandable. However
> implementation wasn't so easy. It was done using a 4-hop cbus
> route. 2 hops of which were supposed to clear cbus channel from
> all other cbus messages. Next two hops implemented steps 3 and 4.
> Rollback state of the WAL was signaled by checking internals of a
> preallocated cbus message.
> 
> The patch makes it simpler and more straightforward. Rollback
> state is now signaled by a simple flag, and there is no a hack
> about clearing cbus channel, no touching attributes of a cbus
> message. The moment when all transactions are stopped and the last
> one has returned from WAL is visible explicitly, because the last
> sent to WAL journal entry is saved.
> 
> Also there is now a single route for commit and rollback cbus
> messages, called tx_complete_batch(). This change will come in
> hand in scope of synchronous replication, when WAL write won't be
> enough for commit. And therefore 'commit' as a concept should be
> washed away from WAL's code gradually. Migrate to solely txn
> module.
> ---
> Branch: http://github.com/tarantool/tarantool/tree/gerold103/gh-4842-simplify-wal-rollback
> Issue: https://github.com/tarantool/tarantool/issues/4842
> 
> During working on 4842 I managed to extract this patch from
> Georgy's branch and make it not depending on anything else. This
> is supposed to make some things in WAL simpler before they will
> get more complex because of sync replication.

I've checked your patch into master.

--
Rehards, Kirill Yukhin

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2020-05-08  8:12 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-04-30 22:50 [Tarantool-patches] [PATCH 1/1] wal: simplify rollback Vladislav Shpilevoy
2020-05-03 10:46 ` Konstantin Osipov
2020-05-03 16:45   ` Vladislav Shpilevoy
2020-05-03 19:30     ` Konstantin Osipov
2020-05-06  7:55 ` Cyrill Gorcunov
2020-05-06 15:38 ` Sergey Ostanevich
2020-05-06 21:43   ` Vladislav Shpilevoy
2020-05-07 10:28     ` Sergey Ostanevich
2020-05-07 21:37       ` Vladislav Shpilevoy
2020-05-08  8:12 ` Kirill Yukhin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox