[Tarantool-patches] [PATCH 1/1] wal: simplify rollback

Thu May 7 00:43:54 MSK 2020

Hi! Thanks for the review!

> On 01 мая 00:50, Vladislav Shpilevoy wrote:
>> From: Georgy Kirichenko <georgy at tarantool.org>
>>
>> Here is a summary on how and when rollback works in WAL.
>>
>> Rollback happens, when disk write fails. In that case the failed
> ^^^
> Disk write failure can cause rollback. Is it better?

To me both are equal, so whatever. Applied your version.

>> and all next transactions, sent to WAL, should be rolled back.
>> Together. Following transactions should be rolled back too,
>> because they could make their statements based on what they saw in
>> the failed transaction. Also rollback of the failed transaction
>> without rollback of the next ones can actually rewrite what they
>> committed.
>>
>> So when rollback is started, *all* pending transactions should be
>> rolled back. However if they would keep coming, the rollback would
>> be infinite. 
> 
> Not quite - you start rolling of txn4...txn1 (in reverse order) and at
> some moment the txn5 appears. It will just ruin the consistency of the
> data, just as you mentioned before - read of a yet-to-be rolled back,
> writing of a will-be-affected by next roll back.

Well, it will not. txn5 appearance means it is after txn4. And it will be
rolled back before txn4, in case it tries to commit before the whole
rollback procedure ends. So the reversed order is fine.

In case you worry txn5 can appear right during rollback (between rolling back
txn4 and txn3, for example) - it is not possible. Rollback of the whole batch
does not yield. So while TX thread waits all rolled back transactions from
WAL thread, it is legal to rollback all newer transactions immediately. When
all rolled back transactions finally return back to TX thread, they are uncommitted
without yields.

Therefore my statement still seems to be correct. We rollback all transactions,
in reversed order, and if rollback is started, new transactions are rolled
back immediately without even trying to go to WAL, until all already sent
transactions are rolled back.

>> This means to complete a rollback it is necessary to
>> stop sending new transactions to WAL, then rollback all already
>> sent. In the end allow new transactions again.
>>
>> Step-by-step:
>>
>> 1) stop accepting all new transactions in WAL thread, where
>> rollback is started. All new transactions don't even try to go to
>> disk. They added to rollback queue immediately after arriving to
>> WAL thread.
>>
>> 2) tell TX thread to stop sending new transactions to WAL. So as
>> the rollback queue would stop growing.
>>
>> 3) rollback all transactions in reverse order.
>>
>> 4) allow transactions again in WAL thread and TX thread.
>>
>> The algorithm is long, but simple and understandable. However
>> implementation wasn't so easy. It was done using a 4-hop cbus
>> route. 2 hops of which were supposed to clear cbus channel from
>> all other cbus messages. Next two hops implemented steps 3 and 4.
>> Rollback state of the WAL was signaled by checking internals of a
>> preallocated cbus message.
>>
>> The patch makes it simpler and more straightforward. Rollback
>> state is now signaled by a simple flag, and there is no a hack
>> about clearing cbus channel, no touching attributes of a cbus
>> message. The moment when all transactions are stopped and the last
>> one has returned from WAL is visible explicitly, because the last
>> sent to WAL journal entry is saved.
>>
>> Also there is now a single route for commit and rollback cbus
>                 ^^^ move it 
>> messages, called tx_complete_batch(). This change will come in
>          ^^^ here 

Nope. Beforehand there was no a single route, and *now* there is a
single route. Which happened to be called tx_complete_batch(). The
accent is on *now there is a single route*. Not on *now it is called*.

>> hand in scope of synchronous replication, when WAL write won't be
>> enough for commit. And therefore 'commit' as a concept should be
>> washed away from WAL's code gradually. Migrate to solely txn
>> module.
>> ---
>> Branch: http://github.com/tarantool/tarantool/tree/gerold103/gh-4842-simplify-wal-rollback
>> Issue: https://github.com/tarantool/tarantool/issues/4842
>>
>> During working on 4842 I managed to extract this patch from
>> Georgy's branch and make it not depending on anything else. This
>> is supposed to make some things in WAL simpler before they will
>> get more complex because of sync replication.
>>
>>  src/box/wal.c | 178 +++++++++++++++++++++++++++-----------------------
>>  1 file changed, 95 insertions(+), 83 deletions(-)
>>
>> diff --git a/src/box/wal.c b/src/box/wal.c
>> index 1eb20272c..b979244e3 100644
>> --- a/src/box/wal.c
>> +++ b/src/box/wal.c
>> @@ -97,6 +97,13 @@ struct wal_writer
>>  	struct cpipe wal_pipe;
>>  	/** A memory pool for messages. */
>>  	struct mempool msg_pool;
>> +	/**
>> +	 * A last journal entry submitted to write. This is a
>> +	 * 'rollback border'. When rollback starts, all
>> +	 * transactions keep being rolled back until this one is
>> +	 * rolled back too.
>> +	 */
>> +	struct journal_entry *last_entry;
>>  	/* ----------------- wal ------------------- */
>>  	/** A setting from instance configuration - wal_max_size */
>>  	int64_t wal_max_size;
>> @@ -153,7 +160,7 @@ struct wal_writer
>>  	 * keep adding all incoming requests to the rollback
>>  	 * queue, until the tx thread has recovered.
>>  	 */
>> -	struct cmsg in_rollback;
>> +	bool is_in_rollback;
>>  	/**
>>  	 * WAL watchers, i.e. threads that should be alerted
>>  	 * whenever there are new records appended to the journal.
>> @@ -198,11 +205,11 @@ static void
>>  wal_write_to_disk(struct cmsg *msg);
>>  
>>  static void
>> -tx_schedule_commit(struct cmsg *msg);
>> +tx_complete_batch(struct cmsg *msg);
>>  
>>  static struct cmsg_hop wal_request_route[] = {
>>  	{wal_write_to_disk, &wal_writer_singleton.tx_prio_pipe},
>> -	{tx_schedule_commit, NULL},
>> +	{tx_complete_batch, NULL},
>>  };
>>  
>>  static void
>> @@ -265,14 +272,83 @@ tx_schedule_queue(struct stailq *queue)
>>  		journal_async_complete(&writer->base, req);
>>  }
>>  
>> +/**
>> + * Rollback happens, when disk write fails. In that case all next
>> + * transactions, sent to WAL, also should be rolled back. Because
>> + * they could make their statements based on what they saw in the
>> + * failed transaction. Also rollback of the failed transaction
>> + * without rollback of the next ones can actually rewrite what
>> + * they committed.
>> + * So when rollback is started, *all* pending transactions should
>> + * be rolled back. However if they would keep coming, the rollback
>> + * would be infinite. This means to complete a rollback it is
>> + * necessary to stop sending new transactions to WAL, then
>> + * rollback all already sent. In the end allow new transactions
>> + * again.
>> + *
>> + * First step is stop accepting all new transactions. For that WAL
>> + * thread sets a global flag. No rocket science here. All new
>> + * transactions, if see the flag set, are added to the rollback
>> + * queue immediately.
>> + *
>> + * Second step - tell TX thread to stop sending new transactions
>> + * to WAL. So as the rollback queue would stop growing.
>> + *
>> + * Third step - rollback all transactions in reverse order.
>> + *
>> + * Fourth step - allow transactions again. Unset the global flag
>> + * in WAL thread.
>> + */
>> +static inline void
>> +wal_begin_rollback(void)
>> +{
>> +	/* Signal WAL-thread stop accepting new transactions. */
>> +	wal_writer_singleton.is_in_rollback = true;
>> +}
>> +
>> +static void
>> +wal_complete_rollback(struct cmsg *base)
>> +{
>> +	(void) base;
>> +	/* WAL-thread can try writing transactions again. */
>> +	wal_writer_singleton.is_in_rollback = false;
>> +}
>> +
>> +static void
>> +tx_complete_rollback(void)
>> +{
>> +	struct wal_writer *writer = &wal_writer_singleton;
>> +	/*
>> +	 * Despite records are sent in batches, the last entry to
>> +	 * commit can't be in the middle of a batch. After all
>> +	 * transactions to rollback are collected, the last entry
>> +	 * will be exactly, well, the last entry.
>> +	 */
>> +	if (stailq_last_entry(&writer->rollback, struct journal_entry,
>> +			      fifo) != writer->last_entry)
>> +		return;
> 
> I didn't get it: is there can be a batch whose last entry us not
> the final one?

Nope. Last entry is exactly last entry. If there is a rollback in
progress, and there is a batch of transactions returned from WAL
thread, it means the last transaction which was sent to WAL is in
the end of the batch. If it is not in the end, then this batch is
not the last, and there will be more.

Since TX thread enters rollback state, it won't send other transactions
to WAL thread, and therefore 'last_entry' stop changing. Eventually
the batch, which contains the last entry, will return back to TX thread
from WAL thread. And the last entry will match the last transaction in
the batch. Because if something else would be sent to WAL afterwards,
the last_entry member would change again.

> You prematurely quit the rollback - is there a guarantee
> you'll appeare here again?

If the batch does not end with the last_entry, it is not the last
batch. So I can't start rollback now. Not all transactions to rollback
have returned from WAL thread.

There is a guarantee, that if last_entry didn't arrive back from WAL
in the current batch, there will be more batches.

>> +	stailq_reverse(&writer->rollback);
>> +	tx_schedule_queue(&writer->rollback);
>> +	/* TX-thread can try sending transactions to WAL again. */
>> +	stailq_create(&writer->rollback);
>> +	static struct cmsg_hop route[] = {
>> +		{wal_complete_rollback, NULL}
>> +	};
>> +	static struct cmsg msg;
>> +	cmsg_init(&msg, route);
>> +	cpipe_push(&writer->wal_pipe, &msg);
>> +}
I decided to help you with an example of how a typical rollback may
look. There are TX thread and WAL thread. Their states in the beginning
of the example:

            TX thread                               WAL thread

            mode: normal                           mode: normal
  rollback_queue: {}                         cbus_queue: {}
      last_entry: null

Assume there is transaction txn1. txn_commit(txn1) is called. TX thread
sends it to WAL.

            TX thread                               WAL thread

            mode: normal                           mode: normal
  rollback_queue: {}                         cbus_queue: {txn1}
      last_entry: txn1

Then txn2, txn3 are committed, they go to WAL in a second batch.

            TX thread                               WAL thread

            mode: normal                           mode: normal
  rollback_queue: {}                         cbus_queue: {txn2, txn3} -> {txn1}
      last_entry: txn3

Now WAL thread pops {txn1} (batch of a single transaction), and tries to
write it. But it fails. Then WAL enters rollback mode and sends {txn1}
back as rolled back.

            TX thread                               WAL thread

            mode: rollback                         mode: rollback
  rollback_queue: {txn1}                     cbus_queue: {txn2, txn3}
      last_entry: txn3

TX thread receives {txn1} as rolled back, so it enters 'rollback' state
too. Also TX thread sees that {txn1} does not end with last_entry, which
is txn3, so there is at least one more batch to wait from WAL before
rollback can be done. It waits.

Assume now arrives transaction txn4. TX thread is in rollback mode, so
an attempt to commit txn4 makes it fail immediately. It is totally legal.
Rollback order is respected. txn4 is just rolled back and removed. No
need to add it to any queue.

Now WAL thread processes next batch. WAL thread is still in rollback
state, so it returns {txn2, txn3} back to TX thread right away.

            TX thread                               WAL thread

            mode: rollback                         mode: rollback
  rollback_queue: {txn1} -> {txn2, txn3}     cbus_queue: {}
      last_entry: txn3

TX thread sees, that this batch ({txn2, txn3}) ends with last_entry,
and now it is sure there are no more batches in WAL thread. Indeed,
all transactions after last_entry were rolled back immediately, without
going to WAL. So it rolls back transactions in the queue in order txn3,
txn2, txn1. Without yields. And enters normal state.

            TX thread                               WAL thread

            mode: normal                           mode: rollback
  rollback_queue: {}                         cbus_queue: {}
      last_entry: txn3

Note, WAL thread is still in rollback state. But this is ok, because
right after rolling back the queue, TX thread sends a message to WAL
thread saying "all is ok now, you can try writing to disk again".

Newer transactions won't be able to arrive to WAL thread earlier,
because in cbus there is strict order of messages.

            TX thread                               WAL thread

            mode: normal                           mode: normal
  rollback_queue: {}                         cbus_queue: {}
      last_entry: txn3

Now rollback is finished. I hope this example helps.