From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from smtp55.i.mail.ru (smtp55.i.mail.ru [217.69.128.35]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dev.tarantool.org (Postfix) with ESMTPS id 65318469710 for ; Wed, 6 May 2020 18:38:44 +0300 (MSK) Date: Wed, 6 May 2020 18:38:43 +0300 From: Sergey Ostanevich Message-ID: <20200506153843.GG112@tarantool.org> References: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: Subject: Re: [Tarantool-patches] [PATCH 1/1] wal: simplify rollback List-Id: Tarantool development patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Vladislav Shpilevoy Cc: tarantool-patches@dev.tarantool.org Hi! Thank you for the patch! See below for some nits. Thanks, Sergos On 01 мая 00:50, Vladislav Shpilevoy wrote: > From: Georgy Kirichenko > > Here is a summary on how and when rollback works in WAL. > > Rollback happens, when disk write fails. In that case the failed ^^^ Disk write failure can cause rollback. Is it better? > and all next transactions, sent to WAL, should be rolled back. > Together. Following transactions should be rolled back too, > because they could make their statements based on what they saw in > the failed transaction. Also rollback of the failed transaction > without rollback of the next ones can actually rewrite what they > committed. > > So when rollback is started, *all* pending transactions should be > rolled back. However if they would keep coming, the rollback would > be infinite. Not quite - you start rolling of txn4...txn1 (in reverse order) and at some moment the txn5 appears. It will just ruin the consistency of the data, just as you mentioned before - read of a yet-to-be rolled back, writing of a will-be-affected by next roll back. > This means to complete a rollback it is necessary to > stop sending new transactions to WAL, then rollback all already > sent. In the end allow new transactions again. > > Step-by-step: > > 1) stop accepting all new transactions in WAL thread, where > rollback is started. All new transactions don't even try to go to > disk. They added to rollback queue immediately after arriving to > WAL thread. > > 2) tell TX thread to stop sending new transactions to WAL. So as > the rollback queue would stop growing. > > 3) rollback all transactions in reverse order. > > 4) allow transactions again in WAL thread and TX thread. > > The algorithm is long, but simple and understandable. However > implementation wasn't so easy. It was done using a 4-hop cbus > route. 2 hops of which were supposed to clear cbus channel from > all other cbus messages. Next two hops implemented steps 3 and 4. > Rollback state of the WAL was signaled by checking internals of a > preallocated cbus message. > > The patch makes it simpler and more straightforward. Rollback > state is now signaled by a simple flag, and there is no a hack > about clearing cbus channel, no touching attributes of a cbus > message. The moment when all transactions are stopped and the last > one has returned from WAL is visible explicitly, because the last > sent to WAL journal entry is saved. > > Also there is now a single route for commit and rollback cbus ^^^ move it > messages, called tx_complete_batch(). This change will come in ^^^ here > hand in scope of synchronous replication, when WAL write won't be > enough for commit. And therefore 'commit' as a concept should be > washed away from WAL's code gradually. Migrate to solely txn > module. > --- > Branch: http://github.com/tarantool/tarantool/tree/gerold103/gh-4842-simplify-wal-rollback > Issue: https://github.com/tarantool/tarantool/issues/4842 > > During working on 4842 I managed to extract this patch from > Georgy's branch and make it not depending on anything else. This > is supposed to make some things in WAL simpler before they will > get more complex because of sync replication. > > src/box/wal.c | 178 +++++++++++++++++++++++++++----------------------- > 1 file changed, 95 insertions(+), 83 deletions(-) > > diff --git a/src/box/wal.c b/src/box/wal.c > index 1eb20272c..b979244e3 100644 > --- a/src/box/wal.c > +++ b/src/box/wal.c > @@ -97,6 +97,13 @@ struct wal_writer > struct cpipe wal_pipe; > /** A memory pool for messages. */ > struct mempool msg_pool; > + /** > + * A last journal entry submitted to write. This is a > + * 'rollback border'. When rollback starts, all > + * transactions keep being rolled back until this one is > + * rolled back too. > + */ > + struct journal_entry *last_entry; > /* ----------------- wal ------------------- */ > /** A setting from instance configuration - wal_max_size */ > int64_t wal_max_size; > @@ -153,7 +160,7 @@ struct wal_writer > * keep adding all incoming requests to the rollback > * queue, until the tx thread has recovered. > */ > - struct cmsg in_rollback; > + bool is_in_rollback; > /** > * WAL watchers, i.e. threads that should be alerted > * whenever there are new records appended to the journal. > @@ -198,11 +205,11 @@ static void > wal_write_to_disk(struct cmsg *msg); > > static void > -tx_schedule_commit(struct cmsg *msg); > +tx_complete_batch(struct cmsg *msg); > > static struct cmsg_hop wal_request_route[] = { > {wal_write_to_disk, &wal_writer_singleton.tx_prio_pipe}, > - {tx_schedule_commit, NULL}, > + {tx_complete_batch, NULL}, > }; > > static void > @@ -265,14 +272,83 @@ tx_schedule_queue(struct stailq *queue) > journal_async_complete(&writer->base, req); > } > > +/** > + * Rollback happens, when disk write fails. In that case all next > + * transactions, sent to WAL, also should be rolled back. Because > + * they could make their statements based on what they saw in the > + * failed transaction. Also rollback of the failed transaction > + * without rollback of the next ones can actually rewrite what > + * they committed. > + * So when rollback is started, *all* pending transactions should > + * be rolled back. However if they would keep coming, the rollback > + * would be infinite. This means to complete a rollback it is > + * necessary to stop sending new transactions to WAL, then > + * rollback all already sent. In the end allow new transactions > + * again. > + * > + * First step is stop accepting all new transactions. For that WAL > + * thread sets a global flag. No rocket science here. All new > + * transactions, if see the flag set, are added to the rollback > + * queue immediately. > + * > + * Second step - tell TX thread to stop sending new transactions > + * to WAL. So as the rollback queue would stop growing. > + * > + * Third step - rollback all transactions in reverse order. > + * > + * Fourth step - allow transactions again. Unset the global flag > + * in WAL thread. > + */ > +static inline void > +wal_begin_rollback(void) > +{ > + /* Signal WAL-thread stop accepting new transactions. */ > + wal_writer_singleton.is_in_rollback = true; > +} > + > +static void > +wal_complete_rollback(struct cmsg *base) > +{ > + (void) base; > + /* WAL-thread can try writing transactions again. */ > + wal_writer_singleton.is_in_rollback = false; > +} > + > +static void > +tx_complete_rollback(void) > +{ > + struct wal_writer *writer = &wal_writer_singleton; > + /* > + * Despite records are sent in batches, the last entry to > + * commit can't be in the middle of a batch. After all > + * transactions to rollback are collected, the last entry > + * will be exactly, well, the last entry. > + */ > + if (stailq_last_entry(&writer->rollback, struct journal_entry, > + fifo) != writer->last_entry) > + return; I didn't get it: is there can be a batch whose last entry us not the final one? You prematurely quit the rollback - is there a guarantee you'll appeare here again? > + stailq_reverse(&writer->rollback); > + tx_schedule_queue(&writer->rollback); > + /* TX-thread can try sending transactions to WAL again. */ > + stailq_create(&writer->rollback); > + static struct cmsg_hop route[] = { > + {wal_complete_rollback, NULL} > + }; > + static struct cmsg msg; > + cmsg_init(&msg, route); > + cpipe_push(&writer->wal_pipe, &msg); > +} > + > /** > * Complete execution of a batch of WAL write requests: > * schedule all committed requests, and, should there > * be any requests to be rolled back, append them to > - * the rollback queue. > + * the rollback queue. In case this is a rollback and the batch > + * contains the last transaction to rollback, the rollback is > + * performed and normal processing is allowed again. > */ > static void > -tx_schedule_commit(struct cmsg *msg) > +tx_complete_batch(struct cmsg *msg) > { > struct wal_writer *writer = &wal_writer_singleton; > struct wal_msg *batch = (struct wal_msg *) msg; > @@ -282,8 +358,8 @@ tx_schedule_commit(struct cmsg *msg) > * iteration of tx_schedule_queue loop. > */ > if (! stailq_empty(&batch->rollback)) { > - /* Closes the input valve. */ > stailq_concat(&writer->rollback, &batch->rollback); > + tx_complete_rollback(); > } > /* Update the tx vclock to the latest written by wal. */ > vclock_copy(&replicaset.vclock, &batch->vclock); > @@ -291,28 +367,6 @@ tx_schedule_commit(struct cmsg *msg) > mempool_free(&writer->msg_pool, container_of(msg, struct wal_msg, base)); > } > > -static void > -tx_schedule_rollback(struct cmsg *msg) > -{ > - (void) msg; > - struct wal_writer *writer = &wal_writer_singleton; > - /* > - * Perform a cascading abort of all transactions which > - * depend on the transaction which failed to get written > - * to the write ahead log. Abort transactions > - * in reverse order, performing a playback of the > - * in-memory database state. > - */ > - stailq_reverse(&writer->rollback); > - /* Must not yield. */ > - tx_schedule_queue(&writer->rollback); > - stailq_create(&writer->rollback); > - if (msg != &writer->in_rollback) > - mempool_free(&writer->msg_pool, > - container_of(msg, struct wal_msg, base)); > -} > - > - > /** > * This message is sent from WAL to TX when the WAL thread hits > * ENOSPC and has to delete some backup WAL files to continue. > @@ -374,7 +428,7 @@ wal_writer_create(struct wal_writer *writer, enum wal_mode wal_mode, > writer->wal_dir.open_wflags |= O_SYNC; > > stailq_create(&writer->rollback); > - cmsg_init(&writer->in_rollback, NULL); > + writer->is_in_rollback = false; > > writer->checkpoint_wal_size = 0; > writer->checkpoint_threshold = INT64_MAX; > @@ -543,7 +597,7 @@ wal_sync_f(struct cbus_call_msg *data) > { > struct wal_vclock_msg *msg = (struct wal_vclock_msg *) data; > struct wal_writer *writer = &wal_writer_singleton; > - if (writer->in_rollback.route != NULL) { > + if (writer->is_in_rollback) { > /* We're rolling back a failed write. */ > diag_set(ClientError, ER_WAL_IO); > return -1; > @@ -586,7 +640,7 @@ wal_begin_checkpoint_f(struct cbus_call_msg *data) > { > struct wal_checkpoint *msg = (struct wal_checkpoint *) data; > struct wal_writer *writer = &wal_writer_singleton; > - if (writer->in_rollback.route != NULL) { > + if (writer->is_in_rollback) { > /* > * We're rolling back a failed write and so > * can't make a checkpoint - see the comment > @@ -892,54 +946,6 @@ out: > return rc; > } > > -static void > -wal_writer_clear_bus(struct cmsg *msg) > -{ > - (void) msg; > -} > - > -static void > -wal_writer_end_rollback(struct cmsg *msg) > -{ > - (void) msg; > - struct wal_writer *writer = &wal_writer_singleton; > - cmsg_init(&writer->in_rollback, NULL); > -} > - > -static void > -wal_writer_begin_rollback(struct wal_writer *writer) > -{ > - static struct cmsg_hop rollback_route[4] = { > - /* > - * Step 1: clear the bus, so that it contains > - * no WAL write requests. This is achieved as a > - * side effect of an empty message travelling > - * through both bus pipes, while writer input > - * valve is closed by non-empty writer->rollback > - * list. > - */ > - { wal_writer_clear_bus, &wal_writer_singleton.wal_pipe }, > - { wal_writer_clear_bus, &wal_writer_singleton.tx_prio_pipe }, > - /* > - * Step 2: writer->rollback queue contains all > - * messages which need to be rolled back, > - * perform the rollback. > - */ > - { tx_schedule_rollback, &wal_writer_singleton.wal_pipe }, > - /* > - * Step 3: re-open the WAL for writing. > - */ > - { wal_writer_end_rollback, NULL } > - }; > - > - /* > - * Make sure the WAL writer rolls back > - * all input until rollback mode is off. > - */ > - cmsg_init(&writer->in_rollback, rollback_route); > - cpipe_push(&writer->tx_prio_pipe, &writer->in_rollback); > -} > - > /* > * Assign lsn and replica identifier for local writes and track > * row into vclock_diff. > @@ -1006,7 +1012,7 @@ wal_write_to_disk(struct cmsg *msg) > > ERROR_INJECT_SLEEP(ERRINJ_WAL_DELAY); > > - if (writer->in_rollback.route != NULL) { > + if (writer->is_in_rollback) { > /* We're rolling back a failed write. */ > stailq_concat(&wal_msg->rollback, &wal_msg->commit); > vclock_copy(&wal_msg->vclock, &writer->vclock); > @@ -1017,14 +1023,14 @@ wal_write_to_disk(struct cmsg *msg) > if (wal_opt_rotate(writer) != 0) { > stailq_concat(&wal_msg->rollback, &wal_msg->commit); > vclock_copy(&wal_msg->vclock, &writer->vclock); > - return wal_writer_begin_rollback(writer); > + return wal_begin_rollback(); > } > > /* Ensure there's enough disk space before writing anything. */ > if (wal_fallocate(writer, wal_msg->approx_len) != 0) { > stailq_concat(&wal_msg->rollback, &wal_msg->commit); > vclock_copy(&wal_msg->vclock, &writer->vclock); > - return wal_writer_begin_rollback(writer); > + return wal_begin_rollback(); > } > > /* > @@ -1130,7 +1136,7 @@ done: > entry->res = -1; > /* Rollback unprocessed requests */ > stailq_concat(&wal_msg->rollback, &rollback); > - wal_writer_begin_rollback(writer); > + wal_begin_rollback(); > } > fiber_gc(); > wal_notify_watchers(writer, WAL_EVENT_WRITE); > @@ -1234,6 +1240,12 @@ wal_write_async(struct journal *journal, struct journal_entry *entry) > stailq_add_tail_entry(&batch->commit, entry, fifo); > cpipe_push(&writer->wal_pipe, &batch->base); > } > + /* > + * Remember last entry sent to WAL. In case of rollback > + * WAL will use this entry as an anchor to rollback all > + * transactions until and including this one. > + */ > + writer->last_entry = entry; > batch->approx_len += entry->approx_len; > writer->wal_pipe.n_input += entry->n_rows * XROW_IOVMAX; > cpipe_flush_input(&writer->wal_pipe); > -- > 2.21.1 (Apple Git-122.3) >