From: Serge Petrenko via Tarantool-patches <tarantool-patches@dev.tarantool.org> To: Vladislav Shpilevoy <v.shpilevoy@tarantool.org>, tarantool-patches@dev.tarantool.org, gorcunov@gmail.com Subject: Re: [Tarantool-patches] [PATCH 1/1] qsync: handle async txns right during CONFIRM Date: Fri, 28 May 2021 12:01:17 +0300 [thread overview] Message-ID: <3dc7734c-a572-7f69-a6b8-ad5aa7536873@tarantool.org> (raw) In-Reply-To: <c668a9e1ebedf79023c92b12cd53d069e0f79b04.1622150822.git.v.shpilevoy@tarantool.org> 28.05.2021 00:28, Vladislav Shpilevoy пишет: > It is possible that a new async transaction is added to the limbo > when there is an in-progress CONFIRM WAL write for all the pending > sync transactions. > > Then when CONFIRM WAL write is done, it might see that the limbo > now in the first place contains an async transaction not yet > written to WAL. A suspicious situation - on one hand the async > transaction does not have any blocking sync txns before it and > can be considered complete, on the other hand its WAL write is not > done and it is not complete. > > Before this patch it resulted into a crash - limbo didn't consider > the situation possible at all. > > Now when CONFIRM covers a not yet written async transactions, they > are removed from the limbo and are turned to plain transactions. > > When their WAL write is done, they see they no more have > TXN_WAIT_SYNC flag and don't even need to interact with the limbo. > > It is important to remove them from the limbo right when the > CONFIRM is done. Because otherwise their limbo entry may be not > removed at all when it is done on a replica. On a replica the > limbo entries are removed only by CONFIRM/ROLLBACK/PROMOTE. If > there would be an async transaction in the first position in the > limbo queue, it wouldn't be deleted until next sync transaction > appears. > > This replica case is not possible now though. Because all synchro > entries on the applier are written in a blocking way. Nonetheless > if it ever becomes non-blocking, the code should handle it ok. > > Closes #6057 Hi! Thanks for working on this and for the fast fix! Please find one comment below. > --- > Branch: http://github.com/tarantool/tarantool/tree/gerold103/gh-6057-confirm-async-no-wal > Issue: https://github.com/tarantool/tarantool/issues/6057 > > .../gh-6057-qsync-confirm-async-no-wal.md | 5 + > src/box/txn.c | 14 +- > src/box/txn_limbo.c | 21 +++ > .../gh-6057-qsync-confirm-async-no-wal.result | 163 ++++++++++++++++++ > ...h-6057-qsync-confirm-async-no-wal.test.lua | 88 ++++++++++ > test/replication/suite.cfg | 1 + > test/replication/suite.ini | 2 +- > 7 files changed, 289 insertions(+), 5 deletions(-) > create mode 100644 changelogs/unreleased/gh-6057-qsync-confirm-async-no-wal.md > create mode 100644 test/replication/gh-6057-qsync-confirm-async-no-wal.result > create mode 100644 test/replication/gh-6057-qsync-confirm-async-no-wal.test.lua > > diff --git a/changelogs/unreleased/gh-6057-qsync-confirm-async-no-wal.md b/changelogs/unreleased/gh-6057-qsync-confirm-async-no-wal.md > new file mode 100644 > index 000000000..1005389d8 > --- /dev/null > +++ b/changelogs/unreleased/gh-6057-qsync-confirm-async-no-wal.md > @@ -0,0 +1,5 @@ > +## bugfix/replication > + > +* Fixed a possible crash when a synchronous transaction was followed by an > + asynchronous transaction right when its confirmation was being written > + (gh-6057). > diff --git a/src/box/txn.c b/src/box/txn.c > index 1d42c9113..3d4d5c397 100644 > --- a/src/box/txn.c > +++ b/src/box/txn.c > @@ -880,8 +880,14 @@ txn_commit(struct txn *txn) > if (req == NULL) > goto rollback; > > - bool is_sync = txn_has_flag(txn, TXN_WAIT_SYNC); > - if (is_sync) { > + /* > + * Do not cash the flag value in a variable. The flag might be deleted > + * during WAL write. This can happen for async transactions created > + * during CONFIRM write, whose all blocking sync transactions get > + * confirmed. They they turn the async transaction into just a plain > + * txn not waiting for anything. > + */ > + if (txn_has_flag(txn, TXN_WAIT_SYNC)) { > /* > * Remote rows, if any, come before local rows, so > * check for originating instance id here. > @@ -900,13 +906,13 @@ txn_commit(struct txn *txn) > > fiber_set_txn(fiber(), NULL); > if (journal_write(req) != 0 || req->res < 0) { > - if (is_sync) > + if (txn_has_flag(txn, TXN_WAIT_SYNC)) > txn_limbo_abort(&txn_limbo, limbo_entry); > diag_set(ClientError, ER_WAL_IO); > diag_log(); > goto rollback; > } > - if (is_sync) { > + if (txn_has_flag(txn, TXN_WAIT_SYNC)) { > if (txn_has_flag(txn, TXN_WAIT_ACK)) { > int64_t lsn = req->rows[req->n_rows - 1]->lsn; > /* > diff --git a/src/box/txn_limbo.c b/src/box/txn_limbo.c > index f287369a2..05f0bf30a 100644 > --- a/src/box/txn_limbo.c > +++ b/src/box/txn_limbo.c > @@ -389,6 +389,27 @@ txn_limbo_read_confirm(struct txn_limbo *limbo, int64_t lsn) > */ > if (e->lsn == -1) > break; > + } else if (e->txn->signature < 0) { > + /* > + * A transaction might be covered by the CONFIRM even if > + * it is not written to WAL yet when it is an async > + * transaction. It could be created just when the > + * CONFIRM was being written to WAL. > + */ > + assert(e->txn->status == TXN_PREPARED); > + /* > + * Let it complete normally as a plain transaction. > + */ > + txn_clear_flags(e->txn, TXN_WAIT_SYNC | TXN_WAIT_ACK); AFAICS it's enough to clear WAIT_SYNC here. Asynchronous transactions never have WAIT_ACK set, do they? > + txn_limbo_remove(limbo, e); > + /* > + * The limbo entry now should not be used by the owner > + * transaction since it just became a plain one. Nullify > + * the txn to get a crash on any usage attempt instead > + * of potential undefined behaviour. > + */ > + e->txn = NULL; > + continue; > } > e->is_commit = true; > txn_limbo_remove(limbo, e); -- Serge Petrenko
next prev parent reply other threads:[~2021-05-28 9:01 UTC|newest] Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top 2021-05-27 21:28 Vladislav Shpilevoy via Tarantool-patches 2021-05-28 7:23 ` Cyrill Gorcunov via Tarantool-patches 2021-05-28 19:13 ` Vladislav Shpilevoy via Tarantool-patches 2021-05-28 9:01 ` Serge Petrenko via Tarantool-patches [this message] 2021-05-28 19:13 ` Vladislav Shpilevoy via Tarantool-patches 2021-06-01 7:37 ` Serge Petrenko via Tarantool-patches 2021-06-01 20:59 ` Vladislav Shpilevoy via Tarantool-patches
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=3dc7734c-a572-7f69-a6b8-ad5aa7536873@tarantool.org \ --to=tarantool-patches@dev.tarantool.org \ --cc=gorcunov@gmail.com \ --cc=sergepetrenko@tarantool.org \ --cc=v.shpilevoy@tarantool.org \ --subject='Re: [Tarantool-patches] [PATCH 1/1] qsync: handle async txns right during CONFIRM' \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: link
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox