From: Vladislav Shpilevoy <v.shpilevoy@tarantool.org> To: tarantool-patches@dev.tarantool.org, gorcunov@gmail.com Subject: [Tarantool-patches] [PATCH 1/3] txn_limbo: handle ROLLBACK during CONFIRM Date: Fri, 31 Jul 2020 00:37:43 +0200 [thread overview] Message-ID: <849ba7dc73a0f8372375a4cc6e0ab45548361a75.1596148501.git.v.shpilevoy@tarantool.org> (raw) In-Reply-To: <cover.1596148501.git.v.shpilevoy@tarantool.org> Limbo could try to ROLLBACK LSN whose CONFIRM is in progress. This is how it could happen: - A synchronous transaction is created, written to WAL; - The fiber sleeps in the limbo waiting for CONFIRM or timeout; - Replica receives the transaction, sends ACK; - Master receives ACK, starts writing CONFIRM; - The first fiber times out and tries to write ROLLBACK for the LSN, whose CONFIRM is in progress right now. The patch adds more checks to the 'timed out' code path to see if it isn't too late to write ROLLBACK. If CONFIRM is in progress, the fiber will wait for its finish. Part of #5185 --- src/box/txn_limbo.c | 31 +++++++--- test/replication/qsync_errinj.result | 85 ++++++++++++++++++++++++++ test/replication/qsync_errinj.test.lua | 40 ++++++++++++ 3 files changed, 147 insertions(+), 9 deletions(-) diff --git a/src/box/txn_limbo.c b/src/box/txn_limbo.c index 27a7302c0..1602c5d16 100644 --- a/src/box/txn_limbo.c +++ b/src/box/txn_limbo.c @@ -200,25 +200,33 @@ txn_limbo_wait_complete(struct txn_limbo *limbo, struct txn_limbo_entry *entry) if (txn_limbo_entry_is_complete(entry)) goto complete; if (rc != 0) - goto do_rollback; + break; } -do_rollback: assert(!txn_limbo_is_empty(limbo)); if (txn_limbo_first_entry(limbo) != entry) { - assert(limbo->is_in_rollback); /* * If this is not a first entry in the limbo, it * is definitely not a first timed out entry. And * since it managed to time out too, it means * there is currently another fiber writing - * rollback. Wait when it will finish and wake us - * up. + * rollback, or waiting for confirmation WAL + * write. Wait when it will finish and wake us up. */ - do { - fiber_yield(); - } while (!txn_limbo_entry_is_complete(entry)); - goto complete; + goto wait; + } + + /* First in the queue is always a synchronous transaction. */ + assert(entry->lsn > 0); + if (entry->lsn <= limbo->confirmed_lsn) { + /* + * Yes, the wait timed out, but there is an on-going CONFIRM WAL + * write in another fiber covering this LSN. Can't rollback it + * already. All what can be done is waiting. The CONFIRM writer + * will wakeup all the confirmed txns when WAL write will be + * finished. + */ + goto wait; } txn_limbo_write_rollback(limbo, entry->lsn); @@ -238,6 +246,11 @@ do_rollback: diag_set(ClientError, ER_SYNC_QUORUM_TIMEOUT); return -1; +wait: + do { + fiber_yield(); + } while (!txn_limbo_entry_is_complete(entry)); + complete: assert(txn_limbo_entry_is_complete(entry)); /* diff --git a/test/replication/qsync_errinj.result b/test/replication/qsync_errinj.result index 1c6e357f7..4b8977810 100644 --- a/test/replication/qsync_errinj.result +++ b/test/replication/qsync_errinj.result @@ -283,6 +283,91 @@ box.space.sync:select{} | - - [2] | ... +-- +-- See if synchro timeout during CONFIRM WAL write won't try to rollback the +-- confirmed transaction. +-- +test_run:switch('default') + | --- + | - true + | ... +box.space.sync:truncate() + | --- + | ... +-- Write something to flush the master's state to the replica. +_ = box.space.sync:insert({1}) + | --- + | ... +_ = box.space.sync:delete({1}) + | --- + | ... + +test_run:switch('default') + | --- + | - true + | ... +-- Set a trap for CONFIRM write so as the current txn won't hang, but following +-- CONFIRM will. +box.error.injection.set('ERRINJ_WAL_DELAY_COUNTDOWN', 1) + | --- + | - ok + | ... +ok, err = nil + | --- + | ... +f = fiber.create(function() \ + ok, err = pcall(box.space.sync.replace, box.space.sync, {1}) \ +end) + | --- + | ... + +-- Wait when got ACK from replica and started writing CONFIRM. +test_run:wait_cond(function() \ + return box.error.injection.get("ERRINJ_WAL_DELAY") \ +end) + | --- + | - true + | ... + +-- Let the pending synchro transaction go. It should timeout, notice the +-- on-going confirmation, and go to sleep instead of doing rollback. +box.cfg{replication_synchro_timeout = 0.001} + | --- + | ... +fiber.yield() + | --- + | ... + +-- Let the confirmation finish. +box.error.injection.set("ERRINJ_WAL_DELAY", false) + | --- + | - ok + | ... + +-- Now the transaction sees the CONFIRM is ok, and it is also finished. +test_run:wait_cond(function() return f:status() == 'dead' end) + | --- + | - true + | ... +ok, err + | --- + | - true + | - [1] + | ... +box.space.sync:select{} + | --- + | - - [1] + | ... + +test_run:switch('replica') + | --- + | - true + | ... +box.space.sync:select{} + | --- + | - - [1] + | ... + test_run:cmd('switch default') | --- | - true diff --git a/test/replication/qsync_errinj.test.lua b/test/replication/qsync_errinj.test.lua index da554c775..817b08a25 100644 --- a/test/replication/qsync_errinj.test.lua +++ b/test/replication/qsync_errinj.test.lua @@ -105,6 +105,46 @@ box.space.sync:replace{2} test_run:switch('replica') box.space.sync:select{} +-- +-- See if synchro timeout during CONFIRM WAL write won't try to rollback the +-- confirmed transaction. +-- +test_run:switch('default') +box.space.sync:truncate() +-- Write something to flush the master's state to the replica. +_ = box.space.sync:insert({1}) +_ = box.space.sync:delete({1}) + +test_run:switch('default') +-- Set a trap for CONFIRM write so as the current txn won't hang, but following +-- CONFIRM will. +box.error.injection.set('ERRINJ_WAL_DELAY_COUNTDOWN', 1) +ok, err = nil +f = fiber.create(function() \ + ok, err = pcall(box.space.sync.replace, box.space.sync, {1}) \ +end) + +-- Wait when got ACK from replica and started writing CONFIRM. +test_run:wait_cond(function() \ + return box.error.injection.get("ERRINJ_WAL_DELAY") \ +end) + +-- Let the pending synchro transaction go. It should timeout, notice the +-- on-going confirmation, and go to sleep instead of doing rollback. +box.cfg{replication_synchro_timeout = 0.001} +fiber.yield() + +-- Let the confirmation finish. +box.error.injection.set("ERRINJ_WAL_DELAY", false) + +-- Now the transaction sees the CONFIRM is ok, and it is also finished. +test_run:wait_cond(function() return f:status() == 'dead' end) +ok, err +box.space.sync:select{} + +test_run:switch('replica') +box.space.sync:select{} + test_run:cmd('switch default') box.cfg{ \ -- 2.21.1 (Apple Git-122.3)
next prev parent reply other threads:[~2020-07-30 22:37 UTC|newest] Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top 2020-07-30 22:37 [Tarantool-patches] [PATCH 0/3] Qsync tricky crashes Vladislav Shpilevoy 2020-07-30 22:37 ` Vladislav Shpilevoy [this message] 2020-07-30 22:37 ` [Tarantool-patches] [PATCH 2/3] txn_limbo: handle CONFIRM during ROLLBACK Vladislav Shpilevoy 2020-07-30 22:37 ` [Tarantool-patches] [PATCH 3/3] txn_limbo: handle duplicate ACKs Vladislav Shpilevoy 2020-07-31 13:59 ` [Tarantool-patches] [PATCH 0/3] Qsync tricky crashes Cyrill Gorcunov 2020-07-31 22:31 ` Vladislav Shpilevoy
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=849ba7dc73a0f8372375a4cc6e0ab45548361a75.1596148501.git.v.shpilevoy@tarantool.org \ --to=v.shpilevoy@tarantool.org \ --cc=gorcunov@gmail.com \ --cc=tarantool-patches@dev.tarantool.org \ --subject='Re: [Tarantool-patches] [PATCH 1/3] txn_limbo: handle ROLLBACK during CONFIRM' \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: link
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox