From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from [87.239.111.99] (localhost [127.0.0.1]) by dev.tarantool.org (Postfix) with ESMTP id 70CE26EC58; Fri, 28 May 2021 00:28:06 +0300 (MSK) DKIM-Filter: OpenDKIM Filter v2.11.0 dev.tarantool.org 70CE26EC58 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=tarantool.org; s=dev; t=1622150886; bh=3qUjGU2qgrMAFeppgox/EnPKknyvlQvdYwV7RUg1krs=; h=To:Date:Subject:List-Id:List-Unsubscribe:List-Archive:List-Post: List-Help:List-Subscribe:From:Reply-To:From; b=NykNOnH+z3JjReYx5e1flE3eYZniwuEiGkfEkuDI5j6KSpleJxUppWT55FryTn8xH DoPS7FeWW+YQH+uS9oLEDZ31wiLcUlbczoXiGkYVa2kcv46Z191+e2paUEDGPFJY7Y wggNLKKcNpPrgrQdnH7+GFU25nN6056F+rB4qJEo= Received: from smtp58.i.mail.ru (smtp58.i.mail.ru [217.69.128.38]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dev.tarantool.org (Postfix) with ESMTPS id C88656EC58 for ; Fri, 28 May 2021 00:28:04 +0300 (MSK) DKIM-Filter: OpenDKIM Filter v2.11.0 dev.tarantool.org C88656EC58 Received: by smtp58.i.mail.ru with esmtpa (envelope-from ) id 1lmNXv-0001w4-Ky; Fri, 28 May 2021 00:28:04 +0300 To: tarantool-patches@dev.tarantool.org, sergepetrenko@tarantool.org, gorcunov@gmail.com Date: Thu, 27 May 2021 23:28:02 +0200 Message-Id: X-Mailer: git-send-email 2.24.3 (Apple Git-128) MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-7564579A: 646B95376F6C166E X-77F55803: 4F1203BC0FB41BD9157EECD0FDB90B9ACAD49BB81F6138BA3FF1437BFDE6218300894C459B0CD1B90E76E7FEB12D20574A8C73A7BC7A648CFF59A89716354C41906DE0852CB233E4 X-7FA49CB5: FF5795518A3D127A4AD6D5ED66289B5278DA827A17800CE77EB2E345998A721DEA1F7E6F0F101C67BD4B6F7A4D31EC0BCC500DACC3FED6E28638F802B75D45FF8AA50765F7900637C3BF94FB392044A18638F802B75D45FF36EB9D2243A4F8B5A6FCA7DBDB1FC311F39EFFDF887939037866D6147AF826D886468D0FB8E70DCC00A868318635170F117882F4460429724CE54428C33FAD305F5C1EE8F4F765FC1BE95B8C87527B4BA471835C12D1D9774AD6D5ED66289B52BA9C0B312567BB23117882F4460429728776938767073520B1593CA6EC85F86DE5D25F19253116ADD2E47CDBA5A96583BA9C0B312567BB2376E601842F6C81A19E625A9149C048EE599709FD55CB46A603CEA74F0D118906D8FC6C240DEA7642DBF02ECDB25306B2B78CF848AE20165D0A6AB1C7CE11FEE365B78C30F681404D302FCEF25BFAB345C4224003CC836476EA7A3FFF5B025636E2021AF6380DFAD1A18204E546F3947CB11811A4A51E3B096D1867E19FE1407959CC434672EE6371089D37D7C0E48F6C8AA50765F7900637BBEA499411984DA1EFF80C71ABB335746BA297DBC24807EABDAD6C7F3747799A X-B7AD71C0: AC4F5C86D027EB782CDD5689AFBDA7A2368A440D3B0F6089093C9A16E5BC824A2A04A2ABAA09D25379311020FFC8D4AD66E3D843E3AAED8C527E5C1E58377C73 X-C1DE0DAB: C20DE7B7AB408E4181F030C43753B8183A4AFAF3EA6BDC44C234C8B12C006B7AFB3490FBB914F934C992C527FF05251DDE715D50FC07FF5AB1881A6453793CE9C32612AADDFBE0611E587B748EF1A2719510FB958DCE06DB6ED91DBE5ABE359A7EE5648E065588D469F8FEF10F1C2C2993EDB24507CE13387DFF0A840B692CF8 X-C8649E89: 4E36BF7865823D7055A7F0CF078B5EC49A30900B95165D340B06C327CE4A70E88FA4055EE923404A7E2DC0ED690569086DB9FB4DE64937E73C654550A2DB61461D7E09C32AA3244CFA3C0343B6363451AFC05D6D8DC163E2D08D48398F32B4A6FACE5A9C96DEB163 X-D57D3AED: 3ZO7eAau8CL7WIMRKs4sN3D3tLDjz0dLbV79QFUyzQ2Ujvy7cMT6pYYqY16iZVKkSc3dCLJ7zSJH7+u4VD18S7Vl4ZUrpaVfd2+vE6kuoey4m4VkSEu530nj6fImhcD4MUrOEAnl0W826KZ9Q+tr5ycPtXkTV4k65bRjmOUUP8cvGozZ33TWg5HZplvhhXbhDGzqmQDTd6OAevLeAnq3Ra9uf7zvY2zzsIhlcp/Y7m53TZgf2aB4JOg4gkr2biojywAFAsvjBJidBa5Ygql5lw== X-Mailru-Sender: 504CC1E875BF3E7D9BC0E5172ADA3110C02405120EF6EFE574AE71A01144B996BA4DC9DE077D22E107784C02288277CA03E0582D3806FB6A5317862B1921BA260ED6CFD6382C13A6112434F685709FCF0DA7A0AF5A3A8387 X-Mras: Ok Subject: [Tarantool-patches] [PATCH 1/1] qsync: handle async txns right during CONFIRM X-BeenThere: tarantool-patches@dev.tarantool.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: Tarantool development patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , From: Vladislav Shpilevoy via Tarantool-patches Reply-To: Vladislav Shpilevoy Errors-To: tarantool-patches-bounces@dev.tarantool.org Sender: "Tarantool-patches" It is possible that a new async transaction is added to the limbo when there is an in-progress CONFIRM WAL write for all the pending sync transactions. Then when CONFIRM WAL write is done, it might see that the limbo now in the first place contains an async transaction not yet written to WAL. A suspicious situation - on one hand the async transaction does not have any blocking sync txns before it and can be considered complete, on the other hand its WAL write is not done and it is not complete. Before this patch it resulted into a crash - limbo didn't consider the situation possible at all. Now when CONFIRM covers a not yet written async transactions, they are removed from the limbo and are turned to plain transactions. When their WAL write is done, they see they no more have TXN_WAIT_SYNC flag and don't even need to interact with the limbo. It is important to remove them from the limbo right when the CONFIRM is done. Because otherwise their limbo entry may be not removed at all when it is done on a replica. On a replica the limbo entries are removed only by CONFIRM/ROLLBACK/PROMOTE. If there would be an async transaction in the first position in the limbo queue, it wouldn't be deleted until next sync transaction appears. This replica case is not possible now though. Because all synchro entries on the applier are written in a blocking way. Nonetheless if it ever becomes non-blocking, the code should handle it ok. Closes #6057 --- Branch: http://github.com/tarantool/tarantool/tree/gerold103/gh-6057-confirm-async-no-wal Issue: https://github.com/tarantool/tarantool/issues/6057 .../gh-6057-qsync-confirm-async-no-wal.md | 5 + src/box/txn.c | 14 +- src/box/txn_limbo.c | 21 +++ .../gh-6057-qsync-confirm-async-no-wal.result | 163 ++++++++++++++++++ ...h-6057-qsync-confirm-async-no-wal.test.lua | 88 ++++++++++ test/replication/suite.cfg | 1 + test/replication/suite.ini | 2 +- 7 files changed, 289 insertions(+), 5 deletions(-) create mode 100644 changelogs/unreleased/gh-6057-qsync-confirm-async-no-wal.md create mode 100644 test/replication/gh-6057-qsync-confirm-async-no-wal.result create mode 100644 test/replication/gh-6057-qsync-confirm-async-no-wal.test.lua diff --git a/changelogs/unreleased/gh-6057-qsync-confirm-async-no-wal.md b/changelogs/unreleased/gh-6057-qsync-confirm-async-no-wal.md new file mode 100644 index 000000000..1005389d8 --- /dev/null +++ b/changelogs/unreleased/gh-6057-qsync-confirm-async-no-wal.md @@ -0,0 +1,5 @@ +## bugfix/replication + +* Fixed a possible crash when a synchronous transaction was followed by an + asynchronous transaction right when its confirmation was being written + (gh-6057). diff --git a/src/box/txn.c b/src/box/txn.c index 1d42c9113..3d4d5c397 100644 --- a/src/box/txn.c +++ b/src/box/txn.c @@ -880,8 +880,14 @@ txn_commit(struct txn *txn) if (req == NULL) goto rollback; - bool is_sync = txn_has_flag(txn, TXN_WAIT_SYNC); - if (is_sync) { + /* + * Do not cash the flag value in a variable. The flag might be deleted + * during WAL write. This can happen for async transactions created + * during CONFIRM write, whose all blocking sync transactions get + * confirmed. They they turn the async transaction into just a plain + * txn not waiting for anything. + */ + if (txn_has_flag(txn, TXN_WAIT_SYNC)) { /* * Remote rows, if any, come before local rows, so * check for originating instance id here. @@ -900,13 +906,13 @@ txn_commit(struct txn *txn) fiber_set_txn(fiber(), NULL); if (journal_write(req) != 0 || req->res < 0) { - if (is_sync) + if (txn_has_flag(txn, TXN_WAIT_SYNC)) txn_limbo_abort(&txn_limbo, limbo_entry); diag_set(ClientError, ER_WAL_IO); diag_log(); goto rollback; } - if (is_sync) { + if (txn_has_flag(txn, TXN_WAIT_SYNC)) { if (txn_has_flag(txn, TXN_WAIT_ACK)) { int64_t lsn = req->rows[req->n_rows - 1]->lsn; /* diff --git a/src/box/txn_limbo.c b/src/box/txn_limbo.c index f287369a2..05f0bf30a 100644 --- a/src/box/txn_limbo.c +++ b/src/box/txn_limbo.c @@ -389,6 +389,27 @@ txn_limbo_read_confirm(struct txn_limbo *limbo, int64_t lsn) */ if (e->lsn == -1) break; + } else if (e->txn->signature < 0) { + /* + * A transaction might be covered by the CONFIRM even if + * it is not written to WAL yet when it is an async + * transaction. It could be created just when the + * CONFIRM was being written to WAL. + */ + assert(e->txn->status == TXN_PREPARED); + /* + * Let it complete normally as a plain transaction. + */ + txn_clear_flags(e->txn, TXN_WAIT_SYNC | TXN_WAIT_ACK); + txn_limbo_remove(limbo, e); + /* + * The limbo entry now should not be used by the owner + * transaction since it just became a plain one. Nullify + * the txn to get a crash on any usage attempt instead + * of potential undefined behaviour. + */ + e->txn = NULL; + continue; } e->is_commit = true; txn_limbo_remove(limbo, e); diff --git a/test/replication/gh-6057-qsync-confirm-async-no-wal.result b/test/replication/gh-6057-qsync-confirm-async-no-wal.result new file mode 100644 index 000000000..23c77729b --- /dev/null +++ b/test/replication/gh-6057-qsync-confirm-async-no-wal.result @@ -0,0 +1,163 @@ +-- test-run result file version 2 +test_run = require('test_run').new() + | --- + | ... +fiber = require('fiber') + | --- + | ... + +-- +-- gh-6057: while CONFIRM is being written, there might appear async +-- transactions not having an LSN/signature yet. When CONFIRM is done, it covers +-- these transactions too since they don't need to wait for an ACK, but it +-- should not try to complete them. Because their WAL write is not done and +-- it might even fail later. It should simply turn them into plain transactions +-- not depending on any synchronous ones. +-- + +old_synchro_quorum = box.cfg.replication_synchro_quorum + | --- + | ... +old_synchro_timeout = box.cfg.replication_synchro_timeout + | --- + | ... +box.cfg{ \ + replication_synchro_quorum = 1, \ + replication_synchro_timeout = 1000000, \ +} + | --- + | ... +s = box.schema.create_space('test', {is_sync = true}) + | --- + | ... +_ = s:create_index('pk') + | --- + | ... +s2 = box.schema.create_space('test2') + | --- + | ... +_ = s2:create_index('pk') + | --- + | ... + +errinj = box.error.injection + | --- + | ... + +function create_hanging_async_after_confirm(sync_key, async_key1, async_key2) \ +-- Let the transaction itself to WAL, but CONFIRM will be blocked. \ + errinj.set('ERRINJ_WAL_DELAY_COUNTDOWN', 1) \ + local lsn = box.info.lsn \ + local f = fiber.create(function() s:replace{sync_key} end) \ + test_run:wait_cond(function() return box.info.lsn == lsn + 1 end) \ +-- Wait for the CONFIRM block. WAL thread is in busy loop now. \ + test_run:wait_cond(function() return errinj.get('ERRINJ_WAL_DELAY') end) \ + \ +-- Create 2 async transactions to ensure multiple of them are handled fine. \ +-- But return only fiber of the second one. It is enough because if it is \ +-- finished, the first one is too. \ + fiber.new(function() s2:replace{async_key1} end) \ + local f2 = fiber.new(function() s2:replace{async_key2} end) \ + fiber.yield() \ +-- When WAL thread would finish CONFIRM, it should be blocked on the async \ +-- transaction so as it wouldn't be completed when CONFIRM is processed. \ + errinj.set('ERRINJ_WAL_DELAY_COUNTDOWN', 0) \ +-- Let CONFIRM go. \ + errinj.set('ERRINJ_WAL_DELAY', false) \ +-- And WAL thread should be blocked on the async txn. \ + test_run:wait_cond(function() return errinj.get('ERRINJ_WAL_DELAY') end) \ +-- Wait CONFIRM finish. \ + test_run:wait_cond(function() return f:status() == 'dead' end) \ + return f2 \ +end + | --- + | ... + +async_f = create_hanging_async_after_confirm(1, 2, 3) + | --- + | ... +-- Let the async transaction finish its WAL write. +errinj.set('ERRINJ_WAL_DELAY', false) + | --- + | - ok + | ... +-- It should see that even though it is in the limbo, it does not have any +-- synchronous transactions to wait for and can complete right away. +test_run:wait_cond(function() return async_f:status() == 'dead' end) + | --- + | - true + | ... + +assert(s:get({1}) ~= nil) + | --- + | - true + | ... +assert(s2:get({2}) ~= nil) + | --- + | - true + | ... +assert(s2:get({3}) ~= nil) + | --- + | - true + | ... + +-- +-- Try all the same, but the async transaction fails its WAL write. +-- +async_f = create_hanging_async_after_confirm(4, 5, 6) + | --- + | ... +-- The async transaction will fail to go to WAL when WAL thread is unblocked. +errinj.set('ERRINJ_WAL_ROTATE', true) + | --- + | - ok + | ... +errinj.set('ERRINJ_WAL_DELAY', false) + | --- + | - ok + | ... +test_run:wait_cond(function() return async_f:status() == 'dead' end) + | --- + | - true + | ... +errinj.set('ERRINJ_WAL_ROTATE', false) + | --- + | - ok + | ... + +assert(s:get({4}) ~= nil) + | --- + | - true + | ... +assert(s2:get({5}) == nil) + | --- + | - true + | ... +assert(s2:get({6}) == nil) + | --- + | - true + | ... + +-- Ensure nothing is broken, works fine. +s:replace{7} + | --- + | - [7] + | ... +s2:replace{8} + | --- + | - [8] + | ... + +s:drop() + | --- + | ... +s2:drop() + | --- + | ... + +box.cfg{ \ + replication_synchro_quorum = old_synchro_quorum, \ + replication_synchro_timeout = old_synchro_timeout, \ +} + | --- + | ... diff --git a/test/replication/gh-6057-qsync-confirm-async-no-wal.test.lua b/test/replication/gh-6057-qsync-confirm-async-no-wal.test.lua new file mode 100644 index 000000000..a11ddc042 --- /dev/null +++ b/test/replication/gh-6057-qsync-confirm-async-no-wal.test.lua @@ -0,0 +1,88 @@ +test_run = require('test_run').new() +fiber = require('fiber') + +-- +-- gh-6057: while CONFIRM is being written, there might appear async +-- transactions not having an LSN/signature yet. When CONFIRM is done, it covers +-- these transactions too since they don't need to wait for an ACK, but it +-- should not try to complete them. Because their WAL write is not done and +-- it might even fail later. It should simply turn them into plain transactions +-- not depending on any synchronous ones. +-- + +old_synchro_quorum = box.cfg.replication_synchro_quorum +old_synchro_timeout = box.cfg.replication_synchro_timeout +box.cfg{ \ + replication_synchro_quorum = 1, \ + replication_synchro_timeout = 1000000, \ +} +s = box.schema.create_space('test', {is_sync = true}) +_ = s:create_index('pk') +s2 = box.schema.create_space('test2') +_ = s2:create_index('pk') + +errinj = box.error.injection + +function create_hanging_async_after_confirm(sync_key, async_key1, async_key2) \ +-- Let the transaction itself to WAL, but CONFIRM will be blocked. \ + errinj.set('ERRINJ_WAL_DELAY_COUNTDOWN', 1) \ + local lsn = box.info.lsn \ + local f = fiber.create(function() s:replace{sync_key} end) \ + test_run:wait_cond(function() return box.info.lsn == lsn + 1 end) \ +-- Wait for the CONFIRM block. WAL thread is in busy loop now. \ + test_run:wait_cond(function() return errinj.get('ERRINJ_WAL_DELAY') end) \ + \ +-- Create 2 async transactions to ensure multiple of them are handled fine. \ +-- But return only fiber of the second one. It is enough because if it is \ +-- finished, the first one is too. \ + fiber.new(function() s2:replace{async_key1} end) \ + local f2 = fiber.new(function() s2:replace{async_key2} end) \ + fiber.yield() \ +-- When WAL thread would finish CONFIRM, it should be blocked on the async \ +-- transaction so as it wouldn't be completed when CONFIRM is processed. \ + errinj.set('ERRINJ_WAL_DELAY_COUNTDOWN', 0) \ +-- Let CONFIRM go. \ + errinj.set('ERRINJ_WAL_DELAY', false) \ +-- And WAL thread should be blocked on the async txn. \ + test_run:wait_cond(function() return errinj.get('ERRINJ_WAL_DELAY') end) \ +-- Wait CONFIRM finish. \ + test_run:wait_cond(function() return f:status() == 'dead' end) \ + return f2 \ +end + +async_f = create_hanging_async_after_confirm(1, 2, 3) +-- Let the async transaction finish its WAL write. +errinj.set('ERRINJ_WAL_DELAY', false) +-- It should see that even though it is in the limbo, it does not have any +-- synchronous transactions to wait for and can complete right away. +test_run:wait_cond(function() return async_f:status() == 'dead' end) + +assert(s:get({1}) ~= nil) +assert(s2:get({2}) ~= nil) +assert(s2:get({3}) ~= nil) + +-- +-- Try all the same, but the async transaction fails its WAL write. +-- +async_f = create_hanging_async_after_confirm(4, 5, 6) +-- The async transaction will fail to go to WAL when WAL thread is unblocked. +errinj.set('ERRINJ_WAL_ROTATE', true) +errinj.set('ERRINJ_WAL_DELAY', false) +test_run:wait_cond(function() return async_f:status() == 'dead' end) +errinj.set('ERRINJ_WAL_ROTATE', false) + +assert(s:get({4}) ~= nil) +assert(s2:get({5}) == nil) +assert(s2:get({6}) == nil) + +-- Ensure nothing is broken, works fine. +s:replace{7} +s2:replace{8} + +s:drop() +s2:drop() + +box.cfg{ \ + replication_synchro_quorum = old_synchro_quorum, \ + replication_synchro_timeout = old_synchro_timeout, \ +} diff --git a/test/replication/suite.cfg b/test/replication/suite.cfg index dc39e2f74..20337b310 100644 --- a/test/replication/suite.cfg +++ b/test/replication/suite.cfg @@ -45,6 +45,7 @@ "gh-5435-qsync-clear-synchro-queue-commit-all.test.lua": {}, "gh-5536-wal-limit.test.lua": {}, "gh-5566-final-join-synchro.test.lua": {}, + "gh-6057-qsync-confirm-async-no-wal.test.lua": {}, "*": { "memtx": {"engine": "memtx"}, "vinyl": {"engine": "vinyl"} diff --git a/test/replication/suite.ini b/test/replication/suite.ini index 1d9c0a4ae..7d811d333 100644 --- a/test/replication/suite.ini +++ b/test/replication/suite.ini @@ -3,7 +3,7 @@ core = tarantool script = master.lua description = tarantool/box, replication disabled = consistent.test.lua -release_disabled = catch.test.lua errinj.test.lua gc.test.lua gc_no_space.test.lua before_replace.test.lua qsync_advanced.test.lua qsync_errinj.test.lua quorum.test.lua recover_missing_xlog.test.lua sync.test.lua long_row_timeout.test.lua gh-4739-vclock-assert.test.lua gh-4730-applier-rollback.test.lua gh-5140-qsync-casc-rollback.test.lua gh-5144-qsync-dup-confirm.test.lua gh-5167-qsync-rollback-snap.test.lua gh-5506-election-on-off.test.lua gh-5536-wal-limit.test.lua hang_on_synchro_fail.test.lua anon_register_gap.test.lua gh-5213-qsync-applier-order.test.lua gh-5213-qsync-applier-order-3.test.lua +release_disabled = catch.test.lua errinj.test.lua gc.test.lua gc_no_space.test.lua before_replace.test.lua qsync_advanced.test.lua qsync_errinj.test.lua quorum.test.lua recover_missing_xlog.test.lua sync.test.lua long_row_timeout.test.lua gh-4739-vclock-assert.test.lua gh-4730-applier-rollback.test.lua gh-5140-qsync-casc-rollback.test.lua gh-5144-qsync-dup-confirm.test.lua gh-5167-qsync-rollback-snap.test.lua gh-5506-election-on-off.test.lua gh-5536-wal-limit.test.lua hang_on_synchro_fail.test.lua anon_register_gap.test.lua gh-5213-qsync-applier-order.test.lua gh-5213-qsync-applier-order-3.test.lua gh-6057-qsync-confirm-async-no-wal.test.lua config = suite.cfg lua_libs = lua/fast_replica.lua lua/rlimit.lua use_unix_sockets = True -- 2.24.3 (Apple Git-128)