From: Nikita Pettik <korablev@tarantool.org> To: tarantool-patches@dev.tarantool.org Cc: v.shpilevoy@tarantool.org Subject: [Tarantool-patches] [PATCH v2] vinyl: restart read iterator in case of rolled back WAL Date: Mon, 1 Jun 2020 19:46:52 +0300 [thread overview] Message-ID: <50a25fbae907f1b0d5406fb2e78d40d3e42a8a8d.1591029888.git.korablev@tarantool.org> (raw) Data read in vinyl is known to yield in case of disc access. So it opens a window for modifications of in-memory level. Imagine following scenario: right before data selection tuple is inserted into space. It passes first stage of commit procedure, i.e. it is prepared to be committed but still is not yet reached WAL. Meanwhile iterator is starting to read the same key. At this moment prepared statement is already inserted to in-memory tree ergo visible to read iterator. So, read iterator fetches this statement and proceeds to disk scan. In turn, disk scan yields and in this moment WAL fails to write statement on disk. Next, two cases are possible: 1. WAL thread has enough time to complete rollback procedure. 2. WAL thread fails to finish rollback in this time gap. In the first case read iterator should skip statement: version of in-memory tree has diverged from iterator's one, so we fall back into iterator restoration procedure. Mem iterator becomes invalid so the only choice is to restart whole 'advance' routine. In the second case nothing is changed to read iterator, so it simply returns prepared statement (and it is considered to be OK). Close #3395 --- Branch: https://github.com/tarantool/tarantool/tree/np/gh-3395-vy-read-prepared Issue: https://github.com/tarantool/tarantool/issues/3395 @ChangeLog: * Restart read iterator in case it has read prepared but rolled back tuple after yield (due to disk scan) (gh-3395). Patch V1: https://lists.tarantool.org/pipermail/tarantool-patches/2020-May/016861.html Changes in v2: - Instead of comparing front_ids now explicit iterator check is used; - Fixed test and split it into two cases depending on race between TX and WAL threads (i.e. whether WAL thread has time to finish its routine or not); - Added test case: tuple to be inserted is not the first in the result set. src/box/vy_read_iterator.c | 14 +- .../gh-3395-read-prepared-uncommitted.result | 234 ++++++++++++++++++ ...gh-3395-read-prepared-uncommitted.test.lua | 127 ++++++++++ test/vinyl/suite.ini | 2 +- 4 files changed, 375 insertions(+), 2 deletions(-) create mode 100644 test/vinyl/gh-3395-read-prepared-uncommitted.result create mode 100644 test/vinyl/gh-3395-read-prepared-uncommitted.test.lua diff --git a/src/box/vy_read_iterator.c b/src/box/vy_read_iterator.c index dcc9a0873..62a8722d9 100644 --- a/src/box/vy_read_iterator.c +++ b/src/box/vy_read_iterator.c @@ -382,6 +382,9 @@ vy_read_iterator_scan_disk(struct vy_read_iterator *itr, uint32_t disk_src, * Restore the position of the active in-memory tree iterator * after a yield caused by a disk read and update 'next' * if necessary. + * @retval -1 In case of error (e.g. OOM); + * @retval 0 Successful execution; + * @retval 1 Restart of advance_iterator is required. */ static NODISCARD int vy_read_iterator_restore_mem(struct vy_read_iterator *itr, @@ -398,6 +401,10 @@ vy_read_iterator_restore_mem(struct vy_read_iterator *itr, if (rc == 0) return 0; /* nothing changed */ + if (vy_mem_tree_iterator_is_invalid(&src->mem_iterator.curr_pos)) { + assert(src->mem_iterator.curr.stmt == NULL); + return 1; + } struct vy_entry entry = vy_history_last_stmt(&src->history); cmp = vy_read_iterator_cmp_stmt(itr, entry, *next); if (cmp > 0) { @@ -529,8 +536,13 @@ rescan_disk: * as it is owned exclusively by the current fiber so the only * source to check is the active in-memory tree. */ - if (vy_read_iterator_restore_mem(itr, &next) != 0) + int rc = vy_read_iterator_restore_mem(itr, &next); + if (rc < 0) return -1; + if (rc > 0) { + vy_read_iterator_restore(itr); + goto restart; + } /* * Scan the next range in case we transgressed the current * range's boundaries. diff --git a/test/vinyl/gh-3395-read-prepared-uncommitted.result b/test/vinyl/gh-3395-read-prepared-uncommitted.result new file mode 100644 index 000000000..3c6b65b50 --- /dev/null +++ b/test/vinyl/gh-3395-read-prepared-uncommitted.result @@ -0,0 +1,234 @@ +-- test-run result file version 2 +-- Let's test following case: right before data selection tuple is +-- inserted into space. It passes first stage of commit procedure, +-- i.e. it is prepared to be committed but still not yet reached WAL. +-- Meanwhile we are starting to read the same key. At this moment +-- prepared statement is already inserted to in-memory tree. So, +-- read iterator fetches this statement and proceeds to disk scan. +-- In turn, disk scan yields and in this moment WAL fails to write +-- statement on disk, so it is rolled back. Read iterator must +-- recognize this situation and handle it properly. +-- +test_run = require('test_run').new() + | --- + | ... +fiber = require('fiber') + | --- + | ... +errinj = box.error.injection + | --- + | ... + +s = box.schema.create_space('test', {engine = 'vinyl'}) + | --- + | ... +pk = s:create_index('pk') + | --- + | ... +sk = s:create_index('sk', {parts = {{2, 'unsigned'}}, unique = false}) + | --- + | ... +s:replace{3, 2} + | --- + | - [3, 2] + | ... +s:replace{2, 2} + | --- + | - [2, 2] + | ... +box.snapshot() + | --- + | - ok + | ... + +c = fiber.channel(1) + | --- + | ... + +function do_write() s:replace{1, 2} end + | --- + | ... +function init_read() end + | --- + | ... +function do_read() local ret = sk:select{2} c:put(ret) end + | --- + | ... + +-- Since we have tuples stored on disk, read procedure may +-- yield, opening window for WAL thread to commit or rollback +-- statements. In our case, WAL_WRITE will lead to rollback +-- of {1, 2} statement. Note that the race condition may result in +-- two possible scenarios: +-- 1. WAL thread has time to rollback the statement. In this case +-- {1, 2} will be deleted from mem (L0 lsm level) and we'll fall back +-- into read iterator restoration (since rollback bumps mem's version, +-- but iterator's version remains unchanged). +-- 2. WAL thread doesn't keep up with rollback/commit. Thus, state of +-- mem doesn't change and the statement is returned in the result set +-- (i.e. dirty read takes place). +-- +test_run:cmd("setopt delimiter ';'"); + | --- + | - true + | ... +-- is_tx_faster_than_wal determines whether wal thread has time +-- to finish its routine or not. In the first case we add extra +-- time gap to make sure that WAL thread finished work and +-- statement is rolled back. +-- +function read_prepared_with_delay(is_tx_faster_than_wal) + errinj.set("ERRINJ_WAL_DELAY", true) + fiber.create(do_write, s) + init_read() + errinj.set("ERRINJ_VY_READ_PAGE_DELAY", true) + fiber.create(do_read, sk, c) + errinj.set("ERRINJ_WAL_WRITE", true) + if is_tx_faster_than_wal then + errinj.set("ERRINJ_RELAY_FASTER_THAN_TX", true) + end + errinj.set("ERRINJ_WAL_DELAY", false) + if not is_tx_faster_than_wal then + fiber.sleep(0.1) + end + errinj.set("ERRINJ_VY_READ_PAGE_DELAY", false) + local res = c:get() + errinj.set("ERRINJ_WAL_WRITE", false) + if is_tx_faster_than_wal then + errinj.set("ERRINJ_RELAY_FASTER_THAN_TX", false) + end + return res +end; + | --- + | ... + +test_run:cmd("setopt delimiter ''"); + | --- + | - true + | ... + +-- 1. Prepared tuple is invisible to read iterator since WAL +-- has enough time and rollback procedure is finished. +-- +read_prepared_with_delay(false) + | --- + | - - [2, 2] + | - [3, 2] + | ... +-- 2. Tuple is not rolled back so it is visible to all transactions. +-- +read_prepared_with_delay(true) + | --- + | - - [1, 2] + | - [2, 2] + | - [3, 2] + | ... + +-- Give WAL thread time to catch up. +-- +fiber.sleep(0.1) + | --- + | ... + +sk:select{2} + | --- + | - - [2, 2] + | - [3, 2] + | ... + +s:drop() + | --- + | ... + +-- A bit more sophisticated case: tuple to be inserted is +-- not the first in result set. +-- +s = box.schema.create_space('test', {engine = 'vinyl'}) + | --- + | ... +pk = s:create_index('pk') + | --- + | ... +-- Set page_size to minimum so that accessing each tuple results +-- in page cache miss and disk access - we need it to set +-- VY_READ_PAGE_DELAY injection. +-- +sk = s:create_index('sk', {parts = {{2, 'unsigned'}}, unique = false, page_size = 5}) + | --- + | ... + +s:replace{1, 10} + | --- + | - [1, 10] + | ... +s:replace{2, 20} + | --- + | - [2, 20] + | ... +s:replace{4, 20} + | --- + | - [4, 20] + | ... +s:replace{5, 30} + | --- + | - [5, 30] + | ... +box.snapshot() + | --- + | - ok + | ... + +gen = nil + | --- + | ... +param = nil + | --- + | ... +state = nil + | --- + | ... +function do_write() s:replace{3, 20} end + | --- + | ... +function init_read() gen, param, state = sk:pairs({20}, {iterator = box.index.EQ}) gen(param, state) end + | --- + | ... +function do_read() local _, ret = gen(param, state) c:put(ret) end + | --- + | ... + +read_prepared_with_delay(false) + | --- + | - [4, 20] + | ... +-- All the same but test second scenario (WAL thread is not finished +-- yet and tuple is visible). +-- +read_prepared_with_delay(true) + | --- + | - [3, 20] + | ... +-- Give WAL thread time to catch up. +-- +fiber.sleep(0.1) + | --- + | ... +-- Read view is aborted due to rollback of prepared statement. +-- +gen(param, state) + | --- + | - error: The read view is aborted + | ... + +fiber.sleep(0.1) + | --- + | ... +sk:select{20} + | --- + | - - [2, 20] + | - [4, 20] + | ... + +s:drop() + | --- + | ... diff --git a/test/vinyl/gh-3395-read-prepared-uncommitted.test.lua b/test/vinyl/gh-3395-read-prepared-uncommitted.test.lua new file mode 100644 index 000000000..10b70dd85 --- /dev/null +++ b/test/vinyl/gh-3395-read-prepared-uncommitted.test.lua @@ -0,0 +1,127 @@ +-- Let's test following case: right before data selection tuple is +-- inserted into space. It passes first stage of commit procedure, +-- i.e. it is prepared to be committed but still not yet reached WAL. +-- Meanwhile we are starting to read the same key. At this moment +-- prepared statement is already inserted to in-memory tree. So, +-- read iterator fetches this statement and proceeds to disk scan. +-- In turn, disk scan yields and in this moment WAL fails to write +-- statement on disk, so it is rolled back. Read iterator must +-- recognize this situation and handle it properly. +-- +test_run = require('test_run').new() +fiber = require('fiber') +errinj = box.error.injection + +s = box.schema.create_space('test', {engine = 'vinyl'}) +pk = s:create_index('pk') +sk = s:create_index('sk', {parts = {{2, 'unsigned'}}, unique = false}) +s:replace{3, 2} +s:replace{2, 2} +box.snapshot() + +c = fiber.channel(1) + +function do_write() s:replace{1, 2} end +function init_read() end +function do_read() local ret = sk:select{2} c:put(ret) end + +-- Since we have tuples stored on disk, read procedure may +-- yield, opening window for WAL thread to commit or rollback +-- statements. In our case, WAL_WRITE will lead to rollback +-- of {1, 2} statement. Note that the race condition may result in +-- two possible scenarios: +-- 1. WAL thread has time to rollback the statement. In this case +-- {1, 2} will be deleted from mem (L0 lsm level) and we'll fall back +-- into read iterator restoration (since rollback bumps mem's version, +-- but iterator's version remains unchanged). +-- 2. WAL thread doesn't keep up with rollback/commit. Thus, state of +-- mem doesn't change and the statement is returned in the result set +-- (i.e. dirty read takes place). +-- +test_run:cmd("setopt delimiter ';'"); +-- is_tx_faster_than_wal determines whether wal thread has time +-- to finish its routine or not. In the first case we add extra +-- time gap to make sure that WAL thread finished work and +-- statement is rolled back. +-- +function read_prepared_with_delay(is_tx_faster_than_wal) + errinj.set("ERRINJ_WAL_DELAY", true) + fiber.create(do_write, s) + init_read() + errinj.set("ERRINJ_VY_READ_PAGE_DELAY", true) + fiber.create(do_read, sk, c) + errinj.set("ERRINJ_WAL_WRITE", true) + if is_tx_faster_than_wal then + errinj.set("ERRINJ_RELAY_FASTER_THAN_TX", true) + end + errinj.set("ERRINJ_WAL_DELAY", false) + if not is_tx_faster_than_wal then + fiber.sleep(0.1) + end + errinj.set("ERRINJ_VY_READ_PAGE_DELAY", false) + local res = c:get() + errinj.set("ERRINJ_WAL_WRITE", false) + if is_tx_faster_than_wal then + errinj.set("ERRINJ_RELAY_FASTER_THAN_TX", false) + end + return res +end; + +test_run:cmd("setopt delimiter ''"); + +-- 1. Prepared tuple is invisible to read iterator since WAL +-- has enough time and rollback procedure is finished. +-- +read_prepared_with_delay(false) +-- 2. Tuple is not rolled back so it is visible to all transactions. +-- +read_prepared_with_delay(true) + +-- Give WAL thread time to catch up. +-- +fiber.sleep(0.1) + +sk:select{2} + +s:drop() + +-- A bit more sophisticated case: tuple to be inserted is +-- not the first in result set. +-- +s = box.schema.create_space('test', {engine = 'vinyl'}) +pk = s:create_index('pk') +-- Set page_size to minimum so that accessing each tuple results +-- in page cache miss and disk access - we need it to set +-- VY_READ_PAGE_DELAY injection. +-- +sk = s:create_index('sk', {parts = {{2, 'unsigned'}}, unique = false, page_size = 5}) + +s:replace{1, 10} +s:replace{2, 20} +s:replace{4, 20} +s:replace{5, 30} +box.snapshot() + +gen = nil +param = nil +state = nil +function do_write() s:replace{3, 20} end +function init_read() gen, param, state = sk:pairs({20}, {iterator = box.index.EQ}) gen(param, state) end +function do_read() local _, ret = gen(param, state) c:put(ret) end + +read_prepared_with_delay(false) +-- All the same but test second scenario (WAL thread is not finished +-- yet and tuple is visible). +-- +read_prepared_with_delay(true) +-- Give WAL thread time to catch up. +-- +fiber.sleep(0.1) +-- Read view is aborted due to rollback of prepared statement. +-- +gen(param, state) + +fiber.sleep(0.1) +sk:select{20} + +s:drop() diff --git a/test/vinyl/suite.ini b/test/vinyl/suite.ini index 8b7e87955..09829b975 100644 --- a/test/vinyl/suite.ini +++ b/test/vinyl/suite.ini @@ -2,7 +2,7 @@ core = tarantool description = vinyl integration tests script = vinyl.lua -release_disabled = errinj.test.lua errinj_ddl.test.lua errinj_gc.test.lua errinj_stat.test.lua errinj_tx.test.lua errinj_vylog.test.lua partial_dump.test.lua quota_timeout.test.lua recovery_quota.test.lua replica_rejoin.test.lua gh-4864-stmt-alloc-fail-compact.test.lua gh-4805-open-run-err-recovery.test.lua +release_disabled = errinj.test.lua errinj_ddl.test.lua errinj_gc.test.lua errinj_stat.test.lua errinj_tx.test.lua errinj_vylog.test.lua partial_dump.test.lua quota_timeout.test.lua recovery_quota.test.lua replica_rejoin.test.lua gh-4864-stmt-alloc-fail-compact.test.lua gh-4805-open-run-err-recovery.test.lua gh-3395-read-prepared-uncommitted.test.lua config = suite.cfg lua_libs = suite.lua stress.lua large.lua txn_proxy.lua ../box/lua/utils.lua use_unix_sockets = True -- 2.17.1
next reply other threads:[~2020-06-01 16:46 UTC|newest] Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top 2020-06-01 16:46 Nikita Pettik [this message] 2020-06-01 17:40 ` Vladislav Shpilevoy 2020-06-19 12:34 ` Nikita Pettik 2020-06-20 16:33 ` Vladislav Shpilevoy 2020-06-16 12:10 ` Aleksandr Lyapunov 2020-06-19 12:24 ` Nikita Pettik 2020-06-19 12:42 ` Konstantin Osipov 2020-06-23 5:15 ` Aleksandr Lyapunov 2020-06-23 11:08 ` Konstantin Osipov 2020-06-19 13:01 ` Aleksandr Lyapunov 2020-06-24 13:41 ` Nikita Pettik 2020-06-20 16:33 ` Vladislav Shpilevoy
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=50a25fbae907f1b0d5406fb2e78d40d3e42a8a8d.1591029888.git.korablev@tarantool.org \ --to=korablev@tarantool.org \ --cc=tarantool-patches@dev.tarantool.org \ --cc=v.shpilevoy@tarantool.org \ --subject='Re: [Tarantool-patches] [PATCH v2] vinyl: restart read iterator in case of rolled back WAL' \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: link
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox