* [Tarantool-patches] [PATCH v2] vinyl: restart read iterator in case of rolled back WAL
@ 2020-06-01 16:46 Nikita Pettik
2020-06-01 17:40 ` Vladislav Shpilevoy
2020-06-16 12:10 ` Aleksandr Lyapunov
0 siblings, 2 replies; 12+ messages in thread
From: Nikita Pettik @ 2020-06-01 16:46 UTC (permalink / raw)
To: tarantool-patches; +Cc: v.shpilevoy
Data read in vinyl is known to yield in case of disc access. So it opens
a window for modifications of in-memory level. Imagine following scenario:
right before data selection tuple is inserted into space. It passes first
stage of commit procedure, i.e. it is prepared to be committed but still
is not yet reached WAL. Meanwhile iterator is starting to read the same key.
At this moment prepared statement is already inserted to in-memory tree
ergo visible to read iterator. So, read iterator fetches this statement
and proceeds to disk scan. In turn, disk scan yields and in this moment
WAL fails to write statement on disk. Next, two cases are possible:
1. WAL thread has enough time to complete rollback procedure.
2. WAL thread fails to finish rollback in this time gap.
In the first case read iterator should skip statement: version of
in-memory tree has diverged from iterator's one, so we fall back into
iterator restoration procedure. Mem iterator becomes invalid so the only
choice is to restart whole 'advance' routine.
In the second case nothing is changed to read iterator, so it simply
returns prepared statement (and it is considered to be OK).
Close #3395
---
Branch: https://github.com/tarantool/tarantool/tree/np/gh-3395-vy-read-prepared
Issue: https://github.com/tarantool/tarantool/issues/3395
@ChangeLog:
* Restart read iterator in case it has read prepared but rolled back
tuple after yield (due to disk scan) (gh-3395).
Patch V1: https://lists.tarantool.org/pipermail/tarantool-patches/2020-May/016861.html
Changes in v2:
- Instead of comparing front_ids now explicit iterator check is used;
- Fixed test and split it into two cases depending on race between
TX and WAL threads (i.e. whether WAL thread has time to finish its
routine or not);
- Added test case: tuple to be inserted is not the first in the result set.
src/box/vy_read_iterator.c | 14 +-
.../gh-3395-read-prepared-uncommitted.result | 234 ++++++++++++++++++
...gh-3395-read-prepared-uncommitted.test.lua | 127 ++++++++++
test/vinyl/suite.ini | 2 +-
4 files changed, 375 insertions(+), 2 deletions(-)
create mode 100644 test/vinyl/gh-3395-read-prepared-uncommitted.result
create mode 100644 test/vinyl/gh-3395-read-prepared-uncommitted.test.lua
diff --git a/src/box/vy_read_iterator.c b/src/box/vy_read_iterator.c
index dcc9a0873..62a8722d9 100644
--- a/src/box/vy_read_iterator.c
+++ b/src/box/vy_read_iterator.c
@@ -382,6 +382,9 @@ vy_read_iterator_scan_disk(struct vy_read_iterator *itr, uint32_t disk_src,
* Restore the position of the active in-memory tree iterator
* after a yield caused by a disk read and update 'next'
* if necessary.
+ * @retval -1 In case of error (e.g. OOM);
+ * @retval 0 Successful execution;
+ * @retval 1 Restart of advance_iterator is required.
*/
static NODISCARD int
vy_read_iterator_restore_mem(struct vy_read_iterator *itr,
@@ -398,6 +401,10 @@ vy_read_iterator_restore_mem(struct vy_read_iterator *itr,
if (rc == 0)
return 0; /* nothing changed */
+ if (vy_mem_tree_iterator_is_invalid(&src->mem_iterator.curr_pos)) {
+ assert(src->mem_iterator.curr.stmt == NULL);
+ return 1;
+ }
struct vy_entry entry = vy_history_last_stmt(&src->history);
cmp = vy_read_iterator_cmp_stmt(itr, entry, *next);
if (cmp > 0) {
@@ -529,8 +536,13 @@ rescan_disk:
* as it is owned exclusively by the current fiber so the only
* source to check is the active in-memory tree.
*/
- if (vy_read_iterator_restore_mem(itr, &next) != 0)
+ int rc = vy_read_iterator_restore_mem(itr, &next);
+ if (rc < 0)
return -1;
+ if (rc > 0) {
+ vy_read_iterator_restore(itr);
+ goto restart;
+ }
/*
* Scan the next range in case we transgressed the current
* range's boundaries.
diff --git a/test/vinyl/gh-3395-read-prepared-uncommitted.result b/test/vinyl/gh-3395-read-prepared-uncommitted.result
new file mode 100644
index 000000000..3c6b65b50
--- /dev/null
+++ b/test/vinyl/gh-3395-read-prepared-uncommitted.result
@@ -0,0 +1,234 @@
+-- test-run result file version 2
+-- Let's test following case: right before data selection tuple is
+-- inserted into space. It passes first stage of commit procedure,
+-- i.e. it is prepared to be committed but still not yet reached WAL.
+-- Meanwhile we are starting to read the same key. At this moment
+-- prepared statement is already inserted to in-memory tree. So,
+-- read iterator fetches this statement and proceeds to disk scan.
+-- In turn, disk scan yields and in this moment WAL fails to write
+-- statement on disk, so it is rolled back. Read iterator must
+-- recognize this situation and handle it properly.
+--
+test_run = require('test_run').new()
+ | ---
+ | ...
+fiber = require('fiber')
+ | ---
+ | ...
+errinj = box.error.injection
+ | ---
+ | ...
+
+s = box.schema.create_space('test', {engine = 'vinyl'})
+ | ---
+ | ...
+pk = s:create_index('pk')
+ | ---
+ | ...
+sk = s:create_index('sk', {parts = {{2, 'unsigned'}}, unique = false})
+ | ---
+ | ...
+s:replace{3, 2}
+ | ---
+ | - [3, 2]
+ | ...
+s:replace{2, 2}
+ | ---
+ | - [2, 2]
+ | ...
+box.snapshot()
+ | ---
+ | - ok
+ | ...
+
+c = fiber.channel(1)
+ | ---
+ | ...
+
+function do_write() s:replace{1, 2} end
+ | ---
+ | ...
+function init_read() end
+ | ---
+ | ...
+function do_read() local ret = sk:select{2} c:put(ret) end
+ | ---
+ | ...
+
+-- Since we have tuples stored on disk, read procedure may
+-- yield, opening window for WAL thread to commit or rollback
+-- statements. In our case, WAL_WRITE will lead to rollback
+-- of {1, 2} statement. Note that the race condition may result in
+-- two possible scenarios:
+-- 1. WAL thread has time to rollback the statement. In this case
+-- {1, 2} will be deleted from mem (L0 lsm level) and we'll fall back
+-- into read iterator restoration (since rollback bumps mem's version,
+-- but iterator's version remains unchanged).
+-- 2. WAL thread doesn't keep up with rollback/commit. Thus, state of
+-- mem doesn't change and the statement is returned in the result set
+-- (i.e. dirty read takes place).
+--
+test_run:cmd("setopt delimiter ';'");
+ | ---
+ | - true
+ | ...
+-- is_tx_faster_than_wal determines whether wal thread has time
+-- to finish its routine or not. In the first case we add extra
+-- time gap to make sure that WAL thread finished work and
+-- statement is rolled back.
+--
+function read_prepared_with_delay(is_tx_faster_than_wal)
+ errinj.set("ERRINJ_WAL_DELAY", true)
+ fiber.create(do_write, s)
+ init_read()
+ errinj.set("ERRINJ_VY_READ_PAGE_DELAY", true)
+ fiber.create(do_read, sk, c)
+ errinj.set("ERRINJ_WAL_WRITE", true)
+ if is_tx_faster_than_wal then
+ errinj.set("ERRINJ_RELAY_FASTER_THAN_TX", true)
+ end
+ errinj.set("ERRINJ_WAL_DELAY", false)
+ if not is_tx_faster_than_wal then
+ fiber.sleep(0.1)
+ end
+ errinj.set("ERRINJ_VY_READ_PAGE_DELAY", false)
+ local res = c:get()
+ errinj.set("ERRINJ_WAL_WRITE", false)
+ if is_tx_faster_than_wal then
+ errinj.set("ERRINJ_RELAY_FASTER_THAN_TX", false)
+ end
+ return res
+end;
+ | ---
+ | ...
+
+test_run:cmd("setopt delimiter ''");
+ | ---
+ | - true
+ | ...
+
+-- 1. Prepared tuple is invisible to read iterator since WAL
+-- has enough time and rollback procedure is finished.
+--
+read_prepared_with_delay(false)
+ | ---
+ | - - [2, 2]
+ | - [3, 2]
+ | ...
+-- 2. Tuple is not rolled back so it is visible to all transactions.
+--
+read_prepared_with_delay(true)
+ | ---
+ | - - [1, 2]
+ | - [2, 2]
+ | - [3, 2]
+ | ...
+
+-- Give WAL thread time to catch up.
+--
+fiber.sleep(0.1)
+ | ---
+ | ...
+
+sk:select{2}
+ | ---
+ | - - [2, 2]
+ | - [3, 2]
+ | ...
+
+s:drop()
+ | ---
+ | ...
+
+-- A bit more sophisticated case: tuple to be inserted is
+-- not the first in result set.
+--
+s = box.schema.create_space('test', {engine = 'vinyl'})
+ | ---
+ | ...
+pk = s:create_index('pk')
+ | ---
+ | ...
+-- Set page_size to minimum so that accessing each tuple results
+-- in page cache miss and disk access - we need it to set
+-- VY_READ_PAGE_DELAY injection.
+--
+sk = s:create_index('sk', {parts = {{2, 'unsigned'}}, unique = false, page_size = 5})
+ | ---
+ | ...
+
+s:replace{1, 10}
+ | ---
+ | - [1, 10]
+ | ...
+s:replace{2, 20}
+ | ---
+ | - [2, 20]
+ | ...
+s:replace{4, 20}
+ | ---
+ | - [4, 20]
+ | ...
+s:replace{5, 30}
+ | ---
+ | - [5, 30]
+ | ...
+box.snapshot()
+ | ---
+ | - ok
+ | ...
+
+gen = nil
+ | ---
+ | ...
+param = nil
+ | ---
+ | ...
+state = nil
+ | ---
+ | ...
+function do_write() s:replace{3, 20} end
+ | ---
+ | ...
+function init_read() gen, param, state = sk:pairs({20}, {iterator = box.index.EQ}) gen(param, state) end
+ | ---
+ | ...
+function do_read() local _, ret = gen(param, state) c:put(ret) end
+ | ---
+ | ...
+
+read_prepared_with_delay(false)
+ | ---
+ | - [4, 20]
+ | ...
+-- All the same but test second scenario (WAL thread is not finished
+-- yet and tuple is visible).
+--
+read_prepared_with_delay(true)
+ | ---
+ | - [3, 20]
+ | ...
+-- Give WAL thread time to catch up.
+--
+fiber.sleep(0.1)
+ | ---
+ | ...
+-- Read view is aborted due to rollback of prepared statement.
+--
+gen(param, state)
+ | ---
+ | - error: The read view is aborted
+ | ...
+
+fiber.sleep(0.1)
+ | ---
+ | ...
+sk:select{20}
+ | ---
+ | - - [2, 20]
+ | - [4, 20]
+ | ...
+
+s:drop()
+ | ---
+ | ...
diff --git a/test/vinyl/gh-3395-read-prepared-uncommitted.test.lua b/test/vinyl/gh-3395-read-prepared-uncommitted.test.lua
new file mode 100644
index 000000000..10b70dd85
--- /dev/null
+++ b/test/vinyl/gh-3395-read-prepared-uncommitted.test.lua
@@ -0,0 +1,127 @@
+-- Let's test following case: right before data selection tuple is
+-- inserted into space. It passes first stage of commit procedure,
+-- i.e. it is prepared to be committed but still not yet reached WAL.
+-- Meanwhile we are starting to read the same key. At this moment
+-- prepared statement is already inserted to in-memory tree. So,
+-- read iterator fetches this statement and proceeds to disk scan.
+-- In turn, disk scan yields and in this moment WAL fails to write
+-- statement on disk, so it is rolled back. Read iterator must
+-- recognize this situation and handle it properly.
+--
+test_run = require('test_run').new()
+fiber = require('fiber')
+errinj = box.error.injection
+
+s = box.schema.create_space('test', {engine = 'vinyl'})
+pk = s:create_index('pk')
+sk = s:create_index('sk', {parts = {{2, 'unsigned'}}, unique = false})
+s:replace{3, 2}
+s:replace{2, 2}
+box.snapshot()
+
+c = fiber.channel(1)
+
+function do_write() s:replace{1, 2} end
+function init_read() end
+function do_read() local ret = sk:select{2} c:put(ret) end
+
+-- Since we have tuples stored on disk, read procedure may
+-- yield, opening window for WAL thread to commit or rollback
+-- statements. In our case, WAL_WRITE will lead to rollback
+-- of {1, 2} statement. Note that the race condition may result in
+-- two possible scenarios:
+-- 1. WAL thread has time to rollback the statement. In this case
+-- {1, 2} will be deleted from mem (L0 lsm level) and we'll fall back
+-- into read iterator restoration (since rollback bumps mem's version,
+-- but iterator's version remains unchanged).
+-- 2. WAL thread doesn't keep up with rollback/commit. Thus, state of
+-- mem doesn't change and the statement is returned in the result set
+-- (i.e. dirty read takes place).
+--
+test_run:cmd("setopt delimiter ';'");
+-- is_tx_faster_than_wal determines whether wal thread has time
+-- to finish its routine or not. In the first case we add extra
+-- time gap to make sure that WAL thread finished work and
+-- statement is rolled back.
+--
+function read_prepared_with_delay(is_tx_faster_than_wal)
+ errinj.set("ERRINJ_WAL_DELAY", true)
+ fiber.create(do_write, s)
+ init_read()
+ errinj.set("ERRINJ_VY_READ_PAGE_DELAY", true)
+ fiber.create(do_read, sk, c)
+ errinj.set("ERRINJ_WAL_WRITE", true)
+ if is_tx_faster_than_wal then
+ errinj.set("ERRINJ_RELAY_FASTER_THAN_TX", true)
+ end
+ errinj.set("ERRINJ_WAL_DELAY", false)
+ if not is_tx_faster_than_wal then
+ fiber.sleep(0.1)
+ end
+ errinj.set("ERRINJ_VY_READ_PAGE_DELAY", false)
+ local res = c:get()
+ errinj.set("ERRINJ_WAL_WRITE", false)
+ if is_tx_faster_than_wal then
+ errinj.set("ERRINJ_RELAY_FASTER_THAN_TX", false)
+ end
+ return res
+end;
+
+test_run:cmd("setopt delimiter ''");
+
+-- 1. Prepared tuple is invisible to read iterator since WAL
+-- has enough time and rollback procedure is finished.
+--
+read_prepared_with_delay(false)
+-- 2. Tuple is not rolled back so it is visible to all transactions.
+--
+read_prepared_with_delay(true)
+
+-- Give WAL thread time to catch up.
+--
+fiber.sleep(0.1)
+
+sk:select{2}
+
+s:drop()
+
+-- A bit more sophisticated case: tuple to be inserted is
+-- not the first in result set.
+--
+s = box.schema.create_space('test', {engine = 'vinyl'})
+pk = s:create_index('pk')
+-- Set page_size to minimum so that accessing each tuple results
+-- in page cache miss and disk access - we need it to set
+-- VY_READ_PAGE_DELAY injection.
+--
+sk = s:create_index('sk', {parts = {{2, 'unsigned'}}, unique = false, page_size = 5})
+
+s:replace{1, 10}
+s:replace{2, 20}
+s:replace{4, 20}
+s:replace{5, 30}
+box.snapshot()
+
+gen = nil
+param = nil
+state = nil
+function do_write() s:replace{3, 20} end
+function init_read() gen, param, state = sk:pairs({20}, {iterator = box.index.EQ}) gen(param, state) end
+function do_read() local _, ret = gen(param, state) c:put(ret) end
+
+read_prepared_with_delay(false)
+-- All the same but test second scenario (WAL thread is not finished
+-- yet and tuple is visible).
+--
+read_prepared_with_delay(true)
+-- Give WAL thread time to catch up.
+--
+fiber.sleep(0.1)
+-- Read view is aborted due to rollback of prepared statement.
+--
+gen(param, state)
+
+fiber.sleep(0.1)
+sk:select{20}
+
+s:drop()
diff --git a/test/vinyl/suite.ini b/test/vinyl/suite.ini
index 8b7e87955..09829b975 100644
--- a/test/vinyl/suite.ini
+++ b/test/vinyl/suite.ini
@@ -2,7 +2,7 @@
core = tarantool
description = vinyl integration tests
script = vinyl.lua
-release_disabled = errinj.test.lua errinj_ddl.test.lua errinj_gc.test.lua errinj_stat.test.lua errinj_tx.test.lua errinj_vylog.test.lua partial_dump.test.lua quota_timeout.test.lua recovery_quota.test.lua replica_rejoin.test.lua gh-4864-stmt-alloc-fail-compact.test.lua gh-4805-open-run-err-recovery.test.lua
+release_disabled = errinj.test.lua errinj_ddl.test.lua errinj_gc.test.lua errinj_stat.test.lua errinj_tx.test.lua errinj_vylog.test.lua partial_dump.test.lua quota_timeout.test.lua recovery_quota.test.lua replica_rejoin.test.lua gh-4864-stmt-alloc-fail-compact.test.lua gh-4805-open-run-err-recovery.test.lua gh-3395-read-prepared-uncommitted.test.lua
config = suite.cfg
lua_libs = suite.lua stress.lua large.lua txn_proxy.lua ../box/lua/utils.lua
use_unix_sockets = True
--
2.17.1
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [Tarantool-patches] [PATCH v2] vinyl: restart read iterator in case of rolled back WAL
2020-06-01 16:46 [Tarantool-patches] [PATCH v2] vinyl: restart read iterator in case of rolled back WAL Nikita Pettik
@ 2020-06-01 17:40 ` Vladislav Shpilevoy
2020-06-19 12:34 ` Nikita Pettik
2020-06-16 12:10 ` Aleksandr Lyapunov
1 sibling, 1 reply; 12+ messages in thread
From: Vladislav Shpilevoy @ 2020-06-01 17:40 UTC (permalink / raw)
To: Nikita Pettik, tarantool-patches
Hi! Thanks for the patch!
On 01/06/2020 18:46, Nikita Pettik wrote:
> Data read in vinyl is known to yield in case of disc access. So it opens
> a window for modifications of in-memory level. Imagine following scenario:
> right before data selection tuple is inserted into space. It passes first
> stage of commit procedure, i.e. it is prepared to be committed but still
> is not yet reached WAL. Meanwhile iterator is starting to read the same key.
> At this moment prepared statement is already inserted to in-memory tree
> ergo visible to read iterator. So, read iterator fetches this statement
> and proceeds to disk scan. In turn, disk scan yields and in this moment
> WAL fails to write statement on disk. Next, two cases are possible:
> 1. WAL thread has enough time to complete rollback procedure.
> 2. WAL thread fails to finish rollback in this time gap.
Seems like the second test case is not really different from no rollback
at all. Because basically the WAL thread execution is still in the same
function, wal_write_to_disk(). Both ERRINJ_WAL_DELAY and ERRINJ_RELAY_FASTER_THAN_TX
give the same effect, if the transaction is rolled back. If it is committed,
they give also the same effect from TX<->WAL interaction point of view. The
difference only appears in TX<->WAL<->Relay threads.
So probably you could drop this case from the patch. It would simplify
things. But up to you and Alexander L.
Talking of the tests, I tried running multiple instances of the new
test:
python test-run.py gh-3395-read-prepared-un gh-3395-read-prepared-un gh-3395-read-prepared-un gh-3395-read-prepared-un gh-3395-read-prepared-un gh-3395-read-prepared-un
And it fails sometimes. So it is flaky. Better fix it before push.
> diff --git a/src/box/vy_read_iterator.c b/src/box/vy_read_iterator.c
> index dcc9a0873..62a8722d9 100644
> --- a/src/box/vy_read_iterator.c
> +++ b/src/box/vy_read_iterator.c
> @@ -382,6 +382,9 @@ vy_read_iterator_scan_disk(struct vy_read_iterator *itr, uint32_t disk_src,
> * Restore the position of the active in-memory tree iterator
> * after a yield caused by a disk read and update 'next'
> * if necessary.
> + * @retval -1 In case of error (e.g. OOM);
> + * @retval 0 Successful execution;
> + * @retval 1 Restart of advance_iterator is required.
What is advance_iterator?
> */
> static NODISCARD int
> vy_read_iterator_restore_mem(struct vy_read_iterator *itr,
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [Tarantool-patches] [PATCH v2] vinyl: restart read iterator in case of rolled back WAL
2020-06-01 16:46 [Tarantool-patches] [PATCH v2] vinyl: restart read iterator in case of rolled back WAL Nikita Pettik
2020-06-01 17:40 ` Vladislav Shpilevoy
@ 2020-06-16 12:10 ` Aleksandr Lyapunov
2020-06-19 12:24 ` Nikita Pettik
1 sibling, 1 reply; 12+ messages in thread
From: Aleksandr Lyapunov @ 2020-06-16 12:10 UTC (permalink / raw)
To: Nikita Pettik, tarantool-patches; +Cc: v.shpilevoy
Thanks for the patch! See my 2 comments below:
On 6/1/20 7:46 PM, Nikita Pettik wrote:
>
> + if (vy_mem_tree_iterator_is_invalid(&src->mem_iterator.curr_pos)) {
> + assert(src->mem_iterator.curr.stmt == NULL);
> + return 1;
> + }
I'm afraid that the iterator will not always be invalid in the given case.
As I see, if a mem holds any older tuple with the same key (e.g. older
version of that tuple), the restoration routine will find the older tuple
with the non-invalid iterator.
I also think that mem_restore must handle all the possible data
changes by itself without concern of read_iterator.
> - if (vy_read_iterator_restore_mem(itr, &next) != 0)
> + int rc = vy_read_iterator_restore_mem(itr, &next);
> + if (rc < 0)
> return -1;
> + if (rc > 0) {
> + vy_read_iterator_restore(itr);
> + goto restart;
> + }
The read iterator was rewritten several times and still have at least
several bugs. I think we should admit that we cannot support such a
complicated solution. How about some stupid solution: if ANY change
has been happened during yield - restart advancing?
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [Tarantool-patches] [PATCH v2] vinyl: restart read iterator in case of rolled back WAL
2020-06-16 12:10 ` Aleksandr Lyapunov
@ 2020-06-19 12:24 ` Nikita Pettik
2020-06-19 12:42 ` Konstantin Osipov
` (2 more replies)
0 siblings, 3 replies; 12+ messages in thread
From: Nikita Pettik @ 2020-06-19 12:24 UTC (permalink / raw)
To: Aleksandr Lyapunov; +Cc: tarantool-patches, v.shpilevoy
On 16 Jun 15:10, Aleksandr Lyapunov wrote:
> Thanks for the patch! See my 2 comments below:
>
> On 6/1/20 7:46 PM, Nikita Pettik wrote:
> > + if (vy_mem_tree_iterator_is_invalid(&src->mem_iterator.curr_pos)) {
> > + assert(src->mem_iterator.curr.stmt == NULL);
> > + return 1;
> > + }
> I'm afraid that the iterator will not always be invalid in the given case.
> As I see, if a mem holds any older tuple with the same key (e.g. older
> version of that tuple), the restoration routine will find the older tuple
> with the non-invalid iterator.
> I also think that mem_restore must handle all the possible data
> changes by itself without concern of read_iterator.
You are likely to be right, but I followed suggestion below
to simplify resotration procedure.
> > - if (vy_read_iterator_restore_mem(itr, &next) != 0)
> > + int rc = vy_read_iterator_restore_mem(itr, &next);
> > + if (rc < 0)
> > return -1;
> > + if (rc > 0) {
> > + vy_read_iterator_restore(itr);
> > + goto restart;
> > + }
> The read iterator was rewritten several times and still have at least
> several bugs. I think we should admit that we cannot support such a
> complicated solution. How about some stupid solution: if ANY change
> has been happened during yield - restart advancing?
Ok, I'm fine with it. Consider this change:
diff --git a/src/box/vy_read_iterator.c b/src/box/vy_read_iterator.c
index 62a8722d9..409796910 100644
--- a/src/box/vy_read_iterator.c
+++ b/src/box/vy_read_iterator.c
@@ -378,65 +378,6 @@ vy_read_iterator_scan_disk(struct vy_read_iterator *itr, uint32_t disk_src,
return 0;
}
-/**
- * Restore the position of the active in-memory tree iterator
- * after a yield caused by a disk read and update 'next'
- * if necessary.
- * @retval -1 In case of error (e.g. OOM);
- * @retval 0 Successful execution;
- * @retval 1 Restart of advance_iterator is required.
- */
-static NODISCARD int
-vy_read_iterator_restore_mem(struct vy_read_iterator *itr,
- struct vy_entry *next)
-{
- int rc;
- int cmp;
- struct vy_read_src *src = &itr->src[itr->mem_src];
-
- rc = vy_mem_iterator_restore(&src->mem_iterator,
- itr->last, &src->history);
- if (rc < 0)
- return -1; /* memory allocation error */
- if (rc == 0)
- return 0; /* nothing changed */
-
- if (vy_mem_tree_iterator_is_invalid(&src->mem_iterator.curr_pos)) {
- assert(src->mem_iterator.curr.stmt == NULL);
- return 1;
- }
- struct vy_entry entry = vy_history_last_stmt(&src->history);
- cmp = vy_read_iterator_cmp_stmt(itr, entry, *next);
- if (cmp > 0) {
- /*
- * Memory trees are append-only so if the
- * source is not on top of the heap after
- * restoration, it was not before.
- */
- assert(src->front_id < itr->front_id);
- return 0;
- }
- if (cmp < 0) {
- /*
- * The new statement precedes the current
- * candidate for the next key.
- */
- *next = entry;
- itr->front_id++;
- } else {
- /*
- * The new statement updates the next key.
- * Make sure we don't read the old value
- * from the cache while applying UPSERTs.
- */
- struct vy_read_src *cache_src = &itr->src[itr->cache_src];
- if (cache_src->front_id == itr->front_id)
- vy_history_cleanup(&cache_src->history);
- }
- src->front_id = itr->front_id;
- return 0;
-}
-
static void
vy_read_iterator_restore(struct vy_read_iterator *itr);
@@ -536,10 +477,8 @@ rescan_disk:
* as it is owned exclusively by the current fiber so the only
* source to check is the active in-memory tree.
*/
- int rc = vy_read_iterator_restore_mem(itr, &next);
- if (rc < 0)
- return -1;
- if (rc > 0) {
+ struct vy_mem_iterator *mem_itr = &itr->src[itr->mem_src].mem_iterator;
+ if (mem_itr->version != mem_itr->mem->version) {
vy_read_iterator_restore(itr);
goto restart;
}
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [Tarantool-patches] [PATCH v2] vinyl: restart read iterator in case of rolled back WAL
2020-06-01 17:40 ` Vladislav Shpilevoy
@ 2020-06-19 12:34 ` Nikita Pettik
2020-06-20 16:33 ` Vladislav Shpilevoy
0 siblings, 1 reply; 12+ messages in thread
From: Nikita Pettik @ 2020-06-19 12:34 UTC (permalink / raw)
To: Vladislav Shpilevoy; +Cc: tarantool-patches
On 01 Jun 19:40, Vladislav Shpilevoy wrote:
> Hi! Thanks for the patch!
>
> On 01/06/2020 18:46, Nikita Pettik wrote:
> > Data read in vinyl is known to yield in case of disc access. So it opens
> > a window for modifications of in-memory level. Imagine following scenario:
> > right before data selection tuple is inserted into space. It passes first
> > stage of commit procedure, i.e. it is prepared to be committed but still
> > is not yet reached WAL. Meanwhile iterator is starting to read the same key.
> > At this moment prepared statement is already inserted to in-memory tree
> > ergo visible to read iterator. So, read iterator fetches this statement
> > and proceeds to disk scan. In turn, disk scan yields and in this moment
> > WAL fails to write statement on disk. Next, two cases are possible:
> > 1. WAL thread has enough time to complete rollback procedure.
> > 2. WAL thread fails to finish rollback in this time gap.
> Seems like the second test case is not really different from no rollback
> at all. Because basically the WAL thread execution is still in the same
> function, wal_write_to_disk(). Both ERRINJ_WAL_DELAY and ERRINJ_RELAY_FASTER_THAN_TX
> give the same effect, if the transaction is rolled back. If it is committed,
> they give also the same effect from TX<->WAL interaction point of view. The
> difference only appears in TX<->WAL<->Relay threads.
>
> So probably you could drop this case from the patch. It would simplify
> things. But up to you and Alexander L.
>
> Talking of the tests, I tried running multiple instances of the new
> test:
>
> python test-run.py gh-3395-read-prepared-un gh-3395-read-prepared-un gh-3395-read-prepared-un gh-3395-read-prepared-un gh-3395-read-prepared-un gh-3395-read-prepared-un
>
> And it fails sometimes. So it is flaky. Better fix it before push.
I've added a tiny fix:
diff --git a/test/vinyl/gh-3395-read-prepared-uncommitted.test.lua b/test/vinyl/gh-3395-read-prepared-uncommitted.test.lua
index 10b70dd85..70136ddff 100644
--- a/test/vinyl/gh-3395-read-prepared-uncommitted.test.lua
+++ b/test/vinyl/gh-3395-read-prepared-uncommitted.test.lua
@@ -55,9 +55,7 @@ function read_prepared_with_delay(is_tx_faster_than_wal)
errinj.set("ERRINJ_RELAY_FASTER_THAN_TX", true)
end
errinj.set("ERRINJ_WAL_DELAY", false)
- if not is_tx_faster_than_wal then
- fiber.sleep(0.1)
- end
+ fiber.sleep(0.1)
errinj.set("ERRINJ_VY_READ_PAGE_DELAY", false)
local res = c:get()
errinj.set("ERRINJ_WAL_WRITE", false)
Could you please check if it still fails on your machine?
> > diff --git a/src/box/vy_read_iterator.c b/src/box/vy_read_iterator.c
> > index dcc9a0873..62a8722d9 100644
> > --- a/src/box/vy_read_iterator.c
> > +++ b/src/box/vy_read_iterator.c
> > @@ -382,6 +382,9 @@ vy_read_iterator_scan_disk(struct vy_read_iterator *itr, uint32_t disk_src,
> > * Restore the position of the active in-memory tree iterator
> > * after a yield caused by a disk read and update 'next'
> > * if necessary.
> > + * @retval -1 In case of error (e.g. OOM);
> > + * @retval 0 Successful execution;
> > + * @retval 1 Restart of advance_iterator is required.
>
> What is advance_iterator?
It is supposed to be vy_read_iterator_advance(). Nevermind, in the
new version I've gotten rid of this function at all.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [Tarantool-patches] [PATCH v2] vinyl: restart read iterator in case of rolled back WAL
2020-06-19 12:24 ` Nikita Pettik
@ 2020-06-19 12:42 ` Konstantin Osipov
2020-06-23 5:15 ` Aleksandr Lyapunov
2020-06-19 13:01 ` Aleksandr Lyapunov
2020-06-20 16:33 ` Vladislav Shpilevoy
2 siblings, 1 reply; 12+ messages in thread
From: Konstantin Osipov @ 2020-06-19 12:42 UTC (permalink / raw)
To: Nikita Pettik; +Cc: tarantool-patches, v.shpilevoy
* Nikita Pettik <korablev@tarantool.org> [20/06/19 15:26]:
> On 16 Jun 15:10, Aleksandr Lyapunov wrote:
> > Thanks for the patch! See my 2 comments below:
> >
> > On 6/1/20 7:46 PM, Nikita Pettik wrote:
> > > + if (vy_mem_tree_iterator_is_invalid(&src->mem_iterator.curr_pos)) {
> > > + assert(src->mem_iterator.curr.stmt == NULL);
> > > + return 1;
> > > + }
> > I'm afraid that the iterator will not always be invalid in the given case.
> > As I see, if a mem holds any older tuple with the same key (e.g. older
> > version of that tuple), the restoration routine will find the older tuple
> > with the non-invalid iterator.
> > I also think that mem_restore must handle all the possible data
> > changes by itself without concern of read_iterator.
>
> You are likely to be right, but I followed suggestion below
> to simplify resotration procedure.
>
> > > - if (vy_read_iterator_restore_mem(itr, &next) != 0)
> > > + int rc = vy_read_iterator_restore_mem(itr, &next);
> > > + if (rc < 0)
> > > return -1;
> > > + if (rc > 0) {
> > > + vy_read_iterator_restore(itr);
> > > + goto restart;
> > > + }
> > The read iterator was rewritten several times and still have at least
> > several bugs. I think we should admit that we cannot support such a
> > complicated solution. How about some stupid solution: if ANY change
> > has been happened during yield - restart advancing?
Does this statement have technical merit? Is it supported by
tests? I'd gladly support the change if it was grounded in reason -
evaluation of the performance impact, for example, could serve as
a confirmation that a simple solution would be just fine.
Without it, I'd it's a regress, a signature of helplessness,
lack of courage to make things right.
>
> diff --git a/src/box/vy_read_iterator.c b/src/box/vy_read_iterator.c
> index 62a8722d9..409796910 100644
> --- a/src/box/vy_read_iterator.c
> +++ b/src/box/vy_read_iterator.c
> @@ -378,65 +378,6 @@ vy_read_iterator_scan_disk(struct vy_read_iterator *itr, uint32_t disk_src,
> return 0;
> }
>
> -/**
> - * Restore the position of the active in-memory tree iterator
> - * after a yield caused by a disk read and update 'next'
> - * if necessary.
> - * @retval -1 In case of error (e.g. OOM);
> - * @retval 0 Successful execution;
> - * @retval 1 Restart of advance_iterator is required.
> - */
> -static NODISCARD int
> -vy_read_iterator_restore_mem(struct vy_read_iterator *itr,
> - struct vy_entry *next)
> -{
> - int rc;
> - int cmp;
> - struct vy_read_src *src = &itr->src[itr->mem_src];
> -
> - rc = vy_mem_iterator_restore(&src->mem_iterator,
> - itr->last, &src->history);
> - if (rc < 0)
> - return -1; /* memory allocation error */
> - if (rc == 0)
> - return 0; /* nothing changed */
> -
> - if (vy_mem_tree_iterator_is_invalid(&src->mem_iterator.curr_pos)) {
> - assert(src->mem_iterator.curr.stmt == NULL);
> - return 1;
> - }
> - struct vy_entry entry = vy_history_last_stmt(&src->history);
> - cmp = vy_read_iterator_cmp_stmt(itr, entry, *next);
> - if (cmp > 0) {
> - /*
> - * Memory trees are append-only so if the
> - * source is not on top of the heap after
> - * restoration, it was not before.
> - */
> - assert(src->front_id < itr->front_id);
> - return 0;
> - }
> - if (cmp < 0) {
> - /*
> - * The new statement precedes the current
> - * candidate for the next key.
> - */
> - *next = entry;
> - itr->front_id++;
> - } else {
> - /*
> - * The new statement updates the next key.
> - * Make sure we don't read the old value
> - * from the cache while applying UPSERTs.
> - */
> - struct vy_read_src *cache_src = &itr->src[itr->cache_src];
> - if (cache_src->front_id == itr->front_id)
> - vy_history_cleanup(&cache_src->history);
> - }
> - src->front_id = itr->front_id;
> - return 0;
> -}
> -
> static void
> vy_read_iterator_restore(struct vy_read_iterator *itr);
>
> @@ -536,10 +477,8 @@ rescan_disk:
> * as it is owned exclusively by the current fiber so the only
> * source to check is the active in-memory tree.
> */
> - int rc = vy_read_iterator_restore_mem(itr, &next);
> - if (rc < 0)
> - return -1;
> - if (rc > 0) {
> + struct vy_mem_iterator *mem_itr = &itr->src[itr->mem_src].mem_iterator;
> + if (mem_itr->version != mem_itr->mem->version) {
> vy_read_iterator_restore(itr);
> goto restart;
> }
>
--
Konstantin Osipov, Moscow, Russia
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [Tarantool-patches] [PATCH v2] vinyl: restart read iterator in case of rolled back WAL
2020-06-19 12:24 ` Nikita Pettik
2020-06-19 12:42 ` Konstantin Osipov
@ 2020-06-19 13:01 ` Aleksandr Lyapunov
2020-06-24 13:41 ` Nikita Pettik
2020-06-20 16:33 ` Vladislav Shpilevoy
2 siblings, 1 reply; 12+ messages in thread
From: Aleksandr Lyapunov @ 2020-06-19 13:01 UTC (permalink / raw)
To: Nikita Pettik; +Cc: tarantool-patches, v.shpilevoy
Great, lgtm.
Btw, I didn't try to run tests, I hope you did it.
On 6/19/20 3:24 PM, Nikita Pettik wrote:
>
> Ok, I'm fine with it. Consider this change:
>
> diff --git a/src/box/vy_read_iterator.c b/src/box/vy_read_iterator.c
> index 62a8722d9..409796910 100644
> --- a/src/box/vy_read_iterator.c
> +++ b/src/box/vy_read_iterator.c
> @@ -378,65 +378,6 @@ vy_read_iterator_scan_disk(struct vy_read_iterator *itr, uint32_t disk_src,
> return 0;
> }
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [Tarantool-patches] [PATCH v2] vinyl: restart read iterator in case of rolled back WAL
2020-06-19 12:24 ` Nikita Pettik
2020-06-19 12:42 ` Konstantin Osipov
2020-06-19 13:01 ` Aleksandr Lyapunov
@ 2020-06-20 16:33 ` Vladislav Shpilevoy
2 siblings, 0 replies; 12+ messages in thread
From: Vladislav Shpilevoy @ 2020-06-20 16:33 UTC (permalink / raw)
To: tarantool-patches
On 19/06/2020 14:24, Nikita Pettik wrote:
> On 16 Jun 15:10, Aleksandr Lyapunov wrote:
>> Thanks for the patch! See my 2 comments below:
>>
>> On 6/1/20 7:46 PM, Nikita Pettik wrote:
>>> + if (vy_mem_tree_iterator_is_invalid(&src->mem_iterator.curr_pos)) {
>>> + assert(src->mem_iterator.curr.stmt == NULL);
>>> + return 1;
>>> + }
>> I'm afraid that the iterator will not always be invalid in the given case.
>> As I see, if a mem holds any older tuple with the same key (e.g. older
>> version of that tuple), the restoration routine will find the older tuple
>> with the non-invalid iterator.
>> I also think that mem_restore must handle all the possible data
>> changes by itself without concern of read_iterator.
>
> You are likely to be right, but I followed suggestion below
> to simplify resotration procedure.
What is a problem if the older tuple is returned? This is the
expected behaviour, isn't it?
>>> - if (vy_read_iterator_restore_mem(itr, &next) != 0)
>>> + int rc = vy_read_iterator_restore_mem(itr, &next);
>>> + if (rc < 0)
>>> return -1;
>>> + if (rc > 0) {
>>> + vy_read_iterator_restore(itr);
>>> + goto restart;
>>> + }
> diff --git a/src/box/vy_read_iterator.c b/src/box/vy_read_iterator.c
> index 62a8722d9..409796910 100644
> --- a/src/box/vy_read_iterator.c
> +++ b/src/box/vy_read_iterator.c
> @@ -536,10 +477,8 @@ rescan_disk:
> * as it is owned exclusively by the current fiber so the only
> * source to check is the active in-memory tree.
> */
> - int rc = vy_read_iterator_restore_mem(itr, &next);
> - if (rc < 0)
> - return -1;
> - if (rc > 0) {
> + struct vy_mem_iterator *mem_itr = &itr->src[itr->mem_src].mem_iterator;
> + if (mem_itr->version != mem_itr->mem->version) {
> vy_read_iterator_restore(itr);
> goto restart;
I don't like this solution, because it is easy to change mem version while
it was used in a read iterator. Previously it was solved without returning
back to disk, so the iterator couldn't "infinitely" follow the disk <-> mem
cycle, when lots of updates land to this mem. Now it can.
I tend to agree here with Kostja. I don't understand why should we restart
the whole iterator when only mem has changed. This looks like a regression.
However my LGTM is not necessary here, after Alexander L. gave it. So I am not
going to insist on anything.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [Tarantool-patches] [PATCH v2] vinyl: restart read iterator in case of rolled back WAL
2020-06-19 12:34 ` Nikita Pettik
@ 2020-06-20 16:33 ` Vladislav Shpilevoy
0 siblings, 0 replies; 12+ messages in thread
From: Vladislav Shpilevoy @ 2020-06-20 16:33 UTC (permalink / raw)
To: Nikita Pettik; +Cc: tarantool-patches
>> Talking of the tests, I tried running multiple instances of the new
>> test:
>>
>> python test-run.py gh-3395-read-prepared-un gh-3395-read-prepared-un gh-3395-read-prepared-un gh-3395-read-prepared-un gh-3395-read-prepared-un gh-3395-read-prepared-un
>>
>> And it fails sometimes. So it is flaky. Better fix it before push.
>
> I've added a tiny fix:
>
> diff --git a/test/vinyl/gh-3395-read-prepared-uncommitted.test.lua b/test/vinyl/gh-3395-read-prepared-uncommitted.test.lua
> index 10b70dd85..70136ddff 100644
> --- a/test/vinyl/gh-3395-read-prepared-uncommitted.test.lua
> +++ b/test/vinyl/gh-3395-read-prepared-uncommitted.test.lua
> @@ -55,9 +55,7 @@ function read_prepared_with_delay(is_tx_faster_than_wal)
> errinj.set("ERRINJ_RELAY_FASTER_THAN_TX", true)
> end
> errinj.set("ERRINJ_WAL_DELAY", false)
> - if not is_tx_faster_than_wal then
> - fiber.sleep(0.1)
> - end
> + fiber.sleep(0.1)
> errinj.set("ERRINJ_VY_READ_PAGE_DELAY", false)
> local res = c:get()
> errinj.set("ERRINJ_WAL_WRITE", false)
>
>
> Could you please check if it still fails on your machine?
Now it works.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [Tarantool-patches] [PATCH v2] vinyl: restart read iterator in case of rolled back WAL
2020-06-19 12:42 ` Konstantin Osipov
@ 2020-06-23 5:15 ` Aleksandr Lyapunov
2020-06-23 11:08 ` Konstantin Osipov
0 siblings, 1 reply; 12+ messages in thread
From: Aleksandr Lyapunov @ 2020-06-23 5:15 UTC (permalink / raw)
To: Konstantin Osipov, Nikita Pettik, tarantool-patches, v.shpilevoy
>>> The read iterator was rewritten several times and still have at least
>>> several bugs. I think we should admit that we cannot support such a
>>> complicated solution. How about some stupid solution: if ANY change
>>> has been happened during yield - restart advancing?
> Does this statement have technical merit? Is it supported by
> tests? I'd gladly support the change if it was grounded in reason -
> evaluation of the performance impact, for example, could serve as
> a confirmation that a simple solution would be just fine.
>
> Without it, I'd it's a regress, a signature of helplessness,
> lack of courage to make things right.
Two reasons:
1. since we have to read from disk it's not a big deal to scan memory again.
2. a dozen of lines of code is removed - the cost of support is lowered
a bit.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [Tarantool-patches] [PATCH v2] vinyl: restart read iterator in case of rolled back WAL
2020-06-23 5:15 ` Aleksandr Lyapunov
@ 2020-06-23 11:08 ` Konstantin Osipov
0 siblings, 0 replies; 12+ messages in thread
From: Konstantin Osipov @ 2020-06-23 11:08 UTC (permalink / raw)
To: Aleksandr Lyapunov; +Cc: v.shpilevoy, tarantool-patches
* Aleksandr Lyapunov <alyapunov@tarantool.org> [20/06/23 08:16]:
>
> > > > The read iterator was rewritten several times and still have at least
> > > > several bugs. I think we should admit that we cannot support such a
> > > > complicated solution. How about some stupid solution: if ANY change
> > > > has been happened during yield - restart advancing?
> > Does this statement have technical merit? Is it supported by
> > tests? I'd gladly support the change if it was grounded in reason -
> > evaluation of the performance impact, for example, could serve as
> > a confirmation that a simple solution would be just fine.
> >
> > Without it, I'd it's a regress, a signature of helplessness,
> > lack of courage to make things right.
> Two reasons:
> 1. since we have to read from disk it's not a big deal to scan memory again.
Makes sense. Make sense now, when we yield only to read from disk.
In future we may consider yielding for other reasons, but this
looks like something rather distant..
> 2. a dozen of lines of code is removed - the cost of support is lowered a
> bit.
--
Konstantin Osipov, Moscow, Russia
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [Tarantool-patches] [PATCH v2] vinyl: restart read iterator in case of rolled back WAL
2020-06-19 13:01 ` Aleksandr Lyapunov
@ 2020-06-24 13:41 ` Nikita Pettik
0 siblings, 0 replies; 12+ messages in thread
From: Nikita Pettik @ 2020-06-24 13:41 UTC (permalink / raw)
To: Aleksandr Lyapunov; +Cc: tarantool-patches, v.shpilevoy
On 19 Jun 16:01, Aleksandr Lyapunov wrote:
> Great, lgtm.
Pushed to master, 2.4, 2.3 and 1.10. Changelogs are updated
correspondingly. Branch is dropped.
> Btw, I didn't try to run tests, I hope you did it.
>
> On 6/19/20 3:24 PM, Nikita Pettik wrote:
> >
> > Ok, I'm fine with it. Consider this change:
> >
> > diff --git a/src/box/vy_read_iterator.c b/src/box/vy_read_iterator.c
> > index 62a8722d9..409796910 100644
> > --- a/src/box/vy_read_iterator.c
> > +++ b/src/box/vy_read_iterator.c
> > @@ -378,65 +378,6 @@ vy_read_iterator_scan_disk(struct vy_read_iterator *itr, uint32_t disk_src,
> > return 0;
> > }
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2020-06-24 13:41 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-01 16:46 [Tarantool-patches] [PATCH v2] vinyl: restart read iterator in case of rolled back WAL Nikita Pettik
2020-06-01 17:40 ` Vladislav Shpilevoy
2020-06-19 12:34 ` Nikita Pettik
2020-06-20 16:33 ` Vladislav Shpilevoy
2020-06-16 12:10 ` Aleksandr Lyapunov
2020-06-19 12:24 ` Nikita Pettik
2020-06-19 12:42 ` Konstantin Osipov
2020-06-23 5:15 ` Aleksandr Lyapunov
2020-06-23 11:08 ` Konstantin Osipov
2020-06-19 13:01 ` Aleksandr Lyapunov
2020-06-24 13:41 ` Nikita Pettik
2020-06-20 16:33 ` Vladislav Shpilevoy
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox