Tarantool development patches archive
 help / color / mirror / Atom feed
From: Serge Petrenko via Tarantool-patches <tarantool-patches@dev.tarantool.org>
To: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: tml <tarantool-patches@dev.tarantool.org>,
	Vladislav Shpilevoy <v.shpilevoy@tarantool.org>
Subject: Re: [Tarantool-patches] [PATCH v22 3/3] test: add gh-6036-qsync-order test
Date: Wed, 13 Oct 2021 11:20:49 +0300	[thread overview]
Message-ID: <78fb87b0-ff39-d2f7-13bd-22148673f753@tarantool.org> (raw)
In-Reply-To: <YWXv7TO9TJttV+Yb@grain>



12.10.2021 23:28, Cyrill Gorcunov пишет:
> On Tue, Oct 12, 2021 at 12:47:06PM +0300, Serge Petrenko wrote:
>>
>> 11.10.2021 22:16, Cyrill Gorcunov пишет:
>>> To test that promotion requests are handled only when appropriate
>>> write to WAL completes, because we update memory data before the
>>> write finishes.
>>>
>>> Note that without the patch "qsync: order access to the limbo terms"
>>> this test fires the assertion
>>>
>>>> tarantool: src/box/txn_limbo.c:481: txn_limbo_read_rollback: Assertion `e->txn->signature >= 0' failed.
>> Thanks for the changes!
>> Sorry for the nitpicking, there are just a couple of minor comments left.
> Thanks! Here is force-pushed variant. Please take a look, hopefully I counted
> number of writes correctly, lets wait for CI results.

Thanks!
Looks like you've updated the branch for v21 instead of v22.
Please make sure everything's consistent and repush to v22.
Or maybe to a branch with no postfix so that no one is confused.

The test itself looks good now, but I see a lot of CI failures, like this:

https://github.com/tarantool/tarantool/runs/3875211570#step:4:9709
https://github.com/tarantool/tarantool/runs/3875211924#step:5:6039
https://github.com/tarantool/tarantool/runs/3875211332#step:5:6192

and so on.

When I run the gh-5566 test locally, with debug build I get the
following assertion failure:

[001] replication/gh-5566-final-join-synchro.test.lua
[001]
[001] [Instance "replica" killed by signal: 6 (SIGABRT)]
[001]
[001] Last 15 lines of Tarantool Log file [Instance 
"replica"][/Users/s.petrenko/Source/tarantool/test/var/001_replication/replica.log]:

<stripped>

[001] Assertion failed: (l->owner != fiber()), function 
latch_lock_timeout, file 
/Users/s.petrenko/Source/tarantool/src/lib/core/latch.h, line 122.
[001] [ fail ]

> ---
>  From 49f05ca2b31512b6555aecf1bb4d3ac1ce59729a Mon Sep 17 00:00:00 2001
> From: Cyrill Gorcunov <gorcunov@gmail.com>
> Date: Mon, 20 Sep 2021 17:22:38 +0300
> Subject: [PATCH v22 3/3] test: add gh-6036-qsync-order test
>
> To test that promotion requests are handled only when appropriate
> write to WAL completes, because we update memory data before the
> write finishes.
>
> Note that without the patch "qsync: order access to the limbo terms"
> this test fires the assertion
>
>> tarantool: src/box/txn_limbo.c:481: txn_limbo_read_rollback: Assertion `e->txn->signature >= 0' failed.
> Part-of #6036
>
> Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
> ---
>   test/replication/gh-6036-qsync-order.result   | 194 ++++++++++++++++++
>   test/replication/gh-6036-qsync-order.test.lua |  94 +++++++++
>   test/replication/suite.cfg                    |   1 +
>   test/replication/suite.ini                    |   2 +-
>   4 files changed, 290 insertions(+), 1 deletion(-)
>   create mode 100644 test/replication/gh-6036-qsync-order.result
>   create mode 100644 test/replication/gh-6036-qsync-order.test.lua
>
> diff --git a/test/replication/gh-6036-qsync-order.result b/test/replication/gh-6036-qsync-order.result
> new file mode 100644
> index 000000000..0e93d429b
> --- /dev/null
> +++ b/test/replication/gh-6036-qsync-order.result
> @@ -0,0 +1,194 @@
> +-- test-run result file version 2
> +--
> +-- gh-6036: verify that terms are locked when we're inside journal
> +-- write routine, because parallel appliers may ignore the fact that
> +-- the term is updated already but not yet written leading to data
> +-- inconsistency.
> +--
> +test_run = require('test_run').new()
> + | ---
> + | ...
> +
> +SERVERS={"election_replica1", "election_replica2", "election_replica3"}
> + | ---
> + | ...
> +test_run:create_cluster(SERVERS, "replication", {args='1 nil manual 1'})
> + | ---
> + | ...
> +test_run:wait_fullmesh(SERVERS)
> + | ---
> + | ...
> +
> +--
> +-- Create a synchro space on the master node and make
> +-- sure the write processed just fine.
> +test_run:switch("election_replica1")
> + | ---
> + | - true
> + | ...
> +box.ctl.promote()
> + | ---
> + | ...
> +s = box.schema.create_space('test', {is_sync = true})
> + | ---
> + | ...
> +_ = s:create_index('pk')
> + | ---
> + | ...
> +s:insert{1}
> + | ---
> + | - [1]
> + | ...
> +
> +test_run:wait_lsn('election_replica2', 'election_replica1')
> + | ---
> + | ...
> +test_run:wait_lsn('election_replica3', 'election_replica1')
> + | ---
> + | ...
> +
> +--
> +-- Drop connection between election_replica1 and election_replica2.
> +box.cfg({                                   \
> +    replication = {                         \
> +        "unix/:./election_replica1.sock",   \
> +        "unix/:./election_replica3.sock",   \
> +    },                                      \
> +})
> + | ---
> + | ...
> +
> +--
> +-- Drop connection between election_replica2 and election_replica1.
> +test_run:switch("election_replica2")
> + | ---
> + | - true
> + | ...
> +box.cfg({                                   \
> +    replication = {                         \
> +        "unix/:./election_replica2.sock",   \
> +        "unix/:./election_replica3.sock",   \
> +    },                                      \
> +})
> + | ---
> + | ...
> +
> +--
> +-- Here we have the following scheme
> +--
> +--              election_replica3 (will be delayed)
> +--              /                \
> +--    election_replica1    election_replica2
> +
> +--
> +-- Initiate disk delay in a bit tricky way: the next write will
> +-- fall into forever sleep.
> +test_run:switch("election_replica3")
> + | ---
> + | - true
> + | ...
> +write_cnt = box.error.injection.get("ERRINJ_WAL_WRITE_COUNT")
> + | ---
> + | ...
> +box.error.injection.set("ERRINJ_WAL_DELAY", true)
> + | ---
> + | - ok
> + | ...
> +--
> +-- Make election_replica2 been a leader and start writting data,
> +-- the PROMOTE request get queued on election_replica3 and not
> +-- yet processed, same time INSERT won't complete either
> +-- waiting for PROMOTE completion first. Note that we
> +-- enter election_replica3 as well just to be sure the PROMOTE
> +-- reached it.
> +test_run:switch("election_replica2")
> + | ---
> + | - true
> + | ...
> +box.ctl.promote()
> + | ---
> + | ...
> +test_run:switch("election_replica3")
> + | ---
> + | - true
> + | ...
> +test_run:wait_cond(function() return box.error.injection.get("ERRINJ_WAL_WRITE_COUNT") > write_cnt end)
> + | ---
> + | - true
> + | ...
> +test_run:switch("election_replica2")
> + | ---
> + | - true
> + | ...
> +box.space.test:insert{2}
> + | ---
> + | - [2]
> + | ...
> +
> +--
> +-- The election_replica1 node has no clue that there is a new leader
> +-- and continue writing data with obsolete term. Since election_replica3
> +-- is delayed now the INSERT won't proceed yet but get queued.
> +test_run:switch("election_replica1")
> + | ---
> + | - true
> + | ...
> +box.space.test:insert{3}
> + | ---
> + | - [3]
> + | ...
> +
> +--
> +-- Finally enable election_replica3 back. Make sure the data from new election_replica2
> +-- leader get writing while old leader's data ignored.
> +test_run:switch("election_replica3")
> + | ---
> + | - true
> + | ...
> +assert(box.error.injection.get("ERRINJ_WAL_WRITE_COUNT") == write_cnt + 2)
> + | ---
> + | - true
> + | ...
> +box.error.injection.set('ERRINJ_WAL_DELAY', false)
> + | ---
> + | - ok
> + | ...
> +test_run:wait_cond(function() return box.space.test:get{2} ~= nil end)
> + | ---
> + | - true
> + | ...
> +box.space.test:select{}
> + | ---
> + | - - [1]
> + |   - [2]
> + | ...
> +
> +test_run:switch("default")
> + | ---
> + | - true
> + | ...
> +test_run:cmd('stop server election_replica1')
> + | ---
> + | - true
> + | ...
> +test_run:cmd('stop server election_replica2')
> + | ---
> + | - true
> + | ...
> +test_run:cmd('stop server election_replica3')
> + | ---
> + | - true
> + | ...
> +
> +test_run:cmd('delete server election_replica1')
> + | ---
> + | - true
> + | ...
> +test_run:cmd('delete server election_replica2')
> + | ---
> + | - true
> + | ...
> +test_run:cmd('delete server election_replica3')
> + | ---
> + | - true
> + | ...
> diff --git a/test/replication/gh-6036-qsync-order.test.lua b/test/replication/gh-6036-qsync-order.test.lua
> new file mode 100644
> index 000000000..20030161e
> --- /dev/null
> +++ b/test/replication/gh-6036-qsync-order.test.lua
> @@ -0,0 +1,94 @@
> +--
> +-- gh-6036: verify that terms are locked when we're inside journal
> +-- write routine, because parallel appliers may ignore the fact that
> +-- the term is updated already but not yet written leading to data
> +-- inconsistency.
> +--
> +test_run = require('test_run').new()
> +
> +SERVERS={"election_replica1", "election_replica2", "election_replica3"}
> +test_run:create_cluster(SERVERS, "replication", {args='1 nil manual 1'})
> +test_run:wait_fullmesh(SERVERS)
> +
> +--
> +-- Create a synchro space on the master node and make
> +-- sure the write processed just fine.
> +test_run:switch("election_replica1")
> +box.ctl.promote()
> +s = box.schema.create_space('test', {is_sync = true})
> +_ = s:create_index('pk')
> +s:insert{1}
> +
> +test_run:wait_lsn('election_replica2', 'election_replica1')
> +test_run:wait_lsn('election_replica3', 'election_replica1')
> +
> +--
> +-- Drop connection between election_replica1 and election_replica2.
> +box.cfg({                                   \
> +    replication = {                         \
> +        "unix/:./election_replica1.sock",   \
> +        "unix/:./election_replica3.sock",   \
> +    },                                      \
> +})
> +
> +--
> +-- Drop connection between election_replica2 and election_replica1.
> +test_run:switch("election_replica2")
> +box.cfg({                                   \
> +    replication = {                         \
> +        "unix/:./election_replica2.sock",   \
> +        "unix/:./election_replica3.sock",   \
> +    },                                      \
> +})
> +
> +--
> +-- Here we have the following scheme
> +--
> +--              election_replica3 (will be delayed)
> +--              /                \
> +--    election_replica1    election_replica2
> +
> +--
> +-- Initiate disk delay in a bit tricky way: the next write will
> +-- fall into forever sleep.
> +test_run:switch("election_replica3")
> +write_cnt = box.error.injection.get("ERRINJ_WAL_WRITE_COUNT")
> +box.error.injection.set("ERRINJ_WAL_DELAY", true)
> +--
> +-- Make election_replica2 been a leader and start writting data,
> +-- the PROMOTE request get queued on election_replica3 and not
> +-- yet processed, same time INSERT won't complete either
> +-- waiting for PROMOTE completion first. Note that we
> +-- enter election_replica3 as well just to be sure the PROMOTE
> +-- reached it.
> +test_run:switch("election_replica2")
> +box.ctl.promote()
> +test_run:switch("election_replica3")
> +test_run:wait_cond(function() return box.error.injection.get("ERRINJ_WAL_WRITE_COUNT") > write_cnt end)
> +test_run:switch("election_replica2")
> +box.space.test:insert{2}
> +
> +--
> +-- The election_replica1 node has no clue that there is a new leader
> +-- and continue writing data with obsolete term. Since election_replica3
> +-- is delayed now the INSERT won't proceed yet but get queued.
> +test_run:switch("election_replica1")
> +box.space.test:insert{3}
> +
> +--
> +-- Finally enable election_replica3 back. Make sure the data from new election_replica2
> +-- leader get writing while old leader's data ignored.
> +test_run:switch("election_replica3")
> +assert(box.error.injection.get("ERRINJ_WAL_WRITE_COUNT") == write_cnt + 2)
> +box.error.injection.set('ERRINJ_WAL_DELAY', false)
> +test_run:wait_cond(function() return box.space.test:get{2} ~= nil end)
> +box.space.test:select{}
> +
> +test_run:switch("default")
> +test_run:cmd('stop server election_replica1')
> +test_run:cmd('stop server election_replica2')
> +test_run:cmd('stop server election_replica3')
> +
> +test_run:cmd('delete server election_replica1')
> +test_run:cmd('delete server election_replica2')
> +test_run:cmd('delete server election_replica3')
> diff --git a/test/replication/suite.cfg b/test/replication/suite.cfg
> index 3eee0803c..ed09b2087 100644
> --- a/test/replication/suite.cfg
> +++ b/test/replication/suite.cfg
> @@ -59,6 +59,7 @@
>       "gh-6094-rs-uuid-mismatch.test.lua": {},
>       "gh-6127-election-join-new.test.lua": {},
>       "gh-6035-applier-filter.test.lua": {},
> +    "gh-6036-qsync-order.test.lua": {},
>       "election-candidate-promote.test.lua": {},
>       "*": {
>           "memtx": {"engine": "memtx"},
> diff --git a/test/replication/suite.ini b/test/replication/suite.ini
> index 77eb95f49..080e4fbf4 100644
> --- a/test/replication/suite.ini
> +++ b/test/replication/suite.ini
> @@ -3,7 +3,7 @@ core = tarantool
>   script =  master.lua
>   description = tarantool/box, replication
>   disabled = consistent.test.lua
> -release_disabled = catch.test.lua errinj.test.lua gc.test.lua gc_no_space.test.lua before_replace.test.lua qsync_advanced.test.lua qsync_errinj.test.lua quorum.test.lua recover_missing_xlog.test.lua sync.test.lua long_row_timeout.test.lua gh-4739-vclock-assert.test.lua gh-4730-applier-rollback.test.lua gh-5140-qsync-casc-rollback.test.lua gh-5144-qsync-dup-confirm.test.lua gh-5167-qsync-rollback-snap.test.lua gh-5430-qsync-promote-crash.test.lua gh-5430-cluster-mvcc.test.lua  gh-5506-election-on-off.test.lua gh-5536-wal-limit.test.lua hang_on_synchro_fail.test.lua anon_register_gap.test.lua gh-5213-qsync-applier-order.test.lua gh-5213-qsync-applier-order-3.test.lua gh-6027-applier-error-show.test.lua gh-6032-promote-wal-write.test.lua gh-6057-qsync-confirm-async-no-wal.test.lua gh-5447-downstream-lag.test.lua gh-4040-invalid-msgpack.test.lua
> +release_disabled = catch.test.lua errinj.test.lua gc.test.lua gc_no_space.test.lua before_replace.test.lua qsync_advanced.test.lua qsync_errinj.test.lua quorum.test.lua recover_missing_xlog.test.lua sync.test.lua long_row_timeout.test.lua gh-4739-vclock-assert.test.lua gh-4730-applier-rollback.test.lua gh-5140-qsync-casc-rollback.test.lua gh-5144-qsync-dup-confirm.test.lua gh-5167-qsync-rollback-snap.test.lua gh-5430-qsync-promote-crash.test.lua gh-5430-cluster-mvcc.test.lua  gh-5506-election-on-off.test.lua gh-5536-wal-limit.test.lua hang_on_synchro_fail.test.lua anon_register_gap.test.lua gh-5213-qsync-applier-order.test.lua gh-5213-qsync-applier-order-3.test.lua gh-6027-applier-error-show.test.lua gh-6032-promote-wal-write.test.lua gh-6057-qsync-confirm-async-no-wal.test.lua gh-5447-downstream-lag.test.lua gh-4040-invalid-msgpack.test.lua gh-6036-qsync-order.test.lua
>   config = suite.cfg
>   lua_libs = lua/fast_replica.lua lua/rlimit.lua
>   use_unix_sockets = True

-- 
Serge Petrenko


      reply	other threads:[~2021-10-13  8:20 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-10-11 19:16 [Tarantool-patches] [PATCH v22 0/3] qsync: implement packet filtering (part 1) Cyrill Gorcunov via Tarantool-patches
2021-10-11 19:16 ` [Tarantool-patches] [PATCH v22 1/3] latch: add latch_is_locked helper Cyrill Gorcunov via Tarantool-patches
2021-10-11 19:16 ` [Tarantool-patches] [PATCH v22 2/3] qsync: order access to the limbo terms Cyrill Gorcunov via Tarantool-patches
2021-10-12  9:40   ` Serge Petrenko via Tarantool-patches
2021-10-12 20:26     ` Cyrill Gorcunov via Tarantool-patches
2021-10-13  7:56       ` Serge Petrenko via Tarantool-patches
2021-10-11 19:16 ` [Tarantool-patches] [PATCH v22 3/3] test: add gh-6036-qsync-order test Cyrill Gorcunov via Tarantool-patches
2021-10-12  9:47   ` Serge Petrenko via Tarantool-patches
2021-10-12 20:28     ` Cyrill Gorcunov via Tarantool-patches
2021-10-13  8:20       ` Serge Petrenko via Tarantool-patches [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=78fb87b0-ff39-d2f7-13bd-22148673f753@tarantool.org \
    --to=tarantool-patches@dev.tarantool.org \
    --cc=gorcunov@gmail.com \
    --cc=sergepetrenko@tarantool.org \
    --cc=v.shpilevoy@tarantool.org \
    --subject='Re: [Tarantool-patches] [PATCH v22 3/3] test: add gh-6036-qsync-order test' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox