[Tarantool-patches] [PATCH v30 3/3] test: add gh-6036-qsync-order test

Serge Petrenko sergepetrenko at tarantool.org
Mon Feb 28 11:13:06 MSK 2022



24.02.2022 23:18, Cyrill Gorcunov пишет:
> To test that promotion requests are handled only when appropriate
> write to WAL completes, because we update memory data before the
> write finishes.
>
> Part-of #6036
>
> Signed-off-by: Cyrill Gorcunov <gorcunov at gmail.com>

Thanks for the patch and the fixes overall!

The test finally works fine on my machine.
I've experienced some flakiness, but I was able to fix
that with the following diff. Please, consider:

======================

diff --git a/test/replication-luatest/gh_6036_qsync_order_test.lua 
b/test/replication-luatest/gh_6036_qsync_order_test.lua
index 95ed3a517..d71739dcc 100644
--- a/test/replication-luatest/gh_6036_qsync_order_test.lua
+++ b/test/replication-luatest/gh_6036_qsync_order_test.lua
@@ -142,10 +142,19 @@ g.test_qsync_order = function(cg)
      cg.r2:wait_vclock(vclock)
      cg.r3:wait_vclock(vclock)

+    -- Drop connection between r1 and the rest of the cluster.
+    -- Otherwise r1 might become Raft follower before attempting insert{4}.
+    cg.r1:exec(function() box.cfg{replication=""} end)
      cg.r3:exec(function()
          box.error.injection.set('ERRINJ_WAL_DELAY_COUNTDOWN', 2)
          require('fiber').create(function() box.ctl.promote() end)
      end)
+    t.helpers.retrying({}, function()
+        t.assert(cg.r3:exec(function()
+            return box.info.synchro.queue.latched
+        end))
+    end)
+    t.assert(cg.r1:exec(function() return box.info.ro == false end))
      cg.r1:eval("box.space.test:insert{4}")
      cg.r3:exec(function()
          assert(box.info.synchro.queue.latched == true)

=======================

Also please address a couple of style-related comments below:


> ---
>   .../gh_6036_qsync_order_test.lua              | 157 ++++++++++++++++++
>   test/replication-luatest/suite.ini            |   1 +
>   2 files changed, 158 insertions(+)
>   create mode 100644 test/replication-luatest/gh_6036_qsync_order_test.lua
>
> diff --git a/test/replication-luatest/gh_6036_qsync_order_test.lua b/test/replication-luatest/gh_6036_qsync_order_test.lua
> new file mode 100644
> index 000000000..95ed3a517
> --- /dev/null
> +++ b/test/replication-luatest/gh_6036_qsync_order_test.lua
> @@ -0,0 +1,157 @@
> +local t = require('luatest')
> +local cluster = require('test.luatest_helpers.cluster')
> +local server = require('test.luatest_helpers.server')
> +local fiber = require('fiber')
> +
> +local g = t.group('gh-6036')
> +
> +g.before_each(function(cg)
> +    cg.cluster = cluster:new({})
> +
> +    local box_cfg = {
> +        replication = {
> +            server.build_instance_uri('r1'),
> +            server.build_instance_uri('r2'),
> +            server.build_instance_uri('r3'),
> +        },
> +        replication_timeout         = 0.1,
> +        replication_connect_quorum  = 1,
> +        election_mode               = 'manual',
> +        election_timeout            = 0.1,
> +        replication_synchro_quorum  = 1,
> +        replication_synchro_timeout = 0.1,
> +        log_level                   = 6,
> +    }
> +
> +    cg.r1 = cg.cluster:build_server({ alias = 'r1', box_cfg = box_cfg })
> +    cg.r2 = cg.cluster:build_server({ alias = 'r2', box_cfg = box_cfg })
> +    cg.r3 = cg.cluster:build_server({ alias = 'r3', box_cfg = box_cfg })
> +
> +    cg.cluster:add_server(cg.r1)
> +    cg.cluster:add_server(cg.r2)
> +    cg.cluster:add_server(cg.r3)
> +    cg.cluster:start()
> +end)
> +
> +g.after_each(function(cg)
> +    cg.cluster:drop()
> +    cg.cluster.servers = nil
> +end)
> +
> +g.test_qsync_order = function(cg)
> +    cg.cluster:wait_fullmesh()
> +
> +    --
> +    -- Create a synchro space on the r1 node and make
> +    -- sure the write processed just fine.
> +    cg.r1:exec(function()
> +        box.ctl.promote()
> +        box.ctl.wait_rw()
> +        local s = box.schema.create_space('test', {is_sync = true})
> +        s:create_index('pk')
> +        s:insert{1}
> +    end)
> +
> +    local vclock = cg.r1:get_vclock()
> +    vclock[0] = nil
> +    cg.r2:wait_vclock(vclock)
> +    cg.r3:wait_vclock(vclock)
> +
> +    t.assert_equals(cg.r1:eval("return box.space.test:select()"), {{1}})
> +    t.assert_equals(cg.r2:eval("return box.space.test:select()"), {{1}})
> +    t.assert_equals(cg.r3:eval("return box.space.test:select()"), {{1}})
> +
> +    local function update_replication(...)
> +        return (box.cfg{ replication = { ... } })
> +    end
> +
> +    --
> +    -- Drop connection between r1 and r2.
> +    cg.r1:exec(update_replication, {
> +            server.build_instance_uri("r1"),
> +            server.build_instance_uri("r3"),
> +        })
> +
> +    --
> +    -- Drop connection between r2 and r1.
> +    cg.r2:exec(update_replication, {
> +        server.build_instance_uri("r2"),
> +        server.build_instance_uri("r3"),
> +    })
> +
> +    --
> +    -- Here we have the following scheme
> +    --
> +    --      r3 (WAL delay)
> +    --      /            \
> +    --    r1              r2
> +    --
> +
> +    --
> +    -- Initiate disk delay in a bit tricky way: the next write will
> +    -- fall into forever sleep.
> +    cg.r3:eval("box.error.injection.set('ERRINJ_WAL_DELAY', true)")

1. Sometimes you use 'eval' and sometimes you use 'exec', and I don't see
    a pattern behind that. Please check every case with 'eval' and 
replace it
    with 'exec' when possible.

> +
> +    --
> +    -- Make r2 been a leader and start writting data, the PROMOTE
> +    -- request get queued on r3 and not yet processed, same time
> +    -- the INSERT won't complete either waiting for the PROMOTE
> +    -- completion first. Note that we enter r3 as well just to be
> +    -- sure the PROMOTE has reached it via queue state test.
> +    cg.r2:exec(function()
> +        box.ctl.promote()
> +        box.ctl.wait_rw()
> +    end)
> +    t.helpers.retrying({}, function()
> +        assert(cg.r3:exec(function()
> +            return box.info.synchro.queue.latched == true
> +        end))

2. Here you use a plain 'assert' instead of 't.assert'. Please avoid
    plain assertions in luatest tests.

> +    end)
> +    cg.r2:eval("box.space.test:insert{2}")

3. Like I already mentioned above, could you wrap that into an 'exec' 
instead?

> +
> +    --
> +    -- The r1 node has no clue that there is a new leader and continue
> +    -- writing data with obsolete term. Since r3 is delayed now
> +    -- the INSERT won't proceed yet but get queued.
> +    cg.r1:eval("box.space.test:insert{3}")
> +
> +    --
> +    -- Finally enable r3 back. Make sure the data from new r2 leader get
> +    -- writing while old leader's data ignored.
> +    cg.r3:eval("box.error.injection.set('ERRINJ_WAL_DELAY', false)")
> +    t.helpers.retrying({}, function()
> +        assert(cg.r3:exec(function()
> +            return box.space.test:get{2} ~= nil
> +        end))
> +    end)
> +
> +    t.assert_equals(cg.r3:eval("return box.space.test:select()"), {{1},{2}})
> +

4. You group two tests in one function. Let's better extract the test 
below into
    a separate function. For example, g.test_promote_order, or something.

    First of all, you may get rid of the 3rd instance in this test (you 
only need 2 of them),
    secondly, now you enter the test with a dirty config from the 
previous test:
    r1 <-> r2 <-> r3 (no connection between r1 and r3).

> +    --
> +    -- Make sure that while we're processing PROMOTE no other records
> +    -- get sneaked in via applier code from other replicas. For this
> +    -- sake initiate voting and stop inside wal thread just before
> +    -- PROMOTE get written. Another replica sends us new record and
> +    -- it should be dropped.
> +    cg.r1:exec(function()
> +        box.ctl.promote()
> +        box.ctl.wait_rw()
> +    end)
> +    vclock = cg.r1:get_vclock()
> +    vclock[0] = nil
> +    cg.r2:wait_vclock(vclock)
> +    cg.r3:wait_vclock(vclock)
> +
> +    cg.r3:exec(function()
> +        box.error.injection.set('ERRINJ_WAL_DELAY_COUNTDOWN', 2)
> +        require('fiber').create(function() box.ctl.promote() end)
> +    end)
> +    cg.r1:eval("box.space.test:insert{4}")
> +    cg.r3:exec(function()
> +        assert(box.info.synchro.queue.latched == true)
> +        box.error.injection.set('ERRINJ_WAL_DELAY', false)
> +        box.ctl.wait_rw()
> +    end)
> +
> +    t.assert_equals(cg.r3:eval("return box.space.test:select()"), {{1},{2}})
> +end
> diff --git a/test/replication-luatest/suite.ini b/test/replication-luatest/suite.ini
> index 374f1b87a..07ec93a52 100644
> --- a/test/replication-luatest/suite.ini
> +++ b/test/replication-luatest/suite.ini
> @@ -2,3 +2,4 @@
>   core = luatest
>   description = replication luatests
>   is_parallel = True
> +release_disabled = gh_6036_qsync_order_test.lua

-- 
Serge Petrenko



More information about the Tarantool-patches mailing list