[Tarantool-patches] [PATCH v17 5/5] test: add gh-6036-term-order test
Vladislav Shpilevoy
v.shpilevoy at tarantool.org
Sun Sep 26 17:30:38 MSK 2021
Thanks for the patch!
See 6 comments below.
> diff --git a/test/replication/gh-6036-order-node.lua b/test/replication/gh-6036-order-node.lua
> new file mode 100644
> index 000000000..b22a7cb4c
> --- /dev/null
> +++ b/test/replication/gh-6036-order-node.lua
> @@ -0,0 +1,60 @@
> +local INSTANCE_ID = string.match(arg[0], "gh%-6036%-order%-(.+)%.lua")
> +
> +local function unix_socket(name)
> + return "unix/:./" .. name .. '.sock';
> +end
> +
> +require('console').listen(os.getenv('ADMIN'))
> +
> +if INSTANCE_ID == "master" then
> + box.cfg({
> + listen = unix_socket(INSTANCE_ID),
> + replication = {
> + unix_socket(INSTANCE_ID),
> + unix_socket("replica1"),
> + unix_socket("replica2"),
> + },
> + replication_connect_quorum = 1,
> + replication_synchro_quorum = 1,
> + replication_synchro_timeout = 10000,
> + replication_sync_timeout = 5,
1. Why do you need sync_timeout 5?
> + read_only = false,
> + election_mode = "off",
> + })
> +elseif INSTANCE_ID == "replica1" then
> + box.cfg({
> + listen = unix_socket(INSTANCE_ID),
> + replication = {
> + unix_socket("master"),
> + unix_socket(INSTANCE_ID),
> + unix_socket("replica2"),
> + },
> + replication_connect_quorum = 1,
> + replication_synchro_quorum = 1,
> + replication_synchro_timeout = 10000,
> + replication_sync_timeout = 5,
> + read_only = false,
> + election_mode = "off",
> + })
> +else
> + assert(INSTANCE_ID == "replica2")
> + box.cfg({
> + listen = unix_socket(INSTANCE_ID),
> + replication = {
> + unix_socket("master"),
> + unix_socket("replica1"),
> + unix_socket(INSTANCE_ID),
> + },
> + replication_connect_quorum = 1,
> + replication_synchro_quorum = 1,
> + replication_synchro_timeout = 10000,
> + replication_sync_timeout = 5,
> + read_only = true,
> + election_mode = "off",
> + })
> +end
> +
> +--box.ctl.wait_rw()
2. Please, remove commented out code.
> +box.once("bootstrap", function()
> + box.schema.user.grant('guest', 'super')
> +end)
> diff --git a/test/replication/gh-6036-term-order.result b/test/replication/gh-6036-term-order.result
> new file mode 100644
> index 000000000..6b19fc2c8
> --- /dev/null
> +++ b/test/replication/gh-6036-term-order.result
3. Please, use prefix gh-####-qsync to be consistent with other qsync tests. Having
'qsync' in the test name helps to run all qsync tests in a single command
python test-run.py qsync
> @@ -0,0 +1,216 @@
> +-- test-run result file version 2
> +--
> +-- gh-6036: verify that terms are locked when we're inside journal
> +-- write routine, because parallel appliers may ignore the fact that
> +-- the term is updated already but not yet written leading to data
> +-- inconsistency.
> +--
> +test_run = require('test_run').new()
> + | ---
> + | ...
> +
> +test_run:cmd('create server master with script="replication/gh-6036-order-master.lua"')
> + | ---
> + | - true
> + | ...
> +test_run:cmd('create server replica1 with script="replication/gh-6036-order-replica1.lua"')
> + | ---
> + | - true
> + | ...
> +test_run:cmd('create server replica2 with script="replication/gh-6036-order-replica2.lua"')
> + | ---
> + | - true
> + | ...
> +
> +test_run:cmd('start server master with wait=False')
> + | ---
> + | - true
> + | ...
> +test_run:cmd('start server replica1 with wait=False')
> + | ---
> + | - true
> + | ...
> +test_run:cmd('start server replica2 with wait=False')
> + | ---
> + | - true
> + | ...
> +
> +test_run:wait_fullmesh({"master", "replica1", "replica2"})
> + | ---
> + | ...
> +
> +test_run:switch("master")
> + | ---
> + | - true
> + | ...
> +box.ctl.demote()
> + | ---
> + | ...
> +
> +test_run:switch("replica1")
> + | ---
> + | - true
> + | ...
> +box.ctl.demote()
> + | ---
> + | ...
> +
> +test_run:switch("replica2")
> + | ---
> + | - true
> + | ...
> +box.ctl.demote()
4. I dropped all 3 demotes and the test passed. Why do you need them?
> + | ---
> + | ...
> +
> +--
> +-- Drop connection between master and replica1.
> +test_run:switch("master")
> + | ---
> + | - true
> + | ...
> +box.cfg({ \
> + replication = { \
> + "unix/:./master.sock", \
> + "unix/:./replica2.sock", \
> + }, \
> +})
> + | ---
> + | ...
> +test_run:switch("replica1")
> + | ---
> + | - true
> + | ...
> +box.cfg({ \
> + replication = { \
> + "unix/:./replica1.sock", \
> + "unix/:./replica2.sock", \
> + }, \
> +})
> + | ---
> + | ...
> +
> +--
> +-- Initiate disk delay and remember the max term seen so far.
> +test_run:switch("replica2")
> + | ---
> + | - true
> + | ...
> +box.error.injection.set('ERRINJ_WAL_DELAY', true)
> + | ---
> + | - ok
> + | ...
> +term_max_replica2 = box.info.synchro.promote.term_max
> + | ---
> + | ...
> +
> +--
> +-- Ping-pong the promote action between master and
> +-- replica1 nodes, the term updates get queued on
> +-- replica2 because of disk being disabled.
> +test_run:switch("master")
> + | ---
> + | - true
> + | ...
> +box.ctl.promote()
> + | ---
> + | ...
> +box.ctl.demote()
> + | ---
> + | ...
> +
> +test_run:switch("replica1")
> + | ---
> + | - true
> + | ...
> +box.ctl.promote()
> + | ---
> + | ...
> +box.ctl.demote()
> + | ---
> + | ...
> +
> +test_run:switch("master")
> + | ---
> + | - true
> + | ...
> +box.ctl.promote()
> + | ---
> + | ...
> +
> +--
> +-- Since we're guarding max promote term make sure that
> +-- 1) The max term has not yet been updated because WAL
> +-- is in sleeping state.
> +-- 2) Max term on master and replica1 nodes are greater
> +-- than we have now because terms update is locked.
> +-- 3) Once WAL is unlocked we make sure that terms has
> +-- reached us.
> +test_run:switch("replica2")
> + | ---
> + | - true
> + | ...
> +assert(term_max_replica2 == box.info.synchro.promote.term_max)
> + | ---
> + | - true
> + | ...
> +
> +term_max_master = test_run:eval('master', 'box.info.synchro.promote.term_max')[1]
> + | ---
> + | ...
> +term_max_replica1 = test_run:eval('replica1', 'box.info.synchro.promote.term_max')[1]
> + | ---
> + | ...
> +assert(term_max_master > term_max_replica2)
> + | ---
> + | - true
> + | ...
> +assert(term_max_replica1 > term_max_replica2)
> + | ---
> + | - true
> + | ...
> +term_max_wait4 = term_max_master
> + | ---
> + | ...
> +if term_max_wait4 < term_max_replica1 then term_max_wait4 = term_max_replica1 end
5. How is it possible? The master did more promotes, it should have a
bigger term for sure.
> + | ---
> + | ...
> +
> +box.error.injection.set('ERRINJ_WAL_DELAY', false)
> + | ---
> + | - ok
> + | ...
> +test_run:wait_cond(function() return box.info.synchro.promote.term_max == term_max_wait4 end)
> + | ---
> + | - true
> + | ...
> +
> +test_run:switch("default")
> + | ---
> + | - true
> + | ...
> +test_run:cmd('stop server master')
> + | ---
> + | - true
> + | ...
> +test_run:cmd('stop server replica1')
> + | ---
> + | - true
> + | ...
> +test_run:cmd('stop server replica2')
> + | ---
> + | - true
> + | ...
> +
> +test_run:cmd('delete server master')
> + | ---
> + | - true
> + | ...
> +test_run:cmd('delete server replica1')
> + | ---
> + | - true
> + | ...
> +test_run:cmd('delete server replica2')
6. Can you add some data to the test? Which before the patches was applied, and now
is rejected/nopified. Otherwise you added a lock, tested the lock, but if I move
the data filtering before the lock, your test still will pass.
More information about the Tarantool-patches
mailing list