[Tarantool-patches] [PATCH v1] test: fix flaky election_qsync_stress test
Serge Petrenko
sergepetrenko at tarantool.org
Fri Nov 13 12:18:49 MSK 2020
13.11.2020 09:03, Alexander V. Tikhonov пишет:
> Found that replication/election_qsync_stress.test.lua test may fail on
> restating instances. It occures on heavy loaded hosts when its local
> call to stop instance using SIGTERM fails to stop it. Decided to use
> SIGKILL in local stop call options to be sure that the instance will
> be stopped.
>
> Also found that running loop inline new hangs occured on server start:
>
> --- replication/election_qsync_stress.result Thu Nov 12 16:23:16 2020
> +++ var/128_replication/election_qsync_stress.result Thu Nov 12 16:31:22 2020
> @@ -323,7 +323,7 @@
> | ...
> test_run:wait_cond(function() return c.space.test ~= nil and c.space.test:get{i} ~= nil end)
> | ---
> - | - true
> + | - false
> | ...
> @@ -380,7 +380,7 @@
> | ...
> test_run:wait_cond(function() return c.space.test ~= nil and c.space.test:get{i} ~= nil end)
> | ---
> - | - true
> + | - false
> | ...
> @@ -494,687 +494,3 @@
> | ---
> | ...
> test_run:cmd('start server '..old_leader..' with wait=True, wait_load=True, args="2 0.4"')
> - | ---
> - | - true
> - | …
>
> but the test already failed before on getting 'c.space.test:get{i}'.
> To avoid of the hang and make test code more correct running way it
> were added log.error messages and return calls. Also the test was
> changed to use function for each loop iteration to be able to check
> return values and break the loop just after the fail.
>
> Needed for #5395
Hi! Thanks for the patch!
Please see my comments below.
> ---
>
> Github: https://github.com/tarantool/tarantool/tree/avtikhon/fix-election_qsync_stress
> Issue: https://github.com/tarantool/tarantool/issues/5395
>
> test/replication/election_qsync_stress.result | 33 ++++++++++++++-----
> .../election_qsync_stress.test.lua | 29 +++++++++++-----
> test/replication/suite.ini | 2 +-
> 3 files changed, 47 insertions(+), 17 deletions(-)
>
> diff --git a/test/replication/election_qsync_stress.result b/test/replication/election_qsync_stress.result
> index 9fab2f1d7..cdb00e34c 100644
> --- a/test/replication/election_qsync_stress.result
> +++ b/test/replication/election_qsync_stress.result
> @@ -5,6 +5,9 @@ test_run = require('test_run').new()
> netbox = require('net.box')
> | ---
> | ...
> +log = require('log')
> + | ---
> + | ...
>
> --
> -- gh-1146: Leader election + Qsync
> @@ -83,23 +86,32 @@ test_run:cmd('setopt delimiter ";"')
> | ---
> | - true
> | ...
> -for i = 1,10 do
> +function run_iter(i)
> c:eval('box.cfg{replication_synchro_quorum=4, replication_synchro_timeout=1000}')
> c.space.test:insert({i}, {is_async=true})
> - test_run:wait_cond(function() return c.space.test:get{i} ~= nil end)
> - test_run:cmd('stop server '..old_leader)
> + if not test_run:wait_cond(function() return c.space.test ~= nil
> + and c.space.test:get{i} ~= nil end) then
It's pointless to wait for `c.space.test ~= nil` after
`c.space.test:insert`.
If `c.space.test` is nil, `c.space.test:insert` will fail. Otherwise
`c.space.test` isn't nil
already.
> + log.error('error: hanged on first call to c.space.test:get(' .. i .. ')')
> + return false
> + end
> + test_run:cmd('stop server '..old_leader..' with signal=KILL')
> nrs[old_leader_nr] = false
> - new_leader_nr = get_leader(nrs)
> - new_leader = 'election_replica'..new_leader_nr
> - leader_port = test_run:eval(new_leader, 'box.cfg.listen')[1]
> + local new_leader_nr = get_leader(nrs)
> + local new_leader = 'election_replica'..new_leader_nr
> + local leader_port = test_run:eval(new_leader, 'box.cfg.listen')[1]
> c = netbox.connect(leader_port)
> c:eval('box.cfg{replication_synchro_timeout=1000}')
> c.space._schema:replace{'smth'}
> - c.space.test:get{i}
> + if not test_run:wait_cond(function() return c.space.test ~= nil
> + and c.space.test:get{i} ~= nil end) then
> + log.error('error: hanged on second call to c.space.test:get(' .. i .. ')')
> + return false
> + end
Same here. After a successful `space._schema:replace` on a new leader it
must succeed on
c.space.test:get{i}. Because insertion to _schema will happen only after
c.space.test has i from
old leader.
So if c.space.test:get{i} fails, say, if old leader didn't send the new
row before dying, then
wait_cond(c.space.test:get{i}) will also fail. The row will never arrive.
In other words, plain `c.space.test:get{i}` is enough here.
And if `c.space.test:get{i} == nil` then log an error message, like you do.
> test_run:cmd('start server '..old_leader..' with wait=True, wait_load=True, args="2 0.4"')
> nrs[old_leader_nr] = true
> old_leader_nr = new_leader_nr
> old_leader = new_leader
> + return true
> end;
> | ---
> | ...
> @@ -107,8 +119,13 @@ test_run:cmd('setopt delimiter ""');
> | ---
> | - true
> | ...
> +
> +for i = 1,10 do if not run_iter(i) then break end end
> + | ---
> + | ...
> +
> -- We're connected to some leader.
> -#c.space.test:select{} == 10 or require('log').error(c.space.test:select{})
> +#c.space.test:select{} == 10 or log.error(c.space.test:select{})
> | ---
> | - true
> | ...
> diff --git a/test/replication/election_qsync_stress.test.lua b/test/replication/election_qsync_stress.test.lua
> index 0ba15eef7..8b654d063 100644
> --- a/test/replication/election_qsync_stress.test.lua
> +++ b/test/replication/election_qsync_stress.test.lua
> @@ -1,5 +1,6 @@
> test_run = require('test_run').new()
> netbox = require('net.box')
> +log = require('log')
>
> --
> -- gh-1146: Leader election + Qsync
> @@ -47,26 +48,38 @@ _ = c:eval('box.space.test:create_index("pk")')
> -- Insert some data to a synchronous space, then kill the leader before the
> -- confirmation is written. Check successful confirmation on the new leader.
> test_run:cmd('setopt delimiter ";"')
> -for i = 1,10 do
> +function run_iter(i)
> c:eval('box.cfg{replication_synchro_quorum=4, replication_synchro_timeout=1000}')
> c.space.test:insert({i}, {is_async=true})
> - test_run:wait_cond(function() return c.space.test:get{i} ~= nil end)
> - test_run:cmd('stop server '..old_leader)
> + if not test_run:wait_cond(function() return c.space.test ~= nil
> + and c.space.test:get{i} ~= nil end) then
> + log.error('error: hanged on first call to c.space.test:get(' .. i .. ')')
> + return false
> + end
> + test_run:cmd('stop server '..old_leader..' with signal=KILL')
> nrs[old_leader_nr] = false
> - new_leader_nr = get_leader(nrs)
> - new_leader = 'election_replica'..new_leader_nr
> - leader_port = test_run:eval(new_leader, 'box.cfg.listen')[1]
> + local new_leader_nr = get_leader(nrs)
> + local new_leader = 'election_replica'..new_leader_nr
> + local leader_port = test_run:eval(new_leader, 'box.cfg.listen')[1]
> c = netbox.connect(leader_port)
> c:eval('box.cfg{replication_synchro_timeout=1000}')
> c.space._schema:replace{'smth'}
> - c.space.test:get{i}
> + if not test_run:wait_cond(function() return c.space.test ~= nil
> + and c.space.test:get{i} ~= nil end) then
> + log.error('error: hanged on second call to c.space.test:get(' .. i .. ')')
> + return false
> + end
> test_run:cmd('start server '..old_leader..' with wait=True, wait_load=True, args="2 0.4"')
> nrs[old_leader_nr] = true
> old_leader_nr = new_leader_nr
> old_leader = new_leader
> + return true
> end;
> test_run:cmd('setopt delimiter ""');
> +
> +for i = 1,10 do if not run_iter(i) then break end end
> +
> -- We're connected to some leader.
> -#c.space.test:select{} == 10 or require('log').error(c.space.test:select{})
> +#c.space.test:select{} == 10 or log.error(c.space.test:select{})
>
> test_run:drop_cluster(SERVERS)
> diff --git a/test/replication/suite.ini b/test/replication/suite.ini
> index 34ee32550..c456678c1 100644
> --- a/test/replication/suite.ini
> +++ b/test/replication/suite.ini
> @@ -119,7 +119,7 @@ fragile = {
> },
> "election_qsync_stress.test.lua": {
> "issues": [ "gh-5395" ],
> - "checksums": [ "634bda94accdcdef7b1db3e14f28f445", "36bcdae426c18a60fd13025c09f197d0", "209c865525154a91435c63850f15eca0", "ca38ff2cdfa65b3defb26607b24454c6", "3b6c573d2aa06ee382d6f27b6a76657b", "bcf08e055cf3ccd9958af25d0fba29f8", "411e7462760bc3fc968b68b4b7267ea1", "37e671aea27e3396098f9d13c7fa7316" ]
> + "checksums": [ "0ace688f0a267c44c37a5f904fb5d6ce" ]
> },
> "gh-3711-misc-no-restart-on-same-configuration.test.lua": {
> "issues": [ "gh-5407" ],
--
Serge Petrenko
More information about the Tarantool-patches
mailing list