* [tarantool-patches] [PATCH 1/4] test: allow to run replication/misc multiple times
2019-04-10 13:28 [tarantool-patches] [PATCH 0/4] *** test: replication/ fixes for parallel run *** Alexander Turenko
@ 2019-04-10 13:28 ` Alexander Turenko
2019-04-10 13:28 ` [tarantool-patches] [PATCH 2/4] test: increase timeouts in replication/misc Alexander Turenko
` (3 subsequent siblings)
4 siblings, 0 replies; 6+ messages in thread
From: Alexander Turenko @ 2019-04-10 13:28 UTC (permalink / raw)
To: tarantool-patches; +Cc: Alexander Turenko
It allows to run `./test-run.py -j 1 replication/misc <...>
replication/misc` that can be useful when debugging a flaky problem.
This ability was broken after after 7474c14e ('test: enable cleaning of
a test environment'), because test-run starts to clean package.loaded
between runs and so each time the test is run it calls ffi.cdef() under
require('rlimit'). This ffi.cdef() call defines a structure, so a second
and following attempts to call the ffi.cdef() will give a Lua error.
This commit does not change anything in regular testing, because each
test runs once (if other is not stated in a configuration list).
---
test/replication/lua/rlimit.lua | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/test/replication/lua/rlimit.lua b/test/replication/lua/rlimit.lua
index 46026aea5..de9f86a35 100644
--- a/test/replication/lua/rlimit.lua
+++ b/test/replication/lua/rlimit.lua
@@ -1,6 +1,6 @@
ffi = require('ffi')
-ffi.cdef([[
+pcall(ffi.cdef, [[
typedef long rlim_t;
struct rlimit {
rlim_t rlim_cur; /* Soft limit */
--
2.20.1
^ permalink raw reply [flat|nested] 6+ messages in thread
* [tarantool-patches] [PATCH 2/4] test: increase timeouts in replication/misc
2019-04-10 13:28 [tarantool-patches] [PATCH 0/4] *** test: replication/ fixes for parallel run *** Alexander Turenko
2019-04-10 13:28 ` [tarantool-patches] [PATCH 1/4] test: allow to run replication/misc multiple times Alexander Turenko
@ 2019-04-10 13:28 ` Alexander Turenko
2019-04-10 13:28 ` [tarantool-patches] [PATCH 3/4] test: increase timeouts in replication/errinj Alexander Turenko
` (2 subsequent siblings)
4 siblings, 0 replies; 6+ messages in thread
From: Alexander Turenko @ 2019-04-10 13:28 UTC (permalink / raw)
To: tarantool-patches; +Cc: Alexander Turenko
All changes are needed to eliminate sporadic fails when testing is run
with, say, 30 parallel jobs.
First, replication_connect_timeout is increased to 30 seconds. This
parameter doesn't change meaning of the test cases.
Second, increase replication_timeout from 0.01 to 0.03. We usually set
it to 0.1 in tests, but a duration of the gh-3160 test case ('Send
heartbeats if there are changes from a remote master only') is around
100 * replication_timeout seconds and we don't want to make this test
much longer. Runs of the test case (w/o other ones that are in
replication/mics.test.lua) in 30 parallel jobs show that 0.03 is enough
for the gh-3160 case to pass stably and hopefully enough for the
following test cases too.
---
test/replication/misc.result | 43 ++++++++--------------------------
test/replication/misc.test.lua | 27 ++++++++-------------
2 files changed, 20 insertions(+), 50 deletions(-)
diff --git a/test/replication/misc.result b/test/replication/misc.result
index ab827c501..a5a322c81 100644
--- a/test/replication/misc.result
+++ b/test/replication/misc.result
@@ -100,32 +100,12 @@ SERVERS = { 'autobootstrap1', 'autobootstrap2', 'autobootstrap3' }
---
...
-- Deploy a cluster.
-test_run:create_cluster(SERVERS, "replication", {args="0.1"})
+test_run:create_cluster(SERVERS, "replication", {args="0.03"})
---
...
test_run:wait_fullmesh(SERVERS)
---
...
-test_run:cmd("switch autobootstrap1")
----
-- true
-...
-test_run = require('test_run').new()
----
-...
-box.cfg{replication_timeout = 0.01, replication_connect_timeout=0.01}
----
-...
-test_run:cmd("switch autobootstrap2")
----
-- true
-...
-test_run = require('test_run').new()
----
-...
-box.cfg{replication_timeout = 0.01, replication_connect_timeout=0.01}
----
-...
test_run:cmd("switch autobootstrap3")
---
- true
@@ -133,10 +113,7 @@ test_run:cmd("switch autobootstrap3")
test_run = require('test_run').new()
---
...
-fiber=require('fiber')
----
-...
-box.cfg{replication_timeout = 0.01, replication_connect_timeout=0.01}
+fiber = require('fiber')
---
...
_ = box.schema.space.create('test_timeout'):create_index('pk')
@@ -146,11 +123,11 @@ test_run:cmd("setopt delimiter ';'")
---
- true
...
-function wait_follow(replicaA, replicaB)
+function wait_not_follow(replicaA, replicaB)
return test_run:wait_cond(function()
return replicaA.status ~= 'follow' or replicaB.status ~= 'follow'
- end, 0.01)
-end ;
+ end, box.cfg.replication_timeout)
+end;
---
...
function test_timeout()
@@ -158,16 +135,16 @@ function test_timeout()
local replicaB = box.info.replication[3].upstream or box.info.replication[2].upstream
local follows = test_run:wait_cond(function()
return replicaA.status == 'follow' or replicaB.status == 'follow'
- end, 0.1)
- if not follows then error('replicas not in follow status') end
- for i = 0, 99 do
+ end)
+ if not follows then error('replicas are not in the follow status') end
+ for i = 0, 99 do
box.space.test_timeout:replace({1})
- if wait_follow(replicaA, replicaB) then
+ if wait_not_follow(replicaA, replicaB) then
return error(box.info.replication)
end
end
return true
-end ;
+end;
---
...
test_run:cmd("setopt delimiter ''");
diff --git a/test/replication/misc.test.lua b/test/replication/misc.test.lua
index eda5310b6..2ee6b5ac7 100644
--- a/test/replication/misc.test.lua
+++ b/test/replication/misc.test.lua
@@ -39,40 +39,33 @@ test_run:cleanup_cluster()
SERVERS = { 'autobootstrap1', 'autobootstrap2', 'autobootstrap3' }
-- Deploy a cluster.
-test_run:create_cluster(SERVERS, "replication", {args="0.1"})
+test_run:create_cluster(SERVERS, "replication", {args="0.03"})
test_run:wait_fullmesh(SERVERS)
-test_run:cmd("switch autobootstrap1")
-test_run = require('test_run').new()
-box.cfg{replication_timeout = 0.01, replication_connect_timeout=0.01}
-test_run:cmd("switch autobootstrap2")
-test_run = require('test_run').new()
-box.cfg{replication_timeout = 0.01, replication_connect_timeout=0.01}
test_run:cmd("switch autobootstrap3")
test_run = require('test_run').new()
-fiber=require('fiber')
-box.cfg{replication_timeout = 0.01, replication_connect_timeout=0.01}
+fiber = require('fiber')
_ = box.schema.space.create('test_timeout'):create_index('pk')
test_run:cmd("setopt delimiter ';'")
-function wait_follow(replicaA, replicaB)
+function wait_not_follow(replicaA, replicaB)
return test_run:wait_cond(function()
return replicaA.status ~= 'follow' or replicaB.status ~= 'follow'
- end, 0.01)
-end ;
+ end, box.cfg.replication_timeout)
+end;
function test_timeout()
local replicaA = box.info.replication[1].upstream or box.info.replication[2].upstream
local replicaB = box.info.replication[3].upstream or box.info.replication[2].upstream
local follows = test_run:wait_cond(function()
return replicaA.status == 'follow' or replicaB.status == 'follow'
- end, 0.1)
- if not follows then error('replicas not in follow status') end
- for i = 0, 99 do
+ end)
+ if not follows then error('replicas are not in the follow status') end
+ for i = 0, 99 do
box.space.test_timeout:replace({1})
- if wait_follow(replicaA, replicaB) then
+ if wait_not_follow(replicaA, replicaB) then
return error(box.info.replication)
end
end
return true
-end ;
+end;
test_run:cmd("setopt delimiter ''");
test_timeout()
--
2.20.1
^ permalink raw reply [flat|nested] 6+ messages in thread
* [tarantool-patches] [PATCH 3/4] test: increase timeouts in replication/errinj
2019-04-10 13:28 [tarantool-patches] [PATCH 0/4] *** test: replication/ fixes for parallel run *** Alexander Turenko
2019-04-10 13:28 ` [tarantool-patches] [PATCH 1/4] test: allow to run replication/misc multiple times Alexander Turenko
2019-04-10 13:28 ` [tarantool-patches] [PATCH 2/4] test: increase timeouts in replication/misc Alexander Turenko
@ 2019-04-10 13:28 ` Alexander Turenko
2019-04-10 13:28 ` [tarantool-patches] [PATCH 4/4] test: wait for xlog/snap/log file changes Alexander Turenko
2019-04-10 13:43 ` [tarantool-patches] Re: [PATCH 0/4] *** test: replication/ fixes for parallel run *** Alexander Turenko
4 siblings, 0 replies; 6+ messages in thread
From: Alexander Turenko @ 2019-04-10 13:28 UTC (permalink / raw)
To: tarantool-patches; +Cc: Alexander Turenko
Needed for parallel running of the test suite.
Use default replication_connect_timeout (30 seconds) instead of 0.5
seconds. This don't changes meaning of the test cases.
Increase replication_timeout from 0.01 to 0.1.
These changes allow to run the test 100 times in 50 parallel jobs
successfully.
---
test/replication/errinj.result | 8 ++++----
test/replication/errinj.test.lua | 8 ++++----
2 files changed, 8 insertions(+), 8 deletions(-)
diff --git a/test/replication/errinj.result b/test/replication/errinj.result
index 2e7d367c7..f04a38c45 100644
--- a/test/replication/errinj.result
+++ b/test/replication/errinj.result
@@ -408,14 +408,14 @@ errinj.set("ERRINJ_RELAY_EXIT_DELAY", 0)
---
- ok
...
-box.cfg{replication_timeout = 0.01}
+box.cfg{replication_timeout = 0.1}
---
...
test_run:cmd("create server replica_timeout with rpl_master=default, script='replication/replica_timeout.lua'")
---
- true
...
-test_run:cmd("start server replica_timeout with args='0.01 0.5'")
+test_run:cmd("start server replica_timeout with args='0.1'")
---
- true
...
@@ -471,7 +471,7 @@ errinj.set("ERRINJ_RELAY_REPORT_INTERVAL", 0)
...
-- Check replica's ACKs don't prevent the master from sending
-- heartbeat messages (gh-3160).
-test_run:cmd("start server replica_timeout with args='0.009 0.5'")
+test_run:cmd("start server replica_timeout with args='0.1'")
---
- true
...
@@ -489,7 +489,7 @@ box.info.replication[1].upstream.status -- follow
---
- follow
...
-for i = 0, 15 do fiber.sleep(0.01) if box.info.replication[1].upstream.status ~= 'follow' then break end end
+for i = 0, 15 do fiber.sleep(box.cfg.replication_timeout) if box.info.replication[1].upstream.status ~= 'follow' then break end end
---
...
box.info.replication[1].upstream.status -- follow
diff --git a/test/replication/errinj.test.lua b/test/replication/errinj.test.lua
index 32e0be912..53637e248 100644
--- a/test/replication/errinj.test.lua
+++ b/test/replication/errinj.test.lua
@@ -169,10 +169,10 @@ test_run:cmd("stop server replica")
test_run:cmd("cleanup server replica")
errinj.set("ERRINJ_RELAY_EXIT_DELAY", 0)
-box.cfg{replication_timeout = 0.01}
+box.cfg{replication_timeout = 0.1}
test_run:cmd("create server replica_timeout with rpl_master=default, script='replication/replica_timeout.lua'")
-test_run:cmd("start server replica_timeout with args='0.01 0.5'")
+test_run:cmd("start server replica_timeout with args='0.1'")
test_run:cmd("switch replica_timeout")
fiber = require('fiber')
@@ -198,13 +198,13 @@ errinj.set("ERRINJ_RELAY_REPORT_INTERVAL", 0)
-- Check replica's ACKs don't prevent the master from sending
-- heartbeat messages (gh-3160).
-test_run:cmd("start server replica_timeout with args='0.009 0.5'")
+test_run:cmd("start server replica_timeout with args='0.1'")
test_run:cmd("switch replica_timeout")
fiber = require('fiber')
while box.info.replication[1].upstream.status ~= 'follow' do fiber.sleep(0.0001) end
box.info.replication[1].upstream.status -- follow
-for i = 0, 15 do fiber.sleep(0.01) if box.info.replication[1].upstream.status ~= 'follow' then break end end
+for i = 0, 15 do fiber.sleep(box.cfg.replication_timeout) if box.info.replication[1].upstream.status ~= 'follow' then break end end
box.info.replication[1].upstream.status -- follow
test_run:cmd("switch default")
--
2.20.1
^ permalink raw reply [flat|nested] 6+ messages in thread
* [tarantool-patches] [PATCH 4/4] test: wait for xlog/snap/log file changes
2019-04-10 13:28 [tarantool-patches] [PATCH 0/4] *** test: replication/ fixes for parallel run *** Alexander Turenko
` (2 preceding siblings ...)
2019-04-10 13:28 ` [tarantool-patches] [PATCH 3/4] test: increase timeouts in replication/errinj Alexander Turenko
@ 2019-04-10 13:28 ` Alexander Turenko
2019-04-10 13:43 ` [tarantool-patches] Re: [PATCH 0/4] *** test: replication/ fixes for parallel run *** Alexander Turenko
4 siblings, 0 replies; 6+ messages in thread
From: Alexander Turenko @ 2019-04-10 13:28 UTC (permalink / raw)
To: tarantool-patches; +Cc: Alexander Tikhonov
From: Alexander Tikhonov <avtikhon@gmail.com>
When a system in under heavy load (say, when tests are run in parallel)
it is possible that disc writes stalls for some time. This can cause a
fail of a check that a test performs, so now we retry such checks during
60 seconds until a condition will be met.
This change targets replication test suite.
---
test/replication/gc_no_space.result | 18 ++++++++++--------
test/replication/gc_no_space.test.lua | 18 ++++++++++--------
test/replication/replica_rejoin.result | 10 +++++-----
test/replication/replica_rejoin.test.lua | 6 +++---
test/replication/sync.result | 2 +-
test/replication/sync.test.lua | 2 +-
6 files changed, 30 insertions(+), 26 deletions(-)
diff --git a/test/replication/gc_no_space.result b/test/replication/gc_no_space.result
index b2d3e2075..e860ab00f 100644
--- a/test/replication/gc_no_space.result
+++ b/test/replication/gc_no_space.result
@@ -20,22 +20,24 @@ test_run:cmd("setopt delimiter ';'")
---
- true
...
-function check_file_count(dir, glob, count)
- local files = fio.glob(fio.pathjoin(dir, glob))
- if #files == count then
- return true
- end
- return false, files
+function wait_file_count(dir, glob, count)
+ return test_run:wait_cond(function()
+ local files = fio.glob(fio.pathjoin(dir, glob))
+ if #files == count then
+ return true
+ end
+ return false, files
+ end)
end;
---
...
function check_wal_count(count)
- return check_file_count(box.cfg.wal_dir, '*.xlog', count)
+ return wait_file_count(box.cfg.wal_dir, '*.xlog', count)
end;
---
...
function check_snap_count(count)
- return check_file_count(box.cfg.memtx_dir, '*.snap', count)
+ return wait_file_count(box.cfg.memtx_dir, '*.snap', count)
end;
---
...
diff --git a/test/replication/gc_no_space.test.lua b/test/replication/gc_no_space.test.lua
index 6940996fe..98ccd401b 100644
--- a/test/replication/gc_no_space.test.lua
+++ b/test/replication/gc_no_space.test.lua
@@ -11,18 +11,20 @@ fio = require('fio')
errinj = box.error.injection
test_run:cmd("setopt delimiter ';'")
-function check_file_count(dir, glob, count)
- local files = fio.glob(fio.pathjoin(dir, glob))
- if #files == count then
- return true
- end
- return false, files
+function wait_file_count(dir, glob, count)
+ return test_run:wait_cond(function()
+ local files = fio.glob(fio.pathjoin(dir, glob))
+ if #files == count then
+ return true
+ end
+ return false, files
+ end)
end;
function check_wal_count(count)
- return check_file_count(box.cfg.wal_dir, '*.xlog', count)
+ return wait_file_count(box.cfg.wal_dir, '*.xlog', count)
end;
function check_snap_count(count)
- return check_file_count(box.cfg.memtx_dir, '*.snap', count)
+ return wait_file_count(box.cfg.memtx_dir, '*.snap', count)
end;
test_run:cmd("setopt delimiter ''");
diff --git a/test/replication/replica_rejoin.result b/test/replication/replica_rejoin.result
index 87d626e20..0a617c314 100644
--- a/test/replication/replica_rejoin.result
+++ b/test/replication/replica_rejoin.result
@@ -102,9 +102,9 @@ _ = box.space.test:insert{30}
fio = require('fio')
---
...
-#fio.glob(fio.pathjoin(box.cfg.wal_dir, '*.xlog')) -- 1
+test_run:wait_cond(function() return #fio.glob(fio.pathjoin(box.cfg.wal_dir, '*.xlog')) == 1 end) or fio.pathjoin(box.cfg.wal_dir, '*.xlog')
---
-- 1
+- true
...
box.cfg{checkpoint_count = checkpoint_count}
---
@@ -203,9 +203,9 @@ for i = 1, 3 do box.space.test:insert{i * 100} end
fio = require('fio')
---
...
-#fio.glob(fio.pathjoin(box.cfg.wal_dir, '*.xlog')) -- 1
+test_run:wait_cond(function() return #fio.glob(fio.pathjoin(box.cfg.wal_dir, '*.xlog')) == 1 end) or fio.pathjoin(box.cfg.wal_dir, '*.xlog')
---
-- 1
+- true
...
box.cfg{checkpoint_count = checkpoint_count}
---
@@ -330,7 +330,7 @@ box.cfg{checkpoint_count = default_checkpoint_count}
fio = require('fio')
---
...
-#fio.glob(fio.pathjoin(box.cfg.wal_dir, '*.xlog')) == 1
+test_run:wait_cond(function() return #fio.glob(fio.pathjoin(box.cfg.wal_dir, '*.xlog')) == 1 end) or fio.pathjoin(box.cfg.wal_dir, '*.xlog')
---
- true
...
diff --git a/test/replication/replica_rejoin.test.lua b/test/replication/replica_rejoin.test.lua
index 9bf43eff8..603ef4d15 100644
--- a/test/replication/replica_rejoin.test.lua
+++ b/test/replication/replica_rejoin.test.lua
@@ -40,7 +40,7 @@ box.snapshot()
_ = box.space.test:delete{3}
_ = box.space.test:insert{30}
fio = require('fio')
-#fio.glob(fio.pathjoin(box.cfg.wal_dir, '*.xlog')) -- 1
+test_run:wait_cond(function() return #fio.glob(fio.pathjoin(box.cfg.wal_dir, '*.xlog')) == 1 end) or fio.pathjoin(box.cfg.wal_dir, '*.xlog')
box.cfg{checkpoint_count = checkpoint_count}
-- Restart the replica. Since xlogs have been removed,
@@ -76,7 +76,7 @@ for i = 1, 3 do box.space.test:delete{i * 10} end
box.snapshot()
for i = 1, 3 do box.space.test:insert{i * 100} end
fio = require('fio')
-#fio.glob(fio.pathjoin(box.cfg.wal_dir, '*.xlog')) -- 1
+test_run:wait_cond(function() return #fio.glob(fio.pathjoin(box.cfg.wal_dir, '*.xlog')) == 1 end) or fio.pathjoin(box.cfg.wal_dir, '*.xlog')
box.cfg{checkpoint_count = checkpoint_count}
test_run:cmd("start server replica")
test_run:cmd("switch replica")
@@ -121,7 +121,7 @@ box.cfg{checkpoint_count = 1}
box.snapshot()
box.cfg{checkpoint_count = default_checkpoint_count}
fio = require('fio')
-#fio.glob(fio.pathjoin(box.cfg.wal_dir, '*.xlog')) == 1
+test_run:wait_cond(function() return #fio.glob(fio.pathjoin(box.cfg.wal_dir, '*.xlog')) == 1 end) or fio.pathjoin(box.cfg.wal_dir, '*.xlog')
-- Bump vclock on the replica again.
test_run:cmd("switch replica")
for i = 1, 10 do box.space.test:replace{2} end
diff --git a/test/replication/sync.result b/test/replication/sync.result
index b34501dae..eddc7cbc8 100644
--- a/test/replication/sync.result
+++ b/test/replication/sync.result
@@ -298,7 +298,7 @@ box.info.replication[1].upstream.status -- follow
---
- follow
...
-test_run:grep_log('replica', 'ER_CFG.*')
+test_run:wait_log("replica", "ER_CFG.*", nil, 200)
---
- 'ER_CFG: Incorrect value for option ''replication'': duplicate connection with the
same replica UUID'
diff --git a/test/replication/sync.test.lua b/test/replication/sync.test.lua
index cae97a26f..52ce88fe2 100644
--- a/test/replication/sync.test.lua
+++ b/test/replication/sync.test.lua
@@ -154,7 +154,7 @@ box.cfg{replication = replication}
box.info.status -- running
box.info.ro -- false
box.info.replication[1].upstream.status -- follow
-test_run:grep_log('replica', 'ER_CFG.*')
+test_run:wait_log("replica", "ER_CFG.*", nil, 200)
test_run:cmd("switch default")
test_run:cmd("stop server replica")
--
2.20.1
^ permalink raw reply [flat|nested] 6+ messages in thread
* [tarantool-patches] Re: [PATCH 0/4] *** test: replication/ fixes for parallel run ***
2019-04-10 13:28 [tarantool-patches] [PATCH 0/4] *** test: replication/ fixes for parallel run *** Alexander Turenko
` (3 preceding siblings ...)
2019-04-10 13:28 ` [tarantool-patches] [PATCH 4/4] test: wait for xlog/snap/log file changes Alexander Turenko
@ 2019-04-10 13:43 ` Alexander Turenko
4 siblings, 0 replies; 6+ messages in thread
From: Alexander Turenko @ 2019-04-10 13:43 UTC (permalink / raw)
To: tarantool-patches; +Cc: Alexander V. Tikhonov
Pushed to master and 2.1.
WBR, Alexander Turenko.
On Wed, Apr 10, 2019 at 04:28:41PM +0300, Alexander Turenko wrote:
> This patchset eliminates some of flaky fails observed when tests are run
> in parallel. It increases replication_connect_timeout from 0.5 to 30
> seconds, increases replication_timeout from 0.01 to 0.03 (where we wait
> that a replication stops) or 0.1 (where it should not affect a duration
> of a test).
>
> Also eliminated problems when a write to xlog/snap/log file stalls for
> some time because of a system load (say, many writes to a disc from
> other tests): added waiting for expected changes.
>
> I filed https://github.com/tarantool/tarantool/issues/4129 re
> replication/sync.test.lua rewriting, because it seems that we have no
> easy way to make it stable with the current approach which slows down
> sending rows from relay. Proposed to stop applier on a certain LSN
> instead.
>
> This patchset does not fix all problems with running replication/ test
> suite in parallel, but fixes some of them.
>
> no issue
> https://github.com/tarantool/tarantool/tree/Totktonada/test-replication-fix-flaky-fails
>
> Alexander Tikhonov (1):
> test: wait for xlog/snap/log file changes
>
> Alexander Turenko (3):
> test: allow to run replication/misc multiple times
> test: increase timeouts in replication/misc
> test: increase timeouts in replication/errinj
>
> test/replication/errinj.result | 8 ++---
> test/replication/errinj.test.lua | 8 ++---
> test/replication/gc_no_space.result | 18 +++++-----
> test/replication/gc_no_space.test.lua | 18 +++++-----
> test/replication/lua/rlimit.lua | 2 +-
> test/replication/misc.result | 43 ++++++------------------
> test/replication/misc.test.lua | 27 ++++++---------
> test/replication/replica_rejoin.result | 10 +++---
> test/replication/replica_rejoin.test.lua | 6 ++--
> test/replication/sync.result | 2 +-
> test/replication/sync.test.lua | 2 +-
> 11 files changed, 59 insertions(+), 85 deletions(-)
>
> --
> 2.20.1
>
^ permalink raw reply [flat|nested] 6+ messages in thread