Tarantool development patches archive
 help / color / mirror / Atom feed
* [tarantool-patches] [PATCH 0/4] *** test: replication/ fixes for parallel run ***
@ 2019-04-10 13:28 Alexander Turenko
  2019-04-10 13:28 ` [tarantool-patches] [PATCH 1/4] test: allow to run replication/misc multiple times Alexander Turenko
                   ` (4 more replies)
  0 siblings, 5 replies; 6+ messages in thread
From: Alexander Turenko @ 2019-04-10 13:28 UTC (permalink / raw)
  To: tarantool-patches; +Cc: Alexander Turenko

This patchset eliminates some of flaky fails observed when tests are run
in parallel. It increases replication_connect_timeout from 0.5 to 30
seconds, increases replication_timeout from 0.01 to 0.03 (where we wait
that a replication stops) or 0.1 (where it should not affect a duration
of a test).

Also eliminated problems when a write to xlog/snap/log file stalls for
some time because of a system load (say, many writes to a disc from
other tests): added waiting for expected changes.

I filed https://github.com/tarantool/tarantool/issues/4129 re
replication/sync.test.lua rewriting, because it seems that we have no
easy way to make it stable with the current approach which slows down
sending rows from relay. Proposed to stop applier on a certain LSN
instead.

This patchset does not fix all problems with running replication/ test
suite in parallel, but fixes some of them.

no issue
https://github.com/tarantool/tarantool/tree/Totktonada/test-replication-fix-flaky-fails

Alexander Tikhonov (1):
  test: wait for xlog/snap/log file changes

Alexander Turenko (3):
  test: allow to run replication/misc multiple times
  test: increase timeouts in replication/misc
  test: increase timeouts in replication/errinj

 test/replication/errinj.result           |  8 ++---
 test/replication/errinj.test.lua         |  8 ++---
 test/replication/gc_no_space.result      | 18 +++++-----
 test/replication/gc_no_space.test.lua    | 18 +++++-----
 test/replication/lua/rlimit.lua          |  2 +-
 test/replication/misc.result             | 43 ++++++------------------
 test/replication/misc.test.lua           | 27 ++++++---------
 test/replication/replica_rejoin.result   | 10 +++---
 test/replication/replica_rejoin.test.lua |  6 ++--
 test/replication/sync.result             |  2 +-
 test/replication/sync.test.lua           |  2 +-
 11 files changed, 59 insertions(+), 85 deletions(-)

-- 
2.20.1

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [tarantool-patches] [PATCH 1/4] test: allow to run replication/misc multiple times
  2019-04-10 13:28 [tarantool-patches] [PATCH 0/4] *** test: replication/ fixes for parallel run *** Alexander Turenko
@ 2019-04-10 13:28 ` Alexander Turenko
  2019-04-10 13:28 ` [tarantool-patches] [PATCH 2/4] test: increase timeouts in replication/misc Alexander Turenko
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: Alexander Turenko @ 2019-04-10 13:28 UTC (permalink / raw)
  To: tarantool-patches; +Cc: Alexander Turenko

It allows to run `./test-run.py -j 1 replication/misc <...>
replication/misc` that can be useful when debugging a flaky problem.

This ability was broken after after 7474c14e ('test: enable cleaning of
a test environment'), because test-run starts to clean package.loaded
between runs and so each time the test is run it calls ffi.cdef() under
require('rlimit'). This ffi.cdef() call defines a structure, so a second
and following attempts to call the ffi.cdef() will give a Lua error.

This commit does not change anything in regular testing, because each
test runs once (if other is not stated in a configuration list).
---
 test/replication/lua/rlimit.lua | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/test/replication/lua/rlimit.lua b/test/replication/lua/rlimit.lua
index 46026aea5..de9f86a35 100644
--- a/test/replication/lua/rlimit.lua
+++ b/test/replication/lua/rlimit.lua
@@ -1,6 +1,6 @@
 
 ffi = require('ffi')
-ffi.cdef([[
+pcall(ffi.cdef, [[
 typedef long rlim_t;
 struct rlimit {
     rlim_t rlim_cur;  /* Soft limit */
-- 
2.20.1

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [tarantool-patches] [PATCH 2/4] test: increase timeouts in replication/misc
  2019-04-10 13:28 [tarantool-patches] [PATCH 0/4] *** test: replication/ fixes for parallel run *** Alexander Turenko
  2019-04-10 13:28 ` [tarantool-patches] [PATCH 1/4] test: allow to run replication/misc multiple times Alexander Turenko
@ 2019-04-10 13:28 ` Alexander Turenko
  2019-04-10 13:28 ` [tarantool-patches] [PATCH 3/4] test: increase timeouts in replication/errinj Alexander Turenko
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: Alexander Turenko @ 2019-04-10 13:28 UTC (permalink / raw)
  To: tarantool-patches; +Cc: Alexander Turenko

All changes are needed to eliminate sporadic fails when testing is run
with, say, 30 parallel jobs.

First, replication_connect_timeout is increased to 30 seconds. This
parameter doesn't change meaning of the test cases.

Second, increase replication_timeout from 0.01 to 0.03. We usually set
it to 0.1 in tests, but a duration of the gh-3160 test case ('Send
heartbeats if there are changes from a remote master only') is around
100 * replication_timeout seconds and we don't want to make this test
much longer. Runs of the test case (w/o other ones that are in
replication/mics.test.lua) in 30 parallel jobs show that 0.03 is enough
for the gh-3160 case to pass stably and hopefully enough for the
following test cases too.
---
 test/replication/misc.result   | 43 ++++++++--------------------------
 test/replication/misc.test.lua | 27 ++++++++-------------
 2 files changed, 20 insertions(+), 50 deletions(-)

diff --git a/test/replication/misc.result b/test/replication/misc.result
index ab827c501..a5a322c81 100644
--- a/test/replication/misc.result
+++ b/test/replication/misc.result
@@ -100,32 +100,12 @@ SERVERS = { 'autobootstrap1', 'autobootstrap2', 'autobootstrap3' }
 ---
 ...
 -- Deploy a cluster.
-test_run:create_cluster(SERVERS, "replication", {args="0.1"})
+test_run:create_cluster(SERVERS, "replication", {args="0.03"})
 ---
 ...
 test_run:wait_fullmesh(SERVERS)
 ---
 ...
-test_run:cmd("switch autobootstrap1")
----
-- true
-...
-test_run = require('test_run').new()
----
-...
-box.cfg{replication_timeout = 0.01, replication_connect_timeout=0.01}
----
-...
-test_run:cmd("switch autobootstrap2")
----
-- true
-...
-test_run = require('test_run').new()
----
-...
-box.cfg{replication_timeout = 0.01, replication_connect_timeout=0.01}
----
-...
 test_run:cmd("switch autobootstrap3")
 ---
 - true
@@ -133,10 +113,7 @@ test_run:cmd("switch autobootstrap3")
 test_run = require('test_run').new()
 ---
 ...
-fiber=require('fiber')
----
-...
-box.cfg{replication_timeout = 0.01, replication_connect_timeout=0.01}
+fiber = require('fiber')
 ---
 ...
 _ = box.schema.space.create('test_timeout'):create_index('pk')
@@ -146,11 +123,11 @@ test_run:cmd("setopt delimiter ';'")
 ---
 - true
 ...
-function wait_follow(replicaA, replicaB)
+function wait_not_follow(replicaA, replicaB)
     return test_run:wait_cond(function()
         return replicaA.status ~= 'follow' or replicaB.status ~= 'follow'
-    end, 0.01)
-end ;
+    end, box.cfg.replication_timeout)
+end;
 ---
 ...
 function test_timeout()
@@ -158,16 +135,16 @@ function test_timeout()
     local replicaB = box.info.replication[3].upstream or box.info.replication[2].upstream
     local follows = test_run:wait_cond(function()
         return replicaA.status == 'follow' or replicaB.status == 'follow'
-    end, 0.1)
-    if not follows then error('replicas not in follow status') end
-    for i = 0, 99 do 
+    end)
+    if not follows then error('replicas are not in the follow status') end
+    for i = 0, 99 do
         box.space.test_timeout:replace({1})
-        if wait_follow(replicaA, replicaB) then
+        if wait_not_follow(replicaA, replicaB) then
             return error(box.info.replication)
         end
     end
     return true
-end ;
+end;
 ---
 ...
 test_run:cmd("setopt delimiter ''");
diff --git a/test/replication/misc.test.lua b/test/replication/misc.test.lua
index eda5310b6..2ee6b5ac7 100644
--- a/test/replication/misc.test.lua
+++ b/test/replication/misc.test.lua
@@ -39,40 +39,33 @@ test_run:cleanup_cluster()
 SERVERS = { 'autobootstrap1', 'autobootstrap2', 'autobootstrap3' }
 
 -- Deploy a cluster.
-test_run:create_cluster(SERVERS, "replication", {args="0.1"})
+test_run:create_cluster(SERVERS, "replication", {args="0.03"})
 test_run:wait_fullmesh(SERVERS)
-test_run:cmd("switch autobootstrap1")
-test_run = require('test_run').new()
-box.cfg{replication_timeout = 0.01, replication_connect_timeout=0.01}
-test_run:cmd("switch autobootstrap2")
-test_run = require('test_run').new()
-box.cfg{replication_timeout = 0.01, replication_connect_timeout=0.01}
 test_run:cmd("switch autobootstrap3")
 test_run = require('test_run').new()
-fiber=require('fiber')
-box.cfg{replication_timeout = 0.01, replication_connect_timeout=0.01}
+fiber = require('fiber')
 _ = box.schema.space.create('test_timeout'):create_index('pk')
 test_run:cmd("setopt delimiter ';'")
-function wait_follow(replicaA, replicaB)
+function wait_not_follow(replicaA, replicaB)
     return test_run:wait_cond(function()
         return replicaA.status ~= 'follow' or replicaB.status ~= 'follow'
-    end, 0.01)
-end ;
+    end, box.cfg.replication_timeout)
+end;
 function test_timeout()
     local replicaA = box.info.replication[1].upstream or box.info.replication[2].upstream
     local replicaB = box.info.replication[3].upstream or box.info.replication[2].upstream
     local follows = test_run:wait_cond(function()
         return replicaA.status == 'follow' or replicaB.status == 'follow'
-    end, 0.1)
-    if not follows then error('replicas not in follow status') end
-    for i = 0, 99 do 
+    end)
+    if not follows then error('replicas are not in the follow status') end
+    for i = 0, 99 do
         box.space.test_timeout:replace({1})
-        if wait_follow(replicaA, replicaB) then
+        if wait_not_follow(replicaA, replicaB) then
             return error(box.info.replication)
         end
     end
     return true
-end ;
+end;
 test_run:cmd("setopt delimiter ''");
 test_timeout()
 
-- 
2.20.1

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [tarantool-patches] [PATCH 3/4] test: increase timeouts in replication/errinj
  2019-04-10 13:28 [tarantool-patches] [PATCH 0/4] *** test: replication/ fixes for parallel run *** Alexander Turenko
  2019-04-10 13:28 ` [tarantool-patches] [PATCH 1/4] test: allow to run replication/misc multiple times Alexander Turenko
  2019-04-10 13:28 ` [tarantool-patches] [PATCH 2/4] test: increase timeouts in replication/misc Alexander Turenko
@ 2019-04-10 13:28 ` Alexander Turenko
  2019-04-10 13:28 ` [tarantool-patches] [PATCH 4/4] test: wait for xlog/snap/log file changes Alexander Turenko
  2019-04-10 13:43 ` [tarantool-patches] Re: [PATCH 0/4] *** test: replication/ fixes for parallel run *** Alexander Turenko
  4 siblings, 0 replies; 6+ messages in thread
From: Alexander Turenko @ 2019-04-10 13:28 UTC (permalink / raw)
  To: tarantool-patches; +Cc: Alexander Turenko

Needed for parallel running of the test suite.

Use default replication_connect_timeout (30 seconds) instead of 0.5
seconds. This don't changes meaning of the test cases.

Increase replication_timeout from 0.01 to 0.1.

These changes allow to run the test 100 times in 50 parallel jobs
successfully.
---
 test/replication/errinj.result   | 8 ++++----
 test/replication/errinj.test.lua | 8 ++++----
 2 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/test/replication/errinj.result b/test/replication/errinj.result
index 2e7d367c7..f04a38c45 100644
--- a/test/replication/errinj.result
+++ b/test/replication/errinj.result
@@ -408,14 +408,14 @@ errinj.set("ERRINJ_RELAY_EXIT_DELAY", 0)
 ---
 - ok
 ...
-box.cfg{replication_timeout = 0.01}
+box.cfg{replication_timeout = 0.1}
 ---
 ...
 test_run:cmd("create server replica_timeout with rpl_master=default, script='replication/replica_timeout.lua'")
 ---
 - true
 ...
-test_run:cmd("start server replica_timeout with args='0.01 0.5'")
+test_run:cmd("start server replica_timeout with args='0.1'")
 ---
 - true
 ...
@@ -471,7 +471,7 @@ errinj.set("ERRINJ_RELAY_REPORT_INTERVAL", 0)
 ...
 -- Check replica's ACKs don't prevent the master from sending
 -- heartbeat messages (gh-3160).
-test_run:cmd("start server replica_timeout with args='0.009 0.5'")
+test_run:cmd("start server replica_timeout with args='0.1'")
 ---
 - true
 ...
@@ -489,7 +489,7 @@ box.info.replication[1].upstream.status -- follow
 ---
 - follow
 ...
-for i = 0, 15 do fiber.sleep(0.01) if box.info.replication[1].upstream.status ~= 'follow' then break end end
+for i = 0, 15 do fiber.sleep(box.cfg.replication_timeout) if box.info.replication[1].upstream.status ~= 'follow' then break end end
 ---
 ...
 box.info.replication[1].upstream.status -- follow
diff --git a/test/replication/errinj.test.lua b/test/replication/errinj.test.lua
index 32e0be912..53637e248 100644
--- a/test/replication/errinj.test.lua
+++ b/test/replication/errinj.test.lua
@@ -169,10 +169,10 @@ test_run:cmd("stop server replica")
 test_run:cmd("cleanup server replica")
 errinj.set("ERRINJ_RELAY_EXIT_DELAY", 0)
 
-box.cfg{replication_timeout = 0.01}
+box.cfg{replication_timeout = 0.1}
 
 test_run:cmd("create server replica_timeout with rpl_master=default, script='replication/replica_timeout.lua'")
-test_run:cmd("start server replica_timeout with args='0.01 0.5'")
+test_run:cmd("start server replica_timeout with args='0.1'")
 test_run:cmd("switch replica_timeout")
 
 fiber = require('fiber')
@@ -198,13 +198,13 @@ errinj.set("ERRINJ_RELAY_REPORT_INTERVAL", 0)
 -- Check replica's ACKs don't prevent the master from sending
 -- heartbeat messages (gh-3160).
 
-test_run:cmd("start server replica_timeout with args='0.009 0.5'")
+test_run:cmd("start server replica_timeout with args='0.1'")
 test_run:cmd("switch replica_timeout")
 
 fiber = require('fiber')
 while box.info.replication[1].upstream.status ~= 'follow' do fiber.sleep(0.0001) end
 box.info.replication[1].upstream.status -- follow
-for i = 0, 15 do fiber.sleep(0.01) if box.info.replication[1].upstream.status ~= 'follow' then break end end
+for i = 0, 15 do fiber.sleep(box.cfg.replication_timeout) if box.info.replication[1].upstream.status ~= 'follow' then break end end
 box.info.replication[1].upstream.status -- follow
 
 test_run:cmd("switch default")
-- 
2.20.1

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [tarantool-patches] [PATCH 4/4] test: wait for xlog/snap/log file changes
  2019-04-10 13:28 [tarantool-patches] [PATCH 0/4] *** test: replication/ fixes for parallel run *** Alexander Turenko
                   ` (2 preceding siblings ...)
  2019-04-10 13:28 ` [tarantool-patches] [PATCH 3/4] test: increase timeouts in replication/errinj Alexander Turenko
@ 2019-04-10 13:28 ` Alexander Turenko
  2019-04-10 13:43 ` [tarantool-patches] Re: [PATCH 0/4] *** test: replication/ fixes for parallel run *** Alexander Turenko
  4 siblings, 0 replies; 6+ messages in thread
From: Alexander Turenko @ 2019-04-10 13:28 UTC (permalink / raw)
  To: tarantool-patches; +Cc: Alexander Tikhonov

From: Alexander Tikhonov <avtikhon@gmail.com>

When a system in under heavy load (say, when tests are run in parallel)
it is possible that disc writes stalls for some time. This can cause a
fail of a check that a test performs, so now we retry such checks during
60 seconds until a condition will be met.

This change targets replication test suite.
---
 test/replication/gc_no_space.result      | 18 ++++++++++--------
 test/replication/gc_no_space.test.lua    | 18 ++++++++++--------
 test/replication/replica_rejoin.result   | 10 +++++-----
 test/replication/replica_rejoin.test.lua |  6 +++---
 test/replication/sync.result             |  2 +-
 test/replication/sync.test.lua           |  2 +-
 6 files changed, 30 insertions(+), 26 deletions(-)

diff --git a/test/replication/gc_no_space.result b/test/replication/gc_no_space.result
index b2d3e2075..e860ab00f 100644
--- a/test/replication/gc_no_space.result
+++ b/test/replication/gc_no_space.result
@@ -20,22 +20,24 @@ test_run:cmd("setopt delimiter ';'")
 ---
 - true
 ...
-function check_file_count(dir, glob, count)
-    local files = fio.glob(fio.pathjoin(dir, glob))
-    if #files == count then
-        return true
-    end
-    return false, files
+function wait_file_count(dir, glob, count)
+    return test_run:wait_cond(function()
+        local files = fio.glob(fio.pathjoin(dir, glob))
+        if #files == count then
+            return true
+        end
+        return false, files
+    end)
 end;
 ---
 ...
 function check_wal_count(count)
-    return check_file_count(box.cfg.wal_dir, '*.xlog', count)
+    return wait_file_count(box.cfg.wal_dir, '*.xlog', count)
 end;
 ---
 ...
 function check_snap_count(count)
-    return check_file_count(box.cfg.memtx_dir, '*.snap', count)
+    return wait_file_count(box.cfg.memtx_dir, '*.snap', count)
 end;
 ---
 ...
diff --git a/test/replication/gc_no_space.test.lua b/test/replication/gc_no_space.test.lua
index 6940996fe..98ccd401b 100644
--- a/test/replication/gc_no_space.test.lua
+++ b/test/replication/gc_no_space.test.lua
@@ -11,18 +11,20 @@ fio = require('fio')
 errinj = box.error.injection
 
 test_run:cmd("setopt delimiter ';'")
-function check_file_count(dir, glob, count)
-    local files = fio.glob(fio.pathjoin(dir, glob))
-    if #files == count then
-        return true
-    end
-    return false, files
+function wait_file_count(dir, glob, count)
+    return test_run:wait_cond(function()
+        local files = fio.glob(fio.pathjoin(dir, glob))
+        if #files == count then
+            return true
+        end
+        return false, files
+    end)
 end;
 function check_wal_count(count)
-    return check_file_count(box.cfg.wal_dir, '*.xlog', count)
+    return wait_file_count(box.cfg.wal_dir, '*.xlog', count)
 end;
 function check_snap_count(count)
-    return check_file_count(box.cfg.memtx_dir, '*.snap', count)
+    return wait_file_count(box.cfg.memtx_dir, '*.snap', count)
 end;
 test_run:cmd("setopt delimiter ''");
 
diff --git a/test/replication/replica_rejoin.result b/test/replication/replica_rejoin.result
index 87d626e20..0a617c314 100644
--- a/test/replication/replica_rejoin.result
+++ b/test/replication/replica_rejoin.result
@@ -102,9 +102,9 @@ _ = box.space.test:insert{30}
 fio = require('fio')
 ---
 ...
-#fio.glob(fio.pathjoin(box.cfg.wal_dir, '*.xlog')) -- 1
+test_run:wait_cond(function() return #fio.glob(fio.pathjoin(box.cfg.wal_dir, '*.xlog')) == 1 end) or fio.pathjoin(box.cfg.wal_dir, '*.xlog')
 ---
-- 1
+- true
 ...
 box.cfg{checkpoint_count = checkpoint_count}
 ---
@@ -203,9 +203,9 @@ for i = 1, 3 do box.space.test:insert{i * 100} end
 fio = require('fio')
 ---
 ...
-#fio.glob(fio.pathjoin(box.cfg.wal_dir, '*.xlog')) -- 1
+test_run:wait_cond(function() return #fio.glob(fio.pathjoin(box.cfg.wal_dir, '*.xlog')) == 1 end) or fio.pathjoin(box.cfg.wal_dir, '*.xlog')
 ---
-- 1
+- true
 ...
 box.cfg{checkpoint_count = checkpoint_count}
 ---
@@ -330,7 +330,7 @@ box.cfg{checkpoint_count = default_checkpoint_count}
 fio = require('fio')
 ---
 ...
-#fio.glob(fio.pathjoin(box.cfg.wal_dir, '*.xlog')) == 1
+test_run:wait_cond(function() return #fio.glob(fio.pathjoin(box.cfg.wal_dir, '*.xlog')) == 1 end) or fio.pathjoin(box.cfg.wal_dir, '*.xlog')
 ---
 - true
 ...
diff --git a/test/replication/replica_rejoin.test.lua b/test/replication/replica_rejoin.test.lua
index 9bf43eff8..603ef4d15 100644
--- a/test/replication/replica_rejoin.test.lua
+++ b/test/replication/replica_rejoin.test.lua
@@ -40,7 +40,7 @@ box.snapshot()
 _ = box.space.test:delete{3}
 _ = box.space.test:insert{30}
 fio = require('fio')
-#fio.glob(fio.pathjoin(box.cfg.wal_dir, '*.xlog')) -- 1
+test_run:wait_cond(function() return #fio.glob(fio.pathjoin(box.cfg.wal_dir, '*.xlog')) == 1 end) or fio.pathjoin(box.cfg.wal_dir, '*.xlog')
 box.cfg{checkpoint_count = checkpoint_count}
 
 -- Restart the replica. Since xlogs have been removed,
@@ -76,7 +76,7 @@ for i = 1, 3 do box.space.test:delete{i * 10} end
 box.snapshot()
 for i = 1, 3 do box.space.test:insert{i * 100} end
 fio = require('fio')
-#fio.glob(fio.pathjoin(box.cfg.wal_dir, '*.xlog')) -- 1
+test_run:wait_cond(function() return #fio.glob(fio.pathjoin(box.cfg.wal_dir, '*.xlog')) == 1 end) or fio.pathjoin(box.cfg.wal_dir, '*.xlog')
 box.cfg{checkpoint_count = checkpoint_count}
 test_run:cmd("start server replica")
 test_run:cmd("switch replica")
@@ -121,7 +121,7 @@ box.cfg{checkpoint_count = 1}
 box.snapshot()
 box.cfg{checkpoint_count = default_checkpoint_count}
 fio = require('fio')
-#fio.glob(fio.pathjoin(box.cfg.wal_dir, '*.xlog')) == 1
+test_run:wait_cond(function() return #fio.glob(fio.pathjoin(box.cfg.wal_dir, '*.xlog')) == 1 end) or fio.pathjoin(box.cfg.wal_dir, '*.xlog')
 -- Bump vclock on the replica again.
 test_run:cmd("switch replica")
 for i = 1, 10 do box.space.test:replace{2} end
diff --git a/test/replication/sync.result b/test/replication/sync.result
index b34501dae..eddc7cbc8 100644
--- a/test/replication/sync.result
+++ b/test/replication/sync.result
@@ -298,7 +298,7 @@ box.info.replication[1].upstream.status -- follow
 ---
 - follow
 ...
-test_run:grep_log('replica', 'ER_CFG.*')
+test_run:wait_log("replica", "ER_CFG.*", nil, 200)
 ---
 - 'ER_CFG: Incorrect value for option ''replication'': duplicate connection with the
   same replica UUID'
diff --git a/test/replication/sync.test.lua b/test/replication/sync.test.lua
index cae97a26f..52ce88fe2 100644
--- a/test/replication/sync.test.lua
+++ b/test/replication/sync.test.lua
@@ -154,7 +154,7 @@ box.cfg{replication = replication}
 box.info.status -- running
 box.info.ro -- false
 box.info.replication[1].upstream.status -- follow
-test_run:grep_log('replica', 'ER_CFG.*')
+test_run:wait_log("replica", "ER_CFG.*", nil, 200)
 
 test_run:cmd("switch default")
 test_run:cmd("stop server replica")
-- 
2.20.1

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [tarantool-patches] Re: [PATCH 0/4] *** test: replication/ fixes for parallel run ***
  2019-04-10 13:28 [tarantool-patches] [PATCH 0/4] *** test: replication/ fixes for parallel run *** Alexander Turenko
                   ` (3 preceding siblings ...)
  2019-04-10 13:28 ` [tarantool-patches] [PATCH 4/4] test: wait for xlog/snap/log file changes Alexander Turenko
@ 2019-04-10 13:43 ` Alexander Turenko
  4 siblings, 0 replies; 6+ messages in thread
From: Alexander Turenko @ 2019-04-10 13:43 UTC (permalink / raw)
  To: tarantool-patches; +Cc: Alexander V. Tikhonov

Pushed to master and 2.1.

WBR, Alexander Turenko.

On Wed, Apr 10, 2019 at 04:28:41PM +0300, Alexander Turenko wrote:
> This patchset eliminates some of flaky fails observed when tests are run
> in parallel. It increases replication_connect_timeout from 0.5 to 30
> seconds, increases replication_timeout from 0.01 to 0.03 (where we wait
> that a replication stops) or 0.1 (where it should not affect a duration
> of a test).
> 
> Also eliminated problems when a write to xlog/snap/log file stalls for
> some time because of a system load (say, many writes to a disc from
> other tests): added waiting for expected changes.
> 
> I filed https://github.com/tarantool/tarantool/issues/4129 re
> replication/sync.test.lua rewriting, because it seems that we have no
> easy way to make it stable with the current approach which slows down
> sending rows from relay. Proposed to stop applier on a certain LSN
> instead.
> 
> This patchset does not fix all problems with running replication/ test
> suite in parallel, but fixes some of them.
> 
> no issue
> https://github.com/tarantool/tarantool/tree/Totktonada/test-replication-fix-flaky-fails
> 
> Alexander Tikhonov (1):
>   test: wait for xlog/snap/log file changes
> 
> Alexander Turenko (3):
>   test: allow to run replication/misc multiple times
>   test: increase timeouts in replication/misc
>   test: increase timeouts in replication/errinj
> 
>  test/replication/errinj.result           |  8 ++---
>  test/replication/errinj.test.lua         |  8 ++---
>  test/replication/gc_no_space.result      | 18 +++++-----
>  test/replication/gc_no_space.test.lua    | 18 +++++-----
>  test/replication/lua/rlimit.lua          |  2 +-
>  test/replication/misc.result             | 43 ++++++------------------
>  test/replication/misc.test.lua           | 27 ++++++---------
>  test/replication/replica_rejoin.result   | 10 +++---
>  test/replication/replica_rejoin.test.lua |  6 ++--
>  test/replication/sync.result             |  2 +-
>  test/replication/sync.test.lua           |  2 +-
>  11 files changed, 59 insertions(+), 85 deletions(-)
> 
> -- 
> 2.20.1
> 

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2019-04-10 13:43 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-04-10 13:28 [tarantool-patches] [PATCH 0/4] *** test: replication/ fixes for parallel run *** Alexander Turenko
2019-04-10 13:28 ` [tarantool-patches] [PATCH 1/4] test: allow to run replication/misc multiple times Alexander Turenko
2019-04-10 13:28 ` [tarantool-patches] [PATCH 2/4] test: increase timeouts in replication/misc Alexander Turenko
2019-04-10 13:28 ` [tarantool-patches] [PATCH 3/4] test: increase timeouts in replication/errinj Alexander Turenko
2019-04-10 13:28 ` [tarantool-patches] [PATCH 4/4] test: wait for xlog/snap/log file changes Alexander Turenko
2019-04-10 13:43 ` [tarantool-patches] Re: [PATCH 0/4] *** test: replication/ fixes for parallel run *** Alexander Turenko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox