Tarantool development patches archive
 help / color / mirror / Atom feed
* [Tarantool-patches] [PATCH v1 01/12] test: enable parallel mode for xlog tests
@ 2019-11-26  6:21 Alexander V. Tikhonov
  2019-11-26  6:21 ` [Tarantool-patches] [PATCH v1 02/12] test: enable parallel run for long test suites Alexander V. Tikhonov
                   ` (11 more replies)
  0 siblings, 12 replies; 13+ messages in thread
From: Alexander V. Tikhonov @ 2019-11-26  6:21 UTC (permalink / raw)
  To: Kirill Yukhin; +Cc: Sergei Voronezhskii, tarantool-patches

From: Sergei Voronezhskii <sergw@tarantool.org>

Part of #2436, #3232

(cherry picked from commit 4d47162d3c36c5aa0fbffc8f1833fc4d115b1ed0)
---
 test/xlog/suite.ini | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/test/xlog/suite.ini b/test/xlog/suite.ini
index dfadc48bc..df54537ff 100644
--- a/test/xlog/suite.ini
+++ b/test/xlog/suite.ini
@@ -9,4 +9,4 @@ config = suite.cfg
 use_unix_sockets = True
 use_unix_sockets_iproto = True
 long_run = snap_io_rate.test.lua
-is_parallel = False
+is_parallel = True
-- 
2.17.1

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Tarantool-patches] [PATCH v1 02/12] test: enable parallel run for long test suites
  2019-11-26  6:21 [Tarantool-patches] [PATCH v1 01/12] test: enable parallel mode for xlog tests Alexander V. Tikhonov
@ 2019-11-26  6:21 ` Alexander V. Tikhonov
  2019-11-26  6:21 ` [Tarantool-patches] [PATCH v1 03/12] test: replication parallel mode on Alexander V. Tikhonov
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: Alexander V. Tikhonov @ 2019-11-26  6:21 UTC (permalink / raw)
  To: Kirill Yukhin; +Cc: Sergei Voronezhskii, tarantool-patches

From: Sergei Voronezhskii <sergw@tarantool.org>

Part of #3232

(cherry picked from commit e01c58fa097302570b377211f5b392a82fbda1af)
---
 test/engine_long/suite.ini | 2 +-
 test/long_run-py/suite.ini | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/test/engine_long/suite.ini b/test/engine_long/suite.ini
index 43f1c97fa..0ebb7c9f8 100644
--- a/test/engine_long/suite.ini
+++ b/test/engine_long/suite.ini
@@ -7,4 +7,4 @@ lua_libs = suite.lua
 use_unix_sockets = True
 use_unix_sockets_iproto = True
 config = engine.cfg
-is_parallel = False
+is_parallel = True
diff --git a/test/long_run-py/suite.ini b/test/long_run-py/suite.ini
index 9385cf28d..110bbb548 100644
--- a/test/long_run-py/suite.ini
+++ b/test/long_run-py/suite.ini
@@ -8,4 +8,4 @@ release_disabled =
 lua_libs = suite.lua
 use_unix_sockets = True
 use_unix_sockets_iproto = True
-is_parallel = False
+is_parallel = True
-- 
2.17.1

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Tarantool-patches] [PATCH v1 03/12] test: replication parallel mode on
  2019-11-26  6:21 [Tarantool-patches] [PATCH v1 01/12] test: enable parallel mode for xlog tests Alexander V. Tikhonov
  2019-11-26  6:21 ` [Tarantool-patches] [PATCH v1 02/12] test: enable parallel run for long test suites Alexander V. Tikhonov
@ 2019-11-26  6:21 ` Alexander V. Tikhonov
  2019-11-26  6:21 ` [Tarantool-patches] [PATCH v1 04/12] test: enable cleaning of a test environment Alexander V. Tikhonov
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: Alexander V. Tikhonov @ 2019-11-26  6:21 UTC (permalink / raw)
  To: Kirill Yukhin; +Cc: Sergei Voronezhskii, tarantool-patches

From: Sergei Voronezhskii <sergw@tarantool.org>

Part of #2436, #3232

(cherry picked from commit f5c8b825cf194ee9bf927a1a727adfbabe1a354e)
---
 test/replication/suite.ini | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/test/replication/suite.ini b/test/replication/suite.ini
index e3d932990..d817f81a8 100644
--- a/test/replication/suite.ini
+++ b/test/replication/suite.ini
@@ -9,7 +9,7 @@ lua_libs = lua/fast_replica.lua lua/rlimit.lua
 use_unix_sockets = True
 use_unix_sockets_iproto = True
 long_run = prune.test.lua
-is_parallel = False
+is_parallel = True
 fragile = errinj.test.lua            ; gh-3870
           join_vclock.test.lua       ; gh-4160
           long_row_timeout.test.lua  ; gh-4351
-- 
2.17.1

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Tarantool-patches] [PATCH v1 04/12] test: enable cleaning of a test environment
  2019-11-26  6:21 [Tarantool-patches] [PATCH v1 01/12] test: enable parallel mode for xlog tests Alexander V. Tikhonov
  2019-11-26  6:21 ` [Tarantool-patches] [PATCH v1 02/12] test: enable parallel run for long test suites Alexander V. Tikhonov
  2019-11-26  6:21 ` [Tarantool-patches] [PATCH v1 03/12] test: replication parallel mode on Alexander V. Tikhonov
@ 2019-11-26  6:21 ` Alexander V. Tikhonov
  2019-11-26  6:21 ` [Tarantool-patches] [PATCH v1 05/12] test: allow to run replication/misc multiple times Alexander V. Tikhonov
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: Alexander V. Tikhonov @ 2019-11-26  6:21 UTC (permalink / raw)
  To: Kirill Yukhin; +Cc: tarantool-patches

From: Alexander Turenko <alexander.turenko@tarantool.org>

This commit enables pretest_clean test-run option on 'core = tarantool'
test suites with Lua tests and 'core = app' test suites. Consider #4094
for an example of a problem that is eliminated by this option.

For 'core = tarantool': this option drops non-system spaces, drops data
in system spaces and global variables to the initial state, unloads
packages except build-in ones.

For 'core = app': this option deletes xlog and snap files before run a
test.

test-run doesn't remove global variables that are listed in the
'protected_globals' global variable. Use it for, say, functions that are
defined in an instance file and called from tests.

Consider test-run/README.md for the information how exactly the option
works.

Removed unused cfg_filter() function from test/engine/box.lua.

Fixes #4094.

(cherry picked from commit 7474c14e9cdb4bae3ab073d375aa11e628e13aa5)
---
 test/app-tap/suite.ini     |  1 +
 test/app/suite.ini         |  1 +
 test/box-tap/suite.ini     |  1 +
 test/box/box.lua           |  4 +++-
 test/box/suite.ini         |  1 +
 test/engine/box.lua        | 21 ---------------------
 test/engine/suite.ini      |  1 +
 test/engine_long/suite.ini |  1 +
 test/engine_long/suite.lua |  5 +++--
 test/replication/suite.ini |  1 +
 test/vinyl/suite.ini       |  1 +
 test/wal_off/suite.ini     |  1 +
 test/xlog/suite.ini        |  1 +
 13 files changed, 16 insertions(+), 24 deletions(-)

diff --git a/test/app-tap/suite.ini b/test/app-tap/suite.ini
index 86af82637..9629dfad5 100644
--- a/test/app-tap/suite.ini
+++ b/test/app-tap/suite.ini
@@ -3,3 +3,4 @@ core = app
 description = application server tests (TAP)
 lua_libs = lua/require_mod.lua lua/serializer_test.lua
 is_parallel = True
+pretest_clean = True
diff --git a/test/app/suite.ini b/test/app/suite.ini
index 134cba6d8..79432e29a 100644
--- a/test/app/suite.ini
+++ b/test/app/suite.ini
@@ -6,4 +6,5 @@ lua_libs = lua/fiber.lua
 use_unix_sockets = True
 use_unix_sockets_iproto = True
 is_parallel = True
+pretest_clean = True
 fragile = socket.test.lua ; gh-4426 gh-4451
diff --git a/test/box-tap/suite.ini b/test/box-tap/suite.ini
index d00e93e72..8d9e32d3f 100644
--- a/test/box-tap/suite.ini
+++ b/test/box-tap/suite.ini
@@ -2,5 +2,6 @@
 core = app
 description = Database tests with #! using TAP
 is_parallel = True
+pretest_clean = True
 fragile = cfg.test.lua     ; gh-4344
           key_def.test.lua ; gh-4252
diff --git a/test/box/box.lua b/test/box/box.lua
index b3b10ffd4..2a8e0e4fa 100644
--- a/test/box/box.lua
+++ b/test/box/box.lua
@@ -29,7 +29,7 @@ function cfg_filter(data)
     return result
 end
 
-function compare(a,b)
+local function compare(a,b)
     return a[1] < b[1]
 end
 
@@ -37,3 +37,5 @@ function sorted(data)
     table.sort(data, compare)
     return data
 end
+
+_G.protected_globals = {'cfg_filter', 'sorted'}
diff --git a/test/box/suite.ini b/test/box/suite.ini
index d636c45b3..6e8508188 100644
--- a/test/box/suite.ini
+++ b/test/box/suite.ini
@@ -8,6 +8,7 @@ lua_libs = lua/fifo.lua lua/utils.lua lua/bitset.lua lua/index_random_test.lua l
 use_unix_sockets = True
 use_unix_sockets_iproto = True
 is_parallel = True
+pretest_clean = True
 fragile = bitset.test.lua      ; gh-4095
           func_reload.test.lua ; gh-4425
           function1.test.lua   ; gh-4199
diff --git a/test/engine/box.lua b/test/engine/box.lua
index c68bd626c..b1a379daf 100644
--- a/test/engine/box.lua
+++ b/test/engine/box.lua
@@ -21,24 +21,3 @@ box.cfg{
 }
 
 require('console').listen(os.getenv('ADMIN'))
-
-_to_exclude = {
-    'pid_file', 'log', 'vinyl_dir',
-    'memtx_dir', 'wal_dir',
-    'memtx_min_tuple_size', 'memtx_max_tuple_size'
-}
-
-_exclude = {}
-for _, f in pairs(_to_exclude) do
-    _exclude[f] = 1
-end
-
-function cfg_filter(data)
-    local result = {}
-    for field, val in pairs(data) do
-        if _exclude[field] == nil then
-            result[field] = val
-        end
-    end
-    return result
-end
diff --git a/test/engine/suite.ini b/test/engine/suite.ini
index 7edd49ddc..7c260eea0 100644
--- a/test/engine/suite.ini
+++ b/test/engine/suite.ini
@@ -9,5 +9,6 @@ config = engine.cfg
 #disabled = replica_join.test.lua
 lua_libs = conflict.lua ../box/lua/utils.lua ../box/lua/push.lua
 is_parallel = True
+pretest_clean = True
 fragile = ddl.test.lua         ; gh-4353
           recover_wal.test.lua ; gh-3767
diff --git a/test/engine_long/suite.ini b/test/engine_long/suite.ini
index 0ebb7c9f8..97d869042 100644
--- a/test/engine_long/suite.ini
+++ b/test/engine_long/suite.ini
@@ -8,3 +8,4 @@ use_unix_sockets = True
 use_unix_sockets_iproto = True
 config = engine.cfg
 is_parallel = True
+pretest_clean = True
diff --git a/test/engine_long/suite.lua b/test/engine_long/suite.lua
index 464138db1..9ac2bff9f 100644
--- a/test/engine_long/suite.lua
+++ b/test/engine_long/suite.lua
@@ -1,5 +1,4 @@
-
-function string_function()
+local function string_function()
     local random_number
     local random_string
     random_string = ""
@@ -107,3 +106,5 @@ function delete_insert(engine_name, iterations)
     box.space.tester:drop()
     return {counter, string_value_2}
 end
+
+_G.protected_globals = {'delete_replace_update', 'delete_insert'}
diff --git a/test/replication/suite.ini b/test/replication/suite.ini
index d817f81a8..15dd05d25 100644
--- a/test/replication/suite.ini
+++ b/test/replication/suite.ini
@@ -10,6 +10,7 @@ use_unix_sockets = True
 use_unix_sockets_iproto = True
 long_run = prune.test.lua
 is_parallel = True
+pretest_clean = True
 fragile = errinj.test.lua            ; gh-3870
           join_vclock.test.lua       ; gh-4160
           long_row_timeout.test.lua  ; gh-4351
diff --git a/test/vinyl/suite.ini b/test/vinyl/suite.ini
index 90e326b46..1417d7156 100644
--- a/test/vinyl/suite.ini
+++ b/test/vinyl/suite.ini
@@ -9,6 +9,7 @@ use_unix_sockets = True
 use_unix_sockets_iproto = True
 long_run = stress.test.lua large.test.lua write_iterator_rand.test.lua dump_stress.test.lua select_consistency.test.lua throttle.test.lua
 is_parallel = False
+pretest_clean = True
 fragile = errinj.test.lua             ; gh-4346
           select_consistency.test.lua ; gh-4385
           throttle.test.lua           ; gh-4168
diff --git a/test/wal_off/suite.ini b/test/wal_off/suite.ini
index 10a02e999..14f531df7 100644
--- a/test/wal_off/suite.ini
+++ b/test/wal_off/suite.ini
@@ -5,4 +5,5 @@ description = tarantool/box, wal_mode = none
 use_unix_sockets = True
 use_unix_sockets_iproto = True
 is_parallel = True
+pretest_clean = True
 fragile = iterator_lt_gt.test.lua ; gh-3925
diff --git a/test/xlog/suite.ini b/test/xlog/suite.ini
index df54537ff..babf625ae 100644
--- a/test/xlog/suite.ini
+++ b/test/xlog/suite.ini
@@ -10,3 +10,4 @@ use_unix_sockets = True
 use_unix_sockets_iproto = True
 long_run = snap_io_rate.test.lua
 is_parallel = True
+pretest_clean = True
-- 
2.17.1

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Tarantool-patches] [PATCH v1 05/12] test: allow to run replication/misc multiple times
  2019-11-26  6:21 [Tarantool-patches] [PATCH v1 01/12] test: enable parallel mode for xlog tests Alexander V. Tikhonov
                   ` (2 preceding siblings ...)
  2019-11-26  6:21 ` [Tarantool-patches] [PATCH v1 04/12] test: enable cleaning of a test environment Alexander V. Tikhonov
@ 2019-11-26  6:21 ` Alexander V. Tikhonov
  2019-11-26  6:21 ` [Tarantool-patches] [PATCH v1 06/12] test: increase timeouts in replication/errinj Alexander V. Tikhonov
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: Alexander V. Tikhonov @ 2019-11-26  6:21 UTC (permalink / raw)
  To: Kirill Yukhin; +Cc: tarantool-patches

From: Alexander Turenko <alexander.turenko@tarantool.org>

It allows to run `./test-run.py -j 1 replication/misc <...>
replication/misc` that can be useful when debugging a flaky problem.

This ability was broken after after 7474c14e ('test: enable cleaning of
a test environment'), because test-run starts to clean package.loaded
between runs and so each time the test is run it calls ffi.cdef() under
require('rlimit'). This ffi.cdef() call defines a structure, so a second
and following attempts to call the ffi.cdef() will give a Lua error.

This commit does not change anything in regular testing, because each
test runs once (if other is not stated in a configuration list).

(cherry picked from commit 7a2c31d39b3753fdee41424cdf17dfad396b2d3d)
---
 test/replication/lua/rlimit.lua | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/test/replication/lua/rlimit.lua b/test/replication/lua/rlimit.lua
index 46026aea5..de9f86a35 100644
--- a/test/replication/lua/rlimit.lua
+++ b/test/replication/lua/rlimit.lua
@@ -1,6 +1,6 @@
 
 ffi = require('ffi')
-ffi.cdef([[
+pcall(ffi.cdef, [[
 typedef long rlim_t;
 struct rlimit {
     rlim_t rlim_cur;  /* Soft limit */
-- 
2.17.1

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Tarantool-patches] [PATCH v1 06/12] test: increase timeouts in replication/errinj
  2019-11-26  6:21 [Tarantool-patches] [PATCH v1 01/12] test: enable parallel mode for xlog tests Alexander V. Tikhonov
                   ` (3 preceding siblings ...)
  2019-11-26  6:21 ` [Tarantool-patches] [PATCH v1 05/12] test: allow to run replication/misc multiple times Alexander V. Tikhonov
@ 2019-11-26  6:21 ` Alexander V. Tikhonov
  2019-11-26  6:21 ` [Tarantool-patches] [PATCH v1 07/12] test: wait for xlog/snap/log file changes Alexander V. Tikhonov
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: Alexander V. Tikhonov @ 2019-11-26  6:21 UTC (permalink / raw)
  To: Kirill Yukhin; +Cc: tarantool-patches

From: Alexander Turenko <alexander.turenko@tarantool.org>

Needed for parallel running of the test suite.

Use default replication_connect_timeout (30 seconds) instead of 0.5
seconds. This don't changes meaning of the test cases.

Increase replication_timeout from 0.01 to 0.1.

These changes allow to run the test 100 times in 50 parallel jobs
successfully.

(cherry picked from commit e257eb27b95c2d3c1cb0d299b4bd35afa17525fe)
---
 test/replication/errinj.result   | 8 ++++----
 test/replication/errinj.test.lua | 8 ++++----
 2 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/test/replication/errinj.result b/test/replication/errinj.result
index 2e7d367c7..f04a38c45 100644
--- a/test/replication/errinj.result
+++ b/test/replication/errinj.result
@@ -408,14 +408,14 @@ errinj.set("ERRINJ_RELAY_EXIT_DELAY", 0)
 ---
 - ok
 ...
-box.cfg{replication_timeout = 0.01}
+box.cfg{replication_timeout = 0.1}
 ---
 ...
 test_run:cmd("create server replica_timeout with rpl_master=default, script='replication/replica_timeout.lua'")
 ---
 - true
 ...
-test_run:cmd("start server replica_timeout with args='0.01 0.5'")
+test_run:cmd("start server replica_timeout with args='0.1'")
 ---
 - true
 ...
@@ -471,7 +471,7 @@ errinj.set("ERRINJ_RELAY_REPORT_INTERVAL", 0)
 ...
 -- Check replica's ACKs don't prevent the master from sending
 -- heartbeat messages (gh-3160).
-test_run:cmd("start server replica_timeout with args='0.009 0.5'")
+test_run:cmd("start server replica_timeout with args='0.1'")
 ---
 - true
 ...
@@ -489,7 +489,7 @@ box.info.replication[1].upstream.status -- follow
 ---
 - follow
 ...
-for i = 0, 15 do fiber.sleep(0.01) if box.info.replication[1].upstream.status ~= 'follow' then break end end
+for i = 0, 15 do fiber.sleep(box.cfg.replication_timeout) if box.info.replication[1].upstream.status ~= 'follow' then break end end
 ---
 ...
 box.info.replication[1].upstream.status -- follow
diff --git a/test/replication/errinj.test.lua b/test/replication/errinj.test.lua
index 32e0be912..53637e248 100644
--- a/test/replication/errinj.test.lua
+++ b/test/replication/errinj.test.lua
@@ -169,10 +169,10 @@ test_run:cmd("stop server replica")
 test_run:cmd("cleanup server replica")
 errinj.set("ERRINJ_RELAY_EXIT_DELAY", 0)
 
-box.cfg{replication_timeout = 0.01}
+box.cfg{replication_timeout = 0.1}
 
 test_run:cmd("create server replica_timeout with rpl_master=default, script='replication/replica_timeout.lua'")
-test_run:cmd("start server replica_timeout with args='0.01 0.5'")
+test_run:cmd("start server replica_timeout with args='0.1'")
 test_run:cmd("switch replica_timeout")
 
 fiber = require('fiber')
@@ -198,13 +198,13 @@ errinj.set("ERRINJ_RELAY_REPORT_INTERVAL", 0)
 -- Check replica's ACKs don't prevent the master from sending
 -- heartbeat messages (gh-3160).
 
-test_run:cmd("start server replica_timeout with args='0.009 0.5'")
+test_run:cmd("start server replica_timeout with args='0.1'")
 test_run:cmd("switch replica_timeout")
 
 fiber = require('fiber')
 while box.info.replication[1].upstream.status ~= 'follow' do fiber.sleep(0.0001) end
 box.info.replication[1].upstream.status -- follow
-for i = 0, 15 do fiber.sleep(0.01) if box.info.replication[1].upstream.status ~= 'follow' then break end end
+for i = 0, 15 do fiber.sleep(box.cfg.replication_timeout) if box.info.replication[1].upstream.status ~= 'follow' then break end end
 box.info.replication[1].upstream.status -- follow
 
 test_run:cmd("switch default")
-- 
2.17.1

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Tarantool-patches] [PATCH v1 07/12] test: wait for xlog/snap/log file changes
  2019-11-26  6:21 [Tarantool-patches] [PATCH v1 01/12] test: enable parallel mode for xlog tests Alexander V. Tikhonov
                   ` (4 preceding siblings ...)
  2019-11-26  6:21 ` [Tarantool-patches] [PATCH v1 06/12] test: increase timeouts in replication/errinj Alexander V. Tikhonov
@ 2019-11-26  6:21 ` Alexander V. Tikhonov
  2019-11-26  6:21 ` [Tarantool-patches] [PATCH v1 08/12] test: use wait_cond to check follow status Alexander V. Tikhonov
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: Alexander V. Tikhonov @ 2019-11-26  6:21 UTC (permalink / raw)
  To: Kirill Yukhin; +Cc: Alexander Tikhonov, tarantool-patches

From: Alexander Tikhonov <avtikhon@gmail.com>

When a system in under heavy load (say, when tests are run in parallel)
it is possible that disc writes stalls for some time. This can cause a
fail of a check that a test performs, so now we retry such checks during
60 seconds until a condition will be met.

This change targets replication test suite.

(cherry picked from commit def75c88fce61d27baf62aabfc7a8e5b4126c73a)
---
 test/replication/gc_no_space.result      | 18 ++++++++++--------
 test/replication/gc_no_space.test.lua    | 18 ++++++++++--------
 test/replication/replica_rejoin.result   | 10 +++++-----
 test/replication/replica_rejoin.test.lua |  6 +++---
 test/replication/sync.result             |  2 +-
 test/replication/sync.test.lua           |  2 +-
 6 files changed, 30 insertions(+), 26 deletions(-)

diff --git a/test/replication/gc_no_space.result b/test/replication/gc_no_space.result
index 8e663cdf0..a7c7203c4 100644
--- a/test/replication/gc_no_space.result
+++ b/test/replication/gc_no_space.result
@@ -20,22 +20,24 @@ test_run:cmd("setopt delimiter ';'")
 ---
 - true
 ...
-function check_file_count(dir, glob, count)
-    local files = fio.glob(fio.pathjoin(dir, glob))
-    if #files == count then
-        return true
-    end
-    return false, files
+function wait_file_count(dir, glob, count)
+    return test_run:wait_cond(function()
+        local files = fio.glob(fio.pathjoin(dir, glob))
+        if #files == count then
+            return true
+        end
+        return false, files
+    end)
 end;
 ---
 ...
 function check_wal_count(count)
-    return check_file_count(box.cfg.wal_dir, '*.xlog', count)
+    return wait_file_count(box.cfg.wal_dir, '*.xlog', count)
 end;
 ---
 ...
 function check_snap_count(count)
-    return check_file_count(box.cfg.memtx_dir, '*.snap', count)
+    return wait_file_count(box.cfg.memtx_dir, '*.snap', count)
 end;
 ---
 ...
diff --git a/test/replication/gc_no_space.test.lua b/test/replication/gc_no_space.test.lua
index 4bab2b0e9..b0e4fedae 100644
--- a/test/replication/gc_no_space.test.lua
+++ b/test/replication/gc_no_space.test.lua
@@ -11,18 +11,20 @@ fio = require('fio')
 errinj = box.error.injection
 
 test_run:cmd("setopt delimiter ';'")
-function check_file_count(dir, glob, count)
-    local files = fio.glob(fio.pathjoin(dir, glob))
-    if #files == count then
-        return true
-    end
-    return false, files
+function wait_file_count(dir, glob, count)
+    return test_run:wait_cond(function()
+        local files = fio.glob(fio.pathjoin(dir, glob))
+        if #files == count then
+            return true
+        end
+        return false, files
+    end)
 end;
 function check_wal_count(count)
-    return check_file_count(box.cfg.wal_dir, '*.xlog', count)
+    return wait_file_count(box.cfg.wal_dir, '*.xlog', count)
 end;
 function check_snap_count(count)
-    return check_file_count(box.cfg.memtx_dir, '*.snap', count)
+    return wait_file_count(box.cfg.memtx_dir, '*.snap', count)
 end;
 test_run:cmd("setopt delimiter ''");
 
diff --git a/test/replication/replica_rejoin.result b/test/replication/replica_rejoin.result
index c76332814..f71292da1 100644
--- a/test/replication/replica_rejoin.result
+++ b/test/replication/replica_rejoin.result
@@ -102,9 +102,9 @@ _ = box.space.test:insert{30}
 fio = require('fio')
 ---
 ...
-#fio.glob(fio.pathjoin(box.cfg.wal_dir, '*.xlog')) -- 1
+test_run:wait_cond(function() return #fio.glob(fio.pathjoin(box.cfg.wal_dir, '*.xlog')) == 1 end) or fio.pathjoin(box.cfg.wal_dir, '*.xlog')
 ---
-- 1
+- true
 ...
 box.cfg{checkpoint_count = checkpoint_count}
 ---
@@ -203,9 +203,9 @@ for i = 1, 3 do box.space.test:insert{i * 100} end
 fio = require('fio')
 ---
 ...
-#fio.glob(fio.pathjoin(box.cfg.wal_dir, '*.xlog')) -- 1
+test_run:wait_cond(function() return #fio.glob(fio.pathjoin(box.cfg.wal_dir, '*.xlog')) == 1 end) or fio.pathjoin(box.cfg.wal_dir, '*.xlog')
 ---
-- 1
+- true
 ...
 box.cfg{checkpoint_count = checkpoint_count}
 ---
@@ -330,7 +330,7 @@ box.cfg{checkpoint_count = default_checkpoint_count}
 fio = require('fio')
 ---
 ...
-#fio.glob(fio.pathjoin(box.cfg.wal_dir, '*.xlog')) == 1
+test_run:wait_cond(function() return #fio.glob(fio.pathjoin(box.cfg.wal_dir, '*.xlog')) == 1 end) or fio.pathjoin(box.cfg.wal_dir, '*.xlog')
 ---
 - true
 ...
diff --git a/test/replication/replica_rejoin.test.lua b/test/replication/replica_rejoin.test.lua
index ca5fc5388..22a91d8d7 100644
--- a/test/replication/replica_rejoin.test.lua
+++ b/test/replication/replica_rejoin.test.lua
@@ -40,7 +40,7 @@ box.snapshot()
 _ = box.space.test:delete{3}
 _ = box.space.test:insert{30}
 fio = require('fio')
-#fio.glob(fio.pathjoin(box.cfg.wal_dir, '*.xlog')) -- 1
+test_run:wait_cond(function() return #fio.glob(fio.pathjoin(box.cfg.wal_dir, '*.xlog')) == 1 end) or fio.pathjoin(box.cfg.wal_dir, '*.xlog')
 box.cfg{checkpoint_count = checkpoint_count}
 
 -- Restart the replica. Since xlogs have been removed,
@@ -76,7 +76,7 @@ for i = 1, 3 do box.space.test:delete{i * 10} end
 box.snapshot()
 for i = 1, 3 do box.space.test:insert{i * 100} end
 fio = require('fio')
-#fio.glob(fio.pathjoin(box.cfg.wal_dir, '*.xlog')) -- 1
+test_run:wait_cond(function() return #fio.glob(fio.pathjoin(box.cfg.wal_dir, '*.xlog')) == 1 end) or fio.pathjoin(box.cfg.wal_dir, '*.xlog')
 box.cfg{checkpoint_count = checkpoint_count}
 test_run:cmd("start server replica")
 test_run:cmd("switch replica")
@@ -121,7 +121,7 @@ box.cfg{checkpoint_count = 1}
 box.snapshot()
 box.cfg{checkpoint_count = default_checkpoint_count}
 fio = require('fio')
-#fio.glob(fio.pathjoin(box.cfg.wal_dir, '*.xlog')) == 1
+test_run:wait_cond(function() return #fio.glob(fio.pathjoin(box.cfg.wal_dir, '*.xlog')) == 1 end) or fio.pathjoin(box.cfg.wal_dir, '*.xlog')
 -- Bump vclock on the replica again.
 test_run:cmd("switch replica")
 for i = 1, 10 do box.space.test:replace{2} end
diff --git a/test/replication/sync.result b/test/replication/sync.result
index b2381ac59..862952fe6 100644
--- a/test/replication/sync.result
+++ b/test/replication/sync.result
@@ -286,7 +286,7 @@ box.info.replication[1].upstream.status -- follow
 ---
 - follow
 ...
-test_run:grep_log('replica', 'ER_CFG.*')
+test_run:wait_log("replica", "ER_CFG.*", nil, 200)
 ---
 - 'ER_CFG: Incorrect value for option ''replication'': duplicate connection with the
   same replica UUID'
diff --git a/test/replication/sync.test.lua b/test/replication/sync.test.lua
index 51131667d..500c5a396 100644
--- a/test/replication/sync.test.lua
+++ b/test/replication/sync.test.lua
@@ -140,7 +140,7 @@ box.cfg{replication = replication}
 box.info.status -- running
 box.info.ro -- false
 box.info.replication[1].upstream.status -- follow
-test_run:grep_log('replica', 'ER_CFG.*')
+test_run:wait_log("replica", "ER_CFG.*", nil, 200)
 
 test_run:cmd("switch default")
 test_run:cmd("stop server replica")
-- 
2.17.1

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Tarantool-patches] [PATCH v1 08/12] test: use wait_cond to check follow status
  2019-11-26  6:21 [Tarantool-patches] [PATCH v1 01/12] test: enable parallel mode for xlog tests Alexander V. Tikhonov
                   ` (5 preceding siblings ...)
  2019-11-26  6:21 ` [Tarantool-patches] [PATCH v1 07/12] test: wait for xlog/snap/log file changes Alexander V. Tikhonov
@ 2019-11-26  6:21 ` Alexander V. Tikhonov
  2019-11-26  6:21 ` [Tarantool-patches] [PATCH v1 09/12] test: increase timeouts in replication/misc Alexander V. Tikhonov
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: Alexander V. Tikhonov @ 2019-11-26  6:21 UTC (permalink / raw)
  To: Kirill Yukhin; +Cc: Sergei Voronezhskii, tarantool-patches

From: Sergei Voronezhskii <sergw@tarantool.org>

After setting timeouts in `box.cfg` and before making a `replace` needs
to wait for replicas in `follow` status. Then if `wait_follow()` found
not `follow` status it returns true. Which immediately causes an error.

Fixes #3734
Part of #2436, #3232

(cherry picked from commit f41548b7667b2f0ff28c9bb81a563f9d276f7107)
---
 test/replication/misc.result   | 21 +++++++++++++++------
 test/replication/misc.test.lua | 19 +++++++++++++------
 2 files changed, 28 insertions(+), 12 deletions(-)

diff --git a/test/replication/misc.result b/test/replication/misc.result
index df3167991..b4af9e41f 100644
--- a/test/replication/misc.result
+++ b/test/replication/misc.result
@@ -163,15 +163,24 @@ test_run:cmd("setopt delimiter ';'")
 ---
 - true
 ...
+function wait_follow(replicaA, replicaB)
+    return test_run:wait_cond(function()
+        return replicaA.status ~= 'follow' or replicaB.status ~= 'follow'
+    end, 0.01)
+end ;
+---
+...
 function test_timeout()
+    local replicaA = box.info.replication[1].upstream or box.info.replication[2].upstream
+    local replicaB = box.info.replication[3].upstream or box.info.replication[2].upstream
+    local follows = test_run:wait_cond(function()
+        return replicaA.status == 'follow' or replicaB.status == 'follow'
+    end, 0.1)
+    if not follows then error('replicas not in follow status') end
     for i = 0, 99 do 
         box.space.test_timeout:replace({1})
-        fiber.sleep(0.005)
-        local rinfo = box.info.replication
-        if rinfo[1].upstream and rinfo[1].upstream.status ~= 'follow' or
-           rinfo[2].upstream and rinfo[2].upstream.status ~= 'follow' or
-           rinfo[3].upstream and rinfo[3].upstream.status ~= 'follow' then
-            return error('Replication broken')
+        if wait_follow(replicaA, replicaB) then
+            return error(box.info.replication)
         end
     end
     return true
diff --git a/test/replication/misc.test.lua b/test/replication/misc.test.lua
index f3d0f2b95..dd374a210 100644
--- a/test/replication/misc.test.lua
+++ b/test/replication/misc.test.lua
@@ -58,15 +58,22 @@ fiber=require('fiber')
 box.cfg{replication_timeout = 0.01, replication_connect_timeout=0.01}
 _ = box.schema.space.create('test_timeout'):create_index('pk')
 test_run:cmd("setopt delimiter ';'")
+function wait_follow(replicaA, replicaB)
+    return test_run:wait_cond(function()
+        return replicaA.status ~= 'follow' or replicaB.status ~= 'follow'
+    end, 0.01)
+end ;
 function test_timeout()
+    local replicaA = box.info.replication[1].upstream or box.info.replication[2].upstream
+    local replicaB = box.info.replication[3].upstream or box.info.replication[2].upstream
+    local follows = test_run:wait_cond(function()
+        return replicaA.status == 'follow' or replicaB.status == 'follow'
+    end, 0.1)
+    if not follows then error('replicas not in follow status') end
     for i = 0, 99 do 
         box.space.test_timeout:replace({1})
-        fiber.sleep(0.005)
-        local rinfo = box.info.replication
-        if rinfo[1].upstream and rinfo[1].upstream.status ~= 'follow' or
-           rinfo[2].upstream and rinfo[2].upstream.status ~= 'follow' or
-           rinfo[3].upstream and rinfo[3].upstream.status ~= 'follow' then
-            return error('Replication broken')
+        if wait_follow(replicaA, replicaB) then
+            return error(box.info.replication)
         end
     end
     return true
-- 
2.17.1

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Tarantool-patches] [PATCH v1 09/12] test: increase timeouts in replication/misc
  2019-11-26  6:21 [Tarantool-patches] [PATCH v1 01/12] test: enable parallel mode for xlog tests Alexander V. Tikhonov
                   ` (6 preceding siblings ...)
  2019-11-26  6:21 ` [Tarantool-patches] [PATCH v1 08/12] test: use wait_cond to check follow status Alexander V. Tikhonov
@ 2019-11-26  6:21 ` Alexander V. Tikhonov
  2019-11-26  6:21 ` [Tarantool-patches] [PATCH v1 10/12] test: put require in proper places Alexander V. Tikhonov
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: Alexander V. Tikhonov @ 2019-11-26  6:21 UTC (permalink / raw)
  To: Kirill Yukhin; +Cc: tarantool-patches

From: Alexander Turenko <alexander.turenko@tarantool.org>

All changes are needed to eliminate sporadic fails when testing is run
with, say, 30 parallel jobs.

First, replication_connect_timeout is increased to 30 seconds. This
parameter doesn't change meaning of the test cases.

Second, increase replication_timeout from 0.01 to 0.03. We usually set
it to 0.1 in tests, but a duration of the gh-3160 test case ('Send
heartbeats if there are changes from a remote master only') is around
100 * replication_timeout seconds and we don't want to make this test
much longer. Runs of the test case (w/o other ones that are in
replication/mics.test.lua) in 30 parallel jobs show that 0.03 is enough
for the gh-3160 case to pass stably and hopefully enough for the
following test cases too.

(cherry picked from commit 697caa6b731ae89627958b1fda2aa1da49ecee5d)
---
 test/replication/misc.result   | 43 ++++++++--------------------------
 test/replication/misc.test.lua | 27 ++++++++-------------
 2 files changed, 20 insertions(+), 50 deletions(-)

diff --git a/test/replication/misc.result b/test/replication/misc.result
index b4af9e41f..6c9582035 100644
--- a/test/replication/misc.result
+++ b/test/replication/misc.result
@@ -117,32 +117,12 @@ SERVERS = { 'autobootstrap1', 'autobootstrap2', 'autobootstrap3' }
 ---
 ...
 -- Deploy a cluster.
-test_run:create_cluster(SERVERS, "replication", {args="0.1"})
+test_run:create_cluster(SERVERS, "replication", {args="0.03"})
 ---
 ...
 test_run:wait_fullmesh(SERVERS)
 ---
 ...
-test_run:cmd("switch autobootstrap1")
----
-- true
-...
-test_run = require('test_run').new()
----
-...
-box.cfg{replication_timeout = 0.01, replication_connect_timeout=0.01}
----
-...
-test_run:cmd("switch autobootstrap2")
----
-- true
-...
-test_run = require('test_run').new()
----
-...
-box.cfg{replication_timeout = 0.01, replication_connect_timeout=0.01}
----
-...
 test_run:cmd("switch autobootstrap3")
 ---
 - true
@@ -150,10 +130,7 @@ test_run:cmd("switch autobootstrap3")
 test_run = require('test_run').new()
 ---
 ...
-fiber=require('fiber')
----
-...
-box.cfg{replication_timeout = 0.01, replication_connect_timeout=0.01}
+fiber = require('fiber')
 ---
 ...
 _ = box.schema.space.create('test_timeout'):create_index('pk')
@@ -163,11 +140,11 @@ test_run:cmd("setopt delimiter ';'")
 ---
 - true
 ...
-function wait_follow(replicaA, replicaB)
+function wait_not_follow(replicaA, replicaB)
     return test_run:wait_cond(function()
         return replicaA.status ~= 'follow' or replicaB.status ~= 'follow'
-    end, 0.01)
-end ;
+    end, box.cfg.replication_timeout)
+end;
 ---
 ...
 function test_timeout()
@@ -175,16 +152,16 @@ function test_timeout()
     local replicaB = box.info.replication[3].upstream or box.info.replication[2].upstream
     local follows = test_run:wait_cond(function()
         return replicaA.status == 'follow' or replicaB.status == 'follow'
-    end, 0.1)
-    if not follows then error('replicas not in follow status') end
-    for i = 0, 99 do 
+    end)
+    if not follows then error('replicas are not in the follow status') end
+    for i = 0, 99 do
         box.space.test_timeout:replace({1})
-        if wait_follow(replicaA, replicaB) then
+        if wait_not_follow(replicaA, replicaB) then
             return error(box.info.replication)
         end
     end
     return true
-end ;
+end;
 ---
 ...
 test_run:cmd("setopt delimiter ''");
diff --git a/test/replication/misc.test.lua b/test/replication/misc.test.lua
index dd374a210..bdfeea11c 100644
--- a/test/replication/misc.test.lua
+++ b/test/replication/misc.test.lua
@@ -44,40 +44,33 @@ test_run:cleanup_cluster()
 SERVERS = { 'autobootstrap1', 'autobootstrap2', 'autobootstrap3' }
 
 -- Deploy a cluster.
-test_run:create_cluster(SERVERS, "replication", {args="0.1"})
+test_run:create_cluster(SERVERS, "replication", {args="0.03"})
 test_run:wait_fullmesh(SERVERS)
-test_run:cmd("switch autobootstrap1")
-test_run = require('test_run').new()
-box.cfg{replication_timeout = 0.01, replication_connect_timeout=0.01}
-test_run:cmd("switch autobootstrap2")
-test_run = require('test_run').new()
-box.cfg{replication_timeout = 0.01, replication_connect_timeout=0.01}
 test_run:cmd("switch autobootstrap3")
 test_run = require('test_run').new()
-fiber=require('fiber')
-box.cfg{replication_timeout = 0.01, replication_connect_timeout=0.01}
+fiber = require('fiber')
 _ = box.schema.space.create('test_timeout'):create_index('pk')
 test_run:cmd("setopt delimiter ';'")
-function wait_follow(replicaA, replicaB)
+function wait_not_follow(replicaA, replicaB)
     return test_run:wait_cond(function()
         return replicaA.status ~= 'follow' or replicaB.status ~= 'follow'
-    end, 0.01)
-end ;
+    end, box.cfg.replication_timeout)
+end;
 function test_timeout()
     local replicaA = box.info.replication[1].upstream or box.info.replication[2].upstream
     local replicaB = box.info.replication[3].upstream or box.info.replication[2].upstream
     local follows = test_run:wait_cond(function()
         return replicaA.status == 'follow' or replicaB.status == 'follow'
-    end, 0.1)
-    if not follows then error('replicas not in follow status') end
-    for i = 0, 99 do 
+    end)
+    if not follows then error('replicas are not in the follow status') end
+    for i = 0, 99 do
         box.space.test_timeout:replace({1})
-        if wait_follow(replicaA, replicaB) then
+        if wait_not_follow(replicaA, replicaB) then
             return error(box.info.replication)
         end
     end
     return true
-end ;
+end;
 test_run:cmd("setopt delimiter ''");
 test_timeout()
 
-- 
2.17.1

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Tarantool-patches] [PATCH v1 10/12] test: put require in proper places
  2019-11-26  6:21 [Tarantool-patches] [PATCH v1 01/12] test: enable parallel mode for xlog tests Alexander V. Tikhonov
                   ` (7 preceding siblings ...)
  2019-11-26  6:21 ` [Tarantool-patches] [PATCH v1 09/12] test: increase timeouts in replication/misc Alexander V. Tikhonov
@ 2019-11-26  6:21 ` Alexander V. Tikhonov
  2019-11-26  6:21 ` [Tarantool-patches] [PATCH v1 11/12] test: fix replication/gc flaky failures Alexander V. Tikhonov
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 13+ messages in thread
From: Alexander V. Tikhonov @ 2019-11-26  6:21 UTC (permalink / raw)
  To: Kirill Yukhin; +Cc: Sergei Voronezhskii, tarantool-patches

From: Sergei Voronezhskii <sergw@tarantool.org>

* put `require('fiber')` after each switch server command, because
  sometimes got 'fiber' not defined error
* use `require('fio')` after `require('test_run').new()`, because
  sometimes got 'fio' not defined error

Part of #2436, #3232

(cherry picked from commit d2f28afaf7f045687df191cd93c935dcc442811b)
---
 test/replication/catch.test.lua      | 1 -
 test/replication/gc.result           | 6 +++---
 test/replication/gc.test.lua         | 2 +-
 test/replication/on_replace.result   | 3 +++
 test/replication/on_replace.test.lua | 1 +
 5 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/test/replication/catch.test.lua b/test/replication/catch.test.lua
index d5de88642..7a531df39 100644
--- a/test/replication/catch.test.lua
+++ b/test/replication/catch.test.lua
@@ -2,7 +2,6 @@ env = require('test_run')
 test_run = env.new()
 engine = test_run:get_cfg('engine')
 
-
 net_box = require('net.box')
 errinj = box.error.injection
 
diff --git a/test/replication/gc.result b/test/replication/gc.result
index 5b44284bf..cbdeffb11 100644
--- a/test/replication/gc.result
+++ b/test/replication/gc.result
@@ -1,6 +1,3 @@
-fio = require 'fio'
----
-...
 test_run = require('test_run').new()
 ---
 ...
@@ -13,6 +10,9 @@ replica_set = require('fast_replica')
 fiber = require('fiber')
 ---
 ...
+fio = require('fio')
+---
+...
 test_run:cleanup_cluster()
 ---
 ...
diff --git a/test/replication/gc.test.lua b/test/replication/gc.test.lua
index fee1fe968..4710fd9e3 100644
--- a/test/replication/gc.test.lua
+++ b/test/replication/gc.test.lua
@@ -1,8 +1,8 @@
-fio = require 'fio'
 test_run = require('test_run').new()
 engine = test_run:get_cfg('engine')
 replica_set = require('fast_replica')
 fiber = require('fiber')
+fio = require('fio')
 
 test_run:cleanup_cluster()
 test_run:cmd("create server replica with rpl_master=default, script='replication/replica.lua'")
diff --git a/test/replication/on_replace.result b/test/replication/on_replace.result
index 8fef8fb14..2e95b90ea 100644
--- a/test/replication/on_replace.result
+++ b/test/replication/on_replace.result
@@ -63,6 +63,9 @@ test_run:cmd("switch replica")
 ---
 - true
 ...
+fiber = require('fiber')
+---
+...
 while box.space.test:count() < 2 do fiber.sleep(0.01) end
 ---
 ...
diff --git a/test/replication/on_replace.test.lua b/test/replication/on_replace.test.lua
index 23a3313b5..e34832103 100644
--- a/test/replication/on_replace.test.lua
+++ b/test/replication/on_replace.test.lua
@@ -26,6 +26,7 @@ session_type
 test_run:cmd("switch default")
 box.space.test:insert{2}
 test_run:cmd("switch replica")
+fiber = require('fiber')
 while box.space.test:count() < 2 do fiber.sleep(0.01) end
 --
 -- applier
-- 
2.17.1

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Tarantool-patches] [PATCH v1 11/12] test: fix replication/gc flaky failures
  2019-11-26  6:21 [Tarantool-patches] [PATCH v1 01/12] test: enable parallel mode for xlog tests Alexander V. Tikhonov
                   ` (8 preceding siblings ...)
  2019-11-26  6:21 ` [Tarantool-patches] [PATCH v1 10/12] test: put require in proper places Alexander V. Tikhonov
@ 2019-11-26  6:21 ` Alexander V. Tikhonov
  2019-11-26  6:21 ` [Tarantool-patches] [PATCH v1 12/12] test: errinj for pause relay_send Alexander V. Tikhonov
  2019-11-26  6:54 ` [Tarantool-patches] [PATCH v1 01/12] test: enable parallel mode for xlog tests Kirill Yukhin
  11 siblings, 0 replies; 13+ messages in thread
From: Alexander V. Tikhonov @ 2019-11-26  6:21 UTC (permalink / raw)
  To: Kirill Yukhin; +Cc: tarantool-patches

From: avtikhon <avtikhon@tarantool.org>

Two problems are fixed here. The first one is about correctness of the
test case. The second is about flaky failures.

About correctness. The test case contains the following lines:

 | test_run:cmd("switch replica")
 | -- Unblock the replica and break replication.
 | box.error.injection.set("ERRINJ_WAL_DELAY", false)
 | box.cfg{replication = {}}

Usually rows are applied and the new vclock is sent to the master before
replication will be disabled. So the master removes old xlog before the
replica restart and the next case tests nothing.

This commit uses the new test-run's ability to stop a tarantool instance
with a custom signal and stops the replica with SIGKILL w/o dropping
ERRINJ_WAL_DELAY. This change fixes the race between applying rows and
disabling replication and so makes the test case correct.

About flaky failures. They were look like so:

 | [029] --- replication/gc.result Mon Apr 15 14:58:09 2019
 | [029] +++ replication/gc.reject Tue Apr 16 09:17:47 2019
 | [029] @@ -290,7 +290,12 @@
 | [029] ...
 | [029] wait_xlog(1) or fio.listdir('./master')
 | [029] ---
 | [029] -- true
 | [029] +- - 00000000000000000305.vylog
 | [029] + - 00000000000000000305.xlog
 | [029] + - '512'
 | [029] + - 00000000000000000310.xlog
 | [029] + - 00000000000000000310.vylog
 | [029] + - 00000000000000000310.snap
 | [029] ...
 | [029] -- Stop the replica.
 | [029] test_run:cmd("stop server replica")
 | <...next cases could have induced mismathes too...>

The reason of the fail is that a replica applied all rows from the old
xlog, but didn't sent an ACK with a new vclock to a master, because the
replication was disabled before that. The master stops relay and keeps
the old xlog. When the replica starts again it subscribes with the
vclock value that instructs a relay to open the new xlog.

Tarantool can remove an old xlog just after a replica's ACK when
observes that the xlog was fully read by all replicas. But tarantool
does not remove xlogs when a replica is subscribed. This is not a big
problem, because such 'stuck' xlog file will be removed with a next xlog
removal.

There was the attempt to fix this behaviour and remove old xlogs at
subscribe, see the following commits:

* b5b4809cf2e6d48230eb9e4301eac188b080e0f4 ('replication: update replica
  gc state on subscribe');
* 766cd3e1015f6f76460a748c37212fb4c8791500 ('Revert "replication: update
  replica gc state on subscribe"').

Anyway, this commit fixes this flaky failures, because stops the replica
before applying rows from the old xlog. So when the replica starts it
continues reading from the old xlog and the xlog file will be removed
when will be fully read.

Closes #4162

(cherry picked from commit 35b5095ab0f437df6e78b1bacf4d41c3737a540e)
---
 test/replication/gc.result   | 16 +++++++---------
 test/replication/gc.test.lua | 16 ++++++++--------
 2 files changed, 15 insertions(+), 17 deletions(-)

diff --git a/test/replication/gc.result b/test/replication/gc.result
index cbdeffb11..5d55403b0 100644
--- a/test/replication/gc.result
+++ b/test/replication/gc.result
@@ -236,20 +236,18 @@ fiber.sleep(0.1) -- wait for master to relay data
 ---
 - true
 ...
-test_run:cmd("switch replica")
+-- Imitate the replica crash and, then, wake up.
+-- Just 'stop server replica' (SIGTERM) is not sufficient to stop
+-- a tarantool instance when ERRINJ_WAL_DELAY is set, because
+-- "tarantool" thread wait for paused "wal" thread infinitely.
+test_run:cmd("stop server replica with signal=KILL")
 ---
 - true
 ...
--- Unblock the replica and break replication.
-box.error.injection.set("ERRINJ_WAL_DELAY", false)
----
-- ok
-...
-box.cfg{replication = {}}
+test_run:cmd("start server replica")
 ---
+- true
 ...
--- Restart the replica to reestablish replication.
-test_run:cmd("restart server replica")
 -- Wait for the replica to catch up.
 test_run:cmd("switch replica")
 ---
diff --git a/test/replication/gc.test.lua b/test/replication/gc.test.lua
index 4710fd9e3..40a349167 100644
--- a/test/replication/gc.test.lua
+++ b/test/replication/gc.test.lua
@@ -110,14 +110,14 @@ fiber.sleep(0.1) -- wait for master to relay data
 -- Garbage collection must not delete the old xlog file
 -- because it is still needed by the replica, but remove
 -- the old snapshot.
-#box.info.gc().checkpoints == 1 or box.info.gc()
-#fio.glob('./master/*.xlog') == 2 or fio.listdir('./master')
-test_run:cmd("switch replica")
--- Unblock the replica and break replication.
-box.error.injection.set("ERRINJ_WAL_DELAY", false)
-box.cfg{replication = {}}
--- Restart the replica to reestablish replication.
-test_run:cmd("restart server replica")
+wait_gc(1) or box.info.gc()
+wait_xlog(2) or fio.listdir('./master')
+-- Imitate the replica crash and, then, wake up.
+-- Just 'stop server replica' (SIGTERM) is not sufficient to stop
+-- a tarantool instance when ERRINJ_WAL_DELAY is set, because
+-- "tarantool" thread wait for paused "wal" thread infinitely.
+test_run:cmd("stop server replica with signal=KILL")
+test_run:cmd("start server replica")
 -- Wait for the replica to catch up.
 test_run:cmd("switch replica")
 test_run:wait_cond(function() return box.space.test:count() == 310 end, 10)
-- 
2.17.1

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Tarantool-patches] [PATCH v1 12/12] test: errinj for pause relay_send
  2019-11-26  6:21 [Tarantool-patches] [PATCH v1 01/12] test: enable parallel mode for xlog tests Alexander V. Tikhonov
                   ` (9 preceding siblings ...)
  2019-11-26  6:21 ` [Tarantool-patches] [PATCH v1 11/12] test: fix replication/gc flaky failures Alexander V. Tikhonov
@ 2019-11-26  6:21 ` Alexander V. Tikhonov
  2019-11-26  6:54 ` [Tarantool-patches] [PATCH v1 01/12] test: enable parallel mode for xlog tests Kirill Yukhin
  11 siblings, 0 replies; 13+ messages in thread
From: Alexander V. Tikhonov @ 2019-11-26  6:21 UTC (permalink / raw)
  To: Kirill Yukhin; +Cc: Sergei Voronezhskii, tarantool-patches

From: Sergei Voronezhskii <sergw@tarantool.org>

This commit is the rest part of changes cherry picked from commit
1c34c91fa725ab254619d23c2f1d99f1e8269324. The initial part of changes
were cherry-picked at commit 8f2bd50105e62b0133032a717cfaa6f8fab26c29.

And lookup the xlog files in loop with a little sleep, until the file
count is not as expected.

Part of #3232

(cherry picked from commit 1c34c91fa725ab254619d23c2f1d99f1e8269324)
---
 test/replication/gc.result   | 86 ++++++++++++++++++++----------------
 test/replication/gc.test.lua | 62 ++++++++++++++------------
 2 files changed, 82 insertions(+), 66 deletions(-)

diff --git a/test/replication/gc.result b/test/replication/gc.result
index 5d55403b0..050a6100c 100644
--- a/test/replication/gc.result
+++ b/test/replication/gc.result
@@ -27,6 +27,28 @@ default_checkpoint_count = box.cfg.checkpoint_count
 box.cfg{checkpoint_count = 1}
 ---
 ...
+test_run:cmd("setopt delimiter ';'")
+---
+- true
+...
+function wait_gc(n)
+    return test_run:wait_cond(function()
+        return #box.info.gc().checkpoints == n
+    end, 10)
+end;
+---
+...
+function wait_xlog(n, timeout)
+    return test_run:wait_cond(function()
+        return #fio.glob('./master/*.xlog') == n
+    end, 10)
+end;
+---
+...
+test_run:cmd("setopt delimiter ''");
+---
+- true
+...
 -- Grant permissions needed for replication.
 box.schema.user.grant('guest', 'replication')
 ---
@@ -63,14 +85,13 @@ for i = 1, 100 do s:auto_increment{} end
 ...
 -- Make sure replica join will take long enough for us to
 -- invoke garbage collection.
-box.error.injection.set("ERRINJ_RELAY_TIMEOUT", 0.05)
+box.error.injection.set("ERRINJ_RELAY_SEND_DELAY", true)
 ---
 - ok
 ...
 -- While the replica is receiving the initial data set,
 -- make a snapshot and invoke garbage collection, then
--- remove the timeout injection so that we don't have to
--- wait too long for the replica to start.
+-- remove delay to allow replica to start.
 test_run:cmd("setopt delimiter ';'")
 ---
 - true
@@ -78,7 +99,7 @@ test_run:cmd("setopt delimiter ';'")
 fiber.create(function()
     fiber.sleep(0.1)
     box.snapshot()
-    box.error.injection.set("ERRINJ_RELAY_TIMEOUT", 0)
+    box.error.injection.set("ERRINJ_RELAY_SEND_DELAY", false)
 end)
 test_run:cmd("setopt delimiter ''");
 ---
@@ -110,21 +131,16 @@ test_run:cmd("switch default")
 ...
 -- Check that garbage collection removed the snapshot once
 -- the replica released the corresponding checkpoint.
-test_run:wait_cond(function() return #box.info.gc().checkpoints == 1 end, 10)
----
-- true
-...
-#box.info.gc().checkpoints == 1 or box.info.gc()
+wait_gc(1) or box.info.gc()
 ---
 - true
 ...
-#fio.glob('./master/*.xlog') == 1 or fio.listdir('./master')
+wait_xlog(1) or fio.listdir('./master') -- Make sure the replica will not receive data until
 ---
 - true
 ...
--- Make sure the replica will receive data it is subscribed
--- to long enough for us to invoke garbage collection.
-box.error.injection.set("ERRINJ_RELAY_TIMEOUT", 0.05)
+-- we test garbage collection.
+box.error.injection.set("ERRINJ_RELAY_SEND_DELAY", true)
 ---
 - ok
 ...
@@ -152,17 +168,17 @@ box.snapshot()
 ---
 - ok
 ...
-#box.info.gc().checkpoints == 1 or box.info.gc()
+wait_gc(1) or box.info.gc()
 ---
 - true
 ...
-#fio.glob('./master/*.xlog') == 2 or fio.listdir('./master')
+wait_xlog(2) or fio.listdir('./master')
 ---
 - true
 ...
--- Remove the timeout injection so that the replica catches
+-- Resume replication so that the replica catches
 -- up quickly.
-box.error.injection.set("ERRINJ_RELAY_TIMEOUT", 0)
+box.error.injection.set("ERRINJ_RELAY_SEND_DELAY", false)
 ---
 - ok
 ...
@@ -185,11 +201,11 @@ test_run:cmd("switch default")
 ...
 -- Now garbage collection should resume and delete files left
 -- from the old checkpoint.
-test_run:wait_cond(function() return #fio.glob('./master/*.xlog') == 0 end, 10)
+wait_gc(1) or box.info.gc()
 ---
 - true
 ...
-#fio.glob('./master/*.xlog') == 0 or fio.listdir('./master')
+wait_xlog(0) or fio.listdir('./master')
 ---
 - true
 ...
@@ -228,11 +244,11 @@ fiber.sleep(0.1) -- wait for master to relay data
 -- Garbage collection must not delete the old xlog file
 -- because it is still needed by the replica, but remove
 -- the old snapshot.
-#box.info.gc().checkpoints == 1 or box.info.gc()
+wait_gc(1) or box.info.gc()
 ---
 - true
 ...
-#fio.glob('./master/*.xlog') == 2 or fio.listdir('./master')
+wait_xlog(2) or fio.listdir('./master')
 ---
 - true
 ...
@@ -266,11 +282,11 @@ test_run:cmd("switch default")
 - true
 ...
 -- Now it's safe to drop the old xlog.
-test_run:wait_cond(function() return #fio.glob('./master/*.xlog') == 1 end, 10)
+wait_gc(1) or box.info.gc()
 ---
 - true
 ...
-#fio.glob('./master/*.xlog') == 1 or fio.listdir('./master')
+wait_xlog(1) or fio.listdir('./master')
 ---
 - true
 ...
@@ -302,17 +318,11 @@ box.snapshot()
 ---
 - ok
 ...
-#box.info.gc().checkpoints == 1 or box.info.gc()
+wait_gc(1) or box.info.gc()
 ---
 - true
 ...
-xlog_count = #fio.glob('./master/*.xlog')
----
-...
--- the replica may have managed to download all data
--- from xlog #1 before it was stopped, in which case
--- it's OK to collect xlog #1
-xlog_count == 3 or xlog_count == 2 or fio.listdir('./master')
+wait_xlog(2) or fio.listdir('./master')
 ---
 - true
 ...
@@ -321,7 +331,11 @@ xlog_count == 3 or xlog_count == 2 or fio.listdir('./master')
 test_run:cleanup_cluster()
 ---
 ...
-#fio.glob('./master/*.xlog') == 1 or fio.listdir('./master')
+wait_gc(1) or box.info.gc()
+---
+- true
+...
+wait_xlog(1) or fio.listdir('./master')
 ---
 - true
 ...
@@ -409,7 +423,7 @@ box.snapshot()
 ---
 - ok
 ...
-#fio.glob('./master/*.xlog') == 3 or fio.listdir('./master')
+wait_xlog(3) or fio.listdir('./master')
 ---
 - true
 ...
@@ -422,11 +436,7 @@ box.snapshot()
 ---
 - ok
 ...
-test_run:wait_cond(function() return #fio.glob('./master/*.xlog') == 0 end, 10)
----
-- true
-...
-#fio.glob('./master/*.xlog') == 0 or fio.listdir('./master')
+wait_xlog(0, 10) or fio.listdir('./master')
 ---
 - true
 ...
diff --git a/test/replication/gc.test.lua b/test/replication/gc.test.lua
index 40a349167..7cd18402c 100644
--- a/test/replication/gc.test.lua
+++ b/test/replication/gc.test.lua
@@ -11,6 +11,19 @@ test_run:cmd("create server replica with rpl_master=default, script='replication
 default_checkpoint_count = box.cfg.checkpoint_count
 box.cfg{checkpoint_count = 1}
 
+test_run:cmd("setopt delimiter ';'")
+function wait_gc(n)
+    return test_run:wait_cond(function()
+        return #box.info.gc().checkpoints == n
+    end, 10)
+end;
+function wait_xlog(n, timeout)
+    return test_run:wait_cond(function()
+        return #fio.glob('./master/*.xlog') == n
+    end, 10)
+end;
+test_run:cmd("setopt delimiter ''");
+
 -- Grant permissions needed for replication.
 box.schema.user.grant('guest', 'replication')
 
@@ -29,17 +42,16 @@ for i = 1, 100 do s:auto_increment{} end
 
 -- Make sure replica join will take long enough for us to
 -- invoke garbage collection.
-box.error.injection.set("ERRINJ_RELAY_TIMEOUT", 0.05)
+box.error.injection.set("ERRINJ_RELAY_SEND_DELAY", true)
 
 -- While the replica is receiving the initial data set,
 -- make a snapshot and invoke garbage collection, then
--- remove the timeout injection so that we don't have to
--- wait too long for the replica to start.
+-- remove delay to allow replica to start.
 test_run:cmd("setopt delimiter ';'")
 fiber.create(function()
     fiber.sleep(0.1)
     box.snapshot()
-    box.error.injection.set("ERRINJ_RELAY_TIMEOUT", 0)
+    box.error.injection.set("ERRINJ_RELAY_SEND_DELAY", false)
 end)
 test_run:cmd("setopt delimiter ''");
 
@@ -57,12 +69,10 @@ test_run:cmd("switch default")
 
 -- Check that garbage collection removed the snapshot once
 -- the replica released the corresponding checkpoint.
-test_run:wait_cond(function() return #box.info.gc().checkpoints == 1 end, 10)
-#box.info.gc().checkpoints == 1 or box.info.gc()
-#fio.glob('./master/*.xlog') == 1 or fio.listdir('./master')
--- Make sure the replica will receive data it is subscribed
--- to long enough for us to invoke garbage collection.
-box.error.injection.set("ERRINJ_RELAY_TIMEOUT", 0.05)
+wait_gc(1) or box.info.gc()
+wait_xlog(1) or fio.listdir('./master') -- Make sure the replica will not receive data until
+-- we test garbage collection.
+box.error.injection.set("ERRINJ_RELAY_SEND_DELAY", true)
 
 -- Send more data to the replica.
 -- Need to do 2 snapshots here, otherwise the replica would
@@ -76,12 +86,12 @@ box.snapshot()
 -- Invoke garbage collection. Check that it doesn't remove
 -- xlogs needed by the replica.
 box.snapshot()
-#box.info.gc().checkpoints == 1 or box.info.gc()
-#fio.glob('./master/*.xlog') == 2 or fio.listdir('./master')
+wait_gc(1) or box.info.gc()
+wait_xlog(2) or fio.listdir('./master')
 
--- Remove the timeout injection so that the replica catches
+-- Resume replication so that the replica catches
 -- up quickly.
-box.error.injection.set("ERRINJ_RELAY_TIMEOUT", 0)
+box.error.injection.set("ERRINJ_RELAY_SEND_DELAY", false)
 
 -- Check that the replica received all data from the master.
 test_run:cmd("switch replica")
@@ -91,8 +101,8 @@ test_run:cmd("switch default")
 
 -- Now garbage collection should resume and delete files left
 -- from the old checkpoint.
-test_run:wait_cond(function() return #fio.glob('./master/*.xlog') == 0 end, 10)
-#fio.glob('./master/*.xlog') == 0 or fio.listdir('./master')
+wait_gc(1) or box.info.gc()
+wait_xlog(0) or fio.listdir('./master')
 --
 -- Check that the master doesn't delete xlog files sent to the
 -- replica until it receives a confirmation that the data has
@@ -124,8 +134,8 @@ test_run:wait_cond(function() return box.space.test:count() == 310 end, 10)
 box.space.test:count()
 test_run:cmd("switch default")
 -- Now it's safe to drop the old xlog.
-test_run:wait_cond(function() return #fio.glob('./master/*.xlog') == 1 end, 10)
-#fio.glob('./master/*.xlog') == 1 or fio.listdir('./master')
+wait_gc(1) or box.info.gc()
+wait_xlog(1) or fio.listdir('./master')
 -- Stop the replica.
 test_run:cmd("stop server replica")
 test_run:cmd("cleanup server replica")
@@ -139,17 +149,14 @@ _ = s:auto_increment{}
 box.snapshot()
 _ = s:auto_increment{}
 box.snapshot()
-#box.info.gc().checkpoints == 1 or box.info.gc()
-xlog_count = #fio.glob('./master/*.xlog')
--- the replica may have managed to download all data
--- from xlog #1 before it was stopped, in which case
--- it's OK to collect xlog #1
-xlog_count == 3 or xlog_count == 2 or fio.listdir('./master')
+wait_gc(1) or box.info.gc()
+wait_xlog(2) or fio.listdir('./master')
 
 -- The xlog should only be deleted after the replica
 -- is unregistered.
 test_run:cleanup_cluster()
-#fio.glob('./master/*.xlog') == 1 or fio.listdir('./master')
+wait_gc(1) or box.info.gc()
+wait_xlog(1) or fio.listdir('./master')
 --
 -- Test that concurrent invocation of the garbage collector works fine.
 --
@@ -188,14 +195,13 @@ _ = s:auto_increment{}
 box.snapshot()
 _ = s:auto_increment{}
 box.snapshot()
-#fio.glob('./master/*.xlog') == 3 or fio.listdir('./master')
+wait_xlog(3) or fio.listdir('./master')
 
 -- Delete the replica from the cluster table and check that
 -- all xlog files are removed.
 test_run:cleanup_cluster()
 box.snapshot()
-test_run:wait_cond(function() return #fio.glob('./master/*.xlog') == 0 end, 10)
-#fio.glob('./master/*.xlog') == 0 or fio.listdir('./master')
+wait_xlog(0, 10) or fio.listdir('./master')
 
 -- Restore the config.
 box.cfg{replication = {}}
-- 
2.17.1

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Tarantool-patches] [PATCH v1 01/12] test: enable parallel mode for xlog tests
  2019-11-26  6:21 [Tarantool-patches] [PATCH v1 01/12] test: enable parallel mode for xlog tests Alexander V. Tikhonov
                   ` (10 preceding siblings ...)
  2019-11-26  6:21 ` [Tarantool-patches] [PATCH v1 12/12] test: errinj for pause relay_send Alexander V. Tikhonov
@ 2019-11-26  6:54 ` Kirill Yukhin
  11 siblings, 0 replies; 13+ messages in thread
From: Kirill Yukhin @ 2019-11-26  6:54 UTC (permalink / raw)
  To: Alexander V. Tikhonov; +Cc: Sergei Voronezhskii, tarantool-patches

Hello,

I've checked the patchset into 1.10.

In future please:
  - Prepare a cover letter
  - Mention branch name

--
Regards, Kirill Yukhin

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2019-11-26  6:54 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-11-26  6:21 [Tarantool-patches] [PATCH v1 01/12] test: enable parallel mode for xlog tests Alexander V. Tikhonov
2019-11-26  6:21 ` [Tarantool-patches] [PATCH v1 02/12] test: enable parallel run for long test suites Alexander V. Tikhonov
2019-11-26  6:21 ` [Tarantool-patches] [PATCH v1 03/12] test: replication parallel mode on Alexander V. Tikhonov
2019-11-26  6:21 ` [Tarantool-patches] [PATCH v1 04/12] test: enable cleaning of a test environment Alexander V. Tikhonov
2019-11-26  6:21 ` [Tarantool-patches] [PATCH v1 05/12] test: allow to run replication/misc multiple times Alexander V. Tikhonov
2019-11-26  6:21 ` [Tarantool-patches] [PATCH v1 06/12] test: increase timeouts in replication/errinj Alexander V. Tikhonov
2019-11-26  6:21 ` [Tarantool-patches] [PATCH v1 07/12] test: wait for xlog/snap/log file changes Alexander V. Tikhonov
2019-11-26  6:21 ` [Tarantool-patches] [PATCH v1 08/12] test: use wait_cond to check follow status Alexander V. Tikhonov
2019-11-26  6:21 ` [Tarantool-patches] [PATCH v1 09/12] test: increase timeouts in replication/misc Alexander V. Tikhonov
2019-11-26  6:21 ` [Tarantool-patches] [PATCH v1 10/12] test: put require in proper places Alexander V. Tikhonov
2019-11-26  6:21 ` [Tarantool-patches] [PATCH v1 11/12] test: fix replication/gc flaky failures Alexander V. Tikhonov
2019-11-26  6:21 ` [Tarantool-patches] [PATCH v1 12/12] test: errinj for pause relay_send Alexander V. Tikhonov
2019-11-26  6:54 ` [Tarantool-patches] [PATCH v1 01/12] test: enable parallel mode for xlog tests Kirill Yukhin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox