[PATCH] test: enable parallel mode for replication tests

Alexander Turenko alexander.turenko at tarantool.org
Mon Oct 1 04:36:57 MSK 2018


Hi, Sergei!

Below I reviewed 'before_replace' and 'catch' tests. I'm going to split the
review into parts to don't block you. Understand how the replication works, how
25 tests work and propose best way to stabilize them is significant amount of
work for me, so I'll interleave this work with other tasks and will try to
finish this within the week.

WBR, Alexander Turenko.

On Thu, Sep 27, 2018 at 06:38:50PM +0300, Sergei Voronezhskii wrote:
> - need more sleeps and timeout because increasing load on i/o
> - at the end of tests which create any replication config need to call
> `test_run:clenup_cluster()` which clears `box.space._cluster`
> - switch on `use_unix_sockets` because of 'Address already in use'
> problems

Need to update test-run too. This option is ignored for non-default
servers on the test-run version in your branch. I hit 'Address already
in use' error locally when test your branch.

> - instead of just checking `box.space.test:count()` or
> `#fio.glob('./master/*.xlog')` need to wait for values because of
> increading load in replication process
> 
> Part of #2436, #3232
> ---
> BRANCH: https://github.com/tarantool/tarantool/tree/sergw/enable-parallel-test-replication
>  test/replication/before_replace.result      |  9 ++-
>  test/replication/before_replace.test.lua    |  7 ++-
>  test/replication/catch.result               | 31 ++++++----
>  test/replication/catch.test.lua             | 14 +++--
>  test/replication/gc.result                  | 18 +++---
>  test/replication/gc.test.lua                | 12 ++--
>  test/replication/local_spaces.result        |  3 +
>  test/replication/local_spaces.test.lua      |  1 +
>  test/replication/misc.result                | 65 +++++++++++++++------
>  test/replication/misc.test.lua              | 43 ++++++++------
>  test/replication/on_replace.result          | 13 +++++
>  test/replication/on_replace.test.lua        |  4 ++
>  test/replication/once.result                |  2 +-
>  test/replication/once.test.lua              |  2 +-
>  test/replication/quorum.result              |  9 ++-
>  test/replication/quorum.test.lua            |  7 ++-
>  test/replication/replica_rejoin.result      |  7 +++
>  test/replication/replica_rejoin.test.lua    |  2 +
>  test/replication/skip_conflict_row.result   |  7 +++
>  test/replication/skip_conflict_row.test.lua |  2 +
>  test/replication/status.result              |  7 +++
>  test/replication/status.test.lua            |  2 +
>  test/replication/suite.ini                  |  3 +-
>  test/replication/sync.result                |  7 +++
>  test/replication/sync.test.lua              |  2 +
>  test/replication/wal_off.result             |  7 +++
>  test/replication/wal_off.test.lua           |  2 +
>  27 files changed, 209 insertions(+), 79 deletions(-)
> 
> diff --git a/test/replication/before_replace.test.lua b/test/replication/before_replace.test.lua
> index f1e590703..d86f74b05 100644
> --- a/test/replication/before_replace.test.lua
> +++ b/test/replication/before_replace.test.lua
> @@ -46,13 +46,13 @@ test_run:cmd("setopt delimiter ''");
>  -- Stall replication and generate incompatible data
>  -- on the replicas.
>  test_run:cmd("switch autobootstrap1")
> -box.error.injection.set('ERRINJ_RELAY_TIMEOUT', 0.01)
> +box.error.injection.set('ERRINJ_RELAY_TIMEOUT', 0.1)
>  for i = 1, 10 do box.space.test:replace{i, i % 3 == 1 and i * 10 or i} end
>  test_run:cmd("switch autobootstrap2")
> -box.error.injection.set('ERRINJ_RELAY_TIMEOUT', 0.01)
> +box.error.injection.set('ERRINJ_RELAY_TIMEOUT', 0.1)
>  for i = 1, 10 do box.space.test:replace{i, i % 3 == 2 and i * 10 or i} end
>  test_run:cmd("switch autobootstrap3")
> -box.error.injection.set('ERRINJ_RELAY_TIMEOUT', 0.01)
> +box.error.injection.set('ERRINJ_RELAY_TIMEOUT', 0.1)
>  for i = 1, 10 do box.space.test:replace{i, i % 3 == 0 and i * 10 or i} end
>  

I have got the following miscompare locally with the new timeout value (0.1).
Don't know whether it matters, but it was got on the old test-run (TCP sockets
were used for non-default server consoles).

[001] --- replication/before_replace.result	Fri Sep 28 14:13:57 2018
[001] +++ replication/before_replace.reject	Mon Oct  1 02:21:22 2018
[001] @@ -146,8 +146,8 @@
[001]    - [6, 60]
[001]    - [7, 70]
[001]    - [8, 80]
[001] -  - [9, 90]
[001] -  - [10, 100]
[001] +  - [9, 9]
[001] +  - [10, 10]
[001]  ...
[001]  test_run:cmd("switch autobootstrap2")
[001]  ---

It is the datum on the first node after restart. But 0.01 works good for me.
Don't know why, to be honest.

Don't get why ERRINJ_RELAY_TIMEOUT calls fiber_sleep **after** the write
of xrow (changing that break the test too). As I see 'catch' test exploits this
fact (but I rewrote it, see below).

Anyway. I'm tentative about the right implementation, but propose use an
enable/disable switch instead of timeout, somehow like so:

diff --git a/src/box/relay.cc b/src/box/relay.cc
index c90383d4a..078273bd4 100644
--- a/src/box/relay.cc
+++ b/src/box/relay.cc
@@ -622,12 +622,19 @@ relay_subscribe(struct replica *replica, int fd, uint64_t sync,
 static void
 relay_send(struct relay *relay, struct xrow_header *packet)
 {
+       struct errinj *inj = errinj(ERRINJ_RELAY_STOP_SEND,
+                                   ERRINJ_BOOL);
+       while (inj->bparam) {
+               fiber_sleep(0.01);
+               inj = errinj(ERRINJ_RELAY_STOP_SEND,
+                            ERRINJ_BOOL);
+       }
        packet->sync = relay->sync;
        relay->last_row_tm = ev_monotonic_now(loop());
        coio_write_xrow(&relay->io, packet);
        fiber_gc();
 
-       struct errinj *inj = errinj(ERRINJ_RELAY_TIMEOUT, ERRINJ_DOUBLE);
+       inj = errinj(ERRINJ_RELAY_TIMEOUT, ERRINJ_DOUBLE);
        if (inj != NULL && inj->dparam > 0)
                fiber_sleep(inj->dparam);
 }
diff --git a/src/errinj.h b/src/errinj.h
index 84a1fbb5e..eaac24f5d 100644
--- a/src/errinj.h
+++ b/src/errinj.h
@@ -94,6 +94,7 @@ struct errinj {
        _(ERRINJ_VY_GC, ERRINJ_BOOL, {.bparam = false}) \
        _(ERRINJ_VY_LOG_FLUSH, ERRINJ_BOOL, {.bparam = false}) \
        _(ERRINJ_VY_LOG_FLUSH_DELAY, ERRINJ_BOOL, {.bparam = false}) \
+       _(ERRINJ_RELAY_STOP_SEND, ERRINJ_BOOL, {.bparam = false}) \
        _(ERRINJ_RELAY_TIMEOUT, ERRINJ_DOUBLE, {.dparam = 0}) \
        _(ERRINJ_RELAY_REPORT_INTERVAL, ERRINJ_DOUBLE, {.dparam = 0}) \
        _(ERRINJ_RELAY_FINAL_SLEEP, ERRINJ_BOOL, {.bparam = false}) \
diff --git a/test/replication/before_replace.test.lua b/test/replication/before_replace.test.lua
index d86f74b05..bfe3ba9e8 100644
--- a/test/replication/before_replace.test.lua
+++ b/test/replication/before_replace.test.lua
@@ -46,15 +46,23 @@ test_run:cmd("setopt delimiter ''");
 -- Stall replication and generate incompatible data
 -- on the replicas.
 test_run:cmd("switch autobootstrap1")
-box.error.injection.set('ERRINJ_RELAY_TIMEOUT', 0.1)
+box.error.injection.set('ERRINJ_RELAY_STOP_SEND', true)
 for i = 1, 10 do box.space.test:replace{i, i % 3 == 1 and i * 10 or i} end
 test_run:cmd("switch autobootstrap2")
-box.error.injection.set('ERRINJ_RELAY_TIMEOUT', 0.1)
+box.error.injection.set('ERRINJ_RELAY_STOP_SEND', true)
 for i = 1, 10 do box.space.test:replace{i, i % 3 == 2 and i * 10 or i} end
 test_run:cmd("switch autobootstrap3")
-box.error.injection.set('ERRINJ_RELAY_TIMEOUT', 0.1)
+box.error.injection.set('ERRINJ_RELAY_STOP_SEND', true)
 for i = 1, 10 do box.space.test:replace{i, i % 3 == 0 and i * 10 or i} end
 
+-- Resume replication.
+test_run:cmd("switch autobootstrap1")
+box.error.injection.set('ERRINJ_RELAY_STOP_SEND', false)
+test_run:cmd("switch autobootstrap2")
+box.error.injection.set('ERRINJ_RELAY_STOP_SEND', false)
+test_run:cmd("switch autobootstrap3")
+box.error.injection.set('ERRINJ_RELAY_STOP_SEND', false)
+
 -- Synchronize.
 test_run:cmd("switch default")
 vclock = test_run:get_cluster_vclock(SERVERS)

That works for me.

>  -- Synchronize.
> @@ -80,3 +80,4 @@ box.space.test:select()
>  -- Cleanup.
>  test_run:cmd("switch default")
>  test_run:drop_cluster(SERVERS)
> +test_run:cleanup_cluster()

> diff --git a/test/replication/catch.test.lua b/test/replication/catch.test.lua
> index 8cc3242f7..68cca831e 100644
> --- a/test/replication/catch.test.lua
> +++ b/test/replication/catch.test.lua
> @@ -1,4 +1,5 @@
>  env = require('test_run')
> +fiber = require('fiber')
>  test_run = env.new()
>  engine = test_run:get_cfg('engine')
>  
> @@ -23,13 +24,15 @@ test_run:cmd("switch default")
>  test_run:cmd("stop server replica")
>  
>  -- insert values on the master while replica is stopped and can't fetch them
> -for i=1,100 do s:insert{i, 'this is test message12345'} end
> +for i=1,1000 do s:insert{i, 'this is test message12345'} end
>  
>  -- sleep after every tuple
> -errinj.set("ERRINJ_RELAY_TIMEOUT", 1000.0)
> +errinj.set("ERRINJ_RELAY_TIMEOUT", 2.0)
>  
>  test_run:cmd("start server replica with args='0.01'")
>  test_run:cmd("switch replica")
> +-- After stop server got error variable 'fiber' is not declared
> +fiber = require('fiber')
>  
>  -- Check that replica doesn't enter read-write mode before
>  -- catching up with the master: to check that we inject sleep into
> @@ -42,23 +45,26 @@ test_run:cmd("switch replica")
>  -- #1: delete tuple on replica
>  --
>  box.space.test ~= nil
> +repeat fiber.sleep(0.001) until box.space.test:get(1) ~= nil
>  d = box.space.test:delete{1}
> -box.space.test:get(1) ~= nil
>  
>  -- case #2: delete tuple by net.box
>  
>  test_run:cmd("switch default")
>  test_run:cmd("set variable r_uri to 'replica.listen'")
>  c = net_box.connect(r_uri)
> +repeat fiber.sleep(0.001) until c.space.test:get(1) ~= nil
>  d = c.space.test:delete{1}
> -c.space.test:get(1) ~= nil
>  
>  -- check sync
>  errinj.set("ERRINJ_RELAY_TIMEOUT", 0)
> +fiber.sleep(2.0) -- wait until release errinj sleep
> +
>  
>  -- cleanup
>  test_run:cmd("stop server replica")
>  test_run:cmd("cleanup server replica")
> +test_run:cleanup_cluster()
>  box.space.test:drop()
>  box.schema.user.revoke('guest', 'replication')
>  

Proposed to use the replication enable/disable too. Also I think it is
easier to use replace instead of waiting for the first tuple and then
use delete.

The diff:

diff --git a/test/replication/catch.test.lua b/test/replication/catch.test.lua
index 68cca831e..92bb645b7 100644
--- a/test/replication/catch.test.lua
+++ b/test/replication/catch.test.lua
@@ -3,7 +3,6 @@ fiber = require('fiber')
 test_run = env.new()
 engine = test_run:get_cfg('engine')
 
-
 net_box = require('net.box')
 errinj = box.error.injection
 
@@ -14,7 +13,7 @@ test_run:cmd("switch replica")
 
 test_run:cmd("switch default")
 s = box.schema.space.create('test', {engine = engine});
--- vinyl does not support hash index
+-- Vinyl does not support hash index.
 index = s:create_index('primary', {type = (engine == 'vinyl' and 'tree' or 'hash') })
 
 test_run:cmd("switch replica")
@@ -23,48 +22,42 @@ while box.space.test == nil do fiber.sleep(0.01) end
 test_run:cmd("switch default")
 test_run:cmd("stop server replica")
 
--- insert values on the master while replica is stopped and can't fetch them
-for i=1,1000 do s:insert{i, 'this is test message12345'} end
-
--- sleep after every tuple
-errinj.set("ERRINJ_RELAY_TIMEOUT", 2.0)
+-- Insert values on the master while replica is stopped and can't
+-- fetch them.
+errinj.set("ERRINJ_RELAY_STOP_SEND", true)
+for i = 1, 100 do s:insert{i, 'this is test message12345'} end
 
 test_run:cmd("start server replica with args='0.01'")
 test_run:cmd("switch replica")
--- After stop server got error variable 'fiber' is not declared
-fiber = require('fiber')
 
 -- Check that replica doesn't enter read-write mode before
--- catching up with the master: to check that we inject sleep into
--- the master relay_send function and attempt a data modifying
--- statement in replica while it's still fetching data from the
--- master.
--- In the next two cases we try to delete a tuple while replica is
--- catching up with the master (local delete, remote delete) case
+-- catching up with the master: to check that we stop sending
+-- rows on the master in relay_send function and attempt a data
+-- modifying statement in replica while it's still fetching data
+-- from the master.
 --
--- #1: delete tuple on replica
+-- In the next two cases we try to replace a tuple while replica
+-- is catching up with the master (local replace, remote replace)
+-- case.
+--
+-- Case #1: replace tuple on replica locally.
 --
 box.space.test ~= nil
-repeat fiber.sleep(0.001) until box.space.test:get(1) ~= nil
-d = box.space.test:delete{1}
+box.space.test:replace{1}
 
--- case #2: delete tuple by net.box
+-- Case #2: replace tuple on replica by net.box.
 
 test_run:cmd("switch default")
 test_run:cmd("set variable r_uri to 'replica.listen'")
 c = net_box.connect(r_uri)
-repeat fiber.sleep(0.001) until c.space.test:get(1) ~= nil
-d = c.space.test:delete{1}
-
--- check sync
-errinj.set("ERRINJ_RELAY_TIMEOUT", 0)
-fiber.sleep(2.0) -- wait until release errinj sleep
+c.space.test:replace{1}
 
+-- Resume replicaton.
+errinj.set("ERRINJ_RELAY_STOP_SEND", false)
 
--- cleanup
+-- Cleanup.
 test_run:cmd("stop server replica")
 test_run:cmd("cleanup server replica")
 test_run:cleanup_cluster()
 box.space.test:drop()
 box.schema.user.revoke('guest', 'replication')
-



More information about the Tarantool-patches mailing list