From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Mon, 8 Oct 2018 22:07:01 +0300 From: Alexander Turenko Subject: Re: [PATCH 2/4] test: errinj for pause relay_send Message-ID: <20181008190701.jcq2d2uuxvt3sxzi@tkn_work_nb> References: <20181003145057.68820-1-sergw@tarantool.org> <20181005090215.6160-1-sergw@tarantool.org> <20181005090215.6160-3-sergw@tarantool.org> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <20181005090215.6160-3-sergw@tarantool.org> To: Sergei Voronezhskii Cc: Vladimir Davydov , tarantool-patches@freelists.org List-ID: On Fri, Oct 05, 2018 at 12:02:13PM +0300, Sergei Voronezhskii wrote: > Instead of using timeout we need just pause `relay_send`. Can't relay Typo: relay on -> rely on. > on timeout because of various system load in parallel mode. Add new > errinj which checks boolean in loop and until it is not `True` do not > pass the method `relay_send` to the next statement. > > Also here we change `delete` to `replace`. And lookup the xlog files Ok, we changed it. Why? The cite from our developers guide [1]: > Explain the problem that this commit is solving. Focus on why you > are making this change as opposed to how (the code explains that). [1]: https://www.tarantool.io/en/doc/1.9/dev_guide/developer_guidelines/#how-to-write-a-commit-message > in loop with a little sleep, until the file count is not as expected. > > Part of #2436, #3232 > --- > src/box/relay.cc | 7 +++++- > src/errinj.h | 1 + > test/replication/catch.result | 44 ++++++++++++++------------------- > test/replication/catch.test.lua | 36 +++++++++++++-------------- > test/replication/gc.result | 18 ++++++-------- > test/replication/gc.test.lua | 16 ++++++------ > 6 files changed, 58 insertions(+), 64 deletions(-) > > diff --git a/src/box/relay.cc b/src/box/relay.cc > index c90383d4a..8618fa81a 100644 > --- a/src/box/relay.cc > +++ b/src/box/relay.cc > @@ -622,12 +622,17 @@ relay_subscribe(struct replica *replica, int fd, uint64_t sync, > static void > relay_send(struct relay *relay, struct xrow_header *packet) > { > + struct errinj *inj = errinj(ERRINJ_RELAY_SEND_DELAY, ERRINJ_BOOL); > + while (inj->bparam) { > + fiber_sleep(0.01); > + inj = errinj(ERRINJ_RELAY_SEND_DELAY, ERRINJ_BOOL); > + } Code style: tab should be used instead of spaces. Please, consider [1]. [1]: https://www.tarantool.io/en/doc/1.9/dev_guide/c_style_guide/#chapter-1-indentation > packet->sync = relay->sync; > relay->last_row_tm = ev_monotonic_now(loop()); > coio_write_xrow(&relay->io, packet); > fiber_gc(); > > - struct errinj *inj = errinj(ERRINJ_RELAY_TIMEOUT, ERRINJ_DOUBLE); > + inj = errinj(ERRINJ_RELAY_TIMEOUT, ERRINJ_DOUBLE); > if (inj != NULL && inj->dparam > 0) > fiber_sleep(inj->dparam); > } > diff --git a/test/replication/catch.test.lua b/test/replication/catch.test.lua > index 217328772..5223e3a24 100644 > --- a/test/replication/catch.test.lua > +++ b/test/replication/catch.test.lua > @@ -13,7 +13,7 @@ test_run:cmd("switch replica") > > test_run:cmd("switch default") > s = box.schema.space.create('test', {engine = engine}); > --- vinyl does not support hash index > +-- Vinyl does not support hash index Are intend to fix the test according to our code style? If so, then period at the end is needed. I commented such things below with 'Nit' (nitpicking) mark. > index = s:create_index('primary', {type = (engine == 'vinyl' and 'tree' or 'hash') }) > > test_run:cmd("switch replica") > @@ -22,41 +22,39 @@ while box.space.test == nil do fiber.sleep(0.01) end > test_run:cmd("switch default") > test_run:cmd("stop server replica") > > --- insert values on the master while replica is stopped and can't fetch them > +-- Insert values on the master while replica is stopped and can't fetch them. Nit: Comments should be 66 chars long at max. > +errinj.set('ERRINJ_RELAY_SEND_DELAY', true) > for i=1,100 do s:insert{i, 'this is test message12345'} end > Nit: for i=1,100 -> for i = 1, 100 > --- sleep after every tuple > -errinj.set("ERRINJ_RELAY_TIMEOUT", 1000.0) > - > test_run:cmd("start server replica with args='0.01'") > test_run:cmd("switch replica") > > -- Check that replica doesn't enter read-write mode before > --- catching up with the master: to check that we inject sleep into > --- the master relay_send function and attempt a data modifying > --- statement in replica while it's still fetching data from the > --- master. > --- In the next two cases we try to delete a tuple while replica is > +-- catching up with the master: to check that we stop sending > +-- rows on the master in relay_send function and attempt a data > +-- modifying statement in replica while it's still fetching data > +-- from the master. > +-- > +-- In the next two cases we try to replace a tuple while replica is Nit: 67 chars. > -- catching up with the master (local delete, remote delete) case > -- > --- #1: delete tuple on replica > +-- Case #1: replace tuple on replica locally. > -- > box.space.test ~= nil > -d = box.space.test:delete{1} > -box.space.test:get(1) ~= nil > +box.space.test:replace{1} > > --- case #2: delete tuple by net.box > +-- Case #2: replace tuple on replica by net.box. > > test_run:cmd("switch default") > test_run:cmd("set variable r_uri to 'replica.listen'") > c = net_box.connect(r_uri) > -d = c.space.test:delete{1} > -c.space.test:get(1) ~= nil > +d = c.space.test:replace{1} > + > +-- Resume replicaton > +errinj.set('ERRINJ_RELAY_SEND_DELAY', false) > Nit: no period at the end. > --- check sync > -errinj.set("ERRINJ_RELAY_TIMEOUT", 0) > Nit: two empty lines instead of one. > --- cleanup > +-- Cleanup Nit: no period at the end. > test_run:cmd("stop server replica") > test_run:cmd("cleanup server replica") Are not we should delete the server replica here? > test_run:cleanup_cluster() Nit: extra empty line at the end. > diff --git a/test/replication/gc.test.lua b/test/replication/gc.test.lua > index eed76850c..ec3bf6baa 100644 > --- a/test/replication/gc.test.lua > +++ b/test/replication/gc.test.lua We still use ERRINJ_RELAY_TIMEOUT in lines 30-40. I think this should be changed to using delay. Please, ask me or Vova for details about the test case. The commit bae6f037c0df9bcde56611d411bf600341e008b3 was pushed to 1.10 since your commit. You need to rebase your patchset on top of fresh 1.10. > @@ -78,11 +78,11 @@ box.snapshot() > -- xlogs needed by the replica. > box.snapshot() > #box.info.gc().checkpoints == 1 or box.info.gc() > -#fio.glob('./master/*.xlog') == 2 or fio.listdir('./master') > +while #fio.glob('./master/*.xlog') ~= 2 do fiber.sleep(0.01) end > This change was performed for some file checks, but didn't for some other. I think some consistency should be here. Also I propose to wrap it into a function like wait_gc (say, wait_xlog) to reuse the pattern and so make the test more readable.