From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Mon, 27 Aug 2018 15:08:08 +0300 From: Vladimir Davydov Subject: Re: [tarantool-patches] [PATCH v3] box: adds replication sync after cfg. update Message-ID: <20180827120808.ss3oo3weilhbescm@esperanza> References: <20180826082537.38552-1-krishtal.olja@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20180826082537.38552-1-krishtal.olja@gmail.com> To: Olga Arkhangelskaia Cc: tarantool-patches@freelists.org List-ID: On Sun, Aug 26, 2018 at 11:25:37AM +0300, Olga Arkhangelskaia wrote: > diff --git a/src/box/box.cc b/src/box/box.cc > index 8d7454d1f..617067ab5 100644 > --- a/src/box/box.cc > +++ b/src/box/box.cc > @@ -634,6 +634,11 @@ box_set_replication(void) > box_sync_replication(true); > /* Follow replica */ > replicaset_follow(); > + /* Sync replica up to quorum */ > + if (!replicaset_sync()) { > + tnt_raise(ClientError, ER_REPLICA_SYNC, cfg_gets("instance_uuid"), > + cfg_gets("replicaset_uuid")); > + } I guess that if synchronization fails, we should rollback to the previous configuration, like we do in case we fail to connect. > } > > void > diff --git a/src/box/errcode.h b/src/box/errcode.h > index 7e3ea1ed1..4059930c0 100644 > --- a/src/box/errcode.h > +++ b/src/box/errcode.h > @@ -207,6 +207,7 @@ struct errcode_record { > /*152 */_(ER_NULLABLE_PRIMARY, "Primary index of the space '%s' can not contain nullable parts") \ > /*153 */_(ER_NULLABLE_MISMATCH, "Field %d is %s in space format, but %s in index parts") \ > /*154 */_(ER_TRANSACTION_YIELD, "Transaction has been aborted by a fiber yield") \ > + /*155 */_(ER_REPLICA_SYNC, "Failed to synchronize replica %s with replicaset %s") \ I don't think we need a separate error code for this - ER_CFG will do. > > /* > * !IMPORTANT! Please follow instructions at start of the file > diff --git a/src/box/replication.cc b/src/box/replication.cc > index 861ce34ea..9ccb13fa2 100644 > --- a/src/box/replication.cc > +++ b/src/box/replication.cc > @@ -661,13 +661,13 @@ replicaset_follow(void) > } > } > > -void > +bool > replicaset_sync(void) > { > int quorum = replicaset_quorum(); > > if (quorum == 0) > - return; > + return true; > > say_verbose("synchronizing with %d replicas", quorum); > > @@ -687,11 +687,12 @@ replicaset_sync(void) > * in 'orphan' state. > */ > say_crit("entering orphan mode"); > - return; > + return false; We don't enter orphan state on replication reconfiguration, instead we throw an error. So you should move this say_crit to box_cfg_xc. > } > > say_crit("replica set sync complete, quorum of %d " > "replicas formed", quorum); > + return true; > } > > void > diff --git a/src/box/replication.h b/src/box/replication.h > index 06a2867b6..d4e6f7e3e 100644 > --- a/src/box/replication.h > +++ b/src/box/replication.h > @@ -373,10 +373,10 @@ replicaset_follow(void); > > /** > * Wait until a replication quorum is formed. > - * Return immediately if a quorum cannot be > - * formed because of errors. > + * @return true in case of success. > + * @return false if a quorum cannot be formed because of errors. > */ > -void > +bool > replicaset_sync(void); > > /** > diff --git a/test/replication/config_change_sync.test.lua b/test/replication/config_change_sync.test.lua I think that you'd better call the test simply replication/sync.test.lua. > new file mode 100644 > index 000000000..b1a71dbaa > --- /dev/null > +++ b/test/replication/config_change_sync.test.lua > @@ -0,0 +1,43 @@ > +-- > +-- gh-3427: no sync after configuration update > +-- > + > +env = require('test_run') > +test_run = env.new() > +engine = test_run:get_cfg('engine') > + > +box.schema.user.grant('guest', 'read,write,execute', 'universe') You don't need this - replication role should be enough. > + > +box.schema.user.grant('guest', 'replication') > +test_run:cmd("create server replica with rpl_master=default, script='replication/replica_orphan.lua'") > +test_run:cmd("start server replica") > + > +repl = test_run:eval('replica', 'return box.cfg.listen')[1] > +box.cfg{replication = repl} Why? You just want to check that the replica doesn't leave box.cfg() until it synchronizes with the master (default) so why do you need to connect the master to the replica? > + > +s = box.schema.space.create('test', {engine = engine}) > +index = s:create_index('primary') > + > +-- change replica configuration > +test_run:cmd("switch replica") > +box.cfg{replication={}} > + > +test_run:cmd("switch default") > +-- insert values on the master while replica is unconfigured > +a = 50000 box.begin() while a > 0 do a = a-1 box.space.test:insert{a,a} end box.commit() That's going to take too long. Please use error injection instead, see ERRINJ_RELAY_TIMEOUT. > + > +test_run:cmd("switch replica") > +box.cfg{replication = os.getenv("MASTER")} Nit: please remember the old configuration before turning off replication and then re-enable it instead: replication = box.cfg.replication ... box.cfg{replication = replication} > +require'fiber'.sleep(0.1) This shouldn't be here, should it? > + > + > +box.info.replication[1].upstream.lag > 0.1 Nit: I expect that an expression like this returns true if everything is right, not false so please change > to <=. > + > +test_run:cmd("switch default") > + > +-- cleanup > +test_run:cmd("stop server replica") > +test_run:cmd("cleanup server replica") > +box.space.test:drop() > +box.schema.user.revoke('guest', 'replication') > +box.schema.user.revoke('guest', 'read,write,execute', 'universe') > diff --git a/test/replication/replica_orphan.lua b/test/replication/replica_orphan.lua > new file mode 100644 > index 000000000..97740d69a > --- /dev/null > +++ b/test/replication/replica_orphan.lua > @@ -0,0 +1,12 @@ > +#!/usr/bin/env tarantool Since box.cfg.replication_sync_lag is now taken into account during replication reconfiguration, it totally makes sense to make this option dynamic. Then you wouldn't need this script and could use replica.lua instead. Please do it in a separate patch. Don't forget to write a documentation request in the commit message. > + > +local TIMEOUT = tonumber(arg[1]) > + > +box.cfg({ > + listen = os.getenv("LISTEN"), > + replication = os.getenv("MASTER"), > + replication_connect_timeout = 0.5, > + replication_sync_lag = 0.01, > +}) > + > +require('console').listen(os.getenv('ADMIN'))