From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Thu, 9 Aug 2018 12:11:01 +0300 From: Vladimir Davydov Subject: Re: [tarantool-patches] Re: [PATCH v2] replication: do not ignore replication_connect_quorum. Message-ID: <20180809091101.hzsxrym3zgndgmgz@esperanza> References: <20180807124413.14947-1-sergepetrenko@tarantool.org> <20180807172847.z7h7476jedxktwno@esperanza> <34F1BCA9-2C61-4C3A-939D-F90E745B349D@tarantool.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <34F1BCA9-2C61-4C3A-939D-F90E745B349D@tarantool.org> To: Sergey Petrenko Cc: Georgy Kirichenko , tarantool-patches@freelists.org List-ID: On Thu, Aug 09, 2018 at 10:50:46AM +0300, Sergey Petrenko wrote: > >> @@ -587,15 +587,13 @@ replicaset_connect(struct applier **appliers, int count, > >> double wait_start = ev_monotonic_now(loop()); > >> if (fiber_cond_wait_timeout(&state.wakeup, timeout) != 0) > >> break; > >> - if (state.failed > 0 && connect_all) > >> - break; > > > > I guess you should break the loop if > > > > state.failed > count - min_count_to_connect > > AFAIU, if replication_timeout is less than replication_connect_timeout, the appliers, which have > failed, will have time to try and reconnect during replicaset_connect(). So failing here is essentially ignoring > replication_connect_timeout. No, replicaset_connect_state.failed is only incremented on unrecoverable errors, when an applier stops - see applier_on_connect_f, applier_f. > >> diff --git a/test/replication-py/master.lua b/test/replication-py/master.lua > >> index 0f9f7a6f0..51283efdf 100644 > >> --- a/test/replication-py/master.lua > >> +++ b/test/replication-py/master.lua > >> @@ -3,6 +3,8 @@ os = require('os') > >> box.cfg({ > >> listen = os.getenv("LISTEN"), > >> memtx_memory = 107374182, > >> + replication_connect_timeout = 1.0, > >> + replication_timeout = 0.3 > > > > Why do you need to adjust the timeouts? > > The timeouts set in all cfg files in all the tests had no effect, just like the > replication_connect_quorum option, cos both of the options were ignored > During bootstrap. We wanted every replica to connect, and the timeout was > Set to TIMEOUT_INFINITY. Now when we actually start passing > replication_connect_timeout, all these timeouts become too small. Some tests now take too long because of this. E.g. replication/quorum runs for 15 seconds for memtx+vinyl without this patch, which is already long enough, but with this patch it takes more than 30 seconds. We need to do something about that. > >> diff --git a/test/replication/autobootstrap_guest.lua b/test/replication/autobootstrap_guest.lua > >> index 40fef2c7a..7cd921e3c 100644 > >> --- a/test/replication/autobootstrap_guest.lua > >> +++ b/test/replication/autobootstrap_guest.lua > >> @@ -20,7 +20,7 @@ box.cfg({ > >> instance_uri(2); > >> instance_uri(3); > >> }; > >> - replication_connect_timeout = 0.5, > >> + replication_connect_timeout = 5, > > > > Why do you use different timeouts in different tests? > > I tried to find the lowest possible boundary in every test. It happens so that they differ. > I believe, any finite timeout is better that the infinite one. I'd like to see some system behind choosing timeouts. Currently, they look random :-( > diff --git a/src/box/replication.h b/src/box/replication.h > index 95122eb45..9ce9910f8 100644 > --- a/src/box/replication.h > +++ b/src/box/replication.h > @@ -356,13 +356,13 @@ replicaset_add(uint32_t replica_id, const struct tt_uuid *instance_uuid); > * > * \param appliers the array of appliers > * \param count size of appliers array > - * \param timeout connection timeout > - * \param connect_all if this flag is set, fail unless all > - * appliers have successfully connected > + * \param connect_quorum if this flag is set, fail unless at > + * least replication_connect_quorum Malformed indentation: this line should start exactly below the parameter description ('if this flag is set ...') - it simply looks better that way. Please fix. > + * appliers have successfully connected. > */ > void > replicaset_connect(struct applier **appliers, int count, > - double timeout, bool connect_all); > + bool connect_quorum); > > /** > * Resume all appliers registered with the replica set. > diff --git a/test/replication/errinj.test.lua b/test/replication/errinj.test.lua > index 463d89a8f..f00b98eed 100644 > --- a/test/replication/errinj.test.lua > +++ b/test/replication/errinj.test.lua > @@ -173,7 +173,7 @@ errinj.set("ERRINJ_RELAY_EXIT_DELAY", 0) > box.cfg{replication_timeout = 0.01} > > test_run:cmd("create server replica_timeout with rpl_master=default, script='replication/replica_timeout.lua'") > -test_run:cmd("start server replica_timeout with args='0.01'") > +test_run:cmd("start server replica_timeout with args='0.1, 0.5'") This won't work as expected - arguments are supposed to be delimited by space, not by comma. That is, TIMEOUT will turn into null in the script. Also, I think you should write some tests that check that the quorum is not ignored on bootstrap and on replication reconfiguration anymore.