From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Thu, 30 Aug 2018 13:02:23 +0300 From: Vladimir Davydov Subject: Re: [tarantool-patches] [PATCH 2/3] box: add replication_sync_lag_timeout Message-ID: <20180830100223.dsax42fgvqocfplh@esperanza> References: <20180829185642.49479-1-krishtal.olja@gmail.com> <20180829185642.49479-2-krishtal.olja@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20180829185642.49479-2-krishtal.olja@gmail.com> To: Olga Arkhangelskaia Cc: tarantool-patches@freelists.org List-ID: On Wed, Aug 29, 2018 at 09:56:41PM +0300, Olga Arkhangelskaia wrote: > In scope of gh-3427 we need timeout in case if replicaset will wait for > synchronization for too long, or even forever. Default value is TIMEOUT_INFINITY. > > @TarantoolBot document > Title: Introduce new option replication_sync_lag_timeout. > After initial bootstrap or after replication configuration changes we > need to sync up with replication quorum. Sometimes sync can take too > long or replication_sync_lag can be smaller than network latency we > replica will stuck in sync loop that can't be cancelled.To avoid this > situations replication_sync_lag_timeout can be used. When time set in > replication_sync_lag_timeout is passed replica enters orphan state. > Can be set dynamically. Default value is TIMEOUT_INFINITY. The option should be called replication_sync_timeout, not replication_sync_lag_timeout. Also, it should probably have a reasonable default, not infinity. Please consult with Georgy and Kostja about it. > > Closes #3674 'Closes ####' should go before TarantoolBot documentation request, otherwise it will be included into the documentation ticket. > --- > https://github.com/tarantool/tarantool/issues/3647 > https://github.com/tarantool/tarantool/tree/OKriw/gh-3427-replication-no-sync-1.9 > > src/box/box.cc | 19 +++++++++++++++++++ > src/box/box.h | 1 + > src/box/lua/cfg.cc | 12 ++++++++++++ > src/box/lua/load_cfg.lua | 4 ++++ > src/box/replication.cc | 14 ++++++++++---- > src/box/replication.h | 6 ++++++ > test/box-tap/cfg.test.lua | 9 ++++++++- > test/box/admin.result | 2 ++ > test/box/cfg.result | 4 ++++ > 9 files changed, 66 insertions(+), 5 deletions(-) app-tap/init_script.test.lua doesn't pass with this patch. You should make sure that all tests pass before submitting a patch for review. > > diff --git a/src/box/box.cc b/src/box/box.cc > index 7155ad085..0f8364ebc 100644 > --- a/src/box/box.cc > +++ b/src/box/box.cc > @@ -420,6 +420,17 @@ box_check_replication_sync_lag(void) > return lag; > } > > +static double > +box_check_replication_sync_lag_timeout(void) > +{ > + double timeout = cfg_getd_default("replication_sync_lag_timeout", TIMEOUT_INFINITY); Nit: this line is too long. > diff --git a/src/box/replication.cc b/src/box/replication.cc > index 861ce34ea..731b05faf 100644 > --- a/src/box/replication.cc > +++ b/src/box/replication.cc > @@ -49,7 +49,7 @@ double replication_timeout = 1.0; /* seconds */ > double replication_connect_timeout = 30.0; /* seconds */ > int replication_connect_quorum = REPLICATION_CONNECT_QUORUM_ALL; > double replication_sync_lag = 10.0; /* seconds */ > - > +double replication_sync_lag_timeout = TIMEOUT_INFINITY; > struct replicaset replicaset; > > static int > @@ -673,12 +673,18 @@ replicaset_sync(void) > > /* > * Wait until all connected replicas synchronize up to > - * replication_sync_lag > + * replication_sync_lag or return on replication_sync_lag_timeout > */ > while (replicaset.applier.synced < quorum && > replicaset.applier.connected + > - replicaset.applier.loading >= quorum) > - fiber_cond_wait(&replicaset.applier.cond); > + replicaset.applier.loading >= quorum) { > + if (fiber_cond_wait_timeout(&replicaset.applier.cond, > + replication_sync_lag_timeout) != 0) { This is incorrect, because the fiber can be woken up spuriously. You should use fiber_cond_wait_deadline() instead. > + say_crit("replication_sync_lag_timeout fired, entering orphan mode"); > + return; No need in this message. The message below is enough. So you can replace these two lines with 'break'. > + } > + > + } > > if (replicaset.applier.synced < quorum) { > /* > diff --git a/src/box/replication.h b/src/box/replication.h > index 06a2867b6..71c17dc8e 100644 > --- a/src/box/replication.h > +++ b/src/box/replication.h > @@ -126,6 +126,12 @@ extern int replication_connect_quorum; > */ > extern double replication_sync_lag; > > +/** > + * Time to wait before enter orphan state in case of unsuccessful > + * synchronization. > + */ > +extern double replication_sync_lag_timeout; > + > /** > * Wait for the given period of time before trying to reconnect > * to a master. > diff --git a/test/box-tap/cfg.test.lua b/test/box-tap/cfg.test.lua > index d315346de..dd883a020 100755 > --- a/test/box-tap/cfg.test.lua > +++ b/test/box-tap/cfg.test.lua > @@ -6,7 +6,7 @@ local socket = require('socket') > local fio = require('fio') > local uuid = require('uuid') > local msgpack = require('msgpack') > -test:plan(91) > +test:plan(94) > > -------------------------------------------------------------------------------- > -- Invalid values > @@ -29,6 +29,8 @@ invalid('replication_timeout', -1) > invalid('replication_timeout', 0) > invalid('replication_sync_lag', -1) > invalid('replication_sync_lag', 0) > +invalid('replication_sync_lag_timeout', -1) > +invalid('replication_sync_lag_timeout', 0) > invalid('replication_connect_timeout', -1) > invalid('replication_connect_timeout', 0) > invalid('replication_connect_quorum', -1) > @@ -100,6 +102,11 @@ status, result = pcall(box.cfg, {replication_sync_lag = 1}) > test:ok(status, "dynamic replication_sync_lag") > pcall(box.cfg, {repliction_sync_lag = lag}) > > +timeout = box.cfg.replication_sync_lag_timeout > +status, result = pcall(box.cfg, {replication_sync_lag_timeout = 10}) > +test:ok(status, "dynamic replication_sync_lag_timeout") > +pcall(box.cfg, {repliction_sync_lag_timeout = timeout}) I assume you add a test that checks that this option actually works in the next patch.