Tarantool development patches archive
 help / color / mirror / Atom feed
From: Vladimir Davydov <vdavydov.dev@gmail.com>
To: Olga Arkhangelskaia <krishtal.olja@gmail.com>
Cc: tarantool-patches@freelists.org
Subject: Re: [tarantool-patches] [PATCH 2/3] box: add replication_sync_lag_timeout
Date: Thu, 30 Aug 2018 13:02:23 +0300	[thread overview]
Message-ID: <20180830100223.dsax42fgvqocfplh@esperanza> (raw)
In-Reply-To: <20180829185642.49479-2-krishtal.olja@gmail.com>

On Wed, Aug 29, 2018 at 09:56:41PM +0300, Olga Arkhangelskaia wrote:
> In scope of gh-3427 we need timeout in case if replicaset will wait for
> synchronization for too long, or even forever. Default value is TIMEOUT_INFINITY.
> 
> @TarantoolBot document
> Title: Introduce new option replication_sync_lag_timeout.
> After initial bootstrap or after replication configuration changes we
> need to sync up with replication quorum. Sometimes sync can take too
> long or replication_sync_lag can be smaller than network latency we
> replica will stuck in sync loop that can't be cancelled.To avoid this
> situations replication_sync_lag_timeout can be used. When time set in
> replication_sync_lag_timeout is passed replica enters orphan state.
> Can be set dynamically. Default value is TIMEOUT_INFINITY.

The option should be called replication_sync_timeout, not
replication_sync_lag_timeout.

Also, it should probably have a reasonable default, not infinity.
Please consult with Georgy and Kostja about it.

> 
> Closes #3674

'Closes ####' should go before TarantoolBot documentation request,
otherwise it will be included into the documentation ticket.

> ---
> https://github.com/tarantool/tarantool/issues/3647
> https://github.com/tarantool/tarantool/tree/OKriw/gh-3427-replication-no-sync-1.9
> 
>  src/box/box.cc            | 19 +++++++++++++++++++
>  src/box/box.h             |  1 +
>  src/box/lua/cfg.cc        | 12 ++++++++++++
>  src/box/lua/load_cfg.lua  |  4 ++++
>  src/box/replication.cc    | 14 ++++++++++----
>  src/box/replication.h     |  6 ++++++
>  test/box-tap/cfg.test.lua |  9 ++++++++-
>  test/box/admin.result     |  2 ++
>  test/box/cfg.result       |  4 ++++
>  9 files changed, 66 insertions(+), 5 deletions(-)

app-tap/init_script.test.lua doesn't pass with this patch. You should
make sure that all tests pass before submitting a patch for review.

> 
> diff --git a/src/box/box.cc b/src/box/box.cc
> index 7155ad085..0f8364ebc 100644
> --- a/src/box/box.cc
> +++ b/src/box/box.cc
> @@ -420,6 +420,17 @@ box_check_replication_sync_lag(void)
>  	return lag;
>  }
>  
> +static double
> +box_check_replication_sync_lag_timeout(void)
> +{
> +	double timeout = cfg_getd_default("replication_sync_lag_timeout", TIMEOUT_INFINITY);

Nit: this line is too long.

> diff --git a/src/box/replication.cc b/src/box/replication.cc
> index 861ce34ea..731b05faf 100644
> --- a/src/box/replication.cc
> +++ b/src/box/replication.cc
> @@ -49,7 +49,7 @@ double replication_timeout = 1.0; /* seconds */
>  double replication_connect_timeout = 30.0; /* seconds */
>  int replication_connect_quorum = REPLICATION_CONNECT_QUORUM_ALL;
>  double replication_sync_lag = 10.0; /* seconds */
> -
> +double replication_sync_lag_timeout = TIMEOUT_INFINITY;
>  struct replicaset replicaset;
>  
>  static int
> @@ -673,12 +673,18 @@ replicaset_sync(void)
>  
>  	/*
>  	 * Wait until all connected replicas synchronize up to
> -	 * replication_sync_lag
> +	 * replication_sync_lag or return on replication_sync_lag_timeout
>  	 */
>  	while (replicaset.applier.synced < quorum &&
>  	       replicaset.applier.connected +
> -	       replicaset.applier.loading >= quorum)
> -		fiber_cond_wait(&replicaset.applier.cond);
> +	       replicaset.applier.loading >= quorum) {
> +		if (fiber_cond_wait_timeout(&replicaset.applier.cond,
> +				            replication_sync_lag_timeout) != 0) {

This is incorrect, because the fiber can be woken up spuriously.
You should use fiber_cond_wait_deadline() instead.

> +			say_crit("replication_sync_lag_timeout fired, entering orphan mode");
> +			return;

No need in this message. The message below is enough. So you can replace
these two lines with 'break'.

> +		}
> +
> +	}
>  
>  	if (replicaset.applier.synced < quorum) {
>  		/*
> diff --git a/src/box/replication.h b/src/box/replication.h
> index 06a2867b6..71c17dc8e 100644
> --- a/src/box/replication.h
> +++ b/src/box/replication.h
> @@ -126,6 +126,12 @@ extern int replication_connect_quorum;
>   */
>  extern double replication_sync_lag;
>  
> +/**
> + * Time to wait before enter orphan state in case of unsuccessful
> + * synchronization.
> + */
> +extern double replication_sync_lag_timeout;
> +
>  /**
>   * Wait for the given period of time before trying to reconnect
>   * to a master.
> diff --git a/test/box-tap/cfg.test.lua b/test/box-tap/cfg.test.lua
> index d315346de..dd883a020 100755
> --- a/test/box-tap/cfg.test.lua
> +++ b/test/box-tap/cfg.test.lua
> @@ -6,7 +6,7 @@ local socket = require('socket')
>  local fio = require('fio')
>  local uuid = require('uuid')
>  local msgpack = require('msgpack')
> -test:plan(91)
> +test:plan(94)
>  
>  --------------------------------------------------------------------------------
>  -- Invalid values
> @@ -29,6 +29,8 @@ invalid('replication_timeout', -1)
>  invalid('replication_timeout', 0)
>  invalid('replication_sync_lag', -1)
>  invalid('replication_sync_lag', 0)
> +invalid('replication_sync_lag_timeout', -1)
> +invalid('replication_sync_lag_timeout', 0)
>  invalid('replication_connect_timeout', -1)
>  invalid('replication_connect_timeout', 0)
>  invalid('replication_connect_quorum', -1)
> @@ -100,6 +102,11 @@ status, result = pcall(box.cfg, {replication_sync_lag = 1})
>  test:ok(status, "dynamic replication_sync_lag")
>  pcall(box.cfg, {repliction_sync_lag = lag})
>  
> +timeout = box.cfg.replication_sync_lag_timeout
> +status, result = pcall(box.cfg, {replication_sync_lag_timeout = 10})
> +test:ok(status, "dynamic replication_sync_lag_timeout")
> +pcall(box.cfg, {repliction_sync_lag_timeout = timeout})

I assume you add a test that checks that this option actually works in
the next patch.

  reply	other threads:[~2018-08-30 10:02 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-08-29 18:56 [tarantool-patches] [PATCH v2 1/3] box: make replication_sync_lag option dynamic Olga Arkhangelskaia
2018-08-29 18:56 ` [tarantool-patches] [PATCH 2/3] box: add replication_sync_lag_timeout Olga Arkhangelskaia
2018-08-30 10:02   ` Vladimir Davydov [this message]
2018-08-29 18:56 ` [tarantool-patches] [PATCH v5 3/3] box: adds replication sync after cfg. update Olga Arkhangelskaia
2018-08-30 10:11   ` Vladimir Davydov
2018-08-30  9:48 ` [tarantool-patches] [PATCH v2 1/3] box: make replication_sync_lag option dynamic Vladimir Davydov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180830100223.dsax42fgvqocfplh@esperanza \
    --to=vdavydov.dev@gmail.com \
    --cc=krishtal.olja@gmail.com \
    --cc=tarantool-patches@freelists.org \
    --subject='Re: [tarantool-patches] [PATCH 2/3] box: add replication_sync_lag_timeout' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox