[tarantool-patches] [PATCH 2/3] box: add replication_sync_lag_timeout

Vladimir Davydov vdavydov.dev at gmail.com
Thu Aug 30 13:02:23 MSK 2018


On Wed, Aug 29, 2018 at 09:56:41PM +0300, Olga Arkhangelskaia wrote:
> In scope of gh-3427 we need timeout in case if replicaset will wait for
> synchronization for too long, or even forever. Default value is TIMEOUT_INFINITY.
> 
> @TarantoolBot document
> Title: Introduce new option replication_sync_lag_timeout.
> After initial bootstrap or after replication configuration changes we
> need to sync up with replication quorum. Sometimes sync can take too
> long or replication_sync_lag can be smaller than network latency we
> replica will stuck in sync loop that can't be cancelled.To avoid this
> situations replication_sync_lag_timeout can be used. When time set in
> replication_sync_lag_timeout is passed replica enters orphan state.
> Can be set dynamically. Default value is TIMEOUT_INFINITY.

The option should be called replication_sync_timeout, not
replication_sync_lag_timeout.

Also, it should probably have a reasonable default, not infinity.
Please consult with Georgy and Kostja about it.

> 
> Closes #3674

'Closes ####' should go before TarantoolBot documentation request,
otherwise it will be included into the documentation ticket.

> ---
> https://github.com/tarantool/tarantool/issues/3647
> https://github.com/tarantool/tarantool/tree/OKriw/gh-3427-replication-no-sync-1.9
> 
>  src/box/box.cc            | 19 +++++++++++++++++++
>  src/box/box.h             |  1 +
>  src/box/lua/cfg.cc        | 12 ++++++++++++
>  src/box/lua/load_cfg.lua  |  4 ++++
>  src/box/replication.cc    | 14 ++++++++++----
>  src/box/replication.h     |  6 ++++++
>  test/box-tap/cfg.test.lua |  9 ++++++++-
>  test/box/admin.result     |  2 ++
>  test/box/cfg.result       |  4 ++++
>  9 files changed, 66 insertions(+), 5 deletions(-)

app-tap/init_script.test.lua doesn't pass with this patch. You should
make sure that all tests pass before submitting a patch for review.

> 
> diff --git a/src/box/box.cc b/src/box/box.cc
> index 7155ad085..0f8364ebc 100644
> --- a/src/box/box.cc
> +++ b/src/box/box.cc
> @@ -420,6 +420,17 @@ box_check_replication_sync_lag(void)
>  	return lag;
>  }
>  
> +static double
> +box_check_replication_sync_lag_timeout(void)
> +{
> +	double timeout = cfg_getd_default("replication_sync_lag_timeout", TIMEOUT_INFINITY);

Nit: this line is too long.

> diff --git a/src/box/replication.cc b/src/box/replication.cc
> index 861ce34ea..731b05faf 100644
> --- a/src/box/replication.cc
> +++ b/src/box/replication.cc
> @@ -49,7 +49,7 @@ double replication_timeout = 1.0; /* seconds */
>  double replication_connect_timeout = 30.0; /* seconds */
>  int replication_connect_quorum = REPLICATION_CONNECT_QUORUM_ALL;
>  double replication_sync_lag = 10.0; /* seconds */
> -
> +double replication_sync_lag_timeout = TIMEOUT_INFINITY;
>  struct replicaset replicaset;
>  
>  static int
> @@ -673,12 +673,18 @@ replicaset_sync(void)
>  
>  	/*
>  	 * Wait until all connected replicas synchronize up to
> -	 * replication_sync_lag
> +	 * replication_sync_lag or return on replication_sync_lag_timeout
>  	 */
>  	while (replicaset.applier.synced < quorum &&
>  	       replicaset.applier.connected +
> -	       replicaset.applier.loading >= quorum)
> -		fiber_cond_wait(&replicaset.applier.cond);
> +	       replicaset.applier.loading >= quorum) {
> +		if (fiber_cond_wait_timeout(&replicaset.applier.cond,
> +				            replication_sync_lag_timeout) != 0) {

This is incorrect, because the fiber can be woken up spuriously.
You should use fiber_cond_wait_deadline() instead.

> +			say_crit("replication_sync_lag_timeout fired, entering orphan mode");
> +			return;

No need in this message. The message below is enough. So you can replace
these two lines with 'break'.

> +		}
> +
> +	}
>  
>  	if (replicaset.applier.synced < quorum) {
>  		/*
> diff --git a/src/box/replication.h b/src/box/replication.h
> index 06a2867b6..71c17dc8e 100644
> --- a/src/box/replication.h
> +++ b/src/box/replication.h
> @@ -126,6 +126,12 @@ extern int replication_connect_quorum;
>   */
>  extern double replication_sync_lag;
>  
> +/**
> + * Time to wait before enter orphan state in case of unsuccessful
> + * synchronization.
> + */
> +extern double replication_sync_lag_timeout;
> +
>  /**
>   * Wait for the given period of time before trying to reconnect
>   * to a master.
> diff --git a/test/box-tap/cfg.test.lua b/test/box-tap/cfg.test.lua
> index d315346de..dd883a020 100755
> --- a/test/box-tap/cfg.test.lua
> +++ b/test/box-tap/cfg.test.lua
> @@ -6,7 +6,7 @@ local socket = require('socket')
>  local fio = require('fio')
>  local uuid = require('uuid')
>  local msgpack = require('msgpack')
> -test:plan(91)
> +test:plan(94)
>  
>  --------------------------------------------------------------------------------
>  -- Invalid values
> @@ -29,6 +29,8 @@ invalid('replication_timeout', -1)
>  invalid('replication_timeout', 0)
>  invalid('replication_sync_lag', -1)
>  invalid('replication_sync_lag', 0)
> +invalid('replication_sync_lag_timeout', -1)
> +invalid('replication_sync_lag_timeout', 0)
>  invalid('replication_connect_timeout', -1)
>  invalid('replication_connect_timeout', 0)
>  invalid('replication_connect_quorum', -1)
> @@ -100,6 +102,11 @@ status, result = pcall(box.cfg, {replication_sync_lag = 1})
>  test:ok(status, "dynamic replication_sync_lag")
>  pcall(box.cfg, {repliction_sync_lag = lag})
>  
> +timeout = box.cfg.replication_sync_lag_timeout
> +status, result = pcall(box.cfg, {replication_sync_lag_timeout = 10})
> +test:ok(status, "dynamic replication_sync_lag_timeout")
> +pcall(box.cfg, {repliction_sync_lag_timeout = timeout})

I assume you add a test that checks that this option actually works in
the next patch.



More information about the Tarantool-patches mailing list