[tarantool-patches] [PATCH 2/3] box: add replication_sync_lag_timeout
Vladimir Davydov
vdavydov.dev at gmail.com
Thu Aug 30 13:02:23 MSK 2018
On Wed, Aug 29, 2018 at 09:56:41PM +0300, Olga Arkhangelskaia wrote:
> In scope of gh-3427 we need timeout in case if replicaset will wait for
> synchronization for too long, or even forever. Default value is TIMEOUT_INFINITY.
>
> @TarantoolBot document
> Title: Introduce new option replication_sync_lag_timeout.
> After initial bootstrap or after replication configuration changes we
> need to sync up with replication quorum. Sometimes sync can take too
> long or replication_sync_lag can be smaller than network latency we
> replica will stuck in sync loop that can't be cancelled.To avoid this
> situations replication_sync_lag_timeout can be used. When time set in
> replication_sync_lag_timeout is passed replica enters orphan state.
> Can be set dynamically. Default value is TIMEOUT_INFINITY.
The option should be called replication_sync_timeout, not
replication_sync_lag_timeout.
Also, it should probably have a reasonable default, not infinity.
Please consult with Georgy and Kostja about it.
>
> Closes #3674
'Closes ####' should go before TarantoolBot documentation request,
otherwise it will be included into the documentation ticket.
> ---
> https://github.com/tarantool/tarantool/issues/3647
> https://github.com/tarantool/tarantool/tree/OKriw/gh-3427-replication-no-sync-1.9
>
> src/box/box.cc | 19 +++++++++++++++++++
> src/box/box.h | 1 +
> src/box/lua/cfg.cc | 12 ++++++++++++
> src/box/lua/load_cfg.lua | 4 ++++
> src/box/replication.cc | 14 ++++++++++----
> src/box/replication.h | 6 ++++++
> test/box-tap/cfg.test.lua | 9 ++++++++-
> test/box/admin.result | 2 ++
> test/box/cfg.result | 4 ++++
> 9 files changed, 66 insertions(+), 5 deletions(-)
app-tap/init_script.test.lua doesn't pass with this patch. You should
make sure that all tests pass before submitting a patch for review.
>
> diff --git a/src/box/box.cc b/src/box/box.cc
> index 7155ad085..0f8364ebc 100644
> --- a/src/box/box.cc
> +++ b/src/box/box.cc
> @@ -420,6 +420,17 @@ box_check_replication_sync_lag(void)
> return lag;
> }
>
> +static double
> +box_check_replication_sync_lag_timeout(void)
> +{
> + double timeout = cfg_getd_default("replication_sync_lag_timeout", TIMEOUT_INFINITY);
Nit: this line is too long.
> diff --git a/src/box/replication.cc b/src/box/replication.cc
> index 861ce34ea..731b05faf 100644
> --- a/src/box/replication.cc
> +++ b/src/box/replication.cc
> @@ -49,7 +49,7 @@ double replication_timeout = 1.0; /* seconds */
> double replication_connect_timeout = 30.0; /* seconds */
> int replication_connect_quorum = REPLICATION_CONNECT_QUORUM_ALL;
> double replication_sync_lag = 10.0; /* seconds */
> -
> +double replication_sync_lag_timeout = TIMEOUT_INFINITY;
> struct replicaset replicaset;
>
> static int
> @@ -673,12 +673,18 @@ replicaset_sync(void)
>
> /*
> * Wait until all connected replicas synchronize up to
> - * replication_sync_lag
> + * replication_sync_lag or return on replication_sync_lag_timeout
> */
> while (replicaset.applier.synced < quorum &&
> replicaset.applier.connected +
> - replicaset.applier.loading >= quorum)
> - fiber_cond_wait(&replicaset.applier.cond);
> + replicaset.applier.loading >= quorum) {
> + if (fiber_cond_wait_timeout(&replicaset.applier.cond,
> + replication_sync_lag_timeout) != 0) {
This is incorrect, because the fiber can be woken up spuriously.
You should use fiber_cond_wait_deadline() instead.
> + say_crit("replication_sync_lag_timeout fired, entering orphan mode");
> + return;
No need in this message. The message below is enough. So you can replace
these two lines with 'break'.
> + }
> +
> + }
>
> if (replicaset.applier.synced < quorum) {
> /*
> diff --git a/src/box/replication.h b/src/box/replication.h
> index 06a2867b6..71c17dc8e 100644
> --- a/src/box/replication.h
> +++ b/src/box/replication.h
> @@ -126,6 +126,12 @@ extern int replication_connect_quorum;
> */
> extern double replication_sync_lag;
>
> +/**
> + * Time to wait before enter orphan state in case of unsuccessful
> + * synchronization.
> + */
> +extern double replication_sync_lag_timeout;
> +
> /**
> * Wait for the given period of time before trying to reconnect
> * to a master.
> diff --git a/test/box-tap/cfg.test.lua b/test/box-tap/cfg.test.lua
> index d315346de..dd883a020 100755
> --- a/test/box-tap/cfg.test.lua
> +++ b/test/box-tap/cfg.test.lua
> @@ -6,7 +6,7 @@ local socket = require('socket')
> local fio = require('fio')
> local uuid = require('uuid')
> local msgpack = require('msgpack')
> -test:plan(91)
> +test:plan(94)
>
> --------------------------------------------------------------------------------
> -- Invalid values
> @@ -29,6 +29,8 @@ invalid('replication_timeout', -1)
> invalid('replication_timeout', 0)
> invalid('replication_sync_lag', -1)
> invalid('replication_sync_lag', 0)
> +invalid('replication_sync_lag_timeout', -1)
> +invalid('replication_sync_lag_timeout', 0)
> invalid('replication_connect_timeout', -1)
> invalid('replication_connect_timeout', 0)
> invalid('replication_connect_quorum', -1)
> @@ -100,6 +102,11 @@ status, result = pcall(box.cfg, {replication_sync_lag = 1})
> test:ok(status, "dynamic replication_sync_lag")
> pcall(box.cfg, {repliction_sync_lag = lag})
>
> +timeout = box.cfg.replication_sync_lag_timeout
> +status, result = pcall(box.cfg, {replication_sync_lag_timeout = 10})
> +test:ok(status, "dynamic replication_sync_lag_timeout")
> +pcall(box.cfg, {repliction_sync_lag_timeout = timeout})
I assume you add a test that checks that this option actually works in
the next patch.
More information about the Tarantool-patches
mailing list