[patches] Re: [commits] [tarantool] 01/01: Break connection on timeout
Konstantin Osipov
kostja at tarantool.org
Fri Feb 2 23:32:34 MSK 2018
Hi,
I don't see this patch submitted for review in patches at . Is it
ready for review?
* Konstantin Belyavskiy <k.belyavskiy at tarantool.org> [18/02/02 16:57]:
> commit e470923d8142641da7ef7b898d1ead00a439ab0b
> Author: Konstantin Belyavskiy <k.belyavskiy at tarantool.org>
> AuthorDate: Thu Feb 1 19:17:58 2018 +0300
>
> Break connection on timeout
>
> In replication schema if one of the instances was powered off, it doesn't detected
> by others and the connection hangs. Alive machines show 'follow' state.
> Add timeout to solve this issue. It's safe since applier and relay both send
> messages every replication_timeout so we can assume that if we read nothing we
> have problem with connection.
> Use replication_disconnect_timeout which is replication_timeout * 4 as for now.
Please align around 65 characters.
>
> Closes #3025
> ---
> src/box/applier.cc | 12 +++++++++++-
> src/errinj.h | 1 +
> test/replication/errinj.result | 13 +++++++++++++
> test/replication/errinj.test.lua | 7 +++++++
> 4 files changed, 32 insertions(+), 1 deletion(-)
>
> diff --git a/src/box/applier.cc b/src/box/applier.cc
> index f0073ba..94add65 100644
> --- a/src/box/applier.cc
> +++ b/src/box/applier.cc
> @@ -45,6 +45,7 @@
> #include "version.h"
> #include "trigger.h"
> #include "xrow_io.h"
> +#include "errinj.h"
> #include "error.h"
> #include "session.h"
>
> @@ -418,7 +419,16 @@ applier_subscribe(struct applier *applier)
> applier_set_state(applier, APPLIER_FOLLOW);
> }
>
> - coio_read_xrow(coio, ibuf, &row);
> + if (applier->version_id < version_id(1, 7, 7))
> + coio_read_xrow(coio, ibuf, &row);
> + else {
> + double timeout = replication_disconnect_timeout();
> + struct errinj *inj = errinj(ERRINJ_APPLIER_TIMEOUT,
> + ERRINJ_DOUBLE);
> + if (inj != NULL && inj->dparam > 0)
> + timeout = inj->dparam;
> + coio_read_xrow_timeout_xc(coio, ibuf, &row, timeout);
> + }
>
Why do you need to check for version? Because 1.7.7 and lower has
no heartbeats, is this correct? This needs a comment.
> if (iproto_type_is_error(row.type))
> xrow_decode_error_xc(&row); /* error */
> diff --git a/src/errinj.h b/src/errinj.h
> index 512d234..a5296cd 100644
> --- a/src/errinj.h
> +++ b/src/errinj.h
> @@ -105,6 +105,7 @@ struct errinj {
> _(ERRINJ_BUILD_SECONDARY, ERRINJ_INT, {.iparam = -1}) \
> _(ERRINJ_VY_POINT_ITER_WAIT, ERRINJ_BOOL, {.bparam = false}) \
> _(ERRINJ_RELAY_EXIT_DELAY, ERRINJ_DOUBLE, {.dparam = 0}) \
> + _(ERRINJ_APPLIER_TIMEOUT, ERRINJ_DOUBLE, {.dparam = 0}) \
>
> ENUM0(errinj_id, ERRINJ_LIST);
> extern struct errinj errinjs[];
> diff --git a/test/replication/errinj.result b/test/replication/errinj.result
> index d1f1dbe..3b0b345 100644
> --- a/test/replication/errinj.result
> +++ b/test/replication/errinj.result
> @@ -426,6 +426,19 @@ test_run:cmd("switch replica_ack")
> ---
> - true
> ...
> +-- applier connection timeout is proportional to replication timeout, so since it's very low for this
> +-- test case, we should manually set it to new value to allow connection. It should be higher than
> +-- default's replication_timeout (0.1)
> +box.error.injection.set("ERRINJ_APPLIER_TIMEOUT", 1.)
This may prove to be not enough under load. Better to set it to a
very large value, say, 24*60*60
0.0
> +---
> +- ok
> +...
> +fiber = require('fiber')
> +---
> +...
> +fiber.sleep(0.5)
> +---
I'm not ready to pay +0.4 second for this test, so perhaps we need to
set the timeout to an even lower value, say 0.05 seconds.
Please ensure also that you check for status reaching the desired
condition in a sleep-check loop, and not once - otherwise your
test is guaranteed to sporadically fail.
> +...
> box.info.replication[1].upstream.status
> ---
> - follow
> diff --git a/test/replication/errinj.test.lua b/test/replication/errinj.test.lua
> index ba83481..803b0a4 100644
> --- a/test/replication/errinj.test.lua
> +++ b/test/replication/errinj.test.lua
> @@ -175,8 +175,15 @@ for i = 0, 9999 do box.space.test:replace({i, 4, 5, 'test'}) end
> test_run:cmd("create server replica_ack with rpl_master=default, script='replication/replica_ack.lua'")
> test_run:cmd("start server replica_ack")
> test_run:cmd("switch replica_ack")
> +-- applier connection timeout is proportional to replication timeout, so since it's very low for this
> +-- test case, we should manually set it to new value to allow connection. It should be higher than
> +-- default's replication_timeout (0.1)
Please align comments around 66 characters, never more than 80.
> +box.error.injection.set("ERRINJ_APPLIER_TIMEOUT", 1.)
> +fiber = require('fiber')
> +fiber.sleep(0.5)
> box.info.replication[1].upstream.status
>
> +
> test_run:cmd("stop server default")
> test_run:cmd("deploy server default")
> test_run:cmd("start server default")
--
Konstantin Osipov, Moscow, Russia, +7 903 626 22 32
http://tarantool.org - www.twitter.com/kostja_osipov
More information about the Tarantool-patches
mailing list