[PATCH] Break connection on timeout

Vladimir Davydov vdavydov.dev at gmail.com
Thu Feb 8 20:05:00 MSK 2018


On Thu, Feb 08, 2018 at 07:42:46PM +0300, Konstantin Belyavskiy wrote:
> In replication schema if one of the instances was powered off, it doesn't detected
> by others and the connection hangs. Alive machines show 'follow' state.
> Add timeout to solve this issue. It's safe since applier and relay both send
> messages every replication_timeout so we can assume that if we read nothing we
> have problem with connection.
> Use replication_disconnect_timeout which is replication_timeout * 4 as for now.

Personally, I don't think we need the new option for now, hardcocded
replication_timeout * 4 (see replication_disconnect_timeout() helper)
will do just fine IMHO. If anybody asks for it (nobody has yet), we can
introduce it anytime later. To simulate the server freeze in the test,
you can use the existing ERRINJ_RELAY_REPORT_INTERVAL.

Anyway, if you do think the option is absolutely necessary, let's do one
step at a time - first fix the actual bug, only then add the new option
in a separate patch.

> 
> Closes #3025
> ---
>  src/box/applier.cc               |  7 ++++++-
>  src/box/box.cc                   | 23 ++++++++++++++++++++-
>  src/box/box.h                    |  1 +
>  src/box/lua/cfg.cc               | 13 ++++++++++++
>  src/box/lua/load_cfg.lua         |  4 ++++
>  src/box/relay.cc                 |  2 +-
>  src/box/replication.cc           |  1 +
>  src/box/replication.h            |  7 ++-----
>  test/app-tap/init_script.result  | 43 ++++++++++++++++++++--------------------
>  test/box/admin.result            |  2 ++
>  test/box/cfg.result              |  4 ++++
>  test/replication/errinj.result   | 13 ++++++++++++
>  test/replication/errinj.test.lua |  7 +++++++
>  13 files changed, 98 insertions(+), 29 deletions(-)
> 
> diff --git a/src/box/applier.cc b/src/box/applier.cc
> index f0073bada..bc1f8fc5e 100644
> --- a/src/box/applier.cc
> +++ b/src/box/applier.cc
> @@ -418,7 +418,12 @@ applier_subscribe(struct applier *applier)
>  			applier_set_state(applier, APPLIER_FOLLOW);
>  		}
>  
> -		coio_read_xrow(coio, ibuf, &row);
> +		if (applier->version_id < version_id(1, 7, 7))

A comment why you need to check a version here is a must.

> +			coio_read_xrow(coio, ibuf, &row);
> +		else {
> +			coio_read_xrow_timeout_xc(coio, ibuf, &row,
> +						  replication_disconnect_timeout);
> +		}

> diff --git a/test/replication/errinj.result b/test/replication/errinj.result
> index d1f1dbe91..d0121e209 100644
> --- a/test/replication/errinj.result
> +++ b/test/replication/errinj.result
> @@ -426,6 +426,19 @@ test_run:cmd("switch replica_ack")
>  ---
>  - true
>  ...
> +-- applier connection timeout is by default proportional to replication_timeout, so since it's very low
> +-- for this test case, we should manually set it to new value to allow connection. It should be higher
> +-- than master's replication_timeout (0.1 in this case)
> +box.cfg{replication_disconnect_timeout = 1.}
> +---
> +...
> +fiber = require('fiber')
> +---
> +...
> +-- wait master's replication_timeout (0.1) couple times to make sure connection is done (follow)
> +fiber.sleep(0.5)
> +---
> +...

This timeout is too long. Please try to make it shorter so that the
tests run faster.



More information about the Tarantool-patches mailing list