[patches] Re: [commits] [tarantool] 01/01: Break connection on timeout

Konstantin Osipov kostja at tarantool.org
Fri Feb 2 23:32:34 MSK 2018


Hi,

I don't see this patch submitted for review in patches at . Is it
ready for review?

* Konstantin Belyavskiy <k.belyavskiy at tarantool.org> [18/02/02 16:57]:

> commit e470923d8142641da7ef7b898d1ead00a439ab0b
> Author: Konstantin Belyavskiy <k.belyavskiy at tarantool.org>
> AuthorDate: Thu Feb 1 19:17:58 2018 +0300
> 
>     Break connection on timeout
>     
>     In replication schema if one of the instances was powered off, it doesn't detected
>     by others and the connection hangs. Alive machines show 'follow' state.
>     Add timeout to solve this issue. It's safe since applier and relay both send
>     messages every replication_timeout so we can assume that if we read nothing we
>     have problem with connection.
>     Use replication_disconnect_timeout which is replication_timeout * 4 as for now.

Please align around 65 characters.

>     
>     Closes #3025
> ---
>  src/box/applier.cc               | 12 +++++++++++-
>  src/errinj.h                     |  1 +
>  test/replication/errinj.result   | 13 +++++++++++++
>  test/replication/errinj.test.lua |  7 +++++++
>  4 files changed, 32 insertions(+), 1 deletion(-)
> 
> diff --git a/src/box/applier.cc b/src/box/applier.cc
> index f0073ba..94add65 100644
> --- a/src/box/applier.cc
> +++ b/src/box/applier.cc
> @@ -45,6 +45,7 @@
>  #include "version.h"
>  #include "trigger.h"
>  #include "xrow_io.h"
> +#include "errinj.h"
>  #include "error.h"
>  #include "session.h"
>  
> @@ -418,7 +419,16 @@ applier_subscribe(struct applier *applier)
>  			applier_set_state(applier, APPLIER_FOLLOW);
>  		}
>  
> -		coio_read_xrow(coio, ibuf, &row);
> +		if (applier->version_id < version_id(1, 7, 7))
> +			coio_read_xrow(coio, ibuf, &row);
> +		else {
> +			double timeout = replication_disconnect_timeout();
> +			struct errinj *inj = errinj(ERRINJ_APPLIER_TIMEOUT,
> +						    ERRINJ_DOUBLE);
> +			if (inj != NULL && inj->dparam > 0)
> +				timeout = inj->dparam;
> +			coio_read_xrow_timeout_xc(coio, ibuf, &row, timeout);
> +		}
>  

Why do you need to check for version? Because 1.7.7 and lower has
no heartbeats, is this correct? This needs a comment.

>  		if (iproto_type_is_error(row.type))
>  			xrow_decode_error_xc(&row);  /* error */
> diff --git a/src/errinj.h b/src/errinj.h
> index 512d234..a5296cd 100644
> --- a/src/errinj.h
> +++ b/src/errinj.h
> @@ -105,6 +105,7 @@ struct errinj {
>  	_(ERRINJ_BUILD_SECONDARY, ERRINJ_INT, {.iparam = -1}) \
>  	_(ERRINJ_VY_POINT_ITER_WAIT, ERRINJ_BOOL, {.bparam = false}) \
>  	_(ERRINJ_RELAY_EXIT_DELAY, ERRINJ_DOUBLE, {.dparam = 0}) \
> +	_(ERRINJ_APPLIER_TIMEOUT, ERRINJ_DOUBLE, {.dparam = 0}) \
>  
>  ENUM0(errinj_id, ERRINJ_LIST);
>  extern struct errinj errinjs[];
> diff --git a/test/replication/errinj.result b/test/replication/errinj.result
> index d1f1dbe..3b0b345 100644
> --- a/test/replication/errinj.result
> +++ b/test/replication/errinj.result
> @@ -426,6 +426,19 @@ test_run:cmd("switch replica_ack")
>  ---
>  - true
>  ...
> +-- applier connection timeout is proportional to replication timeout, so since it's very low for this
> +-- test case, we should manually set it to new value to allow connection. It should be higher than
> +-- default's replication_timeout (0.1)
> +box.error.injection.set("ERRINJ_APPLIER_TIMEOUT", 1.)

This may prove to be not enough under load. Better to set it to a
very large value, say, 24*60*60
0.0 
> +---
> +- ok
> +...
> +fiber = require('fiber')
> +---
> +...
> +fiber.sleep(0.5)
> +---

I'm not ready to pay +0.4 second for this test, so perhaps we need to 
set the timeout to an even lower value, say 0.05 seconds.
Please ensure also that you check for status reaching the desired
condition in a sleep-check loop, and not once - otherwise your
test is guaranteed to sporadically fail.

> +...
>  box.info.replication[1].upstream.status
>  ---
>  - follow
> diff --git a/test/replication/errinj.test.lua b/test/replication/errinj.test.lua
> index ba83481..803b0a4 100644
> --- a/test/replication/errinj.test.lua
> +++ b/test/replication/errinj.test.lua
> @@ -175,8 +175,15 @@ for i = 0, 9999 do box.space.test:replace({i, 4, 5, 'test'}) end
>  test_run:cmd("create server replica_ack with rpl_master=default, script='replication/replica_ack.lua'")
>  test_run:cmd("start server replica_ack")
>  test_run:cmd("switch replica_ack")
> +-- applier connection timeout is proportional to replication timeout, so since it's very low for this
> +-- test case, we should manually set it to new value to allow connection. It should be higher than
> +-- default's replication_timeout (0.1)

Please align comments around 66 characters, never more than 80.

> +box.error.injection.set("ERRINJ_APPLIER_TIMEOUT", 1.)
> +fiber = require('fiber')
> +fiber.sleep(0.5)
>  box.info.replication[1].upstream.status
>  
> +
>  test_run:cmd("stop server default")
>  test_run:cmd("deploy server default")
>  test_run:cmd("start server default")

-- 
Konstantin Osipov, Moscow, Russia, +7 903 626 22 32
http://tarantool.org - www.twitter.com/kostja_osipov



More information about the Tarantool-patches mailing list