Re: [patches] Re: [commits] [tarantool] 01/01: Break connection on timeout

Konstantin Belyavskiy k.belyavskiy at tarantool.org
Mon Feb 5 14:07:39 MSK 2018


I didn't submit it because it's reveal problems with different replication timeout on two instances. For test purposes two instance run with different replication timeouts and I have to add several workarounds. But on production setup we can have same problems. It was functional before and this change breaks it.
So decided with Roman to develop new schema.


>Пятница,  2 февраля 2018, 23:32 +03:00 от Konstantin Osipov <kostja at tarantool.org>:
>
>Hi,
>
>I don't see this patch submitted for review in patches at . Is it
>ready for review?
>
>* Konstantin Belyavskiy < k.belyavskiy at tarantool.org > [18/02/02 16:57]:
>
>> commit e470923d8142641da7ef7b898d1ead00a439ab0b
>> Author: Konstantin Belyavskiy < k.belyavskiy at tarantool.org >
>> AuthorDate: Thu Feb 1 19:17:58 2018 +0300
>> 
>>     Break connection on timeout
>> 
>>     In replication schema if one of the instances was powered off, it doesn't detected
>>     by others and the connection hangs. Alive machines show 'follow' state.
>>     Add timeout to solve this issue. It's safe since applier and relay both send
>>     messages every replication_timeout so we can assume that if we read nothing we
>>     have problem with connection.
>>     Use replication_disconnect_timeout which is replication_timeout * 4 as for now.
>
>Please align around 65 characters.
>
>> 
>>     Closes #3025
>> ---
>>  src/box/applier.cc               | 12 +++++++++++-
>>  src/errinj.h                     |  1 +
>>  test/replication/errinj.result   | 13 +++++++++++++
>>  test/replication/errinj.test.lua |  7 +++++++
>>  4 files changed, 32 insertions(+), 1 deletion(-)
>> 
>> diff --git a/src/box/applier.cc b/src/box/applier.cc
>> index f0073ba..94add65 100644
>> --- a/src/box/applier.cc
>> +++ b/src/box/applier.cc
>> @@ -45,6 +45,7 @@
>>  #include "version.h"
>>  #include "trigger.h"
>>  #include "xrow_io.h"
>> +#include "errinj.h"
>>  #include "error.h"
>>  #include "session.h"
>> 
>> @@ -418,7 +419,16 @@ applier_subscribe(struct applier *applier)
>>  			applier_set_state(applier, APPLIER_FOLLOW);
>>  		}
>> 
>> -		coio_read_xrow(coio, ibuf, &row);
>> +		if (applier->version_id < version_id(1, 7, 7))
>> +			coio_read_xrow(coio, ibuf, &row);
>> +		else {
>> +			double timeout = replication_disconnect_timeout();
>> +			struct errinj *inj = errinj(ERRINJ_APPLIER_TIMEOUT,
>> +						    ERRINJ_DOUBLE);
>> +			if (inj != NULL && inj->dparam > 0)
>> +				timeout = inj->dparam;
>> +			coio_read_xrow_timeout_xc(coio, ibuf, &row, timeout);
>> +		}
>> 
>
>Why do you need to check for version? Because 1.7.7 and lower has
>no heartbeats, is this correct? This needs a comment.
>
>>  		if (iproto_type_is_error(row.type))
>>  			xrow_decode_error_xc(&row);  /* error */
>> diff --git a/src/errinj.h b/src/errinj.h
>> index 512d234..a5296cd 100644
>> --- a/src/errinj.h
>> +++ b/src/errinj.h
>> @@ -105,6 +105,7 @@ struct errinj {
>>  	_(ERRINJ_BUILD_SECONDARY, ERRINJ_INT, {.iparam = -1}) \
>>  	_(ERRINJ_VY_POINT_ITER_WAIT, ERRINJ_BOOL, {.bparam = false}) \
>>  	_(ERRINJ_RELAY_EXIT_DELAY, ERRINJ_DOUBLE, {.dparam = 0}) \
>> +	_(ERRINJ_APPLIER_TIMEOUT, ERRINJ_DOUBLE, {.dparam = 0}) \
>> 
>>  ENUM0(errinj_id, ERRINJ_LIST);
>>  extern struct errinj errinjs[];
>> diff --git a/test/replication/errinj.result b/test/replication/errinj.result
>> index d1f1dbe..3b0b345 100644
>> --- a/test/replication/errinj.result
>> +++ b/test/replication/errinj.result
>> @@ -426,6 +426,19 @@ test_run:cmd("switch replica_ack")
>>  ---
>>  - true
>>  ...
>> +-- applier connection timeout is proportional to replication timeout, so since it's very low for this
>> +-- test case, we should manually set it to new value to allow connection. It should be higher than
>> +-- default's replication_timeout (0.1)
>> +box.error.injection.set("ERRINJ_APPLIER_TIMEOUT", 1.)
>
>This may prove to be not enough under load. Better to set it to a
>very large value, say, 24*60*60
>0.0 
>> +---
>> +- ok
>> +...
>> +fiber = require('fiber')
>> +---
>> +...
>> +fiber.sleep(0.5)
>> +---
>
>I'm not ready to pay +0.4 second for this test, so perhaps we need to 
>set the timeout to an even lower value, say 0.05 seconds.
>Please ensure also that you check for status reaching the desired
>condition in a sleep-check loop, and not once - otherwise your
>test is guaranteed to sporadically fail.
>
>> +...
>>  box.info.replication[1].upstream.status
>>  ---
>>  - follow
>> diff --git a/test/replication/errinj.test.lua b/test/replication/errinj.test.lua
>> index ba83481..803b0a4 100644
>> --- a/test/replication/errinj.test.lua
>> +++ b/test/replication/errinj.test.lua
>> @@ -175,8 +175,15 @@ for i = 0, 9999 do box.space.test:replace({i, 4, 5, 'test'}) end
>>  test_run:cmd("create server replica_ack with rpl_master=default, script='replication/replica_ack.lua'")
>>  test_run:cmd("start server replica_ack")
>>  test_run:cmd("switch replica_ack")
>> +-- applier connection timeout is proportional to replication timeout, so since it's very low for this
>> +-- test case, we should manually set it to new value to allow connection. It should be higher than
>> +-- default's replication_timeout (0.1)
>
>Please align comments around 66 characters, never more than 80.
>
>> +box.error.injection.set("ERRINJ_APPLIER_TIMEOUT", 1.)
>> +fiber = require('fiber')
>> +fiber.sleep(0.5)
>>  box.info.replication[1].upstream.status
>> 
>> +
>>  test_run:cmd("stop server default")
>>  test_run:cmd("deploy server default")
>>  test_run:cmd("start server default")
>
>-- 
>Konstantin Osipov, Moscow, Russia,  +7 903 626 22 32
>http://tarantool.org -  www.twitter.com/kostja_osipov


С уважением,
Konstantin Belyavskiy
k.belyavskiy at tarantool.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.tarantool.org/pipermail/tarantool-patches/attachments/20180205/02a768d0/attachment.html>


More information about the Tarantool-patches mailing list