Re: [patches] Re: [PATCH] Break connection on timeout
Konstantin Belyavskiy
k.belyavskiy at tarantool.org
Thu Feb 8 20:26:21 MSK 2018
branch: gh-3025-break-connection-timeout-v2
>Четверг, 8 февраля 2018, 20:05 +03:00 от Vladimir Davydov <vdavydov.dev at gmail.com>:
>
>On Thu, Feb 08, 2018 at 07:42:46PM +0300, Konstantin Belyavskiy wrote:
>> In replication schema if one of the instances was powered off, it doesn't detected
>> by others and the connection hangs. Alive machines show 'follow' state.
>> Add timeout to solve this issue. It's safe since applier and relay both send
>> messages every replication_timeout so we can assume that if we read nothing we
>> have problem with connection.
>> Use replication_disconnect_timeout which is replication_timeout * 4 as for now.
>
>Personally, I don't think we need the new option for now, hardcocded
>replication_timeout * 4 (see replication_disconnect_timeout() helper)
>will do just fine IMHO. If anybody asks for it (nobody has yet), we can
>introduce it anytime later. To simulate the server freeze in the test,
>you can use the existing ERRINJ_RELAY_REPORT_INTERVAL.
>
>Anyway, if you do think the option is absolutely necessary, let's do one
>step at a time - first fix the actual bug, only then add the new option
>in a separate patch.
>
>>
>> Closes #3025
>> ---
>> src/box/applier.cc | 7 ++++++-
>> src/box/box.cc | 23 ++++++++++++++++++++-
>> src/box/box.h | 1 +
>> src/box/lua/cfg.cc | 13 ++++++++++++
>> src/box/lua/load_cfg.lua | 4 ++++
>> src/box/relay.cc | 2 +-
>> src/box/replication.cc | 1 +
>> src/box/replication.h | 7 ++-----
>> test/app-tap/init_script.result | 43 ++++++++++++++++++++--------------------
>> test/box/admin.result | 2 ++
>> test/box/cfg.result | 4 ++++
>> test/replication/errinj.result | 13 ++++++++++++
>> test/replication/errinj.test.lua | 7 +++++++
>> 13 files changed, 98 insertions(+), 29 deletions(-)
>>
>> diff --git a/src/box/applier.cc b/src/box/applier.cc
>> index f0073bada..bc1f8fc5e 100644
>> --- a/src/box/applier.cc
>> +++ b/src/box/applier.cc
>> @@ -418,7 +418,12 @@ applier_subscribe(struct applier *applier)
>> applier_set_state(applier, APPLIER_FOLLOW);
>> }
>>
>> - coio_read_xrow(coio, ibuf, &row);
>> + if (applier->version_id < version_id(1, 7, 7))
>
>A comment why you need to check a version here is a must.
Agree, fixed
>
>
>> + coio_read_xrow(coio, ibuf, &row);
>> + else {
>> + coio_read_xrow_timeout_xc(coio, ibuf, &row,
>> + replication_disconnect_timeout);
>> + }
>
>> diff --git a/test/replication/errinj.result b/test/replication/errinj.result
>> index d1f1dbe91..d0121e209 100644
>> --- a/test/replication/errinj.result
>> +++ b/test/replication/errinj.result
>> @@ -426,6 +426,19 @@ test_run:cmd("switch replica_ack")
>> ---
>> - true
>> ...
>> +-- applier connection timeout is by default proportional to replication_timeout, so since it's very low
>> +-- for this test case, we should manually set it to new value to allow connection. It should be higher
>> +-- than master's replication_timeout (0.1 in this case)
>> +box.cfg{replication_disconnect_timeout = 1.}
>> +---
>> +...
>> +fiber = require('fiber')
>> +---
>> +...
>> +-- wait master's replication_timeout (0.1) couple times to make sure connection is done (follow)
>> +fiber.sleep(0.5)
>> +---
>> +...
>
>This timeout is too long. Please try to make it shorter so that the
>tests run faster.
Reduce it to 0.15s. Still even original test was quite long to complete..
С уважением,
Konstantin Belyavskiy
k.belyavskiy at tarantool.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.tarantool.org/pipermail/tarantool-patches/attachments/20180208/fd1cedb0/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-Allow-http-client-connection-to-use-unix-socket.patch
Type: application/x-patch
Size: 18204 bytes
Desc: not available
URL: <https://lists.tarantool.org/pipermail/tarantool-patches/attachments/20180208/fd1cedb0/attachment.bin>
More information about the Tarantool-patches
mailing list