Tarantool development patches archive
 help / color / mirror / Atom feed
* [tarantool-patches] [PATCH v2 0/2] detect and throw away dead replicas
@ 2018-10-12 19:45 Olga Arkhangelskaia
  2018-10-12 19:45 ` [tarantool-patches] [PATCH v2 1/2] box: added replication_dead/rw_gap options Olga Arkhangelskaia
  2018-10-12 19:45 ` [tarantool-patches] [PATCH v2 2/2] ctl: added functionality to detect and prune dead replicas Olga Arkhangelskaia
  0 siblings, 2 replies; 7+ messages in thread
From: Olga Arkhangelskaia @ 2018-10-12 19:45 UTC (permalink / raw)
  To: tarantool-patches; +Cc: Olga Arkhangelskaia

According to previous discussions the way of replicas bad state detection
is changed completely. Now we maintain two time differences between now and
last activity of applier and relay.
THis values can be found in box.info.replication.lar/law: 
We use hours, but i still have some doubts may be we should display days,
hours and minutes.
Lar/law are compared with replication_dead/rw_gap, that should be previously
configured via box.cfg. The question here - now I am not sure in replication_rw_gap.
The reason I added tis parameter is the idea that in master case the difference
between applier and relay activity is too be - there is big chance that something
is wrong with replica.

The last problem I want to discuss - is test cases, test takes too much time, and
there is no separate case for applier. I mean that relay and rw_gap can be tested
separetly by turning off replication and tuning gap parameters, however i do not
see case when only lar is lagging seriously.

If you have ideas how to make this functionality better - please, share. Will be
glad to see other opinions.
---
Branch:
https://github.com/tarantool/tarantool/tree/OKriw/gh-3110-prune-dead-replica-from-replicaset-1.10
Issue:
https://github.com/tarantool/tarantool/issues/3110

v1:
https://www.freelists.org/post/tarantool-patches/PATCH-rfc-schema-add-possibility-to-find-and-throw-away-dead-replicas

Changes v2:
- changed the way of replicas death detection
- added special box options
- changed test
- now only dead replicas are shown
- added function to throw away any replica

Olga Arkhangelskaia (2):
  box: added replication_dead/rw_gap options
  ctl: added functionality to detect and prune dead replicas

 src/box/CMakeLists.txt         |   1 +
 src/box/box.cc                 |  34 ++++++
 src/box/box.h                  |   2 +
 src/box/lua/cfg.cc             |  24 +++++
 src/box/lua/ctl.lua            |  58 ++++++++++
 src/box/lua/info.c             |  10 ++
 src/box/lua/init.c             |   2 +
 src/box/lua/load_cfg.lua       |   8 ++
 src/box/relay.cc               |   6 ++
 src/box/relay.h                |   4 +
 src/box/replication.cc         |   3 +-
 src/box/replication.h          |  12 +++
 test/box/admin.result          |   4 +
 test/box/cfg.result            |   8 ++
 test/replication/trim.lua      |  66 ++++++++++++
 test/replication/trim.result   | 237 +++++++++++++++++++++++++++++++++++++++++
 test/replication/trim.test.lua |  93 ++++++++++++++++
 test/replication/trim1.lua     |   1 +
 test/replication/trim2.lua     |   1 +
 test/replication/trim3.lua     |   1 +
 test/replication/trim4.lua     |   1 +
 21 files changed, 575 insertions(+), 1 deletion(-)
 create mode 100644 src/box/lua/ctl.lua
 create mode 100644 test/replication/trim.lua
 create mode 100644 test/replication/trim.result
 create mode 100644 test/replication/trim.test.lua
 create mode 120000 test/replication/trim1.lua
 create mode 120000 test/replication/trim2.lua
 create mode 120000 test/replication/trim3.lua
 create mode 120000 test/replication/trim4.lua

-- 
2.14.3 (Apple Git-98)

^ permalink raw reply	[flat|nested] 7+ messages in thread
* [tarantool-patches] Re: [tarantool-patches] Re: [PATCH v2 1/2] box: added replication_dead/rw_gap options
@ 2018-10-23 18:32 Olga Arkhangelskaia
  2018-10-24 16:49 ` Konstantin Osipov
  0 siblings, 1 reply; 7+ messages in thread
From: Olga Arkhangelskaia @ 2018-10-23 18:32 UTC (permalink / raw)
  To: Konstantin Osipov, tarantool-patches

[-- Attachment #1: Type: text/plain, Size: 2180 bytes --]




23/10/2018 10:10, Konstantin Osipov пишет:
> * Olga Arkhangelskaia < arkholga@tarantool.org > [18/10/13 08:20]:
>> In scope of gh-3110 we need options that store periods of time,
>> to be compared with time of last activity of relay and applier.
>> This patch introduces replication_dead_gap and replication_rw_gap options.
>>
>> replication_dead_gap is configured in box.cfg, with default 0 value.
>> If time that passed from now till last reader/writer activity of given replica
>> exceeds replication_dead_gap value, replica is suspected to be dead.
>> replication_dead_gap is measured in hours.
>>
>> replication_rw_gap is configured in box.cfg, with default 0 value.
>> If time difference between last reader activity and last writer activity of
>> given replica exceeds replication_rw_gap value, replica is suspected to be dead.
>> replication_rw_gap is measured in hours.
> Why do we need this if we have heartbeats?
I used to think that we need some parameters, that can be set by user, 
to check that replica is not active.
For example, if replica is not active for XXXX seconds - it is dead. 
However, I did not think about the idea of passing this parameter as a 
function argument: list_dead_replicas(XXXX). So I will throw it away.

Another question that is worth to discuss - is kind of statistics to use 
for accusing replica to be dead.
The is two ways - save time of last write/read by applier and relay. I 
implemented it, but as Vova pointed out, may be we need to save period 
of time that replica spends in stopped status. So we decided to do 
statistics in separate patch set, and implement both way. And than 
decide. However, may be you have better ideas, etc.
>
> And with swim on board we will have gossip information about entire replica set?
I have read about swim, and as I understand it :
if we have replica set with some topology except full-mesh, we can save 
dead replicas mask, numbers, etc, (that we obtained using 
list_dead_replicas on some of replicas), and in the end, after some 
questioning,  we will definitely  have information about every replica 
in the set.
If that what you mean.
If not, can you be more specific.
>
>> --


[-- Attachment #2: Type: text/html, Size: 2821 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2018-10-24 16:49 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-10-12 19:45 [tarantool-patches] [PATCH v2 0/2] detect and throw away dead replicas Olga Arkhangelskaia
2018-10-12 19:45 ` [tarantool-patches] [PATCH v2 1/2] box: added replication_dead/rw_gap options Olga Arkhangelskaia
2018-10-15 10:22   ` Vladimir Davydov
2018-10-23  7:10   ` [tarantool-patches] " Konstantin Osipov
2018-10-12 19:45 ` [tarantool-patches] [PATCH v2 2/2] ctl: added functionality to detect and prune dead replicas Olga Arkhangelskaia
2018-10-15 12:43   ` Vladimir Davydov
2018-10-23 18:32 [tarantool-patches] Re: [tarantool-patches] Re: [PATCH v2 1/2] box: added replication_dead/rw_gap options Olga Arkhangelskaia
2018-10-24 16:49 ` Konstantin Osipov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox