From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from localhost (localhost [127.0.0.1]) by turing.freelists.org (Avenir Technologies Mail Multiplex) with ESMTP id 194FB2C821 for ; Fri, 12 Oct 2018 15:46:52 -0400 (EDT) Received: from turing.freelists.org ([127.0.0.1]) by localhost (turing.freelists.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 28DlKA-OZjjn for ; Fri, 12 Oct 2018 15:46:51 -0400 (EDT) Received: from smtp52.i.mail.ru (smtp52.i.mail.ru [94.100.177.112]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by turing.freelists.org (Avenir Technologies Mail Multiplex) with ESMTPS id 3C0D32C66F for ; Fri, 12 Oct 2018 15:46:51 -0400 (EDT) From: Olga Arkhangelskaia Subject: [tarantool-patches] [PATCH v2 0/2] detect and throw away dead replicas Date: Fri, 12 Oct 2018 22:45:55 +0300 Message-Id: <20181012194557.7445-1-arkholga@tarantool.org> Sender: tarantool-patches-bounce@freelists.org Errors-to: tarantool-patches-bounce@freelists.org Reply-To: tarantool-patches@freelists.org List-help: List-unsubscribe: List-software: Ecartis version 1.0.0 List-Id: tarantool-patches List-subscribe: List-owner: List-post: List-archive: To: tarantool-patches@freelists.org Cc: Olga Arkhangelskaia According to previous discussions the way of replicas bad state detection is changed completely. Now we maintain two time differences between now and last activity of applier and relay. THis values can be found in box.info.replication.lar/law: We use hours, but i still have some doubts may be we should display days, hours and minutes. Lar/law are compared with replication_dead/rw_gap, that should be previously configured via box.cfg. The question here - now I am not sure in replication_rw_gap. The reason I added tis parameter is the idea that in master case the difference between applier and relay activity is too be - there is big chance that something is wrong with replica. The last problem I want to discuss - is test cases, test takes too much time, and there is no separate case for applier. I mean that relay and rw_gap can be tested separetly by turning off replication and tuning gap parameters, however i do not see case when only lar is lagging seriously. If you have ideas how to make this functionality better - please, share. Will be glad to see other opinions. --- Branch: https://github.com/tarantool/tarantool/tree/OKriw/gh-3110-prune-dead-replica-from-replicaset-1.10 Issue: https://github.com/tarantool/tarantool/issues/3110 v1: https://www.freelists.org/post/tarantool-patches/PATCH-rfc-schema-add-possibility-to-find-and-throw-away-dead-replicas Changes v2: - changed the way of replicas death detection - added special box options - changed test - now only dead replicas are shown - added function to throw away any replica Olga Arkhangelskaia (2): box: added replication_dead/rw_gap options ctl: added functionality to detect and prune dead replicas src/box/CMakeLists.txt | 1 + src/box/box.cc | 34 ++++++ src/box/box.h | 2 + src/box/lua/cfg.cc | 24 +++++ src/box/lua/ctl.lua | 58 ++++++++++ src/box/lua/info.c | 10 ++ src/box/lua/init.c | 2 + src/box/lua/load_cfg.lua | 8 ++ src/box/relay.cc | 6 ++ src/box/relay.h | 4 + src/box/replication.cc | 3 +- src/box/replication.h | 12 +++ test/box/admin.result | 4 + test/box/cfg.result | 8 ++ test/replication/trim.lua | 66 ++++++++++++ test/replication/trim.result | 237 +++++++++++++++++++++++++++++++++++++++++ test/replication/trim.test.lua | 93 ++++++++++++++++ test/replication/trim1.lua | 1 + test/replication/trim2.lua | 1 + test/replication/trim3.lua | 1 + test/replication/trim4.lua | 1 + 21 files changed, 575 insertions(+), 1 deletion(-) create mode 100644 src/box/lua/ctl.lua create mode 100644 test/replication/trim.lua create mode 100644 test/replication/trim.result create mode 100644 test/replication/trim.test.lua create mode 120000 test/replication/trim1.lua create mode 120000 test/replication/trim2.lua create mode 120000 test/replication/trim3.lua create mode 120000 test/replication/trim4.lua -- 2.14.3 (Apple Git-98)