From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from smtpng3.m.smailru.net (smtpng3.m.smailru.net [94.100.177.149]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dev.tarantool.org (Postfix) with ESMTPS id 2CD7B469719 for ; Tue, 15 Sep 2020 02:11:31 +0300 (MSK) From: Vladislav Shpilevoy Date: Tue, 15 Sep 2020 01:11:26 +0200 Message-Id: MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Subject: [Tarantool-patches] [PATCH v2 0/4] Boot with anon List-Id: Tarantool development patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: tarantool-patches@dev.tarantool.org, gorcunov@gmail.com, sergepetrenko@tarantool.org The patch attempts to address with problem of anonymous replicas being registered in _cluster, if they are present during bootstrap. The bug was found during working on another issue related to Raft. The problem is that Raft won't work properly during bootstrap if non-joined replicas are registered in _cluster. When their auto-registration by applier was removed, the anon bug was found. The auto-registration removal is trivial, but it breaks the cluster bootstrap in another way creating false-positive XlogGap errors. See the second commit with an explanation. To solve the issue quite a radical solution is applied - gap errors are not considered critical anymore, and can be retried. I am not sure that is the best option, but couldn't come up with anything better after a long struggle with that. This is a bug, so whatever we will come up with after all, it should be pushed to the older versions too. Branch: http://github.com/tarantool/tarantool/tree/gerold103/gh-5287-anon-false-register Issue: https://github.com/tarantool/tarantool/issues/5287 Changes in v2: - Anon status is stored as a flag again. In v1 it was stored as enum, but an alternative solution was proposed, where the enum is not needed. - Ballot now has a new field is_anon. It helps to avoid the enum, and set replica->anon flag to a correct value right when it becomes connected. Through relay or applier, either. @ChangeLog * Anonymous replica could be registered and could prevent WAL files removal (gh-5287). * XlogGapError is not a critical error anymore. It means, box.info.replication will show upstream status as 'loading' if the error was found. The upstream will be restarted until the error is resolved automatically with a help of another instance, or until the replica is removed from box.cfg.replication (gh-5287). Vladislav Shpilevoy (4): xlog: introduce an error code for XlogGapError replication: retry in case of XlogGapError replication: add is_anon flag to ballot replication: do not register outgoing connections src/box/applier.cc | 40 ++++ src/box/box.cc | 30 +-- src/box/errcode.h | 2 + src/box/error.cc | 2 + src/box/error.h | 1 + src/box/iproto_constants.h | 1 + src/box/recovery.h | 2 - src/box/replication.cc | 14 +- src/box/xrow.c | 14 +- src/box/xrow.h | 5 + test/box/error.result | 2 + test/replication/autobootstrap_anon.lua | 25 +++ test/replication/autobootstrap_anon1.lua | 1 + test/replication/autobootstrap_anon2.lua | 1 + test/replication/force_recovery.result | 110 ----------- test/replication/force_recovery.test.lua | 43 ----- test/replication/gh-5287-boot-anon.result | 81 +++++++++ test/replication/gh-5287-boot-anon.test.lua | 30 +++ test/replication/prune.result | 18 +- test/replication/prune.test.lua | 7 +- test/replication/replica.lua | 2 + test/replication/replica_rejoin.result | 6 +- test/replication/replica_rejoin.test.lua | 4 +- .../show_error_on_disconnect.result | 2 +- .../show_error_on_disconnect.test.lua | 2 +- test/xlog/panic_on_wal_error.result | 171 ------------------ test/xlog/panic_on_wal_error.test.lua | 75 -------- 27 files changed, 262 insertions(+), 429 deletions(-) create mode 100644 test/replication/autobootstrap_anon.lua create mode 120000 test/replication/autobootstrap_anon1.lua create mode 120000 test/replication/autobootstrap_anon2.lua delete mode 100644 test/replication/force_recovery.result delete mode 100644 test/replication/force_recovery.test.lua create mode 100644 test/replication/gh-5287-boot-anon.result create mode 100644 test/replication/gh-5287-boot-anon.test.lua delete mode 100644 test/xlog/panic_on_wal_error.result delete mode 100644 test/xlog/panic_on_wal_error.test.lua -- 2.21.1 (Apple Git-122.3)