From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from [87.239.111.99] (localhost [127.0.0.1]) by dev.tarantool.org (Postfix) with ESMTP id 4BE676EC5B; Fri, 28 May 2021 23:35:47 +0300 (MSK) DKIM-Filter: OpenDKIM Filter v2.11.0 dev.tarantool.org 4BE676EC5B DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=tarantool.org; s=dev; t=1622234147; bh=8ijknGfbOYDWt1C1U/KcAiDGtHSKGoR9RI8ORt8Dyy8=; h=To:Date:Subject:List-Id:List-Unsubscribe:List-Archive:List-Post: List-Help:List-Subscribe:From:Reply-To:From; b=qIfyJ+9olv8WObxW8rEa+LJG4tw/lqcJX4hew6qEVHAwqjIkp0m3li61u63/COdT9 Ls7u+Lxy7S3Qs7720MXoOT3XlxaSlraecW8pUi8V+fwqqtzE/91AvJEKRLxSvs2ZFa nAuzeWZfXNkYBfkdDhAjagTScIIOPf14wCsgouT4= Received: from smtp17.mail.ru (smtp17.mail.ru [94.100.176.154]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dev.tarantool.org (Postfix) with ESMTPS id DF4B56EC5B for ; Fri, 28 May 2021 23:35:44 +0300 (MSK) DKIM-Filter: OpenDKIM Filter v2.11.0 dev.tarantool.org DF4B56EC5B Received: by smtp17.mail.ru with esmtpa (envelope-from ) id 1lmjCp-0003vj-TJ; Fri, 28 May 2021 23:35:44 +0300 To: tarantool-patches@dev.tarantool.org, gorcunov@gmail.com, sergepetrenko@tarantool.org Date: Fri, 28 May 2021 22:35:42 +0200 Message-Id: <6ed9245f407510ad3a149f62c960f89fa689909e.1622233728.git.v.shpilevoy@tarantool.org> X-Mailer: git-send-email 2.24.3 (Apple Git-128) MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-7564579A: 646B95376F6C166E X-77F55803: 4F1203BC0FB41BD9157EECD0FDB90B9A3C7FC6B53229AFE7CC4D706690F5628300894C459B0CD1B9F10727EE83754F7E4871BBDA0F852F05A1DB1E4045D6EB158F23EFBB696E3500 X-7FA49CB5: FF5795518A3D127A4AD6D5ED66289B5278DA827A17800CE78D182E101C1D8075C2099A533E45F2D0395957E7521B51C2CFCAF695D4D8E9FCEA1F7E6F0F101C6778DA827A17800CE7D9B0C78E17BAE9D7EA1F7E6F0F101C6723150C8DA25C47586E58E00D9D99D84E1BDDB23E98D2D38BD6CF32B5F8F9D404823C1448975DF6E4D4F790ADEF79425DCC7F00164DA146DAFE8445B8C89999728AA50765F7900637F924B32C592EA89F389733CBF5DBD5E9C8A9BA7A39EFB766F5D81C698A659EA7CC7F00164DA146DA9985D098DBDEAEC81D471462564A2E19F6B57BC7E6449061A352F6E88A58FB86F5D81C698A659EA73AA81AA40904B5D9A18204E546F3947C62B3BD3CC35DA5886E0066C2D8992A164AD6D5ED66289B52698AB9A7B718F8C46E0066C2D8992A16725E5C173C3A84C3C6C9A987D6FC50EDBA3038C0950A5D36B5C8C57E37DE458B0BC6067A898B09E46D1867E19FE14079C09775C1D3CA48CF3D321E7403792E342EB15956EA79C166A417C69337E82CC275ECD9A6C639B01B78DA827A17800CE7E1BCFB2C0BE3F189731C566533BA786AA5CC5B56E945C8DA X-B7AD71C0: AC4F5C86D027EB782CDD5689AFBDA7A2368A440D3B0F6089093C9A16E5BC824A2A04A2ABAA09D25379311020FFC8D4AD0E93E664E7B59C9889C09A1A358E7F30 X-C1DE0DAB: C20DE7B7AB408E4181F030C43753B8186998911F362727C414F749A5E30D975C7E9571940300E71AAC23D0FA2AB112AC3E05269518237F7C9C2B6934AE262D3EE7EAB7254005DCED53A1989D2D3BB09D1E0A4E2319210D9B64D260DF9561598F01A9E91200F654B06CE7B4E551862B828E8E86DC7131B365E7726E8460B7C23C X-C8649E89: 4E36BF7865823D7055A7F0CF078B5EC49A30900B95165D34EA882B598A2098119209309B5EAF1D5D7D30FE1F5E46EF0F7A7ED3929AAD5B3B094883F2F26EB00B1D7E09C32AA3244C8F499A648E8F86008848A7DCBE11E49FB4DF56057A86259FFACE5A9C96DEB163 X-D57D3AED: 3ZO7eAau8CL7WIMRKs4sN3D3tLDjz0dLbV79QFUyzQ2Ujvy7cMT6pYYqY16iZVKkSc3dCLJ7zSJH7+u4VD18S7Vl4ZUrpaVfd2+vE6kuoey4m4VkSEu530nj6fImhcD4MUrOEAnl0W826KZ9Q+tr5ycPtXkTV4k65bRjmOUUP8cvGozZ33TWg5HZplvhhXbhDGzqmQDTd6OAevLeAnq3Ra9uf7zvY2zzsIhlcp/Y7m53TZgf2aB4JOg4gkr2biojrzEAfVgW/74sb2MwV36rFg== X-Mailru-Sender: 504CC1E875BF3E7D9BC0E5172ADA3110388D7B92CABE2111B906A2153FCA29A2922B1C7B4EBC51AE07784C02288277CA03E0582D3806FB6A5317862B1921BA260ED6CFD6382C13A6112434F685709FCF0DA7A0AF5A3A8387 X-Mras: Ok Subject: [Tarantool-patches] [PATCH 1/1] replication: check rs uuid on subscribe process X-BeenThere: tarantool-patches@dev.tarantool.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: Tarantool development patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , From: Vladislav Shpilevoy via Tarantool-patches Reply-To: Vladislav Shpilevoy Errors-To: tarantool-patches-bounces@dev.tarantool.org Sender: "Tarantool-patches" Remote node doing the subscribe might be from a different replicaset. Before this patch the subscribe would be retried infinitely because the node couldn't be found in _cluster, and the master assumed it must have joined to another node, and its ID should arrive shortly (ER_TOO_EARLY_SUBSCRIBE). The ID would never arrive, because the node belongs to another replicaset. The patch makes so the master checks if the peer lives in the same replicaset. Since it is doing a subscribe, it must have joined already and should have a valid replicaser UUID, regardless of whether it is anonymous or not. Correct behaviour is to hard cut this peer off immediately, without retries. Closes #6094 Part of #5613 --- Branch: http://github.com/tarantool/tarantool/tree/gerold103/gh-6094-replication-bad-error Issue: https://github.com/tarantool/tarantool/issues/6094 The UUID ignorance on subscribe decode was introduced here: https://github.com/tarantool/tarantool/commit/7f8cbde3555084ad6c41f137aec4faba4648c705#diff-fc276b44b551b4eac3431c9433d4bc881790ddd7df76226d7579f80da7798f6e And I don't understand why. Maybe I miss something? The tests have passed. Sergey, do you remember why was it needed really? Replicaset UUID mismatch definitely means the node can't connect. It is not related to whether it is anonymous or not. Because it has nothing to do with _cluster. .../unreleased/gh-6094-rs-uuid-mismatch.md | 6 ++ src/box/box.cc | 17 ++++- .../gh-6094-rs-uuid-mismatch.result | 67 +++++++++++++++++++ .../gh-6094-rs-uuid-mismatch.test.lua | 25 +++++++ test/replication/suite.cfg | 1 + 5 files changed, 114 insertions(+), 2 deletions(-) create mode 100644 changelogs/unreleased/gh-6094-rs-uuid-mismatch.md create mode 100644 test/replication/gh-6094-rs-uuid-mismatch.result create mode 100644 test/replication/gh-6094-rs-uuid-mismatch.test.lua diff --git a/changelogs/unreleased/gh-6094-rs-uuid-mismatch.md b/changelogs/unreleased/gh-6094-rs-uuid-mismatch.md new file mode 100644 index 000000000..f4e47da3d --- /dev/null +++ b/changelogs/unreleased/gh-6094-rs-uuid-mismatch.md @@ -0,0 +1,6 @@ +## bugfix/replication + +* Fixed an error when a replica, at attempt to subscribe to a foreign cluster + (with different replicaset UUID), didn't notice it is not possible, and + instead was stuck in an infinite retry loop printing an error about "too + early subscribe" (gh-6094). diff --git a/src/box/box.cc b/src/box/box.cc index c10e0d8bf..5672073d6 100644 --- a/src/box/box.cc +++ b/src/box/box.cc @@ -2686,17 +2686,30 @@ box_process_subscribe(struct ev_io *io, struct xrow_header *header) tnt_raise(ClientError, ER_LOADING); struct tt_uuid replica_uuid = uuid_nil; + struct tt_uuid peer_replicaset_uuid = uuid_nil; struct vclock replica_clock; uint32_t replica_version_id; vclock_create(&replica_clock); bool anon; uint32_t id_filter; - xrow_decode_subscribe_xc(header, NULL, &replica_uuid, &replica_clock, - &replica_version_id, &anon, &id_filter); + xrow_decode_subscribe_xc(header, &peer_replicaset_uuid, &replica_uuid, + &replica_clock, &replica_version_id, &anon, + &id_filter); /* Forbid connection to itself */ if (tt_uuid_is_equal(&replica_uuid, &INSTANCE_UUID)) tnt_raise(ClientError, ER_CONNECTION_TO_SELF); + /* + * The peer should have bootstrapped from somebody since it tries to + * subscribe already. If it belongs to a different replicaset, it won't + * be ever found here, and would try to reconnect thinking its replica + * ID wasn't replicated here yet. Prevent it right away. + */ + if (!tt_uuid_is_equal(&peer_replicaset_uuid, &REPLICASET_UUID)) { + tnt_raise(ClientError, ER_REPLICASET_UUID_MISMATCH, + tt_uuid_str(&REPLICASET_UUID), + tt_uuid_str(&peer_replicaset_uuid)); + } /* * Do not allow non-anonymous followers for anonymous diff --git a/test/replication/gh-6094-rs-uuid-mismatch.result b/test/replication/gh-6094-rs-uuid-mismatch.result new file mode 100644 index 000000000..1ba189a37 --- /dev/null +++ b/test/replication/gh-6094-rs-uuid-mismatch.result @@ -0,0 +1,67 @@ +-- test-run result file version 2 +test_run = require('test_run').new() + | --- + | ... + +-- +-- gh-6094: master instance didn't check if the subscribed instance has the same +-- replicaset UUID as its own. As a result, if the peer is from a different +-- replicaset, the master couldn't find its record in _cluster, and assumed it +-- simply needs to wait a bit more. This led to an infinite re-subscribe. +-- +box.schema.user.grant('guest', 'super') + | --- + | ... + +test_run:cmd('create server master2 with script="replication/master1.lua"') + | --- + | - true + | ... +test_run:cmd('start server master2') + | --- + | - true + | ... +test_run:switch('master2') + | --- + | - true + | ... +replication = test_run:eval('default', 'return box.cfg.listen')[1] + | --- + | ... +box.cfg{replication = {replication}} + | --- + | ... +assert(test_run:grep_log('master2', 'ER_REPLICASET_UUID_MISMATCH')) + | --- + | - ER_REPLICASET_UUID_MISMATCH + | ... +info = box.info + | --- + | ... +repl_info = info.replication[1] + | --- + | ... +assert(not repl_info.downstream and not repl_info.upstream) + | --- + | - true + | ... +assert(info.status == 'orphan') + | --- + | - true + | ... + +test_run:switch('default') + | --- + | - true + | ... +test_run:cmd('stop server master2') + | --- + | - true + | ... +test_run:cmd('delete server master2') + | --- + | - true + | ... +box.schema.user.revoke('guest', 'super') + | --- + | ... diff --git a/test/replication/gh-6094-rs-uuid-mismatch.test.lua b/test/replication/gh-6094-rs-uuid-mismatch.test.lua new file mode 100644 index 000000000..d3928f52b --- /dev/null +++ b/test/replication/gh-6094-rs-uuid-mismatch.test.lua @@ -0,0 +1,25 @@ +test_run = require('test_run').new() + +-- +-- gh-6094: master instance didn't check if the subscribed instance has the same +-- replicaset UUID as its own. As a result, if the peer is from a different +-- replicaset, the master couldn't find its record in _cluster, and assumed it +-- simply needs to wait a bit more. This led to an infinite re-subscribe. +-- +box.schema.user.grant('guest', 'super') + +test_run:cmd('create server master2 with script="replication/master1.lua"') +test_run:cmd('start server master2') +test_run:switch('master2') +replication = test_run:eval('default', 'return box.cfg.listen')[1] +box.cfg{replication = {replication}} +assert(test_run:grep_log('master2', 'ER_REPLICASET_UUID_MISMATCH')) +info = box.info +repl_info = info.replication[1] +assert(not repl_info.downstream and not repl_info.upstream) +assert(info.status == 'orphan') + +test_run:switch('default') +test_run:cmd('stop server master2') +test_run:cmd('delete server master2') +box.schema.user.revoke('guest', 'super') diff --git a/test/replication/suite.cfg b/test/replication/suite.cfg index dc39e2f74..5acb28fd4 100644 --- a/test/replication/suite.cfg +++ b/test/replication/suite.cfg @@ -45,6 +45,7 @@ "gh-5435-qsync-clear-synchro-queue-commit-all.test.lua": {}, "gh-5536-wal-limit.test.lua": {}, "gh-5566-final-join-synchro.test.lua": {}, + "gh-6094-rs-uuid-mismatch.test.lua": {}, "*": { "memtx": {"engine": "memtx"}, "vinyl": {"engine": "vinyl"} -- 2.24.3 (Apple Git-128)