From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from smtpng1.m.smailru.net (smtpng1.m.smailru.net [94.100.181.251]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dev.tarantool.org (Postfix) with ESMTPS id 3FB2444643E for ; Thu, 10 Sep 2020 02:17:09 +0300 (MSK) From: Vladislav Shpilevoy Date: Thu, 10 Sep 2020 01:16:56 +0200 Message-Id: <86b2104fd1dd49ed4bb7e432cf47cf9660ef35b3.1599693319.git.v.shpilevoy@tarantool.org> In-Reply-To: References: MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Subject: [Tarantool-patches] [PATCH v2 05/11] [wip] box: do not register outgoing connections List-Id: Tarantool development patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: tarantool-patches@dev.tarantool.org, sergepetrenko@tarantool.org, gorcunov@gmail.com Replication protocol's first stage for non-anonymous replicas is that the replica should be registered in _cluster to get a unique ID number. That happens, when replica connects to a writable node, which performs the registration. So it means, registration always happens on the master node when appears an incoming request for it. When a relay is created. That wasn't the case for bootstrap. If box.cfg.replication wasn't empty on the master node doing the cluster bootstrap, it registered all the outgoing connections in _cluster. Note, the target node could be even anonymous, but still was registered. Also the registration happened for the remote replicas even before their bootstrap. That breaks the protocol, and leads to registration of anon replicas sometimes. The patch drops it. The main motivation here though is Raft cluster bootstrap specifics. During Raft bootstrap it is going to be very important that non-bootstrapped nodes should not be registered in _cluster. It would break the leader election during bootstrap. Closes #5287 --- The patch fixes 5287, but now the same test leads to a crash. Because in the code there is no handling for the case when a not anon replica becomes anon. That happens, when a master connects to a replica before it is bootstrapped, the replica allows it, and then after the replica is boostrapped, it sends SUBSCRIBE right away. Then the master crashes in relay_subscribe() in the first line, because the replica was connected as not anon (replica->anon == false), but it does not have an ID (replica->id == REPLICA_ID_NIL). I am not sure how to fix it now. Decided to think more about it, and see what reviewers think. In the current state the fix is enough to unblock Raft, so it is not urgent. src/box/box.cc | 9 --------- 1 file changed, 9 deletions(-) diff --git a/src/box/box.cc b/src/box/box.cc index eeb00d5e2..3214ec340 100644 --- a/src/box/box.cc +++ b/src/box/box.cc @@ -2217,15 +2217,6 @@ bootstrap_master(const struct tt_uuid *replicaset_uuid) box_register_replica(replica_id, &INSTANCE_UUID); assert(replica_by_uuid(&INSTANCE_UUID)->id == 1); - /* Register other cluster members */ - replicaset_foreach(replica) { - if (tt_uuid_is_equal(&replica->uuid, &INSTANCE_UUID)) - continue; - assert(replica->applier != NULL); - box_register_replica(++replica_id, &replica->uuid); - assert(replica->id == replica_id); - } - /* Set UUID of a new replica set */ box_set_replicaset_uuid(replicaset_uuid); -- 2.21.1 (Apple Git-122.3)