From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-lf1-f44.google.com (mail-lf1-f44.google.com [209.85.167.44]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by dev.tarantool.org (Postfix) with ESMTPS id 6DEB3469710 for ; Thu, 14 May 2020 03:05:34 +0300 (MSK) Received: by mail-lf1-f44.google.com with SMTP id z22so1085852lfd.0 for ; Wed, 13 May 2020 17:05:34 -0700 (PDT) Date: Thu, 14 May 2020 03:05:32 +0300 From: Konstantin Osipov Message-ID: <20200514000532.GD5698@atlas> References: <20200403210836.GB18283@tarantool.org> <20200430145033.GF112@tarantool.org> <20200506185559.GA2749@atlas> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Subject: Re: [Tarantool-patches] [RFC] Quorum-based synchronous replication List-Id: Tarantool development patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Vladislav Shpilevoy Cc: tarantool-patches@dev.tarantool.org * Vladislav Shpilevoy [20/05/14 00:47]: > > A few more issues: > > > > - the spec assumes there is a full mesh. In any other > > topology electing a leader based on the longest wal can easily > > deadlock. Yet it provides no protection against non-full-mesh > > setups. Currently the server can't even detect that this is not > > a full-mesh setup, so can't check if the precondition for this > > to work correctly is met. > > Yes, this is a very unstable construction. But we failed to come up > with a solution right now, which would protect against accidental > non-fullmesh. For example, how will it work, when I add a new node? > If non-fullmesh is forbidden, the new node just can't be added ever, > because this can't be done on all nodes simultaneously. Again the answer is present in the raft spec. The node is added in two steps, first steps commits the "add node" event to the durable state of the entire group, the second step (which is also a raft transaction) enacts the new node. This could be achieved in more or less straightforward manner if _cluster is a sync table with replication group = all members of the cluster. But as I said, I can't imagine this is possible with an external coordinator, since it may not be available during boot. Regarding detecting the full mesh, remember the task I created for using swim to discover members and bring non-full-mesh setups to full-mesh automatically? Is the reason for this task to exist clear now? Is it clear now why I asked you (multiple times) to begin working on sync replication by adding built-in swim instances on every replica and using them, instead of the current replication heartbeats, for failure detection? I believe there was a task somewhere for it, too. > > - the spec assumes that quorum is identical to the > > number of replicas, and the number of replicas is stable across > > cluster life time. Can I have quorum=2 while the number of > > replicas is 4? Am I allowed to increase the number of replicas > > online? What happens when a replica is added, > > how exactly and starting from which transaction is the leader > > required to collect a bigger quorum? > > Quorum <= number of replicas. It is a parameter, just like > replication_connect_quorum. I wrote in a comment to the task that it'd be even better if we list node uuids as group members, and assign group to space explicitly, so that it's not just ## of replicas, but specific replicas identified by their uuids. The thing is, it's vague in the spec. The spec has to be explicit about all box.schema API changes, because they will define legacy that will be hard to deal with later. > I think you are allowed to add new replicas. When a replica is added, > it goes through the normal join process. At what point is joins the group and can ACK, i.e. become part of a quorum? That's the question I wanted to be written down explicitly in this document. RAFT has an answer for it. > > - the same goes for removing a replica. How is the quorum reduced? > > Node is just removed, I guess. If total number of nodes becomes less > than quorum, obviously no transactions will be served. Other vendors support 3 different scenarios here: - it can be down for maintenance. In our turns, it means it is simply shut down, without changes to _cluster or space settings - it can be removed forever, in that case an admin may want to reduce the quorum size. - it can be replaced. with box.schema.group API all 3 cases can be translated to API calls on the group itself. e.g. it would be possible to say box.schema.group.groupname.remove(uuid) box.schema.group.groupname.replace(old_uuid, new_uuid). We don't need to implement it right away, but we must provision for these operations in the spec, and at least have a clue how they will be handled in the future. > However what to do with the existing pending transactions, which > already accounted the removed replica in their quorums? Should they be > decremented? > > All what I am talking here are guesses. Which should be clarified in the > RFC in the ideal world, of course. > > Tbh, we discussed the sync replication for may hours in voice, and this > is a surprise, that all of them fit into such a small update of the RFC. > Even though it didn't fit. Since we obviously still didn't clarify many > things. Especially exact API look. -- Konstantin Osipov, Moscow, Russia