From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <tarantool-patches-bounce@freelists.org>
Received: from localhost (localhost [127.0.0.1])
	by turing.freelists.org (Avenir Technologies Mail Multiplex) with ESMTP id 49C7A20F6C
	for <tarantool-patches@freelists.org>; Thu, 17 May 2018 10:36:38 -0400 (EDT)
Received: from turing.freelists.org ([127.0.0.1])
	by localhost (turing.freelists.org [127.0.0.1]) (amavisd-new, port 10024)
	with ESMTP id 78XsDBfVrmGo for <tarantool-patches@freelists.org>;
	Thu, 17 May 2018 10:36:38 -0400 (EDT)
Received: from smtp52.i.mail.ru (smtp52.i.mail.ru [94.100.177.112])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by turing.freelists.org (Avenir Technologies Mail Multiplex) with ESMTPS id 7F33420F65
	for <tarantool-patches@freelists.org>; Thu, 17 May 2018 10:36:37 -0400 (EDT)
Date: Thu, 17 May 2018 17:36:35 +0300
From: Konstantin Osipov <kostja@tarantool.org>
Subject: [tarantool-patches] Re: [replication 1/1] replication: Add rfc on
 vclock implementation
Message-ID: <20180517143634.GB18384@atlas>
References: <b58de1bce01ec6584ce8424a2a7769c3d6ba20c8.1526546132.git.imarkov@tarantool.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <b58de1bce01ec6584ce8424a2a7769c3d6ba20c8.1526546132.git.imarkov@tarantool.org>
Sender: tarantool-patches-bounce@freelists.org
Errors-to: tarantool-patches-bounce@freelists.org
Reply-To: tarantool-patches@freelists.org
List-help: <mailto:ecartis@freelists.org?Subject=help>
List-unsubscribe: <tarantool-patches-request@freelists.org?Subject=unsubscribe>
List-software: Ecartis version 1.0.0
List-Id: tarantool-patches <tarantool-patches.freelists.org>
List-subscribe: <tarantool-patches-request@freelists.org?Subject=subscribe>
List-owner: <mailto:>
List-post: <mailto:tarantool-patches@freelists.org>
List-archive: <http://www.freelists.org/archives/tarantool-patches>
To: tarantool-patches@freelists.org
Cc: georgy@tarantool.org

* Ilya Markov <imarkov@tarantool.org> [18/05/17 12:09]:

The problem this RFC does not yet address is management of deleted
nodes. 

Today, due to array size limit, DBAs are forced to remove deleted
nodes from the cluster thus recycling slots in the array.

If we remove the replica set size limit, we may end up with
thousands of dead replicas which nobody cares enough
to garbage collect  and remove from the vclock set. 

This means that the cost of comparing two vclocks will be not
O(N) where N is the total number of replicas ever registered in
the cluster, which may be much higher than the total number of alive
replicas.

The second issue is that the constant of O(N) may be too high as
well. BTW, could you please study the source code and check in which
cases we compare two vclocks id-by-id, not just their signatures
and document them in the spec?

In a true mesh network the number of members may be very high, but
most members will only connect and chat (receive changes) with
their direct neighbors, with the amount of neighbors quite low. 

I was thinking about using some sort of NAT or CIDR ideas to
organize routing in such networks. In this case many nodes will
have a range of "community-local" server ids (e.g. a subnet), and
a single "public" server id, with all messages being "remapped"to
this public server id before leaving the community.

Another use of the idea of remapping is automatic garbage
collection of dead replicas. Imagine I can somehow track and
change the mapping uuid<->id in the replica set. Then I can
"remap" an existing small id to a new uuid dynamically, and thus
recycle ids.
This idea also looks interesting since even in large mesh networks
many members may stay silent for a very long time.

> Add description of possible redesigning of vector clocks.
> ---
>  doc/rfc/vclock_struct.md | 73 ++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 73 insertions(+)
>  create mode 100644 doc/rfc/vclock_struct.md
> 
> diff --git a/doc/rfc/vclock_struct.md b/doc/rfc/vclock_struct.md
> new file mode 100644
> index 0000000..417caa9
> --- /dev/null
> +++ b/doc/rfc/vclock_struct.md
> @@ -0,0 +1,73 @@
> +# RFC Vclock implementation design
> +
> +* **Status**: in progress
> +* **Start date**: 16-05-2018
> +* **Authors**: Ilya Markov @IlyaMarkovMipt \<imarkov@tarantool.org\>
> +* **Issues**:
> +
> +## Summary
> +
> +Overview of possible implementations of vector clocks in large scale replicasets.
> +
> +## Background and motivation
> +
> +Vector clocks are used for following states(more exactly LSN) of nodes in replicasets.
> +Currently, the clocks are implemented with static arrays with size limited by constant `VCLOCK_MAX`
> +Indices of the array represent replica identifier in replicaset, value is LSN.
> +In a large scale environment array is far from the best implementation in terms of time and memory consumption.
> +
> +The main problem here is that within large scale nodes may be added and deleted and the array may contain
> + large gaps. So most of memory space might turn out to be useless.
> +
> +For example, in star topology, one replica has fully filled vclock,
> + others have large arrays with only two valuable for them cells.
> +
> +## Ideas
> +The new design must address the following requirements:
> +1. Minimize memory consumption within constantly changing replicaset.
> +2. Fast vector clock comparison, following taking into account frequent updated nodes.
> +
> +### Tree
> +As a possible solution to address the gap problem is to use a tree.
> + The tree allocates nodes only for non-empty values. So memory usage in this case is minimized.
> + Comparison and vclock following would take O(N), N -size of replicaset.
> +This time complexity is the same as in implementation with static array but with worse constant.
> +
> +Though operations like set and get take O(logN) instead constant time in array.
> +As we can notice vclock_get is highly used with replica ids, which are written in logs.
> +Under assumption that number of writing replicas is less than the size of replicaset,
> +the problem with vclock_get may be solved with some fixed size cache in front of tree,
> + which will contain frequently replicas lsns.
> +
> +### Remapping with garbage collecting
> +Another approach addressing gap problem is shifting replica id to the start of vclock array,
> +getting rid of gaps.
> +
> +This idea helps avoiding gaps and simplifies comparison, setting vector clocks.
> +On the other hand, it requires dedicated calls which follow the state of vclock and shift it, when gaps are found.
> +Also the shift requires remapping of replica identifiers which also costs something in terms of memory and time consumption.
> +
> +### Paging
> +Allocate fixed size arrays for ranges of ids and store references to them in hash/tree index.
> +For example, we have several ranges of ids: 1-10, 65-100. Let's assume size of each array is 32.
> +For this set, there would 3 ranges: 1-32, 65-96, 97-128. The index would contain 3 records, which could be get by 1, 65, 97 respectively.
> +
> +In this approach gaps are limited to the certain size, there is no need in shifting.
> +Copying and comparison are almost the same as in approach with static size array.
> +
> +### Skip-lists
> +One more possible solution to gap problem may be lists.
> +But, as we need to index sometimes, we can use skip-lists, which in terms of time complexity of indexing are almost the same as trees.
> +Moreover, traversing lists is faster than trees.
> +
> +Bad side of the idea is that it consumes memory excessively.
> +
> +## Conclusion
> +The most easiest to implement solution is a tree. Nevertheless, it needs an optimizations for vclock_get.
> +
> +The paging looks like an approach which solves the current problem with gaps and doesn't create new problems or complexities.
> +
> +The shifting with remapping looks the worst one to my mind, mostly because of its difficulty
> +and generating new maintaining processes(e.g remapping) and, therefore, new possible problems.
> +
> +Skip-lists are just one of variations of trees, but with extra memory consumption.
> -- 
> 2.7.4
> 

-- 
Konstantin Osipov, Moscow, Russia, +7 903 626 22 32
http://tarantool.io - www.twitter.com/kostja_osipov