From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <tarantool-patches-bounce@freelists.org>
Received: from localhost (localhost [127.0.0.1])
	by turing.freelists.org (Avenir Technologies Mail Multiplex) with ESMTP id ADF2C20EF5
	for <tarantool-patches@freelists.org>; Thu, 17 May 2018 04:36:20 -0400 (EDT)
Received: from turing.freelists.org ([127.0.0.1])
	by localhost (turing.freelists.org [127.0.0.1]) (amavisd-new, port 10024)
	with ESMTP id cNdeepMeNAFn for <tarantool-patches@freelists.org>;
	Thu, 17 May 2018 04:36:20 -0400 (EDT)
Received: from smtp57.i.mail.ru (smtp57.i.mail.ru [217.69.128.37])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by turing.freelists.org (Avenir Technologies Mail Multiplex) with ESMTPS id D844F20ED2
	for <tarantool-patches@freelists.org>; Thu, 17 May 2018 04:36:19 -0400 (EDT)
From: Ilya Markov <imarkov@tarantool.org>
Subject: [tarantool-patches] [replication 1/1] replication: Add rfc on vclock implementation
Date: Thu, 17 May 2018 11:36:03 +0300
Message-Id: <b58de1bce01ec6584ce8424a2a7769c3d6ba20c8.1526546132.git.imarkov@tarantool.org>
Sender: tarantool-patches-bounce@freelists.org
Errors-to: tarantool-patches-bounce@freelists.org
Reply-To: tarantool-patches@freelists.org
List-help: <mailto:ecartis@freelists.org?Subject=help>
List-unsubscribe: <tarantool-patches-request@freelists.org?Subject=unsubscribe>
List-software: Ecartis version 1.0.0
List-Id: tarantool-patches <tarantool-patches.freelists.org>
List-subscribe: <tarantool-patches-request@freelists.org?Subject=subscribe>
List-owner: <mailto:>
List-post: <mailto:tarantool-patches@freelists.org>
List-archive: <http://www.freelists.org/archives/tarantool-patches>
To: georgy@tarantool.org
Cc: tarantool-patches@freelists.org

Add description of possible redesigning of vector clocks.
---
 doc/rfc/vclock_struct.md | 73 ++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 73 insertions(+)
 create mode 100644 doc/rfc/vclock_struct.md

diff --git a/doc/rfc/vclock_struct.md b/doc/rfc/vclock_struct.md
new file mode 100644
index 0000000..417caa9
--- /dev/null
+++ b/doc/rfc/vclock_struct.md
@@ -0,0 +1,73 @@
+# RFC Vclock implementation design
+
+* **Status**: in progress
+* **Start date**: 16-05-2018
+* **Authors**: Ilya Markov @IlyaMarkovMipt \<imarkov@tarantool.org\>
+* **Issues**:
+
+## Summary
+
+Overview of possible implementations of vector clocks in large scale replicasets.
+
+## Background and motivation
+
+Vector clocks are used for following states(more exactly LSN) of nodes in replicasets.
+Currently, the clocks are implemented with static arrays with size limited by constant `VCLOCK_MAX`
+Indices of the array represent replica identifier in replicaset, value is LSN.
+In a large scale environment array is far from the best implementation in terms of time and memory consumption.
+
+The main problem here is that within large scale nodes may be added and deleted and the array may contain
+ large gaps. So most of memory space might turn out to be useless.
+
+For example, in star topology, one replica has fully filled vclock,
+ others have large arrays with only two valuable for them cells.
+
+## Ideas
+The new design must address the following requirements:
+1. Minimize memory consumption within constantly changing replicaset.
+2. Fast vector clock comparison, following taking into account frequent updated nodes.
+
+### Tree
+As a possible solution to address the gap problem is to use a tree.
+ The tree allocates nodes only for non-empty values. So memory usage in this case is minimized.
+ Comparison and vclock following would take O(N), N -size of replicaset.
+This time complexity is the same as in implementation with static array but with worse constant.
+
+Though operations like set and get take O(logN) instead constant time in array.
+As we can notice vclock_get is highly used with replica ids, which are written in logs.
+Under assumption that number of writing replicas is less than the size of replicaset,
+the problem with vclock_get may be solved with some fixed size cache in front of tree,
+ which will contain frequently replicas lsns.
+
+### Remapping with garbage collecting
+Another approach addressing gap problem is shifting replica id to the start of vclock array,
+getting rid of gaps.
+
+This idea helps avoiding gaps and simplifies comparison, setting vector clocks.
+On the other hand, it requires dedicated calls which follow the state of vclock and shift it, when gaps are found.
+Also the shift requires remapping of replica identifiers which also costs something in terms of memory and time consumption.
+
+### Paging
+Allocate fixed size arrays for ranges of ids and store references to them in hash/tree index.
+For example, we have several ranges of ids: 1-10, 65-100. Let's assume size of each array is 32.
+For this set, there would 3 ranges: 1-32, 65-96, 97-128. The index would contain 3 records, which could be get by 1, 65, 97 respectively.
+
+In this approach gaps are limited to the certain size, there is no need in shifting.
+Copying and comparison are almost the same as in approach with static size array.
+
+### Skip-lists
+One more possible solution to gap problem may be lists.
+But, as we need to index sometimes, we can use skip-lists, which in terms of time complexity of indexing are almost the same as trees.
+Moreover, traversing lists is faster than trees.
+
+Bad side of the idea is that it consumes memory excessively.
+
+## Conclusion
+The most easiest to implement solution is a tree. Nevertheless, it needs an optimizations for vclock_get.
+
+The paging looks like an approach which solves the current problem with gaps and doesn't create new problems or complexities.
+
+The shifting with remapping looks the worst one to my mind, mostly because of its difficulty
+and generating new maintaining processes(e.g remapping) and, therefore, new possible problems.
+
+Skip-lists are just one of variations of trees, but with extra memory consumption.
-- 
2.7.4