From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from localhost (localhost [127.0.0.1]) by turing.freelists.org (Avenir Technologies Mail Multiplex) with ESMTP id 49C7A20F6C for ; Thu, 17 May 2018 10:36:38 -0400 (EDT) Received: from turing.freelists.org ([127.0.0.1]) by localhost (turing.freelists.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 78XsDBfVrmGo for ; Thu, 17 May 2018 10:36:38 -0400 (EDT) Received: from smtp52.i.mail.ru (smtp52.i.mail.ru [94.100.177.112]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by turing.freelists.org (Avenir Technologies Mail Multiplex) with ESMTPS id 7F33420F65 for ; Thu, 17 May 2018 10:36:37 -0400 (EDT) Date: Thu, 17 May 2018 17:36:35 +0300 From: Konstantin Osipov Subject: [tarantool-patches] Re: [replication 1/1] replication: Add rfc on vclock implementation Message-ID: <20180517143634.GB18384@atlas> References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: tarantool-patches-bounce@freelists.org Errors-to: tarantool-patches-bounce@freelists.org Reply-To: tarantool-patches@freelists.org List-help: List-unsubscribe: List-software: Ecartis version 1.0.0 List-Id: tarantool-patches List-subscribe: List-owner: List-post: List-archive: To: tarantool-patches@freelists.org Cc: georgy@tarantool.org * Ilya Markov [18/05/17 12:09]: The problem this RFC does not yet address is management of deleted nodes. Today, due to array size limit, DBAs are forced to remove deleted nodes from the cluster thus recycling slots in the array. If we remove the replica set size limit, we may end up with thousands of dead replicas which nobody cares enough to garbage collect and remove from the vclock set. This means that the cost of comparing two vclocks will be not O(N) where N is the total number of replicas ever registered in the cluster, which may be much higher than the total number of alive replicas. The second issue is that the constant of O(N) may be too high as well. BTW, could you please study the source code and check in which cases we compare two vclocks id-by-id, not just their signatures and document them in the spec? In a true mesh network the number of members may be very high, but most members will only connect and chat (receive changes) with their direct neighbors, with the amount of neighbors quite low. I was thinking about using some sort of NAT or CIDR ideas to organize routing in such networks. In this case many nodes will have a range of "community-local" server ids (e.g. a subnet), and a single "public" server id, with all messages being "remapped"to this public server id before leaving the community. Another use of the idea of remapping is automatic garbage collection of dead replicas. Imagine I can somehow track and change the mapping uuid<->id in the replica set. Then I can "remap" an existing small id to a new uuid dynamically, and thus recycle ids. This idea also looks interesting since even in large mesh networks many members may stay silent for a very long time. > Add description of possible redesigning of vector clocks. > --- > doc/rfc/vclock_struct.md | 73 ++++++++++++++++++++++++++++++++++++++++++++++++ > 1 file changed, 73 insertions(+) > create mode 100644 doc/rfc/vclock_struct.md > > diff --git a/doc/rfc/vclock_struct.md b/doc/rfc/vclock_struct.md > new file mode 100644 > index 0000000..417caa9 > --- /dev/null > +++ b/doc/rfc/vclock_struct.md > @@ -0,0 +1,73 @@ > +# RFC Vclock implementation design > + > +* **Status**: in progress > +* **Start date**: 16-05-2018 > +* **Authors**: Ilya Markov @IlyaMarkovMipt \ > +* **Issues**: > + > +## Summary > + > +Overview of possible implementations of vector clocks in large scale replicasets. > + > +## Background and motivation > + > +Vector clocks are used for following states(more exactly LSN) of nodes in replicasets. > +Currently, the clocks are implemented with static arrays with size limited by constant `VCLOCK_MAX` > +Indices of the array represent replica identifier in replicaset, value is LSN. > +In a large scale environment array is far from the best implementation in terms of time and memory consumption. > + > +The main problem here is that within large scale nodes may be added and deleted and the array may contain > + large gaps. So most of memory space might turn out to be useless. > + > +For example, in star topology, one replica has fully filled vclock, > + others have large arrays with only two valuable for them cells. > + > +## Ideas > +The new design must address the following requirements: > +1. Minimize memory consumption within constantly changing replicaset. > +2. Fast vector clock comparison, following taking into account frequent updated nodes. > + > +### Tree > +As a possible solution to address the gap problem is to use a tree. > + The tree allocates nodes only for non-empty values. So memory usage in this case is minimized. > + Comparison and vclock following would take O(N), N -size of replicaset. > +This time complexity is the same as in implementation with static array but with worse constant. > + > +Though operations like set and get take O(logN) instead constant time in array. > +As we can notice vclock_get is highly used with replica ids, which are written in logs. > +Under assumption that number of writing replicas is less than the size of replicaset, > +the problem with vclock_get may be solved with some fixed size cache in front of tree, > + which will contain frequently replicas lsns. > + > +### Remapping with garbage collecting > +Another approach addressing gap problem is shifting replica id to the start of vclock array, > +getting rid of gaps. > + > +This idea helps avoiding gaps and simplifies comparison, setting vector clocks. > +On the other hand, it requires dedicated calls which follow the state of vclock and shift it, when gaps are found. > +Also the shift requires remapping of replica identifiers which also costs something in terms of memory and time consumption. > + > +### Paging > +Allocate fixed size arrays for ranges of ids and store references to them in hash/tree index. > +For example, we have several ranges of ids: 1-10, 65-100. Let's assume size of each array is 32. > +For this set, there would 3 ranges: 1-32, 65-96, 97-128. The index would contain 3 records, which could be get by 1, 65, 97 respectively. > + > +In this approach gaps are limited to the certain size, there is no need in shifting. > +Copying and comparison are almost the same as in approach with static size array. > + > +### Skip-lists > +One more possible solution to gap problem may be lists. > +But, as we need to index sometimes, we can use skip-lists, which in terms of time complexity of indexing are almost the same as trees. > +Moreover, traversing lists is faster than trees. > + > +Bad side of the idea is that it consumes memory excessively. > + > +## Conclusion > +The most easiest to implement solution is a tree. Nevertheless, it needs an optimizations for vclock_get. > + > +The paging looks like an approach which solves the current problem with gaps and doesn't create new problems or complexities. > + > +The shifting with remapping looks the worst one to my mind, mostly because of its difficulty > +and generating new maintaining processes(e.g remapping) and, therefore, new possible problems. > + > +Skip-lists are just one of variations of trees, but with extra memory consumption. > -- > 2.7.4 > -- Konstantin Osipov, Moscow, Russia, +7 903 626 22 32 http://tarantool.io - www.twitter.com/kostja_osipov