From mboxrd@z Thu Jan 1 00:00:00 1970
Return-Path:
Received: from smtp53.i.mail.ru (smtp53.i.mail.ru [94.100.177.113])
(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
(No client certificate requested)
by dev.tarantool.org (Postfix) with ESMTPS id 30ECD4696C3
for ;
Fri, 24 Apr 2020 13:10:41 +0300 (MSK)
From: Aleksandr Lyapunov
Message-ID: <5e27bca5-14f4-6800-84c6-53f831cb83b6@tarantool.org>
Date: Fri, 24 Apr 2020 13:10:39 +0300
MIME-Version: 1.0
Content-Type: multipart/alternative;
boundary="------------ADB041D3985D3C9AC1F775BD"
Content-Language: en-US
Subject: Re: [Tarantool-patches] [PATCH] rfc: multi-directional iterators
List-Id: Tarantool development patches
List-Unsubscribe: ,
List-Archive:
List-Post:
List-Help:
List-Subscribe: ,
To: korablev@tarantool.org
Cc: tarantool-patches@dev.tarantool.org
This is a multi-part message in MIME format.
--------------ADB041D3985D3C9AC1F775BD
Content-Type: text/plain; charset="utf-8"; format="flowed"
Content-Transfer-Encoding: 8bit
> Part of #3243
> ---
> rfc in human-readable format:https://github.com/tarantool/tarantool/blob/np/gh-3243-multi-directional-iter-rfc/doc/rfc/3309-multi-directional-iterators.md
> Issue:https://github.com/tarantool/tarantool/issues/3243
>
> doc/rfc/3309-multi-directional-iterators.md | 186 ++++++++++++++++++++++++++++
> 1 file changed, 186 insertions(+)
> create mode 100644 doc/rfc/3309-multi-directional-iterators.md
>
> diff --git a/doc/rfc/3309-multi-directional-iterators.md b/doc/rfc/3309-multi-directional-iterators.md
> new file mode 100644
> index 000000000..4f8ca7c99
> --- /dev/null
> +++ b/doc/rfc/3309-multi-directional-iterators.md
> @@ -0,0 +1,186 @@
> +# Multi-directional iterators
> +
> +* **Status**: In progress
> +* **Start date**: 22-01-2020
> +* **Authors**: Nikita Pettik @korablev77korablev at tarantool.org
> +* **Issues**: [#3243](https://github.com/tarantool/tarantool/issues/3243)
> +
> +
> +## Background and motivation
> +
> +This RFC touches only Memtx engine and TREE index type (as the only available
> +in SQL and most common in user's practice). Multi-directional iterator is
> +an iterator which allows iterating through different key parts orders.
> +For instance: consider index `i` consisting of three key parts: `a`, `b` and `c`.
> +Creating and using casual iterator looks like:
> +```
> +i = box.space.T.index.i
> +i:select({1, 2, 3}, {iterator = 'EQ'}) -- returns all tuples which has
> + -- fields a == 1, b == 2, c == 3.
> +```
> +It is OK to omit one or more key parts declaring key value to be searched. In
> +this case they are assumed to be nils:
> +`i:select({1}, {iterator = 'EQ'})` is the same as
> +`i:select({1, nil, nil}, {iterator = 'EQ'})`. So all tuples which has `a == 1`
> +are getting to the result set. More formally matching rule is following:
> +```
> +if (search-key-part[i] is nil)
> +{
> + if (iterator is LT or GT) return FALSE
> + return TRUE
What does that TRUE and FALSE mean?
BTW tree index works fine with partial keys and GT/LT iterators.
E.g. `i:select({1}, {iterator = 'GT'})` returns all tuples with `a > 1`.
> +}
> +```
> +
> +Another example:
> +`i:select({1, 1, 1}, {iterator = 'GE'})`
> +
> +Here all tuples with `a >= 1`, `b >= 1` and `c >= 1` are returned. But some users
That is not correct. Tuples and keys are compared lexicographically.
Thus all tuples with
(`a = 1` AND `b = 1` AND `c >= 1') OR (`a = 1` AND `b > 1`) OR (`a > 1`)
are the result.
> +may want to select tuples with `a >= 1`, `b >= 1` but `c < 1`. Or, alternatively,
> +somebody may be willing to get tuples ordered by `a` and `b` in ascending order
> +but by `c` in descending order: `i:select({}, {iterator = {'GE', 'GE', 'LE'})`.
Those are not alternatives but independent options. One may wish to select:
`SELECT * FROM t WHERE a > 1 AND b < 1 ORDER BY a DESC, b ASC`.
> +It is analogue of common SQL query `SELECT * FROM t ORDER BY a ASC, b ASC, c DESC`.
> +These requests are obviously impossible to fulfill with current indexes and
> +iterators implementations. This RFC suggests ways to resolve mentioned problem
> +in particular for memtx TREE indexes.
> +
> +## Implementation details
> +
> +TREE indexes in memtx engine are implemented as BPS-trees (see
> +`src/lib/salad/bps_tree.h` for details). Keys correspond to particular values
> +of key parts; data - to pointers to tuples. Hence, all data
> +are sorted by their key values due to tree structure. For this reason HASH
> +indexes have only GT and EQ (and ergo GE) iterators - data stored in a hash is
> +unordered. Tree interface itself provides several functions to operate on data.
> +Iteration process starts in `tree_iterator_start()` (which is called once as
> +`iterator->next()`): depending on its type iterator is positioned to the lower
> +or upper bound (via `memtx_tree_lower_bound()`) of range of values satisfying
> +search condition. In case key is not specified (i.e. empty), iterator is simply
> +set to the first or last element of tree. At this moment first element to be
> +returned (if any) is ready. To continue iterating `next` method of iterator
> +object is changed to one of `tree_iterator_next()`, `tree_iterator_prev()` or
> +their analogues for GE and LE iterators. Actually these functions fetch next
> +element from B-tree leaf block. If iterator points to the last element in the
> +block, it is set to the first element of the next block (leaf blocks are linked
> +into list); if there's no more blocks, iterator is invalidated and iteration
> +process is finished.
> +Taking into account this information let's review several approaches how to
> +implement multi-directional iterators.
> +
> +### Solution №1
> +
> +First solution doesn't involve any additional data structures so that it deals
> +with multi-directional iterators only using existing B-tree structure.
> +It fact, first key part can be used to locate first element as a candidate
> +in the range to be selected. To illustrate this point let's consider following
> +example:
> +
> +```
> +s:create_index('i', {parts = {{1, 'integer'}, {2, 'integer'}}})`
> +s:insert({1, 0})
> +s:insert({1, 0})
> +s:insert({1, 1})
> +s:insert({2, 0})
> +s:insert({2, 1})
> +i:select({}, {iterator = {'GE', 'LE'}})
> +```
> +
> +Result should be:
> +```
> +[1, 1]
> +[1, 0]
> +[1, 0]
> +[2, 1]
> +[2, 0]
> +```
> +Note that in case of casual GE iterator (i.e. {GE, GE} in terms of
> +multi-directional iterators) result is:
> +```
> +[1, 0]
> +[1, 0]
> +[1, 1]
> +[2, 0]
> +[2, 1]
> +```
> +As one can see, results are sorted in different orders by second key part,
> +but in the same order by first key part (not surprisingly). Assume first
> +element with first key part satisfying search condition is located: {1, 0}.
> +Then let's find out the first key part with different iterating order (in our
> +case it is second key part). Since order is different for that key part, it is
> +required to locate the first tuple with next first key part value: {2, 0}.
> +After that, auxiliary iterator is created and positioned to that tuple (see
Generally a stack of iterators is required for that.
> +schema below). Since order for the second key part is different, auxiliary
> +iterator moves towards main iterator.
> +
> +```
> +[1, 0], [1, 0], [1, 1], [2, 0] ... // tuples are arranged as in B-tree
> +^ ^
> +| <----- |
> +Main iterator Aux. iterator
> +```
> +Note that auxiliary iterator is assumed to process all keys between its initial
> +position and main iterator position (since those keys are ordered using other
> +direction - sort of full scan). Auxiliary iterator is required for each key
> +part starting from that which direction is different from one of first key part.
> +So that iteration process is likely to be slow without any optimizations.
> +For EQ iterator type it is possible to simply skip those tuples which doesn't
> +satisfy equality condition. In turn, it results in necessity to extract part of
> +key value for all 'EQ' iterator key parts and compare it with reference key
> +value. This algorithm can be generalized for any number of key parts in index.
> +
> +Pros (+):
> + - it allows to specify any sets of key part iteration orders;
> + - in contrast to the second implementation, the first resulting tuple is
> + returned way much faster (since there's no time overhead to built new tree);
> + - finally, there's almost no memory overhead.
> +
> +Cons (-):
> + - obviously the main drawback of this approach is time complexity -
> + it doesn't seem to be way faster than simple full-scan (the more key parts
> + with different iterating order are declared, the slower process will be).
Actually that's not so complex. As I can see the proposed iterator will scan
only those tuples that is about to output. There is also an optimization of
a tree traverse+search that has O(log(N)) complexity that would help
in case of SELECT with LIMIT.
> +
> +### Solution №2
> +
> +Since BPS tree is built without acknowledge in which order keys should be
> +placed, it is assumed that order is always ascending: keys in blocks are sorted
> +from smaller to bigger (left-to-right); comparison between keys is made by
> +tuple_compare_with_key() function. It makes given tree be unsuitable for
> +efficient iteration in different orders. On the other hand, it is possible to
> +build new *temporary in-memory* BPS-tree featuring correct key order. It seems
> +to be easy to achieve since keys order depends on result of comparator function.
> +Reverting result of comparison for key parts corresponding to opposite iteration
> +direction gives appropriate keys ordering in the tree. Note that not all data in
> +space is needed to be in tree (in case of non-empty search key); only sub-tree
> +making up lower or upper bound of first key part is required.
> +
> +Pros (+):
> + - any sets of key part iteration orders are allowed.
> +
> +Cons (-):
> + - first tuple to selected is probably returned with significant delay;
In the worst case it'll cost N*log(N), where N is the index size.
Note that such a request will delay ALL the requests for unpredictable
timespan.
BTW, usually it's better to build a temporary array and sort it rather
than building
a temporary tree for just one output.
But still I fear that solution is not an option.
> + - tree memory construction overhead (only during iteration routine).
> +
> +### Solution №3
> +
> +It extends solution №2 in sense it allows specifying sorting direction for
> +each part right in key def, that is during index creation. For instance:
> +
> +`s:create_index('i', {parts = {{1, 'integer', 'asc'}, {2, 'integer', 'desc'}}})`
Already implemented:
`i = s:create_index('test', {type='tree', parts={{1, 'uint',
sort_order='desc'}}})`
> +
> +After that 'GT'/'LT' iterators for parts with declared 'desc' sorting order will
> +return reversed results of comparison, so only comparators are affected.
> +That's it (probably the simplest solution; what is more 'DESC' index is casual
> +SQL feature in other DBs).
> +
> +Pros (+):
> + - index search via 'desc' iterators is almost as fast as via casual
> + iterators;
> + - this approach seems to be easy in implementation and resolves
> + problem in SQL (since at the moment of ephemeral space creation it is allowed
> + to set its PK key parts orders).
> +
> +Cons (-):
> + - 'desc' indexes are not versatile - user is unable to set different
> + orders in iterator;
> + - order of iteration itself is immutable. As a result, for each different
> + iteration order user has to create separate index which in turn consumes
> + additional memory and time as any other index.
> --
> 2.15.1
My summary:
1. `select({1, 1}, {iterator='GE'})` is not equivalent to `SELECT ..
WHERE a>=1 AND b >= 1`.
2. It is still possible to implement `iterator={'GE', 'GE'}` that will
suitable for that SQL SELECT.
3. It is also possible to implement `order={'ASC', 'DESC'}` option in
select using the solution #1.
--------------ADB041D3985D3C9AC1F775BD
Content-Type: text/html; charset="utf-8"
Content-Transfer-Encoding: 8bit
Part of #3243
---
rfc in human-readable format: https://github.com/tarantool/tarantool/blob/np/gh-3243-multi-directional-iter-rfc/doc/rfc/3309-multi-directional-iterators.md
Issue: https://github.com/tarantool/tarantool/issues/3243
doc/rfc/3309-multi-directional-iterators.md | 186 ++++++++++++++++++++++++++++
1 file changed, 186 insertions(+)
create mode 100644 doc/rfc/3309-multi-directional-iterators.md
diff --git a/doc/rfc/3309-multi-directional-iterators.md b/doc/rfc/3309-multi-directional-iterators.md
new file mode 100644
index 000000000..4f8ca7c99
--- /dev/null
+++ b/doc/rfc/3309-multi-directional-iterators.md
@@ -0,0 +1,186 @@
+# Multi-directional iterators
+
+* **Status**: In progress
+* **Start date**: 22-01-2020
+* **Authors**: Nikita Pettik @korablev77 korablev at tarantool.org
+* **Issues**: [#3243](https://github.com/tarantool/tarantool/issues/3243)
+
+
+## Background and motivation
+
+This RFC touches only Memtx engine and TREE index type (as the only available
+in SQL and most common in user's practice). Multi-directional iterator is
+an iterator which allows iterating through different key parts orders.
+For instance: consider index `i` consisting of three key parts: `a`, `b` and `c`.
+Creating and using casual iterator looks like:
+```
+i = box.space.T.index.i
+i:select({1, 2, 3}, {iterator = 'EQ'}) -- returns all tuples which has
+ -- fields a == 1, b == 2, c == 3.
+```
+It is OK to omit one or more key parts declaring key value to be searched. In
+this case they are assumed to be nils:
+`i:select({1}, {iterator = 'EQ'})` is the same as
+`i:select({1, nil, nil}, {iterator = 'EQ'})`. So all tuples which has `a == 1`
+are getting to the result set. More formally matching rule is following:
+```
+if (search-key-part[i] is nil)
+{
+ if (iterator is LT or GT) return FALSE
+ return TRUE
What does that TRUE and FALSE mean?
BTW tree index works fine with partial keys and GT/LT iterators.
E.g. `i:select({1}, {iterator = 'GT'})` returns all tuples with `a
> 1`.
+}
+```
+
+Another example:
+`i:select({1, 1, 1}, {iterator = 'GE'})`
+
+Here all tuples with `a >= 1`, `b >= 1` and `c >= 1` are returned. But some users
That is not correct. Tuples and keys are compared
lexicographically. Thus all tuples with
(`a = 1` AND `b = 1` AND `c >= 1') OR (`a = 1` AND `b > 1`)
OR (`a > 1`) are the result.
+may want to select tuples with `a >= 1`, `b >= 1` but `c < 1`. Or, alternatively,
+somebody may be willing to get tuples ordered by `a` and `b` in ascending order
+but by `c` in descending order: `i:select({}, {iterator = {'GE', 'GE', 'LE'})`.
Those are not alternatives but independent options. One may wish
to select:
`SELECT * FROM t WHERE a > 1 AND b < 1 ORDER BY a DESC, b
ASC`.
+It is analogue of common SQL query `SELECT * FROM t ORDER BY a ASC, b ASC, c DESC`.
+These requests are obviously impossible to fulfill with current indexes and
+iterators implementations. This RFC suggests ways to resolve mentioned problem
+in particular for memtx TREE indexes.
+
+## Implementation details
+
+TREE indexes in memtx engine are implemented as BPS-trees (see
+`src/lib/salad/bps_tree.h` for details). Keys correspond to particular values
+of key parts; data - to pointers to tuples. Hence, all data
+are sorted by their key values due to tree structure. For this reason HASH
+indexes have only GT and EQ (and ergo GE) iterators - data stored in a hash is
+unordered. Tree interface itself provides several functions to operate on data.
+Iteration process starts in `tree_iterator_start()` (which is called once as
+`iterator->next()`): depending on its type iterator is positioned to the lower
+or upper bound (via `memtx_tree_lower_bound()`) of range of values satisfying
+search condition. In case key is not specified (i.e. empty), iterator is simply
+set to the first or last element of tree. At this moment first element to be
+returned (if any) is ready. To continue iterating `next` method of iterator
+object is changed to one of `tree_iterator_next()`, `tree_iterator_prev()` or
+their analogues for GE and LE iterators. Actually these functions fetch next
+element from B-tree leaf block. If iterator points to the last element in the
+block, it is set to the first element of the next block (leaf blocks are linked
+into list); if there's no more blocks, iterator is invalidated and iteration
+process is finished.
+Taking into account this information let's review several approaches how to
+implement multi-directional iterators.
+
+### Solution №1
+
+First solution doesn't involve any additional data structures so that it deals
+with multi-directional iterators only using existing B-tree structure.
+It fact, first key part can be used to locate first element as a candidate
+in the range to be selected. To illustrate this point let's consider following
+example:
+
+```
+s:create_index('i', {parts = {{1, 'integer'}, {2, 'integer'}}})`
+s:insert({1, 0})
+s:insert({1, 0})
+s:insert({1, 1})
+s:insert({2, 0})
+s:insert({2, 1})
+i:select({}, {iterator = {'GE', 'LE'}})
+```
+
+Result should be:
+```
+[1, 1]
+[1, 0]
+[1, 0]
+[2, 1]
+[2, 0]
+```
+Note that in case of casual GE iterator (i.e. {GE, GE} in terms of
+multi-directional iterators) result is:
+```
+[1, 0]
+[1, 0]
+[1, 1]
+[2, 0]
+[2, 1]
+```
+As one can see, results are sorted in different orders by second key part,
+but in the same order by first key part (not surprisingly). Assume first
+element with first key part satisfying search condition is located: {1, 0}.
+Then let's find out the first key part with different iterating order (in our
+case it is second key part). Since order is different for that key part, it is
+required to locate the first tuple with next first key part value: {2, 0}.
+After that, auxiliary iterator is created and positioned to that tuple (see
Generally a stack of iterators is required for that.
+schema below). Since order for the second key part is different, auxiliary
+iterator moves towards main iterator.
+
+```
+[1, 0], [1, 0], [1, 1], [2, 0] ... // tuples are arranged as in B-tree
+^ ^
+| <----- |
+Main iterator Aux. iterator
+```
+Note that auxiliary iterator is assumed to process all keys between its initial
+position and main iterator position (since those keys are ordered using other
+direction - sort of full scan). Auxiliary iterator is required for each key
+part starting from that which direction is different from one of first key part.
+So that iteration process is likely to be slow without any optimizations.
+For EQ iterator type it is possible to simply skip those tuples which doesn't
+satisfy equality condition. In turn, it results in necessity to extract part of
+key value for all 'EQ' iterator key parts and compare it with reference key
+value. This algorithm can be generalized for any number of key parts in index.
+
+Pros (+):
+ - it allows to specify any sets of key part iteration orders;
+ - in contrast to the second implementation, the first resulting tuple is
+ returned way much faster (since there's no time overhead to built new tree);
+ - finally, there's almost no memory overhead.
+
+Cons (-):
+ - obviously the main drawback of this approach is time complexity -
+ it doesn't seem to be way faster than simple full-scan (the more key parts
+ with different iterating order are declared, the slower process will be).
Actually that's not so complex. As I can see the proposed iterator
will scan
only those tuples that is about to output. There is also an
optimization of
a tree traverse+search that has O(log(N)) complexity that would
help
in case of SELECT with LIMIT.
+
+### Solution №2
+
+Since BPS tree is built without acknowledge in which order keys should be
+placed, it is assumed that order is always ascending: keys in blocks are sorted
+from smaller to bigger (left-to-right); comparison between keys is made by
+tuple_compare_with_key() function. It makes given tree be unsuitable for
+efficient iteration in different orders. On the other hand, it is possible to
+build new *temporary in-memory* BPS-tree featuring correct key order. It seems
+to be easy to achieve since keys order depends on result of comparator function.
+Reverting result of comparison for key parts corresponding to opposite iteration
+direction gives appropriate keys ordering in the tree. Note that not all data in
+space is needed to be in tree (in case of non-empty search key); only sub-tree
+making up lower or upper bound of first key part is required.
+
+Pros (+):
+ - any sets of key part iteration orders are allowed.
+
+Cons (-):
+ - first tuple to selected is probably returned with significant delay;
In the worst case it'll cost N*log(N), where N is the index size.
Note that such a request will delay ALL the requests for
unpredictable timespan.
BTW, usually it's better to build a temporary array and sort it
rather than building
a temporary tree for just one output.
But still I fear that solution is not an option.
+ - tree memory construction overhead (only during iteration routine).
+
+### Solution №3
+
+It extends solution №2 in sense it allows specifying sorting direction for
+each part right in key def, that is during index creation. For instance:
+
+`s:create_index('i', {parts = {{1, 'integer', 'asc'}, {2, 'integer', 'desc'}}})`
Already implemented:
`i = s:create_index('test', {type='tree', parts={{1, 'uint',
sort_order='desc'}}})`
+
+After that 'GT'/'LT' iterators for parts with declared 'desc' sorting order will
+return reversed results of comparison, so only comparators are affected.
+That's it (probably the simplest solution; what is more 'DESC' index is casual
+SQL feature in other DBs).
+
+Pros (+):
+ - index search via 'desc' iterators is almost as fast as via casual
+ iterators;
+ - this approach seems to be easy in implementation and resolves
+ problem in SQL (since at the moment of ephemeral space creation it is allowed
+ to set its PK key parts orders).
+
+Cons (-):
+ - 'desc' indexes are not versatile - user is unable to set different
+ orders in iterator;
+ - order of iteration itself is immutable. As a result, for each different
+ iteration order user has to create separate index which in turn consumes
+ additional memory and time as any other index.
--
2.15.1
My summary:
- `select({1, 1}, {iterator='GE'})` is not equivalent to `SELECT
.. WHERE a>=1 AND b >= 1`.
- It is still possible to implement `iterator={'GE', 'GE'}` that
will suitable for that SQL SELECT.
- It is also possible to implement `order={'ASC', 'DESC'}`
option in select using the solution #1.
--------------ADB041D3985D3C9AC1F775BD--