Re: [Tarantool-patches] [PATCH] rfc: multi-directional iterators

From: Aleksandr Lyapunov <alyapunov@tarantool.org>
To: korablev@tarantool.org
Cc: tarantool-patches@dev.tarantool.org
Subject: Re: [Tarantool-patches] [PATCH] rfc: multi-directional iterators
Date: Fri, 24 Apr 2020 13:10:39 +0300	[thread overview]
Message-ID: <5e27bca5-14f4-6800-84c6-53f831cb83b6@tarantool.org> (raw)

[-- Attachment #1: Type: text/plain, Size: 11129 bytes --]

> Part of #3243
> ---
> rfc in human-readable format:https://github.com/tarantool/tarantool/blob/np/gh-3243-multi-directional-iter-rfc/doc/rfc/3309-multi-directional-iterators.md
> Issue:https://github.com/tarantool/tarantool/issues/3243
>
>   doc/rfc/3309-multi-directional-iterators.md | 186 ++++++++++++++++++++++++++++
>   1 file changed, 186 insertions(+)
>   create mode 100644 doc/rfc/3309-multi-directional-iterators.md
>
> diff --git a/doc/rfc/3309-multi-directional-iterators.md b/doc/rfc/3309-multi-directional-iterators.md
> new file mode 100644
> index 000000000..4f8ca7c99
> --- /dev/null
> +++ b/doc/rfc/3309-multi-directional-iterators.md
> @@ -0,0 +1,186 @@
> +# Multi-directional iterators
> +
> +* **Status**: In progress
> +* **Start date**: 22-01-2020
> +* **Authors**: Nikita Pettik @korablev77korablev at tarantool.org  <https://lists.tarantool.org/mailman/listinfo/tarantool-patches>
> +* **Issues**: [#3243](https://github.com/tarantool/tarantool/issues/3243)
> +
> +
> +## Background and motivation
> +
> +This RFC touches only Memtx engine and TREE index type (as the only available
> +in SQL and most common in user's practice). Multi-directional iterator is
> +an iterator which allows iterating through different key parts orders.
> +For instance: consider index `i` consisting of three key parts: `a`, `b` and `c`.
> +Creating and using casual iterator looks like:
> +```
> +i = box.space.T.index.i
> +i:select({1, 2, 3}, {iterator = 'EQ'}) -- returns all tuples which has
> +                                       -- fields a == 1, b == 2, c == 3.
> +```
> +It is OK to omit one or more key parts declaring key value to be searched. In
> +this case they are assumed to be nils:
> +`i:select({1}, {iterator = 'EQ'})` is the same as
> +`i:select({1, nil, nil}, {iterator = 'EQ'})`. So all tuples which has `a == 1`
> +are getting to the result set. More formally matching rule is following:
> +```
> +if (search-key-part[i] is nil)
> +{
> +  if (iterator is LT or GT) return FALSE
> +  return TRUE
What does that TRUE and FALSE mean?
BTW tree index works fine with partial keys and GT/LT iterators.
E.g. `i:select({1}, {iterator = 'GT'})` returns all tuples with `a > 1`.

> +}
> +```
> +
> +Another example:
> +`i:select({1, 1, 1}, {iterator = 'GE'})`
> +
> +Here all tuples with `a >= 1`, `b >= 1` and `c >= 1` are returned. But some users
That is not correct. Tuples and keys are compared lexicographically. 
Thus all tuples with
(`a = 1` AND `b = 1` AND `c >= 1') OR (`a = 1` AND `b > 1`) OR (`a > 1`) 
are the result.

> +may want to select tuples with `a >= 1`, `b >= 1` but `c < 1`. Or, alternatively,
> +somebody may be willing to get tuples ordered by `a` and `b` in ascending order
> +but by `c` in descending order: `i:select({}, {iterator = {'GE', 'GE', 'LE'})`.
Those are not alternatives but independent options. One may wish to select:
`SELECT * FROM t WHERE a > 1 AND b < 1 ORDER BY a DESC, b ASC`.
> +It is analogue of common SQL query `SELECT * FROM t ORDER BY a ASC, b ASC, c DESC`.
> +These requests are obviously impossible to fulfill with current indexes and
> +iterators implementations. This RFC suggests ways to resolve mentioned problem
> +in particular for memtx TREE indexes.
> +
> +## Implementation details
> +
> +TREE indexes in memtx engine are implemented as BPS-trees (see
> +`src/lib/salad/bps_tree.h` for details). Keys correspond to particular values
> +of key parts; data - to pointers to tuples. Hence, all data
> +are sorted by their key values due to tree structure. For this reason HASH
> +indexes have only GT and EQ (and ergo GE) iterators - data stored in a hash is
> +unordered. Tree interface itself provides several functions to operate on data.
> +Iteration process starts in `tree_iterator_start()` (which is called once as
> +`iterator->next()`): depending on its type iterator is positioned to the lower
> +or upper bound (via `memtx_tree_lower_bound()`) of range of values satisfying
> +search condition. In case key is not specified (i.e. empty), iterator is simply
> +set to the first or last element of tree. At this moment first element to be
> +returned (if any) is ready. To continue iterating `next` method of iterator
> +object is changed to one of `tree_iterator_next()`, `tree_iterator_prev()` or
> +their analogues for GE and LE iterators. Actually these functions fetch next
> +element from B-tree leaf block. If iterator points to the last element in the
> +block, it is set to the first element of the next block (leaf blocks are linked
> +into list); if there's no more blocks, iterator is invalidated and iteration
> +process is finished.
> +Taking into account this information let's review several approaches how to
> +implement multi-directional iterators.
> +
> +### Solution №1
> +
> +First solution doesn't involve any additional data structures so that it deals
> +with multi-directional iterators only using existing B-tree structure.
> +It fact, first key part can be used to locate first element as a candidate
> +in the range to be selected. To illustrate this point let's consider following
> +example:
> +
> +```
> +s:create_index('i', {parts = {{1, 'integer'}, {2, 'integer'}}})`
> +s:insert({1, 0})
> +s:insert({1, 0})
> +s:insert({1, 1})
> +s:insert({2, 0})
> +s:insert({2, 1})
> +i:select({}, {iterator = {'GE', 'LE'}})
> +```
> +
> +Result should be:
> +```
> +[1, 1]
> +[1, 0]
> +[1, 0]
> +[2, 1]
> +[2, 0]
> +```
> +Note that in case of casual GE iterator (i.e. {GE, GE} in terms of
> +multi-directional iterators) result is:
> +```
> +[1, 0]
> +[1, 0]
> +[1, 1]
> +[2, 0]
> +[2, 1]
> +```
> +As one can see, results are sorted in different orders by second key part,
> +but in the same order by first key part (not surprisingly). Assume first
> +element with first key part satisfying search condition is located: {1, 0}.
> +Then let's find out the first key part with different iterating order (in our
> +case it is second key part). Since order is different for that key part, it is
> +required to locate the first tuple with next first key part value: {2, 0}.
> +After that, auxiliary iterator is created and positioned to that tuple (see
Generally a stack of iterators is required for that.
> +schema below). Since order for the second key part is different, auxiliary
> +iterator moves towards main iterator.
> +
> +```
> +[1, 0], [1, 0], [1, 1], [2, 0] ... // tuples are arranged as in B-tree
> +^                      ^
> +|               <----- |
> +Main iterator         Aux. iterator
> +```
> +Note that auxiliary iterator is assumed to process all keys between its initial
> +position and main iterator position (since those keys are ordered using other
> +direction - sort of full scan). Auxiliary iterator is required for each key
> +part starting from that which direction is different from one of first key part.
> +So that iteration process is likely to be slow without any optimizations.
> +For EQ iterator type it is possible to simply skip those tuples which doesn't
> +satisfy equality condition. In turn, it results in necessity to extract part of
> +key value for all 'EQ' iterator key parts and compare it with reference key
> +value. This algorithm can be generalized for any number of key parts in index.
> +
> +Pros (+):
> + - it allows to specify any sets of key part iteration orders;
> + - in contrast to the second implementation, the first resulting tuple is
> +   returned way much faster (since there's no time overhead to built new tree);
> + - finally, there's almost no memory overhead.
> +
> +Cons (-):
> + - obviously the main drawback of this approach is time complexity -
> +   it doesn't seem to be way faster than simple full-scan (the more key parts
> +   with different iterating order are declared, the slower process will be).
Actually that's not so complex. As I can see the proposed iterator will scan
only those tuples that is about to output. There is also an optimization of
a tree traverse+search that has O(log(N)) complexity that would help
in case of SELECT with LIMIT.
> +
> +### Solution №2
> +
> +Since BPS tree is built without acknowledge in which order keys should be
> +placed, it is assumed that order is always ascending: keys in blocks are sorted
> +from smaller to bigger (left-to-right); comparison between keys is made by
> +tuple_compare_with_key() function. It makes given tree be unsuitable for
> +efficient iteration in different orders. On the other hand, it is possible to
> +build new *temporary in-memory* BPS-tree featuring correct key order. It seems
> +to be easy to achieve since keys order depends on result of comparator function.
> +Reverting result of comparison for key parts corresponding to opposite iteration
> +direction gives appropriate keys ordering in the tree. Note that not all data in
> +space is needed to be in tree (in case of non-empty search key); only sub-tree
> +making up lower or upper bound of first key part is required.
> +
> +Pros (+):
> + - any sets of key part iteration orders are allowed.
> +
> +Cons (-):
> + - first tuple to selected is probably returned with significant delay;
In the worst case it'll cost N*log(N), where N is the index size.
Note that such a request will delay ALL the requests for unpredictable 
timespan.
BTW, usually it's better to build a temporary array and sort it rather 
than building
a temporary tree for just one output.
But still I fear that solution is not an option.
> + - tree memory construction overhead (only during iteration routine).
> +
> +### Solution №3
> +
> +It extends solution №2 in sense it allows specifying sorting direction for
> +each part right in key def, that is during index creation. For instance:
> +
> +`s:create_index('i', {parts = {{1, 'integer', 'asc'}, {2, 'integer', 'desc'}}})`
Already implemented:
`i = s:create_index('test', {type='tree', parts={{1, 'uint', 
sort_order='desc'}}})`
> +
> +After that 'GT'/'LT' iterators for parts with declared 'desc' sorting order will
> +return reversed results of comparison, so only comparators are affected.
> +That's it (probably the simplest solution; what is more 'DESC' index is casual
> +SQL feature in other DBs).
> +
> +Pros (+):
> + - index search via 'desc' iterators is almost as fast as via casual
> +   iterators;
> + - this approach seems to be easy in implementation and resolves
> +   problem in SQL (since at the moment of ephemeral space creation it is allowed
> +   to set its PK key parts orders).
> +
> +Cons (-):
> + - 'desc' indexes are not versatile - user is unable to set different
> +    orders in iterator;
> + - order of iteration itself is immutable. As a result, for each different
> +   iteration order user has to create separate index which in turn consumes
> +   additional memory and time as any other index.
> -- 
> 2.15.1
My summary:

 1. `select({1, 1}, {iterator='GE'})` is not equivalent to `SELECT ..
    WHERE a>=1 AND b >= 1`.
 2. It is still possible to implement `iterator={'GE', 'GE'}` that will
    suitable for that SQL SELECT.
 3. It is also possible to implement `order={'ASC', 'DESC'}` option in
    select using the solution #1.


[-- Attachment #2: Type: text/html, Size: 12164 bytes --]