From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <tsafin@tarantool.org>
Received: from smtp30.i.mail.ru (smtp30.i.mail.ru [94.100.177.90])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (No client certificate requested)
 by dev.tarantool.org (Postfix) with ESMTPS id 18967469710
 for <tarantool-discussions@dev.tarantool.org>;
 Wed, 25 Nov 2020 00:52:36 +0300 (MSK)
From: "Timur Safin" <tsafin@tarantool.org>
References: <0b1d01d6b9a4$aecaf500$0c60df00$@tarantool.org>
 <B511A916-9B56-431D-8A41-FBF4E5B6507C@corp.mail.ru>
In-Reply-To: <B511A916-9B56-431D-8A41-FBF4E5B6507C@corp.mail.ru>
Date: Wed, 25 Nov 2020 00:52:32 +0300
Message-ID: <07dc01d6c2ac$1cea9f30$56bfdd90$@tarantool.org>
MIME-Version: 1.0
Content-Type: text/plain;
	charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Content-Language: ru
Subject: Re: [Tarantool-discussions] RFC - distributed SQL, step #1 - AST
List-Id: Tarantool development process
 <tarantool-discussions.dev.tarantool.org>
List-Unsubscribe: <https://lists.tarantool.org/mailman/options/tarantool-discussions>, 
 <mailto:tarantool-discussions-request@dev.tarantool.org?subject=unsubscribe>
List-Archive: <https://lists.tarantool.org/pipermail/tarantool-discussions/>
List-Post: <mailto:tarantool-discussions@dev.tarantool.org>
List-Help: <mailto:tarantool-discussions-request@dev.tarantool.org?subject=help>
List-Subscribe: <https://lists.tarantool.org/mailman/listinfo/tarantool-discussions>, 
 <mailto:tarantool-discussions-request@dev.tarantool.org?subject=subscribe>
To: "'m.semkin'" <m.semkin@corp.mail.ru>
Cc: mons@tarantool.org, tarantool-discussions@dev.tarantool.org

: From: m.semkin <m.semkin@corp.mail.ru>
: Subject: Re: RFC - distributed SQL, step #1 - AST
:=20
: I wouldn=E2=80=99t separate combiner nodes from storage nodes. I think =
combining
: should always be done on storage nodes. The aim of combining is to =
perform
: the first level of aggregation in order to:
:=20
: * reduce the network traffic
: * not to overfill router's OM
: * increase parallelism
:=20
: So I think the most suitable place for combining is storage nodes.

I thought that there is theoretical possibility to introduce several
intermediate combine steps for aggregating intermediate results for
multi-steps request. And for that we would need to introduce combiner
nodes which are like front-end router, but serving only subrequests.

But on a second thought I'd rather agree with you:
- it's front-end router responsibility to plan cluster execution, and
  Split query to subqueries if there is need;
- and there is no need to transfer raw data to the different node if=20
  combine step might be done using this storage SQL facilties. i.e.
  local aggregation (combine) is done at the storage node. Indeed.


Agreed, thanks for note!

Timur


:=20
: > On 13 Nov 2020, at 13:06, Timur Safin <tsafin@tarantool.org> wrote:
: >
: > Distributed SQL: the first step - AST for parser
: > =
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
: >
: > Summary
: > -------
: >
: > There is preliminary decision that we try to approach distributed =
SQL as
: > next big thing which come in 2021. This is a long project, which =
will be
: > approached gradually. Here we will try to observe all necessary =
steps to
: > be done both in long term and short term period of time. Longer =
terms
: goals
: > will be described briefly, but shorter term goal (extracting of AST =
from
: > SQL parser) will be given in more details, we could do it because of
: current
: > set of Tarantool capabilities available and having some PoC already
: developed.
: >
: > Bear in mind, that for longer term goals, later version of this RFC =
will
: try
: > to collect wider relevant information, showing some (quiet random)
: industry
: > precedents (but that the list of industry examples will neither be
: scientific,
: > nor complete).
: >
: >
: > Vocabulary:
: > -----------
: >
: > We use standard MapReduce vocabulary when we talks about roles of
: > cluster nodes involved to the SQL queries processing.
: >
: > =
+---------------+-----------------------------------------------------+
: > | Router(s)     | The node which processes queries and send to the   =
 |
: > |               | corresponding storage nodes for local processing.  =
 |
: > |               | It combines/reduces resultant data and sends it    =
 |
: > |               | back to client                                     =
 |
: > =
+=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+
: > | Combiner(s)   | Depending on the aggregation complexity needs =
there |
: > |               | may be several intermediate nodes which combine    =
 |
: > |               | (aggregate) intermediate data and send it back     =
 |
: > =
+---------------+-----------------------------------------------------+
: > | Storage nodes |                                                    =
 |
: > =
+---------------+-----------------------------------------------------+
: >
: >
: >
: > Distributed SQL scenario
: > ------------------------
: >
: > Once we move from single node case to multiple node case for SQL
: > execution all kinds of intra-node data exchange arise. Once we get
: > original SQL to the router node, it's expected that router would
: > preparse SQL query, (massage them appropriately) and then send some
: > command data to storage nodes for their local execution.
: >
: > **The question is** - what format of data should we send to storage =
node?
: >
: > We might try to send to the storage node the compiled binary VDBE
: > byte-code, but it looks to be a bad idea for several reasons:
: >
: > 1. Vdbe is not yet frozen and due to the technology used (lemon =
parser
: >   with on the fly generation of constants) it might differ very
: >   much between various versions even for builds from the same =
branch.
: >   Though, if we have to, we could take some extra measures to =
stabilize
: >   values of tokens generated;
: >
: > 2. But bigger problem is - different data distribution on different =
shard
: >   nodes in the cluster. Which may require different query plans used
: >   for the same SQL query. If we would generate blindly the single
: >   byte code for received SQL then we may degrade performance =
comparing
: >   to the case when bytecode would be generated locally, for each =
modes
: >   separately, taking local heuristics.
: >
: > So at the moment simpler approach would be more preferable:
: >
: > -  We simple transfer (modified) SQL query string to each of shard =
node
: >   involved;
: >
: > -  Or we could transfer AST serialized to some kind of binary form;
: >
: > We suggest, that in the 1st stage, take the simplest approach - =
transfer
: > simple SQL query in their textual form.
: >
: > Distributed SQL in the InterSystems IRIS
: > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
: >
: > (TBD)
: >
: >
: > Distributed SQL in MemSQL/SingleStore
: > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
: >
: > https://docs.singlestore.com/v7.1/introduction/how-memsql-works/
: > (TBD)
: >
: >
: > Distributed SQL in MariaDB SkySQL
: > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
: >
: > https://mariadb.com/products/skysql/
: > (TBD)
: >
: >
: > Distributed SQL in Yugabyte
: > ~~~~~~~~~~~~~~~~~~~~~~~~~~~
: >
: > (TBD)
: >
: >
: > Mike Siomkin' distributed SQL PoC
: > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
: >
: > There is working and deployed proof-of-concept project which has =
been
: > implemented by Mike Siomkin, and which implements distributed SQL =
query
: > concept using currently available Tarantool facilities.
: >
: > .. note::
: >
: >   There are some obvious limitations though, but it further proves
: >   the point that with relatively small efforts, restricted =
distributed
: >   SQL processing might be implemented in current Tarantool within
: >   relatively short time frame.
: >
: > For preliminary parsing of SQL queries Mike' code is using SQLParser
: > LuaRocks (https://github.com/tarantool/sqlparser) module which is
: > wrapping HyRise SQL parser implemented in C++
: > (https://github.com/hyrise/sql-parser) for parsing given SQL =
queries, and
: > building abstract-syntax trees (AST).
: >
: > The intermediate part between cluster controller at the Tarantool =
side
: > and SQL parser is gridql.lua module. This is gridql responsibility =
to
: > parse SQL, analyze resultant AST, and enrich it appropriately for
: > aggregate functions, and pagination support. *I.e. queries sent to
: > storage node will be different to the original SQL query*, and will =
be
: > different to the query executed by combiner/reducer node.
: >
: >
: >
: > The used sql-parser module exports only 2 methods: parse(query), and
: > tostring(ast).
: >
: > -  `sqlparser.parse(q)` uses ffi function parseSql, which wraps =
hyrise SQL
: >   parser mechanics and returns AST tree as ffi structure
: >   LuaSQLParseResult, which in turns, composed of series of
: >   LuaSQLStatement-based objects, which might be of various types
: >   (e.g. kStmtSelect,  kStmtImport,  kStmtInsert,  kStmtUpdate,
: >   kStmtDelete, etc.), each of them could be attributed different
: >   set of data, including LuaExpr lists of various kinds;
: >
: > -  `sqlparser.tostring(ast)` stringifies the passed AST object;
: >
: > Despite the fact that Hyrise SQL parser has *no knowledge about =
builtin
: > SQL functions* supported by Tarantool SQL, it's parsing facilities =
are
: > enough for AST tree traversal, because any known name is marked as
: > identifier of function, which is a good enough to start of SQL =
processing
: > in gridql module.
: >
: > Local SQL execution is being done using builtin Tarantool SQL =
engine,
: > thus such lack of functions knowledge is not a problem, iff we pass
: > transparently SQL query down to the node processor.
: >
: > Hyrise knowns all kinds of SQL queries, but at the moment gridql =
modules
: > *supports only `SELECT`s*, and not handles any other kinds of =
requests
: > (i.e. `UPDATE`).
: >
: > Unfortunately, gridql, at their current form, could not be published =
due
: > to heavy usage of customer specific virtual tables, but there are =
claims
: > that it's possible to generalize and simplify code,
: > so it might be used elsewhere beyond current context.
: >
: >
: > Long-term goals
: > ---------------
: >
: > So, having many industry precendents we see that ideally, for =
distributed
: > SQL we have to have:
: >
: > -  Some kind of router accepts SQL query, and then preparses it to =
some
: >   kind of intermediate representation (AST);
: >
: > -  Topology aware query planner analyses parsed query and having =
knowledge
: >   of data distribution it sends parsed "AST" subqueries to only =
those
: >   nodes, which has relevant data. If there is no data locality known
: >   then all cluster involved via Map-Combine-Reduce operation;
: >
: > -  Query might be split into inner subqueries for which stages would =
be
: >   planned and executed separately;
: >
: > -  If transactions are not read only then cluster wide transaction
: >   manager / conflict manager to be involved for 2PC mechanics
: >   coordination;
: >
: > -  And it would be easier if distributed SQL module should work even
: >   for single-node config (with or without vshard involved) for
: >   simpler debugging purposes;
: >
: > Timings for these long-term plans are not yet known, but at the =
moment
: > we believe that the nearest subjects should be:
: >
: > 1. SQL parser refactoring to saving AST (with serialization and
: >   deserialization if necessary);
: >
: > 2. And transaction/conflict manager should be extended with cluster
: >   wide transaction support, to make possible next steps of queries
: >   beyond simple `SELECT`s;
: >
: > 2nd item is not SQL-specific, and will be handled elsewhere =
separately,
: > this RFC we will continue to talk about SQL only plans;
: >
: >
: > Short-term distributed SQL plan
: > -------------------------------
: >
: > At the moment parser, byte-code generation, and query execution is
: > tightly coupled in SQL code in Tarantool. This is side-effect of =
SQLite
: > architecture largely inherited by Tarantool from SQLite parser. And =
such
: > close coupling might become a road-blocker for us in the longer =
term, when
: > we would have to go to different nodes.
: >
: > If we properly split query parser and query execution logics we will
: > simplify configuration, making it easier to approach distributed =
SQL.
: >
: > -  So, for the 1st, obvious step - we would need to create a
: >   separate/builtin module tarantool-sqlparser which would wrap SQL
: >   parsing in `parse(query)` method in a fashion similar to Mike
: >   Siomkin' `sqlparser` module above;
: >
: > -  The `sql.parse(query)` method would need to return AST data =
structures,
: >   exported via ffi.
: >
: >   -  At the moment only SELECT and VIEW queries build AST during
: >      parsing, this is major limitation, which would require some =
extra
: >      refactoring later, but it's ok for the 1st stage.
: >
: >   -  For the 2nd stage we would need to extend AST with more SQL
: >      statement types, e.g. UPDATE / DELETE.
: >
: >   -  Worth to mention, that current AST structures as defined in the
: >      `sqlint.h` are quite similar to that used in Mike' sqlparser
: >      module - for more details see comparison of LuaDataTypes.h to
: >      sqlint.h in the Appendix A below;
: >
: > -  As we build AST we may enrich returned AST nodes with information
: >   about builtin functions kinds and expression data types, specific
: >   for our SQL parser;
: >
: > -  So in addition to currently available ways to run SQL queries =
via:
: >
: >   -  direct `box.execute`,
: >
: >   -  Or 2 step `box.prepare + box.execute`
: >
: >   -  we would add `sql.parse` method, similar to `box.prepare`,
: >      which should be similarly executable via `box.prepare + =
box.execute`;
: >
: > -  This refactoring with separation of parse step, should still =
maintain
: >   fully working SQL execution cycle, i.e. with minimum code
: >   modifications all relevant SQL queries should pass whole
: >   Tarantool SQL test suite. (This might apply only to SELECT
: >   queries, for stage 1);
: >
: >
: >
: > Distributed testing scenario
: > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
: >
: > -  Assumption is that decoupled sql.parse method, should be powerful
: >   enough to be able replace HyRise SQL parser in the gridql
: >   proof-of-concept code. Once being published it will be another
: >   indicator of intermediate success if it will be working with
: >   Tarantool SQL parser module. (Mike is planning to publish cleaned
: >   up code soon);
: >
: >   -  There is no need though in gridql for anything beyond simple
: >      SELECTs, which makes possible to create 1st implementation
: >      without major refactorings in Tarantool SQL parser, having only
: >      current data structures and code;
: >
: >   -  Thus addition of AST support for DELETE, INSERT, UPDATE =
statements
: >      will be done at stage #2, probably the next quarter, and it's
: >      not a goal for current plan;
: >
: >   -  i.e. we start with only read-only (SELECT and VIEW) queries, =
and
: >      not support RW operations yet;
: >
: > -  INSERT/UPDATE/DELETE queries will be done afterward, once we have
: >   distributed conflict manager and transaction managers implemented.
: >   And it's subject to coordinated efforts with Alexander Lyapunov =
team;
: >
: >
: >
: > Appendix A - AST data structures
: > --------------------------------
: >
: > =
+-----------------------------------------------+------------------------=
-
: ------+
: > | ``StatementType::kStmtSelect``                |
: ``ast_type::AST_TYPE_SELECT`` |
: >
: =
+=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D
: =3D=3D=3D=3D+
: > |.. code:: c                                    |
: |
: > |                                               |
: |
: > | typedef struct {                              | **(there is none, =
yet)**
: |
: > |    bool isValid;                              |
: |
: > |    char* errorMsg;                            |
: |
: > |    int errorLine;                             |
: |
: > |    int errorColumn;                           |
: |
: > |    size_t statementCount;                     |
: |
: > |    struct LuaSQLStatement** statements;       |
: |
: > | } LuaSQLParserResult;                         |
: |
: > |                                               |
: |
: > =
+-----------------------------------------------+------------------------=
-
: ------+
: > |.. code:: c                                    |.. code:: c
: |
: > |                                               |
: |
: > | typedef struct LuaSelectStatement {           |  struct Select {
: |
: > |    struct LuaSQLStatement base;               |        ExprList =
*pEList;
: |
: > |                                               |        u8 op;
: |
: > |    struct LuaTableRef* fromTable;             |        LogEst
: nSelectRow;     |
: > |    bool selectDistinct;                       |        u32 =
selFlags;
: |
: > |                                               |        int iLimit,
: iOffset;   |
: > |    size_t selectListSize;                     |        char
: zSelName[12];     |
: > |    struct LuaExpr** selectList;               |        int
: addrOpenEphm[2];   |
: > |                                               |        SrcList =
*pSrc;
: |
: > |    struct LuaExpr* whereClause;               |        Expr =
*pWhere;
: |
: > |                                               |        ExprList
: *pGroupBy;    |
: > |    struct LuaGroupByDescription* groupBy;     |        Expr =
*pHaving;
: |
: > |                                               |        ExprList
: *pOrderBy;    |
: > |    size_t setOperationCount;                  |        Select =
*pPrior;
: |
: > |    struct LuaSetOperation** setOperations;    |        Select =
*pNext;
: |
: > |                                               |        Expr =
*pLimit;
: |
: > |    size_t orderCount;                         |        Expr =
*pOffset;
: |
: > |    struct LuaOrderDescription** order;        |        With =
*pWith;
: |
: > |                                               |  };
: |
: > |    size_t withDescriptionCount;               |
: |
: > |    struct LuaWithDescription**                |
: |
: > |           withDescriptions;                   |
: |
: > |    struct LuaLimitDescription* limit;         |
: |
: > | } LuaSelectStatement;                         |
: |
: > |                                               |
: |
: > =
+-----------------------------------------------+------------------------=
-
: ------+
: >
: > --
: > Timur
: > <distributed-sql-ast.html>