[Tarantool-discussions] RFC - distributed SQL, step #1 - AST

m.semkin m.semkin at corp.mail.ru
Fri Nov 13 21:59:37 MSK 2020


I wouldn’t separate combiner nodes from storage nodes. I think combining
should always be done on storage nodes. The aim of combining is to perform
the first level of aggregation in order to:

* reduce the network traffic
* not to overfill router's OM
* increase parallelism

So I think the most suitable place for combining is storage nodes.

> On 13 Nov 2020, at 13:06, Timur Safin <tsafin at tarantool.org> wrote:
> 
> Distributed SQL: the first step - AST for parser
> ===============================================
> 
> Summary
> -------
> 
> There is preliminary decision that we try to approach distributed SQL as
> next big thing which come in 2021. This is a long project, which will be
> approached gradually. Here we will try to observe all necessary steps to
> be done both in long term and short term period of time. Longer terms goals
> will be described briefly, but shorter term goal (extracting of AST from 
> SQL parser) will be given in more details, we could do it because of current
> set of Tarantool capabilities available and having some PoC already developed.
> 
> Bear in mind, that for longer term goals, later version of this RFC will try 
> to collect wider relevant information, showing some (quiet random) industry
> precedents (but that the list of industry examples will neither be scientific,
> nor complete). 
> 
> 
> Vocabulary:
> -----------
> 
> We use standard MapReduce vocabulary when we talks about roles of
> cluster nodes involved to the SQL queries processing.
> 
> +---------------+-----------------------------------------------------+
> | Router(s)     | The node which processes queries and send to the    |
> |               | corresponding storage nodes for local processing.   |
> |               | It combines/reduces resultant data and sends it     |
> |               | back to client                                      |
> +===============+=====================================================+
> | Combiner(s)   | Depending on the aggregation complexity needs there |
> |               | may be several intermediate nodes which combine     |
> |               | (aggregate) intermediate data and send it back      |
> +---------------+-----------------------------------------------------+
> | Storage nodes |                                                     |
> +---------------+-----------------------------------------------------+
> 
> 
> 
> Distributed SQL scenario
> ------------------------
> 
> Once we move from single node case to multiple node case for SQL
> execution all kinds of intra-node data exchange arise. Once we get
> original SQL to the router node, it's expected that router would
> preparse SQL query, (massage them appropriately) and then send some
> command data to storage nodes for their local execution.
> 
> **The question is** - what format of data should we send to storage node?
> 
> We might try to send to the storage node the compiled binary VDBE
> byte-code, but it looks to be a bad idea for several reasons:
> 
> 1. Vdbe is not yet frozen and due to the technology used (lemon parser
>   with on the fly generation of constants) it might differ very
>   much between various versions even for builds from the same branch.
>   Though, if we have to, we could take some extra measures to stabilize
>   values of tokens generated;
> 
> 2. But bigger problem is - different data distribution on different shard 
>   nodes in the cluster. Which may require different query plans used 
>   for the same SQL query. If we would generate blindly the single 
>   byte code for received SQL then we may degrade performance comparing
>   to the case when bytecode would be generated locally, for each modes 
>   separately, taking local heuristics.
> 
> So at the moment simpler approach would be more preferable:
> 
> -  We simple transfer (modified) SQL query string to each of shard node 
>   involved;
> 
> -  Or we could transfer AST serialized to some kind of binary form;
> 
> We suggest, that in the 1st stage, take the simplest approach - transfer 
> simple SQL query in their textual form.
> 
> Distributed SQL in the InterSystems IRIS
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> 
> (TBD)
> 
> 
> Distributed SQL in MemSQL/SingleStore
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> 
> https://docs.singlestore.com/v7.1/introduction/how-memsql-works/
> (TBD)
> 
> 
> Distributed SQL in MariaDB SkySQL
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> 
> https://mariadb.com/products/skysql/
> (TBD)
> 
> 
> Distributed SQL in Yugabyte
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~
> 
> (TBD)
> 
> 
> Mike Siomkin' distributed SQL PoC
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> 
> There is working and deployed proof-of-concept project which has been
> implemented by Mike Siomkin, and which implements distributed SQL query
> concept using currently available Tarantool facilities. 
> 
> .. note::
> 
>   There are some obvious limitations though, but it further proves 
>   the point that with relatively small efforts, restricted distributed
>   SQL processing might be implemented in current Tarantool within
>   relatively short time frame.
> 
> For preliminary parsing of SQL queries Mike' code is using SQLParser
> LuaRocks (https://github.com/tarantool/sqlparser) module which is
> wrapping HyRise SQL parser implemented in C++
> (https://github.com/hyrise/sql-parser) for parsing given SQL queries, and
> building abstract-syntax trees (AST).
> 
> The intermediate part between cluster controller at the Tarantool side
> and SQL parser is gridql.lua module. This is gridql responsibility to
> parse SQL, analyze resultant AST, and enrich it appropriately for
> aggregate functions, and pagination support. *I.e. queries sent to
> storage node will be different to the original SQL query*, and will be
> different to the query executed by combiner/reducer node.
> 
> 
> 
> The used sql-parser module exports only 2 methods: parse(query), and
> tostring(ast).
> 
> -  `sqlparser.parse(q)` uses ffi function parseSql, which wraps hyrise SQL
>   parser mechanics and returns AST tree as ffi structure
>   LuaSQLParseResult, which in turns, composed of series of
>   LuaSQLStatement-based objects, which might be of various types
>   (e.g. kStmtSelect,  kStmtImport,  kStmtInsert,  kStmtUpdate,
>   kStmtDelete, etc.), each of them could be attributed different
>   set of data, including LuaExpr lists of various kinds;
> 
> -  `sqlparser.tostring(ast)` stringifies the passed AST object;
> 
> Despite the fact that Hyrise SQL parser has *no knowledge about builtin
> SQL functions* supported by Tarantool SQL, it's parsing facilities are
> enough for AST tree traversal, because any known name is marked as 
> identifier of function, which is a good enough to start of SQL processing
> in gridql module.
> 
> Local SQL execution is being done using builtin Tarantool SQL engine,
> thus such lack of functions knowledge is not a problem, iff we pass
> transparently SQL query down to the node processor.
> 
> Hyrise knowns all kinds of SQL queries, but at the moment gridql modules
> *supports only `SELECT`s*, and not handles any other kinds of requests
> (i.e. `UPDATE`).
> 
> Unfortunately, gridql, at their current form, could not be published due
> to heavy usage of customer specific virtual tables, but there are claims
> that it's possible to generalize and simplify code,
> so it might be used elsewhere beyond current context.
> 
> 
> Long-term goals
> ---------------
> 
> So, having many industry precendents we see that ideally, for distributed
> SQL we have to have:
> 
> -  Some kind of router accepts SQL query, and then preparses it to some
>   kind of intermediate representation (AST);
> 
> -  Topology aware query planner analyses parsed query and having knowledge
>   of data distribution it sends parsed "AST" subqueries to only those
>   nodes, which has relevant data. If there is no data locality known
>   then all cluster involved via Map-Combine-Reduce operation;
> 
> -  Query might be split into inner subqueries for which stages would be
>   planned and executed separately;
> 
> -  If transactions are not read only then cluster wide transaction
>   manager / conflict manager to be involved for 2PC mechanics
>   coordination;
> 
> -  And it would be easier if distributed SQL module should work even
>   for single-node config (with or without vshard involved) for
>   simpler debugging purposes;
> 
> Timings for these long-term plans are not yet known, but at the moment
> we believe that the nearest subjects should be:
> 
> 1. SQL parser refactoring to saving AST (with serialization and
>   deserialization if necessary);
> 
> 2. And transaction/conflict manager should be extended with cluster
>   wide transaction support, to make possible next steps of queries
>   beyond simple `SELECT`s;
> 
> 2nd item is not SQL-specific, and will be handled elsewhere separately, 
> this RFC we will continue to talk about SQL only plans;
> 
> 
> Short-term distributed SQL plan
> -------------------------------
> 
> At the moment parser, byte-code generation, and query execution is
> tightly coupled in SQL code in Tarantool. This is side-effect of SQLite
> architecture largely inherited by Tarantool from SQLite parser. And such
> close coupling might become a road-blocker for us in the longer term, when
> we would have to go to different nodes.
> 
> If we properly split query parser and query execution logics we will 
> simplify configuration, making it easier to approach distributed SQL.
> 
> -  So, for the 1st, obvious step - we would need to create a
>   separate/builtin module tarantool-sqlparser which would wrap SQL
>   parsing in `parse(query)` method in a fashion similar to Mike
>   Siomkin' `sqlparser` module above;
> 
> -  The `sql.parse(query)` method would need to return AST data structures,
>   exported via ffi.
> 
>   -  At the moment only SELECT and VIEW queries build AST during
>      parsing, this is major limitation, which would require some extra
>      refactoring later, but it's ok for the 1st stage.
> 
>   -  For the 2nd stage we would need to extend AST with more SQL
>      statement types, e.g. UPDATE / DELETE.
> 
>   -  Worth to mention, that current AST structures as defined in the
>      `sqlint.h` are quite similar to that used in Mike' sqlparser
>      module - for more details see comparison of LuaDataTypes.h to
>      sqlint.h in the Appendix A below;
> 
> -  As we build AST we may enrich returned AST nodes with information
>   about builtin functions kinds and expression data types, specific
>   for our SQL parser;
> 
> -  So in addition to currently available ways to run SQL queries via:
> 
>   -  direct `box.execute`,
> 
>   -  Or 2 step `box.prepare + box.execute`
> 
>   -  we would add `sql.parse` method, similar to `box.prepare`,
>      which should be similarly executable via `box.prepare + box.execute`;
> 
> -  This refactoring with separation of parse step, should still maintain
>   fully working SQL execution cycle, i.e. with minimum code
>   modifications all relevant SQL queries should pass whole
>   Tarantool SQL test suite. (This might apply only to SELECT
>   queries, for stage 1);
> 
> 
> 
> Distributed testing scenario
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> 
> -  Assumption is that decoupled sql.parse method, should be powerful
>   enough to be able replace HyRise SQL parser in the gridql
>   proof-of-concept code. Once being published it will be another
>   indicator of intermediate success if it will be working with
>   Tarantool SQL parser module. (Mike is planning to publish cleaned
>   up code soon);
> 
>   -  There is no need though in gridql for anything beyond simple
>      SELECTs, which makes possible to create 1st implementation
>      without major refactorings in Tarantool SQL parser, having only
>      current data structures and code;
> 
>   -  Thus addition of AST support for DELETE, INSERT, UPDATE statements
>      will be done at stage #2, probably the next quarter, and it's
>      not a goal for current plan;
> 
>   -  i.e. we start with only read-only (SELECT and VIEW) queries, and
>      not support RW operations yet;
> 
> -  INSERT/UPDATE/DELETE queries will be done afterward, once we have
>   distributed conflict manager and transaction managers implemented.
>   And it's subject to coordinated efforts with Alexander Lyapunov team;
> 
> 
> 
> Appendix A - AST data structures
> --------------------------------
> 
> +-----------------------------------------------+-------------------------------+
> | ``StatementType::kStmtSelect``                | ``ast_type::AST_TYPE_SELECT`` |
> +===============================================+===============================+
> |.. code:: c                                    |                               |
> |                                               |                               |
> | typedef struct {                              | **(there is none, yet)**      |
> |    bool isValid;                              |                               |
> |    char* errorMsg;                            |                               |
> |    int errorLine;                             |                               |
> |    int errorColumn;                           |                               |
> |    size_t statementCount;                     |                               |
> |    struct LuaSQLStatement** statements;       |                               |
> | } LuaSQLParserResult;                         |                               |
> |                                               |                               |
> +-----------------------------------------------+-------------------------------+
> |.. code:: c                                    |.. code:: c                    |
> |                                               |                               |
> | typedef struct LuaSelectStatement {           |  struct Select {              | 
> |    struct LuaSQLStatement base;               |        ExprList *pEList;      |
> |                                               |        u8 op;                 |
> |    struct LuaTableRef* fromTable;             |        LogEst nSelectRow;     |
> |    bool selectDistinct;                       |        u32 selFlags;          |
> |                                               |        int iLimit, iOffset;   |
> |    size_t selectListSize;                     |        char zSelName[12];     |
> |    struct LuaExpr** selectList;               |        int addrOpenEphm[2];   |
> |                                               |        SrcList *pSrc;         |
> |    struct LuaExpr* whereClause;               |        Expr *pWhere;          |
> |                                               |        ExprList *pGroupBy;    |
> |    struct LuaGroupByDescription* groupBy;     |        Expr *pHaving;         |
> |                                               |        ExprList *pOrderBy;    |
> |    size_t setOperationCount;                  |        Select *pPrior;        |
> |    struct LuaSetOperation** setOperations;    |        Select *pNext;         |
> |                                               |        Expr *pLimit;          |
> |    size_t orderCount;                         |        Expr *pOffset;         |
> |    struct LuaOrderDescription** order;        |        With *pWith;           |
> |                                               |  };                           |
> |    size_t withDescriptionCount;               |                               |
> |    struct LuaWithDescription**                |                               |
> |           withDescriptions;                   |                               |
> |    struct LuaLimitDescription* limit;         |                               |
> | } LuaSelectStatement;                         |                               |
> |                                               |                               |
> +-----------------------------------------------+-------------------------------+
> 
> --
> Timur
> <distributed-sql-ast.html>



More information about the Tarantool-discussions mailing list