Distributed SQL: the first step - AST for parser

Summary

There is preliminary decision that we try to approach distributed SQL as next big thing which come in 2021. This is a long project, which will be approached gradually. Here we will try to observe all necessary steps to be done both in long term and short term period of time. Longer terms goals will be described briefly, but shorter term goal (extracting of AST from SQL parser) will be given in more details, we could do it because of current set of Tarantool capabilities available and having some PoC already developed.

Bear in mind, that for longer term goals, later version of this RFC will try to collect wider relevant information, showing some (quiet random) industry precedents (but that the list of industry examples will neither be scientific, nor complete).

Vocabulary:

We use standard MapReduce vocabulary when we talks about roles of cluster nodes involved to the SQL queries processing.

Router(s) The node which processes queries and send to the corresponding storage nodes for local processing. It combines/reduces resultant data and sends it back to client
Combiner(s) Depending on the aggregation complexity needs there may be several intermediate nodes which combine (aggregate) intermedate data and send it back
Storage nodes  

 

Distributed SQL scenario

Once we move from single node case to multiple node case for SQL execution all kinds of intra-node data exchange arise. Once we get original SQL to the router node, it's expected that router would preparse SQL query, (massage them appropriately) and then send some command data to storage nodes for their local execution.   The question is - what format of data should we send to storage node?

We might try to send to the storage node the compiled binary VDBE byte-code, but it looks to be a bad idea for several reasons:

  1. Vdbe is not yet frozen and due to the technology used (lemon parser with on the fly generation of constants) it might differ very much between various versions even for builds from the same branch. Though, if we have to, we could take some extra measures to stabilize values of tokens generated;
  2. But bigger problem is - different data distribution on different shard nodes in the cluster. Which may require different query plans used for the same SQL query. If we would generate blindly the single byte code for received SQL then we may degrade performance comparing to the case when bytecode would be generated locally, for each modes separately, taking local heuristics.

So at the moment simpler approach would be more preferable:

We suggest, that in the 1st stage, take the simplest approach - transfer simple SQL query in their textual form.

Distributed SQL in the InterSystems IRIS

(This is preliminary information, without further verification)

InterSystems IRIS uses ECP network for connecting cluster nodes, and and building distributed cache of data in shards. It employs full-mesh topology between nodes involved, thus any other node could cache locally data of a any other remote shard data.

This mesh topology eventually simplifies execution of SQL join execution, which may be invoked on router node, regardless of a data locality. Preparsed and analyzed SQL queries were distributed to cluster nodes via the same ECP mesh network.

Question: What would happen on router for gigantic results?

If router should use ECP for caching of reduced data on the router then question is - how to proceed it effeciently without consuming entire block data memory? Do they use pagine for partial merge or what?

This is important question - and we will ask InterSystems experts for comments.

 

Distributed SQL in MemSQL/SingleStore

https://docs.singlestore.com/v7.1/introduction/how-memsql-works/ SingleStore has builtin mode for local debugging of 4 nodes / (TBD)  

Distributed SQL in MariaDB SkySQL

https://mariadb.com/products/skysql/ (TBD)

Cluster JOIN...

  Distributed SQL in Yugabyte ~~~~~~~~~~~~~~~~~~~~~~~~~~~

Yugabyte uses PostgreSQL language front-end in their SQL parser and optimizer, but has extended that optimizer with cluster aware specifics and has added distributed execution.

For row data consistency they use RAFT consensus protocol, but has modified classical RAFT for faster execution if multiple regions involved, where one needs to know locality of a data for effecient processing.

Cluster JOIN...

Mike Siomkin' distributed SQL PoC

There is working and deployed proof-of-concept project which has been implemented by Mike Siomkin, and which implements distributed SQL query concept using currently available Tarantool facilities.

Note

There are some obvious limitations though, but it further proves the point that with relatively small efforts, restricted distributed SQL processing might be implemented in current Tarantool within relatively short time frame.

For preliminary parsing of SQL queries Mike' code is using SQLParser LuaRocks (https://github.com/tarantool/sqlparser) module which is wrapping HyRise SQL parser implemented in C++ (https://github.com/hyrise/sql-parser) for parsing given SQL queries, and building abstract-syntax trees (AST).

The intermediate part between cluster controller at the Tarantool side and SQL parser is gridql.lua module. This is gridql responsibility to parse SQL, analyze resultant AST, and enrich it appropriately for aggregate functions, and pagination support. I.e. queries sent to storage node will be different to the original SQL query, and will be different to the query executed by combiner/reducer node.

 

The used sql-parser module exports only 2 methods: parse(query), and tostring(ast).

Despite the fact that Hyrise SQL parser has no knowledge about builtin SQL functions supported by Tarantool SQL, it's parsing facilities are enough for AST tree traversal, because any known name is marked as identifier of function, which is a good enough to start of SQL processing in gridql module.

Local SQL execution is being done using builtin Tarantool SQL engine, thus such lack of functions knowledge is not a problem, iff we pass transparently SQL query down to the node processor.

Hyrise knowns all kinds of SQL queries, but at the moment gridql modules supports only `SELECT`s, and not handles any other kinds of requests (i.e. UPDATE).

Unfortunately, gridql, at their current form, could not be published due to heavy usage of customer specific virtual tables, but there are claims that it's possible to generalize and simplify code, so it might be used elsewhere beyond current context.  

Long-term goals

So, having many industry precendents we see that ideally, for distributed SQL we have to have:

Timings for these long-term plans are not yet known, but at the moment we believe that the nearest subjects should be:

  1. SQL parser refactoring to saving AST (with serialization and deserialization if necessary);
  2. And transaction/conflict manager should be extended with cluster wide transaction support, to make possible next steps of queries beyond simple `SELECT`s;

2nd item is not SQL-specific, and will be handled elsewhere separately, this RFC we will continue to talk about SQL only plans;  

Short-term distributed SQL plan

At the moment parser, byte-code generation, and query execution is tightly coupled in SQL code in Tarantool. This is side-effect of SQLite architecture largely inherited by Tarantool from SQLite parser. And such close coupling might become a road-blocker for us in the longer term, when we would have to go to different nodes.

If we properly split query parser and query execution logics we will simplify configuration, making it easier to approach distributed SQL.

 

Distributed testing scenario

 

Appendix A - AST data structures

StatementType::kStmtSelect ast_type::AST_TYPE_SELECT
typedef struct {
   bool isValid;
   char* errorMsg;
   int errorLine;
   int errorColumn;
   size_t statementCount;
   struct LuaSQLStatement** statements;
} LuaSQLParserResult;
(there is none, yet)
typedef struct LuaSelectStatement {
   struct LuaSQLStatement base;

   struct LuaTableRef* fromTable;
   bool selectDistinct;

   size_t selectListSize;
   struct LuaExpr** selectList;

   struct LuaExpr* whereClause;

   struct LuaGroupByDescription* groupBy;

   size_t setOperationCount;
   struct LuaSetOperation** setOperations;

   size_t orderCount;
   struct LuaOrderDescription** order;

   size_t withDescriptionCount;
   struct LuaWithDescription**
          withDescriptions;
   struct LuaLimitDescription* limit;
} LuaSelectStatement;
struct Select {
      ExprList *pEList;
      u8 op;
      LogEst nSelectRow;
      u32 selFlags;
      int iLimit, iOffset;
      char zSelName[12];
      int addrOpenEphm[2];
      SrcList *pSrc;
      Expr *pWhere;
      ExprList *pGroupBy;
      Expr *pHaving;
      ExprList *pOrderBy;
      Select *pPrior;
      Select *pNext;
      Expr *pLimit;
      Expr *pOffset;
      With *pWith;
};