<h1 id="distributed-sql-the-first-step---ast-for-parser">Distributed SQL: the first step - AST for parser</h1>
<ul>
<li><strong>Status</strong>: In progress</li>
<li><strong>Start date</strong>: 11-2020</li>
<li><strong>Authors</strong>: Timur Safin <<a href="mailto:tsafin@tarantool.org">tsafin@tarantool.org</a>></li>
<li><strong>Issues</strong>: N/A</li>
</ul>
<h2 id="summary">Summary</h2>
<p>There is preliminary decision that we try to approach distributed SQL as next big thing which come in 2021. This is a long project, which will be approached gradually. Here we will try to observe all necessary steps to be done both in long term and short term period of time. Longer terms goals will be described briefly, but shorter term goal (extracting of AST from SQL parser) will be given in more details, we could do it because of current set of Tarantool capabilities available and having some PoC already developed.</p>
<p>Bear in mind, that for longer term goals, later version of this RFC will try to collect wider relevant information, showing some (quiet random) industry precedents (but that the list of industry examples will neither be scientific, nor complete).</p>
<h2 id="vocabulary">Vocabulary:</h2>
<p>We use standard MapReduce vocabulary when we talks about roles of cluster nodes involved to the SQL queries processing.</p>
<table style="width:97%;">
<colgroup>
<col style="width: 22%" />
<col style="width: 75%" />
</colgroup>
<thead>
<tr class="header">
<th>Router(s)</th>
<th>The node which processes queries and send to the corresponding storage nodes for local processing. It combines/reduces resultant data and sends it back to client</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>Combiner(s)</td>
<td>Depending on the aggregation complexity needs there may be several intermediate nodes which combine (aggregate) intermedate data and send it back</td>
</tr>
<tr class="even">
<td>Storage nodes</td>
<td> </td>
</tr>
</tbody>
</table>
<p> </p>
<h2 id="distributed-sql-scenario">Distributed SQL scenario</h2>
<p>Once we move from single node case to multiple node case for SQL execution all kinds of intra-node data exchange arise. Once we get original SQL to the router node, it's expected that router would preparse SQL query, (massage them appropriately) and then send some command data to storage nodes for their local execution. <strong>The question is</strong> - what format of data should we send to storage node?</p>
<p>We might try to send to the storage node the compiled binary VDBE byte-code, but it looks to be a bad idea for several reasons:</p>
<ol type="1">
<li>Vdbe is not yet frozen and due to the technology used (lemon parser with on the fly generation of constants) it might differ very much between various versions even for builds from the same branch. Though, if we have to, we could take some extra measures to stabilize values of tokens generated;</li>
<li>But bigger problem is - different data distribution on different shard nodes in the cluster. Which may require different query plans used for the same SQL query. If we would generate blindly the single byte code for received SQL then we may degrade performance comparing to the case when bytecode would be generated locally, for each modes separately, taking local heuristics.</li>
</ol>
<p>So at the moment simpler approach would be more preferable:</p>
<ul>
<li>We simple transfer (modified) SQL query string to each of shard node involved;</li>
<li>Or we could transfer AST serialized to some kind of binary form;</li>
</ul>
<p>We suggest, that in the 1st stage, take the simplest approach - transfer simple SQL query in their textual form.</p>
<h3 id="distributed-sql-in-the-intersystems-iris">Distributed SQL in the InterSystems IRIS</h3>
<p>(This is preliminary information, without further verification)</p>
<p>InterSystems IRIS uses ECP network for connecting cluster nodes, and and building distributed cache of data in shards. It employs full-mesh topology between nodes involved, thus any other node could cache locally data of a any other remote shard data.</p>
<p>This mesh topology eventually simplifies execution of SQL join execution, which may be invoked on router node, regardless of a data locality. Preparsed and analyzed SQL queries were distributed to cluster nodes via the same ECP mesh network.</p>
<p><strong>Question:</strong> What would happen on router for gigantic results?</p>
<p>If router should use ECP for caching of reduced data on the router then question is - how to proceed it effeciently without consuming entire block data memory? Do they use pagine for partial merge or what?</p>
<p>This is important question - and we will ask InterSystems experts for comments.</p>
<p> </p>
<h3 id="distributed-sql-in-memsqlsinglestore">Distributed SQL in MemSQL/SingleStore</h3>
<p><a href="https://docs.singlestore.com/v7.1/introduction/how-memsql-works/">https://docs.singlestore.com/v7.1/introduction/how-memsql-works/</a> SingleStore has builtin mode for local debugging of 4 nodes / (TBD) </p>
<h3 id="distributed-sql-in-mariadb-skysql">Distributed SQL in MariaDB SkySQL</h3>
<p><a href="https://mariadb.com/products/skysql/">https://mariadb.com/products/skysql/</a> (TBD)</p>
<p>Cluster JOIN...</p>
<p> Distributed SQL in Yugabyte ~~~~~~~~~~~~~~~~~~~~~~~~~~~</p>
<p>Yugabyte uses PostgreSQL language front-end in their SQL parser and optimizer, but has extended that optimizer with cluster aware specifics and has added distributed execution.</p>
<p>For row data consistency they use RAFT consensus protocol, but has modified classical RAFT for faster execution if multiple regions involved, where one needs to know locality of a data for effecient processing.</p>
<p>Cluster JOIN...</p>
<h3 id="mike-siomkin-distributed-sql-poc">Mike Siomkin' distributed SQL PoC</h3>
<p>There is working and deployed proof-of-concept project which has been implemented by Mike Siomkin, and which implements distributed SQL query concept using currently available Tarantool facilities.</p>
<div class="note">
<div class="title">
<p>Note</p>
</div>
<p>There are some obvious limitations though, but it further proves the point that with relatively small efforts, restricted distributed SQL processing might be implemented in current Tarantool within relatively short time frame.</p>
</div>
<p>For preliminary parsing of SQL queries Mike' code is using SQLParser LuaRocks (<a href="https://github.com/tarantool/sqlparser">https://github.com/tarantool/sqlparser</a>) module which is wrapping HyRise SQL parser implemented in C++ (<a href="https://github.com/hyrise/sql-parser">https://github.com/hyrise/sql-parser</a>) for parsing given SQL queries, and building abstract-syntax trees (AST).</p>
<p>The intermediate part between cluster controller at the Tarantool side and SQL parser is gridql.lua module. This is gridql responsibility to parse SQL, analyze resultant AST, and enrich it appropriately for aggregate functions, and pagination support. <em>I.e. queries sent to storage node will be different to the original SQL query</em>, and will be different to the query executed by combiner/reducer node.</p>
<p> </p>
<p>The used sql-parser module exports only 2 methods: parse(query), and tostring(ast).</p>
<ul>
<li><span class="title-ref">sqlparser.parse(q)</span> uses ffi function parseSql, which wraps hyrise SQL parser mechanics and returns AST tree as ffi structure LuaSQLParseResult, which in turns, composed of series of LuaSQLStatement-based objects, which might be of various types (e.g. kStmtSelect, kStmtImport, kStmtInsert, kStmtUpdate, kStmtDelete, etc.), each of them could be attributed different set of data, including LuaExpr lists of various kinds;</li>
<li><span class="title-ref">sqlparser.tostring(ast)</span> stringifies the passed AST object;</li>
</ul>
<p>Despite the fact that Hyrise SQL parser has <em>no knowledge about builtin SQL functions</em> supported by Tarantool SQL, it's parsing facilities are enough for AST tree traversal, because any known name is marked as identifier of function, which is a good enough to start of SQL processing in gridql module.</p>
<p>Local SQL execution is being done using builtin Tarantool SQL engine, thus such lack of functions knowledge is not a problem, iff we pass transparently SQL query down to the node processor.</p>
<p>Hyrise knowns all kinds of SQL queries, but at the moment gridql modules <em>supports only `SELECT`s</em>, and not handles any other kinds of requests (i.e. <span class="title-ref">UPDATE</span>).</p>
<p>Unfortunately, gridql, at their current form, could not be published due to heavy usage of customer specific virtual tables, but there are claims that it's possible to generalize and simplify code, so it might be used elsewhere beyond current context. </p>
<h2 id="long-term-goals">Long-term goals</h2>
<p>So, having many industry precendents we see that ideally, for distributed SQL we have to have:</p>
<ul>
<li>Some kind of router accepts SQL query, and then preparses it to some kind of intermediate representation (AST);</li>
<li>Topology aware query planner analyses parsed query and having knowledge of data distribution it sends parsed "AST" subqueries to only those nodes, which has relevant data. If there is no data locality known then all cluster involved via Map-Combine-Reduce operation;</li>
<li>Query might be split into inner subqueries for which stages would be planned and executed separately;</li>
<li>If transactions are not read only then cluster wide transaction manager / conflict manager to be involved for 2PC mechanics coordination;</li>
<li>And it would be easier if distributed SQL module should work even for single-node config (with or without vshard involved) for simpler debugging purposes;</li>
</ul>
<p>Timings for these long-term plans are not yet known, but at the moment we believe that the nearest subjects should be:</p>
<ol type="1">
<li>SQL parser refactoring to saving AST (with serialization and deserialization if necessary);</li>
<li>And transaction/conflict manager should be extended with cluster wide transaction support, to make possible next steps of queries beyond simple `SELECT`s;</li>
</ol>
<p>2nd item is not SQL-specific, and will be handled elsewhere separately, this RFC we will continue to talk about SQL only plans; </p>
<h2 id="short-term-distributed-sql-plan">Short-term distributed SQL plan</h2>
<p>At the moment parser, byte-code generation, and query execution is tightly coupled in SQL code in Tarantool. This is side-effect of SQLite architecture largely inherited by Tarantool from SQLite parser. And such close coupling might become a road-blocker for us in the longer term, when we would have to go to different nodes.</p>
<p>If we properly split query parser and query execution logics we will simplify configuration, making it easier to approach distributed SQL.</p>
<ul>
<li>So, for the 1st, obvious step - we would need to create a separate/builtin module tarantool-sqlparser which would wrap SQL parsing in <span class="title-ref">parse(query)</span> method in a fashion similar to Mike Siomkin' <span class="title-ref">sqlparser</span> module above;</li>
<li>The <span class="title-ref">sql.parse(query)</span> method would need to return AST data structures, exported via ffi.
<ul>
<li>At the moment only SELECT and VIEW queries build AST during parsing, this is major limitation, which would require some extra refactoring later, but it's ok for the 1st stage.</li>
<li>For the 2nd stage we would need to extend AST with more SQL statement types, e.g. UPDATE / DELETE.</li>
<li>Worth to mention, that current AST structures as defined in the <span class="title-ref">sqlint.h</span> are quite similar to that used in Mike' sqlparser module - for more details see comparison of LuaDataTypes.h to sqlint.h in the Appendix A below;</li>
</ul></li>
<li>As we build AST we may enrich returned AST nodes with information about builtin functions kinds and expression data types, specific for our SQL parser;</li>
<li>So in addition to currently available ways to run SQL queries via:
<ul>
<li>direct <span class="title-ref">box.execute</span>,</li>
<li>Or 2 step <span class="title-ref">box.prepare + box.execute</span></li>
<li>we would add <span class="title-ref">sql.parse</span> method, similar to <span class="title-ref">box.prepare</span>, which should be similarly executable via <span class="title-ref">box.prepare + box.execute</span>;</li>
</ul></li>
<li>This refactoring with separation of parse step, should still maintain fully working SQL execution cycle, i.e. with minimum code modifications all relevant SQL queries should pass whole Tarantool SQL test suite. (This might apply only to SELECT queries, for stage 1);</li>
</ul>
<p> </p>
<h3 id="distributed-testing-scenario">Distributed testing scenario</h3>
<ul>
<li>Assumption is that decoupled sql.parse method, should be powerful enough to be able replace HyRise SQL parser in the gridql proof-of-concept code. Once being published it will be another indicator of intermediate success if it will be working with Tarantool SQL parser module. (Mike is planning to publish cleaned up code soon);
<ul>
<li>There is no need though in gridql for anything beyond simple SELECTs, which makes possible to create 1st implementation without major refactorings in Tarantool SQL parser, having only current data structures and code;</li>
<li>Thus addition of AST support for DELETE, INSERT, UPDATE statements will be done at stage #2, probably the next quarter, and it's not a goal for current plan;</li>
<li>i.e. we start with only read-only (SELECT and VIEW) queries, and not support RW operations yet;</li>
</ul></li>
<li>INSERT/UPDATE/DELETE queries will be done afterward, once we have distributed conflict manager and transaction managers implemented. And it's subject to coordinated efforts with Alexander Lyapunov team;</li>
</ul>
<p> </p>
<h2 id="appendix-a---ast-data-structures">Appendix A - AST data structures</h2>
<table>
<colgroup>
<col style="width: 60%" />
<col style="width: 40%" />
</colgroup>
<thead>
<tr class="header">
<th><code>StatementType::kStmtSelect</code></th>
<th><code>ast_type::AST_TYPE_SELECT</code></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><div class="sourceCode" id="cb1"><pre class="sourceCode c"><code class="sourceCode c"><span id="cb1-1"><a href="#cb1-1" aria-hidden="true" tabindex="-1"></a>typedef struct {</span>
<span id="cb1-2"><a href="#cb1-2" aria-hidden="true" tabindex="-1"></a> <span class="dt">bool</span> isValid;</span>
<span id="cb1-3"><a href="#cb1-3" aria-hidden="true" tabindex="-1"></a> <span class="dt">char</span>* errorMsg;</span>
<span id="cb1-4"><a href="#cb1-4" aria-hidden="true" tabindex="-1"></a> <span class="dt">int</span> errorLine;</span>
<span id="cb1-5"><a href="#cb1-5" aria-hidden="true" tabindex="-1"></a> <span class="dt">int</span> errorColumn;</span>
<span id="cb1-6"><a href="#cb1-6" aria-hidden="true" tabindex="-1"></a> <span class="dt">size_t</span> statementCount;</span>
<span id="cb1-7"><a href="#cb1-7" aria-hidden="true" tabindex="-1"></a> <span class="kw">struct</span> LuaSQLStatement** statements;</span>
<span id="cb1-8"><a href="#cb1-8" aria-hidden="true" tabindex="-1"></a>} LuaSQLParserResult;</span></code></pre></div></td>
<td><strong>(there is none, yet)</strong></td>
</tr>
<tr class="even">
<td><div class="sourceCode" id="cb2"><pre class="sourceCode c"><code class="sourceCode c"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a><span class="kw">typedef</span> <span class="kw">struct</span> LuaSelectStatement {</span>
<span id="cb2-2"><a href="#cb2-2" aria-hidden="true" tabindex="-1"></a> <span class="kw">struct</span> LuaSQLStatement base;</span>
<span id="cb2-3"><a href="#cb2-3" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb2-4"><a href="#cb2-4" aria-hidden="true" tabindex="-1"></a> <span class="kw">struct</span> LuaTableRef* fromTable;</span>
<span id="cb2-5"><a href="#cb2-5" aria-hidden="true" tabindex="-1"></a> <span class="dt">bool</span> selectDistinct;</span>
<span id="cb2-6"><a href="#cb2-6" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb2-7"><a href="#cb2-7" aria-hidden="true" tabindex="-1"></a> <span class="dt">size_t</span> selectListSize;</span>
<span id="cb2-8"><a href="#cb2-8" aria-hidden="true" tabindex="-1"></a> <span class="kw">struct</span> LuaExpr** selectList;</span>
<span id="cb2-9"><a href="#cb2-9" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb2-10"><a href="#cb2-10" aria-hidden="true" tabindex="-1"></a> <span class="kw">struct</span> LuaExpr* whereClause;</span>
<span id="cb2-11"><a href="#cb2-11" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb2-12"><a href="#cb2-12" aria-hidden="true" tabindex="-1"></a> <span class="kw">struct</span> LuaGroupByDescription* groupBy;</span>
<span id="cb2-13"><a href="#cb2-13" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb2-14"><a href="#cb2-14" aria-hidden="true" tabindex="-1"></a> <span class="dt">size_t</span> setOperationCount;</span>
<span id="cb2-15"><a href="#cb2-15" aria-hidden="true" tabindex="-1"></a> <span class="kw">struct</span> LuaSetOperation** setOperations;</span>
<span id="cb2-16"><a href="#cb2-16" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb2-17"><a href="#cb2-17" aria-hidden="true" tabindex="-1"></a> <span class="dt">size_t</span> orderCount;</span>
<span id="cb2-18"><a href="#cb2-18" aria-hidden="true" tabindex="-1"></a> <span class="kw">struct</span> LuaOrderDescription** order;</span>
<span id="cb2-19"><a href="#cb2-19" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb2-20"><a href="#cb2-20" aria-hidden="true" tabindex="-1"></a> <span class="dt">size_t</span> withDescriptionCount;</span>
<span id="cb2-21"><a href="#cb2-21" aria-hidden="true" tabindex="-1"></a> <span class="kw">struct</span> LuaWithDescription**</span>
<span id="cb2-22"><a href="#cb2-22" aria-hidden="true" tabindex="-1"></a> withDescriptions;</span>
<span id="cb2-23"><a href="#cb2-23" aria-hidden="true" tabindex="-1"></a> <span class="kw">struct</span> LuaLimitDescription* limit;</span>
<span id="cb2-24"><a href="#cb2-24" aria-hidden="true" tabindex="-1"></a>} LuaSelectStatement;</span></code></pre></div></td>
<td><div class="sourceCode" id="cb3"><pre class="sourceCode c"><code class="sourceCode c"><span id="cb3-1"><a href="#cb3-1" aria-hidden="true" tabindex="-1"></a><span class="kw">struct</span> Select {</span>
<span id="cb3-2"><a href="#cb3-2" aria-hidden="true" tabindex="-1"></a> ExprList *pEList;</span>
<span id="cb3-3"><a href="#cb3-3" aria-hidden="true" tabindex="-1"></a> u8 op;</span>
<span id="cb3-4"><a href="#cb3-4" aria-hidden="true" tabindex="-1"></a> LogEst nSelectRow;</span>
<span id="cb3-5"><a href="#cb3-5" aria-hidden="true" tabindex="-1"></a> u32 selFlags;</span>
<span id="cb3-6"><a href="#cb3-6" aria-hidden="true" tabindex="-1"></a> <span class="dt">int</span> iLimit, iOffset;</span>
<span id="cb3-7"><a href="#cb3-7" aria-hidden="true" tabindex="-1"></a> <span class="dt">char</span> zSelName[<span class="dv">12</span>];</span>
<span id="cb3-8"><a href="#cb3-8" aria-hidden="true" tabindex="-1"></a> <span class="dt">int</span> addrOpenEphm[<span class="dv">2</span>];</span>
<span id="cb3-9"><a href="#cb3-9" aria-hidden="true" tabindex="-1"></a> SrcList *pSrc;</span>
<span id="cb3-10"><a href="#cb3-10" aria-hidden="true" tabindex="-1"></a> Expr *pWhere;</span>
<span id="cb3-11"><a href="#cb3-11" aria-hidden="true" tabindex="-1"></a> ExprList *pGroupBy;</span>
<span id="cb3-12"><a href="#cb3-12" aria-hidden="true" tabindex="-1"></a> Expr *pHaving;</span>
<span id="cb3-13"><a href="#cb3-13" aria-hidden="true" tabindex="-1"></a> ExprList *pOrderBy;</span>
<span id="cb3-14"><a href="#cb3-14" aria-hidden="true" tabindex="-1"></a> Select *pPrior;</span>
<span id="cb3-15"><a href="#cb3-15" aria-hidden="true" tabindex="-1"></a> Select *pNext;</span>
<span id="cb3-16"><a href="#cb3-16" aria-hidden="true" tabindex="-1"></a> Expr *pLimit;</span>
<span id="cb3-17"><a href="#cb3-17" aria-hidden="true" tabindex="-1"></a> Expr *pOffset;</span>
<span id="cb3-18"><a href="#cb3-18" aria-hidden="true" tabindex="-1"></a> With *pWith;</span>
<span id="cb3-19"><a href="#cb3-19" aria-hidden="true" tabindex="-1"></a>};</span></code></pre></div></td>
</tr>
</tbody>
</table>