From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from smtp40.i.mail.ru (smtp40.i.mail.ru [94.100.177.100]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dev.tarantool.org (Postfix) with ESMTPS id 16F62469710 for ; Wed, 25 Nov 2020 00:55:41 +0300 (MSK) From: "Timur Safin" References: <0b1d01d6b9a4$aecaf500$0c60df00$@tarantool.org> <49e6f534-1d53-3fe1-a363-3e42fcf916bb@ocelot.ca> In-Reply-To: <49e6f534-1d53-3fe1-a363-3e42fcf916bb@ocelot.ca> Date: Wed, 25 Nov 2020 00:55:37 +0300 Message-ID: <07de01d6c2ac$8b80b600$a2822200$@tarantool.org> MIME-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Content-Language: ru Subject: Re: [Tarantool-discussions] RFC - distributed SQL, step #1 - AST List-Id: Tarantool development process List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: 'Peter Gulutzan' Cc: m.semkin@corp.mail.ru, mons@tarantool.org, tarantool-discussions@dev.tarantool.org Thanks for your ideas, Peter! We are not yet ready to discuss in all details distributed SQL=20 modifications we would need to introduce for cluster execution, because we are in much earlier stages of development, and need=20 to proceed the 1st step yet - extracting AST from SQL parse tree. But I promise, we will return back to this discussion, once=20 we would go beyond the single node. I'll return back then more prepared. Thanks, Timur : From: Peter Gulutzan : Subject: Re: [Tarantool-discussions] RFC - distributed SQL, step #1 - = AST :=20 : Hi, :=20 : I only want to address the possible SQL syntax adjustments. :=20 : You mentioned "vshard" so I take it that we must assume use of : = https://www.tarantool.io/en/doc/latest/reference/reference_rock/vshard/ : so I mean: what SQL adjustments can take advantage of vshard. :=20 : If we have some appropriate things in SQL, we can: : () make it appear that migration from other SQL DBMSs is easier : () partly fulfill the "distributed SQL" by really doing it in SQL. : () allow users to design tables and queries with shard awareness. : However, if I guess something looks hard, I'll say: make it illegal. :=20 : BUCKET_ID :=20 : All tuples, and most queries, need a column which contains numbers : between 1 and bucket_count, and has a non-unique index. : More flexibility would be nice but assume that's true for a year. : Since a typical SQL database will contain column values that have : different definitions, we are looking for a way that an SQL user : can declare, for each table, "here is a column that can be mapped : to bucket_id". Thus, in SQL terms, bucket_id is a GENERATED column : which is derived from other columns in the table, or system values. : For example, if column_x is an unsigned primary key, perhaps : bucket_id =3D mod(primary_key_value, bucket_count). : Any deterministic function will do. : Changing bucket_count will be difficult. : Changing the source column will be illegal. :=20 : Of course the initial assumption is that a router will handle : distribution questions automatically but the feature might : look better if there is clear support in the syntax. :=20 : "Distributed SQL" should mean that you can specify something in SQL. : There are some common attributes that can be specified, : and the most obvious one is: there is a choice of distribution : methods for each table. : In other words: saying : "It is distributed SQL but it's not in SQL and the only thing : you can do is distribute by a hash" : will not look like what some people think is distributed SQL. : So I am describing some additional SQL syntax : which is non-standard but very common, : with slight extra functionality that I hope will not look hard. :=20 : CREATE TABLE statement :=20 : I am suggesting a new optional clause, and a change to an : existing clause, that would at least indicate whether and how : the table is sharded. :=20 : It will show up in information_schema : and in error messages in a way that SQL users will understand. : Also, a syntax check can determine whether the clause is illegal : because there are other CREATE TABLE clauses that won't work : in combination with the new or changed clause. : (I am thinking of foreign-key and check clauses, possibly.) :=20 : CREATE TABLE ... PARTITION BY ... :=20 : Should the clause be PARTITION BY? : Of course the clause could start with the keyword SHARD, or something = else. : And of course we do not wish to appear to be saying that : shards are synonymous with partitions. : But we can decide, as : PostgreSQL did when they "implemented sharding on top of partitioning" : https://www.percona.com/blog/2019/05/24/an-overview-of-sharding-in- : postgresql-and-how-it-relates-to-mongodbs/ : that a partition description will affect a shard definition, : so it is okay unless we plan to do partitioning without sharding = later. : And PARTITION BY is the clause that is already used in : MySQL/MariaDB : https://dev.mysql.com/doc/refman/8.0/en/create-table.html : https://mariadb.com/kb/en/create-table/#partitions : PostgreSQL: : https://www.postgresql.org/docs/11/sql-createtable.html : Oracle (though they add "You cannot partition a table that is part of = a : cluster."): : = https://docs.oracle.com/database/121/SQLRF/statements_7002.htm#i2215406 : = https://docs.oracle.com/database/121/SQLRF/statements_7002.htm#i2215406 : DB2: : = https://www.ibm.com/support/knowledgecenter/SSEPEK_11.0.0/sqlref/src/tpc/= db2 : z_sql_createtable.html : So, although this is only part of what those other vendors do, I = suggest : PARTITION BY some-keyword, and we can add more options as time goes = by. :=20 : The word PARTITION is already a reserved word. :=20 : REJECTING REDSHIFT :=20 : To be fair, I must acknowledge the existence of other syntaxes, : such as SQL Server's (which allows specifying functions) and most : interestingly Amazon's. For example their ALTER TABLE options include : https://docs.aws.amazon.com/redshift/latest/dg/r_ALTER_TABLE.html : | ALTER DISTKEY column_name : | ALTER DISTSTYLE ALL : | ALTER DISTSTYLE EVEN : | ALTER DISTSTYLE KEY DISTKEY column_name : | ALTER DISTSTYLE AUTO : where ALL means "all nodes should get a copy of the table", : DISTKEY i.e. Distribution Key means something like partition key, : AUTO means let the system figure it out. : I like ALL but it would require a significant change to vshard rules. : I like AUTO but it is nearly useless without ALL. : Therefore, given that these options are uncommon, I ignore them. :=20 : (By the way Redshift's "leader node" can itself have tables so it : doesn't need to pass all queries to other nodes, maybe our "router" : could do something similar, but that's not an issue here.) :=20 : CREATE TABLE ... PARTITION BY HASH (partition-key-column-names); :=20 : I am assuming that we don't need to change vshard hashing or : add more hash algorithms. And I don't refer here to the way : that vshard handles bucket_id, I only expect the hash here to : provide us with a bucket_id value when partition-key-columns are = known. : So perhaps it is uncontroversial that HASH (primary-key-column-names) : would be legal. But is it default and is it compulsory? :=20 : I think the answers are 'no' and 'no' but am uncertain what does it : mean if this clause is absent. It could mean : "iff the whole database is sharded then assume it" : or it could mean : "even if the database is sharded this is local and probably = temporary". :=20 : It would be simplest if partition-key-column-names happened to be : a primary-key-column, because updating a primary-key-column is : illegal already. Anyway, updating partition-key-column-names is = illegal. :=20 : The list of columns in CREATE TABLE must include "bucket_id" (lower = case). : Perhaps we could assume that it exists and not show it with SELECT *. : But I think it is useful to know its value and useful to be able to : specify it anywhere in the column list. So now an example: :=20 : CREATE TABLE t (column_1 INT PRIMARY KEY, "bucket" INT, column_2 = STRING) : PARTITION BY HASH (column_2); : INSERT INTO t VALUES (1, 'A'); : SELECT * FROM t; : Result: 1, 41, 'A'. : Notice assumption: when inserting, I don't need to specify = "bucket_id". : Notice assumption: bucket_id =3D code point of first letter of string. : Maybe those are bad assumptions, I just want the example to be easy. :=20 : I suppose that PARTITION BY clause comes after WITH clause, : as in MySQL where partition options come after table options. :=20 : CREATE TABLE ... PARTITION BY RANGE (expression) : PARTITION partition_name VALUES LESS THAN (value) [, ... ]); :=20 : This will be a bit more detailed than the first CREATE TABLE, : and will allow the user more control over what bucket a row goes to, : but will not be default and maybe will not be necessary. :=20 : An example could be : CREATE TABLE t (column_1 INT PRIMARY KEY, "bucket" INT, column_2 = STRING) : PARTITION BY RANGE (column_2) : PARTITION p1 VALUES LESS THAN ('J'), : PARTITION p2 VALUES LESS THAN ('Z'); : The result would be that there are only two possible bucket_id values, : 1 for 'p1', and 2 for 'p2'. :=20 : Notice that this definition means that column_2 will always be less : than 'Z', so an additional clause : CHECK (column_2 < 'Y') would be redundant : and an additional clause : CHECK (column_2 > '=D0=94') should be illegal. :=20 : It may be useful that this means there is a one-to-one association : between partition_name and bucket_id value, as we will see later. :=20 : It may not be useful that this means a column with monotonically : increasing values will tend to be cause re-use of the same partition. :=20 : FOREIGN KEYS :=20 : The good news about foreign keys is: : if the foreign key and the referenced key are also partition keys : and the partition clauses are the same, then rows with the same : values will be in the same buckets. In such cases we can guarantee : that INSERT/UPDATE/DELETE will work without needing multiple nodes, : and equijoins will work if they are ANDED with single-row : selections of either key. :=20 : The bad news is: : Otherwise there has to be at least one lookup which might : be for a value in a different bucket, and the : ON UPDATE | ON DELETE clauses cause changes of values in : different buckets. That is, they are "multi-row". :=20 : So the recommendation is: let foreign keys be illegal. :=20 : TABLE REFERENCES :=20 : Now that we know there is sharding, we can change : FROM table_name : to : FROM table_name [PARTITION (partition_name)] : which would override router decisions because p1 is associated : with bucket. :=20 : This is manual "pruning". : "Pruning" can mean "only go to the necessary partitions". : Usually we hope an optimizer will do it, but let's not assume so. :=20 : In some systems one can say PARTITION (expression) instead, : but that's unnecessary, I think. :=20 : Table references can have PARTITION clauses in SELECT, : INSERT, UPDATE, DELETE, REPLACE, and perhaps other statements. : But they should be unnecessary for single-row INSERTs, : because single-row INSERTs will always contain the partition-key : values and therefore bucket_id is known already and therefore : passing partition name would be redundant. :=20 : INSERT :=20 : As stated, INSERT should be easy because bucket_id is easy : to calculate from the rest of the statement. :=20 : On the other hand, INSERT ... SELECT is hard, so I suggest : that it should be illegal in the first release. :=20 : UPDATE, SELECT, DELETE :=20 : Either these statements should have a partition name or they should : have a WHERE clause that contains the exact phrase : partition_key_value =3D literal : so that even a very simple program can calculate bucket_id. : Otherwise they should be illegal in the first release. :=20 : Updating partition-key columns is illegal too. :=20 : ALTER :=20 : When ALTER ... DROP is illegal, it should be illegal to drop the : "_bucket" column : or any column used for partitioning. :=20 : DROP INDEX :=20 : It should be illegal to drop indexes on the _bucket column. : It probably should be legal to drop indexes on columns used for : partitioning. :=20 : INFORMATION_SCHEMA :=20 : The _TABLES table should show the partition definitions. : The _COLUMNS table should show whether a column is in a partition key. : Minimally, a user or a program should be able to see enough to decide : whether manual pruning is possible for a simple query. :=20 : Actually I suppose this is my job, since the only implementation of : INFORMATION_SCHEMA is the series of functions that I wrote for the : user manual. Remember that one of those functions will crash until : a certain server bug is fixed. :=20 : If it's desirable, I can write other routines that put bucket : information in INFORMATION_SCHEMA. That's easy for anyone to do, : it merely involves reading system tables and calling Lua functions. : However, I believe the display should be as a table -- that's what : any SQL person would expect -- so for example suppose we had this. : (not a system table but similar to an example in the manual): : tarantool> cfg =3D { : > servers =3D { : > { uri =3D 'host1:33131', zone =3D '1' }, : > { uri =3D 'host2:33131', zone =3D '2' } : > }, : > login =3D 'test_user', : > password =3D 'pass', : > redundancy =3D '2' : > } : I want this display if I select: : | --------- + -------- + ---------- + ------------ + ------------- + : -------------+ --------------| : | login | password | redundancy | server_1_uri | server_1_zone | : server_2_uri | server_2_zone | : | --------- + -------- + ---------- + ------------ + ------------- + : -------------+ --------------| : | test_user | pass | 2 | host:33131 | 1 | : host2:33131 | 2 | : | --------- + -------- + ---------- + ------------ + ------------- + : -------------+ --------------| : This is not a server issue, it is a client feature. :=20 : EXPLAIN | ANALYZE :=20 : I expect that the simple phrase "this can be done on a single node" : is the most useful thing that can be said. More is not necessary. :=20 : ON CLAUSE AND WHERE CLAUSE :=20 : These are the places where we're most likely to see joins or = subqueries. : Suppose they require a search of more than one = partition/bucket/node/etc. : We could cause a warning if someday we support warnings. : We could cause an error if the user fails to add a keyword (like = FORCE) : to indicate an understanding of the possible consequences. : We could require 'admin' privilege to do slow queries of any kind. : But maybe we don't need to do anything, it might not be a big problem. :=20 : I believe, though, that everyone will be happier if we say: : joins and subqueries are illegal in the first release. :=20 : I could add more details but am unsure whether any of this is : the sort of feedback that you expected. :=20 : Peter Gulutzan