From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from rhino.ch-server.com (rhino.ch-server.com [209.59.190.103]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dev.tarantool.org (Postfix) with ESMTPS id B08ED469710 for ; Fri, 20 Nov 2020 03:05:23 +0300 (MSK) References: <0b1d01d6b9a4$aecaf500$0c60df00$@tarantool.org> From: Peter Gulutzan Message-ID: <49e6f534-1d53-3fe1-a363-3e42fcf916bb@ocelot.ca> Date: Thu, 19 Nov 2020 17:05:18 -0700 MIME-Version: 1.0 In-Reply-To: <0b1d01d6b9a4$aecaf500$0c60df00$@tarantool.org> Content-Type: text/plain; charset="utf-8"; format="flowed" Content-Transfer-Encoding: 8bit Content-Language: en-US Subject: Re: [Tarantool-discussions] RFC - distributed SQL, step #1 - AST List-Id: Tarantool development process List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Timur Safin Cc: m.semkin@corp.mail.ru, mons@tarantool.org, tarantool-discussions@dev.tarantool.org Hi, I only want to address the possible SQL syntax adjustments. You mentioned "vshard" so I take it that we must assume use of https://www.tarantool.io/en/doc/latest/reference/reference_rock/vshard/ so I mean: what SQL adjustments can take advantage of vshard. If we have some appropriate things in SQL, we can: () make it appear that migration from other SQL DBMSs is easier () partly fulfill the "distributed SQL" by really doing it in SQL. () allow users to design tables and queries with shard awareness. However, if I guess something looks hard, I'll say: make it illegal. BUCKET_ID All tuples, and most queries, need a column which contains numbers between 1 and bucket_count, and has a non-unique index. More flexibility would be nice but assume that's true for a year. Since a typical SQL database will contain column values that have different definitions, we are looking for a way that an SQL user can declare, for each table, "here is a column that can be mapped to bucket_id". Thus, in SQL terms, bucket_id is a GENERATED column which is derived from other columns in the table, or system values. For example, if column_x is an unsigned primary key, perhaps bucket_id = mod(primary_key_value, bucket_count). Any deterministic function will do. Changing bucket_count will be difficult. Changing the source column will be illegal. Of course the initial assumption is that a router will handle distribution questions automatically but the feature might look better if there is clear support in the syntax. "Distributed SQL" should mean that you can specify something in SQL. There are some common attributes that can be specified, and the most obvious one is: there is a choice of distribution methods for each table. In other words: saying "It is distributed SQL but it's not in SQL and the only thing you can do is distribute by a hash" will not look like what some people think is distributed SQL. So I am describing some additional SQL syntax which is non-standard but very common, with slight extra functionality that I hope will not look hard. CREATE TABLE statement I am suggesting a new optional clause, and a change to an existing clause, that would at least indicate whether and how the table is sharded. It will show up in information_schema and in error messages in a way that SQL users will understand. Also, a syntax check can determine whether the clause is illegal because there are other CREATE TABLE clauses that won't work in combination with the new or changed clause. (I am thinking of foreign-key and check clauses, possibly.) CREATE TABLE ... PARTITION BY ... Should the clause be PARTITION BY? Of course the clause could start with the keyword SHARD, or something else. And of course we do not wish to appear to be saying that shards are synonymous with partitions. But we can decide, as PostgreSQL did when they "implemented sharding on top of partitioning" https://www.percona.com/blog/2019/05/24/an-overview-of-sharding-in-postgresql-and-how-it-relates-to-mongodbs/ that a partition description will affect a shard definition, so it is okay unless we plan to do partitioning without sharding later. And PARTITION BY is the clause that is already used in MySQL/MariaDB https://dev.mysql.com/doc/refman/8.0/en/create-table.html https://mariadb.com/kb/en/create-table/#partitions PostgreSQL: https://www.postgresql.org/docs/11/sql-createtable.html Oracle (though they add "You cannot partition a table that is part of a cluster."): https://docs.oracle.com/database/121/SQLRF/statements_7002.htm#i2215406 https://docs.oracle.com/database/121/SQLRF/statements_7002.htm#i2215406 DB2: https://www.ibm.com/support/knowledgecenter/SSEPEK_11.0.0/sqlref/src/tpc/db2z_sql_createtable.html So, although this is only part of what those other vendors do, I suggest PARTITION BY some-keyword, and we can add more options as time goes by. The word PARTITION is already a reserved word. REJECTING REDSHIFT To be fair, I must acknowledge the existence of other syntaxes, such as SQL Server's (which allows specifying functions) and most interestingly Amazon's. For example their ALTER TABLE options include https://docs.aws.amazon.com/redshift/latest/dg/r_ALTER_TABLE.html | ALTER DISTKEY column_name | ALTER DISTSTYLE ALL | ALTER DISTSTYLE EVEN | ALTER DISTSTYLE KEY DISTKEY column_name | ALTER DISTSTYLE AUTO where ALL means "all nodes should get a copy of the table", DISTKEY i.e. Distribution Key means something like partition key, AUTO means let the system figure it out. I like ALL but it would require a significant change to vshard rules. I like AUTO but it is nearly useless without ALL. Therefore, given that these options are uncommon, I ignore them. (By the way Redshift's "leader node" can itself have tables so it doesn't need to pass all queries to other nodes, maybe our "router" could do something similar, but that's not an issue here.) CREATE TABLE ... PARTITION BY HASH (partition-key-column-names); I am assuming that we don't need to change vshard hashing or add more hash algorithms. And I don't refer here to the way that vshard handles bucket_id, I only expect the hash here to provide us with a bucket_id value when partition-key-columns are known. So perhaps it is uncontroversial that HASH (primary-key-column-names) would be legal. But is it default and is it compulsory? I think the answers are 'no' and 'no' but am uncertain what does it mean if this clause is absent. It could mean "iff the whole database is sharded then assume it" or it could mean "even if the database is sharded this is local and probably temporary". It would be simplest if partition-key-column-names happened to be a primary-key-column, because updating a primary-key-column is illegal already. Anyway, updating partition-key-column-names is illegal. The list of columns in CREATE TABLE must include "bucket_id" (lower case). Perhaps we could assume that it exists and not show it with SELECT *. But I think it is useful to know its value and useful to be able to specify it anywhere in the column list. So now an example: CREATE TABLE t (column_1 INT PRIMARY KEY, "bucket" INT, column_2 STRING) PARTITION BY HASH (column_2); INSERT INTO t VALUES (1, 'A'); SELECT * FROM t; Result: 1, 41, 'A'. Notice assumption: when inserting, I don't need to specify "bucket_id". Notice assumption: bucket_id = code point of first letter of string. Maybe those are bad assumptions, I just want the example to be easy. I suppose that PARTITION BY clause comes after WITH clause, as in MySQL where partition options come after table options. CREATE TABLE ... PARTITION BY RANGE (expression) PARTITION partition_name VALUES LESS THAN (value) [, ... ]); This will be a bit more detailed than the first CREATE TABLE, and will allow the user more control over what bucket a row goes to, but will not be default and maybe will not be necessary. An example could be CREATE TABLE t (column_1 INT PRIMARY KEY, "bucket" INT, column_2 STRING) PARTITION BY RANGE (column_2) PARTITION p1 VALUES LESS THAN ('J'), PARTITION p2 VALUES LESS THAN ('Z'); The result would be that there are only two possible bucket_id values, 1 for 'p1', and 2 for 'p2'. Notice that this definition means that column_2 will always be less than 'Z', so an additional clause CHECK (column_2 < 'Y') would be redundant and an additional clause CHECK (column_2 > 'Д') should be illegal. It may be useful that this means there is a one-to-one association between partition_name and bucket_id value, as we will see later. It may not be useful that this means a column with monotonically increasing values will tend to be cause re-use of the same partition. FOREIGN KEYS The good news about foreign keys is: if the foreign key and the referenced key are also partition keys and the partition clauses are the same, then rows with the same values will be in the same buckets. In such cases we can guarantee that INSERT/UPDATE/DELETE will work without needing multiple nodes, and equijoins will work if they are ANDED with single-row selections of either key. The bad news is: Otherwise there has to be at least one lookup which might be for a value in a different bucket, and the ON UPDATE | ON DELETE clauses cause changes of values in different buckets. That is, they are "multi-row". So the recommendation is: let foreign keys be illegal. TABLE REFERENCES Now that we know there is sharding, we can change FROM table_name to FROM table_name [PARTITION (partition_name)] which would override router decisions because p1 is associated with bucket. This is manual "pruning". "Pruning" can mean "only go to the necessary partitions". Usually we hope an optimizer will do it, but let's not assume so. In some systems one can say PARTITION (expression) instead, but that's unnecessary, I think. Table references can have PARTITION clauses in SELECT, INSERT, UPDATE, DELETE, REPLACE, and perhaps other statements. But they should be unnecessary for single-row INSERTs, because single-row INSERTs will always contain the partition-key values and therefore bucket_id is known already and therefore passing partition name would be redundant. INSERT As stated, INSERT should be easy because bucket_id is easy to calculate from the rest of the statement. On the other hand, INSERT ... SELECT is hard, so I suggest that it should be illegal in the first release. UPDATE, SELECT, DELETE Either these statements should have a partition name or they should have a WHERE clause that contains the exact phrase partition_key_value = literal so that even a very simple program can calculate bucket_id. Otherwise they should be illegal in the  first release. Updating partition-key columns is illegal too. ALTER When ALTER ... DROP is illegal, it should be illegal to drop the "_bucket" column or any column used for partitioning. DROP INDEX It should be illegal to drop indexes on the _bucket column. It probably should be legal to drop indexes on columns used for partitioning. INFORMATION_SCHEMA The _TABLES table should show the partition definitions. The _COLUMNS table should show whether a column is in a partition key. Minimally, a user or a program should be able to see enough to decide whether manual pruning is possible for a simple query. Actually I suppose this is my job, since the only implementation of INFORMATION_SCHEMA is the series of functions that I wrote for the user manual. Remember that one of those functions will crash until a certain server bug is fixed. If it's desirable, I can write other routines that put bucket information in INFORMATION_SCHEMA. That's easy for anyone to do, it merely involves reading system tables and calling Lua functions. However, I believe the display should be as a table -- that's what any SQL person would expect -- so for example suppose we had this. (not a system table but similar to an example in the manual): tarantool> cfg = {          >   servers = {          >     { uri = 'host1:33131', zone = '1' },          >     { uri = 'host2:33131', zone = '2' }          >   },          >   login = 'test_user',          >   password = 'pass',          >   redundancy = '2'          > } I want this display if I select: | --------- + -------- + ---------- + ------------ + ------------- + -------------+ --------------| | login     | password | redundancy | server_1_uri | server_1_zone | server_2_uri | server_2_zone | | --------- + -------- + ---------- + ------------ + ------------- + -------------+ --------------| | test_user | pass     | 2          | host:33131   | 1             | host2:33131  | 2             | | --------- + -------- + ---------- + ------------ + ------------- + -------------+ --------------| This is not a server issue, it is a client feature. EXPLAIN | ANALYZE I expect that the simple phrase "this can be done on a single node" is the most useful thing that can be said. More is not necessary. ON CLAUSE AND WHERE CLAUSE These are the places where we're most likely to see joins or subqueries. Suppose they require a search of more than one partition/bucket/node/etc. We could cause a warning if someday we support warnings. We could cause an error if the user fails to add a keyword (like FORCE) to indicate an understanding of the possible consequences. We could require 'admin' privilege to do slow queries of any kind. But maybe we don't need to do anything, it might not be a big problem. I believe, though, that everyone will be happier if we say: joins and subqueries are illegal in the first release. I could add more details but am unsure whether any of this is the sort of feedback that you expected. Peter Gulutzan