[Tarantool-discussions] RFC - distributed SQL, step #1 - AST

Peter Gulutzan pgulutzan at ocelot.ca
Fri Nov 20 03:05:18 MSK 2020


Hi,

I only want to address the possible SQL syntax adjustments.

You mentioned "vshard" so I take it that we must assume use of
https://www.tarantool.io/en/doc/latest/reference/reference_rock/vshard/
so I mean: what SQL adjustments can take advantage of vshard.

If we have some appropriate things in SQL, we can:
() make it appear that migration from other SQL DBMSs is easier
() partly fulfill the "distributed SQL" by really doing it in SQL.
() allow users to design tables and queries with shard awareness.
However, if I guess something looks hard, I'll say: make it illegal.

BUCKET_ID

All tuples, and most queries, need a column which contains numbers
between 1 and bucket_count, and has a non-unique index.
More flexibility would be nice but assume that's true for a year.
Since a typical SQL database will contain column values that have
different definitions, we are looking for a way that an SQL user
can declare, for each table, "here is a column that can be mapped
to bucket_id". Thus, in SQL terms, bucket_id is a GENERATED column
which is derived from other columns in the table, or system values.
For example, if column_x is an unsigned primary key, perhaps
bucket_id = mod(primary_key_value, bucket_count).
Any deterministic function will do.
Changing bucket_count will be difficult.
Changing the source column will be illegal.

Of course the initial assumption is that a router will handle
distribution questions automatically but the feature might
look better if there is clear support in the syntax.

"Distributed SQL" should mean that you can specify something in SQL.
There are some common attributes that can be specified,
and the most obvious one is: there is a choice of distribution
methods for each table.
In other words: saying
"It is distributed SQL but it's not in SQL and the only thing
you can do is distribute by a hash"
will not look like what some people think is distributed SQL.
So I am describing some additional SQL syntax
which is non-standard but very common,
with slight extra functionality that I hope will not look hard.

CREATE TABLE statement

I am suggesting a new optional clause, and a change to an
existing clause, that would at least indicate whether and how
the table is sharded.

It will show up in information_schema
and in error messages in a way that SQL users will understand.
Also, a syntax check can determine whether the clause is illegal
because there are other CREATE TABLE clauses that won't work
in combination with the new or changed clause.
(I am thinking of foreign-key and check clauses, possibly.)

CREATE TABLE ... PARTITION BY ...

Should the clause be PARTITION BY?
Of course the clause could start with the keyword SHARD, or something else.
And of course we do not wish to appear to be saying that
shards are synonymous with partitions.
But we can decide, as
PostgreSQL did when they "implemented sharding on top of partitioning"
https://www.percona.com/blog/2019/05/24/an-overview-of-sharding-in-postgresql-and-how-it-relates-to-mongodbs/
that a partition description will affect a shard definition,
so it is okay unless we plan to do partitioning without sharding later.
And PARTITION BY is the clause that is already used in
MySQL/MariaDB
https://dev.mysql.com/doc/refman/8.0/en/create-table.html
https://mariadb.com/kb/en/create-table/#partitions
PostgreSQL:
https://www.postgresql.org/docs/11/sql-createtable.html
Oracle (though they add "You cannot partition a table that is part of a 
cluster."):
https://docs.oracle.com/database/121/SQLRF/statements_7002.htm#i2215406
https://docs.oracle.com/database/121/SQLRF/statements_7002.htm#i2215406
DB2:
https://www.ibm.com/support/knowledgecenter/SSEPEK_11.0.0/sqlref/src/tpc/db2z_sql_createtable.html
So, although this is only part of what those other vendors do, I suggest
PARTITION BY some-keyword, and we can add more options as time goes by.

The word PARTITION is already a reserved word.

REJECTING REDSHIFT

To be fair, I must acknowledge the existence of other syntaxes,
such as SQL Server's (which allows specifying functions) and most
interestingly Amazon's. For example their ALTER TABLE options include
https://docs.aws.amazon.com/redshift/latest/dg/r_ALTER_TABLE.html
| ALTER DISTKEY column_name
| ALTER DISTSTYLE ALL
| ALTER DISTSTYLE EVEN
| ALTER DISTSTYLE KEY DISTKEY column_name
| ALTER DISTSTYLE AUTO
where ALL means "all nodes should get a copy of the table",
DISTKEY i.e. Distribution Key means something like partition key,
AUTO means let the system figure it out.
I like ALL but it would require a significant change to vshard rules.
I like AUTO but it is nearly useless without ALL.
Therefore, given that these options are uncommon, I ignore them.

(By the way Redshift's "leader node" can itself have tables so it
doesn't need to pass all queries to other nodes, maybe our "router"
could do something similar, but that's not an issue here.)

CREATE TABLE ... PARTITION BY HASH (partition-key-column-names);

I am assuming that we don't need to change vshard hashing or
add more hash algorithms. And I don't refer here to the way
that vshard handles bucket_id, I only expect the hash here to
provide us with a bucket_id value when partition-key-columns are known.
So perhaps it is uncontroversial that HASH (primary-key-column-names)
would be legal. But is it default and is it compulsory?

I think the answers are 'no' and 'no' but am uncertain what does it
mean if this clause is absent. It could mean
"iff the whole database is sharded then assume it"
or it could mean
"even if the database is sharded this is local and probably temporary".

It would be simplest if partition-key-column-names happened to be
a primary-key-column, because updating a primary-key-column is
illegal already. Anyway, updating partition-key-column-names is illegal.

The list of columns in CREATE TABLE must include "bucket_id" (lower case).
Perhaps we could assume that it exists and not show it with SELECT *.
But I think it is useful to know its value and useful to be able to
specify it anywhere in the column list. So now an example:

CREATE TABLE t (column_1 INT PRIMARY KEY, "bucket" INT, column_2 STRING)
PARTITION BY HASH (column_2);
INSERT INTO t VALUES (1, 'A');
SELECT * FROM t;
Result: 1, 41, 'A'.
Notice assumption: when inserting, I don't need to specify "bucket_id".
Notice assumption: bucket_id = code point of first letter of string.
Maybe those are bad assumptions, I just want the example to be easy.

I suppose that PARTITION BY clause comes after WITH clause,
as in MySQL where partition options come after table options.

CREATE TABLE ... PARTITION BY RANGE (expression)
PARTITION partition_name VALUES LESS THAN (value) [, ... ]);

This will be a bit more detailed than the first CREATE TABLE,
and will allow the user more control over what bucket a row goes to,
but will not be default and maybe will not be necessary.

An example could be
CREATE TABLE t (column_1 INT PRIMARY KEY, "bucket" INT, column_2 STRING)
PARTITION BY RANGE (column_2)
PARTITION p1 VALUES LESS THAN ('J'),
PARTITION p2 VALUES LESS THAN ('Z');
The result would be that there are only two possible bucket_id values,
1 for 'p1', and 2 for 'p2'.

Notice that this definition means that column_2 will always be less
than 'Z', so an additional clause
CHECK (column_2 < 'Y') would be redundant
and an additional clause
CHECK (column_2 > 'Д') should be illegal.

It may be useful that this means there is a one-to-one association
between partition_name and bucket_id value, as we will see later.

It may not be useful that this means a column with monotonically
increasing values will tend to be cause re-use of the same partition.

FOREIGN KEYS

The good news about foreign keys is:
if the foreign key and the referenced key are also partition keys
and the partition clauses are the same, then rows with the same
values will be in the same buckets. In such cases we can guarantee
that INSERT/UPDATE/DELETE will work without needing multiple nodes,
and equijoins will work if they are ANDED with single-row
selections of either key.

The bad news is:
Otherwise there has to be at least one lookup which might
be for a value in a different bucket, and the
ON UPDATE | ON DELETE clauses cause changes of values in
different buckets. That is, they are "multi-row".

So the recommendation is: let foreign keys be illegal.

TABLE REFERENCES

Now that we know there is sharding, we can change
FROM table_name
to
FROM table_name [PARTITION (partition_name)]
which would override router decisions because p1 is associated
with bucket.

This is manual "pruning".
"Pruning" can mean "only go to the necessary partitions".
Usually we hope an optimizer will do it, but let's not assume so.

In some systems one can say PARTITION (expression) instead,
but that's unnecessary, I think.

Table references can have PARTITION clauses in SELECT,
INSERT, UPDATE, DELETE, REPLACE, and perhaps other statements.
But they should be unnecessary for single-row INSERTs,
because single-row INSERTs will always contain the partition-key
values and therefore bucket_id is known already and therefore
passing partition name would be redundant.

INSERT

As stated, INSERT should be easy because bucket_id is easy
to calculate from the rest of the statement.

On the other hand, INSERT ... SELECT is hard, so I suggest
that it should be illegal in the first release.

UPDATE, SELECT, DELETE

Either these statements should have a partition name or they should
have a WHERE clause that contains the exact phrase
partition_key_value = literal
so that even a very simple program can calculate bucket_id.
Otherwise they should be illegal in the  first release.

Updating partition-key columns is illegal too.

ALTER

When ALTER ... DROP is illegal, it should be illegal to drop the 
"_bucket" column
or any column used for partitioning.

DROP INDEX

It should be illegal to drop indexes on the _bucket column.
It probably should be legal to drop indexes on columns used for 
partitioning.

INFORMATION_SCHEMA

The _TABLES table should show the partition definitions.
The _COLUMNS table should show whether a column is in a partition key.
Minimally, a user or a program should be able to see enough to decide
whether manual pruning is possible for a simple query.

Actually I suppose this is my job, since the only implementation of
INFORMATION_SCHEMA is the series of functions that I wrote for the
user manual. Remember that one of those functions will crash until
a certain server bug is fixed.

If it's desirable, I can write other routines that put bucket
information in INFORMATION_SCHEMA. That's easy for anyone to do,
it merely involves reading system tables and calling Lua functions.
However, I believe the display should be as a table -- that's what
any SQL person would expect -- so for example suppose we had this.
(not a system table but similar to an example in the manual):
tarantool> cfg = {
          >   servers = {
          >     { uri = 'host1:33131', zone = '1' },
          >     { uri = 'host2:33131', zone = '2' }
          >   },
          >   login = 'test_user',
          >   password = 'pass',
          >   redundancy = '2'
          > }
I want this display if I select:
| --------- + -------- + ---------- + ------------ + ------------- + 
-------------+ --------------|
| login     | password | redundancy | server_1_uri | server_1_zone | 
server_2_uri | server_2_zone |
| --------- + -------- + ---------- + ------------ + ------------- + 
-------------+ --------------|
| test_user | pass     | 2          | host:33131   | 1             | 
host2:33131  | 2             |
| --------- + -------- + ---------- + ------------ + ------------- + 
-------------+ --------------|
This is not a server issue, it is a client feature.

EXPLAIN | ANALYZE

I expect that the simple phrase "this can be done on a single node"
is the most useful thing that can be said. More is not necessary.

ON CLAUSE AND WHERE CLAUSE

These are the places where we're most likely to see joins or subqueries.
Suppose they require a search of more than one partition/bucket/node/etc.
We could cause a warning if someday we support warnings.
We could cause an error if the user fails to add a keyword (like FORCE)
to indicate an understanding of the possible consequences.
We could require 'admin' privilege to do slow queries of any kind.
But maybe we don't need to do anything, it might not be a big problem.

I believe, though, that everyone will be happier if we say:
joins and subqueries are illegal in the first release.

I could add more details but am unsure whether any of this is
the sort of feedback that you expected.

Peter Gulutzan



More information about the Tarantool-discussions mailing list