[Tarantool-discussions] Implicit cast for COMPARISON

Peter Gulutzan pgulutzan at ocelot.ca
Sat Feb 8 01:30:57 MSK 2020


Hi,

On 2020-02-07 7:24 a.m., Nikita Pettik wrote:

 > <cut>
 > To sum up, my proposal is to allow comparing scalar values with values of
 > any other type. On the other hand, result of such operations like
 > assignment or addition should depend on particular values of SCALAR
 > type.
 >
 > Alternatively, we can abandon this idea and operate apply comparison
 > rules depending purely on mp_ types of particular scalar values. But
 > then rules in NoSQL and SQL will be different.
 >

I hope that you will abandon this idea,
but let us pretend that both these behaviour changes are accepted:
(1) No implicit cast
(2) Values can be SCALAR
with these implications or follow-ups.

1. A value is SCALAR if and only if:
It is an item in a column defined as SCALAR, or
It is the result of CAST(value AS SCALAR), or
It is the result LUA('return scalar-value') or similar Lua function.

2. If a value is SCALAR, then it has two types:
according to SQL, and according to Lua.
Example:
box.execute([[INSERT INTO b VALUES ('a');]])
box.execute([[CREATE TABLE b (s1 SCALAR PRIMARY KEY);]]
box.execute([[SELECT TYPEOF(s1) FROM b;]])
type(box.space.B:select('a')[1][1])
The SQL statement will return 'scalar',
the Lua statement will return 'string'.
You might decide to redefine or rename TYPEOF(),
but you cannot avoid that there are now two types,
and probably need a new built-in function PRIMITIVE_TYPE().

3. For deciding whether an operation "is legal",
we depend on defined type. For example, LENGTH(x)
is legal if defined type is STRING or VARBINARY
or SCALAR. For deciding whether an operation
"is an error", we depend on primitive type.
For example, if scalar_value = 5.5 then
LENGTH(scalar_value) is an error although it is legal.
Presumably the error is a "runtime error" although I
suppose it is possible to know in advance whether all
values of a scalar column have the same primitive type.

4. These situations are subject to "legal but runtime error"
because the scalar value's primitive type is expected to be a number:
Arithmetic operators + - * / % << >> & |
Functions ABS(), CHAR(), AVG(), SUM(), TOTAL().
The result is never SCALAR.

5. These situations are subject to "legal but runtime error"
because the scalar value's primitive type is expected to be a BOOLEAN:
Logical operators NOT AND OR.
The result is never SCALAR.
Example: if scalar_value is FALSE, then
SELECT scalar_value OR scalar_value is legal and
result data type is BOOLEAN (not SCALAR).
Example: if scalar_value is 'FALSE', then
SELECT NOT scalar_value is a runtime error (never cast).

6. These situations are subject to "legal but runtime error"
because the scalar value's primitive type is expected to be a STRING:
There are situations where a primitive value is expected to be a STRING:
Functions LOWER() PRINTF() QUOTE() REPLACE() SOUNDEX() TRIM() UNICODE() 
UPPER()
Clause COLLATE

7. These situations are subject to "legal but runtime error"
because the scalar value's primitive type is expected to be STRING or 
VARBINARY:
Situations where primitive value is expected to be STRING or VARBINARY:
Functions HEX() LENGTH() POSITION() SUBSTR()

8. The concatenation situation looks like a special case
because the scalar value's primitive type is expected to be STRING or 
VARBINARY:
If primitive types are not the same, error.
If primitive types are not either STRING or VARBINARY, error.
If scalar_value || scalar_value: okay, result is SCALAR.
If scalar_value || string_value: okay, result is STRING.
If scalar_value || varbinary_value: okay, result is VARBINARY.
This is just a guess, and it is slightly different from what you
decided for arithmetic operations.
The same rules apply for GROUP_CONCAT().

9. These situations are not subject to "legal but runtime error":
Functions GREATEST() LEAST()
Example: LEAST(FALSE, 1, 'x', X'44') is FALSE and result data type
is BOOLEAN. This is the current behaviour, and it means that we
follow SCALAR rules even though none of the operands is SCALAR.
The same rules apply for MIN() or MAX() if the operand is SCALAR.

10. Comparison operations involving a SCALAR value cause:
Comparison according to SCALAR rules, therefore always legal.
Comparison operations > = < <= > >= LIKE BETWEEN CASE
Implied comparisons UNIQUE DISTINCT GROUP ORDER UNION
Example:
CREATE TABLE t (s1 SCALAR PRIMARY KEY);
INSERT INTO t VALUES (FALSE), (1), ('1'), (X'31');
SELECT * FROM t WHERE s1 = '1' OR s1 > '1';
Currently this causes a "type mismatch" error.
But this error will be fixed if your plan is accepted.

11. Assignment to scalar from any data type is always legal.
Assignment from scalar to any data type is legal if and only
if the primitive type is compatible.

12 SELECT scalar_column UNION SELECT non_scalar_column is SCALAR.
(Also EXCEPT and INTERSECT)


In emails in March 2019 I wrote about compatibility
with other DBMSs and with ODBC. The thread was
"The rules for the scalar data type in SQL".
I'll quote the first email as a reminder,
some of it is still relevant. By the way K. Yukhin replied:
"As far as I understand SCALAR is a bag of types. No value
can ever have type SCALAR."



-------- Forwarded Message --------
Subject:     The rules for the scalar data type in SQL
Date:     Fri, 8 Mar 2019 16:31:02 -0700
From:     Peter Gulutzan <pgulutzan at ocelot.ca>
To:     Nikita Pettik <korablev at tarantool.org>
CC:     dev at tarantool.org


Hi,

What are the rules for the SCALAR data type in SQL?
The decisions have been made, but eventually I'll need to find out what 
they are.
So I'll guess and see whether anyone objects.

Tarantool had two choices:
Choice#1: SCALAR is a "primitive" data type, on the same level as 
INTEGER, VARCHAR, BLOB, etc.
Choice#2: SCALAR is a "complex" data type, two instances can have two 
different primitive data types.
Tarantool chose #2.
I could describe this with terms like "union as in C but without labels",
or "limited polymorphism", etc. but fear that using analogies would 
cause arguments.

I can think of only three SQL DBMSs that have something close to this:
Hive UNIONTYPE 
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types#LanguageManualTypes-UnionTypesunionUnionTypes
SQL Server SQL_VARIANT Re SQL Server SQL_VARIANT 
https://docs.microsoft.com/en-us/sql/t-sql/data-types/sql-variant-transact-sql?view=sql-server-2017
Oracle ANYDATA 
https://docs.oracle.com/database/121/ARPLS/t_anydat.htm#ARPLS077
This is good because Tarantool can claim to have a big feature which
other open-source DBMSs lack, that has resulted from its "NoSQL + SQL" 
heritage.
Although some of the same functionality is available via structured 
user-defined
types or JSON or XML, we're talking about a simple fully-predefined 
solution.

1. The name of the data type is SCALAR, not UNIONTYPE or SQL_VARIANT
or ANYDATA, to be compatible with Tarantool/NoSQL.
There are no other words in a SCALAR item's definition,
for example "SCALAR(5000)" is illegal and "SCALAR COLLATE x" is illegal
and Hive-like "SCALAR<integer,double>" is illegal.
SCALAR is a new reserved word.

2. Any item defined as SCALAR has an underlying primitive type.
For example, after
CREATE TABLE t (s1 SCALAR PRIMARY KEY);
INSERT INTO t VALUES (55),(X'41');
The underlying primitive type of the item in the first row is INTEGER
because literal 55 has data type INTEGER, and the underlying primitive
type in the second row is BLOB (when we start supporting BLOBs again).
An item's primitive type is far more important than its defined type.
Incidentally Tarantool might find the primitive type by looking at the
way MsgPack stores, but that is an implementation detail.

3. SCALAR is not a supertype. Before Choice#1 was accepted, I
tried to apply the logic of supertypes for further rules,
for example saying: "in integer-versus-real comparisons we can
coerce the integer to the real because real is a supertype,
and SCALAR is a supertype of both integer and real because all
their instances can be represented in SCALAR, therefore for
real-to-scalar comparisons we can coerce the real to a scalar".
But Choice#1 was rejected, so supertype logic is illegal.

4. There is no literal syntax which implies data type SCALAR.
Standard SQL indicates date/time/timestamp literals by preceding
them with a data type name e.g. DATE '1990-01-01', and we could
do the equivalent thing by saying SCALAR X'41' etc., but we won't
because it would be meaningless, the primitive type of SCALAR X'41'
would still be BLOB and that is what matters.

5. TYPEOF(x) is never SCALAR, it is always the
underlying data type. This is true even if x is null
(we still think there is a "data type" named "null").
In fact there is no function that is guaranteed to
return the defined data type.
For example, TYPEOF(CAST(1 AS SCALAR)); returns
INTEGER, not SCALAR. But sometimes I still see 'blob'.

6. For any operation that requires casting from an item defined
as SCALAR, the syntax is legal but the operation may fail at runtime.
At runtime, Tarantool detects the underlying primitive data type and 
applies the
rules for that. For example, if a definition is
CREATE TABLE t (s1 INT PRIMARY KEY, s2 SCALAR);
and within any row s2 = 'a', that is, its underlying primitive type is
varchar/char/text or whatever we decide to call character strings,
then UPDATE t SET s1 = s2; is illegal.
We usually do not know that in advance.

7. For any dyadic operation that requires implicit casting to an item 
defined
as SCALAR, the syntax is legal and the operation will not fail at runtime.
Take this situation: comparison with SCALAR and a primitive type.
The implicit cast is primitive-to-scalar not scalar-to-primitive.
For example, after
CREATE TABLE t (s1 SCALAR PRIMARY KEY);
INSERT INTO t VALUES (X'41');
SELECT * FROM t WHERE s1 > 'a';
The comparison is valid, because
we know the ordering of X'41' and 'a' in Tarantool/NoSQL
'scalar'. This is not because of supertyping (see Rule#3),
and we do not have implicit casting to SCALAR
(see Rule#6), so I think this is inconsistent, but it
looks better than an error.

8. The result data type of min/max/sum operation
on a column defined as SCALAR is SCALAR,
unless the result values is NULL. For example:
CREATE TABLE t (s1 INT, s2 SCALAR PRIMARY KEY);
INSERT INTO t VALUES (1,X'44'),(2,11),(3,1E4),(4,'a');
SELECT MIN(s2),MAX(s2),SUM(s2) FROM t;
The result is: - - [11, 'D', 10011].
That is only possible with Tarantool/NoSQL scalar rules,
but we have to ignore the illegal arithmetic (that X'44' + 11 is illegal).
Sometimes typeof(min-max-sum(s2)) is 'blob' but that's just a bug.

9. The result data type of a primitive combination is never
SCALAR, because we in effect use TYPEOF(item) not
the defined data type of the item.
(I use the word "combination" in the way that the
standard document uses it for section ""Result of
data type combinations".) Therefore for
MAX(1E308, 'a', 0, X'00')
which was the subject of an argument in issue#4032
https://github.com/tarantool/tarantool/issues/4032
There I used a "supertype" argument, and Nikita Pettik
rejected it, which is proper (see Rule#3).
I also admit that standard SQL wouldn't allow this.
However, since primitive-to-scalar comparisons are
legal, and because I believe the standard-SQL limitation
is solely due to lack of a SCALAR data type, I am not
convinced that MAX(1E308, 'a', 0, X'00') should cause
an error rather than a scalar result.

10. The result data type of union/except/intersect is
based on the primitive data type, but does not matter because
select x'41' union all select 5;
is legal anyway.

11. Casting SCALAR to char/varchar/text results in a char/varchar/text
with indefinite size and binary collation. Casting
SCALAR to any other primitive type results, as far as I
know, in no loss of information.

12. Probably for ODBC the type is binary, as for sql_variant.

Peter Gulutzan



More information about the Tarantool-discussions mailing list