From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from smtpng1.m.smailru.net (smtpng1.m.smailru.net [94.100.181.251]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dev.tarantool.org (Postfix) with ESMTPS id 2057B469719 for ; Tue, 11 Feb 2020 16:32:34 +0300 (MSK) Date: Tue, 11 Feb 2020 16:32:30 +0300 From: Mergen Imeev Message-ID: <20200211133230.GA7374@tarantool.org> References: <20200121111108.GA18881@tarantool.org> <20200130185216.GC26109@atlas> <20200203113519.GA9896@tarantool.org> <20200204185011.GF1049@tarantool.org> <20200207142446.GA1110@tarantool.org> <5cc6a434-18d2-c149-aa62-e5fa097b0882@ocelot.ca> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <5cc6a434-18d2-c149-aa62-e5fa097b0882@ocelot.ca> Subject: Re: [Tarantool-discussions] Implicit cast for COMPARISON List-Id: Tarantool development process List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Peter Gulutzan Cc: tarantool-discussions@dev.tarantool.org Hi, Thank you for the answer. My comments below. On Fri, Feb 07, 2020 at 03:30:57PM -0700, Peter Gulutzan wrote: > Hi, > > On 2020-02-07 7:24 a.m., Nikita Pettik wrote: > > > > > To sum up, my proposal is to allow comparing scalar values with values of > > any other type. On the other hand, result of such operations like > > assignment or addition should depend on particular values of SCALAR > > type. > > > > Alternatively, we can abandon this idea and operate apply comparison > > rules depending purely on mp_ types of particular scalar values. But > > then rules in NoSQL and SQL will be different. > > > > I hope that you will abandon this idea, I am not sure that we can do this. For example: tarantool> box.execute("select true in (1, '2', 3, '4');") --- - metadata: - name: true in (1, '2', 3, '4') type: boolean rows: - [false] ... Here we use an ephemeral space to store all the values from a given vector. To search for "true" we use BOX, which automatically applies SCALAR rules for comparison. (At the moment, all columns of ephemeral spaces are of SCALAR type). At the same time: tarantool> box.execute("select true in (1, '2');") --- - null - 'Type mismatch: can not convert unsigned to boolean' ... Here we have only two values for comparison, so we use OP_Eq operation twice. This means that we apply SQL rules for comparison. > but let us pretend that both these behaviour changes are accepted: > (1) No implicit cast > (2) Values can be SCALAR > with these implications or follow-ups. > > 1. A value is SCALAR if and only if: > It is an item in a column defined as SCALAR, or > It is the result of CAST(value AS SCALAR), or > It is the result LUA('return scalar-value') or similar Lua function. > I think it is true. > 2. If a value is SCALAR, then it has two types: > according to SQL, and according to Lua. > Example: > box.execute([[INSERT INTO b VALUES ('a');]]) > box.execute([[CREATE TABLE b (s1 SCALAR PRIMARY KEY);]] > box.execute([[SELECT TYPEOF(s1) FROM b;]]) > type(box.space.B:select('a')[1][1]) > The SQL statement will return 'scalar', > the Lua statement will return 'string'. > You might decide to redefine or rename TYPEOF(), > but you cannot avoid that there are now two types, > and probably need a new built-in function PRIMITIVE_TYPE(). > Not exacly: tarantool> box.execute([[CREATE TABLE b (s1 SCALAR PRIMARY KEY);]]) --- - row_count: 1 ... tarantool> box.execute([[INSERT INTO b VALUES ('a');]]) --- - row_count: 1 ... tarantool> box.execute([[SELECT TYPEOF(s1) FROM b;]]) --- - metadata: - name: TYPEOF(s1) type: string rows: - ['string'] ... Function typeof() returns only 'string', 'integer', 'double', 'boolean' and 'varbinary'. Though, I am not sure if this a feature or a bug. > 3. For deciding whether an operation "is legal", > we depend on defined type. For example, LENGTH(x) > is legal if defined type is STRING or VARBINARY > or SCALAR. For deciding whether an operation > "is an error", we depend on primitive type. > For example, if scalar_value = 5.5 then > LENGTH(scalar_value) is an error although it is legal. > Presumably the error is a "runtime error" although I > suppose it is possible to know in advance whether all > values of a scalar column have the same primitive type. > If we ignore the implicit cast (and UDCF), then I think you are right. But I do not think that we can always find out in advance whether all the values in a column are of the same type. > 4. These situations are subject to "legal but runtime error" > because the scalar value's primitive type is expected to be a number: > Arithmetic operators + - * / % << >> & | > Functions ABS(), CHAR(), AVG(), SUM(), TOTAL(). > The result is never SCALAR. > I think it is true. > 5. These situations are subject to "legal but runtime error" > because the scalar value's primitive type is expected to be a BOOLEAN: > Logical operators NOT AND OR. > The result is never SCALAR. > Example: if scalar_value is FALSE, then > SELECT scalar_value OR scalar_value is legal and > result data type is BOOLEAN (not SCALAR). > Example: if scalar_value is 'FALSE', then > SELECT NOT scalar_value is a runtime error (never cast). > I think it is true. However, in case we implement UDCF user can define implicit cast from STRING to BOOLEAN. I won't say it anymore, but I think we have to keep this in mind. > 6. These situations are subject to "legal but runtime error" > because the scalar value's primitive type is expected to be a STRING: > There are situations where a primitive value is expected to be a STRING: > Functions LOWER() PRINTF() QUOTE() REPLACE() SOUNDEX() TRIM() UNICODE() > UPPER() > Clause COLLATE > I think it is true. > 7. These situations are subject to "legal but runtime error" > because the scalar value's primitive type is expected to be STRING or > VARBINARY: > Situations where primitive value is expected to be STRING or VARBINARY: > Functions HEX() LENGTH() POSITION() SUBSTR() > I think it is true. > 8. The concatenation situation looks like a special case > because the scalar value's primitive type is expected to be STRING or > VARBINARY: > If primitive types are not the same, error. > If primitive types are not either STRING or VARBINARY, error. > If scalar_value || scalar_value: okay, result is SCALAR. > If scalar_value || string_value: okay, result is STRING. > If scalar_value || varbinary_value: okay, result is VARBINARY. > This is just a guess, and it is slightly different from what you > decided for arithmetic operations. > The same rules apply for GROUP_CONCAT(). > Why not use the same rules as for arithmetic operations? I mean: 1) If primitive types are not the same, error. 2) If primitive types are not either STRING or VARBINARY, error. 3) If scalar_value || scalar_value: okay, result is SCALAR. 4) If scalar_value || string_value: okay, result is SCALAR. 5) If scalar_value || varbinary_value: okay, result is SCALAR. > 9. These situations are not subject to "legal but runtime error": > Functions GREATEST() LEAST() > Example: LEAST(FALSE, 1, 'x', X'44') is FALSE and result data type > is BOOLEAN. This is the current behaviour, and it means that we > follow SCALAR rules even though none of the operands is SCALAR. > The same rules apply for MIN() or MAX() if the operand is SCALAR. > I think that is true. The mechanics here are the same as in the example shown at the beginning of the letter. > 10. Comparison operations involving a SCALAR value cause: > Comparison according to SCALAR rules, therefore always legal. > Comparison operations > = < <= > >= LIKE BETWEEN CASE > Implied comparisons UNIQUE DISTINCT GROUP ORDER UNION > Example: > CREATE TABLE t (s1 SCALAR PRIMARY KEY); > INSERT INTO t VALUES (FALSE), (1), ('1'), (X'31'); > SELECT * FROM t WHERE s1 = '1' OR s1 > '1'; > Currently this causes a "type mismatch" error. > But this error will be fixed if your plan is accepted. > Here you can see classes and their order: CLASS_NIL = 0 CLASS_BOOL = 1 CLASS_NUMBER = 2 CLASS_STR = 3 CLASS_BIN = 4 CLASS_ARRAY = 5 CLASS_MAP = 6 It mean that any value of type BINARY more that any value of type STRING and so on. > 11. Assignment to scalar from any data type is always legal. > Assignment from scalar to any data type is legal if and only > if the primitive type is compatible. > I think it is true. > 12 SELECT scalar_column UNION SELECT non_scalar_column is SCALAR. > (Also EXCEPT and INTERSECT) > I think it is true. > > In emails in March 2019 I wrote about compatibility > with other DBMSs and with ODBC. The thread was > "The rules for the scalar data type in SQL". > I'll quote the first email as a reminder, > some of it is still relevant. By the way K. Yukhin replied: > "As far as I understand SCALAR is a bag of types. No value > can ever have type SCALAR." > > > > -------- Forwarded Message -------- > Subject:     The rules for the scalar data type in SQL > Date:     Fri, 8 Mar 2019 16:31:02 -0700 > From:     Peter Gulutzan > To:     Nikita Pettik > CC:     dev@tarantool.org > > > Hi, > > What are the rules for the SCALAR data type in SQL? > The decisions have been made, but eventually I'll need to find out what they > are. > So I'll guess and see whether anyone objects. > > Tarantool had two choices: > Choice#1: SCALAR is a "primitive" data type, on the same level as INTEGER, > VARCHAR, BLOB, etc. > Choice#2: SCALAR is a "complex" data type, two instances can have two > different primitive data types. > Tarantool chose #2. > I could describe this with terms like "union as in C but without labels", > or "limited polymorphism", etc. but fear that using analogies would cause > arguments. > > I can think of only three SQL DBMSs that have something close to this: > Hive UNIONTYPE https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types#LanguageManualTypes-UnionTypesunionUnionTypes > SQL Server SQL_VARIANT Re SQL Server SQL_VARIANT https://docs.microsoft.com/en-us/sql/t-sql/data-types/sql-variant-transact-sql?view=sql-server-2017 > Oracle ANYDATA > https://docs.oracle.com/database/121/ARPLS/t_anydat.htm#ARPLS077 > This is good because Tarantool can claim to have a big feature which > other open-source DBMSs lack, that has resulted from its "NoSQL + SQL" > heritage. > Although some of the same functionality is available via structured > user-defined > types or JSON or XML, we're talking about a simple fully-predefined > solution. > > 1. The name of the data type is SCALAR, not UNIONTYPE or SQL_VARIANT > or ANYDATA, to be compatible with Tarantool/NoSQL. > There are no other words in a SCALAR item's definition, > for example "SCALAR(5000)" is illegal and "SCALAR COLLATE x" is illegal > and Hive-like "SCALAR" is illegal. > SCALAR is a new reserved word. > > 2. Any item defined as SCALAR has an underlying primitive type. > For example, after > CREATE TABLE t (s1 SCALAR PRIMARY KEY); > INSERT INTO t VALUES (55),(X'41'); > The underlying primitive type of the item in the first row is INTEGER > because literal 55 has data type INTEGER, and the underlying primitive > type in the second row is BLOB (when we start supporting BLOBs again). > An item's primitive type is far more important than its defined type. > Incidentally Tarantool might find the primitive type by looking at the > way MsgPack stores, but that is an implementation detail. > > 3. SCALAR is not a supertype. Before Choice#1 was accepted, I > tried to apply the logic of supertypes for further rules, > for example saying: "in integer-versus-real comparisons we can > coerce the integer to the real because real is a supertype, > and SCALAR is a supertype of both integer and real because all > their instances can be represented in SCALAR, therefore for > real-to-scalar comparisons we can coerce the real to a scalar". > But Choice#1 was rejected, so supertype logic is illegal. > > 4. There is no literal syntax which implies data type SCALAR. > Standard SQL indicates date/time/timestamp literals by preceding > them with a data type name e.g. DATE '1990-01-01', and we could > do the equivalent thing by saying SCALAR X'41' etc., but we won't > because it would be meaningless, the primitive type of SCALAR X'41' > would still be BLOB and that is what matters. > > 5. TYPEOF(x) is never SCALAR, it is always the > underlying data type. This is true even if x is null > (we still think there is a "data type" named "null"). > In fact there is no function that is guaranteed to > return the defined data type. > For example, TYPEOF(CAST(1 AS SCALAR)); returns > INTEGER, not SCALAR. But sometimes I still see 'blob'. > > 6. For any operation that requires casting from an item defined > as SCALAR, the syntax is legal but the operation may fail at runtime. > At runtime, Tarantool detects the underlying primitive data type and applies > the > rules for that. For example, if a definition is > CREATE TABLE t (s1 INT PRIMARY KEY, s2 SCALAR); > and within any row s2 = 'a', that is, its underlying primitive type is > varchar/char/text or whatever we decide to call character strings, > then UPDATE t SET s1 = s2; is illegal. > We usually do not know that in advance. > > 7. For any dyadic operation that requires implicit casting to an item > defined > as SCALAR, the syntax is legal and the operation will not fail at runtime. > Take this situation: comparison with SCALAR and a primitive type. > The implicit cast is primitive-to-scalar not scalar-to-primitive. > For example, after > CREATE TABLE t (s1 SCALAR PRIMARY KEY); > INSERT INTO t VALUES (X'41'); > SELECT * FROM t WHERE s1 > 'a'; > The comparison is valid, because > we know the ordering of X'41' and 'a' in Tarantool/NoSQL > 'scalar'. This is not because of supertyping (see Rule#3), > and we do not have implicit casting to SCALAR > (see Rule#6), so I think this is inconsistent, but it > looks better than an error. > > 8. The result data type of min/max/sum operation > on a column defined as SCALAR is SCALAR, > unless the result values is NULL. For example: > CREATE TABLE t (s1 INT, s2 SCALAR PRIMARY KEY); > INSERT INTO t VALUES (1,X'44'),(2,11),(3,1E4),(4,'a'); > SELECT MIN(s2),MAX(s2),SUM(s2) FROM t; > The result is: - - [11, 'D', 10011]. > That is only possible with Tarantool/NoSQL scalar rules, > but we have to ignore the illegal arithmetic (that X'44' + 11 is illegal). > Sometimes typeof(min-max-sum(s2)) is 'blob' but that's just a bug. > > 9. The result data type of a primitive combination is never > SCALAR, because we in effect use TYPEOF(item) not > the defined data type of the item. > (I use the word "combination" in the way that the > standard document uses it for section ""Result of > data type combinations".) Therefore for > MAX(1E308, 'a', 0, X'00') > which was the subject of an argument in issue#4032 > https://github.com/tarantool/tarantool/issues/4032 > There I used a "supertype" argument, and Nikita Pettik > rejected it, which is proper (see Rule#3). > I also admit that standard SQL wouldn't allow this. > However, since primitive-to-scalar comparisons are > legal, and because I believe the standard-SQL limitation > is solely due to lack of a SCALAR data type, I am not > convinced that MAX(1E308, 'a', 0, X'00') should cause > an error rather than a scalar result. > > 10. The result data type of union/except/intersect is > based on the primitive data type, but does not matter because > select x'41' union all select 5; > is legal anyway. > > 11. Casting SCALAR to char/varchar/text results in a char/varchar/text > with indefinite size and binary collation. Casting > SCALAR to any other primitive type results, as far as I > know, in no loss of information. > > 12. Probably for ODBC the type is binary, as for sql_variant. > > Peter Gulutzan >