[Tarantool-discussions] The result type and argument types of the built-in SQL functions.
Peter Gulutzan
pgulutzan at ocelot.ca
Wed Jul 29 19:47:01 MSK 2020
Hi,
I am aware that Timur Safin has already replied,
but I will answer one thing at a time.
On 2020-07-28 5:28 a.m., Mergen Imeev wrote:
> Hi!
>
> On Mon, Jul 27, 2020 at 01:39:29PM -0600, Peter Gulutzan wrote:
>> Hi,
>>
>>
>> On 2020-07-27 6:24 a.m., Mergen Imeev wrote:
> <cut>
>>>
>>
>> Re your table of "current" result data types. I think we must define
>> "current".
>> In version 2.4, SELECT TYPEOF(LENGTH('')); returns 'integer'.
> When I say function result type, I mean this:
>
> tarantool> SELECT LENGTH('');
> ---
> - metadata:
> - name: COLUMN_1
> type: integer
> rows:
> - [0]
> ...
>
> As you can see, it says that the result type of the LENGTH() function is
> INTEGER. However, as you said, the type of the value we got is actually
> UNSIGNED. I will fix this after we come to an agreement.
>
Okay, I guess you mean that the metadata type will be the same as TYPEOF,
but that might mean they both say 'integer' or that might mean that they
both say 'unsigned'. Fine for now.
>> In version 2.6, SELECT TYPEOF(LENGTH('')); returns 'unsigned'.
>> In other words, somebody has already made changes, for tarantool-master.
>> However, I did not document that the return will be 'integer' so this is
>> a "change in behaviour" but not a "change in documented behaviour".
>>
>> And I think we must define 'result type'.
>> You write that GREATEST etc. return 'scalar'.
>> But of course SELECT TYPEOF(GREATEST(1,'a')); returns 'string'.
>> So I assume you do not mean that the result value has the Tarantool
'scalar'
>> type, you only mean that the result value will be anything that
>> Tarantool/SQL currently supports.
> I agree that this is true for the value that was received. However,
result type
> functions are more like column types:
>
> tarantool> SELECT GREATEST(1,'a');
> ---
> - metadata:
> - name: COLUMN_1
> type: scalar
> rows:
> - ['a']
> ...
>
You are right that sometimes a 'metadata' type is like a column type,
that is, it must be scalar when it might contain more than one primitive
type.
However, in the case of GREATEST, I do not see that this is necessary,
there is only one value and it is a STRING.
>>
>> Re returning UNSIGNED instead of INTEGER for ...LENGTH(),
ROW_COUNT(), etc.
>> (a) This data type is not standard and is not built-in for most
major DBMSs.
>> Even in MySQL
>> https://dev.mysql.com/doc/refman/8.0/en/numeric-type-syntax.html
>> UNSIGNED is an attribute of a data type, as in BIGINT UNSIGNED, not
a data
>> type.
>> (b) The maximum INTEGER value is the same as the maximum UNSIGNED value,
>> 18446744073709551615, so the change is not necessary.
>> (c) Although in Tarantool it is not likely, there are generic
programs that
>> might
>> be checking whether the result of a function is negative. For
example, in
>> ODBC
>>
https://docs.microsoft.com/en-us/sql/odbc/reference/syntax/sqlrowcount-function?v,
>> a result of an equivalent of our ROW_COUNT() function can be -1.
>> ... Therefore I am not enthusiastic about this change.
> I have no objection here.
>
Does anyone else have an objection here?
If not, does that mean the returned data type will be INTEGER?
>>
>> Re returning NUMBER from ROUND():
>> You say that the current return is 'integer', but when I try it,
>> I get 'double'. I think this current behaviour is acceptable.
> I talked about this:
>
> tarantool> SELECT ROUND(1.2345, 2);
> ---
> - metadata:
> - name: COLUMN_1
> type: integer
> rows:
> - [1.23]
> ...
>
> As you can see, the result is not INTEGER, even if it is written that
it is
> INTEGER.
>
Oh. So this is a bug?
>> However, if ROUND(1) returns INTEGER, that is good too --
>> the general idea, not a law, would be
>> "if the result data type can be the same as the input data type(s),
>> let it be the same".
> I believe ROUND (a) and ROUND (a, 0) should return INTEGER. I still
think that
> we should consider NUMBER as the return type of a function. However,
this does
> not mean that the type of the values we get as a result will be
NUMBER. It will
> be either INTEGER or DOUBLE.
>
> Also, I have an important question: why don't we treat NUMBER the
same way we
> treat SCALAR? Before adding the DOUBLE type, it was possible to say
that NUMBER
> contains INTEGER and real values. So it was in the same position as
INTEGER and
> UNSIGNED right now. However, after adding DOUBLE, all numeric values
can be
> either INTEGER or DOUBLE, so we don't need NUMBER values. I suggest
that we
> allow NUMBER to be the column type, but there should be no NUMBER
values. What
> do you think about this?
>
>>
>> Re types of arguments:
>> You suggest that TRIM() etc. must have 'string' arguments,
>> but currently they can be 'varbinary' and I don't see why that is bad.
>> You suggest that CHAR() must have 'unsigned' argument,
>> but currently it can be some other type, and well, *maybe* that is bad.
>> I don't object to strictness, but worry that I might have to document
>> "sometimes we do an implicit cast, some other times we are strict".
> I believe the values given as arguments should follow the "IMPLICIT
CAST FOR
> ASSIGNMENT" rules. I plan to use the same mechanism, so I don't see
any problems
> here. For exapmle, ROUND(1.234, 2.5) will work the same as
ROUND(1.234, 2).
>
>>
>> Re BLOB instead of string. This is related to the fact that
>> TRIM() etc. currently do not need to have 'string' arguments,
>> they can have 'varbinary' arguments. I admit that many (maybe
>> most) other vendors expect character strings in such cases.
>> But the standard suggests that something very similar is allowed,
>> the DB2 manual says it is allowed,
>> the MariaDB manual is not explicit but here I show it is allowed:
>> "
>> mariadb>SELECT HEX(TRIM(X'D0' FROM CAST('Д' AS BINARY)));
>> OK 1 rows affected (0.0 seconds)
>> +--------------------------------------------+
>> | HEX(TRIM(X'D0' FROM CAST('Д' AS BINARY))) |
>> +--------------------------------------------+
>> | 94 |
>> +--------------------------------------------+
>> "
>> I believe that Tarantool should continue to allow varbinary arguments.
> So you mean that all functions that can accept STRING must accept
VARBINARY?
>
> Do you think we should be thinking about a new type that should
contain STRING
> and VARBINARY the same way NUMBER contains INTEGER and DOUBLE? If you
agree with
> this, can you suggest a name for the new type?
>
Let's look at them. Version 2.6.
SELECT HEX('A'),HEX(X'41');
It works now. It's in the manual: "may be either a string or a byte
sequence".
I do not see why it should fail.
SELECT LENGTH('Д'), LENGTH(X'D094'),CHAR_LENGTH('Д'),CHAR_LENGTH(X'D094');
It works now. It's in the manual: "Return the number of characters in
the expression, or the number of bytes in the expression.
It depends on the data type: ...".
I do not see why it should fail.
Of course, ,CHAR_LENGTH(X'D094') returns the number of bytes not the number
of characters, and I know that is odd, but it was K. Osipov's order.
SELECT LOWER('I'),LOWER(X'49');
It fails now. The manual says the required syntax is
"LOWER(string-expression)".
I suppose that, since there are no chaacters, we could say that
LOWER(X'49') = X'49'.
But I do not see that that would help any users.
The same considerations apply for SELECT UPPER('a'),UPPER(X'61');.
By the way, we are assuming that 'I' is UTF-8 with a DUCET collation, but
LOWER('I' COLLATE "binary") = 'i'. I think that makes no sense, but it is
convenient for users -- if this "bug" were fixed, then they would have to
specify a non-default collation (because "binary" is default).
Am I making sense?
SELECT POSITION('A', 'ДA'),POSITION(X'41', X'D09441');
It works now. It's in the manual: "The data types of the expressions
must be either STRING or VARBINARY. If the expressions have data type
STRING, then the result is the character position. If the expressions
have data type VARBINARY, then the result is the byte position."
I do not see why it should fail.
SELECT PRINTF('%d',5),PRINTF(X'2564',5);
It works now. The manual does not clearly say that it should work,
but says that the first argument should be a "string expression".
I doubt that any user will use VARBINARY, but ...
I do not see why it should fail.
CREATE TABLE u (s1 VARCHAR(1) PRIMARY KEY, s2 VARBINARY);
INSERT INTO u VALUES ('A',X'41');
SELECT QUOTE(s1),QUOTE(s2) FROM u;
It works now. But the result is odd:
tarantool>SELECT QUOTE(s1),QUOTE(s2) FROM u;
OK 1 rows selected (0.0 seconds)
+----------+----------+
| COLUMN_1 | COLUMN_2 |
+----------+----------+
| 'A' | X'41' |
+----------+----------+
Surely putting quote marks i.e. X'27' around X'41' should result in
VARBINARY X'274127'. I think this is a bug. But ...
I do not see why it should fail.
By the way, the manual is wrong. It says the argument should be
"string-literal". But, as the example shows, it can be a column value.
I think this is a bug too. I will put it on my little "to do" list.
SELECT REPLACE('A','A','C'),REPLACE(X'41',X'41',X'43');
It works now. It is in the manual: "The expressions should all
have data type STRING or VARBINARY."
I do not see why it should fail.
SELECT SOUNDEX('A'),SOUNDEX(X'41');
It fails now. The manual does not say that it should fail, but does
say "The algorithm works with characters in the Latin alphabet and
works best with English words."
I suppose that, since there are no chaacters, we could say that
SOUNDEX(X'41') = any junk.
But I do not see that that would help any users.
SELECT SUBSTR('ДAB',3),SUBSTR(X'D09441', 3);
It works now. It is in the manual: "If expression-1 has data type
VARBINARY rather than data type STRING, then positioning and
counting is by bytes rather than by characters."
I do not see why it should fail.
TRIM()
It works now.
Earlier I showed a result from MariaDB.
Here is a result from Tarantool for the same query.
"
tarantool>SELECT HEX(TRIM(X'D0' FROM CAST('Д' AS VARBINARY)));
OK 1 rows selected (0.0 seconds)
+----------+
| COLUMN_1 |
+----------+
| 94 |
+----------+
"
I had to say VARBINARY not BINARY, anyway, the result is the same.
It is in the manual: "The expressions should have data type STRING
or VARBINARY."
I do not see why it should fail.
UPPER()
See above re LOWER().
Summary: the current behaviour is documented and is good, except:
(1) CHAR_LENGTH(varbinary-string) and PRINTF(varbinary-string...)
works but are useless.
(2) LOWER(varbinary-string) and UPPER(varbinary-string)
don't work, but they could work, but would be useless.
Also I think they are slightly buggy but for the user's benefit.
(3) QUOTE(varbinary-string)
works but I think it is slightly buggy.
I suppose you have heard the English expression:
If it is not broken, then fix it anyway.
I understand; I work that way muself.
But I bet that you have better things to do.
>>
>> Re MAP and ARRAY types "in the near future":
>> I think we must define "near future".
>> Currently in SQL we do not even have the Lua DECIMAL or UUID data types.
>> Kirill Yukhin made the issue Implement DECIMAL data type #4415
>> on August 8 2019, saying
>> "After DECIMAL type was introduced to the core, its time to
implement this
>> type in SQL frontend."
>> We are nearly at August 8 2020, so apparently it takes more than one
year to
>> put
>> a data type in SQL even though it is already in the core.
>> (
>> So I think that maps and arrays, which I think are more difficult,
>> will not exist in SQL for two years. I am not worried.
> It is actually in plans for the next release, however I won't argue
with you
> here.
>
Thank you for not arguing, but I am not as polite as you, so I will add
this.
I looked at issue#4763 sql: introduce type <MAP>
https://github.com/tarantool/tarantool/issues/4763
The last thing I see is
"kyukhin added this to the wishlist milestone on Mar 6".
It is not about supporting a MAP data type in SQL, it is about
adding a function that can read a Lua map.
We can already do something like that, as I mentioned to Nikita Pettik.
>> However, it is interesting to imagine
>> UPPER(array of strings) -- should we return upper of all elements?
>> UPPER(map) -- should we return upper of both the key and the value?
>> and so on.
> I didn't even thought about these cases. I mean, if function should
accept
> STRING and we will give her an ARRAY, that it should throw an error.
After all
> ARRAY is not STRING.
>
>> I believe Lua non-scalar values should be flattened in SQL,
>> so perhaps the questions can all be avoided.
>>
>> Peter Gulutzan
>>
>
> Could you take another look at the tables?
>
> Also, here's what the function definition would look like after my
patches:
>
> tarantool> box.execute([[select "name", "param_list", "returns",
"aggregate" from "_func" where "language" = 'SQL_BUILTIN' order by
"name" LIMIT 10;]])
> ---
> - metadata:
> - name: name
> type: string
> - name: param_list
> type: array
> - name: returns
> type: string
> - name: aggregate
> type: string
> rows:
> - ['ABS', ['number'], 'number', 'none']
> - ['AVG', ['number'], 'number', 'group']
> - ['CEIL', [], 'any', 'none']
> - ['CEILING', [], 'any', 'none']
> - ['CHAR', ['unsigned'], 'string', 'none']
> - ['CHARACTER_LENGTH', ['scalar'], 'integer', 'none']
> - ['CHAR_LENGTH', ['scalar'], 'integer', 'none']
> - ['COALESCE', ['scalar'], 'scalar', 'none']
> - ['COUNT', ['scalar'], 'integer', 'group']
> - ['CURRENT_DATE', [], 'any', 'none']
> ...
>
> I used SCALAR instead of STRING, so we may use VARBINARY instead of
STRING.
>
I suppose that you cannot say 'string or varbinary' in this output, right?
>
> Below is an updated table of function result types. Remember that
this is not
> always the type of the value that we get from the function. However,
all values
> that we receive as a result of executing the function must be of the
specified
> type.
>
>
> FUNCTION NAME CURRENT SUGGESTED
> abs number number
> avg number double
> char string string
> character_length integer integer
> char_length integer integer
> coalesce scalar scalar
> count integer integer
> greatest scalar scalar
> group_concat string string
> hex string string
> ifnull integer scalar
> least scalar scalar
> length integer integer
> like integer boolean
> likelihood boolean scalar
> likely boolean scalar
> lower string string
> max scalar scalar
> min scalar scalar
> nullif scalar scalar
> position integer integer
> printf string string
> quote string string
> random integer integer
> randomblob varbinary varbinary
> replace string string
> round integer number
> row_count integer integer
> soundex string string
> substr string string
> sum number number
> total number double
> trim string string
> typeof string string
> unicode string integer
> unlikely boolean scalar
> upper string string
> version string string
> zeroblob varbinary varbinary
>
>
> Below is an updated table of function argument types. Note that the
IMPLICIT
> CAST FOR ASSIGNMENT rules will be applied before values are passed to the
> function as arguments. This means that if the function takes STRING
and we give
> it an INTEGER value, we will get an error even before we call the
function.
>
> FUNCTION NAME TYPES OF ARGUMENTS
> abs number
> avg number
> char* unsigned
> character_length scalar
> char_length scalar
> coalesce scalar
> count scalar
> greatest scalar
> group_concat scalar, scalar
> hex scalar
> ifnull scalar, scalar
> least* scalar
> length scalar
> like scalar, scalar, scalar
> likelihood scalar, double
> likely scalar
> lower scalar
> max scalar
> min scalar
> nullif scalar, scalar
> position scalar, scalar
> printf* scalar
> quote scalar
> randomblob unsigned
> replace scalar, scalar, scalar
> round double, unsigned
> soundex scalar
> substr scalar, integer, integer
> sum number
> total number
> trim* scalar
> typeof scalar
> unicode scalar
> unlikely scalar
> upper scalar
> zeroblob unsigned
>
>
> I still believe that not all functions that accept STRING have to to
accept
> VARBINARY. But that's up to you.
>
I gave a very long answer about each function that I think is relevant.
I did not know that it is up to me --
if I am now the mail.ru boss, nobody warned me.
Peter Gulutzan
More information about the Tarantool-discussions
mailing list