[tarantool-patches] Re: [PATCH 4/9] sql: introduce type boolean

Tue Apr 23 22:58:26 MSK 2019

>> diff --git a/src/box/sql/expr.c b/src/box/sql/expr.c
>> index 6b38e8e66..d70b64c45 100644
>> --- a/src/box/sql/expr.c
>> +++ b/src/box/sql/expr.c
>> @@ -3775,7 +3766,7 @@ sqlExprCodeTarget(Parse * pParse, Expr * pExpr, int target)
>>                }
>>        case TK_TRUE:
>>        case TK_FALSE: {
>> -                       vdbe_emit_bool(pParse->pVdbe, pExpr, target);
>> +                       sqlVdbeAddOp2(v, OP_Bool, op == TK_TRUE, target);
>>                        return target;
>>                }
>> 
>>> 6. I see, that all other 'case's set pExpr->type. Why don't you
>>> do it here?
>> 
>> It is required only for non-constant values. For literals it is assigned
>> in parse.y spanExpr().
> 
> 1. Then why case TK_INTEGER sets 'pExpr->type = FIELD_TYPE_INTEGER',
> TK_STRING sets 'pExpr->type = FIELD_TYPE_STRING;', TK_BLOB sets
> 'pExpr->type = FIELD_TYPE_SCALAR;' - I see all of them being set in
> parse.y.

Those are literals, so as I said above, their types are assigned
right during parsing. If you are asking “why do we need to do so?”,
then answer is “they are leaf nodes and it is most appropriate
moment to do so”.

> On the contrary, there are places which create a new Expr, and
> does not set type even in parse.y. For example:
> 
>    1240: A.pExpr = sql_expr_new_dequoted(pParse->db, TK_INTEGER, &sqlIntTokens[N]);

Speaking of this particular example, yes, it was buggy place.
I’ve fixed it in "sql: make predicates accept and return boolean" patch

> Here the expr is created and spanExpr() is not used. How does it
> work?
> 
>>> 8. Looking at all these type comparison prohibitions I am getting
>>> afraid of how will we select from SCALAR columns?
>>> 
>>> A schema:
>>> 
>>>   box.execute('CREATE TABLE t (id INTEGER PRIMARY KEY, a SCALAR UNIQUE);')
>>>   box.execute("INSERT INTO t VALUES (1, 1), (2, true), (3, false), (4, 'str')")
>>> 
>>> SQL select:
>>> 
>>>> box.execute('SELECT * FROM t WHERE a > 1')
>>>   ---
>>>   - error: 'Type mismatch: can not convert INTEGER to boolean'
>>>   ...
>>> 
>>> The same Lua select:
>>> 
>>>> box.space.T.index[1]:select({1}, {iterator = 'GT'})
>>>   ---
>>>   - - [4, 'str']
>>>   ...
>>> 
>>> In fact, now we can not use SCALAR in SQL for any comparisons
>>> because it will raise type mismatch on literally everything.
>>> 
>>> What are we going to do with it? IMO, we could add a flag like
>>> 'is_scalar' to struct Mem in its flags section, which would allow
>>> to compare this value with anything.
>> 
>> Konstantin suggested to extend struct Mem with field_type,
>> which is going to substitute this flag.
>> 
>>> Looks crutchy though. And is
>>> not a part of this issue of course, but should be filed into the
>>> issue list.
>>> 
>>> I see a similar issue https://github.com/tarantool/tarantool/issues/4124.
>>> Probably, it is worth extending it instead of opening new.
>> 
>> I consider this as a trade-off of using scalar: users shouldn’t use this
>> type until they are really have to. Nevertheless, they still can use CAST
>> operator and switch case, or write their own lua-function converting values
>> to required type.
> 
> 2. Users may have good designed schema and work in each request with only
> one type in such a column, but still they will get errors just because
> the iterator can not walk to the needed tree node. Lets modify my example:
> 
>    SELECT * FROM t WHERE a = 1;
> 
> This request implicitly says that it will work with numbers only, but
> somehow its success depends on the content of the space, not even
> related to '1' value. In a nutshell it means that insertion of something
> into a space can break existing requests not related to the inserted
> data. It is weird. Such scalar type would be useless.

Yep, we can observe the same behaviour in DB2. For instance:

db2 => create table t (id varchar(10))
DB20000I  The SQL command completed successfully.
db2 => insert into t values(1)
DB20000I  The SQL command completed successfully.
db2 => select id from t where id = 1

ID        
----------
1         

  1 record(s) selected.

db2 => insert into t values('abc’)
DB20000I  The SQL command completed successfully.
db2 => insert into t values(1)
DB20000I  The SQL command completed successfully.
db2 => select id from t where id = 1

ID        
----------
1         
SQL0420N  Invalid character found in a character string argument of the 
function "DECFLOAT".  SQLSTATE=22018

After insertion of unconvertible to integer value ‘abc’,
select starts to fail. Note that DB2 is considered to be
most compatible with ANSI according to members of
our server team.

>>>> + *
>>>> + * @param str String to be converted to boolean.
>>>> + *            Assumed to be null terminated.
>>>> + * @param result Resulting value of cast.
>>>> + * @retval 0 If string satisfies conditions above.
>>>> + * @retval -1 Otherwise.
>>>> + */
>>>> +static int
>>>> +str_cast_to_boolean(const char *str, bool *result)
>>>> +{
>>>> +	assert(str != NULL);
>>>> +	for (; *str == ' '; str++);
>>>> +	size_t rest_str_len = strlen(str);
>>>> +	if (rest_str_len == 4 && strncasecmp(str, "true", 4) == 0) {
>>>> +		*result = true;
>>>> +		return 0;
>>>> +	}
>>>> +	if (rest_str_len == 5 && strncasecmp(str, "false", 5) == 0) {
>>>> +		*result = false;
>>>> +		return 0;
>>>> +	}
>>> 
>>> 16. I guess it is far from the most efficient implementation. You
>>> calculate length of the whole string before comparison, scanning
>>> its twice in the worst and most common case. Please, consider my
>>> fixes below. I did not test them though, so probably there are some
>>> typos.
>>> Note, that in my fixes I've already accounted my comment about
>>> trimming trailing whitespaces.
>>> 
>>> ============================================================================
>>> @@ -625,16 +625,20 @@ str_cast_to_boolean(const char *str, bool *result)
>>> {
>>> 	assert(str != NULL);
>>> 	for (; *str == ' '; str++);
>>> -	size_t rest_str_len = strlen(str);
>>> -	if (rest_str_len == 4 && strncasecmp(str, "true", 4) == 0) {
>>> +	if (strncasecmp(str, "true", 4) == 0) {
>> 
>> We can’t use strncasecmp() without verification the fact that
>> rest_str_len (i.e. string length after skipping leading spaces) is >= 4.
> 
> 3.1. Why? According to the standard, strncasecmp() compares not more than
> 'n' characters taking into account terminating zero. I tried these
> examples to simulate our case (when 'str' points to the beginning of a
> possible 'true/false' token).
> 
> 	printf("%d\n", strncasecmp("", "true", 4));
> 	printf("%d\n", strncasecmp("t", "true", 4));
> 	printf("%d\n", strncasecmp("tr", "true", 4));
> 	printf("%d\n", strncasecmp("tru", "true", 4));
> 	printf("%d\n", strncasecmp("true   ", "true", 4));
> 
> I got this output:
> 
>    -116
>    -114
>    -117
>    -101
>    0
> 
> It means, that you do not need to call 'strlen' before
> 'strncasecmp'. I've dropped 'strlen' again and the tests
> passed. If you think that explicit length check is mandatory,
> then please, provide a test. Probably I'm wrong somewhere in
> my speculations.

Well, you are right: I missed the fact that it compares UP to n symbols.
I’m sorry to have confused you.

>> -       if (rest_str_len == 4 && strncasecmp(str, "true", 4) == 0) {
>> +       if (rest_str_len >= 4 && strncasecmp(str, "true", 4) == 0) {
>>                *result = true;
>> -               return 0;
>> -       }
>> -       if (rest_str_len == 5 && strncasecmp(str, "false", 5) == 0) {
>> +               str += 4;
>> +       } else if (rest_str_len >= 5 && strncasecmp(str, "false", 5) == 0) {
>>                *result = false;
>> -               return 0;
>> +               str += 5;
>> +       } else {
>> +               return -1;
>>        }
>> -       return -1;
>> +       for (; *str != '\0'; ++str) {
>> +               if (*str != ' ')
>> +                       return -1;
>> +       }
>> +       return 0;
>> }
> 
> My fixes, on the branch:

Ok, applied.