[tarantool-patches] Re: [PATCH 1/2] sql: LIKE & GLOB pattern comparison issue

Fri Oct 26 18:19:46 MSK 2018

Hello, Alexander! please consider this answer to the review!

Issues:
https://github.com/tarantool/tarantool/issues/3251
https://github.com/tarantool/tarantool/issues/3334

Branch:
https://github.com/tarantool/tarantool/tree/N_Tatunov/gh-3251-where-like-hangs

> On Oct 21, 2018, at 06:51, Alexander Turenko <alexander.turenko at tarantool.org> wrote:
> 
> Hi!
> 
> Thanks for your work.
> 
> The email is big, but don't afraid. I don't push you to rewrite the
> whole things again :)
> 
> The patch is generally okay for me. Minor comments are added on this.
> 
> Below I answered to your investigation about the libicu code. I found
> that you was not right in some assumptions, but propose to postpone
> refactoring out of this task.
> 
> Then I found some possible corner cases. They are out of the scope of
> your task too, so I proposed to check and file issues.
> 
> WBR, Alexander Turenko.
> 
> 1, 2 - ok.
> 
>>>       s=*source;
>>>       if(sourceLimit<s) {
>>>           *err=U_ILLEGAL_ARGUMENT_ERROR;
>>>           return 0xffff;
>> 
>>   3. This one cannot trigger in sql_utf8_pattern_compare():
>>   1) ucnv_getNextUChar is only called when !(sourceLimit<s).
> 
> The discussion is about that the patch lean on the check, but here you
> say it cannot be triggered. Mistake? It seems it can be triggered and it
> the case we check in our code. So, ok.

Yes I’ve mistaken. I forgot that I changed the macro, thank you.

> 
>>>    /*
>>>     * Make sure that the buffer sizes do not exceed the number range for
>>>     * int32_t because some functions use the size (in units or bytes)
>>>     * rather than comparing pointers, and because offsets are int32_t values.
>>>     *
>>>     * size_t is guaranteed to be unsigned and large enough for the job.
>>>     *
>>>     * Return with an error instead of adjusting the limits because we would
>>>     * not be able to maintain the semantics that either the source must be
>>>     * consumed or the target filled (unless an error occurs).
>>>     * An adjustment would be sourceLimit=t+0x7fffffff; for example.
>>>     */
>>>    if(((size_t)(sourceLimit-s)>(size_t)0x7fffffff && sourceLimit>s)) {
>>>        *err=U_ILLEGAL_ARGUMENT_ERROR;
>>>        return 0xffff;
>> 
>> 4. I’m not sure if string data can be this long in our context. 
>>   (string length > (size_t) 0x7ffffffff)
> 
> Note: not 0x7ffffffff, but 0x7fffffff.
> 
> This limit seems to be some weird internal thing related to using
> ucnv_getNextUChar inside libicu.
> 
> I propose to lie libicu about the buffer size in case when it exceeds
> this limit. A UTF-8 encoded symbol is 4 bytes long at max, so we can
> pass the following instead of pattern_end:
> 
> ((size_t) (pattern_end - pattern) > (size_t) 0x7fffffff ? pattern + 0x7fffffff : pattern_end
> 
> I think this trick need to be covered with a unit test (because it is unclear
> how to create a string of size >1GiB from lua). Don't sure whether it is
> okay to allocate such amount of memory in the test, though...
> 
> Please, don't do that within this patch, because it is about the another bug.
> File an issue with all needed information instead (you can provide a link to
> this message for example).

Ok, thank you for advice. I think that’s a good idea, but there’s one thing
I’m getting concerned about: it will cause a lot of operations especially
in case we’re using LIKE for scanning a lot of data (). I guess even if it’s
relevant it’s a discussion inside of an issue that’s going to be filed.

> 
>>>    if(c<0) {
>>>        /*
>>>         * call the native getNextUChar() implementation if we are
>>>         * at a character boundary (toULength==0)
>>>         *
>>>         * unlike with _toUnicode(), getNextUChar() implementations must set
>>>         * U_TRUNCATED_CHAR_FOUND for truncated input,
>>>         * in addition to setting toULength/toUBytes[]
>>>         */
>>>        if(cnv->toULength==0 && cnv->sharedData->impl->getNextUChar!=NULL) {
>>>            c=cnv->sharedData->impl->getNextUChar(&args, err);
>>>            *source=s=args.source;
>>>            if(*err==U_INDEX_OUTOFBOUNDS_ERROR) {
>>>                /* reset the converter without calling the callback function */
>>>                _reset(cnv, UCNV_RESET_TO_UNICODE, FALSE);
>>>                return 0xffff; /* no output */
>> 
>> 5. Occurs when trying to access unindexed data.
> 
> Don't got your note here. It seems we call ucnv_getNextUChar_UTF8 from
> ucnv_u8.c here (because of the "utf8" type of the converter in pUtf8conv
> in func.c). U_INDEX_OUTOFBOUNDS_ERROR is returned when (s >
> sourceLimit), so it cannot occur here. In case of an other error
> (U_ILLEGAL_CHAR_FOUND or U_TRUNCATED_CHAR_FOUND) we fall through to
> _toUnicode() as the comment (below the code pasted above) suggests.
> Don't investigated further.
> 
>>>            } else if(U_SUCCESS(*err) && c>=0) {
>>>                return c;
>> 
>> 6. Returns symbol (can also be 0xfffd, as it is not treated as an actual error).
>> 
>> So if I’m not mistaken we will get results in our function either from
>> ‘return’ number 5 or number 6 and the following code will not be executed.
> 
> It is not so. We'll fall through in case of U_ILLEGAL_CHAR_FOUND or
> U_TRUNCATED_CHAR_FOUND error.
> 
> To be honest I don't want to continue. It seems we should not lean on
> the fact that 0xffff always means end of the buffer, because it does not
> guaranteed by the API and is not clear from the code.
> 
> AFAIR, the problem was to choose appropriate symbol to mark end of the
> buffer situation and distinguish it from a real error. It seems we have
> not one. So we should fairly (and always) check for the buffer before a
> call to ucnv_getNextUChar() or check the status it provide after the
> call. I would prefer to check it in our code. It seems that it is how
> the API works.
> 
> I propose to use the same code pattern for all Utf8Read calls, e.g.:
> 
> if (pattern < pattern_end)
> 	c = Utf8Read(pattern, pattern_end)
> else
> 	return SQL_...;
> if (c == SQL_INVALID_UTF8_SYMBOL)
> 	return SQL_...;
> assert(U_SUCCESS(status));
> 
> Note: I have added the assert, because it is not clear what we can do
> with, say, U_INVALID_TABLE_FORMAT (improper libicu build /
> installation). Hope Nikita P. suggests right way, but now I think we
> should at least assert on that.
> 
> It seems the code above can be even wrapped into a macro that will get
> two pointers (pattern and pattern_end / string and string_end) and two
> SQL_...  error code to handle two possible errors. Yep, it is generally
> discouraged to return from a macro, but if it'll greatly improves the
> code readability, so it is appropriate, I think. Just define the macro
> right before the function and undefne it after to show a reader it is
> some pure internal thing.
> 
> Note: If you will going that way, don't wrap Utf8Read macro into another
> macro. Use one with ucnv_getNextUChar call.
> 
> It is refactoring of the code and our of the scope of your issue.
> Please, file an issue and link this message into it (but please ask
> Nikita P. opinion before).
> 
> It is not good IMHO, but it seems now it worth to leave the code with
> assumption 0xffff is the end of buffer. This is kind of splitting the
> problem into parts and allow us to proceed with this patch re parsing
> bug.
> 
> About the patch
> ---------------
> 
> Please, post an issue and a branch links if you don't cite them.
> Sometimes it is hard to find them in a mail client history, esp. after
> some significant delay in a discussion.
> 
> I'll consider the patch as bugfix + code style fix and will not push you
> to rewrite things in significant way. But I'll ask you to formalize
> found problems as issues.
> 
> It rebased on 2.1 with conflicts. Need to be fixed.
> 
>> -#define Utf8Read(s, e)    ucnv_getNextUChar(pUtf8conv, &s, e, &status)
>> +#define Utf8Read(s, e) ucnv_getNextUChar(pUtf8conv, &(s), (e), &(status))
> 
> 'status' is not a parameter of the macro, no need to enclose it into
> parentheses.
> 
> I would prefer to have it as the parameter, but it seems that the code has many
> indent levels and will look even more ugly then now if we'll increase lines
> lengths. So, just remove the parentheses.

Removed.

> 
>> + * @param matchOther The escape char (LIKE) or '[' (GLOB).
> 
> It is a symbol from ESCAPE parameter or a garbage from the likeFunc
> stack frame. It seems it worth to initialize 'u32 escape' in likeFunc to
> some symbol you cannot hit within 'c' variable in
> sql_utf8_pattern_compare. I think it is SQL_END_OF_STRING. Please, fix
> it here if it is related to your changes or file an issue if it was
> already here.

Changed to SQL_END_OF_STRING.

>> +       /* Next pattern and input string chars */
>> +       UChar32 c, c2;
>> +       /* "?" or "_" */
>> +       UChar32 matchOne = pInfo->matchOne;
>> +       /* "*" or "%" */
>> +       UChar32 matchAll = pInfo->matchAll;
>> +       /* True if uppercase==lowercase */
>> +       UChar32 noCase = pInfo->noCase;
>> +       /* One past the last escaped input char */
> 
> Our code style suggests to have a period at end of a comment.

Fixed.

> 
>> -                                       assert(matchOther < 0x80);      /* '[' is a single-byte character */
>> +                                       assert(matchOther < 0x80);
> 
> The comment was helpful, IMHO.
> 
> What if we'll use LIKE with ESCAPE with symbol, say, 'ё'? We have not
> such tests cases as I see.
> 
> Again, it does not seems to be the problem you solve here. Please, write
> a test case and file an issue if this does not work correctly (it seems
> we can hit the assert above).
> 
> Ouch, now I see, it will be removed in the 2nd commit of the patchset.
> So, please, comment it in the commit message of the 1st commit to don't
> confuse a reviewer. However the case with non-ASCII ESCAPE character is
> worth to be checked anyway.

I think I will just return the fixed comment to the first commit as anyways
it’s going to be deleted in next commit.

Wrote tests for non-ASCII chars in e_expr.test.lua.

> 
> The code of the sql_utf8_pattern_compare function looks okay for me
> (except things I suggested to handle separately).
> 
>> test/sql-tap/gh-3251-string-pattern-comparison.test.lua
>> 
>> -    test_name = prefix .. "8." .. tostring(i)
>> +    local test_name = prefix .. "8." .. tostring(i)
> 
> It is from the 2nd patch, but I think should be here.
> 

Moved to the first patch.