[tarantool-patches] Re: [PATCH 1/2] sql: LIKE & GLOB pattern comparison issue
Nikita Tatunov
n.tatunov at tarantool.org
Fri Oct 26 18:19:46 MSK 2018
Hello, Alexander! please consider this answer to the review!
Issues:
https://github.com/tarantool/tarantool/issues/3251
https://github.com/tarantool/tarantool/issues/3334
Branch:
https://github.com/tarantool/tarantool/tree/N_Tatunov/gh-3251-where-like-hangs
> On Oct 21, 2018, at 06:51, Alexander Turenko <alexander.turenko at tarantool.org> wrote:
>
> Hi!
>
> Thanks for your work.
>
> The email is big, but don't afraid. I don't push you to rewrite the
> whole things again :)
>
> The patch is generally okay for me. Minor comments are added on this.
>
> Below I answered to your investigation about the libicu code. I found
> that you was not right in some assumptions, but propose to postpone
> refactoring out of this task.
>
> Then I found some possible corner cases. They are out of the scope of
> your task too, so I proposed to check and file issues.
>
> WBR, Alexander Turenko.
>
> 1, 2 - ok.
>
>>> s=*source;
>>> if(sourceLimit<s) {
>>> *err=U_ILLEGAL_ARGUMENT_ERROR;
>>> return 0xffff;
>>
>> 3. This one cannot trigger in sql_utf8_pattern_compare():
>> 1) ucnv_getNextUChar is only called when !(sourceLimit<s).
>
> The discussion is about that the patch lean on the check, but here you
> say it cannot be triggered. Mistake? It seems it can be triggered and it
> the case we check in our code. So, ok.
Yes I’ve mistaken. I forgot that I changed the macro, thank you.
>
>>> /*
>>> * Make sure that the buffer sizes do not exceed the number range for
>>> * int32_t because some functions use the size (in units or bytes)
>>> * rather than comparing pointers, and because offsets are int32_t values.
>>> *
>>> * size_t is guaranteed to be unsigned and large enough for the job.
>>> *
>>> * Return with an error instead of adjusting the limits because we would
>>> * not be able to maintain the semantics that either the source must be
>>> * consumed or the target filled (unless an error occurs).
>>> * An adjustment would be sourceLimit=t+0x7fffffff; for example.
>>> */
>>> if(((size_t)(sourceLimit-s)>(size_t)0x7fffffff && sourceLimit>s)) {
>>> *err=U_ILLEGAL_ARGUMENT_ERROR;
>>> return 0xffff;
>>
>> 4. I’m not sure if string data can be this long in our context.
>> (string length > (size_t) 0x7ffffffff)
>
> Note: not 0x7ffffffff, but 0x7fffffff.
>
> This limit seems to be some weird internal thing related to using
> ucnv_getNextUChar inside libicu.
>
> I propose to lie libicu about the buffer size in case when it exceeds
> this limit. A UTF-8 encoded symbol is 4 bytes long at max, so we can
> pass the following instead of pattern_end:
>
> ((size_t) (pattern_end - pattern) > (size_t) 0x7fffffff ? pattern + 0x7fffffff : pattern_end
>
> I think this trick need to be covered with a unit test (because it is unclear
> how to create a string of size >1GiB from lua). Don't sure whether it is
> okay to allocate such amount of memory in the test, though...
>
> Please, don't do that within this patch, because it is about the another bug.
> File an issue with all needed information instead (you can provide a link to
> this message for example).
Ok, thank you for advice. I think that’s a good idea, but there’s one thing
I’m getting concerned about: it will cause a lot of operations especially
in case we’re using LIKE for scanning a lot of data (). I guess even if it’s
relevant it’s a discussion inside of an issue that’s going to be filed.
>
>>> if(c<0) {
>>> /*
>>> * call the native getNextUChar() implementation if we are
>>> * at a character boundary (toULength==0)
>>> *
>>> * unlike with _toUnicode(), getNextUChar() implementations must set
>>> * U_TRUNCATED_CHAR_FOUND for truncated input,
>>> * in addition to setting toULength/toUBytes[]
>>> */
>>> if(cnv->toULength==0 && cnv->sharedData->impl->getNextUChar!=NULL) {
>>> c=cnv->sharedData->impl->getNextUChar(&args, err);
>>> *source=s=args.source;
>>> if(*err==U_INDEX_OUTOFBOUNDS_ERROR) {
>>> /* reset the converter without calling the callback function */
>>> _reset(cnv, UCNV_RESET_TO_UNICODE, FALSE);
>>> return 0xffff; /* no output */
>>
>> 5. Occurs when trying to access unindexed data.
>
> Don't got your note here. It seems we call ucnv_getNextUChar_UTF8 from
> ucnv_u8.c here (because of the "utf8" type of the converter in pUtf8conv
> in func.c). U_INDEX_OUTOFBOUNDS_ERROR is returned when (s >
> sourceLimit), so it cannot occur here. In case of an other error
> (U_ILLEGAL_CHAR_FOUND or U_TRUNCATED_CHAR_FOUND) we fall through to
> _toUnicode() as the comment (below the code pasted above) suggests.
> Don't investigated further.
>
>>> } else if(U_SUCCESS(*err) && c>=0) {
>>> return c;
>>
>> 6. Returns symbol (can also be 0xfffd, as it is not treated as an actual error).
>>
>> So if I’m not mistaken we will get results in our function either from
>> ‘return’ number 5 or number 6 and the following code will not be executed.
>
> It is not so. We'll fall through in case of U_ILLEGAL_CHAR_FOUND or
> U_TRUNCATED_CHAR_FOUND error.
>
> To be honest I don't want to continue. It seems we should not lean on
> the fact that 0xffff always means end of the buffer, because it does not
> guaranteed by the API and is not clear from the code.
>
> AFAIR, the problem was to choose appropriate symbol to mark end of the
> buffer situation and distinguish it from a real error. It seems we have
> not one. So we should fairly (and always) check for the buffer before a
> call to ucnv_getNextUChar() or check the status it provide after the
> call. I would prefer to check it in our code. It seems that it is how
> the API works.
>
> I propose to use the same code pattern for all Utf8Read calls, e.g.:
>
> if (pattern < pattern_end)
> c = Utf8Read(pattern, pattern_end)
> else
> return SQL_...;
> if (c == SQL_INVALID_UTF8_SYMBOL)
> return SQL_...;
> assert(U_SUCCESS(status));
>
> Note: I have added the assert, because it is not clear what we can do
> with, say, U_INVALID_TABLE_FORMAT (improper libicu build /
> installation). Hope Nikita P. suggests right way, but now I think we
> should at least assert on that.
>
> It seems the code above can be even wrapped into a macro that will get
> two pointers (pattern and pattern_end / string and string_end) and two
> SQL_... error code to handle two possible errors. Yep, it is generally
> discouraged to return from a macro, but if it'll greatly improves the
> code readability, so it is appropriate, I think. Just define the macro
> right before the function and undefne it after to show a reader it is
> some pure internal thing.
>
> Note: If you will going that way, don't wrap Utf8Read macro into another
> macro. Use one with ucnv_getNextUChar call.
>
> It is refactoring of the code and our of the scope of your issue.
> Please, file an issue and link this message into it (but please ask
> Nikita P. opinion before).
>
> It is not good IMHO, but it seems now it worth to leave the code with
> assumption 0xffff is the end of buffer. This is kind of splitting the
> problem into parts and allow us to proceed with this patch re parsing
> bug.
>
> About the patch
> ---------------
>
> Please, post an issue and a branch links if you don't cite them.
> Sometimes it is hard to find them in a mail client history, esp. after
> some significant delay in a discussion.
>
> I'll consider the patch as bugfix + code style fix and will not push you
> to rewrite things in significant way. But I'll ask you to formalize
> found problems as issues.
>
> It rebased on 2.1 with conflicts. Need to be fixed.
>
>> -#define Utf8Read(s, e) ucnv_getNextUChar(pUtf8conv, &s, e, &status)
>> +#define Utf8Read(s, e) ucnv_getNextUChar(pUtf8conv, &(s), (e), &(status))
>
> 'status' is not a parameter of the macro, no need to enclose it into
> parentheses.
>
> I would prefer to have it as the parameter, but it seems that the code has many
> indent levels and will look even more ugly then now if we'll increase lines
> lengths. So, just remove the parentheses.
Removed.
>
>> + * @param matchOther The escape char (LIKE) or '[' (GLOB).
>
> It is a symbol from ESCAPE parameter or a garbage from the likeFunc
> stack frame. It seems it worth to initialize 'u32 escape' in likeFunc to
> some symbol you cannot hit within 'c' variable in
> sql_utf8_pattern_compare. I think it is SQL_END_OF_STRING. Please, fix
> it here if it is related to your changes or file an issue if it was
> already here.
Changed to SQL_END_OF_STRING.
>> + /* Next pattern and input string chars */
>> + UChar32 c, c2;
>> + /* "?" or "_" */
>> + UChar32 matchOne = pInfo->matchOne;
>> + /* "*" or "%" */
>> + UChar32 matchAll = pInfo->matchAll;
>> + /* True if uppercase==lowercase */
>> + UChar32 noCase = pInfo->noCase;
>> + /* One past the last escaped input char */
>
> Our code style suggests to have a period at end of a comment.
Fixed.
>
>> - assert(matchOther < 0x80); /* '[' is a single-byte character */
>> + assert(matchOther < 0x80);
>
> The comment was helpful, IMHO.
>
> What if we'll use LIKE with ESCAPE with symbol, say, 'ё'? We have not
> such tests cases as I see.
>
> Again, it does not seems to be the problem you solve here. Please, write
> a test case and file an issue if this does not work correctly (it seems
> we can hit the assert above).
>
> Ouch, now I see, it will be removed in the 2nd commit of the patchset.
> So, please, comment it in the commit message of the 1st commit to don't
> confuse a reviewer. However the case with non-ASCII ESCAPE character is
> worth to be checked anyway.
I think I will just return the fixed comment to the first commit as anyways
it’s going to be deleted in next commit.
Wrote tests for non-ASCII chars in e_expr.test.lua.
>
> The code of the sql_utf8_pattern_compare function looks okay for me
> (except things I suggested to handle separately).
>
>> test/sql-tap/gh-3251-string-pattern-comparison.test.lua
>>
>> - test_name = prefix .. "8." .. tostring(i)
>> + local test_name = prefix .. "8." .. tostring(i)
>
> It is from the 2nd patch, but I think should be here.
>
Moved to the first patch.
More information about the Tarantool-patches
mailing list