[tarantool-patches] Re: [PATCH 1/2] sql: LIKE & GLOB pattern comparison issue

Sun Oct 21 06:51:40 MSK 2018

Hi!

Thanks for your work.

The email is big, but don't afraid. I don't push you to rewrite the
whole things again :)

The patch is generally okay for me. Minor comments are added on this.

Below I answered to your investigation about the libicu code. I found
that you was not right in some assumptions, but propose to postpone
refactoring out of this task.

Then I found some possible corner cases. They are out of the scope of
your task too, so I proposed to check and file issues.

WBR, Alexander Turenko.

1, 2 - ok.

> >        s=*source;
> >        if(sourceLimit<s) {
> >            *err=U_ILLEGAL_ARGUMENT_ERROR;
> >            return 0xffff;
> 
>    3. This one cannot trigger in sql_utf8_pattern_compare():
>    1) ucnv_getNextUChar is only called when !(sourceLimit<s).

The discussion is about that the patch lean on the check, but here you
say it cannot be triggered. Mistake? It seems it can be triggered and it
the case we check in our code. So, ok.

> >     /*
> >      * Make sure that the buffer sizes do not exceed the number range for
> >      * int32_t because some functions use the size (in units or bytes)
> >      * rather than comparing pointers, and because offsets are int32_t values.
> >      *
> >      * size_t is guaranteed to be unsigned and large enough for the job.
> >      *
> >      * Return with an error instead of adjusting the limits because we would
> >      * not be able to maintain the semantics that either the source must be
> >      * consumed or the target filled (unless an error occurs).
> >      * An adjustment would be sourceLimit=t+0x7fffffff; for example.
> >      */
> >     if(((size_t)(sourceLimit-s)>(size_t)0x7fffffff && sourceLimit>s)) {
> >         *err=U_ILLEGAL_ARGUMENT_ERROR;
> >         return 0xffff;
>
> 4. I’m not sure if string data can be this long in our context. 
>    (string length > (size_t) 0x7ffffffff)

Note: not 0x7ffffffff, but 0x7fffffff.

This limit seems to be some weird internal thing related to using
ucnv_getNextUChar inside libicu.

I propose to lie libicu about the buffer size in case when it exceeds
this limit. A UTF-8 encoded symbol is 4 bytes long at max, so we can
pass the following instead of pattern_end:

((size_t) (pattern_end - pattern) > (size_t) 0x7fffffff ? pattern + 0x7fffffff : pattern_end

I think this trick need to be covered with a unit test (because it is unclear
how to create a string of size >1GiB from lua). Don't sure whether it is
okay to allocate such amount of memory in the test, though...

Please, don't do that within this patch, because it is about the another bug.
File an issue with all needed information instead (you can provide a link to
this message for example).

> >     if(c<0) {
> >         /*
> >          * call the native getNextUChar() implementation if we are
> >          * at a character boundary (toULength==0)
> >          *
> >          * unlike with _toUnicode(), getNextUChar() implementations must set
> >          * U_TRUNCATED_CHAR_FOUND for truncated input,
> >          * in addition to setting toULength/toUBytes[]
> >          */
> >         if(cnv->toULength==0 && cnv->sharedData->impl->getNextUChar!=NULL) {
> >             c=cnv->sharedData->impl->getNextUChar(&args, err);
> >             *source=s=args.source;
> >             if(*err==U_INDEX_OUTOFBOUNDS_ERROR) {
> >                 /* reset the converter without calling the callback function */
> >                 _reset(cnv, UCNV_RESET_TO_UNICODE, FALSE);
> >                 return 0xffff; /* no output */
>
> 5. Occurs when trying to access unindexed data.

Don't got your note here. It seems we call ucnv_getNextUChar_UTF8 from
ucnv_u8.c here (because of the "utf8" type of the converter in pUtf8conv
in func.c). U_INDEX_OUTOFBOUNDS_ERROR is returned when (s >
sourceLimit), so it cannot occur here. In case of an other error
(U_ILLEGAL_CHAR_FOUND or U_TRUNCATED_CHAR_FOUND) we fall through to
_toUnicode() as the comment (below the code pasted above) suggests.
Don't investigated further.

> >             } else if(U_SUCCESS(*err) && c>=0) {
> >                 return c;
>
> 6. Returns symbol (can also be 0xfffd, as it is not treated as an actual error).
>
> So if I’m not mistaken we will get results in our function either from
> ‘return’ number 5 or number 6 and the following code will not be executed.

It is not so. We'll fall through in case of U_ILLEGAL_CHAR_FOUND or
U_TRUNCATED_CHAR_FOUND error.

To be honest I don't want to continue. It seems we should not lean on
the fact that 0xffff always means end of the buffer, because it does not
guaranteed by the API and is not clear from the code.

AFAIR, the problem was to choose appropriate symbol to mark end of the
buffer situation and distinguish it from a real error. It seems we have
not one. So we should fairly (and always) check for the buffer before a
call to ucnv_getNextUChar() or check the status it provide after the
call. I would prefer to check it in our code. It seems that it is how
the API works.

I propose to use the same code pattern for all Utf8Read calls, e.g.:

if (pattern < pattern_end)
	c = Utf8Read(pattern, pattern_end)
else
	return SQL_...;
if (c == SQL_INVALID_UTF8_SYMBOL)
	return SQL_...;
assert(U_SUCCESS(status));

Note: I have added the assert, because it is not clear what we can do
with, say, U_INVALID_TABLE_FORMAT (improper libicu build /
installation). Hope Nikita P. suggests right way, but now I think we
should at least assert on that.

It seems the code above can be even wrapped into a macro that will get
two pointers (pattern and pattern_end / string and string_end) and two
SQL_...  error code to handle two possible errors. Yep, it is generally
discouraged to return from a macro, but if it'll greatly improves the
code readability, so it is appropriate, I think. Just define the macro
right before the function and undefne it after to show a reader it is
some pure internal thing.

Note: If you will going that way, don't wrap Utf8Read macro into another
macro. Use one with ucnv_getNextUChar call.

It is refactoring of the code and our of the scope of your issue.
Please, file an issue and link this message into it (but please ask
Nikita P. opinion before).

It is not good IMHO, but it seems now it worth to leave the code with
assumption 0xffff is the end of buffer. This is kind of splitting the
problem into parts and allow us to proceed with this patch re parsing
bug.

About the patch
---------------

Please, post an issue and a branch links if you don't cite them.
Sometimes it is hard to find them in a mail client history, esp. after
some significant delay in a discussion.

I'll consider the patch as bugfix + code style fix and will not push you
to rewrite things in significant way. But I'll ask you to formalize
found problems as issues.

It rebased on 2.1 with conflicts. Need to be fixed.

> -#define Utf8Read(s, e)    ucnv_getNextUChar(pUtf8conv, &s, e, &status)
> +#define Utf8Read(s, e) ucnv_getNextUChar(pUtf8conv, &(s), (e), &(status))

'status' is not a parameter of the macro, no need to enclose it into
parentheses.

I would prefer to have it as the parameter, but it seems that the code has many
indent levels and will look even more ugly then now if we'll increase lines
lengths. So, just remove the parentheses.

> + * @param matchOther The escape char (LIKE) or '[' (GLOB).

It is a symbol from ESCAPE parameter or a garbage from the likeFunc
stack frame. It seems it worth to initialize 'u32 escape' in likeFunc to
some symbol you cannot hit within 'c' variable in
sql_utf8_pattern_compare. I think it is SQL_END_OF_STRING. Please, fix
it here if it is related to your changes or file an issue if it was
already here.

> +       /* Next pattern and input string chars */
> +       UChar32 c, c2;
> +       /* "?" or "_" */
> +       UChar32 matchOne = pInfo->matchOne;
> +       /* "*" or "%" */
> +       UChar32 matchAll = pInfo->matchAll;
> +       /* True if uppercase==lowercase */
> +       UChar32 noCase = pInfo->noCase;
> +       /* One past the last escaped input char */

Our code style suggests to have a period at end of a comment.

> -                                       assert(matchOther < 0x80);      /* '[' is a single-byte character */
> +                                       assert(matchOther < 0x80);

The comment was helpful, IMHO.

What if we'll use LIKE with ESCAPE with symbol, say, 'ё'? We have not
such tests cases as I see.

Again, it does not seems to be the problem you solve here. Please, write
a test case and file an issue if this does not work correctly (it seems
we can hit the assert above).

Ouch, now I see, it will be removed in the 2nd commit of the patchset.
So, please, comment it in the commit message of the 1st commit to don't
confuse a reviewer. However the case with non-ASCII ESCAPE character is
worth to be checked anyway.

The code of the sql_utf8_pattern_compare function looks okay for me
(except things I suggested to handle separately).

> test/sql-tap/gh-3251-string-pattern-comparison.test.lua
>
> -    test_name = prefix .. "8." .. tostring(i)
> +    local test_name = prefix .. "8." .. tostring(i)

It is from the 2nd patch, but I think should be here.