From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <tarantool-patches-bounce@freelists.org>
Received: from localhost (localhost [127.0.0.1])
	by turing.freelists.org (Avenir Technologies Mail Multiplex) with ESMTP id C8DE62EC92
	for <tarantool-patches@freelists.org>; Fri, 26 Oct 2018 11:19:54 -0400 (EDT)
Received: from turing.freelists.org ([127.0.0.1])
	by localhost (turing.freelists.org [127.0.0.1]) (amavisd-new, port 10024)
	with ESMTP id S948VkkXW39a for <tarantool-patches@freelists.org>;
	Fri, 26 Oct 2018 11:19:54 -0400 (EDT)
Received: from smtp53.i.mail.ru (smtp53.i.mail.ru [94.100.177.113])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by turing.freelists.org (Avenir Technologies Mail Multiplex) with ESMTPS id 611DA2EC52
	for <tarantool-patches@freelists.org>; Fri, 26 Oct 2018 11:19:54 -0400 (EDT)
Content-Type: text/plain;
	charset=utf-8
Mime-Version: 1.0 (Mac OS X Mail 11.5 \(3445.9.1\))
Subject: [tarantool-patches] Re: [PATCH 1/2] sql: LIKE & GLOB pattern
 comparison issue
From: Nikita Tatunov <n.tatunov@tarantool.org>
In-Reply-To: <20181021035140.avx6d3rokx5ta6hi@tkn_work_nb>
Date: Fri, 26 Oct 2018 18:19:46 +0300
Content-Transfer-Encoding: quoted-printable
Message-Id: <6740948F-6C40-4C0F-B237-7C3573225FBC@tarantool.org>
References: <43febf82af3702fadfea135db978ffb6426eb00d.1534436836.git.n.tatunov@tarantool.org>
 <d11b496b-1d77-c1ef-27dd-874835bee1b9@tarantool.org>
 <20180817111727.y6nsbblpm5nh4n3g@tkn_work_nb>
 <436d256a-f9d0-781f-8cad-179d7322c7bd@tarantool.org>
 <87897608-173E-45EB-80A1-8B249706D8A1@tarantool.org>
 <6a1352e9-425c-d656-1bec-bb04d9f0fee6@tarantool.org>
 <58B407E2-AF5D-4531-A9FF-9DC57CE0070B@tarantool.org>
 <860a125b-19f3-3bf1-8705-25156ff508ab@tarantool.org>
 <45338A27-C589-4330-B206-A4E379A4DE75@tarantool.org>
 <BE06C955-2ED3-42D7-87B8-93F138448E3D@tarantool.org>
 <20181021035140.avx6d3rokx5ta6hi@tkn_work_nb>
Sender: tarantool-patches-bounce@freelists.org
Errors-to: tarantool-patches-bounce@freelists.org
Reply-To: tarantool-patches@freelists.org
List-help: <mailto:ecartis@freelists.org?Subject=help>
List-unsubscribe: <tarantool-patches-request@freelists.org?Subject=unsubscribe>
List-software: Ecartis version 1.0.0
List-Id: tarantool-patches <tarantool-patches.freelists.org>
List-subscribe: <tarantool-patches-request@freelists.org?Subject=subscribe>
List-owner: <mailto:>
List-post: <mailto:tarantool-patches@freelists.org>
List-archive: <http://www.freelists.org/archives/tarantool-patches>
To: tarantool-patches@freelists.org
Cc: korablev@tarantool.org, Alexander Turenko <alexander.turenko@tarantool.org>

Hello, Alexander! please consider this answer to the review!

Issues:
https://github.com/tarantool/tarantool/issues/3251
https://github.com/tarantool/tarantool/issues/3334

Branch:
=
https://github.com/tarantool/tarantool/tree/N_Tatunov/gh-3251-where-like-h=
angs

> On Oct 21, 2018, at 06:51, Alexander Turenko =
<alexander.turenko@tarantool.org> wrote:
>=20
> Hi!
>=20
> Thanks for your work.
>=20
> The email is big, but don't afraid. I don't push you to rewrite the
> whole things again :)
>=20
> The patch is generally okay for me. Minor comments are added on this.
>=20
> Below I answered to your investigation about the libicu code. I found
> that you was not right in some assumptions, but propose to postpone
> refactoring out of this task.
>=20
> Then I found some possible corner cases. They are out of the scope of
> your task too, so I proposed to check and file issues.
>=20
> WBR, Alexander Turenko.
>=20
> 1, 2 - ok.
>=20
>>>       s=3D*source;
>>>       if(sourceLimit<s) {
>>>           *err=3DU_ILLEGAL_ARGUMENT_ERROR;
>>>           return 0xffff;
>>=20
>>   3. This one cannot trigger in sql_utf8_pattern_compare():
>>   1) ucnv_getNextUChar is only called when !(sourceLimit<s).
>=20
> The discussion is about that the patch lean on the check, but here you
> say it cannot be triggered. Mistake? It seems it can be triggered and =
it
> the case we check in our code. So, ok.

Yes I=E2=80=99ve mistaken. I forgot that I changed the macro, thank you.

>=20
>>>    /*
>>>     * Make sure that the buffer sizes do not exceed the number range =
for
>>>     * int32_t because some functions use the size (in units or =
bytes)
>>>     * rather than comparing pointers, and because offsets are =
int32_t values.
>>>     *
>>>     * size_t is guaranteed to be unsigned and large enough for the =
job.
>>>     *
>>>     * Return with an error instead of adjusting the limits because =
we would
>>>     * not be able to maintain the semantics that either the source =
must be
>>>     * consumed or the target filled (unless an error occurs).
>>>     * An adjustment would be sourceLimit=3Dt+0x7fffffff; for =
example.
>>>     */
>>>    if(((size_t)(sourceLimit-s)>(size_t)0x7fffffff && sourceLimit>s)) =
{
>>>        *err=3DU_ILLEGAL_ARGUMENT_ERROR;
>>>        return 0xffff;
>>=20
>> 4. I=E2=80=99m not sure if string data can be this long in our =
context.=20
>>   (string length > (size_t) 0x7ffffffff)
>=20
> Note: not 0x7ffffffff, but 0x7fffffff.
>=20
> This limit seems to be some weird internal thing related to using
> ucnv_getNextUChar inside libicu.
>=20
> I propose to lie libicu about the buffer size in case when it exceeds
> this limit. A UTF-8 encoded symbol is 4 bytes long at max, so we can
> pass the following instead of pattern_end:
>=20
> ((size_t) (pattern_end - pattern) > (size_t) 0x7fffffff ? pattern + =
0x7fffffff : pattern_end
>=20
> I think this trick need to be covered with a unit test (because it is =
unclear
> how to create a string of size >1GiB from lua). Don't sure whether it =
is
> okay to allocate such amount of memory in the test, though...
>=20
> Please, don't do that within this patch, because it is about the =
another bug.
> File an issue with all needed information instead (you can provide a =
link to
> this message for example).

Ok, thank you for advice. I think that=E2=80=99s a good idea, but =
there=E2=80=99s one thing
I=E2=80=99m getting concerned about: it will cause a lot of operations =
especially
in case we=E2=80=99re using LIKE for scanning a lot of data (). I guess =
even if it=E2=80=99s
relevant it=E2=80=99s a discussion inside of an issue that=E2=80=99s =
going to be filed.

>=20
>>>    if(c<0) {
>>>        /*
>>>         * call the native getNextUChar() implementation if we are
>>>         * at a character boundary (toULength=3D=3D0)
>>>         *
>>>         * unlike with _toUnicode(), getNextUChar() implementations =
must set
>>>         * U_TRUNCATED_CHAR_FOUND for truncated input,
>>>         * in addition to setting toULength/toUBytes[]
>>>         */
>>>        if(cnv->toULength=3D=3D0 && =
cnv->sharedData->impl->getNextUChar!=3DNULL) {
>>>            c=3Dcnv->sharedData->impl->getNextUChar(&args, err);
>>>            *source=3Ds=3Dargs.source;
>>>            if(*err=3D=3DU_INDEX_OUTOFBOUNDS_ERROR) {
>>>                /* reset the converter without calling the callback =
function */
>>>                _reset(cnv, UCNV_RESET_TO_UNICODE, FALSE);
>>>                return 0xffff; /* no output */
>>=20
>> 5. Occurs when trying to access unindexed data.
>=20
> Don't got your note here. It seems we call ucnv_getNextUChar_UTF8 from
> ucnv_u8.c here (because of the "utf8" type of the converter in =
pUtf8conv
> in func.c). U_INDEX_OUTOFBOUNDS_ERROR is returned when (s >
> sourceLimit), so it cannot occur here. In case of an other error
> (U_ILLEGAL_CHAR_FOUND or U_TRUNCATED_CHAR_FOUND) we fall through to
> _toUnicode() as the comment (below the code pasted above) suggests.
> Don't investigated further.
>=20
>>>            } else if(U_SUCCESS(*err) && c>=3D0) {
>>>                return c;
>>=20
>> 6. Returns symbol (can also be 0xfffd, as it is not treated as an =
actual error).
>>=20
>> So if I=E2=80=99m not mistaken we will get results in our function =
either from
>> =E2=80=98return=E2=80=99 number 5 or number 6 and the following code =
will not be executed.
>=20
> It is not so. We'll fall through in case of U_ILLEGAL_CHAR_FOUND or
> U_TRUNCATED_CHAR_FOUND error.
>=20
> To be honest I don't want to continue. It seems we should not lean on
> the fact that 0xffff always means end of the buffer, because it does =
not
> guaranteed by the API and is not clear from the code.
>=20
> AFAIR, the problem was to choose appropriate symbol to mark end of the
> buffer situation and distinguish it from a real error. It seems we =
have
> not one. So we should fairly (and always) check for the buffer before =
a
> call to ucnv_getNextUChar() or check the status it provide after the
> call. I would prefer to check it in our code. It seems that it is how
> the API works.
>=20
> I propose to use the same code pattern for all Utf8Read calls, e.g.:
>=20
> if (pattern < pattern_end)
> 	c =3D Utf8Read(pattern, pattern_end)
> else
> 	return SQL_...;
> if (c =3D=3D SQL_INVALID_UTF8_SYMBOL)
> 	return SQL_...;
> assert(U_SUCCESS(status));
>=20
> Note: I have added the assert, because it is not clear what we can do
> with, say, U_INVALID_TABLE_FORMAT (improper libicu build /
> installation). Hope Nikita P. suggests right way, but now I think we
> should at least assert on that.
>=20
> It seems the code above can be even wrapped into a macro that will get
> two pointers (pattern and pattern_end / string and string_end) and two
> SQL_...  error code to handle two possible errors. Yep, it is =
generally
> discouraged to return from a macro, but if it'll greatly improves the
> code readability, so it is appropriate, I think. Just define the macro
> right before the function and undefne it after to show a reader it is
> some pure internal thing.
>=20
> Note: If you will going that way, don't wrap Utf8Read macro into =
another
> macro. Use one with ucnv_getNextUChar call.
>=20
> It is refactoring of the code and our of the scope of your issue.
> Please, file an issue and link this message into it (but please ask
> Nikita P. opinion before).
>=20
> It is not good IMHO, but it seems now it worth to leave the code with
> assumption 0xffff is the end of buffer. This is kind of splitting the
> problem into parts and allow us to proceed with this patch re parsing
> bug.
>=20
> About the patch
> ---------------
>=20
> Please, post an issue and a branch links if you don't cite them.
> Sometimes it is hard to find them in a mail client history, esp. after
> some significant delay in a discussion.
>=20
> I'll consider the patch as bugfix + code style fix and will not push =
you
> to rewrite things in significant way. But I'll ask you to formalize
> found problems as issues.
>=20
> It rebased on 2.1 with conflicts. Need to be fixed.
>=20
>> -#define Utf8Read(s, e)    ucnv_getNextUChar(pUtf8conv, &s, e, =
&status)
>> +#define Utf8Read(s, e) ucnv_getNextUChar(pUtf8conv, &(s), (e), =
&(status))
>=20
> 'status' is not a parameter of the macro, no need to enclose it into
> parentheses.
>=20
> I would prefer to have it as the parameter, but it seems that the code =
has many
> indent levels and will look even more ugly then now if we'll increase =
lines
> lengths. So, just remove the parentheses.

Removed.

>=20
>> + * @param matchOther The escape char (LIKE) or '[' (GLOB).
>=20
> It is a symbol from ESCAPE parameter or a garbage from the likeFunc
> stack frame. It seems it worth to initialize 'u32 escape' in likeFunc =
to
> some symbol you cannot hit within 'c' variable in
> sql_utf8_pattern_compare. I think it is SQL_END_OF_STRING. Please, fix
> it here if it is related to your changes or file an issue if it was
> already here.

Changed to SQL_END_OF_STRING.

>> +       /* Next pattern and input string chars */
>> +       UChar32 c, c2;
>> +       /* "?" or "_" */
>> +       UChar32 matchOne =3D pInfo->matchOne;
>> +       /* "*" or "%" */
>> +       UChar32 matchAll =3D pInfo->matchAll;
>> +       /* True if uppercase=3D=3Dlowercase */
>> +       UChar32 noCase =3D pInfo->noCase;
>> +       /* One past the last escaped input char */
>=20
> Our code style suggests to have a period at end of a comment.

Fixed.

>=20
>> -                                       assert(matchOther < 0x80);    =
  /* '[' is a single-byte character */
>> +                                       assert(matchOther < 0x80);
>=20
> The comment was helpful, IMHO.
>=20
> What if we'll use LIKE with ESCAPE with symbol, say, '=D1=91'? We have =
not
> such tests cases as I see.
>=20
> Again, it does not seems to be the problem you solve here. Please, =
write
> a test case and file an issue if this does not work correctly (it =
seems
> we can hit the assert above).
>=20
> Ouch, now I see, it will be removed in the 2nd commit of the patchset.
> So, please, comment it in the commit message of the 1st commit to =
don't
> confuse a reviewer. However the case with non-ASCII ESCAPE character =
is
> worth to be checked anyway.

I think I will just return the fixed comment to the first commit as =
anyways
it=E2=80=99s going to be deleted in next commit.

Wrote tests for non-ASCII chars in e_expr.test.lua.

>=20
> The code of the sql_utf8_pattern_compare function looks okay for me
> (except things I suggested to handle separately).
>=20
>> test/sql-tap/gh-3251-string-pattern-comparison.test.lua
>>=20
>> -    test_name =3D prefix .. "8." .. tostring(i)
>> +    local test_name =3D prefix .. "8." .. tostring(i)
>=20
> It is from the 2nd patch, but I think should be here.
>=20


Moved to the first patch.