From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from localhost (localhost [127.0.0.1]) by turing.freelists.org (Avenir Technologies Mail Multiplex) with ESMTP id 79D562A5FE for ; Tue, 11 Sep 2018 02:06:29 -0400 (EDT) Received: from turing.freelists.org ([127.0.0.1]) by localhost (turing.freelists.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id gprq61f09tTS for ; Tue, 11 Sep 2018 02:06:29 -0400 (EDT) Received: from smtp56.i.mail.ru (smtp56.i.mail.ru [217.69.128.36]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by turing.freelists.org (Avenir Technologies Mail Multiplex) with ESMTPS id 8A9472A5F4 for ; Tue, 11 Sep 2018 02:06:28 -0400 (EDT) From: Nikita Tatunov Message-Id: <58B407E2-AF5D-4531-A9FF-9DC57CE0070B@tarantool.org> Content-Type: multipart/alternative; boundary="Apple-Mail=_31058AB7-9075-489B-A45B-06948C955F0F" Mime-Version: 1.0 (Mac OS X Mail 11.5 \(3445.9.1\)) Subject: [tarantool-patches] Re: [PATCH 1/2] sql: LIKE & GLOB pattern comparison issue Date: Tue, 11 Sep 2018 09:06:22 +0300 In-Reply-To: <6a1352e9-425c-d656-1bec-bb04d9f0fee6@tarantool.org> References: <43febf82af3702fadfea135db978ffb6426eb00d.1534436836.git.n.tatunov@tarantool.org> <20180817111727.y6nsbblpm5nh4n3g@tkn_work_nb> <436d256a-f9d0-781f-8cad-179d7322c7bd@tarantool.org> <87897608-173E-45EB-80A1-8B249706D8A1@tarantool.org> <6a1352e9-425c-d656-1bec-bb04d9f0fee6@tarantool.org> Sender: tarantool-patches-bounce@freelists.org Errors-to: tarantool-patches-bounce@freelists.org Reply-To: tarantool-patches@freelists.org List-help: List-unsubscribe: List-software: Ecartis version 1.0.0 List-Id: tarantool-patches List-subscribe: List-owner: List-post: List-archive: To: tarantool-patches@freelists.org Cc: avkhatskevich@tarantool.org, Alexander Turenko , korablev@tarantool.org --Apple-Mail=_31058AB7-9075-489B-A45B-06948C955F0F Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=utf-8 > On 11 Sep 2018, at 01:20, Alex Khatskevich = wrote: >=20 >>=20 >>=20 >>> On 17 Aug 2018, at 14:42, Alex Khatskevich = > = wrote: >>>=20 >>>=20 >>> On 17.08.2018 14:17, Alexander Turenko wrote: >>>> 0xffff is the result of 'end of a string' check as well as internal = buffer >>>> overflow error. I have the relevant code pasted in the first review = of >>>> the patch (July, 18). >>>>=20 >>>> // source/common/ucnv.c::ucnv_getNextUChar >>>> 1860 s=3D*source; >>>> 1861 if(sourceLimit>>> 1862 *err=3DU_ILLEGAL_ARGUMENT_ERROR; >>>> 1863 return 0xffff; >>>> 1864 } >>>>=20 >>>> We should not handle the buffer overflow case as an invalid symbol. = Of >>>> course we should not handle it as the 'end of the string' = situation. >>>> Ideally we should perform pointer myself and raise an error in case = of >>>> 0xffff. I had thought that a buffer overflow error is unlikely to = meet, >>>> but you are right: we should differentiate these situations. >>>>=20 >>>> In one of the previous version of a patch we perform this check = like so: >>>>=20 >>>> #define Utf8Read(s, e) (((s) < (e)) ?\ >>>> ucnv_getNextUChar(pUtf8conv, &s, e, &status) : 0) >>>>=20 >>>> Don't sure why it was changed. Maybe it is try to correctly handle = '\0' >>>> symbol (it is valid unicode character)? >>> The define you have pasted can return 0xffff. >>> The reasons to change it back are described in the previous = patchset. >>> In short: >>> 1. It is equivalent to >>> a. check s < e in a while loop >>> b. read next character inside of where loop body. >>> 2. In some usages of the code this check (s>> 3. There is no reason to rewrite the old version of this function. = (So, we decided to use old version of the function) >>>> So I see two ways to proceed: >>>>=20 >>>> 1. Lean on icu's check and ignore possibility of the buffer = overflow. >>>> 2. Use our own check and possibly meet '\0' problems. >>>> 3. Check for U_ILLEGAL_ARGUMENT_ERROR to treat as end of a string, = raise >>>> the error for other 0xffff. >>>>=20 >>>> Alex, what do you suggests here? >>> As I understand, by now the 0xffff is used ONLY to handle the case = of unexpectedly ended symbol. >>> E.g. some symbol consists of 2 characters, but the length of the = input buffer is 1. >>> In my opinion this is the same as an invalid symbol. >>>=20 >>> I guess that internal buffer overflow cannot occur in the = `ucnv_getNextChar` function. >>>=20 >>> I suppose that it is Nikitas duty to investigate this problem and = explain it to us all. I just have noticed a strange usage. >>=20 >>=20 >> Hello, please consider my comments. >>=20 >> There are some cases when 0xffff can occur, but: >> 1) Cannot trigger in our context. >> 2) Cannot trigger in our context. >> 3) Only triggers if end < start. (Cannot happen in = sql_utf8_pattern_compare, i guess) >> 4) Only triggers if string length > (size_t) 0x7ffffffff (can it = actually happen? I don=E2=80=99t think so). >> 5) Occurs when trying to access to not unindexed data. >> 6) Cannot occur in our context. >> 7) Cannot occur in our context. > I do not understand what are those numbers related to. Please, = describe it. They are related to possible cases returning 0xffff from icu source code = (function ucnv_getNextUChar()). >>=20 >> 0xfffd only means that symbol cannot be treated as a unicode symbol. >>=20 >> Shall I change it somehow then? >>=20 >>=20 >>> On 17 Aug 2018, at 12:23, Alex Khatskevich = > = wrote: >>>=20 >>> I have a look at icu code and It seems like 0xffff is an error, and = it is more similar to >>> invalid symbol that to "end of string". Check it, and fix the code, = so that it is treated as >>> an error. >>> For example it is not handled in the main pattern loop: >>>=20 >>> + while (pattern < pattern_end) { >>> c =3D Utf8Read(pattern, pattern_end); >>> + if (c =3D=3D SQL_INVALID_UTF8_SYMBOL) >>> + return SQL_INVALID_PATTERN; >>>=20 >>> It seems like the 0xffff should be checked there too. >>=20 >> No, it should not. This way it will only cause a bug when, for = example =E2=80=99select =E2=80=9C=E2=80=9D like =E2=80=9C=E2=80=9D=E2=80=99= >> will be treated as an error. > I do not understand. > =E2=80=99select =E2=80=9C=E2=80=9D like =E2=80=9C=E2=80=9D=E2=80=99 = should not even trap inside of the while loop > (because `pattern < pattern_end` is false). Ah, you=E2=80=99re right, sorry, then it just doesn=E2=80=99t matter, = since pattern < pattern_end is equal to 0xffff according to the comment above. -- WBR, Nikita Tatunov. n.tatunov@tarantool.org --Apple-Mail=_31058AB7-9075-489B-A45B-06948C955F0F Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=utf-8

On 11 Sep 2018, at 01:20, Alex Khatskevich <avkhatskevich@tarantool.org> wrote:



On 17 = Aug 2018, at 14:42, Alex Khatskevich <avkhatskevich@tarantool.org> = wrote:


On 17.08.2018 14:17, = Alexander Turenko wrote:
0xffff is the result of 'end of a string' check = as well as internal buffer
overflow error. I have the = relevant code pasted in the first review of
the patch = (July, 18).

// = source/common/ucnv.c::ucnv_getNextUChar
1860 =     s=3D*source;
1861 =     if(sourceLimit<s) {
1862 =         *err=3DU_ILLEGAL_ARGUMENT_= ERROR;
1863 =         return 0xffff;
1864     }

We = should not handle the buffer overflow case as an invalid symbol. Of
course we should not handle it as the 'end of the string' = situation.
Ideally we should perform pointer myself and = raise an error in case of
0xffff. I had thought that a = buffer overflow error is unlikely to meet,
but you are = right: we should differentiate these situations.

In one of the previous version of a patch we perform this = check like so:

#define Utf8Read(s, e) (((s) = < (e)) ?\
ucnv_getNextUChar(pUtf8conv, = &s, e, &status) : 0)

Don't sure why = it was changed. Maybe it is try to correctly handle '\0'
symbol (it is valid unicode character)?
The = define you have pasted can return 0xffff.
The reasons to change it back are described = in the previous patchset.
In = short:
1. It = is equivalent to
   a. check s < e in a = while loop
   b. read next character = inside of where loop body.
2. In = some usages of the code this check (s<e) was redundant (it was = performed a couple lines above)
3. = There is no reason to rewrite the old version of this function. (So, we = decided to use old version of the function)
So I see two = ways to proceed:

1. Lean on icu's check and = ignore possibility of the buffer overflow.
2. Use our own = check and possibly meet '\0' problems.
3. Check for = U_ILLEGAL_ARGUMENT_ERROR to treat as end of a string, raise
   the error for other 0xffff.
Alex, what do you suggests here?
As I = understand, by now the 0xffff is used ONLY to handle the case of = unexpectedly ended symbol.
E.g. = some symbol consists of 2 characters, but the length of the input buffer = is 1.
In my = opinion this is the same as an invalid symbol.

I guess that internal buffer overflow = cannot occur in the `ucnv_getNextChar` function.

I suppose that it is Nikitas duty to = investigate this problem and explain it to us all. I just have noticed a = strange usage.

Hello, please consider my = comments.

There = are some cases when 0xffff can occur, but:
= 1) Cannot trigger in our context.
2) = Cannot = trigger in our context.
3) Only triggers if end < start. = (Cannot happen in sql_utf8_pattern_compare, i guess)
4) = Only = triggers if string length > (size_t) 0x7ffffffff (can it actually = happen? I don=E2=80=99t think so).
5) = Occurs = when trying to access to not unindexed data.
6) Cannot = occur in our context.
7) Cannot occur in our = context.
I do not understand what = are those numbers related to. Please, describe it.

They are related to possible cases returning = 0xffff from icu source code (function ucnv_getNextUChar()).


0xfffd only means that = symbol cannot be treated as a unicode symbol.

Shall I change = it somehow then?


On 17 Aug 2018, at 12:23, Alex Khatskevich <avkhatskevich@tarantool.org> = wrote:

I have a = look at icu code and It seems like 0xffff is an error, and it is more = similar to
invalid symbol that to "end of string". = Check it, and fix the code, so that it is treated as
an error.
For example it is not = handled in the main pattern loop:

+ while (pattern < pattern_end) {
c =3D = Utf8Read(pattern, pattern_end);
+ if (c =3D=3D= SQL_INVALID_UTF8_SYMBOL)
+ return = SQL_INVALID_PATTERN;

It seems like the = 0xffff should be checked there too.

No, it should not. This way it will = only cause a bug when, for example =E2=80=99select =E2=80=9C=E2=80=9D = like =E2=80=9C=E2=80=9D=E2=80=99
will be treated as = an error.
I do not = understand.
=E2=80=99select =E2=80=9C=E2=80=9D like =E2=80=9C=E2=80=9D=E2=80= =99 should not even trap inside of the while loop
(because `pattern < pattern_end` is false).

Ah, you=E2=80=99re right, sorry, then it just = doesn=E2=80=99t matter, since pattern < pattern_end is = equal
to 0xffff according to the comment above.

--
WBR, Nikita Tatunov.

= --Apple-Mail=_31058AB7-9075-489B-A45B-06948C955F0F--