From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from localhost (localhost [127.0.0.1]) by turing.freelists.org (Avenir Technologies Mail Multiplex) with ESMTP id 9361C28A79 for ; Sun, 9 Sep 2018 09:33:59 -0400 (EDT) Received: from turing.freelists.org ([127.0.0.1]) by localhost (turing.freelists.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id dcZrsGw1Vqwm for ; Sun, 9 Sep 2018 09:33:59 -0400 (EDT) Received: from smtp41.i.mail.ru (smtp41.i.mail.ru [94.100.177.101]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by turing.freelists.org (Avenir Technologies Mail Multiplex) with ESMTPS id CE18C277B6 for ; Sun, 9 Sep 2018 09:33:58 -0400 (EDT) From: Nikita Tatunov Message-Id: <87897608-173E-45EB-80A1-8B249706D8A1@tarantool.org> Content-Type: multipart/alternative; boundary="Apple-Mail=_73C7705D-5580-4618-B9F7-1CEAF81CFC21" Mime-Version: 1.0 (Mac OS X Mail 11.5 \(3445.9.1\)) Subject: [tarantool-patches] Re: [PATCH 1/2] sql: LIKE & GLOB pattern comparison issue Date: Sun, 9 Sep 2018 16:33:50 +0300 In-Reply-To: <436d256a-f9d0-781f-8cad-179d7322c7bd@tarantool.org> References: <43febf82af3702fadfea135db978ffb6426eb00d.1534436836.git.n.tatunov@tarantool.org> <20180817111727.y6nsbblpm5nh4n3g@tkn_work_nb> <436d256a-f9d0-781f-8cad-179d7322c7bd@tarantool.org> Sender: tarantool-patches-bounce@freelists.org Errors-to: tarantool-patches-bounce@freelists.org Reply-To: tarantool-patches@freelists.org List-help: List-unsubscribe: List-software: Ecartis version 1.0.0 List-Id: tarantool-patches List-subscribe: List-owner: List-post: List-archive: To: tarantool-patches@freelists.org Cc: Alexander Turenko , "N.Tatunov" --Apple-Mail=_73C7705D-5580-4618-B9F7-1CEAF81CFC21 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=utf-8 > On 17 Aug 2018, at 14:42, Alex Khatskevich = wrote: >=20 >=20 > On 17.08.2018 14:17, Alexander Turenko wrote: >> 0xffff is the result of 'end of a string' check as well as internal = buffer >> overflow error. I have the relevant code pasted in the first review = of >> the patch (July, 18). >>=20 >> // source/common/ucnv.c::ucnv_getNextUChar >> 1860 s=3D*source; >> 1861 if(sourceLimit> 1862 *err=3DU_ILLEGAL_ARGUMENT_ERROR; >> 1863 return 0xffff; >> 1864 } >>=20 >> We should not handle the buffer overflow case as an invalid symbol. = Of >> course we should not handle it as the 'end of the string' situation. >> Ideally we should perform pointer myself and raise an error in case = of >> 0xffff. I had thought that a buffer overflow error is unlikely to = meet, >> but you are right: we should differentiate these situations. >>=20 >> In one of the previous version of a patch we perform this check like = so: >>=20 >> #define Utf8Read(s, e) (((s) < (e)) ?\ >> ucnv_getNextUChar(pUtf8conv, &s, e, &status) : 0) >>=20 >> Don't sure why it was changed. Maybe it is try to correctly handle = '\0' >> symbol (it is valid unicode character)? > The define you have pasted can return 0xffff. > The reasons to change it back are described in the previous patchset. > In short: > 1. It is equivalent to > a. check s < e in a while loop > b. read next character inside of where loop body. > 2. In some usages of the code this check (s 3. There is no reason to rewrite the old version of this function. = (So, we decided to use old version of the function) >> So I see two ways to proceed: >>=20 >> 1. Lean on icu's check and ignore possibility of the buffer overflow. >> 2. Use our own check and possibly meet '\0' problems. >> 3. Check for U_ILLEGAL_ARGUMENT_ERROR to treat as end of a string, = raise >> the error for other 0xffff. >>=20 >> Alex, what do you suggests here? > As I understand, by now the 0xffff is used ONLY to handle the case of = unexpectedly ended symbol. > E.g. some symbol consists of 2 characters, but the length of the input = buffer is 1. > In my opinion this is the same as an invalid symbol. >=20 > I guess that internal buffer overflow cannot occur in the = `ucnv_getNextChar` function. >=20 > I suppose that it is Nikitas duty to investigate this problem and = explain it to us all. I just have noticed a strange usage. Hello, please consider my comments. There are some cases when 0xffff can occur, but: 1) Cannot trigger in our context. 2) Cannot trigger in our context. 3) Only triggers if end < start. (Cannot happen in = sql_utf8_pattern_compare, i guess) 4) Only triggers if string length > (size_t) 0x7ffffffff (can it = actually happen? I don=E2=80=99t think so). 5) Occurs when trying to access to not unindexed data. 6) Cannot occur in our context. 7) Cannot occur in our context. 0xfffd only means that symbol cannot be treated as a unicode symbol. Shall I change it somehow then? > On 17 Aug 2018, at 12:23, Alex Khatskevich = wrote: >=20 > I have a look at icu code and It seems like 0xffff is an error, and it = is more similar to > invalid symbol that to "end of string". Check it, and fix the code, so = that it is treated as > an error. > For example it is not handled in the main pattern loop: >=20 > + while (pattern < pattern_end) { > c =3D Utf8Read(pattern, pattern_end); > + if (c =3D=3D SQL_INVALID_UTF8_SYMBOL) > + return SQL_INVALID_PATTERN; >=20 > It seems like the 0xffff should be checked there too. No, it should not. This way it will only cause a bug when, for example = =E2=80=99select =E2=80=9C=E2=80=9D like =E2=80=9C=E2=80=9D=E2=80=99 will be treated as an error. -- WBR, Nikita Tatunov. n.tatunov@tarantool.org --Apple-Mail=_73C7705D-5580-4618-B9F7-1CEAF81CFC21 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=utf-8

On 17 Aug 2018, at 14:42, Alex Khatskevich <avkhatskevich@tarantool.org> wrote:


On 17.08.2018 14:17, Alexander = Turenko wrote:
0xffff is the result of 'end of a = string' check as well as internal buffer
overflow error. I = have the relevant code pasted in the first review of
the = patch (July, 18).

// = source/common/ucnv.c::ucnv_getNextUChar
1860 =     s=3D*source;
1861 =     if(sourceLimit<s) {
1862 =         *err=3DU_ILLEGAL_ARGUMENT_= ERROR;
1863 =         return 0xffff;
1864     }

We = should not handle the buffer overflow case as an invalid symbol. Of
course we should not handle it as the 'end of the string' = situation.
Ideally we should perform pointer myself and = raise an error in case of
0xffff. I had thought that a = buffer overflow error is unlikely to meet,
but you are = right: we should differentiate these situations.

In one of the previous version of a patch we perform this = check like so:

#define Utf8Read(s, e) (((s) = < (e)) ?\
ucnv_getNextUChar(pUtf8conv, = &s, e, &status) : 0)

Don't sure why = it was changed. Maybe it is try to correctly handle '\0'
symbol (it is valid unicode character)?
The define you have pasted can return 0xffff.
The reasons to change it back = are described in the previous patchset.
In short:
1. It is equivalent to
   a. check s < e in a while loop
   b. read next = character inside of where loop body.
2. In some usages of the code this check (s<e) was = redundant (it was performed a couple lines above)
3. There is no reason to rewrite = the old version of this function. (So, we decided to use old version of = the function)
So I = see two ways to proceed:

1. Lean on icu's = check and ignore possibility of the buffer overflow.
2. = Use our own check and possibly meet '\0' problems.
3. = Check for U_ILLEGAL_ARGUMENT_ERROR to treat as end of a string, raise
   the error for other 0xffff.
Alex, what do you suggests here?
As I understand, by now the 0xffff is used ONLY to handle the = case of unexpectedly ended symbol.
E.g. some symbol consists of 2 characters, but the length of = the input buffer is 1.
In my opinion this is the same as an invalid = symbol.

I guess that = internal buffer overflow cannot occur in the `ucnv_getNextChar` = function.

I suppose = that it is Nikitas duty to investigate this problem and explain it to us = all. I just have noticed a strange = usage.

Hello, please consider my = comments.

There = are some cases when 0xffff can occur, but:
= 1) Cannot trigger in our context.
2) = Cannot = trigger in our context.
3) = Only = triggers if end < start. (Cannot happen in sql_utf8_pattern_compare, = i guess)
4) Only triggers if string = length > (size_t) 0x7ffffffff (can it actually happen? I don=E2=80=99t = think so).
5) Occurs when trying to = access to not unindexed data.
6) Cannot = occur in our context.
= 7) Cannot occur in our context.

0xfffd only means that symbol cannot be = treated as a unicode symbol.

Shall I change = it somehow then?


On 17 Aug 2018, at 12:23, Alex Khatskevich <avkhatskevich@tarantool.org> wrote:

I have a look at icu = code and It seems like 0xffff is an error, and it is more similar = to
invalid symbol that to "end of string". Check it, = and fix the code, so that it is treated as
an = error.
For example it is not handled in the main = pattern loop:

+ while = (pattern < pattern_end) {
c =3D = Utf8Read(pattern, pattern_end);
+ if (c =3D=3D= SQL_INVALID_UTF8_SYMBOL)
+ return = SQL_INVALID_PATTERN;

It seems like the = 0xffff should be checked there too.

No, it should not. This way it will = only cause a bug when, for example =E2=80=99select =E2=80=9C=E2=80=9D = like =E2=80=9C=E2=80=9D=E2=80=99
will be treated as = an error.

--
WBR, Nikita Tatunov.

= --Apple-Mail=_73C7705D-5580-4618-B9F7-1CEAF81CFC21--