From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from localhost (localhost [127.0.0.1]) by turing.freelists.org (Avenir Technologies Mail Multiplex) with ESMTP id 372EC2B590 for ; Tue, 11 Sep 2018 09:31:23 -0400 (EDT) Received: from turing.freelists.org ([127.0.0.1]) by localhost (turing.freelists.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id jtlV4139fplv for ; Tue, 11 Sep 2018 09:31:23 -0400 (EDT) Received: from smtp36.i.mail.ru (smtp36.i.mail.ru [94.100.177.96]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by turing.freelists.org (Avenir Technologies Mail Multiplex) with ESMTPS id 2AA562B58F for ; Tue, 11 Sep 2018 09:31:22 -0400 (EDT) From: Nikita Tatunov Message-Id: <45338A27-C589-4330-B206-A4E379A4DE75@tarantool.org> Content-Type: multipart/alternative; boundary="Apple-Mail=_84B8E4C2-8A38-41F9-B8E8-1B4CD64ED4F8" Mime-Version: 1.0 (Mac OS X Mail 11.5 \(3445.9.1\)) Subject: [tarantool-patches] Re: [PATCH 1/2] sql: LIKE & GLOB pattern comparison issue Date: Tue, 11 Sep 2018 16:31:18 +0300 In-Reply-To: <860a125b-19f3-3bf1-8705-25156ff508ab@tarantool.org> References: <43febf82af3702fadfea135db978ffb6426eb00d.1534436836.git.n.tatunov@tarantool.org> <20180817111727.y6nsbblpm5nh4n3g@tkn_work_nb> <436d256a-f9d0-781f-8cad-179d7322c7bd@tarantool.org> <87897608-173E-45EB-80A1-8B249706D8A1@tarantool.org> <6a1352e9-425c-d656-1bec-bb04d9f0fee6@tarantool.org> <58B407E2-AF5D-4531-A9FF-9DC57CE0070B@tarantool.org> <860a125b-19f3-3bf1-8705-25156ff508ab@tarantool.org> Sender: tarantool-patches-bounce@freelists.org Errors-to: tarantool-patches-bounce@freelists.org Reply-To: tarantool-patches@freelists.org List-help: List-unsubscribe: List-software: Ecartis version 1.0.0 List-Id: tarantool-patches List-subscribe: List-owner: List-post: List-archive: To: Alex Khatskevich Cc: tarantool-patches@freelists.org, Alexander Turenko , korablev@tarantool.org --Apple-Mail=_84B8E4C2-8A38-41F9-B8E8-1B4CD64ED4F8 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=utf-8 > On 11 Sep 2018, at 13:06, Alex Khatskevich = wrote: >=20 >=20 >=20 > On 11.09.2018 09:06, Nikita Tatunov wrote: >>=20 >>=20 >>> On 11 Sep 2018, at 01:20, Alex Khatskevich = > = wrote: >>>=20 >>>>=20 >>>>=20 >>>>> On 17 Aug 2018, at 14:42, Alex Khatskevich = > = wrote: >>>>>=20 >>>>>=20 >>>>> On 17.08.2018 14:17, Alexander Turenko wrote: >>>>>> 0xffff is the result of 'end of a string' check as well as = internal buffer >>>>>> overflow error. I have the relevant code pasted in the first = review of >>>>>> the patch (July, 18). >>>>>>=20 >>>>>> // source/common/ucnv.c::ucnv_getNextUChar >>>>>> 1860 s=3D*source; >>>>>> 1861 if(sourceLimit>>>>> 1862 *err=3DU_ILLEGAL_ARGUMENT_ERROR; >>>>>> 1863 return 0xffff; >>>>>> 1864 } >>>>>>=20 >>>>>> We should not handle the buffer overflow case as an invalid = symbol. Of >>>>>> course we should not handle it as the 'end of the string' = situation. >>>>>> Ideally we should perform pointer myself and raise an error in = case of >>>>>> 0xffff. I had thought that a buffer overflow error is unlikely to = meet, >>>>>> but you are right: we should differentiate these situations. >>>>>>=20 >>>>>> In one of the previous version of a patch we perform this check = like so: >>>>>>=20 >>>>>> #define Utf8Read(s, e) (((s) < (e)) ?\ >>>>>> ucnv_getNextUChar(pUtf8conv, &s, e, &status) : 0) >>>>>>=20 >>>>>> Don't sure why it was changed. Maybe it is try to correctly = handle '\0' >>>>>> symbol (it is valid unicode character)? >>>>> The define you have pasted can return 0xffff. >>>>> The reasons to change it back are described in the previous = patchset. >>>>> In short: >>>>> 1. It is equivalent to >>>>> a. check s < e in a while loop >>>>> b. read next character inside of where loop body. >>>>> 2. In some usages of the code this check (s>>>> 3. There is no reason to rewrite the old version of this function. = (So, we decided to use old version of the function) >>>>>> So I see two ways to proceed: >>>>>>=20 >>>>>> 1. Lean on icu's check and ignore possibility of the buffer = overflow. >>>>>> 2. Use our own check and possibly meet '\0' problems. >>>>>> 3. Check for U_ILLEGAL_ARGUMENT_ERROR to treat as end of a = string, raise >>>>>> the error for other 0xffff. >>>>>>=20 >>>>>> Alex, what do you suggests here? >>>>> As I understand, by now the 0xffff is used ONLY to handle the case = of unexpectedly ended symbol. >>>>> E.g. some symbol consists of 2 characters, but the length of the = input buffer is 1. >>>>> In my opinion this is the same as an invalid symbol. >>>>>=20 >>>>> I guess that internal buffer overflow cannot occur in the = `ucnv_getNextChar` function. >>>>>=20 >>>>> I suppose that it is Nikitas duty to investigate this problem and = explain it to us all. I just have noticed a strange usage. >>>>=20 >>>>=20 >>>> Hello, please consider my comments. >>>>=20 >>>> There are some cases when 0xffff can occur, but: >>>> 1) Cannot trigger in our context. >>>> 2) Cannot trigger in our context. >>>> 3) Only triggers if end < start. (Cannot happen in = sql_utf8_pattern_compare, i guess) >>>> 4) Only triggers if string length > (size_t) 0x7ffffffff (can it = actually happen? I don=E2=80=99t think so). >>>> 5) Occurs when trying to access to not unindexed data. >>>> 6) Cannot occur in our context. >>>> 7) Cannot occur in our context. >>> I do not understand what are those numbers related to. Please, = describe it. >>=20 >> They are related to possible cases returning 0xffff from icu source = code (function ucnv_getNextUChar()). > Can you just copy it here, so that anyone interested in that = conversation can > analyze it without looking for source files? Ok then: U_CAPI UChar32 U_EXPORT2 ucnv_getNextUChar(UConverter *cnv, const char **source, const char *sourceLimit, UErrorCode *err) { UConverterToUnicodeArgs args; UChar buffer[U16_MAX_LENGTH]; const char *s; UChar32 c; int32_t i, length; /* check parameters */ if(err=3D=3DNULL || U_FAILURE(*err)) { return 0xffff; } if(cnv=3D=3DNULL || source=3D=3DNULL) { *err=3DU_ILLEGAL_ARGUMENT_ERROR; return 0xffff; } s=3D*source; if(sourceLimit(size_t)0x7fffffff && sourceLimit>s)) { *err=3DU_ILLEGAL_ARGUMENT_ERROR; return 0xffff; } c=3DU_SENTINEL; /* flush the target overflow buffer */ if(cnv->UCharErrorBufferLength>0) { UChar *overflow; overflow=3Dcnv->UCharErrorBuffer; i=3D0; length=3Dcnv->UCharErrorBufferLength; U16_NEXT(overflow, i, length, c); /* move the remaining overflow contents up to the beginning */ if((cnv->UCharErrorBufferLength=3D(int8_t)(length-i))>0) { uprv_memmove(cnv->UCharErrorBuffer, cnv->UCharErrorBuffer+i, cnv->UCharErrorBufferLength*U_SIZEOF_UCHAR); } if(!U16_IS_LEAD(c) || itoULength=3D=3D0 && = cnv->sharedData->impl->getNextUChar!=3DNULL) { c=3Dcnv->sharedData->impl->getNextUChar(&args, err); *source=3Ds=3Dargs.source; if(*err=3D=3DU_INDEX_OUTOFBOUNDS_ERROR) { /* reset the converter without calling the callback = function */ _reset(cnv, UCNV_RESET_TO_UNICODE, FALSE); return 0xffff; /* no output */ } else if(U_SUCCESS(*err) && c>=3D0) { return c; /* * else fall through to use _toUnicode() because * UCNV_GET_NEXT_UCHAR_USE_TO_U: the native function did = not want to handle it after all * U_FAILURE: call _toUnicode() for callback handling (do = not output c) */ } } /* convert to one UChar in buffer[0], or handle getNextUChar() = errors */ _toUnicodeWithCallback(&args, err); if(*err=3D=3DU_BUFFER_OVERFLOW_ERROR) { *err=3DU_ZERO_ERROR; } i=3D0; length=3D(int32_t)(args.target-buffer); } else { /* write the lead surrogate from the overflow buffer */ buffer[0]=3D(UChar)c; args.target=3Dbuffer+1; i=3D0; length=3D1; } /* buffer contents starts at i and ends before length */ if(U_FAILURE(*err)) { c=3D0xffff; /* no output */ } else if(length=3D=3D0) { /* no input or only state changes */ *err=3DU_INDEX_OUTOFBOUNDS_ERROR; /* no need to reset explicitly because _toUnicodeWithCallback() = did it */ c=3D0xffff; /* no output */ } else { c=3Dbuffer[0]; i=3D1; if(!U16_IS_LEAD(c)) { /* consume c=3Dbuffer[0], done */ } else { /* got a lead surrogate, see if a trail surrogate follows */ UChar c2; if(cnv->UCharErrorBufferLength>0) { /* got overflow output from the conversion */ if(U16_IS_TRAIL(c2=3Dcnv->UCharErrorBuffer[0])) { /* got a trail surrogate, too */ c=3DU16_GET_SUPPLEMENTARY(c, c2); /* move the remaining overflow contents up to the = beginning */ if((--cnv->UCharErrorBufferLength)>0) { uprv_memmove(cnv->UCharErrorBuffer, = cnv->UCharErrorBuffer+1, = cnv->UCharErrorBufferLength*U_SIZEOF_UCHAR); } } else { /* c is an unpaired lead surrogate, just return it = */ } } else if(args.sourceUCharErrorBufferLength)>0) { uprv_memmove(cnv->UCharErrorBuffer+delta, = cnv->UCharErrorBuffer, length*U_SIZEOF_UCHAR); } cnv->UCharErrorBufferLength=3D(int8_t)(length+delta); cnv->UCharErrorBuffer[0]=3Dbuffer[i++]; if(delta>1) { cnv->UCharErrorBuffer[1]=3Dbuffer[i]; } } *source=3Dargs.source; return c; } -- WBR, Nikita Tatunov. n.tatunov@tarantool.org --Apple-Mail=_84B8E4C2-8A38-41F9-B8E8-1B4CD64ED4F8 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=utf-8

On 11 Sep 2018, at 13:06, Alex Khatskevich <avkhatskevich@tarantool.org> wrote:

=20 =20



On 11.09.2018 09:06, Nikita Tatunov wrote:


On 11 Sep 2018, at 01:20, Alex Khatskevich = <avkhatskevich@tarantool.org> wrote:



On 17 Aug 2018, at 14:42, Alex Khatskevich <avkhatskevich@tarantool.org> wrote:


On 17.08.2018 14:17, Alexander Turenko wrote:
0xffff is the result of 'end of a string' check as well as internal buffer
overflow error. I have the relevant code pasted in the first review of
the patch (July, 18).

// source/common/ucnv.c::ucnv_getNextUChar
1860     s=3D*source;
1861     if(sourceLimit<s) = {
1862 =         *err=3DU_ILLEGAL_ARGUMENT_= ERROR;
1863 =         return 0xffff;
1864     }

We should not handle the buffer overflow case as an invalid symbol. Of
course we should not handle it as the 'end of the string' situation.
Ideally we should perform pointer myself and raise an error in case of
0xffff. I had thought that a buffer overflow error is unlikely to meet,
but you are right: we should differentiate these situations.

In one of the previous version of a patch we perform this check like so:

#define Utf8Read(s, e) (((s) < (e)) ?\
ucnv_getNextUChar(pUtf8conv, &s, e, &status) : 0)

Don't sure why it was changed. Maybe it is try to correctly handle '\0'
symbol (it is valid unicode character)?
The define you have pasted can return 0xffff.
The reasons to change it back are described in the previous patchset.
In short:
1. It is equivalent = to
   a. check s < e in a while loop
   b. read next character inside of where loop = body.
2. In some usages of the code this check (s<e) was redundant (it was performed a couple lines above)
3. There is no reason to rewrite the old version of this function. (So, we decided to use old version of the = function)
So I see two ways to proceed:

1. Lean on icu's check and ignore possibility of the buffer overflow.
2. Use our own check and possibly meet '\0' problems.
3. Check for U_ILLEGAL_ARGUMENT_ERROR to treat as end of a string, raise
   the error for other 0xffff.

Alex, what do you suggests here?
As I understand, by now the 0xffff is used ONLY to handle the case of unexpectedly ended symbol.
E.g. some symbol consists of 2 characters, but the length of the input buffer is 1.
In my opinion this is the same as an invalid symbol.

I guess that internal buffer overflow cannot occur in the `ucnv_getNextChar` function.

I suppose that it is Nikitas duty to investigate this problem and explain it to us all. I just have noticed a strange = usage.

Hello, please consider my comments.

There are some cases when 0xffff can = occur, but:
1) Cannot trigger in our context.
2) Cannot trigger in our = context.
3) Only triggers if end < start. (Cannot happen in sql_utf8_pattern_compare, i = guess)
4) Only triggers if string length > (size_t) 0x7ffffffff (can it actually happen? I = don=E2=80=99t think so).
5) Occurs when trying to access to not unindexed data.
6) Cannot occur in our context.
7) Cannot occur in our context.
I= do not understand what are those numbers related to. Please, describe it.

They are related to possible cases returning = 0xffff from icu source code (function ucnv_getNextUChar()).
Can you just copy it here, so that anyone interested in that conversation can
analyze it without looking for source files?

Ok then:

U_CAPI UChar32 = U_EXPORT2
ucnv_getNextUChar(UConverter *cnv,
  =                 const char = **source, const char *sourceLimit,
        =           UErrorCode *err) {
  =   UConverterToUnicodeArgs args;
    UChar = buffer[U16_MAX_LENGTH];
    const char = *s;
    UChar32 c;
    int32_t = i, length;

    /* check = parameters */
    if(err=3D=3DNULL || = U_FAILURE(*err)) {
        return = 0xffff;
    }

    if(cnv=3D=3DNULL || source=3D=3DNULL) = {
        = *err=3DU_ILLEGAL_ARGUMENT_ERROR;
        = return 0xffff;
    }

    s=3D*source;
    = if(sourceLimit<s) {
        = *err=3DU_ILLEGAL_ARGUMENT_ERROR;
        = return 0xffff;
    }

    /*
     * = Make sure that the buffer sizes do not exceed the number range = for
     * int32_t because some functions use = the size (in units or bytes)
     * rather than = comparing pointers, and because offsets are int32_t = values.
     *
     * = size_t is guaranteed to be unsigned and large enough for the = job.
     *
     * = Return with an error instead of adjusting the limits because we = would
     * not be able to maintain the = semantics that either the source must be
     * = consumed or the target filled (unless an error occurs).
  =    * An adjustment would be sourceLimit=3Dt+0x7fffffff; for = example.
     */
    = if(((size_t)(sourceLimit-s)>(size_t)0x7fffffff && = sourceLimit>s)) {
        = *err=3DU_ILLEGAL_ARGUMENT_ERROR;
        = return 0xffff;
    }

    c=3DU_SENTINEL;

    /* flush the target overflow buffer = */
    if(cnv->UCharErrorBufferLength>0) = {
        UChar *overflow;

        = overflow=3Dcnv->UCharErrorBuffer;
      =   i=3D0;
        = length=3Dcnv->UCharErrorBufferLength;
      =   U16_NEXT(overflow, i, length, c);

        /* move the remaining = overflow contents up to the beginning */
      =   if((cnv->UCharErrorBufferLength=3D(int8_t)(length-i))>0) = {
            = uprv_memmove(cnv->UCharErrorBuffer, = cnv->UCharErrorBuffer+i,
          =               =  cnv->UCharErrorBufferLength*U_SIZEOF_UCHAR);
  =       }

    =     if(!U16_IS_LEAD(c) || i<length) {
  =           return c;
    =     }
        = /*
         * Continue if the = overflow buffer contained only a lead surrogate,
    =      * in case the converter outputs single surrogates = from complete
         * input = sequences.
        =  */
    }

    /*
     * = flush=3D=3DTRUE is implied for ucnv_getNextUChar()
  =    *
     * do not simply return even = if s=3D=3DsourceLimit because the converter may
    =  * not have seen flush=3D=3DTRUE before
    =  */

    /* prepare = the converter arguments */
    = args.converter=3Dcnv;
    = args.flush=3DTRUE;
    = args.offsets=3DNULL;
    = args.source=3Ds;
    = args.sourceLimit=3DsourceLimit;
    = args.target=3Dbuffer;
    = args.targetLimit=3Dbuffer+1;
    = args.size=3Dsizeof(args);

  =   if(c<0) {
        = /*
         * call the native = getNextUChar() implementation if we are
      =    * at a character boundary (toULength=3D=3D0)
 =        *
        =  * unlike with _toUnicode(), getNextUChar() implementations must = set
         * U_TRUNCATED_CHAR_FOUND = for truncated input,
         * in = addition to setting toULength/toUBytes[]
      =    */
        = if(cnv->toULength=3D=3D0 && = cnv->sharedData->impl->getNextUChar!=3DNULL) {
  =           = c=3Dcnv->sharedData->impl->getNextUChar(&args, = err);
            = *source=3Ds=3Dargs.source;
          =   if(*err=3D=3DU_INDEX_OUTOFBOUNDS_ERROR) {
    =             /* reset the converter without = calling the callback function */
        =         _reset(cnv, UCNV_RESET_TO_UNICODE, = FALSE);
              =   return 0xffff; /* no output */
      =       } else if(U_SUCCESS(*err) && c>=3D0) = {
                = return c;
            = /*
             * else fall = through to use _toUnicode() because
      =        *   UCNV_GET_NEXT_UCHAR_USE_TO_U: the = native function did not want to handle it after all
  =            *   U_FAILURE: call = _toUnicode() for callback handling (do not output c)
  =            */
    =         }
        = }

        /* = convert to one UChar in buffer[0], or handle getNextUChar() errors = */
        = _toUnicodeWithCallback(&args, err);

        = if(*err=3D=3DU_BUFFER_OVERFLOW_ERROR) {
      =       *err=3DU_ZERO_ERROR;
      =   }

        = i=3D0;
        = length=3D(int32_t)(args.target-buffer);
    } else = {
        /* write the lead surrogate from = the overflow buffer */
        = buffer[0]=3D(UChar)c;
        = args.target=3Dbuffer+1;
        = i=3D0;
        length=3D1;
  =   }

    /* buffer = contents starts at i and ends before length */

    if(U_FAILURE(*err)) = {
        c=3D0xffff; /* no output = */
    } else if(length=3D=3D0) {
  =       /* no input or only state changes = */
        = *err=3DU_INDEX_OUTOFBOUNDS_ERROR;
        = /* no need to reset explicitly because _toUnicodeWithCallback() did it = */
        c=3D0xffff; /* no output = */
    } else {
      =   c=3Dbuffer[0];
        = i=3D1;
        if(!U16_IS_LEAD(c)) = {
            /* consume = c=3Dbuffer[0], done */
        } else = {
            /* got a lead = surrogate, see if a trail surrogate follows */
    =         UChar c2;

            = if(cnv->UCharErrorBufferLength>0) {
      =           /* got overflow output from the = conversion */
              =   if(U16_IS_TRAIL(c2=3Dcnv->UCharErrorBuffer[0])) = {
                =     /* got a trail surrogate, too */
    =                 = c=3DU16_GET_SUPPLEMENTARY(c, c2);

              =       /* move the remaining overflow contents up to the = beginning */
              =       if((--cnv->UCharErrorBufferLength)>0) = {
                =         uprv_memmove(cnv->UCharErrorBuffer, = cnv->UCharErrorBuffer+1,
          =                     =       =  cnv->UCharErrorBufferLength*U_SIZEOF_UCHAR);
  =                   = }
                } = else {
                =     /* c is an unpaired lead surrogate, just return it = */
                = }
            } else = if(args.source<sourceLimit) {
        =         /* convert once more, to buffer[1] = */
                = args.targetLimit=3Dbuffer+2;
        =         _toUnicodeWithCallback(&args, = err);
                = if(*err=3D=3DU_BUFFER_OVERFLOW_ERROR) {
      =               = *err=3DU_ZERO_ERROR;
            =     }

      =           = length=3D(int32_t)(args.target-buffer);
      =           if(U_SUCCESS(*err) && = length=3D=3D2 && U16_IS_TRAIL(c2=3Dbuffer[1])) = {
                =     /* got a trail surrogate, too */
    =                 = c=3DU16_GET_SUPPLEMENTARY(c, c2);
        =             i=3D2;
    =             }
    =         }
        = }
    }

  =   /*
     * move leftover output from = buffer[i..length[
     * into the beginning of = the overflow buffer
     */
  =   if(i<length) {
        /* move = further overflow back */
        int32_t = delta=3Dlength-i;
        = if((length=3Dcnv->UCharErrorBufferLength)>0) {
  =           = uprv_memmove(cnv->UCharErrorBuffer+delta, = cnv->UCharErrorBuffer,
          =               =  length*U_SIZEOF_UCHAR);
        = }
        = cnv->UCharErrorBufferLength=3D(int8_t)(length+delta);

        = cnv->UCharErrorBuffer[0]=3Dbuffer[i++];
    =     if(delta>1) {
        =     cnv->UCharErrorBuffer[1]=3Dbuffer[i];
  =       }
    }

    = *source=3Dargs.source;
    return = c;
}

--
WBR, Nikita Tatunov.

= --Apple-Mail=_84B8E4C2-8A38-41F9-B8E8-1B4CD64ED4F8--