Tarantool development patches archive
 help / color / mirror / Atom feed
From: Nikita Tatunov <n.tatunov@tarantool.org>
To: Alex Khatskevich <avkhatskevich@tarantool.org>
Cc: tarantool-patches@freelists.org,
	Alexander Turenko <alexander.turenko@tarantool.org>,
	korablev@tarantool.org
Subject: [tarantool-patches] Re: [PATCH 1/2] sql: LIKE & GLOB pattern comparison issue
Date: Tue, 11 Sep 2018 16:31:18 +0300	[thread overview]
Message-ID: <45338A27-C589-4330-B206-A4E379A4DE75@tarantool.org> (raw)
In-Reply-To: <860a125b-19f3-3bf1-8705-25156ff508ab@tarantool.org>

[-- Attachment #1: Type: text/plain, Size: 10935 bytes --]



> On 11 Sep 2018, at 13:06, Alex Khatskevich <avkhatskevich@tarantool.org> wrote:
> 
> 
> 
> On 11.09.2018 09:06, Nikita Tatunov wrote:
>> 
>> 
>>> On 11 Sep 2018, at 01:20, Alex Khatskevich <avkhatskevich@tarantool.org <mailto:avkhatskevich@tarantool.org>> wrote:
>>> 
>>>> 
>>>> 
>>>>> On 17 Aug 2018, at 14:42, Alex Khatskevich <avkhatskevich@tarantool.org <mailto:avkhatskevich@tarantool.org>> wrote:
>>>>> 
>>>>> 
>>>>> On 17.08.2018 14:17, Alexander Turenko wrote:
>>>>>> 0xffff is the result of 'end of a string' check as well as internal buffer
>>>>>> overflow error. I have the relevant code pasted in the first review of
>>>>>> the patch (July, 18).
>>>>>> 
>>>>>> // source/common/ucnv.c::ucnv_getNextUChar
>>>>>> 1860     s=*source;
>>>>>> 1861     if(sourceLimit<s) {
>>>>>> 1862         *err=U_ILLEGAL_ARGUMENT_ERROR;
>>>>>> 1863         return 0xffff;
>>>>>> 1864     }
>>>>>> 
>>>>>> We should not handle the buffer overflow case as an invalid symbol. Of
>>>>>> course we should not handle it as the 'end of the string' situation.
>>>>>> Ideally we should perform pointer myself and raise an error in case of
>>>>>> 0xffff. I had thought that a buffer overflow error is unlikely to meet,
>>>>>> but you are right: we should differentiate these situations.
>>>>>> 
>>>>>> In one of the previous version of a patch we perform this check like so:
>>>>>> 
>>>>>> #define Utf8Read(s, e) (((s) < (e)) ?\
>>>>>> 	ucnv_getNextUChar(pUtf8conv, &s, e, &status) : 0)
>>>>>> 
>>>>>> Don't sure why it was changed. Maybe it is try to correctly handle '\0'
>>>>>> symbol (it is valid unicode character)?
>>>>> The define you have pasted can return 0xffff.
>>>>> The reasons to change it back are described in the previous patchset.
>>>>> In short:
>>>>> 1. It is equivalent to
>>>>>    a. check s < e in a while loop
>>>>>    b. read next character inside of where loop body.
>>>>> 2. In some usages of the code this check (s<e) was redundant (it was performed a couple lines above)
>>>>> 3. There is no reason to rewrite the old version of this function. (So, we decided to use old version of the function)
>>>>>> So I see two ways to proceed:
>>>>>> 
>>>>>> 1. Lean on icu's check and ignore possibility of the buffer overflow.
>>>>>> 2. Use our own check and possibly meet '\0' problems.
>>>>>> 3. Check for U_ILLEGAL_ARGUMENT_ERROR to treat as end of a string, raise
>>>>>>    the error for other 0xffff.
>>>>>> 
>>>>>> Alex, what do you suggests here?
>>>>> As I understand, by now the 0xffff is used ONLY to handle the case of unexpectedly ended symbol.
>>>>> E.g. some symbol consists of 2 characters, but the length of the input buffer is 1.
>>>>> In my opinion this is the same as an invalid symbol.
>>>>> 
>>>>> I guess that internal buffer overflow cannot occur in the `ucnv_getNextChar` function.
>>>>> 
>>>>> I suppose that it is Nikitas duty to investigate this problem and explain it to us all. I just have noticed a strange usage.
>>>> 
>>>> 
>>>> Hello, please consider my comments.
>>>> 
>>>> There are some cases when 0xffff can occur, but:
>>>> 	1) Cannot trigger in our context.
>>>> 	2) Cannot trigger in our context.
>>>> 	3) Only triggers if end < start. (Cannot happen in sql_utf8_pattern_compare, i guess)
>>>> 	4) Only triggers if string length > (size_t) 0x7ffffffff (can it actually happen? I don’t think so).
>>>> 	5) Occurs when trying to access to not unindexed data.
>>>> 	6) Cannot occur in our context.
>>>> 	7) Cannot occur in our context.
>>> I do not understand what are those numbers related to. Please, describe it.
>> 
>> They are related to possible cases returning 0xffff from icu source code (function ucnv_getNextUChar()).
> Can you just copy it here, so that anyone interested in that conversation can
> analyze it without looking for source files?

Ok then:

U_CAPI UChar32 U_EXPORT2
ucnv_getNextUChar(UConverter *cnv,
                  const char **source, const char *sourceLimit,
                  UErrorCode *err) {
    UConverterToUnicodeArgs args;
    UChar buffer[U16_MAX_LENGTH];
    const char *s;
    UChar32 c;
    int32_t i, length;

    /* check parameters */
    if(err==NULL || U_FAILURE(*err)) {
        return 0xffff;
    }

    if(cnv==NULL || source==NULL) {
        *err=U_ILLEGAL_ARGUMENT_ERROR;
        return 0xffff;
    }

    s=*source;
    if(sourceLimit<s) {
        *err=U_ILLEGAL_ARGUMENT_ERROR;
        return 0xffff;
    }

    /*
     * Make sure that the buffer sizes do not exceed the number range for
     * int32_t because some functions use the size (in units or bytes)
     * rather than comparing pointers, and because offsets are int32_t values.
     *
     * size_t is guaranteed to be unsigned and large enough for the job.
     *
     * Return with an error instead of adjusting the limits because we would
     * not be able to maintain the semantics that either the source must be
     * consumed or the target filled (unless an error occurs).
     * An adjustment would be sourceLimit=t+0x7fffffff; for example.
     */
    if(((size_t)(sourceLimit-s)>(size_t)0x7fffffff && sourceLimit>s)) {
        *err=U_ILLEGAL_ARGUMENT_ERROR;
        return 0xffff;
    }

    c=U_SENTINEL;

    /* flush the target overflow buffer */
    if(cnv->UCharErrorBufferLength>0) {
        UChar *overflow;

        overflow=cnv->UCharErrorBuffer;
        i=0;
        length=cnv->UCharErrorBufferLength;
        U16_NEXT(overflow, i, length, c);

        /* move the remaining overflow contents up to the beginning */
        if((cnv->UCharErrorBufferLength=(int8_t)(length-i))>0) {
            uprv_memmove(cnv->UCharErrorBuffer, cnv->UCharErrorBuffer+i,
                         cnv->UCharErrorBufferLength*U_SIZEOF_UCHAR);
        }

        if(!U16_IS_LEAD(c) || i<length) {
            return c;
        }
        /*
         * Continue if the overflow buffer contained only a lead surrogate,
         * in case the converter outputs single surrogates from complete
         * input sequences.
         */
    }

    /*
     * flush==TRUE is implied for ucnv_getNextUChar()
     *
     * do not simply return even if s==sourceLimit because the converter may
     * not have seen flush==TRUE before
     */

    /* prepare the converter arguments */
    args.converter=cnv;
    args.flush=TRUE;
    args.offsets=NULL;
    args.source=s;
    args.sourceLimit=sourceLimit;
    args.target=buffer;
    args.targetLimit=buffer+1;
    args.size=sizeof(args);

    if(c<0) {
        /*
         * call the native getNextUChar() implementation if we are
         * at a character boundary (toULength==0)
         *
         * unlike with _toUnicode(), getNextUChar() implementations must set
         * U_TRUNCATED_CHAR_FOUND for truncated input,
         * in addition to setting toULength/toUBytes[]
         */
        if(cnv->toULength==0 && cnv->sharedData->impl->getNextUChar!=NULL) {
            c=cnv->sharedData->impl->getNextUChar(&args, err);
            *source=s=args.source;
            if(*err==U_INDEX_OUTOFBOUNDS_ERROR) {
                /* reset the converter without calling the callback function */
                _reset(cnv, UCNV_RESET_TO_UNICODE, FALSE);
                return 0xffff; /* no output */
            } else if(U_SUCCESS(*err) && c>=0) {
                return c;
            /*
             * else fall through to use _toUnicode() because
             *   UCNV_GET_NEXT_UCHAR_USE_TO_U: the native function did not want to handle it after all
             *   U_FAILURE: call _toUnicode() for callback handling (do not output c)
             */
            }
        }

        /* convert to one UChar in buffer[0], or handle getNextUChar() errors */
        _toUnicodeWithCallback(&args, err);

        if(*err==U_BUFFER_OVERFLOW_ERROR) {
            *err=U_ZERO_ERROR;
        }

        i=0;
        length=(int32_t)(args.target-buffer);
    } else {
        /* write the lead surrogate from the overflow buffer */
        buffer[0]=(UChar)c;
        args.target=buffer+1;
        i=0;
        length=1;
    }

    /* buffer contents starts at i and ends before length */

    if(U_FAILURE(*err)) {
        c=0xffff; /* no output */
    } else if(length==0) {
        /* no input or only state changes */
        *err=U_INDEX_OUTOFBOUNDS_ERROR;
        /* no need to reset explicitly because _toUnicodeWithCallback() did it */
        c=0xffff; /* no output */
    } else {
        c=buffer[0];
        i=1;
        if(!U16_IS_LEAD(c)) {
            /* consume c=buffer[0], done */
        } else {
            /* got a lead surrogate, see if a trail surrogate follows */
            UChar c2;

            if(cnv->UCharErrorBufferLength>0) {
                /* got overflow output from the conversion */
                if(U16_IS_TRAIL(c2=cnv->UCharErrorBuffer[0])) {
                    /* got a trail surrogate, too */
                    c=U16_GET_SUPPLEMENTARY(c, c2);

                    /* move the remaining overflow contents up to the beginning */
                    if((--cnv->UCharErrorBufferLength)>0) {
                        uprv_memmove(cnv->UCharErrorBuffer, cnv->UCharErrorBuffer+1,
                                     cnv->UCharErrorBufferLength*U_SIZEOF_UCHAR);
                    }
                } else {
                    /* c is an unpaired lead surrogate, just return it */
                }
            } else if(args.source<sourceLimit) {
                /* convert once more, to buffer[1] */
                args.targetLimit=buffer+2;
                _toUnicodeWithCallback(&args, err);
                if(*err==U_BUFFER_OVERFLOW_ERROR) {
                    *err=U_ZERO_ERROR;
                }

                length=(int32_t)(args.target-buffer);
                if(U_SUCCESS(*err) && length==2 && U16_IS_TRAIL(c2=buffer[1])) {
                    /* got a trail surrogate, too */
                    c=U16_GET_SUPPLEMENTARY(c, c2);
                    i=2;
                }
            }
        }
    }

    /*
     * move leftover output from buffer[i..length[
     * into the beginning of the overflow buffer
     */
    if(i<length) {
        /* move further overflow back */
        int32_t delta=length-i;
        if((length=cnv->UCharErrorBufferLength)>0) {
            uprv_memmove(cnv->UCharErrorBuffer+delta, cnv->UCharErrorBuffer,
                         length*U_SIZEOF_UCHAR);
        }
        cnv->UCharErrorBufferLength=(int8_t)(length+delta);

        cnv->UCharErrorBuffer[0]=buffer[i++];
        if(delta>1) {
            cnv->UCharErrorBuffer[1]=buffer[i];
        }
    }

    *source=args.source;
    return c;
}

--
WBR, Nikita Tatunov.
n.tatunov@tarantool.org


[-- Attachment #2: Type: text/html, Size: 43163 bytes --]

  reply	other threads:[~2018-09-11 13:31 UTC|newest]

Thread overview: 46+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-08-16 17:00 [tarantool-patches] [PATCH v2 0/2] sql: pattern comparison fixes & GLOB removal N.Tatunov
2018-08-16 17:00 ` [tarantool-patches] [PATCH 1/2] sql: LIKE & GLOB pattern comparison issue N.Tatunov
2018-08-17  9:23   ` [tarantool-patches] " Alex Khatskevich
2018-08-17 11:17     ` Alexander Turenko
2018-08-17 11:42       ` Alex Khatskevich
2018-09-09 13:33         ` Nikita Tatunov
2018-09-10 22:20           ` Alex Khatskevich
2018-09-11  6:06             ` Nikita Tatunov
2018-09-11 10:06               ` Alex Khatskevich
2018-09-11 13:31                 ` Nikita Tatunov [this message]
2018-10-18 18:02                   ` Nikita Tatunov
2018-10-21  3:51                     ` Alexander Turenko
2018-10-26 15:19                       ` Nikita Tatunov
2018-10-29 13:01                         ` Alexander Turenko
2018-10-31  5:25                           ` Nikita Tatunov
2018-11-01 10:30                             ` Alexander Turenko
2018-11-14 14:16                               ` n.pettik
2018-11-14 17:06                                 ` Alexander Turenko
2018-08-16 17:00 ` [tarantool-patches] [PATCH 2/2] sql: remove GLOB from Tarantool N.Tatunov
2018-08-17  8:25   ` [tarantool-patches] " Alex Khatskevich
2018-08-17  8:49     ` n.pettik
2018-08-17  9:01       ` Alex Khatskevich
2018-08-17  9:20         ` n.pettik
2018-08-17  9:28           ` Alex Khatskevich
     [not found]     ` <04D02794-07A5-4146-9144-84EE720C8656@corp.mail.ru>
2018-08-17  8:53       ` Alex Khatskevich
2018-08-17 11:26     ` Alexander Turenko
2018-08-17 11:34       ` Alexander Turenko
2018-08-17 13:46     ` Nikita Tatunov
2018-09-09 14:57     ` Nikita Tatunov
2018-09-10 22:06       ` Alex Khatskevich
2018-09-11  7:38         ` Nikita Tatunov
2018-09-11 10:11           ` Alexander Turenko
2018-09-11 10:22             ` Alex Khatskevich
2018-09-11 12:03           ` Alex Khatskevich
2018-10-18 20:28             ` Nikita Tatunov
2018-10-21  3:48               ` Alexander Turenko
2018-10-26 15:21                 ` Nikita Tatunov
2018-10-29 12:15                   ` Alexander Turenko
2018-11-08 15:09                     ` Nikita Tatunov
2018-11-09 12:18                       ` Alexander Turenko
2018-11-10  3:38                         ` Nikita Tatunov
2018-11-13 19:23                           ` Alexander Turenko
2018-11-14 14:16                             ` n.pettik
2018-11-14 17:41                               ` Alexander Turenko
2018-11-14 21:48                                 ` n.pettik
2018-11-15  4:57 ` [tarantool-patches] Re: [PATCH v2 0/2] sql: pattern comparison fixes & GLOB removal Kirill Yukhin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=45338A27-C589-4330-B206-A4E379A4DE75@tarantool.org \
    --to=n.tatunov@tarantool.org \
    --cc=alexander.turenko@tarantool.org \
    --cc=avkhatskevich@tarantool.org \
    --cc=korablev@tarantool.org \
    --cc=tarantool-patches@freelists.org \
    --subject='[tarantool-patches] Re: [PATCH 1/2] sql: LIKE & GLOB pattern comparison issue' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox