[tarantool-patches] Re: [PATCH 1/2] sql: LIKE & GLOB pattern comparison issue
Nikita Tatunov
n.tatunov at tarantool.org
Tue Sep 11 16:31:18 MSK 2018
> On 11 Sep 2018, at 13:06, Alex Khatskevich <avkhatskevich at tarantool.org> wrote:
>
>
>
> On 11.09.2018 09:06, Nikita Tatunov wrote:
>>
>>
>>> On 11 Sep 2018, at 01:20, Alex Khatskevich <avkhatskevich at tarantool.org <mailto:avkhatskevich at tarantool.org>> wrote:
>>>
>>>>
>>>>
>>>>> On 17 Aug 2018, at 14:42, Alex Khatskevich <avkhatskevich at tarantool.org <mailto:avkhatskevich at tarantool.org>> wrote:
>>>>>
>>>>>
>>>>> On 17.08.2018 14:17, Alexander Turenko wrote:
>>>>>> 0xffff is the result of 'end of a string' check as well as internal buffer
>>>>>> overflow error. I have the relevant code pasted in the first review of
>>>>>> the patch (July, 18).
>>>>>>
>>>>>> // source/common/ucnv.c::ucnv_getNextUChar
>>>>>> 1860 s=*source;
>>>>>> 1861 if(sourceLimit<s) {
>>>>>> 1862 *err=U_ILLEGAL_ARGUMENT_ERROR;
>>>>>> 1863 return 0xffff;
>>>>>> 1864 }
>>>>>>
>>>>>> We should not handle the buffer overflow case as an invalid symbol. Of
>>>>>> course we should not handle it as the 'end of the string' situation.
>>>>>> Ideally we should perform pointer myself and raise an error in case of
>>>>>> 0xffff. I had thought that a buffer overflow error is unlikely to meet,
>>>>>> but you are right: we should differentiate these situations.
>>>>>>
>>>>>> In one of the previous version of a patch we perform this check like so:
>>>>>>
>>>>>> #define Utf8Read(s, e) (((s) < (e)) ?\
>>>>>> ucnv_getNextUChar(pUtf8conv, &s, e, &status) : 0)
>>>>>>
>>>>>> Don't sure why it was changed. Maybe it is try to correctly handle '\0'
>>>>>> symbol (it is valid unicode character)?
>>>>> The define you have pasted can return 0xffff.
>>>>> The reasons to change it back are described in the previous patchset.
>>>>> In short:
>>>>> 1. It is equivalent to
>>>>> a. check s < e in a while loop
>>>>> b. read next character inside of where loop body.
>>>>> 2. In some usages of the code this check (s<e) was redundant (it was performed a couple lines above)
>>>>> 3. There is no reason to rewrite the old version of this function. (So, we decided to use old version of the function)
>>>>>> So I see two ways to proceed:
>>>>>>
>>>>>> 1. Lean on icu's check and ignore possibility of the buffer overflow.
>>>>>> 2. Use our own check and possibly meet '\0' problems.
>>>>>> 3. Check for U_ILLEGAL_ARGUMENT_ERROR to treat as end of a string, raise
>>>>>> the error for other 0xffff.
>>>>>>
>>>>>> Alex, what do you suggests here?
>>>>> As I understand, by now the 0xffff is used ONLY to handle the case of unexpectedly ended symbol.
>>>>> E.g. some symbol consists of 2 characters, but the length of the input buffer is 1.
>>>>> In my opinion this is the same as an invalid symbol.
>>>>>
>>>>> I guess that internal buffer overflow cannot occur in the `ucnv_getNextChar` function.
>>>>>
>>>>> I suppose that it is Nikitas duty to investigate this problem and explain it to us all. I just have noticed a strange usage.
>>>>
>>>>
>>>> Hello, please consider my comments.
>>>>
>>>> There are some cases when 0xffff can occur, but:
>>>> 1) Cannot trigger in our context.
>>>> 2) Cannot trigger in our context.
>>>> 3) Only triggers if end < start. (Cannot happen in sql_utf8_pattern_compare, i guess)
>>>> 4) Only triggers if string length > (size_t) 0x7ffffffff (can it actually happen? I don’t think so).
>>>> 5) Occurs when trying to access to not unindexed data.
>>>> 6) Cannot occur in our context.
>>>> 7) Cannot occur in our context.
>>> I do not understand what are those numbers related to. Please, describe it.
>>
>> They are related to possible cases returning 0xffff from icu source code (function ucnv_getNextUChar()).
> Can you just copy it here, so that anyone interested in that conversation can
> analyze it without looking for source files?
Ok then:
U_CAPI UChar32 U_EXPORT2
ucnv_getNextUChar(UConverter *cnv,
const char **source, const char *sourceLimit,
UErrorCode *err) {
UConverterToUnicodeArgs args;
UChar buffer[U16_MAX_LENGTH];
const char *s;
UChar32 c;
int32_t i, length;
/* check parameters */
if(err==NULL || U_FAILURE(*err)) {
return 0xffff;
}
if(cnv==NULL || source==NULL) {
*err=U_ILLEGAL_ARGUMENT_ERROR;
return 0xffff;
}
s=*source;
if(sourceLimit<s) {
*err=U_ILLEGAL_ARGUMENT_ERROR;
return 0xffff;
}
/*
* Make sure that the buffer sizes do not exceed the number range for
* int32_t because some functions use the size (in units or bytes)
* rather than comparing pointers, and because offsets are int32_t values.
*
* size_t is guaranteed to be unsigned and large enough for the job.
*
* Return with an error instead of adjusting the limits because we would
* not be able to maintain the semantics that either the source must be
* consumed or the target filled (unless an error occurs).
* An adjustment would be sourceLimit=t+0x7fffffff; for example.
*/
if(((size_t)(sourceLimit-s)>(size_t)0x7fffffff && sourceLimit>s)) {
*err=U_ILLEGAL_ARGUMENT_ERROR;
return 0xffff;
}
c=U_SENTINEL;
/* flush the target overflow buffer */
if(cnv->UCharErrorBufferLength>0) {
UChar *overflow;
overflow=cnv->UCharErrorBuffer;
i=0;
length=cnv->UCharErrorBufferLength;
U16_NEXT(overflow, i, length, c);
/* move the remaining overflow contents up to the beginning */
if((cnv->UCharErrorBufferLength=(int8_t)(length-i))>0) {
uprv_memmove(cnv->UCharErrorBuffer, cnv->UCharErrorBuffer+i,
cnv->UCharErrorBufferLength*U_SIZEOF_UCHAR);
}
if(!U16_IS_LEAD(c) || i<length) {
return c;
}
/*
* Continue if the overflow buffer contained only a lead surrogate,
* in case the converter outputs single surrogates from complete
* input sequences.
*/
}
/*
* flush==TRUE is implied for ucnv_getNextUChar()
*
* do not simply return even if s==sourceLimit because the converter may
* not have seen flush==TRUE before
*/
/* prepare the converter arguments */
args.converter=cnv;
args.flush=TRUE;
args.offsets=NULL;
args.source=s;
args.sourceLimit=sourceLimit;
args.target=buffer;
args.targetLimit=buffer+1;
args.size=sizeof(args);
if(c<0) {
/*
* call the native getNextUChar() implementation if we are
* at a character boundary (toULength==0)
*
* unlike with _toUnicode(), getNextUChar() implementations must set
* U_TRUNCATED_CHAR_FOUND for truncated input,
* in addition to setting toULength/toUBytes[]
*/
if(cnv->toULength==0 && cnv->sharedData->impl->getNextUChar!=NULL) {
c=cnv->sharedData->impl->getNextUChar(&args, err);
*source=s=args.source;
if(*err==U_INDEX_OUTOFBOUNDS_ERROR) {
/* reset the converter without calling the callback function */
_reset(cnv, UCNV_RESET_TO_UNICODE, FALSE);
return 0xffff; /* no output */
} else if(U_SUCCESS(*err) && c>=0) {
return c;
/*
* else fall through to use _toUnicode() because
* UCNV_GET_NEXT_UCHAR_USE_TO_U: the native function did not want to handle it after all
* U_FAILURE: call _toUnicode() for callback handling (do not output c)
*/
}
}
/* convert to one UChar in buffer[0], or handle getNextUChar() errors */
_toUnicodeWithCallback(&args, err);
if(*err==U_BUFFER_OVERFLOW_ERROR) {
*err=U_ZERO_ERROR;
}
i=0;
length=(int32_t)(args.target-buffer);
} else {
/* write the lead surrogate from the overflow buffer */
buffer[0]=(UChar)c;
args.target=buffer+1;
i=0;
length=1;
}
/* buffer contents starts at i and ends before length */
if(U_FAILURE(*err)) {
c=0xffff; /* no output */
} else if(length==0) {
/* no input or only state changes */
*err=U_INDEX_OUTOFBOUNDS_ERROR;
/* no need to reset explicitly because _toUnicodeWithCallback() did it */
c=0xffff; /* no output */
} else {
c=buffer[0];
i=1;
if(!U16_IS_LEAD(c)) {
/* consume c=buffer[0], done */
} else {
/* got a lead surrogate, see if a trail surrogate follows */
UChar c2;
if(cnv->UCharErrorBufferLength>0) {
/* got overflow output from the conversion */
if(U16_IS_TRAIL(c2=cnv->UCharErrorBuffer[0])) {
/* got a trail surrogate, too */
c=U16_GET_SUPPLEMENTARY(c, c2);
/* move the remaining overflow contents up to the beginning */
if((--cnv->UCharErrorBufferLength)>0) {
uprv_memmove(cnv->UCharErrorBuffer, cnv->UCharErrorBuffer+1,
cnv->UCharErrorBufferLength*U_SIZEOF_UCHAR);
}
} else {
/* c is an unpaired lead surrogate, just return it */
}
} else if(args.source<sourceLimit) {
/* convert once more, to buffer[1] */
args.targetLimit=buffer+2;
_toUnicodeWithCallback(&args, err);
if(*err==U_BUFFER_OVERFLOW_ERROR) {
*err=U_ZERO_ERROR;
}
length=(int32_t)(args.target-buffer);
if(U_SUCCESS(*err) && length==2 && U16_IS_TRAIL(c2=buffer[1])) {
/* got a trail surrogate, too */
c=U16_GET_SUPPLEMENTARY(c, c2);
i=2;
}
}
}
}
/*
* move leftover output from buffer[i..length[
* into the beginning of the overflow buffer
*/
if(i<length) {
/* move further overflow back */
int32_t delta=length-i;
if((length=cnv->UCharErrorBufferLength)>0) {
uprv_memmove(cnv->UCharErrorBuffer+delta, cnv->UCharErrorBuffer,
length*U_SIZEOF_UCHAR);
}
cnv->UCharErrorBufferLength=(int8_t)(length+delta);
cnv->UCharErrorBuffer[0]=buffer[i++];
if(delta>1) {
cnv->UCharErrorBuffer[1]=buffer[i];
}
}
*source=args.source;
return c;
}
--
WBR, Nikita Tatunov.
n.tatunov at tarantool.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.tarantool.org/pipermail/tarantool-patches/attachments/20180911/a2dcb66a/attachment.html>
More information about the Tarantool-patches
mailing list