> On 11 Sep 2018, at 13:06, Alex Khatskevich wrote: > > > > On 11.09.2018 09:06, Nikita Tatunov wrote: >> >> >>> On 11 Sep 2018, at 01:20, Alex Khatskevich > wrote: >>> >>>> >>>> >>>>> On 17 Aug 2018, at 14:42, Alex Khatskevich > wrote: >>>>> >>>>> >>>>> On 17.08.2018 14:17, Alexander Turenko wrote: >>>>>> 0xffff is the result of 'end of a string' check as well as internal buffer >>>>>> overflow error. I have the relevant code pasted in the first review of >>>>>> the patch (July, 18). >>>>>> >>>>>> // source/common/ucnv.c::ucnv_getNextUChar >>>>>> 1860 s=*source; >>>>>> 1861 if(sourceLimit>>>>> 1862 *err=U_ILLEGAL_ARGUMENT_ERROR; >>>>>> 1863 return 0xffff; >>>>>> 1864 } >>>>>> >>>>>> We should not handle the buffer overflow case as an invalid symbol. Of >>>>>> course we should not handle it as the 'end of the string' situation. >>>>>> Ideally we should perform pointer myself and raise an error in case of >>>>>> 0xffff. I had thought that a buffer overflow error is unlikely to meet, >>>>>> but you are right: we should differentiate these situations. >>>>>> >>>>>> In one of the previous version of a patch we perform this check like so: >>>>>> >>>>>> #define Utf8Read(s, e) (((s) < (e)) ?\ >>>>>> ucnv_getNextUChar(pUtf8conv, &s, e, &status) : 0) >>>>>> >>>>>> Don't sure why it was changed. Maybe it is try to correctly handle '\0' >>>>>> symbol (it is valid unicode character)? >>>>> The define you have pasted can return 0xffff. >>>>> The reasons to change it back are described in the previous patchset. >>>>> In short: >>>>> 1. It is equivalent to >>>>> a. check s < e in a while loop >>>>> b. read next character inside of where loop body. >>>>> 2. In some usages of the code this check (s>>>> 3. There is no reason to rewrite the old version of this function. (So, we decided to use old version of the function) >>>>>> So I see two ways to proceed: >>>>>> >>>>>> 1. Lean on icu's check and ignore possibility of the buffer overflow. >>>>>> 2. Use our own check and possibly meet '\0' problems. >>>>>> 3. Check for U_ILLEGAL_ARGUMENT_ERROR to treat as end of a string, raise >>>>>> the error for other 0xffff. >>>>>> >>>>>> Alex, what do you suggests here? >>>>> As I understand, by now the 0xffff is used ONLY to handle the case of unexpectedly ended symbol. >>>>> E.g. some symbol consists of 2 characters, but the length of the input buffer is 1. >>>>> In my opinion this is the same as an invalid symbol. >>>>> >>>>> I guess that internal buffer overflow cannot occur in the `ucnv_getNextChar` function. >>>>> >>>>> I suppose that it is Nikitas duty to investigate this problem and explain it to us all. I just have noticed a strange usage. >>>> >>>> >>>> Hello, please consider my comments. >>>> >>>> There are some cases when 0xffff can occur, but: >>>> 1) Cannot trigger in our context. >>>> 2) Cannot trigger in our context. >>>> 3) Only triggers if end < start. (Cannot happen in sql_utf8_pattern_compare, i guess) >>>> 4) Only triggers if string length > (size_t) 0x7ffffffff (can it actually happen? I don’t think so). >>>> 5) Occurs when trying to access to not unindexed data. >>>> 6) Cannot occur in our context. >>>> 7) Cannot occur in our context. >>> I do not understand what are those numbers related to. Please, describe it. >> >> They are related to possible cases returning 0xffff from icu source code (function ucnv_getNextUChar()). > Can you just copy it here, so that anyone interested in that conversation can > analyze it without looking for source files? Ok then: U_CAPI UChar32 U_EXPORT2 ucnv_getNextUChar(UConverter *cnv, const char **source, const char *sourceLimit, UErrorCode *err) { UConverterToUnicodeArgs args; UChar buffer[U16_MAX_LENGTH]; const char *s; UChar32 c; int32_t i, length; /* check parameters */ if(err==NULL || U_FAILURE(*err)) { return 0xffff; } if(cnv==NULL || source==NULL) { *err=U_ILLEGAL_ARGUMENT_ERROR; return 0xffff; } s=*source; if(sourceLimit(size_t)0x7fffffff && sourceLimit>s)) { *err=U_ILLEGAL_ARGUMENT_ERROR; return 0xffff; } c=U_SENTINEL; /* flush the target overflow buffer */ if(cnv->UCharErrorBufferLength>0) { UChar *overflow; overflow=cnv->UCharErrorBuffer; i=0; length=cnv->UCharErrorBufferLength; U16_NEXT(overflow, i, length, c); /* move the remaining overflow contents up to the beginning */ if((cnv->UCharErrorBufferLength=(int8_t)(length-i))>0) { uprv_memmove(cnv->UCharErrorBuffer, cnv->UCharErrorBuffer+i, cnv->UCharErrorBufferLength*U_SIZEOF_UCHAR); } if(!U16_IS_LEAD(c) || itoULength==0 && cnv->sharedData->impl->getNextUChar!=NULL) { c=cnv->sharedData->impl->getNextUChar(&args, err); *source=s=args.source; if(*err==U_INDEX_OUTOFBOUNDS_ERROR) { /* reset the converter without calling the callback function */ _reset(cnv, UCNV_RESET_TO_UNICODE, FALSE); return 0xffff; /* no output */ } else if(U_SUCCESS(*err) && c>=0) { return c; /* * else fall through to use _toUnicode() because * UCNV_GET_NEXT_UCHAR_USE_TO_U: the native function did not want to handle it after all * U_FAILURE: call _toUnicode() for callback handling (do not output c) */ } } /* convert to one UChar in buffer[0], or handle getNextUChar() errors */ _toUnicodeWithCallback(&args, err); if(*err==U_BUFFER_OVERFLOW_ERROR) { *err=U_ZERO_ERROR; } i=0; length=(int32_t)(args.target-buffer); } else { /* write the lead surrogate from the overflow buffer */ buffer[0]=(UChar)c; args.target=buffer+1; i=0; length=1; } /* buffer contents starts at i and ends before length */ if(U_FAILURE(*err)) { c=0xffff; /* no output */ } else if(length==0) { /* no input or only state changes */ *err=U_INDEX_OUTOFBOUNDS_ERROR; /* no need to reset explicitly because _toUnicodeWithCallback() did it */ c=0xffff; /* no output */ } else { c=buffer[0]; i=1; if(!U16_IS_LEAD(c)) { /* consume c=buffer[0], done */ } else { /* got a lead surrogate, see if a trail surrogate follows */ UChar c2; if(cnv->UCharErrorBufferLength>0) { /* got overflow output from the conversion */ if(U16_IS_TRAIL(c2=cnv->UCharErrorBuffer[0])) { /* got a trail surrogate, too */ c=U16_GET_SUPPLEMENTARY(c, c2); /* move the remaining overflow contents up to the beginning */ if((--cnv->UCharErrorBufferLength)>0) { uprv_memmove(cnv->UCharErrorBuffer, cnv->UCharErrorBuffer+1, cnv->UCharErrorBufferLength*U_SIZEOF_UCHAR); } } else { /* c is an unpaired lead surrogate, just return it */ } } else if(args.sourceUCharErrorBufferLength)>0) { uprv_memmove(cnv->UCharErrorBuffer+delta, cnv->UCharErrorBuffer, length*U_SIZEOF_UCHAR); } cnv->UCharErrorBufferLength=(int8_t)(length+delta); cnv->UCharErrorBuffer[0]=buffer[i++]; if(delta>1) { cnv->UCharErrorBuffer[1]=buffer[i]; } } *source=args.source; return c; } -- WBR, Nikita Tatunov. n.tatunov@tarantool.org