On 11.09.2018 09:06, Nikita Tatunov wrote: > > >> On 11 Sep 2018, at 01:20, Alex Khatskevich >> > wrote: >> >>> >>> >>>> On 17 Aug 2018, at 14:42, Alex Khatskevich >>>> > >>>> wrote: >>>> >>>> >>>> On 17.08.2018 14:17, Alexander Turenko wrote: >>>>> 0xffff is the result of 'end of a string' check as well as >>>>> internal buffer >>>>> overflow error. I have the relevant code pasted in the first review of >>>>> the patch (July, 18). >>>>> >>>>> // source/common/ucnv.c::ucnv_getNextUChar >>>>> 1860     s=*source; >>>>> 1861     if(sourceLimit>>>> 1862         *err=U_ILLEGAL_ARGUMENT_ERROR; >>>>> 1863         return 0xffff; >>>>> 1864     } >>>>> >>>>> We should not handle the buffer overflow case as an invalid symbol. Of >>>>> course we should not handle it as the 'end of the string' situation. >>>>> Ideally we should perform pointer myself and raise an error in case of >>>>> 0xffff. I had thought that a buffer overflow error is unlikely to >>>>> meet, >>>>> but you are right: we should differentiate these situations. >>>>> >>>>> In one of the previous version of a patch we perform this check >>>>> like so: >>>>> >>>>> #define Utf8Read(s, e) (((s) < (e)) ?\ >>>>> ucnv_getNextUChar(pUtf8conv, &s, e, &status) : 0) >>>>> >>>>> Don't sure why it was changed. Maybe it is try to correctly handle >>>>> '\0' >>>>> symbol (it is valid unicode character)? >>>> The define you have pasted can return 0xffff. >>>> The reasons to change it back are described in the previous patchset. >>>> In short: >>>> 1. It is equivalent to >>>> a. check s < e in a while loop >>>> b. read next character inside of where loop body. >>>> 2. In some usages of the code this check (s>>> was performed a couple lines above) >>>> 3. There is no reason to rewrite the old version of this function. >>>> (So, we decided to use old version of the function) >>>>> So I see two ways to proceed: >>>>> >>>>> 1. Lean on icu's check and ignore possibility of the buffer overflow. >>>>> 2. Use our own check and possibly meet '\0' problems. >>>>> 3. Check for U_ILLEGAL_ARGUMENT_ERROR to treat as end of a string, >>>>> raise >>>>>    the error for other 0xffff. >>>>> >>>>> Alex, what do you suggests here? >>>> As I understand, by now the 0xffff is used ONLY to handle the case >>>> of unexpectedly ended symbol. >>>> E.g. some symbol consists of 2 characters, but the length of the >>>> input buffer is 1. >>>> In my opinion this is the same as an invalid symbol. >>>> >>>> I guess that internal buffer overflow cannot occur in the >>>> `ucnv_getNextChar` function. >>>> >>>> I suppose that it is Nikitas duty to investigate this problem and >>>> explain it to us all. I just have noticed a strange usage. >>> >>> Hello, please consider my comments. >>> >>> There are some cases when 0xffff can occur, but: >>> 1) Cannot trigger in our context. >>> 2) Cannot trigger in our context. >>> 3) Only triggers if end < start. (Cannot happen in >>> sql_utf8_pattern_compare, i guess) >>> 4) Only triggers if string length > (size_t) 0x7ffffffff (can it >>> actually happen? I don’t think so). >>> 5) Occurs when trying to access to not unindexed data. >>> 6) Cannot occur in our context. >>> 7) Cannot occur in our context. >> I do not understand what are those numbers related to. Please, >> describe it. > > They are related to possible cases returning 0xffff from icu source > code (function ucnv_getNextUChar()). Can you just copy it here, so that anyone interested in that conversation can analyze it without looking for source files? > >>> >>> 0xfffd only means that symbol cannot be treated as a unicode symbol. >>> >>> Shall I change it somehow then? >>> >>> >>>> On 17 Aug 2018, at 12:23, Alex Khatskevich >>>> > >>>> wrote: >>>> >>>> I have a look at icu code and It seems like 0xffff is an error, and >>>> it is more similar to >>>> invalid symbol that to "end of string". Check it, and fix the code, >>>> so that it is treated as >>>> an error. >>>> For example it is not handled in the main pattern loop: >>>> >>>> +while (pattern < pattern_end) { >>>> c = Utf8Read(pattern, pattern_end); >>>> +if (c == SQL_INVALID_UTF8_SYMBOL) >>>> +return SQL_INVALID_PATTERN; >>>> >>>> It seems like the 0xffff should be checked there too. >>> >>> No, it should not. This way it will only cause a bug when, for >>> example ’select “” like “”’ >>> will be treated as an error. >> I do not understand. >> ’select “” like “”’ should not even trap inside of the while loop >> (because`pattern < pattern_end` is false). > > Ah, you’re right, sorry, then it just doesn’t matter, since pattern < > pattern_end is equal > to 0xffff according to the comment above. > > -- > WBR, Nikita Tatunov. > n.tatunov@tarantool.org >