> On 11 Sep 2018, at 01:20, Alex Khatskevich wrote: > >> >> >>> On 17 Aug 2018, at 14:42, Alex Khatskevich > wrote: >>> >>> >>> On 17.08.2018 14:17, Alexander Turenko wrote: >>>> 0xffff is the result of 'end of a string' check as well as internal buffer >>>> overflow error. I have the relevant code pasted in the first review of >>>> the patch (July, 18). >>>> >>>> // source/common/ucnv.c::ucnv_getNextUChar >>>> 1860 s=*source; >>>> 1861 if(sourceLimit>>> 1862 *err=U_ILLEGAL_ARGUMENT_ERROR; >>>> 1863 return 0xffff; >>>> 1864 } >>>> >>>> We should not handle the buffer overflow case as an invalid symbol. Of >>>> course we should not handle it as the 'end of the string' situation. >>>> Ideally we should perform pointer myself and raise an error in case of >>>> 0xffff. I had thought that a buffer overflow error is unlikely to meet, >>>> but you are right: we should differentiate these situations. >>>> >>>> In one of the previous version of a patch we perform this check like so: >>>> >>>> #define Utf8Read(s, e) (((s) < (e)) ?\ >>>> ucnv_getNextUChar(pUtf8conv, &s, e, &status) : 0) >>>> >>>> Don't sure why it was changed. Maybe it is try to correctly handle '\0' >>>> symbol (it is valid unicode character)? >>> The define you have pasted can return 0xffff. >>> The reasons to change it back are described in the previous patchset. >>> In short: >>> 1. It is equivalent to >>> a. check s < e in a while loop >>> b. read next character inside of where loop body. >>> 2. In some usages of the code this check (s>> 3. There is no reason to rewrite the old version of this function. (So, we decided to use old version of the function) >>>> So I see two ways to proceed: >>>> >>>> 1. Lean on icu's check and ignore possibility of the buffer overflow. >>>> 2. Use our own check and possibly meet '\0' problems. >>>> 3. Check for U_ILLEGAL_ARGUMENT_ERROR to treat as end of a string, raise >>>> the error for other 0xffff. >>>> >>>> Alex, what do you suggests here? >>> As I understand, by now the 0xffff is used ONLY to handle the case of unexpectedly ended symbol. >>> E.g. some symbol consists of 2 characters, but the length of the input buffer is 1. >>> In my opinion this is the same as an invalid symbol. >>> >>> I guess that internal buffer overflow cannot occur in the `ucnv_getNextChar` function. >>> >>> I suppose that it is Nikitas duty to investigate this problem and explain it to us all. I just have noticed a strange usage. >> >> >> Hello, please consider my comments. >> >> There are some cases when 0xffff can occur, but: >> 1) Cannot trigger in our context. >> 2) Cannot trigger in our context. >> 3) Only triggers if end < start. (Cannot happen in sql_utf8_pattern_compare, i guess) >> 4) Only triggers if string length > (size_t) 0x7ffffffff (can it actually happen? I don’t think so). >> 5) Occurs when trying to access to not unindexed data. >> 6) Cannot occur in our context. >> 7) Cannot occur in our context. > I do not understand what are those numbers related to. Please, describe it. They are related to possible cases returning 0xffff from icu source code (function ucnv_getNextUChar()). >> >> 0xfffd only means that symbol cannot be treated as a unicode symbol. >> >> Shall I change it somehow then? >> >> >>> On 17 Aug 2018, at 12:23, Alex Khatskevich > wrote: >>> >>> I have a look at icu code and It seems like 0xffff is an error, and it is more similar to >>> invalid symbol that to "end of string". Check it, and fix the code, so that it is treated as >>> an error. >>> For example it is not handled in the main pattern loop: >>> >>> + while (pattern < pattern_end) { >>> c = Utf8Read(pattern, pattern_end); >>> + if (c == SQL_INVALID_UTF8_SYMBOL) >>> + return SQL_INVALID_PATTERN; >>> >>> It seems like the 0xffff should be checked there too. >> >> No, it should not. This way it will only cause a bug when, for example ’select “” like “”’ >> will be treated as an error. > I do not understand. > ’select “” like “”’ should not even trap inside of the while loop > (because `pattern < pattern_end` is false). Ah, you’re right, sorry, then it just doesn’t matter, since pattern < pattern_end is equal to 0xffff according to the comment above. -- WBR, Nikita Tatunov. n.tatunov@tarantool.org