On 17.08.2018 14:17, Alexander
Turenko wrote:
0xffff is the result of 'end of a string' check
as well as internal buffer
overflow error. I have the relevant code pasted in the
first review of
the patch (July, 18).
// source/common/ucnv.c::ucnv_getNextUChar
1860 s=*source;
1861 if(sourceLimit<s) {
1862 *err=U_ILLEGAL_ARGUMENT_ERROR;
1863 return 0xffff;
1864 }
We should not handle the buffer overflow case as an
invalid symbol. Of
course we should not handle it as the 'end of the string'
situation.
Ideally we should perform pointer myself and raise an
error in case of
0xffff. I had thought that a buffer overflow error is
unlikely to meet,
but you are right: we should differentiate these
situations.
In one of the previous version of a patch we perform this
check like so:
#define Utf8Read(s, e) (((s) < (e)) ?\
ucnv_getNextUChar(pUtf8conv,
&s, e, &status) : 0)
Don't sure why it was changed. Maybe it is try to
correctly handle '\0'
symbol (it is valid unicode character)?
The define you have pasted can
return 0xffff.
The reasons to change it back are
described in the previous patchset.
In short:
1. It is equivalent to
a. check s < e in a while loop
b. read next character inside of
where loop body.
2. In some usages of the code this
check (s<e) was redundant (it was performed a couple
lines above)
3. There is no reason to rewrite the
old version of this function. (So, we decided to use old
version of the function)
So I see two ways to proceed:
1. Lean on icu's check and ignore possibility of the
buffer overflow.
2. Use our own check and possibly meet '\0' problems.
3. Check for U_ILLEGAL_ARGUMENT_ERROR to treat as end of a
string, raise
the error for other 0xffff.
Alex, what do you suggests here?
As I understand, by now the 0xffff
is used ONLY to handle the case of unexpectedly ended
symbol.
E.g. some symbol consists of 2
characters, but the length of the input buffer is 1.
In my opinion this is the same as an
invalid symbol.
I guess that internal buffer
overflow cannot occur in the `ucnv_getNextChar` function.
I suppose that it is Nikitas duty to
investigate this problem and explain it to us all. I just
have noticed a strange usage.
Hello, please consider my comments.
There are some cases when 0xffff can occur, but:
1) Cannot trigger in
our context.
2)
Cannot
trigger in our context.
3)
Only
triggers if end < start. (Cannot happen in
sql_utf8_pattern_compare, i guess)
4)
Only
triggers if string length > (size_t) 0x7ffffffff (can it
actually happen? I don’t think so).
5)
Occurs
when trying to access to not unindexed data.
6)
Cannot occur in our context.
7) Cannot occur in
our context.
I do not understand what are those numbers related to. Please,
describe it.
0xfffd
only means that symbol cannot be treated as a unicode symbol.
Shall I change it somehow then?
I have a look at icu code and It seems
like 0xffff is an error, and it is more similar to
invalid symbol that to "end of string". Check
it, and fix the code, so that it is treated as
an error.
For example it is not handled in the main
pattern loop:
+ while
(pattern < pattern_end) {
c
= Utf8Read(pattern, pattern_end);
+ if
(c == SQL_INVALID_UTF8_SYMBOL)
+ return
SQL_INVALID_PATTERN;
It seems like the 0xffff should be checked
there too.
No, it should not. This way it will only cause a bug
when, for example ’select “” like “”’
will be treated as an error.
I do not understand.