On 25 Feb 2019, at 14:09, i.koptelov <ivan.koptelov@tarantool.org> wrote:
On 22 Feb 2019, at 15:59, n.pettik <korablev@tarantool.org> wrote:
On 20 Feb 2019, at 22:24, i.koptelov <ivan.koptelov@tarantool.org> wrote:

On 20 Feb 2019, at 18:47, i.koptelov <ivan.koptelov@tarantool.org> wrote:

Thanks to Alexander, I fixed my patch to use a function
from icu to count the length of the string.

Changes:

Travis has failed. Please, make sure it is OK before sending the patch.
It doesn’t fail on my local (Mac) machine, so I guess this fail appears
only on Linux system.
The problem is with badutf test (LENGTH tests).
I’ve tried to reproduce the problem on my machine (using Docker with Ubuntu),
but with no success. It seems like that different versions of icu4c lib
provide different behavior of U8_FWD_1_UNSAFE.
I propose to just inline these two lines (which we need) into
some util function. Logic of these lines seems to be quite simple
and obvious (after you read about utf8 on wikipedia), so I see no
problem.

#define U8_COUNT_TRAIL_BYTES_UNSAFE(leadByte) \
(((uint8_t)(leadByte)>=0xc2)+((uint8_t)(leadByte)>=0xe0)+((uint8_t)(leadByte)>=0xf0))

#define U8_FWD_1_UNSAFE(s, i) { \
(i)+=1+U8_COUNT_TRAIL_BYTES_UNSAFE((s)[i]); \
}

That’s I was talking about. But using the macros with the same

name as in utf library doesn’t look like a good pattern. Yep, you

can use define guards like:

#ifdef U8_COUNT_TRAIL_BYTES_UNSAFE

#undef U8_COUNT_TRAIL_BYTES_UNSAFE

#endif

#define U8_COUNT_TRAIL_BYTES_UNSAFE

But I’d rather just give it another name.

Hence, taking into account comment below,

we are going to substitute SQL_SKIP_UTF8() with

implementation borrowed from icu library.

Furthermore, description says that it “assumes well-formed UTF-8”,
which in our case is not true. So who knows what may happen if we pass
malformed byte sequence. I am not even saying that behaviour of
this function on invalid inputs may change later.

In it's current implementation U8_FWD_1_UNSAFE satisfy our needs safely. Returned
symbol length would never exceed byte_len.

static int
utf8_char_count(const unsigned char *str, int byte_len)
{
int symbol_count = 0;
for (int i = 0; i < byte_len;) {
U8_FWD_1_UNSAFE(str, i);
symbol_count++;
}
return symbol_count;
}

I agree that it is a bad idea to relay on lib behaviour which may
change lately. So maybe I would just inline these one line macros?
Or use my own implementation, since it’s more efficient (but less beautiful)

Nevermind, let's keep it as is.
I really worry only about the fact that in other places SQL_SKIP_UTF8
is used instead. It handles only two-bytes utf8 symbols, meanwhile
U8_FWD_1_UNSAFE() accounts three and four bytes length symbols.
Can we use everywhere the same pattern?
Yes, I think, we can.