> On 20 Feb 2019, at 22:24, i.koptelov wrote: > >>> >>> On 20 Feb 2019, at 18:47, i.koptelov > wrote: >>> >>> Thanks to Alexander, I fixed my patch to use a function >>> from icu to count the length of the string. >>> >>> Changes: Travis has failed. Please, make sure it is OK before sending the patch. It doesn’t fail on my local (Mac) machine, so I guess this fail appears only on Linux system. >> Furthermore, description says that it “assumes well-formed UTF-8”, >> which in our case is not true. So who knows what may happen if we pass >> malformed byte sequence. I am not even saying that behaviour of >> this function on invalid inputs may change later. > > In it's current implementation U8_FWD_1_UNSAFE satisfy our needs safely. Returned > symbol length would never exceed byte_len. > > static int > utf8_char_count(const unsigned char *str, int byte_len) > { > int symbol_count = 0; > for (int i = 0; i < byte_len;) { > U8_FWD_1_UNSAFE(str, i); > symbol_count++; > } > return symbol_count; > } > > I agree that it is a bad idea to relay on lib behaviour which may > change lately. So maybe I would just inline these one line macros? > Or use my own implementation, since it’s more efficient (but less beautiful) Nevermind, let's keep it as is. I really worry only about the fact that in other places SQL_SKIP_UTF8 is used instead. It handles only two-bytes utf8 symbols, meanwhile U8_FWD_1_UNSAFE() accounts three and four bytes length symbols. Can we use everywhere the same pattern?