From: "i.koptelov" <ivan.koptelov@tarantool.org> To: tarantool-patches@freelists.org Cc: "n.pettik" <korablev@tarantool.org> Subject: [tarantool-patches] Re: [PATCH] sql: LIKE/LENGTH process '\0' Date: Mon, 25 Feb 2019 14:09:05 +0300 [thread overview] Message-ID: <DD611E97-01AF-4E10-A81B-E27AA6B8A681@tarantool.org> (raw) In-Reply-To: <3377BC01-D943-4D24-A9A2-BA9B9C67EA92@tarantool.org> > On 22 Feb 2019, at 15:59, n.pettik <korablev@tarantool.org> wrote: > > > >> On 20 Feb 2019, at 22:24, i.koptelov <ivan.koptelov@tarantool.org> wrote: >> >>>> >>>> On 20 Feb 2019, at 18:47, i.koptelov <ivan.koptelov@tarantool.org> wrote: >>>> >>>> Thanks to Alexander, I fixed my patch to use a function >>>> from icu to count the length of the string. >>>> >>>> Changes: > > Travis has failed. Please, make sure it is OK before sending the patch. > It doesn’t fail on my local (Mac) machine, so I guess this fail appears > only on Linux system. The problem is with badutf test (LENGTH tests). I’ve tried to reproduce the problem on my machine (using Docker with Ubuntu), but with no success. It seems like that different versions of icu4c lib provide different behavior of U8_FWD_1_UNSAFE. I propose to just inline these two lines (which we need) into some util function. Logic of these lines seems to be quite simple and obvious (after you read about utf8 on wikipedia), so I see no problem. #define U8_COUNT_TRAIL_BYTES_UNSAFE(leadByte) \ (((uint8_t)(leadByte)>=0xc2)+((uint8_t)(leadByte)>=0xe0)+((uint8_t)(leadByte)>=0xf0)) #define U8_FWD_1_UNSAFE(s, i) { \ (i)+=1+U8_COUNT_TRAIL_BYTES_UNSAFE((s)[i]); \ } > >>> Furthermore, description says that it “assumes well-formed UTF-8”, >>> which in our case is not true. So who knows what may happen if we pass >>> malformed byte sequence. I am not even saying that behaviour of >>> this function on invalid inputs may change later. >> >> In it's current implementation U8_FWD_1_UNSAFE satisfy our needs safely. Returned >> symbol length would never exceed byte_len. >> >> static int >> utf8_char_count(const unsigned char *str, int byte_len) >> { >> int symbol_count = 0; >> for (int i = 0; i < byte_len;) { >> U8_FWD_1_UNSAFE(str, i); >> symbol_count++; >> } >> return symbol_count; >> } >> >> I agree that it is a bad idea to relay on lib behaviour which may >> change lately. So maybe I would just inline these one line macros? >> Or use my own implementation, since it’s more efficient (but less beautiful) > > Nevermind, let's keep it as is. > I really worry only about the fact that in other places SQL_SKIP_UTF8 > is used instead. It handles only two-bytes utf8 symbols, meanwhile > U8_FWD_1_UNSAFE() accounts three and four bytes length symbols. > Can we use everywhere the same pattern? Yes, I think, we can.
next prev parent reply other threads:[~2019-02-25 11:09 UTC|newest] Thread overview: 22+ messages / expand[flat|nested] mbox.gz Atom feed top 2019-01-29 9:56 [tarantool-patches] " Ivan Koptelov 2019-01-29 16:35 ` [tarantool-patches] " n.pettik 2019-02-04 12:34 ` Ivan Koptelov 2019-02-05 13:50 ` n.pettik 2019-02-07 15:14 ` i.koptelov 2019-02-11 13:15 ` n.pettik 2019-02-13 15:46 ` i.koptelov 2019-02-14 12:57 ` n.pettik 2019-02-20 13:54 ` i.koptelov 2019-02-20 15:47 ` i.koptelov 2019-02-20 16:04 ` n.pettik 2019-02-20 18:08 ` Vladislav Shpilevoy 2019-02-20 19:24 ` i.koptelov 2019-02-22 12:59 ` n.pettik 2019-02-25 11:09 ` i.koptelov [this message] 2019-02-25 15:10 ` n.pettik 2019-02-26 13:33 ` i.koptelov 2019-02-26 17:50 ` n.pettik 2019-02-26 18:44 ` i.koptelov 2019-02-26 20:16 ` Vladislav Shpilevoy 2019-03-04 11:59 ` i.koptelov 2019-03-04 15:30 ` Kirill Yukhin
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=DD611E97-01AF-4E10-A81B-E27AA6B8A681@tarantool.org \ --to=ivan.koptelov@tarantool.org \ --cc=korablev@tarantool.org \ --cc=tarantool-patches@freelists.org \ --subject='[tarantool-patches] Re: [PATCH] sql: LIKE/LENGTH process '\''\0'\''' \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: link
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox