From: "n.pettik" <korablev@tarantool.org> To: tarantool-patches@freelists.org Cc: Ivan Koptelov <ivan.koptelov@tarantool.org> Subject: [tarantool-patches] Re: [PATCH] sql: LIKE/LENGTH process '\0' Date: Mon, 25 Feb 2019 18:10:27 +0300 [thread overview] Message-ID: <D8CAFB0C-E9A3-40B6-95AC-6751E9892D87@tarantool.org> (raw) In-Reply-To: <DD611E97-01AF-4E10-A81B-E27AA6B8A681@tarantool.org> [-- Attachment #1: Type: text/plain, Size: 3112 bytes --] > On 25 Feb 2019, at 14:09, i.koptelov <ivan.koptelov@tarantool.org> wrote: >> On 22 Feb 2019, at 15:59, n.pettik <korablev@tarantool.org> wrote: >>> On 20 Feb 2019, at 22:24, i.koptelov <ivan.koptelov@tarantool.org> wrote: >>>>> >>>>> On 20 Feb 2019, at 18:47, i.koptelov <ivan.koptelov@tarantool.org> wrote: >>>>> >>>>> Thanks to Alexander, I fixed my patch to use a function >>>>> from icu to count the length of the string. >>>>> >>>>> Changes: >> >> Travis has failed. Please, make sure it is OK before sending the patch. >> It doesn’t fail on my local (Mac) machine, so I guess this fail appears >> only on Linux system. > The problem is with badutf test (LENGTH tests). > I’ve tried to reproduce the problem on my machine (using Docker with Ubuntu), > but with no success. It seems like that different versions of icu4c lib > provide different behavior of U8_FWD_1_UNSAFE. > I propose to just inline these two lines (which we need) into > some util function. Logic of these lines seems to be quite simple > and obvious (after you read about utf8 on wikipedia), so I see no > problem. > > #define U8_COUNT_TRAIL_BYTES_UNSAFE(leadByte) \ > (((uint8_t)(leadByte)>=0xc2)+((uint8_t)(leadByte)>=0xe0)+((uint8_t)(leadByte)>=0xf0)) > > #define U8_FWD_1_UNSAFE(s, i) { \ > (i)+=1+U8_COUNT_TRAIL_BYTES_UNSAFE((s)[i]); \ > } That’s I was talking about. But using the macros with the same name as in utf library doesn’t look like a good pattern. Yep, you can use define guards like: #ifdef U8_COUNT_TRAIL_BYTES_UNSAFE #undef U8_COUNT_TRAIL_BYTES_UNSAFE #endif #define U8_COUNT_TRAIL_BYTES_UNSAFE But I’d rather just give it another name. Hence, taking into account comment below, we are going to substitute SQL_SKIP_UTF8() with implementation borrowed from icu library. >>>> Furthermore, description says that it “assumes well-formed UTF-8”, >>>> which in our case is not true. So who knows what may happen if we pass >>>> malformed byte sequence. I am not even saying that behaviour of >>>> this function on invalid inputs may change later. >>> >>> In it's current implementation U8_FWD_1_UNSAFE satisfy our needs safely. Returned >>> symbol length would never exceed byte_len. >>> >>> static int >>> utf8_char_count(const unsigned char *str, int byte_len) >>> { >>> int symbol_count = 0; >>> for (int i = 0; i < byte_len;) { >>> U8_FWD_1_UNSAFE(str, i); >>> symbol_count++; >>> } >>> return symbol_count; >>> } >>> >>> I agree that it is a bad idea to relay on lib behaviour which may >>> change lately. So maybe I would just inline these one line macros? >>> Or use my own implementation, since it’s more efficient (but less beautiful) >> >> Nevermind, let's keep it as is. >> I really worry only about the fact that in other places SQL_SKIP_UTF8 >> is used instead. It handles only two-bytes utf8 symbols, meanwhile >> U8_FWD_1_UNSAFE() accounts three and four bytes length symbols. >> Can we use everywhere the same pattern? > Yes, I think, we can. Ok, then will be waiting for updates. [-- Attachment #2: Type: text/html, Size: 16152 bytes --]
next prev parent reply other threads:[~2019-02-25 15:10 UTC|newest] Thread overview: 22+ messages / expand[flat|nested] mbox.gz Atom feed top 2019-01-29 9:56 [tarantool-patches] " Ivan Koptelov 2019-01-29 16:35 ` [tarantool-patches] " n.pettik 2019-02-04 12:34 ` Ivan Koptelov 2019-02-05 13:50 ` n.pettik 2019-02-07 15:14 ` i.koptelov 2019-02-11 13:15 ` n.pettik 2019-02-13 15:46 ` i.koptelov 2019-02-14 12:57 ` n.pettik 2019-02-20 13:54 ` i.koptelov 2019-02-20 15:47 ` i.koptelov 2019-02-20 16:04 ` n.pettik 2019-02-20 18:08 ` Vladislav Shpilevoy 2019-02-20 19:24 ` i.koptelov 2019-02-22 12:59 ` n.pettik 2019-02-25 11:09 ` i.koptelov 2019-02-25 15:10 ` n.pettik [this message] 2019-02-26 13:33 ` i.koptelov 2019-02-26 17:50 ` n.pettik 2019-02-26 18:44 ` i.koptelov 2019-02-26 20:16 ` Vladislav Shpilevoy 2019-03-04 11:59 ` i.koptelov 2019-03-04 15:30 ` Kirill Yukhin
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=D8CAFB0C-E9A3-40B6-95AC-6751E9892D87@tarantool.org \ --to=korablev@tarantool.org \ --cc=ivan.koptelov@tarantool.org \ --cc=tarantool-patches@freelists.org \ --subject='[tarantool-patches] Re: [PATCH] sql: LIKE/LENGTH process '\''\0'\''' \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: link
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox