Tarantool development patches archive
 help / color / mirror / Atom feed
From: "n.pettik" <korablev@tarantool.org>
To: tarantool-patches@freelists.org
Cc: Ivan Koptelov <ivan.koptelov@tarantool.org>
Subject: [tarantool-patches] Re: [PATCH] sql: LIKE/LENGTH process '\0'
Date: Mon, 25 Feb 2019 18:10:27 +0300	[thread overview]
Message-ID: <D8CAFB0C-E9A3-40B6-95AC-6751E9892D87@tarantool.org> (raw)
In-Reply-To: <DD611E97-01AF-4E10-A81B-E27AA6B8A681@tarantool.org>

[-- Attachment #1: Type: text/plain, Size: 3112 bytes --]



> On 25 Feb 2019, at 14:09, i.koptelov <ivan.koptelov@tarantool.org> wrote:
>> On 22 Feb 2019, at 15:59, n.pettik <korablev@tarantool.org> wrote:
>>> On 20 Feb 2019, at 22:24, i.koptelov <ivan.koptelov@tarantool.org> wrote:
>>>>> 
>>>>> On 20 Feb 2019, at 18:47, i.koptelov <ivan.koptelov@tarantool.org> wrote:
>>>>> 
>>>>> Thanks to Alexander, I fixed my patch to use a function
>>>>> from icu to count the length of the string.
>>>>> 
>>>>> Changes:
>> 
>> Travis has failed. Please, make sure it is OK before sending the patch.
>> It doesn’t fail on my local (Mac) machine, so I guess this fail appears
>> only on Linux system.
> The problem is with badutf test (LENGTH tests).
> I’ve tried to reproduce the problem on my machine (using Docker with Ubuntu),
> but with no success. It seems like that different versions of icu4c lib
> provide different behavior of U8_FWD_1_UNSAFE.
> I propose to just inline these two lines (which we need) into
> some util function. Logic of these lines seems to be quite simple
> and obvious (after you read about utf8 on wikipedia), so I see no
> problem.
> 
> #define U8_COUNT_TRAIL_BYTES_UNSAFE(leadByte) \
>   (((uint8_t)(leadByte)>=0xc2)+((uint8_t)(leadByte)>=0xe0)+((uint8_t)(leadByte)>=0xf0))
> 
> #define U8_FWD_1_UNSAFE(s, i) { \
>   (i)+=1+U8_COUNT_TRAIL_BYTES_UNSAFE((s)[i]); \
> }

That’s I was talking about. But using the macros with the same
name as in utf library doesn’t look like a good pattern. Yep, you
can use define guards like:

#ifdef  U8_COUNT_TRAIL_BYTES_UNSAFE
#undef U8_COUNT_TRAIL_BYTES_UNSAFE
#endif
#define U8_COUNT_TRAIL_BYTES_UNSAFE

But I’d rather just give it another name.
Hence, taking into account comment below,
we are going to substitute SQL_SKIP_UTF8() with
implementation borrowed from icu library.

>>>> Furthermore, description says that it “assumes well-formed UTF-8”,
>>>> which in our case is not true. So who knows what may happen if we pass
>>>> malformed byte sequence. I am not even saying that behaviour of
>>>> this function on invalid inputs may change later.
>>> 
>>> In it's current implementation U8_FWD_1_UNSAFE satisfy our needs safely. Returned
>>> symbol length would never exceed byte_len.
>>> 
>>> static int
>>> utf8_char_count(const unsigned char *str, int byte_len)
>>> {
>>> 	int symbol_count = 0;
>>> 	for (int i = 0; i < byte_len;) {
>>> 		U8_FWD_1_UNSAFE(str, i);
>>> 		symbol_count++;
>>> 	}
>>> 	return symbol_count;
>>> }
>>> 
>>> I agree that it is a bad idea to relay on lib behaviour which may
>>> change lately. So maybe I would just inline these one line macros?
>>> Or use my own implementation, since it’s more efficient (but less beautiful)
>> 
>> Nevermind, let's keep it as is.
>> I really worry only about the fact that in other places SQL_SKIP_UTF8
>> is used instead. It handles only two-bytes utf8 symbols, meanwhile
>> U8_FWD_1_UNSAFE() accounts three and four bytes length symbols.
>> Can we use everywhere the same pattern?
> Yes, I think, we can.

Ok, then will be waiting for updates.


[-- Attachment #2: Type: text/html, Size: 16152 bytes --]

  reply	other threads:[~2019-02-25 15:10 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-01-29  9:56 [tarantool-patches] " Ivan Koptelov
2019-01-29 16:35 ` [tarantool-patches] " n.pettik
2019-02-04 12:34   ` Ivan Koptelov
2019-02-05 13:50     ` n.pettik
2019-02-07 15:14       ` i.koptelov
2019-02-11 13:15         ` n.pettik
2019-02-13 15:46           ` i.koptelov
2019-02-14 12:57             ` n.pettik
2019-02-20 13:54               ` i.koptelov
2019-02-20 15:47                 ` i.koptelov
2019-02-20 16:04                   ` n.pettik
2019-02-20 18:08                     ` Vladislav Shpilevoy
2019-02-20 19:24                     ` i.koptelov
2019-02-22 12:59                       ` n.pettik
2019-02-25 11:09                         ` i.koptelov
2019-02-25 15:10                           ` n.pettik [this message]
2019-02-26 13:33                             ` i.koptelov
2019-02-26 17:50                               ` n.pettik
2019-02-26 18:44                                 ` i.koptelov
2019-02-26 20:16                                   ` Vladislav Shpilevoy
2019-03-04 11:59                                     ` i.koptelov
2019-03-04 15:30 ` Kirill Yukhin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=D8CAFB0C-E9A3-40B6-95AC-6751E9892D87@tarantool.org \
    --to=korablev@tarantool.org \
    --cc=ivan.koptelov@tarantool.org \
    --cc=tarantool-patches@freelists.org \
    --subject='[tarantool-patches] Re: [PATCH] sql: LIKE/LENGTH process '\''\0'\''' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox