[tarantool-patches] Re: [PATCH] sql: LIKE/LENGTH process '\0'

Tarantool development patches archive
 help / color / mirror / Atom feed

From: "i.koptelov" <ivan.koptelov@tarantool.org>
To: tarantool-patches@freelists.org
Cc: "n.pettik" <korablev@tarantool.org>
Subject: [tarantool-patches] Re: [PATCH] sql: LIKE/LENGTH process '\0'
Date: Mon, 25 Feb 2019 14:09:05 +0300	[thread overview]
Message-ID: <DD611E97-01AF-4E10-A81B-E27AA6B8A681@tarantool.org> (raw)
In-Reply-To: <3377BC01-D943-4D24-A9A2-BA9B9C67EA92@tarantool.org>



> On 22 Feb 2019, at 15:59, n.pettik <korablev@tarantool.org> wrote:
> 
> 
> 
>> On 20 Feb 2019, at 22:24, i.koptelov <ivan.koptelov@tarantool.org> wrote:
>> 
>>>> 
>>>> On 20 Feb 2019, at 18:47, i.koptelov <ivan.koptelov@tarantool.org> wrote:
>>>> 
>>>> Thanks to Alexander, I fixed my patch to use a function
>>>> from icu to count the length of the string.
>>>> 
>>>> Changes:
> 
> Travis has failed. Please, make sure it is OK before sending the patch.
> It doesn’t fail on my local (Mac) machine, so I guess this fail appears
> only on Linux system.
The problem is with badutf test (LENGTH tests).
I’ve tried to reproduce the problem on my machine (using Docker with Ubuntu),
but with no success. It seems like that different versions of icu4c lib
provide different behavior of U8_FWD_1_UNSAFE.
I propose to just inline these two lines (which we need) into
some util function. Logic of these lines seems to be quite simple
and obvious (after you read about utf8 on wikipedia), so I see no
problem.

#define U8_COUNT_TRAIL_BYTES_UNSAFE(leadByte) \
   (((uint8_t)(leadByte)>=0xc2)+((uint8_t)(leadByte)>=0xe0)+((uint8_t)(leadByte)>=0xf0))

#define U8_FWD_1_UNSAFE(s, i) { \
   (i)+=1+U8_COUNT_TRAIL_BYTES_UNSAFE((s)[i]); \
}


> 
>>> Furthermore, description says that it “assumes well-formed UTF-8”,
>>> which in our case is not true. So who knows what may happen if we pass
>>> malformed byte sequence. I am not even saying that behaviour of
>>> this function on invalid inputs may change later.
>> 
>> In it's current implementation U8_FWD_1_UNSAFE satisfy our needs safely. Returned
>> symbol length would never exceed byte_len.
>> 
>> static int
>> utf8_char_count(const unsigned char *str, int byte_len)
>> {
>> 	int symbol_count = 0;
>> 	for (int i = 0; i < byte_len;) {
>> 		U8_FWD_1_UNSAFE(str, i);
>> 		symbol_count++;
>> 	}
>> 	return symbol_count;
>> }
>> 
>> I agree that it is a bad idea to relay on lib behaviour which may
>> change lately. So maybe I would just inline these one line macros?
>> Or use my own implementation, since it’s more efficient (but less beautiful)
> 
> Nevermind, let's keep it as is.
> I really worry only about the fact that in other places SQL_SKIP_UTF8
> is used instead. It handles only two-bytes utf8 symbols, meanwhile
> U8_FWD_1_UNSAFE() accounts three and four bytes length symbols.
> Can we use everywhere the same pattern?
Yes, I think, we can.

next prev parent reply	other threads:[~2019-02-25 11:09 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-01-29  9:56 [tarantool-patches] " Ivan Koptelov
2019-01-29 16:35 ` [tarantool-patches] " n.pettik
2019-02-04 12:34   ` Ivan Koptelov
2019-02-05 13:50     ` n.pettik
2019-02-07 15:14       ` i.koptelov
2019-02-11 13:15         ` n.pettik
2019-02-13 15:46           ` i.koptelov
2019-02-14 12:57             ` n.pettik
2019-02-20 13:54               ` i.koptelov
2019-02-20 15:47                 ` i.koptelov
2019-02-20 16:04                   ` n.pettik
2019-02-20 18:08                     ` Vladislav Shpilevoy
2019-02-20 19:24                     ` i.koptelov
2019-02-22 12:59                       ` n.pettik
2019-02-25 11:09                         ` i.koptelov [this message]
2019-02-25 15:10                           ` n.pettik
2019-02-26 13:33                             ` i.koptelov
2019-02-26 17:50                               ` n.pettik
2019-02-26 18:44                                 ` i.koptelov
2019-02-26 20:16                                   ` Vladislav Shpilevoy
2019-03-04 11:59                                     ` i.koptelov
2019-03-04 15:30 ` Kirill Yukhin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=DD611E97-01AF-4E10-A81B-E27AA6B8A681@tarantool.org \
    --to=ivan.koptelov@tarantool.org \
    --cc=korablev@tarantool.org \
    --cc=tarantool-patches@freelists.org \
    --subject='[tarantool-patches] Re: [PATCH] sql: LIKE/LENGTH process '\''\0'\''' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox