Tarantool development patches archive
 help / color / mirror / Atom feed
From: "i.koptelov" <ivan.koptelov@tarantool.org>
To: tarantool-patches@freelists.org
Cc: "n.pettik" <korablev@tarantool.org>
Subject: [tarantool-patches] Re: [PATCH] sql: LIKE/LENGTH process '\0'
Date: Wed, 20 Feb 2019 22:24:56 +0300	[thread overview]
Message-ID: <583EC402-D1FF-45C4-B18B-8A06D4362200@tarantool.org> (raw)
In-Reply-To: <DD522CAF-BD70-4E66-B8A4-C1837370B81D@tarantool.org>


>> On 20 Feb 2019, at 18:47, i.koptelov <ivan.koptelov@tarantool.org> wrote:
>> 
>> Thanks to Alexander, I fixed my patch to use a function
>> from icu to count the length of the string.
>> 
>> Changes:
>> 
> 
> Look, each next implementation again and again changes
> results of certain tests. Lets firstly define exact behaviour of
> length() function and then write function which will satisfy these
> requirements, not vice versa. Is this the final version?
I thought that these changes in ‘badutf’ tests are OK because we came
to an agreement that we don’t care for results of LENGTH() on
invalid strings.
> Moreover, since Konstantin suggest as fast implementation
> as we can, I propose to consider sort of asm written variant:
> 
>        .global ap_strlen_utf8_s
> ap_strlen_utf8_s:
>        push %esi
>        cld
>        mov 8(%esp), %esi
>        xor %ecx, %ecx
> loopa:  dec %ecx
> loopb:  lodsb
>        shl $1, %al
>        js loopa
>        jc loopb
>        jnz loopa
>        mov %ecx, %eax
>        not %eax
>        pop %esi
>        ret
> 
> 
> It is taken from http://canonical.org/~kragen/strlen-utf8
> and author claims that quite fast (seems like it doesn’t
> handle \0, but we can patch it). I didn’t bench it, so I am
> not absolutely sure that it ‘way faster’ than other implementations.
I’ve also came across this solution, but I considered it to be kind of overkill.
> 
>> diff --git a/src/box/sql/func.c b/src/box/sql/func.c
>> index 233ea2901..8ddb9780f 100644
>> --- a/src/box/sql/func.c
>> +++ b/src/box/sql/func.c
>> @@ -149,16 +149,7 @@ utf8_char_count(const unsigned char *str, int byte_len)
>> {
>> 	int symbol_count = 0;
>> 	for (int i = 0; i < byte_len;) {
>> -		if ((str[i] & 0x80) == 0)
>> -			i += 1;
>> -		else if ((str[i] & 0xe0) == 0xc0)
>> -			i += 2;
>> -		else if ((str[i] & 0xf0) == 0xe0)
>> -			i += 3;
>> -		else if ((str[i] & 0xf8) == 0xf0)
>> -			i += 4;
>> -		else
>> -			i += 1;
>> +		U8_FWD_1_UNSAFE(str, i);
> 
> This function handles string not in the way we’ve discussed.
Because it always does three comparisons?

#define U8_COUNT_TRAIL_BYTES_UNSAFE(leadByte) \
    (((uint8_t)(leadByte)>=0xc2)+((uint8_t)(leadByte)>=0xe0)+((uint8_t)(leadByte)>=0xf0))

#define U8_FWD_1_UNSAFE(s, i) { \
    (i)+=1+U8_COUNT_TRAIL_BYTES_UNSAFE((s)[i]); \
}


> Furthermore, description says that it “assumes well-formed UTF-8”,
> which in our case is not true. So who knows what may happen if we pass
> malformed byte sequence. I am not even saying that behaviour of
> this function on invalid inputs may change later.

In it's current implementation U8_FWD_1_UNSAFE satisfy our needs safely. Returned
symbol length would never exceed byte_len.

static int
utf8_char_count(const unsigned char *str, int byte_len)
{
	int symbol_count = 0;
	for (int i = 0; i < byte_len;) {
		U8_FWD_1_UNSAFE(str, i);
		symbol_count++;
	}
	return symbol_count;
}

I agree that it is a bad idea to relay on lib behaviour which may
change lately. So maybe I would just inline these one line macros?
Or use my own implementation, since it’s more efficient (but less beautiful)

  parent reply	other threads:[~2019-02-20 19:24 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-01-29  9:56 [tarantool-patches] " Ivan Koptelov
2019-01-29 16:35 ` [tarantool-patches] " n.pettik
2019-02-04 12:34   ` Ivan Koptelov
2019-02-05 13:50     ` n.pettik
2019-02-07 15:14       ` i.koptelov
2019-02-11 13:15         ` n.pettik
2019-02-13 15:46           ` i.koptelov
2019-02-14 12:57             ` n.pettik
2019-02-20 13:54               ` i.koptelov
2019-02-20 15:47                 ` i.koptelov
2019-02-20 16:04                   ` n.pettik
2019-02-20 18:08                     ` Vladislav Shpilevoy
2019-02-20 19:24                     ` i.koptelov [this message]
2019-02-22 12:59                       ` n.pettik
2019-02-25 11:09                         ` i.koptelov
2019-02-25 15:10                           ` n.pettik
2019-02-26 13:33                             ` i.koptelov
2019-02-26 17:50                               ` n.pettik
2019-02-26 18:44                                 ` i.koptelov
2019-02-26 20:16                                   ` Vladislav Shpilevoy
2019-03-04 11:59                                     ` i.koptelov
2019-03-04 15:30 ` Kirill Yukhin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=583EC402-D1FF-45C4-B18B-8A06D4362200@tarantool.org \
    --to=ivan.koptelov@tarantool.org \
    --cc=korablev@tarantool.org \
    --cc=tarantool-patches@freelists.org \
    --subject='[tarantool-patches] Re: [PATCH] sql: LIKE/LENGTH process '\''\0'\''' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox