Tarantool development patches archive
 help / color / mirror / Atom feed
From: "v.shpilevoy@tarantool.org" <v.shpilevoy@tarantool.org>
To: tarantool-patches@freelists.org
Cc: Kirill Shcherbatov <kshcherbatov@tarantool.org>
Subject: [tarantool-patches] Re: [PATCH v2 3/3] Multibyte characters support
Date: Fri, 30 Mar 2018 13:24:05 +0300	[thread overview]
Message-ID: <55E839FD-8366-4BA0-BDAF-5C13661E40F7@tarantool.org> (raw)
In-Reply-To: <db8586f2-a673-edb6-29fc-cc59eade2aac@tarantool.org>

[-- Attachment #1: Type: text/plain, Size: 4395 bytes --]

See below 5 comments. And see at the travis: https://travis-ci.org/tarantool/tarantool/builds/359964649?utm_source=github_status&utm_medium=notification <https://travis-ci.org/tarantool/tarantool/builds/359964649?utm_source=github_status&utm_medium=notification> - seems like the patch does not work. box/alter.test.lua with unicode and your own tests are failed.

> 29 марта 2018 г., в 21:04, Kirill Shcherbatov <kshcherbatov@tarantool.org> написал(а):
> 
> From eb16ee6ea2d2ca5801b6b09c403c6e2f23984bd9 Mon Sep 17 00:00:00 2001
> From: Kirill Shcherbatov <kshcherbatov@tarantool.org>
> Date: Thu, 29 Mar 2018 21:03:21 +0300
> Subject: [PATCH] Multibyte characters support
> 
> ---
> src/box/lua/tuple.c        |  1 -
> src/lib/json/path.c        | 35 ++++++++++++++++++++++++++++++-----
> src/lib/json/path.h        |  2 ++
> test/engine/tuple.result   | 16 ++++++++++++++--
> test/engine/tuple.test.lua |  5 ++++-
> 5 files changed, 50 insertions(+), 9 deletions(-)
> 
> diff --git a/src/box/lua/tuple.c b/src/box/lua/tuple.c
> index 99b9ff2..c3a435b 100644
> --- a/src/box/lua/tuple.c
> +++ b/src/box/lua/tuple.c
> @@ -44,6 +46,26 @@ strntoull(const char *src, int len) {
> 	return value;
> }
> 

1. Please, add comments what the function does, what arguments gets, what returns.

> +static inline size_t
> +mbsize(const char *str, size_t str_size)
> +{
> +	mbstate_t ps;
> +	memset(&ps, 0, sizeof(ps));
> +	mbrlen(NULL,0,&ps);

2. I do not like this always-zero ps. Why you can not use mblen instead of mbrlen?

> +	return mbrlen(str, str_size, &ps);
> +}
> +
> +static inline int
> +mb2wcisalpha(char *mb_char, size_t mb_char_size)
> +{
> +	assert(mb_char_size < 5);
> +	wchar_t buff[2];

3. Sorry, I still can not understand, why do you need 2 wchar, if here only
one is checked? Why do you use mbsrtowcs instead of mbrtowc.

> +	mbstate_t ps;
> +	memset(&ps, 0, sizeof(ps));
> +	mbsrtowcs(buff, (const char **)&mb_char, mb_char_size + 1, &ps);
> +	return iswalpha((wint_t)buff[0]);
> +}
> +
> /**
>  * Parse string identifier in quotes. Parser either stops right
>  * after the closing quote, or returns an error position.
> @@ -126,17 +148,20 @@ json_parse_identifier(struct json_path_parser *parser,
> 	const char *str = pos;
> 	char c = *pos;
> 	/* First symbol can not be digit. */
> -	if (!isalpha(c) && c != '_')
> -		return pos - parser->src + 1;
> +	size_t mb_size = 0;
> +	if ((mb_size = mbsize(pos, end - pos)) && !mb2wcisalpha((char *)pos, mb_size) && c != '_')
> +		return pos - parser->src + mb_size;

4. I propose to join mbsize and mb2wcisalpha under the single function check_alpha()
or something, that either accepts char ** and moves it forward, or returns parsed size.

And why you do not check mbsize() for negative results? In the doc I saw, that it can
return values like (size_t)-1 or (size_t)-2. And you implementation accepts it as ok.

> 	int len = 1;
> -	for (c = *++pos; pos < end && (isalpha(c) || c == '_' || isdigit(c));
> -	     c = *++pos)
> +	for (c = *(pos += mb_size);
> +	     pos < end && (((mb_size = mbsize(pos, end - pos))
> +	                   && mb2wcisalpha((char *)pos, mb_size)) || c == '_' || isdigit(c));
> +	     c = *(pos += mb_size))
> 		++len;
> 	assert(len > 0);
> 	parser->pos = pos;
> 	node->type = JSON_PATH_STR;
> 	node->str = str;
> -	node->len = len;
> +	node->len = (int)(pos - str);
> 	return 0;
> }
> 
> diff --git a/src/lib/json/path.h b/src/lib/json/path.h
> index 6e8db4c..b1028f6 100644
> --- a/src/lib/json/path.h
> +++ b/src/lib/json/path.h
> @@ -33,6 +33,8 @@
> 
> #include <stdbool.h>
> #include <stdint.h>
> +#include <string.h>
> +#include <malloc.h>
> 
> #ifdef __cplusplus
> extern "C" {
> diff --git a/test/engine/tuple.result b/test/engine/tuple.result
> index 2d7367a..c4b361a 100644
> --- a/test/engine/tuple.result
> +++ b/test/engine/tuple.result
> @@ -611,7 +611,7 @@ s = box.schema.space.create('test', {format = format})
> pk = s:create_index('pk')
> ---
> ...
> -field2 = {1, 2, 3, "4", {5,6,7}, {key="key1", value="value1"}}
> +field2 = {1, 2, 3, "4", {5,6,7}, {key="key1", value="value1", hello中国world = {中国 = 'test'}}}

5. Please, add unit tests on unicode. See commit
49df120aaa7592e6475bf6974fe6f225db9234de (Introduce json_path_parser) for examples.
> 


[-- Attachment #2: Type: text/html, Size: 9490 bytes --]

  reply	other threads:[~2018-03-30 10:24 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-03-29 14:22 [tarantool-patches] [PATCH v2 0/3] tuple field access via a json path Kirill Shcherbatov
2018-03-29 14:22 ` [tarantool-patches] [PATCH v2 1/3] Introduce json_path_parser Kirill Shcherbatov
2018-03-29 14:22 ` [tarantool-patches] [PATCH v2 2/3] lua: implement json path access to tuple fields Kirill Shcherbatov
2018-03-29 14:22 ` [tarantool-patches] [PATCH v2 3/3] Multibyte characters support Kirill Shcherbatov
2018-03-29 18:04   ` [tarantool-patches] " Kirill Shcherbatov
2018-03-30 10:24     ` v.shpilevoy [this message]
2018-03-30 10:25       ` v.shpilevoy
2018-04-02 19:19       ` Kirill Shcherbatov
2018-04-03 10:20         ` Vladislav Shpilevoy
2018-04-05 14:09           ` [tarantool-patches] [PATCH v2 1/1] ICU Unicode support for JSON parser Kirill Shcherbatov
2018-04-05 18:00             ` [tarantool-patches] " Kirill Shcherbatov
2018-04-05 23:32               ` Vladislav Shpilevoy
2018-04-04 10:37 ` [tarantool-patches] [PATCH v2 3/3] Multibyte characters support ICU Kirill Shcherbatov
2018-04-04 11:30   ` [tarantool-patches] " Vladislav Shpilevoy

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=55E839FD-8366-4BA0-BDAF-5C13661E40F7@tarantool.org \
    --to=v.shpilevoy@tarantool.org \
    --cc=kshcherbatov@tarantool.org \
    --cc=tarantool-patches@freelists.org \
    --subject='[tarantool-patches] Re: [PATCH v2 3/3] Multibyte characters support' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox