From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from localhost (localhost [127.0.0.1]) by turing.freelists.org (Avenir Technologies Mail Multiplex) with ESMTP id BFBA92CEE4 for ; Fri, 30 Mar 2018 06:24:10 -0400 (EDT) Received: from turing.freelists.org ([127.0.0.1]) by localhost (turing.freelists.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id EBEbSSuRYHaG for ; Fri, 30 Mar 2018 06:24:10 -0400 (EDT) Received: from smtp52.i.mail.ru (smtp52.i.mail.ru [94.100.177.112]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by turing.freelists.org (Avenir Technologies Mail Multiplex) with ESMTPS id DFA102CEE2 for ; Fri, 30 Mar 2018 06:24:09 -0400 (EDT) From: "v.shpilevoy@tarantool.org" Message-Id: <55E839FD-8366-4BA0-BDAF-5C13661E40F7@tarantool.org> Content-Type: multipart/alternative; boundary="Apple-Mail=_AB5A24F5-F013-4260-A5DC-22CCF4AB3F8F" Mime-Version: 1.0 (Mac OS X Mail 11.2 \(3445.5.20\)) Subject: [tarantool-patches] Re: [PATCH v2 3/3] Multibyte characters support Date: Fri, 30 Mar 2018 13:24:05 +0300 In-Reply-To: References: Sender: tarantool-patches-bounce@freelists.org Errors-to: tarantool-patches-bounce@freelists.org Reply-To: tarantool-patches@freelists.org List-help: List-unsubscribe: List-software: Ecartis version 1.0.0 List-Id: tarantool-patches List-subscribe: List-owner: List-post: List-archive: To: tarantool-patches@freelists.org Cc: Kirill Shcherbatov --Apple-Mail=_AB5A24F5-F013-4260-A5DC-22CCF4AB3F8F Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=utf-8 See below 5 comments. And see at the travis: = https://travis-ci.org/tarantool/tarantool/builds/359964649?utm_source=3Dgi= thub_status&utm_medium=3Dnotification = - seems like the patch does not = work. box/alter.test.lua with unicode and your own tests are failed. > 29 =D0=BC=D0=B0=D1=80=D1=82=D0=B0 2018 =D0=B3., =D0=B2 21:04, Kirill = Shcherbatov =D0=BD=D0=B0=D0=BF=D0=B8=D1=81=D0= =B0=D0=BB(=D0=B0): >=20 > =46rom eb16ee6ea2d2ca5801b6b09c403c6e2f23984bd9 Mon Sep 17 00:00:00 = 2001 > From: Kirill Shcherbatov > Date: Thu, 29 Mar 2018 21:03:21 +0300 > Subject: [PATCH] Multibyte characters support >=20 > --- > src/box/lua/tuple.c | 1 - > src/lib/json/path.c | 35 ++++++++++++++++++++++++++++++----- > src/lib/json/path.h | 2 ++ > test/engine/tuple.result | 16 ++++++++++++++-- > test/engine/tuple.test.lua | 5 ++++- > 5 files changed, 50 insertions(+), 9 deletions(-) >=20 > diff --git a/src/box/lua/tuple.c b/src/box/lua/tuple.c > index 99b9ff2..c3a435b 100644 > --- a/src/box/lua/tuple.c > +++ b/src/box/lua/tuple.c > @@ -44,6 +46,26 @@ strntoull(const char *src, int len) { > return value; > } >=20 1. Please, add comments what the function does, what arguments gets, = what returns. > +static inline size_t > +mbsize(const char *str, size_t str_size) > +{ > + mbstate_t ps; > + memset(&ps, 0, sizeof(ps)); > + mbrlen(NULL,0,&ps); 2. I do not like this always-zero ps. Why you can not use mblen instead = of mbrlen? > + return mbrlen(str, str_size, &ps); > +} > + > +static inline int > +mb2wcisalpha(char *mb_char, size_t mb_char_size) > +{ > + assert(mb_char_size < 5); > + wchar_t buff[2]; 3. Sorry, I still can not understand, why do you need 2 wchar, if here = only one is checked? Why do you use mbsrtowcs instead of mbrtowc. > + mbstate_t ps; > + memset(&ps, 0, sizeof(ps)); > + mbsrtowcs(buff, (const char **)&mb_char, mb_char_size + 1, &ps); > + return iswalpha((wint_t)buff[0]); > +} > + > /** > * Parse string identifier in quotes. Parser either stops right > * after the closing quote, or returns an error position. > @@ -126,17 +148,20 @@ json_parse_identifier(struct json_path_parser = *parser, > const char *str =3D pos; > char c =3D *pos; > /* First symbol can not be digit. */ > - if (!isalpha(c) && c !=3D '_') > - return pos - parser->src + 1; > + size_t mb_size =3D 0; > + if ((mb_size =3D mbsize(pos, end - pos)) && !mb2wcisalpha((char = *)pos, mb_size) && c !=3D '_') > + return pos - parser->src + mb_size; 4. I propose to join mbsize and mb2wcisalpha under the single function = check_alpha() or something, that either accepts char ** and moves it forward, or = returns parsed size. And why you do not check mbsize() for negative results? In the doc I = saw, that it can return values like (size_t)-1 or (size_t)-2. And you implementation = accepts it as ok. > int len =3D 1; > - for (c =3D *++pos; pos < end && (isalpha(c) || c =3D=3D '_' || = isdigit(c)); > - c =3D *++pos) > + for (c =3D *(pos +=3D mb_size); > + pos < end && (((mb_size =3D mbsize(pos, end - pos)) > + && mb2wcisalpha((char *)pos, mb_size)) || c = =3D=3D '_' || isdigit(c)); > + c =3D *(pos +=3D mb_size)) > ++len; > assert(len > 0); > parser->pos =3D pos; > node->type =3D JSON_PATH_STR; > node->str =3D str; > - node->len =3D len; > + node->len =3D (int)(pos - str); > return 0; > } >=20 > diff --git a/src/lib/json/path.h b/src/lib/json/path.h > index 6e8db4c..b1028f6 100644 > --- a/src/lib/json/path.h > +++ b/src/lib/json/path.h > @@ -33,6 +33,8 @@ >=20 > #include > #include > +#include > +#include >=20 > #ifdef __cplusplus > extern "C" { > diff --git a/test/engine/tuple.result b/test/engine/tuple.result > index 2d7367a..c4b361a 100644 > --- a/test/engine/tuple.result > +++ b/test/engine/tuple.result > @@ -611,7 +611,7 @@ s =3D box.schema.space.create('test', {format =3D = format}) > pk =3D s:create_index('pk') > --- > ... > -field2 =3D {1, 2, 3, "4", {5,6,7}, {key=3D"key1", value=3D"value1"}} > +field2 =3D {1, 2, 3, "4", {5,6,7}, {key=3D"key1", value=3D"value1", = hello=E4=B8=AD=E5=9B=BDworld =3D {=E4=B8=AD=E5=9B=BD =3D 'test'}}} 5. Please, add unit tests on unicode. See commit 49df120aaa7592e6475bf6974fe6f225db9234de (Introduce json_path_parser) = for examples. >=20 --Apple-Mail=_AB5A24F5-F013-4260-A5DC-22CCF4AB3F8F Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=utf-8 See = below 5 comments. And see at the travis: https://travis-ci.org/tarantool/tarantool/builds/359964649?utm_= source=3Dgithub_status&utm_medium=3Dnotification - seems = like the patch does not work. box/alter.test.lua with unicode and your = own tests are failed.

29 =D0=BC=D0=B0=D1=80=D1=82=D0=B0= 2018 =D0=B3., =D0=B2 21:04, Kirill Shcherbatov <kshcherbatov@tarantool.org> = =D0=BD=D0=B0=D0=BF=D0=B8=D1=81=D0=B0=D0=BB(=D0=B0):

=46rom= eb16ee6ea2d2ca5801b6b09c403c6e2f23984bd9 Mon Sep 17 00:00:00 2001
From: Kirill Shcherbatov <kshcherbatov@tarantool.org>
Date: Thu, = 29 Mar 2018 21:03:21 +0300
Subject: [PATCH] Multibyte = characters support

---
= src/box/lua/tuple.c        |  1 = -
src/lib/json/path.c =        | 35 = ++++++++++++++++++++++++++++++-----
src/lib/json/path.h =        |  2 ++
= test/engine/tuple.result   | 16 ++++++++++++++--
= test/engine/tuple.test.lua |  5 ++++-
5 files = changed, 50 insertions(+), 9 deletions(-)

diff --git a/src/box/lua/tuple.c b/src/box/lua/tuple.c
index 99b9ff2..c3a435b 100644
--- = a/src/box/lua/tuple.c
+++ b/src/box/lua/tuple.c
@@ -44,6 +46,26 @@ strntoull(const char *src, int len) {
= return value;
}


1. = Please, add comments what the function does, what arguments gets, what = returns.

+static inline size_t
+mbsize(const char *str, size_t str_size)
+{
+ = mbstate_t ps;
+ memset(&ps, 0, = sizeof(ps));
+ mbrlen(NULL,0,&ps);

2. I = do not like this always-zero ps. Why you can not use mblen instead of = mbrlen?

+ return mbrlen(str, str_size, = &ps);
+}
+
+static inline = int
+mb2wcisalpha(char *mb_char, size_t mb_char_size)
+{
+ assert(mb_char_size < 5);
+ = wchar_t buff[2];

3. Sorry, I still can not understand, why do you = need 2 wchar, if here only
one is checked? Why do you use = mbsrtowcs instead of mbrtowc.

+ mbstate_t = ps;
+ memset(&ps, 0, = sizeof(ps));
+ mbsrtowcs(buff, (const char = **)&mb_char, mb_char_size + 1, &ps);
+ return = iswalpha((wint_t)buff[0]);
+}
+
= /**
 * Parse string identifier in quotes. Parser = either stops right
 * after the closing quote, or = returns an error position.
@@ -126,17 +148,20 @@ = json_parse_identifier(struct json_path_parser *parser,
= = const char *str =3D pos;
char c =3D = *pos;
/* First symbol can not be digit. = */
- if (!isalpha(c) && c !=3D = '_')
- return pos - parser->src + = 1;
+ size_t mb_size =3D 0;
+ = if ((mb_size =3D mbsize(pos, end - pos)) && = !mb2wcisalpha((char *)pos, mb_size) && c !=3D '_')
+ = = return pos - parser->src + mb_size;

4. I = propose to join mbsize and mb2wcisalpha under the single function = check_alpha()
or something, that either accepts char ** and = moves it forward, or returns parsed size.

And why you do not check mbsize() for negative = results? In the doc I saw, that it can
return values like = (size_t)-1 or (size_t)-2. And you implementation accepts it as = ok.

int len =3D 1;
- for (c =3D = *++pos; pos < end && (isalpha(c) || c =3D=3D '_' || = isdigit(c));
-     c =3D = *++pos)
+ for (c =3D *(pos +=3D = mb_size);
+     pos < = end && (((mb_size =3D mbsize(pos, end - pos))
+ =             &n= bsp;     && mb2wcisalpha((char *)pos, = mb_size)) || c =3D=3D '_' || isdigit(c));
+ =     c =3D *(pos +=3D mb_size))
++len;
= assert(len > 0);
parser->pos =3D pos;
= node->type =3D JSON_PATH_STR;
= node->str =3D str;
- node->len =3D len;
+ = node->len =3D (int)(pos - str);
return = 0;
}

diff --git = a/src/lib/json/path.h b/src/lib/json/path.h
index = 6e8db4c..b1028f6 100644
--- a/src/lib/json/path.h
+++ b/src/lib/json/path.h
@@ -33,6 +33,8 @@

#include <stdbool.h>
= #include <stdint.h>
+#include <string.h>
+#include <malloc.h>

= #ifdef __cplusplus
extern "C" {
diff --git = a/test/engine/tuple.result b/test/engine/tuple.result
index = 2d7367a..c4b361a 100644
--- a/test/engine/tuple.result
+++ b/test/engine/tuple.result
@@ -611,7 +611,7 = @@ s =3D box.schema.space.create('test', {format =3D format})
pk =3D s:create_index('pk')
---
= ...
-field2 =3D {1, 2, 3, "4", {5,6,7}, {key=3D"key1", = value=3D"value1"}}
+field2 =3D {1, 2, 3, "4", {5,6,7}, = {key=3D"key1", value=3D"value1", hello=E4=B8=AD=E5=9B=BDworld =3D = {=E4=B8=AD=E5=9B=BD =3D 'test'}}}

5. Please, add unit tests on unicode. See = commit
49df120aaa7592e6475bf6974fe6f225db9234de (Introduce = json_path_parser) for examples.


= --Apple-Mail=_AB5A24F5-F013-4260-A5DC-22CCF4AB3F8F--