From: Vladislav Shpilevoy <v.shpilevoy@tarantool.org> To: tarantool-patches@freelists.org, Kirill Shcherbatov <kshcherbatov@tarantool.org> Subject: [tarantool-patches] Re: [PATCH v2 3/3] Multibyte characters support ICU Date: Wed, 4 Apr 2018 14:30:47 +0300 [thread overview] Message-ID: <0bd63589-d3c6-389d-aba4-b1d451f20021@tarantool.org> (raw) In-Reply-To: <01a0181a-19f9-c38c-94ea-65ab549a4517@tarantool.org> Hello. See my 12 comments below. And seems, that you ignored my appeal to do not copy pase diff into the letter. Again: indentation here is completely broken - all tabs are turned into 2 spaces, it is unreadable. 04.04.2018 13:37, Kirill Shcherbatov пишет: > ICU Implementation > > From 8703e465382ba05817e7703550694c3790972e54 Mon Sep 17 00:00:00 2001 > Message-Id: > <8703e465382ba05817e7703550694c3790972e54.1522838002.git.kshcherbatov@tarantool.org> > In-Reply-To: <cover.1522838002.git.kshcherbatov@tarantool.org> > References: <cover.1522838002.git.kshcherbatov@tarantool.org> > From: Kirill Shcherbatov <kshcherbatov@tarantool.org> > Date: Wed, 4 Apr 2018 13:06:22 +0300 > Subject: [PATCH v2 3/3] ICU Unicode parsing implementation > > --- > src/box/lua/tuple.c | 31 +++++-- > src/lib/json/path.c | 211 > +++++++++++++++++++++++++++++++++------------ > src/lib/json/path.h | 30 ++++--- > test/engine/tuple.result | 43 +++++++-- > test/engine/tuple.test.lua | 13 ++- > test/unit/CMakeLists.txt | 2 +- > test/unit/json_path.c | 4 +- > 7 files changed, 247 insertions(+), 87 deletions(-) > > diff --git a/src/box/lua/tuple.c b/src/box/lua/tuple.c > index 99b9ff2..b89e1f9 100644 > --- a/src/box/lua/tuple.c > +++ b/src/box/lua/tuple.c > @@ -413,7 +413,6 @@ lbox_tuple_transform(struct lua_State *L) > static inline int > tuple_field_go_to_index(const char **field, uint64_t index) > { > - assert(index >= 0); > enum mp_type type = mp_typeof(**field); > if (type == MP_ARRAY) { > if (index == 0) > @@ -497,6 +496,12 @@ tuple_field_go_to_key(const char **field, const > char *key, int len) > static int > lbox_tuple_field_by_path(struct lua_State *L) > { > + int err_pos = 0; > + struct json_path_parser parser; > + /* Need uninitialized structure to > + * json_path_parser_deinit on lua_isnumber */ > + memset(&parser, 0, sizeof(parser)); > + const char *path = NULL; 1. Please, rename init back to create. In Tarantool we do not use init name for objects. We use create name to initialize an object. And new to allocate + initialize. 2. Remove deinit method. If you need UConverter, you could create it in a global static variable in path.c and initialize it in main() call via json_path_open() call or something. Look other libraries for examples. It is too expensive to do ucnv_open/ucnv_close on each parser creation. When update by JSON path will be implemented, the parser will be being created and destroyed very often. In any case, it is the comment on future, because here you can delete UConverter using U8_NEXT. See http://icu-project.org/apiref/icu4c/utf8_8h.html#a6d4c94e845b059fddba0c51e9bad87fd. It 1) does not use external state like converter, 2) handle errors for you. And please, consider how to remove errors raising from this function. It was written by me, when old errors handling convention was accepted, and now it must be fixed. It would be very good, if you will do it in a separate commit before introducing unicode. > const char *field; > struct tuple *tuple = luaT_istuple(L, 1); > /* Is checked in Lua wrapper. */ > @@ -506,6 +511,18 @@ lbox_tuple_field_by_path(struct lua_State *L) > index -= TUPLE_INDEX_BASE; > if (index < 0) { > not_found: 3. Please, find a way to simplify the code. I have tried to do it for you, and the code becomes already much simpler, if these gotos are moved into the end of the function. You can go off this hit of. > + if (!path) > + goto exit_not_found; > + uint32_t path_len = strlen(path); > + uint32_t path_hash = lua_hash(path, path_len); > + field = tuple_field_by_name(tuple, path, > + path_len, path_hash); > + if (field) > + goto push_value; > + if (err_pos || path_len == 0) > + luaL_error(L, "Error in path on position %d", err_pos); > +exit_not_found: > + json_path_parser_deinit(&parser); 4. Deinit will be completely removed, when you will use U8_NEXT. > diff --git a/src/lib/json/path.c b/src/lib/json/path.c > index 4a6174e..4aadb3a 100644 > --- a/src/lib/json/path.c > +++ b/src/lib/json/path.c > @@ -31,8 +31,11 @@ > > #include "path.h" > #include <ctype.h> > +#include <unicode/uchar.h> > #include "trivia/util.h" > > +#define REPLACEMENT_CHARACTER (0xFFFD) 5. What is replacement character? > + > /** Same as strtoull(), but with limited length. */ > static inline uint64_t > strntoull(const char *src, int len) { > @@ -45,6 +48,51 @@ strntoull(const char *src, int len) { > } > > /** > + * Parse string and update parser's state. > + * @param[out] parser JSON path parser. Upates pos, signs_read. > + * @param[out] UChar32 to store result. > + * > + * @retval 1 Success. > + * @retval 0 End of string. > + * @retval -1 Parse error. > + */ > +static inline int > +parser_read_sign(struct json_path_parser *parser, UChar32 *out) > +{ > + int rc; > + UErrorCode status = U_ZERO_ERROR; > + if (parser->pos == parser->end) > + return 0; > + *out = ucnv_getNextUChar(parser->utf8conv, &parser->pos, > parser->end, &status); > + parser->invalid_sign_off += (rc = U_SUCCESS(status)); > + return rc ? 1 : -1; > +} > + > +/** > + * Parse string and update parser's state. 6. The comment seems to be not for this function. > + * @param[out] parser JSON path parser. Upates pos, signs_read. > + * @param old parser read offset. > + * @param signs to drop. > + */ > +static inline void > +parser_reset_pos(struct json_path_parser *parser, const char > *old_pos, int signs) > +{ > + parser->pos = old_pos; > + parser->invalid_sign_off -= signs; 7. You can do not restore old pos - if the string is invalid somewhere, it is completely invalid and will not be parsed after. > +} > + > +static inline bool > +string_valid_sign(UChar32 c) > +{ > + int8_t type = u_charType(c); > + return !(c == REPLACEMENT_CHARACTER || > + type == U_UNASSIGNED || > + type == U_LINE_SEPARATOR || > + type == U_CONTROL_CHAR || > + type == U_PARAGRAPH_SEPARATOR); > +} 8. If a string is invalid, you will see it in output status output variable in parser_read_sign. I do not see a sense to check for errors here again. > @@ -81,7 +131,7 @@ json_parse_string(struct json_path_parser *parser, > struct json_path_node *node) > /** > * Parse digit sequence into integer until non-digit is met. > * Parser stops right after the last digit. > - * @param parser JSON parser. > + * @param[out] parser JSON parser. Updates signs_read field. 9. The parser is not [out] parameter here. A parameters is called [out], when it is its single mission. Here it is [in][out] parameter actually, but it can be omitted. > +static inline int > +error_sign_offset(struct json_path_parser *parser) > +{ > + return parser->invalid_sign_off; > +} 10. I see no sense in this wrapper. And looks like you too, because somewhere you continue to use invalid_sign_off with no wrapper. And on a future: when you add a new method for a struct, please, name it using <struct_name> prefix. > > diff --git a/src/lib/json/path.h b/src/lib/json/path.h > index 6e8db4c..0ff68c4 100644 > --- a/src/lib/json/path.h > +++ b/src/lib/json/path.h > @@ -33,6 +33,9 @@ > > #include <stdbool.h> > #include <stdint.h> > +#include <unicode/ucnv_err.h> 11. This header is unused here. > +#include <unicode/ucnv.h> 12. Since the UConverter is used inside path.c only, you can remove this header, and predeclare a converter using the code: struct UConverter; But in any case it will be deleted, when you starting to use U8_NEXT.
prev parent reply other threads:[~2018-04-04 11:30 UTC|newest] Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top 2018-03-29 14:22 [tarantool-patches] [PATCH v2 0/3] tuple field access via a json path Kirill Shcherbatov 2018-03-29 14:22 ` [tarantool-patches] [PATCH v2 1/3] Introduce json_path_parser Kirill Shcherbatov 2018-03-29 14:22 ` [tarantool-patches] [PATCH v2 2/3] lua: implement json path access to tuple fields Kirill Shcherbatov 2018-03-29 14:22 ` [tarantool-patches] [PATCH v2 3/3] Multibyte characters support Kirill Shcherbatov 2018-03-29 18:04 ` [tarantool-patches] " Kirill Shcherbatov 2018-03-30 10:24 ` v.shpilevoy 2018-03-30 10:25 ` v.shpilevoy 2018-04-02 19:19 ` Kirill Shcherbatov 2018-04-03 10:20 ` Vladislav Shpilevoy 2018-04-05 14:09 ` [tarantool-patches] [PATCH v2 1/1] ICU Unicode support for JSON parser Kirill Shcherbatov 2018-04-05 18:00 ` [tarantool-patches] " Kirill Shcherbatov 2018-04-05 23:32 ` Vladislav Shpilevoy 2018-04-04 10:37 ` [tarantool-patches] [PATCH v2 3/3] Multibyte characters support ICU Kirill Shcherbatov 2018-04-04 11:30 ` Vladislav Shpilevoy [this message]
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=0bd63589-d3c6-389d-aba4-b1d451f20021@tarantool.org \ --to=v.shpilevoy@tarantool.org \ --cc=kshcherbatov@tarantool.org \ --cc=tarantool-patches@freelists.org \ --subject='[tarantool-patches] Re: [PATCH v2 3/3] Multibyte characters support ICU' \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: link
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox