[tarantool-patches] Re: [PATCH v2 3/3] Multibyte characters support ICU

Vladislav Shpilevoy v.shpilevoy at tarantool.org
Wed Apr 4 14:30:47 MSK 2018


Hello. See my 12 comments below. And seems, that you ignored my appeal 
to do not copy pase
diff into the letter. Again: indentation here is completely broken - all 
tabs are turned into 2 spaces,
it is unreadable.

04.04.2018 13:37, Kirill Shcherbatov пишет:
> ICU Implementation
>
> From 8703e465382ba05817e7703550694c3790972e54 Mon Sep 17 00:00:00 2001
> Message-Id: 
> <8703e465382ba05817e7703550694c3790972e54.1522838002.git.kshcherbatov at tarantool.org>
> In-Reply-To: <cover.1522838002.git.kshcherbatov at tarantool.org>
> References: <cover.1522838002.git.kshcherbatov at tarantool.org>
> From: Kirill Shcherbatov <kshcherbatov at tarantool.org>
> Date: Wed, 4 Apr 2018 13:06:22 +0300
> Subject: [PATCH v2 3/3] ICU Unicode parsing implementation
>
> ---
>  src/box/lua/tuple.c        |  31 +++++--
>  src/lib/json/path.c        | 211 
> +++++++++++++++++++++++++++++++++------------
>  src/lib/json/path.h        |  30 ++++---
>  test/engine/tuple.result   |  43 +++++++--
>  test/engine/tuple.test.lua |  13 ++-
>  test/unit/CMakeLists.txt   |   2 +-
>  test/unit/json_path.c      |   4 +-
>  7 files changed, 247 insertions(+), 87 deletions(-)
>
> diff --git a/src/box/lua/tuple.c b/src/box/lua/tuple.c
> index 99b9ff2..b89e1f9 100644
> --- a/src/box/lua/tuple.c
> +++ b/src/box/lua/tuple.c
> @@ -413,7 +413,6 @@ lbox_tuple_transform(struct lua_State *L)
>  static inline int
>  tuple_field_go_to_index(const char **field, uint64_t index)
>  {
> -    assert(index >= 0);
>      enum mp_type type = mp_typeof(**field);
>      if (type == MP_ARRAY) {
>          if (index == 0)
> @@ -497,6 +496,12 @@ tuple_field_go_to_key(const char **field, const 
> char *key, int len)
>  static int
>  lbox_tuple_field_by_path(struct lua_State *L)
>  {
> +    int err_pos = 0;
> +    struct json_path_parser parser;
> +    /* Need uninitialized structure to
> +     * json_path_parser_deinit on lua_isnumber */
> +    memset(&parser, 0, sizeof(parser));
> +    const char *path = NULL;
1. Please, rename init back to create. In Tarantool we do not use init name
for objects. We use create name to initialize an object. And new to 
allocate +
initialize.

2. Remove deinit method. If you need UConverter, you could create it in 
a global
static variable in path.c and initialize it in main() call via 
json_path_open() call or something.
Look other libraries for examples. It is too expensive to do 
ucnv_open/ucnv_close on each
parser creation. When update by JSON path will be implemented, the 
parser will be
being created and destroyed very often.

In any case, it is the comment on future, because here you can delete 
UConverter using
U8_NEXT. See
http://icu-project.org/apiref/icu4c/utf8_8h.html#a6d4c94e845b059fddba0c51e9bad87fd.
It 1) does not use external state like converter, 2) handle errors for you.

And please, consider how to remove errors raising from this function. It 
was written by me,
when old errors handling convention was accepted, and now it must be 
fixed. It would be very
good, if you will do it in a separate commit before introducing unicode.
> const char *field;
>      struct tuple *tuple = luaT_istuple(L, 1);
>      /* Is checked in Lua wrapper. */
> @@ -506,6 +511,18 @@ lbox_tuple_field_by_path(struct lua_State *L)
>          index -= TUPLE_INDEX_BASE;
>          if (index < 0) {
>  not_found:
3. Please, find a way to simplify the code. I have tried to do it for 
you, and the code becomes
already much simpler, if these gotos are moved into the end of the 
function. You can go off
this hit of.
> + if (!path)
> +                goto exit_not_found;
> +            uint32_t path_len = strlen(path);
> +            uint32_t path_hash = lua_hash(path, path_len);
> +            field = tuple_field_by_name(tuple, path,
> +                                        path_len, path_hash);
> +            if (field)
> +                goto push_value;
> +            if (err_pos || path_len == 0)
> +                luaL_error(L, "Error in path on position %d", err_pos);
> +exit_not_found:
> +            json_path_parser_deinit(&parser);
4. Deinit will be completely removed, when you will use U8_NEXT.
> diff --git a/src/lib/json/path.c b/src/lib/json/path.c
> index 4a6174e..4aadb3a 100644
> --- a/src/lib/json/path.c
> +++ b/src/lib/json/path.c
> @@ -31,8 +31,11 @@
>
>  #include "path.h"
>  #include <ctype.h>
> +#include <unicode/uchar.h>
>  #include "trivia/util.h"
>
> +#define REPLACEMENT_CHARACTER (0xFFFD)
5. What is replacement character?
> +
>  /** Same as strtoull(), but with limited length. */
>  static inline uint64_t
>  strntoull(const char *src, int len) {
> @@ -45,6 +48,51 @@ strntoull(const char *src, int len) {
>  }
>
>  /**
> + * Parse string and update parser's state.
> + * @param[out] parser JSON path parser. Upates pos, signs_read.
> + * @param[out] UChar32 to store result.
> + *
> + * @retval 1 Success.
> + * @retval 0 End of string.
> + * @retval -1 Parse error.
> + */
> +static inline int
> +parser_read_sign(struct json_path_parser *parser, UChar32 *out)
> +{
> +    int rc;
> +    UErrorCode status = U_ZERO_ERROR;
> +    if (parser->pos == parser->end)
> +        return 0;
> +    *out = ucnv_getNextUChar(parser->utf8conv, &parser->pos, 
> parser->end, &status);
> +    parser->invalid_sign_off += (rc = U_SUCCESS(status));
> +    return rc ? 1 : -1;
> +}
> +
> +/**
> + * Parse string and update parser's state.
6. The comment seems to be not for this function.
> + * @param[out] parser JSON path parser. Upates pos, signs_read.
> + * @param old parser read offset.
> + * @param signs to drop.
> + */
> +static inline void
> +parser_reset_pos(struct json_path_parser *parser, const char 
> *old_pos, int signs)
> +{
> +    parser->pos = old_pos;
> +    parser->invalid_sign_off -= signs;
7. You can do not restore old pos - if the string is invalid somewhere, 
it is completely
invalid and will not be parsed after.
> +}
> +
> +static inline bool
> +string_valid_sign(UChar32 c)
> +{
> +    int8_t type = u_charType(c);
> +    return !(c == REPLACEMENT_CHARACTER ||
> +             type == U_UNASSIGNED ||
> +             type == U_LINE_SEPARATOR ||
> +             type == U_CONTROL_CHAR ||
> +             type == U_PARAGRAPH_SEPARATOR);
> +}
8. If a string is invalid, you will see it in output status output 
variable in parser_read_sign. I
do not see a sense to check for errors here again.
> @@ -81,7 +131,7 @@ json_parse_string(struct json_path_parser *parser, 
> struct json_path_node *node)
>  /**
>   * Parse digit sequence into integer until non-digit is met.
>   * Parser stops right after the last digit.
> - * @param parser JSON parser.
> + * @param[out] parser JSON parser. Updates signs_read field.
9. The parser is not [out] parameter here. A parameters is called [out], 
when it is its single
mission. Here it is [in][out] parameter actually, but it can be omitted.
> +static inline int
> +error_sign_offset(struct json_path_parser *parser)
> +{
> +    return parser->invalid_sign_off;
> +}
10. I see no sense in this wrapper. And looks like you too, because 
somewhere you continue
to use invalid_sign_off with no wrapper. And on a future: when you add a 
new method
for a struct, please, name it using <struct_name> prefix.
>
> diff --git a/src/lib/json/path.h b/src/lib/json/path.h
> index 6e8db4c..0ff68c4 100644
> --- a/src/lib/json/path.h
> +++ b/src/lib/json/path.h
> @@ -33,6 +33,9 @@
>
>  #include <stdbool.h>
>  #include <stdint.h>
> +#include <unicode/ucnv_err.h>
11. This header is unused here.
> +#include <unicode/ucnv.h>
12. Since the UConverter is used inside path.c only, you can remove this
header, and predeclare a converter using the code:
struct UConverter;

But in any case it will be deleted, when you starting to use U8_NEXT.






More information about the Tarantool-patches mailing list