Tarantool development patches archive
 help / color / mirror / Atom feed
From: Vladislav Shpilevoy <v.shpilevoy@tarantool.org>
To: Vladimir Davydov <vdavydov.dev@gmail.com>, kshcherbatov@tarantool.org
Cc: tarantool-patches@freelists.org
Subject: Re: [tarantool-patches] Re: [PATCH v3 2/3] Introduce json_path_parser with Unicode support.
Date: Tue, 10 Apr 2018 21:42:08 +0300	[thread overview]
Message-ID: <a3a42dfb-4a38-d4c1-5186-225cbdae6bc0@tarantool.org> (raw)
In-Reply-To: <20180410163619.egndfoanpazswmba@esperanza>



On 10/04/2018 19:36, Vladimir Davydov wrote:
> On Tue, Apr 10, 2018 at 11:31:45AM +0300, Vladislav Shpilevoy wrote:
>> diff --git a/src/lib/json/path.c b/src/lib/json/path.c
>> new file mode 100644
>> index 000000000..681419f90
>> --- /dev/null
>> +++ b/src/lib/json/path.c
> 
>> +/**
>> + * Read a single symbol from a string starting from an offset.
>> + * @param parser JSON path parser.
>> + * @param[out] UChar32 Read symbol.
>> + *
>> + * @retval   0 Success.
>> + * @retval > 0 1-based position of a syntax error.
>> + */
>> +static inline int
>> +json_read_symbol(struct json_path_parser *parser, UChar32 *out)
>> +{
>> +	if (parser->offset == parser->src_len) {
> 
>> +		*out = U_SENTINEL;
> 
> This is not necessary AFAICS.

One of compilers prints warning, that 'UChar32 c' is not initialized, 
where 'c' is argument of json_read_symbol.
>> +/** Fast forward when it is known that a symbol is 1-byte char. */
>> +static inline void
>> +json_propagate_char(struct json_path_parser *parser)
> 
> Bad name IMO, json_skip_char would be better.

As you wish.

diff --git a/src/lib/json/path.c b/src/lib/json/path.c
index 681419f90..1fccf82d0 100644
--- a/src/lib/json/path.c
+++ b/src/lib/json/path.c
@@ -72,7 +72,7 @@ json_revert_symbol(struct json_path_parser *parser, 
int offset)

  /** Fast forward when it is known that a symbol is 1-byte char. */
  static inline void
-json_propagate_char(struct json_path_parser *parser)
+json_skip_char(struct json_path_parser *parser)
  {
         ++parser->offset;
         ++parser->symbol_count;

>> +/**
>> + * Parse string identifier in quotes. Parser either stops right
>> + * after the closing quote, or returns an error position.
>> + * @param parser JSON path parser.
>> + * @param[out] node JSON node to store result.
>> + * @param quote_type Quote by that a string must be terminated.
>> + *
>> + * @retval   0 Success.
>> + * @retval > 0 1-based position of a syntax error.
>> + */
>> +static inline int
>> +json_parse_string(struct json_path_parser *parser, struct json_path_node *node,
>> +		  UChar32 quote_type)
>> +{
>> +	assert(parser->offset < parser->src_len);
>> +	assert(quote_type == json_current_char(parser));
> 
> Why pass quote_type if you know that it equals json_current_char?

I do not want to calculate *(src + offset) if it is already done in the 
caller function. Besides, I do not like json_current_char(), but now I 
can not find how to drop it with no simple inline.
>> +static inline int
>> +json_parse_integer(struct json_path_parser *parser, struct json_path_node *node)
>> +{
>> +	const char *end = parser->src + parser->src_len;
>> +	const char *pos = parser->src + parser->offset;
>> +	assert(pos < end);
>> +	int len = 0;
>> +	uint64_t value = 0;
>> +	char c = *pos;
>> +	if (! isdigit(c))
>> +		return parser->symbol_count + 1;
>> +	do {
>> +		value = value * 10 + c - (int)'0';
>> +		++len;
>> +	} while (++pos < end && isdigit((c = *pos)));
>> +	parser->offset += len;
>> +	parser->symbol_count += len;
>> +	node->type = JSON_PATH_NUM;
>> +	node->num = value;
> 
> strtol?

Now number value is calculated simultaneously with reading and offsets 
propagation in exactly N symbols, where N - is length of a number.

You propose to read N symbols and then read them again to build a number 
- it is 2N readings. Seems to be slower.

Moreover, strtol will check each symbol on is digit, but we already know 
that all len symbols are digits. And strtol does not have strntol 
version. So I can input invalid JSON like this: '[1234', and strtol can 
do not stop after 4, if next memory area byte looks like a digit.

First version of JSON path parser had strntoll() implementation and 2N 
readings, but the bench showed, then building a number on fly is faster.

> 
>> +	return 0;
>> +}
>> +
>> +/**
>> + * Check that a symbol can be part of a JSON path not inside
>> + * ["..."].
>> + */
>> +static inline bool
>> +json_is_valid_identifier_symbol(UChar32 c)
>> +{
>> +	return u_isUAlphabetic(c) || c == (UChar32)'_' || u_isdigit(c);
>> +}
>> +
>> +/**
>> + * Parse identifier out of quotes. It can contain only alphas,
>> + * digits and underscores. And can not contain digit at the first
>> + * position. Parser is stoped right after the last non-digit,
>> + * non-alpha and non-underscore symbol.
>> + * @param parser JSON parser. Updates signs_read field.
> 
> What is 'signs_read'?

Did not notice that on self-review(

Fixed.

@@ -166,7 +166,7 @@ json_is_valid_identifier_symbol(UChar32 c)
   * digits and underscores. And can not contain digit at the first
   * position. Parser is stoped right after the last non-digit,
   * non-alpha and non-underscore symbol.
- * @param parser JSON parser. Updates signs_read field.
+ * @param parser JSON parser.
   * @param[out] node JSON node to store result.
   *
   * @retval   0 Success.

>> +	/* First symbol can not be digit. */
>> +	if (!u_isalpha(c) && c != (UChar32)'_')
>> +		return parser->symbol_count;
>> +	int last_offset = parser->offset;
>> +	while ((rc = json_read_symbol(parser, &c)) == 0) {
> 
>> +		if (! json_is_valid_identifier_symbol(c)) {
> 
> I would fold this function, because we have to use u_isalpha right above
> anyway.

is_valid_identifier checks not only on alpha and '_' - it checks on 
digits. An identifier can contain digits on non-first place.

>> +}
>> +
>> +int
>> +json_path_next(struct json_path_parser *parser, struct json_path_node *node)
>> +{
>> +	UChar32 c;
>> +	int last_offset = parser->offset;
>> +	int rc = json_read_symbol(parser, &c);
>> +	if (rc != 0) {
>> +		if (parser->offset == parser->src_len) {
>> +			node->type = JSON_PATH_END;
>> +			return 0;
>> +		}
> 
> IMO it would be more straightforward to first check EOF and only then
> proceed to reading the string, not vice versa.

As you wish.

@@ -202,16 +202,15 @@ json_parse_identifier(struct json_path_parser *parser,
  int
  json_path_next(struct json_path_parser *parser, struct json_path_node 
*node)
  {
+       if (parser->offset == parser->src_len) {
+               node->type = JSON_PATH_END;
+               return 0;
+       }
         UChar32 c;
         int last_offset = parser->offset;
         int rc = json_read_symbol(parser, &c);
-       if (rc != 0) {
-               if (parser->offset == parser->src_len) {
-                       node->type = JSON_PATH_END;
-                       return 0;
-               }
+       if (rc != 0)
                 return rc;
-       }
         switch(c) {

>> +void
>> +test_basic()
>> +{
>> +	header();
>> +	plan(67);
>> +	const char *path;
>> +	int len;
>> +	struct json_path_parser parser;
>> +	struct json_path_node node;
>> +
>> +	reset_to_new_path("[0].field1.field2['field3'][5]");
>> +	is_next_index(3, 0);
>> +	is_next_key("field1");
>> +	is_next_key("field2");
>> +	is_next_key("field3");
>> +	is_next_index(3, 5);
>> +
>> +	reset_to_new_path("[3].field[2].field")
>> +	is_next_index(3, 3);
>> +	is_next_key("field");
>> +	is_next_index(3, 2);
>> +	is_next_key("field");
>> +
>> +	reset_to_new_path("[\"f1\"][\"f2'3'\"]");
>> +	is_next_key("f1");
>> +	is_next_key("f2'3'");
>> +
>> +	/* Support both '.field1...' and 'field1...'. */
> 
> The comment is misleading - there's no test case for 'field1...'

Thanks for the catch.

diff --git a/test/unit/json_path.c b/test/unit/json_path.c
index fb35bf898..1d7e7d372 100644
--- a/test/unit/json_path.c
+++ b/test/unit/json_path.c
@@ -26,7 +26,7 @@ void
  test_basic()
  {
         header();
-       plan(67);
+       plan(71);
         const char *path;
         int len;
         struct json_path_parser parser;
@@ -52,6 +52,8 @@ test_basic()
         /* Support both '.field1...' and 'field1...'. */
         reset_to_new_path(".field1");
         is_next_key("field1");
+       reset_to_new_path("field1");
+       is_next_key("field1");

> 
>> +	reset_to_new_path(".field1");
>> +	is_next_key("field1");
> 
> Shouldn't it return an empty string before 'field1' for such an input?

No, here everything is ok. My JSON path parser allows to start a path 
with '.' and it is not considered as an empty string + '.'. Exception is 
'.[...]' - it is an error (see tests in test_errors()).

I allowed '.identifier' syntax to lay emphasis that it is actually 
'tuple.identifier'. The JSON path is just a suffix for a tuple.

  reply	other threads:[~2018-04-10 18:42 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-04-10  8:31 [PATCH v3 0/3] Implement json path access to tuple fields Vladislav Shpilevoy
2018-04-10  8:31 ` [PATCH v3 1/3] Allow gcov on Mac Vladislav Shpilevoy
2018-04-10 10:49   ` Vladimir Davydov
     [not found] ` <f1302082b4457a03b5d2b3c44526815428de4a0e.1523349020.git.v.shpilevoy@tarantool.org>
2018-04-10 16:36   ` [PATCH v3 2/3] Introduce json_path_parser with Unicode support Vladimir Davydov
2018-04-10 18:42     ` Vladislav Shpilevoy [this message]
2018-04-11  9:20       ` [tarantool-patches] " Vladimir Davydov
     [not found] ` <c3b64b7b4e638970864995c895bbd5f736282560.1523349020.git.v.shpilevoy@tarantool.org>
2018-04-13 11:33   ` [PATCH v3 3/3] Lua: implement json path access to tuple fields Vladimir Davydov
2018-04-13 21:51     ` [tarantool-patches] " Vladislav Shpilevoy
2018-04-16  8:35       ` Vladimir Davydov
2018-04-16 10:02         ` Vladislav Shpilevoy
2018-04-22 14:21         ` Vladislav Shpilevoy

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=a3a42dfb-4a38-d4c1-5186-225cbdae6bc0@tarantool.org \
    --to=v.shpilevoy@tarantool.org \
    --cc=kshcherbatov@tarantool.org \
    --cc=tarantool-patches@freelists.org \
    --cc=vdavydov.dev@gmail.com \
    --subject='Re: [tarantool-patches] Re: [PATCH v3 2/3] Introduce json_path_parser with Unicode support.' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox