From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <tarantool-patches-bounce@freelists.org>
Received: from localhost (localhost [127.0.0.1])
	by turing.freelists.org (Avenir Technologies Mail Multiplex) with ESMTP id 3148120E1E
	for <tarantool-patches@freelists.org>; Wed, 23 May 2018 06:29:41 -0400 (EDT)
Received: from turing.freelists.org ([127.0.0.1])
	by localhost (turing.freelists.org [127.0.0.1]) (amavisd-new, port 10024)
	with ESMTP id VghH9haq8yJL for <tarantool-patches@freelists.org>;
	Wed, 23 May 2018 06:29:41 -0400 (EDT)
Received: from smtp42.i.mail.ru (smtp42.i.mail.ru [94.100.177.102])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by turing.freelists.org (Avenir Technologies Mail Multiplex) with ESMTPS id 745FC20D47
	for <tarantool-patches@freelists.org>; Wed, 23 May 2018 06:29:40 -0400 (EDT)
Subject: [tarantool-patches] Re: [PATCH] sql: allow any space symbols to
 be a white space
References: <d0e9be722970a5b5c5880c59962533da06da2e65.1527004167.git.kyukhin@tarantool.org>
 <e69e8e43-c483-8800-60b2-9f599bc2a397@tarantool.org>
 <20180523051500.3fx5uoetyzsjkwp5@tarantool.org>
 <20180523055408.6axbc53uucslcu4h@tarantool.org>
From: Vladislav Shpilevoy <v.shpilevoy@tarantool.org>
Message-ID: <de0ebe1c-cdb6-e622-d02e-6a6a5163b679@tarantool.org>
Date: Wed, 23 May 2018 13:29:38 +0300
MIME-Version: 1.0
In-Reply-To: <20180523055408.6axbc53uucslcu4h@tarantool.org>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Language: en-US
Content-Transfer-Encoding: 7bit
Sender: tarantool-patches-bounce@freelists.org
Errors-to: tarantool-patches-bounce@freelists.org
Reply-To: tarantool-patches@freelists.org
List-help: <mailto:ecartis@freelists.org?Subject=help>
List-unsubscribe: <tarantool-patches-request@freelists.org?Subject=unsubscribe>
List-software: Ecartis version 1.0.0
List-Id: tarantool-patches <tarantool-patches.freelists.org>
List-subscribe: <tarantool-patches-request@freelists.org?Subject=subscribe>
List-owner: <mailto:>
List-post: <mailto:tarantool-patches@freelists.org>
List-archive: <http://www.freelists.org/archives/tarantool-patches>
To: tarantool-patches@freelists.org, Kirill Yukhin <kyukhin@tarantool.org>

Hello. Thanks for the fixes! I see, that you changed branch to
kyukhin/gh-2371-utf8-spaces-check, right?

And that you added a separate commit to link with ICU. But with no
linking the first commit does not work. Maybe you should squash them?

See 4 comments below.

> diff --git a/src/box/sql/tokenize.c b/src/box/sql/tokenize.c
> index c77aa9b..4c01066 100644
> --- a/src/box/sql/tokenize.c
> +++ b/src/box/sql/tokenize.c
> @@ -36,17 +36,21 @@
>    * individual tokens and sends those tokens one-by-one over to the
>    * parser for analysis.
>    */
> -#include "sqliteInt.h"
>   #include <stdlib.h>
> +#include <unicode/utf8.h>
> +#include <unicode/uchar.h>
> +
>   #include "say.h"
> +#include "sqliteInt.h"
>   
>   /* Character classes for tokenizing
>    *
> - * In the sqlite3GetToken() function, a switch() on aiClass[c] is implemented
> - * using a lookup table, whereas a switch() directly on c uses a binary search.
> - * The lookup table is much faster.  To maximize speed, and to ensure that
> - * a lookup table is used, all of the classes need to be small integers and
> - * all of them need to be used within the switch.
> + * In the sql_token() function, a switch() on sql_ascii[c] is

1. No sql_ascii.
> @@ -167,360 +123,295 @@ const unsigned char ebcdicToAscii[] = {
>    */
>   #include "keywordhash.h"
>   
> -/*
> - * If X is a character that can be used in an identifier then
> - * IdChar(X) will be true.  Otherwise it is false.
> - *
> - * For ASCII, any character with the high-order bit set is
> - * allowed in an identifier.  For 7-bit characters,
> - * sqlite3IsIdChar[X] must be 1.
> - *
> - * For EBCDIC, the rules are more complex but have the same
> - * end result.
> +#define maybe_utf8(c) ((sqlite3CtypeMap[c] & 0x40) != 0)
> +
> +/**
> + * Return true if current symbol is space.
>    *
> - * Ticket #1066.  the SQL standard does not allow '$' in the
> - * middle of identifiers.  But many SQL implementations do.
> - * SQLite will allow '$' in identifiers for compatibility.
> - * But the feature is undocumented.
> + * @param z Input stream.
> + * @retval True if current symbo1l space.

2. symbo1l -> symbol is.
> diff --git a/test/sql-tap/gh-2371-utf8-spaces.test.lua b/test/sql-tap/gh-2371-utf8-spaces.test.lua

3. Can you do not create a new test file specially for the issue? I believe we will
support unicode not only as white spaces, so maybe worth to create a more common
file test/sql/unicode.test.lua.

> new file mode 100755
> index 0000000..191cc1c
> --- /dev/null
> +++ b/test/sql-tap/gh-2371-utf8-spaces.test.lua
> @@ -0,0 +1,37 @@
> +#!/usr/bin/env tarantool
> +test = require("sqltester")
> +test:plan(23 * 3)
> +
> +-- 23 entities
> +local utf8_spaces = {"\u{0009}", "\u{000A}", "\u{000B}", "\u{000C}", "\u{000D}",
> +                     "\u{0085}", "\u{1680}", "\u{2000}", "\u{2001}", "\u{2002}",
> +                     "\u{2003}", "\u{2004}", "\u{2005}", "\u{2006}", "\u{2007}",
> +                     "\u{2008}", "\u{2009}", "\u{200A}", "\u{2028}", "\u{2029}",
> +                     "\u{202F}", "\u{205F}", "\u{3000}"}
> +local spaces_cnt = 23
> +
> +-- 1. Check UTF-8 single space
> +for i, v in pairs(utf8_spaces) do
> +    test:do_execsql_test(
> +    	"utf8-spaces-1."..i,
> +    	"select" .. v .. "1",
> +    	{ 1 })

4. Here and 2 same places below you use on the same line 4 spaces + 1 tab with 8 width.
Please, use only spaces in .lua files.

> +end
> +
> +-- 2. Check pair simple + UTF-8 space
> +for i, v in pairs(utf8_spaces) do
> +    test:do_execsql_test(
> +    	"utf8-spaces-2."..i,
> +    	"select" .. v .. "1",
> +    	{ 1 })
> +end
> +
> +-- 3. Sequence of spaces
> +for i, v in pairs(utf8_spaces) do
> +    test:do_execsql_test(
> +    	"utf8-spaces-3."..i,
> +    	"select" .. v .. " " .. utf8_spaces[spaces_cnt - i + 1] .. " 1",
> +    	{ 1 })
> +end
> +
> +test:finish_test()
> 
>