From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <tarantool-patches-bounce@freelists.org>
Received: from localhost (localhost [127.0.0.1])
	by turing.freelists.org (Avenir Technologies Mail Multiplex) with ESMTP id 478132FA7D
	for <tarantool-patches@freelists.org>; Wed, 14 Nov 2018 06:52:22 -0500 (EST)
Received: from turing.freelists.org ([127.0.0.1])
	by localhost (turing.freelists.org [127.0.0.1]) (amavisd-new, port 10024)
	with ESMTP id 0pqK2LPDJAPt for <tarantool-patches@freelists.org>;
	Wed, 14 Nov 2018 06:52:22 -0500 (EST)
Received: from smtpng3.m.smailru.net (smtpng3.m.smailru.net [94.100.177.149])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by turing.freelists.org (Avenir Technologies Mail Multiplex) with ESMTPS id 6757C2FA7C
	for <tarantool-patches@freelists.org>; Wed, 14 Nov 2018 06:52:21 -0500 (EST)
Subject: [tarantool-patches] Re: [PATCH v2 2/3] Introduce "none" and
 "binary" collations
References: <cover.1542066357.git.korablev@tarantool.org>
 <8e9bca5d0f620a5cbd3bdffe521cecdfd1ad0ffd.1542066357.git.korablev@tarantool.org>
 <e1e6f1ad-a2bc-58f3-f6ce-6cbacd14dce0@tarantool.org>
 <03A13ECD-6BA4-448F-8952-EE8E5C753130@tarantool.org>
From: Vladislav Shpilevoy <v.shpilevoy@tarantool.org>
Message-ID: <e86922fe-baa6-4559-d64e-44c6a352936e@tarantool.org>
Date: Wed, 14 Nov 2018 14:52:16 +0300
MIME-Version: 1.0
In-Reply-To: <03A13ECD-6BA4-448F-8952-EE8E5C753130@tarantool.org>
Content-Type: text/plain; charset="utf-8"; format="flowed"
Content-Language: en-US
Content-Transfer-Encoding: 8bit
Sender: tarantool-patches-bounce@freelists.org
Errors-to: tarantool-patches-bounce@freelists.org
Reply-To: tarantool-patches@freelists.org
List-help: <mailto:ecartis@freelists.org?Subject=help>
List-unsubscribe: <tarantool-patches-request@freelists.org?Subject=unsubscribe>
List-software: Ecartis version 1.0.0
List-Id: tarantool-patches <tarantool-patches.freelists.org>
List-subscribe: <tarantool-patches-request@freelists.org?Subject=subscribe>
List-owner: <mailto:>
List-post: <mailto:tarantool-patches@freelists.org>
List-archive: <http://www.freelists.org/archives/tarantool-patches>
To: "n.pettik" <korablev@tarantool.org>, tarantool-patches@freelists.org


>>> +	/*
>>> +	 * In SQL all identifiers should be uppercased, so
>>> +	 * to avoid mess lets simple search binary (since it is
>>> +	 * sort of "special" collation) ignoring case at all.
>>> +	 */
>>
>> 2. I am not sure if it should be so special. I think, we should
>> treat it just like any other collation. If tests have BINARY
>> uppercased, then they should be fixed, I guess.
> 
> Thing is in SQLite binary can be used without quotes, like:
> 
> … COLLATE BiNaRy …
> 
> And since ids are uppercased, this thing comes as “BINARY".
> If I forced only lower case “binary”, a lot (>80 separate sub-tests,
> 12 separate file tests) of tests would fail. Moreover, I see that similar
> things take place at Lua-land (schema.cc):
> 
> local function update_index_parts(format, parts)
> ...
> elseif k == 'collation' then
>      -- find ID by name
>      local coll = box.space._collation.index.name:get{v}
>      if not coll then
>          coll = box.space._collation.index.name:get{v:lower()}
>      end
> 
> If you still think that we should allow only “binary” format,
> I will do it alongside with tests in a separate commit.

I still think that since 'binary' is even in _collation space
now, it should be just a regular collation. If you do not want
to change tests, then you can try to search any collation firstly
as is, and if not found, then lowerify and search again, regardless
of is it binary or not.

My point is that we should not treat binary differently from
others. Please, ask in the server team chat or consult Peter.
Especially, if you like the idea of lowering any collation name
if it is not found (just like we do in Lua as you decently
highlighted).

By the way, what if I write SQL like this? :

     SELECT ... WHERE  ... COLLATE NONE ...

Looks like this code:

> +       if (strcasecmp(name, "binary") == 0)
> +               p = coll_by_name("binary", strlen("binary"));
> +       else
> +               p = coll_by_name(name, strlen(name));

will find it ok, but maybe should not, is it?

> diff --git a/test/sql/collation.test.lua b/test/sql/collation.test.lua
> index 61df33a95..05f666f5e 100644
> --- a/test/sql/collation.test.lua
> +++ b/test/sql/collation.test.lua
> @@ -321,8 +319,7 @@ key_def_sizeof_parts(const struct key_part_def *parts, uint32_t part_count)
>          for (uint32_t i = 0; i < part_count; i++) {
>                  const struct key_part_def *part = &parts[i];
>                  int count = 2;
> -               if (part->type == FIELD_TYPE_STRING ||
> -                   part->type == FIELD_TYPE_SCALAR)
> +               if (part->coll_id != COLL_NONE)
>                          count++;
>                  if (part->is_nullable)
>                          count++;
> @@ -332,8 +329,7 @@ key_def_sizeof_parts(const struct key_part_def *parts, uint32_t part_count)
>                  assert(part->type < field_type_MAX);
>                  size += mp_sizeof_str(strlen(PART_OPT_TYPE));
>                  size += mp_sizeof_str(strlen(field_type_strs[part->type]));
> -               if (part->type == FIELD_TYPE_STRING ||
> -                   part->type == FIELD_TYPE_SCALAR) {
> +               if (part->coll_id != COLL_NONE) {
>                          size += mp_sizeof_str(strlen(PART_OPT_COLLATION));
>                          size += mp_sizeof_uint(part->coll_id);
>                  }
> @@ -352,8 +348,7 @@ key_def_encode_parts(char *data, const struct key_part_def *parts,
>          for (uint32_t i = 0; i < part_count; i++) {
>                  const struct key_part_def *part = &parts[i];
>                  int count = 2;
> -               if (part->type == FIELD_TYPE_STRING ||
> -                   part->type == FIELD_TYPE_SCALAR)
> +               if (part->coll_id != COLL_NONE)
>                          count++;
>                  if (part->is_nullable)
>                          count++;
> @@ -366,8 +361,7 @@ key_def_encode_parts(char *data, const struct key_part_def *parts,
>                  assert(part->type < field_type_MAX);
>                  const char *type_str = field_type_strs[part->type];
>                  data = mp_encode_str(data, type_str, strlen(type_str));
> -               if (part->type == FIELD_TYPE_STRING ||
> -                   part->type == FIELD_TYPE_SCALAR) {
> +               if (part->coll_id != COLL_NONE) {
>                          data = mp_encode_str(data, PART_OPT_COLLATION,
>                                               strlen(PART_OPT_COLLATION));
>                          data = mp_encode_uint(data, part->coll_id);
> 
> diff --git a/src/box/lua/space.cc b/src/box/lua/space.cc
> index efaa10da0..7cae436f1 100644
> --- a/src/box/lua/space.cc
> +++ b/src/box/lua/space.cc
> @@ -299,15 +299,12 @@ lbox_fillspace(struct lua_State *L, struct space *space, int i)
>                          lua_pushboolean(L, key_part_is_nullable(part));
>                          lua_setfield(L, -2, "is_nullable");
>   
> -                       if (part->type == FIELD_TYPE_STRING ||
> -                           part->type == FIELD_TYPE_SCALAR) {
> -                               if (! space_is_system(space)) {
> -                                       struct coll_id *coll_id =
> -                                               coll_by_id(part->coll_id);
> -                                       assert(coll_id != NULL);
> -                                       lua_pushstring(L, coll_id->name);
> -                                       lua_setfield(L, -2, "collation");
> -                               }
> +                       if (part->coll_id != COLL_NONE) {
> +                               struct coll_id *coll_id =
> +                                       coll_by_id(part->coll_id);
> +                               assert(coll_id != NULL);
> +                               lua_pushstring(L, coll_id->name);
> +                               lua_setfield(L, -2, "collation");
>                          }
> 
> In other words, each field with type of STRING/SCALAR and mb ANY
> would have real pointer to collation. However, in turn such change
> would break too many tests, without significant refactoring of SQL
> query planner (at least I failed to do it in several hours).
> On the other hand, in this case we can remove COLL_NONE
> and unify collation interface: each field or key part MUST feature
> collation. And I mean not only valid ID (it is how it works now),
> but also valid (i.e. != NULL) pointer to collation struct. So, it seems
> that we would be able to remove any checks like:
> 
> if (coll != NULL)
> 	*use collation*
> else
> 	*process without involving collations*
> 
> Would become:
> 
> If (type_is_string_like)
> 	*always use collation*
> 
> To be honest I was sure that you would notice this pickle…
> Or current approach is not dat bad?
> 

Yeah, I understand your aspiration, but I have some remarks
about the draft patch above and about the concept as it is:

* you said that you want to get rid of COLL_NONE, but in the draft
   above you still use it. Even if you had removed COLL_NONE
   usage from key_def builders, it would have stayed in alter.cc
   as a check that you can not drop 'none' collation. I think that
   COLL_NONE as a constant equal to 0 should not be removed from
   source code until it is not used at all. But it is a minor
   comment rather about coding style: I prefer code like
   'id != UNDERSTANDABLE_NAME' to 'id != <strange_constant>'.

* the main flaw in storing a collation for each string is that it
   will dramatically slow down comparators. The awful problem is in
   virtual functions calling - it is much slower than regular
   functions. Now we have two levels of avoiding coll->cmp() calling.
   First in *_compare_create functions. If a key has no collations,
   we can use highly specified template recursive comparators
   (struct TupleCompare, struct TupleCompareWithKey) which call
   memcmp. Second, we are trying to do not call coll->cmp() in other
   comparators (see mp_compare_str).

   Yes, we still can follow you, store a collation object in each
   key_part, tuple_field, field_def, but in tuple_compare.cc we
   still will need a way how to check that a collation comparator
   can be inlined into memcmp. This way is check coll->id == COLL_NONE.
   Now it is like

       use_fast_cmp = coll == NULL

   Will be like

       use_fast_cmp = coll->id == COLL_NONE

   As a summary: we can store a collation in each *_field/part object
   to simplify SQL source code, but it does not affect necessity of
   a special COLL_NONE constant.


Also, I see now a new minor problem:

> diff --git a/src/box/lua/upgrade.lua b/src/box/lua/upgrade.lua
> index 64f74f9d3..a9525058b 100644
> --- a/src/box/lua/upgrade.lua
> +++ b/src/box/lua/upgrade.lua
> @@ -396,8 +396,10 @@ local function create_collation_space()
>      box.space._index:insert{_collation.id, 1, 'name', 'tree', {unique = true}, {{1, 'string'}}}
>  
>      log.info("create predefined collations")
> +    box.space._collation:replace{0, "none", ADMIN, "BINARY", "", setmap{}}
>      box.space._collation:replace{1, "unicode", ADMIN, "ICU", "", setmap{}}
>      box.space._collation:replace{2, "unicode_ci", ADMIN, "ICU", "", {strength='primary'}}
> +    box.space._collation:replace{3, "binary", ADMIN, "BINARY", "", setmap{}}
>  
>      local _priv = box.space[box.schema.PRIV_ID]
>      _priv:insert{ADMIN, PUBLIC, 'space', _collation.id, box.priv.W}

Unfortunately, we can not change old update() functions. So the only way to add
these new collations is to add a new code to upgrade_to_2_1_0() with these two
new lines.