From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from localhost (localhost [127.0.0.1]) by turing.freelists.org (Avenir Technologies Mail Multiplex) with ESMTP id 478132FA7D for ; Wed, 14 Nov 2018 06:52:22 -0500 (EST) Received: from turing.freelists.org ([127.0.0.1]) by localhost (turing.freelists.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 0pqK2LPDJAPt for ; Wed, 14 Nov 2018 06:52:22 -0500 (EST) Received: from smtpng3.m.smailru.net (smtpng3.m.smailru.net [94.100.177.149]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by turing.freelists.org (Avenir Technologies Mail Multiplex) with ESMTPS id 6757C2FA7C for ; Wed, 14 Nov 2018 06:52:21 -0500 (EST) Subject: [tarantool-patches] Re: [PATCH v2 2/3] Introduce "none" and "binary" collations References: <8e9bca5d0f620a5cbd3bdffe521cecdfd1ad0ffd.1542066357.git.korablev@tarantool.org> <03A13ECD-6BA4-448F-8952-EE8E5C753130@tarantool.org> From: Vladislav Shpilevoy Message-ID: Date: Wed, 14 Nov 2018 14:52:16 +0300 MIME-Version: 1.0 In-Reply-To: <03A13ECD-6BA4-448F-8952-EE8E5C753130@tarantool.org> Content-Type: text/plain; charset="utf-8"; format="flowed" Content-Language: en-US Content-Transfer-Encoding: 8bit Sender: tarantool-patches-bounce@freelists.org Errors-to: tarantool-patches-bounce@freelists.org Reply-To: tarantool-patches@freelists.org List-help: List-unsubscribe: List-software: Ecartis version 1.0.0 List-Id: tarantool-patches List-subscribe: List-owner: List-post: List-archive: To: "n.pettik" , tarantool-patches@freelists.org >>> + /* >>> + * In SQL all identifiers should be uppercased, so >>> + * to avoid mess lets simple search binary (since it is >>> + * sort of "special" collation) ignoring case at all. >>> + */ >> >> 2. I am not sure if it should be so special. I think, we should >> treat it just like any other collation. If tests have BINARY >> uppercased, then they should be fixed, I guess. > > Thing is in SQLite binary can be used without quotes, like: > > … COLLATE BiNaRy … > > And since ids are uppercased, this thing comes as “BINARY". > If I forced only lower case “binary”, a lot (>80 separate sub-tests, > 12 separate file tests) of tests would fail. Moreover, I see that similar > things take place at Lua-land (schema.cc): > > local function update_index_parts(format, parts) > ... > elseif k == 'collation' then > -- find ID by name > local coll = box.space._collation.index.name:get{v} > if not coll then > coll = box.space._collation.index.name:get{v:lower()} > end > > If you still think that we should allow only “binary” format, > I will do it alongside with tests in a separate commit. I still think that since 'binary' is even in _collation space now, it should be just a regular collation. If you do not want to change tests, then you can try to search any collation firstly as is, and if not found, then lowerify and search again, regardless of is it binary or not. My point is that we should not treat binary differently from others. Please, ask in the server team chat or consult Peter. Especially, if you like the idea of lowering any collation name if it is not found (just like we do in Lua as you decently highlighted). By the way, what if I write SQL like this? : SELECT ... WHERE ... COLLATE NONE ... Looks like this code: > + if (strcasecmp(name, "binary") == 0) > + p = coll_by_name("binary", strlen("binary")); > + else > + p = coll_by_name(name, strlen(name)); will find it ok, but maybe should not, is it? > diff --git a/test/sql/collation.test.lua b/test/sql/collation.test.lua > index 61df33a95..05f666f5e 100644 > --- a/test/sql/collation.test.lua > +++ b/test/sql/collation.test.lua > @@ -321,8 +319,7 @@ key_def_sizeof_parts(const struct key_part_def *parts, uint32_t part_count) > for (uint32_t i = 0; i < part_count; i++) { > const struct key_part_def *part = &parts[i]; > int count = 2; > - if (part->type == FIELD_TYPE_STRING || > - part->type == FIELD_TYPE_SCALAR) > + if (part->coll_id != COLL_NONE) > count++; > if (part->is_nullable) > count++; > @@ -332,8 +329,7 @@ key_def_sizeof_parts(const struct key_part_def *parts, uint32_t part_count) > assert(part->type < field_type_MAX); > size += mp_sizeof_str(strlen(PART_OPT_TYPE)); > size += mp_sizeof_str(strlen(field_type_strs[part->type])); > - if (part->type == FIELD_TYPE_STRING || > - part->type == FIELD_TYPE_SCALAR) { > + if (part->coll_id != COLL_NONE) { > size += mp_sizeof_str(strlen(PART_OPT_COLLATION)); > size += mp_sizeof_uint(part->coll_id); > } > @@ -352,8 +348,7 @@ key_def_encode_parts(char *data, const struct key_part_def *parts, > for (uint32_t i = 0; i < part_count; i++) { > const struct key_part_def *part = &parts[i]; > int count = 2; > - if (part->type == FIELD_TYPE_STRING || > - part->type == FIELD_TYPE_SCALAR) > + if (part->coll_id != COLL_NONE) > count++; > if (part->is_nullable) > count++; > @@ -366,8 +361,7 @@ key_def_encode_parts(char *data, const struct key_part_def *parts, > assert(part->type < field_type_MAX); > const char *type_str = field_type_strs[part->type]; > data = mp_encode_str(data, type_str, strlen(type_str)); > - if (part->type == FIELD_TYPE_STRING || > - part->type == FIELD_TYPE_SCALAR) { > + if (part->coll_id != COLL_NONE) { > data = mp_encode_str(data, PART_OPT_COLLATION, > strlen(PART_OPT_COLLATION)); > data = mp_encode_uint(data, part->coll_id); > > diff --git a/src/box/lua/space.cc b/src/box/lua/space.cc > index efaa10da0..7cae436f1 100644 > --- a/src/box/lua/space.cc > +++ b/src/box/lua/space.cc > @@ -299,15 +299,12 @@ lbox_fillspace(struct lua_State *L, struct space *space, int i) > lua_pushboolean(L, key_part_is_nullable(part)); > lua_setfield(L, -2, "is_nullable"); > > - if (part->type == FIELD_TYPE_STRING || > - part->type == FIELD_TYPE_SCALAR) { > - if (! space_is_system(space)) { > - struct coll_id *coll_id = > - coll_by_id(part->coll_id); > - assert(coll_id != NULL); > - lua_pushstring(L, coll_id->name); > - lua_setfield(L, -2, "collation"); > - } > + if (part->coll_id != COLL_NONE) { > + struct coll_id *coll_id = > + coll_by_id(part->coll_id); > + assert(coll_id != NULL); > + lua_pushstring(L, coll_id->name); > + lua_setfield(L, -2, "collation"); > } > > In other words, each field with type of STRING/SCALAR and mb ANY > would have real pointer to collation. However, in turn such change > would break too many tests, without significant refactoring of SQL > query planner (at least I failed to do it in several hours). > On the other hand, in this case we can remove COLL_NONE > and unify collation interface: each field or key part MUST feature > collation. And I mean not only valid ID (it is how it works now), > but also valid (i.e. != NULL) pointer to collation struct. So, it seems > that we would be able to remove any checks like: > > if (coll != NULL) > *use collation* > else > *process without involving collations* > > Would become: > > If (type_is_string_like) > *always use collation* > > To be honest I was sure that you would notice this pickle… > Or current approach is not dat bad? > Yeah, I understand your aspiration, but I have some remarks about the draft patch above and about the concept as it is: * you said that you want to get rid of COLL_NONE, but in the draft above you still use it. Even if you had removed COLL_NONE usage from key_def builders, it would have stayed in alter.cc as a check that you can not drop 'none' collation. I think that COLL_NONE as a constant equal to 0 should not be removed from source code until it is not used at all. But it is a minor comment rather about coding style: I prefer code like 'id != UNDERSTANDABLE_NAME' to 'id != '. * the main flaw in storing a collation for each string is that it will dramatically slow down comparators. The awful problem is in virtual functions calling - it is much slower than regular functions. Now we have two levels of avoiding coll->cmp() calling. First in *_compare_create functions. If a key has no collations, we can use highly specified template recursive comparators (struct TupleCompare, struct TupleCompareWithKey) which call memcmp. Second, we are trying to do not call coll->cmp() in other comparators (see mp_compare_str). Yes, we still can follow you, store a collation object in each key_part, tuple_field, field_def, but in tuple_compare.cc we still will need a way how to check that a collation comparator can be inlined into memcmp. This way is check coll->id == COLL_NONE. Now it is like use_fast_cmp = coll == NULL Will be like use_fast_cmp = coll->id == COLL_NONE As a summary: we can store a collation in each *_field/part object to simplify SQL source code, but it does not affect necessity of a special COLL_NONE constant. Also, I see now a new minor problem: > diff --git a/src/box/lua/upgrade.lua b/src/box/lua/upgrade.lua > index 64f74f9d3..a9525058b 100644 > --- a/src/box/lua/upgrade.lua > +++ b/src/box/lua/upgrade.lua > @@ -396,8 +396,10 @@ local function create_collation_space() > box.space._index:insert{_collation.id, 1, 'name', 'tree', {unique = true}, {{1, 'string'}}} > > log.info("create predefined collations") > + box.space._collation:replace{0, "none", ADMIN, "BINARY", "", setmap{}} > box.space._collation:replace{1, "unicode", ADMIN, "ICU", "", setmap{}} > box.space._collation:replace{2, "unicode_ci", ADMIN, "ICU", "", {strength='primary'}} > + box.space._collation:replace{3, "binary", ADMIN, "BINARY", "", setmap{}} > > local _priv = box.space[box.schema.PRIV_ID] > _priv:insert{ADMIN, PUBLIC, 'space', _collation.id, box.priv.W} Unfortunately, we can not change old update() functions. So the only way to add these new collations is to add a new code to upgrade_to_2_1_0() with these two new lines.