[tarantool-patches] Re: [PATCH v2 2/3] Introduce "none" and "binary" collations

Wed Nov 14 01:37:37 MSK 2018

> On 13 Nov 2018, at 15:02, Vladislav Shpilevoy <v.shpilevoy at tarantool.org> wrote:
> 
> Hi! Thanks for the fixes!
> 
> Please, remember to put branch/issue link
> into a covering letter next time.

I’m sorry, it was late at night...

> See 5 comments below, my fixes at the end
> of the email and on the branch.
> 
>> diff --git a/src/box/alter.cc b/src/box/alter.cc
>> index 6d2c59bbc..2eb9f53e8 100644
>> --- a/src/box/alter.cc
>> +++ b/src/box/alter.cc
>> @@ -403,7 +403,7 @@ field_def_decode(struct field_def *field, const char **data,
>>  				     "nullable action properties", fieldno +
>>  				     TUPLE_INDEX_BASE));
>>  	}
>> -	if (field->coll_id != COLL_NONE &&
>> +	if (field->coll_id != 0 &&
> 
> 1. Why not to leave enum COLL_NONE, but change its value to 0?
> I guess, it is a purpose of define/enum features - define a name
> for a constant to be able to change the value under the hood. It
> could have reduced the diff.

For me it looks more convenient. But I don’t mind applying your fixes.
But there is one more point to avoid using COLL_NONE.
See my comment at the end of review.

>>  	    field->type != FIELD_TYPE_STRING &&
>>  	    field->type != FIELD_TYPE_SCALAR &&
>>  	    field->type != FIELD_TYPE_ANY) {
>> diff --git a/src/box/sql/callback.c b/src/box/sql/callback.c
>> index 3cf3a835d..352745e0e 100644
>> --- a/src/box/sql/callback.c
>> +++ b/src/box/sql/callback.c
>> @@ -42,13 +42,22 @@
>>  struct coll *
>>  sql_get_coll_seq(Parse *parser, const char *name, uint32_t *coll_id)
>>  {
>> -	if (name == NULL || strcasecmp(name, "binary") == 0) {
>> -		*coll_id = COLL_NONE;
>> -		return NULL;
>> +	if (name == NULL) {
>> +		*coll_id = 0;
>> +		return coll_by_id(0)->coll;
>>  	}
>> -	struct coll_id *p = coll_by_name(name, strlen(name));
>> +	struct coll_id *p;
>> +	/*
>> +	 * In SQL all identifiers should be uppercased, so
>> +	 * to avoid mess lets simple search binary (since it is
>> +	 * sort of "special" collation) ignoring case at all.
>> +	 */
> 
> 2. I am not sure if it should be so special. I think, we should
> treat it just like any other collation. If tests have BINARY
> uppercased, then they should be fixed, I guess.

Thing is in SQLite binary can be used without quotes, like:

… COLLATE BiNaRy …

And since ids are uppercased, this thing comes as “BINARY".
If I forced only lower case “binary”, a lot (>80 separate sub-tests,
12 separate file tests) of tests would fail. Moreover, I see that similar
things take place at Lua-land (schema.cc):

local function update_index_parts(format, parts)
...
elseif k == 'collation' then
    -- find ID by name
    local coll = box.space._collation.index.name:get{v}
    if not coll then
        coll = box.space._collation.index.name:get{v:lower()}
    end

If you still think that we should allow only “binary” format,
I will do it alongside with tests in a separate commit.

>> +	if (strcasecmp(name, "binary") == 0)
>> +		p = coll_by_name("binary", strlen("binary"));
>> +	else
>> +		p = coll_by_name(name, strlen(name));
>>  	if (p == NULL) {
>> -		*coll_id = COLL_NONE;
>> +		*coll_id = 0;
>>  		sqlite3ErrorMsg(parser, "no such collation sequence: %s",
>>  				name);
>>  		return NULL;
>> diff --git a/src/box/sql/func.c b/src/box/sql/func.c
>> index 8c34cbb3d..580cf1e60 100644
>> --- a/src/box/sql/func.c
>> +++ b/src/box/sql/func.c
>> @@ -506,7 +506,7 @@ case_type##ICUFunc(sqlite3_context *context, int argc, sqlite3_value **argv)   \
>>  	UErrorCode status = U_ZERO_ERROR;                                      \
>>  	struct coll *coll = sqlite3GetFuncCollSeq(context);                    \
>>  	const char *locale = NULL;                                             \
>> -	if (coll != NULL) {                                                    \
>> +	if (coll != NULL && coll->collator != NULL) {                          \
> 
> 3. Why do you use coll->collator != 0 instead of coll->type == COLL_TYPE_ICU?
> I do not think it is safe since in future we can move collator into a union,
> which will not be NULL for a non-ICU collation.

Yep, you are absolutely right, I should use type of collation, instead of
checking existence of collator. Fixed:

diff --git a/src/box/sql/func.c b/src/box/sql/func.c
index 580cf1e60..a2ac9ba2d 100644
--- a/src/box/sql/func.c
+++ b/src/box/sql/func.c
@@ -506,7 +506,7 @@ case_type##ICUFunc(sqlite3_context *context, int argc, sqlite3_value **argv)   \
        UErrorCode status = U_ZERO_ERROR;                                      \
        struct coll *coll = sqlite3GetFuncCollSeq(context);                    \
        const char *locale = NULL;                                             \
-       if (coll != NULL && coll->collator != NULL) {                          \
+       if (coll != NULL && coll->type == COLL_TYPE_ICU) {                     \
                locale = ucol_getLocaleByType(coll->collator,                  \
                                              ULOC_VALID_LOCALE, &status);     \
        }                                                                      \

> 
>>  		locale = ucol_getLocaleByType(coll->collator,                  \
>>  					      ULOC_VALID_LOCALE, &status);     \
>>  	}        > diff --git a/src/box/tuple_format.c b/src/box/tuple_format.c
>> index 5bdf102e7..8e5aea51b 100644
>> --- a/src/box/tuple_format.c
>> +++ b/src/box/tuple_format.c
>> @@ -40,7 +40,7 @@ static uint32_t formats_size = 0, formats_capacity = 0;
>>  static const struct tuple_field tuple_field_default = {
>>        FIELD_TYPE_ANY, TUPLE_OFFSET_SLOT_NIL, false,
>> -       ON_CONFLICT_ACTION_NONE, NULL, COLL_NONE,
>> +       ON_CONFLICT_ACTION_DEFAULT, NULL, 0,
>> };
> 
> 4. Are you sure that changing on_conflict_action belongs
> to this patch?

No, it doesn’t. I guess it is an artefact of rebasing.
Glad that you caught it. Fixed.

diff --git a/src/box/tuple_format.c b/src/box/tuple_format.c
index 8e5aea51b..badffc701 100644
--- a/src/box/tuple_format.c
+++ b/src/box/tuple_format.c
@@ -40,7 +40,7 @@ static uint32_t formats_size = 0, formats_capacity = 0;
 
 static const struct tuple_field tuple_field_default = {
        FIELD_TYPE_ANY, TUPLE_OFFSET_SLOT_NIL, false,
-       ON_CONFLICT_ACTION_DEFAULT, NULL, 0,
+       ON_CONFLICT_ACTION_NONE, NULL, 0,

> 
> 5. How about tests like this one?
> 
> box.schema.create_space('test', {format = {{1, 'unsigned'}, {2, 'string', collation = 'binary'}}})
> 
> I mean specifying collations via Lua. And what happens if I use collation = 'none'
> in a space format or index parts? I think, it should throw an error, but it is up
> to Kostja or Kirill.

Hmm, personally I see no reason why an error should be raised.
If you specify “none” collation, then it will act as an absence of collation.
That is simple. Anyway, added tests on binary collation:

diff --git a/test/sql/collation.result b/test/sql/collation.result
index 3df4cfa70..b42d3e034 100644
--- a/test/sql/collation.result
+++ b/test/sql/collation.result
@@ -123,6 +123,49 @@ box.space.T:format()[3]['collation']
 box.sql.execute("DROP TABLE t;")
 ---
 ...
+-- BINARY collation works in the same way as there is no collation
+-- at all.
+--
+t = box.schema.create_space('test', {format = {{'id', 'unsigned'}, {'a', 'string', collation = 'binary'}}})
+---
+...
+t:format()[2]['collation']
+---
+- 3
+...
+pk = t:create_index('primary', {parts = {1}})
+---
+...
+i = t:create_index('secondary', {parts = {2, 'str', collation='binary'}})
+---
+...
+t:insert({1, 'AsD'})
+---
+- [1, 'AsD']
+...
+t:insert({2, 'ASD'})
+---
+- [2, 'ASD']
+...
+t:insert({3, 'asd'})
+---
+- [3, 'asd']
+...
+i:select('asd')
+---
+- - [3, 'asd']
+...
+i:select('ASD')
+---
+- - [2, 'ASD']
+...
+i:select('AsD')
+---
+- - [1, 'AsD']
+...
+t:drop()
+---
+...
 -- Collation with id == 0 is "none". It used to unify interaction
 -- with collation interface. It also can't be dropped.
 --
diff --git a/test/sql/collation.test.lua b/test/sql/collation.test.lua
index 61df33a95..05f666f5e 100644
--- a/test/sql/collation.test.lua
+++ b/test/sql/collation.test.lua
@@ -50,6 +50,21 @@ box.space.T:format()[2]['collation']
 box.space.T:format()[3]['collation']
 box.sql.execute("DROP TABLE t;")
 
+-- BINARY collation works in the same way as there is no collation
+-- at all.
+--
+t = box.schema.create_space('test', {format = {{'id', 'unsigned'}, {'a', 'string', collation = 'binary'}}})
+t:format()[2]['collation']
+pk = t:create_index('primary', {parts = {1}})
+i = t:create_index('secondary', {parts = {2, 'str', collation='binary'}})
+t:insert({1, 'AsD'})
+t:insert({2, 'ASD'})
+t:insert({3, 'asd'})
+i:select('asd')
+i:select('ASD')
+i:select('AsD')
+t:drop()
+

I would like to reveal my main concern about this patch:
pointer to “none” collation is not set in key_def/field_def, even if “none”
collation is forced. The way I wanted to implement this patch is like:

@@ -174,17 +174,15 @@ key_def_new(const struct key_part_def *parts, uint32_t part_count)
        for (uint32_t i = 0; i < part_count; i++) {
                const struct key_part_def *part = &parts[i];
                struct coll *coll = NULL;
-               if (part->type == FIELD_TYPE_SCALAR ||
-                   part->type == FIELD_TYPE_STRING) {
+               if (part->coll_id != COLL_NONE) {
                        struct coll_id *coll_id = coll_by_id(part->coll_id);
-                       if (part->coll_id != 0 && coll_id == NULL) {
+                       if (coll_id == NULL) {
                                diag_set(ClientError, ER_WRONG_INDEX_OPTIONS,
                                         i + 1, "collation was not found by ID");
                                key_def_delete(def);
                                return NULL;
                        }
-                       if (coll_id != NULL)
-                               coll = coll_id->coll;
+                       coll = coll_id->coll;
                }
                key_def_set_part(def, i, part->fieldno, part->type,
                                 part->nullable_action, coll, part->coll_id,

@@ -321,8 +319,7 @@ key_def_sizeof_parts(const struct key_part_def *parts, uint32_t part_count)
        for (uint32_t i = 0; i < part_count; i++) {
                const struct key_part_def *part = &parts[i];
                int count = 2;
-               if (part->type == FIELD_TYPE_STRING ||
-                   part->type == FIELD_TYPE_SCALAR)
+               if (part->coll_id != COLL_NONE)
                        count++;
                if (part->is_nullable)
                        count++;
@@ -332,8 +329,7 @@ key_def_sizeof_parts(const struct key_part_def *parts, uint32_t part_count)
                assert(part->type < field_type_MAX);
                size += mp_sizeof_str(strlen(PART_OPT_TYPE));
                size += mp_sizeof_str(strlen(field_type_strs[part->type]));
-               if (part->type == FIELD_TYPE_STRING ||
-                   part->type == FIELD_TYPE_SCALAR) {
+               if (part->coll_id != COLL_NONE) {
                        size += mp_sizeof_str(strlen(PART_OPT_COLLATION));
                        size += mp_sizeof_uint(part->coll_id);
                }
@@ -352,8 +348,7 @@ key_def_encode_parts(char *data, const struct key_part_def *parts,
        for (uint32_t i = 0; i < part_count; i++) {
                const struct key_part_def *part = &parts[i];
                int count = 2;
-               if (part->type == FIELD_TYPE_STRING ||
-                   part->type == FIELD_TYPE_SCALAR)
+               if (part->coll_id != COLL_NONE)
                        count++;
                if (part->is_nullable)
                        count++;
@@ -366,8 +361,7 @@ key_def_encode_parts(char *data, const struct key_part_def *parts,
                assert(part->type < field_type_MAX);
                const char *type_str = field_type_strs[part->type];
                data = mp_encode_str(data, type_str, strlen(type_str));
-               if (part->type == FIELD_TYPE_STRING ||
-                   part->type == FIELD_TYPE_SCALAR) {
+               if (part->coll_id != COLL_NONE) {
                        data = mp_encode_str(data, PART_OPT_COLLATION,
                                             strlen(PART_OPT_COLLATION));
                        data = mp_encode_uint(data, part->coll_id);

diff --git a/src/box/lua/space.cc b/src/box/lua/space.cc
index efaa10da0..7cae436f1 100644
--- a/src/box/lua/space.cc
+++ b/src/box/lua/space.cc
@@ -299,15 +299,12 @@ lbox_fillspace(struct lua_State *L, struct space *space, int i)
                        lua_pushboolean(L, key_part_is_nullable(part));
                        lua_setfield(L, -2, "is_nullable");
 
-                       if (part->type == FIELD_TYPE_STRING ||
-                           part->type == FIELD_TYPE_SCALAR) {
-                               if (! space_is_system(space)) {
-                                       struct coll_id *coll_id =
-                                               coll_by_id(part->coll_id);
-                                       assert(coll_id != NULL);
-                                       lua_pushstring(L, coll_id->name);
-                                       lua_setfield(L, -2, "collation");
-                               }
+                       if (part->coll_id != COLL_NONE) {
+                               struct coll_id *coll_id =
+                                       coll_by_id(part->coll_id);
+                               assert(coll_id != NULL);
+                               lua_pushstring(L, coll_id->name);
+                               lua_setfield(L, -2, "collation");
                        }

In other words, each field with type of STRING/SCALAR and mb ANY
would have real pointer to collation. However, in turn such change
would break too many tests, without significant refactoring of SQL
query planner (at least I failed to do it in several hours).
On the other hand, in this case we can remove COLL_NONE
and unify collation interface: each field or key part MUST feature
collation. And I mean not only valid ID (it is how it works now),
but also valid (i.e. != NULL) pointer to collation struct. So, it seems
that we would be able to remove any checks like:

if (coll != NULL)
	*use collation*
else
	*process without involving collations*

Would become:

If (type_is_string_like)
	*always use collation*

To be honest I was sure that you would notice this pickle…
Or current approach is not dat bad?