[Tarantool-patches] [PATCH 2/2] feedback: collect db engines and index features

Wed Jul 1 03:15:44 MSK 2020

(CCed all persons who is mentioned below.)

You all know it, but I cannot keep myself from saying again that I'm not
fan on the idea of the built-in/enabled-by-default reporting tool.
Especially when it is desined in the way that a user don't have a
guarantee that its IP address is not collected. It is not in the
opensource spirit and is not kindly to our users (to ones who care about
privacy).

I understand why the organization need it: it give some view on
community -- what is useful and what is not. It may help when we, say,
prioritize problems. But this fact does not change anything: garthering
IP addresses and some information without expicit agreement does not
become a friendly action.

And while we're on the topic, there was the idea from Mons that it would
feel more comfortable if we would offer something useful for a user
together with the reporting tool: say, provide information about
critical / security problems in a running version if it is affected. Or
feed news about important updates in functionality that is relevant for
a particular application. I would add from myself that this way the
reporting tool not even necessarily must be enabled default to be useful
for us: there is a motivation to enable it. Win-win. Collect only from
ones who explicitly said that is okay with it.

We also can structurize and improve our documentation and see what
topics are more popular and what are less. Collect download statistics
from repositories, rocks and git clones. Here a user explicitly asked an
action with a remote server and (s)he most likely know that the remote
server may count it.

Pasted actual version to comment it inline. Nothing critical in fact.

WBR, Alexander Turenko.

>       "feedback_type": "version",

Was it planned to report several maps with different types? If we
decided to report everything within one map, then this 'feedback_type'
field has no sense anymore and just confusing, right? If so, then maybe
it is the time to remove it and bump 'feedback_version'?

> +local function is_system_space(sp)
> +    return box.schema.SYSTEM_ID_MIN <= sp.id and
> +           sp.id <= box.schema.SYSTEM_ID_MAX
> +end

Nit: a space object is named 'space' below, but 'sp' above. I agree with
Max and generally prefer to avoid abbreviations.

> +local function is_jsonpath_index(idx)
> +    for _, part in pairs(idx.parts) do
> +        if part.path ~= nil then
> +            return true
> +        end
> +    end
> +
> +    return false
> +end

I agree with Oleg's point: it worth to count multikey indices too. I
don't know much, but what I understood from the commit message [1]:

- Any multikey index is JSON path index.
- A multikey index has '[*]' in 'path'.

Looks quite simple to count if I don't miss anything. Please, verify
and/or ask about it someone who knows more.

[1]: https://github.com/tarantool/tarantool/commit/f1d9f2575c11afcf3b0368d38f44ae399cd7e045

> +local function fill_in_schema_stats_impl(schema)
> +    local spaces = {
> +        memtx     = 0,
> +        vinyl     = 0,
> +        temporary = 0,
> +        ['local'] = 0,
> +    }
> +
> +    local indices = {
> +        hash       = 0,
> +        tree       = 0,
> +        rtree      = 0,
> +        bitset     = 0,
> +        jsonpath   = 0,
> +        functional = 0,
> +    }
> +
> +    for name, space in pairs(box.space) do
> +        local is_system = is_system_space(space)
> +        if not is_system and type(name) == 'number' then
> +            if space.engine == 'vinyl' then
> +                spaces.vinyl = spaces.vinyl + 1
> +            elseif space.engine == 'memtx' then
> +                if space.temporary then
> +                    spaces.temporary = spaces.temporary + 1
> +                end
> +                spaces.memtx = spaces.memtx + 1
> +            end
> +            if space.is_local then
> +                spaces['local'] = spaces['local'] + 1
> +            end
> +            fill_in_indices_stats(space, indices)
> +        end
> +        fiber.yield()

What if DDL (say, a space drop) will occur here? I think we should check
schema version right after exiting from yield and skip / repeat
collecting statistics if the schema version is changed.

I don't sure that iteration over `box.space` over DDL is safe -- what
if a current space is dropped? Is the behaviour will be like the
following?

 | tarantool> t = {a = 1, b = 2}
 | ---
 | ...
 | tarantool> for k, v in pairs(t) do if k == 'a' then t['a'] = nil end end
 | ---
 | ...

So, let's check it within the same iteration of the loop: just after
exiting from the yield.

> +    end
> +
> +    for k, v in pairs(spaces) do
> +        schema[k..'_spaces'] = v
> +    end
> +
> +    for k, v in pairs(indices) do
> +        schema[k..'_indices'] = v
> +    end

Nit: We collect those metrics into one structure and then transform it
into another one. Is one better and another worse? If so, it seems
logical to use one.

I was triggered on this, because in fact it is using of different names
for the same things: it usually does not improve code readability.

Anyway, if you have understanding why nested structuring is good in the
code, but bad in a report, proceed with your way. I don't have a strong
preference here (but I'm sympathetic to the nesting way).