[Tarantool-patches] [PATCH 2/2] feedback: collect db engines and index features

Fri Jul 3 15:05:34 MSK 2020

Hi,

Thanks for your review.

You also stated your concerns about the fact that user's IP may be tracked and it’s not good practice in open source community. It sounds valid to me. But I don’t understand how this patch is related to these concerns. It does not add IP tracking feature, nor the first feedback daemon version does. This problem should have been arisen when adding feedback daemon in the first place. If it happen to be noticed now, after feedback daemon have been merged, a possibility of ip tracking may be considered as a bug. Personally, I don’t have an idea, how to solve it in tarantool. It must be some kind of trusted external service which guarantees to hide original IP address, but I don’t know such yet. I believe the best way is to start a discussion in GitHub issues and gather peoples opinions on this.

Another question you bringing up is the design of feedback tool. You are saying we may want to rethink goals of the feedback tool (you are naming it a reporting tool) so a user can benefit from it too. This is fine idea but it is also a more broader topic to discuss and goes beyond of this patch changes. I don’t see this patch somehow interferes with you ideas.


As for another comments, see my answers below:


> On 1 Jul 2020, at 10:15, Alexander Turenko <alexander.turenko at tarantool.org> wrote:
> 
> (CCed all persons who is mentioned below.)
> 
> You all know it, but I cannot keep myself from saying again that I'm not
> fan on the idea of the built-in/enabled-by-default reporting tool.
> Especially when it is desined in the way that a user don't have a
> guarantee that its IP address is not collected. It is not in the
> opensource spirit and is not kindly to our users (to ones who care about
> privacy).
> 
> I understand why the organization need it: it give some view on
> community -- what is useful and what is not. It may help when we, say,
> prioritize problems. But this fact does not change anything: garthering
> IP addresses and some information without expicit agreement does not
> become a friendly action.
> 
> And while we're on the topic, there was the idea from Mons that it would
> feel more comfortable if we would offer something useful for a user
> together with the reporting tool: say, provide information about
> critical / security problems in a running version if it is affected. Or
> feed news about important updates in functionality that is relevant for
> a particular application. I would add from myself that this way the
> reporting tool not even necessarily must be enabled default to be useful
> for us: there is a motivation to enable it. Win-win. Collect only from
> ones who explicitly said that is okay with it.
> 
> We also can structurize and improve our documentation and see what
> topics are more popular and what are less. Collect download statistics
> from repositories, rocks and git clones. Here a user explicitly asked an
> action with a remote server and (s)he most likely know that the remote
> server may count it.
> 
> Pasted actual version to comment it inline. Nothing critical in fact.
> 
> WBR, Alexander Turenko.
> 
>>      "feedback_type": "version",
> 
> Was it planned to report several maps with different types? If we
> decided to report everything within one map, then this 'feedback_type'
> field has no sense anymore and just confusing, right? If so, then maybe
> it is the time to remove it and bump 'feedback_version'?

“feedback_type” was added as a possibility to split feedback into different launches. I.e. collect simple statistics once an hour, but “heavy” stats only once a day. This patch adds some stats collection logic, but it may only be heavy for rare corner cases. It also adds caching and yielding in order to avoid affecting performance, when feedback daemon wakes up. So I decided to not split it into different feedbacks yet. However, I agree on that “version” type became obsolete with new changes, so I removed it.

> 
>> +local function is_system_space(sp)
>> +    return box.schema.SYSTEM_ID_MIN <= sp.id and
>> +           sp.id <= box.schema.SYSTEM_ID_MAX
>> +end
> 
> Nit: a space object is named 'space' below, but 'sp' above. I agree with
> Max and generally prefer to avoid abbreviations.

Okay, since you both agree on this, I made it more verbose.

> 
>> +local function is_jsonpath_index(idx)
>> +    for _, part in pairs(idx.parts) do
>> +        if part.path ~= nil then
>> +            return true
>> +        end
>> +    end
>> +
>> +    return false
>> +end
> 
> I agree with Oleg's point: it worth to count multikey indices too. I
> don't know much, but what I understood from the commit message [1]:
> 
> - Any multikey index is JSON path index.
> - A multikey index has '[*]' in 'path'.
> 
> Looks quite simple to count if I don't miss anything. Please, verify
> and/or ask about it someone who knows more.
> 
> [1]: https://github.com/tarantool/tarantool/commit/f1d9f2575c11afcf3b0368d38f44ae399cd7e045

I've checked the code and realize, that multikey path is only valid if it contains “[*]” substring. So I've added a new metric for jsonpath multikey indices, which includes also counts as jsonpath index. Similarly, I added functional multikey index, which check if the function has "is_multikey = true” option.

> 
>> +local function fill_in_schema_stats_impl(schema)
>> +    local spaces = {
>> +        memtx     = 0,
>> +        vinyl     = 0,
>> +        temporary = 0,
>> +        ['local'] = 0,
>> +    }
>> +
>> +    local indices = {
>> +        hash       = 0,
>> +        tree       = 0,
>> +        rtree      = 0,
>> +        bitset     = 0,
>> +        jsonpath   = 0,
>> +        functional = 0,
>> +    }
>> +
>> +    for name, space in pairs(box.space) do
>> +        local is_system = is_system_space(space)
>> +        if not is_system and type(name) == 'number' then
>> +            if space.engine == 'vinyl' then
>> +                spaces.vinyl = spaces.vinyl + 1
>> +            elseif space.engine == 'memtx' then
>> +                if space.temporary then
>> +                    spaces.temporary = spaces.temporary + 1
>> +                end
>> +                spaces.memtx = spaces.memtx + 1
>> +            end
>> +            if space.is_local then
>> +                spaces['local'] = spaces['local'] + 1
>> +            end
>> +            fill_in_indices_stats(space, indices)
>> +        end
>> +        fiber.yield()
> 
> What if DDL (say, a space drop) will occur here? I think we should check
> schema version right after exiting from yield and skip / repeat
> collecting statistics if the schema version is changed.
> 
> I don't sure that iteration over `box.space` over DDL is safe -- what
> if a current space is dropped? Is the behaviour will be like the
> following?
> 
> | tarantool> t = {a = 1, b = 2}
> | ---
> | ...
> | tarantool> for k, v in pairs(t) do if k == 'a' then t['a'] = nil end end
> | ---
> | ...
> 
> So, let's check it within the same iteration of the loop: just after
> exiting from the yield.

Good point. Your suggestion has some pros, but I decided to solve this problem in other, imo more robust way. I collect all space into a table and then iterating over this table. If space is not found, just skip it. I accept the fact, that there is a little chance to have a bit inaccurate feedback numbers for spaces and indices, but more stable approach.

> 
>> +    end
>> +
>> +    for k, v in pairs(spaces) do
>> +        schema[k..'_spaces'] = v
>> +    end
>> +
>> +    for k, v in pairs(indices) do
>> +        schema[k..'_indices'] = v
>> +    end
> 
> Nit: We collect those metrics into one structure and then transform it
> into another one. Is one better and another worse? If so, it seems
> logical to use one.
> 
> I was triggered on this, because in fact it is using of different names
> for the same things: it usually does not improve code readability.
> 
> Anyway, if you have understanding why nested structuring is good in the
> code, but bad in a report, proceed with your way. I don't have a strong
> preference here (but I'm sympathetic to the nesting way).

I think this way is simpler. It’s not only readable but also more usable. This hunk simply merges two tables into one container which stores feedback schema statistics. These two tables (spaces and indices) are isolated, so name clashes won’t happen. Also if I defined final keys at the beginning, it would contain lines like "spaces.memtx_spaces = spaces.memtx_spaces + 1” and would look cumbersome as a whole.

---

diff --git a/src/box/lua/feedback_daemon.lua b/src/box/lua/feedback_daemon.lua
index ee58575a4..64fd71f86 100644
--- a/src/box/lua/feedback_daemon.lua
+++ b/src/box/lua/feedback_daemon.lua
@@ -50,14 +50,30 @@ local function detect_docker_environment()
     return cached_detect_docker_env
 end
 
-local function is_system_space(sp)
-    return box.schema.SYSTEM_ID_MIN <= sp.id and
-           sp.id <= box.schema.SYSTEM_ID_MAX
+local function is_system_space(space)
+    return box.schema.SYSTEM_ID_MIN <= space.id and
+           space.id <= box.schema.SYSTEM_ID_MAX
 end
 
-local function is_jsonpath_index(idx)
+local function jsonpaths_from_idx_parts(idx)
+    local paths = {}
+
     for _, part in pairs(idx.parts) do
-        if part.path ~= nil then
+        if type(part.path) == 'string' then
+            table.insert(paths, part.path)
+        end
+    end
+
+    return paths
+end
+
+local function is_jsonpath_index(idx)
+    return #jsonpaths_from_idx_parts(idx) > 0
+end
+
+local function is_jp_multikey_index(idx)
+    for _, path in pairs(jsonpaths_from_idx_parts(idx)) do
+        if path:find('[*]', 1, true) then
             return true
         end
     end
@@ -69,6 +85,16 @@ local function is_functional_index(idx)
     return idx.func ~= nil
 end
 
+local function is_func_multikey_index(idx)
+    if is_functional_index(idx) then
+        local fid = idx.func.fid
+        local func = fid and box.func[fid]
+        return func and func.is_multikey or false
+    end
+
+    return false
+end
+
 local function fill_in_base_info(feedback)
     if box.info.status ~= "running" then
         return nil, "not running"
@@ -91,8 +117,14 @@ local function fill_in_indices_stats(space, stats)
             if idx_type == 'TREE' then
                 if is_functional_index(idx) then
                     stats.functional = stats.functional + 1
+                    if is_func_multikey_index(idx) then
+                        stats.functional_multikey = stats.functional_multikey + 1
+                    end
                 elseif is_jsonpath_index(idx) then
                     stats.jsonpath = stats.jsonpath + 1
+                    if is_jp_multikey_index(idx) then
+                        stats.jsonpath_multikey = stats.jsonpath_multikey + 1
+                    end
                 end
                 stats.tree = stats.tree + 1
             elseif idx_type == 'HASH' then
@@ -115,31 +147,45 @@ local function fill_in_schema_stats_impl(schema)
     }
 
     local indices = {
-        hash       = 0,
-        tree       = 0,
-        rtree      = 0,
-        bitset     = 0,
-        jsonpath   = 0,
-        functional = 0,
+        hash                = 0,
+        tree                = 0,
+        rtree               = 0,
+        bitset              = 0,
+        jsonpath            = 0,
+        jsonpath_multikey   = 0,
+        functional          = 0,
+        functional_multikey = 0,
     }
 
+    local space_ids = {}
     for name, space in pairs(box.space) do
         local is_system = is_system_space(space)
         if not is_system and type(name) == 'number' then
-            if space.engine == 'vinyl' then
-                spaces.vinyl = spaces.vinyl + 1
-            elseif space.engine == 'memtx' then
-                if space.temporary then
-                    spaces.temporary = spaces.temporary + 1
-                end
-                spaces.memtx = spaces.memtx + 1
-            end
-            if space.is_local then
-                spaces['local'] = spaces['local'] + 1
+            table.insert(space_ids, name)
+        end
+    end
+
+    for _, id in pairs(space_ids) do
+        local space = box.space[id]
+        if space == nil then
+            goto continue;
+        end
+
+        if space.engine == 'vinyl' then
+            spaces.vinyl = spaces.vinyl + 1
+        elseif space.engine == 'memtx' then
+            if space.temporary then
+                spaces.temporary = spaces.temporary + 1
             end
-            fill_in_indices_stats(space, indices)
+            spaces.memtx = spaces.memtx + 1
+        end
+        if space.is_local then
+            spaces['local'] = spaces['local'] + 1
         end
+        fill_in_indices_stats(space, indices)
+
         fiber.yield()
+        ::continue::
     end
 
     for k, v in pairs(spaces) do
@@ -255,7 +301,7 @@ setmetatable(daemon, {
         end,
         -- this function is used in saving feedback in file
         generate_feedback = function()
-            return fill_in_feedback({ feedback_type = "version", feedback_version = 1 })
+            return fill_in_feedback({ feedback_version = 2 })
         end,
         start = function()
             start(daemon)
diff --git a/test/box-tap/feedback_daemon.test.lua b/test/box-tap/feedback_daemon.test.lua
index 330442f2e..1606af96d 100755
--- a/test/box-tap/feedback_daemon.test.lua
+++ b/test/box-tap/feedback_daemon.test.lua
@@ -148,7 +148,9 @@ local expected = {
     hash_indices = 0,
     bitset_indices = 0,
     jsonpath_indices = 0,
+    jsonpath_multikey_indices = 0,
     functional_indices = 0,
+    functional_multikey_indices = 0,
 }
 test:is_deeply(actual.features.schema, expected,
         'schema stats are empty at the moment')
@@ -165,27 +167,38 @@ box.space.features_memtx:create_index('memtx_rtree',
         {type = 'rtree', parts = {{field = 3, type = 'array'}}})
 box.space.features_memtx:create_index('memtx_jpath',
         {parts = {{field = 4, type = 'str', path = 'data.name'}}})
+box.space.features_memtx:create_index('memtx_multikey',
+        {parts = {{field = 5, type = 'str', path = 'data[*].name'}}})
 box.schema.func.create('features_func', {
     body = "function(tuple) return {string.sub(tuple[2], 1, 1)} end",
     is_deterministic = true,
     is_sandboxed = true})
-box.space.features_memtx:create_index('j',
+box.schema.func.create('features_func_multikey', {
+    body = "function(tuple) return {1, 2} end",
+    is_deterministic = true,
+    is_sandboxed = true,
+    opts = {is_multikey = true}})
+box.space.features_memtx:create_index('functional',
         {parts = {{field = 1, type = 'number'}}, func = 'features_func'})
+box.space.features_memtx:create_index('functional_multikey',
+        {parts = {{field = 1, type = 'number'}}, func = 'features_func_multikey'})
 
 actual = daemon.generate_feedback()
 local schema_stats = actual.features.schema
 test:test('features.schema', function(t)
-    t:plan(10)
+    t:plan(12)
     t:is(schema_stats.memtx_spaces, 2, 'memtx engine usage gathered')
     t:is(schema_stats.vinyl_spaces, 1, 'vinyl engine usage gathered')
     t:is(schema_stats.temporary_spaces, 1, 'temporary space usage gathered')
     t:is(schema_stats.local_spaces, 1, 'local space usage gathered')
-    t:is(schema_stats.tree_indices, 4, 'tree index gathered')
+    t:is(schema_stats.tree_indices, 6, 'tree index gathered')
     t:is(schema_stats.hash_indices, 1, 'hash index gathered')
     t:is(schema_stats.rtree_indices, 1, 'rtree index gathered')
     t:is(schema_stats.bitset_indices, 1, 'bitset index gathered')
-    t:is(schema_stats.jsonpath_indices, 1, 'jsonpath index gathered')
-    t:is(schema_stats.functional_indices, 1, 'functional index gathered')
+    t:is(schema_stats.jsonpath_indices, 2, 'jsonpath index gathered')
+    t:is(schema_stats.jsonpath_multikey_indices, 1, 'jsonpath multikey index gathered')
+    t:is(schema_stats.functional_indices, 2, 'functional index gathered')
+    t:is(schema_stats.functional_multikey_indices, 1, 'functional multikey index gathered')
 end)
 
 box.space.features_memtx:create_index('memtx_sec', {type = 'hash'})