Re: [Tarantool-patches] [PATCH 2/2] feedback: collect db engines and index features

Tarantool development patches archive
 help / color / mirror / Atom feed

From: "Илья Конюхов" <runsfor@gmail.com>
To: Vladislav Shpilevoy <v.shpilevoy@tarantool.org>
Cc: tarantool-patches@dev.tarantool.org, alexander.turenko@tarantool.org
Subject: Re: [Tarantool-patches] [PATCH 2/2] feedback: collect db engines and index features
Date: Wed, 10 Jun 2020 02:06:18 +0300	[thread overview]
Message-ID: <EE69E031-21CF-49AC-89CE-8E57CB006E84@gmail.com> (raw)
In-Reply-To: <67c75c01-8503-2355-e1f7-9644def2179c@tarantool.org>

Thanks for the detailed review!

I’ve corrected most of the comments, you highlighted. You've also mentioned we may want to collect some internal statistics in C. It seems to be more broader topic to talk about. There is still not so clear to me which stats exactly we need to collect. We should discuss it more in related GH issue.

Right now I would focus on parts we can easily reach from lua. And I still think this patch may be useful to track a distribution how spaces and indices are used. I think performance should not be a major issue here since it is now use caching and intended to run once an hour.

In general, based on your feedback I’ve decided to refactor this patch a bit and:
- it now adds a caching for space and index statistics based on schema version;
- redundant primary and secondary index flags are removed;
- flags were replaced by counters.


Also, I’ve left detailed answers under your comments below.

> On 7 Jun 2020, at 19:45, Vladislav Shpilevoy <v.shpilevoy@tarantool.org> wrote:
> 
> Thanks for the patch!
> 
> Generally, I don't like having so much Lua code in
> the daemon, and system space full scans. Because it
> is slow and produces Lua garbage.
> 
> Also anyway it can't collect some internal things
> such as whether SQL is used (it is not exposed in
> any system spaces), popen, swim, etc. These things
> don't register self in any global place.
> 
> I was rather thinking about keeping track of all
> these modules and their statistics in C. So as collection
> of the statistics would be right when it changes, in
> a set of int counters. And statistics dump would cost O(1)
> by time, right into a JSON string, without Lua participation
> except that it would call this C dumper and put its result
> into an http request.
> 
> In other words, I am not sure this commit is needed at all,
> until we understand how to collect all the other features
> too.
> 
> See 6 comments below.
> 
> On 05/06/2020 10:35, Ilya Konyukhov wrote:
>> This patch adds basic db features to feedback report.
>> It collects info about what engine and which types of
>> indexes are setup by the user.
>> 
>> Here is how report may look like if all the features used:
>> 
>> ```json
>> {
>>  "arch": "x64",
>>  "features": {
>>    "has_bitset_index": true,
>>    "has_jsonpath_index": true,
>>    "vinyl": true,
>>    "has_tree_index": true,
>>    "has_primary_index": true,
>>    "has_hash_index": true,
>>    "memtx": true,
>>    "has_temporary_spaces": true,
>>    "has_local_spaces": true,
>>    "has_rtree_index": true,
>>    "has_secondary_index": true,
>>    "has_functional_index": true
>>  },
>>  "server_id": "7c8490f7-61c5-4e12-a7ff-d9fed05ad8ac",
>>  "is_docker": false,
>>  "os": "OSX",
>>  "feedback_type": "version",
>>  "cluster_id": "1eb7d98e-3344-4f15-a439-c287464f09e7",
>>  "tarantool_version": "2.5.0-90-g27fbe6ecd",
>>  "feedback_version": 1
>> }
>> ```
>> 
>> Part of #4943
>> ---
>> src/box/lua/feedback_daemon.lua       | 65 +++++++++++++++++++++++++++
>> test/box-tap/feedback_daemon.test.lua | 42 ++++++++++++++++-
>> 2 files changed, 106 insertions(+), 1 deletion(-)
>> 
>> diff --git a/src/box/lua/feedback_daemon.lua b/src/box/lua/feedback_daemon.lua
>> index 2ce49fb22..0fcd8ed87 100644
>> --- a/src/box/lua/feedback_daemon.lua
>> +++ b/src/box/lua/feedback_daemon.lua
>> @@ -41,6 +41,15 @@ local function detect_docker_environment()
>>     return true
>> end
>> 
>> +local function is_system_space(sp)
>> +    local sp_id = sp.id
>> +    if box.schema.SYSTEM_ID_MIN <= sp_id and sp_id <= box.schema.SYSTEM_ID_MAX then
>> +        return true
>> +    end
> 
> 1. Please, keep code lines inside 80 symbols border. Also this function
> return can be simplified to
> 
>    return box.schema.SYSTEM_ID_MIN <= sp_id and sp_id <= box.schema.SYSTEM_ID_MAX

Ok, collapsed that part

> 
>> +
>> +    return false
>> +end
>> +
>> local function fill_in_base_info(feedback)
>>     if box.info.status ~= "running" then
>>         return nil, "not running"
>> @@ -56,9 +65,65 @@ local function fill_in_platform_info(feedback)
>>     feedback.is_docker = detect_docker_environment()
>> end
>> 
>> +local function fill_in_space_indices(feedback, sp)
>> +    if not sp.index[0] then return end
>> +
>> +    feedback.features.has_primary_index = true
> 
> 2. What is a purpose of this field? Zero-index spaces always
> exist, at least because indexes are created in a separate DDL
> statement.
> 
> Besides, the function and spaces iteration may be really heavy,
> if space count is thousands. Or even hundreds, but with many
> indexes. And there is no a yield.
> 
> In addition to yields I ask you to add caching of this function
> results using schema version counter. Schema changes very rarely,
> so caching would make this function practically free almost
> always.

- removed primary/secondary fields
- added caching. Cache invalidates when schema version updates
- added yield after each space iteration

> 
>> +    local idx_count = 0
>> +    for _, idx in pairs(sp.index) do
>> +        for _, part in pairs(idx.parts) do
>> +            if part.path ~= nil then
>> +                feedback.features.has_jsonpath_index = true
>> +                break
>> +            end
>> +        end
>> +        if idx.func ~= nil then
>> +            feedback.features.has_functional_index = true
>> +        end
>> +        if idx.type == 'TREE' then
>> +            feedback.features.has_tree_index = true
>> +        elseif idx.type == 'HASH' then
>> +            feedback.features.has_hash_index = true
>> +        elseif idx.type == 'RTREE' then
>> +            feedback.features.has_rtree_index = true
>> +        elseif idx.type == 'BITSET' then
>> +            feedback.features.has_bitset_index = true
>> +        end
>> +        idx_count = idx_count + 1
>> +    end
>> +
>> +    if idx_count > 1 then
>> +        feedback.features.has_secondary_index = true
> 
> 3. This does not look really useful. What is this flag going
> to tell us? Secondary indexes exist almost always.
> 
> Besides, I agree with Dmitry's comment about counters instead
> of flags.

- removed secondary index tracking
- introduced counters for other indices.

> 
>> +    end
>> +end
>> +
>> +local function fill_in_features(feedback)
>> +    feedback.features = feedback.features or {}
>> +
>> +    local is_memtx, is_vinyl, is_temporary, is_local
>> +    for _, sp in pairs(box.space) do
>> +        local is_system = is_system_space(sp)
>> +        if not is_system then
>> +            if sp.engine == 'vinyl' then is_vinyl = true end
>> +            if sp.engine == 'memtx' then
>> +                if sp.temporary ~= nil then is_temporary = true end
>> +                is_memtx = true
>> +            end
>> +            if sp.is_local ~= nil then is_local = true end
>> +            fill_in_space_indices(feedback, sp)
>> +        end
>> +    end
>> +
>> +    feedback.features.has_temporary_spaces = is_temporary
>> +    feedback.features.has_local_spaces = is_local
>> +    feedback.features.memtx = is_memtx
>> +    feedback.features.vinyl = is_vinyl
> 
> 4. Why do some flags have prefix 'has_', some have 'is_',
> and some are just nouns like 'memtx', 'vinyl'? Lets be
> consistent and use one name template. For that type of
> flags in C we would use 'has_'.

With counters it all now has suffixes like “_spaces” and “_indices"

> 
>> +end
>> diff --git a/test/box-tap/feedback_daemon.test.lua b/test/box-tap/feedback_daemon.test.lua
>> index c36b2a694..e382af8e8 100755
>> --- a/test/box-tap/feedback_daemon.test.lua
>> +++ b/test/box-tap/feedback_daemon.test.lua
>> @@ -113,6 +113,46 @@ check("feedback after start")
>> daemon.send_test()
>> check("feedback after feedback send_test")
>> 
>> +local feedback_json = json.decode(feedback_save)
> 
> 5. When write a test for an issue, please, mention the
> issue in a comment and describe it shortly. Like this:
> 
> 	--
> 	-- gh-####: description.
> 	—
> 

Done

>> +test:is(type(feedback_json.features), 'table', 'features field is present')
>> +test:isnil(next(feedback_json.features), 'features are empty at the moment')
>> +
>> +box.schema.create_space('features_vinyl', {engine = 'vinyl'})
>> +box.schema.create_space('features_memtx', {engine = 'memtx', is_local = true, temporary = true})
>> +box.space.features_memtx:create_index('vinyl_pk', {type = 'tree'})
>> +box.space.features_memtx:create_index('memtx_pk', {type = 'hash'})
>> +box.space.features_memtx:create_index('memtx_bitset', {type = 'bitset'})
>> +box.space.features_memtx:create_index('memtx_rtree', {type = 'rtree', parts = {3, 'array'}})
>> +box.space.features_memtx:create_index('memtx_jpath',
>> +        {parts = {{field=4, type='str', path='data.name'}}})
> 
> 6. Please, be consistent in the code style. Surround '=' with whitespaces,
> add a whitespace after ',' (see your code below).
> 

Adjusted code style here. Thanks for pointing it out.

>> +box.schema.func.create('features_func', {
>> +    body = "function(tuple) return {string.sub(tuple[2],1,1)} end",
>> +    is_deterministic = true,
>> +    is_sandboxed = true})
>> +box.space.features_memtx:create_index('j',
>> +        {parts={{field = 1, type = 'number'}},func = 'features_func'})
>> +
>> +check('old feedback received')
>> +feedback_reset()
>> +check('feedback with db features received')
>> +
>> +feedback_json = json.decode(feedback_save)
>> +test:test('features', function(t)
>> +    t:plan(12)
>> +    t:ok(feedback_json.features.memtx, 'memtx engine usage gathered')
>> +    t:ok(feedback_json.features.vinyl, 'vinyl engine usage gathered')
>> +    t:ok(feedback_json.features.has_temporary_spaces, 'temporary space usage gathered')
>> +    t:ok(feedback_json.features.has_local_spaces, 'local space usage gathered')
>> +    t:ok(feedback_json.features.has_primary_index, 'primary index gathered')
>> +    t:ok(feedback_json.features.has_secondary_index, 'secondary index gathered')
>> +    t:ok(feedback_json.features.has_tree_index, 'tree index gathered')
>> +    t:ok(feedback_json.features.has_hash_index, 'hash index gathered')
>> +    t:ok(feedback_json.features.has_rtree_index, 'rtree index gathered')
>> +    t:ok(feedback_json.features.has_bitset_index, 'bitset index gathered')
>> +    t:ok(feedback_json.features.has_jsonpath_index, 'jsonpath index gathered')
>> +    t:ok(feedback_json.features.has_functional_index, 'functional index gathered')
>> +end)
>> +
>> daemon.stop()
>> 
>> box.feedback.save("feedback.json”)
>> 

diff --git a/src/box/lua/feedback_daemon.lua b/src/box/lua/feedback_daemon.lua
index 21e69d511..1f177a204 100644
--- a/src/box/lua/feedback_daemon.lua
+++ b/src/box/lua/feedback_daemon.lua
@@ -50,6 +50,25 @@ local function detect_docker_environment()
     return cached_detect_docker_env
 end
 
+local function is_system_space(sp)
+    return box.schema.SYSTEM_ID_MIN <= sp.id and
+            sp.id <= box.schema.SYSTEM_ID_MAX
+end
+
+local function is_jsonpath_index(idx)
+    for _, part in pairs(idx.parts) do
+        if part.path ~= nil then
+            return true
+        end
+    end
+
+    return false
+end
+
+local function is_functional_index(idx)
+    return idx.func ~= nil
+end
+
 local function fill_in_base_info(feedback)
     if box.info.status ~= "running" then
         return nil, "not running"
@@ -65,9 +84,98 @@ local function fill_in_platform_info(feedback)
     feedback.is_docker = detect_docker_environment()
 end
 
+local function fill_in_indices_stats(space, stats)
+    if not space.index[0] then return end
+
+    for name, idx in pairs(space.index) do
+        if type(name) == 'number' then
+            local idx_type = idx.type
+            if idx_type == 'TREE' then
+                if is_functional_index(idx) then
+                    stats.functional = stats.functional + 1
+                elseif is_jsonpath_index(idx) then
+                    stats.jsonpath = stats.jsonpath + 1
+                end
+                stats.tree = stats.tree + 1
+            elseif idx_type == 'HASH' then
+                stats.hash = stats.hash + 1
+            elseif idx_type == 'RTREE' then
+                stats.rtree = stats.rtree + 1
+            elseif idx_type == 'BITSET' then
+                stats.bitset = stats.bitset + 1
+            end
+        end
+    end
+end
+
+local function fill_in_space_stats(features)
+    local spaces = {
+        memtx     = 0,
+        vinyl     = 0,
+        temporary = 0,
+        ['local'] = 0,
+    }
+
+    local indices = {
+        hash       = 0,
+        tree       = 0,
+        rtree      = 0,
+        bitset     = 0,
+        jsonpath   = 0,
+        functional = 0,
+    }
+
+    for name, space in pairs(box.space) do
+        local is_system = is_system_space(space)
+        if not is_system and type(name) == 'number' then
+            if space.engine == 'vinyl' then
+                spaces.vinyl = spaces.vinyl + 1
+            elseif space.engine == 'memtx' then
+                if space.temporary ~= nil then
+                    spaces.temporary = spaces.temporary + 1
+                end
+                spaces.memtx = spaces.memtx + 1
+            end
+            if space.is_local == false then
+                spaces['local'] = spaces['local'] + 1
+            end
+            fill_in_indices_stats(space, indices)
+        end
+        fiber.yield()
+    end
+
+    for k, v in pairs(spaces) do
+        features[k..'_spaces'] = v
+    end
+
+    for k, v in pairs(indices) do
+        features[k..'_indices'] = v
+    end
+end
+
+local function fill_in_features_impl(features)
+    fill_in_space_stats(features)
+end
+
+local cached_schema_version = 0
+local cached_feedback_features = {}
+
+local function fill_in_features(feedback)
+    local schema_version = box.internal.schema_version()
+    if cached_schema_version < schema_version then
+        local features = {}
+        fill_in_features_impl(features)
+        cached_schema_version = schema_version
+        cached_feedback_features = features
+    end
+
+    feedback.features = cached_feedback_features
+end
+
 local function fill_in_feedback(feedback)
     fill_in_base_info(feedback)
     fill_in_platform_info(feedback)
+    fill_in_features(feedback)
 
     return feedback
 end
diff --git a/test/box-tap/feedback_daemon.test.lua b/test/box-tap/feedback_daemon.test.lua
index d4adb71f1..8ef20e0d0 100755
--- a/test/box-tap/feedback_daemon.test.lua
+++ b/test/box-tap/feedback_daemon.test.lua
@@ -67,7 +67,7 @@ if not ok then
     os.exit(0)
 end
 
-test:plan(11)
+test:plan(19)
 
 local function check(message)
     while feedback_count < 1 do
@@ -113,6 +113,71 @@ check("feedback after start")
 daemon.send_test()
 check("feedback after feedback send_test")
 
+--
+-- gh-4943: Collect engines and indices statistics.
+--
+
+local feedback_json = json.decode(feedback_save)
+test:is(type(feedback_json.features), 'table', 'features field is present')
+local expected = {
+    memtx_spaces = 0,
+    vinyl_spaces = 0,
+    temporary_spaces = 0,
+    local_spaces = 0,
+    tree_indices = 0,
+    rtree_indices = 0,
+    hash_indices = 0,
+    bitset_indices = 0,
+    jsonpath_indices = 0,
+    functional_indices = 0,
+}
+test:is_deeply(feedback_json.features, expected, 'features are empty at the moment')
+
+box.schema.create_space('features_vinyl', {engine = 'vinyl'})
+box.schema.create_space('features_memtx',
+        {engine = 'memtx', is_local = true, temporary = true})
+box.space.features_vinyl:create_index('vinyl_pk', {type = 'tree'})
+box.space.features_memtx:create_index('memtx_pk', {type = 'tree'})
+box.space.features_memtx:create_index('memtx_hash', {type = 'hash'})
+box.space.features_memtx:create_index('memtx_bitset', {type = 'bitset'})
+box.space.features_memtx:create_index('memtx_rtree',
+        {type = 'rtree', parts = {{field = 3, type = 'array'}}})
+box.space.features_memtx:create_index('memtx_jpath',
+        {parts = {{field = 4, type = 'str', path = 'data.name'}}})
+box.schema.func.create('features_func', {
+    body = "function(tuple) return {string.sub(tuple[2], 1, 1)} end",
+    is_deterministic = true,
+    is_sandboxed = true})
+box.space.features_memtx:create_index('j',
+        {parts = {{field = 1, type = 'number'}}, func = 'features_func'})
+
+check('old feedback received')
+feedback_reset()
+check('feedback with db features received')
+
+feedback_json = json.decode(feedback_save)
+test:test('features', function(t)
+    t:plan(10)
+    t:is(feedback_json.features.memtx_spaces, 1, 'memtx engine usage gathered')
+    t:is(feedback_json.features.vinyl_spaces, 1, 'vinyl engine usage gathered')
+    t:is(feedback_json.features.temporary_spaces, 1, 'temporary space usage gathered')
+    t:is(feedback_json.features.local_spaces, 1, 'local space usage gathered')
+    t:is(feedback_json.features.tree_indices, 4, 'tree index gathered')
+    t:is(feedback_json.features.hash_indices, 1, 'hash index gathered')
+    t:is(feedback_json.features.rtree_indices, 1, 'rtree index gathered')
+    t:is(feedback_json.features.bitset_indices, 1, 'bitset index gathered')
+    t:is(feedback_json.features.jsonpath_indices, 1, 'jsonpath index gathered')
+    t:is(feedback_json.features.functional_indices, 1, 'functional index gathered')
+end)
+
+box.space.features_memtx:create_index('memtx_sec', {type = 'hash'})
+
+check('old feedback received')
+feedback_reset()
+check('feedback with new db features received')
+feedback_json = json.decode(feedback_save)
+test:is(feedback_json.features.hash_indices, 2, 'internal cache invalidates when schema changes')
+
 daemon.stop()
 
 box.feedback.save("feedback.json")

next prev parent reply	other threads:[~2020-06-09 23:06 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-06-05  8:35 [Tarantool-patches] [PATCH 0/2] Extend feedback module report Ilya Konyukhov
2020-06-05  8:35 ` [Tarantool-patches] [PATCH 1/2] feedback: determine runtime platform info Ilya Konyukhov
2020-06-07 16:45   ` Vladislav Shpilevoy
2020-06-09 23:05     ` Илья Конюхов
2020-06-11 19:32       ` Vladislav Shpilevoy
2020-07-01  0:16       ` Alexander Turenko
2020-07-05  2:14         ` Alexander Turenko
2020-06-05  8:35 ` [Tarantool-patches] [PATCH 2/2] feedback: collect db engines and index features Ilya Konyukhov
2020-06-07 16:45   ` Vladislav Shpilevoy
2020-06-09 23:06     ` Илья Конюхов [this message]
2020-06-11 19:32       ` Vladislav Shpilevoy
2020-06-17  8:59         ` Илья Конюхов
2020-06-17 22:53           ` Vladislav Shpilevoy
2020-06-18 15:42             ` Илья Конюхов
2020-06-18 23:02               ` Vladislav Shpilevoy
2020-06-19 14:01                 ` Илья Конюхов
2020-06-19 23:49                   ` Vladislav Shpilevoy
2020-06-22  8:55                     ` Илья Конюхов
2020-07-01  0:15   ` Alexander Turenko
2020-07-03 12:05     ` Илья Конюхов
2020-07-05  2:10       ` Alexander Turenko
2020-06-23 21:23 ` [Tarantool-patches] [PATCH 0/2] Extend feedback module report Vladislav Shpilevoy
2020-07-13 13:47 ` Kirill Yukhin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=EE69E031-21CF-49AC-89CE-8E57CB006E84@gmail.com \
    --to=runsfor@gmail.com \
    --cc=alexander.turenko@tarantool.org \
    --cc=tarantool-patches@dev.tarantool.org \
    --cc=v.shpilevoy@tarantool.org \
    --subject='Re: [Tarantool-patches] [PATCH 2/2] feedback: collect db engines and index features' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox