From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-lj1-f175.google.com (mail-lj1-f175.google.com [209.85.208.175]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by dev.tarantool.org (Postfix) with ESMTPS id 58C9942EF5C for ; Fri, 3 Jul 2020 15:06:41 +0300 (MSK) Received: by mail-lj1-f175.google.com with SMTP id q4so936102lji.2 for ; Fri, 03 Jul 2020 05:06:41 -0700 (PDT) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 13.4 \(3608.80.23.2.2\)) From: =?utf-8?B?0JjQu9GM0Y8g0JrQvtC90Y7RhdC+0LI=?= In-Reply-To: <20200701001544.wxx275pojfj3su5o@tkn_work_nb> Date: Fri, 3 Jul 2020 22:05:34 +1000 Content-Transfer-Encoding: quoted-printable Message-Id: <2C0F8BEA-3419-4794-B17A-7BF7AFFF9E0C@gmail.com> References: <20200701001544.wxx275pojfj3su5o@tkn_work_nb> Subject: Re: [Tarantool-patches] [PATCH 2/2] feedback: collect db engines and index features List-Id: Tarantool development patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Alexander Turenko Cc: Oleg Babin , tarantool-patches@dev.tarantool.org, Vladislav Shpilevoy Hi, Thanks for your review. You also stated your concerns about the fact that user's IP may be = tracked and it=E2=80=99s not good practice in open source community. It = sounds valid to me. But I don=E2=80=99t understand how this patch is = related to these concerns. It does not add IP tracking feature, nor the = first feedback daemon version does. This problem should have been arisen = when adding feedback daemon in the first place. If it happen to be = noticed now, after feedback daemon have been merged, a possibility of ip = tracking may be considered as a bug. Personally, I don=E2=80=99t have an = idea, how to solve it in tarantool. It must be some kind of trusted = external service which guarantees to hide original IP address, but I = don=E2=80=99t know such yet. I believe the best way is to start a = discussion in GitHub issues and gather peoples opinions on this. Another question you bringing up is the design of feedback tool. You are = saying we may want to rethink goals of the feedback tool (you are naming = it a reporting tool) so a user can benefit from it too. This is fine = idea but it is also a more broader topic to discuss and goes beyond of = this patch changes. I don=E2=80=99t see this patch somehow interferes = with you ideas. As for another comments, see my answers below: > On 1 Jul 2020, at 10:15, Alexander Turenko = wrote: >=20 > (CCed all persons who is mentioned below.) >=20 > You all know it, but I cannot keep myself from saying again that I'm = not > fan on the idea of the built-in/enabled-by-default reporting tool. > Especially when it is desined in the way that a user don't have a > guarantee that its IP address is not collected. It is not in the > opensource spirit and is not kindly to our users (to ones who care = about > privacy). >=20 > I understand why the organization need it: it give some view on > community -- what is useful and what is not. It may help when we, say, > prioritize problems. But this fact does not change anything: = garthering > IP addresses and some information without expicit agreement does not > become a friendly action. >=20 > And while we're on the topic, there was the idea from Mons that it = would > feel more comfortable if we would offer something useful for a user > together with the reporting tool: say, provide information about > critical / security problems in a running version if it is affected. = Or > feed news about important updates in functionality that is relevant = for > a particular application. I would add from myself that this way the > reporting tool not even necessarily must be enabled default to be = useful > for us: there is a motivation to enable it. Win-win. Collect only from > ones who explicitly said that is okay with it. >=20 > We also can structurize and improve our documentation and see what > topics are more popular and what are less. Collect download statistics > from repositories, rocks and git clones. Here a user explicitly asked = an > action with a remote server and (s)he most likely know that the remote > server may count it. >=20 > Pasted actual version to comment it inline. Nothing critical in fact. >=20 > WBR, Alexander Turenko. >=20 >> "feedback_type": "version", >=20 > Was it planned to report several maps with different types? If we > decided to report everything within one map, then this 'feedback_type' > field has no sense anymore and just confusing, right? If so, then = maybe > it is the time to remove it and bump 'feedback_version'? =E2=80=9Cfeedback_type=E2=80=9D was added as a possibility to split = feedback into different launches. I.e. collect simple statistics once an = hour, but =E2=80=9Cheavy=E2=80=9D stats only once a day. This patch adds = some stats collection logic, but it may only be heavy for rare corner = cases. It also adds caching and yielding in order to avoid affecting = performance, when feedback daemon wakes up. So I decided to not split it = into different feedbacks yet. However, I agree on that =E2=80=9Cversion=E2= =80=9D type became obsolete with new changes, so I removed it. >=20 >> +local function is_system_space(sp) >> + return box.schema.SYSTEM_ID_MIN <=3D sp.id and >> + sp.id <=3D box.schema.SYSTEM_ID_MAX >> +end >=20 > Nit: a space object is named 'space' below, but 'sp' above. I agree = with > Max and generally prefer to avoid abbreviations. Okay, since you both agree on this, I made it more verbose. >=20 >> +local function is_jsonpath_index(idx) >> + for _, part in pairs(idx.parts) do >> + if part.path ~=3D nil then >> + return true >> + end >> + end >> + >> + return false >> +end >=20 > I agree with Oleg's point: it worth to count multikey indices too. I > don't know much, but what I understood from the commit message [1]: >=20 > - Any multikey index is JSON path index. > - A multikey index has '[*]' in 'path'. >=20 > Looks quite simple to count if I don't miss anything. Please, verify > and/or ask about it someone who knows more. >=20 > [1]: = https://github.com/tarantool/tarantool/commit/f1d9f2575c11afcf3b0368d38f44= ae399cd7e045 I've checked the code and realize, that multikey path is only valid if = it contains =E2=80=9C[*]=E2=80=9D substring. So I've added a new metric = for jsonpath multikey indices, which includes also counts as jsonpath = index. Similarly, I added functional multikey index, which check if the = function has "is_multikey =3D true=E2=80=9D option. >=20 >> +local function fill_in_schema_stats_impl(schema) >> + local spaces =3D { >> + memtx =3D 0, >> + vinyl =3D 0, >> + temporary =3D 0, >> + ['local'] =3D 0, >> + } >> + >> + local indices =3D { >> + hash =3D 0, >> + tree =3D 0, >> + rtree =3D 0, >> + bitset =3D 0, >> + jsonpath =3D 0, >> + functional =3D 0, >> + } >> + >> + for name, space in pairs(box.space) do >> + local is_system =3D is_system_space(space) >> + if not is_system and type(name) =3D=3D 'number' then >> + if space.engine =3D=3D 'vinyl' then >> + spaces.vinyl =3D spaces.vinyl + 1 >> + elseif space.engine =3D=3D 'memtx' then >> + if space.temporary then >> + spaces.temporary =3D spaces.temporary + 1 >> + end >> + spaces.memtx =3D spaces.memtx + 1 >> + end >> + if space.is_local then >> + spaces['local'] =3D spaces['local'] + 1 >> + end >> + fill_in_indices_stats(space, indices) >> + end >> + fiber.yield() >=20 > What if DDL (say, a space drop) will occur here? I think we should = check > schema version right after exiting from yield and skip / repeat > collecting statistics if the schema version is changed. >=20 > I don't sure that iteration over `box.space` over DDL is safe -- what > if a current space is dropped? Is the behaviour will be like the > following? >=20 > | tarantool> t =3D {a =3D 1, b =3D 2} > | --- > | ... > | tarantool> for k, v in pairs(t) do if k =3D=3D 'a' then t['a'] =3D = nil end end > | --- > | ... >=20 > So, let's check it within the same iteration of the loop: just after > exiting from the yield. Good point. Your suggestion has some pros, but I decided to solve this = problem in other, imo more robust way. I collect all space into a table = and then iterating over this table. If space is not found, just skip it. = I accept the fact, that there is a little chance to have a bit = inaccurate feedback numbers for spaces and indices, but more stable = approach. >=20 >> + end >> + >> + for k, v in pairs(spaces) do >> + schema[k..'_spaces'] =3D v >> + end >> + >> + for k, v in pairs(indices) do >> + schema[k..'_indices'] =3D v >> + end >=20 > Nit: We collect those metrics into one structure and then transform it > into another one. Is one better and another worse? If so, it seems > logical to use one. >=20 > I was triggered on this, because in fact it is using of different = names > for the same things: it usually does not improve code readability. >=20 > Anyway, if you have understanding why nested structuring is good in = the > code, but bad in a report, proceed with your way. I don't have a = strong > preference here (but I'm sympathetic to the nesting way). I think this way is simpler. It=E2=80=99s not only readable but also = more usable. This hunk simply merges two tables into one container which = stores feedback schema statistics. These two tables (spaces and indices) = are isolated, so name clashes won=E2=80=99t happen. Also if I defined = final keys at the beginning, it would contain lines like = "spaces.memtx_spaces =3D spaces.memtx_spaces + 1=E2=80=9D and would look = cumbersome as a whole. --- diff --git a/src/box/lua/feedback_daemon.lua = b/src/box/lua/feedback_daemon.lua index ee58575a4..64fd71f86 100644 --- a/src/box/lua/feedback_daemon.lua +++ b/src/box/lua/feedback_daemon.lua @@ -50,14 +50,30 @@ local function detect_docker_environment() return cached_detect_docker_env end =20 -local function is_system_space(sp) - return box.schema.SYSTEM_ID_MIN <=3D sp.id and - sp.id <=3D box.schema.SYSTEM_ID_MAX +local function is_system_space(space) + return box.schema.SYSTEM_ID_MIN <=3D space.id and + space.id <=3D box.schema.SYSTEM_ID_MAX end =20 -local function is_jsonpath_index(idx) +local function jsonpaths_from_idx_parts(idx) + local paths =3D {} + for _, part in pairs(idx.parts) do - if part.path ~=3D nil then + if type(part.path) =3D=3D 'string' then + table.insert(paths, part.path) + end + end + + return paths +end + +local function is_jsonpath_index(idx) + return #jsonpaths_from_idx_parts(idx) > 0 +end + +local function is_jp_multikey_index(idx) + for _, path in pairs(jsonpaths_from_idx_parts(idx)) do + if path:find('[*]', 1, true) then return true end end @@ -69,6 +85,16 @@ local function is_functional_index(idx) return idx.func ~=3D nil end =20 +local function is_func_multikey_index(idx) + if is_functional_index(idx) then + local fid =3D idx.func.fid + local func =3D fid and box.func[fid] + return func and func.is_multikey or false + end + + return false +end + local function fill_in_base_info(feedback) if box.info.status ~=3D "running" then return nil, "not running" @@ -91,8 +117,14 @@ local function fill_in_indices_stats(space, stats) if idx_type =3D=3D 'TREE' then if is_functional_index(idx) then stats.functional =3D stats.functional + 1 + if is_func_multikey_index(idx) then + stats.functional_multikey =3D = stats.functional_multikey + 1 + end elseif is_jsonpath_index(idx) then stats.jsonpath =3D stats.jsonpath + 1 + if is_jp_multikey_index(idx) then + stats.jsonpath_multikey =3D = stats.jsonpath_multikey + 1 + end end stats.tree =3D stats.tree + 1 elseif idx_type =3D=3D 'HASH' then @@ -115,31 +147,45 @@ local function fill_in_schema_stats_impl(schema) } =20 local indices =3D { - hash =3D 0, - tree =3D 0, - rtree =3D 0, - bitset =3D 0, - jsonpath =3D 0, - functional =3D 0, + hash =3D 0, + tree =3D 0, + rtree =3D 0, + bitset =3D 0, + jsonpath =3D 0, + jsonpath_multikey =3D 0, + functional =3D 0, + functional_multikey =3D 0, } =20 + local space_ids =3D {} for name, space in pairs(box.space) do local is_system =3D is_system_space(space) if not is_system and type(name) =3D=3D 'number' then - if space.engine =3D=3D 'vinyl' then - spaces.vinyl =3D spaces.vinyl + 1 - elseif space.engine =3D=3D 'memtx' then - if space.temporary then - spaces.temporary =3D spaces.temporary + 1 - end - spaces.memtx =3D spaces.memtx + 1 - end - if space.is_local then - spaces['local'] =3D spaces['local'] + 1 + table.insert(space_ids, name) + end + end + + for _, id in pairs(space_ids) do + local space =3D box.space[id] + if space =3D=3D nil then + goto continue; + end + + if space.engine =3D=3D 'vinyl' then + spaces.vinyl =3D spaces.vinyl + 1 + elseif space.engine =3D=3D 'memtx' then + if space.temporary then + spaces.temporary =3D spaces.temporary + 1 end - fill_in_indices_stats(space, indices) + spaces.memtx =3D spaces.memtx + 1 + end + if space.is_local then + spaces['local'] =3D spaces['local'] + 1 end + fill_in_indices_stats(space, indices) + fiber.yield() + ::continue:: end =20 for k, v in pairs(spaces) do @@ -255,7 +301,7 @@ setmetatable(daemon, { end, -- this function is used in saving feedback in file generate_feedback =3D function() - return fill_in_feedback({ feedback_type =3D "version", = feedback_version =3D 1 }) + return fill_in_feedback({ feedback_version =3D 2 }) end, start =3D function() start(daemon) diff --git a/test/box-tap/feedback_daemon.test.lua = b/test/box-tap/feedback_daemon.test.lua index 330442f2e..1606af96d 100755 --- a/test/box-tap/feedback_daemon.test.lua +++ b/test/box-tap/feedback_daemon.test.lua @@ -148,7 +148,9 @@ local expected =3D { hash_indices =3D 0, bitset_indices =3D 0, jsonpath_indices =3D 0, + jsonpath_multikey_indices =3D 0, functional_indices =3D 0, + functional_multikey_indices =3D 0, } test:is_deeply(actual.features.schema, expected, 'schema stats are empty at the moment') @@ -165,27 +167,38 @@ = box.space.features_memtx:create_index('memtx_rtree', {type =3D 'rtree', parts =3D {{field =3D 3, type =3D = 'array'}}}) box.space.features_memtx:create_index('memtx_jpath', {parts =3D {{field =3D 4, type =3D 'str', path =3D = 'data.name'}}}) +box.space.features_memtx:create_index('memtx_multikey', + {parts =3D {{field =3D 5, type =3D 'str', path =3D = 'data[*].name'}}}) box.schema.func.create('features_func', { body =3D "function(tuple) return {string.sub(tuple[2], 1, 1)} end", is_deterministic =3D true, is_sandboxed =3D true}) -box.space.features_memtx:create_index('j', +box.schema.func.create('features_func_multikey', { + body =3D "function(tuple) return {1, 2} end", + is_deterministic =3D true, + is_sandboxed =3D true, + opts =3D {is_multikey =3D true}}) +box.space.features_memtx:create_index('functional', {parts =3D {{field =3D 1, type =3D 'number'}}, func =3D = 'features_func'}) +box.space.features_memtx:create_index('functional_multikey', + {parts =3D {{field =3D 1, type =3D 'number'}}, func =3D = 'features_func_multikey'}) =20 actual =3D daemon.generate_feedback() local schema_stats =3D actual.features.schema test:test('features.schema', function(t) - t:plan(10) + t:plan(12) t:is(schema_stats.memtx_spaces, 2, 'memtx engine usage gathered') t:is(schema_stats.vinyl_spaces, 1, 'vinyl engine usage gathered') t:is(schema_stats.temporary_spaces, 1, 'temporary space usage = gathered') t:is(schema_stats.local_spaces, 1, 'local space usage gathered') - t:is(schema_stats.tree_indices, 4, 'tree index gathered') + t:is(schema_stats.tree_indices, 6, 'tree index gathered') t:is(schema_stats.hash_indices, 1, 'hash index gathered') t:is(schema_stats.rtree_indices, 1, 'rtree index gathered') t:is(schema_stats.bitset_indices, 1, 'bitset index gathered') - t:is(schema_stats.jsonpath_indices, 1, 'jsonpath index gathered') - t:is(schema_stats.functional_indices, 1, 'functional index = gathered') + t:is(schema_stats.jsonpath_indices, 2, 'jsonpath index gathered') + t:is(schema_stats.jsonpath_multikey_indices, 1, 'jsonpath multikey = index gathered') + t:is(schema_stats.functional_indices, 2, 'functional index = gathered') + t:is(schema_stats.functional_multikey_indices, 1, 'functional = multikey index gathered') end) =20 box.space.features_memtx:create_index('memtx_sec', {type =3D 'hash'})