From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from smtp51.i.mail.ru (smtp51.i.mail.ru [94.100.177.111]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dev.tarantool.org (Postfix) with ESMTPS id 22B9E4429E1 for ; Wed, 1 Jul 2020 03:16:36 +0300 (MSK) Date: Wed, 1 Jul 2020 03:15:44 +0300 From: Alexander Turenko Message-ID: <20200701001544.wxx275pojfj3su5o@tkn_work_nb> References: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: Subject: Re: [Tarantool-patches] [PATCH 2/2] feedback: collect db engines and index features List-Id: Tarantool development patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Ilya Konyukhov Cc: Oleg Babin , tarantool-patches@dev.tarantool.org, v.shpilevoy@tarantool.org (CCed all persons who is mentioned below.) You all know it, but I cannot keep myself from saying again that I'm not fan on the idea of the built-in/enabled-by-default reporting tool. Especially when it is desined in the way that a user don't have a guarantee that its IP address is not collected. It is not in the opensource spirit and is not kindly to our users (to ones who care about privacy). I understand why the organization need it: it give some view on community -- what is useful and what is not. It may help when we, say, prioritize problems. But this fact does not change anything: garthering IP addresses and some information without expicit agreement does not become a friendly action. And while we're on the topic, there was the idea from Mons that it would feel more comfortable if we would offer something useful for a user together with the reporting tool: say, provide information about critical / security problems in a running version if it is affected. Or feed news about important updates in functionality that is relevant for a particular application. I would add from myself that this way the reporting tool not even necessarily must be enabled default to be useful for us: there is a motivation to enable it. Win-win. Collect only from ones who explicitly said that is okay with it. We also can structurize and improve our documentation and see what topics are more popular and what are less. Collect download statistics from repositories, rocks and git clones. Here a user explicitly asked an action with a remote server and (s)he most likely know that the remote server may count it. Pasted actual version to comment it inline. Nothing critical in fact. WBR, Alexander Turenko. > "feedback_type": "version", Was it planned to report several maps with different types? If we decided to report everything within one map, then this 'feedback_type' field has no sense anymore and just confusing, right? If so, then maybe it is the time to remove it and bump 'feedback_version'? > +local function is_system_space(sp) > + return box.schema.SYSTEM_ID_MIN <= sp.id and > + sp.id <= box.schema.SYSTEM_ID_MAX > +end Nit: a space object is named 'space' below, but 'sp' above. I agree with Max and generally prefer to avoid abbreviations. > +local function is_jsonpath_index(idx) > + for _, part in pairs(idx.parts) do > + if part.path ~= nil then > + return true > + end > + end > + > + return false > +end I agree with Oleg's point: it worth to count multikey indices too. I don't know much, but what I understood from the commit message [1]: - Any multikey index is JSON path index. - A multikey index has '[*]' in 'path'. Looks quite simple to count if I don't miss anything. Please, verify and/or ask about it someone who knows more. [1]: https://github.com/tarantool/tarantool/commit/f1d9f2575c11afcf3b0368d38f44ae399cd7e045 > +local function fill_in_schema_stats_impl(schema) > + local spaces = { > + memtx = 0, > + vinyl = 0, > + temporary = 0, > + ['local'] = 0, > + } > + > + local indices = { > + hash = 0, > + tree = 0, > + rtree = 0, > + bitset = 0, > + jsonpath = 0, > + functional = 0, > + } > + > + for name, space in pairs(box.space) do > + local is_system = is_system_space(space) > + if not is_system and type(name) == 'number' then > + if space.engine == 'vinyl' then > + spaces.vinyl = spaces.vinyl + 1 > + elseif space.engine == 'memtx' then > + if space.temporary then > + spaces.temporary = spaces.temporary + 1 > + end > + spaces.memtx = spaces.memtx + 1 > + end > + if space.is_local then > + spaces['local'] = spaces['local'] + 1 > + end > + fill_in_indices_stats(space, indices) > + end > + fiber.yield() What if DDL (say, a space drop) will occur here? I think we should check schema version right after exiting from yield and skip / repeat collecting statistics if the schema version is changed. I don't sure that iteration over `box.space` over DDL is safe -- what if a current space is dropped? Is the behaviour will be like the following? | tarantool> t = {a = 1, b = 2} | --- | ... | tarantool> for k, v in pairs(t) do if k == 'a' then t['a'] = nil end end | --- | ... So, let's check it within the same iteration of the loop: just after exiting from the yield. > + end > + > + for k, v in pairs(spaces) do > + schema[k..'_spaces'] = v > + end > + > + for k, v in pairs(indices) do > + schema[k..'_indices'] = v > + end Nit: We collect those metrics into one structure and then transform it into another one. Is one better and another worse? If so, it seems logical to use one. I was triggered on this, because in fact it is using of different names for the same things: it usually does not improve code readability. Anyway, if you have understanding why nested structuring is good in the code, but bad in a report, proceed with your way. I don't have a strong preference here (but I'm sympathetic to the nesting way).