[Tarantool-patches] [tarantool-patches] [PATCH v2 2/2] lua: add fiber.top() listing fiber cpu consumption
Serge Petrenko
sergepetrenko at tarantool.org
Mon Oct 28 19:16:27 MSK 2019
Hi! Thanks for your answer!
I’ve updated the branch, please take a look at the incremental diff below.
> 26 окт. 2019 г., в 21:00, Vladislav Shpilevoy <v.shpilevoy at tarantool.org> написал(а):
>
> Hi! Thanks for the fixes!
>
>>>> 115:
>>>> cpu average (%): 5.4571279061075
>>>> cpu instant (%): 5.9653973440576
>>>> 120:
>>>> cpu average (%): 21.944382148464
>>>> cpu instant (%): 23.849021825646
>>>> 116:
>>>> cpu average (%): 8.6603872318158
>>>> cpu instant (%): 9.6812031335093
>>>> 119:
>>>> cpu average (%): 21.933168871944
>>>> cpu instant (%): 20.007540530351
>>>> cpu misses: 0
>>>> 118:
>>>> cpu average (%): 19.342901995963
>>>> cpu instant (%): 16.932679820703
>>>> 117:
>>>> cpu average (%): 11.549674814981
>>>> cpu instant (%): 13.128901177161
>>>> ...
>>>> ```
>>>
>>> 2. This is super cool!
>>>
>>> Could we give names to predefined fibers? For
>>> example, make an alias top.scheduler = top[1].
>>>
>>> So as in the console output I would see:
>>>
>>> ---
>>> - scheduler:
>>> cpu average (%): 98.230148883023
>>> cpu instant (%): 100
>>>
>>> Instead of '1'. Because personally I didn't
>>> even know that 1 is always the scheduler, and
>>> I would be confused to see a regular fiber
>>> consumed 99% of time in a console.
>>
>> Hmm, what about this?
>>
>> --- a/src/lua/fiber.c
>> +++ b/src/lua/fiber.c
>> @@ -325,13 +325,18 @@ lbox_fiber_top_entry(struct fiber *f, void *cb_ctx)
>> {
>> struct lua_State *L = (struct lua_State *) cb_ctx;
>>
>> - lua_pushinteger(L, f->fid);
>> + /* Index system fibers by name instead of id */
>> + if (f->fid <= FIBER_ID_MAX_RESERVED) {
>> + lua_pushstring(L, f->name);
>> + } else {
>> + lua_pushinteger(L, f->fid);
>> + }
>> lua_newtable(L);
>>
>>
>
> What about always using a name, when it is not empty? When a
> system is running, it might be hard to find to which fiber
> an ID belongs. Or 'name:id' like in logs (or whatever format
> we use in the logs).
>
> For example I started a vshard recovery fiber, it somewhy
> consumes 99% CPU, but I don't know its ID. And I would need
> to use vshard.storage.internal.recovery_fiber:id() to find it.
>
> Up to you, and probably to Kirill since this is public API.
>
Let’s stick with `id/name` just like in the logs then.
>>>> #include "memory.h"
>>>> #include "trigger.h"
>>>> @@ -82,6 +85,34 @@ static int (*fiber_invoke)(fiber_func f, va_list ap);
>>>> err;\
>>>> })
>>>>
>>>> +#if ENABLE_FIBER_TOP
>>>> +/**
>>>> + * An action performed each time a context switch happens.
>>>> + * Used to count each fiber's processing time.
>>>> + * This is a macro rather than a function, since it is used
>>>> + * in scheduler too.
>>>
>>> 4. Hm. Not sure if I understand. Ok, it is used in the scheduler,
>>> so why does it prevent making it a function? Caller is always
>>> struct fiber *, so you can make it a function taking fiber *
>>> parameter, can't you?
>>
>> You cannot call functions from scheduler cos it doesn’t have a
>> stack AFAIU. I can make it an inline function if you want me to.
>>
>
> Does not have a stack? Then how does it work at all? You use
> stack in your macros - you declare a couple of variables. Also
> intermediate arithmetic results might be saved on the stack.
>
Yes, you’re right, thanks. It was a strange belief of mine.
> Talking of an 'inline' function - if that will work, then it
> would look better IMO. At least it won't raise questions why
> it is not a function.
Will do then.
>
>>> 5. Am I right, that we need to update CPU consumption each
>>> time a context switch happens?
>>
>> Yep.
>>
>>> If so, then by logic you
>>> would need to call clock_set_on_csw() on each fiber->csw++,
>>> correct?
>>
>> Right. I also call clock_set_on_csw() on each event loop iteration
>> end.
>>
>>>
>>> Currently that happens in 2 places: fiber_call_impl() and
>>> fiber_yield(). In both these places you already call
>>> clock_set_on_csw():
>>>
>>> - You always call clock_set_on_csw() right before
>>> fiber_call_impl();
>>>
>>> - You call clock_set_on_csw() in fiber_yield().
>>>
>>> I would then move fiber->cws++ logic into your macros too.
>>> So all the context switch and CPU time tracking logic
>>> would be in one place.
>>>
>>> The only problem is that your macros takes caller, but csw++
>>> is done for callee. Probably we could increment it for the
>>> caller instead, I don't anything depends on that.
>>>
>>> Also it would lead to increment of the scheduler's csw
>>> counter. Not sure whether it is ok.
>>
>> Loop iteration end isn’t a context switch, but I still update
>> clocks here.
>
> Yes, you update, but not via clock_set_on_csw(). Am I wrong?
> The only fiber which gets updated in the end of the loop via
> clock_set_on_csw() is the scheduler. For other fibers you
> manually update clock_acc, clock_delta_last, clock_delta in
> loop_on_iteration_end().
Yep, I was talking about the sched.
>
>> I don’t want to increment csw here as a side
>> effect, so maybe leave csw increment as it is?
>>
>
> If you are talking about the scheduler's csw counter, then
> I don't think it matters. 'Csw' is just statistics. It is
> not used for anything. Besides, why a switch to the scheduler
> does not count? End of the loop = switch to the scheduler, so
> it means its csw should be incremented, shouldn't it?
Ok
>
>>>> ev_idle_init(&cord->idle_event, fiber_schedule_idle);
>>>> +
>>>> +#if ENABLE_FIBER_TOP
>>>> +/* fiber.top <http://fiber.top>() currently works only for the main thread. */
>>>> +if (cord_is_main()) {
>>>> +fiber_top_init();
>>>> +fiber_top_enable();
>>>
>>> 7. I think, we need to enable top by default only
>>> when we are sure that it does not affect performance.
>>
>> Ok, disabled it by default. User can turn it on with
>> fiber.top_enable(). This also resolved the issue I had
>> with your swim test.
>>
>> Looking at perf difference, maybe we can turn it on by default?
>
> IMO, 10-16% looks quite significant taking into account,
> that 99% of users won't use it at all, and others only for
> debug of rare exceptional cases. For which they can call
> enable(), get top(), and disable() back. I think, we should
> ask Kirill. I am against default on.
Ok
diff --git a/src/lib/core/fiber.c b/src/lib/core/fiber.c
index e3fe77e21..4c646b278 100644
--- a/src/lib/core/fiber.c
+++ b/src/lib/core/fiber.c
@@ -87,31 +87,32 @@ static int (*fiber_invoke)(fiber_func f, va_list ap);
})
#if ENABLE_FIBER_TOP
+static bool fiber_top_enabled = false;
+
/**
* An action performed each time a context switch happens.
* Used to count each fiber's processing time.
- * This is a macro rather than a function, since it is used
- * in scheduler too.
*/
-#define clock_set_on_csw(caller) \
-({ \
- if (fiber_top_enabled) { \
- uint64_t clock; \
- uint32_t cpu_id; \
- clock = __rdtscp(&cpu_id); \
- \
- if (cpu_id == cord()->cpu_id_last) { \
- (caller)->clock_delta += clock - cord()->clock_last; \
- cord()->clock_delta += clock - cord()->clock_last; \
- } else { \
- cord()->cpu_id_last = cpu_id; \
- cord()->cpu_miss_count++; \
- } \
- cord()->clock_last = clock; \
- } \
-})
+static inline void
+clock_set_on_csw(struct fiber *caller)
+{
+ caller->csw++;
+ if (!fiber_top_enabled)
+ return;
-static bool fiber_top_enabled = false;
+ uint64_t clock;
+ uint32_t cpu_id;
+ clock = __rdtscp(&cpu_id);
+
+ if (cpu_id == cord()->cpu_id_last) {
+ caller->clock_delta += clock - cord()->clock_last;
+ cord()->clock_delta += clock - cord()->clock_last;
+ } else {
+ cord()->cpu_id_last = cpu_id;
+ cord()->cpu_miss_count++;
+ }
+ cord()->clock_last = clock;
+}
#else
#define clock_set_on_csw(caller) ;
@@ -262,7 +263,6 @@ fiber_call_impl(struct fiber *callee)
cord->fiber = callee;
callee->flags &= ~FIBER_IS_READY;
- callee->csw++;
ASAN_START_SWITCH_FIBER(asan_state, 1,
callee->stack,
callee->stack_size);
@@ -515,7 +515,6 @@ fiber_yield(void)
assert(callee->flags & FIBER_IS_READY || callee == &cord->sched);
assert(! (callee->flags & FIBER_IS_DEAD));
cord->fiber = callee;
- callee->csw++;
callee->flags &= ~FIBER_IS_READY;
ASAN_START_SWITCH_FIBER(asan_state,
(caller->flags & FIBER_IS_DEAD) == 0,
diff --git a/src/lua/fiber.c b/src/lua/fiber.c
index c38bd886f..a30290bfa 100644
--- a/src/lua/fiber.c
+++ b/src/lua/fiber.c
@@ -324,13 +324,11 @@ static int
lbox_fiber_top_entry(struct fiber *f, void *cb_ctx)
{
struct lua_State *L = (struct lua_State *) cb_ctx;
+ char name_buf[64];
+
+ snprintf(name_buf, sizeof(name_buf), "%u/%s", f->fid, f->name);
+ lua_pushstring(L, name_buf);
- /* Index system fibers by name instead of id */
- if (f->fid <= FIBER_ID_MAX_RESERVED) {
- lua_pushstring(L, f->name);
- } else {
- lua_pushinteger(L, f->fid);
- }
lua_newtable(L);
lua_pushliteral(L, "average");
diff --git a/test/app/fiber.result b/test/app/fiber.result
index e45cd7639..f9904d465 100644
--- a/test/app/fiber.result
+++ b/test/app/fiber.result
@@ -1478,7 +1478,7 @@ type(a)
...
-- scheduler is present in fiber.top()
-- and is indexed by name
-a["sched"] ~= nil
+a["1/sched"] ~= nil
---
- true
...
@@ -1518,7 +1518,11 @@ sum_avg > 99 and sum_avg < 101 or sum_avg
tbl = nil
---
...
-f = fiber.new(function() for i = 1,1000 do end fiber.yield() tbl = fiber.top()[fiber.self().id()] end)
+f = fiber.new(function()\
+ for i = 1,1000 do end\
+ fiber.yield()\
+ tbl = fiber.top()[fiber.self().id()..'/'..fiber.self().name()]\
+end)
---
...
while f:status() ~= 'dead' do fiber.sleep(0.01) end
diff --git a/test/app/fiber.test.lua b/test/app/fiber.test.lua
index e30aba77e..7f320e318 100644
--- a/test/app/fiber.test.lua
+++ b/test/app/fiber.test.lua
@@ -638,7 +638,7 @@ a = fiber.top()
type(a)
-- scheduler is present in fiber.top()
-- and is indexed by name
-a["sched"] ~= nil
+a["1/sched"] ~= nil
type(a["cpu misses"]) == 'number'
sum_inst = 0
sum_avg = 0
@@ -658,7 +658,11 @@ sum_inst
-- not exact due to accumulated integer division errors
sum_avg > 99 and sum_avg < 101 or sum_avg
tbl = nil
-f = fiber.new(function() for i = 1,1000 do end fiber.yield() tbl = fiber.top()[fiber.self().id()] end)
+f = fiber.new(function()\
+ for i = 1,1000 do end\
+ fiber.yield()\
+ tbl = fiber.top()[fiber.self().id()..'/'..fiber.self().name()]\
+end)
while f:status() ~= 'dead' do fiber.sleep(0.01) end
tbl["average"] > 0
tbl["instant"] > 0
More information about the Tarantool-patches
mailing list