[Tarantool-patches] [tarantool-patches] [PATCH v2 2/2] lua: add fiber.top() listing fiber cpu consumption

Sat Oct 26 21:00:32 MSK 2019

Hi! Thanks for the fixes!

>>>      115:
>>>        cpu average (%): 5.4571279061075
>>>        cpu instant (%): 5.9653973440576
>>>      120:
>>>        cpu average (%): 21.944382148464
>>>        cpu instant (%): 23.849021825646
>>>      116:
>>>        cpu average (%): 8.6603872318158
>>>        cpu instant (%): 9.6812031335093
>>>      119:
>>>        cpu average (%): 21.933168871944
>>>        cpu instant (%): 20.007540530351
>>>      cpu misses: 0
>>>      118:
>>>        cpu average (%): 19.342901995963
>>>        cpu instant (%): 16.932679820703
>>>      117:
>>>        cpu average (%): 11.549674814981
>>>        cpu instant (%): 13.128901177161
>>>    ...
>>>    ```
>>
>> 2. This is super cool!
>>
>> Could we give names to predefined fibers? For
>> example, make an alias top.scheduler = top[1].
>>
>> So as in the console output I would see:
>>
>>    ---
>>    - scheduler:
>>        cpu average (%): 98.230148883023
>>        cpu instant (%): 100
>>
>> Instead of '1'. Because personally I didn't
>> even know that 1 is always the scheduler, and
>> I would be confused to see a regular fiber
>> consumed 99% of time in a console.
> 
> Hmm, what about this?
> 
> --- a/src/lua/fiber.c
> +++ b/src/lua/fiber.c
> @@ -325,13 +325,18 @@ lbox_fiber_top_entry(struct fiber *f, void *cb_ctx)
>  {
>         struct lua_State *L = (struct lua_State *) cb_ctx;
>  
> -       lua_pushinteger(L, f->fid);
> +       /* Index system fibers by name instead of id */
> +       if (f->fid <= FIBER_ID_MAX_RESERVED) {
> +               lua_pushstring(L, f->name);
> +       } else {
> +               lua_pushinteger(L, f->fid);
> +       }
>         lua_newtable(L);
> 
> 

What about always using a name, when it is not empty? When a
system is running, it might be hard to find to which fiber
an ID belongs. Or 'name:id' like in logs (or whatever format
we use in the logs).

For example I started a vshard recovery fiber, it somewhy
consumes 99% CPU, but I don't know its ID. And I would need
to use vshard.storage.internal.recovery_fiber:id() to find it.

Up to you, and probably to Kirill since this is public API.

>>> #include "memory.h"
>>> #include "trigger.h"
>>> @@ -82,6 +85,34 @@ static int (*fiber_invoke)(fiber_func f, va_list ap);
>>> err;\
>>> })
>>>
>>> +#if ENABLE_FIBER_TOP
>>> +/**
>>> + * An action performed each time a context switch happens.
>>> + * Used to count each fiber's processing time.
>>> + * This is a macro rather than a function, since it is used
>>> + * in scheduler too.
>>
>> 4. Hm. Not sure if I understand. Ok, it is used in the scheduler,
>> so why does it prevent making it a function? Caller is always
>> struct fiber *, so you can make it a function taking fiber *
>> parameter, can't you?
> 
> You cannot call functions from scheduler cos it doesn’t have a
> stack AFAIU. I can make it an inline function if you want me to.
> 

Does not have a stack? Then how does it work at all? You use
stack in your macros - you declare a couple of variables. Also
intermediate arithmetic results might be saved on the stack.

Talking of an 'inline' function - if that will work, then it
would look better IMO. At least it won't raise questions why
it is not a function.

>> 5. Am I right, that we need to update CPU consumption each
>> time a context switch happens? 
> 
> Yep.
> 
>> If so, then by logic you
>> would need to call clock_set_on_csw() on each fiber->csw++,
>> correct?
> 
> Right. I also call clock_set_on_csw() on each event loop iteration
> end.
> 
>>
>> Currently that happens in 2 places: fiber_call_impl() and
>> fiber_yield(). In both these places you already call
>> clock_set_on_csw():
>>
>>    - You always call clock_set_on_csw() right before
>>      fiber_call_impl();
>>
>>    - You call clock_set_on_csw() in fiber_yield().
>>
>> I would then move fiber->cws++ logic into your macros too.
>> So all the context switch and CPU time tracking logic
>> would be in one place.
>>
>> The only problem is that your macros takes caller, but csw++
>> is done for callee. Probably we could increment it for the
>> caller instead, I don't anything depends on that.
>>
>> Also it would lead to increment of the scheduler's csw
>> counter. Not sure whether it is ok.
> 
> Loop iteration end isn’t a context switch, but I still update
> clocks here.

Yes, you update, but not via clock_set_on_csw(). Am I wrong?
The only fiber which gets updated in the end of the loop via
clock_set_on_csw() is the scheduler. For other fibers you
manually update clock_acc, clock_delta_last, clock_delta in
loop_on_iteration_end().

> I don’t want to increment csw here as a side
> effect, so maybe leave csw increment as it is?
> 

If you are talking about the scheduler's csw counter, then
I don't think it matters. 'Csw' is just statistics. It is
not used for anything. Besides, why a switch to the scheduler
does not count? End of the loop = switch to the scheduler, so
it means its csw should be incremented, shouldn't it?

>>> ev_idle_init(&cord->idle_event, fiber_schedule_idle);
>>> +
>>> +#if ENABLE_FIBER_TOP
>>> +/* fiber.top <http://fiber.top>() currently works only for the main thread. */
>>> +if (cord_is_main()) {
>>> +fiber_top_init();
>>> +fiber_top_enable();
>>
>> 7. I think, we need to enable top by default only
>> when we are sure that it does not affect performance.
> 
> Ok, disabled it by default. User can turn it on with
> fiber.top_enable(). This also resolved the issue I had
> with your swim test.
> 
> Looking at perf difference, maybe we can turn it on by default?

IMO, 10-16% looks quite significant taking into account,
that 99% of users won't use it at all, and others only for
debug of rare exceptional cases. For which they can call
enable(), get top(), and disable() back. I think, we should
ask Kirill. I am against default on.