[Tarantool-patches] [tarantool-patches] [PATCH v2 2/2] lua: add fiber.top() listing fiber cpu consumption

Serge Petrenko sergepetrenko at tarantool.org
Mon Oct 28 19:16:27 MSK 2019


Hi! Thanks for your answer!

I’ve updated the branch, please take a look at the incremental diff below.

> 26 окт. 2019 г., в 21:00, Vladislav Shpilevoy <v.shpilevoy at tarantool.org> написал(а):
> 
> Hi! Thanks for the fixes!
> 
>>>>      115:
>>>>        cpu average (%): 5.4571279061075
>>>>        cpu instant (%): 5.9653973440576
>>>>      120:
>>>>        cpu average (%): 21.944382148464
>>>>        cpu instant (%): 23.849021825646
>>>>      116:
>>>>        cpu average (%): 8.6603872318158
>>>>        cpu instant (%): 9.6812031335093
>>>>      119:
>>>>        cpu average (%): 21.933168871944
>>>>        cpu instant (%): 20.007540530351
>>>>      cpu misses: 0
>>>>      118:
>>>>        cpu average (%): 19.342901995963
>>>>        cpu instant (%): 16.932679820703
>>>>      117:
>>>>        cpu average (%): 11.549674814981
>>>>        cpu instant (%): 13.128901177161
>>>>    ...
>>>>    ```
>>> 
>>> 2. This is super cool!
>>> 
>>> Could we give names to predefined fibers? For
>>> example, make an alias top.scheduler = top[1].
>>> 
>>> So as in the console output I would see:
>>> 
>>>    ---
>>>    - scheduler:
>>>        cpu average (%): 98.230148883023
>>>        cpu instant (%): 100
>>> 
>>> Instead of '1'. Because personally I didn't
>>> even know that 1 is always the scheduler, and
>>> I would be confused to see a regular fiber
>>> consumed 99% of time in a console.
>> 
>> Hmm, what about this?
>> 
>> --- a/src/lua/fiber.c
>> +++ b/src/lua/fiber.c
>> @@ -325,13 +325,18 @@ lbox_fiber_top_entry(struct fiber *f, void *cb_ctx)
>>  {
>>         struct lua_State *L = (struct lua_State *) cb_ctx;
>>  
>> -       lua_pushinteger(L, f->fid);
>> +       /* Index system fibers by name instead of id */
>> +       if (f->fid <= FIBER_ID_MAX_RESERVED) {
>> +               lua_pushstring(L, f->name);
>> +       } else {
>> +               lua_pushinteger(L, f->fid);
>> +       }
>>         lua_newtable(L);
>> 
>> 
> 
> What about always using a name, when it is not empty? When a
> system is running, it might be hard to find to which fiber
> an ID belongs. Or 'name:id' like in logs (or whatever format
> we use in the logs).
> 
> For example I started a vshard recovery fiber, it somewhy
> consumes 99% CPU, but I don't know its ID. And I would need
> to use vshard.storage.internal.recovery_fiber:id() to find it.
> 
> Up to you, and probably to Kirill since this is public API.
> 

Let’s stick with `id/name` just like in the logs then.

>>>> #include "memory.h"
>>>> #include "trigger.h"
>>>> @@ -82,6 +85,34 @@ static int (*fiber_invoke)(fiber_func f, va_list ap);
>>>> err;\
>>>> })
>>>> 
>>>> +#if ENABLE_FIBER_TOP
>>>> +/**
>>>> + * An action performed each time a context switch happens.
>>>> + * Used to count each fiber's processing time.
>>>> + * This is a macro rather than a function, since it is used
>>>> + * in scheduler too.
>>> 
>>> 4. Hm. Not sure if I understand. Ok, it is used in the scheduler,
>>> so why does it prevent making it a function? Caller is always
>>> struct fiber *, so you can make it a function taking fiber *
>>> parameter, can't you?
>> 
>> You cannot call functions from scheduler cos it doesn’t have a
>> stack AFAIU. I can make it an inline function if you want me to.
>> 
> 
> Does not have a stack? Then how does it work at all? You use
> stack in your macros - you declare a couple of variables. Also
> intermediate arithmetic results might be saved on the stack.
> 

Yes, you’re right, thanks. It was a strange belief of mine.

> Talking of an 'inline' function - if that will work, then it
> would look better IMO. At least it won't raise questions why
> it is not a function.

Will do then.

> 
>>> 5. Am I right, that we need to update CPU consumption each
>>> time a context switch happens? 
>> 
>> Yep.
>> 
>>> If so, then by logic you
>>> would need to call clock_set_on_csw() on each fiber->csw++,
>>> correct?
>> 
>> Right. I also call clock_set_on_csw() on each event loop iteration
>> end.
>> 
>>> 
>>> Currently that happens in 2 places: fiber_call_impl() and
>>> fiber_yield(). In both these places you already call
>>> clock_set_on_csw():
>>> 
>>>    - You always call clock_set_on_csw() right before
>>>      fiber_call_impl();
>>> 
>>>    - You call clock_set_on_csw() in fiber_yield().
>>> 
>>> I would then move fiber->cws++ logic into your macros too.
>>> So all the context switch and CPU time tracking logic
>>> would be in one place.
>>> 
>>> The only problem is that your macros takes caller, but csw++
>>> is done for callee. Probably we could increment it for the
>>> caller instead, I don't anything depends on that.
>>> 
>>> Also it would lead to increment of the scheduler's csw
>>> counter. Not sure whether it is ok.
>> 
>> Loop iteration end isn’t a context switch, but I still update
>> clocks here.
> 
> Yes, you update, but not via clock_set_on_csw(). Am I wrong?
> The only fiber which gets updated in the end of the loop via
> clock_set_on_csw() is the scheduler. For other fibers you
> manually update clock_acc, clock_delta_last, clock_delta in
> loop_on_iteration_end().

Yep, I was talking about the sched.

> 
>> I don’t want to increment csw here as a side
>> effect, so maybe leave csw increment as it is?
>> 
> 
> If you are talking about the scheduler's csw counter, then
> I don't think it matters. 'Csw' is just statistics. It is
> not used for anything. Besides, why a switch to the scheduler
> does not count? End of the loop = switch to the scheduler, so
> it means its csw should be incremented, shouldn't it?

Ok

> 
>>>> ev_idle_init(&cord->idle_event, fiber_schedule_idle);
>>>> +
>>>> +#if ENABLE_FIBER_TOP
>>>> +/* fiber.top <http://fiber.top>() currently works only for the main thread. */
>>>> +if (cord_is_main()) {
>>>> +fiber_top_init();
>>>> +fiber_top_enable();
>>> 
>>> 7. I think, we need to enable top by default only
>>> when we are sure that it does not affect performance.
>> 
>> Ok, disabled it by default. User can turn it on with
>> fiber.top_enable(). This also resolved the issue I had
>> with your swim test.
>> 
>> Looking at perf difference, maybe we can turn it on by default?
> 
> IMO, 10-16% looks quite significant taking into account,
> that 99% of users won't use it at all, and others only for
> debug of rare exceptional cases. For which they can call
> enable(), get top(), and disable() back. I think, we should
> ask Kirill. I am against default on.

Ok

diff --git a/src/lib/core/fiber.c b/src/lib/core/fiber.c
index e3fe77e21..4c646b278 100644
--- a/src/lib/core/fiber.c
+++ b/src/lib/core/fiber.c
@@ -87,31 +87,32 @@ static int (*fiber_invoke)(fiber_func f, va_list ap);
 })
 
 #if ENABLE_FIBER_TOP
+static bool fiber_top_enabled = false;
+
 /**
  * An action performed each time a context switch happens.
  * Used to count each fiber's processing time.
- * This is a macro rather than a function, since it is used
- * in scheduler too.
  */
-#define clock_set_on_csw(caller)					     \
-({									     \
-	if (fiber_top_enabled) {					     \
-		uint64_t clock;						     \
-		uint32_t cpu_id;		 			     \
-		clock = __rdtscp(&cpu_id);				     \
-									     \
-		if (cpu_id == cord()->cpu_id_last) {			     \
-			(caller)->clock_delta += clock - cord()->clock_last; \
-			cord()->clock_delta += clock - cord()->clock_last;   \
-		} else {						     \
-			cord()->cpu_id_last = cpu_id;			     \
-			cord()->cpu_miss_count++;			     \
-		}							     \
-		cord()->clock_last = clock;				     \
-	}								     \
-})
+static inline void
+clock_set_on_csw(struct fiber *caller)
+{
+	caller->csw++;
+	if (!fiber_top_enabled)
+		return;
 
-static bool fiber_top_enabled = false;
+	uint64_t clock;
+	uint32_t cpu_id;
+	clock = __rdtscp(&cpu_id);
+
+	if (cpu_id == cord()->cpu_id_last) {
+		caller->clock_delta += clock - cord()->clock_last;
+		cord()->clock_delta += clock - cord()->clock_last;
+	} else {
+		cord()->cpu_id_last = cpu_id;
+		cord()->cpu_miss_count++;
+	}
+	cord()->clock_last = clock;
+}
 
 #else
 #define clock_set_on_csw(caller) ;
@@ -262,7 +263,6 @@ fiber_call_impl(struct fiber *callee)
 	cord->fiber = callee;
 
 	callee->flags &= ~FIBER_IS_READY;
-	callee->csw++;
 	ASAN_START_SWITCH_FIBER(asan_state, 1,
 				callee->stack,
 				callee->stack_size);
@@ -515,7 +515,6 @@ fiber_yield(void)
 	assert(callee->flags & FIBER_IS_READY || callee == &cord->sched);
 	assert(! (callee->flags & FIBER_IS_DEAD));
 	cord->fiber = callee;
-	callee->csw++;
 	callee->flags &= ~FIBER_IS_READY;
 	ASAN_START_SWITCH_FIBER(asan_state,
 				(caller->flags & FIBER_IS_DEAD) == 0,
diff --git a/src/lua/fiber.c b/src/lua/fiber.c
index c38bd886f..a30290bfa 100644
--- a/src/lua/fiber.c
+++ b/src/lua/fiber.c
@@ -324,13 +324,11 @@ static int
 lbox_fiber_top_entry(struct fiber *f, void *cb_ctx)
 {
 	struct lua_State *L = (struct lua_State *) cb_ctx;
+	char name_buf[64];
+
+	snprintf(name_buf, sizeof(name_buf), "%u/%s", f->fid, f->name);
+	lua_pushstring(L, name_buf);
 
-	/* Index system fibers by name instead of id */
-	if (f->fid <= FIBER_ID_MAX_RESERVED) {
-		lua_pushstring(L, f->name);
-	} else {
-		lua_pushinteger(L, f->fid);
-	}
 	lua_newtable(L);
 
 	lua_pushliteral(L, "average");
diff --git a/test/app/fiber.result b/test/app/fiber.result
index e45cd7639..f9904d465 100644
--- a/test/app/fiber.result
+++ b/test/app/fiber.result
@@ -1478,7 +1478,7 @@ type(a)
 ...
 -- scheduler is present in fiber.top()
 -- and is indexed by name
-a["sched"] ~= nil
+a["1/sched"] ~= nil
 ---
 - true
 ...
@@ -1518,7 +1518,11 @@ sum_avg > 99 and sum_avg < 101 or sum_avg
 tbl = nil
 ---
 ...
-f = fiber.new(function() for i = 1,1000 do end fiber.yield() tbl = fiber.top()[fiber.self().id()] end)
+f = fiber.new(function()\
+    for i = 1,1000 do end\
+    fiber.yield()\
+    tbl = fiber.top()[fiber.self().id()..'/'..fiber.self().name()]\
+end)
 ---
 ...
 while f:status() ~= 'dead' do fiber.sleep(0.01) end
diff --git a/test/app/fiber.test.lua b/test/app/fiber.test.lua
index e30aba77e..7f320e318 100644
--- a/test/app/fiber.test.lua
+++ b/test/app/fiber.test.lua
@@ -638,7 +638,7 @@ a = fiber.top()
 type(a)
 -- scheduler is present in fiber.top()
 -- and is indexed by name
-a["sched"] ~= nil
+a["1/sched"] ~= nil
 type(a["cpu misses"]) == 'number'
 sum_inst = 0
 sum_avg = 0
@@ -658,7 +658,11 @@ sum_inst
 -- not exact due to accumulated integer division errors
 sum_avg > 99 and sum_avg < 101 or sum_avg
 tbl = nil
-f = fiber.new(function() for i = 1,1000 do end fiber.yield() tbl = fiber.top()[fiber.self().id()] end)
+f = fiber.new(function()\
+    for i = 1,1000 do end\
+    fiber.yield()\
+    tbl = fiber.top()[fiber.self().id()..'/'..fiber.self().name()]\
+end)
 while f:status() ~= 'dead' do fiber.sleep(0.01) end
 tbl["average"] > 0
 tbl["instant"] > 0




More information about the Tarantool-patches mailing list