From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from smtp37.i.mail.ru (smtp37.i.mail.ru [94.100.177.97]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dev.tarantool.org (Postfix) with ESMTPS id D2B9B452566 for ; Wed, 20 Nov 2019 11:29:28 +0300 (MSK) From: Serge Petrenko Message-Id: Content-Type: multipart/alternative; boundary="Apple-Mail=_61CEA279-531F-4C41-A0A1-3063C277989B" Mime-Version: 1.0 (Mac OS X Mail 12.4 \(3445.104.11\)) Date: Wed, 20 Nov 2019 11:29:26 +0300 In-Reply-To: <20191120014409.f7r2inhv5on4watf@tkn_work_nb> References: <9bea3bba-ead8-490e-c9f5-aaf385c2637f@tarantool.org> <20191120014409.f7r2inhv5on4watf@tkn_work_nb> Subject: Re: [Tarantool-patches] [PATCH v3 0/3] fiber.top(): minor fixup List-Id: Tarantool development patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Alexander Turenko Cc: tarantool-patches@dev.tarantool.org, Vladislav Shpilevoy --Apple-Mail=_61CEA279-531F-4C41-A0A1-3063C277989B Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=utf-8 Hi! Thanks for your reply! -- Serge Petrenko sergepetrenko@tarantool.org > 20 =D0=BD=D0=BE=D1=8F=D0=B1. 2019 =D0=B3., =D0=B2 4:44, Alexander = Turenko =D0=BD=D0=B0=D0=BF=D0=B8=D1=81=D0= =B0=D0=BB(=D0=B0): >=20 > It still shows miscompares for me: >=20 > $ cat /proc/cpuinfo | grep processor | wc -l > 8 > $ (cd test && ./test-run.py $(yes fiber.test.lua | head -n 1000)) >=20 > commit aefc64faaf7bfbb58ac53bf5abea55d1faabcbfe >=20 > The miscompare looks so: >=20 > [006] Test failed! Result content mismatch: > [006] --- app/fiber.result Tue Nov 19 20:39:21 2019 > [006] +++ app/fiber.reject Wed Nov 20 04:33:39 2019 > [006] @@ -1525,7 +1525,7 @@ > [006] -- disable the test above. > [006] sum_avg <=3D 100 or sum_avg > [006] --- > [006] -- true > [006] +- 100 > [006] ... > [006] -- not exact due to accumulated integer division errors > [006] --sum_avg > 99 and sum_avg <=3D 100 or sum_avg >=20 > Maybe it is some rounding issue. >=20 > When I changed the condition to sum_avg <=3D 101, I got: >=20 > [015] Test failed! Result content mismatch: > [015] --- app/fiber.result Wed Nov 20 04:35:14 2019 > [015] +++ app/fiber.reject Wed Nov 20 04:35:36 2019 > [015] @@ -1512,7 +1512,7 @@ > [015] ... > [015] sum_inst > [015] --- > [015] -- 100 > [015] +- 45.802853351342 > [015] ... > [015] -- when a fiber dies, its impact on the thread moving average > [015] -- persists for a couple of ev loop iterations, but it is no >=20 > Sorry, but it does not look okay for me in the sense that it may lead = to > problems in testing: we already have enough ones. >=20 > Let's provide a test that will work stable or describe why it is not > possible. In the latter case we can extract an unstable test case and > mark it as fragile using suite.ini test-run's option. It will not > strictly close the problem, but will lower its probability. >=20 > Maybe we can even retry such tests or allows them to fail: > https://github.com/tarantool/test-run/issues/189 >=20 > Anyway, if those fixes descrease probability of fails, I don't mind. > Just noted that a further work may be needed. Well, yes, they do. Just try to run this test before the patch. Anyway, I guess you=E2=80=99re right that we shouldn=E2=80=99t leave = flaky tests if possible, so I commented the flaky pieces out and added a request to reenable them once #4625 (List dead fibers in fiber.top = () output) is implemented. The diff=E2=80=99s below. Now 1000 out of 1000 tests pass with 16 = concurrent jobs, at least on my machine. >=20 > WBR, Alexander Turenko. >=20 > On Tue, Nov 19, 2019 at 11:57:11PM +0100, Vladislav Shpilevoy wrote: >> Hi! Thanks for the fixes! >>=20 >> LGTM. >>=20 >> On 18/11/2019 17:05, Serge Petrenko wrote: >>> The first patch factors all the clock-related members of struct = fiber and struct >>> cord into appropriate structs and adds methods for their updates and = resets. >>>=20 >>> The second patch fixes exponential moving average calculation so = that we do not >>> experience huge numbers in average load percentage calculations. >>>=20 >>> The third patch alters fiber.top() test to wait for correct output = before >>> testing it.=20 >>>=20 >>> Follow-up https://github.com/tarantool/tarantool/issues/2694 >>> Branch = https://github.com/tarantool/tarantool/tree/sp/gh-2694-test-fixup >>>=20 >>> Changes in v3: >>> - introduce a new patch which refactors all the clock stat >>> handling. >>> - introduce a patch fixing EMA calculation >>> - review fixes as per review from Vladislav >>>=20 >>> Changes in v2: >>> - clean up all fibers clock stats on fiber.top_enable() >>> - push 0 instead of NaN when cord clock_delta_last is 0 >>> - review fixes as per review from Vladislav >>>=20 >>> Serge Petrenko (3): >>> fiber.top() refactor clock and cpu time calculation >>> fiber.top(): alter exponential moving average calculation >>> app/fiber: wait till a full event loop iteration ends >>>=20 >>> src/lib/core/fiber.c | 175 = ++++++++++++++++++++++++---------------- >>> src/lib/core/fiber.h | 109 +++++++++++++++++-------- >>> src/lua/fiber.c | 20 +++-- >>> test/app/fiber.result | 38 ++++++--- >>> test/app/fiber.test.lua | 36 +++++++-- >>> 5 files changed, 252 insertions(+), 126 deletions(-) >>>=20 diff --git a/test/app/fiber.result b/test/app/fiber.result index b767bbb59..6d9604ad8 100644 --- a/test/app/fiber.result +++ b/test/app/fiber.result @@ -1510,25 +1510,21 @@ for k, v in pairs(a) do\ end --- ... -sum_inst ---- -- 100 -... -- when a fiber dies, its impact on the thread moving average -- persists for a couple of ev loop iterations, but it is no -- longer listed in fiber.top(). So sum_avg may way smaller than --- 100%. See gh-4625 for details and reenable the test below as +-- 100%. See gh-4625 for details and reenable both tests below as -- soon as it is implemented. -- In rare cases when a fiber dies on the same event loop -- iteration as you issue fiber.top(), sum_inst will also be --- smaller than 100%, but it is so rare I don't even want to --- disable the test above. -sum_avg <=3D 100 or sum_avg +-- smaller than 100%. +-- sum_inst +sum_avg <=3D 100.1 or sum_avg --- - true ... -- not exact due to accumulated integer division errors ---sum_avg > 99 and sum_avg <=3D 100 or sum_avg +--sum_avg > 99 and sum_avg <=3D 100.1 or sum_avg tbl =3D nil --- ... diff --git a/test/app/fiber.test.lua b/test/app/fiber.test.lua index 7f7350acc..6df210d9c 100644 --- a/test/app/fiber.test.lua +++ b/test/app/fiber.test.lua @@ -658,19 +658,18 @@ for k, v in pairs(a) do\ sum_avg =3D sum_avg + v["average"]\ end =20 -sum_inst -- when a fiber dies, its impact on the thread moving average -- persists for a couple of ev loop iterations, but it is no -- longer listed in fiber.top(). So sum_avg may way smaller than --- 100%. See gh-4625 for details and reenable the test below as +-- 100%. See gh-4625 for details and reenable both tests below as -- soon as it is implemented. -- In rare cases when a fiber dies on the same event loop -- iteration as you issue fiber.top(), sum_inst will also be --- smaller than 100%, but it is so rare I don't even want to --- disable the test above. -sum_avg <=3D 100 or sum_avg +-- smaller than 100%. +-- sum_inst +sum_avg <=3D 100.1 or sum_avg -- not exact due to accumulated integer division errors ---sum_avg > 99 and sum_avg <=3D 100 or sum_avg +--sum_avg > 99 and sum_avg <=3D 100.1 or sum_avg tbl =3D nil f =3D fiber.new(function()\ local fiber_key =3D fiber.self().id()..'/'..fiber.self().name()\= --Apple-Mail=_61CEA279-531F-4C41-A0A1-3063C277989B Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=utf-8 Hi! = Thanks for your reply!



20 =D0=BD=D0=BE=D1=8F=D0=B1. 2019 =D0=B3., =D0=B2 4:44, = Alexander Turenko <alexander.turenko@tarantool.org> = =D0=BD=D0=B0=D0=BF=D0=B8=D1=81=D0=B0=D0=BB(=D0=B0):

It = still shows miscompares for me:

$ cat = /proc/cpuinfo | grep processor | wc -l
8
$ = (cd test && ./test-run.py $(yes fiber.test.lua | head -n = 1000))

commit = aefc64faaf7bfbb58ac53bf5abea55d1faabcbfe

The = miscompare looks so:

[006] Test failed! = Result content mismatch:
[006] --- app/fiber.result Tue Nov = 19 20:39:21 2019
[006] +++ app/fiber.reject Wed Nov = 20 04:33:39 2019
[006] @@ -1525,7 +1525,7 @@
[006]  -- disable the test above.
[006] =  sum_avg <=3D 100 or sum_avg
[006]  ---
[006] -- true
[006] +- 100
[006] =  ...
[006]  -- not exact due to accumulated = integer division errors
[006]  --sum_avg > 99 and = sum_avg <=3D 100 or sum_avg

Maybe it is = some rounding issue.

When I changed the = condition to sum_avg <=3D 101, I got:

[015] Test failed! Result content mismatch:
[015]= --- app/fiber.result Wed Nov 20 04:35:14 2019
[015] +++ app/fiber.reject Wed Nov 20 04:35:36 2019
[015] @@ -1512,7 +1512,7 @@
[015]  ...
[015]  sum_inst
[015]  ---
[015] -- 100
[015] +- 45.802853351342
[015]  ...
[015]  -- when a fiber = dies, its impact on the thread moving average
[015] =  -- persists for a couple of ev loop iterations, but it is no

Sorry, but it does not look okay for me in the = sense that it may lead to
problems in testing: we already = have enough ones.

Let's provide a test that = will work stable or describe why it is not
possible. In = the latter case we can extract an unstable test case and
mark it as fragile using suite.ini test-run's option. It will = not
strictly close the problem, but will lower its = probability.

Maybe we can even retry such = tests or allows them to fail:
https://github.com/tarantool/test-run/issues/189

Anyway, if those fixes descrease probability = of fails, I don't mind.
Just noted that a further work may = be needed.

Well, yes, they do. Just try to run this test = before the patch.
Anyway, I guess you=E2=80=99re right that we = shouldn=E2=80=99t leave flaky tests if
possible, so I = commented the flaky pieces out and added a request to
reenable = them once #4625 (List dead fibers in fiber.top() output) = is
implemented.

The = diff=E2=80=99s below. Now 1000 out of 1000 tests pass with 16 = concurrent
jobs, at least on my machine.



WBR, Alexander Turenko.

On Tue, Nov 19, 2019 at 11:57:11PM +0100, = Vladislav Shpilevoy wrote:
Hi! Thanks for the fixes!

LGTM.
On 18/11/2019 17:05, Serge Petrenko wrote:
The first patch factors = all the clock-related members of struct fiber and struct
cord into appropriate structs and adds methods for their = updates and resets.

The second patch fixes = exponential moving average calculation so that we do not
experience huge numbers in average load percentage = calculations.

The third patch alters fiber.top() test to wait for = correct output before
testing it.

Follow-up https://github.com/tarantool/tarantool/issues/2694
Branch https://github.com/tarantool/tarantool/tree/sp/gh-2694-test-fix= up

Changes in v3:
- = introduce a new patch which refactors all the clock stat
=   handling.
- introduce a patch fixing EMA = calculation
- review fixes as per review from = Vladislav

Changes in v2:
- = clean up all fibers clock stats on fiber.top_enable()
- = push 0 instead of NaN when cord clock_delta_last is 0
- = review fixes as per review from Vladislav

Serge Petrenko (3):
 fiber.top() refactor clock and = cpu time calculation
 fiber.top(): alter exponential moving average = calculation
 app/fiber: wait till a full event loop = iteration ends

src/lib/core/fiber.c =    | 175 ++++++++++++++++++++++++----------------
src/lib/core/fiber.h    | 109 = +++++++++++++++++--------
src/lua/fiber.c =         |  20 +++--
test/app/fiber.result   |  38 ++++++---
test/app/fiber.test.lua |  36 +++++++--
= 5 files changed, 252 insertions(+), 126 deletions(-)



diff = --git a/test/app/fiber.result b/test/app/fiber.result
index = b767bbb59..6d9604ad8 100644
--- = a/test/app/fiber.result
+++ = b/test/app/fiber.result
@@ -1510,25 +1510,21 @@ for k, v in = pairs(a) do\
 end
 ---
 ...
-sum_inst
----
-- 100
-...
 -- when a fiber dies, its impact on the thread moving = average
 -- persists for a couple of ev loop iterations, but it = is no
 -- longer listed in fiber.top(). So sum_avg may way smaller = than
--- 100%. See gh-4625 for details and reenable the test below = as
+-- 100%. See gh-4625 for details and reenable both tests = below as
 -- soon as it is implemented.
 -- = In rare cases when a fiber dies on the same event loop
 -- = iteration as you issue fiber.top(), sum_inst will also be
--- = smaller than 100%, but it is so rare I don't even want = to
--- disable the test above.
-sum_avg <=3D 100 or = sum_avg
+-- smaller than 100%.
+-- sum_inst
+sum_avg = <=3D 100.1 or sum_avg
 ---
 - = true
 ...
 -- not exact due to accumulated = integer division errors
---sum_avg > 99 and sum_avg <=3D = 100 or sum_avg
+--sum_avg > 99 and sum_avg <=3D 100.1 or = sum_avg
 tbl =3D nil
 ---
 ...
diff --git a/test/app/fiber.test.lua = b/test/app/fiber.test.lua
index 7f7350acc..6df210d9c = 100644
--- a/test/app/fiber.test.lua
+++ = b/test/app/fiber.test.lua
@@ -658,19 +658,18 @@ for k, v in = pairs(a) do\
     sum_avg =3D sum_avg + = v["average"]\
 end

 

-sum_inst
 -- = when a fiber dies, its impact on the thread moving = average
 -- persists for a couple of ev loop iterations, but it = is no
 -- longer listed in fiber.top(). So sum_avg may way smaller = than
--- 100%. See gh-4625 for details and reenable the test below = as
+-- 100%. See gh-4625 for details and reenable both tests = below as
 -- soon as it is implemented.
 -- = In rare cases when a fiber dies on the same event loop
 -- = iteration as you issue fiber.top(), sum_inst will also be
--- = smaller than 100%, but it is so rare I don't even want = to
--- disable the test above.
-sum_avg <=3D 100 or = sum_avg
+-- smaller than 100%.
+-- sum_inst
+sum_avg = <=3D 100.1 or sum_avg
 -- not exact due to accumulated = integer division errors
---sum_avg > 99 and sum_avg <=3D = 100 or sum_avg
+--sum_avg > 99 and sum_avg <=3D 100.1 or = sum_avg
 tbl =3D nil
 f =3D = fiber.new(function()\
     local fiber_key =3D = fiber.self().id()..'/'..fiber.self().name()\
= --Apple-Mail=_61CEA279-531F-4C41-A0A1-3063C277989B--