[Tarantool-patches] [PATCH v3 0/3] fiber.top(): minor fixup

Wed Nov 20 15:19:02 MSK 2019

On Wed, Nov 20, 2019 at 11:29:26AM +0300, Serge Petrenko wrote:
> Hi! Thanks for your reply!
> --
> Serge Petrenko
> sergepetrenko at tarantool.org
> 
> 
> 
> 
> > 20 нояб. 2019 г., в 4:44, Alexander Turenko <alexander.turenko at tarantool.org> написал(а):
> > 
> > It still shows miscompares for me:
> > 
> > $ cat /proc/cpuinfo | grep processor | wc -l
> > 8
> > $ (cd test && ./test-run.py $(yes fiber.test.lua | head -n 1000))
> > 
> > commit aefc64faaf7bfbb58ac53bf5abea55d1faabcbfe
> > 
> > The miscompare looks so:
> > 
> > [006] Test failed! Result content mismatch:
> > [006] --- app/fiber.result	Tue Nov 19 20:39:21 2019
> > [006] +++ app/fiber.reject	Wed Nov 20 04:33:39 2019
> > [006] @@ -1525,7 +1525,7 @@
> > [006]  -- disable the test above.
> > [006]  sum_avg <= 100 or sum_avg
> > [006]  ---
> > [006] -- true
> > [006] +- 100
> > [006]  ...
> > [006]  -- not exact due to accumulated integer division errors
> > [006]  --sum_avg > 99 and sum_avg <= 100 or sum_avg
> > 
> > Maybe it is some rounding issue.
> > 
> > When I changed the condition to sum_avg <= 101, I got:
> > 
> > [015] Test failed! Result content mismatch:
> > [015] --- app/fiber.result	Wed Nov 20 04:35:14 2019
> > [015] +++ app/fiber.reject	Wed Nov 20 04:35:36 2019
> > [015] @@ -1512,7 +1512,7 @@
> > [015]  ...
> > [015]  sum_inst
> > [015]  ---
> > [015] -- 100
> > [015] +- 45.802853351342
> > [015]  ...
> > [015]  -- when a fiber dies, its impact on the thread moving average
> > [015]  -- persists for a couple of ev loop iterations, but it is no
> > 
> > Sorry, but it does not look okay for me in the sense that it may lead to
> > problems in testing: we already have enough ones.
> > 
> > Let's provide a test that will work stable or describe why it is not
> > possible. In the latter case we can extract an unstable test case and
> > mark it as fragile using suite.ini test-run's option. It will not
> > strictly close the problem, but will lower its probability.
> > 
> > Maybe we can even retry such tests or allows them to fail:
> > https://github.com/tarantool/test-run/issues/189
> > 
> > Anyway, if those fixes descrease probability of fails, I don't mind.
> > Just noted that a further work may be needed.
> 
> Well, yes, they do. Just try to run this test before the patch.
> Anyway, I guess you’re right that we shouldn’t leave flaky tests if
> possible, so I commented the flaky pieces out and added a request to
> reenable them once #4625 (List dead fibers in fiber.top <http://fiber.top/>() output) is
> implemented.
> 
> The diff’s below. Now 1000 out of 1000 tests pass with 16 concurrent
> jobs, at least on my machine.

Verified 79579e399ff08017581ba58895d06bd7da54f67d. There are no more
fails on fiber top related cases. So, I'm okay now (it is not formal
LGTM, because I did not review for the whole patchset, just my personal
feeling).

NB: There are fails that are unrelated to fiber top:

[012] --- app/fiber.result	Wed Nov 20 14:39:10 2019
[012] +++ app/fiber.reject	Wed Nov 20 14:40:57 2019
[012] @@ -675,8 +675,8 @@
[012]  -- attempt to access local storage of dead fiber raises error
[012]  pcall(function(f) return f.storage end, f)
[012]  ---
[012] -- false
[012] -- '[string "return pcall(function(f) return f.storage end..."]:1: the fiber is dead'
[012] +- true
[012] +- key: some value
[012]  ...
[012]  --
[012]  -- Test that local storage is garbage collected when fiber is died

I run 1000 jobs in 16 threads three times and got 5, 8 and 6 fails. It
is not about fiber top, just noted.