Tarantool development patches archive
 help / color / mirror / Atom feed
* [PATCH] test: fix flaky fails of box/iproto_stress
@ 2018-12-25 23:37 Alexander Turenko
  2018-12-27 10:26 ` Vladimir Davydov
  0 siblings, 1 reply; 4+ messages in thread
From: Alexander Turenko @ 2018-12-25 23:37 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: Alexander Turenko, tarantool-patches

Fixes #3911.
---

https://github.com/tarantool/tarantool/issues/3911
https://github.com/tarantool/tarantool/tree/Totktonada/gh-3911-fix-box-iproto-stress

 test/box/iproto_stress.result   | 1 +
 test/box/iproto_stress.test.lua | 1 +
 2 files changed, 2 insertions(+)

diff --git a/test/box/iproto_stress.result b/test/box/iproto_stress.result
index 4239b49b8..4b7b41bd7 100644
--- a/test/box/iproto_stress.result
+++ b/test/box/iproto_stress.result
@@ -40,6 +40,7 @@ function worker(i)
             n_errors = n_errors + 1
             break
         end
+        conn:wait_state('active', 10)
         for k = 1,10 do
             conn.space.test:insert{i, j, k}
         end
diff --git a/test/box/iproto_stress.test.lua b/test/box/iproto_stress.test.lua
index 2f3071450..84a58641f 100644
--- a/test/box/iproto_stress.test.lua
+++ b/test/box/iproto_stress.test.lua
@@ -23,6 +23,7 @@ function worker(i)
             n_errors = n_errors + 1
             break
         end
+        conn:wait_state('active', 10)
         for k = 1,10 do
             conn.space.test:insert{i, j, k}
         end
-- 
2.20.0

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] test: fix flaky fails of box/iproto_stress
  2018-12-25 23:37 [PATCH] test: fix flaky fails of box/iproto_stress Alexander Turenko
@ 2018-12-27 10:26 ` Vladimir Davydov
  2018-12-28  3:11   ` Alexander Turenko
  0 siblings, 1 reply; 4+ messages in thread
From: Vladimir Davydov @ 2018-12-27 10:26 UTC (permalink / raw)
  To: Alexander Turenko; +Cc: tarantool-patches

On Wed, Dec 26, 2018 at 02:37:51AM +0300, Alexander Turenko wrote:
> Fixes #3911.
> ---
> 
> https://github.com/tarantool/tarantool/issues/3911
> https://github.com/tarantool/tarantool/tree/Totktonada/gh-3911-fix-box-iproto-stress
> 
>  test/box/iproto_stress.result   | 1 +
>  test/box/iproto_stress.test.lua | 1 +
>  2 files changed, 2 insertions(+)
> 
> diff --git a/test/box/iproto_stress.result b/test/box/iproto_stress.result
> index 4239b49b8..4b7b41bd7 100644
> --- a/test/box/iproto_stress.result
> +++ b/test/box/iproto_stress.result
> @@ -40,6 +40,7 @@ function worker(i)
>              n_errors = n_errors + 1
>              break
>          end
> +        conn:wait_state('active', 10)

According to

  https://tarantool.io/en/doc/1.10/reference/reference_lua/net_box/#lua-function.net_box.new

net_box.connect() waits until connection is active unless it is passed
wait_connected = false.

A quick glance at the code confirms that:

  https://github.com/tarantool/tarantool/blob/911139e379666ddff6e448184d066384a085df1e/src/box/lua/net_box.lua#L958

The question is why it doesn't happen in the test. A bug in the
documentation or in net.box?

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] test: fix flaky fails of box/iproto_stress
  2018-12-27 10:26 ` Vladimir Davydov
@ 2018-12-28  3:11   ` Alexander Turenko
  2019-01-09 10:33     ` Vladimir Davydov
  0 siblings, 1 reply; 4+ messages in thread
From: Alexander Turenko @ 2018-12-28  3:11 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: tarantool-patches

On Thu, Dec 27, 2018 at 01:26:26PM +0300, Vladimir Davydov wrote:
> On Wed, Dec 26, 2018 at 02:37:51AM +0300, Alexander Turenko wrote:
> > Fixes #3911.
> > ---
> > 
> > https://github.com/tarantool/tarantool/issues/3911
> > https://github.com/tarantool/tarantool/tree/Totktonada/gh-3911-fix-box-iproto-stress
> > 
> >  test/box/iproto_stress.result   | 1 +
> >  test/box/iproto_stress.test.lua | 1 +
> >  2 files changed, 2 insertions(+)
> > 
> > diff --git a/test/box/iproto_stress.result b/test/box/iproto_stress.result
> > index 4239b49b8..4b7b41bd7 100644
> > --- a/test/box/iproto_stress.result
> > +++ b/test/box/iproto_stress.result
> > @@ -40,6 +40,7 @@ function worker(i)
> >              n_errors = n_errors + 1
> >              break
> >          end
> > +        conn:wait_state('active', 10)
> 
> According to
> 
>   https://tarantool.io/en/doc/1.10/reference/reference_lua/net_box/#lua-function.net_box.new
> 
> net_box.connect() waits until connection is active unless it is passed
> wait_connected = false.
> 
> A quick glance at the code confirms that:
> 
>   https://github.com/tarantool/tarantool/blob/911139e379666ddff6e448184d066384a085df1e/src/box/lua/net_box.lua#L958
> 
> The question is why it doesn't happen in the test. A bug in the
> documentation or in net.box?

It seems I wrongly determined the reason of the flaky fails. Thank you
for catching it up.

I found that I'm able to reproduce the issue (w/o net.box changes) when
compiling gcc in 16 threads in background :)

I guessed it is because attempts count at the end of test is lesser than
needed for parallel run on travis-ci and tried to increase it from 100
to 1000. But even in this case I catched the fail (once, to be honest)
with background CPU load and 1000 iproto_stress tests in 16 threads.

It seems it is rare thing. Now I don't sure we should change attempts
count. Maybe we should run the test in non-parallel mode, after all
parallel tests? We must move it into separate test suite to do so for
now.

Maybe we should leave it unchanged and check whether it is really rare.

(The branch was updated with attemps == 1000.)

Any suggestions?

WBR, Alexander Turenko.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] test: fix flaky fails of box/iproto_stress
  2018-12-28  3:11   ` Alexander Turenko
@ 2019-01-09 10:33     ` Vladimir Davydov
  0 siblings, 0 replies; 4+ messages in thread
From: Vladimir Davydov @ 2019-01-09 10:33 UTC (permalink / raw)
  To: Alexander Turenko; +Cc: tarantool-patches

On Fri, Dec 28, 2018 at 06:11:40AM +0300, Alexander Turenko wrote:
> On Thu, Dec 27, 2018 at 01:26:26PM +0300, Vladimir Davydov wrote:
> > On Wed, Dec 26, 2018 at 02:37:51AM +0300, Alexander Turenko wrote:
> > > Fixes #3911.
> > > ---
> > > 
> > > https://github.com/tarantool/tarantool/issues/3911
> > > https://github.com/tarantool/tarantool/tree/Totktonada/gh-3911-fix-box-iproto-stress
> > > 
> > >  test/box/iproto_stress.result   | 1 +
> > >  test/box/iproto_stress.test.lua | 1 +
> > >  2 files changed, 2 insertions(+)
> > > 
> > > diff --git a/test/box/iproto_stress.result b/test/box/iproto_stress.result
> > > index 4239b49b8..4b7b41bd7 100644
> > > --- a/test/box/iproto_stress.result
> > > +++ b/test/box/iproto_stress.result
> > > @@ -40,6 +40,7 @@ function worker(i)
> > >              n_errors = n_errors + 1
> > >              break
> > >          end
> > > +        conn:wait_state('active', 10)
> > 
> > According to
> > 
> >   https://tarantool.io/en/doc/1.10/reference/reference_lua/net_box/#lua-function.net_box.new
> > 
> > net_box.connect() waits until connection is active unless it is passed
> > wait_connected = false.
> > 
> > A quick glance at the code confirms that:
> > 
> >   https://github.com/tarantool/tarantool/blob/911139e379666ddff6e448184d066384a085df1e/src/box/lua/net_box.lua#L958
> > 
> > The question is why it doesn't happen in the test. A bug in the
> > documentation or in net.box?
> 
> It seems I wrongly determined the reason of the flaky fails. Thank you
> for catching it up.
> 
> I found that I'm able to reproduce the issue (w/o net.box changes) when
> compiling gcc in 16 threads in background :)
> 
> I guessed it is because attempts count at the end of test is lesser than
> needed for parallel run on travis-ci and tried to increase it from 100
> to 1000. But even in this case I catched the fail (once, to be honest)

1000 means waiting for 100 seconds, which is a bit of an overkill.

> with background CPU load and 1000 iproto_stress tests in 16 threads.
> 
> It seems it is rare thing. Now I don't sure we should change attempts
> count. Maybe we should run the test in non-parallel mode, after all
> parallel tests? We must move it into separate test suite to do so for
> now.

I don't think we need to add this workaround, because the failure is
pretty rare. Besides, it still isn't quite clear what's causing it so
occasional failures will remind us that we need to investigate the
problem deeper.

> 
> Maybe we should leave it unchanged and check whether it is really rare.
> 
> (The branch was updated with attemps == 1000.)
> 
> Any suggestions?

As a matter of fact, the error message mentioned in the issue
description

  E> LuajitError: [string "function worker(i)     n_workers = n_workers ..."]:1: attempt to index field 'space' (a nil value)

looks weird to me. After all, the test does nothing criminal - it simply
creates a net.box connection and tries to access an existing space and
for some reason fails. We need to find out the real cause of the
problem. May be a bug in net.box implementation...

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2019-01-09 10:33 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-12-25 23:37 [PATCH] test: fix flaky fails of box/iproto_stress Alexander Turenko
2018-12-27 10:26 ` Vladimir Davydov
2018-12-28  3:11   ` Alexander Turenko
2019-01-09 10:33     ` Vladimir Davydov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox