Tarantool development patches archive
 help / color / mirror / Atom feed
* [Tarantool-patches] [PATCH v1] test: fix hang of vinyl/select_consistency.* test
@ 2020-11-15 18:43 Alexander V. Tikhonov
  2020-11-16 21:06 ` Nikita Pettik
  0 siblings, 1 reply; 5+ messages in thread
From: Alexander V. Tikhonov @ 2020-11-15 18:43 UTC (permalink / raw)
  To: Kirill Yukhin, Nikita Pettik; +Cc: tarantool-patches

During testing process:

  https://gitlab.com/tarantool/tarantool/-/jobs/800863457#L5303

found that test vinyl/select_consistency.test.lua hanged:

  --- vinyl/select_consistency.result		Thu Jun 11 11:59:48 2020
  +++ var/152_vinyl/select_consistency.result	Tue Oct 20 09:13:02 2020
  @@ -137,16 +137,3 @@
   for i = 1, ch:size() do
       ch:get()
   end;
  ----
  -...
  -test_run:cmd("setopt delimiter ''");
  ----
  -- true
  -...

This issue stops the testing till the complete timeout of the test-run
will stop all the testing process.

Also found that real hang occured at box.snapshot command in snap_loop
loop, like:

  [001] Last 15 lines of Tarantool Log file [Instance "vinyl"][/source/test/var/001_vinyl/vinyl.log]:
  [001] 2020-11-15 18:13:49.928 [78887] main/105/gc I> removed /source/test/var/001_vinyl/vinyl/519/2/00000000000000011145.run
  [001] 2020-11-15 18:13:49.928 [78887] vinyl.dump.0/640/task I> writing `/source/test/var/001_vinyl/vinyl/519/0/00000000000000011538.index'
  [001] 2020-11-15 18:13:49.928 [78887] snapshot/101/main I> done
  [001] 2020-11-15 18:13:49.929 [78887] main/105/gc I> removed /source/test/var/001_vinyl/vinyl/519/2/00000000000000011157.index
  [001] 2020-11-15 18:13:49.929 [78887] main/108/vinyl.scheduler I> 519/0: dump completed
  [001] 2020-11-15 18:13:49.929 [78887] main/108/vinyl.scheduler I> dumped 0 bytes in 0.0 s, rate 0.0 MB/s
  [001] 2020-11-15 18:13:49.929 [78887] main/481/lua I> vinyl checkpoint completed
  [001] 2020-11-15 18:13:49.929 [78887] main/105/gc I> removed /source/test/var/001_vinyl/vinyl/519/2/00000000000000011157.run
  [001] 2020-11-15 18:13:49.929 [78887] main/108/vinyl.scheduler I> 519/0: started compacting range ([94]..inf), runs 2/2
  [001] 2020-11-15 18:13:49.930 [78887] wal I> removed /source/test/var/001_vinyl/vinyl/00000000000000093070.xlog
  [001] 2020-11-15 18:13:49.930 [78887] main/105/gc I> removed /source/test/var/001_vinyl/vinyl/00000000000000094030.snap.inprogress
  [001] 2020-11-15 18:13:49.930 [78887] main I> removed /source/test/var/001_vinyl/vinyl/00000000000000093306.snap
  [001] 2020-11-15 18:13:49.930 [78887] main I> removed /source/test/var/001_vinyl/vinyl/00000000000000093070.vylog
  [001] 2020-11-15 18:13:49.931 [78887] main/481/lua F> can't rename .snap.inprogress
  [001] 2020-11-15 18:13:49.931 [78887] main/481/lua F> can't rename .snap.inprogress

It happened because on heavy loaded hosts may occure the situation
when the previous snapshot was inprogress when the new snapshot came
with the same file name *.snap.inprogress. It happens before the
current snapshot completed and printed "dump completed" in log file.
Also this file *.snap.inprogress was seen left on manual debug, when
the test hanged before this patch. To resolve the test issue fiber
sleep delay after it can be increased, but we want to save the issue
reproducable. The current patch corrects the test to avoid of hang on
box snapshot, to be able to continue testing after it failed. Fiber
sleep was even decreased after adding fiber for box.snapshot to be
able to reproduce the issue.

Needed for #4385
---

Github: https://github.com/tarantool/tarantool/tree/avtikhon/gh-4385_hang_select_consistency
Issue: https://github.com/tarantool/tarantool/issues/4385

 test/vinyl/select_consistency.result   | 51 +++++++++++++++++++-------
 test/vinyl/select_consistency.test.lua | 45 ++++++++++++++++-------
 test/vinyl/suite.ini                   |  3 +-
 3 files changed, 71 insertions(+), 28 deletions(-)

diff --git a/test/vinyl/select_consistency.result b/test/vinyl/select_consistency.result
index f27d0ec68..cd99b84bf 100644
--- a/test/vinyl/select_consistency.result
+++ b/test/vinyl/select_consistency.result
@@ -4,6 +4,9 @@ test_run = require('test_run').new()
 fiber = require 'fiber'
 ---
 ...
+log = require 'log'
+---
+...
 math.randomseed(os.time())
 ---
 ...
@@ -52,6 +55,8 @@ function gen_update()
 end;
 ---
 ...
+failed = false
+
 function dml_loop()
     local insert = true
     while not stop do
@@ -75,8 +80,13 @@ end;
 ...
 function snap_loop()
     while not stop do
-        box.snapshot()
-        fiber.sleep(0.1)
+        local ok, err = fiber.create(function() local ok, err = pcall(box.snapshot) return ok, err end)
+        if ok == false then
+            log.info("error: box.snapshot failed with error " .. err)
+            failed = true
+            break
+        end
+        fiber.sleep(0.01)
     end
     ch:put(true)
 end;
@@ -97,10 +107,7 @@ _ = fiber.create(dml_loop);
 _ = fiber.create(snap_loop);
 ---
 ...
-failed = {};
----
-...
-for i = 1, 10000 do
+function run_iter()
     local val = math.random(MAX_VAL)
     box.begin()
     local res1 = s.index.i1:select({val})
@@ -117,25 +124,37 @@ for i = 1, 10000 do
                 end
             end
             if not found then
+                log.error("error: equal not found for #res1 = " .. #res1 .. ", #res2 = " .. #res2)
                 equal = false
                 break
             end
         end
-    else
-        equal = false
-    end
-    if not equal then
-        table.insert(failed, {res1, res2})
     end
     fiber.sleep(0)
+    return equal
+end;
+---
+...
+for i = 1, 10000 do
+    if failed or not run_iter(i) then
+        log.error("error: failed on iteration " .. i)
+        failed = true
+        break
+    end
 end;
 ---
 ...
 stop = true;
 ---
 ...
-for i = 1, ch:size() do
-    ch:get()
+function check_get()
+    for i = 1, ch:size() do
+        if not test_run:wait_cond(function() return ch:get() ~= nil end) then
+            log.error("error: hanged on ch:get() on iteration " .. i)
+            return false
+	end
+    end
+    return true
 end;
 ---
 ...
@@ -143,10 +162,14 @@ test_run:cmd("setopt delimiter ''");
 ---
 - true
 ...
-#failed == 0 or failed
+test_run:wait_cond(function() return check_get() end)
 ---
 - true
 ...
+failed
+---
+- false
+...
 s:drop()
 ---
 ...
diff --git a/test/vinyl/select_consistency.test.lua b/test/vinyl/select_consistency.test.lua
index 73b443e89..e5e01b741 100644
--- a/test/vinyl/select_consistency.test.lua
+++ b/test/vinyl/select_consistency.test.lua
@@ -1,6 +1,7 @@
 test_run = require('test_run').new()
 
 fiber = require 'fiber'
+log = require 'log'
 
 math.randomseed(os.time())
 
@@ -33,6 +34,8 @@ function gen_update()
     pcall(s.update, s, math.random(MAX_KEY), {{'+', 5, 1}})
 end;
 
+failed = false
+
 function dml_loop()
     local insert = true
     while not stop do
@@ -55,8 +58,13 @@ end;
 
 function snap_loop()
     while not stop do
-        box.snapshot()
-        fiber.sleep(0.1)
+        local ok, err = fiber.create(function() local ok, err = pcall(box.snapshot) return ok, err end)
+        if ok == false then
+            log.info("error: box.snapshot failed with error " .. err)
+            failed = true
+            break
+        end
+        fiber.sleep(0.01)
     end
     ch:put(true)
 end;
@@ -68,9 +76,7 @@ _ = fiber.create(dml_loop);
 _ = fiber.create(dml_loop);
 _ = fiber.create(snap_loop);
 
-failed = {};
-
-for i = 1, 10000 do
+function run_iter()
     local val = math.random(MAX_VAL)
     box.begin()
     local res1 = s.index.i1:select({val})
@@ -87,26 +93,39 @@ for i = 1, 10000 do
                 end
             end
             if not found then
+                log.error("error: equal not found for #res1 = " .. #res1 .. ", #res2 = " .. #res2)
                 equal = false
                 break
             end
         end
-    else
-        equal = false
-    end
-    if not equal then
-        table.insert(failed, {res1, res2})
     end
     fiber.sleep(0)
+    return equal
+end;
+
+for i = 1, 10000 do
+    if failed or not run_iter(i) then
+        log.error("error: failed on iteration " .. i)
+        failed = true
+        break
+    end
 end;
 
 stop = true;
-for i = 1, ch:size() do
-    ch:get()
+
+function check_get()
+    for i = 1, ch:size() do
+        if not test_run:wait_cond(function() return ch:get() ~= nil end) then
+            log.error("error: hanged on ch:get() on iteration " .. i)
+            return false
+	end
+    end
+    return true
 end;
 
 test_run:cmd("setopt delimiter ''");
 
-#failed == 0 or failed
+test_run:wait_cond(function() return check_get() end)
+failed
 
 s:drop()
diff --git a/test/vinyl/suite.ini b/test/vinyl/suite.ini
index 1fded2393..d55e7734c 100644
--- a/test/vinyl/suite.ini
+++ b/test/vinyl/suite.ini
@@ -23,7 +23,8 @@ fragile = {
             "issues": [ "gh-4346" ]
         },
         "select_consistency.test.lua": {
-            "issues": [ "gh-4385" ]
+            "issues": [ "gh-4385" ],
+            "checksums": [ "817493231b4ce2ebb8aa295303e857df" ]
         },
         "throttle.test.lua": {
             "issues": [ "gh-4168" ]
-- 
2.25.1

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Tarantool-patches] [PATCH v1] test: fix hang of vinyl/select_consistency.* test
  2020-11-15 18:43 [Tarantool-patches] [PATCH v1] test: fix hang of vinyl/select_consistency.* test Alexander V. Tikhonov
@ 2020-11-16 21:06 ` Nikita Pettik
  2020-12-02 18:57   ` Alexander V. Tikhonov
  0 siblings, 1 reply; 5+ messages in thread
From: Nikita Pettik @ 2020-11-16 21:06 UTC (permalink / raw)
  To: Alexander V. Tikhonov; +Cc: tarantool-patches

On 15 Nov 21:43, Alexander V. Tikhonov wrote:
> 
> It happened because on heavy loaded hosts may occure the situation
> when the previous snapshot was inprogress when the new snapshot came
> with the same file name *.snap.inprogress. It happens before the
> current snapshot completed and printed "dump completed" in log file.
> Also this file *.snap.inprogress was seen left on manual debug, when
> the test hanged before this patch. To resolve the test issue fiber
> sleep delay after it can be increased, but we want to save the issue
> reproducable. The current patch corrects the test to avoid of hang on

I guess increasing sleep and fileing exact repro with 'bug' label is
enough to deal with this test.

> box snapshot, to be able to continue testing after it failed. Fiber
> sleep was even decreased after adding fiber for box.snapshot to be
> able to reproduce the issue.
> 
> Needed for #4385
> ---
> @@ -75,8 +80,13 @@ end;
>  ...
>  function snap_loop()
>      while not stop do
> -        box.snapshot()
> -        fiber.sleep(0.1)
> +        local ok, err = fiber.create(function() local ok, err = pcall(box.snapshot) return ok, err end)

Why not simply wrap box.snapshot() in pcall? Why do you need another
one separate fiber for it?

> @@ -97,10 +107,7 @@ _ = fiber.create(dml_loop);
>  _ = fiber.create(snap_loop);
>  ---
>  ...
> -failed = {};
> ----
> -...
> -for i = 1, 10000 do
> +function run_iter()

I'd call it select_loop() (as dml/snap loops above).

>      local val = math.random(MAX_VAL)
>      box.begin()
>      local res1 = s.index.i1:select({val})
> @@ -117,25 +124,37 @@ for i = 1, 10000 do
>                  end
>              end
>              if not found then
> +                log.error("error: equal not found for #res1 = " .. #res1 .. ", #res2 = " .. #res2)

What's the point of printing number of elements in res1/res2?
Why not print the whole result set?

> +end;
> +---
> +...
> +for i = 1, 10000 do
> +    if failed or not run_iter(i) then
> +        log.error("error: failed on iteration " .. i)
> +        failed = true
> +        break
> +    end
>  end;
> +function check_get()
> +    for i = 1, ch:size() do
> +        if not test_run:wait_cond(function() return ch:get() ~= nil end) then

As I see from docs, ch:get() has optional timeout param. Why not use it?
I mean not only this particular test, but in general.

> +            log.error("error: hanged on ch:get() on iteration " .. i)
> +            return false
> +	end
> +    end
> +    return true
>  end;
>  ---
>  ...
> @@ -143,10 +162,14 @@ test_run:cmd("setopt delimiter ''");
>  ---
>  - true
>  ...
> -#failed == 0 or failed
> +test_run:wait_cond(function() return check_get() end)

Why do you need another one wait_cond over check_get?

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Tarantool-patches] [PATCH v1] test: fix hang of vinyl/select_consistency.* test
  2020-11-16 21:06 ` Nikita Pettik
@ 2020-12-02 18:57   ` Alexander V. Tikhonov
  2020-12-07 12:26     ` Nikita Pettik
  0 siblings, 1 reply; 5+ messages in thread
From: Alexander V. Tikhonov @ 2020-12-02 18:57 UTC (permalink / raw)
  To: Nikita Pettik; +Cc: tarantool-patches

Hi Nikita, thanks for the review. I've made all of your suggestions,
please review below.

On Mon, Nov 16, 2020 at 09:06:54PM +0000, Nikita Pettik wrote:
> On 15 Nov 21:43, Alexander V. Tikhonov wrote:
> > 
> > It happened because on heavy loaded hosts may occure the situation
> > when the previous snapshot was inprogress when the new snapshot came
> > with the same file name *.snap.inprogress. It happens before the
> > current snapshot completed and printed "dump completed" in log file.
> > Also this file *.snap.inprogress was seen left on manual debug, when
> > the test hanged before this patch. To resolve the test issue fiber
> > sleep delay after it can be increased, but we want to save the issue
> > reproducable. The current patch corrects the test to avoid of hang on
> 
> I guess increasing sleep and fileing exact repro with 'bug' label is
> enough to deal with this test.
>

Ok, I've added instructions how to reproduce the issue in the issue and
increased the sleep from 0.1 to 0.5.

> > box snapshot, to be able to continue testing after it failed. Fiber
> > sleep was even decreased after adding fiber for box.snapshot to be
> > able to reproduce the issue.
> > 
> > Needed for #4385
> > ---
> > @@ -75,8 +80,13 @@ end;
> >  ...
> >  function snap_loop()
> >      while not stop do
> > -        box.snapshot()
> > -        fiber.sleep(0.1)
> > +        local ok, err = fiber.create(function() local ok, err = pcall(box.snapshot) return ok, err end)
> 
> Why not simply wrap box.snapshot() in pcall? Why do you need another
> one separate fiber for it?
> 

Found that the test hangs running without it. Also this part, I'd
corrected a bit.

> > @@ -97,10 +107,7 @@ _ = fiber.create(dml_loop);
> >  _ = fiber.create(snap_loop);
> >  ---
> >  ...
> > -failed = {};
> > ----
> > -...
> > -for i = 1, 10000 do
> > +function run_iter()
> 
> I'd call it select_loop() (as dml/snap loops above).
>

Sure, changed.

> >      local val = math.random(MAX_VAL)
> >      box.begin()
> >      local res1 = s.index.i1:select({val})
> > @@ -117,25 +124,37 @@ for i = 1, 10000 do
> >                  end
> >              end
> >              if not found then
> > +                log.error("error: equal not found for #res1 = " .. #res1 .. ", #res2 = " .. #res2)
> 
> What's the point of printing number of elements in res1/res2?
> Why not print the whole result set?
> 

Right, added.

> > +end;
> > +---
> > +...
> > +for i = 1, 10000 do
> > +    if failed or not run_iter(i) then
> > +        log.error("error: failed on iteration " .. i)
> > +        failed = true
> > +        break
> > +    end
> >  end;
> > +function check_get()
> > +    for i = 1, ch:size() do
> > +        if not test_run:wait_cond(function() return ch:get() ~= nil end) then
> 
> As I see from docs, ch:get() has optional timeout param. Why not use it?
> I mean not only this particular test, but in general.
> 

Added timeout using local setup as define to make it easy to find all
the tests with the same changes in the future.

> > +            log.error("error: hanged on ch:get() on iteration " .. i)
> > +            return false
> > +	end
> > +    end
> > +    return true
> >  end;
> >  ---
> >  ...
> > @@ -143,10 +162,14 @@ test_run:cmd("setopt delimiter ''");
> >  ---
> >  - true
> >  ...
> > -#failed == 0 or failed
> > +test_run:wait_cond(function() return check_get() end)
> 
> Why do you need another one wait_cond over check_get?
> 

Right, no need for now.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Tarantool-patches] [PATCH v1] test: fix hang of vinyl/select_consistency.* test
  2020-12-02 18:57   ` Alexander V. Tikhonov
@ 2020-12-07 12:26     ` Nikita Pettik
  0 siblings, 0 replies; 5+ messages in thread
From: Nikita Pettik @ 2020-12-07 12:26 UTC (permalink / raw)
  To: Alexander V. Tikhonov; +Cc: tarantool-patches

On 02 Dec 21:57, Alexander V. Tikhonov wrote:
> Hi Nikita, thanks for the review. I've made all of your suggestions,
> please review below.
> 
> On Mon, Nov 16, 2020 at 09:06:54PM +0000, Nikita Pettik wrote:
> > On 15 Nov 21:43, Alexander V. Tikhonov wrote:
> > > 
> > > It happened because on heavy loaded hosts may occure the situation
> > > when the previous snapshot was inprogress when the new snapshot came
> > > with the same file name *.snap.inprogress. It happens before the
> > > current snapshot completed and printed "dump completed" in log file.
> > > Also this file *.snap.inprogress was seen left on manual debug, when
> > > the test hanged before this patch. To resolve the test issue fiber
> > > sleep delay after it can be increased, but we want to save the issue
> > > reproducable. The current patch corrects the test to avoid of hang on
> > 
> > I guess increasing sleep and fileing exact repro with 'bug' label is
> > enough to deal with this test.
> >
> 
> Ok, I've added instructions how to reproduce the issue in the issue and
> increased the sleep from 0.1 to 0.5.
> 
> > > box snapshot, to be able to continue testing after it failed. Fiber
> > > sleep was even decreased after adding fiber for box.snapshot to be
> > > able to reproduce the issue.
> > > 
> > > Needed for #4385
> > > ---
> > > @@ -75,8 +80,13 @@ end;
> > >  ...
> > >  function snap_loop()
> > >      while not stop do
> > > -        box.snapshot()
> > > -        fiber.sleep(0.1)
> > > +        local ok, err = fiber.create(function() local ok, err = pcall(box.snapshot) return ok, err end)
> > 
> > Why not simply wrap box.snapshot() in pcall? Why do you need another
> > one separate fiber for it?
> >

Could you please file new issue since it is not longer 'flaky test', but
rather vinyl bug? I mean this situation with simultanious instance
dumps seems to core issue. Also it would be nice to have reproducer out
of test-run environemt, so owner of the bug can run Tarantool, copy-paste
reproducer to the console and get mentioned errors.
 

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Tarantool-patches] [PATCH v1] test: fix hang of vinyl/select_consistency.* test
@ 2020-11-09 12:57 Alexander V. Tikhonov
  0 siblings, 0 replies; 5+ messages in thread
From: Alexander V. Tikhonov @ 2020-11-09 12:57 UTC (permalink / raw)
  To: Kirill Yukhin; +Cc: tarantool-patches

During testing process:

  https://gitlab.com/tarantool/tarantool/-/jobs/800863457#L5303

found that test vinyl/select_consistency.test.lua hanged:

  --- vinyl/select_consistency.result		Thu Jun 11 11:59:48 2020
  +++ var/152_vinyl/select_consistency.result	Tue Oct 20 09:13:02 2020
  @@ -137,16 +137,3 @@
   for i = 1, ch:size() do
       ch:get()
   end;
  ----
  -...
  -test_run:cmd("setopt delimiter ''");
  ----
  -- true
  -...

This issue stops the testing till the complete timeout of the test-run
will stop all the testing process. To avoid of it 'wait_cond()' routine
was added to 'ch:get()' command to set it's local timeout to avoid of
waiting for test-run global timeout.

Needed for #4385
---

Github: https://github.com/tarantool/tarantool/tree/avtikhon/gh-4385_hang_select_consistency
Issue: https://github.com/tarantool/tarantool/issues/4385

 test/vinyl/select_consistency.result   | 2 +-
 test/vinyl/select_consistency.test.lua | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/test/vinyl/select_consistency.result b/test/vinyl/select_consistency.result
index f27d0ec68..5a23a674b 100644
--- a/test/vinyl/select_consistency.result
+++ b/test/vinyl/select_consistency.result
@@ -135,7 +135,7 @@ stop = true;
 ---
 ...
 for i = 1, ch:size() do
-    ch:get()
+    test_run:wait_cond(function() return ch:get() end)
 end;
 ---
 ...
diff --git a/test/vinyl/select_consistency.test.lua b/test/vinyl/select_consistency.test.lua
index 73b443e89..6398d09cd 100644
--- a/test/vinyl/select_consistency.test.lua
+++ b/test/vinyl/select_consistency.test.lua
@@ -102,7 +102,7 @@ end;
 
 stop = true;
 for i = 1, ch:size() do
-    ch:get()
+    test_run:wait_cond(function() return ch:get() end)
 end;
 
 test_run:cmd("setopt delimiter ''");
-- 
2.25.1

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2020-12-07 12:26 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-11-15 18:43 [Tarantool-patches] [PATCH v1] test: fix hang of vinyl/select_consistency.* test Alexander V. Tikhonov
2020-11-16 21:06 ` Nikita Pettik
2020-12-02 18:57   ` Alexander V. Tikhonov
2020-12-07 12:26     ` Nikita Pettik
  -- strict thread matches above, loose matches on Subject: below --
2020-11-09 12:57 Alexander V. Tikhonov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox