[Tarantool-patches] [PATCH v2] test: fix hang of vinyl/select_consistency.* test

Nikita Pettik korablev at tarantool.org
Mon Dec 7 15:22:23 MSK 2020


On 02 Dec 21:09, Alexander V. Tikhonov wrote:
>   --- vinyl/select_consistency.result		Thu Jun 11 11:59:48 2020
>   +++ var/152_vinyl/select_consistency.result	Tue Oct 20 09:13:02 2020
>   @@ -137,16 +137,3 @@
>    for i = 1, ch:size() do
>        ch:get()
>    end;
>   ----
>   -...
>   -test_run:cmd("setopt delimiter ''");
>   ----
>   -- true
>   -...
> 
> This issue stops the testing till the complete timeout of the test-run
> will stop all the testing process.

Hm, could we patch test-run so that timeout of single test won't stop
the whole testing process? 
 
> It happened because on heavy loaded hosts may occure the situation
> when the previous snapshot was inprogress when the new snapshot came
> with the same file name *.snap.inprogress. It happens before the
> current snapshot completed and printed "dump completed" in log file.
> Also this file *.snap.inprogress was seen left on manual debug, when
> the test hanged before this patch. To resolve the test issue fiber
> sleep delay after it can be increased, but we want to save the issue
> reproducable. The current patch corrects the test to avoid of hang on
> box snapshot, to be able to continue testing after it failed.
> 
> Needed for #4385
> ---
> 
> Github: https://github.com/tarantool/tarantool/tree/avtikhon/gh-4385_hang_select_consistency
> Issue: https://github.com/tarantool/tarantool/issues/4385
> 
>  test/vinyl/select_consistency.result   | 67 ++++++++++++++++++++------
>  test/vinyl/select_consistency.test.lua | 61 ++++++++++++++++++-----
>  test/vinyl/suite.ini                   |  3 +-
>  3 files changed, 103 insertions(+), 28 deletions(-)
> 
> diff --git a/test/vinyl/select_consistency.result b/test/vinyl/select_consistency.result
> index f27d0ec68..f6d96473d 100644
> --- a/test/vinyl/select_consistency.result
> +++ b/test/vinyl/select_consistency.result
> @@ -4,6 +4,9 @@ test_run = require('test_run').new()
>  fiber = require 'fiber'
>  ---
>  ...
> +log = require 'log'
> +---
> +...
>  math.randomseed(os.time())
>  ---
>  ...
> @@ -52,6 +55,8 @@ function gen_update()
>  end;
>  ---
>  ...
> +failed = false
> +
>  function dml_loop()
>      local insert = true
>      while not stop do
> @@ -75,8 +80,18 @@ end;
>  ...
>  function snap_loop()
>      while not stop do
> -        box.snapshot()
> -        fiber.sleep(0.1)
> +        local ok = fiber.create(function()

Still don't understand necessity of creating another one fiber.
snap_loop() is itself executed in a separate fiber. I suggest you
to dig a bit into this problem, it may hide another issue.

> +            local ok, err = pcall(box.snapshot)
> +            if ok == false then
> +                log.error("error: box.snapshot failed with error " .. err)
> +            end
> +            return ok
> +        end)
> +        if ok == false then
> +            failed = true
> +            break
> +        end
> +        fiber.sleep(0.5)
>      end
>      ch:put(true)
>  end;
> @@ -97,10 +112,7 @@ _ = fiber.create(dml_loop);
>  _ = fiber.create(snap_loop);
>  ---
>  ...
> -failed = {};
> ----
> -...
> -for i = 1, 10000 do
> +function select_loop()
>      local val = math.random(MAX_VAL)
>      box.begin()
>      local res1 = s.index.i1:select({val})
> @@ -117,25 +129,48 @@ for i = 1, 10000 do
>                  end
>              end
>              if not found then
> +                log.error("error: equal not found for res1 and res2:")
> +                local ts1 = ''
> +                local ts2 = ''
> +                for _, t1 in ipairs(res1) do
> +                    ts1 = ts1 .. " " .. t1[1]
> +                end
> +                log.info("error: equal not found for res1" .. ts1)

Why the first log is error meanwhile second and third are info?
I suggest to make them both error to make it consistent.

> +                for _, t2 in ipairs(res2) do
> +                    ts2 = ts2 .. " " .. t2[1]
> +                end
> +                log.info("error: equal not found for res2" .. ts2)
>                  equal = false
>  test_run:cmd("setopt delimiter ''");
>  
> -#failed == 0 or failed
> +test_run:wait_cond(function() return check_get() end)
> +failed

In previous review iter you approved that there's no need in :wait_cond()
here, but still it is presented.
 
> -            "issues": [ "gh-4385" ]
> +            "issues": [ "gh-4385" ],
> +            "checksums": [ "94d5359d451ee7bfdb35a037f80b14ff", "f7e8ebc72f4d59941b4ca10fde474963" ]

Why there are two checksums? How to point out corresponding to particular
checksum error?

Moreover, with increased timeout this test shouldn't be flacky anymore,
should it?



More information about the Tarantool-patches mailing list