[Tarantool-patches] [PATCH v1] test: fix hang of vinyl/select_consistency.* test

Tue Nov 17 00:06:54 MSK 2020

On 15 Nov 21:43, Alexander V. Tikhonov wrote:
> 
> It happened because on heavy loaded hosts may occure the situation
> when the previous snapshot was inprogress when the new snapshot came
> with the same file name *.snap.inprogress. It happens before the
> current snapshot completed and printed "dump completed" in log file.
> Also this file *.snap.inprogress was seen left on manual debug, when
> the test hanged before this patch. To resolve the test issue fiber
> sleep delay after it can be increased, but we want to save the issue
> reproducable. The current patch corrects the test to avoid of hang on

I guess increasing sleep and fileing exact repro with 'bug' label is
enough to deal with this test.

> box snapshot, to be able to continue testing after it failed. Fiber
> sleep was even decreased after adding fiber for box.snapshot to be
> able to reproduce the issue.
> 
> Needed for #4385
> ---
> @@ -75,8 +80,13 @@ end;
>  ...
>  function snap_loop()
>      while not stop do
> -        box.snapshot()
> -        fiber.sleep(0.1)
> +        local ok, err = fiber.create(function() local ok, err = pcall(box.snapshot) return ok, err end)

Why not simply wrap box.snapshot() in pcall? Why do you need another
one separate fiber for it?

> @@ -97,10 +107,7 @@ _ = fiber.create(dml_loop);
>  _ = fiber.create(snap_loop);
>  ---
>  ...
> -failed = {};
> ----
> -...
> -for i = 1, 10000 do
> +function run_iter()

I'd call it select_loop() (as dml/snap loops above).

>      local val = math.random(MAX_VAL)
>      box.begin()
>      local res1 = s.index.i1:select({val})
> @@ -117,25 +124,37 @@ for i = 1, 10000 do
>                  end
>              end
>              if not found then
> +                log.error("error: equal not found for #res1 = " .. #res1 .. ", #res2 = " .. #res2)

What's the point of printing number of elements in res1/res2?
Why not print the whole result set?

> +end;
> +---
> +...
> +for i = 1, 10000 do
> +    if failed or not run_iter(i) then
> +        log.error("error: failed on iteration " .. i)
> +        failed = true
> +        break
> +    end
>  end;
> +function check_get()
> +    for i = 1, ch:size() do
> +        if not test_run:wait_cond(function() return ch:get() ~= nil end) then

As I see from docs, ch:get() has optional timeout param. Why not use it?
I mean not only this particular test, but in general.

> +            log.error("error: hanged on ch:get() on iteration " .. i)
> +            return false
> +	end
> +    end
> +    return true
>  end;
>  ---
>  ...
> @@ -143,10 +162,14 @@ test_run:cmd("setopt delimiter ''");
>  ---
>  - true
>  ...
> -#failed == 0 or failed
> +test_run:wait_cond(function() return check_get() end)

Why do you need another one wait_cond over check_get?