Tarantool development patches archive
 help / color / mirror / Atom feed
* [Tarantool-patches] [PATCH v1] test: fix flaky election_qsync_stress test
@ 2020-11-13  6:03 Alexander V. Tikhonov
  2020-11-13  9:18 ` Serge Petrenko
  0 siblings, 1 reply; 2+ messages in thread
From: Alexander V. Tikhonov @ 2020-11-13  6:03 UTC (permalink / raw)
  To: Serge Petrenko, Kirill Yukhin; +Cc: tarantool-patches

Found that replication/election_qsync_stress.test.lua test may fail on
restating instances. It occures on heavy loaded hosts when its local
call to stop instance using SIGTERM fails to stop it. Decided to use
SIGKILL in local stop call options to be sure that the instance will
be stopped.

Also found that running loop inline new hangs occured on server start:

  --- replication/election_qsync_stress.result    Thu Nov 12 16:23:16 2020
  +++ var/128_replication/election_qsync_stress.result    Thu Nov 12 16:31:22 2020
  @@ -323,7 +323,7 @@
    | ...
       test_run:wait_cond(function() return c.space.test ~= nil and c.space.test:get{i} ~= nil end)
    | ---
  - | - true
  + | - false
    | ...
  @@ -380,7 +380,7 @@
    | ...
       test_run:wait_cond(function() return c.space.test ~= nil and c.space.test:get{i} ~= nil end)
    | ---
  - | - true
  + | - false
    | ...
  @@ -494,687 +494,3 @@
    | ---
    | ...
       test_run:cmd('start server '..old_leader..' with wait=True, wait_load=True, args="2 0.4"')
  - | ---
  - | - true
  - | …

but the test already failed before on getting 'c.space.test:get{i}'.
To avoid of the hang and make test code more correct running way it
were added log.error messages and return calls. Also the test was
changed to use function for each loop iteration to be able to check
return values and break the loop just after the fail.

Needed for #5395
---

Github: https://github.com/tarantool/tarantool/tree/avtikhon/fix-election_qsync_stress
Issue: https://github.com/tarantool/tarantool/issues/5395

 test/replication/election_qsync_stress.result | 33 ++++++++++++++-----
 .../election_qsync_stress.test.lua            | 29 +++++++++++-----
 test/replication/suite.ini                    |  2 +-
 3 files changed, 47 insertions(+), 17 deletions(-)

diff --git a/test/replication/election_qsync_stress.result b/test/replication/election_qsync_stress.result
index 9fab2f1d7..cdb00e34c 100644
--- a/test/replication/election_qsync_stress.result
+++ b/test/replication/election_qsync_stress.result
@@ -5,6 +5,9 @@ test_run = require('test_run').new()
 netbox = require('net.box')
  | ---
  | ...
+log = require('log')
+ | ---
+ | ...
 
 --
 -- gh-1146: Leader election + Qsync
@@ -83,23 +86,32 @@ test_run:cmd('setopt delimiter ";"')
  | ---
  | - true
  | ...
-for i = 1,10 do
+function run_iter(i)
     c:eval('box.cfg{replication_synchro_quorum=4, replication_synchro_timeout=1000}')
     c.space.test:insert({i}, {is_async=true})
-    test_run:wait_cond(function() return c.space.test:get{i} ~= nil end)
-    test_run:cmd('stop server '..old_leader)
+    if not test_run:wait_cond(function() return c.space.test ~= nil
+            and c.space.test:get{i} ~= nil end) then
+        log.error('error: hanged on first call to c.space.test:get(' .. i .. ')')
+        return false
+    end
+    test_run:cmd('stop server '..old_leader..' with signal=KILL')
     nrs[old_leader_nr] = false
-    new_leader_nr = get_leader(nrs)
-    new_leader = 'election_replica'..new_leader_nr
-    leader_port = test_run:eval(new_leader, 'box.cfg.listen')[1]
+    local new_leader_nr = get_leader(nrs)
+    local new_leader = 'election_replica'..new_leader_nr
+    local leader_port = test_run:eval(new_leader, 'box.cfg.listen')[1]
     c = netbox.connect(leader_port)
     c:eval('box.cfg{replication_synchro_timeout=1000}')
     c.space._schema:replace{'smth'}
-    c.space.test:get{i}
+    if not test_run:wait_cond(function() return c.space.test ~= nil
+            and c.space.test:get{i} ~= nil end) then
+        log.error('error: hanged on second call to c.space.test:get(' .. i .. ')')
+        return false
+    end
     test_run:cmd('start server '..old_leader..' with wait=True, wait_load=True, args="2 0.4"')
     nrs[old_leader_nr] = true
     old_leader_nr = new_leader_nr
     old_leader = new_leader
+    return true
 end;
  | ---
  | ...
@@ -107,8 +119,13 @@ test_run:cmd('setopt delimiter ""');
  | ---
  | - true
  | ...
+
+for i = 1,10 do if not run_iter(i) then break end end
+ | ---
+ | ...
+
 -- We're connected to some leader.
-#c.space.test:select{} == 10 or require('log').error(c.space.test:select{})
+#c.space.test:select{} == 10 or log.error(c.space.test:select{})
  | ---
  | - true
  | ...
diff --git a/test/replication/election_qsync_stress.test.lua b/test/replication/election_qsync_stress.test.lua
index 0ba15eef7..8b654d063 100644
--- a/test/replication/election_qsync_stress.test.lua
+++ b/test/replication/election_qsync_stress.test.lua
@@ -1,5 +1,6 @@
 test_run = require('test_run').new()
 netbox = require('net.box')
+log = require('log')
 
 --
 -- gh-1146: Leader election + Qsync
@@ -47,26 +48,38 @@ _ = c:eval('box.space.test:create_index("pk")')
 -- Insert some data to a synchronous space, then kill the leader before the
 -- confirmation is written. Check successful confirmation on the new leader.
 test_run:cmd('setopt delimiter ";"')
-for i = 1,10 do
+function run_iter(i)
     c:eval('box.cfg{replication_synchro_quorum=4, replication_synchro_timeout=1000}')
     c.space.test:insert({i}, {is_async=true})
-    test_run:wait_cond(function() return c.space.test:get{i} ~= nil end)
-    test_run:cmd('stop server '..old_leader)
+    if not test_run:wait_cond(function() return c.space.test ~= nil
+            and c.space.test:get{i} ~= nil end) then
+        log.error('error: hanged on first call to c.space.test:get(' .. i .. ')')
+        return false
+    end
+    test_run:cmd('stop server '..old_leader..' with signal=KILL')
     nrs[old_leader_nr] = false
-    new_leader_nr = get_leader(nrs)
-    new_leader = 'election_replica'..new_leader_nr
-    leader_port = test_run:eval(new_leader, 'box.cfg.listen')[1]
+    local new_leader_nr = get_leader(nrs)
+    local new_leader = 'election_replica'..new_leader_nr
+    local leader_port = test_run:eval(new_leader, 'box.cfg.listen')[1]
     c = netbox.connect(leader_port)
     c:eval('box.cfg{replication_synchro_timeout=1000}')
     c.space._schema:replace{'smth'}
-    c.space.test:get{i}
+    if not test_run:wait_cond(function() return c.space.test ~= nil
+            and c.space.test:get{i} ~= nil end) then
+        log.error('error: hanged on second call to c.space.test:get(' .. i .. ')')
+        return false
+    end
     test_run:cmd('start server '..old_leader..' with wait=True, wait_load=True, args="2 0.4"')
     nrs[old_leader_nr] = true
     old_leader_nr = new_leader_nr
     old_leader = new_leader
+    return true
 end;
 test_run:cmd('setopt delimiter ""');
+
+for i = 1,10 do if not run_iter(i) then break end end
+
 -- We're connected to some leader.
-#c.space.test:select{} == 10 or require('log').error(c.space.test:select{})
+#c.space.test:select{} == 10 or log.error(c.space.test:select{})
 
 test_run:drop_cluster(SERVERS)
diff --git a/test/replication/suite.ini b/test/replication/suite.ini
index 34ee32550..c456678c1 100644
--- a/test/replication/suite.ini
+++ b/test/replication/suite.ini
@@ -119,7 +119,7 @@ fragile = {
         },
         "election_qsync_stress.test.lua": {
             "issues": [ "gh-5395" ],
-            "checksums": [ "634bda94accdcdef7b1db3e14f28f445", "36bcdae426c18a60fd13025c09f197d0", "209c865525154a91435c63850f15eca0", "ca38ff2cdfa65b3defb26607b24454c6", "3b6c573d2aa06ee382d6f27b6a76657b", "bcf08e055cf3ccd9958af25d0fba29f8", "411e7462760bc3fc968b68b4b7267ea1", "37e671aea27e3396098f9d13c7fa7316" ]
+            "checksums": [ "0ace688f0a267c44c37a5f904fb5d6ce" ]
         },
         "gh-3711-misc-no-restart-on-same-configuration.test.lua": {
             "issues": [ "gh-5407" ],
-- 
2.25.1

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: [Tarantool-patches] [PATCH v1] test: fix flaky election_qsync_stress test
  2020-11-13  6:03 [Tarantool-patches] [PATCH v1] test: fix flaky election_qsync_stress test Alexander V. Tikhonov
@ 2020-11-13  9:18 ` Serge Petrenko
  0 siblings, 0 replies; 2+ messages in thread
From: Serge Petrenko @ 2020-11-13  9:18 UTC (permalink / raw)
  To: Alexander V. Tikhonov, Kirill Yukhin; +Cc: tarantool-patches


13.11.2020 09:03, Alexander V. Tikhonov пишет:
> Found that replication/election_qsync_stress.test.lua test may fail on
> restating instances. It occures on heavy loaded hosts when its local
> call to stop instance using SIGTERM fails to stop it. Decided to use
> SIGKILL in local stop call options to be sure that the instance will
> be stopped.
>
> Also found that running loop inline new hangs occured on server start:
>
>    --- replication/election_qsync_stress.result    Thu Nov 12 16:23:16 2020
>    +++ var/128_replication/election_qsync_stress.result    Thu Nov 12 16:31:22 2020
>    @@ -323,7 +323,7 @@
>      | ...
>         test_run:wait_cond(function() return c.space.test ~= nil and c.space.test:get{i} ~= nil end)
>      | ---
>    - | - true
>    + | - false
>      | ...
>    @@ -380,7 +380,7 @@
>      | ...
>         test_run:wait_cond(function() return c.space.test ~= nil and c.space.test:get{i} ~= nil end)
>      | ---
>    - | - true
>    + | - false
>      | ...
>    @@ -494,687 +494,3 @@
>      | ---
>      | ...
>         test_run:cmd('start server '..old_leader..' with wait=True, wait_load=True, args="2 0.4"')
>    - | ---
>    - | - true
>    - | …
>
> but the test already failed before on getting 'c.space.test:get{i}'.
> To avoid of the hang and make test code more correct running way it
> were added log.error messages and return calls. Also the test was
> changed to use function for each loop iteration to be able to check
> return values and break the loop just after the fail.
>
> Needed for #5395


Hi! Thanks for the patch!

Please see my comments below.


> ---
>
> Github: https://github.com/tarantool/tarantool/tree/avtikhon/fix-election_qsync_stress
> Issue: https://github.com/tarantool/tarantool/issues/5395
>
>   test/replication/election_qsync_stress.result | 33 ++++++++++++++-----
>   .../election_qsync_stress.test.lua            | 29 +++++++++++-----
>   test/replication/suite.ini                    |  2 +-
>   3 files changed, 47 insertions(+), 17 deletions(-)
>
> diff --git a/test/replication/election_qsync_stress.result b/test/replication/election_qsync_stress.result
> index 9fab2f1d7..cdb00e34c 100644
> --- a/test/replication/election_qsync_stress.result
> +++ b/test/replication/election_qsync_stress.result
> @@ -5,6 +5,9 @@ test_run = require('test_run').new()
>   netbox = require('net.box')
>    | ---
>    | ...
> +log = require('log')
> + | ---
> + | ...
>   
>   --
>   -- gh-1146: Leader election + Qsync
> @@ -83,23 +86,32 @@ test_run:cmd('setopt delimiter ";"')
>    | ---
>    | - true
>    | ...
> -for i = 1,10 do
> +function run_iter(i)
>       c:eval('box.cfg{replication_synchro_quorum=4, replication_synchro_timeout=1000}')
>       c.space.test:insert({i}, {is_async=true})
> -    test_run:wait_cond(function() return c.space.test:get{i} ~= nil end)
> -    test_run:cmd('stop server '..old_leader)
> +    if not test_run:wait_cond(function() return c.space.test ~= nil
> +            and c.space.test:get{i} ~= nil end) then

It's pointless to wait for `c.space.test ~= nil` after 
`c.space.test:insert`.
If `c.space.test` is nil, `c.space.test:insert` will fail. Otherwise 
`c.space.test` isn't nil
already.

> +        log.error('error: hanged on first call to c.space.test:get(' .. i .. ')')
> +        return false
> +    end
> +    test_run:cmd('stop server '..old_leader..' with signal=KILL')
>       nrs[old_leader_nr] = false
> -    new_leader_nr = get_leader(nrs)
> -    new_leader = 'election_replica'..new_leader_nr
> -    leader_port = test_run:eval(new_leader, 'box.cfg.listen')[1]
> +    local new_leader_nr = get_leader(nrs)
> +    local new_leader = 'election_replica'..new_leader_nr
> +    local leader_port = test_run:eval(new_leader, 'box.cfg.listen')[1]
>       c = netbox.connect(leader_port)
>       c:eval('box.cfg{replication_synchro_timeout=1000}')
>       c.space._schema:replace{'smth'}
> -    c.space.test:get{i}
> +    if not test_run:wait_cond(function() return c.space.test ~= nil
> +            and c.space.test:get{i} ~= nil end) then
> +        log.error('error: hanged on second call to c.space.test:get(' .. i .. ')')
> +        return false
> +    end

Same here. After a successful `space._schema:replace` on a new leader it 
must succeed on
c.space.test:get{i}. Because insertion to _schema will happen only after 
c.space.test has i from
old leader.
So if c.space.test:get{i} fails, say, if old leader didn't send the new 
row before dying, then
wait_cond(c.space.test:get{i}) will also fail. The row will never arrive.


In other words, plain `c.space.test:get{i}` is enough here.
And if `c.space.test:get{i} == nil` then log an error message, like you do.

>       test_run:cmd('start server '..old_leader..' with wait=True, wait_load=True, args="2 0.4"')
>       nrs[old_leader_nr] = true
>       old_leader_nr = new_leader_nr
>       old_leader = new_leader
> +    return true
>   end;
>    | ---
>    | ...
> @@ -107,8 +119,13 @@ test_run:cmd('setopt delimiter ""');
>    | ---
>    | - true
>    | ...
> +
> +for i = 1,10 do if not run_iter(i) then break end end
> + | ---
> + | ...
> +
>   -- We're connected to some leader.
> -#c.space.test:select{} == 10 or require('log').error(c.space.test:select{})
> +#c.space.test:select{} == 10 or log.error(c.space.test:select{})
>    | ---
>    | - true
>    | ...
> diff --git a/test/replication/election_qsync_stress.test.lua b/test/replication/election_qsync_stress.test.lua
> index 0ba15eef7..8b654d063 100644
> --- a/test/replication/election_qsync_stress.test.lua
> +++ b/test/replication/election_qsync_stress.test.lua
> @@ -1,5 +1,6 @@
>   test_run = require('test_run').new()
>   netbox = require('net.box')
> +log = require('log')
>   
>   --
>   -- gh-1146: Leader election + Qsync
> @@ -47,26 +48,38 @@ _ = c:eval('box.space.test:create_index("pk")')
>   -- Insert some data to a synchronous space, then kill the leader before the
>   -- confirmation is written. Check successful confirmation on the new leader.
>   test_run:cmd('setopt delimiter ";"')
> -for i = 1,10 do
> +function run_iter(i)
>       c:eval('box.cfg{replication_synchro_quorum=4, replication_synchro_timeout=1000}')
>       c.space.test:insert({i}, {is_async=true})
> -    test_run:wait_cond(function() return c.space.test:get{i} ~= nil end)
> -    test_run:cmd('stop server '..old_leader)
> +    if not test_run:wait_cond(function() return c.space.test ~= nil
> +            and c.space.test:get{i} ~= nil end) then
> +        log.error('error: hanged on first call to c.space.test:get(' .. i .. ')')
> +        return false
> +    end
> +    test_run:cmd('stop server '..old_leader..' with signal=KILL')
>       nrs[old_leader_nr] = false
> -    new_leader_nr = get_leader(nrs)
> -    new_leader = 'election_replica'..new_leader_nr
> -    leader_port = test_run:eval(new_leader, 'box.cfg.listen')[1]
> +    local new_leader_nr = get_leader(nrs)
> +    local new_leader = 'election_replica'..new_leader_nr
> +    local leader_port = test_run:eval(new_leader, 'box.cfg.listen')[1]
>       c = netbox.connect(leader_port)
>       c:eval('box.cfg{replication_synchro_timeout=1000}')
>       c.space._schema:replace{'smth'}
> -    c.space.test:get{i}
> +    if not test_run:wait_cond(function() return c.space.test ~= nil
> +            and c.space.test:get{i} ~= nil end) then
> +        log.error('error: hanged on second call to c.space.test:get(' .. i .. ')')
> +        return false
> +    end
>       test_run:cmd('start server '..old_leader..' with wait=True, wait_load=True, args="2 0.4"')
>       nrs[old_leader_nr] = true
>       old_leader_nr = new_leader_nr
>       old_leader = new_leader
> +    return true
>   end;
>   test_run:cmd('setopt delimiter ""');
> +
> +for i = 1,10 do if not run_iter(i) then break end end
> +
>   -- We're connected to some leader.
> -#c.space.test:select{} == 10 or require('log').error(c.space.test:select{})
> +#c.space.test:select{} == 10 or log.error(c.space.test:select{})
>   
>   test_run:drop_cluster(SERVERS)
> diff --git a/test/replication/suite.ini b/test/replication/suite.ini
> index 34ee32550..c456678c1 100644
> --- a/test/replication/suite.ini
> +++ b/test/replication/suite.ini
> @@ -119,7 +119,7 @@ fragile = {
>           },
>           "election_qsync_stress.test.lua": {
>               "issues": [ "gh-5395" ],
> -            "checksums": [ "634bda94accdcdef7b1db3e14f28f445", "36bcdae426c18a60fd13025c09f197d0", "209c865525154a91435c63850f15eca0", "ca38ff2cdfa65b3defb26607b24454c6", "3b6c573d2aa06ee382d6f27b6a76657b", "bcf08e055cf3ccd9958af25d0fba29f8", "411e7462760bc3fc968b68b4b7267ea1", "37e671aea27e3396098f9d13c7fa7316" ]
> +            "checksums": [ "0ace688f0a267c44c37a5f904fb5d6ce" ]
>           },
>           "gh-3711-misc-no-restart-on-same-configuration.test.lua": {
>               "issues": [ "gh-5407" ],

-- 
Serge Petrenko

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2020-11-13  9:18 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-11-13  6:03 [Tarantool-patches] [PATCH v1] test: fix flaky election_qsync_stress test Alexander V. Tikhonov
2020-11-13  9:18 ` Serge Petrenko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox