Tarantool development patches archive
 help / color / mirror / Atom feed
From: "Alexander V. Tikhonov" <avtikhon@tarantool.org>
To: Serge Petrenko <sergepetrenko@tarantool.org>,
	Kirill Yukhin <kyukhin@tarantool.org>
Cc: tarantool-patches@dev.tarantool.org
Subject: [Tarantool-patches] [PATCH v1] test: fix flaky election_qsync_stress test
Date: Fri, 13 Nov 2020 09:03:57 +0300	[thread overview]
Message-ID: <6174ccd36e833128fa4da137a25e2be3ca5ffde3.1605247383.git.avtikhon@tarantool.org> (raw)

Found that replication/election_qsync_stress.test.lua test may fail on
restating instances. It occures on heavy loaded hosts when its local
call to stop instance using SIGTERM fails to stop it. Decided to use
SIGKILL in local stop call options to be sure that the instance will
be stopped.

Also found that running loop inline new hangs occured on server start:

  --- replication/election_qsync_stress.result    Thu Nov 12 16:23:16 2020
  +++ var/128_replication/election_qsync_stress.result    Thu Nov 12 16:31:22 2020
  @@ -323,7 +323,7 @@
    | ...
       test_run:wait_cond(function() return c.space.test ~= nil and c.space.test:get{i} ~= nil end)
    | ---
  - | - true
  + | - false
    | ...
  @@ -380,7 +380,7 @@
    | ...
       test_run:wait_cond(function() return c.space.test ~= nil and c.space.test:get{i} ~= nil end)
    | ---
  - | - true
  + | - false
    | ...
  @@ -494,687 +494,3 @@
    | ---
    | ...
       test_run:cmd('start server '..old_leader..' with wait=True, wait_load=True, args="2 0.4"')
  - | ---
  - | - true
  - | …

but the test already failed before on getting 'c.space.test:get{i}'.
To avoid of the hang and make test code more correct running way it
were added log.error messages and return calls. Also the test was
changed to use function for each loop iteration to be able to check
return values and break the loop just after the fail.

Needed for #5395
---

Github: https://github.com/tarantool/tarantool/tree/avtikhon/fix-election_qsync_stress
Issue: https://github.com/tarantool/tarantool/issues/5395

 test/replication/election_qsync_stress.result | 33 ++++++++++++++-----
 .../election_qsync_stress.test.lua            | 29 +++++++++++-----
 test/replication/suite.ini                    |  2 +-
 3 files changed, 47 insertions(+), 17 deletions(-)

diff --git a/test/replication/election_qsync_stress.result b/test/replication/election_qsync_stress.result
index 9fab2f1d7..cdb00e34c 100644
--- a/test/replication/election_qsync_stress.result
+++ b/test/replication/election_qsync_stress.result
@@ -5,6 +5,9 @@ test_run = require('test_run').new()
 netbox = require('net.box')
  | ---
  | ...
+log = require('log')
+ | ---
+ | ...
 
 --
 -- gh-1146: Leader election + Qsync
@@ -83,23 +86,32 @@ test_run:cmd('setopt delimiter ";"')
  | ---
  | - true
  | ...
-for i = 1,10 do
+function run_iter(i)
     c:eval('box.cfg{replication_synchro_quorum=4, replication_synchro_timeout=1000}')
     c.space.test:insert({i}, {is_async=true})
-    test_run:wait_cond(function() return c.space.test:get{i} ~= nil end)
-    test_run:cmd('stop server '..old_leader)
+    if not test_run:wait_cond(function() return c.space.test ~= nil
+            and c.space.test:get{i} ~= nil end) then
+        log.error('error: hanged on first call to c.space.test:get(' .. i .. ')')
+        return false
+    end
+    test_run:cmd('stop server '..old_leader..' with signal=KILL')
     nrs[old_leader_nr] = false
-    new_leader_nr = get_leader(nrs)
-    new_leader = 'election_replica'..new_leader_nr
-    leader_port = test_run:eval(new_leader, 'box.cfg.listen')[1]
+    local new_leader_nr = get_leader(nrs)
+    local new_leader = 'election_replica'..new_leader_nr
+    local leader_port = test_run:eval(new_leader, 'box.cfg.listen')[1]
     c = netbox.connect(leader_port)
     c:eval('box.cfg{replication_synchro_timeout=1000}')
     c.space._schema:replace{'smth'}
-    c.space.test:get{i}
+    if not test_run:wait_cond(function() return c.space.test ~= nil
+            and c.space.test:get{i} ~= nil end) then
+        log.error('error: hanged on second call to c.space.test:get(' .. i .. ')')
+        return false
+    end
     test_run:cmd('start server '..old_leader..' with wait=True, wait_load=True, args="2 0.4"')
     nrs[old_leader_nr] = true
     old_leader_nr = new_leader_nr
     old_leader = new_leader
+    return true
 end;
  | ---
  | ...
@@ -107,8 +119,13 @@ test_run:cmd('setopt delimiter ""');
  | ---
  | - true
  | ...
+
+for i = 1,10 do if not run_iter(i) then break end end
+ | ---
+ | ...
+
 -- We're connected to some leader.
-#c.space.test:select{} == 10 or require('log').error(c.space.test:select{})
+#c.space.test:select{} == 10 or log.error(c.space.test:select{})
  | ---
  | - true
  | ...
diff --git a/test/replication/election_qsync_stress.test.lua b/test/replication/election_qsync_stress.test.lua
index 0ba15eef7..8b654d063 100644
--- a/test/replication/election_qsync_stress.test.lua
+++ b/test/replication/election_qsync_stress.test.lua
@@ -1,5 +1,6 @@
 test_run = require('test_run').new()
 netbox = require('net.box')
+log = require('log')
 
 --
 -- gh-1146: Leader election + Qsync
@@ -47,26 +48,38 @@ _ = c:eval('box.space.test:create_index("pk")')
 -- Insert some data to a synchronous space, then kill the leader before the
 -- confirmation is written. Check successful confirmation on the new leader.
 test_run:cmd('setopt delimiter ";"')
-for i = 1,10 do
+function run_iter(i)
     c:eval('box.cfg{replication_synchro_quorum=4, replication_synchro_timeout=1000}')
     c.space.test:insert({i}, {is_async=true})
-    test_run:wait_cond(function() return c.space.test:get{i} ~= nil end)
-    test_run:cmd('stop server '..old_leader)
+    if not test_run:wait_cond(function() return c.space.test ~= nil
+            and c.space.test:get{i} ~= nil end) then
+        log.error('error: hanged on first call to c.space.test:get(' .. i .. ')')
+        return false
+    end
+    test_run:cmd('stop server '..old_leader..' with signal=KILL')
     nrs[old_leader_nr] = false
-    new_leader_nr = get_leader(nrs)
-    new_leader = 'election_replica'..new_leader_nr
-    leader_port = test_run:eval(new_leader, 'box.cfg.listen')[1]
+    local new_leader_nr = get_leader(nrs)
+    local new_leader = 'election_replica'..new_leader_nr
+    local leader_port = test_run:eval(new_leader, 'box.cfg.listen')[1]
     c = netbox.connect(leader_port)
     c:eval('box.cfg{replication_synchro_timeout=1000}')
     c.space._schema:replace{'smth'}
-    c.space.test:get{i}
+    if not test_run:wait_cond(function() return c.space.test ~= nil
+            and c.space.test:get{i} ~= nil end) then
+        log.error('error: hanged on second call to c.space.test:get(' .. i .. ')')
+        return false
+    end
     test_run:cmd('start server '..old_leader..' with wait=True, wait_load=True, args="2 0.4"')
     nrs[old_leader_nr] = true
     old_leader_nr = new_leader_nr
     old_leader = new_leader
+    return true
 end;
 test_run:cmd('setopt delimiter ""');
+
+for i = 1,10 do if not run_iter(i) then break end end
+
 -- We're connected to some leader.
-#c.space.test:select{} == 10 or require('log').error(c.space.test:select{})
+#c.space.test:select{} == 10 or log.error(c.space.test:select{})
 
 test_run:drop_cluster(SERVERS)
diff --git a/test/replication/suite.ini b/test/replication/suite.ini
index 34ee32550..c456678c1 100644
--- a/test/replication/suite.ini
+++ b/test/replication/suite.ini
@@ -119,7 +119,7 @@ fragile = {
         },
         "election_qsync_stress.test.lua": {
             "issues": [ "gh-5395" ],
-            "checksums": [ "634bda94accdcdef7b1db3e14f28f445", "36bcdae426c18a60fd13025c09f197d0", "209c865525154a91435c63850f15eca0", "ca38ff2cdfa65b3defb26607b24454c6", "3b6c573d2aa06ee382d6f27b6a76657b", "bcf08e055cf3ccd9958af25d0fba29f8", "411e7462760bc3fc968b68b4b7267ea1", "37e671aea27e3396098f9d13c7fa7316" ]
+            "checksums": [ "0ace688f0a267c44c37a5f904fb5d6ce" ]
         },
         "gh-3711-misc-no-restart-on-same-configuration.test.lua": {
             "issues": [ "gh-5407" ],
-- 
2.25.1

             reply	other threads:[~2020-11-13  6:03 UTC|newest]

Thread overview: 2+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-11-13  6:03 Alexander V. Tikhonov [this message]
2020-11-13  9:18 ` Serge Petrenko

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=6174ccd36e833128fa4da137a25e2be3ca5ffde3.1605247383.git.avtikhon@tarantool.org \
    --to=avtikhon@tarantool.org \
    --cc=kyukhin@tarantool.org \
    --cc=sergepetrenko@tarantool.org \
    --cc=tarantool-patches@dev.tarantool.org \
    --subject='Re: [Tarantool-patches] [PATCH v1] test: fix flaky election_qsync_stress test' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox