Tarantool development patches archive
 help / color / mirror / Atom feed
From: Vladislav Shpilevoy <v.shpilevoy@tarantool.org>
To: tarantool-patches@dev.tarantool.org, avtikhon@tarantool.org
Subject: [Tarantool-patches] [PATCH 1/2] test: fix flaky qsync_advanced.test.lua
Date: Wed, 15 Jul 2020 00:44:45 +0200	[thread overview]
Message-ID: <739ef9d3bdd3ee995a9e220aec8c51db4db60e48.1594766457.git.v.shpilevoy@tarantool.org> (raw)
In-Reply-To: <cover.1594766457.git.v.shpilevoy@tarantool.org>

There were multiple problems:

- Some timeouts were too small. Timeout 0.1 is a very small value,
  which leads to flakiness in 100% cases sooner or later.

- One timeout was too big - 5 second waiting, whereas it could
  easily be less than a second. Anyway it was expected to fail. No
  need to wait so long.

- To check if timeout really passed whole, was used os.time(),
  which is incorrect: precision is seconds. Also the passed time
  was checked using equation duration == timeout, but it is also
  wrong. When something is blocked on a timeout, if the system is
  not real-time, the really passed time is always >= timeout. Not
  == timeout.

- In the failover test there was no fullmesh. As a result, when a
  replica was promoted and wrote something into the sync space, it
  wasn't replicated to master. But the test passed because
  1) The incorrect behaviour was in .result file;
  2) On the replica the quorum was default, i.e. 1. So the replica
   didn't wait master, and successfully wrote data into the sync
   space.

The initial problem of the test was that in the last case one of
the test jobs somehow got the old master seeing the replica's
data. But it is impossible, there was no replication from the
replica to master. Anyway now the test case is reworked, and even
if it would fail, it would be a new fail.

Closes #5168
---
 test/replication/qsync_advanced.result   | 42 +++++++++++++++++-------
 test/replication/qsync_advanced.test.lua | 32 +++++++++++-------
 2 files changed, 50 insertions(+), 24 deletions(-)

diff --git a/test/replication/qsync_advanced.result b/test/replication/qsync_advanced.result
index 3a288e0ca..94b19b1f2 100644
--- a/test/replication/qsync_advanced.result
+++ b/test/replication/qsync_advanced.result
@@ -63,7 +63,7 @@ test_run:switch('default')
  | ---
  | - true
  | ...
-box.cfg{replication_synchro_quorum=NUM_INSTANCES, replication_synchro_timeout=0.1}
+box.cfg{replication_synchro_quorum=NUM_INSTANCES, replication_synchro_timeout=1000}
  | ---
  | ...
 _ = box.schema.space.create('sync', {is_sync=true, engine=engine})
@@ -100,7 +100,7 @@ test_run:switch('default')
  | ---
  | - true
  | ...
-box.cfg{replication_synchro_quorum=BROKEN_QUORUM, replication_synchro_timeout=0.1}
+box.cfg{replication_synchro_quorum=BROKEN_QUORUM, replication_synchro_timeout=0.001}
  | ---
  | ...
 _ = box.schema.space.create('sync', {is_sync=true, engine=engine})
@@ -138,7 +138,7 @@ test_run:switch('default')
  | ---
  | - true
  | ...
-box.cfg{replication_synchro_quorum=NUM_INSTANCES, replication_synchro_timeout=0.1}
+box.cfg{replication_synchro_quorum=NUM_INSTANCES, replication_synchro_timeout=1000}
  | ---
  | ...
 _ = box.schema.space.create('sync', {is_sync=true, engine=engine})
@@ -191,7 +191,7 @@ test_run:switch('default')
  | ---
  | - true
  | ...
-box.cfg{replication_synchro_quorum=BROKEN_QUORUM, replication_synchro_timeout=orig_synchro_timeout}
+box.cfg{replication_synchro_quorum=BROKEN_QUORUM, replication_synchro_timeout=0.001}
  | ---
  | ...
 _ = box.schema.space.create('sync', {is_sync=true, engine=engine})
@@ -201,14 +201,17 @@ _ = box.space.sync:create_index('pk')
  | ---
  | ...
 -- Testcase body.
-start = os.time()
+start = fiber.clock()
  | ---
  | ...
 box.space.sync:insert{1}
  | ---
  | - error: Quorum collection for a synchronous transaction is timed out
  | ...
-(os.time() - start) == box.cfg.replication_synchro_timeout -- true
+duration = fiber.clock() - start
+ | ---
+ | ...
+duration >= box.cfg.replication_synchro_timeout or duration -- true
  | ---
  | - true
  | ...
@@ -298,7 +301,7 @@ test_run:switch('default')
  | ---
  | - true
  | ...
-box.cfg{replication_synchro_quorum=NUM_INSTANCES, replication_synchro_timeout=0.1}
+box.cfg{replication_synchro_quorum=NUM_INSTANCES, replication_synchro_timeout=1000}
  | ---
  | ...
 _ = box.schema.space.create('sync', {is_sync=true, engine=engine})
@@ -326,7 +329,7 @@ test_run:switch('default')
  | ---
  | - true
  | ...
-box.cfg{replication_synchro_quorum=NUM_INSTANCES, replication_synchro_timeout=0.1}
+box.cfg{replication_synchro_quorum=NUM_INSTANCES, replication_synchro_timeout=1000}
  | ---
  | ...
 _ = box.schema.space.create('sync', {is_sync=true, engine=engine})
@@ -428,7 +431,15 @@ test_run:switch('default')
  | ---
  | - true
  | ...
-box.cfg{replication_synchro_quorum=NUM_INSTANCES, replication_synchro_timeout=0.1}
+test_run:cmd("set variable replica_url to 'replica.listen'")
+ | ---
+ | - true
+ | ...
+box.cfg{                                                                        \
+    replication_synchro_quorum = NUM_INSTANCES,                                 \
+    replication_synchro_timeout = 1000,                                         \
+    replication = replica_url,                                                  \
+}
  | ---
  | ...
 _ = box.schema.space.create('sync', {is_sync=true, engine=engine})
@@ -468,6 +479,9 @@ test_run:switch('replica')
  | ---
  | - true
  | ...
+box.cfg{replication_synchro_quorum = 2, replication_synchro_timeout = 1000}
+ | ---
+ | ...
 box.space.sync:insert{2}
  | ---
  | - [2]
@@ -484,6 +498,7 @@ test_run:switch('default')
 box.space.sync:select{} -- 1, 2
  | ---
  | - - [1]
+ |   - [2]
  | ...
 -- Revert cluster configuration.
 test_run:switch('default')
@@ -508,6 +523,9 @@ test_run:switch('default')
 box.space.sync:drop()
  | ---
  | ...
+box.cfg{replication = {}}
+ | ---
+ | ...
 
 -- Check behaviour with failed write to WAL on master (ERRINJ_WAL_IO).
 -- Testcase setup.
@@ -515,7 +533,7 @@ test_run:switch('default')
  | ---
  | - true
  | ...
-box.cfg{replication_synchro_quorum=NUM_INSTANCES, replication_synchro_timeout=0.1}
+box.cfg{replication_synchro_quorum=NUM_INSTANCES, replication_synchro_timeout=1000}
  | ---
  | ...
 _ = box.schema.space.create('sync', {is_sync=true, engine=engine})
@@ -699,14 +717,14 @@ disable_sync_mode()
  | ---
  | ...
 -- Space is in sync mode now.
-box.cfg{replication_synchro_quorum=NUM_INSTANCES, replication_synchro_timeout=0.1}
+box.cfg{replication_synchro_quorum=NUM_INSTANCES, replication_synchro_timeout=1000}
  | ---
  | ...
 box.space.sync:insert{2} -- success
  | ---
  | - [2]
  | ...
-box.cfg{replication_synchro_quorum=BROKEN_QUORUM, replication_synchro_timeout=0.1}
+box.cfg{replication_synchro_quorum=BROKEN_QUORUM, replication_synchro_timeout=1000}
  | ---
  | ...
 box.space.sync:insert{3} -- success
diff --git a/test/replication/qsync_advanced.test.lua b/test/replication/qsync_advanced.test.lua
index 4b62c6fb4..058ece602 100644
--- a/test/replication/qsync_advanced.test.lua
+++ b/test/replication/qsync_advanced.test.lua
@@ -27,7 +27,7 @@ test_run:cmd('start server replica with wait=True, wait_load=True')
 -- Successful write.
 -- Testcase setup.
 test_run:switch('default')
-box.cfg{replication_synchro_quorum=NUM_INSTANCES, replication_synchro_timeout=0.1}
+box.cfg{replication_synchro_quorum=NUM_INSTANCES, replication_synchro_timeout=1000}
 _ = box.schema.space.create('sync', {is_sync=true, engine=engine})
 _ = box.space.sync:create_index('pk')
 -- Testcase body.
@@ -41,7 +41,7 @@ box.space.sync:drop()
 -- Unsuccessfull write.
 -- Testcase setup.
 test_run:switch('default')
-box.cfg{replication_synchro_quorum=BROKEN_QUORUM, replication_synchro_timeout=0.1}
+box.cfg{replication_synchro_quorum=BROKEN_QUORUM, replication_synchro_timeout=0.001}
 _ = box.schema.space.create('sync', {is_sync=true, engine=engine})
 _ = box.space.sync:create_index('pk')
 -- Testcase body.
@@ -56,7 +56,7 @@ box.space.sync:drop()
 -- same order as on client in case of achieved quorum.
 -- Testcase setup.
 test_run:switch('default')
-box.cfg{replication_synchro_quorum=NUM_INSTANCES, replication_synchro_timeout=0.1}
+box.cfg{replication_synchro_quorum=NUM_INSTANCES, replication_synchro_timeout=1000}
 _ = box.schema.space.create('sync', {is_sync=true, engine=engine})
 _ = box.space.sync:create_index('pk')
 -- Testcase body.
@@ -73,13 +73,14 @@ box.space.sync:drop()
 -- Synchro timeout is not bigger than replication_synchro_timeout value.
 -- Testcase setup.
 test_run:switch('default')
-box.cfg{replication_synchro_quorum=BROKEN_QUORUM, replication_synchro_timeout=orig_synchro_timeout}
+box.cfg{replication_synchro_quorum=BROKEN_QUORUM, replication_synchro_timeout=0.001}
 _ = box.schema.space.create('sync', {is_sync=true, engine=engine})
 _ = box.space.sync:create_index('pk')
 -- Testcase body.
-start = os.time()
+start = fiber.clock()
 box.space.sync:insert{1}
-(os.time() - start) == box.cfg.replication_synchro_timeout -- true
+duration = fiber.clock() - start
+duration >= box.cfg.replication_synchro_timeout or duration -- true
 -- Testcase cleanup.
 test_run:switch('default')
 box.space.sync:drop()
@@ -108,7 +109,7 @@ box.cfg.replication_synchro_timeout -- old value
 -- TX is in synchronous replication.
 -- Testcase setup.
 test_run:switch('default')
-box.cfg{replication_synchro_quorum=NUM_INSTANCES, replication_synchro_timeout=0.1}
+box.cfg{replication_synchro_quorum=NUM_INSTANCES, replication_synchro_timeout=1000}
 _ = box.schema.space.create('sync', {is_sync=true, engine=engine})
 _ = box.space.sync:create_index('pk')
 -- Testcase body.
@@ -121,7 +122,7 @@ box.space.sync:drop()
 -- data consistency on a leader and replicas.
 -- Testcase setup.
 test_run:switch('default')
-box.cfg{replication_synchro_quorum=NUM_INSTANCES, replication_synchro_timeout=0.1}
+box.cfg{replication_synchro_quorum=NUM_INSTANCES, replication_synchro_timeout=1000}
 _ = box.schema.space.create('sync', {is_sync=true, engine=engine})
 _ = box.space.sync:create_index('pk')
 -- Testcase body.
@@ -155,7 +156,12 @@ box.cfg{replication_synchro_quorum=BROKEN_QUORUM} -- warning
 -- success and data consistency on a leader and replicas (gh-5124).
 -- Testcase setup.
 test_run:switch('default')
-box.cfg{replication_synchro_quorum=NUM_INSTANCES, replication_synchro_timeout=0.1}
+test_run:cmd("set variable replica_url to 'replica.listen'")
+box.cfg{                                                                        \
+    replication_synchro_quorum = NUM_INSTANCES,                                 \
+    replication_synchro_timeout = 1000,                                         \
+    replication = replica_url,                                                  \
+}
 _ = box.schema.space.create('sync', {is_sync=true, engine=engine})
 _ = box.space.sync:create_index('pk')
 -- Testcase body.
@@ -167,6 +173,7 @@ box.cfg{read_only=false} -- promote replica to master
 test_run:switch('default')
 box.cfg{read_only=true} -- demote master to replica
 test_run:switch('replica')
+box.cfg{replication_synchro_quorum = 2, replication_synchro_timeout = 1000}
 box.space.sync:insert{2}
 box.space.sync:select{} -- 1, 2
 test_run:switch('default')
@@ -179,11 +186,12 @@ box.cfg{read_only=true}
 -- Testcase cleanup.
 test_run:switch('default')
 box.space.sync:drop()
+box.cfg{replication = {}}
 
 -- Check behaviour with failed write to WAL on master (ERRINJ_WAL_IO).
 -- Testcase setup.
 test_run:switch('default')
-box.cfg{replication_synchro_quorum=NUM_INSTANCES, replication_synchro_timeout=0.1}
+box.cfg{replication_synchro_quorum=NUM_INSTANCES, replication_synchro_timeout=1000}
 _ = box.schema.space.create('sync', {is_sync=true, engine=engine})
 _ = box.space.sync:create_index('pk')
 -- Testcase body.
@@ -250,9 +258,9 @@ test_run:switch('default')
 -- Enable synchronous mode.
 disable_sync_mode()
 -- Space is in sync mode now.
-box.cfg{replication_synchro_quorum=NUM_INSTANCES, replication_synchro_timeout=0.1}
+box.cfg{replication_synchro_quorum=NUM_INSTANCES, replication_synchro_timeout=1000}
 box.space.sync:insert{2} -- success
-box.cfg{replication_synchro_quorum=BROKEN_QUORUM, replication_synchro_timeout=0.1}
+box.cfg{replication_synchro_quorum=BROKEN_QUORUM, replication_synchro_timeout=1000}
 box.space.sync:insert{3} -- success
 box.space.sync:select{} -- 1, 2, 3
 test_run:cmd('switch replica')
-- 
2.21.1 (Apple Git-122.3)

  reply	other threads:[~2020-07-14 22:44 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-07-14 22:44 [Tarantool-patches] [PATCH 0/2] Qsync flaky tests, next iteration Vladislav Shpilevoy
2020-07-14 22:44 ` Vladislav Shpilevoy [this message]
2020-07-14 22:44 ` [Tarantool-patches] [PATCH 2/2] test: fix flaky qsync_snapshots.test.lua Vladislav Shpilevoy
2020-07-17 10:56 ` [Tarantool-patches] [PATCH 0/2] Qsync flaky tests, next iteration Sergey Bronnikov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=739ef9d3bdd3ee995a9e220aec8c51db4db60e48.1594766457.git.v.shpilevoy@tarantool.org \
    --to=v.shpilevoy@tarantool.org \
    --cc=avtikhon@tarantool.org \
    --cc=tarantool-patches@dev.tarantool.org \
    --subject='Re: [Tarantool-patches] [PATCH 1/2] test: fix flaky qsync_advanced.test.lua' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox