[Tarantool-patches] [PATCH 1/1] test: fix flaky unit/swim test

Vladislav Shpilevoy v.shpilevoy at tarantool.org
Thu Nov 19 00:24:41 MSK 2020


swim_test_indirect_ping() failed with random seed 1605651752.

The test created a cluster with 3 swim nodes, and broke network
connection between node-1 and node-2. Then it run the cluster for
10 seconds, and ensured, that both node-1 and node-2 are
eventually alive despite they are suspected sometimes.

    node1 <-> node3 <-> node2

'Alive' means that a node is considered alive on all the other
nodes.

The test spun for 10 seconds giving the nodes a chance to become
suspected. Then it checked that node-1 is either still alive, or
it is suspected, but will be restored in at most 3 seconds. The
same was checked for node-2. They were supposed to interact via
node-3.

3 seconds was used assuming that the worst what could happen is
that it is suspected from the beginning of this three-second
interval on node-3, because it was suspected by node-2 and
disseminated to node-3.

Then node-3 might need 1 second to finish its current
dissemination round by sending a ping to node-2, 1 second to start
new round randomly again from node-2, and only then send a ping to
node-1. So 3 seconds total.

But it could also happen, that in the beginning of the
three-second interval node-1 is already suspected on node-2. On
the next step node-2 shares the suspicion with node-3. And then
the scenario above happens. So the test case needed at least 4
seconds.

And actually it could happen infinitely, because while the test
waits for 3 seconds of gossip refutation about node-1 on node-3,
node-2 can suspect it again. And so on.

Also the test would pass even without indirect pings. Because
node-3 has access to node-1 and node-2. So even if, say, node-1
suspects node-2, then it will tell node-3 about it. Node-3 will
ping node-2, get ack, and will refute the gossip. The refutation
will be then sent to node-1 back. It means indirect pings don't
matter here.

The patch makes a new test, which won't pass without indirect
pings. It uses the existing error injection ERRINJ_SWIM_FD_ONLY,
which allows to turn off all the SWIM components except failure
detection. So only pings and acks are being sent.

Then without proper indirect pings node-1 and node-2 would suspect
each other and declare dead eventually. The new test checks it
does not happen.

Closes #5399
---
Branch: http://github.com/tarantool/tarantool/tree/gerold103/gh-5399-swim-flaky
Issue: https://github.com/tarantool/tarantool/issues/5399

 test/unit/suite.ini          |  8 -------
 test/unit/swim.c             | 26 +----------------------
 test/unit/swim.result        | 20 ++++++------------
 test/unit/swim_errinj.c      | 41 +++++++++++++++++++++++++++++++++++-
 test/unit/swim_errinj.result |  8 ++++++-
 5 files changed, 55 insertions(+), 48 deletions(-)

diff --git a/test/unit/suite.ini b/test/unit/suite.ini
index fe871b287..d16ba413b 100644
--- a/test/unit/suite.ini
+++ b/test/unit/suite.ini
@@ -4,11 +4,3 @@ description = unit tests
 disabled = snap_quorum_delay.test
 release_disabled = fiber_stack.test swim_errinj.test
 is_parallel = True
-fragile = {
-    "retries": 10,
-    "tests": {
-        "swim.test": {
-            "issues": [ "gh-5399" ]
-        }
-    }
-  }
diff --git a/test/unit/swim.c b/test/unit/swim.c
index bb12baf8d..4e9dce8ce 100644
--- a/test/unit/swim.c
+++ b/test/unit/swim.c
@@ -735,29 +735,6 @@ swim_test_payload_basic(void)
 	swim_finish_test();
 }
 
-static void
-swim_test_indirect_ping(void)
-{
-	swim_start_test(2);
-	uint16_t cluster_size = 3;
-	struct swim_cluster *cluster = swim_cluster_new(cluster_size);
-	swim_cluster_set_ack_timeout(cluster, 1);
-	for (int i = 0; i < cluster_size; ++i) {
-		for (int j = i + 1; j < cluster_size; ++j)
-			swim_cluster_interconnect(cluster, i, j);
-	}
-	swim_cluster_set_drop_channel(cluster, 0, 1, true);
-	swim_cluster_set_drop_channel(cluster, 1, 0, true);
-	swim_run_for(10);
-	is(swim_cluster_wait_status_everywhere(cluster, 0, MEMBER_ALIVE, 3),
-	   0, "S1 is still alive everywhere");
-	is(swim_cluster_wait_status_everywhere(cluster, 1, MEMBER_ALIVE, 3),
-	   0, "as well as S2 - they communicated via S3");
-
-	swim_cluster_delete(cluster);
-	swim_finish_test();
-}
-
 static void
 swim_test_encryption(void)
 {
@@ -1097,7 +1074,7 @@ swim_test_suspect_new_members(void)
 static int
 main_f(va_list ap)
 {
-	swim_start_test(23);
+	swim_start_test(22);
 
 	(void) ap;
 	swim_test_ev_init();
@@ -1119,7 +1096,6 @@ main_f(va_list ap)
 	swim_test_uri_update();
 	swim_test_broadcast();
 	swim_test_payload_basic();
-	swim_test_indirect_ping();
 	swim_test_encryption();
 	swim_test_slow_net();
 	swim_test_triggers();
diff --git a/test/unit/swim.result b/test/unit/swim.result
index 07ea6a1a0..d4b1dba49 100644
--- a/test/unit/swim.result
+++ b/test/unit/swim.result
@@ -1,5 +1,5 @@
 	*** main_f ***
-1..23
+1..22
 	*** swim_test_one_link ***
     1..6
     ok 1 - no rounds - no fullmesh
@@ -169,23 +169,17 @@ ok 15 - subtests
     ok 11 - third payload is disseminated via anti-entropy
 ok 16 - subtests
 	*** swim_test_payload_basic: done ***
-	*** swim_test_indirect_ping ***
-    1..2
-    ok 1 - S1 is still alive everywhere
-    ok 2 - as well as S2 - they communicated via S3
-ok 17 - subtests
-	*** swim_test_indirect_ping: done ***
 	*** swim_test_encryption ***
     1..3
     ok 1 - cluster works with encryption
     ok 2 - different encryption keys - can't interact
     ok 3 - cluster works after encryption has been disabled
-ok 18 - subtests
+ok 17 - subtests
 	*** swim_test_encryption: done ***
 	*** swim_test_slow_net ***
     1..0
     # slow network leads to idle round steps, they should not produce a new message
-ok 19 - subtests
+ok 18 - subtests
 	*** swim_test_slow_net: done ***
 	*** swim_test_triggers ***
     1..20
@@ -210,25 +204,25 @@ ok 19 - subtests
     # now all the triggers are done and deleted
     ok 19 - local URI update warns about version update
     ok 20 - version is a part of incarnation, so the latter is updated too
-ok 20 - subtests
+ok 19 - subtests
 	*** swim_test_triggers: done ***
 	*** swim_test_generation ***
     1..3
     ok 1 - S1 disseminated its payload to S2
     ok 2 - S1 restarted and set another payload. Without generation it could lead to never disseminated new payload.
     ok 3 - S2 sees new generation of S1
-ok 21 - subtests
+ok 20 - subtests
 	*** swim_test_generation: done ***
 	*** swim_test_dissemination_speed ***
     1..2
     ok 1 - dissemination work in log time even at the very start of a cluster
     ok 2 - dissemination can withstand an event storm
-ok 22 - subtests
+ok 21 - subtests
 	*** swim_test_dissemination_speed: done ***
 	*** swim_test_suspect_new_members ***
     1..2
     ok 1 - S2 dropped S1 as dead
     ok 2 - S3 didn't add S1 from S2's messages, because S1 didn't answer on a ping
-ok 23 - subtests
+ok 22 - subtests
 	*** swim_test_suspect_new_members: done ***
 	*** main_f: done ***
diff --git a/test/unit/swim_errinj.c b/test/unit/swim_errinj.c
index 2e89f1821..0f8718a33 100644
--- a/test/unit/swim_errinj.c
+++ b/test/unit/swim_errinj.c
@@ -171,17 +171,56 @@ swim_test_payload_refutation(void)
 	swim_finish_test();
 }
 
+static void
+swim_test_indirect_ping(void)
+{
+	swim_start_test(2);
+	uint16_t cluster_size = 3;
+	struct swim_cluster *cluster = swim_cluster_new(cluster_size);
+	swim_cluster_set_ack_timeout(cluster, 0.5);
+	for (int i = 0; i < cluster_size; ++i) {
+		for (int j = i + 1; j < cluster_size; ++j)
+			swim_cluster_interconnect(cluster, i, j);
+	}
+	swim_cluster_set_drop_channel(cluster, 0, 1, true);
+	swim_cluster_set_drop_channel(cluster, 1, 0, true);
+	/*
+	 * There are alive channels S1 <-> S3 and S2 <-> S3. Things are not
+	 * interesting when S3 can send dissemination, because even if S1 and S2
+	 * suspect each other, they will share it with S3, and S3 will refute it
+	 * via dissemination, because has access to both of them.
+	 * When only failure detection works, a suspicion can only be refuted by
+	 * pings and acks. In this test pings-acks between S1 and S2 will need
+	 * to be indirect.
+	 */
+	struct errinj *errinj = &errinjs[ERRINJ_SWIM_FD_ONLY];
+	errinj->bparam = true;
+	/*
+	 * No nodes are ever suspected. Because when a direct ping fails, the
+	 * nodes simply will use indirect pings and they will work.
+	 */
+	isnt(swim_cluster_wait_status_anywhere(cluster, 0,
+					       MEMBER_SUSPECTED, 10),
+	     0, "S1 is never suspected");
+	isnt(swim_cluster_wait_status_anywhere(cluster, 1,
+					       MEMBER_SUSPECTED, 10),
+	     0, "S2 is never suspected");
+	errinj->bparam = false;
+	swim_cluster_delete(cluster);
+	swim_finish_test();
+}
 
 static int
 main_f(va_list ap)
 {
-	swim_start_test(1);
+	swim_start_test(2);
 
 	(void) ap;
 	swim_test_ev_init();
 	swim_test_transport_init();
 
 	swim_test_payload_refutation();
+	swim_test_indirect_ping();
 
 	swim_test_transport_free();
 	swim_test_ev_free();
diff --git a/test/unit/swim_errinj.result b/test/unit/swim_errinj.result
index c60389d7f..03e357e13 100644
--- a/test/unit/swim_errinj.result
+++ b/test/unit/swim_errinj.result
@@ -1,5 +1,5 @@
 	*** main_f ***
-1..1
+1..2
 	*** swim_test_payload_refutation ***
     1..11
     ok 1 - S2 sees new version of S1
@@ -15,4 +15,10 @@
     ok 11 - S3 learns S1's payload from S2
 ok 1 - subtests
 	*** swim_test_payload_refutation: done ***
+	*** swim_test_indirect_ping ***
+    1..2
+    ok 1 - S1 is never suspected
+    ok 2 - S2 is never suspected
+ok 2 - subtests
+	*** swim_test_indirect_ping: done ***
 	*** main_f: done ***
-- 
2.24.3 (Apple Git-128)



More information about the Tarantool-patches mailing list