From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from smtpng1.m.smailru.net (smtpng1.m.smailru.net [94.100.181.251]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dev.tarantool.org (Postfix) with ESMTPS id 0C2E0469710 for ; Thu, 19 Nov 2020 00:24:44 +0300 (MSK) From: Vladislav Shpilevoy Date: Wed, 18 Nov 2020 22:24:41 +0100 Message-Id: <62fe7bdd3888e60c740a2ed704de85a4f19ad140.1605734663.git.v.shpilevoy@tarantool.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Subject: [Tarantool-patches] [PATCH 1/1] test: fix flaky unit/swim test List-Id: Tarantool development patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: tarantool-patches@dev.tarantool.org, avtikhon@tarantool.org swim_test_indirect_ping() failed with random seed 1605651752. The test created a cluster with 3 swim nodes, and broke network connection between node-1 and node-2. Then it run the cluster for 10 seconds, and ensured, that both node-1 and node-2 are eventually alive despite they are suspected sometimes. node1 <-> node3 <-> node2 'Alive' means that a node is considered alive on all the other nodes. The test spun for 10 seconds giving the nodes a chance to become suspected. Then it checked that node-1 is either still alive, or it is suspected, but will be restored in at most 3 seconds. The same was checked for node-2. They were supposed to interact via node-3. 3 seconds was used assuming that the worst what could happen is that it is suspected from the beginning of this three-second interval on node-3, because it was suspected by node-2 and disseminated to node-3. Then node-3 might need 1 second to finish its current dissemination round by sending a ping to node-2, 1 second to start new round randomly again from node-2, and only then send a ping to node-1. So 3 seconds total. But it could also happen, that in the beginning of the three-second interval node-1 is already suspected on node-2. On the next step node-2 shares the suspicion with node-3. And then the scenario above happens. So the test case needed at least 4 seconds. And actually it could happen infinitely, because while the test waits for 3 seconds of gossip refutation about node-1 on node-3, node-2 can suspect it again. And so on. Also the test would pass even without indirect pings. Because node-3 has access to node-1 and node-2. So even if, say, node-1 suspects node-2, then it will tell node-3 about it. Node-3 will ping node-2, get ack, and will refute the gossip. The refutation will be then sent to node-1 back. It means indirect pings don't matter here. The patch makes a new test, which won't pass without indirect pings. It uses the existing error injection ERRINJ_SWIM_FD_ONLY, which allows to turn off all the SWIM components except failure detection. So only pings and acks are being sent. Then without proper indirect pings node-1 and node-2 would suspect each other and declare dead eventually. The new test checks it does not happen. Closes #5399 --- Branch: http://github.com/tarantool/tarantool/tree/gerold103/gh-5399-swim-flaky Issue: https://github.com/tarantool/tarantool/issues/5399 test/unit/suite.ini | 8 ------- test/unit/swim.c | 26 +---------------------- test/unit/swim.result | 20 ++++++------------ test/unit/swim_errinj.c | 41 +++++++++++++++++++++++++++++++++++- test/unit/swim_errinj.result | 8 ++++++- 5 files changed, 55 insertions(+), 48 deletions(-) diff --git a/test/unit/suite.ini b/test/unit/suite.ini index fe871b287..d16ba413b 100644 --- a/test/unit/suite.ini +++ b/test/unit/suite.ini @@ -4,11 +4,3 @@ description = unit tests disabled = snap_quorum_delay.test release_disabled = fiber_stack.test swim_errinj.test is_parallel = True -fragile = { - "retries": 10, - "tests": { - "swim.test": { - "issues": [ "gh-5399" ] - } - } - } diff --git a/test/unit/swim.c b/test/unit/swim.c index bb12baf8d..4e9dce8ce 100644 --- a/test/unit/swim.c +++ b/test/unit/swim.c @@ -735,29 +735,6 @@ swim_test_payload_basic(void) swim_finish_test(); } -static void -swim_test_indirect_ping(void) -{ - swim_start_test(2); - uint16_t cluster_size = 3; - struct swim_cluster *cluster = swim_cluster_new(cluster_size); - swim_cluster_set_ack_timeout(cluster, 1); - for (int i = 0; i < cluster_size; ++i) { - for (int j = i + 1; j < cluster_size; ++j) - swim_cluster_interconnect(cluster, i, j); - } - swim_cluster_set_drop_channel(cluster, 0, 1, true); - swim_cluster_set_drop_channel(cluster, 1, 0, true); - swim_run_for(10); - is(swim_cluster_wait_status_everywhere(cluster, 0, MEMBER_ALIVE, 3), - 0, "S1 is still alive everywhere"); - is(swim_cluster_wait_status_everywhere(cluster, 1, MEMBER_ALIVE, 3), - 0, "as well as S2 - they communicated via S3"); - - swim_cluster_delete(cluster); - swim_finish_test(); -} - static void swim_test_encryption(void) { @@ -1097,7 +1074,7 @@ swim_test_suspect_new_members(void) static int main_f(va_list ap) { - swim_start_test(23); + swim_start_test(22); (void) ap; swim_test_ev_init(); @@ -1119,7 +1096,6 @@ main_f(va_list ap) swim_test_uri_update(); swim_test_broadcast(); swim_test_payload_basic(); - swim_test_indirect_ping(); swim_test_encryption(); swim_test_slow_net(); swim_test_triggers(); diff --git a/test/unit/swim.result b/test/unit/swim.result index 07ea6a1a0..d4b1dba49 100644 --- a/test/unit/swim.result +++ b/test/unit/swim.result @@ -1,5 +1,5 @@ *** main_f *** -1..23 +1..22 *** swim_test_one_link *** 1..6 ok 1 - no rounds - no fullmesh @@ -169,23 +169,17 @@ ok 15 - subtests ok 11 - third payload is disseminated via anti-entropy ok 16 - subtests *** swim_test_payload_basic: done *** - *** swim_test_indirect_ping *** - 1..2 - ok 1 - S1 is still alive everywhere - ok 2 - as well as S2 - they communicated via S3 -ok 17 - subtests - *** swim_test_indirect_ping: done *** *** swim_test_encryption *** 1..3 ok 1 - cluster works with encryption ok 2 - different encryption keys - can't interact ok 3 - cluster works after encryption has been disabled -ok 18 - subtests +ok 17 - subtests *** swim_test_encryption: done *** *** swim_test_slow_net *** 1..0 # slow network leads to idle round steps, they should not produce a new message -ok 19 - subtests +ok 18 - subtests *** swim_test_slow_net: done *** *** swim_test_triggers *** 1..20 @@ -210,25 +204,25 @@ ok 19 - subtests # now all the triggers are done and deleted ok 19 - local URI update warns about version update ok 20 - version is a part of incarnation, so the latter is updated too -ok 20 - subtests +ok 19 - subtests *** swim_test_triggers: done *** *** swim_test_generation *** 1..3 ok 1 - S1 disseminated its payload to S2 ok 2 - S1 restarted and set another payload. Without generation it could lead to never disseminated new payload. ok 3 - S2 sees new generation of S1 -ok 21 - subtests +ok 20 - subtests *** swim_test_generation: done *** *** swim_test_dissemination_speed *** 1..2 ok 1 - dissemination work in log time even at the very start of a cluster ok 2 - dissemination can withstand an event storm -ok 22 - subtests +ok 21 - subtests *** swim_test_dissemination_speed: done *** *** swim_test_suspect_new_members *** 1..2 ok 1 - S2 dropped S1 as dead ok 2 - S3 didn't add S1 from S2's messages, because S1 didn't answer on a ping -ok 23 - subtests +ok 22 - subtests *** swim_test_suspect_new_members: done *** *** main_f: done *** diff --git a/test/unit/swim_errinj.c b/test/unit/swim_errinj.c index 2e89f1821..0f8718a33 100644 --- a/test/unit/swim_errinj.c +++ b/test/unit/swim_errinj.c @@ -171,17 +171,56 @@ swim_test_payload_refutation(void) swim_finish_test(); } +static void +swim_test_indirect_ping(void) +{ + swim_start_test(2); + uint16_t cluster_size = 3; + struct swim_cluster *cluster = swim_cluster_new(cluster_size); + swim_cluster_set_ack_timeout(cluster, 0.5); + for (int i = 0; i < cluster_size; ++i) { + for (int j = i + 1; j < cluster_size; ++j) + swim_cluster_interconnect(cluster, i, j); + } + swim_cluster_set_drop_channel(cluster, 0, 1, true); + swim_cluster_set_drop_channel(cluster, 1, 0, true); + /* + * There are alive channels S1 <-> S3 and S2 <-> S3. Things are not + * interesting when S3 can send dissemination, because even if S1 and S2 + * suspect each other, they will share it with S3, and S3 will refute it + * via dissemination, because has access to both of them. + * When only failure detection works, a suspicion can only be refuted by + * pings and acks. In this test pings-acks between S1 and S2 will need + * to be indirect. + */ + struct errinj *errinj = &errinjs[ERRINJ_SWIM_FD_ONLY]; + errinj->bparam = true; + /* + * No nodes are ever suspected. Because when a direct ping fails, the + * nodes simply will use indirect pings and they will work. + */ + isnt(swim_cluster_wait_status_anywhere(cluster, 0, + MEMBER_SUSPECTED, 10), + 0, "S1 is never suspected"); + isnt(swim_cluster_wait_status_anywhere(cluster, 1, + MEMBER_SUSPECTED, 10), + 0, "S2 is never suspected"); + errinj->bparam = false; + swim_cluster_delete(cluster); + swim_finish_test(); +} static int main_f(va_list ap) { - swim_start_test(1); + swim_start_test(2); (void) ap; swim_test_ev_init(); swim_test_transport_init(); swim_test_payload_refutation(); + swim_test_indirect_ping(); swim_test_transport_free(); swim_test_ev_free(); diff --git a/test/unit/swim_errinj.result b/test/unit/swim_errinj.result index c60389d7f..03e357e13 100644 --- a/test/unit/swim_errinj.result +++ b/test/unit/swim_errinj.result @@ -1,5 +1,5 @@ *** main_f *** -1..1 +1..2 *** swim_test_payload_refutation *** 1..11 ok 1 - S2 sees new version of S1 @@ -15,4 +15,10 @@ ok 11 - S3 learns S1's payload from S2 ok 1 - subtests *** swim_test_payload_refutation: done *** + *** swim_test_indirect_ping *** + 1..2 + ok 1 - S1 is never suspected + ok 2 - S2 is never suspected +ok 2 - subtests + *** swim_test_indirect_ping: done *** *** main_f: done *** -- 2.24.3 (Apple Git-128)