From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from smtp16.mail.ru (smtp16.mail.ru [94.100.176.153]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dev.tarantool.org (Postfix) with ESMTPS id C4683469710 for ; Thu, 19 Nov 2020 08:38:26 +0300 (MSK) Date: Thu, 19 Nov 2020 08:38:24 +0300 From: "Alexander V. Tikhonov" Message-ID: <20201119053824.GA76626@hpalx> References: <62fe7bdd3888e60c740a2ed704de85a4f19ad140.1605734663.git.v.shpilevoy@tarantool.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <62fe7bdd3888e60c740a2ed704de85a4f19ad140.1605734663.git.v.shpilevoy@tarantool.org> Subject: Re: [Tarantool-patches] [PATCH 1/1] test: fix flaky unit/swim test List-Id: Tarantool development patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Vladislav Shpilevoy Cc: tarantool-patches@dev.tarantool.org Hi Vlad, thanks for the patch, as I see no new degradation found in gitlab-ci testing commit criteria pipeline [1], patch LGTM. [1] - https://gitlab.com/tarantool/tarantool/-/pipelines/218165816 On Wed, Nov 18, 2020 at 10:24:41PM +0100, Vladislav Shpilevoy wrote: > swim_test_indirect_ping() failed with random seed 1605651752. > > The test created a cluster with 3 swim nodes, and broke network > connection between node-1 and node-2. Then it run the cluster for > 10 seconds, and ensured, that both node-1 and node-2 are > eventually alive despite they are suspected sometimes. > > node1 <-> node3 <-> node2 > > 'Alive' means that a node is considered alive on all the other > nodes. > > The test spun for 10 seconds giving the nodes a chance to become > suspected. Then it checked that node-1 is either still alive, or > it is suspected, but will be restored in at most 3 seconds. The > same was checked for node-2. They were supposed to interact via > node-3. > > 3 seconds was used assuming that the worst what could happen is > that it is suspected from the beginning of this three-second > interval on node-3, because it was suspected by node-2 and > disseminated to node-3. > > Then node-3 might need 1 second to finish its current > dissemination round by sending a ping to node-2, 1 second to start > new round randomly again from node-2, and only then send a ping to > node-1. So 3 seconds total. > > But it could also happen, that in the beginning of the > three-second interval node-1 is already suspected on node-2. On > the next step node-2 shares the suspicion with node-3. And then > the scenario above happens. So the test case needed at least 4 > seconds. > > And actually it could happen infinitely, because while the test > waits for 3 seconds of gossip refutation about node-1 on node-3, > node-2 can suspect it again. And so on. > > Also the test would pass even without indirect pings. Because > node-3 has access to node-1 and node-2. So even if, say, node-1 > suspects node-2, then it will tell node-3 about it. Node-3 will > ping node-2, get ack, and will refute the gossip. The refutation > will be then sent to node-1 back. It means indirect pings don't > matter here. > > The patch makes a new test, which won't pass without indirect > pings. It uses the existing error injection ERRINJ_SWIM_FD_ONLY, > which allows to turn off all the SWIM components except failure > detection. So only pings and acks are being sent. > > Then without proper indirect pings node-1 and node-2 would suspect > each other and declare dead eventually. The new test checks it > does not happen. > > Closes #5399 > --- > Branch: http://github.com/tarantool/tarantool/tree/gerold103/gh-5399-swim-flaky > Issue: https://github.com/tarantool/tarantool/issues/5399 > > test/unit/suite.ini | 8 ------- > test/unit/swim.c | 26 +---------------------- > test/unit/swim.result | 20 ++++++------------ > test/unit/swim_errinj.c | 41 +++++++++++++++++++++++++++++++++++- > test/unit/swim_errinj.result | 8 ++++++- > 5 files changed, 55 insertions(+), 48 deletions(-) > > diff --git a/test/unit/suite.ini b/test/unit/suite.ini > index fe871b287..d16ba413b 100644 > --- a/test/unit/suite.ini > +++ b/test/unit/suite.ini > @@ -4,11 +4,3 @@ description = unit tests > disabled = snap_quorum_delay.test > release_disabled = fiber_stack.test swim_errinj.test > is_parallel = True > -fragile = { > - "retries": 10, > - "tests": { > - "swim.test": { > - "issues": [ "gh-5399" ] > - } > - } > - } > diff --git a/test/unit/swim.c b/test/unit/swim.c > index bb12baf8d..4e9dce8ce 100644 > --- a/test/unit/swim.c > +++ b/test/unit/swim.c > @@ -735,29 +735,6 @@ swim_test_payload_basic(void) > swim_finish_test(); > } > > -static void > -swim_test_indirect_ping(void) > -{ > - swim_start_test(2); > - uint16_t cluster_size = 3; > - struct swim_cluster *cluster = swim_cluster_new(cluster_size); > - swim_cluster_set_ack_timeout(cluster, 1); > - for (int i = 0; i < cluster_size; ++i) { > - for (int j = i + 1; j < cluster_size; ++j) > - swim_cluster_interconnect(cluster, i, j); > - } > - swim_cluster_set_drop_channel(cluster, 0, 1, true); > - swim_cluster_set_drop_channel(cluster, 1, 0, true); > - swim_run_for(10); > - is(swim_cluster_wait_status_everywhere(cluster, 0, MEMBER_ALIVE, 3), > - 0, "S1 is still alive everywhere"); > - is(swim_cluster_wait_status_everywhere(cluster, 1, MEMBER_ALIVE, 3), > - 0, "as well as S2 - they communicated via S3"); > - > - swim_cluster_delete(cluster); > - swim_finish_test(); > -} > - > static void > swim_test_encryption(void) > { > @@ -1097,7 +1074,7 @@ swim_test_suspect_new_members(void) > static int > main_f(va_list ap) > { > - swim_start_test(23); > + swim_start_test(22); > > (void) ap; > swim_test_ev_init(); > @@ -1119,7 +1096,6 @@ main_f(va_list ap) > swim_test_uri_update(); > swim_test_broadcast(); > swim_test_payload_basic(); > - swim_test_indirect_ping(); > swim_test_encryption(); > swim_test_slow_net(); > swim_test_triggers(); > diff --git a/test/unit/swim.result b/test/unit/swim.result > index 07ea6a1a0..d4b1dba49 100644 > --- a/test/unit/swim.result > +++ b/test/unit/swim.result > @@ -1,5 +1,5 @@ > *** main_f *** > -1..23 > +1..22 > *** swim_test_one_link *** > 1..6 > ok 1 - no rounds - no fullmesh > @@ -169,23 +169,17 @@ ok 15 - subtests > ok 11 - third payload is disseminated via anti-entropy > ok 16 - subtests > *** swim_test_payload_basic: done *** > - *** swim_test_indirect_ping *** > - 1..2 > - ok 1 - S1 is still alive everywhere > - ok 2 - as well as S2 - they communicated via S3 > -ok 17 - subtests > - *** swim_test_indirect_ping: done *** > *** swim_test_encryption *** > 1..3 > ok 1 - cluster works with encryption > ok 2 - different encryption keys - can't interact > ok 3 - cluster works after encryption has been disabled > -ok 18 - subtests > +ok 17 - subtests > *** swim_test_encryption: done *** > *** swim_test_slow_net *** > 1..0 > # slow network leads to idle round steps, they should not produce a new message > -ok 19 - subtests > +ok 18 - subtests > *** swim_test_slow_net: done *** > *** swim_test_triggers *** > 1..20 > @@ -210,25 +204,25 @@ ok 19 - subtests > # now all the triggers are done and deleted > ok 19 - local URI update warns about version update > ok 20 - version is a part of incarnation, so the latter is updated too > -ok 20 - subtests > +ok 19 - subtests > *** swim_test_triggers: done *** > *** swim_test_generation *** > 1..3 > ok 1 - S1 disseminated its payload to S2 > ok 2 - S1 restarted and set another payload. Without generation it could lead to never disseminated new payload. > ok 3 - S2 sees new generation of S1 > -ok 21 - subtests > +ok 20 - subtests > *** swim_test_generation: done *** > *** swim_test_dissemination_speed *** > 1..2 > ok 1 - dissemination work in log time even at the very start of a cluster > ok 2 - dissemination can withstand an event storm > -ok 22 - subtests > +ok 21 - subtests > *** swim_test_dissemination_speed: done *** > *** swim_test_suspect_new_members *** > 1..2 > ok 1 - S2 dropped S1 as dead > ok 2 - S3 didn't add S1 from S2's messages, because S1 didn't answer on a ping > -ok 23 - subtests > +ok 22 - subtests > *** swim_test_suspect_new_members: done *** > *** main_f: done *** > diff --git a/test/unit/swim_errinj.c b/test/unit/swim_errinj.c > index 2e89f1821..0f8718a33 100644 > --- a/test/unit/swim_errinj.c > +++ b/test/unit/swim_errinj.c > @@ -171,17 +171,56 @@ swim_test_payload_refutation(void) > swim_finish_test(); > } > > +static void > +swim_test_indirect_ping(void) > +{ > + swim_start_test(2); > + uint16_t cluster_size = 3; > + struct swim_cluster *cluster = swim_cluster_new(cluster_size); > + swim_cluster_set_ack_timeout(cluster, 0.5); > + for (int i = 0; i < cluster_size; ++i) { > + for (int j = i + 1; j < cluster_size; ++j) > + swim_cluster_interconnect(cluster, i, j); > + } > + swim_cluster_set_drop_channel(cluster, 0, 1, true); > + swim_cluster_set_drop_channel(cluster, 1, 0, true); > + /* > + * There are alive channels S1 <-> S3 and S2 <-> S3. Things are not > + * interesting when S3 can send dissemination, because even if S1 and S2 > + * suspect each other, they will share it with S3, and S3 will refute it > + * via dissemination, because has access to both of them. > + * When only failure detection works, a suspicion can only be refuted by > + * pings and acks. In this test pings-acks between S1 and S2 will need > + * to be indirect. > + */ > + struct errinj *errinj = &errinjs[ERRINJ_SWIM_FD_ONLY]; > + errinj->bparam = true; > + /* > + * No nodes are ever suspected. Because when a direct ping fails, the > + * nodes simply will use indirect pings and they will work. > + */ > + isnt(swim_cluster_wait_status_anywhere(cluster, 0, > + MEMBER_SUSPECTED, 10), > + 0, "S1 is never suspected"); > + isnt(swim_cluster_wait_status_anywhere(cluster, 1, > + MEMBER_SUSPECTED, 10), > + 0, "S2 is never suspected"); > + errinj->bparam = false; > + swim_cluster_delete(cluster); > + swim_finish_test(); > +} > > static int > main_f(va_list ap) > { > - swim_start_test(1); > + swim_start_test(2); > > (void) ap; > swim_test_ev_init(); > swim_test_transport_init(); > > swim_test_payload_refutation(); > + swim_test_indirect_ping(); > > swim_test_transport_free(); > swim_test_ev_free(); > diff --git a/test/unit/swim_errinj.result b/test/unit/swim_errinj.result > index c60389d7f..03e357e13 100644 > --- a/test/unit/swim_errinj.result > +++ b/test/unit/swim_errinj.result > @@ -1,5 +1,5 @@ > *** main_f *** > -1..1 > +1..2 > *** swim_test_payload_refutation *** > 1..11 > ok 1 - S2 sees new version of S1 > @@ -15,4 +15,10 @@ > ok 11 - S3 learns S1's payload from S2 > ok 1 - subtests > *** swim_test_payload_refutation: done *** > + *** swim_test_indirect_ping *** > + 1..2 > + ok 1 - S1 is never suspected > + ok 2 - S2 is never suspected > +ok 2 - subtests > + *** swim_test_indirect_ping: done *** > *** main_f: done *** > -- > 2.24.3 (Apple Git-128) >