* [Tarantool-patches] [PATCH 1/1] test: fix flaky unit/swim test
@ 2020-11-18 21:24 Vladislav Shpilevoy
2020-11-19 5:38 ` Alexander V. Tikhonov
2020-11-20 18:10 ` Vladislav Shpilevoy
0 siblings, 2 replies; 3+ messages in thread
From: Vladislav Shpilevoy @ 2020-11-18 21:24 UTC (permalink / raw)
To: tarantool-patches, avtikhon
swim_test_indirect_ping() failed with random seed 1605651752.
The test created a cluster with 3 swim nodes, and broke network
connection between node-1 and node-2. Then it run the cluster for
10 seconds, and ensured, that both node-1 and node-2 are
eventually alive despite they are suspected sometimes.
node1 <-> node3 <-> node2
'Alive' means that a node is considered alive on all the other
nodes.
The test spun for 10 seconds giving the nodes a chance to become
suspected. Then it checked that node-1 is either still alive, or
it is suspected, but will be restored in at most 3 seconds. The
same was checked for node-2. They were supposed to interact via
node-3.
3 seconds was used assuming that the worst what could happen is
that it is suspected from the beginning of this three-second
interval on node-3, because it was suspected by node-2 and
disseminated to node-3.
Then node-3 might need 1 second to finish its current
dissemination round by sending a ping to node-2, 1 second to start
new round randomly again from node-2, and only then send a ping to
node-1. So 3 seconds total.
But it could also happen, that in the beginning of the
three-second interval node-1 is already suspected on node-2. On
the next step node-2 shares the suspicion with node-3. And then
the scenario above happens. So the test case needed at least 4
seconds.
And actually it could happen infinitely, because while the test
waits for 3 seconds of gossip refutation about node-1 on node-3,
node-2 can suspect it again. And so on.
Also the test would pass even without indirect pings. Because
node-3 has access to node-1 and node-2. So even if, say, node-1
suspects node-2, then it will tell node-3 about it. Node-3 will
ping node-2, get ack, and will refute the gossip. The refutation
will be then sent to node-1 back. It means indirect pings don't
matter here.
The patch makes a new test, which won't pass without indirect
pings. It uses the existing error injection ERRINJ_SWIM_FD_ONLY,
which allows to turn off all the SWIM components except failure
detection. So only pings and acks are being sent.
Then without proper indirect pings node-1 and node-2 would suspect
each other and declare dead eventually. The new test checks it
does not happen.
Closes #5399
---
Branch: http://github.com/tarantool/tarantool/tree/gerold103/gh-5399-swim-flaky
Issue: https://github.com/tarantool/tarantool/issues/5399
test/unit/suite.ini | 8 -------
test/unit/swim.c | 26 +----------------------
test/unit/swim.result | 20 ++++++------------
test/unit/swim_errinj.c | 41 +++++++++++++++++++++++++++++++++++-
test/unit/swim_errinj.result | 8 ++++++-
5 files changed, 55 insertions(+), 48 deletions(-)
diff --git a/test/unit/suite.ini b/test/unit/suite.ini
index fe871b287..d16ba413b 100644
--- a/test/unit/suite.ini
+++ b/test/unit/suite.ini
@@ -4,11 +4,3 @@ description = unit tests
disabled = snap_quorum_delay.test
release_disabled = fiber_stack.test swim_errinj.test
is_parallel = True
-fragile = {
- "retries": 10,
- "tests": {
- "swim.test": {
- "issues": [ "gh-5399" ]
- }
- }
- }
diff --git a/test/unit/swim.c b/test/unit/swim.c
index bb12baf8d..4e9dce8ce 100644
--- a/test/unit/swim.c
+++ b/test/unit/swim.c
@@ -735,29 +735,6 @@ swim_test_payload_basic(void)
swim_finish_test();
}
-static void
-swim_test_indirect_ping(void)
-{
- swim_start_test(2);
- uint16_t cluster_size = 3;
- struct swim_cluster *cluster = swim_cluster_new(cluster_size);
- swim_cluster_set_ack_timeout(cluster, 1);
- for (int i = 0; i < cluster_size; ++i) {
- for (int j = i + 1; j < cluster_size; ++j)
- swim_cluster_interconnect(cluster, i, j);
- }
- swim_cluster_set_drop_channel(cluster, 0, 1, true);
- swim_cluster_set_drop_channel(cluster, 1, 0, true);
- swim_run_for(10);
- is(swim_cluster_wait_status_everywhere(cluster, 0, MEMBER_ALIVE, 3),
- 0, "S1 is still alive everywhere");
- is(swim_cluster_wait_status_everywhere(cluster, 1, MEMBER_ALIVE, 3),
- 0, "as well as S2 - they communicated via S3");
-
- swim_cluster_delete(cluster);
- swim_finish_test();
-}
-
static void
swim_test_encryption(void)
{
@@ -1097,7 +1074,7 @@ swim_test_suspect_new_members(void)
static int
main_f(va_list ap)
{
- swim_start_test(23);
+ swim_start_test(22);
(void) ap;
swim_test_ev_init();
@@ -1119,7 +1096,6 @@ main_f(va_list ap)
swim_test_uri_update();
swim_test_broadcast();
swim_test_payload_basic();
- swim_test_indirect_ping();
swim_test_encryption();
swim_test_slow_net();
swim_test_triggers();
diff --git a/test/unit/swim.result b/test/unit/swim.result
index 07ea6a1a0..d4b1dba49 100644
--- a/test/unit/swim.result
+++ b/test/unit/swim.result
@@ -1,5 +1,5 @@
*** main_f ***
-1..23
+1..22
*** swim_test_one_link ***
1..6
ok 1 - no rounds - no fullmesh
@@ -169,23 +169,17 @@ ok 15 - subtests
ok 11 - third payload is disseminated via anti-entropy
ok 16 - subtests
*** swim_test_payload_basic: done ***
- *** swim_test_indirect_ping ***
- 1..2
- ok 1 - S1 is still alive everywhere
- ok 2 - as well as S2 - they communicated via S3
-ok 17 - subtests
- *** swim_test_indirect_ping: done ***
*** swim_test_encryption ***
1..3
ok 1 - cluster works with encryption
ok 2 - different encryption keys - can't interact
ok 3 - cluster works after encryption has been disabled
-ok 18 - subtests
+ok 17 - subtests
*** swim_test_encryption: done ***
*** swim_test_slow_net ***
1..0
# slow network leads to idle round steps, they should not produce a new message
-ok 19 - subtests
+ok 18 - subtests
*** swim_test_slow_net: done ***
*** swim_test_triggers ***
1..20
@@ -210,25 +204,25 @@ ok 19 - subtests
# now all the triggers are done and deleted
ok 19 - local URI update warns about version update
ok 20 - version is a part of incarnation, so the latter is updated too
-ok 20 - subtests
+ok 19 - subtests
*** swim_test_triggers: done ***
*** swim_test_generation ***
1..3
ok 1 - S1 disseminated its payload to S2
ok 2 - S1 restarted and set another payload. Without generation it could lead to never disseminated new payload.
ok 3 - S2 sees new generation of S1
-ok 21 - subtests
+ok 20 - subtests
*** swim_test_generation: done ***
*** swim_test_dissemination_speed ***
1..2
ok 1 - dissemination work in log time even at the very start of a cluster
ok 2 - dissemination can withstand an event storm
-ok 22 - subtests
+ok 21 - subtests
*** swim_test_dissemination_speed: done ***
*** swim_test_suspect_new_members ***
1..2
ok 1 - S2 dropped S1 as dead
ok 2 - S3 didn't add S1 from S2's messages, because S1 didn't answer on a ping
-ok 23 - subtests
+ok 22 - subtests
*** swim_test_suspect_new_members: done ***
*** main_f: done ***
diff --git a/test/unit/swim_errinj.c b/test/unit/swim_errinj.c
index 2e89f1821..0f8718a33 100644
--- a/test/unit/swim_errinj.c
+++ b/test/unit/swim_errinj.c
@@ -171,17 +171,56 @@ swim_test_payload_refutation(void)
swim_finish_test();
}
+static void
+swim_test_indirect_ping(void)
+{
+ swim_start_test(2);
+ uint16_t cluster_size = 3;
+ struct swim_cluster *cluster = swim_cluster_new(cluster_size);
+ swim_cluster_set_ack_timeout(cluster, 0.5);
+ for (int i = 0; i < cluster_size; ++i) {
+ for (int j = i + 1; j < cluster_size; ++j)
+ swim_cluster_interconnect(cluster, i, j);
+ }
+ swim_cluster_set_drop_channel(cluster, 0, 1, true);
+ swim_cluster_set_drop_channel(cluster, 1, 0, true);
+ /*
+ * There are alive channels S1 <-> S3 and S2 <-> S3. Things are not
+ * interesting when S3 can send dissemination, because even if S1 and S2
+ * suspect each other, they will share it with S3, and S3 will refute it
+ * via dissemination, because has access to both of them.
+ * When only failure detection works, a suspicion can only be refuted by
+ * pings and acks. In this test pings-acks between S1 and S2 will need
+ * to be indirect.
+ */
+ struct errinj *errinj = &errinjs[ERRINJ_SWIM_FD_ONLY];
+ errinj->bparam = true;
+ /*
+ * No nodes are ever suspected. Because when a direct ping fails, the
+ * nodes simply will use indirect pings and they will work.
+ */
+ isnt(swim_cluster_wait_status_anywhere(cluster, 0,
+ MEMBER_SUSPECTED, 10),
+ 0, "S1 is never suspected");
+ isnt(swim_cluster_wait_status_anywhere(cluster, 1,
+ MEMBER_SUSPECTED, 10),
+ 0, "S2 is never suspected");
+ errinj->bparam = false;
+ swim_cluster_delete(cluster);
+ swim_finish_test();
+}
static int
main_f(va_list ap)
{
- swim_start_test(1);
+ swim_start_test(2);
(void) ap;
swim_test_ev_init();
swim_test_transport_init();
swim_test_payload_refutation();
+ swim_test_indirect_ping();
swim_test_transport_free();
swim_test_ev_free();
diff --git a/test/unit/swim_errinj.result b/test/unit/swim_errinj.result
index c60389d7f..03e357e13 100644
--- a/test/unit/swim_errinj.result
+++ b/test/unit/swim_errinj.result
@@ -1,5 +1,5 @@
*** main_f ***
-1..1
+1..2
*** swim_test_payload_refutation ***
1..11
ok 1 - S2 sees new version of S1
@@ -15,4 +15,10 @@
ok 11 - S3 learns S1's payload from S2
ok 1 - subtests
*** swim_test_payload_refutation: done ***
+ *** swim_test_indirect_ping ***
+ 1..2
+ ok 1 - S1 is never suspected
+ ok 2 - S2 is never suspected
+ok 2 - subtests
+ *** swim_test_indirect_ping: done ***
*** main_f: done ***
--
2.24.3 (Apple Git-128)
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [Tarantool-patches] [PATCH 1/1] test: fix flaky unit/swim test
2020-11-18 21:24 [Tarantool-patches] [PATCH 1/1] test: fix flaky unit/swim test Vladislav Shpilevoy
@ 2020-11-19 5:38 ` Alexander V. Tikhonov
2020-11-20 18:10 ` Vladislav Shpilevoy
1 sibling, 0 replies; 3+ messages in thread
From: Alexander V. Tikhonov @ 2020-11-19 5:38 UTC (permalink / raw)
To: Vladislav Shpilevoy; +Cc: tarantool-patches
Hi Vlad, thanks for the patch, as I see no new degradation found in
gitlab-ci testing commit criteria pipeline [1], patch LGTM.
[1] - https://gitlab.com/tarantool/tarantool/-/pipelines/218165816
On Wed, Nov 18, 2020 at 10:24:41PM +0100, Vladislav Shpilevoy wrote:
> swim_test_indirect_ping() failed with random seed 1605651752.
>
> The test created a cluster with 3 swim nodes, and broke network
> connection between node-1 and node-2. Then it run the cluster for
> 10 seconds, and ensured, that both node-1 and node-2 are
> eventually alive despite they are suspected sometimes.
>
> node1 <-> node3 <-> node2
>
> 'Alive' means that a node is considered alive on all the other
> nodes.
>
> The test spun for 10 seconds giving the nodes a chance to become
> suspected. Then it checked that node-1 is either still alive, or
> it is suspected, but will be restored in at most 3 seconds. The
> same was checked for node-2. They were supposed to interact via
> node-3.
>
> 3 seconds was used assuming that the worst what could happen is
> that it is suspected from the beginning of this three-second
> interval on node-3, because it was suspected by node-2 and
> disseminated to node-3.
>
> Then node-3 might need 1 second to finish its current
> dissemination round by sending a ping to node-2, 1 second to start
> new round randomly again from node-2, and only then send a ping to
> node-1. So 3 seconds total.
>
> But it could also happen, that in the beginning of the
> three-second interval node-1 is already suspected on node-2. On
> the next step node-2 shares the suspicion with node-3. And then
> the scenario above happens. So the test case needed at least 4
> seconds.
>
> And actually it could happen infinitely, because while the test
> waits for 3 seconds of gossip refutation about node-1 on node-3,
> node-2 can suspect it again. And so on.
>
> Also the test would pass even without indirect pings. Because
> node-3 has access to node-1 and node-2. So even if, say, node-1
> suspects node-2, then it will tell node-3 about it. Node-3 will
> ping node-2, get ack, and will refute the gossip. The refutation
> will be then sent to node-1 back. It means indirect pings don't
> matter here.
>
> The patch makes a new test, which won't pass without indirect
> pings. It uses the existing error injection ERRINJ_SWIM_FD_ONLY,
> which allows to turn off all the SWIM components except failure
> detection. So only pings and acks are being sent.
>
> Then without proper indirect pings node-1 and node-2 would suspect
> each other and declare dead eventually. The new test checks it
> does not happen.
>
> Closes #5399
> ---
> Branch: http://github.com/tarantool/tarantool/tree/gerold103/gh-5399-swim-flaky
> Issue: https://github.com/tarantool/tarantool/issues/5399
>
> test/unit/suite.ini | 8 -------
> test/unit/swim.c | 26 +----------------------
> test/unit/swim.result | 20 ++++++------------
> test/unit/swim_errinj.c | 41 +++++++++++++++++++++++++++++++++++-
> test/unit/swim_errinj.result | 8 ++++++-
> 5 files changed, 55 insertions(+), 48 deletions(-)
>
> diff --git a/test/unit/suite.ini b/test/unit/suite.ini
> index fe871b287..d16ba413b 100644
> --- a/test/unit/suite.ini
> +++ b/test/unit/suite.ini
> @@ -4,11 +4,3 @@ description = unit tests
> disabled = snap_quorum_delay.test
> release_disabled = fiber_stack.test swim_errinj.test
> is_parallel = True
> -fragile = {
> - "retries": 10,
> - "tests": {
> - "swim.test": {
> - "issues": [ "gh-5399" ]
> - }
> - }
> - }
> diff --git a/test/unit/swim.c b/test/unit/swim.c
> index bb12baf8d..4e9dce8ce 100644
> --- a/test/unit/swim.c
> +++ b/test/unit/swim.c
> @@ -735,29 +735,6 @@ swim_test_payload_basic(void)
> swim_finish_test();
> }
>
> -static void
> -swim_test_indirect_ping(void)
> -{
> - swim_start_test(2);
> - uint16_t cluster_size = 3;
> - struct swim_cluster *cluster = swim_cluster_new(cluster_size);
> - swim_cluster_set_ack_timeout(cluster, 1);
> - for (int i = 0; i < cluster_size; ++i) {
> - for (int j = i + 1; j < cluster_size; ++j)
> - swim_cluster_interconnect(cluster, i, j);
> - }
> - swim_cluster_set_drop_channel(cluster, 0, 1, true);
> - swim_cluster_set_drop_channel(cluster, 1, 0, true);
> - swim_run_for(10);
> - is(swim_cluster_wait_status_everywhere(cluster, 0, MEMBER_ALIVE, 3),
> - 0, "S1 is still alive everywhere");
> - is(swim_cluster_wait_status_everywhere(cluster, 1, MEMBER_ALIVE, 3),
> - 0, "as well as S2 - they communicated via S3");
> -
> - swim_cluster_delete(cluster);
> - swim_finish_test();
> -}
> -
> static void
> swim_test_encryption(void)
> {
> @@ -1097,7 +1074,7 @@ swim_test_suspect_new_members(void)
> static int
> main_f(va_list ap)
> {
> - swim_start_test(23);
> + swim_start_test(22);
>
> (void) ap;
> swim_test_ev_init();
> @@ -1119,7 +1096,6 @@ main_f(va_list ap)
> swim_test_uri_update();
> swim_test_broadcast();
> swim_test_payload_basic();
> - swim_test_indirect_ping();
> swim_test_encryption();
> swim_test_slow_net();
> swim_test_triggers();
> diff --git a/test/unit/swim.result b/test/unit/swim.result
> index 07ea6a1a0..d4b1dba49 100644
> --- a/test/unit/swim.result
> +++ b/test/unit/swim.result
> @@ -1,5 +1,5 @@
> *** main_f ***
> -1..23
> +1..22
> *** swim_test_one_link ***
> 1..6
> ok 1 - no rounds - no fullmesh
> @@ -169,23 +169,17 @@ ok 15 - subtests
> ok 11 - third payload is disseminated via anti-entropy
> ok 16 - subtests
> *** swim_test_payload_basic: done ***
> - *** swim_test_indirect_ping ***
> - 1..2
> - ok 1 - S1 is still alive everywhere
> - ok 2 - as well as S2 - they communicated via S3
> -ok 17 - subtests
> - *** swim_test_indirect_ping: done ***
> *** swim_test_encryption ***
> 1..3
> ok 1 - cluster works with encryption
> ok 2 - different encryption keys - can't interact
> ok 3 - cluster works after encryption has been disabled
> -ok 18 - subtests
> +ok 17 - subtests
> *** swim_test_encryption: done ***
> *** swim_test_slow_net ***
> 1..0
> # slow network leads to idle round steps, they should not produce a new message
> -ok 19 - subtests
> +ok 18 - subtests
> *** swim_test_slow_net: done ***
> *** swim_test_triggers ***
> 1..20
> @@ -210,25 +204,25 @@ ok 19 - subtests
> # now all the triggers are done and deleted
> ok 19 - local URI update warns about version update
> ok 20 - version is a part of incarnation, so the latter is updated too
> -ok 20 - subtests
> +ok 19 - subtests
> *** swim_test_triggers: done ***
> *** swim_test_generation ***
> 1..3
> ok 1 - S1 disseminated its payload to S2
> ok 2 - S1 restarted and set another payload. Without generation it could lead to never disseminated new payload.
> ok 3 - S2 sees new generation of S1
> -ok 21 - subtests
> +ok 20 - subtests
> *** swim_test_generation: done ***
> *** swim_test_dissemination_speed ***
> 1..2
> ok 1 - dissemination work in log time even at the very start of a cluster
> ok 2 - dissemination can withstand an event storm
> -ok 22 - subtests
> +ok 21 - subtests
> *** swim_test_dissemination_speed: done ***
> *** swim_test_suspect_new_members ***
> 1..2
> ok 1 - S2 dropped S1 as dead
> ok 2 - S3 didn't add S1 from S2's messages, because S1 didn't answer on a ping
> -ok 23 - subtests
> +ok 22 - subtests
> *** swim_test_suspect_new_members: done ***
> *** main_f: done ***
> diff --git a/test/unit/swim_errinj.c b/test/unit/swim_errinj.c
> index 2e89f1821..0f8718a33 100644
> --- a/test/unit/swim_errinj.c
> +++ b/test/unit/swim_errinj.c
> @@ -171,17 +171,56 @@ swim_test_payload_refutation(void)
> swim_finish_test();
> }
>
> +static void
> +swim_test_indirect_ping(void)
> +{
> + swim_start_test(2);
> + uint16_t cluster_size = 3;
> + struct swim_cluster *cluster = swim_cluster_new(cluster_size);
> + swim_cluster_set_ack_timeout(cluster, 0.5);
> + for (int i = 0; i < cluster_size; ++i) {
> + for (int j = i + 1; j < cluster_size; ++j)
> + swim_cluster_interconnect(cluster, i, j);
> + }
> + swim_cluster_set_drop_channel(cluster, 0, 1, true);
> + swim_cluster_set_drop_channel(cluster, 1, 0, true);
> + /*
> + * There are alive channels S1 <-> S3 and S2 <-> S3. Things are not
> + * interesting when S3 can send dissemination, because even if S1 and S2
> + * suspect each other, they will share it with S3, and S3 will refute it
> + * via dissemination, because has access to both of them.
> + * When only failure detection works, a suspicion can only be refuted by
> + * pings and acks. In this test pings-acks between S1 and S2 will need
> + * to be indirect.
> + */
> + struct errinj *errinj = &errinjs[ERRINJ_SWIM_FD_ONLY];
> + errinj->bparam = true;
> + /*
> + * No nodes are ever suspected. Because when a direct ping fails, the
> + * nodes simply will use indirect pings and they will work.
> + */
> + isnt(swim_cluster_wait_status_anywhere(cluster, 0,
> + MEMBER_SUSPECTED, 10),
> + 0, "S1 is never suspected");
> + isnt(swim_cluster_wait_status_anywhere(cluster, 1,
> + MEMBER_SUSPECTED, 10),
> + 0, "S2 is never suspected");
> + errinj->bparam = false;
> + swim_cluster_delete(cluster);
> + swim_finish_test();
> +}
>
> static int
> main_f(va_list ap)
> {
> - swim_start_test(1);
> + swim_start_test(2);
>
> (void) ap;
> swim_test_ev_init();
> swim_test_transport_init();
>
> swim_test_payload_refutation();
> + swim_test_indirect_ping();
>
> swim_test_transport_free();
> swim_test_ev_free();
> diff --git a/test/unit/swim_errinj.result b/test/unit/swim_errinj.result
> index c60389d7f..03e357e13 100644
> --- a/test/unit/swim_errinj.result
> +++ b/test/unit/swim_errinj.result
> @@ -1,5 +1,5 @@
> *** main_f ***
> -1..1
> +1..2
> *** swim_test_payload_refutation ***
> 1..11
> ok 1 - S2 sees new version of S1
> @@ -15,4 +15,10 @@
> ok 11 - S3 learns S1's payload from S2
> ok 1 - subtests
> *** swim_test_payload_refutation: done ***
> + *** swim_test_indirect_ping ***
> + 1..2
> + ok 1 - S1 is never suspected
> + ok 2 - S2 is never suspected
> +ok 2 - subtests
> + *** swim_test_indirect_ping: done ***
> *** main_f: done ***
> --
> 2.24.3 (Apple Git-128)
>
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [Tarantool-patches] [PATCH 1/1] test: fix flaky unit/swim test
2020-11-18 21:24 [Tarantool-patches] [PATCH 1/1] test: fix flaky unit/swim test Vladislav Shpilevoy
2020-11-19 5:38 ` Alexander V. Tikhonov
@ 2020-11-20 18:10 ` Vladislav Shpilevoy
1 sibling, 0 replies; 3+ messages in thread
From: Vladislav Shpilevoy @ 2020-11-20 18:10 UTC (permalink / raw)
To: tarantool-patches, avtikhon
Pushed to master, 2.6, 2.5.
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2020-11-20 18:11 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-11-18 21:24 [Tarantool-patches] [PATCH 1/1] test: fix flaky unit/swim test Vladislav Shpilevoy
2020-11-19 5:38 ` Alexander V. Tikhonov
2020-11-20 18:10 ` Vladislav Shpilevoy
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox