From: Konstantin Osipov <kostja@tarantool.org>
To: Vladislav Shpilevoy <v.shpilevoy@tarantool.org>
Cc: tarantool-patches@freelists.org
Subject: [tarantool-patches] Re: [PATCH v2 1/2] swim: disseminate event for log(cluster_size) steps
Date: Wed, 3 Jul 2019 22:06:45 +0300 [thread overview]
Message-ID: <20190703190645.GB17318@atlas> (raw)
In-Reply-To: <e3914ca128214a29e4fd45f42ef2aef4c1f34326.1561924738.git.v.shpilevoy@tarantool.org>
* Vladislav Shpilevoy <v.shpilevoy@tarantool.org> [19/07/01 10:04]:
lgtm
> Before the patch there was a problem of events and anti-entropy
> starvation, when a cluster generates so many events, that they
> consume the whole UDP packet. A packet fits up to 26 events. If
> during the event storm something important happens, that event is
> likely to be lost, and not disseminated until the storm is over.
>
> Sadly, there is no way to prevent a storm, but it can be made
> much shorter. For that the patch makes TTD of events logarithmic
> instead of linear of cluster size.
>
> According to the SWIM paper and to experiments the logarithm is
> really enough. Linear TTD was a redundant overkill.
>
> When events live shorter, it does not solve a problem of the
> events starvation - still some of them can be lost in case of a
> storm. But it frees some space for anti-entropy, which can finish
> dissemination of lost events.
>
> Experiments in a simulation of a cluster with 100 nodes showed,
> that a failure dissemination happened in ~110 steps if there is
> a storm. Basically, no dissemination at all.
>
> After the patch it is ~20 steps. So it is logarithmic as it
> should be, although with a bigger constant than without a storm.
>
> However the patch can't be pushed as is. Several tests are broken
> now, because due to short TTD 'dead' members are dropped from
> member tables too early. It creates a possibility, that another
> node resurrects them back via anti-entropy.
>
> Part of #4253
> ---
> src/lib/swim/swim.c | 19 ++++++++++++-
> test/unit/swim.c | 68 ++++++++++++++++++++++++++++++++++++++++++++-
> 2 files changed, 85 insertions(+), 2 deletions(-)
>
> diff --git a/src/lib/swim/swim.c b/src/lib/swim/swim.c
> index bb9e9f519..9647094f5 100644
> --- a/src/lib/swim/swim.c
> +++ b/src/lib/swim/swim.c
> @@ -616,7 +616,24 @@ swim_register_event(struct swim *swim, struct swim_member *member)
> rlist_add_tail_entry(&swim->dissemination_queue, member,
> in_dissemination_queue);
> }
> - member->status_ttd = mh_size(swim->members);
> + /*
> + * Logarithm is a perfect number of disseminations of an
> + * event.
> + *
> + * Firstly, it matches the dissemination speed.
> + *
> + * Secondly, bigger number of disseminations (for example,
> + * linear) causes events and anti-entropy starvation in
> + * big clusters, when lots of events occupy the whole UDP
> + * packet, and factually the same packet content is being
> + * sent for quite a long time. No randomness. Anti-entropy
> + * does not get a chance to disseminate something new and
> + * random. Bigger orders are redundant and harmful.
> + *
> + * Thirdly, logarithm is proved by the original
> + * SWIM paper as the best option.
> + */
> + member->status_ttd = ceil(log2(mh_size(swim->members)));
> swim_cached_round_msg_invalidate(swim);
> }
>
> diff --git a/test/unit/swim.c b/test/unit/swim.c
> index 3486d3f73..d776a8231 100644
> --- a/test/unit/swim.c
> +++ b/test/unit/swim.c
> @@ -43,6 +43,7 @@
> #include "swim_test_utils.h"
> #include "trigger.h"
> #include <fcntl.h>
> +#include <math.h>
>
> /**
> * Test result is a real returned value of main_f. Fiber_join can
> @@ -1152,10 +1153,74 @@ swim_test_generation(void)
> swim_finish_test();
> }
>
> +static void
> +swim_test_dissemination_speed(void)
> +{
> + swim_start_test(2);
> +
> + int size = 100;
> + double ack_timeout = 0.1;
> + struct swim_cluster *cluster = swim_cluster_new(size);
> + swim_cluster_set_ack_timeout(cluster, ack_timeout);
> + swim_cluster_set_gc(cluster, SWIM_GC_OFF);
> + for (int i = 0; i < size; ++i) {
> + for (int j = i + 1; j < size; ++j)
> + swim_cluster_interconnect(cluster, i, j);
> + }
> + swim_cluster_set_drop(cluster, 0, 100);
> + fail_if(swim_cluster_wait_status_anywhere(cluster, 0,
> + MEMBER_DEAD, size) != 0);
> + /*
> + * Not a trivial problem - at start of a cluster there are
> + * so many events, that they occupy a UDP packet fully.
> + * All these events are 'a new member is added'. And
> + * because of that other much more important events
> + * starve. In this concrete test a new event 'member is
> + * dead' starves. To beat that problem SWIM says that
> + * events should be disseminated not longer than for a
> + * O(log) round steps. In such a case all the events are
> + * expired quite fast, and anti-entropy swiftly finishes
> + * the job. Usually this test works in log * 2, log * 3
> + * steps. Here it is log * 6 to avoid flakiness in some
> + * extra rare and slow random cases, but to still check
> + * for O(log) speed.
> + */
> + is(swim_cluster_wait_status_everywhere(cluster, 0, MEMBER_DEAD,
> + log2(size) * 6), 0,
> + "dissemination work in log time even at the very start of a cluster");
> + swim_cluster_set_drop(cluster, 0, 0);
> + fail_if(swim_cluster_wait_status_everywhere(cluster, 0,
> + MEMBER_ALIVE, size) != 0);
> + /*
> + * Another big-cluster case. Assume, that something
> + * happened and all the members generated an event. For
> + * example, changed their payload. It creates a storm of
> + * events, among which some important ones can be lost.
> + * Such as a failure detection. The only solution again -
> + * make the events as short living as possible in order to
> + * faster free space in a UDP packet for other events and
> + * for anti-entropy. The test below proves that even when
> + * there is an event storm, failure dissemination still
> + * works for O(log) time.
> + */
> + swim_cluster_set_drop(cluster, 0, 100);
> + fail_if(swim_cluster_wait_status_anywhere(cluster, 0,
> + MEMBER_DEAD, size) != 0);
> + for (int i = 0; i < size; ++i)
> + swim_cluster_member_set_payload(cluster, i, "", 0);
> + is(swim_cluster_wait_status_everywhere(cluster, 0, MEMBER_DEAD,
> + log2(size) * 6), 0,
> + "dissemination can withstand an event storm");
> +
> + swim_cluster_delete(cluster);
> +
> + swim_finish_test();
> +}
> +
> static int
> main_f(va_list ap)
> {
> - swim_start_test(22);
> + swim_start_test(23);
>
> (void) ap;
> swim_test_ev_init();
> @@ -1183,6 +1248,7 @@ main_f(va_list ap)
> swim_test_slow_net();
> swim_test_triggers();
> swim_test_generation();
> + swim_test_dissemination_speed();
>
> swim_test_transport_free();
> swim_test_ev_free();
> --
> 2.20.1 (Apple Git-117)
>
--
Konstantin Osipov, Moscow, Russia
next prev parent reply other threads:[~2019-07-03 19:06 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-06-30 20:01 [tarantool-patches] [PATCH v2 0/2] SWIM log TTD, and suspicious anti-entropy Vladislav Shpilevoy
2019-06-30 20:01 ` [tarantool-patches] [PATCH v2 1/2] swim: disseminate event for log(cluster_size) steps Vladislav Shpilevoy
2019-07-03 19:06 ` Konstantin Osipov [this message]
2019-07-03 23:30 ` [tarantool-patches] " Vladislav Shpilevoy
2019-07-04 8:10 ` Konstantin Osipov
2019-06-30 20:01 ` [tarantool-patches] [PATCH v2 2/2] swim: be suspicious when add new member Vladislav Shpilevoy
2019-07-03 19:07 ` [tarantool-patches] " Konstantin Osipov
2019-07-03 23:32 ` Vladislav Shpilevoy
2019-07-04 8:10 ` Konstantin Osipov
2019-07-05 22:45 ` [tarantool-patches] Re: [PATCH v2 0/2] SWIM log TTD, and suspicious anti-entropy Vladislav Shpilevoy
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20190703190645.GB17318@atlas \
--to=kostja@tarantool.org \
--cc=tarantool-patches@freelists.org \
--cc=v.shpilevoy@tarantool.org \
--subject='[tarantool-patches] Re: [PATCH v2 1/2] swim: disseminate event for log(cluster_size) steps' \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox