From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from localhost (localhost [127.0.0.1]) by turing.freelists.org (Avenir Technologies Mail Multiplex) with ESMTP id F3CBB24717 for ; Wed, 3 Jul 2019 15:06:47 -0400 (EDT) Received: from turing.freelists.org ([127.0.0.1]) by localhost (turing.freelists.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id QQ6njl1coKpg for ; Wed, 3 Jul 2019 15:06:47 -0400 (EDT) Received: from smtp38.i.mail.ru (smtp38.i.mail.ru [94.100.177.98]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by turing.freelists.org (Avenir Technologies Mail Multiplex) with ESMTPS id ADC4024716 for ; Wed, 3 Jul 2019 15:06:47 -0400 (EDT) Date: Wed, 3 Jul 2019 22:06:45 +0300 From: Konstantin Osipov Subject: [tarantool-patches] Re: [PATCH v2 1/2] swim: disseminate event for log(cluster_size) steps Message-ID: <20190703190645.GB17318@atlas> References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: tarantool-patches-bounce@freelists.org Errors-to: tarantool-patches-bounce@freelists.org Reply-To: tarantool-patches@freelists.org List-Help: List-Unsubscribe: List-software: Ecartis version 1.0.0 List-Id: tarantool-patches List-Subscribe: List-Owner: List-post: List-Archive: To: Vladislav Shpilevoy Cc: tarantool-patches@freelists.org * Vladislav Shpilevoy [19/07/01 10:04]: lgtm > Before the patch there was a problem of events and anti-entropy > starvation, when a cluster generates so many events, that they > consume the whole UDP packet. A packet fits up to 26 events. If > during the event storm something important happens, that event is > likely to be lost, and not disseminated until the storm is over. > > Sadly, there is no way to prevent a storm, but it can be made > much shorter. For that the patch makes TTD of events logarithmic > instead of linear of cluster size. > > According to the SWIM paper and to experiments the logarithm is > really enough. Linear TTD was a redundant overkill. > > When events live shorter, it does not solve a problem of the > events starvation - still some of them can be lost in case of a > storm. But it frees some space for anti-entropy, which can finish > dissemination of lost events. > > Experiments in a simulation of a cluster with 100 nodes showed, > that a failure dissemination happened in ~110 steps if there is > a storm. Basically, no dissemination at all. > > After the patch it is ~20 steps. So it is logarithmic as it > should be, although with a bigger constant than without a storm. > > However the patch can't be pushed as is. Several tests are broken > now, because due to short TTD 'dead' members are dropped from > member tables too early. It creates a possibility, that another > node resurrects them back via anti-entropy. > > Part of #4253 > --- > src/lib/swim/swim.c | 19 ++++++++++++- > test/unit/swim.c | 68 ++++++++++++++++++++++++++++++++++++++++++++- > 2 files changed, 85 insertions(+), 2 deletions(-) > > diff --git a/src/lib/swim/swim.c b/src/lib/swim/swim.c > index bb9e9f519..9647094f5 100644 > --- a/src/lib/swim/swim.c > +++ b/src/lib/swim/swim.c > @@ -616,7 +616,24 @@ swim_register_event(struct swim *swim, struct swim_member *member) > rlist_add_tail_entry(&swim->dissemination_queue, member, > in_dissemination_queue); > } > - member->status_ttd = mh_size(swim->members); > + /* > + * Logarithm is a perfect number of disseminations of an > + * event. > + * > + * Firstly, it matches the dissemination speed. > + * > + * Secondly, bigger number of disseminations (for example, > + * linear) causes events and anti-entropy starvation in > + * big clusters, when lots of events occupy the whole UDP > + * packet, and factually the same packet content is being > + * sent for quite a long time. No randomness. Anti-entropy > + * does not get a chance to disseminate something new and > + * random. Bigger orders are redundant and harmful. > + * > + * Thirdly, logarithm is proved by the original > + * SWIM paper as the best option. > + */ > + member->status_ttd = ceil(log2(mh_size(swim->members))); > swim_cached_round_msg_invalidate(swim); > } > > diff --git a/test/unit/swim.c b/test/unit/swim.c > index 3486d3f73..d776a8231 100644 > --- a/test/unit/swim.c > +++ b/test/unit/swim.c > @@ -43,6 +43,7 @@ > #include "swim_test_utils.h" > #include "trigger.h" > #include > +#include > > /** > * Test result is a real returned value of main_f. Fiber_join can > @@ -1152,10 +1153,74 @@ swim_test_generation(void) > swim_finish_test(); > } > > +static void > +swim_test_dissemination_speed(void) > +{ > + swim_start_test(2); > + > + int size = 100; > + double ack_timeout = 0.1; > + struct swim_cluster *cluster = swim_cluster_new(size); > + swim_cluster_set_ack_timeout(cluster, ack_timeout); > + swim_cluster_set_gc(cluster, SWIM_GC_OFF); > + for (int i = 0; i < size; ++i) { > + for (int j = i + 1; j < size; ++j) > + swim_cluster_interconnect(cluster, i, j); > + } > + swim_cluster_set_drop(cluster, 0, 100); > + fail_if(swim_cluster_wait_status_anywhere(cluster, 0, > + MEMBER_DEAD, size) != 0); > + /* > + * Not a trivial problem - at start of a cluster there are > + * so many events, that they occupy a UDP packet fully. > + * All these events are 'a new member is added'. And > + * because of that other much more important events > + * starve. In this concrete test a new event 'member is > + * dead' starves. To beat that problem SWIM says that > + * events should be disseminated not longer than for a > + * O(log) round steps. In such a case all the events are > + * expired quite fast, and anti-entropy swiftly finishes > + * the job. Usually this test works in log * 2, log * 3 > + * steps. Here it is log * 6 to avoid flakiness in some > + * extra rare and slow random cases, but to still check > + * for O(log) speed. > + */ > + is(swim_cluster_wait_status_everywhere(cluster, 0, MEMBER_DEAD, > + log2(size) * 6), 0, > + "dissemination work in log time even at the very start of a cluster"); > + swim_cluster_set_drop(cluster, 0, 0); > + fail_if(swim_cluster_wait_status_everywhere(cluster, 0, > + MEMBER_ALIVE, size) != 0); > + /* > + * Another big-cluster case. Assume, that something > + * happened and all the members generated an event. For > + * example, changed their payload. It creates a storm of > + * events, among which some important ones can be lost. > + * Such as a failure detection. The only solution again - > + * make the events as short living as possible in order to > + * faster free space in a UDP packet for other events and > + * for anti-entropy. The test below proves that even when > + * there is an event storm, failure dissemination still > + * works for O(log) time. > + */ > + swim_cluster_set_drop(cluster, 0, 100); > + fail_if(swim_cluster_wait_status_anywhere(cluster, 0, > + MEMBER_DEAD, size) != 0); > + for (int i = 0; i < size; ++i) > + swim_cluster_member_set_payload(cluster, i, "", 0); > + is(swim_cluster_wait_status_everywhere(cluster, 0, MEMBER_DEAD, > + log2(size) * 6), 0, > + "dissemination can withstand an event storm"); > + > + swim_cluster_delete(cluster); > + > + swim_finish_test(); > +} > + > static int > main_f(va_list ap) > { > - swim_start_test(22); > + swim_start_test(23); > > (void) ap; > swim_test_ev_init(); > @@ -1183,6 +1248,7 @@ main_f(va_list ap) > swim_test_slow_net(); > swim_test_triggers(); > swim_test_generation(); > + swim_test_dissemination_speed(); > > swim_test_transport_free(); > swim_test_ev_free(); > -- > 2.20.1 (Apple Git-117) > -- Konstantin Osipov, Moscow, Russia