From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from localhost (localhost [127.0.0.1]) by turing.freelists.org (Avenir Technologies Mail Multiplex) with ESMTP id 69DF421900 for ; Sun, 30 Jun 2019 16:01:00 -0400 (EDT) Received: from turing.freelists.org ([127.0.0.1]) by localhost (turing.freelists.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id DXbIqXD6qCIi for ; Sun, 30 Jun 2019 16:01:00 -0400 (EDT) Received: from smtp14.mail.ru (smtp14.mail.ru [94.100.181.95]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by turing.freelists.org (Avenir Technologies Mail Multiplex) with ESMTPS id 2572F218E5 for ; Sun, 30 Jun 2019 16:01:00 -0400 (EDT) From: Vladislav Shpilevoy Subject: [tarantool-patches] [PATCH v2 1/2] swim: disseminate event for log(cluster_size) steps Date: Sun, 30 Jun 2019 22:01:55 +0200 Message-Id: In-Reply-To: References: MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: tarantool-patches-bounce@freelists.org Errors-to: tarantool-patches-bounce@freelists.org Reply-To: tarantool-patches@freelists.org List-Help: List-Unsubscribe: List-software: Ecartis version 1.0.0 List-Id: tarantool-patches List-Subscribe: List-Owner: List-post: List-Archive: To: tarantool-patches@freelists.org Cc: kostja@tarantool.org Before the patch there was a problem of events and anti-entropy starvation, when a cluster generates so many events, that they consume the whole UDP packet. A packet fits up to 26 events. If during the event storm something important happens, that event is likely to be lost, and not disseminated until the storm is over. Sadly, there is no way to prevent a storm, but it can be made much shorter. For that the patch makes TTD of events logarithmic instead of linear of cluster size. According to the SWIM paper and to experiments the logarithm is really enough. Linear TTD was a redundant overkill. When events live shorter, it does not solve a problem of the events starvation - still some of them can be lost in case of a storm. But it frees some space for anti-entropy, which can finish dissemination of lost events. Experiments in a simulation of a cluster with 100 nodes showed, that a failure dissemination happened in ~110 steps if there is a storm. Basically, no dissemination at all. After the patch it is ~20 steps. So it is logarithmic as it should be, although with a bigger constant than without a storm. However the patch can't be pushed as is. Several tests are broken now, because due to short TTD 'dead' members are dropped from member tables too early. It creates a possibility, that another node resurrects them back via anti-entropy. Part of #4253 --- src/lib/swim/swim.c | 19 ++++++++++++- test/unit/swim.c | 68 ++++++++++++++++++++++++++++++++++++++++++++- 2 files changed, 85 insertions(+), 2 deletions(-) diff --git a/src/lib/swim/swim.c b/src/lib/swim/swim.c index bb9e9f519..9647094f5 100644 --- a/src/lib/swim/swim.c +++ b/src/lib/swim/swim.c @@ -616,7 +616,24 @@ swim_register_event(struct swim *swim, struct swim_member *member) rlist_add_tail_entry(&swim->dissemination_queue, member, in_dissemination_queue); } - member->status_ttd = mh_size(swim->members); + /* + * Logarithm is a perfect number of disseminations of an + * event. + * + * Firstly, it matches the dissemination speed. + * + * Secondly, bigger number of disseminations (for example, + * linear) causes events and anti-entropy starvation in + * big clusters, when lots of events occupy the whole UDP + * packet, and factually the same packet content is being + * sent for quite a long time. No randomness. Anti-entropy + * does not get a chance to disseminate something new and + * random. Bigger orders are redundant and harmful. + * + * Thirdly, logarithm is proved by the original + * SWIM paper as the best option. + */ + member->status_ttd = ceil(log2(mh_size(swim->members))); swim_cached_round_msg_invalidate(swim); } diff --git a/test/unit/swim.c b/test/unit/swim.c index 3486d3f73..d776a8231 100644 --- a/test/unit/swim.c +++ b/test/unit/swim.c @@ -43,6 +43,7 @@ #include "swim_test_utils.h" #include "trigger.h" #include +#include /** * Test result is a real returned value of main_f. Fiber_join can @@ -1152,10 +1153,74 @@ swim_test_generation(void) swim_finish_test(); } +static void +swim_test_dissemination_speed(void) +{ + swim_start_test(2); + + int size = 100; + double ack_timeout = 0.1; + struct swim_cluster *cluster = swim_cluster_new(size); + swim_cluster_set_ack_timeout(cluster, ack_timeout); + swim_cluster_set_gc(cluster, SWIM_GC_OFF); + for (int i = 0; i < size; ++i) { + for (int j = i + 1; j < size; ++j) + swim_cluster_interconnect(cluster, i, j); + } + swim_cluster_set_drop(cluster, 0, 100); + fail_if(swim_cluster_wait_status_anywhere(cluster, 0, + MEMBER_DEAD, size) != 0); + /* + * Not a trivial problem - at start of a cluster there are + * so many events, that they occupy a UDP packet fully. + * All these events are 'a new member is added'. And + * because of that other much more important events + * starve. In this concrete test a new event 'member is + * dead' starves. To beat that problem SWIM says that + * events should be disseminated not longer than for a + * O(log) round steps. In such a case all the events are + * expired quite fast, and anti-entropy swiftly finishes + * the job. Usually this test works in log * 2, log * 3 + * steps. Here it is log * 6 to avoid flakiness in some + * extra rare and slow random cases, but to still check + * for O(log) speed. + */ + is(swim_cluster_wait_status_everywhere(cluster, 0, MEMBER_DEAD, + log2(size) * 6), 0, + "dissemination work in log time even at the very start of a cluster"); + swim_cluster_set_drop(cluster, 0, 0); + fail_if(swim_cluster_wait_status_everywhere(cluster, 0, + MEMBER_ALIVE, size) != 0); + /* + * Another big-cluster case. Assume, that something + * happened and all the members generated an event. For + * example, changed their payload. It creates a storm of + * events, among which some important ones can be lost. + * Such as a failure detection. The only solution again - + * make the events as short living as possible in order to + * faster free space in a UDP packet for other events and + * for anti-entropy. The test below proves that even when + * there is an event storm, failure dissemination still + * works for O(log) time. + */ + swim_cluster_set_drop(cluster, 0, 100); + fail_if(swim_cluster_wait_status_anywhere(cluster, 0, + MEMBER_DEAD, size) != 0); + for (int i = 0; i < size; ++i) + swim_cluster_member_set_payload(cluster, i, "", 0); + is(swim_cluster_wait_status_everywhere(cluster, 0, MEMBER_DEAD, + log2(size) * 6), 0, + "dissemination can withstand an event storm"); + + swim_cluster_delete(cluster); + + swim_finish_test(); +} + static int main_f(va_list ap) { - swim_start_test(22); + swim_start_test(23); (void) ap; swim_test_ev_init(); @@ -1183,6 +1248,7 @@ main_f(va_list ap) swim_test_slow_net(); swim_test_triggers(); swim_test_generation(); + swim_test_dissemination_speed(); swim_test_transport_free(); swim_test_ev_free(); -- 2.20.1 (Apple Git-117)