[tarantool-patches] Re: [PATCH v2 1/2] swim: disseminate event for log(cluster_size) steps

Konstantin Osipov kostja at tarantool.org
Wed Jul 3 22:06:45 MSK 2019


* Vladislav Shpilevoy <v.shpilevoy at tarantool.org> [19/07/01 10:04]:

lgtm

> Before the patch there was a problem of events and anti-entropy
> starvation, when a cluster generates so many events, that they
> consume the whole UDP packet. A packet fits up to 26 events. If
> during the event storm something important happens, that event is
> likely to be lost, and not disseminated until the storm is over.
> 
> Sadly, there is no way to prevent a storm, but it can be made
> much shorter. For that the patch makes TTD of events logarithmic
> instead of linear of cluster size.
> 
> According to the SWIM paper and to experiments the logarithm is
> really enough. Linear TTD was a redundant overkill.
> 
> When events live shorter, it does not solve a problem of the
> events starvation - still some of them can be lost in case of a
> storm. But it frees some space for anti-entropy, which can finish
> dissemination of lost events.
> 
> Experiments in a simulation of a cluster with 100 nodes showed,
> that a failure dissemination happened in ~110 steps if there is
> a storm. Basically, no dissemination at all.
> 
> After the patch it is ~20 steps. So it is logarithmic as it
> should be, although with a bigger constant than without a storm.
> 
> However the patch can't be pushed as is. Several tests are broken
> now, because due to short TTD 'dead' members are dropped from
> member tables too early. It creates a possibility, that another
> node resurrects them back via anti-entropy.
> 
> Part of #4253
> ---
>  src/lib/swim/swim.c | 19 ++++++++++++-
>  test/unit/swim.c    | 68 ++++++++++++++++++++++++++++++++++++++++++++-
>  2 files changed, 85 insertions(+), 2 deletions(-)
> 
> diff --git a/src/lib/swim/swim.c b/src/lib/swim/swim.c
> index bb9e9f519..9647094f5 100644
> --- a/src/lib/swim/swim.c
> +++ b/src/lib/swim/swim.c
> @@ -616,7 +616,24 @@ swim_register_event(struct swim *swim, struct swim_member *member)
>  		rlist_add_tail_entry(&swim->dissemination_queue, member,
>  				     in_dissemination_queue);
>  	}
> -	member->status_ttd = mh_size(swim->members);
> +	/*
> +	 * Logarithm is a perfect number of disseminations of an
> +	 * event.
> +	 *
> +	 * Firstly, it matches the dissemination speed.
> +	 *
> +	 * Secondly, bigger number of disseminations (for example,
> +	 * linear) causes events and anti-entropy starvation in
> +	 * big clusters, when lots of events occupy the whole UDP
> +	 * packet, and factually the same packet content is being
> +	 * sent for quite a long time. No randomness. Anti-entropy
> +	 * does not get a chance to disseminate something new and
> +	 * random. Bigger orders are redundant and harmful.
> +	 *
> +	 * Thirdly, logarithm is proved by the original
> +	 * SWIM paper as the best option.
> +	 */
> +	member->status_ttd = ceil(log2(mh_size(swim->members)));
>  	swim_cached_round_msg_invalidate(swim);
>  }
>  
> diff --git a/test/unit/swim.c b/test/unit/swim.c
> index 3486d3f73..d776a8231 100644
> --- a/test/unit/swim.c
> +++ b/test/unit/swim.c
> @@ -43,6 +43,7 @@
>  #include "swim_test_utils.h"
>  #include "trigger.h"
>  #include <fcntl.h>
> +#include <math.h>
>  
>  /**
>   * Test result is a real returned value of main_f. Fiber_join can
> @@ -1152,10 +1153,74 @@ swim_test_generation(void)
>  	swim_finish_test();
>  }
>  
> +static void
> +swim_test_dissemination_speed(void)
> +{
> +	swim_start_test(2);
> +
> +	int size = 100;
> +	double ack_timeout = 0.1;
> +	struct swim_cluster *cluster = swim_cluster_new(size);
> +	swim_cluster_set_ack_timeout(cluster, ack_timeout);
> +	swim_cluster_set_gc(cluster, SWIM_GC_OFF);
> +	for (int i = 0; i < size; ++i) {
> +		for (int j = i + 1; j < size; ++j)
> +			swim_cluster_interconnect(cluster, i, j);
> +	}
> +	swim_cluster_set_drop(cluster, 0, 100);
> +	fail_if(swim_cluster_wait_status_anywhere(cluster, 0,
> +						  MEMBER_DEAD, size) != 0);
> +	/*
> +	 * Not a trivial problem - at start of a cluster there are
> +	 * so many events, that they occupy a UDP packet fully.
> +	 * All these events are 'a new member is added'. And
> +	 * because of that other much more important events
> +	 * starve. In this concrete test a new event 'member is
> +	 * dead' starves. To beat that problem SWIM says that
> +	 * events should be disseminated not longer than for a
> +	 * O(log) round steps. In such a case all the events are
> +	 * expired quite fast, and anti-entropy swiftly finishes
> +	 * the job. Usually this test works in log * 2, log * 3
> +	 * steps. Here it is log * 6 to avoid flakiness in some
> +	 * extra rare and slow random cases, but to still check
> +	 * for O(log) speed.
> +	 */
> +	is(swim_cluster_wait_status_everywhere(cluster, 0, MEMBER_DEAD,
> +					       log2(size) * 6), 0,
> +	   "dissemination work in log time even at the very start of a cluster");
> +	swim_cluster_set_drop(cluster, 0, 0);
> +	fail_if(swim_cluster_wait_status_everywhere(cluster, 0,
> +						    MEMBER_ALIVE, size) != 0);
> +	/*
> +	 * Another big-cluster case. Assume, that something
> +	 * happened and all the members generated an event. For
> +	 * example, changed their payload. It creates a storm of
> +	 * events, among which some important ones can be lost.
> +	 * Such as a failure detection. The only solution again -
> +	 * make the events as short living as possible in order to
> +	 * faster free space in a UDP packet for other events and
> +	 * for anti-entropy. The test below proves that even when
> +	 * there is an event storm, failure dissemination still
> +	 * works for O(log) time.
> +	 */
> +	swim_cluster_set_drop(cluster, 0, 100);
> +	fail_if(swim_cluster_wait_status_anywhere(cluster, 0,
> +						  MEMBER_DEAD, size) != 0);
> +	for (int i = 0; i < size; ++i)
> +		swim_cluster_member_set_payload(cluster, i, "", 0);
> +	is(swim_cluster_wait_status_everywhere(cluster, 0, MEMBER_DEAD,
> +					       log2(size) * 6), 0,
> +	   "dissemination can withstand an event storm");
> +
> +	swim_cluster_delete(cluster);
> +
> +	swim_finish_test();
> +}
> +
>  static int
>  main_f(va_list ap)
>  {
> -	swim_start_test(22);
> +	swim_start_test(23);
>  
>  	(void) ap;
>  	swim_test_ev_init();
> @@ -1183,6 +1248,7 @@ main_f(va_list ap)
>  	swim_test_slow_net();
>  	swim_test_triggers();
>  	swim_test_generation();
> +	swim_test_dissemination_speed();
>  
>  	swim_test_transport_free();
>  	swim_test_ev_free();
> -- 
> 2.20.1 (Apple Git-117)
> 

-- 
Konstantin Osipov, Moscow, Russia




More information about the Tarantool-patches mailing list