[tarantool-patches] [PATCH 0/2] SWIM big cluster improvements, part 1

Tarantool development patches archive
 help / color / mirror / Atom feed

* [tarantool-patches] [PATCH 0/2] SWIM big cluster improvements, part 1
@ 2019-06-29 23:45 Vladislav Shpilevoy
  2019-06-29 23:45 ` [tarantool-patches] [PATCH 1/2] swim: prefer new events for dissemination over old Vladislav Shpilevoy
  2019-06-29 23:45 ` [tarantool-patches] [PATCH 2/2] swim: disseminate event for log(cluster_size) steps Vladislav Shpilevoy
  0 siblings, 2 replies; 6+ messages in thread
From: Vladislav Shpilevoy @ 2019-06-29 23:45 UTC (permalink / raw)
  To: tarantool-patches; +Cc: kostja

It is a first patchset to make SWIM more usable in big clusters. Most of the
problems arise from incorrect interpretation of the original SWIM paper, and
extensions adaptation.

The two commits improve current version of dissemination without some dramatic
changes. First one is quite trivial.

Second one can be discussed. To understand the next proposals, please read the
commit first, and then return here.

Ok, assume you know the problem the second commit solves. But there is another
option - to protect from spurious resurrections lets suspiciously treat
anti-entropy section, and do not add new members from it as is. Instead, if we
see in the anti-entropy section a new member, lets send him a ping. Just call
probe_member() function on its URI. And don't add it to the member table now. If
it is really alive, then it will respond ack, and we add him to the table.

Basically, we add a new member to the table in two cases only - a user called
add_member(), or an ACK is received.

A problem is that it produces additional load on the member, if it is really
alive - everyone will start pinging him. We could try to wait a random timeout
before sending a ping, or do it with a certain probability, but not sure if it
would be simpler than the current solution with a limbo queue.

Besides, we can't protect from spurious resurrections on 100%, even using the
solution above. This is because of UDP - we can receive a very old ACK messages
from a member, which was deleted long time ago. And we will add him as a new and
alive member. So is it worth trying to send pings to each new member before its
addition? After all, in the most frequent case a new member is alive.

The way how to prevent resurrections does not affect performance, just a
warning.

Branch: http://github.com/tarantool/tarantool/tree/gerold103/gh-4253-dissemination-speed
Issue: https://github.com/tarantool/tarantool/issues/4253

Vladislav Shpilevoy (2):
  swim: prefer new events for dissemination over old
  swim: disseminate event for log(cluster_size) steps

 src/lib/swim/swim.c   | 130 +++++++++++++++++++++++++++++++++++++++---
 test/unit/swim.c      |  72 ++++++++++++++++++++++-
 test/unit/swim.result |   8 ++-
 3 files changed, 199 insertions(+), 11 deletions(-)

-- 
2.20.1 (Apple Git-117)

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [tarantool-patches] [PATCH 1/2] swim: prefer new events for dissemination over old
  2019-06-29 23:45 [tarantool-patches] [PATCH 0/2] SWIM big cluster improvements, part 1 Vladislav Shpilevoy
@ 2019-06-29 23:45 ` Vladislav Shpilevoy
  2019-06-30 16:25   ` [tarantool-patches] " Vladislav Shpilevoy
  2019-06-29 23:45 ` [tarantool-patches] [PATCH 2/2] swim: disseminate event for log(cluster_size) steps Vladislav Shpilevoy
  1 sibling, 1 reply; 6+ messages in thread
From: Vladislav Shpilevoy @ 2019-06-29 23:45 UTC (permalink / raw)
  To: tarantool-patches; +Cc: kostja

Before the patch old events were disseminated before new ones
until their TTD becomes 0. It was considered fair, and not much
important. But after some experiments with closed-source version
of SWIM it was found, that I was wrong as never.

Firstly, SWIM in the original paper explicitly says, that newer
events should be sent first. Secondly, it really has a goal. For
example, when a big cluster is just started, there is huge number
of events like 'a new member is added'. They consume the whole
UDP packet, and newer events like 'a member is dead' starve. In
fact, dissemination just doesn't work at start of a big cluster.

This patch make SWIM prefer newer events for dissemination.

Part of #4253
---
 src/lib/swim/swim.c   | 13 ++++++++++---
 test/unit/swim.c      | 44 ++++++++++++++++++++++++++++++++++++++++++-
 test/unit/swim.result |  7 ++++++-
 3 files changed, 59 insertions(+), 5 deletions(-)

diff --git a/src/lib/swim/swim.c b/src/lib/swim/swim.c
index bb9e9f519..8edf0495d 100644
--- a/src/lib/swim/swim.c
+++ b/src/lib/swim/swim.c
@@ -367,7 +367,7 @@ struct swim_member {
 	int payload_ttd;
 	/**
 	 * All created events are put into a queue sorted by event
-	 * time.
+	 * time in descending order.
 	 */
 	struct rlist in_dissemination_queue;
 	/**
@@ -608,13 +608,20 @@ swim_wait_ack(struct swim *swim, struct swim_member *member,
  * The status change itself occupies only 2 bytes in a packet, so
  * it is cheap to send it on any update, while does reduce
  * entropy.
+ *
+ * It is important to put new events in the head of the queue.
+ * Otherwise if there are lots of old events, new ones starve. It
+ * is especially notable when a big cluster is just started, and
+ * the whole UDP packet is occupied by events like 'a new member
+ * is added'. More important events like 'a member is dead' won't
+ * get a chance to be disseminated.
  */
 static inline void
 swim_register_event(struct swim *swim, struct swim_member *member)
 {
 	if (rlist_empty(&member->in_dissemination_queue)) {
-		rlist_add_tail_entry(&swim->dissemination_queue, member,
-				     in_dissemination_queue);
+		rlist_add_entry(&swim->dissemination_queue, member,
+				in_dissemination_queue);
 	}
 	member->status_ttd = mh_size(swim->members);
 	swim_cached_round_msg_invalidate(swim);
diff --git a/test/unit/swim.c b/test/unit/swim.c
index 3486d3f73..07cc0f06f 100644
--- a/test/unit/swim.c
+++ b/test/unit/swim.c
@@ -43,6 +43,7 @@
 #include "swim_test_utils.h"
 #include "trigger.h"
 #include <fcntl.h>
+#include <math.h>
 
 /**
  * Test result is a real returned value of main_f. Fiber_join can
@@ -1152,10 +1153,50 @@ swim_test_generation(void)
 	swim_finish_test();
 }
 
+static void
+swim_test_dissemination_speed(void)
+{
+	swim_start_test(1);
+
+	int size = 100;
+	double ack_timeout = 0.1;
+	struct swim_cluster *cluster = swim_cluster_new(size);
+	swim_cluster_set_ack_timeout(cluster, ack_timeout);
+	swim_cluster_set_gc(cluster, SWIM_GC_OFF);
+	for (int i = 0; i < size; ++i) {
+		for (int j = i + 1; j < size; ++j)
+			swim_cluster_interconnect(cluster, i, j);
+	}
+	swim_cluster_set_drop(cluster, 0, 100);
+	fail_if(swim_cluster_wait_status_anywhere(cluster, 0,
+						  MEMBER_DEAD, size) != 0);
+	/*
+	 * Not a trivial problem - at start of a cluster there are
+	 * so many events, that they occupy a UDP packet fully.
+	 * All these events are 'a new member is added'. And
+	 * because of that other much more important events
+	 * starve. In this concrete test a new event 'member is
+	 * dead' starves. To beat that problem SWIM says that new
+	 * events should be preferred. In such a case number of
+	 * old events does not matter anymore, and new events
+	 * disseminate in log. Usually this test works in log * 2,
+	 * log * 3 steps. Here it is log * 5 to avoid flakiness in
+	 * some extra rare and slow random cases, but to still
+	 * check for O(log) speed.
+	 */
+	is(swim_cluster_wait_status_everywhere(cluster, 0, MEMBER_DEAD,
+					       log2(size) * 5), 0,
+	   "dissemination work in log time even at the very start of a cluster");
+
+	swim_cluster_delete(cluster);
+
+	swim_finish_test();
+}
+
 static int
 main_f(va_list ap)
 {
-	swim_start_test(22);
+	swim_start_test(23);
 
 	(void) ap;
 	swim_test_ev_init();
@@ -1183,6 +1224,7 @@ main_f(va_list ap)
 	swim_test_slow_net();
 	swim_test_triggers();
 	swim_test_generation();
+	swim_test_dissemination_speed();
 
 	swim_test_transport_free();
 	swim_test_ev_free();
diff --git a/test/unit/swim.result b/test/unit/swim.result
index 04a2778e6..83b6bca64 100644
--- a/test/unit/swim.result
+++ b/test/unit/swim.result
@@ -1,5 +1,5 @@
 	*** main_f ***
-1..22
+1..23
 	*** swim_test_one_link ***
     1..6
     ok 1 - no rounds - no fullmesh
@@ -237,4 +237,9 @@ ok 21 - subtests
     ok 3 - S2 sees new generation of S1
 ok 22 - subtests
 	*** swim_test_generation: done ***
+	*** swim_test_dissemination_speed ***
+    1..1
+    ok 1 - dissemination work in log time even at the very start of a cluster
+ok 23 - subtests
+	*** swim_test_dissemination_speed: done ***
 	*** main_f: done ***
-- 
2.20.1 (Apple Git-117)

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [tarantool-patches] [PATCH 2/2] swim: disseminate event for log(cluster_size) steps
  2019-06-29 23:45 [tarantool-patches] [PATCH 0/2] SWIM big cluster improvements, part 1 Vladislav Shpilevoy
  2019-06-29 23:45 ` [tarantool-patches] [PATCH 1/2] swim: prefer new events for dissemination over old Vladislav Shpilevoy
@ 2019-06-29 23:45 ` Vladislav Shpilevoy
  2019-06-30  6:55   ` [tarantool-patches] " Konstantin Osipov
  1 sibling, 1 reply; 6+ messages in thread
From: Vladislav Shpilevoy @ 2019-06-29 23:45 UTC (permalink / raw)
  To: tarantool-patches; +Cc: kostja

Before the patch there was a problem of events and anti-entropy
starvation, when a cluster generates so many events, that they
consume the whole UDP packet. If during the event storm something
important happens, that event is likely to be lost, and not
disseminated until the storm is over.

Sadly, there is no way to prevent a storm, but it can be made
much shorter. For that the patch makes TTD of events logarithmic
instead of linear of cluster size.

According to the SWIM paper and to the experiments the logarithm
is really enough. Linear TTD was a redundant overkill.

When events live shorter, it does not solve a problem of the
events starvation - still some of them can be lost in case of a
storm. But it frees some space for anti-entropy, which can finish
dissemination of lost events.

Experiments in a simulation of a cluster with 100 nodes showed,
that a failure dissemination happened in ~110 steps if there is
a storm. Linear dissemination is the worst problem.
After the patch it is ~20 steps. So it is logarithmic as it
should be, although with a bigger constant than without a storm.

Part of #4253
---
 src/lib/swim/swim.c   | 117 ++++++++++++++++++++++++++++++++++++++++--
 test/unit/swim.c      |  30 +++++++++--
 test/unit/swim.result |   3 +-
 3 files changed, 142 insertions(+), 8 deletions(-)

diff --git a/src/lib/swim/swim.c b/src/lib/swim/swim.c
index 8edf0495d..936f6049a 100644
--- a/src/lib/swim/swim.c
+++ b/src/lib/swim/swim.c
@@ -327,6 +327,19 @@ struct swim_member {
 	 * allow other members to learn the dead status.
 	 */
 	int status_ttd;
+	/**
+	 * Time To Live. When a member is dead, it can't be
+	 * deleted from the member table immediately. Otherwise
+	 * there is a real possibility, that other members, not
+	 * aware of the death, will disseminate its alive state
+	 * back via anti-entropy. TTL is a way to keep a dead
+	 * member in the member table until the whole cluster
+	 * knows about the death with a very hight probability,
+	 * almost 100%. Such members are stored in a limbo queue
+	 * for TTL round steps, and deleted afterwards. All the
+	 * same is fair for 'left' members as well.
+	 */
+	int ttl;
 	/** Arbitrary user data, disseminated on each change. */
 	char *payload;
 	/** Payload size, in bytes. */
@@ -370,6 +383,13 @@ struct swim_member {
 	 * time in descending order.
 	 */
 	struct rlist in_dissemination_queue;
+	/**
+	 * Members sentenced for deletion. They are kept here in
+	 * order to wait until the whole cluster knows about their
+	 * deletion, and nobody will try to resurrect them via
+	 * anti-entropy.
+	 */
+	struct rlist in_limbo_queue;
 	/**
 	 * Each time a member is updated, or created, or dropped,
 	 * it is added to an event queue. Members from this queue
@@ -513,6 +533,17 @@ struct swim {
 	 * as long as the event TTD is non-zero.
 	 */
 	struct rlist dissemination_queue;
+	/**
+	 * Queue of members waiting for their deletion. The
+	 * members 'dead' and 'left' can't be deleted immediately
+	 * after their new status is discovered and event about
+	 * that is disseminated, or otherwise they may be
+	 * resurrected back by those who didn't know about their
+	 * death, via anti-entropy. Here the members wait until
+	 * the whole cluster is likely to know about their death,
+	 * and only then are deleted.
+	 */
+	struct rlist limbo_queue;
 	/**
 	 * Queue of updated, new, and dropped members to deliver
 	 * the events to triggers. Dropped members are also kept
@@ -619,11 +650,36 @@ swim_wait_ack(struct swim *swim, struct swim_member *member,
 static inline void
 swim_register_event(struct swim *swim, struct swim_member *member)
 {
+	/*
+	 * If a member, scheduled for deletion, got an event, then
+	 * it is alive. Somebody noticed an ack/ping from him, or
+	 * got its new payload, or anything. Deletion of such
+	 * members shall be canceled.
+	 */
+	if (! rlist_empty(&member->in_limbo_queue))
+		rlist_del_entry(member, in_limbo_queue);
 	if (rlist_empty(&member->in_dissemination_queue)) {
 		rlist_add_entry(&swim->dissemination_queue, member,
 				in_dissemination_queue);
 	}
-	member->status_ttd = mh_size(swim->members);
+	/*
+	 * Logarithm is a perfect number of disseminations of an
+	 * event.
+	 *
+	 * Firstly, it matches the dissemination speed.
+	 *
+	 * Secondly, bigger number of disseminations (for example,
+	 * linear) causes events and anti-entropy starvation in
+	 * big clusters, when lots of events occupy the whole UDP
+	 * packet, and factually the same packet content is being
+	 * sent for quite a long time. No randomness. Anti-entropy
+	 * does not get a chance to disseminate something new and
+	 * random. Bigger orders are redundant and harmful.
+	 *
+	 * Thirdly, logarithm is proved by the original
+	 * SWIM paper as the best option.
+	 */
+	member->status_ttd = log2(mh_size(swim->members));
 	swim_cached_round_msg_invalidate(swim);
 }
 
@@ -805,6 +861,7 @@ swim_member_delete(struct swim_member *member)
 
 	/* Dissemination component. */
 	assert(rlist_empty(&member->in_dissemination_queue));
+	assert(rlist_empty(&member->in_limbo_queue));
 
 	swim_member_unref(member);
 }
@@ -837,10 +894,40 @@ swim_member_new(const struct sockaddr_in *addr, const struct tt_uuid *uuid,
 
 	/* Dissemination component. */
 	rlist_create(&member->in_dissemination_queue);
+	rlist_create(&member->in_limbo_queue);
 
 	return member;
 }
 
+/** Schedule member deletion after some number of steps. */
+static void
+swim_send_member_to_limbo(struct swim *swim, struct swim_member *member)
+{
+	assert(rlist_empty(&member->in_dissemination_queue));
+	assert(rlist_empty(&member->in_limbo_queue));
+	rlist_add_entry(&swim->limbo_queue, member, in_limbo_queue);
+	int size = mh_size(swim->members);
+	/*
+	 * Always linear number of steps delay would be perfect in
+	 * terms of protection against sudden resurrection of dead
+	 * members via anti-entropy. Even more perfect is when GC
+	 * is off. But these values would lead to a problem when
+	 * too many round steps are spent on sending pings to
+	 * already 100% dead members, and when they occupy too big
+	 * part of anti-entropy section. As a compromise here the
+	 * linear TTL is used for small clusters, and a value of
+	 * the same order as event dissemination speed for big
+	 * clusters, but with a much bigger constant.
+	 *
+	 * According to simulations, in a cluster of size 250 a
+	 * death event is disseminated in ~13 steps. Rarely for
+	 * ~15 steps. But extremely rare for ~30-40 steps.
+	 * Log2(250) is ~8, so 10 * log(N) covers all these cases,
+	 * even super rare ones.
+	 */
+	member->ttl = MIN(size, 10 * log2(size));
+}
+
 /**
  * Remove the member from all queues, hashes, destroy it and free
  * the memory.
@@ -863,6 +950,7 @@ swim_delete_member(struct swim *swim, struct swim_member *member)
 	/* Dissemination component. */
 	swim_on_member_update(swim, member, SWIM_EV_DROP);
 	rlist_del_entry(member, in_dissemination_queue);
+	rlist_del_entry(member, in_limbo_queue);
 
 	swim_member_delete(member);
 }
@@ -1154,6 +1242,22 @@ swim_encode_round_msg(struct swim *swim)
 	swim->is_round_packet_valid = true;
 }
 
+/**
+ * Drop members, whose TTL is expired. It should be done once
+ * after each round step.
+ */
+static void
+swim_decrease_member_ttl(struct swim *swim)
+{
+	struct swim_member *member, *tmp;
+	rlist_foreach_entry_safe(member, &swim->limbo_queue,
+				 in_limbo_queue, tmp) {
+		assert(member->ttl > 0);
+		if (--member->ttl == 0)
+			swim_delete_member(swim, member);
+	}
+}
+
 /**
  * Decrement TTDs of all events. It is done after each round step.
  * Note, since we decrement TTD of all events, even those which
@@ -1181,7 +1285,7 @@ swim_decrease_event_ttd(struct swim *swim)
 			rlist_del_entry(member, in_dissemination_queue);
 			swim_cached_round_msg_invalidate(swim);
 			if (member->status == MEMBER_LEFT)
-				swim_delete_member(swim, member);
+				swim_send_member_to_limbo(swim, member);
 		}
 	}
 }
@@ -1262,6 +1366,7 @@ swim_complete_step(struct swim_task *task,
 			 * sections.
 			 */
 			swim_wait_ack(swim, m, false);
+			swim_decrease_member_ttl(swim);
 			swim_decrease_event_ttd(swim);
 		}
 	}
@@ -1433,8 +1538,10 @@ swim_check_acks(struct ev_loop *loop, struct ev_timer *t, int events)
 			break;
 		case MEMBER_DEAD:
 			if (m->unacknowledged_pings >= NO_ACKS_TO_GC &&
-			    swim->gc_mode == SWIM_GC_ON && m->status_ttd == 0) {
-				swim_delete_member(swim, m);
+			    swim->gc_mode == SWIM_GC_ON &&
+			    m->status_ttd == 0 &&
+			    rlist_empty(&m->in_limbo_queue)) {
+				swim_send_member_to_limbo(swim, m);
 				continue;
 			}
 			break;
@@ -1892,6 +1999,7 @@ swim_new(uint64_t generation)
 
 	/* Dissemination component. */
 	rlist_create(&swim->dissemination_queue);
+	rlist_create(&swim->limbo_queue);
 	rlist_create(&swim->on_member_event);
 	stailq_create(&swim->event_queue);
 	swim->event_handler = fiber_new("SWIM event handler",
@@ -2205,6 +2313,7 @@ swim_delete(struct swim *swim)
 		if (! heap_node_is_stray(&m->in_wait_ack_heap))
 			wait_ack_heap_delete(&swim->wait_ack_heap, m);
 		rlist_del_entry(m, in_dissemination_queue);
+		rlist_del_entry(m, in_limbo_queue);
 		swim_member_delete(m);
 	}
 	/*
diff --git a/test/unit/swim.c b/test/unit/swim.c
index 07cc0f06f..6e537e95b 100644
--- a/test/unit/swim.c
+++ b/test/unit/swim.c
@@ -115,7 +115,7 @@ swim_test_uuid_update(void)
 	is(swim_cluster_update_uuid(cluster, 0, &new_uuid), 0, "UUID update");
 	is(swim_member_status(swim_member_by_uuid(s, &old_uuid)), MEMBER_LEFT,
 	   "old UUID is marked as 'left'");
-	swim_run_for(5);
+	swim_run_for(7);
 	is(swim_member_by_uuid(s, &old_uuid), NULL,
 	   "old UUID is dropped after a while");
 	ok(swim_cluster_is_fullmesh(cluster), "dropped everywhere");
@@ -1061,7 +1061,7 @@ swim_test_triggers(void)
 	is(tctx.ctx.events, SWIM_EV_NEW_STATUS, "status dead");
 
 	fail_if(swim_cluster_wait_status(cluster, 0, 1,
-					 swim_member_status_MAX, 2) != 0);
+					 swim_member_status_MAX, 4) != 0);
 	swim_cluster_run_triggers(cluster);
 	is(tctx.counter, 6, "drop fired a trigger");
 	is(tctx.ctx.events, SWIM_EV_DROP, "status dropped");
@@ -1156,7 +1156,7 @@ swim_test_generation(void)
 static void
 swim_test_dissemination_speed(void)
 {
-	swim_start_test(1);
+	swim_start_test(2);
 
 	int size = 100;
 	double ack_timeout = 0.1;
@@ -1187,6 +1187,30 @@ swim_test_dissemination_speed(void)
 	is(swim_cluster_wait_status_everywhere(cluster, 0, MEMBER_DEAD,
 					       log2(size) * 5), 0,
 	   "dissemination work in log time even at the very start of a cluster");
+	swim_cluster_set_drop(cluster, 0, 0);
+	fail_if(swim_cluster_wait_status_everywhere(cluster, 0,
+						    MEMBER_ALIVE, size) != 0);
+	/*
+	 * Another big-cluster case. Assume, that something
+	 * happened and all the members generated an event. For
+	 * example, changed their payload. It creates a storm of
+	 * events, among which some important ones can be lost.
+	 * Such as a failure detection. Event age preferences do
+	 * not help here. The only solution - make the events as
+	 * short living as possible in order to faster free space
+	 * in a UDP packet for other events and for anti-entropy.
+	 * The test below proves that even when there is an event
+	 * storm, failure dissemination still works for O(log)
+	 * time. Although with a bit bigger constant.
+	 */
+	swim_cluster_set_drop(cluster, 0, 100);
+	fail_if(swim_cluster_wait_status_anywhere(cluster, 0,
+						  MEMBER_DEAD, 5) != 0);
+	for (int i = 0; i < size; ++i)
+		swim_cluster_member_set_payload(cluster, i, "", 0);
+	is(swim_cluster_wait_status_everywhere(cluster, 0, MEMBER_DEAD,
+					       log2(size) * 6), 0,
+	   "dissemination can withstand an event storm");
 
 	swim_cluster_delete(cluster);
 
diff --git a/test/unit/swim.result b/test/unit/swim.result
index 83b6bca64..5f66200b0 100644
--- a/test/unit/swim.result
+++ b/test/unit/swim.result
@@ -238,8 +238,9 @@ ok 21 - subtests
 ok 22 - subtests
 	*** swim_test_generation: done ***
 	*** swim_test_dissemination_speed ***
-    1..1
+    1..2
     ok 1 - dissemination work in log time even at the very start of a cluster
+    ok 2 - dissemination can withstand an event storm
 ok 23 - subtests
 	*** swim_test_dissemination_speed: done ***
 	*** main_f: done ***
-- 
2.20.1 (Apple Git-117)

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [tarantool-patches] Re: [PATCH 2/2] swim: disseminate event for log(cluster_size) steps
  2019-06-29 23:45 ` [tarantool-patches] [PATCH 2/2] swim: disseminate event for log(cluster_size) steps Vladislav Shpilevoy
@ 2019-06-30  6:55   ` Konstantin Osipov
  2019-06-30 16:24     ` Vladislav Shpilevoy
  0 siblings, 1 reply; 6+ messages in thread
From: Konstantin Osipov @ 2019-06-30  6:55 UTC (permalink / raw)
  To: Vladislav Shpilevoy; +Cc: tarantool-patches

* Vladislav Shpilevoy <v.shpilevoy@tarantool.org> [19/06/30 09:04]:

 I don'
> Before the patch there was a problem of events and anti-entropy
> starvation, when a cluster generates so many events, that they
> consume the whole UDP packet. If during the event storm something
> important happens, that event is likely to be lost, and not
> disseminated until the storm is over.
> 
> Sadly, there is no way to prevent a storm, but it can be made
> much shorter. For that the patch makes TTD of events logarithmic
> instead of linear of cluster size.
> 
> According to the SWIM paper and to the experiments the logarithm
> is really enough. Linear TTD was a redundant overkill.
> 
> When events live shorter, it does not solve a problem of the
> events starvation - still some of them can be lost in case of a
> storm. But it frees some space for anti-entropy, which can finish
> dissemination of lost events.
> 
> Experiments in a simulation of a cluster with 100 nodes showed,
> that a failure dissemination happened in ~110 steps if there is
> a storm. Linear dissemination is the worst problem.
> After the patch it is ~20 steps. So it is logarithmic as it
> should be, although with a bigger constant than without a storm.

You say nothing in this commit about limbo queue. I have serious
doubts about your manipulation with it. The patch needs to be
split into pieces, each addressing its own problem and having a
test. Now I only see 1 test for so many changes.

-- 
Konstantin Osipov, Moscow, Russia

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [tarantool-patches] Re: [PATCH 2/2] swim: disseminate event for log(cluster_size) steps
  2019-06-30  6:55   ` [tarantool-patches] " Konstantin Osipov
@ 2019-06-30 16:24     ` Vladislav Shpilevoy
  0 siblings, 0 replies; 6+ messages in thread
From: Vladislav Shpilevoy @ 2019-06-30 16:24 UTC (permalink / raw)
  To: Konstantin Osipov; +Cc: tarantool-patches



On 30/06/2019 08:55, Konstantin Osipov wrote:
> * Vladislav Shpilevoy <v.shpilevoy@tarantool.org> [19/06/30 09:04]:
> 
>  I don'
>> Before the patch there was a problem of events and anti-entropy
>> starvation, when a cluster generates so many events, that they
>> consume the whole UDP packet. If during the event storm something
>> important happens, that event is likely to be lost, and not
>> disseminated until the storm is over.
>>
>> Sadly, there is no way to prevent a storm, but it can be made
>> much shorter. For that the patch makes TTD of events logarithmic
>> instead of linear of cluster size.
>>
>> According to the SWIM paper and to the experiments the logarithm
>> is really enough. Linear TTD was a redundant overkill.
>>
>> When events live shorter, it does not solve a problem of the
>> events starvation - still some of them can be lost in case of a
>> storm. But it frees some space for anti-entropy, which can finish
>> dissemination of lost events.
>>
>> Experiments in a simulation of a cluster with 100 nodes showed,
>> that a failure dissemination happened in ~110 steps if there is
>> a storm. Linear dissemination is the worst problem.
>> After the patch it is ~20 steps. So it is logarithmic as it
>> should be, although with a bigger constant than without a storm.
> 
> You say nothing in this commit about limbo queue. I have serious
> doubts about your manipulation with it. The patch needs to be
> split into pieces, each addressing its own problem and having a
> test. Now I only see 1 test for so many changes.
> 

Limbo queue makes no sense before logarithmic TTD. I can add it in
a separate commit, but then the first commit with log TTD will fail
some tests. And they will be fixed in the second one adding the limbo
queue.

If you don't like the limbo queue idea, then answer on my cover
letter in this thread, where I explain what alternatives exist for
the limbo queue.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [tarantool-patches] Re: [PATCH 1/2] swim: prefer new events for dissemination over old
  2019-06-29 23:45 ` [tarantool-patches] [PATCH 1/2] swim: prefer new events for dissemination over old Vladislav Shpilevoy
@ 2019-06-30 16:25   ` Vladislav Shpilevoy
  0 siblings, 0 replies; 6+ messages in thread
From: Vladislav Shpilevoy @ 2019-06-30 16:25 UTC (permalink / raw)
  To: tarantool-patches; +Cc: kostja

This commit is dropped. It was discussed verbally, that it would
be better to implement more smart priority queue for events.

On 30/06/2019 01:45, Vladislav Shpilevoy wrote:
> Before the patch old events were disseminated before new ones
> until their TTD becomes 0. It was considered fair, and not much
> important. But after some experiments with closed-source version
> of SWIM it was found, that I was wrong as never.
> 
> Firstly, SWIM in the original paper explicitly says, that newer
> events should be sent first. Secondly, it really has a goal. For
> example, when a big cluster is just started, there is huge number
> of events like 'a new member is added'. They consume the whole
> UDP packet, and newer events like 'a member is dead' starve. In
> fact, dissemination just doesn't work at start of a big cluster.
> 
> This patch make SWIM prefer newer events for dissemination.
> 
> Part of #4253
> ---
>  src/lib/swim/swim.c   | 13 ++++++++++---
>  test/unit/swim.c      | 44 ++++++++++++++++++++++++++++++++++++++++++-
>  test/unit/swim.result |  7 ++++++-
>  3 files changed, 59 insertions(+), 5 deletions(-)
> 
> diff --git a/src/lib/swim/swim.c b/src/lib/swim/swim.c
> index bb9e9f519..8edf0495d 100644
> --- a/src/lib/swim/swim.c
> +++ b/src/lib/swim/swim.c
> @@ -367,7 +367,7 @@ struct swim_member {
>  	int payload_ttd;
>  	/**
>  	 * All created events are put into a queue sorted by event
> -	 * time.
> +	 * time in descending order.
>  	 */
>  	struct rlist in_dissemination_queue;
>  	/**
> @@ -608,13 +608,20 @@ swim_wait_ack(struct swim *swim, struct swim_member *member,
>   * The status change itself occupies only 2 bytes in a packet, so
>   * it is cheap to send it on any update, while does reduce
>   * entropy.
> + *
> + * It is important to put new events in the head of the queue.
> + * Otherwise if there are lots of old events, new ones starve. It
> + * is especially notable when a big cluster is just started, and
> + * the whole UDP packet is occupied by events like 'a new member
> + * is added'. More important events like 'a member is dead' won't
> + * get a chance to be disseminated.
>   */
>  static inline void
>  swim_register_event(struct swim *swim, struct swim_member *member)
>  {
>  	if (rlist_empty(&member->in_dissemination_queue)) {
> -		rlist_add_tail_entry(&swim->dissemination_queue, member,
> -				     in_dissemination_queue);
> +		rlist_add_entry(&swim->dissemination_queue, member,
> +				in_dissemination_queue);
>  	}
>  	member->status_ttd = mh_size(swim->members);
>  	swim_cached_round_msg_invalidate(swim);
> diff --git a/test/unit/swim.c b/test/unit/swim.c
> index 3486d3f73..07cc0f06f 100644
> --- a/test/unit/swim.c
> +++ b/test/unit/swim.c
> @@ -43,6 +43,7 @@
>  #include "swim_test_utils.h"
>  #include "trigger.h"
>  #include <fcntl.h>
> +#include <math.h>
>  
>  /**
>   * Test result is a real returned value of main_f. Fiber_join can
> @@ -1152,10 +1153,50 @@ swim_test_generation(void)
>  	swim_finish_test();
>  }
>  
> +static void
> +swim_test_dissemination_speed(void)
> +{
> +	swim_start_test(1);
> +
> +	int size = 100;
> +	double ack_timeout = 0.1;
> +	struct swim_cluster *cluster = swim_cluster_new(size);
> +	swim_cluster_set_ack_timeout(cluster, ack_timeout);
> +	swim_cluster_set_gc(cluster, SWIM_GC_OFF);
> +	for (int i = 0; i < size; ++i) {
> +		for (int j = i + 1; j < size; ++j)
> +			swim_cluster_interconnect(cluster, i, j);
> +	}
> +	swim_cluster_set_drop(cluster, 0, 100);
> +	fail_if(swim_cluster_wait_status_anywhere(cluster, 0,
> +						  MEMBER_DEAD, size) != 0);
> +	/*
> +	 * Not a trivial problem - at start of a cluster there are
> +	 * so many events, that they occupy a UDP packet fully.
> +	 * All these events are 'a new member is added'. And
> +	 * because of that other much more important events
> +	 * starve. In this concrete test a new event 'member is
> +	 * dead' starves. To beat that problem SWIM says that new
> +	 * events should be preferred. In such a case number of
> +	 * old events does not matter anymore, and new events
> +	 * disseminate in log. Usually this test works in log * 2,
> +	 * log * 3 steps. Here it is log * 5 to avoid flakiness in
> +	 * some extra rare and slow random cases, but to still
> +	 * check for O(log) speed.
> +	 */
> +	is(swim_cluster_wait_status_everywhere(cluster, 0, MEMBER_DEAD,
> +					       log2(size) * 5), 0,
> +	   "dissemination work in log time even at the very start of a cluster");
> +
> +	swim_cluster_delete(cluster);
> +
> +	swim_finish_test();
> +}
> +
>  static int
>  main_f(va_list ap)
>  {
> -	swim_start_test(22);
> +	swim_start_test(23);
>  
>  	(void) ap;
>  	swim_test_ev_init();
> @@ -1183,6 +1224,7 @@ main_f(va_list ap)
>  	swim_test_slow_net();
>  	swim_test_triggers();
>  	swim_test_generation();
> +	swim_test_dissemination_speed();
>  
>  	swim_test_transport_free();
>  	swim_test_ev_free();
> diff --git a/test/unit/swim.result b/test/unit/swim.result
> index 04a2778e6..83b6bca64 100644
> --- a/test/unit/swim.result
> +++ b/test/unit/swim.result
> @@ -1,5 +1,5 @@
>  	*** main_f ***
> -1..22
> +1..23
>  	*** swim_test_one_link ***
>      1..6
>      ok 1 - no rounds - no fullmesh
> @@ -237,4 +237,9 @@ ok 21 - subtests
>      ok 3 - S2 sees new generation of S1
>  ok 22 - subtests
>  	*** swim_test_generation: done ***
> +	*** swim_test_dissemination_speed ***
> +    1..1
> +    ok 1 - dissemination work in log time even at the very start of a cluster
> +ok 23 - subtests
> +	*** swim_test_dissemination_speed: done ***
>  	*** main_f: done ***
> 

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2019-06-30 16:25 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-06-29 23:45 [tarantool-patches] [PATCH 0/2] SWIM big cluster improvements, part 1 Vladislav Shpilevoy
2019-06-29 23:45 ` [tarantool-patches] [PATCH 1/2] swim: prefer new events for dissemination over old Vladislav Shpilevoy
2019-06-30 16:25   ` [tarantool-patches] " Vladislav Shpilevoy
2019-06-29 23:45 ` [tarantool-patches] [PATCH 2/2] swim: disseminate event for log(cluster_size) steps Vladislav Shpilevoy
2019-06-30  6:55   ` [tarantool-patches] " Konstantin Osipov
2019-06-30 16:24     ` Vladislav Shpilevoy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox