From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from localhost (localhost [127.0.0.1]) by turing.freelists.org (Avenir Technologies Mail Multiplex) with ESMTP id DAC9E211DB for ; Sun, 30 Jun 2019 12:25:55 -0400 (EDT) Received: from turing.freelists.org ([127.0.0.1]) by localhost (turing.freelists.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id rsdtjkoIxara for ; Sun, 30 Jun 2019 12:25:55 -0400 (EDT) Received: from smtp32.i.mail.ru (smtp32.i.mail.ru [94.100.177.92]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by turing.freelists.org (Avenir Technologies Mail Multiplex) with ESMTPS id 9943F21101 for ; Sun, 30 Jun 2019 12:25:55 -0400 (EDT) Subject: [tarantool-patches] Re: [PATCH 1/2] swim: prefer new events for dissemination over old From: Vladislav Shpilevoy References: <677a123b62745d661c1695ce49cfa88a501eed44.1561851087.git.v.shpilevoy@tarantool.org> Message-ID: <97708c04-d92a-f15f-8a91-dcdfe110e5f0@tarantool.org> Date: Sun, 30 Jun 2019 18:25:01 +0200 MIME-Version: 1.0 In-Reply-To: <677a123b62745d661c1695ce49cfa88a501eed44.1561851087.git.v.shpilevoy@tarantool.org> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit Sender: tarantool-patches-bounce@freelists.org Errors-to: tarantool-patches-bounce@freelists.org Reply-To: tarantool-patches@freelists.org List-Help: List-Unsubscribe: List-software: Ecartis version 1.0.0 List-Id: tarantool-patches List-Subscribe: List-Owner: List-post: List-Archive: To: tarantool-patches@freelists.org Cc: kostja@tarantool.org This commit is dropped. It was discussed verbally, that it would be better to implement more smart priority queue for events. On 30/06/2019 01:45, Vladislav Shpilevoy wrote: > Before the patch old events were disseminated before new ones > until their TTD becomes 0. It was considered fair, and not much > important. But after some experiments with closed-source version > of SWIM it was found, that I was wrong as never. > > Firstly, SWIM in the original paper explicitly says, that newer > events should be sent first. Secondly, it really has a goal. For > example, when a big cluster is just started, there is huge number > of events like 'a new member is added'. They consume the whole > UDP packet, and newer events like 'a member is dead' starve. In > fact, dissemination just doesn't work at start of a big cluster. > > This patch make SWIM prefer newer events for dissemination. > > Part of #4253 > --- > src/lib/swim/swim.c | 13 ++++++++++--- > test/unit/swim.c | 44 ++++++++++++++++++++++++++++++++++++++++++- > test/unit/swim.result | 7 ++++++- > 3 files changed, 59 insertions(+), 5 deletions(-) > > diff --git a/src/lib/swim/swim.c b/src/lib/swim/swim.c > index bb9e9f519..8edf0495d 100644 > --- a/src/lib/swim/swim.c > +++ b/src/lib/swim/swim.c > @@ -367,7 +367,7 @@ struct swim_member { > int payload_ttd; > /** > * All created events are put into a queue sorted by event > - * time. > + * time in descending order. > */ > struct rlist in_dissemination_queue; > /** > @@ -608,13 +608,20 @@ swim_wait_ack(struct swim *swim, struct swim_member *member, > * The status change itself occupies only 2 bytes in a packet, so > * it is cheap to send it on any update, while does reduce > * entropy. > + * > + * It is important to put new events in the head of the queue. > + * Otherwise if there are lots of old events, new ones starve. It > + * is especially notable when a big cluster is just started, and > + * the whole UDP packet is occupied by events like 'a new member > + * is added'. More important events like 'a member is dead' won't > + * get a chance to be disseminated. > */ > static inline void > swim_register_event(struct swim *swim, struct swim_member *member) > { > if (rlist_empty(&member->in_dissemination_queue)) { > - rlist_add_tail_entry(&swim->dissemination_queue, member, > - in_dissemination_queue); > + rlist_add_entry(&swim->dissemination_queue, member, > + in_dissemination_queue); > } > member->status_ttd = mh_size(swim->members); > swim_cached_round_msg_invalidate(swim); > diff --git a/test/unit/swim.c b/test/unit/swim.c > index 3486d3f73..07cc0f06f 100644 > --- a/test/unit/swim.c > +++ b/test/unit/swim.c > @@ -43,6 +43,7 @@ > #include "swim_test_utils.h" > #include "trigger.h" > #include > +#include > > /** > * Test result is a real returned value of main_f. Fiber_join can > @@ -1152,10 +1153,50 @@ swim_test_generation(void) > swim_finish_test(); > } > > +static void > +swim_test_dissemination_speed(void) > +{ > + swim_start_test(1); > + > + int size = 100; > + double ack_timeout = 0.1; > + struct swim_cluster *cluster = swim_cluster_new(size); > + swim_cluster_set_ack_timeout(cluster, ack_timeout); > + swim_cluster_set_gc(cluster, SWIM_GC_OFF); > + for (int i = 0; i < size; ++i) { > + for (int j = i + 1; j < size; ++j) > + swim_cluster_interconnect(cluster, i, j); > + } > + swim_cluster_set_drop(cluster, 0, 100); > + fail_if(swim_cluster_wait_status_anywhere(cluster, 0, > + MEMBER_DEAD, size) != 0); > + /* > + * Not a trivial problem - at start of a cluster there are > + * so many events, that they occupy a UDP packet fully. > + * All these events are 'a new member is added'. And > + * because of that other much more important events > + * starve. In this concrete test a new event 'member is > + * dead' starves. To beat that problem SWIM says that new > + * events should be preferred. In such a case number of > + * old events does not matter anymore, and new events > + * disseminate in log. Usually this test works in log * 2, > + * log * 3 steps. Here it is log * 5 to avoid flakiness in > + * some extra rare and slow random cases, but to still > + * check for O(log) speed. > + */ > + is(swim_cluster_wait_status_everywhere(cluster, 0, MEMBER_DEAD, > + log2(size) * 5), 0, > + "dissemination work in log time even at the very start of a cluster"); > + > + swim_cluster_delete(cluster); > + > + swim_finish_test(); > +} > + > static int > main_f(va_list ap) > { > - swim_start_test(22); > + swim_start_test(23); > > (void) ap; > swim_test_ev_init(); > @@ -1183,6 +1224,7 @@ main_f(va_list ap) > swim_test_slow_net(); > swim_test_triggers(); > swim_test_generation(); > + swim_test_dissemination_speed(); > > swim_test_transport_free(); > swim_test_ev_free(); > diff --git a/test/unit/swim.result b/test/unit/swim.result > index 04a2778e6..83b6bca64 100644 > --- a/test/unit/swim.result > +++ b/test/unit/swim.result > @@ -1,5 +1,5 @@ > *** main_f *** > -1..22 > +1..23 > *** swim_test_one_link *** > 1..6 > ok 1 - no rounds - no fullmesh > @@ -237,4 +237,9 @@ ok 21 - subtests > ok 3 - S2 sees new generation of S1 > ok 22 - subtests > *** swim_test_generation: done *** > + *** swim_test_dissemination_speed *** > + 1..1 > + ok 1 - dissemination work in log time even at the very start of a cluster > +ok 23 - subtests > + *** swim_test_dissemination_speed: done *** > *** main_f: done *** >