From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from [87.239.111.99] (localhost [127.0.0.1]) by dev.tarantool.org (Postfix) with ESMTP id CE2BB6ECE3; Fri, 21 Jan 2022 02:02:06 +0300 (MSK) DKIM-Filter: OpenDKIM Filter v2.11.0 dev.tarantool.org CE2BB6ECE3 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=tarantool.org; s=dev; t=1642719726; bh=ky4LiwsapoKRJGqBw/HGalvesrNZ/gA76grnwtqrX10=; h=Date:To:References:In-Reply-To:Subject:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To: From; b=L/1W0bCizYScuZUBkZomVZN1U423qgMtlu+Ce0C5D6FLqdNtgfapfr+mBI0CS+DYS 1BK4dMbtQhoSp1w/ju01rOziiwYns3Gqa1lCduJ8Nv0RyZDA6pmnGa8FS6dDix9CO5 tHeMKwFG7tSmNyd66m8Bc6NIWgDbDrSOcfI/z2Fk= Received: from smtpng1.i.mail.ru (smtpng1.i.mail.ru [94.100.181.251]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dev.tarantool.org (Postfix) with ESMTPS id 765726ECE3 for ; Fri, 21 Jan 2022 02:02:04 +0300 (MSK) DKIM-Filter: OpenDKIM Filter v2.11.0 dev.tarantool.org 765726ECE3 Received: by smtpng1.m.smailru.net with esmtpa (envelope-from ) id 1nAgRP-0007gf-DN; Fri, 21 Jan 2022 02:02:04 +0300 Message-ID: <71e416b7-3532-897d-a49b-0dcb925a88f6@tarantool.org> Date: Fri, 21 Jan 2022 00:02:02 +0100 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:91.0) Gecko/20100101 Thunderbird/91.5.0 Content-Language: en-US To: Serge Petrenko , tarantool-patches@dev.tarantool.org References: <72798befd5d6e32f4386aaeeb3209cc93c0e44b4.1642639079.git.v.shpilevoy@tarantool.org> <33c8217d-c839-1b1e-7595-db44c640eeac@tarantool.org> In-Reply-To: <33c8217d-c839-1b1e-7595-db44c640eeac@tarantool.org> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-4EC0790: 10 X-7564579A: B8F34718100C35BD X-77F55803: 4F1203BC0FB41BD9AA78FDF62ECAE61F7D831E2382AA7A6C0481E357CE3D3DF1182A05F5380850402122F9C78A5EEE25ECF5EC6F71309FC9ADDE12B0D64CD5050A4CDEB1E1E56819 X-7FA49CB5: FF5795518A3D127A4AD6D5ED66289B5278DA827A17800CE70760526B31D64A36EA1F7E6F0F101C67BD4B6F7A4D31EC0BCC500DACC3FED6E28638F802B75D45FF8AA50765F7900637E31772672C957B658638F802B75D45FF36EB9D2243A4F8B5A6FCA7DBDB1FC311F39EFFDF887939037866D6147AF826D8792BD731AA06F75EE5B4F84B2C743DEB117882F4460429724CE54428C33FAD305F5C1EE8F4F765FCAA867293B0326636D2E47CDBA5A96583BD4B6F7A4D31EC0BC014FD901B82EE079FA2833FD35BB23D27C277FBC8AE2E8B8C7ADC89C2F0B2A5A471835C12D1D977C4224003CC8364762BB6847A3DEAEFB0F43C7A68FF6260569E8FC8737B5C2249EC8D19AE6D49635B68655334FD4449CB9ECD01F8117BC8BEAAAE862A0553A39223F8577A6DFFEA7CFA80D66F452D417A43847C11F186F3C59DAA53EE0834AAEE X-B7AD71C0: 1B70FBA5C9BEEE72C9761FC34675ADEB871C96603B655635EE9D5CB6078CC77C8EB096CD86619708EFFFE7C7C1A70394 X-C1DE0DAB: 0D63561A33F958A5D0130609B572F83A038826F8533A2630A6A3965E9DC1B388D59269BC5F550898D99A6476B3ADF6B47008B74DF8BB9EF7333BD3B22AA88B938A852937E12ACA75B7BFB303F1C7DB4D8E8E86DC7131B365E7726E8460B7C23C X-C8649E89: 4E36BF7865823D7055A7F0CF078B5EC49A30900B95165D3430FC2BC1B917F01878859BC95197F47BE859E84ABE8F7A510F59BD4B3C1E8AE57AAD0C40380F9AE31D7E09C32AA3244CFCBE74967C1F3F772D862641DC8C119169B6CAE0477E908D729B2BEF169E0186 X-D57D3AED: 3ZO7eAau8CL7WIMRKs4sN3D3tLDjz0dLbV79QFUyzQ2Ujvy7cMT6pYYqY16iZVKkSc3dCLJ7zSJH7+u4VD18S7Vl4ZUrpaVfd2+vE6kuoey4m4VkSEu530nj6fImhcD4MUrOEAnl0W826KZ9Q+tr5ycPtXkTV4k65bRjmOUUP8cvGozZ33TWg5HZplvhhXbhDGzqmQDTd6OAevLeAnq3Ra9uf7zvY2zzsIhlcp/Y7m53TZgf2aB4JOg4gkr2biojeO2NaHb0kQvNSw8jGw2XLQ== X-Mailru-Sender: 689FA8AB762F739339CABD9B3CA9A7D67FF7FE1734A6125496EBFD63B29BFBAC3841015FED1DE5223CC9A89AB576DD93FB559BB5D741EB963CF37A108A312F5C27E8A8C3839CE0E25FEEDEB644C299C0ED14614B50AE0675 X-Mras: Ok Subject: Re: [Tarantool-patches] [PATCH v2 4/5] raft: introduce split vote detection X-BeenThere: tarantool-patches@dev.tarantool.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: Tarantool development patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , From: Vladislav Shpilevoy via Tarantool-patches Reply-To: Vladislav Shpilevoy Errors-To: tarantool-patches-bounces@dev.tarantool.org Sender: "Tarantool-patches" >> +static void >> +raft_check_split_vote(struct raft *raft) >> +{ >> +    /* When leader is known, there is no election. Thus no vote to split. */ >> +    if (raft->leader != 0) >> +        return; >> +    /* Not a candidate = can't trigger term bump anyway. */ >> +    if (!raft->is_candidate) >> +        return; >> +    /* >> +     * WAL write in progress means the state is changing. All is rechecked >> +     * when it is done. >> +     */ >> +    if (raft->is_write_in_progress) >> +        return; >> +    if (!raft_has_split_vote(raft)) >> +        return; >> +    assert(raft_ev_is_active(&raft->timer)); >> +    /* >> +     * Could be already detected before. The timeout would be updated by now >> +     * then. >> +     */ >> +    if (raft->timer.repeat < raft->election_timeout) >> +        return; > > I don't think you should decrease timer.repeat. > This 'vote speedup' is for a single term only. > > Besides the check below about delay >= remaining is enough > to test if split vote detection was already triggered. I update timer.repeat as a flag that the split vote was already seen during this term. If I drop this check, each next vote after the first detection of split vote will lead to potential timeout decrease depending on random. I wanted to decrease it only once - when split vote is seen first time. Even managed to add a test for it. The alternative was to store a flag 'has_split_vote' somewhere in struct raft, but I didn't want to add it. >> + >> +    assert(raft->state == RAFT_STATE_FOLLOWER || >> +           raft->state == RAFT_STATE_CANDIDATE); >> +    struct ev_loop *loop = raft_loop(); >> +    struct ev_timer *timer = &raft->timer; >> +    double delay = raft_new_random_election_shift(raft); >> +    /* >> +     * Could be too late to speed up anything - probably the term is almost >> +     * over anyway. >> +     */ >> +    double remaining = raft_ev_timer_remaining(loop, timer); >> +    if (delay >= remaining) >> +        delay = remaining; >> +    say_info("RAFT: split vote is discovered - %s, new term in %lf sec", >> +         raft_scores_str(raft), delay); >> +    raft_ev_timer_stop(loop, timer); >> +    raft_ev_timer_set(timer, delay, delay); > > > ... >> diff --git a/test/unit/raft_test_utils.h b/test/unit/raft_test_utils.h >> index c68dc3b22..2138a829e 100644 >> --- a/test/unit/raft_test_utils.h >> +++ b/test/unit/raft_test_utils.h >> @@ -32,6 +32,7 @@ >>   #include "fakesys/fakeev.h" >>   #include "fiber.h" >>   #include "raft/raft.h" >> +#include "raft/raft_ev.h" > > Why do you need it here? The tests now use raft_ev_is_active() which is defined in this header. I decided to add it here instead of unit/raft.c because the idea of raft_test_utils.h is, among other things, to do all the boilerplate work such as necessary inclusions. New version of the commit after the comment from v1 about new raft members: ==================== raft: introduce split vote detection Split vote is a situation when during election nobody can win and will not win in this term for sure. Not a single node could get enough votes. For example, with 4 nodes one could get 2 votes and other also 2 votes. Nobody will get quorum 3 in this term. The patch makes raft able to notice that situation and speed up the term bump. It is not bumped immediately though, because nodes might do that simultaneously and will get the split vote again after voting for self. There is a random delay. But it is just max 10% of election timeout, so it should speed up split vote resolution on 90% at least. Part of #5285 diff --git a/src/lib/raft/raft.c b/src/lib/raft/raft.c index 8182e73e3..90ed01ca4 100644 --- a/src/lib/raft/raft.c +++ b/src/lib/raft/raft.c @@ -152,20 +152,76 @@ raft_can_vote_for(const struct raft *raft, const struct vclock *v) return cmp == 0 || cmp == 1; } -static inline void +static inline bool raft_add_vote(struct raft *raft, int src, int dst) { struct raft_vote *v = &raft->votes[src]; if (v->did_vote) - return; + return false; v->did_vote = true; - ++raft->votes[dst].count; + ++raft->voted_count; + int count = ++raft->votes[dst].count; + if (count > raft->max_vote) + raft->max_vote = count; + return true; +} + +static bool +raft_has_split_vote(const struct raft *raft) +{ + int vote_vac = raft->cluster_size; + int quorum = raft->election_quorum; + /* + * Quorum > cluster is either a misconfiguration or some instances + * didn't register yet. Anyway, speeding the elections up won't help. + * The same when more nodes voted than there are nodes configured. + */ + if (vote_vac < quorum) + return false; + vote_vac -= raft->voted_count; + if (vote_vac < 0) + return false; + return raft->max_vote + vote_vac < quorum; +} + +static int +raft_scores_snprintf(const struct raft *raft, char *buf, int size) +{ + int total = 0; + bool is_empty = true; + SNPRINT(total, snprintf, buf, size, "{"); + for (int i = 0; i < VCLOCK_MAX; ++i) { + int count = raft->votes[i].count; + if (count == 0) + continue; + if (!is_empty) + SNPRINT(total, snprintf, buf, size, ", "); + else + is_empty = false; + SNPRINT(total, snprintf, buf, size, "%d: %d", i, count); + } + SNPRINT(total, snprintf, buf, size, "}"); + return total; +} + +static const char * +raft_scores_str(const struct raft *raft) +{ + char *buf = tt_static_buf(); + int rc = raft_scores_snprintf(raft, buf, TT_STATIC_BUF_LEN); + assert(rc >= 0); + (void)rc; + return buf; } /** Schedule broadcast of the complete Raft state to all the followers. */ static void raft_schedule_broadcast(struct raft *raft); +/** If there is split vote, the node might reduce the next term delay. */ +static void +raft_check_split_vote(struct raft *raft); + /** Raft state machine methods. 'sm' stands for State Machine. */ /** @@ -351,7 +407,8 @@ raft_process_msg(struct raft *raft, const struct raft_msg *req, uint32_t source) * persisted long time ago and still broadcasted. Or a vote response. */ if (req->vote != 0) { - raft_add_vote(raft, source, req->vote); + if (raft_add_vote(raft, source, req->vote)) + raft_check_split_vote(raft); switch (raft->state) { case RAFT_STATE_FOLLOWER: @@ -679,6 +736,8 @@ raft_sm_schedule_new_term(struct raft *raft, uint64_t new_term) raft->leader = 0; raft->state = RAFT_STATE_FOLLOWER; memset(raft->votes, 0, sizeof(raft->votes)); + raft->voted_count = 0; + raft->max_vote = 0; /* * The instance could be promoted for the previous term. But promotion * has no effect on following terms. @@ -771,6 +830,11 @@ raft_sm_wait_election_end(struct raft *raft) raft_new_random_election_shift(raft); raft_ev_timer_set(&raft->timer, election_timeout, election_timeout); raft_ev_timer_start(raft_loop(), &raft->timer); + /* + * Could start the waiting after a WAL write during which the split vote + * could happen. + */ + raft_check_split_vote(raft); } static void @@ -977,6 +1041,8 @@ raft_cfg_election_quorum(struct raft *raft, int election_quorum) if (raft->state == RAFT_STATE_CANDIDATE && raft_vote_count(raft) >= raft->election_quorum) raft_sm_become_leader(raft); + else + raft_check_split_vote(raft); } void @@ -1024,6 +1090,13 @@ raft_cfg_vclock(struct raft *raft, const struct vclock *vclock) raft->vclock = vclock; } +void +raft_cfg_cluster_size(struct raft *raft, int size) +{ + raft->cluster_size = size; + raft_check_split_vote(raft); +} + void raft_new_term(struct raft *raft) { @@ -1048,6 +1121,50 @@ raft_schedule_broadcast(struct raft *raft) raft_schedule_async(raft); } +static void +raft_check_split_vote(struct raft *raft) +{ + /* When leader is known, there is no election. Thus no vote to split. */ + if (raft->leader != 0) + return; + /* Not a candidate = can't trigger term bump anyway. */ + if (!raft->is_candidate) + return; + /* + * WAL write in progress means the state is changing. All is rechecked + * when it is done. + */ + if (raft->is_write_in_progress) + return; + if (!raft_has_split_vote(raft)) + return; + assert(raft_ev_is_active(&raft->timer)); + /* + * Could be already detected before. The timeout would be updated by now + * then. + */ + if (raft->timer.repeat < raft->election_timeout) + return; + + assert(raft->state == RAFT_STATE_FOLLOWER || + raft->state == RAFT_STATE_CANDIDATE); + struct ev_loop *loop = raft_loop(); + struct ev_timer *timer = &raft->timer; + double delay = raft_new_random_election_shift(raft); + /* + * Could be too late to speed up anything - probably the term is almost + * over anyway. + */ + double remaining = raft_ev_timer_remaining(loop, timer); + if (delay >= remaining) + delay = remaining; + say_info("RAFT: split vote is discovered - %s, new term in %lf sec", + raft_scores_str(raft), delay); + raft_ev_timer_stop(loop, timer); + raft_ev_timer_set(timer, delay, delay); + raft_ev_timer_start(loop, timer); +} + void raft_create(struct raft *raft, const struct raft_vtab *vtab) { @@ -1058,6 +1175,7 @@ raft_create(struct raft *raft, const struct raft_vtab *vtab) .election_quorum = 1, .election_timeout = 5, .death_timeout = 5, + .cluster_size = VCLOCK_MAX, .vtab = vtab, }; raft_ev_timer_init(&raft->timer, raft_sm_schedule_new_election_cb, diff --git a/src/lib/raft/raft.h b/src/lib/raft/raft.h index 1d064bef0..05e373254 100644 --- a/src/lib/raft/raft.h +++ b/src/lib/raft/raft.h @@ -199,6 +199,10 @@ struct raft { uint32_t vote; /** Statistics which node voted for who. */ struct raft_vote votes[VCLOCK_MAX]; + /** How many nodes voted in the current term. */ + int voted_count; + /** Max vote count given to any node in the current term. */ + int max_vote; /** Number of votes necessary for successful election. */ int election_quorum; /** @@ -219,6 +223,8 @@ struct raft { * elections can be started. */ double death_timeout; + /** Number of instances registered in the cluster. */ + int cluster_size; /** Virtual table to perform application-specific actions. */ const struct raft_vtab *vtab; /** @@ -333,6 +339,10 @@ raft_cfg_instance_id(struct raft *raft, uint32_t instance_id); void raft_cfg_vclock(struct raft *raft, const struct vclock *vclock); +/** Configure number of registered instances in the cluster. */ +void +raft_cfg_cluster_size(struct raft *raft, int size); + /** * Bump the term. When it is persisted, the node checks if there is a leader, * and if there is not, a new election is started. That said, this function can diff --git a/test/unit/raft.c b/test/unit/raft.c index 9f18f4d8e..2fe6354cd 100644 --- a/test/unit/raft.c +++ b/test/unit/raft.c @@ -1497,10 +1497,308 @@ raft_test_promote_restore(void) raft_finish_test(); } +static void +raft_test_split_vote(void) +{ + raft_start_test(58); + struct raft_node node; + raft_node_create(&node); + + /* + * Normal split vote. + */ + raft_node_cfg_cluster_size(&node, 4); + raft_node_cfg_election_quorum(&node, 3); + raft_run_next_event(); + + ok(raft_node_check_full_state(&node, + RAFT_STATE_CANDIDATE /* State. */, + 0 /* Leader. */, + 2 /* Term. */, + 1 /* Vote. */, + 2 /* Volatile term. */, + 1 /* Volatile vote. */, + "{0: 1}" /* Vclock. */ + ), "elections with a new term"); + + /* Make so node 1 has votes 1 and 2. Node 3 has votes 3 and 4. */ + is(raft_node_send_vote_response(&node, + 2 /* Term. */, + 1 /* Vote. */, + 2 /* Source. */ + ), 0, "vote response for 1 from 2"); + is(raft_node_send_vote_response(&node, + 2 /* Term. */, + 3 /* Vote. */, + 3 /* Source. */ + ), 0, "vote response for 3 from 3"); + + ok(node.raft.timer.repeat >= node.raft.election_timeout, "term " + "timeout >= election timeout normally"); + + is(raft_node_send_vote_response(&node, + 2 /* Term. */, + 3 /* Vote. */, + 4 /* Source. */ + ), 0, "vote response for 3 from 4"); + + ok(node.raft.timer.repeat < node.raft.election_timeout, "split vote " + "reduced the term timeout"); + + raft_run_next_event(); + ok(raft_node_check_full_state(&node, + RAFT_STATE_CANDIDATE /* State. */, + 0 /* Leader. */, + 3 /* Term. */, + 1 /* Vote. */, + 3 /* Volatile term. */, + 1 /* Volatile vote. */, + "{0: 2}" /* Vclock. */ + ), "a new term"); + + ok(node.raft.timer.repeat >= node.raft.election_timeout, "timeout is " + "normal again"); + + /* + * Cluster size change can make split vote. + */ + raft_node_destroy(&node); + raft_node_create(&node); + raft_node_cfg_cluster_size(&node, 3); + raft_node_cfg_election_quorum(&node, 2); + raft_run_next_event(); + is(node.raft.state, RAFT_STATE_CANDIDATE, "is candidate"); + is(node.raft.vote, 1, "voted for self"); + + is(raft_node_send_vote_response(&node, + 2 /* Term. */, + 2 /* Vote. */, + 2 /* Source. */ + ), 0, "vote response for 2 from 2"); + ok(node.raft.timer.repeat >= node.raft.election_timeout, "the vote is " + "not split yet"); + + raft_node_cfg_cluster_size(&node, 2); + ok(node.raft.timer.repeat < node.raft.election_timeout, "cluster size " + "change makes split vote"); + + raft_run_next_event(); + ok(raft_node_check_full_state(&node, + RAFT_STATE_CANDIDATE /* State. */, + 0 /* Leader. */, + 3 /* Term. */, + 1 /* Vote. */, + 3 /* Volatile term. */, + 1 /* Volatile vote. */, + "{0: 2}" /* Vclock. */ + ), "a new term"); + + /* + * Split vote can be even when the leader is already known - then + * nothing to do. Votes are just left from the beginning of the term and + * then probably cluster size reduced a bit. + */ + raft_node_destroy(&node); + raft_node_create(&node); + raft_node_cfg_cluster_size(&node, 3); + raft_node_cfg_election_quorum(&node, 2); + is(raft_node_send_vote_response(&node, + 2 /* Term. */, + 2 /* Vote. */, + 2 /* Source. */ + ), 0, "vote response for 2 from 2"); + /* There is also a vote from 3 for 2. But it wasn't delivered to 1. */ + is(raft_node_send_leader(&node, + 2 /* Term. */, + 2 /* Source. */ + ), 0, "message is accepted"); + is(node.raft.leader, 2, "other node's leadership is accepted"); + is(raft_vote_count(&node.raft), 1, "the only own vote was from self"); + + raft_node_cfg_cluster_size(&node, 2); + ok(node.raft.timer.repeat >= node.raft.death_timeout, "cluster change " + "does not affect the leader's death waiting"); + + /* + * Non-candidate should ignore split vote. + */ + raft_node_destroy(&node); + raft_node_create(&node); + raft_node_cfg_cluster_size(&node, 3); + raft_node_cfg_election_quorum(&node, 3); + raft_node_cfg_is_candidate(&node, false); + + is(raft_node_send_vote_response(&node, + 2 /* Term. */, + 2 /* Vote. */, + 2 /* Source. */ + ), 0, "vote response for 2 from 2"); + is(raft_node_send_vote_response(&node, + 2 /* Term. */, + 3 /* Vote. */, + 3 /* Source. */ + ), 0, "vote response for 3 from 3"); + + ok(!raft_ev_is_active(&node.raft.timer), "voter couldn't schedule " + "new term"); + + /* + * Split vote can get worse, but it shouldn't lead to new term delay + * restart. + */ + raft_node_destroy(&node); + raft_node_create(&node); + raft_node_cfg_cluster_size(&node, 3); + raft_node_cfg_election_quorum(&node, 3); + + raft_run_next_event(); + is(node.raft.term, 2, "bump term"); + is(node.raft.vote, 1, "vote for self"); + is(raft_node_send_vote_response(&node, + 2 /* Term. */, + 2 /* Vote. */, + 2 /* Source. */ + ), 0, "vote response for 2 from 2"); + + double delay = node.raft.timer.repeat; + ok(delay < node.raft.election_timeout, "split vote is already " + "inevitable"); + + is(raft_node_send_vote_response(&node, + 2 /* Term. */, + 3 /* Vote. */, + 3 /* Source. */ + ), 0, "vote response for 3 from 3"); + + is(delay, node.raft.timer.repeat, "split vote got worse, but delay " + "didn't change"); + + /* + * Handle split vote when WAL write is in progress. + */ + raft_node_destroy(&node); + raft_node_create(&node); + raft_node_cfg_cluster_size(&node, 2); + raft_node_cfg_election_quorum(&node, 2); + + raft_node_block(&node); + raft_run_next_event(); + is(node.raft.term, 1, "old term"); + is(node.raft.vote, 0, "unused vote"); + is(node.raft.volatile_term, 2, "new volatile term"); + is(node.raft.volatile_vote, 1, "new volatile vote"); + is(raft_node_send_vote_response(&node, + 2 /* Term. */, + 2 /* Vote. */, + 2 /* Source. */ + ), 0, "vote response for 2 from 2"); + + raft_node_unblock(&node); + is(node.raft.term, 2, "new term"); + is(node.raft.vote, 1, "voted for self"); + is(node.raft.volatile_term, 2, "volatile term"); + is(node.raft.volatile_vote, 1, "volatile vote"); + ok(node.raft.timer.repeat < node.raft.election_timeout, "found split " + "vote after WAL write"); + + raft_run_next_event(); + is(node.raft.term, 3, "bump term"); + is(node.raft.vote, 1, "vote for self"); + + /* + * Split vote check is disabled when cluster size < quorum. Makes no + * sense to speed the elections up. + */ + raft_node_destroy(&node); + raft_node_create(&node); + raft_node_cfg_cluster_size(&node, 1); + raft_node_cfg_election_quorum(&node, 2); + + raft_run_next_event(); + is(node.raft.term, 2, "bump term"); + is(node.raft.vote, 1, "vote for self"); + is(raft_node_send_vote_response(&node, + 2 /* Term. */, + 2 /* Vote. */, + 2 /* Source. */ + ), 0, "vote response for 2 from 2"); + + ok(node.raft.timer.repeat >= node.raft.election_timeout, "split vote " + "is not checked for cluster < quorum"); + + /* + * Split vote check is disabled when vote count > cluster size. The + * reason is the same as with quorum > cluster size - something is odd, + * more term bumps won't help. + */ + raft_node_destroy(&node); + raft_node_create(&node); + raft_node_cfg_cluster_size(&node, 3); + raft_node_cfg_election_quorum(&node, 2); + + raft_run_next_event(); + is(node.raft.term, 2, "bump term"); + is(node.raft.vote, 1, "vote for self"); + is(raft_node_send_vote_response(&node, + 2 /* Term. */, + 2 /* Vote. */, + 2 /* Source. */ + ), 0, "vote response for 2 from 2"); + is(raft_node_send_vote_response(&node, + 2 /* Term. */, + 2 /* Vote. */, + 3 /* Source. */ + ), 0, "vote response for 2 from 3"); + is(raft_node_send_vote_response(&node, + 2 /* Term. */, + 3 /* Vote. */, + 4 /* Source. */ + ), 0, "vote response for 3 from 4"); + is(raft_node_send_vote_response(&node, + 2 /* Term. */, + 4 /* Vote. */, + 5 /* Source. */ + ), 0, "vote response for 4 from 5"); + + ok(node.raft.timer.repeat >= node.raft.election_timeout, "split vote " + "is not checked when vote count > cluster size"); + + /* + * Split vote can happen if quorum was suddenly increased. + */ + raft_node_destroy(&node); + raft_node_create(&node); + raft_node_cfg_cluster_size(&node, 3); + raft_node_cfg_election_quorum(&node, 2); + + raft_run_next_event(); + is(node.raft.term, 2, "bump term"); + is(node.raft.vote, 1, "vote for self"); + is(raft_node_send_vote_response(&node, + 2 /* Term. */, + 2 /* Vote. */, + 2 /* Source. */ + ), 0, "vote response for 2 from 2"); + + ok(node.raft.timer.repeat >= node.raft.election_timeout, "not split " + "vote yet"); + + raft_node_cfg_election_quorum(&node, 3); + ok(node.raft.timer.repeat < node.raft.election_timeout, "split vote " + "after quorum increase"); + + raft_run_next_event(); + is(node.raft.term, 3, "bump term"); + is(node.raft.vote, 1, "vote for self"); + + raft_node_destroy(&node); + raft_finish_test(); +} + static int main_f(va_list ap) { - raft_start_test(14); + raft_start_test(15); (void) ap; fakeev_init(); @@ -1519,6 +1817,7 @@ main_f(va_list ap) raft_test_enable_disable(); raft_test_too_long_wal_write(); raft_test_promote_restore(); + raft_test_split_vote(); fakeev_free(); diff --git a/test/unit/raft.result b/test/unit/raft.result index 3fcdf8180..cfc9be733 100644 --- a/test/unit/raft.result +++ b/test/unit/raft.result @@ -1,5 +1,5 @@ *** main_f *** -1..14 +1..15 *** raft_test_leader_election *** 1..24 ok 1 - 1 pending message at start @@ -262,4 +262,66 @@ ok 13 - subtests ok 12 - not a candidate ok 14 - subtests *** raft_test_promote_restore: done *** + *** raft_test_split_vote *** + 1..58 + ok 1 - elections with a new term + ok 2 - vote response for 1 from 2 + ok 3 - vote response for 3 from 3 + ok 4 - term timeout >= election timeout normally + ok 5 - vote response for 3 from 4 + ok 6 - split vote reduced the term timeout + ok 7 - a new term + ok 8 - timeout is normal again + ok 9 - is candidate + ok 10 - voted for self + ok 11 - vote response for 2 from 2 + ok 12 - the vote is not split yet + ok 13 - cluster size change makes split vote + ok 14 - a new term + ok 15 - vote response for 2 from 2 + ok 16 - message is accepted + ok 17 - other node's leadership is accepted + ok 18 - the only own vote was from self + ok 19 - cluster change does not affect the leader's death waiting + ok 20 - vote response for 2 from 2 + ok 21 - vote response for 3 from 3 + ok 22 - voter couldn't schedule new term + ok 23 - bump term + ok 24 - vote for self + ok 25 - vote response for 2 from 2 + ok 26 - split vote is already inevitable + ok 27 - vote response for 3 from 3 + ok 28 - split vote got worse, but delay didn't change + ok 29 - old term + ok 30 - unused vote + ok 31 - new volatile term + ok 32 - new volatile vote + ok 33 - vote response for 2 from 2 + ok 34 - new term + ok 35 - voted for self + ok 36 - volatile term + ok 37 - volatile vote + ok 38 - found split vote after WAL write + ok 39 - bump term + ok 40 - vote for self + ok 41 - bump term + ok 42 - vote for self + ok 43 - vote response for 2 from 2 + ok 44 - split vote is not checked for cluster < quorum + ok 45 - bump term + ok 46 - vote for self + ok 47 - vote response for 2 from 2 + ok 48 - vote response for 2 from 3 + ok 49 - vote response for 3 from 4 + ok 50 - vote response for 4 from 5 + ok 51 - split vote is not checked when vote count > cluster size + ok 52 - bump term + ok 53 - vote for self + ok 54 - vote response for 2 from 2 + ok 55 - not split vote yet + ok 56 - split vote after quorum increase + ok 57 - bump term + ok 58 - vote for self +ok 15 - subtests + *** raft_test_split_vote: done *** *** main_f: done *** diff --git a/test/unit/raft_test_utils.c b/test/unit/raft_test_utils.c index b5754cf78..1d7dd01ba 100644 --- a/test/unit/raft_test_utils.c +++ b/test/unit/raft_test_utils.c @@ -194,6 +194,7 @@ raft_node_create(struct raft_node *node) node->cfg_election_quorum = 3; node->cfg_death_timeout = 5; node->cfg_instance_id = 1; + node->cfg_cluster_size = VCLOCK_MAX; node->cfg_vclock = &node->journal.vclock; raft_journal_create(&node->journal, node->cfg_instance_id); raft_node_start(node); @@ -365,6 +366,7 @@ raft_node_start(struct raft_node *node) raft_cfg_election_quorum(&node->raft, node->cfg_election_quorum); raft_cfg_death_timeout(&node->raft, node->cfg_death_timeout); raft_cfg_instance_id(&node->raft, node->cfg_instance_id); + raft_cfg_cluster_size(&node->raft, node->cfg_cluster_size); raft_cfg_vclock(&node->raft, node->cfg_vclock); raft_run_async_work(); } @@ -423,6 +425,16 @@ raft_node_cfg_is_candidate(struct raft_node *node, bool value) } } +void +raft_node_cfg_cluster_size(struct raft_node *node, int value) +{ + node->cfg_cluster_size = value; + if (raft_node_is_started(node)) { + raft_cfg_cluster_size(&node->raft, value); + raft_run_async_work(); + } +} + void raft_node_cfg_election_timeout(struct raft_node *node, double value) { diff --git a/test/unit/raft_test_utils.h b/test/unit/raft_test_utils.h index c68dc3b22..2138a829e 100644 --- a/test/unit/raft_test_utils.h +++ b/test/unit/raft_test_utils.h @@ -32,6 +32,7 @@ #include "fakesys/fakeev.h" #include "fiber.h" #include "raft/raft.h" +#include "raft/raft_ev.h" #include "unit.h" /** WAL simulation. It stores a list of rows which raft wanted to persist. */ @@ -105,6 +106,7 @@ struct raft_node { int cfg_election_quorum; double cfg_death_timeout; uint32_t cfg_instance_id; + int cfg_cluster_size; struct vclock *cfg_vclock; }; @@ -227,6 +229,9 @@ raft_node_cfg_is_enabled(struct raft_node *node, bool value); void raft_node_cfg_is_candidate(struct raft_node *node, bool value); +void +raft_node_cfg_cluster_size(struct raft_node *node, int value); + void raft_node_cfg_election_timeout(struct raft_node *node, double value);