From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from [87.239.111.99] (localhost [127.0.0.1]) by dev.tarantool.org (Postfix) with ESMTP id 0B0BB6EC5F; Sun, 18 Apr 2021 11:49:59 +0300 (MSK) DKIM-Filter: OpenDKIM Filter v2.11.0 dev.tarantool.org 0B0BB6EC5F DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=tarantool.org; s=dev; t=1618735799; bh=LSotkKhfSEggJ9JL682hId/+ZWfnfSz2s5uq/YfxDNU=; h=To:Cc:References:Date:In-Reply-To:Subject:List-Id: List-Unsubscribe:List-Archive:List-Post:List-Help:List-Subscribe: From:Reply-To:From; b=xPnfM+/oK/sWHcbkEw9YLSmlAcegZHzeleROSC21NGv8Bkd6Wa3VMCQOkkYEyjYWf IJUbyPKySQnoVgkT7xQpEKYMhOJiRd1NCmwMJnyoDwHmGraR6RlLSysGkW/Lg4J3nd RoIbAFRyn8ag4xQiANrUgcKf/mdyWY1IgaELCJAY= Received: from smtp31.i.mail.ru (smtp31.i.mail.ru [94.100.177.91]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dev.tarantool.org (Postfix) with ESMTPS id EDC5B6EC5F for ; Sun, 18 Apr 2021 11:49:56 +0300 (MSK) DKIM-Filter: OpenDKIM Filter v2.11.0 dev.tarantool.org EDC5B6EC5F Received: by smtp31.i.mail.ru with esmtpa (envelope-from ) id 1lY37s-0004CQ-6v; Sun, 18 Apr 2021 11:49:56 +0300 To: Vladislav Shpilevoy , gorcunov@gmail.com Cc: tarantool-patches@dev.tarantool.org References: Message-ID: <76899832-2a8a-c47b-9e40-32310d9e9e38@tarantool.org> Date: Sun, 18 Apr 2021 11:49:55 +0300 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Thunderbird/78.9.1 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit Content-Language: en-GB X-7564579A: 646B95376F6C166E X-77F55803: 4F1203BC0FB41BD92FFCB8E6708E7480257C85EA0BB7A95D0F00AE41BB9A5343182A05F5380850407243598DBE437E5695C5B038F8A996167B50D36238C55BEFAA943C57B6A3407B X-7FA49CB5: FF5795518A3D127A4AD6D5ED66289B5278DA827A17800CE71D16D29A965DB9EEEA1F7E6F0F101C67BD4B6F7A4D31EC0BCC500DACC3FED6E28638F802B75D45FF8AA50765F79006373D332FFE8BBF4EB58638F802B75D45FF914D58D5BE9E6BC1A93B80C6DEB9DEE97C6FB206A91F05B22129550FB5DEE1344FCBD069AAF481568CCF7685675CC1A7D2E47CDBA5A96583C09775C1D3CA48CFCA5A41EBD8A3A0199FA2833FD35BB23D2EF20D2F80756B5F868A13BD56FB6657A471835C12D1D977725E5C173C3A84C317B107DEF921CE79117882F4460429728AD0CFFFB425014E868A13BD56FB6657D81D268191BDAD3DC09775C1D3CA48CF7C6E5A447658E3FFBA3038C0950A5D36C8A9BA7A39EFB766EC990983EF5C0329BA3038C0950A5D36D5E8D9A59859A8B6D4DCF96145C64DAE76E601842F6C81A1F004C906525384307823802FF610243D2EB15956EA79C166A417C69337E82CC275ECD9A6C639B01B78DA827A17800CE7D151390FFDBF6399731C566533BA786AA5CC5B56E945C8DA X-C1DE0DAB: 0D63561A33F958A504784B1822FD41A8A39BC030BA3291E4DD7982F0A6668005D59269BC5F550898D99A6476B3ADF6B47008B74DF8BB9EF7333BD3B22AA88B938A852937E12ACA7502E6951B79FF9A3F410CA545F18667F91A7EA1CDA0B5A7A0 X-C8649E89: 4E36BF7865823D7055A7F0CF078B5EC49A30900B95165D347D10A9FCB2A62DFED5B9FBE8AF408089D1A5F023A2D08080B4ED3F154D465EC668DFE4214C192BA01D7E09C32AA3244C068CB2D69DC4C02E1274E1C934F0EB3160759606DA2E136AFACE5A9C96DEB163 X-D57D3AED: 3ZO7eAau8CL7WIMRKs4sN3D3tLDjz0dLbV79QFUyzQ2Ujvy7cMT6pYYqY16iZVKkSc3dCLJ7zSJH7+u4VD18S7Vl4ZUrpaVfd2+vE6kuoey4m4VkSEu530nj6fImhcD4MUrOEAnl0W826KZ9Q+tr5ycPtXkTV4k65bRjmOUUP8cvGozZ33TWg5HZplvhhXbhDGzqmQDTd6OAevLeAnq3Ra9uf7zvY2zzsIhlcp/Y7m53TZgf2aB4JOg4gkr2bioj1t4H7vLuVFXGrVLQPYmG8A== X-Mailru-Sender: 583F1D7ACE8F49BDD2846D59FC20E9F8C624ED8E63334BDC78AD4E7F416A42F646A66B0BC572A284424AE0EB1F3D1D21E2978F233C3FAE6EE63DB1732555E4A8EE80603BA4A5B0BC112434F685709FCF0DA7A0AF5A3A8387 X-Mras: Ok Subject: Re: [Tarantool-patches] [PATCH v4 07/12] raft: filter rows based on known peer terms X-BeenThere: tarantool-patches@dev.tarantool.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: Tarantool development patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , From: Serge Petrenko via Tarantool-patches Reply-To: Serge Petrenko Errors-To: tarantool-patches-bounces@dev.tarantool.org Sender: "Tarantool-patches" 17.04.2021 01:21, Vladislav Shpilevoy пишет: > I appreciate the work you did here! > >> diff --git a/src/lib/raft/raft.h b/src/lib/raft/raft.h >> index e447f6634..a5f7e08d9 100644 >> --- a/src/lib/raft/raft.h >> +++ b/src/lib/raft/raft.h >> @@ -207,6 +207,19 @@ struct raft { >> * subsystems, such as Raft. >> */ >> const struct vclock *vclock; >> + /** >> + * The biggest term seen by this instance and persisted in WAL as part >> + * of a PROMOTE request. May be smaller than @a term, while there are >> + * ongoing elections, or the leader is already known, but this instance >> + * hasn't read its PROMOTE request yet. >> + * During other times must be equal to @a term. >> + */ >> + uint64_t greatest_term; >> + /** >> + * Latest terms received with PROMOTE entries from remote instances. >> + * Raft uses them to determine data from which sources may be applied. >> + */ >> + struct vclock term_map; > I am sorry for not noticing this first time, but I realized the > names are still not perfect - they give an impression the terms are > collected on any term bump. But they are only for promotions. So > they should probably be greatest_promote_term, and promote_term_map. > > Another issue I see after that rename - they depend on something not > related to raft. Raft does write PROMOTEs. You can see that these > 2 members are not used in raft code at all. Only in the limbo and > box. On the other hand, they don't remove terms dependency from the > limbo, because they are part of PROMOTE, which is part of the limbo. > > That means, we introduced an explicit dependency on raft in the > limbo just to store some numbers in struct raft. > > Maybe move these 2 members to the limbo? They have nothing to do with > the leader election as we can see, and our lib/raft is only about that. > > They are for filtering once the leader is elected already, which is > synchronous replication's job, and which in turn is the limbo. > > This also makes us closer to the idea I mentioned about lsn map > and promote term map merged into something new inside of the limbo. > > I tried to deal with that idea myself, and it resulted into a commit > I pushed on top of your branch, and pasted below. > > I made so the limbo does not depend on raft anymore (on its API). It > only uses term numbers. Box is the link between raft and limbo - it > passes the raft terms to the new promote entries in box.ctl.promote(). > > If you agree, please, squash. Otherwise lets discuss. I didn't delete > the unit test about this new map yet, only commented it out. You would > need to drop it if squash. I see what you mean. It was hard to see that these methods belong to txn_limbo, at first. But I agree with you now. Your version looks good. Thanks for the help! Squashed. > ==================== > diff --git a/src/box/applier.cc b/src/box/applier.cc > index 61d53fdec..b0e8fbba7 100644 > --- a/src/box/applier.cc > +++ b/src/box/applier.cc > @@ -967,6 +967,59 @@ apply_final_join_tx(struct stailq *rows) > return rc; > } > > +/* > + * When elections are enabled we must filter out synchronous rows coming > + * from an instance that fell behind the current leader. This includes > + * both synchronous tx rows and rows for txs following unconfirmed > + * synchronous transactions. > + * The rows are replaced with NOPs to preserve the vclock consistency. > + */ > +static void > +applier_synchro_filter_tx(struct applier *applier, struct stailq *rows) > +{ > + /* > + * XXX: in case raft is disabled, synchronous replication still works > + * but without any filtering. That might lead to issues with > + * unpredictable confirms after rollbacks which are supposed to be > + * fixed by the filtering. > + */ > + if (!raft_is_enabled(box_raft())) > + return; > + if (!txn_limbo_is_replica_outdated(&txn_limbo, applier->instance_id)) > + return; > + > + struct xrow_header *row; > + row = &stailq_last_entry(rows, struct applier_tx_row, next)->row; > + if (row->wait_sync) > + goto nopify; > + > + row = &stailq_first_entry(rows, struct applier_tx_row, next)->row; > + /* > + * Not waiting for sync and not a synchro request - this make it already > + * NOP or an asynchronous transaction not depending on any synchronous > + * ones - let it go as is. > + */ > + if (!iproto_type_is_synchro_request(row->type)) > + return; > + /* > + * Do not NOPify promotion, otherwise won't even know who is the limbo > + * owner now. > + */ > + if (iproto_type_is_promote_request(row->type)) > + return; > +nopify:; > + struct applier_tx_row *item; > + stailq_foreach_entry(item, rows, next) { > + row = &item->row; > + row->type = IPROTO_NOP; > + /* > + * Row body is saved to fiber's region and will be freed > + * on next fiber_gc() call. > + */ > + row->bodycnt = 0; > + } > +} > + > /** > * Apply all rows in the rows queue as a single transaction. > * > @@ -1026,29 +1079,7 @@ applier_apply_tx(struct applier *applier, struct stailq *rows) > } > } > } > - > - /* > - * When elections are enabled we must filter out synchronous rows coming > - * from an instance that fell behind the current leader. This includes > - * both synchronous tx rows and rows for txs following unconfirmed > - * synchronous transactions. > - * The rows are replaced with NOPs to preserve the vclock consistency. > - */ > - struct applier_tx_row *item; > - if (raft_is_node_outdated(box_raft(), applier->instance_id) && > - (last_row->wait_sync || > - (iproto_type_is_synchro_request(first_row->type) && > - !iproto_type_is_promote_request(first_row->type)))) { > - stailq_foreach_entry(item, rows, next) { > - struct xrow_header *row = &item->row; > - row->type = IPROTO_NOP; > - /* > - * Row body is saved to fiber's region and will be freed > - * on next fiber_gc() call. > - */ > - row->bodycnt = 0; > - } > - } > + applier_synchro_filter_tx(applier, rows); > if (unlikely(iproto_type_is_synchro_request(first_row->type))) { > /* > * Synchro messages are not transactions, in terms > diff --git a/src/box/box.cc b/src/box/box.cc > index 70cb2bd53..cc68f0168 100644 > --- a/src/box/box.cc > +++ b/src/box/box.cc > @@ -1516,10 +1516,12 @@ box_promote(void) > > /* > * Do nothing when box isn't configured and when PROMOTE was already > - * written for this term. > + * written for this term (synchronous replication and leader election > + * are in sync, and both chose this node as a leader). > */ > - if (!is_box_configured || > - raft_node_term(box_raft(), instance_id) == box_raft()->term) > + if (!is_box_configured) > + return 0; > + if (txn_limbo_replica_term(&txn_limbo, instance_id) == box_raft()->term) > return 0; > > bool run_elections = false; > diff --git a/src/box/txn_limbo.c b/src/box/txn_limbo.c > index 0726b5a04..bafb47aaa 100644 > --- a/src/box/txn_limbo.c > +++ b/src/box/txn_limbo.c > @@ -34,7 +34,6 @@ > #include "iproto_constants.h" > #include "journal.h" > #include "box.h" > -#include "raft.h" > > struct txn_limbo txn_limbo; > > @@ -46,6 +45,8 @@ txn_limbo_create(struct txn_limbo *limbo) > limbo->owner_id = REPLICA_ID_NIL; > fiber_cond_create(&limbo->wait_cond); > vclock_create(&limbo->vclock); > + vclock_create(&limbo->promote_term_map); > + limbo->promote_greatest_term = 0; > limbo->confirmed_lsn = 0; > limbo->rollback_count = 0; > limbo->is_in_rollback = false; > @@ -644,8 +645,13 @@ complete: > void > txn_limbo_process(struct txn_limbo *limbo, const struct synchro_request *req) > { > - /* It's ok to process an empty term. It'll just get ignored. */ > - raft_process_term(box_raft(), req->origin_id, req->term); > + uint64_t term = req->term; > + uint32_t origin = req->origin_id; > + if (txn_limbo_replica_term(limbo, origin) < term) { > + vclock_follow(&limbo->promote_term_map, origin, term); > + if (term > limbo->promote_greatest_term) > + limbo->promote_greatest_term = term; > + } > if (req->replica_id != limbo->owner_id) { > /* > * Ignore CONFIRM/ROLLBACK messages for a foreign master. > diff --git a/src/box/txn_limbo.h b/src/box/txn_limbo.h > index f35771dc9..e409ac657 100644 > --- a/src/box/txn_limbo.h > +++ b/src/box/txn_limbo.h > @@ -129,6 +129,24 @@ struct txn_limbo { > * transactions, created on the limbo's owner node. > */ > struct vclock vclock; > + /** > + * Latest terms received with PROMOTE entries from remote instances. > + * Limbo uses them to filter out the transactions coming not from the > + * limbo owner, but so outdated that they are rolled back everywhere > + * except outdated nodes. > + */ > + struct vclock promote_term_map; > + /** > + * The biggest PROMOTE term seen by the instance and persisted in WAL. > + * It is related to raft term, but not the same. Synchronous replication > + * represented by the limbo is interested only in the won elections > + * ended with PROMOTE request. > + * It means the limbo's term might be smaller than the raft term, while > + * there are ongoing elections, or the leader is already known and this > + * instance hasn't read its PROMOTE request yet. During other times the > + * limbo and raft are in sync and the terms are the same. > + */ > + uint64_t promote_greatest_term; > /** > * Maximal LSN gathered quorum and either already confirmed in WAL, or > * whose confirmation is in progress right now. Any attempt to confirm > @@ -193,6 +211,28 @@ txn_limbo_last_entry(struct txn_limbo *limbo) > in_queue); > } > > +/** > + * Return the latest term as seen in PROMOTE requests from instance with id > + * @a replica_id. > + */ > +static inline uint64_t > +txn_limbo_replica_term(const struct txn_limbo *limbo, uint32_t replica_id) > +{ > + return vclock_get(&limbo->promote_term_map, replica_id); > +} > + > +/** > + * Check whether replica with id @a source_id is too old to apply synchronous > + * data from it. The check is only valid when elections are enabled. > + */ > +static inline bool > +txn_limbo_is_replica_outdated(const struct txn_limbo *limbo, > + uint32_t replica_id) > +{ > + return txn_limbo_replica_term(limbo, replica_id) < > + limbo->promote_greatest_term; > +} > + > /** > * Return the last synchronous transaction in the limbo or NULL when it is > * empty. > diff --git a/src/lib/raft/raft.c b/src/lib/raft/raft.c > index b21693642..874e9157e 100644 > --- a/src/lib/raft/raft.c > +++ b/src/lib/raft/raft.c > @@ -1012,7 +1012,6 @@ raft_create(struct raft *raft, const struct raft_vtab *vtab) > .death_timeout = 5, > .vtab = vtab, > }; > - vclock_create(&raft->term_map); > raft_ev_timer_init(&raft->timer, raft_sm_schedule_new_election_cb, > 0, 0); > raft->timer.data = raft; > diff --git a/src/lib/raft/raft.h b/src/lib/raft/raft.h > index 69dec63c6..f7bc205d2 100644 > --- a/src/lib/raft/raft.h > +++ b/src/lib/raft/raft.h > @@ -207,19 +207,6 @@ struct raft { > * subsystems, such as Raft. > */ > const struct vclock *vclock; > - /** > - * The biggest term seen by this instance and persisted in WAL as part > - * of a PROMOTE request. May be smaller than @a term, while there are > - * ongoing elections, or the leader is already known, but this instance > - * hasn't read its PROMOTE request yet. > - * During other times must be equal to @a term. > - */ > - uint64_t greatest_term; > - /** > - * Latest terms received with PROMOTE entries from remote instances. > - * Raft uses them to determine data from which sources may be applied. > - */ > - struct vclock term_map; > /** State machine timed event trigger. */ > struct ev_timer timer; > /** Configured election timeout in seconds. */ > @@ -256,39 +243,6 @@ raft_is_source_allowed(const struct raft *raft, uint32_t source_id) > return !raft->is_enabled || raft->leader == source_id; > } > > -/** > - * Return the latest term as seen in PROMOTE requests from instance with id > - * @a source_id. > - */ > -static inline uint64_t > -raft_node_term(const struct raft *raft, uint32_t source_id) > -{ > - assert(source_id < VCLOCK_MAX); > - return vclock_get(&raft->term_map, source_id); > -} > - > -/** > - * Check whether replica with id @a source_id is too old to apply synchronous > - * data from it. The check is only valid when elections are enabled. > - */ > -static inline bool > -raft_is_node_outdated(const struct raft *raft, uint32_t source_id) > -{ > - uint64_t source_term = raft_node_term(raft, source_id); > - return raft->is_enabled && source_term < raft->greatest_term; > -} > - > -/** Remember the last term seen for replica with id @a source_id. */ > -static inline void > -raft_process_term(struct raft *raft, uint32_t source_id, uint64_t term) > -{ > - if (raft_node_term(raft, source_id) >= term) > - return; > - vclock_follow(&raft->term_map, source_id, term); > - if (term > raft->greatest_term) > - raft->greatest_term = term; > -} > - > /** Check if Raft is enabled. */ > static inline bool > raft_is_enabled(const struct raft *raft) > diff --git a/test/unit/raft.c b/test/unit/raft.c > index 575886932..4214dbc4c 100644 > --- a/test/unit/raft.c > +++ b/test/unit/raft.c > @@ -1267,38 +1267,38 @@ raft_test_too_long_wal_write(void) > raft_finish_test(); > } > > -static void > -raft_test_term_filter(void) > -{ > - raft_start_test(9); > - struct raft_node node; > - raft_node_create(&node); > - > - is(raft_node_term(&node.raft, 1), 0, "empty node term"); > - ok(!raft_is_node_outdated(&node.raft, 1), "not outdated initially"); > - > - raft_process_term(&node.raft, 1, 1); > - is(raft_node_term(&node.raft, 1), 1, "node term updated"); > - ok(raft_is_node_outdated(&node.raft, 2), "other nodes are outdated"); > - > - raft_process_term(&node.raft, 2, 100); > - ok(raft_is_node_outdated(&node.raft, 1), "node outdated when others " > - "have greater term"); > - ok(!raft_is_node_outdated(&node.raft, 2), "node with greatest term " > - "isn't outdated"); > - > - raft_process_term(&node.raft, 3, 100); > - ok(!raft_is_node_outdated(&node.raft, 2), "node not outdated when " > - "others have the same term"); > - > - raft_process_term(&node.raft, 3, 99); > - is(raft_node_term(&node.raft, 3), 100, "node term isn't decreased"); > - ok(!raft_is_node_outdated(&node.raft, 3), "node doesn't become " > - "outdated"); > - > - raft_node_destroy(&node); > - raft_finish_test(); > -} > +// static void > +// raft_test_term_filter(void) > +// { > +// raft_start_test(9); > +// struct raft_node node; > +// raft_node_create(&node); > + > +// is(raft_node_term(&node.raft, 1), 0, "empty node term"); > +// ok(!raft_is_node_outdated(&node.raft, 1), "not outdated initially"); > + > +// raft_process_term(&node.raft, 1, 1); > +// is(raft_node_term(&node.raft, 1), 1, "node term updated"); > +// ok(raft_is_node_outdated(&node.raft, 2), "other nodes are outdated"); > + > +// raft_process_term(&node.raft, 2, 100); > +// ok(raft_is_node_outdated(&node.raft, 1), "node outdated when others " > +// "have greater term"); > +// ok(!raft_is_node_outdated(&node.raft, 2), "node with greatest term " > +// "isn't outdated"); > + > +// raft_process_term(&node.raft, 3, 100); > +// ok(!raft_is_node_outdated(&node.raft, 2), "node not outdated when " > +// "others have the same term"); > + > +// raft_process_term(&node.raft, 3, 99); > +// is(raft_node_term(&node.raft, 3), 100, "node term isn't decreased"); > +// ok(!raft_is_node_outdated(&node.raft, 3), "node doesn't become " > +// "outdated"); > + > +// raft_node_destroy(&node); > +// raft_finish_test(); > +// } > > static void > raft_test_start_stop_candidate(void) > @@ -1332,7 +1332,7 @@ raft_test_start_stop_candidate(void) > static int > main_f(va_list ap) > { > - raft_start_test(15); > + raft_start_test(14); > > (void) ap; > fakeev_init(); > @@ -1350,7 +1350,7 @@ main_f(va_list ap) > raft_test_death_timeout(); > raft_test_enable_disable(); > raft_test_too_long_wal_write(); > - raft_test_term_filter(); > + //raft_test_term_filter(); > raft_test_start_stop_candidate(); > > fakeev_free(); > diff --git a/test/unit/raft.result b/test/unit/raft.result > index bb799936b..f9a8f249b 100644 > --- a/test/unit/raft.result > +++ b/test/unit/raft.result > @@ -1,5 +1,5 @@ > *** main_f *** > -1..15 > +1..14 > *** raft_test_leader_election *** > 1..24 > ok 1 - 1 pending message at start > @@ -220,25 +220,12 @@ ok 12 - subtests > ok 8 - became candidate > ok 13 - subtests > *** raft_test_too_long_wal_write: done *** > - *** raft_test_term_filter *** > - 1..9 > - ok 1 - empty node term > - ok 2 - not outdated initially > - ok 3 - node term updated > - ok 4 - other nodes are outdated > - ok 5 - node outdated when others have greater term > - ok 6 - node with greatest term isn't outdated > - ok 7 - node not outdated when others have the same term > - ok 8 - node term isn't decreased > - ok 9 - node doesn't become outdated > -ok 14 - subtests > - *** raft_test_term_filter: done *** > *** raft_test_start_stop_candidate *** > 1..4 > ok 1 - became leader after start_candidate > ok 2 - remain leader after stop_candidate > ok 3 - vote request from 2 > ok 4 - demote once new election starts > -ok 15 - subtests > +ok 14 - subtests > *** raft_test_start_stop_candidate: done *** > *** main_f: done *** > -- Serge Petrenko