From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from [87.239.111.99] (localhost [127.0.0.1]) by dev.tarantool.org (Postfix) with ESMTP id 1B38F6FC8F; Fri, 16 Apr 2021 02:27:44 +0300 (MSK) DKIM-Filter: OpenDKIM Filter v2.11.0 dev.tarantool.org 1B38F6FC8F DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=tarantool.org; s=dev; t=1618529264; bh=8wfW3SKWZ6Mi/OJaQtw2nWAhDBrWNX0v2b3QSMtc+dI=; h=To:Cc:References:Date:In-Reply-To:Subject:List-Id: List-Unsubscribe:List-Archive:List-Post:List-Help:List-Subscribe: From:Reply-To:From; b=H0jGSaF+kaCAureeuYJWHoxNK4xWbnQRXgNWO3iA11izaUE71a93UU+yflXAL14f6 g6H7HB0TLMN70wfSXzYJU4qh1QfXOVvtxjWN6qLzautFfFZEoIun+SfyoLuKaty7oJ MC0mBoA34jhYOLRRDQpv/OZ/L9XogD0Jsizfdc1I= Received: from smtp56.i.mail.ru (smtp56.i.mail.ru [217.69.128.36]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dev.tarantool.org (Postfix) with ESMTPS id ACD9A6FC8F for ; Fri, 16 Apr 2021 02:27:42 +0300 (MSK) DKIM-Filter: OpenDKIM Filter v2.11.0 dev.tarantool.org ACD9A6FC8F Received: by smtp56.i.mail.ru with esmtpa (envelope-from ) id 1lXBOf-0000qG-QW; Fri, 16 Apr 2021 02:27:42 +0300 To: Serge Petrenko , gorcunov@gmail.com Cc: tarantool-patches@dev.tarantool.org References: <578222e82d897fd042d73afefa14766a0518de09.1618409665.git.sergepetrenko@tarantool.org> Message-ID: <72b60d0e-f657-1102-9129-e9f719af338b@tarantool.org> Date: Fri, 16 Apr 2021 01:27:40 +0200 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Thunderbird/78.9.1 MIME-Version: 1.0 In-Reply-To: <578222e82d897fd042d73afefa14766a0518de09.1618409665.git.sergepetrenko@tarantool.org> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit X-7564579A: B8F34718100C35BD X-77F55803: 4F1203BC0FB41BD92FFCB8E6708E74806859AC5FE18436AEED970E897805ADA4182A05F53808504077A05FC442975E1A07CD83BF1E4A9D4D821943C5197BFB83FE2868065E0F6FFB X-7FA49CB5: FF5795518A3D127A4AD6D5ED66289B5278DA827A17800CE781EBC5BFC74B682DEA1F7E6F0F101C67BD4B6F7A4D31EC0BCC500DACC3FED6E28638F802B75D45FF8AA50765F7900637CE3A619BB4CB99268638F802B75D45FF914D58D5BE9E6BC1A93B80C6DEB9DEE97C6FB206A91F05B2AE399226DFA3C6EDD139D215FF6EAAB4262F3407DF4BB9EED2E47CDBA5A96583C09775C1D3CA48CFCA5A41EBD8A3A0199FA2833FD35BB23D2EF20D2F80756B5F868A13BD56FB6657A471835C12D1D977725E5C173C3A84C3A12191B5F2BB8629117882F4460429728AD0CFFFB425014E868A13BD56FB6657D81D268191BDAD3DC09775C1D3CA48CF3F23D99D50CED82F76E601842F6C81A12EF20D2F80756B5F7E9C4E3C761E06A776E601842F6C81A127C277FBC8AE2E8B9DF15F3491E5012F3AA81AA40904B5D9DBF02ECDB25306B2201CA6A4E26CD07C68655334FD4449CB9ECD01F8117BC8BEAAAE862A0553A39223F8577A6DFFEA7CAA44A86D94E7BBB043847C11F186F3C59DAA53EE0834AAEE X-B7AD71C0: AC4F5C86D027EB782CDD5689AFBDA7A2368A440D3B0F6089093C9A16E5BC824A2A04A2ABAA09D25379311020FFC8D4AD6C30AE2B5E62355BD86AD6B8E2BEC178 X-C1DE0DAB: 0D63561A33F958A53BE7F3729796E832B9DDEC7E9758BC11E7E920B3DA1C5722D59269BC5F550898D99A6476B3ADF6B47008B74DF8BB9EF7333BD3B22AA88B938A852937E12ACA7502E6951B79FF9A3F410CA545F18667F91A7EA1CDA0B5A7A0 X-C8649E89: 4E36BF7865823D7055A7F0CF078B5EC49A30900B95165D34C0DDD40374D3501759335186D70913BC750BD7FCB60F2CF315ECC55E3E7D55EAC115ABA28A3A6CC91D7E09C32AA3244C2496235087B9BDF8982AA73A0D95FA4AE646F07CC2D4F3D8FACE5A9C96DEB163 X-D57D3AED: 3ZO7eAau8CL7WIMRKs4sN3D3tLDjz0dLbV79QFUyzQ2Ujvy7cMT6pYYqY16iZVKkSc3dCLJ7zSJH7+u4VD18S7Vl4ZUrpaVfd2+vE6kuoey4m4VkSEu530nj6fImhcD4MUrOEAnl0W826KZ9Q+tr5ycPtXkTV4k65bRjmOUUP8cvGozZ33TWg5HZplvhhXbhDGzqmQDTd6OAevLeAnq3Ra9uf7zvY2zzsIhlcp/Y7m53TZgf2aB4JOg4gkr2bioj3S6P1v0GIqTF43wOiii0VA== X-Mailru-Sender: 504CC1E875BF3E7D9BC0E5172ADA3110165A010A0C836DBF1B06215E3ADF2749298B7C1D8D93B6DF07784C02288277CA03E0582D3806FB6A5317862B1921BA260ED6CFD6382C13A6112434F685709FCF0DA7A0AF5A3A8387 X-Mras: Ok Subject: Re: [Tarantool-patches] [PATCH v3 06/10] raft: keep track of greatest known term and filter replication sources based on that X-BeenThere: tarantool-patches@dev.tarantool.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: Tarantool development patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , From: Vladislav Shpilevoy via Tarantool-patches Reply-To: Vladislav Shpilevoy Errors-To: tarantool-patches-bounces@dev.tarantool.org Sender: "Tarantool-patches" Good job on the patch! Please, try to reduce length of the lines in the commit message, or at least its title. It is suuuper long now. See 10 comments below. > diff --git a/src/box/applier.cc b/src/box/applier.cc > index 4898f9f7b..3fb864686 100644 > --- a/src/box/applier.cc > +++ b/src/box/applier.cc > @@ -790,6 +790,12 @@ apply_synchro_row_cb(struct journal_entry *entry) > applier_rollback_by_wal_io(); > } else { > txn_limbo_process(&txn_limbo, synchro_entry->req); > + if (iproto_type_is_promote_request(synchro_entry->req->type)) { > + raft_source_update_term(box_raft(), > + synchro_entry->req->origin_id, > + synchro_entry->req->term); 1. How about moving that to txn_limbo_read_promote()? What do you think? I see you do it in 3 places where txn_limbo_process() or txn_limbo_read_promote() are called on PROMOTE rows. > + > + } > trigger_run(&replicaset.applier.on_wal_write, NULL); > } > /* The fiber is the same on final join. */ > @@ -1027,6 +1033,28 @@ applier_apply_tx(struct applier *applier, struct stailq *rows) > } > } > > + /* > + * When elections are enabled we must filter out synchronous rows coming > + * from an instance that fell behind the current leader. This includes > + * both synchronous tx rows and rows for txs following unconfirmed > + * synchronous transactions. > + * The rows are replaced with NOPs to preserve the vclock consistency. > + */ > + struct applier_tx_row *item; > + if (raft_source_has_outdated_term(box_raft(), applier->instance_id) && 2. The names are too long IMO. I would propose raft_is_node_outdated(raft, id) // Check if behind raft_process_term(raft, id, term) // Set term for a node or skip if the same or older raft_node_term(raft, id) // Get term 'source' is not really a perfect name, because raft nodes send messages to each other. There are no one-directional channels AFAIR like we have with upstream and downstream in the replication. I used 'source' in raft_process_heartbeat() as like a source of the heartbeat message. Note like the nodes are called sources everywhere. Also I used 'process' for the new term, because we already have raft_process_heartbeat() to handle info from a node with a given ID, and I thought it makes sense to keep them similar. 3. Why does raft_is_source_allowed() still exist when we have this wonder? > + (last_row->wait_sync || > + (iproto_type_is_synchro_request(first_row->type) && > + !iproto_type_is_promote_request(first_row->type)))) { > + stailq_foreach_entry(item, rows, next) { > + struct xrow_header *row = &item->row; > + row->type = IPROTO_NOP; > + /* > + * Row body is saved to fiber's region and will be freed > + * on next fiber_gc() call. > + */ > + row->bodycnt = 0; > + } > + } > diff --git a/src/box/box.cc b/src/box/box.cc > index 722fc23b7..f44dd0e54 100644 > --- a/src/box/box.cc > +++ b/src/box/box.cc > @@ -426,6 +426,10 @@ wal_stream_apply_synchro_row(struct wal_stream *stream, struct xrow_header *row) > return -1; > } > txn_limbo_process(&txn_limbo, &syn_req); > + if (syn_req.type == IPROTO_PROMOTE) { > + raft_source_update_term(box_raft(), syn_req.origin_id, > + syn_req.term); 4. Misaligned. Also see the first comment. > @@ -1558,20 +1567,21 @@ box_clear_synchro_queue(bool try_wait) > rc = -1; > } else { > promote: > - /* > - * Term parameter is unused now, We'll pass > - * box_raft()->term there later. > - */ > - txn_limbo_write_promote(&txn_limbo, wait_lsn, 0); > + /* We cannot possibly get here in a volatile state. */ > + assert(box_raft()->volatile_term == box_raft()->term); > + txn_limbo_write_promote(&txn_limbo, wait_lsn, > + box_raft()->term); > struct synchro_request req = { > .type = 0, /* unused */ > .replica_id = 0, /* unused */ > .origin_id = instance_id, > .lsn = wait_lsn, > - .term = 0, /* unused */ > + .term = box_raft()->term, > }; > txn_limbo_read_promote(&txn_limbo, &req); > assert(txn_limbo_is_empty(&txn_limbo)); > + raft_source_update_term(box_raft(), req.origin_id, > + req.term); 5. See the first comment. > } > } > diff --git a/src/lib/raft/raft.h b/src/lib/raft/raft.h > index e447f6634..01f548fee 100644 > --- a/src/lib/raft/raft.h > +++ b/src/lib/raft/raft.h > @@ -207,6 +207,19 @@ struct raft { > * subsystems, such as Raft. > */ > const struct vclock *vclock; > + /** > + * The biggest term seen by this instance and persisted in WAL as part > + * of a PROMOTE request. May be smaller than @a term, while there are > + * ongoing elections, or the leader is already known, but this instance > + * hasn't read its PROMOTE request yet. > + * During other times must be equal to @a term. > + */ > + uint64_t greatest_known_term; 6. Maybe omit 'known'. There can't be 'greatest_unknown_term'. > + /** > + * Latest terms received with PROMOTE entries from remote instances. > + * Raft uses them to determine data from which sources may be applied. > + */ > + struct vclock term_map; 7. I have a feeling it is similar to the limbo's LSN map. Like they should be merged into something one. Can't formulate that properly. I hope we will see it more clear when will move all that to the WAL thread someday. > /** State machine timed event trigger. */ > struct ev_timer timer; > /** Configured election timeout in seconds. */ > @@ -243,6 +256,39 @@ raft_is_source_allowed(const struct raft *raft, uint32_t source_id) > return !raft->is_enabled || raft->leader == source_id; > } > > +/** > + * Return the latest term as seen in PROMOTE requests from instance with id > + * @a source_id. > + */ > +static inline uint64_t > +raft_source_term(const struct raft *raft, uint32_t source_id) > +{ > + assert(source_id != 0 && source_id < VCLOCK_MAX); > + return vclock_get(&raft->term_map, source_id); > +} > + > +/** > + * Check whether replica with id @a source_id is too old to apply synchronous > + * data from it. The check is only valid when elections are enabled. > + */ > +static inline bool > +raft_source_has_outdated_term(const struct raft *raft, uint32_t source_id) > +{ > + uint64_t source_term = vclock_get(&raft->term_map, source_id); > + return raft->is_enabled && source_term < raft->greatest_known_term; > +} > + > +/** Remember the last term seen for replica with id @a source_id. */ > +static inline void > +raft_source_update_term(struct raft *raft, uint32_t source_id, uint64_t term) > +{ > + if ((uint64_t) vclock_get(&raft->term_map, source_id) >= term) 8. Probably having the term as uint64_t was a mistake from the beginning. Feel free to change it to int64_t if you want, in a separate commit. > + return; > + vclock_follow(&raft->term_map, source_id, term); > + if (term > raft->greatest_known_term) > + raft->greatest_known_term = term; > +} 9. I see these are not used in the raft code at all. Did you think about moving it all to box/raft.h and box/raft.c? Or about covering this all with unit tests in unit/raft.c if you decide to keep it here? > + > /** Check if Raft is enabled. */ > static inline bool > raft_is_enabled(const struct raft *raft) > diff --git a/test/replication/gh-5445-leader-inconsistency.result b/test/replication/gh-5445-leader-inconsistency.result > new file mode 100644 > index 000000000..ff3104de5 > --- /dev/null > +++ b/test/replication/gh-5445-leader-inconsistency.result > @@ -0,0 +1,291 @@ > +-- test-run result file version 2 > +test_run = require("test_run").new() > + | --- > + | ... > + > +is_leader_cmd = "return box.info.election.state == 'leader'" > + | --- > + | ... > + > +-- Auxiliary. > +test_run:cmd('setopt delimiter ";"') > + | --- > + | - true > + | ... > +function get_leader(nrs) > + local leader_nr = 0 > + test_run:wait_cond(function() > + for nr, do_check in pairs(nrs) do > + if do_check then > + local is_leader = test_run:eval('election_replica'..nr, > + is_leader_cmd)[1] > + if is_leader then > + leader_nr = nr > + return true > + end > + end > + end > + return false > + end) > + assert(leader_nr ~= 0) > + return leader_nr > +end; > + | --- > + | ... > + > +function name(id) > + return 'election_replica'..id > +end; 10. You can move this function above get_leader() and use it in there.