From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from smtp46.i.mail.ru (smtp46.i.mail.ru [94.100.177.106]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dev.tarantool.org (Postfix) with ESMTPS id 7768E469719 for ; Wed, 9 Sep 2020 11:04:55 +0300 (MSK) References: <90367ede-7d37-4ed4-82b7-86a4bcd507de@tarantool.org> From: Serge Petrenko Message-ID: <7e5fee9d-d032-de31-e1eb-970ffeb5ce37@tarantool.org> Date: Wed, 9 Sep 2020 11:04:54 +0300 MIME-Version: 1.0 In-Reply-To: <90367ede-7d37-4ed4-82b7-86a4bcd507de@tarantool.org> Content-Type: text/plain; charset="utf-8"; format="flowed" Content-Transfer-Encoding: 8bit Content-Language: en-GB Subject: Re: [Tarantool-patches] [PATCH v2 09/10] raft: introduce state machine List-Id: Tarantool development patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Vladislav Shpilevoy , tarantool-patches@dev.tarantool.org, gorcunov@gmail.com 08.09.2020 01:57, Vladislav Shpilevoy пишет: > Thanks for the review! Hi! Thanks for your reply! > >>> src/box/applier.cc | 18 +- >>> src/box/box.cc | 7 +- >>> src/box/raft.c | 711 +++++++++++++++++++++++++++++++++++++++++++-- >>> src/box/raft.h | 118 ++++++++ >>> src/box/relay.cc | 19 ++ >>> 5 files changed, 847 insertions(+), 26 deletions(-) >>> >>> diff --git a/src/box/applier.cc b/src/box/applier.cc >>> index 5459a1dc1..c7c486ee4 100644 >>> --- a/src/box/applier.cc >>> +++ b/src/box/applier.cc >>> @@ -916,8 +916,21 @@ applier_handle_raft(struct applier *applier, struct xrow_header *row) >>> * Return 0 for success or -1 in case of an error. >>> */ >>> static int >>> -applier_apply_tx(struct stailq *rows) >>> +applier_apply_tx(struct applier *applier, struct stailq *rows) >>> { >>> + /* >>> + * Rows received not directly from a leader are ignored. That is a >>> + * protection against the case when an old leader keeps sending data >>> + * around not knowing yet that it is not a leader anymore. >>> + * >>> + * XXX: it may be that this can be fine to apply leader transactions by >>> + * looking at their replica_id field if it is equal to leader id. That >>> + * can be investigated as an 'optimization'. Even though may not give >>> + * anything, because won't change total number of rows sent in the >>> + * network anyway. >>> + */ >>> + if (!raft_is_source_allowed(applier->instance_id)) >>> + return 0; >> Have we abandoned Rule 66? > Yes, it was officially deprecated. > >>> diff --git a/src/box/raft.h b/src/box/raft.h >>> index db64cf933..111a9c16e 100644 >>> --- a/src/box/raft.h >>> +++ b/src/box/raft.h >>> @@ -31,34 +31,141 @@ >>> struct raft { >>> + /** Instance ID of leader of the current term. */ >>> uint32_t leader; >>> + /** State of the instance. */ >>> enum raft_state state; >>> + /** >>> + * Volatile part of the Raft state, whose WAL write may be still >>> + * in-progress, and yet the state may be already used. Volatile state is >>> + * never sent to anywhere, but the state machine makes decisions based >>> + * on it. That is vital. >>> + * As an example, volatile vote needs to be used to reject votes inside >>> + * a term, where the instance already voted (even if the vote WAL write >>> + * is not finished yet). Otherwise the instance would try to write >>> + * several votes inside one term. >>> + */ >>> + uint64_t volatile_term; >>> + uint32_t volatile_vote; >>> + /** >>> + * Flag whether Raft is enabled. When disabled, it still persists terms >>> + * so as to quickly enroll into the cluster when (if) it is enabled. In >>> + * everything else disabled Raft does not affect instance work. >>> + */ >>> bool is_enabled; >>> + /** >>> + * Flag whether the node can become a leader. It is an accumulated value >>> + * of configuration options Raft enabled and Raft candidate. If at least >>> + * one is false - the instance is not a candidate. >>> + */ >>> bool is_candidate; >>> + /** Flag whether the instance is allowed to be a leader. */ >>> + bool is_cfg_candidate; >>> + /** >>> + * Flag whether Raft currently tries to write something into WAL. It >>> + * happens asynchronously, not right after Raft state is updated. >>> + */ >>> + bool is_write_in_progress; >>> + /** >>> + * Persisted Raft state. These values are used when need to tell current >>> + * Raft state to other nodes. >>> + */ >>> uint64_t term; >>> uint32_t vote; >>> + /** >>> + * Bit 1 on position N means that a vote from instance with ID = N was >>> + * obtained. >>> + */ >>> + uint32_t vote_mask; >>> + /** Number of votes for this instance. Valid only in candidate state. */ >>> + int vote_count; >>> + /** State machine timed event trigger. */ >>> + struct ev_timer timer; >>> + /** >>> + * Dump of Raft state in the end of event loop, when it is changed. >>> + */ >>> + struct ev_check io; >> You probably need the ev_prepare watcher. >> >> Here's what the libev doc  says: >> >> All |ev_prepare| watchers are invoked just /before/ |ev_run| starts >> to gather new events, and all |ev_check| watchers are queued (not >> invoked) just after |ev_run| has gathered them, but before it >> queues any callbacks for any received events. That means >> |ev_prepare| watchers are the last watchers invoked before the >> event loop sleeps or polls for new events, and |ev_check| watchers >> will be invoked before any other watchers of the same or lower >> priority within an event loop iteration. > I am not so sure. The documentation is hard to read. From what I understand, > this is how it works: ev_prepare is invoked, then events are collected, then > ev_check is invoked, then the callbacks are called. Yes, it works this way. Aren't we speaking of the same things? By 'ev_loop iteration' I mean the time when our fibers perform some useful work and various ev callbacks are called. This means EV_CHECK is called in the beginning of 'ev_loop iteration' and EV_PREPARE is called in the end. > In our case it seems to > be not important whether we use prepare or check. But on the other hand, it > is probably not how it works, because in fiber.top check is called in the > beginning of an iteration, and prepare is called afterwards, and it fiber.top > returns sane results. This is so because all the fibers perform their work between ev_check and ev_prepare. > > Here I see that check is called after prepare. So check is at least 'closer' to > event loop iteration end, isn't it?: > > http://pod.tst.eu/http://cvs.schmorp.de/libev/ev.pod#code_ev_prepare_code_and_code_ev_che > Prepare and check watchers are often (but not always) used in pairs: prepare watchers get invoked before the process blocks and check watchers afterwards. As I understand it, EV_CHECK is called in  the 'beginning' of an ev_loop iteration. Anyway, looks like may use any of the watchers. I don't think it makes any difference. > > I also couldn't get any info from libev source code, because it is unreadable > mostly. > > Is there a case, when ev_check won't work, and ev_prepare will? Is > ev_prepare actually closer to an event loop iteration end, and I missed something? > >>> + /** Configured election timeout in seconds. */ >>> double election_timeout; >>> }; >>> @@ -786,6 +803,8 @@ relay_raft_msg_push(struct cmsg *base) >>> struct xrow_header row; >>> xrow_encode_raft(&row, &fiber()->gc, &msg->req); >>> try { >>> + if (msg->req.state == RAFT_STATE_LEADER) >>> + relay_restart_recovery(msg->relay); >>> relay_send(msg->relay, &row); >> Looks like you should first send the message and then restart the recovery. >> Otherwise the resent data will be ignored once again. > Yes, true. > > ==================== > diff --git a/src/box/relay.cc b/src/box/relay.cc > index 4f9bbc0de..d63711600 100644 > --- a/src/box/relay.cc > +++ b/src/box/relay.cc > @@ -803,9 +803,14 @@ relay_raft_msg_push(struct cmsg *base) > struct xrow_header row; > xrow_encode_raft(&row, &fiber()->gc, &msg->req); > try { > + /* > + * Send the message before restarting the recovery. Otherwise > + * all the rows would be sent from under a non-leader role and > + * would be ignored again. > + */ > + relay_send(msg->relay, &row); > if (msg->req.state == RAFT_STATE_LEADER) > relay_restart_recovery(msg->relay); > - relay_send(msg->relay, &row); > } catch (Exception *e) { > relay_set_error(msg->relay, e); > fiber_cancel(fiber()); -- Serge Petrenko