[Tarantool-patches] [PATCH v2 09/10] raft: introduce state machine
Serge Petrenko
sergepetrenko at tarantool.org
Wed Sep 9 11:04:54 MSK 2020
08.09.2020 01:57, Vladislav Shpilevoy пишет:
> Thanks for the review!
Hi! Thanks for your reply!
>
>>> src/box/applier.cc | 18 +-
>>> src/box/box.cc | 7 +-
>>> src/box/raft.c | 711 +++++++++++++++++++++++++++++++++++++++++++--
>>> src/box/raft.h | 118 ++++++++
>>> src/box/relay.cc | 19 ++
>>> 5 files changed, 847 insertions(+), 26 deletions(-)
>>>
>>> diff --git a/src/box/applier.cc b/src/box/applier.cc
>>> index 5459a1dc1..c7c486ee4 100644
>>> --- a/src/box/applier.cc
>>> +++ b/src/box/applier.cc
>>> @@ -916,8 +916,21 @@ applier_handle_raft(struct applier *applier, struct xrow_header *row)
>>> * Return 0 for success or -1 in case of an error.
>>> */
>>> static int
>>> -applier_apply_tx(struct stailq *rows)
>>> +applier_apply_tx(struct applier *applier, struct stailq *rows)
>>> {
>>> + /*
>>> + * Rows received not directly from a leader are ignored. That is a
>>> + * protection against the case when an old leader keeps sending data
>>> + * around not knowing yet that it is not a leader anymore.
>>> + *
>>> + * XXX: it may be that this can be fine to apply leader transactions by
>>> + * looking at their replica_id field if it is equal to leader id. That
>>> + * can be investigated as an 'optimization'. Even though may not give
>>> + * anything, because won't change total number of rows sent in the
>>> + * network anyway.
>>> + */
>>> + if (!raft_is_source_allowed(applier->instance_id))
>>> + return 0;
>> Have we abandoned Rule 66?
> Yes, it was officially deprecated.
>
>>> diff --git a/src/box/raft.h b/src/box/raft.h
>>> index db64cf933..111a9c16e 100644
>>> --- a/src/box/raft.h
>>> +++ b/src/box/raft.h
>>> @@ -31,34 +31,141 @@
>>> struct raft {
>>> + /** Instance ID of leader of the current term. */
>>> uint32_t leader;
>>> + /** State of the instance. */
>>> enum raft_state state;
>>> + /**
>>> + * Volatile part of the Raft state, whose WAL write may be still
>>> + * in-progress, and yet the state may be already used. Volatile state is
>>> + * never sent to anywhere, but the state machine makes decisions based
>>> + * on it. That is vital.
>>> + * As an example, volatile vote needs to be used to reject votes inside
>>> + * a term, where the instance already voted (even if the vote WAL write
>>> + * is not finished yet). Otherwise the instance would try to write
>>> + * several votes inside one term.
>>> + */
>>> + uint64_t volatile_term;
>>> + uint32_t volatile_vote;
>>> + /**
>>> + * Flag whether Raft is enabled. When disabled, it still persists terms
>>> + * so as to quickly enroll into the cluster when (if) it is enabled. In
>>> + * everything else disabled Raft does not affect instance work.
>>> + */
>>> bool is_enabled;
>>> + /**
>>> + * Flag whether the node can become a leader. It is an accumulated value
>>> + * of configuration options Raft enabled and Raft candidate. If at least
>>> + * one is false - the instance is not a candidate.
>>> + */
>>> bool is_candidate;
>>> + /** Flag whether the instance is allowed to be a leader. */
>>> + bool is_cfg_candidate;
>>> + /**
>>> + * Flag whether Raft currently tries to write something into WAL. It
>>> + * happens asynchronously, not right after Raft state is updated.
>>> + */
>>> + bool is_write_in_progress;
>>> + /**
>>> + * Persisted Raft state. These values are used when need to tell current
>>> + * Raft state to other nodes.
>>> + */
>>> uint64_t term;
>>> uint32_t vote;
>>> + /**
>>> + * Bit 1 on position N means that a vote from instance with ID = N was
>>> + * obtained.
>>> + */
>>> + uint32_t vote_mask;
>>> + /** Number of votes for this instance. Valid only in candidate state. */
>>> + int vote_count;
>>> + /** State machine timed event trigger. */
>>> + struct ev_timer timer;
>>> + /**
>>> + * Dump of Raft state in the end of event loop, when it is changed.
>>> + */
>>> + struct ev_check io;
>> You probably need the ev_prepare watcher.
>>
>> Here's what the libev doc says:
>>
>> All |ev_prepare| watchers are invoked just /before/ |ev_run| starts
>> to gather new events, and all |ev_check| watchers are queued (not
>> invoked) just after |ev_run| has gathered them, but before it
>> queues any callbacks for any received events. That means
>> |ev_prepare| watchers are the last watchers invoked before the
>> event loop sleeps or polls for new events, and |ev_check| watchers
>> will be invoked before any other watchers of the same or lower
>> priority within an event loop iteration.
> I am not so sure. The documentation is hard to read. From what I understand,
> this is how it works: ev_prepare is invoked, then events are collected, then
> ev_check is invoked, then the callbacks are called.
Yes, it works this way. Aren't we speaking of the same things?
By 'ev_loop iteration' I mean the time when our fibers perform some
useful work and
various ev callbacks are called. This means EV_CHECK is called in the
beginning of
'ev_loop iteration' and EV_PREPARE is called in the end.
> In our case it seems to
> be not important whether we use prepare or check. But on the other hand, it
> is probably not how it works, because in fiber.top check is called in the
> beginning of an iteration, and prepare is called afterwards, and it fiber.top
> returns sane results.
This is so because all the fibers perform their work between ev_check
and ev_prepare.
>
> Here I see that check is called after prepare. So check is at least 'closer' to
> event loop iteration end, isn't it?:
>
> http://pod.tst.eu/http://cvs.schmorp.de/libev/ev.pod#code_ev_prepare_code_and_code_ev_che
> Prepare and check watchers are often (but not always) used in pairs: prepare watchers get invoked before the process blocks and check watchers afterwards.
As I understand it, EV_CHECK is called in the 'beginning' of an ev_loop
iteration.
Anyway, looks like may use any of the watchers. I don't think it makes
any difference.
>
> I also couldn't get any info from libev source code, because it is unreadable
> mostly.
>
> Is there a case, when ev_check won't work, and ev_prepare will? Is
> ev_prepare actually closer to an event loop iteration end, and I missed something?
>
>>> + /** Configured election timeout in seconds. */
>>> double election_timeout;
>>> };
>>> @@ -786,6 +803,8 @@ relay_raft_msg_push(struct cmsg *base)
>>> struct xrow_header row;
>>> xrow_encode_raft(&row, &fiber()->gc, &msg->req);
>>> try {
>>> + if (msg->req.state == RAFT_STATE_LEADER)
>>> + relay_restart_recovery(msg->relay);
>>> relay_send(msg->relay, &row);
>> Looks like you should first send the message and then restart the recovery.
>> Otherwise the resent data will be ignored once again.
> Yes, true.
>
> ====================
> diff --git a/src/box/relay.cc b/src/box/relay.cc
> index 4f9bbc0de..d63711600 100644
> --- a/src/box/relay.cc
> +++ b/src/box/relay.cc
> @@ -803,9 +803,14 @@ relay_raft_msg_push(struct cmsg *base)
> struct xrow_header row;
> xrow_encode_raft(&row, &fiber()->gc, &msg->req);
> try {
> + /*
> + * Send the message before restarting the recovery. Otherwise
> + * all the rows would be sent from under a non-leader role and
> + * would be ignored again.
> + */
> + relay_send(msg->relay, &row);
> if (msg->req.state == RAFT_STATE_LEADER)
> relay_restart_recovery(msg->relay);
> - relay_send(msg->relay, &row);
> } catch (Exception *e) {
> relay_set_error(msg->relay, e);
> fiber_cancel(fiber());
--
Serge Petrenko
More information about the Tarantool-patches
mailing list