[Tarantool-patches] [PATCH v2 09/10] raft: introduce state machine

Tue Sep 8 01:57:11 MSK 2020

Thanks for the review!

>>  src/box/applier.cc |  18 +-
>>  src/box/box.cc     |   7 +-
>>  src/box/raft.c     | 711 +++++++++++++++++++++++++++++++++++++++++++--
>>  src/box/raft.h     | 118 ++++++++
>>  src/box/relay.cc   |  19 ++
>>  5 files changed, 847 insertions(+), 26 deletions(-)
>>
>> diff --git a/src/box/applier.cc b/src/box/applier.cc
>> index 5459a1dc1..c7c486ee4 100644
>> --- a/src/box/applier.cc
>> +++ b/src/box/applier.cc
>> @@ -916,8 +916,21 @@ applier_handle_raft(struct applier *applier, struct xrow_header *row)
>>   * Return 0 for success or -1 in case of an error.
>>   */
>>  static int
>> -applier_apply_tx(struct stailq *rows)
>> +applier_apply_tx(struct applier *applier, struct stailq *rows)
>>  {
>> +	/*
>> +	 * Rows received not directly from a leader are ignored. That is a
>> +	 * protection against the case when an old leader keeps sending data
>> +	 * around not knowing yet that it is not a leader anymore.
>> +	 *
>> +	 * XXX: it may be that this can be fine to apply leader transactions by
>> +	 * looking at their replica_id field if it is equal to leader id. That
>> +	 * can be investigated as an 'optimization'. Even though may not give
>> +	 * anything, because won't change total number of rows sent in the
>> +	 * network anyway.
>> +	 */
>> +	if (!raft_is_source_allowed(applier->instance_id))
>> +		return 0;
> Have we abandoned Rule 66?

Yes, it was officially deprecated.

>> diff --git a/src/box/raft.h b/src/box/raft.h
>> index db64cf933..111a9c16e 100644
>> --- a/src/box/raft.h
>> +++ b/src/box/raft.h
>> @@ -31,34 +31,141 @@
>>  struct raft {
>> +	/** Instance ID of leader of the current term. */
>>  	uint32_t leader;
>> +	/** State of the instance. */
>>  	enum raft_state state;
>> +	/**
>> +	 * Volatile part of the Raft state, whose WAL write may be still
>> +	 * in-progress, and yet the state may be already used. Volatile state is
>> +	 * never sent to anywhere, but the state machine makes decisions based
>> +	 * on it. That is vital.
>> +	 * As an example, volatile vote needs to be used to reject votes inside
>> +	 * a term, where the instance already voted (even if the vote WAL write
>> +	 * is not finished yet). Otherwise the instance would try to write
>> +	 * several votes inside one term.
>> +	 */
>> +	uint64_t volatile_term;
>> +	uint32_t volatile_vote;
>> +	/**
>> +	 * Flag whether Raft is enabled. When disabled, it still persists terms
>> +	 * so as to quickly enroll into the cluster when (if) it is enabled. In
>> +	 * everything else disabled Raft does not affect instance work.
>> +	 */
>>  	bool is_enabled;
>> +	/**
>> +	 * Flag whether the node can become a leader. It is an accumulated value
>> +	 * of configuration options Raft enabled and Raft candidate. If at least
>> +	 * one is false - the instance is not a candidate.
>> +	 */
>>  	bool is_candidate;
>> +	/** Flag whether the instance is allowed to be a leader. */
>> +	bool is_cfg_candidate;
>> +	/**
>> +	 * Flag whether Raft currently tries to write something into WAL. It
>> +	 * happens asynchronously, not right after Raft state is updated.
>> +	 */
>> +	bool is_write_in_progress;
>> +	/**
>> +	 * Persisted Raft state. These values are used when need to tell current
>> +	 * Raft state to other nodes.
>> +	 */
>>  	uint64_t term;
>>  	uint32_t vote;
>> +	/**
>> +	 * Bit 1 on position N means that a vote from instance with ID = N was
>> +	 * obtained.
>> +	 */
>> +	uint32_t vote_mask;
>> +	/** Number of votes for this instance. Valid only in candidate state. */
>> +	int vote_count;
>> +	/** State machine timed event trigger. */
>> +	struct ev_timer timer;
>> +	/**
>> +	 * Dump of Raft state in the end of event loop, when it is changed.
>> +	 */
>> +	struct ev_check io;
> 
> You probably need the ev_prepare watcher.
> 
> Here's what the libev doc  says:
> 
> All |ev_prepare| watchers are invoked just /before/ |ev_run| starts
> to gather new events, and all |ev_check| watchers are queued (not
> invoked) just after |ev_run| has gathered them, but before it
> queues any callbacks for any received events. That means
> |ev_prepare| watchers are the last watchers invoked before the
> event loop sleeps or polls for new events, and |ev_check| watchers
> will be invoked before any other watchers of the same or lower
> priority within an event loop iteration.

I am not so sure. The documentation is hard to read. From what I understand,
this is how it works: ev_prepare is invoked, then events are collected, then
ev_check is invoked, then the callbacks are called. In our case it seems to
be not important whether we use prepare or check. But on the other hand, it
is probably not how it works, because in fiber.top check is called in the
beginning of an iteration, and prepare is called afterwards, and it fiber.top
returns sane results.

Here I see that check is called after prepare. So check is at least 'closer' to
event loop iteration end, isn't it?:

	http://pod.tst.eu/http://cvs.schmorp.de/libev/ev.pod#code_ev_prepare_code_and_code_ev_che
	Prepare and check watchers are often (but not always) used in pairs: prepare watchers get invoked before the process blocks and check watchers afterwards.

I also couldn't get any info from libev source code, because it is unreadable
mostly.

Is there a case, when ev_check won't work, and ev_prepare will? Is
ev_prepare actually closer to an event loop iteration end, and I missed something?

>> +	/** Configured election timeout in seconds. */
>>  	double election_timeout;
>>  };
>> @@ -786,6 +803,8 @@ relay_raft_msg_push(struct cmsg *base)
>>  	struct xrow_header row;
>>  	xrow_encode_raft(&row, &fiber()->gc, &msg->req);
>>  	try {
>> +		if (msg->req.state == RAFT_STATE_LEADER)
>> +			relay_restart_recovery(msg->relay);
>>  		relay_send(msg->relay, &row);
> 
> Looks like you should first send the message and then restart the recovery.
> Otherwise the resent data will be ignored once again.

Yes, true.

====================

diff --git a/src/box/relay.cc b/src/box/relay.cc
index 4f9bbc0de..d63711600 100644
--- a/src/box/relay.cc
+++ b/src/box/relay.cc
@@ -803,9 +803,14 @@ relay_raft_msg_push(struct cmsg *base)
 	struct xrow_header row;
 	xrow_encode_raft(&row, &fiber()->gc, &msg->req);
 	try {
+		/*
+		 * Send the message before restarting the recovery. Otherwise
+		 * all the rows would be sent from under a non-leader role and
+		 * would be ignored again.
+		 */
+		relay_send(msg->relay, &row);
 		if (msg->req.state == RAFT_STATE_LEADER)
 			relay_restart_recovery(msg->relay);
-		relay_send(msg->relay, &row);
 	} catch (Exception *e) {
 		relay_set_error(msg->relay, e);
 		fiber_cancel(fiber());