From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <sergepetrenko@tarantool.org>
Received: from smtp46.i.mail.ru (smtp46.i.mail.ru [94.100.177.106])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (No client certificate requested)
 by dev.tarantool.org (Postfix) with ESMTPS id 7768E469719
 for <tarantool-patches@dev.tarantool.org>;
 Wed,  9 Sep 2020 11:04:55 +0300 (MSK)
References: <cover.1599173312.git.v.shpilevoy@tarantool.org>
 <bb762069db05eb500069ff456cfbd8ac72c29ccc.1599173312.git.v.shpilevoy@tarantool.org>
 <a08f1bd8-5ddd-2231-a231-f7ba97b428a4@tarantool.org>
 <90367ede-7d37-4ed4-82b7-86a4bcd507de@tarantool.org>
From: Serge Petrenko <sergepetrenko@tarantool.org>
Message-ID: <7e5fee9d-d032-de31-e1eb-970ffeb5ce37@tarantool.org>
Date: Wed, 9 Sep 2020 11:04:54 +0300
MIME-Version: 1.0
In-Reply-To: <90367ede-7d37-4ed4-82b7-86a4bcd507de@tarantool.org>
Content-Type: text/plain; charset="utf-8"; format="flowed"
Content-Transfer-Encoding: 8bit
Content-Language: en-GB
Subject: Re: [Tarantool-patches] [PATCH v2 09/10] raft: introduce state
	machine
List-Id: Tarantool development patches <tarantool-patches.dev.tarantool.org>
List-Unsubscribe: <https://lists.tarantool.org/mailman/options/tarantool-patches>, 
 <mailto:tarantool-patches-request@dev.tarantool.org?subject=unsubscribe>
List-Archive: <https://lists.tarantool.org/pipermail/tarantool-patches/>
List-Post: <mailto:tarantool-patches@dev.tarantool.org>
List-Help: <mailto:tarantool-patches-request@dev.tarantool.org?subject=help>
List-Subscribe: <https://lists.tarantool.org/mailman/listinfo/tarantool-patches>, 
 <mailto:tarantool-patches-request@dev.tarantool.org?subject=subscribe>
To: Vladislav Shpilevoy <v.shpilevoy@tarantool.org>, tarantool-patches@dev.tarantool.org, gorcunov@gmail.com


08.09.2020 01:57, Vladislav Shpilevoy пишет:
> Thanks for the review!


Hi! Thanks for your reply!

>
>>>   src/box/applier.cc |  18 +-
>>>   src/box/box.cc     |   7 +-
>>>   src/box/raft.c     | 711 +++++++++++++++++++++++++++++++++++++++++++--
>>>   src/box/raft.h     | 118 ++++++++
>>>   src/box/relay.cc   |  19 ++
>>>   5 files changed, 847 insertions(+), 26 deletions(-)
>>>
>>> diff --git a/src/box/applier.cc b/src/box/applier.cc
>>> index 5459a1dc1..c7c486ee4 100644
>>> --- a/src/box/applier.cc
>>> +++ b/src/box/applier.cc
>>> @@ -916,8 +916,21 @@ applier_handle_raft(struct applier *applier, struct xrow_header *row)
>>>    * Return 0 for success or -1 in case of an error.
>>>    */
>>>   static int
>>> -applier_apply_tx(struct stailq *rows)
>>> +applier_apply_tx(struct applier *applier, struct stailq *rows)
>>>   {
>>> +	/*
>>> +	 * Rows received not directly from a leader are ignored. That is a
>>> +	 * protection against the case when an old leader keeps sending data
>>> +	 * around not knowing yet that it is not a leader anymore.
>>> +	 *
>>> +	 * XXX: it may be that this can be fine to apply leader transactions by
>>> +	 * looking at their replica_id field if it is equal to leader id. That
>>> +	 * can be investigated as an 'optimization'. Even though may not give
>>> +	 * anything, because won't change total number of rows sent in the
>>> +	 * network anyway.
>>> +	 */
>>> +	if (!raft_is_source_allowed(applier->instance_id))
>>> +		return 0;
>> Have we abandoned Rule 66?
> Yes, it was officially deprecated.
>
>>> diff --git a/src/box/raft.h b/src/box/raft.h
>>> index db64cf933..111a9c16e 100644
>>> --- a/src/box/raft.h
>>> +++ b/src/box/raft.h
>>> @@ -31,34 +31,141 @@
>>>   struct raft {
>>> +	/** Instance ID of leader of the current term. */
>>>   	uint32_t leader;
>>> +	/** State of the instance. */
>>>   	enum raft_state state;
>>> +	/**
>>> +	 * Volatile part of the Raft state, whose WAL write may be still
>>> +	 * in-progress, and yet the state may be already used. Volatile state is
>>> +	 * never sent to anywhere, but the state machine makes decisions based
>>> +	 * on it. That is vital.
>>> +	 * As an example, volatile vote needs to be used to reject votes inside
>>> +	 * a term, where the instance already voted (even if the vote WAL write
>>> +	 * is not finished yet). Otherwise the instance would try to write
>>> +	 * several votes inside one term.
>>> +	 */
>>> +	uint64_t volatile_term;
>>> +	uint32_t volatile_vote;
>>> +	/**
>>> +	 * Flag whether Raft is enabled. When disabled, it still persists terms
>>> +	 * so as to quickly enroll into the cluster when (if) it is enabled. In
>>> +	 * everything else disabled Raft does not affect instance work.
>>> +	 */
>>>   	bool is_enabled;
>>> +	/**
>>> +	 * Flag whether the node can become a leader. It is an accumulated value
>>> +	 * of configuration options Raft enabled and Raft candidate. If at least
>>> +	 * one is false - the instance is not a candidate.
>>> +	 */
>>>   	bool is_candidate;
>>> +	/** Flag whether the instance is allowed to be a leader. */
>>> +	bool is_cfg_candidate;
>>> +	/**
>>> +	 * Flag whether Raft currently tries to write something into WAL. It
>>> +	 * happens asynchronously, not right after Raft state is updated.
>>> +	 */
>>> +	bool is_write_in_progress;
>>> +	/**
>>> +	 * Persisted Raft state. These values are used when need to tell current
>>> +	 * Raft state to other nodes.
>>> +	 */
>>>   	uint64_t term;
>>>   	uint32_t vote;
>>> +	/**
>>> +	 * Bit 1 on position N means that a vote from instance with ID = N was
>>> +	 * obtained.
>>> +	 */
>>> +	uint32_t vote_mask;
>>> +	/** Number of votes for this instance. Valid only in candidate state. */
>>> +	int vote_count;
>>> +	/** State machine timed event trigger. */
>>> +	struct ev_timer timer;
>>> +	/**
>>> +	 * Dump of Raft state in the end of event loop, when it is changed.
>>> +	 */
>>> +	struct ev_check io;
>> You probably need the ev_prepare watcher.
>>
>> Here's what the libev doc  says:
>>
>> All |ev_prepare| watchers are invoked just /before/ |ev_run| starts
>> to gather new events, and all |ev_check| watchers are queued (not
>> invoked) just after |ev_run| has gathered them, but before it
>> queues any callbacks for any received events. That means
>> |ev_prepare| watchers are the last watchers invoked before the
>> event loop sleeps or polls for new events, and |ev_check| watchers
>> will be invoked before any other watchers of the same or lower
>> priority within an event loop iteration.
> I am not so sure. The documentation is hard to read. From what I understand,
> this is how it works: ev_prepare is invoked, then events are collected, then
> ev_check is invoked, then the callbacks are called.

Yes, it works this way. Aren't we speaking of the same things?
By 'ev_loop iteration' I mean the time when our fibers perform some 
useful work and
various ev callbacks are called. This means EV_CHECK is called in the 
beginning of
'ev_loop iteration' and EV_PREPARE is called in the end.

> In our case it seems to
> be not important whether we use prepare or check. But on the other hand, it
> is probably not how it works, because in fiber.top check is called in the
> beginning of an iteration, and prepare is called afterwards, and it fiber.top
> returns sane results.
This is so because all the fibers perform their work between ev_check 
and ev_prepare.
>
> Here I see that check is called after prepare. So check is at least 'closer' to
> event loop iteration end, isn't it?:
>
> 	http://pod.tst.eu/http://cvs.schmorp.de/libev/ev.pod#code_ev_prepare_code_and_code_ev_che
> 	Prepare and check watchers are often (but not always) used in pairs: prepare watchers get invoked before the process blocks and check watchers afterwards.

As I understand it, EV_CHECK is called in  the 'beginning' of an ev_loop 
iteration.
Anyway, looks like may use any of the watchers. I don't think it makes 
any difference.

>
> I also couldn't get any info from libev source code, because it is unreadable
> mostly.
>
> Is there a case, when ev_check won't work, and ev_prepare will? Is
> ev_prepare actually closer to an event loop iteration end, and I missed something?
>
>>> +	/** Configured election timeout in seconds. */
>>>   	double election_timeout;
>>>   };
>>> @@ -786,6 +803,8 @@ relay_raft_msg_push(struct cmsg *base)
>>>   	struct xrow_header row;
>>>   	xrow_encode_raft(&row, &fiber()->gc, &msg->req);
>>>   	try {
>>> +		if (msg->req.state == RAFT_STATE_LEADER)
>>> +			relay_restart_recovery(msg->relay);
>>>   		relay_send(msg->relay, &row);
>> Looks like you should first send the message and then restart the recovery.
>> Otherwise the resent data will be ignored once again.
> Yes, true.
>
> ====================
> diff --git a/src/box/relay.cc b/src/box/relay.cc
> index 4f9bbc0de..d63711600 100644
> --- a/src/box/relay.cc
> +++ b/src/box/relay.cc
> @@ -803,9 +803,14 @@ relay_raft_msg_push(struct cmsg *base)
>   	struct xrow_header row;
>   	xrow_encode_raft(&row, &fiber()->gc, &msg->req);
>   	try {
> +		/*
> +		 * Send the message before restarting the recovery. Otherwise
> +		 * all the rows would be sent from under a non-leader role and
> +		 * would be ignored again.
> +		 */
> +		relay_send(msg->relay, &row);
>   		if (msg->req.state == RAFT_STATE_LEADER)
>   			relay_restart_recovery(msg->relay);
> -		relay_send(msg->relay, &row);
>   	} catch (Exception *e) {
>   		relay_set_error(msg->relay, e);
>   		fiber_cancel(fiber());

-- 
Serge Petrenko