Tarantool development patches archive
 help / color / mirror / Atom feed
From: Serge Petrenko <sergepetrenko@tarantool.org>
To: Vladislav Shpilevoy <v.shpilevoy@tarantool.org>,
	tarantool-patches@dev.tarantool.org, gorcunov@gmail.com
Subject: Re: [Tarantool-patches] [PATCH v2 09/10] raft: introduce state machine
Date: Wed, 9 Sep 2020 11:04:54 +0300	[thread overview]
Message-ID: <7e5fee9d-d032-de31-e1eb-970ffeb5ce37@tarantool.org> (raw)
In-Reply-To: <90367ede-7d37-4ed4-82b7-86a4bcd507de@tarantool.org>


08.09.2020 01:57, Vladislav Shpilevoy пишет:
> Thanks for the review!


Hi! Thanks for your reply!

>
>>>   src/box/applier.cc |  18 +-
>>>   src/box/box.cc     |   7 +-
>>>   src/box/raft.c     | 711 +++++++++++++++++++++++++++++++++++++++++++--
>>>   src/box/raft.h     | 118 ++++++++
>>>   src/box/relay.cc   |  19 ++
>>>   5 files changed, 847 insertions(+), 26 deletions(-)
>>>
>>> diff --git a/src/box/applier.cc b/src/box/applier.cc
>>> index 5459a1dc1..c7c486ee4 100644
>>> --- a/src/box/applier.cc
>>> +++ b/src/box/applier.cc
>>> @@ -916,8 +916,21 @@ applier_handle_raft(struct applier *applier, struct xrow_header *row)
>>>    * Return 0 for success or -1 in case of an error.
>>>    */
>>>   static int
>>> -applier_apply_tx(struct stailq *rows)
>>> +applier_apply_tx(struct applier *applier, struct stailq *rows)
>>>   {
>>> +	/*
>>> +	 * Rows received not directly from a leader are ignored. That is a
>>> +	 * protection against the case when an old leader keeps sending data
>>> +	 * around not knowing yet that it is not a leader anymore.
>>> +	 *
>>> +	 * XXX: it may be that this can be fine to apply leader transactions by
>>> +	 * looking at their replica_id field if it is equal to leader id. That
>>> +	 * can be investigated as an 'optimization'. Even though may not give
>>> +	 * anything, because won't change total number of rows sent in the
>>> +	 * network anyway.
>>> +	 */
>>> +	if (!raft_is_source_allowed(applier->instance_id))
>>> +		return 0;
>> Have we abandoned Rule 66?
> Yes, it was officially deprecated.
>
>>> diff --git a/src/box/raft.h b/src/box/raft.h
>>> index db64cf933..111a9c16e 100644
>>> --- a/src/box/raft.h
>>> +++ b/src/box/raft.h
>>> @@ -31,34 +31,141 @@
>>>   struct raft {
>>> +	/** Instance ID of leader of the current term. */
>>>   	uint32_t leader;
>>> +	/** State of the instance. */
>>>   	enum raft_state state;
>>> +	/**
>>> +	 * Volatile part of the Raft state, whose WAL write may be still
>>> +	 * in-progress, and yet the state may be already used. Volatile state is
>>> +	 * never sent to anywhere, but the state machine makes decisions based
>>> +	 * on it. That is vital.
>>> +	 * As an example, volatile vote needs to be used to reject votes inside
>>> +	 * a term, where the instance already voted (even if the vote WAL write
>>> +	 * is not finished yet). Otherwise the instance would try to write
>>> +	 * several votes inside one term.
>>> +	 */
>>> +	uint64_t volatile_term;
>>> +	uint32_t volatile_vote;
>>> +	/**
>>> +	 * Flag whether Raft is enabled. When disabled, it still persists terms
>>> +	 * so as to quickly enroll into the cluster when (if) it is enabled. In
>>> +	 * everything else disabled Raft does not affect instance work.
>>> +	 */
>>>   	bool is_enabled;
>>> +	/**
>>> +	 * Flag whether the node can become a leader. It is an accumulated value
>>> +	 * of configuration options Raft enabled and Raft candidate. If at least
>>> +	 * one is false - the instance is not a candidate.
>>> +	 */
>>>   	bool is_candidate;
>>> +	/** Flag whether the instance is allowed to be a leader. */
>>> +	bool is_cfg_candidate;
>>> +	/**
>>> +	 * Flag whether Raft currently tries to write something into WAL. It
>>> +	 * happens asynchronously, not right after Raft state is updated.
>>> +	 */
>>> +	bool is_write_in_progress;
>>> +	/**
>>> +	 * Persisted Raft state. These values are used when need to tell current
>>> +	 * Raft state to other nodes.
>>> +	 */
>>>   	uint64_t term;
>>>   	uint32_t vote;
>>> +	/**
>>> +	 * Bit 1 on position N means that a vote from instance with ID = N was
>>> +	 * obtained.
>>> +	 */
>>> +	uint32_t vote_mask;
>>> +	/** Number of votes for this instance. Valid only in candidate state. */
>>> +	int vote_count;
>>> +	/** State machine timed event trigger. */
>>> +	struct ev_timer timer;
>>> +	/**
>>> +	 * Dump of Raft state in the end of event loop, when it is changed.
>>> +	 */
>>> +	struct ev_check io;
>> You probably need the ev_prepare watcher.
>>
>> Here's what the libev doc  says:
>>
>> All |ev_prepare| watchers are invoked just /before/ |ev_run| starts
>> to gather new events, and all |ev_check| watchers are queued (not
>> invoked) just after |ev_run| has gathered them, but before it
>> queues any callbacks for any received events. That means
>> |ev_prepare| watchers are the last watchers invoked before the
>> event loop sleeps or polls for new events, and |ev_check| watchers
>> will be invoked before any other watchers of the same or lower
>> priority within an event loop iteration.
> I am not so sure. The documentation is hard to read. From what I understand,
> this is how it works: ev_prepare is invoked, then events are collected, then
> ev_check is invoked, then the callbacks are called.

Yes, it works this way. Aren't we speaking of the same things?
By 'ev_loop iteration' I mean the time when our fibers perform some 
useful work and
various ev callbacks are called. This means EV_CHECK is called in the 
beginning of
'ev_loop iteration' and EV_PREPARE is called in the end.

> In our case it seems to
> be not important whether we use prepare or check. But on the other hand, it
> is probably not how it works, because in fiber.top check is called in the
> beginning of an iteration, and prepare is called afterwards, and it fiber.top
> returns sane results.
This is so because all the fibers perform their work between ev_check 
and ev_prepare.
>
> Here I see that check is called after prepare. So check is at least 'closer' to
> event loop iteration end, isn't it?:
>
> 	http://pod.tst.eu/http://cvs.schmorp.de/libev/ev.pod#code_ev_prepare_code_and_code_ev_che
> 	Prepare and check watchers are often (but not always) used in pairs: prepare watchers get invoked before the process blocks and check watchers afterwards.

As I understand it, EV_CHECK is called in  the 'beginning' of an ev_loop 
iteration.
Anyway, looks like may use any of the watchers. I don't think it makes 
any difference.

>
> I also couldn't get any info from libev source code, because it is unreadable
> mostly.
>
> Is there a case, when ev_check won't work, and ev_prepare will? Is
> ev_prepare actually closer to an event loop iteration end, and I missed something?
>
>>> +	/** Configured election timeout in seconds. */
>>>   	double election_timeout;
>>>   };
>>> @@ -786,6 +803,8 @@ relay_raft_msg_push(struct cmsg *base)
>>>   	struct xrow_header row;
>>>   	xrow_encode_raft(&row, &fiber()->gc, &msg->req);
>>>   	try {
>>> +		if (msg->req.state == RAFT_STATE_LEADER)
>>> +			relay_restart_recovery(msg->relay);
>>>   		relay_send(msg->relay, &row);
>> Looks like you should first send the message and then restart the recovery.
>> Otherwise the resent data will be ignored once again.
> Yes, true.
>
> ====================
> diff --git a/src/box/relay.cc b/src/box/relay.cc
> index 4f9bbc0de..d63711600 100644
> --- a/src/box/relay.cc
> +++ b/src/box/relay.cc
> @@ -803,9 +803,14 @@ relay_raft_msg_push(struct cmsg *base)
>   	struct xrow_header row;
>   	xrow_encode_raft(&row, &fiber()->gc, &msg->req);
>   	try {
> +		/*
> +		 * Send the message before restarting the recovery. Otherwise
> +		 * all the rows would be sent from under a non-leader role and
> +		 * would be ignored again.
> +		 */
> +		relay_send(msg->relay, &row);
>   		if (msg->req.state == RAFT_STATE_LEADER)
>   			relay_restart_recovery(msg->relay);
> -		relay_send(msg->relay, &row);
>   	} catch (Exception *e) {
>   		relay_set_error(msg->relay, e);
>   		fiber_cancel(fiber());

-- 
Serge Petrenko

      reply	other threads:[~2020-09-09  8:04 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-09-03 22:51 [Tarantool-patches] [PATCH v2 00/10] dRaft Vladislav Shpilevoy
2020-09-03 22:51 ` [Tarantool-patches] [PATCH v2 01/10] applier: store instance_id in struct applier Vladislav Shpilevoy
2020-09-04  8:13   ` Serge Petrenko
2020-09-07 22:54     ` Vladislav Shpilevoy
2020-09-03 22:51 ` [Tarantool-patches] [PATCH v2 10/10] raft: introduce box.info.raft Vladislav Shpilevoy
2020-09-04 11:38   ` Serge Petrenko
2020-09-03 22:51 ` [Tarantool-patches] [PATCH v2 02/10] box: introduce summary RO flag Vladislav Shpilevoy
2020-09-04  8:17   ` Serge Petrenko
2020-09-03 22:51 ` [Tarantool-patches] [PATCH v2 03/10] wal: don't touch box.cfg.wal_dir more than once Vladislav Shpilevoy
2020-09-04  8:20   ` Serge Petrenko
2020-09-03 22:51 ` [Tarantool-patches] [PATCH v2 04/10] replication: track registered replica count Vladislav Shpilevoy
2020-09-04  8:24   ` Serge Petrenko
2020-09-07 22:54     ` Vladislav Shpilevoy
2020-09-07 15:45   ` Sergey Ostanevich
2020-09-07 22:54     ` Vladislav Shpilevoy
2020-09-03 22:51 ` [Tarantool-patches] [PATCH v2 05/10] raft: introduce persistent raft state Vladislav Shpilevoy
2020-09-04  8:59   ` Serge Petrenko
2020-09-03 22:51 ` [Tarantool-patches] [PATCH v2 06/10] raft: introduce box.cfg.raft_* options Vladislav Shpilevoy
2020-09-04  9:07   ` Serge Petrenko
2020-09-07 22:55     ` Vladislav Shpilevoy
2020-09-03 22:51 ` [Tarantool-patches] [PATCH v2 07/10] raft: relay status updates to followers Vladislav Shpilevoy
2020-09-03 22:51 ` [Tarantool-patches] [PATCH v2 08/10] [tosquash] raft: pass source instance_id to raft_process_msg() Vladislav Shpilevoy
2020-09-04  9:22   ` Serge Petrenko
2020-09-03 22:51 ` [Tarantool-patches] [PATCH v2 09/10] raft: introduce state machine Vladislav Shpilevoy
2020-09-04 11:36   ` Serge Petrenko
2020-09-07 22:57     ` Vladislav Shpilevoy
2020-09-09  8:04       ` Serge Petrenko [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=7e5fee9d-d032-de31-e1eb-970ffeb5ce37@tarantool.org \
    --to=sergepetrenko@tarantool.org \
    --cc=gorcunov@gmail.com \
    --cc=tarantool-patches@dev.tarantool.org \
    --cc=v.shpilevoy@tarantool.org \
    --subject='Re: [Tarantool-patches] [PATCH v2 09/10] raft: introduce state machine' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox