[tarantool-patches] Re: [PATCH 6/6] swim: introduce suspicion

Wed Apr 24 23:28:55 MSK 2019

On 24/04/2019 20:01, Konstantin Osipov wrote:
> * Vladislav Shpilevoy <v.shpilevoy at tarantool.org> [19/04/24 18:50]:
>> +	/**
>> +	 * Number of attempts to reach out a member via another
>> +	 * members when it did not answer on a regular ping .
>> +	 */
>> +	INDIRECT_PING_COUNT = 2,
> 
> As far as I can see it's not the number of attempts, it's the
> number of cluster members involved as proxies for sending an
> indirect ping. The fact that you attempt an indirect ping only
> once before switching member state to dead is secondary.
> 
> This has to do with lack of a description in changeset or code
> comment of indirect pings strategy. 

Sorry, you are right. They are not 'attempts' strictly
speaking. New comment:

==================================================================
 	 */
 	NO_ACKS_TO_GC = 2,
 	/**
-	 * Number of attempts to reach out a member via another
-	 * members when it did not answer on a regular ping .
+	 * Number of pings sent indirectly to a member via other
+	 * members when it did not answer on a regular ping. The
+	 * messages are sent in parallel and via different
+	 * members.
 	 */
 	INDIRECT_PING_COUNT = 2,
 };
==================================================================

> 
>> +/**
>> + * Put the member into a list of ACK waiters. @a hop_count says
>> + * how many hops from one member to another the ACK is expected to
>> + * do.
>> + */
>>  static void
>> -swim_wait_ack(struct swim *swim, struct swim_member *member)
>> +swim_wait_ack(struct swim *swim, struct swim_member *member, int hop_count)
>>  {
>>  	if (heap_node_is_stray(&member->in_wait_ack_heap)) {
>> -		member->ping_deadline = swim_time() + swim->wait_ack_tick.at;
>> +		double timeout = swim->wait_ack_tick.at * hop_count;
> 
> Is hop_count ever greater than two? If not, I would not use int
> hop_count, but use bool is_indirect_ping, and add a comment
> explaining that in case of indirect pings the number of hops is
> exactly two, so we need to increase the timeout.

Ok, I do not mind.

==================================================================

diff --git a/src/lib/swim/swim.c b/src/lib/swim/swim.c
index af9b59cbb..24229bf0c 100644
--- a/src/lib/swim/swim.c
+++ b/src/lib/swim/swim.c
@@ -493,10 +493,18 @@ swim_cached_round_msg_invalidate(struct swim *swim)
  * do.
  */
 static void
-swim_wait_ack(struct swim *swim, struct swim_member *member, int hop_count)
+swim_wait_ack(struct swim *swim, struct swim_member *member,
+	      bool was_ping_indirect)
 {
 	if (heap_node_is_stray(&member->in_wait_ack_heap)) {
-		double timeout = swim->wait_ack_tick.at * hop_count;
+		double timeout = swim->wait_ack_tick.at;
+		/*
+		 * Direct ping is two hops: PING + ACK.
+		 * Indirect ping is four hops: PING, FORWARD PING,
+		 * ACK, FORWARD ACK. This is why x2 for indirects.
+		 */
+		if (was_ping_indirect)
+			timeout *= 2;
 		member->ping_deadline = swim_time() + timeout;
 		wait_ack_heap_insert(&swim->wait_ack_heap, member);
 		swim_ev_timer_start(loop(), &swim->wait_ack_tick);
@@ -627,7 +635,7 @@ swim_ping_task_complete(struct swim_task *task,
 	struct swim *swim = swim_by_scheduler(scheduler);
 	struct swim_member *m = container_of(task, struct swim_member,
 					     ping_task);
-	swim_wait_ack(swim, m, 1);
+	swim_wait_ack(swim, m, false);
 }
 
 /** Free member's resources. */
@@ -1082,7 +1090,7 @@ swim_complete_step(struct swim_task *task,
 			 * dissemination and failure detection
 			 * sections.
 			 */
-			swim_wait_ack(swim, m, 1);
+			swim_wait_ack(swim, m, false);
 			swim_decrease_event_ttd(swim);
 		}
 	}
@@ -1239,7 +1247,7 @@ swim_iping_task_complete(struct swim_task *base_task,
 		struct swim_member *m =
 			swim_find_member(swim, &b->dst_uuid);
 		if (m != NULL)
-			swim_wait_ack(swim, m, 2);
+			swim_wait_ack(swim, m, true);
 	}
 	swim_iping_task_delete(t);
 }
==================================================================

> 
>> +/**
>> + * When indirect pings are sent, each is represented by this
>> + * structure.
>> + */
>> +struct swim_iping_task {
>> +	/** Base task used by the scheduler. */
>> +	struct swim_task base;
>> +	/**
>> +	 * Reference to a block of other indirect ping tasks sent
>> +	 * at the same moment. Used to decide whether it is needed
>> +	 * to start waiting for an ACK, and on which member.
>> +	 */
>> +	struct swim_iping_block *block;
> 
> I don't understand this comment, why do you need a reference?
> 
> I would not use it at all, seems like we have a separate data
> structure when existing structures would do just fine.
> 

Talking of block - we have decided to get rid of it. To
understand why, I should explain why was it needed.

1) To wait for an ACK I need a struct swim_member to put it
into ack waiter queue. But in task_complete() callback I don't
know the member nor its UUID. struct swim_task is too encapsulated
inside the transport level. Iping block stores the UUID of sender.

2) It is possible, that a first indirect ACK was sent and we got
a result faster than managed to send a next indirect ACK. In such a
case we will wait for the second ACK, even having a response on
the first one.


We've decided that block can be removed if UUID is added to
struct swim_task, and if we mark ACK and ping with timestamps.

But I found another solution for ACK/ping timing problem.

When a next indirect ping is sent, lets watch at
swim_member status. If it is not suspected anymore, it
means, that an ACK had already been received while this ping
was waiting for EV_WRITE.

I will not paste here diff of this change since it is too big.
I would rather send V2 of this patchset.

>>  	switch (def.type) {
>>  	case SWIM_FD_MSG_PING:
>> -		if (! swim_task_is_scheduled(&member->ack_task))
>> +		if (proxy != NULL) {
>> +			if (swim_send_indirect_ack(swim, &member->addr,
>> +						   proxy) != 0)
>> +				diag_log();
>> +		} else if (! swim_task_is_scheduled(&member->ack_task)) {
>>  			swim_send_ack(swim, &member->ack_task, &member->addr);
>> +		}
>>  		break;
> 
> Why do you ever need to distinguish direct and indirect acks? Why not
> respond to the sender address, while preserving the routing table
> from the ping request, assuming the sender can figure out whether
> the response is addressed to him directly or it should proxy it
> further based on the received proxy section of the response?

Such a long sentence, do not know how to split it in parts. Ok.

Firstly, I do not distinguish indirect ACKs when they are received
by the pinger. This is thanks to the transport level full isolation.
I do not even have a special request type like 'PING_REQUEST' or
something, like classical implementation does. In theory, I could even
send usual anti-entropy and dissemination messages indirectly.

Secondly, to preserve routing table I need to distinguish direct
from indirect ACKs on sender side, because routing table is generated
only when proxy address is set. I can't just say
'swim_send_ack(swim, member)' and let the transport level decide
whether it should use proxy or not.

The transport level doesn't have an address-table to choose proxies,
as well as it does not know, that this message is a response to another
message, which in turn was received via proxy. Only SWIM core has this
information. So it should give the transport level a hint via
'swim_task_proxy()' method.

swim_send_ack and swim_send_indirect_ack are just wrappers around
swim_send_fd_msg(), passing different proxy values. The former sets
proxy NULL, the latter not NULL. That's all.

> 
> -- 
> Konstantin Osipov, Moscow, Russia, +7 903 626 22 32
> http://tarantool.io - www.twitter.com/kostja_osipov
>