[tarantool-patches] Re: [PATCH 1/1] swim: introduce SWIM's anti-entropy component

Thu Mar 7 15:33:50 MSK 2019


On 07/03/2019 13:22, Konstantin Osipov wrote:
> * Vladislav Shpilevoy <v.shpilevoy at tarantool.org> [19/03/07 11:19]:
> 
>> diff --git a/cmake/profile.cmake b/cmake/profile.cmake
>> index 0ba31fa2c..70fa74694 100644
>> --- a/cmake/profile.cmake
>> +++ b/cmake/profile.cmake
>> @@ -1,5 +1,5 @@
>>   set(CMAKE_REQUIRED_FLAGS "-fprofile-arcs -ftest-coverage")
>> -check_library_exists("" __gcov_flush "" HAVE_GCOV)
>> +set(HAVE_GCOV 1)
>>   set(CMAKE_REQUIRED_FLAGS "")
> 
> Stray change, please remove.

It is removed. You ignored my first follow-up email. Again.

> 
>> +	/**
>> +	 * Swim instance is its own iterator. Such iterators are
>> +	 * obviously unstable and fragile, but it is good enough
>> +	 * for a fast non-yielding scan of members table. Each new
>> +	 * public API iterator iterates through member table using
>> +	 * this variable as a storage of the current position.
>> +	 */
> 
> This comment is unclear, Since you rejected my previous edition of
> this comment, please check with Vladimir Davydov and write a
> better comment than that.

Your previous edition was just wrong. It was not unclear. Talking
of Vova, I cite you: 'вова смотрит то, что важно вове'. Since
obviously Vova does not care about that subsystem, I do not see any
sense to ask him to review SWIM or any part of it again.

Nonetheless, I just a bit tired to argue with you about each
comment, so I replaced it with your version (wrong).

@@ -276,11 +276,13 @@ struct swim {
  	 */
  	struct swim_scheduler scheduler;
  	/**
-	 * Swim instance is its own iterator. Such iterators are
-	 * obviously unstable and fragile, but it is good enough
-	 * for a fast non-yielding scan of members table. Each new
-	 * public API iterator iterates through member table using
-	 * this variable as a storage of the current position.
+	 * An offset of this instance in the hash table. Is used
+	 * to iterate in the hash table starting from this swim
+	 * instance. Such iteration is unstable between yields
+	 * (i.e. member positions may change when the table is
+	 * resized after an incoming event), but still is useful
+	 * for a fast non-yielding scan of the member table
+	 * starting from this instance.
  	 */
  	mh_int_t iterator;

> 
>> +/**
>> + * Encode anti-entropy header and random members data as many as
>> + * possible to the end of the packet.
>> + * @retval 0 Not error, but nothing is encoded.
>> + * @retval 1 Something is encoded.
>> + */
> 
> This comment is confusing. You actually return the number of
> members added to messagepack map. You can add 0 or 1 member.
> Please rewrite.

I do not see anything confusing. You just said the same what is
written here. But as you wish.

@@ -464,8 +464,7 @@ swim_new_round(struct swim *swim)
  /**
   * Encode anti-entropy header and random members data as many as
   * possible to the end of the packet.
- * @retval 0 Not error, but nothing is encoded.
- * @retval 1 Something is encoded.
+ * @retval Number of key-values added to the packet's root map.
   */
  static int
  swim_encode_anti_entropy(struct swim *swim, struct swim_packet *packet)

> 
>> +static int
>> +swim_encode_anti_entropy(struct swim *swim, struct swim_packet *packet)
> 
> 
>> +	struct swim_member_bin member_bin;
>> +	int size = sizeof(ae_header_bin);
>> +	char *header = swim_packet_reserve(packet, size);
>> +	if (header == NULL)
>> +		return 0;
> 
> I don't understand why this error can ever be ignored.

Because it is not an error, I repeat it again, third time.
Any swim_packet_reserve(), swim_packet_allow() can not return
an error - they do not allocate anything, they just propagate
some pointers. Anti-entropy is not a mandatory component, and it
is ok, if it is not encoded because of too big dissemination
component.

> The result of
> this check does not depend on environment: each time you invoke
> this function your header data won't fit in the packet. This

And nevertheless it fits. I do not understand what do you mean.
Header data fits, and the protocol works. It is proven by tests.
I reserve space for header data before encoding all the components.

> violates the basic assumptions you make in the protocol, i.e. that
> you can fit at least one member in the packet. Please either

Where did you see such an assumption? Anti-entropy is an **optional**
component. If the dissemination component is too big, then it is ok
to omit anti-entropy.

> explain how you can ignore such a basic problem as when you first
> ever header assert/panic.
> 
>> +	char *pos = header;
>> +	swim_member_bin_create(&member_bin);
>> +	struct mh_swim_table_t *t = swim->members;
>> +	int i = 0, member_count = mh_size(t);
>> +	int rnd = swim_scaled_rand(0, member_count - 1);
>> +	for (mh_int_t rc = mh_swim_table_random(t, rnd), end = mh_end(t);
>> +	     i < member_count; ++i) {
>> +		struct swim_member *m = *mh_swim_table_node(t, rc);
>> +		int new_size = size + sizeof(member_bin);
>> +		if (swim_packet_reserve(packet, new_size) == NULL)
>> +			break;
> 
> Same here. You should assert/panic that you can fit at least one member
> in the datagram.  Otherwise the protocol is unusable, and this
> error condition does not depend on environment - it's a
> programming bug.

Read my answers above. Anti-entropy is not a linchpin of the protocol,
and it is ok to do not encode it. Talking of assertions, I added them
to swim_encode_round_msg() which checks that the packet now always has
at least two sections: SRC_UUID and ANTI_ENTROPY. It is not so in the
next patches though and changes later.

> 
>> +/**
>> + * Encode source UUID.
>> + * @retval 0 Not error, but nothing is encoded.
>> + * @retval 1 Something is encoded.
>> + */
>> +static inline int
>> +swim_encode_src_uuid(struct swim *swim, struct swim_packet *packet)
>> +{
> 
> This comment is also confusing. Take a look at the usage:
> Please update the comment.

As you wish.

@@ -511,8 +510,7 @@ swim_encode_anti_entropy(struct swim *swim, struct swim_packet *packet)
  
  /**
   * Encode source UUID.
- * @retval 0 Not error, but nothing is encoded.
- * @retval 1 Something is encoded.
+ * @retval Number of key-values added to the packet's root map.
   */
  static inline int
  swim_encode_src_uuid(struct swim *swim, struct swim_packet *packet)

> 
>> +	int map_size = 0;
>> +	map_size += swim_encode_src_uuid(swim, packet);
>> +	map_size += swim_encode_anti_entropy(swim, packet);
> 
>> +
>> +/**
>> + * Update or create a member by its definition, received from a
> 
> Update or create is "replace", or  upsert in Tarantool vocabulary.

Update or create are English words not reserved by Tarantool DML/DDL.

> Update does nothing if no member exists.

This is why I wrote 'or create'.

> The choice of the name is also confusing since you chose to return
> NULL on error.
> Please rename.

I presented that name to you earlier and it was ok for you. But
as you wish, I renamed it to swim_upsert_member, if we are talking
in "Tarantool Vocabulary".

@@ -657,7 +657,7 @@ swim_update_member_addr(struct swim *swim, struct swim_member *member,
   * @retval New member, or updated old member.
   */
  static struct swim_member *
-swim_update_member(struct swim *swim, const struct swim_member_def *def)
+swim_upsert_member(struct swim *swim, const struct swim_member_def *def)
  {
  	struct swim_member *member = swim_find_member(swim, &def->uuid);
  	if (member == NULL) {
@@ -683,7 +683,7 @@ swim_process_anti_entropy(struct swim *swim, const char **pos, const char *end)
  		struct swim_member_def def;
  		if (swim_member_def_decode(&def, pos, end, prefix) != 0)
  			return -1;
-		if (swim_update_member(swim, &def) == NULL) {
+		if (swim_upsert_member(swim, &def) == NULL) {
  			/*
  			 * Not a critical error. Other members
  			 * still can be updated.

> 
>> +		if (swim_update_member(swim, &def) == NULL) {
>> +			/*
>> +			 * Not a critical error. Other members
>> +			 * still can be updated.
>> +			 */
>> +			diag_log();
> 
> You're not using the return value.  I assume it will be used in a
> subsequent patch. Otherwise please return {0,-1}

It is not used in a subsequent patch, and here I do not return -1
because I want to process other members. Fail of one member update
should not reject the whole packet.

> 
>> +		}
>> +	}
>> +	return 0;
>> +	if (size == 0) {
>> +		diag_set(SwimError, "%s body can not be empty", prefix);
>> +		goto error;
>> +	}
> 
> Are you sure you want to use a single exception for both protocol
> and transport errors?

Yes, I am sure.

> Why would you ever have a  protocol error
> in production?

It is impossible if the code was tested enough. And it wonders
me even more why should I care about a special type of error
for SWIM protocol violations.

> It suggests that protocol errors, generally, should
> be treated differently from transport errors. Nothing to be done
> in this patch about it, to be addressed in a subsequent patch.
> 
>> +void
>> +swim_delete(struct swim *swim)
>> +{
>> +	swim_scheduler_destroy(&swim->scheduler);
>> +	swim_ev_timer_stop(loop(), &swim->round_tick);
>> +	swim_task_destroy(&swim->round_step_task);
>> +	mh_int_t node = mh_first(swim->members);
>> +	while (node != mh_end(swim->members)) {
>> +		struct swim_member *m =
>> +			*mh_swim_table_node(swim->members, node);
>> +		swim_delete_member(swim, m);
>> +		node = mh_first(swim->members);
>> +	}
> 
> This is O(n^2) from the number of members, since the hash table is
> not generally shrinking after you delete a member from it. Please
> make it linear: go over all mhash members, delete them with
> swim_member_delete(), then delete the mhash itself.

I still do not understand how do you select 'hot paths'. This place
is called maybe 1-2 times per instance lifetime. And you propose to
pollute it with broken encapsulation of removing the members directly
bypassing swim instance. I have some assertions in swim_member_delete
that it is already removed from the swim instance, and you force me
to crutch that assertions here. But as you wish, again.

@@ -932,12 +932,11 @@ swim_delete(struct swim *swim)
  	swim_scheduler_destroy(&swim->scheduler);
  	swim_ev_timer_stop(loop(), &swim->round_tick);
  	swim_task_destroy(&swim->round_step_task);
-	mh_int_t node = mh_first(swim->members);
-	while (node != mh_end(swim->members)) {\
+	mh_int_t node;
+	mh_foreach(swim->members, node) {
  		struct swim_member *m =
  			*mh_swim_table_node(swim->members, node);
-		swim_delete_member(swim, m);
-		node = mh_first(swim->members);
+		rlist_del_entry(m, in_round_queue);
+		swim_member_delete(m);
  	}
  	mh_swim_table_delete(swim->members);
  	free(swim->shuffled);

> 
>> +struct swim_iterator *
>> +swim_iterator_open(struct swim *swim)
>> +{
>> +	assert(swim_is_configured(swim));
>> +	swim->iterator = mh_first(swim->members);
>> +	return (struct swim_iterator *) swim;
>> +}
> 
> What are you trying to save on by fusing the iterator in? A

The same question about swim_delete() - what are you trying to
save in swim_delete() and why?

> separate data structure?

I am trying to save memory allocation for the iterator, because
it is likely to be called often. Much more often than swim_delete.

> What's wrong with a separate iterator
> object pointing copied by-value? This code is more complicated
> than necessary. To be addressed in a separate patch.

It is smaller than with additional special iterators instead of
one simple number mh_int_t. I do not think that it makes any
sense to create standalone and stable iterators until an explicit
request.

> 
>> +static inline int
>> +swim_decode_ip(struct sockaddr_in *address, const char **pos, const char *end,
>> +	       const char *prefix, const char *param_name)
>> +{
>> +	uint64_t ip;
>> +	if (swim_decode_uint(pos, end, &ip, prefix, param_name) != 0)
>> +		return -1;
>> +	if (ip > UINT32_MAX) {
>> +		diag_set(SwimError, "%s %s is an invalid IP address", prefix,
>> +			 param_name);
>> +		return -1;
>> +	}
>> +	address->sin_addr.s_addr = ip;
> 
> I believe you should convert host order to network order when
> assigning. Please add htons.

I send port and ip in network order and do not need htons here, but as
you wish.

@@ -92,7 +92,7 @@ swim_decode_ip(struct sockaddr_in *address, const char **pos, const char *end,
  			 param_name);
  		return -1;
  	}
-	address->sin_addr.s_addr = ip;
+	address->sin_addr.s_addr = htonl(ip);
  	return 0;
  }
  
@@ -108,7 +108,7 @@ swim_decode_port(struct sockaddr_in *address, const char **pos, const char *end,
  			 param_name);
  		return -1;
  	}
-	address->sin_port = port;
+	address->sin_port = htons(port);
  	return 0;
  }
  
@@ -244,8 +244,8 @@ swim_member_bin_fill(struct swim_member_bin *header,
  		     enum swim_member_status status)
  {
  	header->v_status = status;
-	header->v_addr = mp_bswap_u32(addr->sin_addr.s_addr);
-	header->v_port = mp_bswap_u16(addr->sin_port);
+	header->v_addr = mp_bswap_u32(ntohl(addr->sin_addr.s_addr));
+	header->v_port = mp_bswap_u16(ntohs(addr->sin_port));
  	memcpy(header->v_uuid, uuid, UUID_LEN);
  }
  
@@ -273,10 +273,10 @@ swim_meta_header_bin_create(struct swim_meta_header_bin *header,
  	header->v_version = mp_bswap_u32(tarantool_version_id());
  	header->k_addr = SWIM_META_SRC_ADDRESS;
  	header->m_addr = 0xce;
-	header->v_addr = mp_bswap_u32(src->sin_addr.s_addr);
+	header->v_addr = mp_bswap_u32(ntohl(src->sin_addr.s_addr));
  	header->k_port = SWIM_META_SRC_PORT;
  	header->m_port = 0xcd;
-	header->v_port = mp_bswap_u16(src->sin_port);
+	header->v_port = mp_bswap_u16(ntohs(src->sin_port));
  }

> 
>> +static int
>> +main_f(va_list ap)
>> +{
>> +	swim_start_test(5);
>> +
>> +	(void) ap;
>> +	struct ev_loop *loop = loop();
>> +	swim_test_ev_init();
>> +	swim_test_transport_init();
>> +
>> +	swim_test_one_link();
>> +	swim_test_sequence();
>> +	swim_test_uuid_update();
>> +	swim_test_cfg();
>> +	swim_test_add_remove();
>> +
>> +	swim_test_transport_free();
>> +	swim_test_ev_free();
>> +
>> +	test_result = check_plan();
>> +	footer();
>> +	return 0;
>> +}
> 
> Please also add to the test plan (in a separate patch):
> 
> - hitting the boundary of udp datagram (i.e. case when
>    swim_packet_reserve() returns error)
> - packet loss in the network - 5%, 10%, 20%, 50%, 90%
> - packet loss from a single swim member
> - one or few members "lagging" behind the rest.
> - speed at which a detected failure spreads through the network
> - a "flaky" member which is on and off
> 
> I understand anti-entropy itself does not exhibit too much
> testable properties, so this test plan is good enough for the
> first patch.

This test plan is partially covered now. But other examples
require a notion of 'dead' member. The fist commit introduces
only 'alive' members. In this commit, on anti-entropy level
you do not know what it is 'on', 'off', 'down', 'up', 'loss level'.
It appears in the failure detection component.

I added your plan to my backlog.

> 
> Please fix the above comments and push.

Since I did not fix some of the above comments, I am
waiting for your response before push.

> 
> -- 
> Konstantin Osipov, Moscow, Russia, +7 903 626 22 32
> http://tarantool.io - www.twitter.com/kostja_osipov
>