Tarantool development patches archive
 help / color / mirror / Atom feed
From: Serge Petrenko via Tarantool-patches <tarantool-patches@dev.tarantool.org>
To: Vladislav Shpilevoy <v.shpilevoy@tarantool.org>, gorcunov@gmail.com
Cc: tarantool-patches@dev.tarantool.org
Subject: Re: [Tarantool-patches] [PATCH 2/2] replication: fix replica disconnect upon reconfiguration
Date: Tue, 5 Oct 2021 16:09:40 +0300	[thread overview]
Message-ID: <c49aa8ae-14cf-1d02-e53f-2d9798893839@tarantool.org> (raw)
In-Reply-To: <a7916487-3540-162f-8f1f-40de94869de3@tarantool.org>



05.10.2021 00:04, Vladislav Shpilevoy пишет:
> Hi! Thanks for working on this!
>
>>>> diff --git a/src/box/box.cc b/src/box/box.cc
>>>> index 219ffa38d..89cda5599 100644
>>>> --- a/src/box/box.cc
>>>> +++ b/src/box/box.cc> @@ -1261,7 +1261,9 @@ box_sync_replication(bool connect_quorum)
>>>>                applier_delete(appliers[i]); /* doesn't affect diag */
>>>>        });
>>>>    -    replicaset_connect(appliers, count, connect_quorum);
>>>> +    bool connect_quorum = strict;
>>>> +    bool keep_connect = !strict;
>>>> +    replicaset_connect(appliers, count, connect_quorum, keep_connect);
>>> 1. How about passing both these parameters explicitly to box_sync_replication?
>>> I don't understand the link between them so that they could be one.
>>>
>>> It seems the only case when you need to drop the old connections is when
>>> you turn anon to normal. Why should they be fully reset otherwise?
>> Yes, it's true. anon to normal is the only place where existing
>> connections should be reset.
>>
>> For both bootstrap and local recovery (first ever box.cfg) keep_connect
>> doesn't make sense at all, because there are no previous connections to
>> keep.
>>
>> So the only two (out of 5) box_sync_replication calls, that need
>> keep_connect are replication reconfiguration (keep_connect = true) and
>> anon replica reconfiguration (keep_connect = false).
>>
>> Speaking of the relation between keep_connect and connect_quorum:
>> We don't care about keep_connect in 3 calls (bootstrap and recovery),
>> and when keep_connect is important, it's equal to !connect_quorum.
>> I thought it might be nice to replace them with a single parameter.
>>
>> I tried to pass both parameters to box_sync_repication() at first.
>> This looked rather ugly IMO:
>> box_sync_replication(true, false), box_sync_replication(false, true);
>> Two boolean parameters which are responsible for God knows what are
>> worse than one parameter.
>>
>> I'm not 100% happy with my solution, but it at least hides the second
>> parameter. And IMO box_sync_replication(strict) is rather easy to
>> understand: when strict = true, you want to connect to quorum, and
>> you want to reset the connections. And vice versa when strict = false.
> This can be resolved with a couple of wrappers, like in this diff:
>
> ====================
> diff --git a/src/box/box.cc b/src/box/box.cc
> index 89cda5599..c1216172d 100644
> --- a/src/box/box.cc
> +++ b/src/box/box.cc
> @@ -1249,7 +1249,7 @@ cfg_get_replication(int *p_count)
>    * don't start appliers.
>    */
>   static void
> -box_sync_replication(bool strict)
> +box_sync_replication(bool do_quorum, bool do_reuse)
>   {
>   	int count = 0;
>   	struct applier **appliers = cfg_get_replication(&count);
> @@ -1260,14 +1260,27 @@ box_sync_replication(bool strict)
>   		for (int i = 0; i < count; i++)
>   			applier_delete(appliers[i]); /* doesn't affect diag */
>   	});
> -
> -	bool connect_quorum = strict;
> -	bool keep_connect = !strict;
> -	replicaset_connect(appliers, count, connect_quorum, keep_connect);
> +	replicaset_connect(appliers, count, do_quorum, do_reuse);
>   
>   	guard.is_active = false;
>   }
>   
> +static inline void
> +box_reset_replication(void)
> +{
> +	const bool do_quorum = true;
> +	const bool do_reuse = false;
> +	box_sync_replication(do_quorum, do_reuse);
> +}
> +
> +static inline void
> +box_update_replication(void)
> +{
> +	const bool do_quorum = false;
> +	const bool do_reuse = true;
> +	box_sync_replication(do_quorum, do_reuse);
> +}
> +
>   void
>   box_set_replication(void)
>   {
> @@ -1286,7 +1299,7 @@ box_set_replication(void)
>   	 * Stay in orphan mode in case we fail to connect to at least
>   	 * 'replication_connect_quorum' remote instances.
>   	 */
> -	box_sync_replication(false);
> +	box_update_replication();
>   	/* Follow replica */
>   	replicaset_follow();
>   	/* Wait until appliers are in sync */
> @@ -1406,7 +1419,7 @@ box_set_replication_anon(void)
>   		 * them can register and others resend a
>   		 * non-anonymous subscribe.
>   		 */
> -		box_sync_replication(true);
> +		box_reset_replication();
>   		/*
>   		 * Wait until the master has registered this
>   		 * instance.
> @@ -3260,7 +3273,7 @@ bootstrap(const struct tt_uuid *instance_uuid,
>   	 * with connecting to 'replication_connect_quorum' masters.
>   	 * If this also fails, throw an error.
>   	 */
> -	box_sync_replication(true);
> +	box_update_replication();
>   
>   	struct replica *master = replicaset_find_join_master();
>   	assert(master == NULL || master->applier != NULL);
> @@ -3337,7 +3350,7 @@ local_recovery(const struct tt_uuid *instance_uuid,
>   	if (wal_dir_lock >= 0) {
>   		if (box_listen() != 0)
>   			diag_raise();
> -		box_sync_replication(false);
> +		box_update_replication();
>   
>   		struct replica *master;
>   		if (replicaset_needs_rejoin(&master)) {
> @@ -3416,7 +3429,7 @@ local_recovery(const struct tt_uuid *instance_uuid,
>   		vclock_copy(&replicaset.vclock, &recovery->vclock);
>   		if (box_listen() != 0)
>   			diag_raise();
> -		box_sync_replication(false);
> +		box_update_replication();
>   	}
>   	stream_guard.is_active = false;
>   	recovery_finalize(recovery);
> ====================
>
> Feel free to discard it if don't like. I am fine with the current
> solution too.
>
> Now when I sent this diff, I realized box_restart_replication()
> would be a better name than reset. Up to you as well.

Your version looks better, thanks!
Applied with renaming box_reset_replication() to box_restart_replication()
Also replaced box_update_replication() with box_restart_replication() in 
bootstrap().

>
>> diff --git a/test/instance_files/base_instance.lua b/test/instance_files/base_instance.lua
>> index 45bdbc7e8..e579c3843 100755
>> --- a/test/instance_files/base_instance.lua
>> +++ b/test/instance_files/base_instance.lua
>> @@ -5,7 +5,8 @@ local listen = os.getenv('TARANTOOL_LISTEN')
>>   box.cfg({
>>       work_dir = workdir,
>>   --     listen = 'localhost:3310'
>> -    listen = listen
>> +    listen = listen,
>> +    log = workdir..'/tarantool.log',
> Do you really need it in this patch?

Yep, I need it for grep_log.
Looks like luatest doesn't set log to anything by default.

>
> Other than that LGTM. You can send the next version to a next
> reviewer. I suppose it can be Yan now.

Here's the full diff:

==================================

diff --git a/src/box/box.cc b/src/box/box.cc
index 89cda5599..cc4ada47e 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -1249,7 +1249,7 @@ cfg_get_replication(int *p_count)
   * don't start appliers.
   */
  static void
-box_sync_replication(bool strict)
+box_sync_replication(bool do_quorum, bool do_reuse)
  {
      int count = 0;
      struct applier **appliers = cfg_get_replication(&count);
@@ -1260,14 +1260,27 @@ box_sync_replication(bool strict)
          for (int i = 0; i < count; i++)
              applier_delete(appliers[i]); /* doesn't affect diag */
      });
-
-    bool connect_quorum = strict;
-    bool keep_connect = !strict;
-    replicaset_connect(appliers, count, connect_quorum, keep_connect);
+    replicaset_connect(appliers, count, do_quorum, do_reuse);

      guard.is_active = false;
  }

+static inline void
+box_restart_replication(void)
+{
+    const bool do_quorum = true;
+    const bool do_reuse = false;
+    box_sync_replication(do_quorum, do_reuse);
+}
+
+static inline void
+box_update_replication(void)
+{
+    const bool do_quorum = false;
+    const bool do_reuse = true;
+    box_sync_replication(do_quorum, do_reuse);
+}
+
  void
  box_set_replication(void)
  {
@@ -1286,7 +1299,7 @@ box_set_replication(void)
       * Stay in orphan mode in case we fail to connect to at least
       * 'replication_connect_quorum' remote instances.
       */
-    box_sync_replication(false);
+    box_update_replication();
      /* Follow replica */
      replicaset_follow();
      /* Wait until appliers are in sync */
@@ -1406,7 +1419,7 @@ box_set_replication_anon(void)
           * them can register and others resend a
           * non-anonymous subscribe.
           */
-        box_sync_replication(true);
+        box_restart_replication();
          /*
           * Wait until the master has registered this
           * instance.
@@ -3260,7 +3273,7 @@ bootstrap(const struct tt_uuid *instance_uuid,
       * with connecting to 'replication_connect_quorum' masters.
       * If this also fails, throw an error.
       */
-    box_sync_replication(true);
+    box_restart_replication();

      struct replica *master = replicaset_find_join_master();
      assert(master == NULL || master->applier != NULL);
@@ -3337,7 +3350,7 @@ local_recovery(const struct tt_uuid *instance_uuid,
      if (wal_dir_lock >= 0) {
          if (box_listen() != 0)
              diag_raise();
-        box_sync_replication(false);
+        box_update_replication();

          struct replica *master;
          if (replicaset_needs_rejoin(&master)) {
@@ -3416,7 +3429,7 @@ local_recovery(const struct tt_uuid *instance_uuid,
          vclock_copy(&replicaset.vclock, &recovery->vclock);
          if (box_listen() != 0)
              diag_raise();
-        box_sync_replication(false);
+        box_update_replication();
      }
      stream_guard.is_active = false;
      recovery_finalize(recovery);
diff --git a/test/replication-luatest/gh-4669-applier-reconnect_test.lua 
b/test/replication-luatest/gh_4669_applier_reconnect_test.lua
similarity index 100%
rename from test/replication-luatest/gh-4669-applier-reconnect_test.lua
rename to test/replication-luatest/gh_4669_applier_reconnect_test.lua


==================================

-- 
Serge Petrenko


  reply	other threads:[~2021-10-05 13:09 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-09-30 19:17 [Tarantool-patches] [PATCH 0/2] replication: fix reconnect on box.cfg.replication change Serge Petrenko via Tarantool-patches
2021-09-30 19:17 ` [Tarantool-patches] [PATCH 1/2] replicaiton: make anon replica connect to quorum upon reconfiguration Serge Petrenko via Tarantool-patches
2021-09-30 19:17 ` [Tarantool-patches] [PATCH 2/2] replication: fix replica disconnect " Serge Petrenko via Tarantool-patches
2021-09-30 22:15   ` Vladislav Shpilevoy via Tarantool-patches
2021-10-01 11:31     ` Serge Petrenko via Tarantool-patches
2021-10-04 21:04       ` Vladislav Shpilevoy via Tarantool-patches
2021-10-05 13:09         ` Serge Petrenko via Tarantool-patches [this message]
2021-10-06 21:59           ` Vladislav Shpilevoy via Tarantool-patches

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=c49aa8ae-14cf-1d02-e53f-2d9798893839@tarantool.org \
    --to=tarantool-patches@dev.tarantool.org \
    --cc=gorcunov@gmail.com \
    --cc=sergepetrenko@tarantool.org \
    --cc=v.shpilevoy@tarantool.org \
    --subject='Re: [Tarantool-patches] [PATCH 2/2] replication: fix replica disconnect upon reconfiguration' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox