[Tarantool-patches] [PATCH 0/2] replication: fix reconnect on box.cfg.replication change

Tarantool development patches archive
 help / color / mirror / Atom feed

* [Tarantool-patches] [PATCH 0/2] replication: fix reconnect on box.cfg.replication change
@ 2021-09-30 19:17 Serge Petrenko via Tarantool-patches
  2021-09-30 19:17 ` [Tarantool-patches] [PATCH 1/2] replicaiton: make anon replica connect to quorum upon reconfiguration Serge Petrenko via Tarantool-patches
  2021-09-30 19:17 ` [Tarantool-patches] [PATCH 2/2] replication: fix replica disconnect " Serge Petrenko via Tarantool-patches
  0 siblings, 2 replies; 8+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-09-30 19:17 UTC (permalink / raw)
  To: v.shpilevoy, gorcunov; +Cc: tarantool-patches

The patchset is based on branch VitaliyaIoffe/testrun-304-provide-first-luatest,
because I've written a luatest-based test.

https://github.com/tarantool/tarantool/issues/4669
https://github.com/tarantool/tarantool/tree/sp/gh-4669-replication-reconnect

Serge Petrenko (2):
  replicaiton: make anon replica connect to quorum upon reconfiguration
  replication: fix replica disconnect upon reconfiguration

 .../gh-4669-applier-reconfig-reconnect.md     |  5 ++
 src/box/box.cc                                |  8 ++-
 src/box/replication.cc                        | 41 ++++++++----
 src/box/replication.h                         |  4 +-
 test/instance_files/base_instance.lua         |  3 +-
 test/instance_files/replica.lua               | 17 +++++
 test/luatest_helpers/server.lua               | 64 +++++++++++++++++++
 .../gh_4669_applier_reconnect_test.lua        | 42 ++++++++++++
 8 files changed, 167 insertions(+), 17 deletions(-)
 create mode 100644 changelogs/unreleased/gh-4669-applier-reconfig-reconnect.md
 create mode 100755 test/instance_files/replica.lua
 create mode 100644 test/replication-luatest/gh_4669_applier_reconnect_test.lua

-- 
2.30.1 (Apple Git-130)


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Tarantool-patches] [PATCH 1/2] replicaiton: make anon replica connect to quorum upon reconfiguration
  2021-09-30 19:17 [Tarantool-patches] [PATCH 0/2] replication: fix reconnect on box.cfg.replication change Serge Petrenko via Tarantool-patches
@ 2021-09-30 19:17 ` Serge Petrenko via Tarantool-patches
  2021-09-30 19:17 ` [Tarantool-patches] [PATCH 2/2] replication: fix replica disconnect " Serge Petrenko via Tarantool-patches
  1 sibling, 0 replies; 8+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-09-30 19:17 UTC (permalink / raw)
  To: v.shpilevoy, gorcunov; +Cc: tarantool-patches

Once an anonymous replica tries to register, it reconnects to every
other instance in its replication list in order to receive updated
ballots and choose someone to register on.

Make the instance wait until it connects to quorum before letting it
choose the node to register on.
---
 src/box/box.cc | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/box/box.cc b/src/box/box.cc
index 0b12b1328..219ffa38d 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -1404,7 +1404,7 @@ box_set_replication_anon(void)
 		 * them can register and others resend a
 		 * non-anonymous subscribe.
 		 */
-		box_sync_replication(false);
+		box_sync_replication(true);
 		/*
 		 * Wait until the master has registered this
 		 * instance.
-- 
2.30.1 (Apple Git-130)


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Tarantool-patches] [PATCH 2/2] replication: fix replica disconnect upon reconfiguration
  2021-09-30 19:17 [Tarantool-patches] [PATCH 0/2] replication: fix reconnect on box.cfg.replication change Serge Petrenko via Tarantool-patches
  2021-09-30 19:17 ` [Tarantool-patches] [PATCH 1/2] replicaiton: make anon replica connect to quorum upon reconfiguration Serge Petrenko via Tarantool-patches
@ 2021-09-30 19:17 ` Serge Petrenko via Tarantool-patches
  2021-09-30 22:15   ` Vladislav Shpilevoy via Tarantool-patches
  1 sibling, 1 reply; 8+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-09-30 19:17 UTC (permalink / raw)
  To: v.shpilevoy, gorcunov; +Cc: tarantool-patches

Replication reconfiguration used to work as follows: upon receiving a
new config disconnect from all the existing masters and try to connect
to all the masters from the new config.

This lead to instances doing extra work when old config and new config
had the same nodes in them: instead of doing nothing, tarantool
reinitialized the connection.

There was another problem: when an existing connection is broken, master
takes some time to notice that. So, when replica resets the connection,
it may try to reconnect faster than the master is able to notice its
absence. In this case replica wouldn't be able to reconnect due to
`duplicate connection with the same replica UUID`. So replication would
stop for a replication_timeout, which may be quite large (seconds or
tens of seconds).

Let's prevent tarantool from reconnecting to the same instance, if there
already is a working connection.

Closes #4669
---
 .../gh-4669-applier-reconfig-reconnect.md     |  5 ++
 src/box/box.cc                                |  6 +-
 src/box/replication.cc                        | 41 ++++++++----
 src/box/replication.h                         |  4 +-
 test/instance_files/base_instance.lua         |  3 +-
 test/instance_files/replica.lua               | 17 +++++
 test/luatest_helpers/server.lua               | 64 +++++++++++++++++++
 .../gh_4669_applier_reconnect_test.lua        | 42 ++++++++++++
 8 files changed, 166 insertions(+), 16 deletions(-)
 create mode 100644 changelogs/unreleased/gh-4669-applier-reconfig-reconnect.md
 create mode 100755 test/instance_files/replica.lua
 create mode 100644 test/replication-luatest/gh_4669_applier_reconnect_test.lua

diff --git a/changelogs/unreleased/gh-4669-applier-reconfig-reconnect.md b/changelogs/unreleased/gh-4669-applier-reconfig-reconnect.md
new file mode 100644
index 000000000..2a776872b
--- /dev/null
+++ b/changelogs/unreleased/gh-4669-applier-reconfig-reconnect.md
@@ -0,0 +1,5 @@
+## bugfix/replication
+
+* Fixed replica reconnecting to a living master on any `box.cfg{replication=...}`
+  change. Such reconnects could lead to replica failing to restore connection
+  for `replication_timeout` seconds (gh-4669).
diff --git a/src/box/box.cc b/src/box/box.cc
index 219ffa38d..89cda5599 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -1249,7 +1249,7 @@ cfg_get_replication(int *p_count)
  * don't start appliers.
  */
 static void
-box_sync_replication(bool connect_quorum)
+box_sync_replication(bool strict)
 {
 	int count = 0;
 	struct applier **appliers = cfg_get_replication(&count);
@@ -1261,7 +1261,9 @@ box_sync_replication(bool connect_quorum)
 			applier_delete(appliers[i]); /* doesn't affect diag */
 	});
 
-	replicaset_connect(appliers, count, connect_quorum);
+	bool connect_quorum = strict;
+	bool keep_connect = !strict;
+	replicaset_connect(appliers, count, connect_quorum, keep_connect);
 
 	guard.is_active = false;
 }
diff --git a/src/box/replication.cc b/src/box/replication.cc
index 1288bc9b1..e5fce6c8c 100644
--- a/src/box/replication.cc
+++ b/src/box/replication.cc
@@ -507,7 +507,7 @@ replica_on_applier_state_f(struct trigger *trigger, void *event)
  * upon reconfiguration of box.cfg.replication.
  */
 static void
-replicaset_update(struct applier **appliers, int count)
+replicaset_update(struct applier **appliers, int count, bool keep_connect)
 {
 	replica_hash_t uniq;
 	memset(&uniq, 0, sizeof(uniq));
@@ -572,22 +572,39 @@ replicaset_update(struct applier **appliers, int count)
 		applier_stop(applier);
 		applier_delete(applier);
 	}
+
+	replicaset.applier.total = count;
+	replicaset.applier.connected = 0;
+	replicaset.applier.loading = 0;
+	replicaset.applier.synced = 0;
 	replicaset_foreach(replica) {
 		if (replica->applier == NULL)
 			continue;
-		applier = replica->applier;
-		replica_clear_applier(replica);
-		replica->applier_sync_state = APPLIER_DISCONNECTED;
+		struct replica *other = replica_hash_search(&uniq, replica);
+		if (keep_connect && other != NULL &&
+		    (replica->applier->state == APPLIER_FOLLOW ||
+		     replica->applier->state == APPLIER_SYNC)) {
+			/*
+			 * Try not to interrupt working appliers upon
+			 * reconfiguration.
+			 */
+			replicaset.applier.connected++;
+			replicaset.applier.synced++;
+			replica_hash_remove(&uniq, other);
+			applier = other->applier;
+			replica_clear_applier(other);
+			replica_delete(other);
+		} else {
+			applier = replica->applier;
+			replica_clear_applier(replica);
+			replica->applier_sync_state = APPLIER_DISCONNECTED;
+		}
 		applier_stop(applier);
 		applier_delete(applier);
 	}
 
-	/* Save new appliers */
-	replicaset.applier.total = count;
-	replicaset.applier.connected = 0;
-	replicaset.applier.loading = 0;
-	replicaset.applier.synced = 0;
 
+	/* Save new appliers */
 	replica_hash_foreach_safe(&uniq, replica, next) {
 		replica_hash_remove(&uniq, replica);
 
@@ -664,11 +681,11 @@ applier_on_connect_f(struct trigger *trigger, void *event)
 
 void
 replicaset_connect(struct applier **appliers, int count,
-		   bool connect_quorum)
+		   bool connect_quorum, bool keep_connect)
 {
 	if (count == 0) {
 		/* Cleanup the replica set. */
-		replicaset_update(appliers, count);
+		replicaset_update(appliers, count, keep_connect);
 		return;
 	}
 
@@ -772,7 +789,7 @@ replicaset_connect(struct applier **appliers, int count,
 
 	/* Now all the appliers are connected, update the replica set. */
 	try {
-		replicaset_update(appliers, count);
+		replicaset_update(appliers, count, keep_connect);
 	} catch (Exception *e) {
 		goto error;
 	}
diff --git a/src/box/replication.h b/src/box/replication.h
index 4c82cd839..a8fed45e8 100644
--- a/src/box/replication.h
+++ b/src/box/replication.h
@@ -439,10 +439,12 @@ replicaset_add_anon(const struct tt_uuid *replica_uuid);
  * \param connect_quorum if this flag is set, fail unless at
  *                       least replication_connect_quorum
  *                       appliers have successfully connected.
+ * \param keep_connect   if this flag is set do not force a reconnect if the
+ *                       old connection to the replica is fine.
  */
 void
 replicaset_connect(struct applier **appliers, int count,
-		   bool connect_quorum);
+		   bool connect_quorum, bool keep_connect);
 
 /**
  * Check if the current instance fell too much behind its
diff --git a/test/instance_files/base_instance.lua b/test/instance_files/base_instance.lua
index 45bdbc7e8..e579c3843 100755
--- a/test/instance_files/base_instance.lua
+++ b/test/instance_files/base_instance.lua
@@ -5,7 +5,8 @@ local listen = os.getenv('TARANTOOL_LISTEN')
 box.cfg({
     work_dir = workdir,
 --     listen = 'localhost:3310'
-    listen = listen
+    listen = listen,
+    log = workdir..'/tarantool.log',
 })
 
 box.schema.user.grant('guest', 'read,write,execute,create', 'universe')
diff --git a/test/instance_files/replica.lua b/test/instance_files/replica.lua
new file mode 100755
index 000000000..587345f24
--- /dev/null
+++ b/test/instance_files/replica.lua
@@ -0,0 +1,17 @@
+#!/usr/bin/env tarantool
+local workdir = os.getenv('TARANTOOL_WORKDIR')
+local listen = os.getenv('TARANTOOL_LISTEN')
+local SOCKET_DIR = os.getenv('VARDIR')
+local MASTER = arg[1] or "master"
+
+local function master_uri()
+    return SOCKET_DIR..'/'..MASTER..'.sock'
+end
+
+box.cfg({
+    work_dir = workdir,
+    listen = listen,
+    replication = master_uri(),
+})
+
+_G.ready = true
diff --git a/test/luatest_helpers/server.lua b/test/luatest_helpers/server.lua
index 018ea4f7d..f204dc794 100644
--- a/test/luatest_helpers/server.lua
+++ b/test/luatest_helpers/server.lua
@@ -59,6 +59,70 @@ function Server:initialize()
     getmetatable(getmetatable(self)).initialize(self)
 end
 
+-- A copy of test_run:grep_log.
+function Server:grep_log(what, bytes, opts)
+    local opts = opts or {}
+    local noreset = opts.noreset or false
+    -- if instance has crashed provide filename to use grep_log
+    local filename = opts.filename or self:eval('return box.cfg.log')
+    local file = fio.open(filename, {'O_RDONLY', 'O_NONBLOCK'})
+
+    local function fail(msg)
+        local err = errno.strerror()
+        if file ~= nil then
+            file:close()
+        end
+        error(string.format("%s: %s: %s", msg, filename, err))
+    end
+
+    if file == nil then
+        fail("Failed to open log file")
+    end
+    io.flush() -- attempt to flush stdout == log fd
+    local filesize = file:seek(0, 'SEEK_END')
+    if filesize == nil then
+        fail("Failed to get log file size")
+    end
+    local bytes = bytes or 65536 -- don't read whole log - it can be huge
+    bytes = bytes > filesize and filesize or bytes
+    if file:seek(-bytes, 'SEEK_END') == nil then
+        fail("Failed to seek log file")
+    end
+    local found, buf
+    repeat -- read file in chunks
+        local s = file:read(2048)
+        if s == nil then
+            fail("Failed to read log file")
+        end
+        local pos = 1
+        repeat -- split read string in lines
+            local endpos = string.find(s, '\n', pos)
+            endpos = endpos and endpos - 1 -- strip terminating \n
+            local line = string.sub(s, pos, endpos)
+            if endpos == nil and s ~= '' then
+                -- line doesn't end with \n or eof, append it to buffer
+                -- to be checked on next iteration
+                buf = buf or {}
+                table.insert(buf, line)
+            else
+                if buf ~= nil then -- prepend line with buffered data
+                    table.insert(buf, line)
+                    line = table.concat(buf)
+                    buf = nil
+                end
+                if string.match(line, "Starting instance") and not noreset then
+                    found = nil -- server was restarted, reset search
+                else
+                    found = string.match(line, what) or found
+                end
+            end
+            pos = endpos and endpos + 2 -- jump to char after \n
+        until pos == nil
+    until s == ''
+    file:close()
+    return found
+end
+
 --- Generates environment to run process with.
 -- The result is merged into os.environ().
 -- @return map
diff --git a/test/replication-luatest/gh_4669_applier_reconnect_test.lua b/test/replication-luatest/gh_4669_applier_reconnect_test.lua
new file mode 100644
index 000000000..62adff716
--- /dev/null
+++ b/test/replication-luatest/gh_4669_applier_reconnect_test.lua
@@ -0,0 +1,42 @@
+local t = require('luatest')
+local fio = require('fio')
+local Server = t.Server
+local Cluster = require('test.luatest_helpers.cluster')
+
+local g = t.group('gh-4669-applier-reconnect')
+
+local function check_follow_master(server)
+    return t.assert_equals(
+        server:eval('return box.info.replication[1].upstream.status'), 'follow')
+end
+
+g.before_each(function()
+    g.cluster = Cluster:new({})
+    g.master = g.cluster:build_server(
+        {}, {alias = 'master'}, 'base_instance.lua')
+    g.replica = g.cluster:build_server(
+        {args={'master'}}, {alias = 'replica'}, 'replica.lua')
+
+    g.cluster:join_server(g.master)
+    g.cluster:join_server(g.replica)
+    g.cluster:start()
+    check_follow_master(g.replica)
+end)
+
+g.after_each(function()
+    g.cluster:stop()
+end)
+
+-- Test that appliers aren't recreated upon replication reconfiguration.
+g.test_applier_connection_on_reconfig = function(g)
+    g.replica:eval(
+        'box.cfg{'..
+            'replication = {'..
+                'os.getenv("TARANTOOL_LISTEN"),'..
+                'box.cfg.replication[1],'..
+            '}'..
+        '}'
+    )
+    check_follow_master(g.replica)
+    t.assert_equals(g.master:grep_log("exiting the relay loop"), nil)
+end
-- 
2.30.1 (Apple Git-130)


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Tarantool-patches] [PATCH 2/2] replication: fix replica disconnect upon reconfiguration
  2021-09-30 19:17 ` [Tarantool-patches] [PATCH 2/2] replication: fix replica disconnect " Serge Petrenko via Tarantool-patches
@ 2021-09-30 22:15   ` Vladislav Shpilevoy via Tarantool-patches
  2021-10-01 11:31     ` Serge Petrenko via Tarantool-patches
  0 siblings, 1 reply; 8+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-09-30 22:15 UTC (permalink / raw)
  To: Serge Petrenko, gorcunov; +Cc: tarantool-patches

Hi! Thanks for the patch!

See 6 comments below.

> diff --git a/src/box/box.cc b/src/box/box.cc
> index 219ffa38d..89cda5599 100644
> --- a/src/box/box.cc
> +++ b/src/box/box.cc> @@ -1261,7 +1261,9 @@ box_sync_replication(bool connect_quorum)
>  			applier_delete(appliers[i]); /* doesn't affect diag */
>  	});
>  
> -	replicaset_connect(appliers, count, connect_quorum);
> +	bool connect_quorum = strict;
> +	bool keep_connect = !strict;
> +	replicaset_connect(appliers, count, connect_quorum, keep_connect);

1. How about passing both these parameters explicitly to box_sync_replication?
I don't understand the link between them so that they could be one.

It seems the only case when you need to drop the old connections is when
you turn anon to normal. Why should they be fully reset otherwise?

> diff --git a/src/box/replication.cc b/src/box/replication.cc
> index 1288bc9b1..e5fce6c8c 100644
> --- a/src/box/replication.cc
> +++ b/src/box/replication.cc
> @@ -664,11 +681,11 @@ applier_on_connect_f(struct trigger *trigger, void *event)
>  
>  void
>  replicaset_connect(struct applier **appliers, int count,
> -		   bool connect_quorum)
> +		   bool connect_quorum, bool keep_connect)
>  {
>  	if (count == 0) {
>  		/* Cleanup the replica set. */
> -		replicaset_update(appliers, count);
> +		replicaset_update(appliers, count, keep_connect);

2. In case of count 0 it means all the appliers must be terminated,
mustn't they? So you could pass always false here. Up to you.

>  		return;
>  	}

3. A few lines below I see that all the appliers are started via
applier_start and gather connections. Even if you have them already.
Wasn't the point of this patchset not to create connections when you
already have them? You could find matches by URI even before you
try to create a connection.

Otherwise when I do this:

	box.cfg{replication = {3313, 3314}}
	box.cfg{replication = {3313, 3314, 3315}}

you create 3 connections in the second box.cfg. While you
could create just 1.

> diff --git a/src/box/replication.h b/src/box/replication.h
> index 4c82cd839..a8fed45e8 100644
> --- a/src/box/replication.h
> +++ b/src/box/replication.h
> @@ -439,10 +439,12 @@ replicaset_add_anon(const struct tt_uuid *replica_uuid);
>   * \param connect_quorum if this flag is set, fail unless at
>   *                       least replication_connect_quorum
>   *                       appliers have successfully connected.
> + * \param keep_connect   if this flag is set do not force a reconnect if the
> + *                       old connection to the replica is fine.

4. Why do you need to touch it even if the replica is not fine?
Shouldn't it reconnect automatically anyway? You could maybe force
it to reconnect right now if you don't want to wait until its
reconnect timeout expires.

>   */
>  void
>  replicaset_connect(struct applier **appliers, int count,
> -		   bool connect_quorum);
> +		   bool connect_quorum, bool keep_connect);
>  
> diff --git a/test/replication-luatest/gh_4669_applier_reconnect_test.lua b/test/replication-luatest/gh_4669_applier_reconnect_test.lua
> new file mode 100644
> index 000000000..62adff716
> --- /dev/null
> +++ b/test/replication-luatest/gh_4669_applier_reconnect_test.lua
> @@ -0,0 +1,42 @@
> +local t = require('luatest')
> +local fio = require('fio')
> +local Server = t.Server
> +local Cluster = require('test.luatest_helpers.cluster')

5. Are we using first capital letters for variable names now? Maybe
stick to the guidelines and use lower case letters?

> +
> +local g = t.group('gh-4669-applier-reconnect')
> +
> +local function check_follow_master(server)
> +    return t.assert_equals(
> +        server:eval('return box.info.replication[1].upstream.status'), 'follow')
> +end
> +
> +g.before_each(function()
> +    g.cluster = Cluster:new({})
> +    g.master = g.cluster:build_server(
> +        {}, {alias = 'master'}, 'base_instance.lua')
> +    g.replica = g.cluster:build_server(
> +        {args={'master'}}, {alias = 'replica'}, 'replica.lua')
> +
> +    g.cluster:join_server(g.master)
> +    g.cluster:join_server(g.replica)
> +    g.cluster:start()
> +    check_follow_master(g.replica)
> +end)
> +
> +g.after_each(function()
> +    g.cluster:stop()
> +end)
> +
> +-- Test that appliers aren't recreated upon replication reconfiguration.
> +g.test_applier_connection_on_reconfig = function(g)
> +    g.replica:eval(
> +        'box.cfg{'..
> +            'replication = {'..
> +                'os.getenv("TARANTOOL_LISTEN"),'..
> +                'box.cfg.replication[1],'..
> +            '}'..
> +        '}'
> +    )

6. Are we really supposed to write Lua code as Lua strings now?
You could use here [[ ... ]] btw, but the question still remains.

Not a comment for your patch. But it looks unusable. What if the
test would be a bit more complicated? Look at qsync tests for instance.
Imagine writing all of them as Lua strings with quotes and all.

Especially if you want to pass into the strings some parameters not
known in advance.

Could you maybe make it on top of master as a normal .test.lua test?

> +    check_follow_master(g.replica)
> +    t.assert_equals(g.master:grep_log("exiting the relay loop"), nil)
> +end
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Tarantool-patches] [PATCH 2/2] replication: fix replica disconnect upon reconfiguration
  2021-09-30 22:15   ` Vladislav Shpilevoy via Tarantool-patches
@ 2021-10-01 11:31     ` Serge Petrenko via Tarantool-patches
  2021-10-04 21:04       ` Vladislav Shpilevoy via Tarantool-patches
  0 siblings, 1 reply; 8+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-10-01 11:31 UTC (permalink / raw)
  To: Vladislav Shpilevoy, gorcunov; +Cc: tarantool-patches



01.10.2021 01:15, Vladislav Shpilevoy пишет:
> Hi! Thanks for the patch!
>
> See 6 comments below.
>
>> diff --git a/src/box/box.cc b/src/box/box.cc
>> index 219ffa38d..89cda5599 100644
>> --- a/src/box/box.cc
>> +++ b/src/box/box.cc> @@ -1261,7 +1261,9 @@ box_sync_replication(bool connect_quorum)
>>   			applier_delete(appliers[i]); /* doesn't affect diag */
>>   	});
>>   
>> -	replicaset_connect(appliers, count, connect_quorum);
>> +	bool connect_quorum = strict;
>> +	bool keep_connect = !strict;
>> +	replicaset_connect(appliers, count, connect_quorum, keep_connect);
> 1. How about passing both these parameters explicitly to box_sync_replication?
> I don't understand the link between them so that they could be one.
>
> It seems the only case when you need to drop the old connections is when
> you turn anon to normal. Why should they be fully reset otherwise?

Yes, it's true. anon to normal is the only place where existing
connections should be reset.

For both bootstrap and local recovery (first ever box.cfg) keep_connect
doesn't make sense at all, because there are no previous connections to
keep.

So the only two (out of 5) box_sync_replication calls, that need
keep_connect are replication reconfiguration (keep_connect = true) and
anon replica reconfiguration (keep_connect = false).

Speaking of the relation between keep_connect and connect_quorum:
We don't care about keep_connect in 3 calls (bootstrap and recovery),
and when keep_connect is important, it's equal to !connect_quorum.
I thought it might be nice to replace them with a single parameter.

I tried to pass both parameters to box_sync_repication() at first.
This looked rather ugly IMO:
box_sync_replication(true, false), box_sync_replication(false, true);
Two boolean parameters which are responsible for God knows what are
worse than one parameter.

I'm not 100% happy with my solution, but it at least hides the second
parameter. And IMO box_sync_replication(strict) is rather easy to
understand: when strict = true, you want to connect to quorum, and
you want to reset the connections. And vice versa when strict = false.

>
>> diff --git a/src/box/replication.cc b/src/box/replication.cc
>> index 1288bc9b1..e5fce6c8c 100644
>> --- a/src/box/replication.cc
>> +++ b/src/box/replication.cc
>> @@ -664,11 +681,11 @@ applier_on_connect_f(struct trigger *trigger, void *event)
>>   
>>   void
>>   replicaset_connect(struct applier **appliers, int count,
>> -		   bool connect_quorum)
>> +		   bool connect_quorum, bool keep_connect)
>>   {
>>   	if (count == 0) {
>>   		/* Cleanup the replica set. */
>> -		replicaset_update(appliers, count);
>> +		replicaset_update(appliers, count, keep_connect);
> 2. In case of count 0 it means all the appliers must be terminated,
> mustn't they? So you could pass always false here. Up to you.

Ok, let's change that. I'll change "count" to 0 as well.
It has always bothered me here.


=================================================

diff --git a/src/box/replication.cc b/src/box/replication.cc
index e5fce6c8c..10b4ac915 100644
--- a/src/box/replication.cc
+++ b/src/box/replication.cc
@@ -685,7 +685,7 @@ replicaset_connect(struct applier **appliers, int count,
  {
         if (count == 0) {
                 /* Cleanup the replica set. */
-               replicaset_update(appliers, count, keep_connect);
+               replicaset_update(appliers, 0, false);
                 return;
         }

=================================================
>
>>   		return;
>>   	}
> 3. A few lines below I see that all the appliers are started via
> applier_start and gather connections. Even if you have them already.
> Wasn't the point of this patchset not to create connections when you
> already have them? You could find matches by URI even before you
> try to create a connection.

It's true, but I decided it'd be simpler to find matches by replica uuid.
It gives us a bonus: you won't drop an existing connection to a replica
even when uri has changed. Say, "localhost:3303" and "127.0.0.1:3303".

The point of the patchset is "don't restart existing connections if they 
are ok".
Because if you restart them old relay doesn't exit in time and replica 
receives a
"duplicate connection with same replica UUID".

Here's how the patch works:
1. Get a list of appliers (one per each box.cfg.replication entry)
2. each applier is connected
3. new appliers receive master uuids
4. Find matches between new and old appliers, and remove duplicate
    new appliers. (only when the old applier is functional)

Here's how I thought I'd implement it at first:
1. you get a list of appliers
2. You check the new applier list against existing appliers
3. When you find matches, new appliers are removed.
4. Everything else works as before.

The problem with the second approach is replicaset_update().
It always replaces all of the existing appliers with the new ones.

How do we keep the old (matching) appliers then? Add them to new applier
list?

TBH I just decided that my approach would be simpler than this one.
So there might be no problem at all.

>
> Otherwise when I do this:
>
> 	box.cfg{replication = {3313, 3314}}
> 	box.cfg{replication = {3313, 3314, 3315}}
>
> you create 3 connections in the second box.cfg. While you
> could create just 1.



>
>> diff --git a/src/box/replication.h b/src/box/replication.h
>> index 4c82cd839..a8fed45e8 100644
>> --- a/src/box/replication.h
>> +++ b/src/box/replication.h
>> @@ -439,10 +439,12 @@ replicaset_add_anon(const struct tt_uuid *replica_uuid);
>>    * \param connect_quorum if this flag is set, fail unless at
>>    *                       least replication_connect_quorum
>>    *                       appliers have successfully connected.
>> + * \param keep_connect   if this flag is set do not force a reconnect if the
>> + *                       old connection to the replica is fine.
> 4. Why do you need to touch it even if the replica is not fine?
> Shouldn't it reconnect automatically anyway? You could maybe force
> it to reconnect right now if you don't want to wait until its
> reconnect timeout expires.

I don't understand this comment. Is it clear now after my other answers?

P.S. I think I understand now. What if replication with the
replica is permanently broken (by, say, ER_INVALID_MSGPACK)?
I guess another `box.cfg{replication=...}` call should revive it.

And yes, when replication_timeout is huge, and applier is waiting to
retry for some reason, you might want to speed things up by another
box.cfg{replication=...} call.

>
>>    */
>>   void
>>   replicaset_connect(struct applier **appliers, int count,
>> -		   bool connect_quorum);
>> +		   bool connect_quorum, bool keep_connect);
>>   
>> diff --git a/test/replication-luatest/gh_4669_applier_reconnect_test.lua b/test/replication-luatest/gh_4669_applier_reconnect_test.lua
>> new file mode 100644
>> index 000000000..62adff716
>> --- /dev/null
>> +++ b/test/replication-luatest/gh_4669_applier_reconnect_test.lua
>> @@ -0,0 +1,42 @@
>> +local t = require('luatest')
>> +local fio = require('fio')
>> +local Server = t.Server
>> +local Cluster = require('test.luatest_helpers.cluster')
> 5. Are we using first capital letters for variable names now? Maybe
> stick to the guidelines and use lower case letters?

I tried to stick to other luatest tests' style
(replication-luatest on Vitaliya's branch).

Sure, let's change that.

>
>> +
>> +local g = t.group('gh-4669-applier-reconnect')
>> +
>> +local function check_follow_master(server)
>> +    return t.assert_equals(
>> +        server:eval('return box.info.replication[1].upstream.status'), 'follow')
>> +end
>> +
>> +g.before_each(function()
>> +    g.cluster = Cluster:new({})
>> +    g.master = g.cluster:build_server(
>> +        {}, {alias = 'master'}, 'base_instance.lua')
>> +    g.replica = g.cluster:build_server(
>> +        {args={'master'}}, {alias = 'replica'}, 'replica.lua')
>> +
>> +    g.cluster:join_server(g.master)
>> +    g.cluster:join_server(g.replica)
>> +    g.cluster:start()
>> +    check_follow_master(g.replica)
>> +end)
>> +
>> +g.after_each(function()
>> +    g.cluster:stop()
>> +end)
>> +
>> +-- Test that appliers aren't recreated upon replication reconfiguration.
>> +g.test_applier_connection_on_reconfig = function(g)
>> +    g.replica:eval(
>> +        'box.cfg{'..
>> +            'replication = {'..
>> +                'os.getenv("TARANTOOL_LISTEN"),'..
>> +                'box.cfg.replication[1],'..
>> +            '}'..
>> +        '}'
>> +    )
> 6. Are we really supposed to write Lua code as Lua strings now?
> You could use here [[ ... ]] btw, but the question still remains.
>
> Not a comment for your patch. But it looks unusable. What if the
> test would be a bit more complicated? Look at qsync tests for instance.
> Imagine writing all of them as Lua strings with quotes and all.

>
> Especially if you want to pass into the strings some parameters not
> known in advance.

I agree we need some way to access server's console directly.
I don't know a good solution for this with luatest though.

>
> Could you maybe make it on top of master as a normal .test.lua test?

I think not. Both Sergos and KirillY asked me to make this a luatest-test.

Besides, this particular test looks ok in luatest. There's a single eval 
there.

>
>> +    check_follow_master(g.replica)
>> +    t.assert_equals(g.master:grep_log("exiting the relay loop"), nil)
>> +end
>>

Please, check out the diff.
(I've renamed the test to `gh-4669-applier-reconnect_test.lua`.
I tried `gh-4669-applier-reconnect.test.lua` first, but couldn't
make luatest understand *.test.lua test names. I've asked Vitaliya
to fix this)


=======================================

diff --git a/src/box/replication.cc b/src/box/replication.cc
index e5fce6c8c..10b4ac915 100644
--- a/src/box/replication.cc
+++ b/src/box/replication.cc
@@ -685,7 +685,7 @@ replicaset_connect(struct applier **appliers, int count,
  {
         if (count == 0) {
                 /* Cleanup the replica set. */
-               replicaset_update(appliers, count, keep_connect);
+               replicaset_update(appliers, 0, false);
                 return;
         }

diff --git a/test/replication-luatest/gh_4669_applier_reconnect_test.lua 
b/test/replication-luatest/gh-4669-applier-reconnect_test.lua
similarity index 74%
rename from test/replication-luatest/gh_4669_applier_reconnect_test.lua
rename to test/replication-luatest/gh-4669-applier-reconnect_test.lua
index 62adff716..a4a138714 100644
--- a/test/replication-luatest/gh_4669_applier_reconnect_test.lua
+++ b/test/replication-luatest/gh-4669-applier-reconnect_test.lua
@@ -1,7 +1,7 @@
  local t = require('luatest')
  local fio = require('fio')
-local Server = t.Server
-local Cluster = require('test.luatest_helpers.cluster')
+local server = t.Server
+local cluster = require('test.luatest_helpers.cluster')

  local g = t.group('gh-4669-applier-reconnect')

@@ -11,7 +11,7 @@ local function check_follow_master(server)
  end

  g.before_each(function()
-    g.cluster = Cluster:new({})
+    g.cluster = cluster:new({})
      g.master = g.cluster:build_server(
          {}, {alias = 'master'}, 'base_instance.lua')
      g.replica = g.cluster:build_server(
@@ -29,14 +29,14 @@ end)

  -- Test that appliers aren't recreated upon replication reconfiguration.
  g.test_applier_connection_on_reconfig = function(g)
-    g.replica:eval(
-        'box.cfg{'..
-            'replication = {'..
-                'os.getenv("TARANTOOL_LISTEN"),'..
-                'box.cfg.replication[1],'..
-            '}'..
-        '}'
-    )
+    g.replica:eval([[
+        box.cfg{
+            replication = {
+                os.getenv("TARANTOOL_LISTEN"),
+                box.cfg.replication[1],
+            }
+        }
+    ]])
      check_follow_master(g.replica)
      t.assert_equals(g.master:grep_log("exiting the relay loop"), nil)
  end


=======================================

-- 
Serge Petrenko


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Tarantool-patches] [PATCH 2/2] replication: fix replica disconnect upon reconfiguration
  2021-10-01 11:31     ` Serge Petrenko via Tarantool-patches
@ 2021-10-04 21:04       ` Vladislav Shpilevoy via Tarantool-patches
  2021-10-05 13:09         ` Serge Petrenko via Tarantool-patches
  0 siblings, 1 reply; 8+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-10-04 21:04 UTC (permalink / raw)
  To: Serge Petrenko, gorcunov; +Cc: tarantool-patches

Hi! Thanks for working on this!

>>> diff --git a/src/box/box.cc b/src/box/box.cc
>>> index 219ffa38d..89cda5599 100644
>>> --- a/src/box/box.cc
>>> +++ b/src/box/box.cc> @@ -1261,7 +1261,9 @@ box_sync_replication(bool connect_quorum)
>>>               applier_delete(appliers[i]); /* doesn't affect diag */
>>>       });
>>>   -    replicaset_connect(appliers, count, connect_quorum);
>>> +    bool connect_quorum = strict;
>>> +    bool keep_connect = !strict;
>>> +    replicaset_connect(appliers, count, connect_quorum, keep_connect);
>> 1. How about passing both these parameters explicitly to box_sync_replication?
>> I don't understand the link between them so that they could be one.
>>
>> It seems the only case when you need to drop the old connections is when
>> you turn anon to normal. Why should they be fully reset otherwise?
> 
> Yes, it's true. anon to normal is the only place where existing
> connections should be reset.
> 
> For both bootstrap and local recovery (first ever box.cfg) keep_connect
> doesn't make sense at all, because there are no previous connections to
> keep.
> 
> So the only two (out of 5) box_sync_replication calls, that need
> keep_connect are replication reconfiguration (keep_connect = true) and
> anon replica reconfiguration (keep_connect = false).
> 
> Speaking of the relation between keep_connect and connect_quorum:
> We don't care about keep_connect in 3 calls (bootstrap and recovery),
> and when keep_connect is important, it's equal to !connect_quorum.
> I thought it might be nice to replace them with a single parameter.
> 
> I tried to pass both parameters to box_sync_repication() at first.
> This looked rather ugly IMO:
> box_sync_replication(true, false), box_sync_replication(false, true);
> Two boolean parameters which are responsible for God knows what are
> worse than one parameter.
> 
> I'm not 100% happy with my solution, but it at least hides the second
> parameter. And IMO box_sync_replication(strict) is rather easy to
> understand: when strict = true, you want to connect to quorum, and
> you want to reset the connections. And vice versa when strict = false.

This can be resolved with a couple of wrappers, like in this diff:

====================
diff --git a/src/box/box.cc b/src/box/box.cc
index 89cda5599..c1216172d 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -1249,7 +1249,7 @@ cfg_get_replication(int *p_count)
  * don't start appliers.
  */
 static void
-box_sync_replication(bool strict)
+box_sync_replication(bool do_quorum, bool do_reuse)
 {
 	int count = 0;
 	struct applier **appliers = cfg_get_replication(&count);
@@ -1260,14 +1260,27 @@ box_sync_replication(bool strict)
 		for (int i = 0; i < count; i++)
 			applier_delete(appliers[i]); /* doesn't affect diag */
 	});
-
-	bool connect_quorum = strict;
-	bool keep_connect = !strict;
-	replicaset_connect(appliers, count, connect_quorum, keep_connect);
+	replicaset_connect(appliers, count, do_quorum, do_reuse);
 
 	guard.is_active = false;
 }
 
+static inline void
+box_reset_replication(void)
+{
+	const bool do_quorum = true;
+	const bool do_reuse = false;
+	box_sync_replication(do_quorum, do_reuse);
+}
+
+static inline void
+box_update_replication(void)
+{
+	const bool do_quorum = false;
+	const bool do_reuse = true;
+	box_sync_replication(do_quorum, do_reuse);
+}
+
 void
 box_set_replication(void)
 {
@@ -1286,7 +1299,7 @@ box_set_replication(void)
 	 * Stay in orphan mode in case we fail to connect to at least
 	 * 'replication_connect_quorum' remote instances.
 	 */
-	box_sync_replication(false);
+	box_update_replication();
 	/* Follow replica */
 	replicaset_follow();
 	/* Wait until appliers are in sync */
@@ -1406,7 +1419,7 @@ box_set_replication_anon(void)
 		 * them can register and others resend a
 		 * non-anonymous subscribe.
 		 */
-		box_sync_replication(true);
+		box_reset_replication();
 		/*
 		 * Wait until the master has registered this
 		 * instance.
@@ -3260,7 +3273,7 @@ bootstrap(const struct tt_uuid *instance_uuid,
 	 * with connecting to 'replication_connect_quorum' masters.
 	 * If this also fails, throw an error.
 	 */
-	box_sync_replication(true);
+	box_update_replication();
 
 	struct replica *master = replicaset_find_join_master();
 	assert(master == NULL || master->applier != NULL);
@@ -3337,7 +3350,7 @@ local_recovery(const struct tt_uuid *instance_uuid,
 	if (wal_dir_lock >= 0) {
 		if (box_listen() != 0)
 			diag_raise();
-		box_sync_replication(false);
+		box_update_replication();
 
 		struct replica *master;
 		if (replicaset_needs_rejoin(&master)) {
@@ -3416,7 +3429,7 @@ local_recovery(const struct tt_uuid *instance_uuid,
 		vclock_copy(&replicaset.vclock, &recovery->vclock);
 		if (box_listen() != 0)
 			diag_raise();
-		box_sync_replication(false);
+		box_update_replication();
 	}
 	stream_guard.is_active = false;
 	recovery_finalize(recovery);
====================

Feel free to discard it if don't like. I am fine with the current
solution too.

Now when I sent this diff, I realized box_restart_replication()
would be a better name than reset. Up to you as well.

> diff --git a/test/instance_files/base_instance.lua b/test/instance_files/base_instance.lua
> index 45bdbc7e8..e579c3843 100755
> --- a/test/instance_files/base_instance.lua
> +++ b/test/instance_files/base_instance.lua
> @@ -5,7 +5,8 @@ local listen = os.getenv('TARANTOOL_LISTEN')
>  box.cfg({
>      work_dir = workdir,
>  --     listen = 'localhost:3310'
> -    listen = listen
> +    listen = listen,
> +    log = workdir..'/tarantool.log',

Do you really need it in this patch?

Other than that LGTM. You can send the next version to a next
reviewer. I suppose it can be Yan now.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Tarantool-patches] [PATCH 2/2] replication: fix replica disconnect upon reconfiguration
  2021-10-04 21:04       ` Vladislav Shpilevoy via Tarantool-patches
@ 2021-10-05 13:09         ` Serge Petrenko via Tarantool-patches
  2021-10-06 21:59           ` Vladislav Shpilevoy via Tarantool-patches
  0 siblings, 1 reply; 8+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-10-05 13:09 UTC (permalink / raw)
  To: Vladislav Shpilevoy, gorcunov; +Cc: tarantool-patches



05.10.2021 00:04, Vladislav Shpilevoy пишет:
> Hi! Thanks for working on this!
>
>>>> diff --git a/src/box/box.cc b/src/box/box.cc
>>>> index 219ffa38d..89cda5599 100644
>>>> --- a/src/box/box.cc
>>>> +++ b/src/box/box.cc> @@ -1261,7 +1261,9 @@ box_sync_replication(bool connect_quorum)
>>>>                applier_delete(appliers[i]); /* doesn't affect diag */
>>>>        });
>>>>    -    replicaset_connect(appliers, count, connect_quorum);
>>>> +    bool connect_quorum = strict;
>>>> +    bool keep_connect = !strict;
>>>> +    replicaset_connect(appliers, count, connect_quorum, keep_connect);
>>> 1. How about passing both these parameters explicitly to box_sync_replication?
>>> I don't understand the link between them so that they could be one.
>>>
>>> It seems the only case when you need to drop the old connections is when
>>> you turn anon to normal. Why should they be fully reset otherwise?
>> Yes, it's true. anon to normal is the only place where existing
>> connections should be reset.
>>
>> For both bootstrap and local recovery (first ever box.cfg) keep_connect
>> doesn't make sense at all, because there are no previous connections to
>> keep.
>>
>> So the only two (out of 5) box_sync_replication calls, that need
>> keep_connect are replication reconfiguration (keep_connect = true) and
>> anon replica reconfiguration (keep_connect = false).
>>
>> Speaking of the relation between keep_connect and connect_quorum:
>> We don't care about keep_connect in 3 calls (bootstrap and recovery),
>> and when keep_connect is important, it's equal to !connect_quorum.
>> I thought it might be nice to replace them with a single parameter.
>>
>> I tried to pass both parameters to box_sync_repication() at first.
>> This looked rather ugly IMO:
>> box_sync_replication(true, false), box_sync_replication(false, true);
>> Two boolean parameters which are responsible for God knows what are
>> worse than one parameter.
>>
>> I'm not 100% happy with my solution, but it at least hides the second
>> parameter. And IMO box_sync_replication(strict) is rather easy to
>> understand: when strict = true, you want to connect to quorum, and
>> you want to reset the connections. And vice versa when strict = false.
> This can be resolved with a couple of wrappers, like in this diff:
>
> ====================
> diff --git a/src/box/box.cc b/src/box/box.cc
> index 89cda5599..c1216172d 100644
> --- a/src/box/box.cc
> +++ b/src/box/box.cc
> @@ -1249,7 +1249,7 @@ cfg_get_replication(int *p_count)
>    * don't start appliers.
>    */
>   static void
> -box_sync_replication(bool strict)
> +box_sync_replication(bool do_quorum, bool do_reuse)
>   {
>   	int count = 0;
>   	struct applier **appliers = cfg_get_replication(&count);
> @@ -1260,14 +1260,27 @@ box_sync_replication(bool strict)
>   		for (int i = 0; i < count; i++)
>   			applier_delete(appliers[i]); /* doesn't affect diag */
>   	});
> -
> -	bool connect_quorum = strict;
> -	bool keep_connect = !strict;
> -	replicaset_connect(appliers, count, connect_quorum, keep_connect);
> +	replicaset_connect(appliers, count, do_quorum, do_reuse);
>   
>   	guard.is_active = false;
>   }
>   
> +static inline void
> +box_reset_replication(void)
> +{
> +	const bool do_quorum = true;
> +	const bool do_reuse = false;
> +	box_sync_replication(do_quorum, do_reuse);
> +}
> +
> +static inline void
> +box_update_replication(void)
> +{
> +	const bool do_quorum = false;
> +	const bool do_reuse = true;
> +	box_sync_replication(do_quorum, do_reuse);
> +}
> +
>   void
>   box_set_replication(void)
>   {
> @@ -1286,7 +1299,7 @@ box_set_replication(void)
>   	 * Stay in orphan mode in case we fail to connect to at least
>   	 * 'replication_connect_quorum' remote instances.
>   	 */
> -	box_sync_replication(false);
> +	box_update_replication();
>   	/* Follow replica */
>   	replicaset_follow();
>   	/* Wait until appliers are in sync */
> @@ -1406,7 +1419,7 @@ box_set_replication_anon(void)
>   		 * them can register and others resend a
>   		 * non-anonymous subscribe.
>   		 */
> -		box_sync_replication(true);
> +		box_reset_replication();
>   		/*
>   		 * Wait until the master has registered this
>   		 * instance.
> @@ -3260,7 +3273,7 @@ bootstrap(const struct tt_uuid *instance_uuid,
>   	 * with connecting to 'replication_connect_quorum' masters.
>   	 * If this also fails, throw an error.
>   	 */
> -	box_sync_replication(true);
> +	box_update_replication();
>   
>   	struct replica *master = replicaset_find_join_master();
>   	assert(master == NULL || master->applier != NULL);
> @@ -3337,7 +3350,7 @@ local_recovery(const struct tt_uuid *instance_uuid,
>   	if (wal_dir_lock >= 0) {
>   		if (box_listen() != 0)
>   			diag_raise();
> -		box_sync_replication(false);
> +		box_update_replication();
>   
>   		struct replica *master;
>   		if (replicaset_needs_rejoin(&master)) {
> @@ -3416,7 +3429,7 @@ local_recovery(const struct tt_uuid *instance_uuid,
>   		vclock_copy(&replicaset.vclock, &recovery->vclock);
>   		if (box_listen() != 0)
>   			diag_raise();
> -		box_sync_replication(false);
> +		box_update_replication();
>   	}
>   	stream_guard.is_active = false;
>   	recovery_finalize(recovery);
> ====================
>
> Feel free to discard it if don't like. I am fine with the current
> solution too.
>
> Now when I sent this diff, I realized box_restart_replication()
> would be a better name than reset. Up to you as well.

Your version looks better, thanks!
Applied with renaming box_reset_replication() to box_restart_replication()
Also replaced box_update_replication() with box_restart_replication() in 
bootstrap().

>
>> diff --git a/test/instance_files/base_instance.lua b/test/instance_files/base_instance.lua
>> index 45bdbc7e8..e579c3843 100755
>> --- a/test/instance_files/base_instance.lua
>> +++ b/test/instance_files/base_instance.lua
>> @@ -5,7 +5,8 @@ local listen = os.getenv('TARANTOOL_LISTEN')
>>   box.cfg({
>>       work_dir = workdir,
>>   --     listen = 'localhost:3310'
>> -    listen = listen
>> +    listen = listen,
>> +    log = workdir..'/tarantool.log',
> Do you really need it in this patch?

Yep, I need it for grep_log.
Looks like luatest doesn't set log to anything by default.

>
> Other than that LGTM. You can send the next version to a next
> reviewer. I suppose it can be Yan now.

Here's the full diff:

==================================

diff --git a/src/box/box.cc b/src/box/box.cc
index 89cda5599..cc4ada47e 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -1249,7 +1249,7 @@ cfg_get_replication(int *p_count)
   * don't start appliers.
   */
  static void
-box_sync_replication(bool strict)
+box_sync_replication(bool do_quorum, bool do_reuse)
  {
      int count = 0;
      struct applier **appliers = cfg_get_replication(&count);
@@ -1260,14 +1260,27 @@ box_sync_replication(bool strict)
          for (int i = 0; i < count; i++)
              applier_delete(appliers[i]); /* doesn't affect diag */
      });
-
-    bool connect_quorum = strict;
-    bool keep_connect = !strict;
-    replicaset_connect(appliers, count, connect_quorum, keep_connect);
+    replicaset_connect(appliers, count, do_quorum, do_reuse);

      guard.is_active = false;
  }

+static inline void
+box_restart_replication(void)
+{
+    const bool do_quorum = true;
+    const bool do_reuse = false;
+    box_sync_replication(do_quorum, do_reuse);
+}
+
+static inline void
+box_update_replication(void)
+{
+    const bool do_quorum = false;
+    const bool do_reuse = true;
+    box_sync_replication(do_quorum, do_reuse);
+}
+
  void
  box_set_replication(void)
  {
@@ -1286,7 +1299,7 @@ box_set_replication(void)
       * Stay in orphan mode in case we fail to connect to at least
       * 'replication_connect_quorum' remote instances.
       */
-    box_sync_replication(false);
+    box_update_replication();
      /* Follow replica */
      replicaset_follow();
      /* Wait until appliers are in sync */
@@ -1406,7 +1419,7 @@ box_set_replication_anon(void)
           * them can register and others resend a
           * non-anonymous subscribe.
           */
-        box_sync_replication(true);
+        box_restart_replication();
          /*
           * Wait until the master has registered this
           * instance.
@@ -3260,7 +3273,7 @@ bootstrap(const struct tt_uuid *instance_uuid,
       * with connecting to 'replication_connect_quorum' masters.
       * If this also fails, throw an error.
       */
-    box_sync_replication(true);
+    box_restart_replication();

      struct replica *master = replicaset_find_join_master();
      assert(master == NULL || master->applier != NULL);
@@ -3337,7 +3350,7 @@ local_recovery(const struct tt_uuid *instance_uuid,
      if (wal_dir_lock >= 0) {
          if (box_listen() != 0)
              diag_raise();
-        box_sync_replication(false);
+        box_update_replication();

          struct replica *master;
          if (replicaset_needs_rejoin(&master)) {
@@ -3416,7 +3429,7 @@ local_recovery(const struct tt_uuid *instance_uuid,
          vclock_copy(&replicaset.vclock, &recovery->vclock);
          if (box_listen() != 0)
              diag_raise();
-        box_sync_replication(false);
+        box_update_replication();
      }
      stream_guard.is_active = false;
      recovery_finalize(recovery);
diff --git a/test/replication-luatest/gh-4669-applier-reconnect_test.lua 
b/test/replication-luatest/gh_4669_applier_reconnect_test.lua
similarity index 100%
rename from test/replication-luatest/gh-4669-applier-reconnect_test.lua
rename to test/replication-luatest/gh_4669_applier_reconnect_test.lua


==================================

-- 
Serge Petrenko


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Tarantool-patches] [PATCH 2/2] replication: fix replica disconnect upon reconfiguration
  2021-10-05 13:09         ` Serge Petrenko via Tarantool-patches
@ 2021-10-06 21:59           ` Vladislav Shpilevoy via Tarantool-patches
  0 siblings, 0 replies; 8+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-10-06 21:59 UTC (permalink / raw)
  To: Serge Petrenko, gorcunov; +Cc: tarantool-patches

Hi! Thanks for the patch!

LGTM.


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2021-10-06 21:59 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-30 19:17 [Tarantool-patches] [PATCH 0/2] replication: fix reconnect on box.cfg.replication change Serge Petrenko via Tarantool-patches
2021-09-30 19:17 ` [Tarantool-patches] [PATCH 1/2] replicaiton: make anon replica connect to quorum upon reconfiguration Serge Petrenko via Tarantool-patches
2021-09-30 19:17 ` [Tarantool-patches] [PATCH 2/2] replication: fix replica disconnect " Serge Petrenko via Tarantool-patches
2021-09-30 22:15   ` Vladislav Shpilevoy via Tarantool-patches
2021-10-01 11:31     ` Serge Petrenko via Tarantool-patches
2021-10-04 21:04       ` Vladislav Shpilevoy via Tarantool-patches
2021-10-05 13:09         ` Serge Petrenko via Tarantool-patches
2021-10-06 21:59           ` Vladislav Shpilevoy via Tarantool-patches

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox