[tarantool-patches] [PATCH v2] replication: do not ignore replication_connect

Tarantool development patches archive
 help / color / mirror / Atom feed

* [tarantool-patches] [PATCH v2] replication: do not ignore replication_connect_quorum.
@ 2018-08-07 12:44 Serge Petrenko
  2018-08-07 14:18 ` [tarantool-patches] " Georgy Kirichenko
  2018-08-07 17:28 ` [tarantool-patches] " Vladimir Davydov
  0 siblings, 2 replies; 9+ messages in thread
From: Serge Petrenko @ 2018-08-07 12:44 UTC (permalink / raw)
  To: tarantool-patches; +Cc: georgy, Serge Petrenko

On bootstrap and after replication reconfiguration
replication_connect_quorum was ignored. The instance tried to connect to
every replica listed in replication parameter, and errored if it wasn't
possible.
The patch alters this behaviour. The instance still tries to connect to
every node listed in replication, but does not raise an error if it was
able to connect to at least replication_connect_quorum instances.

Closes #3428
---
https://github.com/tarantool/tarantool/issues/3428
https://github.com/tarantool/tarantool/tree/sergepetrenko/gh-3428-replication-connect-quorum

Changes in v2:
  - change test/replication/ddl.lua instance file to fix
    test failure on Travis.

 src/box/box.cc                           | 6 +++---
 src/box/replication.cc                   | 8 +++-----
 src/box/replication.h                    | 7 ++++---
 test/replication-py/init_storage.test.py | 2 +-
 test/replication-py/master.lua           | 2 ++
 test/replication-py/replica.lua          | 2 ++
 test/replication/autobootstrap.lua       | 3 ++-
 test/replication/autobootstrap_guest.lua | 2 +-
 test/replication/ddl.lua                 | 3 ++-
 test/replication/errinj.result           | 6 +++---
 test/replication/errinj.test.lua         | 6 +++---
 test/replication/master.lua              | 1 +
 test/replication/master_quorum.lua       | 3 ++-
 test/replication/on_replace.lua          | 3 ++-
 test/replication/quorum.lua              | 4 ++--
 test/replication/rebootstrap.lua         | 2 +-
 test/replication/replica_no_quorum.lua   | 3 ++-
 test/replication/replica_timeout.lua     | 3 ++-
 test/replication/replica_uuid_ro.lua     | 2 +-
 19 files changed, 39 insertions(+), 29 deletions(-)

diff --git a/src/box/box.cc b/src/box/box.cc
index e3eb2738f..f8731f464 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -595,7 +595,7 @@ cfg_get_replication(int *p_count)
  * don't start appliers.
  */
 static void
-box_sync_replication(double timeout, bool connect_all)
+box_sync_replication(double timeout, bool reach_quorum)
 {
 	int count = 0;
 	struct applier **appliers = cfg_get_replication(&count);
@@ -607,7 +607,7 @@ box_sync_replication(double timeout, bool connect_all)
 			applier_delete(appliers[i]); /* doesn't affect diag */
 	});
 
-	replicaset_connect(appliers, count, timeout, connect_all);
+	replicaset_connect(appliers, count, timeout, reach_quorum);
 
 	guard.is_active = false;
 }
@@ -1888,7 +1888,7 @@ box_cfg_xc(void)
 		 * receive the same replica set UUID when a new cluster
 		 * is deployed.
 		 */
-		box_sync_replication(TIMEOUT_INFINITY, true);
+		box_sync_replication(replication_connect_timeout, true);
 		/* Bootstrap a new master */
 		bootstrap(&replicaset_uuid, &is_bootstrap_leader);
 	}
diff --git a/src/box/replication.cc b/src/box/replication.cc
index 528fe4459..a6c60220f 100644
--- a/src/box/replication.cc
+++ b/src/box/replication.cc
@@ -46,7 +46,7 @@ struct tt_uuid INSTANCE_UUID;
 struct tt_uuid REPLICASET_UUID;
 
 double replication_timeout = 1.0; /* seconds */
-double replication_connect_timeout = 4.0; /* seconds */
+double replication_connect_timeout = 10.0; /* seconds */
 int replication_connect_quorum = REPLICATION_CONNECT_QUORUM_ALL;
 double replication_sync_lag = 10.0; /* seconds */
 
@@ -540,7 +540,7 @@ applier_on_connect_f(struct trigger *trigger, void *event)
 
 void
 replicaset_connect(struct applier **appliers, int count,
-		   double timeout, bool connect_all)
+		   double timeout, bool reach_quorum)
 {
 	if (count == 0) {
 		/* Cleanup the replica set. */
@@ -587,15 +587,13 @@ replicaset_connect(struct applier **appliers, int count,
 		double wait_start = ev_monotonic_now(loop());
 		if (fiber_cond_wait_timeout(&state.wakeup, timeout) != 0)
 			break;
-		if (state.failed > 0 && connect_all)
-			break;
 		timeout -= ev_monotonic_now(loop()) - wait_start;
 	}
 	if (state.connected < count) {
 		say_crit("failed to connect to %d out of %d replicas",
 			 count - state.connected, count);
 		/* Timeout or connection failure. */
-		if (connect_all)
+		if (reach_quorum && state.connected < replication_connect_quorum)
 			goto error;
 	} else {
 		say_verbose("connected to %d replicas", state.connected);
diff --git a/src/box/replication.h b/src/box/replication.h
index 95122eb45..c16c8b56c 100644
--- a/src/box/replication.h
+++ b/src/box/replication.h
@@ -357,12 +357,13 @@ replicaset_add(uint32_t replica_id, const struct tt_uuid *instance_uuid);
  * \param appliers the array of appliers
  * \param count size of appliers array
  * \param timeout connection timeout
- * \param connect_all if this flag is set, fail unless all
- *                    appliers have successfully connected
+ * \param reach_quorum if this flag is set, fail unless at
+ *		       least replication_connect_quorum
+ *		       appliers have successfully connected.
  */
 void
 replicaset_connect(struct applier **appliers, int count,
-		   double timeout, bool connect_all);
+		   double timeout, bool reach_quorum);
 
 /**
  * Resume all appliers registered with the replica set.
diff --git a/test/replication-py/init_storage.test.py b/test/replication-py/init_storage.test.py
index 0911a02c0..32b4639f1 100644
--- a/test/replication-py/init_storage.test.py
+++ b/test/replication-py/init_storage.test.py
@@ -57,7 +57,7 @@ print '-------------------------------------------------------------'
 
 server.stop()
 replica = TarantoolServer(server.ini)
-replica.script = 'replication/replica.lua'
+replica.script = 'replication-py/replica.lua'
 replica.vardir = server.vardir #os.path.join(server.vardir, 'replica')
 replica.rpl_master = master
 replica.deploy(wait=False)
diff --git a/test/replication-py/master.lua b/test/replication-py/master.lua
index 0f9f7a6f0..51283efdf 100644
--- a/test/replication-py/master.lua
+++ b/test/replication-py/master.lua
@@ -3,6 +3,8 @@ os = require('os')
 box.cfg({
     listen              = os.getenv("LISTEN"),
     memtx_memory        = 107374182,
+    replication_connect_timeout = 1.0,
+    replication_timeout = 0.3
 })
 
 require('console').listen(os.getenv('ADMIN'))
diff --git a/test/replication-py/replica.lua b/test/replication-py/replica.lua
index 278291bba..b9d193b70 100644
--- a/test/replication-py/replica.lua
+++ b/test/replication-py/replica.lua
@@ -7,6 +7,8 @@ box.cfg({
     listen              = os.getenv("LISTEN"),
     replication         = os.getenv("MASTER"),
     memtx_memory        = 107374182,
+    replication_connect_timeout = 1.0,
+    replication_timeout = 0.3
 })
 
 box_cfg_done = true
diff --git a/test/replication/autobootstrap.lua b/test/replication/autobootstrap.lua
index 4f55417ae..8fc6809de 100644
--- a/test/replication/autobootstrap.lua
+++ b/test/replication/autobootstrap.lua
@@ -21,7 +21,8 @@ box.cfg({
         USER..':'..PASSWORD..'@'..instance_uri(2);
         USER..':'..PASSWORD..'@'..instance_uri(3);
     };
-    replication_connect_timeout = 0.5,
+    replication_connect_timeout = 3.0;
+    replication_timeout = 0.5;
 })
 
 box.once("bootstrap", function()
diff --git a/test/replication/autobootstrap_guest.lua b/test/replication/autobootstrap_guest.lua
index 40fef2c7a..7cd921e3c 100644
--- a/test/replication/autobootstrap_guest.lua
+++ b/test/replication/autobootstrap_guest.lua
@@ -20,7 +20,7 @@ box.cfg({
         instance_uri(2);
         instance_uri(3);
     };
-    replication_connect_timeout = 0.5,
+    replication_connect_timeout = 5,
 })
 
 box.once("bootstrap", function()
diff --git a/test/replication/ddl.lua b/test/replication/ddl.lua
index 694f40eac..85403e35b 100644
--- a/test/replication/ddl.lua
+++ b/test/replication/ddl.lua
@@ -22,7 +22,8 @@ box.cfg({
         USER..':'..PASSWORD..'@'..instance_uri(3);
         USER..':'..PASSWORD..'@'..instance_uri(4);
     };
-    replication_connect_timeout = 0.5,
+    replication_timeout = 0.1,
+    replication_connect_timeout = 2.0,
 })
 
 box.once("bootstrap", function()
diff --git a/test/replication/errinj.result b/test/replication/errinj.result
index ca8af2988..19d7d9a05 100644
--- a/test/replication/errinj.result
+++ b/test/replication/errinj.result
@@ -418,7 +418,7 @@ test_run:cmd("create server replica_timeout with rpl_master=default, script='rep
 ---
 - true
 ...
-test_run:cmd("start server replica_timeout with args='0.01'")
+test_run:cmd("start server replica_timeout with args='0.1, 0.5'")
 ---
 - true
 ...
@@ -474,7 +474,7 @@ errinj.set("ERRINJ_RELAY_REPORT_INTERVAL", 0)
 ...
 -- Check replica's ACKs don't prevent the master from sending
 -- heartbeat messages (gh-3160).
-test_run:cmd("start server replica_timeout with args='0.009'")
+test_run:cmd("start server replica_timeout with args='0.009, 0.5'")
 ---
 - true
 ...
@@ -522,7 +522,7 @@ for i = 0, 9999 do box.space.test:replace({i, 4, 5, 'test'}) end
 -- during the join stage, i.e. a replica with a minuscule
 -- timeout successfully bootstraps and breaks connection only
 -- after subscribe.
-test_run:cmd("start server replica_timeout with args='0.00001'")
+test_run:cmd("start server replica_timeout with args='0.00001, 0.5'")
 ---
 - true
 ...
diff --git a/test/replication/errinj.test.lua b/test/replication/errinj.test.lua
index 463d89a8f..f00b98eed 100644
--- a/test/replication/errinj.test.lua
+++ b/test/replication/errinj.test.lua
@@ -173,7 +173,7 @@ errinj.set("ERRINJ_RELAY_EXIT_DELAY", 0)
 box.cfg{replication_timeout = 0.01}
 
 test_run:cmd("create server replica_timeout with rpl_master=default, script='replication/replica_timeout.lua'")
-test_run:cmd("start server replica_timeout with args='0.01'")
+test_run:cmd("start server replica_timeout with args='0.1, 0.5'")
 test_run:cmd("switch replica_timeout")
 
 fiber = require('fiber')
@@ -199,7 +199,7 @@ errinj.set("ERRINJ_RELAY_REPORT_INTERVAL", 0)
 -- Check replica's ACKs don't prevent the master from sending
 -- heartbeat messages (gh-3160).
 
-test_run:cmd("start server replica_timeout with args='0.009'")
+test_run:cmd("start server replica_timeout with args='0.009, 0.5'")
 test_run:cmd("switch replica_timeout")
 
 fiber = require('fiber')
@@ -219,7 +219,7 @@ for i = 0, 9999 do box.space.test:replace({i, 4, 5, 'test'}) end
 -- during the join stage, i.e. a replica with a minuscule
 -- timeout successfully bootstraps and breaks connection only
 -- after subscribe.
-test_run:cmd("start server replica_timeout with args='0.00001'")
+test_run:cmd("start server replica_timeout with args='0.00001, 0.5'")
 test_run:cmd("switch replica_timeout")
 fiber = require('fiber')
 while box.info.replication[1].upstream.message ~= 'timed out' do fiber.sleep(0.0001) end
diff --git a/test/replication/master.lua b/test/replication/master.lua
index 6d431aaeb..9b96b7891 100644
--- a/test/replication/master.lua
+++ b/test/replication/master.lua
@@ -4,6 +4,7 @@ box.cfg({
     listen              = os.getenv("LISTEN"),
     memtx_memory        = 107374182,
     replication_connect_timeout = 0.5,
+    replication_timeout = 0.1
 })
 
 require('console').listen(os.getenv('ADMIN'))
diff --git a/test/replication/master_quorum.lua b/test/replication/master_quorum.lua
index fb5f7ec2b..6e0429f65 100644
--- a/test/replication/master_quorum.lua
+++ b/test/replication/master_quorum.lua
@@ -20,7 +20,8 @@ box.cfg({
         instance_uri(2);
     };
     replication_connect_quorum = 0;
-    replication_connect_timeout = 0.1;
+    replication_timeout = 0.5;
+    replication_connect_timeout = 2.0;
 })
 
 test_run = require('test_run').new()
diff --git a/test/replication/on_replace.lua b/test/replication/on_replace.lua
index 03f15d94c..bafead48d 100644
--- a/test/replication/on_replace.lua
+++ b/test/replication/on_replace.lua
@@ -20,7 +20,8 @@ box.cfg({
         USER..':'..PASSWORD..'@'..instance_uri(1);
         USER..':'..PASSWORD..'@'..instance_uri(2);
     };
-    replication_connect_timeout = 0.5,
+    replication_timeout = 0.5,
+    replication_connect_timeout = 1.0,
 })
 
 env = require('test_run')
diff --git a/test/replication/quorum.lua b/test/replication/quorum.lua
index 9c7bf5c93..7f85d7b13 100644
--- a/test/replication/quorum.lua
+++ b/test/replication/quorum.lua
@@ -15,8 +15,8 @@ require('console').listen(os.getenv('ADMIN'))
 box.cfg({
     listen = instance_uri(INSTANCE_ID);
     replication_timeout = 0.05;
-    replication_sync_lag = 0.01;
-    replication_connect_timeout = 0.1;
+    replication_sync_lag = 0.1;
+    replication_connect_timeout = 3.0;
     replication_connect_quorum = 3;
     replication = {
         instance_uri(1);
diff --git a/test/replication/rebootstrap.lua b/test/replication/rebootstrap.lua
index e743577e4..f1e8d69e9 100644
--- a/test/replication/rebootstrap.lua
+++ b/test/replication/rebootstrap.lua
@@ -15,7 +15,7 @@ box.cfg({
     listen = instance_uri(INSTANCE_ID),
     instance_uuid = '12345678-abcd-1234-abcd-123456789ef' .. INSTANCE_ID,
     replication_timeout = 0.1,
-    replication_connect_timeout = 0.5,
+    replication_connect_timeout = 2.0,
     replication = {
         instance_uri(1);
         instance_uri(2);
diff --git a/test/replication/replica_no_quorum.lua b/test/replication/replica_no_quorum.lua
index b9edeea94..c30c043cc 100644
--- a/test/replication/replica_no_quorum.lua
+++ b/test/replication/replica_no_quorum.lua
@@ -5,7 +5,8 @@ box.cfg({
     replication         = os.getenv("MASTER"),
     memtx_memory        = 107374182,
     replication_connect_quorum = 0,
-    replication_connect_timeout = 0.1,
+    replication_timeout = 0.1,
+    replication_connect_timeout = 0.5,
 })
 
 require('console').listen(os.getenv('ADMIN'))
diff --git a/test/replication/replica_timeout.lua b/test/replication/replica_timeout.lua
index 64f119763..51c718360 100644
--- a/test/replication/replica_timeout.lua
+++ b/test/replication/replica_timeout.lua
@@ -1,13 +1,14 @@
 #!/usr/bin/env tarantool
 
 local TIMEOUT = tonumber(arg[1])
+local CON_TIMEOUT = arg[2] and tonumber(arg[2]) or TIMEOUT * 3
 
 box.cfg({
     listen              = os.getenv("LISTEN"),
     replication         = os.getenv("MASTER"),
     memtx_memory        = 107374182,
     replication_timeout = TIMEOUT,
-    replication_connect_timeout = TIMEOUT * 3,
+    replication_connect_timeout = CON_TIMEOUT,
 })
 
 require('console').listen(os.getenv('ADMIN'))
diff --git a/test/replication/replica_uuid_ro.lua b/test/replication/replica_uuid_ro.lua
index 8e1c6cc47..ff70da144 100644
--- a/test/replication/replica_uuid_ro.lua
+++ b/test/replication/replica_uuid_ro.lua
@@ -22,7 +22,7 @@ box.cfg({
         USER..':'..PASSWORD..'@'..instance_uri(2);
     };
     read_only = (INSTANCE_ID ~= '1' and true or false);
-    replication_connect_timeout = 0.5,
+    replication_connect_timeout = 5,
 })
 
 box.once("bootstrap", function()
-- 
2.15.2 (Apple Git-101.1)

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [tarantool-patches] Re: [PATCH v2] replication: do not ignore replication_connect_quorum.
  2018-08-07 12:44 [tarantool-patches] [PATCH v2] replication: do not ignore replication_connect_quorum Serge Petrenko
@ 2018-08-07 14:18 ` Georgy Kirichenko
  2018-08-07 17:28 ` [tarantool-patches] " Vladimir Davydov
  1 sibling, 0 replies; 9+ messages in thread
From: Georgy Kirichenko @ 2018-08-07 14:18 UTC (permalink / raw)
  To: tarantool-patches; +Cc: Serge Petrenko

[-- Attachment #1: Type: text/plain, Size: 15463 bytes --]

LGTM

On Tuesday, August 7, 2018 3:44:13 PM MSK Serge Petrenko wrote:
> On bootstrap and after replication reconfiguration
> replication_connect_quorum was ignored. The instance tried to connect to
> every replica listed in replication parameter, and errored if it wasn't
> possible.
> The patch alters this behaviour. The instance still tries to connect to
> every node listed in replication, but does not raise an error if it was
> able to connect to at least replication_connect_quorum instances.
> 
> Closes #3428
> ---
> https://github.com/tarantool/tarantool/issues/3428
> https://github.com/tarantool/tarantool/tree/sergepetrenko/gh-3428-replicatio
> n-connect-quorum
> 
> Changes in v2:
>   - change test/replication/ddl.lua instance file to fix
>     test failure on Travis.
> 
>  src/box/box.cc                           | 6 +++---
>  src/box/replication.cc                   | 8 +++-----
>  src/box/replication.h                    | 7 ++++---
>  test/replication-py/init_storage.test.py | 2 +-
>  test/replication-py/master.lua           | 2 ++
>  test/replication-py/replica.lua          | 2 ++
>  test/replication/autobootstrap.lua       | 3 ++-
>  test/replication/autobootstrap_guest.lua | 2 +-
>  test/replication/ddl.lua                 | 3 ++-
>  test/replication/errinj.result           | 6 +++---
>  test/replication/errinj.test.lua         | 6 +++---
>  test/replication/master.lua              | 1 +
>  test/replication/master_quorum.lua       | 3 ++-
>  test/replication/on_replace.lua          | 3 ++-
>  test/replication/quorum.lua              | 4 ++--
>  test/replication/rebootstrap.lua         | 2 +-
>  test/replication/replica_no_quorum.lua   | 3 ++-
>  test/replication/replica_timeout.lua     | 3 ++-
>  test/replication/replica_uuid_ro.lua     | 2 +-
>  19 files changed, 39 insertions(+), 29 deletions(-)
> 
> diff --git a/src/box/box.cc b/src/box/box.cc
> index e3eb2738f..f8731f464 100644
> --- a/src/box/box.cc
> +++ b/src/box/box.cc
> @@ -595,7 +595,7 @@ cfg_get_replication(int *p_count)
>   * don't start appliers.
>   */
>  static void
> -box_sync_replication(double timeout, bool connect_all)
> +box_sync_replication(double timeout, bool reach_quorum)
>  {
>  	int count = 0;
>  	struct applier **appliers = cfg_get_replication(&count);
> @@ -607,7 +607,7 @@ box_sync_replication(double timeout, bool connect_all)
>  			applier_delete(appliers[i]); /* doesn't affect diag */
>  	});
> 
> -	replicaset_connect(appliers, count, timeout, connect_all);
> +	replicaset_connect(appliers, count, timeout, reach_quorum);
> 
>  	guard.is_active = false;
>  }
> @@ -1888,7 +1888,7 @@ box_cfg_xc(void)
>  		 * receive the same replica set UUID when a new cluster
>  		 * is deployed.
>  		 */
> -		box_sync_replication(TIMEOUT_INFINITY, true);
> +		box_sync_replication(replication_connect_timeout, true);
>  		/* Bootstrap a new master */
>  		bootstrap(&replicaset_uuid, &is_bootstrap_leader);
>  	}
> diff --git a/src/box/replication.cc b/src/box/replication.cc
> index 528fe4459..a6c60220f 100644
> --- a/src/box/replication.cc
> +++ b/src/box/replication.cc
> @@ -46,7 +46,7 @@ struct tt_uuid INSTANCE_UUID;
>  struct tt_uuid REPLICASET_UUID;
> 
>  double replication_timeout = 1.0; /* seconds */
> -double replication_connect_timeout = 4.0; /* seconds */
> +double replication_connect_timeout = 10.0; /* seconds */
>  int replication_connect_quorum = REPLICATION_CONNECT_QUORUM_ALL;
>  double replication_sync_lag = 10.0; /* seconds */
> 
> @@ -540,7 +540,7 @@ applier_on_connect_f(struct trigger *trigger, void
> *event)
> 
>  void
>  replicaset_connect(struct applier **appliers, int count,
> -		   double timeout, bool connect_all)
> +		   double timeout, bool reach_quorum)
>  {
>  	if (count == 0) {
>  		/* Cleanup the replica set. */
> @@ -587,15 +587,13 @@ replicaset_connect(struct applier **appliers, int
> count, double wait_start = ev_monotonic_now(loop());
>  		if (fiber_cond_wait_timeout(&state.wakeup, timeout) != 0)
>  			break;
> -		if (state.failed > 0 && connect_all)
> -			break;
>  		timeout -= ev_monotonic_now(loop()) - wait_start;
>  	}
>  	if (state.connected < count) {
>  		say_crit("failed to connect to %d out of %d replicas",
>  			 count - state.connected, count);
>  		/* Timeout or connection failure. */
> -		if (connect_all)
> +		if (reach_quorum && state.connected < replication_connect_quorum)
>  			goto error;
>  	} else {
>  		say_verbose("connected to %d replicas", state.connected);
> diff --git a/src/box/replication.h b/src/box/replication.h
> index 95122eb45..c16c8b56c 100644
> --- a/src/box/replication.h
> +++ b/src/box/replication.h
> @@ -357,12 +357,13 @@ replicaset_add(uint32_t replica_id, const struct
> tt_uuid *instance_uuid); * \param appliers the array of appliers
>   * \param count size of appliers array
>   * \param timeout connection timeout
> - * \param connect_all if this flag is set, fail unless all
> - *                    appliers have successfully connected
> + * \param reach_quorum if this flag is set, fail unless at
> + *		       least replication_connect_quorum
> + *		       appliers have successfully connected.
>   */
>  void
>  replicaset_connect(struct applier **appliers, int count,
> -		   double timeout, bool connect_all);
> +		   double timeout, bool reach_quorum);
> 
>  /**
>   * Resume all appliers registered with the replica set.
> diff --git a/test/replication-py/init_storage.test.py
> b/test/replication-py/init_storage.test.py index 0911a02c0..32b4639f1
> 100644
> --- a/test/replication-py/init_storage.test.py
> +++ b/test/replication-py/init_storage.test.py
> @@ -57,7 +57,7 @@ print
> '-------------------------------------------------------------'
> 
>  server.stop()
>  replica = TarantoolServer(server.ini)
> -replica.script = 'replication/replica.lua'
> +replica.script = 'replication-py/replica.lua'
>  replica.vardir = server.vardir #os.path.join(server.vardir, 'replica')
>  replica.rpl_master = master
>  replica.deploy(wait=False)
> diff --git a/test/replication-py/master.lua b/test/replication-py/master.lua
> index 0f9f7a6f0..51283efdf 100644
> --- a/test/replication-py/master.lua
> +++ b/test/replication-py/master.lua
> @@ -3,6 +3,8 @@ os = require('os')
>  box.cfg({
>      listen              = os.getenv("LISTEN"),
>      memtx_memory        = 107374182,
> +    replication_connect_timeout = 1.0,
> +    replication_timeout = 0.3
>  })
> 
>  require('console').listen(os.getenv('ADMIN'))
> diff --git a/test/replication-py/replica.lua
> b/test/replication-py/replica.lua index 278291bba..b9d193b70 100644
> --- a/test/replication-py/replica.lua
> +++ b/test/replication-py/replica.lua
> @@ -7,6 +7,8 @@ box.cfg({
>      listen              = os.getenv("LISTEN"),
>      replication         = os.getenv("MASTER"),
>      memtx_memory        = 107374182,
> +    replication_connect_timeout = 1.0,
> +    replication_timeout = 0.3
>  })
> 
>  box_cfg_done = true
> diff --git a/test/replication/autobootstrap.lua
> b/test/replication/autobootstrap.lua index 4f55417ae..8fc6809de 100644
> --- a/test/replication/autobootstrap.lua
> +++ b/test/replication/autobootstrap.lua
> @@ -21,7 +21,8 @@ box.cfg({
>          USER..':'..PASSWORD..'@'..instance_uri(2);
>          USER..':'..PASSWORD..'@'..instance_uri(3);
>      };
> -    replication_connect_timeout = 0.5,
> +    replication_connect_timeout = 3.0;
> +    replication_timeout = 0.5;
>  })
> 
>  box.once("bootstrap", function()
> diff --git a/test/replication/autobootstrap_guest.lua
> b/test/replication/autobootstrap_guest.lua index 40fef2c7a..7cd921e3c
> 100644
> --- a/test/replication/autobootstrap_guest.lua
> +++ b/test/replication/autobootstrap_guest.lua
> @@ -20,7 +20,7 @@ box.cfg({
>          instance_uri(2);
>          instance_uri(3);
>      };
> -    replication_connect_timeout = 0.5,
> +    replication_connect_timeout = 5,
>  })
> 
>  box.once("bootstrap", function()
> diff --git a/test/replication/ddl.lua b/test/replication/ddl.lua
> index 694f40eac..85403e35b 100644
> --- a/test/replication/ddl.lua
> +++ b/test/replication/ddl.lua
> @@ -22,7 +22,8 @@ box.cfg({
>          USER..':'..PASSWORD..'@'..instance_uri(3);
>          USER..':'..PASSWORD..'@'..instance_uri(4);
>      };
> -    replication_connect_timeout = 0.5,
> +    replication_timeout = 0.1,
> +    replication_connect_timeout = 2.0,
>  })
> 
>  box.once("bootstrap", function()
> diff --git a/test/replication/errinj.result b/test/replication/errinj.result
> index ca8af2988..19d7d9a05 100644
> --- a/test/replication/errinj.result
> +++ b/test/replication/errinj.result
> @@ -418,7 +418,7 @@ test_run:cmd("create server replica_timeout with
> rpl_master=default, script='rep ---
>  - true
>  ...
> -test_run:cmd("start server replica_timeout with args='0.01'")
> +test_run:cmd("start server replica_timeout with args='0.1, 0.5'")
>  ---
>  - true
>  ...
> @@ -474,7 +474,7 @@ errinj.set("ERRINJ_RELAY_REPORT_INTERVAL", 0)
>  ...
>  -- Check replica's ACKs don't prevent the master from sending
>  -- heartbeat messages (gh-3160).
> -test_run:cmd("start server replica_timeout with args='0.009'")
> +test_run:cmd("start server replica_timeout with args='0.009, 0.5'")
>  ---
>  - true
>  ...
> @@ -522,7 +522,7 @@ for i = 0, 9999 do box.space.test:replace({i, 4, 5,
> 'test'}) end -- during the join stage, i.e. a replica with a minuscule
>  -- timeout successfully bootstraps and breaks connection only
>  -- after subscribe.
> -test_run:cmd("start server replica_timeout with args='0.00001'")
> +test_run:cmd("start server replica_timeout with args='0.00001, 0.5'")
>  ---
>  - true
>  ...
> diff --git a/test/replication/errinj.test.lua
> b/test/replication/errinj.test.lua index 463d89a8f..f00b98eed 100644
> --- a/test/replication/errinj.test.lua
> +++ b/test/replication/errinj.test.lua
> @@ -173,7 +173,7 @@ errinj.set("ERRINJ_RELAY_EXIT_DELAY", 0)
>  box.cfg{replication_timeout = 0.01}
> 
>  test_run:cmd("create server replica_timeout with rpl_master=default,
> script='replication/replica_timeout.lua'") -test_run:cmd("start server
> replica_timeout with args='0.01'")
> +test_run:cmd("start server replica_timeout with args='0.1, 0.5'")
>  test_run:cmd("switch replica_timeout")
> 
>  fiber = require('fiber')
> @@ -199,7 +199,7 @@ errinj.set("ERRINJ_RELAY_REPORT_INTERVAL", 0)
>  -- Check replica's ACKs don't prevent the master from sending
>  -- heartbeat messages (gh-3160).
> 
> -test_run:cmd("start server replica_timeout with args='0.009'")
> +test_run:cmd("start server replica_timeout with args='0.009, 0.5'")
>  test_run:cmd("switch replica_timeout")
> 
>  fiber = require('fiber')
> @@ -219,7 +219,7 @@ for i = 0, 9999 do box.space.test:replace({i, 4, 5,
> 'test'}) end -- during the join stage, i.e. a replica with a minuscule
>  -- timeout successfully bootstraps and breaks connection only
>  -- after subscribe.
> -test_run:cmd("start server replica_timeout with args='0.00001'")
> +test_run:cmd("start server replica_timeout with args='0.00001, 0.5'")
>  test_run:cmd("switch replica_timeout")
>  fiber = require('fiber')
>  while box.info.replication[1].upstream.message ~= 'timed out' do
> fiber.sleep(0.0001) end diff --git a/test/replication/master.lua
> b/test/replication/master.lua index 6d431aaeb..9b96b7891 100644
> --- a/test/replication/master.lua
> +++ b/test/replication/master.lua
> @@ -4,6 +4,7 @@ box.cfg({
>      listen              = os.getenv("LISTEN"),
>      memtx_memory        = 107374182,
>      replication_connect_timeout = 0.5,
> +    replication_timeout = 0.1
>  })
> 
>  require('console').listen(os.getenv('ADMIN'))
> diff --git a/test/replication/master_quorum.lua
> b/test/replication/master_quorum.lua index fb5f7ec2b..6e0429f65 100644
> --- a/test/replication/master_quorum.lua
> +++ b/test/replication/master_quorum.lua
> @@ -20,7 +20,8 @@ box.cfg({
>          instance_uri(2);
>      };
>      replication_connect_quorum = 0;
> -    replication_connect_timeout = 0.1;
> +    replication_timeout = 0.5;
> +    replication_connect_timeout = 2.0;
>  })
> 
>  test_run = require('test_run').new()
> diff --git a/test/replication/on_replace.lua
> b/test/replication/on_replace.lua index 03f15d94c..bafead48d 100644
> --- a/test/replication/on_replace.lua
> +++ b/test/replication/on_replace.lua
> @@ -20,7 +20,8 @@ box.cfg({
>          USER..':'..PASSWORD..'@'..instance_uri(1);
>          USER..':'..PASSWORD..'@'..instance_uri(2);
>      };
> -    replication_connect_timeout = 0.5,
> +    replication_timeout = 0.5,
> +    replication_connect_timeout = 1.0,
>  })
> 
>  env = require('test_run')
> diff --git a/test/replication/quorum.lua b/test/replication/quorum.lua
> index 9c7bf5c93..7f85d7b13 100644
> --- a/test/replication/quorum.lua
> +++ b/test/replication/quorum.lua
> @@ -15,8 +15,8 @@ require('console').listen(os.getenv('ADMIN'))
>  box.cfg({
>      listen = instance_uri(INSTANCE_ID);
>      replication_timeout = 0.05;
> -    replication_sync_lag = 0.01;
> -    replication_connect_timeout = 0.1;
> +    replication_sync_lag = 0.1;
> +    replication_connect_timeout = 3.0;
>      replication_connect_quorum = 3;
>      replication = {
>          instance_uri(1);
> diff --git a/test/replication/rebootstrap.lua
> b/test/replication/rebootstrap.lua index e743577e4..f1e8d69e9 100644
> --- a/test/replication/rebootstrap.lua
> +++ b/test/replication/rebootstrap.lua
> @@ -15,7 +15,7 @@ box.cfg({
>      listen = instance_uri(INSTANCE_ID),
>      instance_uuid = '12345678-abcd-1234-abcd-123456789ef' .. INSTANCE_ID,
>      replication_timeout = 0.1,
> -    replication_connect_timeout = 0.5,
> +    replication_connect_timeout = 2.0,
>      replication = {
>          instance_uri(1);
>          instance_uri(2);
> diff --git a/test/replication/replica_no_quorum.lua
> b/test/replication/replica_no_quorum.lua index b9edeea94..c30c043cc 100644
> --- a/test/replication/replica_no_quorum.lua
> +++ b/test/replication/replica_no_quorum.lua
> @@ -5,7 +5,8 @@ box.cfg({
>      replication         = os.getenv("MASTER"),
>      memtx_memory        = 107374182,
>      replication_connect_quorum = 0,
> -    replication_connect_timeout = 0.1,
> +    replication_timeout = 0.1,
> +    replication_connect_timeout = 0.5,
>  })
> 
>  require('console').listen(os.getenv('ADMIN'))
> diff --git a/test/replication/replica_timeout.lua
> b/test/replication/replica_timeout.lua index 64f119763..51c718360 100644
> --- a/test/replication/replica_timeout.lua
> +++ b/test/replication/replica_timeout.lua
> @@ -1,13 +1,14 @@
>  #!/usr/bin/env tarantool
> 
>  local TIMEOUT = tonumber(arg[1])
> +local CON_TIMEOUT = arg[2] and tonumber(arg[2]) or TIMEOUT * 3
> 
>  box.cfg({
>      listen              = os.getenv("LISTEN"),
>      replication         = os.getenv("MASTER"),
>      memtx_memory        = 107374182,
>      replication_timeout = TIMEOUT,
> -    replication_connect_timeout = TIMEOUT * 3,
> +    replication_connect_timeout = CON_TIMEOUT,
>  })
> 
>  require('console').listen(os.getenv('ADMIN'))
> diff --git a/test/replication/replica_uuid_ro.lua
> b/test/replication/replica_uuid_ro.lua index 8e1c6cc47..ff70da144 100644
> --- a/test/replication/replica_uuid_ro.lua
> +++ b/test/replication/replica_uuid_ro.lua
> @@ -22,7 +22,7 @@ box.cfg({
>          USER..':'..PASSWORD..'@'..instance_uri(2);
>      };
>      read_only = (INSTANCE_ID ~= '1' and true or false);
> -    replication_connect_timeout = 0.5,
> +    replication_connect_timeout = 5,
>  })
> 
>  box.once("bootstrap", function()


[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [tarantool-patches] [PATCH v2] replication: do not ignore replication_connect_quorum.
  2018-08-07 12:44 [tarantool-patches] [PATCH v2] replication: do not ignore replication_connect_quorum Serge Petrenko
  2018-08-07 14:18 ` [tarantool-patches] " Georgy Kirichenko
@ 2018-08-07 17:28 ` Vladimir Davydov
  2018-08-09  7:50   ` [tarantool-patches] " Sergey Petrenko
  1 sibling, 1 reply; 9+ messages in thread
From: Vladimir Davydov @ 2018-08-07 17:28 UTC (permalink / raw)
  To: Serge Petrenko; +Cc: tarantool-patches, georgy

On Tue, Aug 07, 2018 at 03:44:13PM +0300, Serge Petrenko wrote:
> On bootstrap and after replication reconfiguration
> replication_connect_quorum was ignored. The instance tried to connect to
> every replica listed in replication parameter, and errored if it wasn't
> possible.
> The patch alters this behaviour. The instance still tries to connect to
> every node listed in replication, but does not raise an error if it was
> able to connect to at least replication_connect_quorum instances.

Please append a documentation request (@TarantoolBot) to the commit
message.

> 
> Closes #3428
> ---
> https://github.com/tarantool/tarantool/issues/3428
> https://github.com/tarantool/tarantool/tree/sergepetrenko/gh-3428-replication-connect-quorum
> 
> Changes in v2:
>   - change test/replication/ddl.lua instance file to fix
>     test failure on Travis.
> 
>  src/box/box.cc                           | 6 +++---
>  src/box/replication.cc                   | 8 +++-----
>  src/box/replication.h                    | 7 ++++---
>  test/replication-py/init_storage.test.py | 2 +-
>  test/replication-py/master.lua           | 2 ++
>  test/replication-py/replica.lua          | 2 ++
>  test/replication/autobootstrap.lua       | 3 ++-
>  test/replication/autobootstrap_guest.lua | 2 +-
>  test/replication/ddl.lua                 | 3 ++-
>  test/replication/errinj.result           | 6 +++---
>  test/replication/errinj.test.lua         | 6 +++---
>  test/replication/master.lua              | 1 +
>  test/replication/master_quorum.lua       | 3 ++-
>  test/replication/on_replace.lua          | 3 ++-
>  test/replication/quorum.lua              | 4 ++--
>  test/replication/rebootstrap.lua         | 2 +-
>  test/replication/replica_no_quorum.lua   | 3 ++-
>  test/replication/replica_timeout.lua     | 3 ++-
>  test/replication/replica_uuid_ro.lua     | 2 +-
>  19 files changed, 39 insertions(+), 29 deletions(-)
> 
> diff --git a/src/box/box.cc b/src/box/box.cc
> index e3eb2738f..f8731f464 100644
> --- a/src/box/box.cc
> +++ b/src/box/box.cc
> @@ -595,7 +595,7 @@ cfg_get_replication(int *p_count)
>   * don't start appliers.
>   */
>  static void
> -box_sync_replication(double timeout, bool connect_all)
> +box_sync_replication(double timeout, bool reach_quorum)

After this patch, 'timeout' always equals replication_connect_timeout
so you don't need to pass it explicitly anymore. Please remove it from
this function and from replicaset_connect.

Also, I don't like 'reach_quorum' name. Would 'connect_quorum' sound
better? Or may be we should pass the minimal number of masters to
connect instead?

>  {
>  	int count = 0;
>  	struct applier **appliers = cfg_get_replication(&count);
> @@ -607,7 +607,7 @@ box_sync_replication(double timeout, bool connect_all)
>  			applier_delete(appliers[i]); /* doesn't affect diag */
>  	});
>  
> -	replicaset_connect(appliers, count, timeout, connect_all);
> +	replicaset_connect(appliers, count, timeout, reach_quorum);
>  
>  	guard.is_active = false;
>  }
> @@ -1888,7 +1888,7 @@ box_cfg_xc(void)
>  		 * receive the same replica set UUID when a new cluster
>  		 * is deployed.
>  		 */
> -		box_sync_replication(TIMEOUT_INFINITY, true);
> +		box_sync_replication(replication_connect_timeout, true);
>  		/* Bootstrap a new master */
>  		bootstrap(&replicaset_uuid, &is_bootstrap_leader);
>  	}
> diff --git a/src/box/replication.cc b/src/box/replication.cc
> index 528fe4459..a6c60220f 100644
> --- a/src/box/replication.cc
> +++ b/src/box/replication.cc
> @@ -46,7 +46,7 @@ struct tt_uuid INSTANCE_UUID;
>  struct tt_uuid REPLICASET_UUID;
>  
>  double replication_timeout = 1.0; /* seconds */
> -double replication_connect_timeout = 4.0; /* seconds */
> +double replication_connect_timeout = 10.0; /* seconds */

Why? BTW, this isn't enough - replication_connect_timeout is actually
set in load_cfg.lua (I've no idea why they differ).

>  int replication_connect_quorum = REPLICATION_CONNECT_QUORUM_ALL;
>  double replication_sync_lag = 10.0; /* seconds */
>  
> @@ -540,7 +540,7 @@ applier_on_connect_f(struct trigger *trigger, void *event)
>  
>  void
>  replicaset_connect(struct applier **appliers, int count,
> -		   double timeout, bool connect_all)
> +		   double timeout, bool reach_quorum)
>  {
>  	if (count == 0) {
>  		/* Cleanup the replica set. */
> @@ -587,15 +587,13 @@ replicaset_connect(struct applier **appliers, int count,
>  		double wait_start = ev_monotonic_now(loop());
>  		if (fiber_cond_wait_timeout(&state.wakeup, timeout) != 0)
>  			break;
> -		if (state.failed > 0 && connect_all)
> -			break;

I guess you should break the loop if

  state.failed > count - min_count_to_connect

>  		timeout -= ev_monotonic_now(loop()) - wait_start;
>  	}
>  	if (state.connected < count) {
>  		say_crit("failed to connect to %d out of %d replicas",
>  			 count - state.connected, count);
>  		/* Timeout or connection failure. */
> -		if (connect_all)
> +		if (reach_quorum && state.connected < replication_connect_quorum)

replication_connect_quorum can be greater than the number of configured
replicas. I think you should use MIN(count, replication_connect_quorum).

>  			goto error;
>  	} else {
>  		say_verbose("connected to %d replicas", state.connected);
> diff --git a/src/box/replication.h b/src/box/replication.h
> index 95122eb45..c16c8b56c 100644
> --- a/src/box/replication.h
> +++ b/src/box/replication.h
> @@ -357,12 +357,13 @@ replicaset_add(uint32_t replica_id, const struct tt_uuid *instance_uuid);
>   * \param appliers the array of appliers
>   * \param count size of appliers array
>   * \param timeout connection timeout
> - * \param connect_all if this flag is set, fail unless all
> - *                    appliers have successfully connected
> + * \param reach_quorum if this flag is set, fail unless at
> + *		       least replication_connect_quorum
> + *		       appliers have successfully connected.
>   */
>  void
>  replicaset_connect(struct applier **appliers, int count,
> -		   double timeout, bool connect_all);
> +		   double timeout, bool reach_quorum);
>  
>  /**
>   * Resume all appliers registered with the replica set.
> diff --git a/test/replication-py/init_storage.test.py b/test/replication-py/init_storage.test.py
> index 0911a02c0..32b4639f1 100644
> --- a/test/replication-py/init_storage.test.py
> +++ b/test/replication-py/init_storage.test.py
> @@ -57,7 +57,7 @@ print '-------------------------------------------------------------'
>  
>  server.stop()
>  replica = TarantoolServer(server.ini)
> -replica.script = 'replication/replica.lua'
> +replica.script = 'replication-py/replica.lua'
>  replica.vardir = server.vardir #os.path.join(server.vardir, 'replica')
>  replica.rpl_master = master
>  replica.deploy(wait=False)
> diff --git a/test/replication-py/master.lua b/test/replication-py/master.lua
> index 0f9f7a6f0..51283efdf 100644
> --- a/test/replication-py/master.lua
> +++ b/test/replication-py/master.lua
> @@ -3,6 +3,8 @@ os = require('os')
>  box.cfg({
>      listen              = os.getenv("LISTEN"),
>      memtx_memory        = 107374182,
> +    replication_connect_timeout = 1.0,
> +    replication_timeout = 0.3

Why do you need to adjust the timeouts?

>  })
>  
>  require('console').listen(os.getenv('ADMIN'))
> diff --git a/test/replication-py/replica.lua b/test/replication-py/replica.lua
> index 278291bba..b9d193b70 100644
> --- a/test/replication-py/replica.lua
> +++ b/test/replication-py/replica.lua
> @@ -7,6 +7,8 @@ box.cfg({
>      listen              = os.getenv("LISTEN"),
>      replication         = os.getenv("MASTER"),
>      memtx_memory        = 107374182,
> +    replication_connect_timeout = 1.0,
> +    replication_timeout = 0.3
>  })
>  
>  box_cfg_done = true
> diff --git a/test/replication/autobootstrap.lua b/test/replication/autobootstrap.lua
> index 4f55417ae..8fc6809de 100644
> --- a/test/replication/autobootstrap.lua
> +++ b/test/replication/autobootstrap.lua
> @@ -21,7 +21,8 @@ box.cfg({
>          USER..':'..PASSWORD..'@'..instance_uri(2);
>          USER..':'..PASSWORD..'@'..instance_uri(3);
>      };
> -    replication_connect_timeout = 0.5,
> +    replication_connect_timeout = 3.0;
> +    replication_timeout = 0.5;
>  })
>  
>  box.once("bootstrap", function()
> diff --git a/test/replication/autobootstrap_guest.lua b/test/replication/autobootstrap_guest.lua
> index 40fef2c7a..7cd921e3c 100644
> --- a/test/replication/autobootstrap_guest.lua
> +++ b/test/replication/autobootstrap_guest.lua
> @@ -20,7 +20,7 @@ box.cfg({
>          instance_uri(2);
>          instance_uri(3);
>      };
> -    replication_connect_timeout = 0.5,
> +    replication_connect_timeout = 5,

Why do you use different timeouts in different tests?

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [tarantool-patches] Re: [PATCH v2] replication: do not ignore replication_connect_quorum.
  2018-08-07 17:28 ` [tarantool-patches] " Vladimir Davydov
@ 2018-08-09  7:50   ` Sergey Petrenko
  2018-08-09  9:11     ` Vladimir Davydov
  0 siblings, 1 reply; 9+ messages in thread
From: Sergey Petrenko @ 2018-08-09  7:50 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: Georgy Kirichenko, tarantool-patches

[-- Attachment #1: Type: text/plain, Size: 26275 bytes --]

Hi! Thank you for the review!

I fixed your remarks. The new diff is below. Also please see my comments.

> 7 авг. 2018 г., в 20:28, Vladimir Davydov <vdavydov.dev@gmail.com> написал(а):
> 
> On Tue, Aug 07, 2018 at 03:44:13PM +0300, Serge Petrenko wrote:
>> On bootstrap and after replication reconfiguration
>> replication_connect_quorum was ignored. The instance tried to connect to
>> every replica listed in replication parameter, and errored if it wasn't
>> possible.
>> The patch alters this behaviour. The instance still tries to connect to
>> every node listed in replication, but does not raise an error if it was
>> able to connect to at least replication_connect_quorum instances.
> 
> Please append a documentation request (@TarantoolBot) to the commit
> message.
> 
Done.

>> 
>> Closes #3428
>> ---
>> https://github.com/tarantool/tarantool/issues/3428
>> https://github.com/tarantool/tarantool/tree/sergepetrenko/gh-3428-replication-connect-quorum
>> 
>> Changes in v2:
>>  - change test/replication/ddl.lua instance file to fix
>>    test failure on Travis.
>> 
>> src/box/box.cc                           | 6 +++---
>> src/box/replication.cc                   | 8 +++-----
>> src/box/replication.h                    | 7 ++++---
>> test/replication-py/init_storage.test.py | 2 +-
>> test/replication-py/master.lua           | 2 ++
>> test/replication-py/replica.lua          | 2 ++
>> test/replication/autobootstrap.lua       | 3 ++-
>> test/replication/autobootstrap_guest.lua | 2 +-
>> test/replication/ddl.lua                 | 3 ++-
>> test/replication/errinj.result           | 6 +++---
>> test/replication/errinj.test.lua         | 6 +++---
>> test/replication/master.lua              | 1 +
>> test/replication/master_quorum.lua       | 3 ++-
>> test/replication/on_replace.lua          | 3 ++-
>> test/replication/quorum.lua              | 4 ++--
>> test/replication/rebootstrap.lua         | 2 +-
>> test/replication/replica_no_quorum.lua   | 3 ++-
>> test/replication/replica_timeout.lua     | 3 ++-
>> test/replication/replica_uuid_ro.lua     | 2 +-
>> 19 files changed, 39 insertions(+), 29 deletions(-)
>> 
>> diff --git a/src/box/box.cc b/src/box/box.cc
>> index e3eb2738f..f8731f464 100644
>> --- a/src/box/box.cc
>> +++ b/src/box/box.cc
>> @@ -595,7 +595,7 @@ cfg_get_replication(int *p_count)
>>  * don't start appliers.
>>  */
>> static void
>> -box_sync_replication(double timeout, bool connect_all)
>> +box_sync_replication(double timeout, bool reach_quorum)
> 
> After this patch, 'timeout' always equals replication_connect_timeout
> so you don't need to pass it explicitly anymore. Please remove it from
> this function and from replicaset_connect.

Done.

> 
> Also, I don't like 'reach_quorum' name. Would 'connect_quorum' sound
> better? Or may be we should pass the minimal number of masters to
> connect instead?

Let it be connect_quorum. I don’t like the idea to pass a minimal number of
replicas to connect. Looks like we would always pass either
replication_connect_quorum or count.

>> {
>> 	int count = 0;
>> 	struct applier **appliers = cfg_get_replication(&count);
>> @@ -607,7 +607,7 @@ box_sync_replication(double timeout, bool connect_all)
>> 			applier_delete(appliers[i]); /* doesn't affect diag */
>> 	});
>> 
>> -	replicaset_connect(appliers, count, timeout, connect_all);
>> +	replicaset_connect(appliers, count, timeout, reach_quorum);
>> 
>> 	guard.is_active = false;
>> }
>> @@ -1888,7 +1888,7 @@ box_cfg_xc(void)
>> 		 * receive the same replica set UUID when a new cluster
>> 		 * is deployed.
>> 		 */
>> -		box_sync_replication(TIMEOUT_INFINITY, true);
>> +		box_sync_replication(replication_connect_timeout, true);
>> 		/* Bootstrap a new master */
>> 		bootstrap(&replicaset_uuid, &is_bootstrap_leader);
>> 	}
>> diff --git a/src/box/replication.cc b/src/box/replication.cc
>> index 528fe4459..a6c60220f 100644
>> --- a/src/box/replication.cc
>> +++ b/src/box/replication.cc
>> @@ -46,7 +46,7 @@ struct tt_uuid INSTANCE_UUID;
>> struct tt_uuid REPLICASET_UUID;
>> 
>> double replication_timeout = 1.0; /* seconds */
>> -double replication_connect_timeout = 4.0; /* seconds */
>> +double replication_connect_timeout = 10.0; /* seconds */
> 
> Why? BTW, this isn't enough - replication_connect_timeout is actually
> set in load_cfg.lua (I've no idea why they differ).


Changed the default value to 30 seconds, to match the one in load_cfg.lua
Don’t see a reason for them to differ either,
> 
>> int replication_connect_quorum = REPLICATION_CONNECT_QUORUM_ALL;
>> double replication_sync_lag = 10.0; /* seconds */
>> 
>> @@ -540,7 +540,7 @@ applier_on_connect_f(struct trigger *trigger, void *event)
>> 
>> void
>> replicaset_connect(struct applier **appliers, int count,
>> -		   double timeout, bool connect_all)
>> +		   double timeout, bool reach_quorum)
>> {
>> 	if (count == 0) {
>> 		/* Cleanup the replica set. */
>> @@ -587,15 +587,13 @@ replicaset_connect(struct applier **appliers, int count,
>> 		double wait_start = ev_monotonic_now(loop());
>> 		if (fiber_cond_wait_timeout(&state.wakeup, timeout) != 0)
>> 			break;
>> -		if (state.failed > 0 && connect_all)
>> -			break;
> 
> I guess you should break the loop if
> 
>  state.failed > count - min_count_to_connect

AFAIU, if replication_timeout is less than replication_connect_timeout, the appliers, which have
failed, will have time to try and reconnect during replicaset_connect(). So failing here is essentially ignoring
replication_connect_timeout.

> 
>> 		timeout -= ev_monotonic_now(loop()) - wait_start;
>> 	}
>> 	if (state.connected < count) {
>> 		say_crit("failed to connect to %d out of %d replicas",
>> 			 count - state.connected, count);
>> 		/* Timeout or connection failure. */
>> -		if (connect_all)
>> +		if (reach_quorum && state.connected < replication_connect_quorum)
> 
> replication_connect_quorum can be greater than the number of configured
> replicas. I think you should use MIN(count, replication_connect_quorum).

Fixed.

> 
>> 			goto error;
>> 	} else {
>> 		say_verbose("connected to %d replicas", state.connected);
>> diff --git a/src/box/replication.h b/src/box/replication.h
>> index 95122eb45..c16c8b56c 100644
>> --- a/src/box/replication.h
>> +++ b/src/box/replication.h
>> @@ -357,12 +357,13 @@ replicaset_add(uint32_t replica_id, const struct tt_uuid *instance_uuid);
>>  * \param appliers the array of appliers
>>  * \param count size of appliers array
>>  * \param timeout connection timeout
>> - * \param connect_all if this flag is set, fail unless all
>> - *                    appliers have successfully connected
>> + * \param reach_quorum if this flag is set, fail unless at
>> + *		       least replication_connect_quorum
>> + *		       appliers have successfully connected.
>>  */
>> void
>> replicaset_connect(struct applier **appliers, int count,
>> -		   double timeout, bool connect_all);
>> +		   double timeout, bool reach_quorum);
>> 
>> /**
>>  * Resume all appliers registered with the replica set.
>> diff --git a/test/replication-py/init_storage.test.py b/test/replication-py/init_storage.test.py
>> index 0911a02c0..32b4639f1 100644
>> --- a/test/replication-py/init_storage.test.py
>> +++ b/test/replication-py/init_storage.test.py
>> @@ -57,7 +57,7 @@ print '-------------------------------------------------------------'
>> 
>> server.stop()
>> replica = TarantoolServer(server.ini)
>> -replica.script = 'replication/replica.lua'
>> +replica.script = 'replication-py/replica.lua'
>> replica.vardir = server.vardir #os.path.join(server.vardir, 'replica')
>> replica.rpl_master = master
>> replica.deploy(wait=False)
>> diff --git a/test/replication-py/master.lua b/test/replication-py/master.lua
>> index 0f9f7a6f0..51283efdf 100644
>> --- a/test/replication-py/master.lua
>> +++ b/test/replication-py/master.lua
>> @@ -3,6 +3,8 @@ os = require('os')
>> box.cfg({
>>     listen              = os.getenv("LISTEN"),
>>     memtx_memory        = 107374182,
>> +    replication_connect_timeout = 1.0,
>> +    replication_timeout = 0.3
> 
> Why do you need to adjust the timeouts?

The timeouts set in all cfg files in all the tests had no effect, just like the
replication_connect_quorum option, cos both of the options were ignored
During bootstrap. We wanted every replica to connect, and the timeout was
Set to TIMEOUT_INFINITY. Now when we actually start passing
replication_connect_timeout, all these timeouts become too small.

> 
>> })
>> 
>> require('console').listen(os.getenv('ADMIN'))
>> diff --git a/test/replication-py/replica.lua b/test/replication-py/replica.lua
>> index 278291bba..b9d193b70 100644
>> --- a/test/replication-py/replica.lua
>> +++ b/test/replication-py/replica.lua
>> @@ -7,6 +7,8 @@ box.cfg({
>>     listen              = os.getenv("LISTEN"),
>>     replication         = os.getenv("MASTER"),
>>     memtx_memory        = 107374182,
>> +    replication_connect_timeout = 1.0,
>> +    replication_timeout = 0.3
>> })
>> 
>> box_cfg_done = true
>> diff --git a/test/replication/autobootstrap.lua b/test/replication/autobootstrap.lua
>> index 4f55417ae..8fc6809de 100644
>> --- a/test/replication/autobootstrap.lua
>> +++ b/test/replication/autobootstrap.lua
>> @@ -21,7 +21,8 @@ box.cfg({
>>         USER..':'..PASSWORD..'@'..instance_uri(2);
>>         USER..':'..PASSWORD..'@'..instance_uri(3);
>>     };
>> -    replication_connect_timeout = 0.5,
>> +    replication_connect_timeout = 3.0;
>> +    replication_timeout = 0.5;
>> })
>> 
>> box.once("bootstrap", function()
>> diff --git a/test/replication/autobootstrap_guest.lua b/test/replication/autobootstrap_guest.lua
>> index 40fef2c7a..7cd921e3c 100644
>> --- a/test/replication/autobootstrap_guest.lua
>> +++ b/test/replication/autobootstrap_guest.lua
>> @@ -20,7 +20,7 @@ box.cfg({
>>         instance_uri(2);
>>         instance_uri(3);
>>     };
>> -    replication_connect_timeout = 0.5,
>> +    replication_connect_timeout = 5,
> 
> Why do you use different timeouts in different tests?

I tried to find the lowest possible boundary in every test. It happens so that they differ.
I believe, any finite timeout is better that the infinite one.

Here’s the new diff:

 src/box/box.cc                           | 10 +++++-----
 src/box/replication.cc                   | 13 +++++++------
 src/box/replication.h                    |  8 ++++----
 test/replication-py/init_storage.test.py |  2 +-
 test/replication-py/master.lua           |  2 ++
 test/replication-py/replica.lua          |  2 ++
 test/replication/autobootstrap.lua       |  3 ++-
 test/replication/autobootstrap_guest.lua |  2 +-
 test/replication/ddl.lua                 |  3 ++-
 test/replication/errinj.result           |  6 +++---
 test/replication/errinj.test.lua         |  6 +++---
 test/replication/master.lua              |  1 +
 test/replication/master_quorum.lua       |  3 ++-
 test/replication/on_replace.lua          |  3 ++-
 test/replication/quorum.lua              |  4 ++--
 test/replication/rebootstrap.lua         |  2 +-
 test/replication/replica_no_quorum.lua   |  3 ++-
 test/replication/replica_timeout.lua     |  3 ++-
 test/replication/replica_uuid_ro.lua     |  2 +-
 19 files changed, 45 insertions(+), 33 deletions(-)

diff --git a/src/box/box.cc b/src/box/box.cc
index e3eb2738f..8cf43a6ad 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -595,7 +595,7 @@ cfg_get_replication(int *p_count)
  * don't start appliers.
  */
 static void
-box_sync_replication(double timeout, bool connect_all)
+box_sync_replication(bool connect_quorum)
 {
 	int count = 0;
 	struct applier **appliers = cfg_get_replication(&count);
@@ -607,7 +607,7 @@ box_sync_replication(double timeout, bool connect_all)
 			applier_delete(appliers[i]); /* doesn't affect diag */
 	});
 
-	replicaset_connect(appliers, count, timeout, connect_all);
+	replicaset_connect(appliers, count, connect_quorum);
 
 	guard.is_active = false;
 }
@@ -626,7 +626,7 @@ box_set_replication(void)
 
 	box_check_replication();
 	/* Try to connect to all replicas within the timeout period */
-	box_sync_replication(replication_connect_timeout, true);
+	box_sync_replication(true);
 	/* Follow replica */
 	replicaset_follow();
 }
@@ -1866,7 +1866,7 @@ box_cfg_xc(void)
 		title("orphan");
 
 		/* Wait for the cluster to start up */
-		box_sync_replication(replication_connect_timeout, false);
+		box_sync_replication(false);
 	} else {
 		if (!tt_uuid_is_nil(&instance_uuid))
 			INSTANCE_UUID = instance_uuid;
@@ -1888,7 +1888,7 @@ box_cfg_xc(void)
 		 * receive the same replica set UUID when a new cluster
 		 * is deployed.
 		 */
-		box_sync_replication(TIMEOUT_INFINITY, true);
+		box_sync_replication(true);
 		/* Bootstrap a new master */
 		bootstrap(&replicaset_uuid, &is_bootstrap_leader);
 	}
diff --git a/src/box/replication.cc b/src/box/replication.cc
index 528fe4459..fa3b6afb4 100644
--- a/src/box/replication.cc
+++ b/src/box/replication.cc
@@ -46,7 +46,7 @@ struct tt_uuid INSTANCE_UUID;
 struct tt_uuid REPLICASET_UUID;
 
 double replication_timeout = 1.0; /* seconds */
-double replication_connect_timeout = 4.0; /* seconds */
+double replication_connect_timeout = 30.0; /* seconds */
 int replication_connect_quorum = REPLICATION_CONNECT_QUORUM_ALL;
 double replication_sync_lag = 10.0; /* seconds */
 
@@ -540,7 +540,7 @@ applier_on_connect_f(struct trigger *trigger, void *event)
 
 void
 replicaset_connect(struct applier **appliers, int count,
-		   double timeout, bool connect_all)
+		   bool connect_quorum)
 {
 	if (count == 0) {
 		/* Cleanup the replica set. */
@@ -557,7 +557,7 @@ replicaset_connect(struct applier **appliers, int count,
 	 * - register a trigger in each applier to wake up our
 	 *   fiber via this channel when the remote peer becomes
 	 *   connected and a UUID is received;
-	 * - wait up to CONNECT_TIMEOUT seconds for `count` messages;
+	 * - wait up to REPLICATION_CONNECT_TIMEOUT seconds for `count` messages;
 	 * - on timeout, raise a CFG error, cancel and destroy
 	 *   the freshly created appliers (done in a guard);
 	 * - an success, unregister the trigger, check the UUID set
@@ -571,6 +571,8 @@ replicaset_connect(struct applier **appliers, int count,
 	state.connected = state.failed = 0;
 	fiber_cond_create(&state.wakeup);
 
+	double timeout = replication_connect_timeout;
+
 	/* Add triggers and start simulations connection to remote peers */
 	for (int i = 0; i < count; i++) {
 		struct applier *applier = appliers[i];
@@ -587,15 +589,14 @@ replicaset_connect(struct applier **appliers, int count,
 		double wait_start = ev_monotonic_now(loop());
 		if (fiber_cond_wait_timeout(&state.wakeup, timeout) != 0)
 			break;
-		if (state.failed > 0 && connect_all)
-			break;
 		timeout -= ev_monotonic_now(loop()) - wait_start;
 	}
 	if (state.connected < count) {
 		say_crit("failed to connect to %d out of %d replicas",
 			 count - state.connected, count);
 		/* Timeout or connection failure. */
-		if (connect_all)
+		if (connect_quorum && state.connected <
+		    MIN(count, replication_connect_quorum))
 			goto error;
 	} else {
 		say_verbose("connected to %d replicas", state.connected);
diff --git a/src/box/replication.h b/src/box/replication.h
index 95122eb45..9ce9910f8 100644
--- a/src/box/replication.h
+++ b/src/box/replication.h
@@ -356,13 +356,13 @@ replicaset_add(uint32_t replica_id, const struct tt_uuid *instance_uuid);
  *
  * \param appliers the array of appliers
  * \param count size of appliers array
- * \param timeout connection timeout
- * \param connect_all if this flag is set, fail unless all
- *                    appliers have successfully connected
+ * \param connect_quorum if this flag is set, fail unless at
+ *		       least replication_connect_quorum
+ *		       appliers have successfully connected.
  */
 void
 replicaset_connect(struct applier **appliers, int count,
-		   double timeout, bool connect_all);
+		   bool connect_quorum);
 
 /**
  * Resume all appliers registered with the replica set.
diff --git a/test/replication-py/init_storage.test.py b/test/replication-py/init_storage.test.py
index 0911a02c0..32b4639f1 100644
--- a/test/replication-py/init_storage.test.py
+++ b/test/replication-py/init_storage.test.py
@@ -57,7 +57,7 @@ print '-------------------------------------------------------------'
 
 server.stop()
 replica = TarantoolServer(server.ini)
-replica.script = 'replication/replica.lua'
+replica.script = 'replication-py/replica.lua'
 replica.vardir = server.vardir #os.path.join(server.vardir, 'replica')
 replica.rpl_master = master
 replica.deploy(wait=False)
diff --git a/test/replication-py/master.lua b/test/replication-py/master.lua
index 0f9f7a6f0..51283efdf 100644
--- a/test/replication-py/master.lua
+++ b/test/replication-py/master.lua
@@ -3,6 +3,8 @@ os = require('os')
 box.cfg({
     listen              = os.getenv("LISTEN"),
     memtx_memory        = 107374182,
+    replication_connect_timeout = 1.0,
+    replication_timeout = 0.3
 })
 
 require('console').listen(os.getenv('ADMIN'))
diff --git a/test/replication-py/replica.lua b/test/replication-py/replica.lua
index 278291bba..b9d193b70 100644
--- a/test/replication-py/replica.lua
+++ b/test/replication-py/replica.lua
@@ -7,6 +7,8 @@ box.cfg({
     listen              = os.getenv("LISTEN"),
     replication         = os.getenv("MASTER"),
     memtx_memory        = 107374182,
+    replication_connect_timeout = 1.0,
+    replication_timeout = 0.3
 })
 
 box_cfg_done = true
diff --git a/test/replication/autobootstrap.lua b/test/replication/autobootstrap.lua
index 4f55417ae..8fc6809de 100644
--- a/test/replication/autobootstrap.lua
+++ b/test/replication/autobootstrap.lua
@@ -21,7 +21,8 @@ box.cfg({
         USER..':'..PASSWORD..'@'..instance_uri(2);
         USER..':'..PASSWORD..'@'..instance_uri(3);
     };
-    replication_connect_timeout = 0.5,
+    replication_connect_timeout = 3.0;
+    replication_timeout = 0.5;
 })
 
 box.once("bootstrap", function()
diff --git a/test/replication/autobootstrap_guest.lua b/test/replication/autobootstrap_guest.lua
index 40fef2c7a..7cd921e3c 100644
--- a/test/replication/autobootstrap_guest.lua
+++ b/test/replication/autobootstrap_guest.lua
@@ -20,7 +20,7 @@ box.cfg({
         instance_uri(2);
         instance_uri(3);
     };
-    replication_connect_timeout = 0.5,
+    replication_connect_timeout = 5,
 })
 
 box.once("bootstrap", function()
diff --git a/test/replication/ddl.lua b/test/replication/ddl.lua
index 694f40eac..85403e35b 100644
--- a/test/replication/ddl.lua
+++ b/test/replication/ddl.lua
@@ -22,7 +22,8 @@ box.cfg({
         USER..':'..PASSWORD..'@'..instance_uri(3);
         USER..':'..PASSWORD..'@'..instance_uri(4);
     };
-    replication_connect_timeout = 0.5,
+    replication_timeout = 0.1,
+    replication_connect_timeout = 2.0,
 })
 
 box.once("bootstrap", function()
diff --git a/test/replication/errinj.result b/test/replication/errinj.result
index ca8af2988..19d7d9a05 100644
--- a/test/replication/errinj.result
+++ b/test/replication/errinj.result
@@ -418,7 +418,7 @@ test_run:cmd("create server replica_timeout with rpl_master=default, script='rep
 ---
 - true
 ...
-test_run:cmd("start server replica_timeout with args='0.01'")
+test_run:cmd("start server replica_timeout with args='0.1, 0.5'")
 ---
 - true
 ...
@@ -474,7 +474,7 @@ errinj.set("ERRINJ_RELAY_REPORT_INTERVAL", 0)
 ...
 -- Check replica's ACKs don't prevent the master from sending
 -- heartbeat messages (gh-3160).
-test_run:cmd("start server replica_timeout with args='0.009'")
+test_run:cmd("start server replica_timeout with args='0.009, 0.5'")
 ---
 - true
 ...
@@ -522,7 +522,7 @@ for i = 0, 9999 do box.space.test:replace({i, 4, 5, 'test'}) end
 -- during the join stage, i.e. a replica with a minuscule
 -- timeout successfully bootstraps and breaks connection only
 -- after subscribe.
-test_run:cmd("start server replica_timeout with args='0.00001'")
+test_run:cmd("start server replica_timeout with args='0.00001, 0.5'")
 ---
 - true
 ...
diff --git a/test/replication/errinj.test.lua b/test/replication/errinj.test.lua
index 463d89a8f..f00b98eed 100644
--- a/test/replication/errinj.test.lua
+++ b/test/replication/errinj.test.lua
@@ -173,7 +173,7 @@ errinj.set("ERRINJ_RELAY_EXIT_DELAY", 0)
 box.cfg{replication_timeout = 0.01}
 
 test_run:cmd("create server replica_timeout with rpl_master=default, script='replication/replica_timeout.lua'")
-test_run:cmd("start server replica_timeout with args='0.01'")
+test_run:cmd("start server replica_timeout with args='0.1, 0.5'")
 test_run:cmd("switch replica_timeout")
 
 fiber = require('fiber')
@@ -199,7 +199,7 @@ errinj.set("ERRINJ_RELAY_REPORT_INTERVAL", 0)
 -- Check replica's ACKs don't prevent the master from sending
 -- heartbeat messages (gh-3160).
 
-test_run:cmd("start server replica_timeout with args='0.009'")
+test_run:cmd("start server replica_timeout with args='0.009, 0.5'")
 test_run:cmd("switch replica_timeout")
 
 fiber = require('fiber')
@@ -219,7 +219,7 @@ for i = 0, 9999 do box.space.test:replace({i, 4, 5, 'test'}) end
 -- during the join stage, i.e. a replica with a minuscule
 -- timeout successfully bootstraps and breaks connection only
 -- after subscribe.
-test_run:cmd("start server replica_timeout with args='0.00001'")
+test_run:cmd("start server replica_timeout with args='0.00001, 0.5'")
 test_run:cmd("switch replica_timeout")
 fiber = require('fiber')
 while box.info.replication[1].upstream.message ~= 'timed out' do fiber.sleep(0.0001) end
diff --git a/test/replication/master.lua b/test/replication/master.lua
index 6d431aaeb..9b96b7891 100644
--- a/test/replication/master.lua
+++ b/test/replication/master.lua
@@ -4,6 +4,7 @@ box.cfg({
     listen              = os.getenv("LISTEN"),
     memtx_memory        = 107374182,
     replication_connect_timeout = 0.5,
+    replication_timeout = 0.1
 })
 
 require('console').listen(os.getenv('ADMIN'))
diff --git a/test/replication/master_quorum.lua b/test/replication/master_quorum.lua
index fb5f7ec2b..6e0429f65 100644
--- a/test/replication/master_quorum.lua
+++ b/test/replication/master_quorum.lua
@@ -20,7 +20,8 @@ box.cfg({
         instance_uri(2);
     };
     replication_connect_quorum = 0;
-    replication_connect_timeout = 0.1;
+    replication_timeout = 0.5;
+    replication_connect_timeout = 2.0;
 })
 
 test_run = require('test_run').new()
diff --git a/test/replication/on_replace.lua b/test/replication/on_replace.lua
index 03f15d94c..bafead48d 100644
--- a/test/replication/on_replace.lua
+++ b/test/replication/on_replace.lua
@@ -20,7 +20,8 @@ box.cfg({
         USER..':'..PASSWORD..'@'..instance_uri(1);
         USER..':'..PASSWORD..'@'..instance_uri(2);
     };
-    replication_connect_timeout = 0.5,
+    replication_timeout = 0.5,
+    replication_connect_timeout = 1.0,
 })
 
 env = require('test_run')
diff --git a/test/replication/quorum.lua b/test/replication/quorum.lua
index 9c7bf5c93..7f85d7b13 100644
--- a/test/replication/quorum.lua
+++ b/test/replication/quorum.lua
@@ -15,8 +15,8 @@ require('console').listen(os.getenv('ADMIN'))
 box.cfg({
     listen = instance_uri(INSTANCE_ID);
     replication_timeout = 0.05;
-    replication_sync_lag = 0.01;
-    replication_connect_timeout = 0.1;
+    replication_sync_lag = 0.1;
+    replication_connect_timeout = 3.0;
     replication_connect_quorum = 3;
     replication = {
         instance_uri(1);
diff --git a/test/replication/rebootstrap.lua b/test/replication/rebootstrap.lua
index e743577e4..f1e8d69e9 100644
--- a/test/replication/rebootstrap.lua
+++ b/test/replication/rebootstrap.lua
@@ -15,7 +15,7 @@ box.cfg({
     listen = instance_uri(INSTANCE_ID),
     instance_uuid = '12345678-abcd-1234-abcd-123456789ef' .. INSTANCE_ID,
     replication_timeout = 0.1,
-    replication_connect_timeout = 0.5,
+    replication_connect_timeout = 2.0,
     replication = {
         instance_uri(1);
         instance_uri(2);
diff --git a/test/replication/replica_no_quorum.lua b/test/replication/replica_no_quorum.lua
index b9edeea94..c30c043cc 100644
--- a/test/replication/replica_no_quorum.lua
+++ b/test/replication/replica_no_quorum.lua
@@ -5,7 +5,8 @@ box.cfg({
     replication         = os.getenv("MASTER"),
     memtx_memory        = 107374182,
     replication_connect_quorum = 0,
-    replication_connect_timeout = 0.1,
+    replication_timeout = 0.1,
+    replication_connect_timeout = 0.5,
 })
 
 require('console').listen(os.getenv('ADMIN'))
diff --git a/test/replication/replica_timeout.lua b/test/replication/replica_timeout.lua
index 64f119763..51c718360 100644
--- a/test/replication/replica_timeout.lua
+++ b/test/replication/replica_timeout.lua
@@ -1,13 +1,14 @@
 #!/usr/bin/env tarantool
 
 local TIMEOUT = tonumber(arg[1])
+local CON_TIMEOUT = arg[2] and tonumber(arg[2]) or TIMEOUT * 3
 
 box.cfg({
     listen              = os.getenv("LISTEN"),
     replication         = os.getenv("MASTER"),
     memtx_memory        = 107374182,
     replication_timeout = TIMEOUT,
-    replication_connect_timeout = TIMEOUT * 3,
+    replication_connect_timeout = CON_TIMEOUT,
 })
 
 require('console').listen(os.getenv('ADMIN'))
diff --git a/test/replication/replica_uuid_ro.lua b/test/replication/replica_uuid_ro.lua
index 8e1c6cc47..ff70da144 100644
--- a/test/replication/replica_uuid_ro.lua
+++ b/test/replication/replica_uuid_ro.lua
@@ -22,7 +22,7 @@ box.cfg({
         USER..':'..PASSWORD..'@'..instance_uri(2);
     };
     read_only = (INSTANCE_ID ~= '1' and true or false);
-    replication_connect_timeout = 0.5,
+    replication_connect_timeout = 5,
 })
 
 box.once("bootstrap", function()
-- 
2.15.2 (Apple Git-101.1)


[-- Attachment #2: Type: text/html, Size: 50191 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [tarantool-patches] Re: [PATCH v2] replication: do not ignore replication_connect_quorum.
  2018-08-09  7:50   ` [tarantool-patches] " Sergey Petrenko
@ 2018-08-09  9:11     ` Vladimir Davydov
  2018-08-09 12:20       ` Vladimir Davydov
  0 siblings, 1 reply; 9+ messages in thread
From: Vladimir Davydov @ 2018-08-09  9:11 UTC (permalink / raw)
  To: Sergey Petrenko; +Cc: Georgy Kirichenko, tarantool-patches

On Thu, Aug 09, 2018 at 10:50:46AM +0300, Sergey Petrenko wrote:
> >> @@ -587,15 +587,13 @@ replicaset_connect(struct applier **appliers, int count,
> >> 		double wait_start = ev_monotonic_now(loop());
> >> 		if (fiber_cond_wait_timeout(&state.wakeup, timeout) != 0)
> >> 			break;
> >> -		if (state.failed > 0 && connect_all)
> >> -			break;
> > 
> > I guess you should break the loop if
> > 
> >  state.failed > count - min_count_to_connect
> 
> AFAIU, if replication_timeout is less than replication_connect_timeout, the appliers, which have
> failed, will have time to try and reconnect during replicaset_connect(). So failing here is essentially ignoring
> replication_connect_timeout.

No, replicaset_connect_state.failed is only incremented on unrecoverable
errors, when an applier stops - see applier_on_connect_f, applier_f.

> >> diff --git a/test/replication-py/master.lua b/test/replication-py/master.lua
> >> index 0f9f7a6f0..51283efdf 100644
> >> --- a/test/replication-py/master.lua
> >> +++ b/test/replication-py/master.lua
> >> @@ -3,6 +3,8 @@ os = require('os')
> >> box.cfg({
> >>     listen              = os.getenv("LISTEN"),
> >>     memtx_memory        = 107374182,
> >> +    replication_connect_timeout = 1.0,
> >> +    replication_timeout = 0.3
> > 
> > Why do you need to adjust the timeouts?
> 
> The timeouts set in all cfg files in all the tests had no effect, just like the
> replication_connect_quorum option, cos both of the options were ignored
> During bootstrap. We wanted every replica to connect, and the timeout was
> Set to TIMEOUT_INFINITY. Now when we actually start passing
> replication_connect_timeout, all these timeouts become too small.

Some tests now take too long because of this. E.g. replication/quorum
runs for 15 seconds for memtx+vinyl without this patch, which is already
long enough, but with this patch it takes more than 30 seconds. We need
to do something about that.

> >> diff --git a/test/replication/autobootstrap_guest.lua b/test/replication/autobootstrap_guest.lua
> >> index 40fef2c7a..7cd921e3c 100644
> >> --- a/test/replication/autobootstrap_guest.lua
> >> +++ b/test/replication/autobootstrap_guest.lua
> >> @@ -20,7 +20,7 @@ box.cfg({
> >>         instance_uri(2);
> >>         instance_uri(3);
> >>     };
> >> -    replication_connect_timeout = 0.5,
> >> +    replication_connect_timeout = 5,
> > 
> > Why do you use different timeouts in different tests?
> 
> I tried to find the lowest possible boundary in every test. It happens so that they differ.
> I believe, any finite timeout is better that the infinite one.

I'd like to see some system behind choosing timeouts.
Currently, they look random :-(

> diff --git a/src/box/replication.h b/src/box/replication.h
> index 95122eb45..9ce9910f8 100644
> --- a/src/box/replication.h
> +++ b/src/box/replication.h
> @@ -356,13 +356,13 @@ replicaset_add(uint32_t replica_id, const struct tt_uuid *instance_uuid);
>   *
>   * \param appliers the array of appliers
>   * \param count size of appliers array
> - * \param timeout connection timeout
> - * \param connect_all if this flag is set, fail unless all
> - *                    appliers have successfully connected
> + * \param connect_quorum if this flag is set, fail unless at
> + *		       least replication_connect_quorum

Malformed indentation: this line should start exactly below the
parameter description ('if this flag is set ...') - it simply looks
better that way. Please fix.

> + *		       appliers have successfully connected.
>   */
>  void
>  replicaset_connect(struct applier **appliers, int count,
> -		   double timeout, bool connect_all);
> +		   bool connect_quorum);
>  
>  /**
>   * Resume all appliers registered with the replica set.

> diff --git a/test/replication/errinj.test.lua b/test/replication/errinj.test.lua
> index 463d89a8f..f00b98eed 100644
> --- a/test/replication/errinj.test.lua
> +++ b/test/replication/errinj.test.lua
> @@ -173,7 +173,7 @@ errinj.set("ERRINJ_RELAY_EXIT_DELAY", 0)
>  box.cfg{replication_timeout = 0.01}
>  
>  test_run:cmd("create server replica_timeout with rpl_master=default, script='replication/replica_timeout.lua'")
> -test_run:cmd("start server replica_timeout with args='0.01'")
> +test_run:cmd("start server replica_timeout with args='0.1, 0.5'")

This won't work as expected - arguments are supposed to be delimited by
space, not by comma. That is, TIMEOUT will turn into null in the script.

Also, I think you should write some tests that check that the quorum is
not ignored on bootstrap and on replication reconfiguration anymore.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [tarantool-patches] Re: [PATCH v2] replication: do not ignore replication_connect_quorum.
  2018-08-09  9:11     ` Vladimir Davydov
@ 2018-08-09 12:20       ` Vladimir Davydov
  2018-08-09 12:39         ` Serge Petrenko
  0 siblings, 1 reply; 9+ messages in thread
From: Vladimir Davydov @ 2018-08-09 12:20 UTC (permalink / raw)
  To: Sergey Petrenko; +Cc: Georgy Kirichenko, tarantool-patches

On Thu, Aug 09, 2018 at 12:11:01PM +0300, Vladimir Davydov wrote:
> > >> diff --git a/test/replication-py/master.lua b/test/replication-py/master.lua
> > >> index 0f9f7a6f0..51283efdf 100644
> > >> --- a/test/replication-py/master.lua
> > >> +++ b/test/replication-py/master.lua
> > >> @@ -3,6 +3,8 @@ os = require('os')
> > >> box.cfg({
> > >>     listen              = os.getenv("LISTEN"),
> > >>     memtx_memory        = 107374182,
> > >> +    replication_connect_timeout = 1.0,
> > >> +    replication_timeout = 0.3
> > > 
> > > Why do you need to adjust the timeouts?
> > 
> > The timeouts set in all cfg files in all the tests had no effect, just like the
> > replication_connect_quorum option, cos both of the options were ignored
> > During bootstrap. We wanted every replica to connect, and the timeout was
> > Set to TIMEOUT_INFINITY. Now when we actually start passing
> > replication_connect_timeout, all these timeouts become too small.
> 
> Some tests now take too long because of this. E.g. replication/quorum
> runs for 15 seconds for memtx+vinyl without this patch, which is already
> long enough, but with this patch it takes more than 30 seconds. We need
> to do something about that.

May be, we could pass replication_connect_timeout to these scripts
in arguments and use different timeouts for bootstrap and subsequent
restarts?

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [tarantool-patches] Re: [PATCH v2] replication: do not ignore replication_connect_quorum.
  2018-08-09 12:20       ` Vladimir Davydov
@ 2018-08-09 12:39         ` Serge Petrenko
  2018-08-10  9:14           ` [PATCH v3] " Serge Petrenko
  0 siblings, 1 reply; 9+ messages in thread
From: Serge Petrenko @ 2018-08-09 12:39 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: Georgy Kirichenko, tarantool-patches



> 9 авг. 2018 г., в 15:20, Vladimir Davydov <vdavydov.dev@gmail.com> написал(а):
> 
> On Thu, Aug 09, 2018 at 12:11:01PM +0300, Vladimir Davydov wrote:
>>>>> diff --git a/test/replication-py/master.lua b/test/replication-py/master.lua
>>>>> index 0f9f7a6f0..51283efdf 100644
>>>>> --- a/test/replication-py/master.lua
>>>>> +++ b/test/replication-py/master.lua
>>>>> @@ -3,6 +3,8 @@ os = require('os')
>>>>> box.cfg({
>>>>>    listen              = os.getenv("LISTEN"),
>>>>>    memtx_memory        = 107374182,
>>>>> +    replication_connect_timeout = 1.0,
>>>>> +    replication_timeout = 0.3
>>>> 
>>>> Why do you need to adjust the timeouts?
>>> 
>>> The timeouts set in all cfg files in all the tests had no effect, just like the
>>> replication_connect_quorum option, cos both of the options were ignored
>>> During bootstrap. We wanted every replica to connect, and the timeout was
>>> Set to TIMEOUT_INFINITY. Now when we actually start passing
>>> replication_connect_timeout, all these timeouts become too small.
>> 
>> Some tests now take too long because of this. E.g. replication/quorum
>> runs for 15 seconds for memtx+vinyl without this patch, which is already
>> long enough, but with this patch it takes more than 30 seconds. We need
>> to do something about that.
> 
> May be, we could pass replication_connect_timeout to these scripts
> in arguments and use different timeouts for bootstrap and subsequent
> restarts?
> 

Yeah, this makes sense. Will try.
I tried to fix replication/quorum, but could only
decrease the time to 22 seconds on memtx + vinyl, and the test wasn’t stable.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH v3] replication: do not ignore replication_connect_quorum.
  2018-08-09 12:39         ` Serge Petrenko
@ 2018-08-10  9:14           ` Serge Petrenko
  2018-08-10 10:48             ` Vladimir Davydov
  0 siblings, 1 reply; 9+ messages in thread
From: Serge Petrenko @ 2018-08-10  9:14 UTC (permalink / raw)
  To: tarantool-patches; +Cc: vdavydov.dev, georgy, Serge Petrenko

On bootstrap and after replication reconfiguration
replication_connect_quorum was ignored. The instance tried to connect to
every replica listed in replication parameter, and errored if it wasn't
possible.
The patch alters this behaviour. The instance still tries to connect to
every node listed in replication, but does not raise an error if it was
able to connect to at least replication_connect_quorum instances.

Closes #3428

@TarantoolBot document
Title: replication_connect_quorum is not ignored.
Now on replica set bootstrap instance tries to connect
not to every replica, listed in box.cfg.replication, but
only to replication_connect_quorum replicas. If after
replication_connect_timeout seconds instance is not connected to
at least replication_connect_quorum other instances, we
throw an error.
---
https://github.com/tarantool/tarantool/issues/3428
https://github.com/tarantool/tarantool/tree/sergepetrenko/gh-3428-replication-connect-quorum


Changes in v3:
  - Added a documentation request to TarantoolBot to
    commit message.
  - removed timeout parameter from box_sync_replication()
    and replicaset_connect()
  - rewritten replication tests to start instances with
    different replication_connect_timeout and
    replication_timeout parameters. This made tests run
    faster and more stable.
  - added a new test case to replication/quorum.test.lua
    to check that replication_connect_quorum isn't ignored
    anymore during bootstrap and reconfiguration.

Changes in v2:
  - change test/replication/ddl.lua instance file to fix
    test failure on Travis.

 src/box/box.cc                                 | 10 ++--
 src/box/replication.cc                         | 14 ++++--
 src/box/replication.h                          |  8 ++--
 test/replication-py/init_storage.test.py       |  2 +-
 test/replication-py/master.lua                 |  2 +
 test/replication-py/replica.lua                |  2 +
 test/replication/autobootstrap.lua             |  8 +++-
 test/replication/autobootstrap.result          |  7 ++-
 test/replication/autobootstrap.test.lua        |  5 +-
 test/replication/autobootstrap_guest.lua       |  7 ++-
 test/replication/autobootstrap_guest.result    | 10 +++-
 test/replication/autobootstrap_guest.test.lua  |  9 +++-
 test/replication/before_replace.result         | 11 +++--
 test/replication/before_replace.test.lua       | 10 ++--
 test/replication/catch.result                  |  2 +-
 test/replication/catch.test.lua                |  2 +-
 test/replication/ddl.lua                       |  7 ++-
 test/replication/ddl.result                    |  5 +-
 test/replication/ddl.test.lua                  |  4 +-
 test/replication/errinj.result                 |  6 +--
 test/replication/errinj.test.lua               |  6 +--
 test/replication/lua/cluster_cmd.lua           | 14 ++++++
 test/replication/master.lua                    |  1 +
 test/replication/master_quorum.lua             |  7 ++-
 test/replication/misc.result                   |  5 +-
 test/replication/misc.test.lua                 |  4 +-
 test/replication/on_replace.lua                |  7 ++-
 test/replication/on_replace.result             |  5 +-
 test/replication/on_replace.test.lua           |  4 +-
 test/replication/quorum.lua                    | 10 ++--
 test/replication/quorum.result                 | 64 ++++++++++++++++++++++----
 test/replication/quorum.test.lua               | 37 +++++++++++----
 test/replication/rebootstrap.lua               |  6 ++-
 test/replication/rebootstrap.result            |  9 ++--
 test/replication/rebootstrap.test.lua          |  8 ++--
 test/replication/recover_missing_xlog.result   |  7 ++-
 test/replication/recover_missing_xlog.test.lua |  6 ++-
 test/replication/replica_no_quorum.lua         |  3 +-
 test/replication/replica_quorum.lua            | 24 ++++++++++
 test/replication/replica_timeout.lua           |  3 +-
 test/replication/replica_uuid_ro.lua           |  7 ++-
 test/replication/replicaset_ro_mostly.result   |  8 ++--
 test/replication/replicaset_ro_mostly.test.lua |  8 ++--
 test/replication/suite.ini                     |  2 +-
 44 files changed, 298 insertions(+), 88 deletions(-)
 create mode 100644 test/replication/lua/cluster_cmd.lua
 create mode 100644 test/replication/replica_quorum.lua

diff --git a/src/box/box.cc b/src/box/box.cc
index e3eb2738f..8cf43a6ad 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -595,7 +595,7 @@ cfg_get_replication(int *p_count)
  * don't start appliers.
  */
 static void
-box_sync_replication(double timeout, bool connect_all)
+box_sync_replication(bool connect_quorum)
 {
 	int count = 0;
 	struct applier **appliers = cfg_get_replication(&count);
@@ -607,7 +607,7 @@ box_sync_replication(double timeout, bool connect_all)
 			applier_delete(appliers[i]); /* doesn't affect diag */
 	});
 
-	replicaset_connect(appliers, count, timeout, connect_all);
+	replicaset_connect(appliers, count, connect_quorum);
 
 	guard.is_active = false;
 }
@@ -626,7 +626,7 @@ box_set_replication(void)
 
 	box_check_replication();
 	/* Try to connect to all replicas within the timeout period */
-	box_sync_replication(replication_connect_timeout, true);
+	box_sync_replication(true);
 	/* Follow replica */
 	replicaset_follow();
 }
@@ -1866,7 +1866,7 @@ box_cfg_xc(void)
 		title("orphan");
 
 		/* Wait for the cluster to start up */
-		box_sync_replication(replication_connect_timeout, false);
+		box_sync_replication(false);
 	} else {
 		if (!tt_uuid_is_nil(&instance_uuid))
 			INSTANCE_UUID = instance_uuid;
@@ -1888,7 +1888,7 @@ box_cfg_xc(void)
 		 * receive the same replica set UUID when a new cluster
 		 * is deployed.
 		 */
-		box_sync_replication(TIMEOUT_INFINITY, true);
+		box_sync_replication(true);
 		/* Bootstrap a new master */
 		bootstrap(&replicaset_uuid, &is_bootstrap_leader);
 	}
diff --git a/src/box/replication.cc b/src/box/replication.cc
index 528fe4459..dc1b9d55a 100644
--- a/src/box/replication.cc
+++ b/src/box/replication.cc
@@ -46,7 +46,7 @@ struct tt_uuid INSTANCE_UUID;
 struct tt_uuid REPLICASET_UUID;
 
 double replication_timeout = 1.0; /* seconds */
-double replication_connect_timeout = 4.0; /* seconds */
+double replication_connect_timeout = 30.0; /* seconds */
 int replication_connect_quorum = REPLICATION_CONNECT_QUORUM_ALL;
 double replication_sync_lag = 10.0; /* seconds */
 
@@ -540,7 +540,7 @@ applier_on_connect_f(struct trigger *trigger, void *event)
 
 void
 replicaset_connect(struct applier **appliers, int count,
-		   double timeout, bool connect_all)
+		   bool connect_quorum)
 {
 	if (count == 0) {
 		/* Cleanup the replica set. */
@@ -557,7 +557,7 @@ replicaset_connect(struct applier **appliers, int count,
 	 * - register a trigger in each applier to wake up our
 	 *   fiber via this channel when the remote peer becomes
 	 *   connected and a UUID is received;
-	 * - wait up to CONNECT_TIMEOUT seconds for `count` messages;
+	 * - wait up to REPLICATION_CONNECT_TIMEOUT seconds for `count` messages;
 	 * - on timeout, raise a CFG error, cancel and destroy
 	 *   the freshly created appliers (done in a guard);
 	 * - an success, unregister the trigger, check the UUID set
@@ -571,6 +571,8 @@ replicaset_connect(struct applier **appliers, int count,
 	state.connected = state.failed = 0;
 	fiber_cond_create(&state.wakeup);
 
+	double timeout = replication_connect_timeout;
+
 	/* Add triggers and start simulations connection to remote peers */
 	for (int i = 0; i < count; i++) {
 		struct applier *applier = appliers[i];
@@ -587,7 +589,8 @@ replicaset_connect(struct applier **appliers, int count,
 		double wait_start = ev_monotonic_now(loop());
 		if (fiber_cond_wait_timeout(&state.wakeup, timeout) != 0)
 			break;
-		if (state.failed > 0 && connect_all)
+		if (count - state.failed <
+		    MIN(count, replication_connect_quorum))
 			break;
 		timeout -= ev_monotonic_now(loop()) - wait_start;
 	}
@@ -595,7 +598,8 @@ replicaset_connect(struct applier **appliers, int count,
 		say_crit("failed to connect to %d out of %d replicas",
 			 count - state.connected, count);
 		/* Timeout or connection failure. */
-		if (connect_all)
+		if (connect_quorum && state.connected <
+		    MIN(count, replication_connect_quorum))
 			goto error;
 	} else {
 		say_verbose("connected to %d replicas", state.connected);
diff --git a/src/box/replication.h b/src/box/replication.h
index 95122eb45..39b7b1820 100644
--- a/src/box/replication.h
+++ b/src/box/replication.h
@@ -356,13 +356,13 @@ replicaset_add(uint32_t replica_id, const struct tt_uuid *instance_uuid);
  *
  * \param appliers the array of appliers
  * \param count size of appliers array
- * \param timeout connection timeout
- * \param connect_all if this flag is set, fail unless all
- *                    appliers have successfully connected
+ * \param connect_quorum if this flag is set, fail unless at
+ *			 least replication_connect_quorum
+ *			 appliers have successfully connected.
  */
 void
 replicaset_connect(struct applier **appliers, int count,
-		   double timeout, bool connect_all);
+		   bool connect_quorum);
 
 /**
  * Resume all appliers registered with the replica set.
diff --git a/test/replication-py/init_storage.test.py b/test/replication-py/init_storage.test.py
index 0911a02c0..32b4639f1 100644
--- a/test/replication-py/init_storage.test.py
+++ b/test/replication-py/init_storage.test.py
@@ -57,7 +57,7 @@ print '-------------------------------------------------------------'
 
 server.stop()
 replica = TarantoolServer(server.ini)
-replica.script = 'replication/replica.lua'
+replica.script = 'replication-py/replica.lua'
 replica.vardir = server.vardir #os.path.join(server.vardir, 'replica')
 replica.rpl_master = master
 replica.deploy(wait=False)
diff --git a/test/replication-py/master.lua b/test/replication-py/master.lua
index 0f9f7a6f0..51283efdf 100644
--- a/test/replication-py/master.lua
+++ b/test/replication-py/master.lua
@@ -3,6 +3,8 @@ os = require('os')
 box.cfg({
     listen              = os.getenv("LISTEN"),
     memtx_memory        = 107374182,
+    replication_connect_timeout = 1.0,
+    replication_timeout = 0.3
 })
 
 require('console').listen(os.getenv('ADMIN'))
diff --git a/test/replication-py/replica.lua b/test/replication-py/replica.lua
index 278291bba..b9d193b70 100644
--- a/test/replication-py/replica.lua
+++ b/test/replication-py/replica.lua
@@ -7,6 +7,8 @@ box.cfg({
     listen              = os.getenv("LISTEN"),
     replication         = os.getenv("MASTER"),
     memtx_memory        = 107374182,
+    replication_connect_timeout = 1.0,
+    replication_timeout = 0.3
 })
 
 box_cfg_done = true
diff --git a/test/replication/autobootstrap.lua b/test/replication/autobootstrap.lua
index 4f55417ae..c98f3c559 100644
--- a/test/replication/autobootstrap.lua
+++ b/test/replication/autobootstrap.lua
@@ -5,6 +5,10 @@ local INSTANCE_ID = string.match(arg[0], "%d")
 local USER = 'cluster'
 local PASSWORD = 'somepassword'
 local SOCKET_DIR = require('fio').cwd()
+local TIMEOUT = tonumber(arg[1])
+local CON_TIMEOUT = arg[2] and tonumber(arg[2]) or TIMEOUT * 3
+
+
 local function instance_uri(instance_id)
     --return 'localhost:'..(3310 + instance_id)
     return SOCKET_DIR..'/autobootstrap'..instance_id..'.sock';
@@ -21,7 +25,9 @@ box.cfg({
         USER..':'..PASSWORD..'@'..instance_uri(2);
         USER..':'..PASSWORD..'@'..instance_uri(3);
     };
-    replication_connect_timeout = 0.5,
+    replication_timeout = TIMEOUT;
+    replication_connect_timeout = CON_TIMEOUT;
+    replication_timeout = 0.1;
 })
 
 box.once("bootstrap", function()
diff --git a/test/replication/autobootstrap.result b/test/replication/autobootstrap.result
index 04aeb4315..ae860c94d 100644
--- a/test/replication/autobootstrap.result
+++ b/test/replication/autobootstrap.result
@@ -7,13 +7,16 @@ vclock_diff = require('fast_replica').vclock_diff
 test_run = env.new()
 ---
 ...
+create_cluster_with_args = require('cluster_cmd').create_cluster_with_args
+---
+...
 SERVERS = { 'autobootstrap1', 'autobootstrap2', 'autobootstrap3' }
 ---
 ...
 --
 -- Start servers
 --
-test_run:create_cluster(SERVERS)
+create_cluster_with_args(test_run, SERVERS, "0.1 0.5")
 ---
 ...
 --
@@ -161,7 +164,7 @@ box.space.test_u:select()
 _ = test_run:cmd("switch autobootstrap1")
 ---
 ...
-_ = test_run:cmd("restart server autobootstrap1 with cleanup=1")
+_ = test_run:cmd("restart server autobootstrap1 with cleanup=1, args ='0.1 0.5'")
 _ = box.space.test_u:replace({5, 6, 7, 8})
 ---
 ...
diff --git a/test/replication/autobootstrap.test.lua b/test/replication/autobootstrap.test.lua
index f1e2a9991..3c6076236 100644
--- a/test/replication/autobootstrap.test.lua
+++ b/test/replication/autobootstrap.test.lua
@@ -2,13 +2,14 @@ env = require('test_run')
 vclock_diff = require('fast_replica').vclock_diff
 test_run = env.new()
 
+create_cluster_with_args = require('cluster_cmd').create_cluster_with_args
 
 SERVERS = { 'autobootstrap1', 'autobootstrap2', 'autobootstrap3' }
 
 --
 -- Start servers
 --
-test_run:create_cluster(SERVERS)
+create_cluster_with_args(test_run, SERVERS, "0.1 0.5")
 
 --
 -- Wait for full mesh
@@ -76,7 +77,7 @@ box.space.test_u:select()
 -- Rebootstrap one node and check that others follow.
 --
 _ = test_run:cmd("switch autobootstrap1")
-_ = test_run:cmd("restart server autobootstrap1 with cleanup=1")
+_ = test_run:cmd("restart server autobootstrap1 with cleanup=1, args ='0.1 0.5'")
 
 _ = box.space.test_u:replace({5, 6, 7, 8})
 box.space.test_u:select()
diff --git a/test/replication/autobootstrap_guest.lua b/test/replication/autobootstrap_guest.lua
index 40fef2c7a..2f6e527df 100644
--- a/test/replication/autobootstrap_guest.lua
+++ b/test/replication/autobootstrap_guest.lua
@@ -4,6 +4,10 @@
 local INSTANCE_ID = string.match(arg[0], "%d")
 
 local SOCKET_DIR = require('fio').cwd()
+
+local TIMEOUT = tonumber(arg[1])
+local CON_TIMEOUT = arg[2] and tonumber(arg[2]) or TIMEOUT * 3
+
 local function instance_uri(instance_id)
     --return 'localhost:'..(3310 + instance_id)
     return SOCKET_DIR..'/autobootstrap_guest'..instance_id..'.sock';
@@ -20,7 +24,8 @@ box.cfg({
         instance_uri(2);
         instance_uri(3);
     };
-    replication_connect_timeout = 0.5,
+    replication_timeout = TIMEOUT;
+    replication_connect_timeout = CON_TIMEOUT;
 })
 
 box.once("bootstrap", function()
diff --git a/test/replication/autobootstrap_guest.result b/test/replication/autobootstrap_guest.result
index 49f9bee01..f0b4cff9b 100644
--- a/test/replication/autobootstrap_guest.result
+++ b/test/replication/autobootstrap_guest.result
@@ -7,13 +7,21 @@ vclock_diff = require('fast_replica').vclock_diff
 test_run = env.new()
 ---
 ...
+create_cluster_with_args = require('cluster_cmd').create_cluster_with_args
+---
+...
 SERVERS = { 'autobootstrap_guest1', 'autobootstrap_guest2', 'autobootstrap_guest3' }
 ---
 ...
 --
 -- Start servers
 --
-test_run:create_cluster(SERVERS)
+-- for some reason create_cluster_with_args, just like the
+-- usual test_run:create_cluster, can hang for arbitrary
+-- time, so we have to pass a rather large value for
+-- replication_connect_timeout(4 seconds).
+-- In most cases, 1 second is enough, though.
+create_cluster_with_args(test_run, SERVERS, "0.3 4.0")
 ---
 ...
 --
diff --git a/test/replication/autobootstrap_guest.test.lua b/test/replication/autobootstrap_guest.test.lua
index 70250cff9..5a5422d07 100644
--- a/test/replication/autobootstrap_guest.test.lua
+++ b/test/replication/autobootstrap_guest.test.lua
@@ -2,12 +2,19 @@ env = require('test_run')
 vclock_diff = require('fast_replica').vclock_diff
 test_run = env.new()
 
+create_cluster_with_args = require('cluster_cmd').create_cluster_with_args
+
 SERVERS = { 'autobootstrap_guest1', 'autobootstrap_guest2', 'autobootstrap_guest3' }
 
 --
 -- Start servers
 --
-test_run:create_cluster(SERVERS)
+-- for some reason create_cluster_with_args, just like the
+-- usual test_run:create_cluster, can hang for arbitrary
+-- time, so we have to pass a rather large value for
+-- replication_connect_timeout(4 seconds).
+-- In most cases, 1 second is enough, though.
+create_cluster_with_args(test_run, SERVERS, "0.3 4.0")
 
 --
 -- Wait for full mesh
diff --git a/test/replication/before_replace.result b/test/replication/before_replace.result
index d561b4813..6f207ae01 100644
--- a/test/replication/before_replace.result
+++ b/test/replication/before_replace.result
@@ -7,11 +7,14 @@ env = require('test_run')
 test_run = env.new()
 ---
 ...
+create_cluster_with_args = require('cluster_cmd').create_cluster_with_args
+---
+...
 SERVERS = { 'autobootstrap1', 'autobootstrap2', 'autobootstrap3' }
 ---
 ...
 -- Deploy a cluster.
-test_run:create_cluster(SERVERS)
+create_cluster_with_args(test_run, SERVERS, "0.5 4.0")
 ---
 ...
 test_run:wait_fullmesh(SERVERS)
@@ -125,7 +128,7 @@ box.space.test:select()
   - [9, 90]
   - [10, 100]
 ...
-test_run:cmd('restart server autobootstrap1')
+test_run:cmd('restart server autobootstrap1 with args="0.3 1.0"')
 box.space.test:select()
 ---
 - - [1, 10]
@@ -156,7 +159,7 @@ box.space.test:select()
   - [9, 90]
   - [10, 100]
 ...
-test_run:cmd('restart server autobootstrap2')
+test_run:cmd('restart server autobootstrap2 with args="0.3 1.0"')
 box.space.test:select()
 ---
 - - [1, 10]
@@ -187,7 +190,7 @@ box.space.test:select()
   - [9, 90]
   - [10, 100]
 ...
-test_run:cmd('restart server autobootstrap3')
+test_run:cmd('restart server autobootstrap3 with args="0.3 1.0"')
 box.space.test:select()
 ---
 - - [1, 10]
diff --git a/test/replication/before_replace.test.lua b/test/replication/before_replace.test.lua
index 2c6912d06..1f1877b09 100644
--- a/test/replication/before_replace.test.lua
+++ b/test/replication/before_replace.test.lua
@@ -4,10 +4,12 @@
 env = require('test_run')
 test_run = env.new()
 
+create_cluster_with_args = require('cluster_cmd').create_cluster_with_args
+
 SERVERS = { 'autobootstrap1', 'autobootstrap2', 'autobootstrap3' }
 
 -- Deploy a cluster.
-test_run:create_cluster(SERVERS)
+create_cluster_with_args(test_run, SERVERS, "0.5 4.0")
 test_run:wait_fullmesh(SERVERS)
 
 -- Setup space:before_replace trigger on all replicas.
@@ -54,15 +56,15 @@ vclock2 = test_run:wait_cluster_vclock(SERVERS, vclock)
 -- and the state persists after restart.
 test_run:cmd("switch autobootstrap1")
 box.space.test:select()
-test_run:cmd('restart server autobootstrap1')
+test_run:cmd('restart server autobootstrap1 with args="0.3 1.0"')
 box.space.test:select()
 test_run:cmd("switch autobootstrap2")
 box.space.test:select()
-test_run:cmd('restart server autobootstrap2')
+test_run:cmd('restart server autobootstrap2 with args="0.3 1.0"')
 box.space.test:select()
 test_run:cmd("switch autobootstrap3")
 box.space.test:select()
-test_run:cmd('restart server autobootstrap3')
+test_run:cmd('restart server autobootstrap3 with args="0.3 1.0"')
 box.space.test:select()
 
 
diff --git a/test/replication/catch.result b/test/replication/catch.result
index 91be32725..f048afd74 100644
--- a/test/replication/catch.result
+++ b/test/replication/catch.result
@@ -23,7 +23,7 @@ test_run:cmd("create server replica with rpl_master=default, script='replication
 ---
 - true
 ...
-test_run:cmd("start server replica with args='1'")
+test_run:cmd("start server replica with args='0.3'")
 ---
 - true
 ...
diff --git a/test/replication/catch.test.lua b/test/replication/catch.test.lua
index 2e2e97bc4..5623b922f 100644
--- a/test/replication/catch.test.lua
+++ b/test/replication/catch.test.lua
@@ -9,7 +9,7 @@ errinj = box.error.injection
 
 box.schema.user.grant('guest', 'replication')
 test_run:cmd("create server replica with rpl_master=default, script='replication/replica_timeout.lua'")
-test_run:cmd("start server replica with args='1'")
+test_run:cmd("start server replica with args='0.3'")
 test_run:cmd("switch replica")
 
 test_run:cmd("switch default")
diff --git a/test/replication/ddl.lua b/test/replication/ddl.lua
index 694f40eac..91ca4dee4 100644
--- a/test/replication/ddl.lua
+++ b/test/replication/ddl.lua
@@ -5,6 +5,10 @@ local INSTANCE_ID = string.match(arg[0], "%d")
 local USER = 'cluster'
 local PASSWORD = 'somepassword'
 local SOCKET_DIR = require('fio').cwd()
+
+local TIMEOUT = tonumber(arg[1])
+local CON_TIMEOUT  = arg[2] and tonumber(arg[2]) or TIMEOUT * 3
+
 local function instance_uri(instance_id)
     --return 'localhost:'..(3310 + instance_id)
     return SOCKET_DIR..'/autobootstrap'..instance_id..'.sock';
@@ -22,7 +26,8 @@ box.cfg({
         USER..':'..PASSWORD..'@'..instance_uri(3);
         USER..':'..PASSWORD..'@'..instance_uri(4);
     };
-    replication_connect_timeout = 0.5,
+    replication_timeout = TIMEOUT,
+    replication_connect_timeout = CON_TIMEOUT,
 })
 
 box.once("bootstrap", function()
diff --git a/test/replication/ddl.result b/test/replication/ddl.result
index cc61fd4ce..2fa6c0f73 100644
--- a/test/replication/ddl.result
+++ b/test/replication/ddl.result
@@ -1,11 +1,14 @@
 test_run = require('test_run').new()
 ---
 ...
+create_cluster_with_args = require('cluster_cmd').create_cluster_with_args
+---
+...
 SERVERS = { 'ddl1', 'ddl2', 'ddl3', 'ddl4' }
 ---
 ...
 -- Deploy a cluster.
-test_run:create_cluster(SERVERS)
+create_cluster_with_args(test_run, SERVERS, "0.1 1.0")
 ---
 ...
 test_run:wait_fullmesh(SERVERS)
diff --git a/test/replication/ddl.test.lua b/test/replication/ddl.test.lua
index 4cf4ffa38..55ffecc8b 100644
--- a/test/replication/ddl.test.lua
+++ b/test/replication/ddl.test.lua
@@ -1,9 +1,11 @@
 test_run = require('test_run').new()
 
+create_cluster_with_args = require('cluster_cmd').create_cluster_with_args
+
 SERVERS = { 'ddl1', 'ddl2', 'ddl3', 'ddl4' }
 
 -- Deploy a cluster.
-test_run:create_cluster(SERVERS)
+create_cluster_with_args(test_run, SERVERS, "0.1 1.0")
 test_run:wait_fullmesh(SERVERS)
 test_run:cmd("switch ddl1")
 test_run = require('test_run').new()
diff --git a/test/replication/errinj.result b/test/replication/errinj.result
index ca8af2988..3fc432010 100644
--- a/test/replication/errinj.result
+++ b/test/replication/errinj.result
@@ -418,7 +418,7 @@ test_run:cmd("create server replica_timeout with rpl_master=default, script='rep
 ---
 - true
 ...
-test_run:cmd("start server replica_timeout with args='0.01'")
+test_run:cmd("start server replica_timeout with args='0.01 0.5'")
 ---
 - true
 ...
@@ -474,7 +474,7 @@ errinj.set("ERRINJ_RELAY_REPORT_INTERVAL", 0)
 ...
 -- Check replica's ACKs don't prevent the master from sending
 -- heartbeat messages (gh-3160).
-test_run:cmd("start server replica_timeout with args='0.009'")
+test_run:cmd("start server replica_timeout with args='0.009 0.5'")
 ---
 - true
 ...
@@ -522,7 +522,7 @@ for i = 0, 9999 do box.space.test:replace({i, 4, 5, 'test'}) end
 -- during the join stage, i.e. a replica with a minuscule
 -- timeout successfully bootstraps and breaks connection only
 -- after subscribe.
-test_run:cmd("start server replica_timeout with args='0.00001'")
+test_run:cmd("start server replica_timeout with args='0.00001 0.5'")
 ---
 - true
 ...
diff --git a/test/replication/errinj.test.lua b/test/replication/errinj.test.lua
index 463d89a8f..37375f45e 100644
--- a/test/replication/errinj.test.lua
+++ b/test/replication/errinj.test.lua
@@ -173,7 +173,7 @@ errinj.set("ERRINJ_RELAY_EXIT_DELAY", 0)
 box.cfg{replication_timeout = 0.01}
 
 test_run:cmd("create server replica_timeout with rpl_master=default, script='replication/replica_timeout.lua'")
-test_run:cmd("start server replica_timeout with args='0.01'")
+test_run:cmd("start server replica_timeout with args='0.01 0.5'")
 test_run:cmd("switch replica_timeout")
 
 fiber = require('fiber')
@@ -199,7 +199,7 @@ errinj.set("ERRINJ_RELAY_REPORT_INTERVAL", 0)
 -- Check replica's ACKs don't prevent the master from sending
 -- heartbeat messages (gh-3160).
 
-test_run:cmd("start server replica_timeout with args='0.009'")
+test_run:cmd("start server replica_timeout with args='0.009 0.5'")
 test_run:cmd("switch replica_timeout")
 
 fiber = require('fiber')
@@ -219,7 +219,7 @@ for i = 0, 9999 do box.space.test:replace({i, 4, 5, 'test'}) end
 -- during the join stage, i.e. a replica with a minuscule
 -- timeout successfully bootstraps and breaks connection only
 -- after subscribe.
-test_run:cmd("start server replica_timeout with args='0.00001'")
+test_run:cmd("start server replica_timeout with args='0.00001 0.5'")
 test_run:cmd("switch replica_timeout")
 fiber = require('fiber')
 while box.info.replication[1].upstream.message ~= 'timed out' do fiber.sleep(0.0001) end
diff --git a/test/replication/lua/cluster_cmd.lua b/test/replication/lua/cluster_cmd.lua
new file mode 100644
index 000000000..e9e0ead65
--- /dev/null
+++ b/test/replication/lua/cluster_cmd.lua
@@ -0,0 +1,14 @@
+function create_cluster_with_args(test_run, servers, args)
+    local cmd1 = 'create server %s with script="replication/%s.lua"'
+    local cmd2 = 'start server %s with wait_load=False, wait=False, args="%s"'
+    for _, srv in ipairs(servers) do
+        test_run:cmd(cmd1:format(srv, srv))
+    end
+    for _, srv in ipairs(servers) do
+        test_run:cmd(cmd2:format(srv, args))
+    end
+end
+
+return {
+    create_cluster_with_args = create_cluster_with_args;
+}
diff --git a/test/replication/master.lua b/test/replication/master.lua
index 6d431aaeb..9b96b7891 100644
--- a/test/replication/master.lua
+++ b/test/replication/master.lua
@@ -4,6 +4,7 @@ box.cfg({
     listen              = os.getenv("LISTEN"),
     memtx_memory        = 107374182,
     replication_connect_timeout = 0.5,
+    replication_timeout = 0.1
 })
 
 require('console').listen(os.getenv('ADMIN'))
diff --git a/test/replication/master_quorum.lua b/test/replication/master_quorum.lua
index fb5f7ec2b..f8c37a252 100644
--- a/test/replication/master_quorum.lua
+++ b/test/replication/master_quorum.lua
@@ -4,6 +4,10 @@
 local INSTANCE_ID = string.match(arg[0], "%d")
 
 local SOCKET_DIR = require('fio').cwd()
+
+local TIMEOUT = tonumber(arg[1])
+local CON_TIMEOUT = arg[2] and tonumber(arg[2]) or TIMEOUT * 3
+
 local function instance_uri(instance_id)
     --return 'localhost:'..(3310 + instance_id)
     return SOCKET_DIR..'/master_quorum'..instance_id..'.sock';
@@ -20,7 +24,8 @@ box.cfg({
         instance_uri(2);
     };
     replication_connect_quorum = 0;
-    replication_connect_timeout = 0.1;
+    replication_timeout = TIMEOUT;
+    replication_connect_timeout = CON_TIMEOUT;
 })
 
 test_run = require('test_run').new()
diff --git a/test/replication/misc.result b/test/replication/misc.result
index ff0dbf549..cd3a0dba5 100644
--- a/test/replication/misc.result
+++ b/test/replication/misc.result
@@ -4,6 +4,9 @@ uuid = require('uuid')
 test_run = require('test_run').new()
 ---
 ...
+create_cluster_with_args = require('cluster_cmd').create_cluster_with_args
+---
+...
 box.schema.user.grant('guest', 'replication')
 ---
 ...
@@ -69,7 +72,7 @@ SERVERS = { 'autobootstrap1', 'autobootstrap2', 'autobootstrap3' }
 ---
 ...
 -- Deploy a cluster.
-test_run:create_cluster(SERVERS)
+create_cluster_with_args(test_run, SERVERS, "0.3 1.0")
 ---
 ...
 test_run:wait_fullmesh(SERVERS)
diff --git a/test/replication/misc.test.lua b/test/replication/misc.test.lua
index c05e52165..04f054837 100644
--- a/test/replication/misc.test.lua
+++ b/test/replication/misc.test.lua
@@ -1,6 +1,8 @@
 uuid = require('uuid')
 test_run = require('test_run').new()
 
+create_cluster_with_args = require('cluster_cmd').create_cluster_with_args
+
 box.schema.user.grant('guest', 'replication')
 
 -- gh-2991 - Tarantool asserts on box.cfg.replication update if one of
@@ -27,7 +29,7 @@ box.cfg{read_only = false}
 SERVERS = { 'autobootstrap1', 'autobootstrap2', 'autobootstrap3' }
 
 -- Deploy a cluster.
-test_run:create_cluster(SERVERS)
+create_cluster_with_args(test_run, SERVERS, "0.3 1.0")
 test_run:wait_fullmesh(SERVERS)
 test_run:cmd("switch autobootstrap1")
 test_run = require('test_run').new()
diff --git a/test/replication/on_replace.lua b/test/replication/on_replace.lua
index 03f15d94c..a9e63d806 100644
--- a/test/replication/on_replace.lua
+++ b/test/replication/on_replace.lua
@@ -5,6 +5,10 @@ local INSTANCE_ID = string.match(arg[0], "%d")
 local USER = 'cluster'
 local PASSWORD = 'somepassword'
 local SOCKET_DIR = require('fio').cwd()
+
+local TIMEOUT = tonumber(arg[1])
+local CON_TIMEOUT = arg[2] and tonumber(arg[2]) or TIMEOUT * 3
+
 local function instance_uri(instance_id)
     --return 'localhost:'..(3310 + instance_id)
     return SOCKET_DIR..'/on_replace'..instance_id..'.sock';
@@ -20,7 +24,8 @@ box.cfg({
         USER..':'..PASSWORD..'@'..instance_uri(1);
         USER..':'..PASSWORD..'@'..instance_uri(2);
     };
-    replication_connect_timeout = 0.5,
+    replication_timeout = TIMEOUT,
+    replication_connect_timeout = CON_TIMEOUT,
 })
 
 env = require('test_run')
diff --git a/test/replication/on_replace.result b/test/replication/on_replace.result
index 1736c53b7..7010fe136 100644
--- a/test/replication/on_replace.result
+++ b/test/replication/on_replace.result
@@ -10,6 +10,9 @@ test_run = env.new()
 fiber = require('fiber')
 ---
 ...
+create_cluster_with_args = require('cluster_cmd').create_cluster_with_args
+---
+...
 _ = box.schema.space.create('test')
 ---
 ...
@@ -98,7 +101,7 @@ box.schema.user.revoke('guest', 'replication')
 SERVERS = { 'on_replace1', 'on_replace2' }
 ---
 ...
-test_run:create_cluster(SERVERS)
+create_cluster_with_args(test_run, SERVERS, "0.3 1.0")
 ---
 ...
 test_run:wait_fullmesh(SERVERS)
diff --git a/test/replication/on_replace.test.lua b/test/replication/on_replace.test.lua
index 10bba2ddf..5ac406767 100644
--- a/test/replication/on_replace.test.lua
+++ b/test/replication/on_replace.test.lua
@@ -6,6 +6,8 @@ env = require('test_run')
 test_run = env.new()
 fiber = require('fiber')
 
+create_cluster_with_args = require('cluster_cmd').create_cluster_with_args
+
 _ = box.schema.space.create('test')
 _ = box.space.test:create_index('primary')
 box.schema.user.grant('guest', 'replication')
@@ -44,7 +46,7 @@ box.schema.user.revoke('guest', 'replication')
 -- gh-2682 on_replace on slave server with data change
 
 SERVERS = { 'on_replace1', 'on_replace2' }
-test_run:create_cluster(SERVERS)
+create_cluster_with_args(test_run, SERVERS, "0.3 1.0")
 test_run:wait_fullmesh(SERVERS)
 
 test_run:cmd('switch on_replace1')
diff --git a/test/replication/quorum.lua b/test/replication/quorum.lua
index 9c7bf5c93..2767af974 100644
--- a/test/replication/quorum.lua
+++ b/test/replication/quorum.lua
@@ -4,6 +4,10 @@
 local INSTANCE_ID = string.match(arg[0], "%d")
 
 local SOCKET_DIR = require('fio').cwd()
+
+local TIMEOUT = tonumber(arg[1])
+local CON_TIMEOUT = arg[2] and tonumber(arg[2]) or TIMEOUT * 3
+
 local function instance_uri(instance_id)
     --return 'localhost:'..(3310 + instance_id)
     return SOCKET_DIR..'/quorum'..instance_id..'.sock';
@@ -14,9 +18,9 @@ require('console').listen(os.getenv('ADMIN'))
 
 box.cfg({
     listen = instance_uri(INSTANCE_ID);
-    replication_timeout = 0.05;
-    replication_sync_lag = 0.01;
-    replication_connect_timeout = 0.1;
+    replication_sync_lag = 0.1;
+    replication_timeout = TIMEOUT;
+    replication_connect_timeout = CON_TIMEOUT;
     replication_connect_quorum = 3;
     replication = {
         instance_uri(1);
diff --git a/test/replication/quorum.result b/test/replication/quorum.result
index 8f6e7a070..48f7c9503 100644
--- a/test/replication/quorum.result
+++ b/test/replication/quorum.result
@@ -1,11 +1,14 @@
 test_run = require('test_run').new()
 ---
 ...
+create_cluster_with_args = require('cluster_cmd').create_cluster_with_args
+---
+...
 SERVERS = {'quorum1', 'quorum2', 'quorum3'}
 ---
 ...
 -- Deploy a cluster.
-test_run:create_cluster(SERVERS)
+create_cluster_with_args(test_run, SERVERS, "0.3 4.0")
 ---
 ...
 test_run:wait_fullmesh(SERVERS)
@@ -27,7 +30,7 @@ test_run:cmd('switch quorum2')
 ---
 - true
 ...
-test_run:cmd('restart server quorum2')
+test_run:cmd('restart server quorum2 with args="0.1 0.5"')
 box.info.status -- orphan
 ---
 - orphan
@@ -51,7 +54,7 @@ box.info.status -- running
 ---
 - running
 ...
-test_run:cmd('restart server quorum2')
+test_run:cmd('restart server quorum2 with args="0.1 0.5"')
 box.info.status -- orphan
 ---
 - orphan
@@ -82,7 +85,7 @@ box.info.status -- running
 ---
 - running
 ...
-test_run:cmd('restart server quorum2')
+test_run:cmd('restart server quorum2 with args="0.1 0.5"')
 box.info.status -- orphan
 ---
 - orphan
@@ -99,7 +102,7 @@ box.space.test:replace{100} -- error
 ---
 - error: Can't modify data because this instance is in read-only mode.
 ...
-test_run:cmd('start server quorum1')
+test_run:cmd('start server quorum1 with args="0.1 0.5"')
 ---
 - true
 ...
@@ -158,7 +161,7 @@ fiber = require('fiber')
 fiber.sleep(0.1)
 ---
 ...
-test_run:cmd('start server quorum1')
+test_run:cmd('start server quorum1 with args="0.1  0.5"')
 ---
 - true
 ...
@@ -196,7 +199,7 @@ test_run:cmd('switch quorum1')
 ---
 - true
 ...
-test_run:cmd('restart server quorum1 with cleanup=1')
+test_run:cmd('restart server quorum1 with cleanup=1, args="0.1 0.5"')
 box.space.test:count() -- 100
 ---
 - 100
@@ -356,7 +359,7 @@ SERVERS = {'master_quorum1', 'master_quorum2'}
 ---
 ...
 -- Deploy a cluster.
-test_run:create_cluster(SERVERS)
+create_cluster_with_args(test_run, SERVERS, "0.1 0.5")
 ---
 ...
 test_run:wait_fullmesh(SERVERS)
@@ -401,3 +404,48 @@ test_run:cmd("switch default")
 test_run:drop_cluster(SERVERS)
 ---
 ...
+-- Test that quorum is not ignored neither during bootstrap
+-- nor during reconfiguration.
+box.schema.user.grant('guest', 'replication')
+---
+...
+test_run:cmd("create server replica_quorum with script='replication/replica_quorum.lua'")
+---
+- true
+...
+-- Arguments are: replication_connect_quorum, replication_timeout,
+-- replication_connect_timeout.
+-- If replication_connect_quorum was ignored here, the instance would
+-- exit with an error.
+test_run:cmd("start server replica_quorum with wait=True, wait_load=True, args='1 0.05 0.1'")
+---
+- true
+...
+test_run:cmd("switch replica_quorum")
+---
+- true
+...
+-- If replication_connect_quorum was ignored here, the instance would
+-- exit with an error.
+box.cfg{replication={INSTANCE_URI, non_existent_uri(1)}}
+---
+...
+box.info.id
+---
+- 1
+...
+test_run:cmd("switch default")
+---
+- true
+...
+test_run:cmd("stop server replica_quorum")
+---
+- true
+...
+test_run:cmd("delete server replica_quorum")
+---
+- true
+...
+box.schema.user.revoke('guest', 'replication')
+---
+...
diff --git a/test/replication/quorum.test.lua b/test/replication/quorum.test.lua
index 1df0ae1e7..dcd2ff3bd 100644
--- a/test/replication/quorum.test.lua
+++ b/test/replication/quorum.test.lua
@@ -1,9 +1,11 @@
 test_run = require('test_run').new()
 
+create_cluster_with_args = require('cluster_cmd').create_cluster_with_args
+
 SERVERS = {'quorum1', 'quorum2', 'quorum3'}
 
 -- Deploy a cluster.
-test_run:create_cluster(SERVERS)
+create_cluster_with_args(test_run, SERVERS, "0.3 4.0")
 test_run:wait_fullmesh(SERVERS)
 
 -- Stop one replica and try to restart another one.
@@ -18,7 +20,7 @@ test_run:cmd('stop server quorum1')
 
 test_run:cmd('switch quorum2')
 
-test_run:cmd('restart server quorum2')
+test_run:cmd('restart server quorum2 with args="0.1 0.5"')
 box.info.status -- orphan
 box.ctl.wait_rw(0.001) -- timeout
 box.info.ro -- true
@@ -26,7 +28,7 @@ box.space.test:replace{100} -- error
 box.cfg{replication={}}
 box.info.status -- running
 
-test_run:cmd('restart server quorum2')
+test_run:cmd('restart server quorum2 with args="0.1 0.5"')
 box.info.status -- orphan
 box.ctl.wait_rw(0.001) -- timeout
 box.info.ro -- true
@@ -36,12 +38,12 @@ box.ctl.wait_rw()
 box.info.ro -- false
 box.info.status -- running
 
-test_run:cmd('restart server quorum2')
+test_run:cmd('restart server quorum2 with args="0.1 0.5"')
 box.info.status -- orphan
 box.ctl.wait_rw(0.001) -- timeout
 box.info.ro -- true
 box.space.test:replace{100} -- error
-test_run:cmd('start server quorum1')
+test_run:cmd('start server quorum1 with args="0.1 0.5"')
 box.ctl.wait_rw()
 box.info.ro -- false
 box.info.status -- running
@@ -63,7 +65,7 @@ for i = 1, 100 do box.space.test:insert{i} end
 fiber = require('fiber')
 fiber.sleep(0.1)
 
-test_run:cmd('start server quorum1')
+test_run:cmd('start server quorum1 with args="0.1  0.5"')
 test_run:cmd('switch quorum1')
 box.space.test:count() -- 100
 
@@ -79,7 +81,7 @@ test_run:cmd('switch quorum2')
 box.snapshot()
 
 test_run:cmd('switch quorum1')
-test_run:cmd('restart server quorum1 with cleanup=1')
+test_run:cmd('restart server quorum1 with cleanup=1, args="0.1 0.5"')
 
 box.space.test:count() -- 100
 
@@ -136,7 +138,7 @@ box.schema.user.revoke('guest', 'replication')
 -- Second case, check that master-master works.
 SERVERS = {'master_quorum1', 'master_quorum2'}
 -- Deploy a cluster.
-test_run:create_cluster(SERVERS)
+create_cluster_with_args(test_run, SERVERS, "0.1 0.5")
 test_run:wait_fullmesh(SERVERS)
 test_run:cmd("switch master_quorum1")
 repl = box.cfg.replication
@@ -150,3 +152,22 @@ box.space.test:select()
 test_run:cmd("switch default")
 -- Cleanup.
 test_run:drop_cluster(SERVERS)
+
+-- Test that quorum is not ignored neither during bootstrap
+-- nor during reconfiguration.
+box.schema.user.grant('guest', 'replication')
+test_run:cmd("create server replica_quorum with script='replication/replica_quorum.lua'")
+-- Arguments are: replication_connect_quorum, replication_timeout,
+-- replication_connect_timeout.
+-- If replication_connect_quorum was ignored here, the instance would
+-- exit with an error.
+test_run:cmd("start server replica_quorum with wait=True, wait_load=True, args='1 0.05 0.1'")
+test_run:cmd("switch replica_quorum")
+-- If replication_connect_quorum was ignored here, the instance would
+-- exit with an error.
+box.cfg{replication={INSTANCE_URI, non_existent_uri(1)}}
+box.info.id
+test_run:cmd("switch default")
+test_run:cmd("stop server replica_quorum")
+test_run:cmd("delete server replica_quorum")
+box.schema.user.revoke('guest', 'replication')
diff --git a/test/replication/rebootstrap.lua b/test/replication/rebootstrap.lua
index e743577e4..e77270028 100644
--- a/test/replication/rebootstrap.lua
+++ b/test/replication/rebootstrap.lua
@@ -4,6 +4,10 @@
 local INSTANCE_ID = string.match(arg[0], "%d")
 
 local SOCKET_DIR = require('fio').cwd()
+
+local TIMEOUT = tonumber(arg[1])
+local CON_TIMEOUT = arg[2] and tonumber(arg[2]) or TIMEOUT * 3
+
 local function instance_uri(instance_id)
     return SOCKET_DIR..'/rebootstrap'..instance_id..'.sock';
 end
@@ -15,7 +19,7 @@ box.cfg({
     listen = instance_uri(INSTANCE_ID),
     instance_uuid = '12345678-abcd-1234-abcd-123456789ef' .. INSTANCE_ID,
     replication_timeout = 0.1,
-    replication_connect_timeout = 0.5,
+    replication_connect_timeout = 2.0,
     replication = {
         instance_uri(1);
         instance_uri(2);
diff --git a/test/replication/rebootstrap.result b/test/replication/rebootstrap.result
index afbfc8e65..90e8786e3 100644
--- a/test/replication/rebootstrap.result
+++ b/test/replication/rebootstrap.result
@@ -1,10 +1,13 @@
 test_run = require('test_run').new()
 ---
 ...
+create_cluster_with_args = require('cluster_cmd').create_cluster_with_args
+---
+...
 SERVERS = {'rebootstrap1', 'rebootstrap2'}
 ---
 ...
-test_run:create_cluster(SERVERS)
+create_cluster_with_args(test_run, SERVERS, "0.3 1.0")
 ---
 ...
 test_run:wait_fullmesh(SERVERS)
@@ -20,11 +23,11 @@ test_run:cmd('stop server rebootstrap1')
 ---
 - true
 ...
-test_run:cmd('restart server rebootstrap2 with cleanup=True, wait=False, wait_load=False')
+test_run:cmd('restart server rebootstrap2 with cleanup=True, wait=False, wait_load=False, args="0.3 2.0"')
 ---
 - true
 ...
-test_run:cmd('start server rebootstrap1')
+test_run:cmd('start server rebootstrap1 with args="0.1 0.5"')
 ---
 - true
 ...
diff --git a/test/replication/rebootstrap.test.lua b/test/replication/rebootstrap.test.lua
index 954726ddb..2a85a3a25 100644
--- a/test/replication/rebootstrap.test.lua
+++ b/test/replication/rebootstrap.test.lua
@@ -1,8 +1,10 @@
 test_run = require('test_run').new()
 
+create_cluster_with_args = require('cluster_cmd').create_cluster_with_args
+
 SERVERS = {'rebootstrap1', 'rebootstrap2'}
 
-test_run:create_cluster(SERVERS)
+create_cluster_with_args(test_run, SERVERS, "0.3 1.0")
 test_run:wait_fullmesh(SERVERS)
 
 --
@@ -12,8 +14,8 @@ test_run:wait_fullmesh(SERVERS)
 -- in 'orphan' mode.
 --
 test_run:cmd('stop server rebootstrap1')
-test_run:cmd('restart server rebootstrap2 with cleanup=True, wait=False, wait_load=False')
-test_run:cmd('start server rebootstrap1')
+test_run:cmd('restart server rebootstrap2 with cleanup=True, wait=False, wait_load=False, args="0.3 2.0"')
+test_run:cmd('start server rebootstrap1 with args="0.1 0.5"')
 test_run:cmd('switch rebootstrap1')
 box.info.status -- running
 
diff --git a/test/replication/recover_missing_xlog.result b/test/replication/recover_missing_xlog.result
index 027f8761e..7aac274d8 100644
--- a/test/replication/recover_missing_xlog.result
+++ b/test/replication/recover_missing_xlog.result
@@ -4,11 +4,14 @@ env = require('test_run')
 test_run = env.new()
 ---
 ...
+create_cluster_with_args = require('cluster_cmd').create_cluster_with_args
+---
+...
 SERVERS = { 'autobootstrap1', 'autobootstrap2', 'autobootstrap3' }
 ---
 ...
 -- Start servers
-test_run:create_cluster(SERVERS)
+create_cluster_with_args(test_run, SERVERS, "0.1 0.5")
 ---
 ...
 -- Wait for full mesh
@@ -66,7 +69,7 @@ fio.unlink(fio.pathjoin(fio.abspath("."), string.format('autobootstrap1/%020d.xl
 ---
 - true
 ...
-test_run:cmd("start server autobootstrap1")
+test_run:cmd('start server autobootstrap1 with args="0.1 0.5"')
 ---
 - true
 ...
diff --git a/test/replication/recover_missing_xlog.test.lua b/test/replication/recover_missing_xlog.test.lua
index 57bc7d31f..e22b96da1 100644
--- a/test/replication/recover_missing_xlog.test.lua
+++ b/test/replication/recover_missing_xlog.test.lua
@@ -1,9 +1,11 @@
 env = require('test_run')
 test_run = env.new()
 
+create_cluster_with_args = require('cluster_cmd').create_cluster_with_args
+
 SERVERS = { 'autobootstrap1', 'autobootstrap2', 'autobootstrap3' }
 -- Start servers
-test_run:create_cluster(SERVERS)
+create_cluster_with_args(test_run, SERVERS, "0.1 0.5")
 -- Wait for full mesh
 test_run:wait_fullmesh(SERVERS)
 
@@ -28,7 +30,7 @@ fio = require('fio')
 -- Also check that there is no concurrency, i.e. master is
 -- in 'read-only' mode unless it receives all data.
 fio.unlink(fio.pathjoin(fio.abspath("."), string.format('autobootstrap1/%020d.xlog', 8)))
-test_run:cmd("start server autobootstrap1")
+test_run:cmd('start server autobootstrap1 with args="0.1 0.5"')
 
 test_run:cmd("switch autobootstrap1")
 for i = 10, 19 do box.space.test:insert{i, 'test' .. i} end
diff --git a/test/replication/replica_no_quorum.lua b/test/replication/replica_no_quorum.lua
index b9edeea94..c30c043cc 100644
--- a/test/replication/replica_no_quorum.lua
+++ b/test/replication/replica_no_quorum.lua
@@ -5,7 +5,8 @@ box.cfg({
     replication         = os.getenv("MASTER"),
     memtx_memory        = 107374182,
     replication_connect_quorum = 0,
-    replication_connect_timeout = 0.1,
+    replication_timeout = 0.1,
+    replication_connect_timeout = 0.5,
 })
 
 require('console').listen(os.getenv('ADMIN'))
diff --git a/test/replication/replica_quorum.lua b/test/replication/replica_quorum.lua
new file mode 100644
index 000000000..19c9e717f
--- /dev/null
+++ b/test/replication/replica_quorum.lua
@@ -0,0 +1,24 @@
+#!/usr/bin/env tarantool
+
+local SOCKET_DIR = require('fio').cwd()
+
+local TIMEOUT = arg[2] and tonumber(arg[2]) or 0.1
+local CON_TIMEOUT = arg[3] and tonumber(arg[3]) or 0.5
+local QUORUM = tonumber(arg[1])
+INSTANCE_URI = SOCKET_DIR..'/replica_quorum.sock'
+
+function non_existent_uri(id)
+    return SOCKET_DIR..'/replica_quorum'..(1000+id)..'.sock'
+end
+
+require('console').listen(os.getenv('ADMIN'))
+
+box.cfg{
+    listen = INSTANCE_URI,
+    replication_timeout = TIMEOUT,
+    replication_connect_timeout = CON_TIMEOUT,
+    replication_connect_quorum = QUORUM,
+    replication = {INSTANCE_URI,
+                   non_existent_uri(1),
+                   non_existent_uri(2)}
+}
diff --git a/test/replication/replica_timeout.lua b/test/replication/replica_timeout.lua
index 64f119763..51c718360 100644
--- a/test/replication/replica_timeout.lua
+++ b/test/replication/replica_timeout.lua
@@ -1,13 +1,14 @@
 #!/usr/bin/env tarantool
 
 local TIMEOUT = tonumber(arg[1])
+local CON_TIMEOUT = arg[2] and tonumber(arg[2]) or TIMEOUT * 3
 
 box.cfg({
     listen              = os.getenv("LISTEN"),
     replication         = os.getenv("MASTER"),
     memtx_memory        = 107374182,
     replication_timeout = TIMEOUT,
-    replication_connect_timeout = TIMEOUT * 3,
+    replication_connect_timeout = CON_TIMEOUT,
 })
 
 require('console').listen(os.getenv('ADMIN'))
diff --git a/test/replication/replica_uuid_ro.lua b/test/replication/replica_uuid_ro.lua
index 8e1c6cc47..5e06edd11 100644
--- a/test/replication/replica_uuid_ro.lua
+++ b/test/replication/replica_uuid_ro.lua
@@ -5,6 +5,10 @@ local INSTANCE_ID = string.match(arg[0], "%d")
 local USER = 'cluster'
 local PASSWORD = 'somepassword'
 local SOCKET_DIR = require('fio').cwd()
+
+local TIMEOUT = tonumber(arg[2])
+local CON_TIMEOUT = arg[3] and tonumber(arg[3]) or TIMEOUT * 3
+
 local function instance_uri(instance_id)
     --return 'localhost:'..(3310 + instance_id)
     return SOCKET_DIR..'/replica_uuid_ro'..instance_id..'.sock';
@@ -22,7 +26,8 @@ box.cfg({
         USER..':'..PASSWORD..'@'..instance_uri(2);
     };
     read_only = (INSTANCE_ID ~= '1' and true or false);
-    replication_connect_timeout = 0.5,
+    replication_timeout = TIMEOUT;
+    replication_connect_timeout = CON_TIMEOUT;
 })
 
 box.once("bootstrap", function()
diff --git a/test/replication/replicaset_ro_mostly.result b/test/replication/replicaset_ro_mostly.result
index b9e8f1fe8..074b608d8 100644
--- a/test/replication/replicaset_ro_mostly.result
+++ b/test/replication/replicaset_ro_mostly.result
@@ -27,7 +27,7 @@ UUID = sort({uuid1, uuid2}, sort_cmp)
 create_cluster_cmd1 = 'create server %s with script="replication/%s.lua"'
 ---
 ...
-create_cluster_cmd2 = 'start server %s with args="%s", wait_load=False, wait=False'
+create_cluster_cmd2 = 'start server %s with args="%s %s", wait_load=False, wait=False'
 ---
 ...
 test_run:cmd("setopt delimiter ';'")
@@ -37,7 +37,9 @@ test_run:cmd("setopt delimiter ';'")
 function create_cluster_uuid(servers, uuids)
     for i, name in ipairs(servers) do
         test_run:cmd(create_cluster_cmd1:format(name, name))
-        test_run:cmd(create_cluster_cmd2:format(name, uuids[i]))
+    end
+    for i, name in ipairs(servers) do
+        test_run:cmd(create_cluster_cmd2:format(name, uuids[i], "0.3 1.0"))
     end
 end;
 ---
@@ -61,7 +63,7 @@ test_run:cmd(create_cluster_cmd1:format(name, name))
 ---
 - true
 ...
-test_run:cmd(create_cluster_cmd2:format(name, uuid.new()))
+test_run:cmd(create_cluster_cmd2:format(name, uuid.new(), "0.3 1.0"))
 ---
 - true
 ...
diff --git a/test/replication/replicaset_ro_mostly.test.lua b/test/replication/replicaset_ro_mostly.test.lua
index f2c2d0d11..f11e72681 100644
--- a/test/replication/replicaset_ro_mostly.test.lua
+++ b/test/replication/replicaset_ro_mostly.test.lua
@@ -12,13 +12,15 @@ function sort(t) table.sort(t, sort_cmp) return t end
 UUID = sort({uuid1, uuid2}, sort_cmp)
 
 create_cluster_cmd1 = 'create server %s with script="replication/%s.lua"'
-create_cluster_cmd2 = 'start server %s with args="%s", wait_load=False, wait=False'
+create_cluster_cmd2 = 'start server %s with args="%s %s", wait_load=False, wait=False'
 
 test_run:cmd("setopt delimiter ';'")
 function create_cluster_uuid(servers, uuids)
     for i, name in ipairs(servers) do
         test_run:cmd(create_cluster_cmd1:format(name, name))
-        test_run:cmd(create_cluster_cmd2:format(name, uuids[i]))
+    end
+    for i, name in ipairs(servers) do
+        test_run:cmd(create_cluster_cmd2:format(name, uuids[i], "0.3 1.0"))
     end
 end;
 test_run:cmd("setopt delimiter ''");
@@ -30,7 +32,7 @@ test_run:wait_fullmesh(SERVERS)
 -- Add third replica
 name = 'replica_uuid_ro3'
 test_run:cmd(create_cluster_cmd1:format(name, name))
-test_run:cmd(create_cluster_cmd2:format(name, uuid.new()))
+test_run:cmd(create_cluster_cmd2:format(name, uuid.new(), "0.3 1.0"))
 test_run:cmd('switch replica_uuid_ro3')
 test_run:cmd('switch default')
 
diff --git a/test/replication/suite.ini b/test/replication/suite.ini
index b489add58..9d34e897e 100644
--- a/test/replication/suite.ini
+++ b/test/replication/suite.ini
@@ -5,6 +5,6 @@ description = tarantool/box, replication
 disabled = consistent.test.lua
 release_disabled = catch.test.lua errinj.test.lua gc.test.lua before_replace.test.lua quorum.test.lua recover_missing_xlog.test.lua
 config = suite.cfg
-lua_libs = lua/fast_replica.lua
+lua_libs = lua/fast_replica.lua lua/cluster_cmd.lua
 long_run = prune.test.lua
 is_parallel = False
-- 
2.15.2 (Apple Git-101.1)

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v3] replication: do not ignore replication_connect_quorum.
  2018-08-10  9:14           ` [PATCH v3] " Serge Petrenko
@ 2018-08-10 10:48             ` Vladimir Davydov
  0 siblings, 0 replies; 9+ messages in thread
From: Vladimir Davydov @ 2018-08-10 10:48 UTC (permalink / raw)
  To: Serge Petrenko; +Cc: tarantool-patches, georgy

On Fri, Aug 10, 2018 at 12:14:53PM +0300, Serge Petrenko wrote:
> On bootstrap and after replication reconfiguration
> replication_connect_quorum was ignored. The instance tried to connect to
> every replica listed in replication parameter, and errored if it wasn't
> possible.
> The patch alters this behaviour. The instance still tries to connect to
> every node listed in replication, but does not raise an error if it was
> able to connect to at least replication_connect_quorum instances.
> 
> Closes #3428
> 
> @TarantoolBot document
> Title: replication_connect_quorum is not ignored.
> Now on replica set bootstrap instance tries to connect
> not to every replica, listed in box.cfg.replication, but
> only to replication_connect_quorum replicas. If after
> replication_connect_timeout seconds instance is not connected to
> at least replication_connect_quorum other instances, we
> throw an error.

Please also mention that quorum is not ignored in case of replication
reconfiguration (i.e. on subsequent calls to box.cfg{replication}).

> diff --git a/src/box/box.cc b/src/box/box.cc
> index e3eb2738f..8cf43a6ad 100644
> --- a/src/box/box.cc
> +++ b/src/box/box.cc
> @@ -626,7 +626,7 @@ box_set_replication(void)
>  
>  	box_check_replication();
>  	/* Try to connect to all replicas within the timeout period */

What happens if we fail to connect to all masters? The configuration
will succeed as long as we've managed to connect to at least 'quorum'
masters. Please update the comment.

> -	box_sync_replication(replication_connect_timeout, true);
> +	box_sync_replication(true);
>  	/* Follow replica */
>  	replicaset_follow();
>  }
> @@ -1866,7 +1866,7 @@ box_cfg_xc(void)
>  		title("orphan");
>  
>  		/* Wait for the cluster to start up */
> -		box_sync_replication(replication_connect_timeout, false);
> +		box_sync_replication(false);

This code cries for an explanation why we don't need to connect to
'quorum' masters here. Please update the comment.

>  	} else {
>  		if (!tt_uuid_is_nil(&instance_uuid))
>  			INSTANCE_UUID = instance_uuid;
> @@ -1888,7 +1888,7 @@ box_cfg_xc(void)
>  		 * receive the same replica set UUID when a new cluster
>  		 * is deployed.
>  		 */
> -		box_sync_replication(TIMEOUT_INFINITY, true);
> +		box_sync_replication(true);

Here's the comment to this code (it got trimmed by diff):

		/*
		 * Wait for the cluster to start up.
		 *
		 * Note, when bootstrapping a new instance, we have to
		 * connect to all masters to make sure all replicas
		 * receive the same replica set UUID when a new cluster
		 * is deployed.
		 */

It's obviously outdated by this patch. Please update.

>  		/* Bootstrap a new master */
>  		bootstrap(&replicaset_uuid, &is_bootstrap_leader);
>  	}
> diff --git a/src/box/replication.cc b/src/box/replication.cc
> index 528fe4459..dc1b9d55a 100644
> --- a/src/box/replication.cc
> +++ b/src/box/replication.cc
> @@ -557,7 +557,7 @@ replicaset_connect(struct applier **appliers, int count,
>  	 * - register a trigger in each applier to wake up our
>  	 *   fiber via this channel when the remote peer becomes
>  	 *   connected and a UUID is received;
> -	 * - wait up to CONNECT_TIMEOUT seconds for `count` messages;
> +	 * - wait up to REPLICATION_CONNECT_TIMEOUT seconds for `count` messages;

The line is too long now and conspicuously stands out in the comment.
Please carry it to the next line. Or not update it at all - as a matter
of fact, I don't think we really need to update it.

>  	 * - on timeout, raise a CFG error, cancel and destroy
>  	 *   the freshly created appliers (done in a guard);
>  	 * - an success, unregister the trigger, check the UUID set
> @@ -587,7 +589,8 @@ replicaset_connect(struct applier **appliers, int count,
>  		double wait_start = ev_monotonic_now(loop());
>  		if (fiber_cond_wait_timeout(&state.wakeup, timeout) != 0)
>  			break;
> -		if (state.failed > 0 && connect_all)
> +		if (count - state.failed <
> +		    MIN(count, replication_connect_quorum))
>  			break;
>  		timeout -= ev_monotonic_now(loop()) - wait_start;
>  	}
> @@ -595,7 +598,8 @@ replicaset_connect(struct applier **appliers, int count,
>  		say_crit("failed to connect to %d out of %d replicas",
>  			 count - state.connected, count);
>  		/* Timeout or connection failure. */
> -		if (connect_all)
> +		if (connect_quorum && state.connected <
> +		    MIN(count, replication_connect_quorum))

Please eliminate the code duplication by introducing a separate
variable:

	int quorum = MIN(count, replication_connect_quorum);

> diff --git a/test/replication-py/master.lua b/test/replication-py/master.lua
> index 0f9f7a6f0..51283efdf 100644
> --- a/test/replication-py/master.lua
> +++ b/test/replication-py/master.lua
> @@ -3,6 +3,8 @@ os = require('os')
>  box.cfg({
>      listen              = os.getenv("LISTEN"),
>      memtx_memory        = 107374182,
> +    replication_connect_timeout = 1.0,
> +    replication_timeout = 0.3
>  })
>  
>  require('console').listen(os.getenv('ADMIN'))
> diff --git a/test/replication-py/replica.lua b/test/replication-py/replica.lua
> index 278291bba..b9d193b70 100644
> --- a/test/replication-py/replica.lua
> +++ b/test/replication-py/replica.lua
> @@ -7,6 +7,8 @@ box.cfg({
>      listen              = os.getenv("LISTEN"),
>      replication         = os.getenv("MASTER"),
>      memtx_memory        = 107374182,
> +    replication_connect_timeout = 1.0,
> +    replication_timeout = 0.3

Why don't you pass timeouts to these scripts, as you do for other tests?

>  })
>  
>  box_cfg_done = true
> diff --git a/test/replication/autobootstrap.lua b/test/replication/autobootstrap.lua
> index 4f55417ae..c98f3c559 100644
> --- a/test/replication/autobootstrap.lua
> +++ b/test/replication/autobootstrap.lua
> @@ -5,6 +5,10 @@ local INSTANCE_ID = string.match(arg[0], "%d")
>  local USER = 'cluster'
>  local PASSWORD = 'somepassword'
>  local SOCKET_DIR = require('fio').cwd()
> +local TIMEOUT = tonumber(arg[1])
> +local CON_TIMEOUT = arg[2] and tonumber(arg[2]) or TIMEOUT * 3
> +

As far as I can see, you always pass both timeouts to this script so
I don't think you need the fallback.

> +

Extra new line. Please remove.

>  local function instance_uri(instance_id)
>      --return 'localhost:'..(3310 + instance_id)
>      return SOCKET_DIR..'/autobootstrap'..instance_id..'.sock';
> @@ -21,7 +25,9 @@ box.cfg({
>          USER..':'..PASSWORD..'@'..instance_uri(2);
>          USER..':'..PASSWORD..'@'..instance_uri(3);
>      };
> -    replication_connect_timeout = 0.5,
> +    replication_timeout = TIMEOUT;
> +    replication_connect_timeout = CON_TIMEOUT;
> +    replication_timeout = 0.1;

'replication_timeout' is set twice.

> diff --git a/test/replication/autobootstrap_guest.lua b/test/replication/autobootstrap_guest.lua
> index 40fef2c7a..2f6e527df 100644
> --- a/test/replication/autobootstrap_guest.lua
> +++ b/test/replication/autobootstrap_guest.lua
> @@ -4,6 +4,10 @@
>  local INSTANCE_ID = string.match(arg[0], "%d")
>  
>  local SOCKET_DIR = require('fio').cwd()
> +
> +local TIMEOUT = tonumber(arg[1])
> +local CON_TIMEOUT = arg[2] and tonumber(arg[2]) or TIMEOUT * 3
> +

Again, I don't think you need to fall back on TIMEOUT * 3 in case
CON_TIMEOUT is unset - you pass both values anyways.

> diff --git a/test/replication/autobootstrap_guest.result b/test/replication/autobootstrap_guest.result
> index 49f9bee01..f0b4cff9b 100644
> --- a/test/replication/autobootstrap_guest.result
> +++ b/test/replication/autobootstrap_guest.result
> @@ -7,13 +7,21 @@ vclock_diff = require('fast_replica').vclock_diff
>  test_run = env.new()
>  ---
>  ...
> +create_cluster_with_args = require('cluster_cmd').create_cluster_with_args
> +---
> +...
>  SERVERS = { 'autobootstrap_guest1', 'autobootstrap_guest2', 'autobootstrap_guest3' }
>  ---
>  ...
>  --
>  -- Start servers
>  --
> -test_run:create_cluster(SERVERS)
> +-- for some reason create_cluster_with_args, just like the
> +-- usual test_run:create_cluster, can hang for arbitrary
> +-- time, so we have to pass a rather large value for
> +-- replication_connect_timeout(4 seconds).
> +-- In most cases, 1 second is enough, though.

Don't think we need this comment - it raises some unsettling questions.
Let's just always pass a big timeout (say 10 seconds) to create_cluster,
just in case? In other tests too, for consistency.

> +create_cluster_with_args(test_run, SERVERS, "0.3 4.0")

Why do you pass 0.3 for replication_timeout? 0.1 should work just fine, no?

> diff --git a/test/replication/before_replace.result b/test/replication/before_replace.result
> index d561b4813..6f207ae01 100644
> --- a/test/replication/before_replace.result
> +++ b/test/replication/before_replace.result
> @@ -7,11 +7,14 @@ env = require('test_run')
>  test_run = env.new()
>  ---
>  ...
> +create_cluster_with_args = require('cluster_cmd').create_cluster_with_args
> +---
> +...
>  SERVERS = { 'autobootstrap1', 'autobootstrap2', 'autobootstrap3' }
>  ---
>  ...
>  -- Deploy a cluster.
> -test_run:create_cluster(SERVERS)
> +create_cluster_with_args(test_run, SERVERS, "0.5 4.0")

Why 0.5 for replication_timeout? Let's leave it 0.1 as it used to be.
You don't change 'replication_timeout' logic in this patch so it should
work just fine. And please always pass a substantially big timeout (say,
10 seconds) to create_cluster(), the same in all old tests.

Alternatively, you can just use the default value in those scripts
(autobootstrap.lua, autobootstrap_guest.lua) if connect_timeout is
omitted (it's 30 seconds, right?). If you do, then scratch my comment
regarding the uselessness of the fallback.

>  ---
>  ...
>  test_run:wait_fullmesh(SERVERS)
> @@ -125,7 +128,7 @@ box.space.test:select()
>    - [9, 90]
>    - [10, 100]
>  ...
> -test_run:cmd('restart server autobootstrap1')
> +test_run:cmd('restart server autobootstrap1 with args="0.3 1.0"')

I don't think you need to change the timeouts here. That is, use 0.1 +
0.5 as before.

Also, this patch is getting rather difficult to review. I think you
should split it:

 - First, allow to pass args to create_cluster
 - Second, pass timeouts explicitly. Use a big timeout when creating a
   cluster, as a preparation for the final patch. Or omit it and use the
   default (30 seconds).
 - Finally, the patch changing behavior of replication_connect_timeout.
   It shouldn't touch existing tests then, only add new ones.

Please do. I stop looking at the patch at this point.

Just one more comment. Please don't introduce a new function for
starting a cluster with custom arguments as you do below. Instead,
patch tarantool/test-run so that create_cluster can optionally take
arguments it is supposed to pass to the script (I think they should
go to 'opts').

> diff --git a/test/replication/lua/cluster_cmd.lua b/test/replication/lua/cluster_cmd.lua
> new file mode 100644
> index 000000000..e9e0ead65
> --- /dev/null
> +++ b/test/replication/lua/cluster_cmd.lua
> @@ -0,0 +1,14 @@
> +function create_cluster_with_args(test_run, servers, args)
> +    local cmd1 = 'create server %s with script="replication/%s.lua"'
> +    local cmd2 = 'start server %s with wait_load=False, wait=False, args="%s"'
> +    for _, srv in ipairs(servers) do
> +        test_run:cmd(cmd1:format(srv, srv))
> +    end
> +    for _, srv in ipairs(servers) do
> +        test_run:cmd(cmd2:format(srv, args))
> +    end
> +end
> +
> +return {
> +    create_cluster_with_args = create_cluster_with_args;
> +}

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2018-08-10 10:48 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-08-07 12:44 [tarantool-patches] [PATCH v2] replication: do not ignore replication_connect_quorum Serge Petrenko
2018-08-07 14:18 ` [tarantool-patches] " Georgy Kirichenko
2018-08-07 17:28 ` [tarantool-patches] " Vladimir Davydov
2018-08-09  7:50   ` [tarantool-patches] " Sergey Petrenko
2018-08-09  9:11     ` Vladimir Davydov
2018-08-09 12:20       ` Vladimir Davydov
2018-08-09 12:39         ` Serge Petrenko
2018-08-10  9:14           ` [PATCH v3] " Serge Petrenko
2018-08-10 10:48             ` Vladimir Davydov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox