[PATCH v3] replication: do not ignore replication_connect_quorum.

Fri Aug 10 13:48:02 MSK 2018

On Fri, Aug 10, 2018 at 12:14:53PM +0300, Serge Petrenko wrote:
> On bootstrap and after replication reconfiguration
> replication_connect_quorum was ignored. The instance tried to connect to
> every replica listed in replication parameter, and errored if it wasn't
> possible.
> The patch alters this behaviour. The instance still tries to connect to
> every node listed in replication, but does not raise an error if it was
> able to connect to at least replication_connect_quorum instances.
> 
> Closes #3428
> 
> @TarantoolBot document
> Title: replication_connect_quorum is not ignored.
> Now on replica set bootstrap instance tries to connect
> not to every replica, listed in box.cfg.replication, but
> only to replication_connect_quorum replicas. If after
> replication_connect_timeout seconds instance is not connected to
> at least replication_connect_quorum other instances, we
> throw an error.

Please also mention that quorum is not ignored in case of replication
reconfiguration (i.e. on subsequent calls to box.cfg{replication}).

> diff --git a/src/box/box.cc b/src/box/box.cc
> index e3eb2738f..8cf43a6ad 100644
> --- a/src/box/box.cc
> +++ b/src/box/box.cc
> @@ -626,7 +626,7 @@ box_set_replication(void)
>  
>  	box_check_replication();
>  	/* Try to connect to all replicas within the timeout period */

What happens if we fail to connect to all masters? The configuration
will succeed as long as we've managed to connect to at least 'quorum'
masters. Please update the comment.

> -	box_sync_replication(replication_connect_timeout, true);
> +	box_sync_replication(true);
>  	/* Follow replica */
>  	replicaset_follow();
>  }
> @@ -1866,7 +1866,7 @@ box_cfg_xc(void)
>  		title("orphan");
>  
>  		/* Wait for the cluster to start up */
> -		box_sync_replication(replication_connect_timeout, false);
> +		box_sync_replication(false);

This code cries for an explanation why we don't need to connect to
'quorum' masters here. Please update the comment.

>  	} else {
>  		if (!tt_uuid_is_nil(&instance_uuid))
>  			INSTANCE_UUID = instance_uuid;
> @@ -1888,7 +1888,7 @@ box_cfg_xc(void)
>  		 * receive the same replica set UUID when a new cluster
>  		 * is deployed.
>  		 */
> -		box_sync_replication(TIMEOUT_INFINITY, true);
> +		box_sync_replication(true);

Here's the comment to this code (it got trimmed by diff):

		/*
		 * Wait for the cluster to start up.
		 *
		 * Note, when bootstrapping a new instance, we have to
		 * connect to all masters to make sure all replicas
		 * receive the same replica set UUID when a new cluster
		 * is deployed.
		 */

It's obviously outdated by this patch. Please update.

>  		/* Bootstrap a new master */
>  		bootstrap(&replicaset_uuid, &is_bootstrap_leader);
>  	}
> diff --git a/src/box/replication.cc b/src/box/replication.cc
> index 528fe4459..dc1b9d55a 100644
> --- a/src/box/replication.cc
> +++ b/src/box/replication.cc
> @@ -557,7 +557,7 @@ replicaset_connect(struct applier **appliers, int count,
>  	 * - register a trigger in each applier to wake up our
>  	 *   fiber via this channel when the remote peer becomes
>  	 *   connected and a UUID is received;
> -	 * - wait up to CONNECT_TIMEOUT seconds for `count` messages;
> +	 * - wait up to REPLICATION_CONNECT_TIMEOUT seconds for `count` messages;

The line is too long now and conspicuously stands out in the comment.
Please carry it to the next line. Or not update it at all - as a matter
of fact, I don't think we really need to update it.

>  	 * - on timeout, raise a CFG error, cancel and destroy
>  	 *   the freshly created appliers (done in a guard);
>  	 * - an success, unregister the trigger, check the UUID set
> @@ -587,7 +589,8 @@ replicaset_connect(struct applier **appliers, int count,
>  		double wait_start = ev_monotonic_now(loop());
>  		if (fiber_cond_wait_timeout(&state.wakeup, timeout) != 0)
>  			break;
> -		if (state.failed > 0 && connect_all)
> +		if (count - state.failed <
> +		    MIN(count, replication_connect_quorum))
>  			break;
>  		timeout -= ev_monotonic_now(loop()) - wait_start;
>  	}
> @@ -595,7 +598,8 @@ replicaset_connect(struct applier **appliers, int count,
>  		say_crit("failed to connect to %d out of %d replicas",
>  			 count - state.connected, count);
>  		/* Timeout or connection failure. */
> -		if (connect_all)
> +		if (connect_quorum && state.connected <
> +		    MIN(count, replication_connect_quorum))

Please eliminate the code duplication by introducing a separate
variable:

	int quorum = MIN(count, replication_connect_quorum);

> diff --git a/test/replication-py/master.lua b/test/replication-py/master.lua
> index 0f9f7a6f0..51283efdf 100644
> --- a/test/replication-py/master.lua
> +++ b/test/replication-py/master.lua
> @@ -3,6 +3,8 @@ os = require('os')
>  box.cfg({
>      listen              = os.getenv("LISTEN"),
>      memtx_memory        = 107374182,
> +    replication_connect_timeout = 1.0,
> +    replication_timeout = 0.3
>  })
>  
>  require('console').listen(os.getenv('ADMIN'))
> diff --git a/test/replication-py/replica.lua b/test/replication-py/replica.lua
> index 278291bba..b9d193b70 100644
> --- a/test/replication-py/replica.lua
> +++ b/test/replication-py/replica.lua
> @@ -7,6 +7,8 @@ box.cfg({
>      listen              = os.getenv("LISTEN"),
>      replication         = os.getenv("MASTER"),
>      memtx_memory        = 107374182,
> +    replication_connect_timeout = 1.0,
> +    replication_timeout = 0.3

Why don't you pass timeouts to these scripts, as you do for other tests?

>  })
>  
>  box_cfg_done = true
> diff --git a/test/replication/autobootstrap.lua b/test/replication/autobootstrap.lua
> index 4f55417ae..c98f3c559 100644
> --- a/test/replication/autobootstrap.lua
> +++ b/test/replication/autobootstrap.lua
> @@ -5,6 +5,10 @@ local INSTANCE_ID = string.match(arg[0], "%d")
>  local USER = 'cluster'
>  local PASSWORD = 'somepassword'
>  local SOCKET_DIR = require('fio').cwd()
> +local TIMEOUT = tonumber(arg[1])
> +local CON_TIMEOUT = arg[2] and tonumber(arg[2]) or TIMEOUT * 3
> +

As far as I can see, you always pass both timeouts to this script so
I don't think you need the fallback.

> +

Extra new line. Please remove.

>  local function instance_uri(instance_id)
>      --return 'localhost:'..(3310 + instance_id)
>      return SOCKET_DIR..'/autobootstrap'..instance_id..'.sock';
> @@ -21,7 +25,9 @@ box.cfg({
>          USER..':'..PASSWORD..'@'..instance_uri(2);
>          USER..':'..PASSWORD..'@'..instance_uri(3);
>      };
> -    replication_connect_timeout = 0.5,
> +    replication_timeout = TIMEOUT;
> +    replication_connect_timeout = CON_TIMEOUT;
> +    replication_timeout = 0.1;

'replication_timeout' is set twice.

> diff --git a/test/replication/autobootstrap_guest.lua b/test/replication/autobootstrap_guest.lua
> index 40fef2c7a..2f6e527df 100644
> --- a/test/replication/autobootstrap_guest.lua
> +++ b/test/replication/autobootstrap_guest.lua
> @@ -4,6 +4,10 @@
>  local INSTANCE_ID = string.match(arg[0], "%d")
>  
>  local SOCKET_DIR = require('fio').cwd()
> +
> +local TIMEOUT = tonumber(arg[1])
> +local CON_TIMEOUT = arg[2] and tonumber(arg[2]) or TIMEOUT * 3
> +

Again, I don't think you need to fall back on TIMEOUT * 3 in case
CON_TIMEOUT is unset - you pass both values anyways.

> diff --git a/test/replication/autobootstrap_guest.result b/test/replication/autobootstrap_guest.result
> index 49f9bee01..f0b4cff9b 100644
> --- a/test/replication/autobootstrap_guest.result
> +++ b/test/replication/autobootstrap_guest.result
> @@ -7,13 +7,21 @@ vclock_diff = require('fast_replica').vclock_diff
>  test_run = env.new()
>  ---
>  ...
> +create_cluster_with_args = require('cluster_cmd').create_cluster_with_args
> +---
> +...
>  SERVERS = { 'autobootstrap_guest1', 'autobootstrap_guest2', 'autobootstrap_guest3' }
>  ---
>  ...
>  --
>  -- Start servers
>  --
> -test_run:create_cluster(SERVERS)
> +-- for some reason create_cluster_with_args, just like the
> +-- usual test_run:create_cluster, can hang for arbitrary
> +-- time, so we have to pass a rather large value for
> +-- replication_connect_timeout(4 seconds).
> +-- In most cases, 1 second is enough, though.

Don't think we need this comment - it raises some unsettling questions.
Let's just always pass a big timeout (say 10 seconds) to create_cluster,
just in case? In other tests too, for consistency.

> +create_cluster_with_args(test_run, SERVERS, "0.3 4.0")

Why do you pass 0.3 for replication_timeout? 0.1 should work just fine, no?

> diff --git a/test/replication/before_replace.result b/test/replication/before_replace.result
> index d561b4813..6f207ae01 100644
> --- a/test/replication/before_replace.result
> +++ b/test/replication/before_replace.result
> @@ -7,11 +7,14 @@ env = require('test_run')
>  test_run = env.new()
>  ---
>  ...
> +create_cluster_with_args = require('cluster_cmd').create_cluster_with_args
> +---
> +...
>  SERVERS = { 'autobootstrap1', 'autobootstrap2', 'autobootstrap3' }
>  ---
>  ...
>  -- Deploy a cluster.
> -test_run:create_cluster(SERVERS)
> +create_cluster_with_args(test_run, SERVERS, "0.5 4.0")

Why 0.5 for replication_timeout? Let's leave it 0.1 as it used to be.
You don't change 'replication_timeout' logic in this patch so it should
work just fine. And please always pass a substantially big timeout (say,
10 seconds) to create_cluster(), the same in all old tests.

Alternatively, you can just use the default value in those scripts
(autobootstrap.lua, autobootstrap_guest.lua) if connect_timeout is
omitted (it's 30 seconds, right?). If you do, then scratch my comment
regarding the uselessness of the fallback.

>  ---
>  ...
>  test_run:wait_fullmesh(SERVERS)
> @@ -125,7 +128,7 @@ box.space.test:select()
>    - [9, 90]
>    - [10, 100]
>  ...
> -test_run:cmd('restart server autobootstrap1')
> +test_run:cmd('restart server autobootstrap1 with args="0.3 1.0"')

I don't think you need to change the timeouts here. That is, use 0.1 +
0.5 as before.

Also, this patch is getting rather difficult to review. I think you
should split it:

 - First, allow to pass args to create_cluster
 - Second, pass timeouts explicitly. Use a big timeout when creating a
   cluster, as a preparation for the final patch. Or omit it and use the
   default (30 seconds).
 - Finally, the patch changing behavior of replication_connect_timeout.
   It shouldn't touch existing tests then, only add new ones.

Please do. I stop looking at the patch at this point.

Just one more comment. Please don't introduce a new function for
starting a cluster with custom arguments as you do below. Instead,
patch tarantool/test-run so that create_cluster can optionally take
arguments it is supposed to pass to the script (I think they should
go to 'opts').

> diff --git a/test/replication/lua/cluster_cmd.lua b/test/replication/lua/cluster_cmd.lua
> new file mode 100644
> index 000000000..e9e0ead65
> --- /dev/null
> +++ b/test/replication/lua/cluster_cmd.lua
> @@ -0,0 +1,14 @@
> +function create_cluster_with_args(test_run, servers, args)
> +    local cmd1 = 'create server %s with script="replication/%s.lua"'
> +    local cmd2 = 'start server %s with wait_load=False, wait=False, args="%s"'
> +    for _, srv in ipairs(servers) do
> +        test_run:cmd(cmd1:format(srv, srv))
> +    end
> +    for _, srv in ipairs(servers) do
> +        test_run:cmd(cmd2:format(srv, args))
> +    end
> +end
> +
> +return {
> +    create_cluster_with_args = create_cluster_with_args;
> +}