[tarantool-patches] [PATCH v2] replication: fix assertion with duplicated connect to same master

Mon Sep 17 18:05:12 MSK 2018

On Tue, Sep 11, 2018 at 10:11:05AM +0300, Olga Arkhangelskaia wrote:
> In case of duplicated connection to the same master we are not able
> to check it at the very beginning due to quorum option. So we allow such
> configuration to proceed till it is obvious. If it the initial
> configuration - replica won't start and in case of configuration change
> we will be notified about duplicated replication source.

This comment isn't exactly helpful: this is how duplicate connections
are supposed to be handled. It'd be nice to read what exactly you're
fixing in this patch.

Nit: please use 'duplicate' instead of 'duplicated'.

> ---
> https://github.com/tarantool/tarantool/issues/3610
> https://github.com/tarantool/tarantool/tree/OKriw/gh-3610-assert_fail_when_connect_master_twice-1.10
> 
> v1:
> 
> Changes in v2:
> - changed completely, now we let duplicated params to proceed
> - stop the applier at the moment when replica has the same hash that other one
> - changed test
> 
>  src/box/box.cc                                 |  4 ++-
>  src/box/replication.cc                         | 21 +++++++++++++--
>  test/replication/duplicate_connection.result   | 37 ++++++++++++++++++++++++++
>  test/replication/duplicate_connection.test.lua | 16 +++++++++++
>  4 files changed, 75 insertions(+), 3 deletions(-)
>  create mode 100644 test/replication/duplicate_connection.result
>  create mode 100644 test/replication/duplicate_connection.test.lua
> 
> diff --git a/src/box/box.cc b/src/box/box.cc
> index f25146d01..d3aeb5de0 100644
> --- a/src/box/box.cc
> +++ b/src/box/box.cc
> @@ -667,8 +667,10 @@ box_sync_replication(bool connect_quorum)
>  		diag_raise();
>  
>  	auto guard = make_scoped_guard([=]{
> -		for (int i = 0; i < count; i++)
> +		for (int i = 0; i < count; i++) {
> +			applier_stop(appliers[i]);

Appliers are started by replicaset_connect. On some errors they are
stopped by replicaset_connect, but on other errors they are stopped by
this guard here, in box_sync_replication. This looks confusing. Please
always stop appliers in the same place where they are started, i.e. in
replicaset_connect.

>  			applier_delete(appliers[i]); /* doesn't affect diag */
> +		}
>  	});
>  
>  	replicaset_connect(appliers, count, connect_quorum);
> diff --git a/src/box/replication.cc b/src/box/replication.cc
> index 5755ad45e..d6b65e0a1 100644
> --- a/src/box/replication.cc
> +++ b/src/box/replication.cc
> @@ -261,6 +261,21 @@ replica_on_applier_sync(struct replica *replica)
>  	replicaset_check_quorum();
>  }
>  
> +static void
> +replica_on_applier_off(struct replica *replica)
> +{
> +	switch (replica->applier_sync_state) {
> +	case APPLIER_CONNECTED:
> +		replica_on_applier_sync(replica);
> +		break;
> +	case APPLIER_DISCONNECTED:
> +		break;
> +	default:
> +		unreachable();
> +	}
> +
> +}
> +
>  static void
>  replica_on_applier_connect(struct replica *replica)
>  {
> @@ -396,9 +411,10 @@ replica_on_applier_state_f(struct trigger *trigger, void *event)
>  		/*
>  		 * Connection to self, duplicate connection
>  		 * to the same master, or the applier fiber
> -		 * has been cancelled. Assume synced.
> +		 * has been cancelled. In case of duplicated connection
> +		 * will be left in this state, otherwise assume synced.

Nit: when updating a comment, please make sure it stays aligned.

>  		 */
> -		replica_on_applier_sync(replica);
> +		replica_on_applier_off(replica);

I don't understand replica_on_applier_off(). How's it possible?

Replacing this function with replica_on_applier_sync() results in an
assertion failure:

  void replica_on_applier_sync(replica*): Assertion `replica->applier_sync_state == APPLIER_CONNECTED' failed.

I guess replica_on_applier_off() is supposed to plug it.

The assertion is triggered by box_sync_replication() calling
applier_stop(). However, the replica should've been deleted by that time
and the trigger should've been cleared and hence never called. Looks
like instead of plugging this hole with replica_on_applier_off(), you
should fix a replica struct leak in replicaset_update():

433 static void
434 replicaset_update(struct applier **appliers, int count)
435 {
...
473                 if (replica_hash_search(&uniq, replica) != NULL) {
474                         tnt_raise(ClientError, ER_CFG, "replication",
475                                   "duplicate connection to the same replica");

^^^ replica leaks here and stays attached to the applier

476                 }
477                 replica_hash_insert(&uniq, replica);

>  		break;
>  	case APPLIER_STOPPED:
>  		/* Unrecoverable error. */
> @@ -427,6 +443,7 @@ replicaset_update(struct applier **appliers, int count)
>  	auto uniq_guard = make_scoped_guard([&]{
>  		replica_hash_foreach_safe(&uniq, replica, next) {
>  			replica_hash_remove(&uniq, replica);
> +			replica_clear_applier(replica);
>  			replica_delete(replica);
>  		}
>  	});

> diff --git a/test/replication/duplicate_connection.test.lua b/test/replication/duplicate_connection.test.lua
> new file mode 100644
> index 000000000..b9bc573ca
> --- /dev/null
> +++ b/test/replication/duplicate_connection.test.lua
> @@ -0,0 +1,16 @@
> +test_run = require('test_run').new()
> +engine = test_run:get_cfg('engine')

engine isn't used in this test

Anyway, I think that the test case is small enough to be moved to
misc.test.lua

Also, you only test one case described in #3610. The other one (async
duplicate connection detection) remains untested. Please test it as
well.

> +
> +box.schema.user.grant('guest', 'replication')
> +
> +-- Deploy a replica.
> +test_run:cmd("create server replica with rpl_master=default, script='replication/replica.lua'")
> +test_run:cmd("start server replica")
> +test_run:cmd("switch replica")
> +
> +replication = box.cfg.replication
> +box.cfg{replication = {replication, replication}}
> +
> +
> +test_run:cmd("switch default")
> +box.schema.user.revoke('guest', 'replication')