Tarantool development patches archive
 help / color / mirror / Atom feed
* [tarantool-patches] [PATCH v2] replication: fix assertion with duplicated connect to same master
@ 2018-09-11  7:11 Olga Arkhangelskaia
  2018-09-17 15:05 ` Vladimir Davydov
  0 siblings, 1 reply; 3+ messages in thread
From: Olga Arkhangelskaia @ 2018-09-11  7:11 UTC (permalink / raw)
  To: tarantool-patches; +Cc: Olga Arkhangelskaia

In case of duplicated connection to the same master we are not able
to check it at the very beginning due to quorum option. So we allow such
configuration to proceed till it is obvious. If it the initial
configuration - replica won't start and in case of configuration change
we will be notified about duplicated replication source.
---
https://github.com/tarantool/tarantool/issues/3610
https://github.com/tarantool/tarantool/tree/OKriw/gh-3610-assert_fail_when_connect_master_twice-1.10

v1:

Changes in v2:
- changed completely, now we let duplicated params to proceed
- stop the applier at the moment when replica has the same hash that other one
- changed test

 src/box/box.cc                                 |  4 ++-
 src/box/replication.cc                         | 21 +++++++++++++--
 test/replication/duplicate_connection.result   | 37 ++++++++++++++++++++++++++
 test/replication/duplicate_connection.test.lua | 16 +++++++++++
 4 files changed, 75 insertions(+), 3 deletions(-)
 create mode 100644 test/replication/duplicate_connection.result
 create mode 100644 test/replication/duplicate_connection.test.lua

diff --git a/src/box/box.cc b/src/box/box.cc
index f25146d01..d3aeb5de0 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -667,8 +667,10 @@ box_sync_replication(bool connect_quorum)
 		diag_raise();
 
 	auto guard = make_scoped_guard([=]{
-		for (int i = 0; i < count; i++)
+		for (int i = 0; i < count; i++) {
+			applier_stop(appliers[i]);
 			applier_delete(appliers[i]); /* doesn't affect diag */
+		}
 	});
 
 	replicaset_connect(appliers, count, connect_quorum);
diff --git a/src/box/replication.cc b/src/box/replication.cc
index 5755ad45e..d6b65e0a1 100644
--- a/src/box/replication.cc
+++ b/src/box/replication.cc
@@ -261,6 +261,21 @@ replica_on_applier_sync(struct replica *replica)
 	replicaset_check_quorum();
 }
 
+static void
+replica_on_applier_off(struct replica *replica)
+{
+	switch (replica->applier_sync_state) {
+	case APPLIER_CONNECTED:
+		replica_on_applier_sync(replica);
+		break;
+	case APPLIER_DISCONNECTED:
+		break;
+	default:
+		unreachable();
+	}
+
+}
+
 static void
 replica_on_applier_connect(struct replica *replica)
 {
@@ -396,9 +411,10 @@ replica_on_applier_state_f(struct trigger *trigger, void *event)
 		/*
 		 * Connection to self, duplicate connection
 		 * to the same master, or the applier fiber
-		 * has been cancelled. Assume synced.
+		 * has been cancelled. In case of duplicated connection
+		 * will be left in this state, otherwise assume synced.
 		 */
-		replica_on_applier_sync(replica);
+		replica_on_applier_off(replica);
 		break;
 	case APPLIER_STOPPED:
 		/* Unrecoverable error. */
@@ -427,6 +443,7 @@ replicaset_update(struct applier **appliers, int count)
 	auto uniq_guard = make_scoped_guard([&]{
 		replica_hash_foreach_safe(&uniq, replica, next) {
 			replica_hash_remove(&uniq, replica);
+			replica_clear_applier(replica);
 			replica_delete(replica);
 		}
 	});
diff --git a/test/replication/duplicate_connection.result b/test/replication/duplicate_connection.result
new file mode 100644
index 000000000..5edcd4254
--- /dev/null
+++ b/test/replication/duplicate_connection.result
@@ -0,0 +1,37 @@
+test_run = require('test_run').new()
+---
+...
+engine = test_run:get_cfg('engine')
+---
+...
+box.schema.user.grant('guest', 'replication')
+---
+...
+-- Deploy a replica.
+test_run:cmd("create server replica with rpl_master=default, script='replication/replica.lua'")
+---
+- true
+...
+test_run:cmd("start server replica")
+---
+- true
+...
+test_run:cmd("switch replica")
+---
+- true
+...
+replication = box.cfg.replication
+---
+...
+box.cfg{replication = {replication, replication}}
+---
+- error: 'Incorrect value for option ''replication'': duplicate connection to the
+    same replica'
+...
+test_run:cmd("switch default")
+---
+- true
+...
+box.schema.user.revoke('guest', 'replication') 
+---
+...
diff --git a/test/replication/duplicate_connection.test.lua b/test/replication/duplicate_connection.test.lua
new file mode 100644
index 000000000..b9bc573ca
--- /dev/null
+++ b/test/replication/duplicate_connection.test.lua
@@ -0,0 +1,16 @@
+test_run = require('test_run').new()
+engine = test_run:get_cfg('engine')
+
+box.schema.user.grant('guest', 'replication')
+
+-- Deploy a replica.
+test_run:cmd("create server replica with rpl_master=default, script='replication/replica.lua'")
+test_run:cmd("start server replica")
+test_run:cmd("switch replica")
+
+replication = box.cfg.replication
+box.cfg{replication = {replication, replication}}
+
+
+test_run:cmd("switch default")
+box.schema.user.revoke('guest', 'replication') 
-- 
2.14.3 (Apple Git-98)

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [tarantool-patches] [PATCH v2] replication: fix assertion with duplicated connect to same master
  2018-09-11  7:11 [tarantool-patches] [PATCH v2] replication: fix assertion with duplicated connect to same master Olga Arkhangelskaia
@ 2018-09-17 15:05 ` Vladimir Davydov
       [not found]   ` <85617ed6-0cc4-c354-dd62-7a84b9feae70@tarantool.org>
  0 siblings, 1 reply; 3+ messages in thread
From: Vladimir Davydov @ 2018-09-17 15:05 UTC (permalink / raw)
  To: Olga Arkhangelskaia; +Cc: tarantool-patches

On Tue, Sep 11, 2018 at 10:11:05AM +0300, Olga Arkhangelskaia wrote:
> In case of duplicated connection to the same master we are not able
> to check it at the very beginning due to quorum option. So we allow such
> configuration to proceed till it is obvious. If it the initial
> configuration - replica won't start and in case of configuration change
> we will be notified about duplicated replication source.

This comment isn't exactly helpful: this is how duplicate connections
are supposed to be handled. It'd be nice to read what exactly you're
fixing in this patch.

Nit: please use 'duplicate' instead of 'duplicated'.

> ---
> https://github.com/tarantool/tarantool/issues/3610
> https://github.com/tarantool/tarantool/tree/OKriw/gh-3610-assert_fail_when_connect_master_twice-1.10
> 
> v1:
> 
> Changes in v2:
> - changed completely, now we let duplicated params to proceed
> - stop the applier at the moment when replica has the same hash that other one
> - changed test
> 
>  src/box/box.cc                                 |  4 ++-
>  src/box/replication.cc                         | 21 +++++++++++++--
>  test/replication/duplicate_connection.result   | 37 ++++++++++++++++++++++++++
>  test/replication/duplicate_connection.test.lua | 16 +++++++++++
>  4 files changed, 75 insertions(+), 3 deletions(-)
>  create mode 100644 test/replication/duplicate_connection.result
>  create mode 100644 test/replication/duplicate_connection.test.lua
> 
> diff --git a/src/box/box.cc b/src/box/box.cc
> index f25146d01..d3aeb5de0 100644
> --- a/src/box/box.cc
> +++ b/src/box/box.cc
> @@ -667,8 +667,10 @@ box_sync_replication(bool connect_quorum)
>  		diag_raise();
>  
>  	auto guard = make_scoped_guard([=]{
> -		for (int i = 0; i < count; i++)
> +		for (int i = 0; i < count; i++) {
> +			applier_stop(appliers[i]);

Appliers are started by replicaset_connect. On some errors they are
stopped by replicaset_connect, but on other errors they are stopped by
this guard here, in box_sync_replication. This looks confusing. Please
always stop appliers in the same place where they are started, i.e. in
replicaset_connect.

>  			applier_delete(appliers[i]); /* doesn't affect diag */
> +		}
>  	});
>  
>  	replicaset_connect(appliers, count, connect_quorum);
> diff --git a/src/box/replication.cc b/src/box/replication.cc
> index 5755ad45e..d6b65e0a1 100644
> --- a/src/box/replication.cc
> +++ b/src/box/replication.cc
> @@ -261,6 +261,21 @@ replica_on_applier_sync(struct replica *replica)
>  	replicaset_check_quorum();
>  }
>  
> +static void
> +replica_on_applier_off(struct replica *replica)
> +{
> +	switch (replica->applier_sync_state) {
> +	case APPLIER_CONNECTED:
> +		replica_on_applier_sync(replica);
> +		break;
> +	case APPLIER_DISCONNECTED:
> +		break;
> +	default:
> +		unreachable();
> +	}
> +
> +}
> +
>  static void
>  replica_on_applier_connect(struct replica *replica)
>  {
> @@ -396,9 +411,10 @@ replica_on_applier_state_f(struct trigger *trigger, void *event)
>  		/*
>  		 * Connection to self, duplicate connection
>  		 * to the same master, or the applier fiber
> -		 * has been cancelled. Assume synced.
> +		 * has been cancelled. In case of duplicated connection
> +		 * will be left in this state, otherwise assume synced.

Nit: when updating a comment, please make sure it stays aligned.

>  		 */
> -		replica_on_applier_sync(replica);
> +		replica_on_applier_off(replica);

I don't understand replica_on_applier_off(). How's it possible?

Replacing this function with replica_on_applier_sync() results in an
assertion failure:

  void replica_on_applier_sync(replica*): Assertion `replica->applier_sync_state == APPLIER_CONNECTED' failed.

I guess replica_on_applier_off() is supposed to plug it.

The assertion is triggered by box_sync_replication() calling
applier_stop(). However, the replica should've been deleted by that time
and the trigger should've been cleared and hence never called. Looks
like instead of plugging this hole with replica_on_applier_off(), you
should fix a replica struct leak in replicaset_update():

433 static void
434 replicaset_update(struct applier **appliers, int count)
435 {
...
473                 if (replica_hash_search(&uniq, replica) != NULL) {
474                         tnt_raise(ClientError, ER_CFG, "replication",
475                                   "duplicate connection to the same replica");

^^^ replica leaks here and stays attached to the applier

476                 }
477                 replica_hash_insert(&uniq, replica);

>  		break;
>  	case APPLIER_STOPPED:
>  		/* Unrecoverable error. */
> @@ -427,6 +443,7 @@ replicaset_update(struct applier **appliers, int count)
>  	auto uniq_guard = make_scoped_guard([&]{
>  		replica_hash_foreach_safe(&uniq, replica, next) {
>  			replica_hash_remove(&uniq, replica);
> +			replica_clear_applier(replica);
>  			replica_delete(replica);
>  		}
>  	});

> diff --git a/test/replication/duplicate_connection.test.lua b/test/replication/duplicate_connection.test.lua
> new file mode 100644
> index 000000000..b9bc573ca
> --- /dev/null
> +++ b/test/replication/duplicate_connection.test.lua
> @@ -0,0 +1,16 @@
> +test_run = require('test_run').new()
> +engine = test_run:get_cfg('engine')

engine isn't used in this test

Anyway, I think that the test case is small enough to be moved to
misc.test.lua

Also, you only test one case described in #3610. The other one (async
duplicate connection detection) remains untested. Please test it as
well.

> +
> +box.schema.user.grant('guest', 'replication')
> +
> +-- Deploy a replica.
> +test_run:cmd("create server replica with rpl_master=default, script='replication/replica.lua'")
> +test_run:cmd("start server replica")
> +test_run:cmd("switch replica")
> +
> +replication = box.cfg.replication
> +box.cfg{replication = {replication, replication}}
> +
> +
> +test_run:cmd("switch default")
> +box.schema.user.revoke('guest', 'replication') 

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [tarantool-patches] Re: [PATCH v2] replication: fix assertion with duplicated connect to same master
       [not found]   ` <85617ed6-0cc4-c354-dd62-7a84b9feae70@tarantool.org>
@ 2018-09-22 16:21     ` Vladimir Davydov
  0 siblings, 0 replies; 3+ messages in thread
From: Vladimir Davydov @ 2018-09-22 16:21 UTC (permalink / raw)
  To: Olga Arkhangelskaia; +Cc: tarantool-patches

[Please don't drop the mailing list from Cc]

On Sat, Sep 22, 2018 at 10:04:57AM +0300, Olga Arkhangelskaia wrote:
> *
> 
> Hi! Thanks for the review. I will fix most of the comments and observations.
> 
> However, I guess I need to know that we are on the same page.
> 
> When we do box.cfg{replication={44441, 44441}}
> 
> we have two replicas, each one has its onw applier, etc. In
> replicaset_update()we are able to
> 
> identify  that the replicas are same. At this point we raise exceptions.
> Problem occurs when we try to delete the second one. For proper deletion we
> need to stop applier, clear it and than delete replica. As I understand we
> need to:
> 
> applier_stop(replica->applier);
> 
> replica_clear_applier(replica);replica_delete(replica);

If you change the call sequence to

  replica_clear_applier(replica);
  replica_delete(replica);
  applier_stop(replica->applier);

you won't be facing this problem.

> 
> The reason I added replica_on_applier_off(replica) is because when an
> applier enters stopped state, it state marks as APPLIER_OFF. Trigger on
> change states reacts on this change with
> 
> replica_on_applier_state_f. That leads us to on_applier_sync. Instead we
> should react on APPLIER_DISCONNECTED. And the only way we react on this
> state  - is to try to load applier again.
> 
> So replica_on_applier_off is used in case when we want to stop applier
> forever, before replica deletion. I think we do need this function. May be
> in some other form that I did.

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2018-09-22 16:21 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-09-11  7:11 [tarantool-patches] [PATCH v2] replication: fix assertion with duplicated connect to same master Olga Arkhangelskaia
2018-09-17 15:05 ` Vladimir Davydov
     [not found]   ` <85617ed6-0cc4-c354-dd62-7a84b9feae70@tarantool.org>
2018-09-22 16:21     ` [tarantool-patches] " Vladimir Davydov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox