* [tarantool-patches] [PATCH v2] replication: fix assertion with duplicated connect to same master
@ 2018-09-11 7:11 Olga Arkhangelskaia
2018-09-17 15:05 ` Vladimir Davydov
0 siblings, 1 reply; 3+ messages in thread
From: Olga Arkhangelskaia @ 2018-09-11 7:11 UTC (permalink / raw)
To: tarantool-patches; +Cc: Olga Arkhangelskaia
In case of duplicated connection to the same master we are not able
to check it at the very beginning due to quorum option. So we allow such
configuration to proceed till it is obvious. If it the initial
configuration - replica won't start and in case of configuration change
we will be notified about duplicated replication source.
---
https://github.com/tarantool/tarantool/issues/3610
https://github.com/tarantool/tarantool/tree/OKriw/gh-3610-assert_fail_when_connect_master_twice-1.10
v1:
Changes in v2:
- changed completely, now we let duplicated params to proceed
- stop the applier at the moment when replica has the same hash that other one
- changed test
src/box/box.cc | 4 ++-
src/box/replication.cc | 21 +++++++++++++--
test/replication/duplicate_connection.result | 37 ++++++++++++++++++++++++++
test/replication/duplicate_connection.test.lua | 16 +++++++++++
4 files changed, 75 insertions(+), 3 deletions(-)
create mode 100644 test/replication/duplicate_connection.result
create mode 100644 test/replication/duplicate_connection.test.lua
diff --git a/src/box/box.cc b/src/box/box.cc
index f25146d01..d3aeb5de0 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -667,8 +667,10 @@ box_sync_replication(bool connect_quorum)
diag_raise();
auto guard = make_scoped_guard([=]{
- for (int i = 0; i < count; i++)
+ for (int i = 0; i < count; i++) {
+ applier_stop(appliers[i]);
applier_delete(appliers[i]); /* doesn't affect diag */
+ }
});
replicaset_connect(appliers, count, connect_quorum);
diff --git a/src/box/replication.cc b/src/box/replication.cc
index 5755ad45e..d6b65e0a1 100644
--- a/src/box/replication.cc
+++ b/src/box/replication.cc
@@ -261,6 +261,21 @@ replica_on_applier_sync(struct replica *replica)
replicaset_check_quorum();
}
+static void
+replica_on_applier_off(struct replica *replica)
+{
+ switch (replica->applier_sync_state) {
+ case APPLIER_CONNECTED:
+ replica_on_applier_sync(replica);
+ break;
+ case APPLIER_DISCONNECTED:
+ break;
+ default:
+ unreachable();
+ }
+
+}
+
static void
replica_on_applier_connect(struct replica *replica)
{
@@ -396,9 +411,10 @@ replica_on_applier_state_f(struct trigger *trigger, void *event)
/*
* Connection to self, duplicate connection
* to the same master, or the applier fiber
- * has been cancelled. Assume synced.
+ * has been cancelled. In case of duplicated connection
+ * will be left in this state, otherwise assume synced.
*/
- replica_on_applier_sync(replica);
+ replica_on_applier_off(replica);
break;
case APPLIER_STOPPED:
/* Unrecoverable error. */
@@ -427,6 +443,7 @@ replicaset_update(struct applier **appliers, int count)
auto uniq_guard = make_scoped_guard([&]{
replica_hash_foreach_safe(&uniq, replica, next) {
replica_hash_remove(&uniq, replica);
+ replica_clear_applier(replica);
replica_delete(replica);
}
});
diff --git a/test/replication/duplicate_connection.result b/test/replication/duplicate_connection.result
new file mode 100644
index 000000000..5edcd4254
--- /dev/null
+++ b/test/replication/duplicate_connection.result
@@ -0,0 +1,37 @@
+test_run = require('test_run').new()
+---
+...
+engine = test_run:get_cfg('engine')
+---
+...
+box.schema.user.grant('guest', 'replication')
+---
+...
+-- Deploy a replica.
+test_run:cmd("create server replica with rpl_master=default, script='replication/replica.lua'")
+---
+- true
+...
+test_run:cmd("start server replica")
+---
+- true
+...
+test_run:cmd("switch replica")
+---
+- true
+...
+replication = box.cfg.replication
+---
+...
+box.cfg{replication = {replication, replication}}
+---
+- error: 'Incorrect value for option ''replication'': duplicate connection to the
+ same replica'
+...
+test_run:cmd("switch default")
+---
+- true
+...
+box.schema.user.revoke('guest', 'replication')
+---
+...
diff --git a/test/replication/duplicate_connection.test.lua b/test/replication/duplicate_connection.test.lua
new file mode 100644
index 000000000..b9bc573ca
--- /dev/null
+++ b/test/replication/duplicate_connection.test.lua
@@ -0,0 +1,16 @@
+test_run = require('test_run').new()
+engine = test_run:get_cfg('engine')
+
+box.schema.user.grant('guest', 'replication')
+
+-- Deploy a replica.
+test_run:cmd("create server replica with rpl_master=default, script='replication/replica.lua'")
+test_run:cmd("start server replica")
+test_run:cmd("switch replica")
+
+replication = box.cfg.replication
+box.cfg{replication = {replication, replication}}
+
+
+test_run:cmd("switch default")
+box.schema.user.revoke('guest', 'replication')
--
2.14.3 (Apple Git-98)
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [tarantool-patches] [PATCH v2] replication: fix assertion with duplicated connect to same master
2018-09-11 7:11 [tarantool-patches] [PATCH v2] replication: fix assertion with duplicated connect to same master Olga Arkhangelskaia
@ 2018-09-17 15:05 ` Vladimir Davydov
[not found] ` <85617ed6-0cc4-c354-dd62-7a84b9feae70@tarantool.org>
0 siblings, 1 reply; 3+ messages in thread
From: Vladimir Davydov @ 2018-09-17 15:05 UTC (permalink / raw)
To: Olga Arkhangelskaia; +Cc: tarantool-patches
On Tue, Sep 11, 2018 at 10:11:05AM +0300, Olga Arkhangelskaia wrote:
> In case of duplicated connection to the same master we are not able
> to check it at the very beginning due to quorum option. So we allow such
> configuration to proceed till it is obvious. If it the initial
> configuration - replica won't start and in case of configuration change
> we will be notified about duplicated replication source.
This comment isn't exactly helpful: this is how duplicate connections
are supposed to be handled. It'd be nice to read what exactly you're
fixing in this patch.
Nit: please use 'duplicate' instead of 'duplicated'.
> ---
> https://github.com/tarantool/tarantool/issues/3610
> https://github.com/tarantool/tarantool/tree/OKriw/gh-3610-assert_fail_when_connect_master_twice-1.10
>
> v1:
>
> Changes in v2:
> - changed completely, now we let duplicated params to proceed
> - stop the applier at the moment when replica has the same hash that other one
> - changed test
>
> src/box/box.cc | 4 ++-
> src/box/replication.cc | 21 +++++++++++++--
> test/replication/duplicate_connection.result | 37 ++++++++++++++++++++++++++
> test/replication/duplicate_connection.test.lua | 16 +++++++++++
> 4 files changed, 75 insertions(+), 3 deletions(-)
> create mode 100644 test/replication/duplicate_connection.result
> create mode 100644 test/replication/duplicate_connection.test.lua
>
> diff --git a/src/box/box.cc b/src/box/box.cc
> index f25146d01..d3aeb5de0 100644
> --- a/src/box/box.cc
> +++ b/src/box/box.cc
> @@ -667,8 +667,10 @@ box_sync_replication(bool connect_quorum)
> diag_raise();
>
> auto guard = make_scoped_guard([=]{
> - for (int i = 0; i < count; i++)
> + for (int i = 0; i < count; i++) {
> + applier_stop(appliers[i]);
Appliers are started by replicaset_connect. On some errors they are
stopped by replicaset_connect, but on other errors they are stopped by
this guard here, in box_sync_replication. This looks confusing. Please
always stop appliers in the same place where they are started, i.e. in
replicaset_connect.
> applier_delete(appliers[i]); /* doesn't affect diag */
> + }
> });
>
> replicaset_connect(appliers, count, connect_quorum);
> diff --git a/src/box/replication.cc b/src/box/replication.cc
> index 5755ad45e..d6b65e0a1 100644
> --- a/src/box/replication.cc
> +++ b/src/box/replication.cc
> @@ -261,6 +261,21 @@ replica_on_applier_sync(struct replica *replica)
> replicaset_check_quorum();
> }
>
> +static void
> +replica_on_applier_off(struct replica *replica)
> +{
> + switch (replica->applier_sync_state) {
> + case APPLIER_CONNECTED:
> + replica_on_applier_sync(replica);
> + break;
> + case APPLIER_DISCONNECTED:
> + break;
> + default:
> + unreachable();
> + }
> +
> +}
> +
> static void
> replica_on_applier_connect(struct replica *replica)
> {
> @@ -396,9 +411,10 @@ replica_on_applier_state_f(struct trigger *trigger, void *event)
> /*
> * Connection to self, duplicate connection
> * to the same master, or the applier fiber
> - * has been cancelled. Assume synced.
> + * has been cancelled. In case of duplicated connection
> + * will be left in this state, otherwise assume synced.
Nit: when updating a comment, please make sure it stays aligned.
> */
> - replica_on_applier_sync(replica);
> + replica_on_applier_off(replica);
I don't understand replica_on_applier_off(). How's it possible?
Replacing this function with replica_on_applier_sync() results in an
assertion failure:
void replica_on_applier_sync(replica*): Assertion `replica->applier_sync_state == APPLIER_CONNECTED' failed.
I guess replica_on_applier_off() is supposed to plug it.
The assertion is triggered by box_sync_replication() calling
applier_stop(). However, the replica should've been deleted by that time
and the trigger should've been cleared and hence never called. Looks
like instead of plugging this hole with replica_on_applier_off(), you
should fix a replica struct leak in replicaset_update():
433 static void
434 replicaset_update(struct applier **appliers, int count)
435 {
...
473 if (replica_hash_search(&uniq, replica) != NULL) {
474 tnt_raise(ClientError, ER_CFG, "replication",
475 "duplicate connection to the same replica");
^^^ replica leaks here and stays attached to the applier
476 }
477 replica_hash_insert(&uniq, replica);
> break;
> case APPLIER_STOPPED:
> /* Unrecoverable error. */
> @@ -427,6 +443,7 @@ replicaset_update(struct applier **appliers, int count)
> auto uniq_guard = make_scoped_guard([&]{
> replica_hash_foreach_safe(&uniq, replica, next) {
> replica_hash_remove(&uniq, replica);
> + replica_clear_applier(replica);
> replica_delete(replica);
> }
> });
> diff --git a/test/replication/duplicate_connection.test.lua b/test/replication/duplicate_connection.test.lua
> new file mode 100644
> index 000000000..b9bc573ca
> --- /dev/null
> +++ b/test/replication/duplicate_connection.test.lua
> @@ -0,0 +1,16 @@
> +test_run = require('test_run').new()
> +engine = test_run:get_cfg('engine')
engine isn't used in this test
Anyway, I think that the test case is small enough to be moved to
misc.test.lua
Also, you only test one case described in #3610. The other one (async
duplicate connection detection) remains untested. Please test it as
well.
> +
> +box.schema.user.grant('guest', 'replication')
> +
> +-- Deploy a replica.
> +test_run:cmd("create server replica with rpl_master=default, script='replication/replica.lua'")
> +test_run:cmd("start server replica")
> +test_run:cmd("switch replica")
> +
> +replication = box.cfg.replication
> +box.cfg{replication = {replication, replication}}
> +
> +
> +test_run:cmd("switch default")
> +box.schema.user.revoke('guest', 'replication')
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2018-09-22 16:21 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-09-11 7:11 [tarantool-patches] [PATCH v2] replication: fix assertion with duplicated connect to same master Olga Arkhangelskaia
2018-09-17 15:05 ` Vladimir Davydov
[not found] ` <85617ed6-0cc4-c354-dd62-7a84b9feae70@tarantool.org>
2018-09-22 16:21 ` [tarantool-patches] " Vladimir Davydov
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox