* [tarantool-patches] [v4 PATCH] replication: fix assertion with duplicate connection
@ 2018-09-30 13:27 Olga Arkhangelskaia
2018-10-01 12:12 ` Vladimir Davydov
0 siblings, 1 reply; 2+ messages in thread
From: Olga Arkhangelskaia @ 2018-09-30 13:27 UTC (permalink / raw)
To: tarantool-patches; +Cc: Olga Arkhangelskaia
Patch fixes behavior when replica tries to connect to the same master
more than once. In case when it is initial configuration we raise the
exception. If it in not initial config we print the error and disconnect
the applier.
Closes #3610
---
https://github.com/tarantool/tarantool/issues/3610
https://github.com/tarantool/tarantool/tree/OKriw/gh-3610-assert_fail_when_connect_master_twice-1.10
v1:
https://www.freelists.org/post/tarantool-patches/PATCH-box-fix-assertion-with-duplication-in-repl-source
v2:
https://www.freelists.org/post/tarantool-patches/PATCH-v2-replication-fix-assertion-with-duplicated-connect-to-same-master
v3:
https://www.freelists.org/post/tarantool-patches/PATCH-v3-replication-fix-assertion-with-duplicate-connection
Changes in v2:
- changed completely, now we let duplicated params to proceed
- stop the applier at the moment when replica has the same hash that other one
- changed test
Changes in v3:
- changed the wsy and order of how we clean up applier replica on duplication
- test is now in misc.test.lua
- need to think over second test case
Changes in v4:
- added test case
- no changes in applier states or etc.
- appliers lwft in stopped states in case of dup. conf
src/box/box.cc | 2 +-
src/box/replication.cc | 9 ++-
test/replication/misc.result | 115 ++++++++++++++++++++++++++++++++++++++
test/replication/misc.test.lua | 44 ++++++++++++++-
test/replication/replica_dupl.lua | 10 ++++
5 files changed, 176 insertions(+), 4 deletions(-)
create mode 100644 test/replication/replica_dupl.lua
diff --git a/src/box/box.cc b/src/box/box.cc
index 804fc00e5..137b7ecd9 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -668,7 +668,7 @@ box_sync_replication(bool connect_quorum)
auto guard = make_scoped_guard([=]{
for (int i = 0; i < count; i++)
- applier_delete(appliers[i]); /* doesn't affect diag */
+ applier_stop(appliers[i]);
});
replicaset_connect(appliers, count, connect_quorum);
diff --git a/src/box/replication.cc b/src/box/replication.cc
index 5755ad45e..462ab01fd 100644
--- a/src/box/replication.cc
+++ b/src/box/replication.cc
@@ -271,6 +271,8 @@ replica_on_applier_connect(struct replica *replica)
assert(replica->applier_sync_state == APPLIER_DISCONNECTED);
replica->uuid = applier->uuid;
+ replica->applier_sync_state = APPLIER_CONNECTED;
+ replicaset.applier.connected++;
struct replica *orig = replica_hash_search(&replicaset.hash, replica);
if (orig != NULL && orig->applier != NULL) {
@@ -290,6 +292,8 @@ replica_on_applier_connect(struct replica *replica)
if (orig != NULL) {
/* Use existing struct replica */
+ assert(orig->applier_sync_state == APPLIER_DISCONNECTED);
+ orig->applier_sync_state = replica->applier_sync_state;
replica_set_applier(orig, applier);
replica_clear_applier(replica);
replica_delete(replica);
@@ -299,8 +303,6 @@ replica_on_applier_connect(struct replica *replica)
replica_hash_insert(&replicaset.hash, replica);
}
- replica->applier_sync_state = APPLIER_CONNECTED;
- replicaset.applier.connected++;
}
static void
@@ -427,6 +429,7 @@ replicaset_update(struct applier **appliers, int count)
auto uniq_guard = make_scoped_guard([&]{
replica_hash_foreach_safe(&uniq, replica, next) {
replica_hash_remove(&uniq, replica);
+ replica_clear_applier(replica);
replica_delete(replica);
}
});
@@ -454,6 +457,8 @@ replicaset_update(struct applier **appliers, int count)
replica->uuid = applier->uuid;
if (replica_hash_search(&uniq, replica) != NULL) {
+ replica_clear_applier(replica);
+ replica_delete(replica);
tnt_raise(ClientError, ER_CFG, "replication",
"duplicate connection to the same replica");
}
diff --git a/test/replication/misc.result b/test/replication/misc.result
index 0ac48ba34..c14883284 100644
--- a/test/replication/misc.result
+++ b/test/replication/misc.result
@@ -394,3 +394,118 @@ test_run:cmd("delete server replica_auth")
box.schema.user.drop('cluster')
---
...
+--
+-- Test case for gh-3610. Before the fix replica would fail with the assertion
+-- when trying to connect to the same master twice.
+--
+box.schema.user.grant('guest', 'replication')
+---
+...
+test_run:cmd("create server replica with rpl_master=default, script='replication/replica.lua'")
+---
+- true
+...
+test_run:cmd("start server replica")
+---
+- true
+...
+test_run:cmd("switch replica")
+---
+- true
+...
+replication = box.cfg.replication
+---
+...
+box.cfg{replication = {replication, replication}}
+---
+- error: 'Incorrect value for option ''replication'': duplicate connection to the
+ same replica'
+...
+test_run:cmd("switch default")
+---
+- true
+...
+box.schema.user.revoke('guest', 'replication')
+---
+...
+test_run:cmd("stop server replica")
+---
+- true
+...
+test_run:cmd('cleanup server replica')
+---
+- true
+...
+test_run:cmd("delete server replica")
+---
+- true
+...
+-- case when replica reconnects master with duplication in new configuration
+listen = box.cfg.listen
+---
+...
+box.schema.user.grant('guest', 'replication')
+---
+...
+test_run:cmd("create server replica_dupl with rpl_master=default, script='replication/replica_dupl.lua'")
+---
+- true
+...
+test_run:cmd("start server replica_dupl")
+---
+- true
+...
+test_run:cmd("switch replica_dupl")
+---
+- true
+...
+test_run:cmd("switch default")
+---
+- true
+...
+box.cfg{listen = ''}
+---
+...
+test_run:cmd("switch replica_dupl")
+---
+- true
+...
+replication = box.cfg.replication
+---
+...
+box.cfg{replication = {replication, replication}}
+---
+...
+test_run:cmd("switch default")
+---
+- true
+...
+box.cfg{listen = listen}
+---
+...
+fiber.sleep(1)
+---
+...
+test_run:cmd("switch replica_dupl")
+---
+- true
+...
+test_run:cmd("switch default")
+---
+- true
+...
+box.schema.user.revoke('guest', 'replication')
+---
+...
+test_run:cmd("stop server replica_dupl")
+---
+- true
+...
+test_run:cmd('cleanup server replica_dupl')
+---
+- true
+...
+test_run:cmd("delete server replica_dupl")
+---
+- true
+...
diff --git a/test/replication/misc.test.lua b/test/replication/misc.test.lua
index 56e1bab69..e070ebe57 100644
--- a/test/replication/misc.test.lua
+++ b/test/replication/misc.test.lua
@@ -161,5 +161,47 @@ _ = test_run:wait_vclock('replica_auth', vclock)
test_run:cmd("stop server replica_auth")
test_run:cmd("cleanup server replica_auth")
test_run:cmd("delete server replica_auth")
-
box.schema.user.drop('cluster')
+
+--
+-- Test case for gh-3610. Before the fix replica would fail with the assertion
+-- when trying to connect to the same master twice.
+--
+box.schema.user.grant('guest', 'replication')
+test_run:cmd("create server replica with rpl_master=default, script='replication/replica.lua'")
+test_run:cmd("start server replica")
+test_run:cmd("switch replica")
+replication = box.cfg.replication
+box.cfg{replication = {replication, replication}}
+
+test_run:cmd("switch default")
+box.schema.user.revoke('guest', 'replication')
+test_run:cmd("stop server replica")
+test_run:cmd('cleanup server replica')
+test_run:cmd("delete server replica")
+
+-- case when replica reconnects master with duplication in new configuration
+
+listen = box.cfg.listen
+box.schema.user.grant('guest', 'replication')
+test_run:cmd("create server replica_dupl with rpl_master=default, script='replication/replica_dupl.lua'")
+test_run:cmd("start server replica_dupl")
+test_run:cmd("switch replica_dupl")
+
+test_run:cmd("switch default")
+box.cfg{listen = ''}
+
+test_run:cmd("switch replica_dupl")
+replication = box.cfg.replication
+box.cfg{replication = {replication, replication}}
+
+test_run:cmd("switch default")
+box.cfg{listen = listen}
+fiber.sleep(1)
+test_run:cmd("switch replica_dupl")
+
+test_run:cmd("switch default")
+box.schema.user.revoke('guest', 'replication')
+test_run:cmd("stop server replica_dupl")
+test_run:cmd('cleanup server replica_dupl')
+test_run:cmd("delete server replica_dupl")
diff --git a/test/replication/replica_dupl.lua b/test/replication/replica_dupl.lua
new file mode 100644
index 000000000..29e94551d
--- /dev/null
+++ b/test/replication/replica_dupl.lua
@@ -0,0 +1,10 @@
+#!/usr/bin/env tarantool
+
+box.cfg({
+ replication = os.getenv("MASTER"),
+ memtx_memory = 107374182,
+ replication_connect_quorum=0,
+ replication_connect_timeout = 0.1,
+})
+
+require('console').listen(os.getenv('ADMIN'))
--
2.14.3 (Apple Git-98)
^ permalink raw reply [flat|nested] 2+ messages in thread
* Re: [tarantool-patches] [v4 PATCH] replication: fix assertion with duplicate connection
2018-09-30 13:27 [tarantool-patches] [v4 PATCH] replication: fix assertion with duplicate connection Olga Arkhangelskaia
@ 2018-10-01 12:12 ` Vladimir Davydov
0 siblings, 0 replies; 2+ messages in thread
From: Vladimir Davydov @ 2018-10-01 12:12 UTC (permalink / raw)
To: Olga Arkhangelskaia; +Cc: tarantool-patches
On Sun, Sep 30, 2018 at 04:27:36PM +0300, Olga Arkhangelskaia wrote:
> diff --git a/src/box/box.cc b/src/box/box.cc
> index 804fc00e5..137b7ecd9 100644
> --- a/src/box/box.cc
> +++ b/src/box/box.cc
> @@ -668,7 +668,7 @@ box_sync_replication(bool connect_quorum)
>
> auto guard = make_scoped_guard([=]{
> for (int i = 0; i < count; i++)
> - applier_delete(appliers[i]); /* doesn't affect diag */
> + applier_stop(appliers[i]);
Why? Now an applier leaks if an error occurs...
Is it an attempt to address my previous comments to this place?
https://www.freelists.org/post/tarantool-patches/PATCH-v2-replication-fix-assertion-with-duplicated-connect-to-same-master,1
https://www.freelists.org/post/tarantool-patches/PATCH-v3-replication-fix-assertion-with-duplicate-connection,3
What I meant is that the protocol of replicaset_connect() should be
definite: since it is passed an array of stopped appliers, it should
leave all appliers in the stopped state in case of error so that all
the caller has to do is delete them.
> });
>
> replicaset_connect(appliers, count, connect_quorum);
> diff --git a/src/box/replication.cc b/src/box/replication.cc
> index 5755ad45e..462ab01fd 100644
> --- a/src/box/replication.cc
> +++ b/src/box/replication.cc
> @@ -271,6 +271,8 @@ replica_on_applier_connect(struct replica *replica)
> assert(replica->applier_sync_state == APPLIER_DISCONNECTED);
>
> replica->uuid = applier->uuid;
> + replica->applier_sync_state = APPLIER_CONNECTED;
> + replicaset.applier.connected++;
>
> struct replica *orig = replica_hash_search(&replicaset.hash, replica);
> if (orig != NULL && orig->applier != NULL) {
> @@ -290,6 +292,8 @@ replica_on_applier_connect(struct replica *replica)
>
> if (orig != NULL) {
> /* Use existing struct replica */
> + assert(orig->applier_sync_state == APPLIER_DISCONNECTED);
> + orig->applier_sync_state = replica->applier_sync_state;
> replica_set_applier(orig, applier);
> replica_clear_applier(replica);
> replica_delete(replica);
> @@ -299,8 +303,6 @@ replica_on_applier_connect(struct replica *replica)
> replica_hash_insert(&replicaset.hash, replica);
> }
>
Extra new line left.
> - replica->applier_sync_state = APPLIER_CONNECTED;
> - replicaset.applier.connected++;
> }
>
> static void
> @@ -427,6 +429,7 @@ replicaset_update(struct applier **appliers, int count)
> auto uniq_guard = make_scoped_guard([&]{
> replica_hash_foreach_safe(&uniq, replica, next) {
> replica_hash_remove(&uniq, replica);
> + replica_clear_applier(replica);
> replica_delete(replica);
> }
> });
> @@ -454,6 +457,8 @@ replicaset_update(struct applier **appliers, int count)
> replica->uuid = applier->uuid;
>
> if (replica_hash_search(&uniq, replica) != NULL) {
> + replica_clear_applier(replica);
> + replica_delete(replica);
> tnt_raise(ClientError, ER_CFG, "replication",
> "duplicate connection to the same replica");
> }
> diff --git a/test/replication/misc.test.lua b/test/replication/misc.test.lua
> index 56e1bab69..e070ebe57 100644
> --- a/test/replication/misc.test.lua
> +++ b/test/replication/misc.test.lua
> @@ -161,5 +161,47 @@ _ = test_run:wait_vclock('replica_auth', vclock)
> test_run:cmd("stop server replica_auth")
> test_run:cmd("cleanup server replica_auth")
> test_run:cmd("delete server replica_auth")
> -
> box.schema.user.drop('cluster')
> +
> +--
> +-- Test case for gh-3610. Before the fix replica would fail with the assertion
> +-- when trying to connect to the same master twice.
> +--
> +box.schema.user.grant('guest', 'replication')
> +test_run:cmd("create server replica with rpl_master=default, script='replication/replica.lua'")
> +test_run:cmd("start server replica")
> +test_run:cmd("switch replica")
> +replication = box.cfg.replication
> +box.cfg{replication = {replication, replication}}
> +
> +test_run:cmd("switch default")
> +box.schema.user.revoke('guest', 'replication')
> +test_run:cmd("stop server replica")
> +test_run:cmd('cleanup server replica')
> +test_run:cmd("delete server replica")
> +
> +-- case when replica reconnects master with duplication in new configuration
> +
> +listen = box.cfg.listen
> +box.schema.user.grant('guest', 'replication')
> +test_run:cmd("create server replica_dupl with rpl_master=default, script='replication/replica_dupl.lua'")
> +test_run:cmd("start server replica_dupl")
> +test_run:cmd("switch replica_dupl")
> +
> +test_run:cmd("switch default")
> +box.cfg{listen = ''}
> +
> +test_run:cmd("switch replica_dupl")
> +replication = box.cfg.replication
> +box.cfg{replication = {replication, replication}}
> +
> +test_run:cmd("switch default")
> +box.cfg{listen = listen}
> +fiber.sleep(1)
> +test_run:cmd("switch replica_dupl")
> +
> +test_run:cmd("switch default")
> +box.schema.user.revoke('guest', 'replication')
> +test_run:cmd("stop server replica_dupl")
> +test_run:cmd('cleanup server replica_dupl')
> +test_run:cmd("delete server replica_dupl")
> diff --git a/test/replication/replica_dupl.lua b/test/replication/replica_dupl.lua
> new file mode 100644
> index 000000000..29e94551d
> --- /dev/null
> +++ b/test/replication/replica_dupl.lua
> @@ -0,0 +1,10 @@
> +#!/usr/bin/env tarantool
> +
> +box.cfg({
> + replication = os.getenv("MASTER"),
> + memtx_memory = 107374182,
> + replication_connect_quorum=0,
> + replication_connect_timeout = 0.1,
> +})
> +
> +require('console').listen(os.getenv('ADMIN'))
You don't need to start another replica to test the case when a
duplicate connection is detected asynchronously: both configuration
options may be changed online.
^ permalink raw reply [flat|nested] 2+ messages in thread
end of thread, other threads:[~2018-10-01 12:12 UTC | newest]
Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-09-30 13:27 [tarantool-patches] [v4 PATCH] replication: fix assertion with duplicate connection Olga Arkhangelskaia
2018-10-01 12:12 ` Vladimir Davydov
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox