[tarantool-patches] [PATCH v2] replication: fix assertion with duplicated connect to same master
Vladimir Davydov
vdavydov.dev at gmail.com
Mon Sep 17 18:05:12 MSK 2018
On Tue, Sep 11, 2018 at 10:11:05AM +0300, Olga Arkhangelskaia wrote:
> In case of duplicated connection to the same master we are not able
> to check it at the very beginning due to quorum option. So we allow such
> configuration to proceed till it is obvious. If it the initial
> configuration - replica won't start and in case of configuration change
> we will be notified about duplicated replication source.
This comment isn't exactly helpful: this is how duplicate connections
are supposed to be handled. It'd be nice to read what exactly you're
fixing in this patch.
Nit: please use 'duplicate' instead of 'duplicated'.
> ---
> https://github.com/tarantool/tarantool/issues/3610
> https://github.com/tarantool/tarantool/tree/OKriw/gh-3610-assert_fail_when_connect_master_twice-1.10
>
> v1:
>
> Changes in v2:
> - changed completely, now we let duplicated params to proceed
> - stop the applier at the moment when replica has the same hash that other one
> - changed test
>
> src/box/box.cc | 4 ++-
> src/box/replication.cc | 21 +++++++++++++--
> test/replication/duplicate_connection.result | 37 ++++++++++++++++++++++++++
> test/replication/duplicate_connection.test.lua | 16 +++++++++++
> 4 files changed, 75 insertions(+), 3 deletions(-)
> create mode 100644 test/replication/duplicate_connection.result
> create mode 100644 test/replication/duplicate_connection.test.lua
>
> diff --git a/src/box/box.cc b/src/box/box.cc
> index f25146d01..d3aeb5de0 100644
> --- a/src/box/box.cc
> +++ b/src/box/box.cc
> @@ -667,8 +667,10 @@ box_sync_replication(bool connect_quorum)
> diag_raise();
>
> auto guard = make_scoped_guard([=]{
> - for (int i = 0; i < count; i++)
> + for (int i = 0; i < count; i++) {
> + applier_stop(appliers[i]);
Appliers are started by replicaset_connect. On some errors they are
stopped by replicaset_connect, but on other errors they are stopped by
this guard here, in box_sync_replication. This looks confusing. Please
always stop appliers in the same place where they are started, i.e. in
replicaset_connect.
> applier_delete(appliers[i]); /* doesn't affect diag */
> + }
> });
>
> replicaset_connect(appliers, count, connect_quorum);
> diff --git a/src/box/replication.cc b/src/box/replication.cc
> index 5755ad45e..d6b65e0a1 100644
> --- a/src/box/replication.cc
> +++ b/src/box/replication.cc
> @@ -261,6 +261,21 @@ replica_on_applier_sync(struct replica *replica)
> replicaset_check_quorum();
> }
>
> +static void
> +replica_on_applier_off(struct replica *replica)
> +{
> + switch (replica->applier_sync_state) {
> + case APPLIER_CONNECTED:
> + replica_on_applier_sync(replica);
> + break;
> + case APPLIER_DISCONNECTED:
> + break;
> + default:
> + unreachable();
> + }
> +
> +}
> +
> static void
> replica_on_applier_connect(struct replica *replica)
> {
> @@ -396,9 +411,10 @@ replica_on_applier_state_f(struct trigger *trigger, void *event)
> /*
> * Connection to self, duplicate connection
> * to the same master, or the applier fiber
> - * has been cancelled. Assume synced.
> + * has been cancelled. In case of duplicated connection
> + * will be left in this state, otherwise assume synced.
Nit: when updating a comment, please make sure it stays aligned.
> */
> - replica_on_applier_sync(replica);
> + replica_on_applier_off(replica);
I don't understand replica_on_applier_off(). How's it possible?
Replacing this function with replica_on_applier_sync() results in an
assertion failure:
void replica_on_applier_sync(replica*): Assertion `replica->applier_sync_state == APPLIER_CONNECTED' failed.
I guess replica_on_applier_off() is supposed to plug it.
The assertion is triggered by box_sync_replication() calling
applier_stop(). However, the replica should've been deleted by that time
and the trigger should've been cleared and hence never called. Looks
like instead of plugging this hole with replica_on_applier_off(), you
should fix a replica struct leak in replicaset_update():
433 static void
434 replicaset_update(struct applier **appliers, int count)
435 {
...
473 if (replica_hash_search(&uniq, replica) != NULL) {
474 tnt_raise(ClientError, ER_CFG, "replication",
475 "duplicate connection to the same replica");
^^^ replica leaks here and stays attached to the applier
476 }
477 replica_hash_insert(&uniq, replica);
> break;
> case APPLIER_STOPPED:
> /* Unrecoverable error. */
> @@ -427,6 +443,7 @@ replicaset_update(struct applier **appliers, int count)
> auto uniq_guard = make_scoped_guard([&]{
> replica_hash_foreach_safe(&uniq, replica, next) {
> replica_hash_remove(&uniq, replica);
> + replica_clear_applier(replica);
> replica_delete(replica);
> }
> });
> diff --git a/test/replication/duplicate_connection.test.lua b/test/replication/duplicate_connection.test.lua
> new file mode 100644
> index 000000000..b9bc573ca
> --- /dev/null
> +++ b/test/replication/duplicate_connection.test.lua
> @@ -0,0 +1,16 @@
> +test_run = require('test_run').new()
> +engine = test_run:get_cfg('engine')
engine isn't used in this test
Anyway, I think that the test case is small enough to be moved to
misc.test.lua
Also, you only test one case described in #3610. The other one (async
duplicate connection detection) remains untested. Please test it as
well.
> +
> +box.schema.user.grant('guest', 'replication')
> +
> +-- Deploy a replica.
> +test_run:cmd("create server replica with rpl_master=default, script='replication/replica.lua'")
> +test_run:cmd("start server replica")
> +test_run:cmd("switch replica")
> +
> +replication = box.cfg.replication
> +box.cfg{replication = {replication, replication}}
> +
> +
> +test_run:cmd("switch default")
> +box.schema.user.revoke('guest', 'replication')
More information about the Tarantool-patches
mailing list