From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Mon, 17 Sep 2018 18:05:12 +0300 From: Vladimir Davydov Subject: Re: [tarantool-patches] [PATCH v2] replication: fix assertion with duplicated connect to same master Message-ID: <20180917150512.fizr47fac4zks26l@esperanza> References: <20180911071105.88001-1-arkholga@tarantool.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20180911071105.88001-1-arkholga@tarantool.org> To: Olga Arkhangelskaia Cc: tarantool-patches@freelists.org List-ID: On Tue, Sep 11, 2018 at 10:11:05AM +0300, Olga Arkhangelskaia wrote: > In case of duplicated connection to the same master we are not able > to check it at the very beginning due to quorum option. So we allow such > configuration to proceed till it is obvious. If it the initial > configuration - replica won't start and in case of configuration change > we will be notified about duplicated replication source. This comment isn't exactly helpful: this is how duplicate connections are supposed to be handled. It'd be nice to read what exactly you're fixing in this patch. Nit: please use 'duplicate' instead of 'duplicated'. > --- > https://github.com/tarantool/tarantool/issues/3610 > https://github.com/tarantool/tarantool/tree/OKriw/gh-3610-assert_fail_when_connect_master_twice-1.10 > > v1: > > Changes in v2: > - changed completely, now we let duplicated params to proceed > - stop the applier at the moment when replica has the same hash that other one > - changed test > > src/box/box.cc | 4 ++- > src/box/replication.cc | 21 +++++++++++++-- > test/replication/duplicate_connection.result | 37 ++++++++++++++++++++++++++ > test/replication/duplicate_connection.test.lua | 16 +++++++++++ > 4 files changed, 75 insertions(+), 3 deletions(-) > create mode 100644 test/replication/duplicate_connection.result > create mode 100644 test/replication/duplicate_connection.test.lua > > diff --git a/src/box/box.cc b/src/box/box.cc > index f25146d01..d3aeb5de0 100644 > --- a/src/box/box.cc > +++ b/src/box/box.cc > @@ -667,8 +667,10 @@ box_sync_replication(bool connect_quorum) > diag_raise(); > > auto guard = make_scoped_guard([=]{ > - for (int i = 0; i < count; i++) > + for (int i = 0; i < count; i++) { > + applier_stop(appliers[i]); Appliers are started by replicaset_connect. On some errors they are stopped by replicaset_connect, but on other errors they are stopped by this guard here, in box_sync_replication. This looks confusing. Please always stop appliers in the same place where they are started, i.e. in replicaset_connect. > applier_delete(appliers[i]); /* doesn't affect diag */ > + } > }); > > replicaset_connect(appliers, count, connect_quorum); > diff --git a/src/box/replication.cc b/src/box/replication.cc > index 5755ad45e..d6b65e0a1 100644 > --- a/src/box/replication.cc > +++ b/src/box/replication.cc > @@ -261,6 +261,21 @@ replica_on_applier_sync(struct replica *replica) > replicaset_check_quorum(); > } > > +static void > +replica_on_applier_off(struct replica *replica) > +{ > + switch (replica->applier_sync_state) { > + case APPLIER_CONNECTED: > + replica_on_applier_sync(replica); > + break; > + case APPLIER_DISCONNECTED: > + break; > + default: > + unreachable(); > + } > + > +} > + > static void > replica_on_applier_connect(struct replica *replica) > { > @@ -396,9 +411,10 @@ replica_on_applier_state_f(struct trigger *trigger, void *event) > /* > * Connection to self, duplicate connection > * to the same master, or the applier fiber > - * has been cancelled. Assume synced. > + * has been cancelled. In case of duplicated connection > + * will be left in this state, otherwise assume synced. Nit: when updating a comment, please make sure it stays aligned. > */ > - replica_on_applier_sync(replica); > + replica_on_applier_off(replica); I don't understand replica_on_applier_off(). How's it possible? Replacing this function with replica_on_applier_sync() results in an assertion failure: void replica_on_applier_sync(replica*): Assertion `replica->applier_sync_state == APPLIER_CONNECTED' failed. I guess replica_on_applier_off() is supposed to plug it. The assertion is triggered by box_sync_replication() calling applier_stop(). However, the replica should've been deleted by that time and the trigger should've been cleared and hence never called. Looks like instead of plugging this hole with replica_on_applier_off(), you should fix a replica struct leak in replicaset_update(): 433 static void 434 replicaset_update(struct applier **appliers, int count) 435 { ... 473 if (replica_hash_search(&uniq, replica) != NULL) { 474 tnt_raise(ClientError, ER_CFG, "replication", 475 "duplicate connection to the same replica"); ^^^ replica leaks here and stays attached to the applier 476 } 477 replica_hash_insert(&uniq, replica); > break; > case APPLIER_STOPPED: > /* Unrecoverable error. */ > @@ -427,6 +443,7 @@ replicaset_update(struct applier **appliers, int count) > auto uniq_guard = make_scoped_guard([&]{ > replica_hash_foreach_safe(&uniq, replica, next) { > replica_hash_remove(&uniq, replica); > + replica_clear_applier(replica); > replica_delete(replica); > } > }); > diff --git a/test/replication/duplicate_connection.test.lua b/test/replication/duplicate_connection.test.lua > new file mode 100644 > index 000000000..b9bc573ca > --- /dev/null > +++ b/test/replication/duplicate_connection.test.lua > @@ -0,0 +1,16 @@ > +test_run = require('test_run').new() > +engine = test_run:get_cfg('engine') engine isn't used in this test Anyway, I think that the test case is small enough to be moved to misc.test.lua Also, you only test one case described in #3610. The other one (async duplicate connection detection) remains untested. Please test it as well. > + > +box.schema.user.grant('guest', 'replication') > + > +-- Deploy a replica. > +test_run:cmd("create server replica with rpl_master=default, script='replication/replica.lua'") > +test_run:cmd("start server replica") > +test_run:cmd("switch replica") > + > +replication = box.cfg.replication > +box.cfg{replication = {replication, replication}} > + > + > +test_run:cmd("switch default") > +box.schema.user.revoke('guest', 'replication')