Tarantool development patches archive
 help / color / mirror / Atom feed
From: Vladimir Davydov <vdavydov.dev@gmail.com>
To: Olga Arkhangelskaia <arkholga@tarantool.org>
Cc: tarantool-patches@freelists.org
Subject: Re: [tarantool-patches] [PATCH v2] replication: fix assertion with duplicated connect to same master
Date: Mon, 17 Sep 2018 18:05:12 +0300	[thread overview]
Message-ID: <20180917150512.fizr47fac4zks26l@esperanza> (raw)
In-Reply-To: <20180911071105.88001-1-arkholga@tarantool.org>

On Tue, Sep 11, 2018 at 10:11:05AM +0300, Olga Arkhangelskaia wrote:
> In case of duplicated connection to the same master we are not able
> to check it at the very beginning due to quorum option. So we allow such
> configuration to proceed till it is obvious. If it the initial
> configuration - replica won't start and in case of configuration change
> we will be notified about duplicated replication source.

This comment isn't exactly helpful: this is how duplicate connections
are supposed to be handled. It'd be nice to read what exactly you're
fixing in this patch.

Nit: please use 'duplicate' instead of 'duplicated'.

> ---
> https://github.com/tarantool/tarantool/issues/3610
> https://github.com/tarantool/tarantool/tree/OKriw/gh-3610-assert_fail_when_connect_master_twice-1.10
> 
> v1:
> 
> Changes in v2:
> - changed completely, now we let duplicated params to proceed
> - stop the applier at the moment when replica has the same hash that other one
> - changed test
> 
>  src/box/box.cc                                 |  4 ++-
>  src/box/replication.cc                         | 21 +++++++++++++--
>  test/replication/duplicate_connection.result   | 37 ++++++++++++++++++++++++++
>  test/replication/duplicate_connection.test.lua | 16 +++++++++++
>  4 files changed, 75 insertions(+), 3 deletions(-)
>  create mode 100644 test/replication/duplicate_connection.result
>  create mode 100644 test/replication/duplicate_connection.test.lua
> 
> diff --git a/src/box/box.cc b/src/box/box.cc
> index f25146d01..d3aeb5de0 100644
> --- a/src/box/box.cc
> +++ b/src/box/box.cc
> @@ -667,8 +667,10 @@ box_sync_replication(bool connect_quorum)
>  		diag_raise();
>  
>  	auto guard = make_scoped_guard([=]{
> -		for (int i = 0; i < count; i++)
> +		for (int i = 0; i < count; i++) {
> +			applier_stop(appliers[i]);

Appliers are started by replicaset_connect. On some errors they are
stopped by replicaset_connect, but on other errors they are stopped by
this guard here, in box_sync_replication. This looks confusing. Please
always stop appliers in the same place where they are started, i.e. in
replicaset_connect.

>  			applier_delete(appliers[i]); /* doesn't affect diag */
> +		}
>  	});
>  
>  	replicaset_connect(appliers, count, connect_quorum);
> diff --git a/src/box/replication.cc b/src/box/replication.cc
> index 5755ad45e..d6b65e0a1 100644
> --- a/src/box/replication.cc
> +++ b/src/box/replication.cc
> @@ -261,6 +261,21 @@ replica_on_applier_sync(struct replica *replica)
>  	replicaset_check_quorum();
>  }
>  
> +static void
> +replica_on_applier_off(struct replica *replica)
> +{
> +	switch (replica->applier_sync_state) {
> +	case APPLIER_CONNECTED:
> +		replica_on_applier_sync(replica);
> +		break;
> +	case APPLIER_DISCONNECTED:
> +		break;
> +	default:
> +		unreachable();
> +	}
> +
> +}
> +
>  static void
>  replica_on_applier_connect(struct replica *replica)
>  {
> @@ -396,9 +411,10 @@ replica_on_applier_state_f(struct trigger *trigger, void *event)
>  		/*
>  		 * Connection to self, duplicate connection
>  		 * to the same master, or the applier fiber
> -		 * has been cancelled. Assume synced.
> +		 * has been cancelled. In case of duplicated connection
> +		 * will be left in this state, otherwise assume synced.

Nit: when updating a comment, please make sure it stays aligned.

>  		 */
> -		replica_on_applier_sync(replica);
> +		replica_on_applier_off(replica);

I don't understand replica_on_applier_off(). How's it possible?

Replacing this function with replica_on_applier_sync() results in an
assertion failure:

  void replica_on_applier_sync(replica*): Assertion `replica->applier_sync_state == APPLIER_CONNECTED' failed.

I guess replica_on_applier_off() is supposed to plug it.

The assertion is triggered by box_sync_replication() calling
applier_stop(). However, the replica should've been deleted by that time
and the trigger should've been cleared and hence never called. Looks
like instead of plugging this hole with replica_on_applier_off(), you
should fix a replica struct leak in replicaset_update():

433 static void
434 replicaset_update(struct applier **appliers, int count)
435 {
...
473                 if (replica_hash_search(&uniq, replica) != NULL) {
474                         tnt_raise(ClientError, ER_CFG, "replication",
475                                   "duplicate connection to the same replica");

^^^ replica leaks here and stays attached to the applier

476                 }
477                 replica_hash_insert(&uniq, replica);

>  		break;
>  	case APPLIER_STOPPED:
>  		/* Unrecoverable error. */
> @@ -427,6 +443,7 @@ replicaset_update(struct applier **appliers, int count)
>  	auto uniq_guard = make_scoped_guard([&]{
>  		replica_hash_foreach_safe(&uniq, replica, next) {
>  			replica_hash_remove(&uniq, replica);
> +			replica_clear_applier(replica);
>  			replica_delete(replica);
>  		}
>  	});

> diff --git a/test/replication/duplicate_connection.test.lua b/test/replication/duplicate_connection.test.lua
> new file mode 100644
> index 000000000..b9bc573ca
> --- /dev/null
> +++ b/test/replication/duplicate_connection.test.lua
> @@ -0,0 +1,16 @@
> +test_run = require('test_run').new()
> +engine = test_run:get_cfg('engine')

engine isn't used in this test

Anyway, I think that the test case is small enough to be moved to
misc.test.lua

Also, you only test one case described in #3610. The other one (async
duplicate connection detection) remains untested. Please test it as
well.

> +
> +box.schema.user.grant('guest', 'replication')
> +
> +-- Deploy a replica.
> +test_run:cmd("create server replica with rpl_master=default, script='replication/replica.lua'")
> +test_run:cmd("start server replica")
> +test_run:cmd("switch replica")
> +
> +replication = box.cfg.replication
> +box.cfg{replication = {replication, replication}}
> +
> +
> +test_run:cmd("switch default")
> +box.schema.user.revoke('guest', 'replication') 

  reply	other threads:[~2018-09-17 15:05 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-09-11  7:11 Olga Arkhangelskaia
2018-09-17 15:05 ` Vladimir Davydov [this message]
     [not found]   ` <85617ed6-0cc4-c354-dd62-7a84b9feae70@tarantool.org>
2018-09-22 16:21     ` [tarantool-patches] " Vladimir Davydov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180917150512.fizr47fac4zks26l@esperanza \
    --to=vdavydov.dev@gmail.com \
    --cc=arkholga@tarantool.org \
    --cc=tarantool-patches@freelists.org \
    --subject='Re: [tarantool-patches] [PATCH v2] replication: fix assertion with duplicated connect to same master' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox