[tarantool-patches] Re: [PATCH v2] replication: fix assertion with duplicated connect to same master

Olga Arkhangelskaia arkholga at tarantool.org
Sat Sep 22 10:04:57 MSK 2018


*

Hi! Thanks for the review. I will fix most of the comments and observations.

However, I guess I need to know that we are on the same page.

When we do box.cfg{replication={44441, 44441}}

we have two replicas, each one has its onw applier, etc. In 
replicaset_update()we are able to

identify  that the replicas are same. At this point we raise exceptions. 
Problem occurs when we try to delete the second one. For proper deletion 
we need to stop applier, clear it and than delete replica. As I 
understand we need to:

applier_stop(replica->applier);

replica_clear_applier(replica);replica_delete(replica);

The reason I added replica_on_applier_off(replica) is because when an 
applier enters stopped state, it state marks as APPLIER_OFF. Trigger on 
change states reacts on this change with

replica_on_applier_state_f. That leads us to on_applier_sync. Instead we 
should react on APPLIER_DISCONNECTED. And the only way we react on this 
state  - is to try to load applier again.

So replica_on_applier_off is used in case when we want to stop applier 
forever, before replica deletion. I think we do need this function. May 
be in some other form that I did.

Other comments, about places, comments, test I will definitely fix. But 
I need to know that the way I see the problem and the way of fix is 
correct one. So let’s discuss it.

*
17/09/2018 18:05, Vladimir Davydov пишет:
> On Tue, Sep 11, 2018 at 10:11:05AM +0300, Olga Arkhangelskaia wrote:
>> In case of duplicated connection to the same master we are not able
>> to check it at the very beginning due to quorum option. So we allow such
>> configuration to proceed till it is obvious. If it the initial
>> configuration - replica won't start and in case of configuration change
>> we will be notified about duplicated replication source.
> This comment isn't exactly helpful: this is how duplicate connections
> are supposed to be handled. It'd be nice to read what exactly you're
> fixing in this patch.
>
> Nit: please use 'duplicate' instead of 'duplicated'.
>
>> ---
>> https://github.com/tarantool/tarantool/issues/3610
>> https://github.com/tarantool/tarantool/tree/OKriw/gh-3610-assert_fail_when_connect_master_twice-1.10
>>
>> v1:
>>
>> Changes in v2:
>> - changed completely, now we let duplicated params to proceed
>> - stop the applier at the moment when replica has the same hash that other one
>> - changed test
>>
>>   src/box/box.cc                                 |  4 ++-
>>   src/box/replication.cc                         | 21 +++++++++++++--
>>   test/replication/duplicate_connection.result   | 37 ++++++++++++++++++++++++++
>>   test/replication/duplicate_connection.test.lua | 16 +++++++++++
>>   4 files changed, 75 insertions(+), 3 deletions(-)
>>   create mode 100644 test/replication/duplicate_connection.result
>>   create mode 100644 test/replication/duplicate_connection.test.lua
>>
>> diff --git a/src/box/box.cc b/src/box/box.cc
>> index f25146d01..d3aeb5de0 100644
>> --- a/src/box/box.cc
>> +++ b/src/box/box.cc
>> @@ -667,8 +667,10 @@ box_sync_replication(bool connect_quorum)
>>   		diag_raise();
>>   
>>   	auto guard = make_scoped_guard([=]{
>> -		for (int i = 0; i < count; i++)
>> +		for (int i = 0; i < count; i++) {
>> +			applier_stop(appliers[i]);
> Appliers are started by replicaset_connect. On some errors they are
> stopped by replicaset_connect, but on other errors they are stopped by
> this guard here, in box_sync_replication. This looks confusing. Please
> always stop appliers in the same place where they are started, i.e. in
> replicaset_connect.
>
>>   			applier_delete(appliers[i]); /* doesn't affect diag */
>> +		}
>>   	});
>>   
>>   	replicaset_connect(appliers, count, connect_quorum);
>> diff --git a/src/box/replication.cc b/src/box/replication.cc
>> index 5755ad45e..d6b65e0a1 100644
>> --- a/src/box/replication.cc
>> +++ b/src/box/replication.cc
>> @@ -261,6 +261,21 @@ replica_on_applier_sync(struct replica *replica)
>>   	replicaset_check_quorum();
>>   }
>>   
>> +static void
>> +replica_on_applier_off(struct replica *replica)
>> +{
>> +	switch (replica->applier_sync_state) {
>> +	case APPLIER_CONNECTED:
>> +		replica_on_applier_sync(replica);
>> +		break;
>> +	case APPLIER_DISCONNECTED:
>> +		break;
>> +	default:
>> +		unreachable();
>> +	}
>> +
>> +}
>> +
>>   static void
>>   replica_on_applier_connect(struct replica *replica)
>>   {
>> @@ -396,9 +411,10 @@ replica_on_applier_state_f(struct trigger *trigger, void *event)
>>   		/*
>>   		 * Connection to self, duplicate connection
>>   		 * to the same master, or the applier fiber
>> -		 * has been cancelled. Assume synced.
>> +		 * has been cancelled. In case of duplicated connection
>> +		 * will be left in this state, otherwise assume synced.
> Nit: when updating a comment, please make sure it stays aligned.
>
>>   		 */
>> -		replica_on_applier_sync(replica);
>> +		replica_on_applier_off(replica);
> I don't understand replica_on_applier_off(). How's it possible?
>
> Replacing this function with replica_on_applier_sync() results in an
> assertion failure:
>
>    void replica_on_applier_sync(replica*): Assertion `replica->applier_sync_state == APPLIER_CONNECTED' failed.
>
> I guess replica_on_applier_off() is supposed to plug it.
>
> The assertion is triggered by box_sync_replication() calling
> applier_stop(). However, the replica should've been deleted by that time
> and the trigger should've been cleared and hence never called. Looks
> like instead of plugging this hole with replica_on_applier_off(), you
> should fix a replica struct leak in replicaset_update():
>
> 433 static void
> 434 replicaset_update(struct applier **appliers, int count)
> 435 {
> ...
> 473                 if (replica_hash_search(&uniq, replica) != NULL) {
> 474                         tnt_raise(ClientError, ER_CFG, "replication",
> 475                                   "duplicate connection to the same replica");
>
> ^^^ replica leaks here and stays attached to the applier
>
> 476                 }
> 477                 replica_hash_insert(&uniq, replica);
>
>>   		break;
>>   	case APPLIER_STOPPED:
>>   		/* Unrecoverable error. */
>> @@ -427,6 +443,7 @@ replicaset_update(struct applier **appliers, int count)
>>   	auto uniq_guard = make_scoped_guard([&]{
>>   		replica_hash_foreach_safe(&uniq, replica, next) {
>>   			replica_hash_remove(&uniq, replica);
>> +			replica_clear_applier(replica);
>>   			replica_delete(replica);
>>   		}
>>   	});
>> diff --git a/test/replication/duplicate_connection.test.lua b/test/replication/duplicate_connection.test.lua
>> new file mode 100644
>> index 000000000..b9bc573ca
>> --- /dev/null
>> +++ b/test/replication/duplicate_connection.test.lua
>> @@ -0,0 +1,16 @@
>> +test_run = require('test_run').new()
>> +engine = test_run:get_cfg('engine')
> engine isn't used in this test
>
> Anyway, I think that the test case is small enough to be moved to
> misc.test.lua
>
> Also, you only test one case described in #3610. The other one (async
> duplicate connection detection) remains untested. Please test it as
> well.
>
>> +
>> +box.schema.user.grant('guest', 'replication')
>> +
>> +-- Deploy a replica.
>> +test_run:cmd("create server replica with rpl_master=default, script='replication/replica.lua'")
>> +test_run:cmd("start server replica")
>> +test_run:cmd("switch replica")
>> +
>> +replication = box.cfg.replication
>> +box.cfg{replication = {replication, replication}}
>> +
>> +
>> +test_run:cmd("switch default")
>> +box.schema.user.revoke('guest', 'replication')

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.tarantool.org/pipermail/tarantool-patches/attachments/20180922/409cea02/attachment.html>


More information about the Tarantool-patches mailing list