[tarantool-patches] Re: [PATCH rfc] schema: add possibility to find and throw away dead replicas

Olga Arkhangelskaia arkholga at tarantool.org
Thu Sep 27 10:37:57 MSK 2018



26/09/2018 17:46, Vladimir Davydov пишет:
> On Fri, Sep 21, 2018 at 09:25:03PM +0300, Olga Arkhangelskaia wrote:
>> Adds possibility to get list of alive replicas in a replicaset,
>> prune from box.space_cluster those who is not considered as alive,
>> and if one has doubts see state of replicaset.
>>
>> Replica is considered alive if it is just added, its status after
>> timeout period is not stopped or disconnected. However it it has both
>> roles (master and replica) we consider such instance dead only if its
>> upstream and downstream status is stopped or disconnected.
>>
>> If replica is considered dead we can prune its uuid from _cluster space.
>> If one not sure if the replica is dead or is there is any activity on it
>> it is possible to list replicas with its role, status and lsn
>> statistics.
>>
>> If you have some ideas how else we can/should decide whether replica is dead
>> please share.
>>
>> Closes #3110
>> ---
>>
>> https://github.com/tarantool/tarantool/issues/3110
>> https://github.com/tarantool/tarantool/tree/OKriw/gh-3110-prune-dead-replica-from-replicaset-1.10
> A documentation request with the new API description is missing.
> Tests don't pass on Travis CI.
>
> Regarding the code:
>
>   1. Why do you add a function that lists *alive* replicas? The issue
>      author didn't ask for that. He asked for a script that would delete
>      dead replicas from the _cluster system space. We might want to add a
>      function that would list *dead* replicas so that he/she could check
>      what replicas would be deleted (aka "dry run"), but it doesn't make
>      sense to list alive replicas.
It is easy to change, but as I understood we need to throw away replica.
>
>   2. Dead replica detection is utterly ridiculuous: the functions sleeps
>      for the given amount of time and then deletes inactive replicas.
>      As a user, I'd want to have an ability to delete replicas that have
>      been inactive for, say, a day. Does this mean that I have to wait
>      for a whole day before this function completes? Obviously, no.
>      I guess tarantool core should keep track of the time each replica
>      was active

So we need changes in core code? About lastt time of activity, what do 
you mean? Lasn change, vclock, status?
If replica is dead for long perion of time we can see its status. And as 
I undestand we have heartbeat to monitor the connecion, so if there is 
problems with it - we see status.
> last time so that the function would work instantly.
>      The time should probably be persisted so that restarts wouldn't
>      affect the way the function works.




More information about the Tarantool-patches mailing list