[tarantool-patches] [PATCH rfc] schema: add possibility to find and throw away dead replicas

Vladimir Davydov vdavydov.dev at gmail.com
Wed Sep 26 17:46:58 MSK 2018


On Fri, Sep 21, 2018 at 09:25:03PM +0300, Olga Arkhangelskaia wrote:
> Adds possibility to get list of alive replicas in a replicaset,
> prune from box.space_cluster those who is not considered as alive,
> and if one has doubts see state of replicaset.
> 
> Replica is considered alive if it is just added, its status after
> timeout period is not stopped or disconnected. However it it has both
> roles (master and replica) we consider such instance dead only if its
> upstream and downstream status is stopped or disconnected.
> 
> If replica is considered dead we can prune its uuid from _cluster space.
> If one not sure if the replica is dead or is there is any activity on it
> it is possible to list replicas with its role, status and lsn
> statistics.
> 
> If you have some ideas how else we can/should decide whether replica is dead
> please share. 
> 
> Closes #3110
> ---
> 
> https://github.com/tarantool/tarantool/issues/3110
> https://github.com/tarantool/tarantool/tree/OKriw/gh-3110-prune-dead-replica-from-replicaset-1.10

A documentation request with the new API description is missing.
Tests don't pass on Travis CI.

Regarding the code:

 1. Why do you add a function that lists *alive* replicas? The issue
    author didn't ask for that. He asked for a script that would delete
    dead replicas from the _cluster system space. We might want to add a
    function that would list *dead* replicas so that he/she could check
    what replicas would be deleted (aka "dry run"), but it doesn't make
    sense to list alive replicas.

 2. Dead replica detection is utterly ridiculuous: the functions sleeps
    for the given amount of time and then deletes inactive replicas.
    As a user, I'd want to have an ability to delete replicas that have
    been inactive for, say, a day. Does this mean that I have to wait
    for a whole day before this function completes? Obviously, no.
    I guess tarantool core should keep track of the time each replica
    was active last time so that the function would work instantly.
    The time should probably be persisted so that restarts wouldn't
    affect the way the function works.



More information about the Tarantool-patches mailing list