Tarantool development patches archive
 help / color / mirror / Atom feed
From: Vladimir Davydov <vdavydov.dev@gmail.com>
To: Olga Arkhangelskaia <arkholga@tarantool.org>
Cc: tarantool-patches@freelists.org
Subject: Re: [tarantool-patches] [PATCH v2 2/2] ctl: added functionality to detect and prune dead replicas
Date: Mon, 15 Oct 2018 15:43:33 +0300	[thread overview]
Message-ID: <20181015124333.ghnyqohsg7q2w7gi@esperanza> (raw)
In-Reply-To: <20181012194557.7445-3-arkholga@tarantool.org>

On Fri, Oct 12, 2018 at 10:45:57PM +0300, Olga Arkhangelskaia wrote:
> Added replicaset_list_wasted(), replica_displace(uuid),
> replicaset_trim(uuid_table) functions.
> 
> replicaset_list_wasted() - returns table of dead replicas.
> We maintain two values: last active write time  - law, and last active
> read - lar. lar/law is time that passed from now till last read/write (activity
> of the applier and relay). Tome is measured in hours. Values can be
> found in box.info.replication. We do not maintain lar/law for current
> replica. If replicaset_list_wasted() is called it compares lar/law/abs(lar-law)
> with replication_dead_gap/replication_rw_gap. If this values exceeds
> gaps replica is supposed to be dead. And is stored in the resulting
> table.
> 
> The resulting table cam be passed to replicaset_trim(uuid_table). In
> this case all replicas from the table will be thrown away from the system
> space.
> 
> If one knows that some replica is dead it's uuid can be passed to
> replica_displace(uuid). In such case it will be thrown away from suystem
> space.

Documentation request is missing. Besides, it'd be good to see an
example of using these new functions in the commit message so that
one could see the picture before diving in the code.

> 
> Closes #3110
> ---
>  src/box/CMakeLists.txt         |   1 +
>  src/box/lua/ctl.lua            |  58 ++++++++++
>  src/box/lua/info.c             |  10 ++
>  src/box/lua/init.c             |   2 +
>  src/box/relay.cc               |   6 ++
>  src/box/relay.h                |   4 +
>  test/replication/trim.lua      |  66 ++++++++++++
>  test/replication/trim.result   | 237 +++++++++++++++++++++++++++++++++++++++++
>  test/replication/trim.test.lua |  93 ++++++++++++++++
>  test/replication/trim1.lua     |   1 +
>  test/replication/trim2.lua     |   1 +
>  test/replication/trim3.lua     |   1 +
>  test/replication/trim4.lua     |   1 +

The test takes way too much time (almost 3 minutes on my laptop).
Please fix.

>  13 files changed, 481 insertions(+)
>  create mode 100644 src/box/lua/ctl.lua
>  create mode 100644 test/replication/trim.lua
>  create mode 100644 test/replication/trim.result
>  create mode 100644 test/replication/trim.test.lua
>  create mode 120000 test/replication/trim1.lua
>  create mode 120000 test/replication/trim2.lua
>  create mode 120000 test/replication/trim3.lua
>  create mode 120000 test/replication/trim4.lua
> 
> diff --git a/src/box/CMakeLists.txt b/src/box/CMakeLists.txt
> index 52413d3cf..1daa1798e 100644
> --- a/src/box/CMakeLists.txt
> +++ b/src/box/CMakeLists.txt
> @@ -11,6 +11,7 @@ lua_source(lua_sources lua/net_box.lua)
>  lua_source(lua_sources lua/upgrade.lua)
>  lua_source(lua_sources lua/console.lua)
>  lua_source(lua_sources lua/xlog.lua)
> +lua_source(lua_sources lua/ctl.lua)
>  set(bin_sources)
>  bin_source(bin_sources bootstrap.snap bootstrap.h)
>  
> diff --git a/src/box/lua/ctl.lua b/src/box/lua/ctl.lua
> new file mode 100644
> index 000000000..852de91d1
> --- /dev/null
> +++ b/src/box/lua/ctl.lua
> @@ -0,0 +1,58 @@
> +-- ctl.lua (internal file)
> +
> +dead_gap = 0
> +rw_gap = 0
> +
> +local function is_dead(replica)

A comment is missing. What's dead? box.ctl may be used for other
subsystems too.

> +    -- no information about applier and relay
> +    if replica.lar == nil and replica.law == nil then return true end
> +
> +    -- time between last active read and now exceeds dead_gap
> +    if replica.lar > dead_gap then return true end
> +
> +    -- time between last active write and now exceeds dead_gap
> +    if replica.law > dead_gap then return true end

'lar' and 'law' are horrible names. Think of something better please.

May be, replica.{downstream,upstream}.inactive? Or 'disconnected'?
Would show the time that has passed since the applier was disconnected.
And I guess we shouldn't show 'idle' and 'lag' if the relay is
disconnected. Anyway, these new statistics should be introduced in a
separate patch and covered with tests.

Also, I don't really like the idea of reusing last_row_tm for this,
because it's going to be difficult to persist as it can be updated
rather often. May be, remember the time when an applier/relay
disconnects? Please think about it.

> +
> +    -- something happened to relay or applier
> +    if math.abs(replica.lar - replica.law) > rw_gap then return true end

Why do we need the two 'gaps'? Why is a single timeout not enough?

> +
> +    return false
> +end
> +
> +-- return list of replicas suspected to be dead

We typically start comments with a capital letter and end with a dot.
Please fix here and everywhere. Also, comments must be more thorough.
What replicas are suspected to be dead.

> +function replicaset_list_wasted()

Don't like 'wasted'. 'Inactive', may be?

> +    dead_gap = box.cfg.replication_dead_gap
> +    rw_gap = box.cfg.replication_rw_gap
> +    if dead_gap == 0 or rw_gap == 0 then
> +         error("replication_dead_gap and replication_rw_gap must be set")
> +    end
> +    local wasted_list = {}
> +    local replicaset = box.info.replication
> +    for i, replica in pairs(replicaset) do
> +        -- current replica is alive
> +        if replica.uuid ~=  box.info.uuid and is_dead(replica) then

The uuid check should be in is_dead or whatever it will be called.

> +            table.insert(wasted_list, replica.uuid)
> +        end
> +    end
> +    return wasted_list
> +end
> +
> +-- throw away any replica from system space
> +function replica_displace(uuid)
> +    if uuid == nil then
> +        error("Usage: replica_displace([uuid])")
> +    end
> +    box.space._cluster.index.uuid:delete{uuid}
> +end
> +
> +-- delete table of dead replica obtained from replicaset_list_wasted() or
> +-- formed by admin
> +function replicaset_trim(uuid_table)

'trim', 'displace', this looks inconsistent. Please think of better
names.

> +    if uuid_table == nil then
> +        error("Usage: replicaset_trim([uuid_table])")
> +    end
> +   for i in pairs(uuid_table) do

Bad indentation.

for _, uuid in pairs(uuids) do

> +        print("Deleting replica with uuid ".. i.. " "..uuid_table[i])

print() doesn't work for remote connections. Should probably print to
the log instead.

> +        replica_displace(uuid_table[i])
> +    end
> +end

All these functions should be a part of box.ctl namespace. May be, it's
worth putting them in box.ctl.replication namespace.

> diff --git a/src/box/lua/info.c b/src/box/lua/info.c
> index 655768ec4..c35e72aa9 100644
> --- a/src/box/lua/info.c
> +++ b/src/box/lua/info.c
> @@ -145,6 +145,16 @@ lbox_pushreplica(lua_State *L, struct replica *replica)
>  	luaL_pushuint64(L, vclock_get(&replicaset.vclock, replica->id));
>  	lua_settable(L, -3);
>  
> +	if (applier != NULL && applier->state != APPLIER_OFF) {
> +		lua_pushstring(L, "lar");
> +		lua_pushnumber(L, (ev_monotonic_now(loop()) - applier->last_row_time)/3600.0);
> +		lua_settable(L, -3);
> +	}

This should be inside lua_pushapplier().

> +	if (relay != NULL && relay_get_state(relay) != RELAY_OFF) {
> +		lua_pushstring(L, "law");
> +		lua_pushnumber(L, (ev_monotonic_now(loop()) - relay_get_lrt(relay))/3600.0);
> +		lua_settable(L, -3);
> +	}

This should ib inside lua_pushrelay().

>  	if (applier != NULL && applier->state != APPLIER_OFF) {
>  		lua_pushstring(L, "upstream");
>  		lbox_pushapplier(L, applier);

      reply	other threads:[~2018-10-15 12:43 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-10-12 19:45 [tarantool-patches] [PATCH v2 0/2] detect and throw away " Olga Arkhangelskaia
2018-10-12 19:45 ` [tarantool-patches] [PATCH v2 1/2] box: added replication_dead/rw_gap options Olga Arkhangelskaia
2018-10-15 10:22   ` Vladimir Davydov
2018-10-23  7:10   ` [tarantool-patches] " Konstantin Osipov
2018-10-12 19:45 ` [tarantool-patches] [PATCH v2 2/2] ctl: added functionality to detect and prune dead replicas Olga Arkhangelskaia
2018-10-15 12:43   ` Vladimir Davydov [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20181015124333.ghnyqohsg7q2w7gi@esperanza \
    --to=vdavydov.dev@gmail.com \
    --cc=arkholga@tarantool.org \
    --cc=tarantool-patches@freelists.org \
    --subject='Re: [tarantool-patches] [PATCH v2 2/2] ctl: added functionality to detect and prune dead replicas' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox