[Tarantool-patches] [PATCH v4 2/2] relay: provide information about downstream lag

Vladislav Shpilevoy v.shpilevoy at tarantool.org
Wed May 12 23:10:47 MSK 2021


I appreciate the work you did here!

See 6 comments below.

On 06.05.2021 23:41, Cyrill Gorcunov wrote:
> We already have `box.replication.upstream.lag` entry for monitoring
> sake. Same time in synchronous replication timeouts are key properties
> for quorum gathering procedure. Thus we would like to know how long
> it took of a transaction to traverse `initiator WAL -> network ->
> remote applier -> ACK` path.
> 
> Typical ouput is

1. ouput -> output.

>  .../unreleased/gh-5447-downstream-lag.md      |  6 ++
>  src/box/lua/info.c                            |  3 +
>  src/box/relay.cc                              | 18 ++++
>  src/box/relay.h                               |  6 ++
>  .../replication/gh-5447-downstream-lag.result | 93 +++++++++++++++++++
>  .../gh-5447-downstream-lag.test.lua           | 41 ++++++++
>  6 files changed, 167 insertions(+)
>  create mode 100644 changelogs/unreleased/gh-5447-downstream-lag.md
>  create mode 100644 test/replication/gh-5447-downstream-lag.result
>  create mode 100644 test/replication/gh-5447-downstream-lag.test.lua
> 
> diff --git a/changelogs/unreleased/gh-5447-downstream-lag.md b/changelogs/unreleased/gh-5447-downstream-lag.md
> new file mode 100644
> index 000000000..3b11c50ee
> --- /dev/null
> +++ b/changelogs/unreleased/gh-5447-downstream-lag.md
> @@ -0,0 +1,6 @@
> +#feature/replication
> +
> + * Introduce `box.info.replication[n].downstream.lag` to monitor state

2. If was recently decided we must use Past Simple in changelogs.

> +   of replication which is especially important for synchronous spaces
> +   where malfunctioning replicas may prevent quorum from gathering votes
> +   to commit a transaction.

3. Please, add a reference to the ticket in the form `(gh-####)`.

> diff --git a/src/box/lua/info.c b/src/box/lua/info.c
> index 0eb48b823..5b7c25f28 100644
> --- a/src/box/lua/info.c
> +++ b/src/box/lua/info.c
> @@ -143,6 +143,9 @@ lbox_pushrelay(lua_State *L, struct relay *relay)
>  		lua_pushnumber(L, ev_monotonic_now(loop()) -
>  			       relay_last_row_time(relay));
>  		lua_settable(L, -3);
> +		lua_pushstring(L, "lag");
> +		lua_pushnumber(L, relay_lag(relay));
> +		lua_settable(L, -3);
>  		break;
>  	case RELAY_STOPPED:
>  	{
> diff --git a/src/box/relay.cc b/src/box/relay.cc
> index ff43c2fc7..1501e53c7 100644
> --- a/src/box/relay.cc
> +++ b/src/box/relay.cc
> @@ -158,6 +158,12 @@ struct relay {
>  	struct stailq pending_gc;
>  	/** Time when last row was sent to peer. */
>  	double last_row_time;
> +	/**
> +	 * A time difference between moment when we
> +	 * wrote a row in the local WAL and received
> +	 * an ACK that peer has it replicated.
> +	 */
> +	double lag;

4. Lag is updated in the relay thread, therefore you can't
simply read it in TX thread like you do in the diff block
above.

Try to deliver it to TX the same way as acked vclock is
delivered.

>  	/** Relay sync state. */
>  	enum relay_state state;
>  
> @@ -621,6 +633,12 @@ relay_reader_f(va_list ap)
>  			/* vclock is followed while decoding, zeroing it. */
>  			vclock_create(&relay->recv_vclock);
>  			xrow_decode_vclock_xc(&xrow, &relay->recv_vclock);
> +			if (xrow.tm != 0) {
> +				double delta = ev_now(loop()) - xrow.tm;

5. What if the acked row didn't come from the remote peer? But from a third
node. Node1 -> Node2 -> Node3. Does Node2, before sending the row to node3,
update its timestamp? Otherwise you are going to send an ACK to Node2 with
timestamp of Node1, which won't make much sense. I don't know. I see
xlog_write_entry() updates the timestamp, but please, double-check it. In
the debugger or via printfs.

> +				relay->lag = delta;
> +			} else {
> +				relay->lag = 0;

6. Hm. This means that if I sent a synchronous transaction, and its
quorum collection takes longer than the replication timeout, I would
see the lag 0. Moverover, it will constantly blink if I have many
synchronous transactions working slow. Between 0 and seconds. I am not
100% it looks right. Especially it won't be applicable for any monitoring,
the results won't be readable. I would rather expect the lag grow until
I receive an ACK, which will bump it to an actual value. And drop to 0
only when there is no rows to wait ACK for. Or maybe never drop to 0 like
applier does?

Drop to 0 when no rows means we will have it blinking between 0 and
seconds again if the synchronous transactions are long, but not so
frequent. So there is often no rows to ACK.

Did you try to ask Vlad G., who is going to actually use it probably?

I tried to list the options we have
https://github.com/tarantool/tarantool/issues/5447#issuecomment-840047693
and asked Vlad what he thinks. He also promised to ask more people who
use Tarantool actively.


More information about the Tarantool-patches mailing list