[Tarantool-patches] [PATCH v8 0/2] relay: provide downstream lag information

Tarantool development patches archive
 help / color / mirror / Atom feed

* [Tarantool-patches] [PATCH v8 0/2] relay: provide downstream lag information
@ 2021-06-07 15:55 Cyrill Gorcunov via Tarantool-patches
  2021-06-07 15:55 ` [Tarantool-patches] [PATCH v8 1/2] applier: send transaction's first row WAL time in the applier_writer_f Cyrill Gorcunov via Tarantool-patches
                   ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: Cyrill Gorcunov via Tarantool-patches @ 2021-06-07 15:55 UTC (permalink / raw)
  To: tml; +Cc: Vladislav Shpilevoy

Guys, take a look once time permit. Previous version is here

https://lists.tarantool.org/tarantool-patches/20210506214132.533913-1-gorcunov@gmail.com/

I still think we might need some general statistics engine because this
lag rather common data and we could move other "last" things like last_row_time
there but since I didn't get approve on the idea I keep the timestamp
in replica structure.

v4 (by Vlad):
 - add a test case
 - add docbot request
 - dropped off xrow_encode_vclock_timed, we use opencoded assignment
   for tm value when send ack
 - struct awstat renamed to applier_wal_stat. Vlad I think this is
   better name than "applier_lag" because this is statistics on WAL,
   we simply track remote WAL propagation here, so more general name
   is better for grep sake and for future extensions
 - instead of passing applier structure we pass replica_id
 - the real keeper of this statistics comes into "replica" structure
   thus unbound of applier itself
 - for synchro entries we pass a pointer to the applier_wal_stat instead
   of using replica_id = 0 as a sign that we don't need to update statistics
   for initial and final join cases
 - to write and read statistics we provide wal_stat_update and wal_stat_ack
   helpers to cover the case where single ACK spans several transactions

v8:
 - make one branch less in apply_synchro_row()
 - keep applier_txn_start_tm inside replica stucture
 - rename wal_stat to replica_cb_data since this is more
   logical for case where we have no general stat engine
 - make applier to send timestamp so that relay will compute
   delte upon the read, the lag is kept permanently until new
   write happens
 - extend doc and changelog a bit
 - keep reading of relay's lag from TX thread without any modifications
   because relay get deleted from TX thread and set to non-RELAY_FOLLOW
   state, thus any attempt to read it won't success. To be honest there
   is a small race window present: write doubles are not atomic operation
   thus we might read partially updated timestamp simiraly as we have with
   @idle field already. I think this should be addressed separately and better
   without heavy cmsg engine involved but with rw lock instead or plain atomics.

branch gorcunov/gh-5447-relay-lag-8
issue https://github.com/tarantool/tarantool/issues/5447

Cyrill Gorcunov (2):
  applier: send transaction's first row WAL time in the applier_writer_f
  relay: provide information about downstream lag

 .../unreleased/gh-5447-downstream-lag.md      |  6 ++
 src/box/applier.cc                            | 90 +++++++++++++++---
 src/box/lua/info.c                            |  3 +
 src/box/relay.cc                              | 50 ++++++++++
 src/box/relay.h                               |  6 ++
 src/box/replication.cc                        |  1 +
 src/box/replication.h                         |  5 +
 .../replication/gh-5447-downstream-lag.result | 93 +++++++++++++++++++
 .../gh-5447-downstream-lag.test.lua           | 41 ++++++++
 9 files changed, 280 insertions(+), 15 deletions(-)
 create mode 100644 changelogs/unreleased/gh-5447-downstream-lag.md
 create mode 100644 test/replication/gh-5447-downstream-lag.result
 create mode 100644 test/replication/gh-5447-downstream-lag.test.lua


base-commit: 2b25ca6c70908b57c011ee24d2eeb9f6155d2750
-- 
2.31.1


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Tarantool-patches] [PATCH v8 1/2] applier: send transaction's first row WAL time in the applier_writer_f
  2021-06-07 15:55 [Tarantool-patches] [PATCH v8 0/2] relay: provide downstream lag information Cyrill Gorcunov via Tarantool-patches
@ 2021-06-07 15:55 ` Cyrill Gorcunov via Tarantool-patches
  2021-06-07 19:20   ` Vladislav Shpilevoy via Tarantool-patches
  2021-06-15  9:36   ` Serge Petrenko via Tarantool-patches
  2021-06-07 15:55 ` [Tarantool-patches] [PATCH v8 2/2] relay: provide information about downstream lag Cyrill Gorcunov via Tarantool-patches
  2021-06-07 19:20 ` [Tarantool-patches] [PATCH v8 0/2] relay: provide downstream lag information Vladislav Shpilevoy via Tarantool-patches
  2 siblings, 2 replies; 17+ messages in thread
From: Cyrill Gorcunov via Tarantool-patches @ 2021-06-07 15:55 UTC (permalink / raw)
  To: tml; +Cc: Vladislav Shpilevoy

Applier fiber sends current vclock of the node to remote relay reader,
pointing current state of fetched WAL data so the relay will know which
new data should be sent. The packet applier sends carries xrow_header::tm
field as a zero but we can reuse it to provide information about first
timestamp in a transaction we wrote to our WAL. Since old instances of
Tarantool simply ignore this field such extension won't cause any
problems.

The timestamp will be needed to account lag of downstream replicas
suitable for information purpose and cluster health monitoring.

We update applier statistics in WAL callbacks but since both
apply_synchro_row and apply_plain_tx are used not only in real data
application but in final join stage as well (in this stage we're not
writing the data yet) the apply_synchro_row is extended with replica_id
argument which is non zero when applier is subscribed.

The calculation of the downstream lag itself lag will be addressed
in next patch because sending the timestamp and its observation
are independent actions.

Part-of #5447

Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
---
 src/box/applier.cc     | 90 +++++++++++++++++++++++++++++++++++-------
 src/box/replication.cc |  1 +
 src/box/replication.h  |  5 +++
 3 files changed, 81 insertions(+), 15 deletions(-)

diff --git a/src/box/applier.cc b/src/box/applier.cc
index 33181fdbf..38695a54f 100644
--- a/src/box/applier.cc
+++ b/src/box/applier.cc
@@ -163,6 +163,9 @@ applier_writer_f(va_list ap)
 	struct ev_io io;
 	coio_create(&io, applier->io.fd);
 
+	/* ID is permanent while applier is alive */
+	uint32_t replica_id = applier->instance_id;
+
 	while (!fiber_is_cancelled()) {
 		/*
 		 * Tarantool >= 1.7.7 sends periodic heartbeat
@@ -193,6 +196,16 @@ applier_writer_f(va_list ap)
 			applier->has_acks_to_send = false;
 			struct xrow_header xrow;
 			xrow_encode_vclock(&xrow, &replicaset.vclock);
+			/*
+			 * For relay lag statistics we report last
+			 * written transaction timestamp in tm field.
+			 *
+			 * Replica might be dead already so we have to
+			 * test on each iteration.
+			 */
+			struct replica *r = replica_by_id(replica_id);
+			if (likely(r != NULL))
+				xrow.tm = r->applier_txn_start_tm;
 			coio_write_xrow(&io, &xrow);
 			ERROR_INJECT(ERRINJ_APPLIER_SLOW_ACK, {
 				fiber_sleep(0.01);
@@ -490,7 +503,7 @@ static uint64_t
 applier_read_tx(struct applier *applier, struct stailq *rows, double timeout);
 
 static int
-apply_final_join_tx(struct stailq *rows);
+apply_final_join_tx(uint32_t replica_id, struct stailq *rows);
 
 /**
  * A helper struct to link xrow objects in a list.
@@ -535,7 +548,7 @@ applier_wait_register(struct applier *applier, uint64_t row_count)
 						  next)->row);
 			break;
 		}
-		if (apply_final_join_tx(&rows) != 0)
+		if (apply_final_join_tx(applier->instance_id, &rows) != 0)
 			diag_raise();
 	}
 
@@ -751,11 +764,35 @@ applier_txn_rollback_cb(struct trigger *trigger, void *event)
 	return 0;
 }
 
+struct replica_cb_data {
+	/** Replica ID the data belongs to. */
+	uint32_t replica_id;
+	/**
+	 * Timestamp of a transaction to be accounted
+	 * for relay lag. Usually it is a first row in
+	 * a transaction.
+	 */
+	double txn_start_tm;
+};
+
+/** Update replica associated data once write is complete. */
+static void
+replica_txn_wal_write_cb(struct replica_cb_data *rcb)
+{
+	struct replica *r = replica_by_id(rcb->replica_id);
+	if (likely(r != NULL))
+		r->applier_txn_start_tm = rcb->txn_start_tm;
+}
+
 static int
 applier_txn_wal_write_cb(struct trigger *trigger, void *event)
 {
-	(void) trigger;
 	(void) event;
+
+	struct replica_cb_data *rcb =
+		(struct replica_cb_data *)trigger->data;
+	replica_txn_wal_write_cb(rcb);
+
 	/* Broadcast the WAL write across all appliers. */
 	trigger_run(&replicaset.applier.on_wal_write, NULL);
 	return 0;
@@ -766,6 +803,8 @@ struct synchro_entry {
 	struct synchro_request *req;
 	/** Fiber created the entry. To wakeup when WAL write is done. */
 	struct fiber *owner;
+	/** Replica associated data. */
+	struct replica_cb_data *rcb;
 	/**
 	 * The base journal entry. It has unsized array and then must be the
 	 * last entry in the structure. But can workaround it via a union
@@ -789,6 +828,7 @@ apply_synchro_row_cb(struct journal_entry *entry)
 	if (entry->res < 0) {
 		applier_rollback_by_wal_io();
 	} else {
+		replica_txn_wal_write_cb(synchro_entry->rcb);
 		txn_limbo_process(&txn_limbo, synchro_entry->req);
 		trigger_run(&replicaset.applier.on_wal_write, NULL);
 	}
@@ -797,7 +837,7 @@ apply_synchro_row_cb(struct journal_entry *entry)
 
 /** Process a synchro request. */
 static int
-apply_synchro_row(struct xrow_header *row)
+apply_synchro_row(uint32_t replica_id, struct xrow_header *row)
 {
 	assert(iproto_type_is_synchro_request(row->type));
 
@@ -805,6 +845,7 @@ apply_synchro_row(struct xrow_header *row)
 	if (xrow_decode_synchro(row, &req) != 0)
 		goto err;
 
+	struct replica_cb_data rcb_data;
 	struct synchro_entry entry;
 	/*
 	 * Rows array is cast from *[] to **, because otherwise g++ complains
@@ -817,6 +858,11 @@ apply_synchro_row(struct xrow_header *row)
 			     apply_synchro_row_cb, &entry);
 	entry.req = &req;
 	entry.owner = fiber();
+
+	rcb_data.replica_id = replica_id;
+	rcb_data.txn_start_tm = row->tm;
+	entry.rcb = &rcb_data;
+
 	/*
 	 * The WAL write is blocking. Otherwise it might happen that a CONFIRM
 	 * or ROLLBACK is sent to WAL, and it would empty the limbo, but before
@@ -862,8 +908,9 @@ applier_handle_raft(struct applier *applier, struct xrow_header *row)
 	return box_raft_process(&req, applier->instance_id);
 }
 
-static inline int
-apply_plain_tx(struct stailq *rows, bool skip_conflict, bool use_triggers)
+static int
+apply_plain_tx(uint32_t replica_id, struct stailq *rows,
+	       bool skip_conflict, bool use_triggers)
 {
 	/*
 	 * Explicitly begin the transaction so that we can
@@ -931,10 +978,21 @@ apply_plain_tx(struct stailq *rows, bool skip_conflict, bool use_triggers)
 			goto fail;
 		}
 
+		struct replica_cb_data *rcb;
+		rcb = region_alloc_object(&txn->region, typeof(*rcb), &size);
+		if (rcb == NULL) {
+			diag_set(OutOfMemory, size, "region_alloc_object", "rcb");
+			goto fail;
+		}
+
 		trigger_create(on_rollback, applier_txn_rollback_cb, NULL, NULL);
 		txn_on_rollback(txn, on_rollback);
 
-		trigger_create(on_wal_write, applier_txn_wal_write_cb, NULL, NULL);
+		item = stailq_first_entry(rows, struct applier_tx_row, next);
+		rcb->replica_id = replica_id;
+		rcb->txn_start_tm = item->row.tm;
+
+		trigger_create(on_wal_write, applier_txn_wal_write_cb, rcb, NULL);
 		txn_on_wal_write(txn, on_wal_write);
 	}
 
@@ -946,7 +1004,7 @@ apply_plain_tx(struct stailq *rows, bool skip_conflict, bool use_triggers)
 
 /** A simpler version of applier_apply_tx() for final join stage. */
 static int
-apply_final_join_tx(struct stailq *rows)
+apply_final_join_tx(uint32_t replica_id, struct stailq *rows)
 {
 	struct xrow_header *first_row =
 		&stailq_first_entry(rows, struct applier_tx_row, next)->row;
@@ -957,9 +1015,9 @@ apply_final_join_tx(struct stailq *rows)
 	vclock_follow_xrow(&replicaset.vclock, last_row);
 	if (unlikely(iproto_type_is_synchro_request(first_row->type))) {
 		assert(first_row == last_row);
-		rc = apply_synchro_row(first_row);
+		rc = apply_synchro_row(replica_id, first_row);
 	} else {
-		rc = apply_plain_tx(rows, false, false);
+		rc = apply_plain_tx(replica_id, rows, false, false);
 	}
 	fiber_gc();
 	return rc;
@@ -1088,12 +1146,14 @@ applier_apply_tx(struct applier *applier, struct stailq *rows)
 		 * each other.
 		 */
 		assert(first_row == last_row);
-		if ((rc = apply_synchro_row(first_row)) != 0)
-			goto finish;
-	} else if ((rc = apply_plain_tx(rows, replication_skip_conflict,
-					true)) != 0) {
-		goto finish;
+		rc = apply_synchro_row(applier->instance_id, first_row);
+	} else {
+		rc = apply_plain_tx(applier->instance_id, rows,
+				    replication_skip_conflict, true);
 	}
+	if (rc != 0)
+		goto finish;
+
 	vclock_follow(&replicaset.applier.vclock, last_row->replica_id,
 		      last_row->lsn);
 finish:
diff --git a/src/box/replication.cc b/src/box/replication.cc
index aefb812b3..c97c1fc04 100644
--- a/src/box/replication.cc
+++ b/src/box/replication.cc
@@ -184,6 +184,7 @@ replica_new(void)
 	trigger_create(&replica->on_applier_state,
 		       replica_on_applier_state_f, NULL, NULL);
 	replica->applier_sync_state = APPLIER_DISCONNECTED;
+	replica->applier_txn_start_tm = 0;
 	latch_create(&replica->order_latch);
 	return replica;
 }
diff --git a/src/box/replication.h b/src/box/replication.h
index 2ad1cbf66..d9817d4ff 100644
--- a/src/box/replication.h
+++ b/src/box/replication.h
@@ -331,6 +331,11 @@ struct replica {
 	 * separate from applier.
 	 */
 	enum applier_state applier_sync_state;
+	/**
+	 * Applier's last written to WAL transaction timestamp.
+	 * Needed for relay lagging statistics.
+	 */
+	double applier_txn_start_tm;
 	/* The latch is used to order replication requests. */
 	struct latch order_latch;
 };
-- 
2.31.1


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [Tarantool-patches] [PATCH v8 2/2] relay: provide information about downstream lag
  2021-06-07 15:55 [Tarantool-patches] [PATCH v8 0/2] relay: provide downstream lag information Cyrill Gorcunov via Tarantool-patches
  2021-06-07 15:55 ` [Tarantool-patches] [PATCH v8 1/2] applier: send transaction's first row WAL time in the applier_writer_f Cyrill Gorcunov via Tarantool-patches
@ 2021-06-07 15:55 ` Cyrill Gorcunov via Tarantool-patches
  2021-06-07 19:21   ` Vladislav Shpilevoy via Tarantool-patches
  2021-06-15 10:03   ` Serge Petrenko via Tarantool-patches
  2021-06-07 19:20 ` [Tarantool-patches] [PATCH v8 0/2] relay: provide downstream lag information Vladislav Shpilevoy via Tarantool-patches
  2 siblings, 2 replies; 17+ messages in thread
From: Cyrill Gorcunov via Tarantool-patches @ 2021-06-07 15:55 UTC (permalink / raw)
  To: tml; +Cc: Vladislav Shpilevoy

We already have `box.replication.upstream.lag` entry for monitoring
sake. Same time in synchronous replication timeouts are key properties
for quorum gathering procedure. Thus we would like to know how long
it took of a transaction to traverse `initiator WAL -> network ->
remote applier -> ACK` path.

Typical output is

 | tarantool> box.info.replication[2].downstream
 | ---
 | - status: follow
 |   idle: 0.61753897101153
 |   vclock: {1: 147}
 |   lag: 0
 | ...
 | tarantool> box.space.sync:insert{69}
 | ---
 | - [69]
 | ...
 |
 | tarantool> box.info.replication[2].downstream
 | ---
 | - status: follow
 |   idle: 0.75324084801832
 |   vclock: {1: 151}
 |   lag: 0.0011014938354492
 | ...

Closes #5447

Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>

@TarantoolBot document
Title: Add `box.info.replication[n].downstream.lag` entry

`replication[n].downstream.lag` is the time difference between
last transactions been written to the WAL journal of the transaction
initiator and the transaction written to WAL on the `n` replica.

In other words this is a lag in seconds between the main node writes
data to own WAL and the replica `n` get this data replicated to own
WAL journal.

In case if a transaction failed to replicate the lag value won't be
modified because only successfully applied transactions are accounted.
Same time if the main node or a repllica get restarted the lag value
will be zero until next success transaction.
---
 .../unreleased/gh-5447-downstream-lag.md      |  6 ++
 src/box/lua/info.c                            |  3 +
 src/box/relay.cc                              | 50 ++++++++++
 src/box/relay.h                               |  6 ++
 .../replication/gh-5447-downstream-lag.result | 93 +++++++++++++++++++
 .../gh-5447-downstream-lag.test.lua           | 41 ++++++++
 6 files changed, 199 insertions(+)
 create mode 100644 changelogs/unreleased/gh-5447-downstream-lag.md
 create mode 100644 test/replication/gh-5447-downstream-lag.result
 create mode 100644 test/replication/gh-5447-downstream-lag.test.lua

diff --git a/changelogs/unreleased/gh-5447-downstream-lag.md b/changelogs/unreleased/gh-5447-downstream-lag.md
new file mode 100644
index 000000000..726175c6c
--- /dev/null
+++ b/changelogs/unreleased/gh-5447-downstream-lag.md
@@ -0,0 +1,6 @@
+#feature/replication
+
+ * Introduced `box.info.replication[n].downstream.lag` field to monitor
+   state of replication. This member represents time spent between
+   transaction been written to initiator's WAL file and reached WAL
+   file of a replica (gh-5447).
diff --git a/src/box/lua/info.c b/src/box/lua/info.c
index 0eb48b823..f201b25e3 100644
--- a/src/box/lua/info.c
+++ b/src/box/lua/info.c
@@ -143,6 +143,9 @@ lbox_pushrelay(lua_State *L, struct relay *relay)
 		lua_pushnumber(L, ev_monotonic_now(loop()) -
 			       relay_last_row_time(relay));
 		lua_settable(L, -3);
+		lua_pushstring(L, "lag");
+		lua_pushnumber(L, relay_txn_lag(relay));
+		lua_settable(L, -3);
 		break;
 	case RELAY_STOPPED:
 	{
diff --git a/src/box/relay.cc b/src/box/relay.cc
index b1571b361..cdd1383e8 100644
--- a/src/box/relay.cc
+++ b/src/box/relay.cc
@@ -158,6 +158,18 @@ struct relay {
 	struct stailq pending_gc;
 	/** Time when last row was sent to peer. */
 	double last_row_time;
+	/**
+	 * A time difference between the moment when we
+	 * wrote a transaction to the local WAL and when
+	 * this transaction has been replicated to remote
+	 * node (ie written to node's WAL).
+	 */
+	double txn_lag;
+	/**
+	 * Last timestamp observed from remote node to
+	 * persist @a txn_lag value.
+	 */
+	double txn_acked_tm;
 	/** Relay sync state. */
 	enum relay_state state;
 
@@ -217,6 +229,12 @@ relay_last_row_time(const struct relay *relay)
 	return relay->last_row_time;
 }
 
+double
+relay_txn_lag(const struct relay *relay)
+{
+	return relay->txn_lag;
+}
+
 static void
 relay_send(struct relay *relay, struct xrow_header *packet);
 static void
@@ -284,6 +302,16 @@ relay_start(struct relay *relay, int fd, uint64_t sync,
 	relay->state = RELAY_FOLLOW;
 	relay->row_count = 0;
 	relay->last_row_time = ev_monotonic_now(loop());
+	relay->txn_lag = 0;
+	/*
+	 * We assume that previously written rows in WAL
+	 * are older than current node real time which allows
+	 * to simplify @a txn_lag calculation. In worst
+	 * scenario when runtime has been adjusted backwards
+	 * between restart we simply get some big value in
+	 * @a txn_lag until next transaction get replicated.
+	 */
+	relay->txn_acked_tm = ev_now(loop());
 }
 
 void
@@ -336,6 +364,8 @@ relay_stop(struct relay *relay)
 	 * upon cord_create().
 	 */
 	relay->cord.id = 0;
+	relay->txn_lag = 0;
+	relay->txn_acked_tm = ev_now(loop());
 }
 
 void
@@ -629,6 +659,26 @@ relay_reader_f(va_list ap)
 			/* vclock is followed while decoding, zeroing it. */
 			vclock_create(&relay->recv_vclock);
 			xrow_decode_vclock_xc(&xrow, &relay->recv_vclock);
+			/*
+			 * Replica send us last replicated transaction
+			 * timestamp which is needed for relay lag
+			 * monitoring. Note that this transaction has
+			 * been written to WAL with our current realtime
+			 * clock value, thus when it get reported back we
+			 * can compute time spent regardless of the clock
+			 * value on remote replica.
+			 *
+			 * An interesting moment is replica restart - it will
+			 * send us value 0 after that but we can preserve
+			 * old reported value here since we *assume* that
+			 * timestamp is not going backwards on properly
+			 * set up nodes, otherwise the lag get raised.
+			 * After all this is a not tamper-proof value.
+			 */
+			if (relay->txn_acked_tm < xrow.tm) {
+				relay->txn_acked_tm = xrow.tm;
+				relay->txn_lag = ev_now(loop()) - xrow.tm;
+			}
 			fiber_cond_signal(&relay->reader_cond);
 		}
 	} catch (Exception *e) {
diff --git a/src/box/relay.h b/src/box/relay.h
index b32e2ea2a..615ffb75d 100644
--- a/src/box/relay.h
+++ b/src/box/relay.h
@@ -93,6 +93,12 @@ relay_vclock(const struct relay *relay);
 double
 relay_last_row_time(const struct relay *relay);
 
+/**
+ * Returns relay's transaction's lag.
+ */
+double
+relay_txn_lag(const struct relay *relay);
+
 /**
  * Send a Raft update request to the relay channel. It is not
  * guaranteed that it will be delivered. The connection may break.
diff --git a/test/replication/gh-5447-downstream-lag.result b/test/replication/gh-5447-downstream-lag.result
new file mode 100644
index 000000000..8586d0ed3
--- /dev/null
+++ b/test/replication/gh-5447-downstream-lag.result
@@ -0,0 +1,93 @@
+-- test-run result file version 2
+--
+-- gh-5447: Test for box.info.replication[n].downstream.lag.
+-- We need to be sure that if replica start been back of
+-- master node reports own lagging and cluster admin would
+-- be able to detect such situation.
+--
+
+fiber = require('fiber')
+ | ---
+ | ...
+test_run = require('test_run').new()
+ | ---
+ | ...
+engine = test_run:get_cfg('engine')
+ | ---
+ | ...
+
+box.schema.user.grant('guest', 'replication')
+ | ---
+ | ...
+
+test_run:cmd('create server replica with rpl_master=default, \
+             script="replication/replica.lua"')
+ | ---
+ | - true
+ | ...
+test_run:cmd('start server replica')
+ | ---
+ | - true
+ | ...
+
+s = box.schema.space.create('test', {engine = engine})
+ | ---
+ | ...
+_ = s:create_index('pk')
+ | ---
+ | ...
+
+--
+-- The replica should wait some time (wal delay is 1 second
+-- by default) so we would be able to detect the lag, since
+-- on local instances the lag is minimal and usually transactions
+-- are handled instantly.
+test_run:switch('replica')
+ | ---
+ | - true
+ | ...
+box.error.injection.set("ERRINJ_WAL_DELAY", true)
+ | ---
+ | - ok
+ | ...
+
+test_run:switch('default')
+ | ---
+ | - true
+ | ...
+box.space.test:insert({1})
+ | ---
+ | - [1]
+ | ...
+test_run:wait_cond(function() return box.info.replication[2].downstream.lag ~= 0 end, 10)
+ | ---
+ | - true
+ | ...
+
+test_run:switch('replica')
+ | ---
+ | - true
+ | ...
+box.error.injection.set("ERRINJ_WAL_DELAY", false)
+ | ---
+ | - ok
+ | ...
+--
+-- Cleanup everything.
+test_run:switch('default')
+ | ---
+ | - true
+ | ...
+
+test_run:cmd('stop server replica')
+ | ---
+ | - true
+ | ...
+test_run:cmd('cleanup server replica')
+ | ---
+ | - true
+ | ...
+test_run:cmd('delete server replica')
+ | ---
+ | - true
+ | ...
diff --git a/test/replication/gh-5447-downstream-lag.test.lua b/test/replication/gh-5447-downstream-lag.test.lua
new file mode 100644
index 000000000..650e8e215
--- /dev/null
+++ b/test/replication/gh-5447-downstream-lag.test.lua
@@ -0,0 +1,41 @@
+--
+-- gh-5447: Test for box.info.replication[n].downstream.lag.
+-- We need to be sure that if replica start been back of
+-- master node reports own lagging and cluster admin would
+-- be able to detect such situation.
+--
+
+fiber = require('fiber')
+test_run = require('test_run').new()
+engine = test_run:get_cfg('engine')
+
+box.schema.user.grant('guest', 'replication')
+
+test_run:cmd('create server replica with rpl_master=default, \
+             script="replication/replica.lua"')
+test_run:cmd('start server replica')
+
+s = box.schema.space.create('test', {engine = engine})
+_ = s:create_index('pk')
+
+--
+-- The replica should wait some time (wal delay is 1 second
+-- by default) so we would be able to detect the lag, since
+-- on local instances the lag is minimal and usually transactions
+-- are handled instantly.
+test_run:switch('replica')
+box.error.injection.set("ERRINJ_WAL_DELAY", true)
+
+test_run:switch('default')
+box.space.test:insert({1})
+test_run:wait_cond(function() return box.info.replication[2].downstream.lag ~= 0 end, 10)
+
+test_run:switch('replica')
+box.error.injection.set("ERRINJ_WAL_DELAY", false)
+--
+-- Cleanup everything.
+test_run:switch('default')
+
+test_run:cmd('stop server replica')
+test_run:cmd('cleanup server replica')
+test_run:cmd('delete server replica')
-- 
2.31.1


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Tarantool-patches] [PATCH v8 0/2] relay: provide downstream lag information
  2021-06-07 15:55 [Tarantool-patches] [PATCH v8 0/2] relay: provide downstream lag information Cyrill Gorcunov via Tarantool-patches
  2021-06-07 15:55 ` [Tarantool-patches] [PATCH v8 1/2] applier: send transaction's first row WAL time in the applier_writer_f Cyrill Gorcunov via Tarantool-patches
  2021-06-07 15:55 ` [Tarantool-patches] [PATCH v8 2/2] relay: provide information about downstream lag Cyrill Gorcunov via Tarantool-patches
@ 2021-06-07 19:20 ` Vladislav Shpilevoy via Tarantool-patches
  2021-06-07 20:00   ` Cyrill Gorcunov via Tarantool-patches
  2 siblings, 1 reply; 17+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-06-07 19:20 UTC (permalink / raw)
  To: Cyrill Gorcunov, tml

Hi! Thanks for the patchset!

Please, provide branch and issue links next time.
https://github.com/tarantool/tarantool/wiki/Code-review-procedure#sending-the-patch

I assume your branch is gorcunov/gh-5447-relay-lag-8.

It seems after your changes some tests start failing. Looks
like you didn't clear artifacts of your new test.
https://github.com/tarantool/tarantool/runs/2765786321

On 07.06.2021 17:55, Cyrill Gorcunov wrote:
> Guys, take a look once time permit. Previous version is here
> 
> https://lists.tarantool.org/tarantool-patches/20210506214132.533913-1-gorcunov@gmail.com/
> 
> I still think we might need some general statistics engine because this
> lag rather common data and we could move other "last" things like last_row_time
> there but since I didn't get approve on the idea I keep the timestamp
> in replica structure.

If you are talking about your perf metrics engine which we tried to
design for replication, it was approved. The problem is that there was
no a really good design on how should it look and work.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Tarantool-patches] [PATCH v8 1/2] applier: send transaction's first row WAL time in the applier_writer_f
  2021-06-07 15:55 ` [Tarantool-patches] [PATCH v8 1/2] applier: send transaction's first row WAL time in the applier_writer_f Cyrill Gorcunov via Tarantool-patches
@ 2021-06-07 19:20   ` Vladislav Shpilevoy via Tarantool-patches
  2021-06-15  9:36   ` Serge Petrenko via Tarantool-patches
  1 sibling, 0 replies; 17+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-06-07 19:20 UTC (permalink / raw)
  To: Cyrill Gorcunov, tml

Thanks for the patch!

> diff --git a/src/box/applier.cc b/src/box/applier.cc
> index 33181fdbf..38695a54f 100644
> --- a/src/box/applier.cc
> +++ b/src/box/applier.cc> @@ -751,11 +764,35 @@ applier_txn_rollback_cb(struct trigger *trigger, void *event)
>  	return 0;
>  }
>  
> +struct replica_cb_data {
> +	/** Replica ID the data belongs to. */
> +	uint32_t replica_id;
> +	/**
> +	 * Timestamp of a transaction to be accounted
> +	 * for relay lag. Usually it is a first row in
> +	 * a transaction.
> +	 */
> +	double txn_start_tm;
> +};
> +
> +/** Update replica associated data once write is complete. */
> +static void
> +replica_txn_wal_write_cb(struct replica_cb_data *rcb)
> +{
> +	struct replica *r = replica_by_id(rcb->replica_id);
> +	if (likely(r != NULL))
> +		r->applier_txn_start_tm = rcb->txn_start_tm;

I spent some time trying to understand why did you delete the
check that we should not override the timestamp until its ACK is
sent.

But then I realized that indeed the correct way is to send the
latest timestamp of a tx batch, not the first one. Because the
replica confirms receipt of the transactions up to the latest one.
I am 90% sure it is correct. And you need to add a comment here
why do you use the latest timestamp.

Also that means you need to use the last row's timestamp, not the
first row's one. By the same reason. I doubt they are different,
but still it would be more correct. In apply_plain_tx() below you
should use last stailq entry, not first entry.

> @@ -931,10 +978,21 @@ apply_plain_tx(struct stailq *rows, bool skip_conflict, bool use_triggers)
>  			goto fail;
>  		}
>  
> +		struct replica_cb_data *rcb;
> +		rcb = region_alloc_object(&txn->region, typeof(*rcb), &size);
> +		if (rcb == NULL) {
> +			diag_set(OutOfMemory, size, "region_alloc_object", "rcb");
> +			goto fail;
> +		}
> +
>  		trigger_create(on_rollback, applier_txn_rollback_cb, NULL, NULL);
>  		txn_on_rollback(txn, on_rollback);
>  
> -		trigger_create(on_wal_write, applier_txn_wal_write_cb, NULL, NULL);
> +		item = stailq_first_entry(rows, struct applier_tx_row, next);
> +		rcb->replica_id = replica_id;
> +		rcb->txn_start_tm = item->row.tm;
> +
> +		trigger_create(on_wal_write, applier_txn_wal_write_cb, rcb, NULL);
>  		txn_on_wal_write(txn, on_wal_write);
>  	}

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Tarantool-patches] [PATCH v8 2/2] relay: provide information about downstream lag
  2021-06-07 15:55 ` [Tarantool-patches] [PATCH v8 2/2] relay: provide information about downstream lag Cyrill Gorcunov via Tarantool-patches
@ 2021-06-07 19:21   ` Vladislav Shpilevoy via Tarantool-patches
  2021-06-08  8:40     ` Cyrill Gorcunov via Tarantool-patches
  2021-06-15 10:03   ` Serge Petrenko via Tarantool-patches
  1 sibling, 1 reply; 17+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-06-07 19:21 UTC (permalink / raw)
  To: Cyrill Gorcunov, tml

Thanks for the patch!

See 6 comments below.

> diff --git a/src/box/relay.cc b/src/box/relay.cc
> index b1571b361..cdd1383e8 100644
> --- a/src/box/relay.cc
> +++ b/src/box/relay.cc
> @@ -158,6 +158,18 @@ struct relay {
>  	struct stailq pending_gc;
>  	/** Time when last row was sent to peer. */
>  	double last_row_time;
> +	/**
> +	 * A time difference between the moment when we
> +	 * wrote a transaction to the local WAL and when
> +	 * this transaction has been replicated to remote
> +	 * node (ie written to node's WAL).
> +	 */
> +	double txn_lag;
> +	/**
> +	 * Last timestamp observed from remote node to
> +	 * persist @a txn_lag value.
> +	 */
> +	double txn_acked_tm;
>  	/** Relay sync state. */
>  	enum relay_state state;
>  
> @@ -217,6 +229,12 @@ relay_last_row_time(const struct relay *relay)
>  	return relay->last_row_time;
>  }
>  
> +double
> +relay_txn_lag(const struct relay *relay)
> +{
> +	return relay->txn_lag;

1. As I said in the previous review, you can't read a variable from another
thread without any protection.

Please, use the way I proposed last time. Relay has 'tx' struct inside,
which is updated on each received ACK. You need to deliver the lag value
to TX thread in the same way as the acked vclock is delivered. In the
same message preferably.

> @@ -629,6 +659,26 @@ relay_reader_f(va_list ap)
>  			/* vclock is followed while decoding, zeroing it. */
>  			vclock_create(&relay->recv_vclock);
>  			xrow_decode_vclock_xc(&xrow, &relay->recv_vclock);
> +			/*
> +			 * Replica send us last replicated transaction
> +			 * timestamp which is needed for relay lag
> +			 * monitoring. Note that this transaction has
> +			 * been written to WAL with our current realtime
> +			 * clock value, thus when it get reported back we
> +			 * can compute time spent regardless of the clock
> +			 * value on remote replica.
> +			 *
> +			 * An interesting moment is replica restart - it will
> +			 * send us value 0 after that but we can preserve
> +			 * old reported value here since we *assume* that
> +			 * timestamp is not going backwards on properly
> +			 * set up nodes, otherwise the lag get raised.
> +			 * After all this is a not tamper-proof value.

2. I don't understand. Why does it send value 0? And if it does, why
can't you ignore only zeros? The non-0 values must be valid anyway.

> +			 */
> +			if (relay->txn_acked_tm < xrow.tm) {
> +				relay->txn_acked_tm = xrow.tm;
> +				relay->txn_lag = ev_now(loop()) - xrow.tm;
> +			}
> diff --git a/test/replication/gh-5447-downstream-lag.result b/test/replication/gh-5447-downstream-lag.result
> new file mode 100644
> index 000000000..8586d0ed3
> --- /dev/null
> +++ b/test/replication/gh-5447-downstream-lag.result
> @@ -0,0 +1,93 @@
> +-- test-run result file version 2
> +--
> +-- gh-5447: Test for box.info.replication[n].downstream.lag.
> +-- We need to be sure that if replica start been back of
> +-- master node reports own lagging and cluster admin would
> +-- be able to detect such situation.

3. I couldn't parse the last sentence. Could you use some
punctuation? It might help.

> +--
> +
> +fiber = require('fiber')
> + | ---
> + | ...
> +test_run = require('test_run').new()
> + | ---
> + | ...
> +engine = test_run:get_cfg('engine')
> + | ---
> + | ...
> +
> +box.schema.user.grant('guest', 'replication')
> + | ---
> + | ...
> +
> +test_run:cmd('create server replica with rpl_master=default, \
> +             script="replication/replica.lua"')
> + | ---
> + | - true
> + | ...
> +test_run:cmd('start server replica')
> + | ---
> + | - true
> + | ...
> +
> +s = box.schema.space.create('test', {engine = engine})
> + | ---
> + | ...
> +_ = s:create_index('pk')
> + | ---
> + | ...
> +
> +--
> +-- The replica should wait some time (wal delay is 1 second
> +-- by default) so we would be able to detect the lag, since
> +-- on local instances the lag is minimal and usually transactions
> +-- are handled instantly.

4. But it is not 1 second. usleep(1000) means 1 millisecond, and it
happens in a loop, so it does not matter much. It works until you
set the delay back to false. That makes WAL thread blocked until
you free it. It is not a fixed delay.

> +test_run:switch('replica')
> + | ---
> + | - true
> + | ...
> +box.error.injection.set("ERRINJ_WAL_DELAY", true)
> + | ---
> + | - ok
> + | ...
> +
> +test_run:switch('default')
> + | ---
> + | - true
> + | ...
> +box.space.test:insert({1})
> + | ---
> + | - [1]
> + | ...
> +test_run:wait_cond(function() return box.info.replication[2].downstream.lag ~= 0 end, 10)

5. This condition is true even before you did the insert.
And it couldn't change during insert, because there are no
ACKs - the replica can't write to WAL because of the delay,
it is blocked in a busy loop.

> + | ---
> + | - true
> + | ...
> +
> +test_run:switch('replica')
> + | ---
> + | - true
> + | ...
> +box.error.injection.set("ERRINJ_WAL_DELAY", false)
> + | ---
> + | - ok
> + | ...
> +--
> +-- Cleanup everything.

6. You need to revoke the granted rights and drop the space.

> +test_run:switch('default')
> + | ---
> + | - true
> + | ...
> +
> +test_run:cmd('stop server replica')
> + | ---
> + | - true
> + | ...
> +test_run:cmd('cleanup server replica')
> + | ---
> + | - true
> + | ...
> +test_run:cmd('delete server replica')
> + | ---
> + | - true
> + | ...

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Tarantool-patches] [PATCH v8 0/2] relay: provide downstream lag information
  2021-06-07 19:20 ` [Tarantool-patches] [PATCH v8 0/2] relay: provide downstream lag information Vladislav Shpilevoy via Tarantool-patches
@ 2021-06-07 20:00   ` Cyrill Gorcunov via Tarantool-patches
  0 siblings, 0 replies; 17+ messages in thread
From: Cyrill Gorcunov via Tarantool-patches @ 2021-06-07 20:00 UTC (permalink / raw)
  To: Vladislav Shpilevoy; +Cc: tml

On Mon, Jun 07, 2021 at 09:20:50PM +0200, Vladislav Shpilevoy wrote:
> Hi! Thanks for the patchset!
> 
> Please, provide branch and issue links next time.
> https://github.com/tarantool/tarantool/wiki/Code-review-procedure#sending-the-patch
> 
> I assume your branch is gorcunov/gh-5447-relay-lag-8.

It was there, letme quote it for you, I think you might simply
miss it from a glance

 > branch gorcunov/gh-5447-relay-lag-8
 > issue https://github.com/tarantool/tarantool/issues/5447


> It seems after your changes some tests start failing. Looks
> like you didn't clear artifacts of your new test.
> https://github.com/tarantool/tarantool/runs/2765786321

I'll take a look, thanks!

> On 07.06.2021 17:55, Cyrill Gorcunov wrote:
> > Guys, take a look once time permit. Previous version is here
> > 
> > https://lists.tarantool.org/tarantool-patches/20210506214132.533913-1-gorcunov@gmail.com/
> > 
> > I still think we might need some general statistics engine because this
> > lag rather common data and we could move other "last" things like last_row_time
> > there but since I didn't get approve on the idea I keep the timestamp
> > in replica structure.
> 
> If you are talking about your perf metrics engine which we tried to
> design for replication, it was approved. The problem is that there was
> no a really good design on how should it look and work.

Not perf metrics (as we were discussed time back then seems plain perf is
more than enough for most of developers if I'm not mistaken). I rather though
of my previous RFC where I proposed to have some common and shared "stat" engine
with rw operations:

 > Not for merging yet! I think instead of applier_wal_stat structure we might need some
 > _commonly shared_ statistics structure, probably bound to WAL code and all other
 > threads will update it in a lockless way, because we might need to collect more
 > detais on WAL processing in future. I though of something like
 > 
 > 	enum {
 > 		WAL_ICNT__APPLIER_TXN_START_TM,
 > 
 > 		WAL_ICNT__MAX,
 > 	};
 > 
 > 	struct wal_stat {
 > 		int64_t	icnt[WAL_ICNT__MAX];
 > 	} wal_st[VCLOCK_MAX];
 > 
 > and introduce
 > 
 > 	wal_st__icnt_read(unsigned id);
 > 	wal_st__icnt_write(unsigned id);
 > 
 > then applier will simply push last timestamp to WAL_ICNT__APPLIER_TXN_START_TM
 > counter, and later when we need to send ACK we use wal_st__icnt_read() to
 > fetch it back. We won't need to allocate some dynamic memory for it but
 > rather use statically preallocated shared between threads.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Tarantool-patches] [PATCH v8 2/2] relay: provide information about downstream lag
  2021-06-07 19:21   ` Vladislav Shpilevoy via Tarantool-patches
@ 2021-06-08  8:40     ` Cyrill Gorcunov via Tarantool-patches
  2021-06-08  9:58       ` Cyrill Gorcunov via Tarantool-patches
  2021-06-08 18:15       ` Vladislav Shpilevoy via Tarantool-patches
  0 siblings, 2 replies; 17+ messages in thread
From: Cyrill Gorcunov via Tarantool-patches @ 2021-06-08  8:40 UTC (permalink / raw)
  To: Vladislav Shpilevoy; +Cc: tml

On Mon, Jun 07, 2021 at 09:21:09PM +0200, Vladislav Shpilevoy wrote:
> >  
> > +double
> > +relay_txn_lag(const struct relay *relay)
> > +{
> > +	return relay->txn_lag;
> 
> 1. As I said in the previous review, you can't read a variable from another
> thread without any protection.

Let me explain why I did so - I really don't like that we have to add another
variable into relay structure: we already have the lag keeper in replica
structure and since the lag value is not any kind of sync point or some flag
the value of which changes program flow logic, we can use parallel read from
another thread. Moreover we could use guaranteed atomic read operation, at
least on x86 (via return *(int64_t *)relay->txn_lag, though we must be sure
the member is qword aligned). But I presume this trick will confuse other
code readers in future because it is not obvious and without deep knowlege
of arch internals it might draw a wrong impression that such read is a bug,
especially since there is no comments in code.

> 
> Please, use the way I proposed last time. Relay has 'tx' struct inside,
> which is updated on each received ACK. You need to deliver the lag value
> to TX thread in the same way as the acked vclock is delivered. In the
> same message preferably.

Sure, will do.

> > @@ -629,6 +659,26 @@ relay_reader_f(va_list ap)
> >  			/* vclock is followed while decoding, zeroing it. */
> >  			vclock_create(&relay->recv_vclock);
> >  			xrow_decode_vclock_xc(&xrow, &relay->recv_vclock);
> > +			/*
> > +			 * Replica send us last replicated transaction
> > +			 * timestamp which is needed for relay lag
> > +			 * monitoring. Note that this transaction has
> > +			 * been written to WAL with our current realtime
> > +			 * clock value, thus when it get reported back we
> > +			 * can compute time spent regardless of the clock
> > +			 * value on remote replica.
> > +			 *
> > +			 * An interesting moment is replica restart - it will
> > +			 * send us value 0 after that but we can preserve
> > +			 * old reported value here since we *assume* that
> > +			 * timestamp is not going backwards on properly
> > +			 * set up nodes, otherwise the lag get raised.
> > +			 * After all this is a not tamper-proof value.
> 
> 2. I don't understand. Why does it send value 0? And if it does, why
> can't you ignore only zeros? The non-0 values must be valid anyway.

When replica node get restarted the applier_txn_start_tm is initialized to
zero inside relay structure creation, and since there are no new transactions
the applier_txn_start_tm remains set to 0, which replica sends out. Also I
just realized that jeeping lag inside relay structure seems to be not very
good: on reconnection the relay recreated from scratch so I zap previously
read timestamp to 0.

IOW, the real situation is the following:

 - if replica restarted, but main node is alive, the lag report on the
   main node is dropped to 0

 - if main node get restarted, then lag report is dropped to 0 as well

I suppose this is expected? I'll update the comment above.

> > +++ b/test/replication/gh-5447-downstream-lag.result
> > @@ -0,0 +1,93 @@
> > +-- test-run result file version 2
> > +--
> > +-- gh-5447: Test for box.info.replication[n].downstream.lag.
> > +-- We need to be sure that if replica start been back of
> > +-- master node reports own lagging and cluster admin would
> > +-- be able to detect such situation.
> 
> 3. I couldn't parse the last sentence. Could you use some
> punctuation? It might help.

Would the following be better? "We need to be sure that slow
ACKs delivery might be catched by monitoring tools".

> > +
> > +--
> > +-- The replica should wait some time (wal delay is 1 second
> > +-- by default) so we would be able to detect the lag, since
> > +-- on local instances the lag is minimal and usually transactions
> > +-- are handled instantly.
> 
> 4. But it is not 1 second. usleep(1000) means 1 millisecond, and it

No, usleep(1000) means exactly 1 second, this system call works with
microseconds, I think you misread it with nanosleep().

> happens in a loop, so it does not matter much. It works until you
> set the delay back to false. That makes WAL thread blocked until
> you free it. It is not a fixed delay.

Not sure I follow you here. We force wal engine to slow down _each_
write to take at least 1 second long, in turn this will delay the
ACK delivery and calculated lag won't be zero.

> > +box.space.test:insert({1})
> > + | ---
> > + | - [1]
> > + | ...
> > +test_run:wait_cond(function() return box.info.replication[2].downstream.lag ~= 0 end, 10)
> 
> 5. This condition is true even before you did the insert.

Indeed, because of space replication.

> And it couldn't change during insert, because there are no
> ACKs - the replica can't write to WAL because of the delay,
> it is blocked in a busy loop.

Hmm, need to think, thanks!

> 
> > + | ---
> > + | - true
> > + | ...
> > +
> > +test_run:switch('replica')
> > + | ---
> > + | - true
> > + | ...
> > +box.error.injection.set("ERRINJ_WAL_DELAY", false)
> > + | ---
> > + | - ok
> > + | ...
> > +--
> > +-- Cleanup everything.
> 
> 6. You need to revoke the granted rights and drop the space.

+1, thanks!

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Tarantool-patches] [PATCH v8 2/2] relay: provide information about downstream lag
  2021-06-08  8:40     ` Cyrill Gorcunov via Tarantool-patches
@ 2021-06-08  9:58       ` Cyrill Gorcunov via Tarantool-patches
  2021-06-08 18:15       ` Vladislav Shpilevoy via Tarantool-patches
  1 sibling, 0 replies; 17+ messages in thread
From: Cyrill Gorcunov via Tarantool-patches @ 2021-06-08  9:58 UTC (permalink / raw)
  To: Vladislav Shpilevoy; +Cc: tml

On Tue, Jun 08, 2021 at 11:40:33AM +0300, Cyrill Gorcunov wrote:
> > 
> > 4. But it is not 1 second. usleep(1000) means 1 millisecond, and it
> 
> No, usleep(1000) means exactly 1 second, this system call works with
> microseconds, I think you misread it with nanosleep().

Silly me, not sure why but I always confused with micro/milli prefixes :)

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Tarantool-patches] [PATCH v8 2/2] relay: provide information about downstream lag
  2021-06-08  8:40     ` Cyrill Gorcunov via Tarantool-patches
  2021-06-08  9:58       ` Cyrill Gorcunov via Tarantool-patches
@ 2021-06-08 18:15       ` Vladislav Shpilevoy via Tarantool-patches
  1 sibling, 0 replies; 17+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-06-08 18:15 UTC (permalink / raw)
  To: Cyrill Gorcunov; +Cc: tml

On 08.06.2021 10:40, Cyrill Gorcunov wrote:
> On Mon, Jun 07, 2021 at 09:21:09PM +0200, Vladislav Shpilevoy wrote:
>>>  
>>> +double
>>> +relay_txn_lag(const struct relay *relay)
>>> +{
>>> +	return relay->txn_lag;
>>
>> 1. As I said in the previous review, you can't read a variable from another
>> thread without any protection.
> 
> Let me explain why I did so - I really don't like that we have to add another
> variable into relay structure: we already have the lag keeper in replica
> structure and since the lag value is not any kind of sync point or some flag
> the value of which changes program flow logic, we can use parallel read from
> another thread. Moreover we could use guaranteed atomic read operation, at
> least on x86 (via return *(int64_t *)relay->txn_lag, though we must be sure
> the member is qword aligned).

It is not only x86 anymore. Any assumptions, on which the data correctness
depends, must not be made.

>>> @@ -629,6 +659,26 @@ relay_reader_f(va_list ap)
>>>  			/* vclock is followed while decoding, zeroing it. */
>>>  			vclock_create(&relay->recv_vclock);
>>>  			xrow_decode_vclock_xc(&xrow, &relay->recv_vclock);
>>> +			/*
>>> +			 * Replica send us last replicated transaction
>>> +			 * timestamp which is needed for relay lag
>>> +			 * monitoring. Note that this transaction has
>>> +			 * been written to WAL with our current realtime
>>> +			 * clock value, thus when it get reported back we
>>> +			 * can compute time spent regardless of the clock
>>> +			 * value on remote replica.
>>> +			 *
>>> +			 * An interesting moment is replica restart - it will
>>> +			 * send us value 0 after that but we can preserve
>>> +			 * old reported value here since we *assume* that
>>> +			 * timestamp is not going backwards on properly
>>> +			 * set up nodes, otherwise the lag get raised.
>>> +			 * After all this is a not tamper-proof value.
>>
>> 2. I don't understand. Why does it send value 0? And if it does, why
>> can't you ignore only zeros? The non-0 values must be valid anyway.
> 
> IOW, the real situation is the following:
> 
>  - if replica restarted, but main node is alive, the lag report on the
>    main node is dropped to 0
> 
>  - if main node get restarted, then lag report is dropped to 0 as well
> 
> I suppose this is expected? I'll update the comment above.

Drop to 0 in case of any reconnect until data is being replicated again
is fine.

>>> +++ b/test/replication/gh-5447-downstream-lag.result
>>> @@ -0,0 +1,93 @@
>>> +-- test-run result file version 2
>>> +--
>>> +-- gh-5447: Test for box.info.replication[n].downstream.lag.
>>> +-- We need to be sure that if replica start been back of
>>> +-- master node reports own lagging and cluster admin would
>>> +-- be able to detect such situation.
>>
>> 3. I couldn't parse the last sentence. Could you use some
>> punctuation? It might help.
> 
> Would the following be better? "We need to be sure that slow
> ACKs delivery might be catched by monitoring tools".

Yes. Except 'catched' -> 'caught'.

>>> +
>>> +--
>>> +-- The replica should wait some time (wal delay is 1 second
>>> +-- by default) so we would be able to detect the lag, since
>>> +-- on local instances the lag is minimal and usually transactions
>>> +-- are handled instantly.
>>
>> 4. But it is not 1 second. usleep(1000) means 1 millisecond, and it
> 
> No, usleep(1000) means exactly 1 second, this system call works with
> microseconds, I think you misread it with nanosleep().

1 second has 1 000 000 microseconds. 1000us is 1ms. Not 1s.

>> happens in a loop, so it does not matter much. It works until you
>> set the delay back to false. That makes WAL thread blocked until
>> you free it. It is not a fixed delay.
> 
> Not sure I follow you here. We force wal engine to slow down _each_
> write to take at least 1 second long, in turn this will delay the
> ACK delivery and calculated lag won't be zero.

No, this is not how it works. When you turn on the delay, you block
the WAL thread from doing anything. It is not a delay per transaction.
Please, open ERROR_INJECT_SLEEP() (which is used for ERRINJ_WAL_DELAY)
and see how it works. I paste it below.

	#  define ERROR_INJECT_WHILE(ID, CODE) \
		do { \
			while (errinj(ID, ERRINJ_BOOL)->bparam) \
				CODE; \
		} while (0)

It is a busy loop until you drop the injection back to false.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Tarantool-patches] [PATCH v8 1/2] applier: send transaction's first row WAL time in the applier_writer_f
  2021-06-07 15:55 ` [Tarantool-patches] [PATCH v8 1/2] applier: send transaction's first row WAL time in the applier_writer_f Cyrill Gorcunov via Tarantool-patches
  2021-06-07 19:20   ` Vladislav Shpilevoy via Tarantool-patches
@ 2021-06-15  9:36   ` Serge Petrenko via Tarantool-patches
  2021-06-16 13:32     ` Cyrill Gorcunov via Tarantool-patches
  1 sibling, 1 reply; 17+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-06-15  9:36 UTC (permalink / raw)
  To: Cyrill Gorcunov, tml; +Cc: Vladislav Shpilevoy



07.06.2021 18:55, Cyrill Gorcunov пишет:
> Applier fiber sends current vclock of the node to remote relay reader,
> pointing current state of fetched WAL data so the relay will know which
> new data should be sent. The packet applier sends carries xrow_header::tm
> field as a zero but we can reuse it to provide information about first
> timestamp in a transaction we wrote to our WAL. Since old instances of
> Tarantool simply ignore this field such extension won't cause any
> problems.
>
> The timestamp will be needed to account lag of downstream replicas
> suitable for information purpose and cluster health monitoring.
>
> We update applier statistics in WAL callbacks but since both
> apply_synchro_row and apply_plain_tx are used not only in real data
> application but in final join stage as well (in this stage we're not
> writing the data yet) the apply_synchro_row is extended with replica_id
> argument which is non zero when applier is subscribed.
>
> The calculation of the downstream lag itself lag will be addressed
> in next patch because sending the timestamp and its observation
> are independent actions.

Hi! Thanks for the patch!

Looks good generally. Please, find one question below.


>
> Part-of #5447
>
> Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
> ---
>   src/box/applier.cc     | 90 +++++++++++++++++++++++++++++++++++-------
>   src/box/replication.cc |  1 +
>   src/box/replication.h  |  5 +++
>   3 files changed, 81 insertions(+), 15 deletions(-)
>
> diff --git a/src/box/applier.cc b/src/box/applier.cc
> index 33181fdbf..38695a54f 100644
> --- a/src/box/applier.cc
> +++ b/src/box/applier.cc
> @@ -163,6 +163,9 @@ applier_writer_f(va_list ap)
>   	struct ev_io io;
>   	coio_create(&io, applier->io.fd);
>   
> +	/* ID is permanent while applier is alive */
> +	uint32_t replica_id = applier->instance_id;
> +
>   	while (!fiber_is_cancelled()) {
>   		/*
>   		 * Tarantool >= 1.7.7 sends periodic heartbeat
> @@ -193,6 +196,16 @@ applier_writer_f(va_list ap)
>   			applier->has_acks_to_send = false;
>   			struct xrow_header xrow;
>   			xrow_encode_vclock(&xrow, &replicaset.vclock);
> +			/*
> +			 * For relay lag statistics we report last
> +			 * written transaction timestamp in tm field.
> +			 *
> +			 * Replica might be dead already so we have to
> +			 * test on each iteration.
> +			 */
> +			struct replica *r = replica_by_id(replica_id);
> +			if (likely(r != NULL))
> +				xrow.tm = r->applier_txn_start_tm;

How could a replica be dead here?
AFAIR we delete a replica only when it's deleted from _cluster. Shouldn't
the applier writer be dead as well by that time?

>   			coio_write_xrow(&io, &xrow);
>   			ERROR_INJECT(ERRINJ_APPLIER_SLOW_ACK, {
>   				fiber_sleep(0.01);
> @@ -490,7 +503,7 @@ static uint64_t
>   applier_read_tx(struct applier *applier, struct stailq *rows, double timeout);
>   
>   static int
> -apply_final_join_tx(struct stailq *rows);
> +apply_final_join_tx(uint32_t replica_id, struct stailq *rows);
>   
>   /**
>    * A helper struct to link xrow objects in a list.
> @@ -535,7 +548,7 @@ applier_wait_register(struct applier *applier, uint64_t row_count)
>   						  next)->row);
>   			break;
>   		}
> -		if (apply_final_join_tx(&rows) != 0)
> +		if (apply_final_join_tx(applier->instance_id, &rows) != 0)
>   			diag_raise();
>   	}
>   
> @@ -751,11 +764,35 @@ applier_txn_rollback_cb(struct trigger *trigger, void *event)
>   	return 0;
>   }
>   
> +struct replica_cb_data {
> +	/** Replica ID the data belongs to. */
> +	uint32_t replica_id;
> +	/**
> +	 * Timestamp of a transaction to be accounted
> +	 * for relay lag. Usually it is a first row in
> +	 * a transaction.
> +	 */
> +	double txn_start_tm;
> +};
> +
>

-- 
Serge Petrenko


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Tarantool-patches] [PATCH v8 2/2] relay: provide information about downstream lag
  2021-06-07 15:55 ` [Tarantool-patches] [PATCH v8 2/2] relay: provide information about downstream lag Cyrill Gorcunov via Tarantool-patches
  2021-06-07 19:21   ` Vladislav Shpilevoy via Tarantool-patches
@ 2021-06-15 10:03   ` Serge Petrenko via Tarantool-patches
  2021-06-17  6:55     ` Cyrill Gorcunov via Tarantool-patches
  1 sibling, 1 reply; 17+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-06-15 10:03 UTC (permalink / raw)
  To: Cyrill Gorcunov, tml; +Cc: Vladislav Shpilevoy



07.06.2021 18:55, Cyrill Gorcunov пишет:
> We already have `box.replication.upstream.lag` entry for monitoring
> sake. Same time in synchronous replication timeouts are key properties
> for quorum gathering procedure. Thus we would like to know how long
> it took of a transaction to traverse `initiator WAL -> network ->
> remote applier -> ACK` path.

Hi! Thanks for the patch! Please, find a couple of comments below.

>
> Typical output is
>
>   | tarantool> box.info.replication[2].downstream
>   | ---
>   | - status: follow
>   |   idle: 0.61753897101153
>   |   vclock: {1: 147}
>   |   lag: 0
>   | ...
>   | tarantool> box.space.sync:insert{69}
>   | ---
>   | - [69]
>   | ...
>   |
>   | tarantool> box.info.replication[2].downstream
>   | ---
>   | - status: follow
>   |   idle: 0.75324084801832
>   |   vclock: {1: 151}
>   |   lag: 0.0011014938354492
>   | ...
>
> Closes #5447
>
> Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
>
> @TarantoolBot document
> Title: Add `box.info.replication[n].downstream.lag` entry
>
> `replication[n].downstream.lag` is the time difference between
> last transactions been written to the WAL journal of the transaction
> initiator and the transaction written to WAL on the `n` replica.
>
> In other words this is a lag in seconds between the main node writes
> data to own WAL and the replica `n` get this data replicated to own
> WAL journal.

This is not true. You describe `upstream.lag` in this paragraph.
Downstream lag is the time difference between the WAL write on master
side and the receipt of an ack (confirmation of a WAL write on replica)
for this transaction. Also on master side.

>
> In case if a transaction failed to replicate the lag value won't be
> modified because only successfully applied transactions are accounted.
> Same time if the main node or a repllica get restarted the lag value
> will be zero until next success transaction.
> ---
>   .../unreleased/gh-5447-downstream-lag.md      |  6 ++
>   src/box/lua/info.c                            |  3 +
>   src/box/relay.cc                              | 50 ++++++++++
>   src/box/relay.h                               |  6 ++
>   .../replication/gh-5447-downstream-lag.result | 93 +++++++++++++++++++
>   .../gh-5447-downstream-lag.test.lua           | 41 ++++++++
>   6 files changed, 199 insertions(+)
>   create mode 100644 changelogs/unreleased/gh-5447-downstream-lag.md
>   create mode 100644 test/replication/gh-5447-downstream-lag.result
>   create mode 100644 test/replication/gh-5447-downstream-lag.test.lua
>
> diff --git a/changelogs/unreleased/gh-5447-downstream-lag.md b/changelogs/unreleased/gh-5447-downstream-lag.md
> new file mode 100644
> index 000000000..726175c6c
> --- /dev/null
> +++ b/changelogs/unreleased/gh-5447-downstream-lag.md
> @@ -0,0 +1,6 @@
> +#feature/replication
> +
> + * Introduced `box.info.replication[n].downstream.lag` field to monitor
> +   state of replication. This member represents time spent between
> +   transaction been written to initiator's WAL file and reached WAL
> +   file of a replica (gh-5447).

Same here.

-- 
Serge Petrenko


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Tarantool-patches] [PATCH v8 1/2] applier: send transaction's first row WAL time in the applier_writer_f
  2021-06-15  9:36   ` Serge Petrenko via Tarantool-patches
@ 2021-06-16 13:32     ` Cyrill Gorcunov via Tarantool-patches
  2021-06-17  9:16       ` Serge Petrenko via Tarantool-patches
  0 siblings, 1 reply; 17+ messages in thread
From: Cyrill Gorcunov via Tarantool-patches @ 2021-06-16 13:32 UTC (permalink / raw)
  To: Serge Petrenko; +Cc: Vladislav Shpilevoy, tml

On Tue, Jun 15, 2021 at 12:36:02PM +0300, Serge Petrenko wrote:
> > @@ -193,6 +196,16 @@ applier_writer_f(va_list ap)
> >   			applier->has_acks_to_send = false;
> >   			struct xrow_header xrow;
> >   			xrow_encode_vclock(&xrow, &replicaset.vclock);
> > +			/*
> > +			 * For relay lag statistics we report last
> > +			 * written transaction timestamp in tm field.
> > +			 *
> > +			 * Replica might be dead already so we have to
> > +			 * test on each iteration.
> > +			 */
> > +			struct replica *r = replica_by_id(replica_id);
> > +			if (likely(r != NULL))
> > +				xrow.tm = r->applier_txn_start_tm;
> 
> How could a replica be dead here?
> AFAIR we delete a replica only when it's deleted from _cluster. Shouldn't
> the applier writer be dead as well by that time?

Before accessing replica_by_id we're sitting in event loop trying to
fetch data from the network. Which means an admin may cleanup the
entry manually before we get back to this code in result we will
get a nil dereference (if only I'm not missing something).

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Tarantool-patches] [PATCH v8 2/2] relay: provide information about downstream lag
  2021-06-15 10:03   ` Serge Petrenko via Tarantool-patches
@ 2021-06-17  6:55     ` Cyrill Gorcunov via Tarantool-patches
  2021-06-17  9:01       ` Serge Petrenko via Tarantool-patches
  0 siblings, 1 reply; 17+ messages in thread
From: Cyrill Gorcunov via Tarantool-patches @ 2021-06-17  6:55 UTC (permalink / raw)
  To: Serge Petrenko; +Cc: Vladislav Shpilevoy, tml

On Tue, Jun 15, 2021 at 01:03:59PM +0300, Serge Petrenko wrote:
> > 
> > In other words this is a lag in seconds between the main node writes
> > data to own WAL and the replica `n` get this data replicated to own
> > WAL journal.
> 
> This is not true. You describe `upstream.lag` in this paragraph.
> Downstream lag is the time difference between the WAL write on master
> side and the receipt of an ack (confirmation of a WAL write on replica)
> for this transaction. Also on master side.

No. The upstream.lag is calulated as

applier_read_tx_row
  ...
  applier->lag = ev_now(loop()) - row->tm;

IOW it is a time between data is written to txn initiator to WAL and
the packet _received_ by replica, and there is no guarantee that the
packet will be *replicated* at all. In turn for downstream we compute
time delta when data is already writtent to a replica's WAL.

	Cyrill

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Tarantool-patches] [PATCH v8 2/2] relay: provide information about downstream lag
  2021-06-17  6:55     ` Cyrill Gorcunov via Tarantool-patches
@ 2021-06-17  9:01       ` Serge Petrenko via Tarantool-patches
  2021-06-17  9:58         ` Cyrill Gorcunov via Tarantool-patches
  0 siblings, 1 reply; 17+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-06-17  9:01 UTC (permalink / raw)
  To: Cyrill Gorcunov; +Cc: Vladislav Shpilevoy, tml



17.06.2021 09:55, Cyrill Gorcunov пишет:
> On Tue, Jun 15, 2021 at 01:03:59PM +0300, Serge Petrenko wrote:
>>> In other words this is a lag in seconds between the main node writes
>>> data to own WAL and the replica `n` get this data replicated to own
>>> WAL journal.
>> This is not true. You describe `upstream.lag` in this paragraph.
>> Downstream lag is the time difference between the WAL write on master
>> side and the receipt of an ack (confirmation of a WAL write on replica)
>> for this transaction. Also on master side.
> No. The upstream.lag is calulated as
>
> applier_read_tx_row
>    ...
>    applier->lag = ev_now(loop()) - row->tm;

Ok, I meant your description looks similar to upstream.lag.

>
> IOW it is a time between data is written to txn initiator to WAL and
> the packet _received_ by replica, and there is no guarantee that the
> packet will be *replicated* at all. In turn for downstream we compute
> time delta when data is already writtent to a replica's WAL.
>
> 	Cyrill
>
I think I understand what you mean now. But still, this description is 
confusing.

When you're saying it's "time delta when data is already written to a 
replica's WAL"
it sounds like the delta is taken on replica, not on master. And that's 
not true.
It's also unclear from this description that packet travel time (ack 
from replica to master)
is included into downstream.lag

So I would reformulate the description to something like

"Downstream.lag is the lag between the main node writes a certain 
transaction
to it's own WAL and the moment it receives an ack for this transaction 
from a replica".


-- 
Serge Petrenko


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Tarantool-patches] [PATCH v8 1/2] applier: send transaction's first row WAL time in the applier_writer_f
  2021-06-16 13:32     ` Cyrill Gorcunov via Tarantool-patches
@ 2021-06-17  9:16       ` Serge Petrenko via Tarantool-patches
  0 siblings, 0 replies; 17+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-06-17  9:16 UTC (permalink / raw)
  To: Cyrill Gorcunov; +Cc: Vladislav Shpilevoy, tml



16.06.2021 16:32, Cyrill Gorcunov пишет:
> On Tue, Jun 15, 2021 at 12:36:02PM +0300, Serge Petrenko wrote:
>>> @@ -193,6 +196,16 @@ applier_writer_f(va_list ap)
>>>    			applier->has_acks_to_send = false;
>>>    			struct xrow_header xrow;
>>>    			xrow_encode_vclock(&xrow, &replicaset.vclock);
>>> +			/*
>>> +			 * For relay lag statistics we report last
>>> +			 * written transaction timestamp in tm field.
>>> +			 *
>>> +			 * Replica might be dead already so we have to
>>> +			 * test on each iteration.
>>> +			 */
>>> +			struct replica *r = replica_by_id(replica_id);
>>> +			if (likely(r != NULL))
>>> +				xrow.tm = r->applier_txn_start_tm;
>> How could a replica be dead here?
>> AFAIR we delete a replica only when it's deleted from _cluster. Shouldn't
>> the applier writer be dead as well by that time?
> Before accessing replica_by_id we're sitting in event loop trying to
> fetch data from the network. Which means an admin may cleanup the
> entry manually before we get back to this code in result we will
> get a nil dereference (if only I'm not missing something).

I mean that this code you're referring to is executed in a
while (!fiber_is_cancelled()) {...} loop.

IIRC applier->writer gets cancelled as soon as replica is unregistered.
So even if someone deletes the entry manually the writer will exit the loop
before getting a nil dereference.

P.S. I couldn't find this anywhere in code so let's leave the check.

-- 
Serge Petrenko


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Tarantool-patches] [PATCH v8 2/2] relay: provide information about downstream lag
  2021-06-17  9:01       ` Serge Petrenko via Tarantool-patches
@ 2021-06-17  9:58         ` Cyrill Gorcunov via Tarantool-patches
  0 siblings, 0 replies; 17+ messages in thread
From: Cyrill Gorcunov via Tarantool-patches @ 2021-06-17  9:58 UTC (permalink / raw)
  To: Serge Petrenko; +Cc: Vladislav Shpilevoy, tml

On Thu, Jun 17, 2021 at 12:01:32PM +0300, Serge Petrenko wrote:
> I think I understand what you mean now. But still, this description is
> confusing.
> 
> When you're saying it's "time delta when data is already written to a
> replica's WAL" it sounds like the delta is taken on replica, not on master.
> And that's not true.
> It's also unclear from this description that packet travel time (ack from
> replica to master) is included into downstream.lag
> 
> So I would reformulate the description to something like
> 
> "Downstream.lag is the lag between the main node writes a certain
> transaction to it's own WAL and the moment it receives an ack for
> this transaction from a replica".

Thanks, Serge! I'll update the description.

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2021-06-17  9:58 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-07 15:55 [Tarantool-patches] [PATCH v8 0/2] relay: provide downstream lag information Cyrill Gorcunov via Tarantool-patches
2021-06-07 15:55 ` [Tarantool-patches] [PATCH v8 1/2] applier: send transaction's first row WAL time in the applier_writer_f Cyrill Gorcunov via Tarantool-patches
2021-06-07 19:20   ` Vladislav Shpilevoy via Tarantool-patches
2021-06-15  9:36   ` Serge Petrenko via Tarantool-patches
2021-06-16 13:32     ` Cyrill Gorcunov via Tarantool-patches
2021-06-17  9:16       ` Serge Petrenko via Tarantool-patches
2021-06-07 15:55 ` [Tarantool-patches] [PATCH v8 2/2] relay: provide information about downstream lag Cyrill Gorcunov via Tarantool-patches
2021-06-07 19:21   ` Vladislav Shpilevoy via Tarantool-patches
2021-06-08  8:40     ` Cyrill Gorcunov via Tarantool-patches
2021-06-08  9:58       ` Cyrill Gorcunov via Tarantool-patches
2021-06-08 18:15       ` Vladislav Shpilevoy via Tarantool-patches
2021-06-15 10:03   ` Serge Petrenko via Tarantool-patches
2021-06-17  6:55     ` Cyrill Gorcunov via Tarantool-patches
2021-06-17  9:01       ` Serge Petrenko via Tarantool-patches
2021-06-17  9:58         ` Cyrill Gorcunov via Tarantool-patches
2021-06-07 19:20 ` [Tarantool-patches] [PATCH v8 0/2] relay: provide downstream lag information Vladislav Shpilevoy via Tarantool-patches
2021-06-07 20:00   ` Cyrill Gorcunov via Tarantool-patches

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox