[Tarantool-patches] [RFC v3 0/3] relay: provide downstream lag information

Tarantool development patches archive
 help / color / mirror / Atom feed

* [Tarantool-patches] [RFC v3 0/3] relay: provide downstream lag information
@ 2021-04-30 15:39 Cyrill Gorcunov via Tarantool-patches
  2021-04-30 15:39 ` [Tarantool-patches] [RFC v3 1/3] xrow: allow to pass timestamp via xrow_encode_vclock_timed helper Cyrill Gorcunov via Tarantool-patches
                   ` (3 more replies)
  0 siblings, 4 replies; 16+ messages in thread
From: Cyrill Gorcunov via Tarantool-patches @ 2021-04-30 15:39 UTC (permalink / raw)
  To: tml; +Cc: Mons Anderson, Vladislav Shpilevoy

Guys, this is *NOT* for merging but rather to gather comments on the
code structure and overall idea.

Here is a code flow for memory refresh

  MASTER NODE
  ===========

  TX
  ==
  main.sched
  |
  `- box_process_rw
  ^  `- txn_commit
  |     `- alloc xrow
  |        `- journal_write
  |           `- wal_assign_lsn
  |           `- write to disk
  |           `- wal_notify_watchers
  |               |
  +---------------+ wakeup relay thread
                  |
                  v
                RELAY THREAD
                ============
                relay_subscribe_f
                `- relay_reader_f
                |   `- coio_read_xrow_timeout_xc <------------------+
                |                                                   |
                `- relay_process_wal_event                          |
                   `- recover_remaining_wals                        |
                      `- relay_send                                 |
                          |                                         |
                          | read xrows from disk                    |
                          | and send them to replica's              |
                          | applier                                 |
                          |                                         |
                          |                                         |
  REPLICA NODE            |                                         |
  ============            |                                         ^
  TX                      |                                         |
  ==                      |                                         |
  main.sched              |                                         |
  `- applier_apply_tx <---+                                         |
  |  `- apply_synchro_row (if CONFIRM | ROLLBACK)                   |
  |  |  `- journal_write                                            |
  |  |  `- applier->first_row_wal_time from xrow::tm                |
  |  `- apply_plain_tx                                              |
  |     `- txn_commit_try_async                                     |
  |        `- applier_txn_wal_write_cb                              |
  |           `- applier->first_row_wal_time from xrow::tm          |
  |                                                                 |
  `- applier_writer_f                                               |
     `- xrow_encode_vclock_timed(applier->first_row_wal_time)       |
        `- coio_write_xrow -----------------------------------------+

Typical output is something like

 (freshly started)
 |tarantool> box.info.replication
 |---
 |- 1:
 |    id: 1
 |    uuid: f94edca8-71d4-46c9-b9d2-620a6a2bd977
 |    lsn: 121
 |  2:
 |    id: 2
 |    uuid: f6ac84e1-a040-48d9-a9c7-f8147b8e2c9e
 |    lsn: 0
 |    upstream:
 |      status: follow
 |      idle: 0.56554910800332
 |      peer: replicator@127.0.0.1:3302
 |      lag: 0.00021719932556152
 |    downstream:
 |      status: follow
 |      idle: 0.52823433600133
 |      vclock: {1: 121}
 |      lag: 0
 |...

The new data sent

 |tarantool> box.space.sync:insert{55}
 |---
 |- [55]
 |...
 | tarantool> box.info.replication
 | ---
 | - 1:
 |     id: 1
 |     uuid: f94edca8-71d4-46c9-b9d2-620a6a2bd977
 |     lsn: 123
 |   2:
 |     id: 2
 |     uuid: f6ac84e1-a040-48d9-a9c7-f8147b8e2c9e
 |     lsn: 0
 |     upstream:
 |       status: follow
 |       idle: 0.96756215799542
 |       peer: replicator@127.0.0.1:3302
 |       lag: 0.0002143383026123
 |     downstream:
 |       status: follow
 |       idle: 0.31903971399879
 |       vclock: {1: 123}
 |       lag: 0.0010807514190674
 | ...

Please take a look on applier notifications structure and naming. Actually
I don't really like `downstream.lag` name either because this is not a counterpart
for `upstream.lag` as far as I understand but rather packet traverse so maybe
`dowstream.wal-lag` would be more suitable? Also in idle cycles downstream.lag
is not changed which might confuse the readers because `upstream.lag` does.

Anyway any kind of comments and code structure would be highly appreciated.
Again, this series is not for merging because there is no docs, no tests yet,
I did manual testing only.

Previous version https://lists.tarantool.org/tarantool-patches/20210201100037.212301-1-gorcunov@gmail.com/

branch: gorcunov/gh-5447-relay-lag-3
issue: https://github.com/tarantool/tarantool/issues/5447

Cyrill Gorcunov (3):
  xrow: allow to pass timestamp via xrow_encode_vclock_timed helper
  applier: send first row's WAL time in the applier_writer_f
  relay: provide information about downstream lag

 src/box/applier.cc | 84 ++++++++++++++++++++++++++++++++++++++--------
 src/box/applier.h  |  5 +++
 src/box/lua/info.c |  3 ++
 src/box/relay.cc   | 46 ++++++++++++++++++++++---
 src/box/relay.h    |  3 ++
 src/box/xrow.c     |  5 ++-
 src/box/xrow.h     | 21 ++++++++++--
 7 files changed, 146 insertions(+), 21 deletions(-)


base-commit: 7fd53b4c5264bdbc8f01858409abe52bc38764c8
-- 
2.30.2


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Tarantool-patches] [RFC v3 1/3] xrow: allow to pass timestamp via xrow_encode_vclock_timed helper
  2021-04-30 15:39 [Tarantool-patches] [RFC v3 0/3] relay: provide downstream lag information Cyrill Gorcunov via Tarantool-patches
@ 2021-04-30 15:39 ` Cyrill Gorcunov via Tarantool-patches
  2021-04-30 20:45   ` Vladislav Shpilevoy via Tarantool-patches
  2021-05-03 20:21   ` Konstantin Osipov via Tarantool-patches
  2021-04-30 15:39 ` [Tarantool-patches] [RFC v3 2/3] applier: send first row's WAL time in the applier_writer_f Cyrill Gorcunov via Tarantool-patches
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 16+ messages in thread
From: Cyrill Gorcunov via Tarantool-patches @ 2021-04-30 15:39 UTC (permalink / raw)
  To: tml; +Cc: Mons Anderson, Vladislav Shpilevoy

We will enhance applier writer with timestamp of last written row.
It is needed to detect downstream lag.

Part-of #5447

Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
---
 src/box/xrow.c |  5 ++++-
 src/box/xrow.h | 21 +++++++++++++++++++--
 2 files changed, 23 insertions(+), 3 deletions(-)

diff --git a/src/box/xrow.c b/src/box/xrow.c
index 2e364cea5..46ae23a78 100644
--- a/src/box/xrow.c
+++ b/src/box/xrow.c
@@ -1643,7 +1643,9 @@ xrow_encode_join(struct xrow_header *row, const struct tt_uuid *instance_uuid)
 }
 
 int
-xrow_encode_vclock(struct xrow_header *row, const struct vclock *vclock)
+xrow_encode_vclock_timed(struct xrow_header *row,
+			 const struct vclock *vclock,
+			 double tm)
 {
 	memset(row, 0, sizeof(*row));
 
@@ -1662,6 +1664,7 @@ xrow_encode_vclock(struct xrow_header *row, const struct vclock *vclock)
 	row->body[0].iov_base = buf;
 	row->body[0].iov_len = (data - buf);
 	row->bodycnt = 1;
+	row->tm = tm;
 	row->type = IPROTO_OK;
 	return 0;
 }
diff --git a/src/box/xrow.h b/src/box/xrow.h
index b3c664be2..f48e64b83 100644
--- a/src/box/xrow.h
+++ b/src/box/xrow.h
@@ -495,6 +495,20 @@ xrow_decode_register(struct xrow_header *row, struct tt_uuid *instance_uuid,
 				     NULL, NULL);
 }
 
+/**
+ * Encode vector clock with timestamp.
+ * @param row[out] Row to encode into.
+ * @param vclock vector clock to encode.
+ * @param tm timestamp.
+ *
+ * @retval  0 Success.
+ * @retval -1 Memory error.
+ */
+int
+xrow_encode_vclock_timed(struct xrow_header *row,
+			 const struct vclock *vclock,
+			 double tm);
+
 /**
  * Encode end of stream command (a response to JOIN command).
  * @param row[out] Row to encode into.
@@ -503,8 +517,11 @@ xrow_decode_register(struct xrow_header *row, struct tt_uuid *instance_uuid,
  * @retval  0 Success.
  * @retval -1 Memory error.
  */
-int
-xrow_encode_vclock(struct xrow_header *row, const struct vclock *vclock);
+static inline int
+xrow_encode_vclock(struct xrow_header *row, const struct vclock *vclock)
+{
+	return xrow_encode_vclock_timed(row, vclock, 0);
+}
 
 /**
  * Decode end of stream command (a response to JOIN command).
-- 
2.30.2


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Tarantool-patches] [RFC v3 2/3] applier: send first row's WAL time in the applier_writer_f
  2021-04-30 15:39 [Tarantool-patches] [RFC v3 0/3] relay: provide downstream lag information Cyrill Gorcunov via Tarantool-patches
  2021-04-30 15:39 ` [Tarantool-patches] [RFC v3 1/3] xrow: allow to pass timestamp via xrow_encode_vclock_timed helper Cyrill Gorcunov via Tarantool-patches
@ 2021-04-30 15:39 ` Cyrill Gorcunov via Tarantool-patches
  2021-04-30 20:49   ` Vladislav Shpilevoy via Tarantool-patches
  2021-04-30 15:39 ` [Tarantool-patches] [RFC v3 3/3] relay: provide information about downstream lag Cyrill Gorcunov via Tarantool-patches
  2021-04-30 20:45 ` [Tarantool-patches] [RFC v3 0/3] relay: provide downstream lag information Vladislav Shpilevoy via Tarantool-patches
  3 siblings, 1 reply; 16+ messages in thread
From: Cyrill Gorcunov via Tarantool-patches @ 2021-04-30 15:39 UTC (permalink / raw)
  To: tml; +Cc: Mons Anderson, Vladislav Shpilevoy

This fibers sends current vclock of the node to remote relay reader.
This packet carries xrow_header::tm field as a zero but we can reuse
it to provide information about first timestamp in a transaction we
wrote to our WAL. Since old instances of Tarantool simply ignore this
field such extension won't cause any problems.

This timestamp will be needed to account lags of downstream replicas
suitable for information purpose and cluster health monitoring.

We update applier statistics in WAL callbacks but since both
apply_synchro_row and apply_plain_tx are used not only in real data
application but in final join stage as well (in this stage we're not
yet writing the data) the apply_synchro_row is extended with a flag
pointing that applier update is needed.

Same time the apply_plain_tx uses asynchronous WAL write completion
and at moment when the write procedure is finished the applier might
be removed from replicaset already thus we use applier's instance
to lookup if it is still alive.

The calculation of the downstream lag itself lag will be addressed
in next patch because sending the timestamp and its observation
are independent actions.

Part-of #5447

Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
---
 src/box/applier.cc | 84 ++++++++++++++++++++++++++++++++++++++--------
 src/box/applier.h  |  5 +++
 2 files changed, 75 insertions(+), 14 deletions(-)

diff --git a/src/box/applier.cc b/src/box/applier.cc
index 33181fdbf..626dc0324 100644
--- a/src/box/applier.cc
+++ b/src/box/applier.cc
@@ -192,7 +192,8 @@ applier_writer_f(va_list ap)
 		try {
 			applier->has_acks_to_send = false;
 			struct xrow_header xrow;
-			xrow_encode_vclock(&xrow, &replicaset.vclock);
+			xrow_encode_vclock_timed(&xrow, &replicaset.vclock,
+						 applier->first_row_wal_time);
 			coio_write_xrow(&io, &xrow);
 			ERROR_INJECT(ERRINJ_APPLIER_SLOW_ACK, {
 				fiber_sleep(0.01);
@@ -490,7 +491,7 @@ static uint64_t
 applier_read_tx(struct applier *applier, struct stailq *rows, double timeout);
 
 static int
-apply_final_join_tx(struct stailq *rows);
+apply_final_join_tx(struct applier *applier, struct stailq *rows);
 
 /**
  * A helper struct to link xrow objects in a list.
@@ -535,7 +536,7 @@ applier_wait_register(struct applier *applier, uint64_t row_count)
 						  next)->row);
 			break;
 		}
-		if (apply_final_join_tx(&rows) != 0)
+		if (apply_final_join_tx(applier, &rows) != 0)
 			diag_raise();
 	}
 
@@ -751,11 +752,41 @@ applier_txn_rollback_cb(struct trigger *trigger, void *event)
 	return 0;
 }
 
+/** Applier WAL related statistics */
+struct awstat {
+	uint32_t instance_id;
+	double first_row_tm;
+};
+
+static void
+awstat_update(struct awstat *awstat)
+{
+	/* Ignore if not needed */
+	if (awstat->instance_id == 0)
+		return;
+
+	/*
+	 * Write to WAL happens in two contexts: as
+	 * synchronous writes and as asynchronous. In
+	 * second case the applier might be already
+	 * stopped and removed.
+	 */
+	struct replica *r = replica_by_id(awstat->instance_id);
+	if (r == NULL && r->applier == NULL)
+		return;
+
+	r->applier->first_row_wal_time = awstat->first_row_tm;
+}
+
 static int
 applier_txn_wal_write_cb(struct trigger *trigger, void *event)
 {
 	(void) trigger;
 	(void) event;
+
+	struct awstat *awstat = (struct awstat *)trigger->data;
+	awstat_update(awstat);
+
 	/* Broadcast the WAL write across all appliers. */
 	trigger_run(&replicaset.applier.on_wal_write, NULL);
 	return 0;
@@ -766,6 +797,8 @@ struct synchro_entry {
 	struct synchro_request *req;
 	/** Fiber created the entry. To wakeup when WAL write is done. */
 	struct fiber *owner;
+	/** WAL bound statistics. */
+	struct awstat awstat;
 	/**
 	 * The base journal entry. It has unsized array and then must be the
 	 * last entry in the structure. But can workaround it via a union
@@ -789,6 +822,7 @@ apply_synchro_row_cb(struct journal_entry *entry)
 	if (entry->res < 0) {
 		applier_rollback_by_wal_io();
 	} else {
+		awstat_update(&synchro_entry->awstat);
 		txn_limbo_process(&txn_limbo, synchro_entry->req);
 		trigger_run(&replicaset.applier.on_wal_write, NULL);
 	}
@@ -797,7 +831,8 @@ apply_synchro_row_cb(struct journal_entry *entry)
 
 /** Process a synchro request. */
 static int
-apply_synchro_row(struct xrow_header *row)
+apply_synchro_row(struct applier *applier, struct xrow_header *row,
+		  bool use_awstat)
 {
 	assert(iproto_type_is_synchro_request(row->type));
 
@@ -817,6 +852,12 @@ apply_synchro_row(struct xrow_header *row)
 			     apply_synchro_row_cb, &entry);
 	entry.req = &req;
 	entry.owner = fiber();
+	if (use_awstat) {
+		entry.awstat.instance_id = applier->instance_id;
+		entry.awstat.first_row_tm = row->tm;
+	} else {
+		entry.awstat.instance_id = 0;
+	}
 	/*
 	 * The WAL write is blocking. Otherwise it might happen that a CONFIRM
 	 * or ROLLBACK is sent to WAL, and it would empty the limbo, but before
@@ -862,8 +903,9 @@ applier_handle_raft(struct applier *applier, struct xrow_header *row)
 	return box_raft_process(&req, applier->instance_id);
 }
 
-static inline int
-apply_plain_tx(struct stailq *rows, bool skip_conflict, bool use_triggers)
+static int
+apply_plain_tx(struct applier *applier, struct stailq *rows,
+	       bool skip_conflict, bool use_triggers)
 {
 	/*
 	 * Explicitly begin the transaction so that we can
@@ -931,10 +973,21 @@ apply_plain_tx(struct stailq *rows, bool skip_conflict, bool use_triggers)
 			goto fail;
 		}
 
+		struct awstat *awstat;
+		awstat = region_alloc_object(&txn->region, typeof(*awstat), &size);
+		if (awstat == NULL) {
+			diag_set(OutOfMemory, size, "region_alloc_object", "awstat");
+			goto fail;
+		}
+
 		trigger_create(on_rollback, applier_txn_rollback_cb, NULL, NULL);
 		txn_on_rollback(txn, on_rollback);
 
-		trigger_create(on_wal_write, applier_txn_wal_write_cb, NULL, NULL);
+		item = stailq_first_entry(rows, struct applier_tx_row, next);
+		awstat->instance_id = applier->instance_id;
+		awstat->first_row_tm = item->row.tm;
+
+		trigger_create(on_wal_write, applier_txn_wal_write_cb, awstat, NULL);
 		txn_on_wal_write(txn, on_wal_write);
 	}
 
@@ -946,7 +999,7 @@ apply_plain_tx(struct stailq *rows, bool skip_conflict, bool use_triggers)
 
 /** A simpler version of applier_apply_tx() for final join stage. */
 static int
-apply_final_join_tx(struct stailq *rows)
+apply_final_join_tx(struct applier *applier, struct stailq *rows)
 {
 	struct xrow_header *first_row =
 		&stailq_first_entry(rows, struct applier_tx_row, next)->row;
@@ -957,9 +1010,9 @@ apply_final_join_tx(struct stailq *rows)
 	vclock_follow_xrow(&replicaset.vclock, last_row);
 	if (unlikely(iproto_type_is_synchro_request(first_row->type))) {
 		assert(first_row == last_row);
-		rc = apply_synchro_row(first_row);
+		rc = apply_synchro_row(applier, first_row, false);
 	} else {
-		rc = apply_plain_tx(rows, false, false);
+		rc = apply_plain_tx(applier, rows, false, false);
 	}
 	fiber_gc();
 	return rc;
@@ -1088,11 +1141,14 @@ applier_apply_tx(struct applier *applier, struct stailq *rows)
 		 * each other.
 		 */
 		assert(first_row == last_row);
-		if ((rc = apply_synchro_row(first_row)) != 0)
+		rc = apply_synchro_row(applier, first_row, true);
+		if (rc != 0)
+			goto finish;
+	} else {
+		rc = apply_plain_tx(applier, rows,
+				    replication_skip_conflict, true);
+		if (rc != 0)
 			goto finish;
-	} else if ((rc = apply_plain_tx(rows, replication_skip_conflict,
-					true)) != 0) {
-		goto finish;
 	}
 	vclock_follow(&replicaset.applier.vclock, last_row->replica_id,
 		      last_row->lsn);
diff --git a/src/box/applier.h b/src/box/applier.h
index 15ca1fcfd..bd98827e7 100644
--- a/src/box/applier.h
+++ b/src/box/applier.h
@@ -93,6 +93,11 @@ struct applier {
 	ev_tstamp last_row_time;
 	/** Number of seconds this replica is behind the remote master */
 	ev_tstamp lag;
+	/**
+	 * WAL time of first applied row in a transaction.
+	 * For relay statistics sake.
+	 */
+	double first_row_wal_time;
 	/** The last box_error_code() logged to avoid log flooding */
 	uint32_t last_logged_errcode;
 	/** Remote instance ID. */
-- 
2.30.2


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Tarantool-patches] [RFC v3 3/3] relay: provide information about downstream lag
  2021-04-30 15:39 [Tarantool-patches] [RFC v3 0/3] relay: provide downstream lag information Cyrill Gorcunov via Tarantool-patches
  2021-04-30 15:39 ` [Tarantool-patches] [RFC v3 1/3] xrow: allow to pass timestamp via xrow_encode_vclock_timed helper Cyrill Gorcunov via Tarantool-patches
  2021-04-30 15:39 ` [Tarantool-patches] [RFC v3 2/3] applier: send first row's WAL time in the applier_writer_f Cyrill Gorcunov via Tarantool-patches
@ 2021-04-30 15:39 ` Cyrill Gorcunov via Tarantool-patches
  2021-04-30 20:50   ` Vladislav Shpilevoy via Tarantool-patches
  2021-04-30 20:45 ` [Tarantool-patches] [RFC v3 0/3] relay: provide downstream lag information Vladislav Shpilevoy via Tarantool-patches
  3 siblings, 1 reply; 16+ messages in thread
From: Cyrill Gorcunov via Tarantool-patches @ 2021-04-30 15:39 UTC (permalink / raw)
  To: tml; +Cc: Mons Anderson, Vladislav Shpilevoy

We already have `box.replication.upstream.lag` entry for monitoring
sake. Same time in synchronous replication timeouts are key properties
for quorum gathering procedure. Thus we would like to know how long
it took of a transaction to traverse `initiator WAL -> network ->
remote applier -> ACK` path.

In this patch we use new applier's functionality which returns us
the timestamp of first written xrow in a transaction such that we
can calculate the downstream lag.

Closes #5447

Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
---
 src/box/lua/info.c |  3 +++
 src/box/relay.cc   | 46 ++++++++++++++++++++++++++++++++++++++++++----
 src/box/relay.h    |  3 +++
 3 files changed, 48 insertions(+), 4 deletions(-)

diff --git a/src/box/lua/info.c b/src/box/lua/info.c
index 0eb48b823..b72dd9915 100644
--- a/src/box/lua/info.c
+++ b/src/box/lua/info.c
@@ -143,6 +143,9 @@ lbox_pushrelay(lua_State *L, struct relay *relay)
 		lua_pushnumber(L, ev_monotonic_now(loop()) -
 			       relay_last_row_time(relay));
 		lua_settable(L, -3);
+		lua_pushstring(L, "lag");
+		lua_pushnumber(L, relay_peer_wal_lag(relay));
+		lua_settable(L, -3);
 		break;
 	case RELAY_STOPPED:
 	{
diff --git a/src/box/relay.cc b/src/box/relay.cc
index ff43c2fc7..6d880932a 100644
--- a/src/box/relay.cc
+++ b/src/box/relay.cc
@@ -158,6 +158,12 @@ struct relay {
 	struct stailq pending_gc;
 	/** Time when last row was sent to peer. */
 	double last_row_time;
+	/**
+	 * A time difference between moment when we
+	 * wrote the row in a local WAL and peer wrote
+	 * it in own WAL.
+	 */
+	double peer_wal_lag;
 	/** Relay sync state. */
 	enum relay_state state;
 
@@ -217,6 +223,12 @@ relay_last_row_time(const struct relay *relay)
 	return relay->last_row_time;
 }
 
+double
+relay_peer_wal_lag(const struct relay *relay)
+{
+	return relay->peer_wal_lag;
+}
+
 static void
 relay_send(struct relay *relay, struct xrow_header *packet);
 static void
@@ -614,15 +626,41 @@ relay_reader_f(va_list ap)
 	coio_create(&io, relay->io.fd);
 	ibuf_create(&ibuf, &cord()->slabc, 1024);
 	try {
-		while (!fiber_is_cancelled()) {
-			struct xrow_header xrow;
+		struct xrow_header xrow;
+		double prev_tm;
+
+		/*
+		 * Make a first read as a separate action because
+		 * we need previous timestamp from the xrow to
+		 * calculate delta from (to eliminate branching
+		 * in next reads).
+		 */
+		if (!fiber_is_cancelled()) {
 			coio_read_xrow_timeout_xc(&io, &ibuf, &xrow,
-					replication_disconnect_timeout());
+				replication_disconnect_timeout());
+			prev_tm = xrow.tm;
+		}
+
+		do {
 			/* vclock is followed while decoding, zeroing it. */
 			vclock_create(&relay->recv_vclock);
 			xrow_decode_vclock_xc(&xrow, &relay->recv_vclock);
+			/*
+			 * Old instances do not report the timestamp.
+			 * Same time in case of idle cycles the xrow.tm
+			 * is the same so update lag only when new data
+			 * been acked.
+			 */
+			if (prev_tm != xrow.tm) {
+				double delta = ev_now(loop()) - xrow.tm;
+				relay->peer_wal_lag = delta;
+				prev_tm = xrow.tm;
+			}
 			fiber_cond_signal(&relay->reader_cond);
-		}
+
+			coio_read_xrow_timeout_xc(&io, &ibuf, &xrow,
+					replication_disconnect_timeout());
+		} while (!fiber_is_cancelled());
 	} catch (Exception *e) {
 		relay_set_error(relay, e);
 		fiber_cancel(relay_f);
diff --git a/src/box/relay.h b/src/box/relay.h
index b32e2ea2a..99e2179cb 100644
--- a/src/box/relay.h
+++ b/src/box/relay.h
@@ -93,6 +93,9 @@ relay_vclock(const struct relay *relay);
 double
 relay_last_row_time(const struct relay *relay);
 
+double
+relay_peer_wal_lag(const struct relay *relay);
+
 /**
  * Send a Raft update request to the relay channel. It is not
  * guaranteed that it will be delivered. The connection may break.
-- 
2.30.2


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Tarantool-patches] [RFC v3 0/3] relay: provide downstream lag information
  2021-04-30 15:39 [Tarantool-patches] [RFC v3 0/3] relay: provide downstream lag information Cyrill Gorcunov via Tarantool-patches
                   ` (2 preceding siblings ...)
  2021-04-30 15:39 ` [Tarantool-patches] [RFC v3 3/3] relay: provide information about downstream lag Cyrill Gorcunov via Tarantool-patches
@ 2021-04-30 20:45 ` Vladislav Shpilevoy via Tarantool-patches
  3 siblings, 0 replies; 16+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-04-30 20:45 UTC (permalink / raw)
  To: Cyrill Gorcunov, tml; +Cc: Mons Anderson

Hi! Thanks for the patchset!

> Please take a look on applier notifications structure and naming. Actually
> I don't really like `downstream.lag` name either because this is not a counterpart
> for `upstream.lag` as far as I understand but rather packet traverse so maybe
> `dowstream.wal-lag` would be more suitable? Also in idle cycles downstream.lag

For me 'downstream.lag' looks good. `wal-lag` looks bad, because you will need
to take this name into quotes to use it in Lua instead of indexing via '.'.

But another option would be 'downstream.latency'.

> is not changed which might confuse the readers because `upstream.lag` does.

This is indeed not good. Should be fixed somehow. If there are no rows to ACK,
it could be dropped to 0, for example. Also you could use relay heartbeat timestamps
on the applier side to report just network latency (since there are no rows to
write to WAL).

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Tarantool-patches] [RFC v3 1/3] xrow: allow to pass timestamp via xrow_encode_vclock_timed helper
  2021-04-30 15:39 ` [Tarantool-patches] [RFC v3 1/3] xrow: allow to pass timestamp via xrow_encode_vclock_timed helper Cyrill Gorcunov via Tarantool-patches
@ 2021-04-30 20:45   ` Vladislav Shpilevoy via Tarantool-patches
  2021-04-30 20:50     ` Cyrill Gorcunov via Tarantool-patches
  2021-05-03 20:21   ` Konstantin Osipov via Tarantool-patches
  1 sibling, 1 reply; 16+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-04-30 20:45 UTC (permalink / raw)
  To: Cyrill Gorcunov, tml; +Cc: Mons Anderson

Thanks for the patch!

I am not sure it is needed though. You can simply assign tm after
encode(). Grep `xrow_encode_vclock()` to see examples where sync
is assigned after encode, it looks fine.

Too many changes just for a single usage.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Tarantool-patches] [RFC v3 2/3] applier: send first row's WAL time in the applier_writer_f
  2021-04-30 15:39 ` [Tarantool-patches] [RFC v3 2/3] applier: send first row's WAL time in the applier_writer_f Cyrill Gorcunov via Tarantool-patches
@ 2021-04-30 20:49   ` Vladislav Shpilevoy via Tarantool-patches
  2021-05-05 13:06     ` Cyrill Gorcunov via Tarantool-patches
  0 siblings, 1 reply; 16+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-04-30 20:49 UTC (permalink / raw)
  To: Cyrill Gorcunov, tml; +Cc: Mons Anderson

Thanks for working on this!

On 30.04.2021 17:39, Cyrill Gorcunov wrote:
> This fibers sends current vclock of the node to remote relay reader.
> This packet carries xrow_header::tm field as a zero but we can reuse
> it to provide information about first timestamp in a transaction we
> wrote to our WAL. Since old instances of Tarantool simply ignore this
> field such extension won't cause any problems.
> 
> This timestamp will be needed to account lags of downstream replicas
> suitable for information purpose and cluster health monitoring.
> 
> We update applier statistics in WAL callbacks but since both
> apply_synchro_row and apply_plain_tx are used not only in real data
> application but in final join stage as well (in this stage we're not
> yet writing the data) the apply_synchro_row is extended with a flag
> pointing that applier update is needed.
> 
> Same time the apply_plain_tx uses asynchronous WAL write completion
> and at moment when the write procedure is finished the applier might
> be removed from replicaset already thus we use applier's instance

Did you mean instance id?

> to lookup if it is still alive.
> 
> The calculation of the downstream lag itself lag will be addressed

One of the 'lag' words is redundant.

> in next patch because sending the timestamp and its observation
> are independent actions.

See 6 comments below.

> diff --git a/src/box/applier.cc b/src/box/applier.cc
> index 33181fdbf..626dc0324 100644
> --- a/src/box/applier.cc
> +++ b/src/box/applier.cc
> @@ -751,11 +752,41 @@ applier_txn_rollback_cb(struct trigger *trigger, void *event)
>  	return 0;
>  }
>  
> +/** Applier WAL related statistics */
> +struct awstat {

1. Please, lets avoid such hard contractions. At first I
thought you decided to do something related to AWS when saw
this name.

	struct applier_lag

would be just fine.

> +	uint32_t instance_id;
> +	double first_row_tm;
> +};
> +
> +static void
> +awstat_update(struct awstat *awstat)
> +{
> +	/* Ignore if not needed */
> +	if (awstat->instance_id == 0)
> +		return;

2. Why did you even allocate this stat if it is not needed?
Maybe it would be better to have it NULL then and check
for NULL? AFAIU these are the initial and final join cases.

Did you try the way I proposed about waiting for all applier's
WAL writes to end in applier_stop()? Does it look worse? After
the fiber stop and wal_sync() it would be safe to assume there
are no WAL writes in fly from this applier. But I don't know if
it would look better.

> +
> +	/*
> +	 * Write to WAL happens in two contexts: as
> +	 * synchronous writes and as asynchronous. In
> +	 * second case the applier might be already
> +	 * stopped and removed.> +	 */
> +	struct replica *r = replica_by_id(awstat->instance_id);
> +	if (r == NULL && r->applier == NULL)

3. There is another idea - store the timestamp in struct replica.
Then it is -1 dereference. Although you would need to call
replica_by_id() before each ACK, but one ACK covers multiple
transactions and it would mean less lookups than now.

> +		return;
> +
> +	r->applier->first_row_wal_time = awstat->first_row_tm;

4. In case there was a batch of transactions written to WAL,
the latest one will override the timestamp of the previous ones and
this would make the lag incorrect, because you missed the older
transactions. Exactly like when you tried to take a timestamp of
the last row instead of the first row, but in a bigger scope.
Unless I missed something.

Probably you need to assign a new timestamp only when the old
one is not 0, and reset it to 0 on each sent ACK. Don't know.

> @@ -817,6 +852,12 @@ apply_synchro_row(struct xrow_header *row)
>  			     apply_synchro_row_cb, &entry);
>  	entry.req = &req;
>  	entry.owner = fiber();
> +	if (use_awstat) {

5. You don't really need this flag. Because during joins applier's
instance ID should be zero anyway. Therefore you would assign
stat.instance_id = 0 in this case regardless of the flag.

Also this means you don't need the entire applier struct. You only
need the instance_id as an argument. I am saying this because these
functions apply_* didn't not depend on a concrete applier object
probably exactly because it could be deleted. Not having an applier
pointer in their arguments could be intentional.

> +		entry.awstat.instance_id = applier->instance_id;
> +		entry.awstat.first_row_tm = row->tm;
> +	} else {
> +		entry.awstat.instance_id = 0;
> +	}> diff --git a/src/box/applier.h b/src/box/applier.h
> index 15ca1fcfd..bd98827e7 100644
> --- a/src/box/applier.h
> +++ b/src/box/applier.h
> @@ -93,6 +93,11 @@ struct applier {
>  	ev_tstamp last_row_time;
>  	/** Number of seconds this replica is behind the remote master */
>  	ev_tstamp lag;
> +	/**
> +	 * WAL time of first applied row in a transaction.
> +	 * For relay statistics sake.

6. It is not a first applied row on the whole. Only for the latest
transaction written to WAL.

> +	 */
> +	double first_row_wal_time;
>  	/** The last box_error_code() logged to avoid log flooding */
>  	uint32_t last_logged_errcode;
>  	/** Remote instance ID. */
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Tarantool-patches] [RFC v3 1/3] xrow: allow to pass timestamp via xrow_encode_vclock_timed helper
  2021-04-30 20:45   ` Vladislav Shpilevoy via Tarantool-patches
@ 2021-04-30 20:50     ` Cyrill Gorcunov via Tarantool-patches
  0 siblings, 0 replies; 16+ messages in thread
From: Cyrill Gorcunov via Tarantool-patches @ 2021-04-30 20:50 UTC (permalink / raw)
  To: Vladislav Shpilevoy; +Cc: tml, Mons Anderson

On Fri, Apr 30, 2021 at 10:45:48PM +0200, Vladislav Shpilevoy wrote:
> Thanks for the patch!
> 
> I am not sure it is needed though. You can simply assign tm after
> encode(). Grep `xrow_encode_vclock()` to see examples where sync
> is assigned after encode, it looks fine.
> 
> Too many changes just for a single usage.

Sure thing, will update.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Tarantool-patches] [RFC v3 3/3] relay: provide information about downstream lag
  2021-04-30 15:39 ` [Tarantool-patches] [RFC v3 3/3] relay: provide information about downstream lag Cyrill Gorcunov via Tarantool-patches
@ 2021-04-30 20:50   ` Vladislav Shpilevoy via Tarantool-patches
  0 siblings, 0 replies; 16+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-04-30 20:50 UTC (permalink / raw)
  To: Cyrill Gorcunov, tml; +Cc: Mons Anderson

I appreciate the work you did here!

See 4 comments below.

On 30.04.2021 17:39, Cyrill Gorcunov via Tarantool-patches wrote:
> We already have `box.replication.upstream.lag` entry for monitoring
> sake. Same time in synchronous replication timeouts are key properties
> for quorum gathering procedure. Thus we would like to know how long
> it took of a transaction to traverse `initiator WAL -> network ->
> remote applier -> ACK` path.
> 
> In this patch we use new applier's functionality which returns us
> the timestamp of first written xrow in a transaction such that we
> can calculate the downstream lag.
> 
> Closes #5447

1. There must be a docbot request and a changelog file.

> diff --git a/src/box/relay.cc b/src/box/relay.cc
> index ff43c2fc7..6d880932a 100644
> --- a/src/box/relay.cc
> +++ b/src/box/relay.cc
> @@ -158,6 +158,12 @@ struct relay {
>  	struct stailq pending_gc;
>  	/** Time when last row was sent to peer. */
>  	double last_row_time;
> +	/**
> +	 * A time difference between moment when we
> +	 * wrote the row in a local WAL and peer wrote
> +	 * it in own WAL.
> +	 */
> +	double peer_wal_lag;

2. It is not true really. It also includes 2 network hops.
I wouldn't call it 'wal'. It is simply lag. Or latency.

> @@ -614,15 +626,41 @@ relay_reader_f(va_list ap)
>  	coio_create(&io, relay->io.fd);
>  	ibuf_create(&ibuf, &cord()->slabc, 1024);
>  	try {
> -		while (!fiber_is_cancelled()) {
> -			struct xrow_header xrow;
> +		struct xrow_header xrow;
> +		double prev_tm;
> +
> +		/*
> +		 * Make a first read as a separate action because
> +		 * we need previous timestamp from the xrow to
> +		 * calculate delta from (to eliminate branching
> +		 * in next reads).
> +		 */
> +		if (!fiber_is_cancelled()) {
>  			coio_read_xrow_timeout_xc(&io, &ibuf, &xrow,
> -					replication_disconnect_timeout());
> +				replication_disconnect_timeout());

3. You didn't have to change this line.

> +			prev_tm = xrow.tm;
> +		}
> +
> +		do {
>  			/* vclock is followed while decoding, zeroing it. */
>  			vclock_create(&relay->recv_vclock);
>  			xrow_decode_vclock_xc(&xrow, &relay->recv_vclock);
> +			/*
> +			 * Old instances do not report the timestamp.
> +			 * Same time in case of idle cycles the xrow.tm
> +			 * is the same so update lag only when new data
> +			 * been acked.
> +			 */
> +			if (prev_tm != xrow.tm) {

4. It also could be reset to zero when there are no rows and send/received
vclock match. Because it means there is no rows to ack, and therefore can't
be any latency. Or you can use relay heartbeats to update it even when
there are no rows.

> +				double delta = ev_now(loop()) - xrow.tm;
> +				relay->peer_wal_lag = delta;
> +				prev_tm = xrow.tm;
> +			}

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Tarantool-patches] [RFC v3 1/3] xrow: allow to pass timestamp via xrow_encode_vclock_timed helper
  2021-04-30 15:39 ` [Tarantool-patches] [RFC v3 1/3] xrow: allow to pass timestamp via xrow_encode_vclock_timed helper Cyrill Gorcunov via Tarantool-patches
  2021-04-30 20:45   ` Vladislav Shpilevoy via Tarantool-patches
@ 2021-05-03 20:21   ` Konstantin Osipov via Tarantool-patches
  2021-05-03 20:33     ` Cyrill Gorcunov via Tarantool-patches
  1 sibling, 1 reply; 16+ messages in thread
From: Konstantin Osipov via Tarantool-patches @ 2021-05-03 20:21 UTC (permalink / raw)
  To: Cyrill Gorcunov; +Cc: tml, Mons Anderson, Vladislav Shpilevoy

* Cyrill Gorcunov via Tarantool-patches <tarantool-patches@dev.tarantool.org> [21/04/30 18:43]:
> We will enhance applier writer with timestamp of last written row.
> It is needed to detect downstream lag.

LSN is already 64 bit and can very well fit a timestamp within a
very reasonable range (e.g nanosecond resolution, timeuuid range).

I think you shouldn't add a new field, just change the way LSN is
generated.


-- 
Konstantin Osipov, Moscow, Russia

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Tarantool-patches] [RFC v3 1/3] xrow: allow to pass timestamp via xrow_encode_vclock_timed helper
  2021-05-03 20:21   ` Konstantin Osipov via Tarantool-patches
@ 2021-05-03 20:33     ` Cyrill Gorcunov via Tarantool-patches
  2021-05-03 20:37       ` Konstantin Osipov via Tarantool-patches
  0 siblings, 1 reply; 16+ messages in thread
From: Cyrill Gorcunov via Tarantool-patches @ 2021-05-03 20:33 UTC (permalink / raw)
  To: Konstantin Osipov; +Cc: tml, Mons Anderson, Vladislav Shpilevoy

On Mon, May 03, 2021 at 11:21:09PM +0300, Konstantin Osipov wrote:
> * Cyrill Gorcunov via Tarantool-patches <tarantool-patches@dev.tarantool.org> [21/04/30 18:43]:
> > We will enhance applier writer with timestamp of last written row.
> > It is needed to detect downstream lag.
> 
> LSN is already 64 bit and can very well fit a timestamp within a
> very reasonable range (e.g nanosecond resolution, timeuuid range).
> 
> I think you shouldn't add a new field, just change the way LSN is
> generated.

Kostya, I fear I don't get it. Currently we send back current
vclock and in the header we have the "tm" field which has been
unused before this patch. Now we start using it by sending
timestamp there. Which exactly LSN you propose to change?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Tarantool-patches] [RFC v3 1/3] xrow: allow to pass timestamp via xrow_encode_vclock_timed helper
  2021-05-03 20:33     ` Cyrill Gorcunov via Tarantool-patches
@ 2021-05-03 20:37       ` Konstantin Osipov via Tarantool-patches
  2021-05-03 20:42         ` Cyrill Gorcunov via Tarantool-patches
  0 siblings, 1 reply; 16+ messages in thread
From: Konstantin Osipov via Tarantool-patches @ 2021-05-03 20:37 UTC (permalink / raw)
  To: Cyrill Gorcunov; +Cc: tml, Mons Anderson, Vladislav Shpilevoy

* Cyrill Gorcunov <gorcunov@gmail.com> [21/05/03 23:35]:
> On Mon, May 03, 2021 at 11:21:09PM +0300, Konstantin Osipov wrote:
> > * Cyrill Gorcunov via Tarantool-patches <tarantool-patches@dev.tarantool.org> [21/04/30 18:43]:
> > > We will enhance applier writer with timestamp of last written row.
> > > It is needed to detect downstream lag.
> > 
> > LSN is already 64 bit and can very well fit a timestamp within a
> > very reasonable range (e.g nanosecond resolution, timeuuid range).
> > 
> > I think you shouldn't add a new field, just change the way LSN is
> > generated.
> 
> Kostya, I fear I don't get it. Currently we send back current
> vclock and in the header we have the "tm" field which has been
> unused before this patch. Now we start using it by sending
> timestamp there. Which exactly LSN you propose to change?

OK, I didn't get it, I think it's ok to fill it in then.
The idea I had in mind is that all LSN become timestamps.

-- 
Konstantin Osipov, Moscow, Russia

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Tarantool-patches] [RFC v3 1/3] xrow: allow to pass timestamp via xrow_encode_vclock_timed helper
  2021-05-03 20:37       ` Konstantin Osipov via Tarantool-patches
@ 2021-05-03 20:42         ` Cyrill Gorcunov via Tarantool-patches
  0 siblings, 0 replies; 16+ messages in thread
From: Cyrill Gorcunov via Tarantool-patches @ 2021-05-03 20:42 UTC (permalink / raw)
  To: Konstantin Osipov; +Cc: tml, Mons Anderson, Vladislav Shpilevoy

On Mon, May 03, 2021 at 11:37:55PM +0300, Konstantin Osipov wrote:
> > 
> > Kostya, I fear I don't get it. Currently we send back current
> > vclock and in the header we have the "tm" field which has been
> > unused before this patch. Now we start using it by sending
> > timestamp there. Which exactly LSN you propose to change?
> 
> OK, I didn't get it, I think it's ok to fill it in then.
> The idea I had in mind is that all LSN become timestamps.

Aha, ok then.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Tarantool-patches] [RFC v3 2/3] applier: send first row's WAL time in the applier_writer_f
  2021-04-30 20:49   ` Vladislav Shpilevoy via Tarantool-patches
@ 2021-05-05 13:06     ` Cyrill Gorcunov via Tarantool-patches
  2021-05-05 20:47       ` Vladislav Shpilevoy via Tarantool-patches
  0 siblings, 1 reply; 16+ messages in thread
From: Cyrill Gorcunov via Tarantool-patches @ 2021-05-05 13:06 UTC (permalink / raw)
  To: Vladislav Shpilevoy; +Cc: Mons Anderson, tml

On Fri, Apr 30, 2021 at 10:49:10PM +0200, Vladislav Shpilevoy wrote:
> > 
> > Same time the apply_plain_tx uses asynchronous WAL write completion
> > and at moment when the write procedure is finished the applier might
> > be removed from replicaset already thus we use applier's instance
> 
> Did you mean instance id?

Yes.

> > to lookup if it is still alive.
> > 
> > The calculation of the downstream lag itself lag will be addressed
> 
> One of the 'lag' words is redundant.

Thanks for catching!

> >  
> > +/** Applier WAL related statistics */
> > +struct awstat {
> 
> 1. Please, lets avoid such hard contractions. At first I
> thought you decided to do something related to AWS when saw
> this name.
> 
> 	struct applier_lag
> 
> would be just fine.

Sure, will update.

> 
> > +	uint32_t instance_id;
> > +	double first_row_tm;
> > +};
> > +
> > +static void
> > +awstat_update(struct awstat *awstat)
> > +{
> > +	/* Ignore if not needed */
> > +	if (awstat->instance_id == 0)
> > +		return;
> 
> 2. Why did you even allocate this stat if it is not needed?
> Maybe it would be better to have it NULL then and check
> for NULL? AFAIU these are the initial and final join cases.

They are allocated on the stack together with synchro_entry,
so there is no penalty. After your comment I've changed it to

struct synchro_entry {
	/** Request to process when WAL write is done. */
	struct synchro_request *req;
	/** Fiber created the entry. To wakeup when WAL write is done. */
	struct fiber *owner;
	/** WAL bound statistics. */
	struct applier_lag *applier_lag;
	/**
	 * The base journal entry. It has unsized array and then must be the
	 * last entry in the structure. But can workaround it via a union
	 * adding the needed tail as char[].
	 */
	union {
		struct journal_entry base;
		char base_buf[sizeof(base) + sizeof(base.rows[0])];
	};
};

and then allocate applier_lag on the stack and assign a pointer if needed.

static int
apply_synchro_row(struct applier *applier, struct xrow_header *row,
		  bool use_awstat)
{
	...
	struct applier_lag applier_lag;
	struct synchro_entry entry;
	...
	if (use_awstat) {
		applier_lag.instance_id = applier->instance_id;
		applier_lag.first_row_tm = row->tm;

		entry.applier_lag = &applier_lag;
	} else {
		entry.applier_lag = NULL;
	}
}

Strictly speaking there is no difference.

> 
> Did you try the way I proposed about waiting for all applier's
> WAL writes to end in applier_stop()? Does it look worse? After
> the fiber stop and wal_sync() it would be safe to assume there
> are no WAL writes in fly from this applier. But I don't know if
> it would look better.

I thought about it alot. And you know, I don't really like what we
are to implement

 - currently applier_stop() doesn't wait the journal to finish its
   write. The main applier reader is spinning in !fiber_is_cancelled()
   cycle in a polling way while applier tries to read new data from the
   remote relay peer. If peer doesn't reply for some reason then we throw
   an exception which is catched by a caller code, and the caller tries
   to iterate new cycle testing if fiber is cancelled.

   In case if reconfiguration happens (and timeouts are set to default)
   then we try to prune old appliers calling fiber_join on them which
   means this fiber_join won't exit until new fiber_is_cancelled iteration
   get processed. With default configs it means this could take up to
   replication_disconnect_timeout(), ie 1 second by default.

   In turn if we bound to journal write completion then this gonna be
   uncontrollable because journal may hang as long as it wish and we
   continue spinning in a waiting cycle

 - in applier_stop() we will have to implement some kind of reference
   counting, which would be modified on journal completion and i think
   this makes code even more complex, since we have to add some additional
   logic when applier is allowed to cancel.

> > +
> > +	/*
> > +	 * Write to WAL happens in two contexts: as
> > +	 * synchronous writes and as asynchronous. In
> > +	 * second case the applier might be already
> > +	 * stopped and removed.> +	 */
> > +	struct replica *r = replica_by_id(awstat->instance_id);
> > +	if (r == NULL && r->applier == NULL)
> 
> 3. There is another idea - store the timestamp in struct replica.
> Then it is -1 dereference. Although you would need to call
> replica_by_id() before each ACK, but one ACK covers multiple
> transactions and it would mean less lookups than now.

I like this idea, letme try to implement it.

> > +		return;
> > +
> > +	r->applier->first_row_wal_time = awstat->first_row_tm;
> 
> 4. In case there was a batch of transactions written to WAL,
> the latest one will override the timestamp of the previous ones and
> this would make the lag incorrect, because you missed the older
> transactions. Exactly like when you tried to take a timestamp of
> the last row instead of the first row, but in a bigger scope.
> Unless I missed something.

I'm not sure I follow you here. Say we have a batch of transactions.
The update happens on every journal_entry completion, if several
entries are flushed then completion is called in ordered way (one
followed by another). The update happens in same tx thread where
appliers are running which means acks sending procedure is ordered
relatively to updates call. Thus we may have situation where we
complete first entry then we either send it in ack message either
update to a new value and only then send ack. As far as I understand
there might be a gap in fibers scheduling before several journal entries
get completed. Thus the lag calculated on relay side will be bigger for
some moment but on next ack will shrink to latest entry for writtent
journal entry. Because relay uses current time minus time value
obtained from applier's ack. Hopefully I didn't miss something either.

> Probably you need to assign a new timestamp only when the old
> one is not 0, and reset it to 0 on each sent ACK. Don't know.

Gimme some time to investigate this vague moment.

> 
> > @@ -817,6 +852,12 @@ apply_synchro_row(struct xrow_header *row)
> >  			     apply_synchro_row_cb, &entry);
> >  	entry.req = &req;
> >  	entry.owner = fiber();
> > +	if (use_awstat) {
> 
> 5. You don't really need this flag. Because during joins applier's
> instance ID should be zero anyway. Therefore you would assign
> stat.instance_id = 0 in this case regardless of the flag.
> 
> Also this means you don't need the entire applier struct. You only
> need the instance_id as an argument. I am saying this because these
> functions apply_* didn't not depend on a concrete applier object
> probably exactly because it could be deleted. Not having an applier
> pointer in their arguments could be intentional.

Sounds reasonable, thanks!

> > @@ -93,6 +93,11 @@ struct applier {
> >  	ev_tstamp last_row_time;
> >  	/** Number of seconds this replica is behind the remote master */
> >  	ev_tstamp lag;
> > +	/**
> > +	 * WAL time of first applied row in a transaction.
> > +	 * For relay statistics sake.
> 
> 6. It is not a first applied row on the whole. Only for the latest
> transaction written to WAL.

+1

Thanks a huge for comments, Vlad!

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Tarantool-patches] [RFC v3 2/3] applier: send first row's WAL time in the applier_writer_f
  2021-05-05 13:06     ` Cyrill Gorcunov via Tarantool-patches
@ 2021-05-05 20:47       ` Vladislav Shpilevoy via Tarantool-patches
  2021-05-05 22:19         ` Cyrill Gorcunov via Tarantool-patches
  0 siblings, 1 reply; 16+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-05-05 20:47 UTC (permalink / raw)
  To: Cyrill Gorcunov; +Cc: Mons Anderson, tml

Hi! Thanks for the response!

>>> +	uint32_t instance_id;
>>> +	double first_row_tm;
>>> +};
>>> +
>>> +static void
>>> +awstat_update(struct awstat *awstat)
>>> +{
>>> +	/* Ignore if not needed */
>>> +	if (awstat->instance_id == 0)
>>> +		return;
>>
>> 2. Why did you even allocate this stat if it is not needed?
>> Maybe it would be better to have it NULL then and check
>> for NULL? AFAIU these are the initial and final join cases.
> 
> They are allocated on the stack together with synchro_entry,
> so there is no penalty.

They are allocated on the region for plain transactions, which
are way more important to optimize.

>> Did you try the way I proposed about waiting for all applier's
>> WAL writes to end in applier_stop()? Does it look worse? After
>> the fiber stop and wal_sync() it would be safe to assume there
>> are no WAL writes in fly from this applier. But I don't know if
>> it would look better.
> 
> I thought about it alot. And you know, I don't really like what we
> are to implement
> 
>  - currently applier_stop() doesn't wait the journal to finish its
>    write. The main applier reader is spinning in !fiber_is_cancelled()
>    cycle in a polling way while applier tries to read new data from the
>    remote relay peer. If peer doesn't reply for some reason then we throw
>    an exception which is catched by a caller code, and the caller tries
>    to iterate new cycle testing if fiber is cancelled.

I do not follow here. You needed to wait only for the pending journal
writes. It has nothing to do with reading new data from anywhere, does
it?

>  - in applier_stop() we will have to implement some kind of reference
>    counting, which would be modified on journal completion and i think
>    this makes code even more complex, since we have to add some additional
>    logic when applier is allowed to cancel.

We don't really. You only need to call wal_sync() after which you can be
sure all the WAL writes started before this call now are finished. At
least AFAR it works. But these refs might be needed for assertions, so
yes, maybe not a good idea.

>>> +		return;
>>> +
>>> +	r->applier->first_row_wal_time = awstat->first_row_tm;
>>
>> 4. In case there was a batch of transactions written to WAL,
>> the latest one will override the timestamp of the previous ones and
>> this would make the lag incorrect, because you missed the older
>> transactions. Exactly like when you tried to take a timestamp of
>> the last row instead of the first row, but in a bigger scope.
>> Unless I missed something.
> 
> I'm not sure I follow you here. Say we have a batch of transactions.
> The update happens on every journal_entry completion, if several
> entries are flushed then completion is called in ordered way (one
> followed by another).

Exactly. The latest update will override the previous timestamps.

> The update happens in same tx thread where
> appliers are running which means acks sending procedure is ordered
> relatively to updates call. Thus we may have situation where we
> complete first entry then we either send it in ack message either
> update to a new value and only then send ack.

In one event loop iteration many transactions might be submitted to
WAL by the applier. And they will end also in one event loop iteration
later. There won't even be any yields between their completions.
Yes, first of them will wakeup the ack-sender fiber, but the next ones
will override the first timestamp. After a yield the ACK will contain
the newest timestamp ignoring the timestamps of the older transactions.

> As far as I understand
> there might be a gap in fibers scheduling before several journal entries
> get completed. Thus the lag calculated on relay side will be bigger for
> some moment but on next ack will shrink to latest entry for writtent
> journal entry.

On the contrary, it will be smaller on the other side than it is.
Because you send the ACK with the latest timestamp instead of the
oldest one. Smaller would be misleading about the real latency.

Unless I don't understand something.

> Because relay uses current time minus time value
> obtained from applier's ack. Hopefully I didn't miss something either.
> 
>> Probably you need to assign a new timestamp only when the old
>> one is not 0, and reset it to 0 on each sent ACK. Don't know.
> 
> Gimme some time to investigate this vague moment.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Tarantool-patches] [RFC v3 2/3] applier: send first row's WAL time in the applier_writer_f
  2021-05-05 20:47       ` Vladislav Shpilevoy via Tarantool-patches
@ 2021-05-05 22:19         ` Cyrill Gorcunov via Tarantool-patches
  0 siblings, 0 replies; 16+ messages in thread
From: Cyrill Gorcunov via Tarantool-patches @ 2021-05-05 22:19 UTC (permalink / raw)
  To: Vladislav Shpilevoy; +Cc: Mons Anderson, tml

On Wed, May 05, 2021 at 10:47:23PM +0200, Vladislav Shpilevoy wrote:
> >>> +static void
> >>> +awstat_update(struct awstat *awstat)
> >>> +{
> >>> +	/* Ignore if not needed */
> >>> +	if (awstat->instance_id == 0)
> >>> +		return;
> >>
> >> 2. Why did you even allocate this stat if it is not needed?
> >> Maybe it would be better to have it NULL then and check
> >> for NULL? AFAIU these are the initial and final join cases.
> > 
> > They are allocated on the stack together with synchro_entry,
> > so there is no penalty.
> 
> They are allocated on the region for plain transactions, which
> are way more important to optimize.

I thought you meant about synchro entry. Anyway, I changed the code
that this if no longer called for syncro entries in "join" stages.
I'll send a new version for review (hopefully soon).

> 
> >> Did you try the way I proposed about waiting for all applier's
> >> WAL writes to end in applier_stop()? Does it look worse? After
> >> the fiber stop and wal_sync() it would be safe to assume there
> >> are no WAL writes in fly from this applier. But I don't know if
> >> it would look better.
> > 
> > I thought about it alot. And you know, I don't really like what we
> > are to implement
> > 
> >  - currently applier_stop() doesn't wait the journal to finish its
> >    write. The main applier reader is spinning in !fiber_is_cancelled()
> >    cycle in a polling way while applier tries to read new data from the
> >    remote relay peer. If peer doesn't reply for some reason then we throw
> >    an exception which is catched by a caller code, and the caller tries
> >    to iterate new cycle testing if fiber is cancelled.
> 
> I do not follow here. You needed to wait only for the pending journal
> writes. It has nothing to do with reading new data from anywhere, does
> it?

The pending writes may take too much time. And in case if we need to stop
applier such "wait" procedure will bring a penalty because right now if we
call applier_stop() in worst case it will wait for 1 second (with default
configs) before the applier reader will be joined. And we don't know yet
for which reason (except cfg reconfig) we will need to stop appliers in
future. And this is a main reason why I dropped the idea of introducing
"wait until async write complete" approach.

> 
> >  - in applier_stop() we will have to implement some kind of reference
> >    counting, which would be modified on journal completion and i think
> >    this makes code even more complex, since we have to add some additional
> >    logic when applier is allowed to cancel.
> 
> We don't really. You only need to call wal_sync() after which you can be
> sure all the WAL writes started before this call now are finished. At
> least AFAR it works. But these refs might be needed for assertions, so
> yes, maybe not a good idea.

Yes, wal_sync could do the trick (I must confess I forgot about it). So if
we put aside the code with refs+assertion how would the final code flow
look? I think it might be something like

 applier_reader
  apply_plain_tx
    save tm in applier::some-member
 applier_writer
  read applier::some-member

 applier_stop
  wal_sync

but for debug purpose we definitely will need binding with refs which not
that suitable.

> >>> +		return;
> >>> +
> >>> +	r->applier->first_row_wal_time = awstat->first_row_tm;
> >>
> >> 4. In case there was a batch of transactions written to WAL,
> >> the latest one will override the timestamp of the previous ones and
> >> this would make the lag incorrect, because you missed the older
> >> transactions. Exactly like when you tried to take a timestamp of
> >> the last row instead of the first row, but in a bigger scope.
> >> Unless I missed something.
> > 
> > I'm not sure I follow you here. Say we have a batch of transactions.
> > The update happens on every journal_entry completion, if several
> > entries are flushed then completion is called in ordered way (one
> > followed by another).
> 
> Exactly. The latest update will override the previous timestamps.
> 
> > The update happens in same tx thread where
> > appliers are running which means acks sending procedure is ordered
> > relatively to updates call. Thus we may have situation where we
> > complete first entry then we either send it in ack message either
> > update to a new value and only then send ack.
> 
> In one event loop iteration many transactions might be submitted to
> WAL by the applier. And they will end also in one event loop iteration
> later. There won't even be any yields between their completions.
>
> Yes, first of them will wakeup the ack-sender fiber, but the next ones
> will override the first timestamp. After a yield the ACK will contain
> the newest timestamp ignoring the timestamps of the older transactions.

Seems I get what bothers you. Let me think about.

> 
> > As far as I understand
> > there might be a gap in fibers scheduling before several journal entries
> > get completed. Thus the lag calculated on relay side will be bigger for
> > some moment but on next ack will shrink to latest entry for writtent
> > journal entry.
> 
> On the contrary, it will be smaller on the other side than it is.
> Because you send the ACK with the latest timestamp instead of the
> oldest one. Smaller would be misleading about the real latency.

Agreed, need to think. Thanks!

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2021-05-05 22:19 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-30 15:39 [Tarantool-patches] [RFC v3 0/3] relay: provide downstream lag information Cyrill Gorcunov via Tarantool-patches
2021-04-30 15:39 ` [Tarantool-patches] [RFC v3 1/3] xrow: allow to pass timestamp via xrow_encode_vclock_timed helper Cyrill Gorcunov via Tarantool-patches
2021-04-30 20:45   ` Vladislav Shpilevoy via Tarantool-patches
2021-04-30 20:50     ` Cyrill Gorcunov via Tarantool-patches
2021-05-03 20:21   ` Konstantin Osipov via Tarantool-patches
2021-05-03 20:33     ` Cyrill Gorcunov via Tarantool-patches
2021-05-03 20:37       ` Konstantin Osipov via Tarantool-patches
2021-05-03 20:42         ` Cyrill Gorcunov via Tarantool-patches
2021-04-30 15:39 ` [Tarantool-patches] [RFC v3 2/3] applier: send first row's WAL time in the applier_writer_f Cyrill Gorcunov via Tarantool-patches
2021-04-30 20:49   ` Vladislav Shpilevoy via Tarantool-patches
2021-05-05 13:06     ` Cyrill Gorcunov via Tarantool-patches
2021-05-05 20:47       ` Vladislav Shpilevoy via Tarantool-patches
2021-05-05 22:19         ` Cyrill Gorcunov via Tarantool-patches
2021-04-30 15:39 ` [Tarantool-patches] [RFC v3 3/3] relay: provide information about downstream lag Cyrill Gorcunov via Tarantool-patches
2021-04-30 20:50   ` Vladislav Shpilevoy via Tarantool-patches
2021-04-30 20:45 ` [Tarantool-patches] [RFC v3 0/3] relay: provide downstream lag information Vladislav Shpilevoy via Tarantool-patches

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox