Guys, take a look once time permit. Previous version is here https://lists.tarantool.org/tarantool-patches/20210430153940.121271-1-gorcunov@gmail.com/ v4 (by Vlad): - add a test case - add docbot request - dropped off xrow_encode_vclock_timed, we use opencoded assignment for tm value when send ack - struct awstat renamed to applier_wal_stat. Vlad I think this is better name than "applier_lag" because this is statistics on WAL, we simply track remote WAL propagation here, so more general name is better for grep sake and for future extensions - instead of passing applier structure we pass replica_id - the real keeper of this statistics comes into "replica" structure thus unbound of applier itself - for synchro entries we pass a pointer to the applier_wal_stat instead of using replica_id = 0 as a sign that we don't need to update statistics for initial and final join cases - to write and read statistics we provide wal_stat_update and wal_stat_ack helpers to cover the case where single ACK spans several transactions branch: gorcunov/gh-5447-relay-lag-4 issue: https://github.com/tarantool/tarantool/issues/5447 Cyrill Gorcunov (2): applier: send transaction's first row WAL time in the applier_writer_f relay: provide information about downstream lag .../unreleased/gh-5447-downstream-lag.md | 6 ++ src/box/applier.cc | 92 +++++++++++++++--- src/box/applier.h | 14 +++ src/box/lua/info.c | 3 + src/box/relay.cc | 18 ++++ src/box/relay.h | 6 ++ src/box/replication.cc | 3 + src/box/replication.h | 4 + .../replication/gh-5447-downstream-lag.result | 93 +++++++++++++++++++ .../gh-5447-downstream-lag.test.lua | 41 ++++++++ 10 files changed, 267 insertions(+), 13 deletions(-) create mode 100644 changelogs/unreleased/gh-5447-downstream-lag.md create mode 100644 test/replication/gh-5447-downstream-lag.result create mode 100644 test/replication/gh-5447-downstream-lag.test.lua base-commit: 4500547d7a177c0b49bff35c922286b4f21c8719 -- 2.30.2
Applier fiber sends current vclock of the node to remote relay reader, pointing current state of fetched WAL data so the relay will know which new data should be sent. The packet applier sends carries xrow_header::tm field as a zero but we can reuse it to provide information about first timestamp in a transaction we wrote to our WAL. Since old instances of Tarantool simply ignore this field such extension won't cause any problems. The timestamp will be needed to account lag of downstream replicas suitable for information purpose and cluster health monitoring. We update applier statistics in WAL callbacks but since both apply_synchro_row and apply_plain_tx are used not only in real data application but in final join stage as well (in this stage we're not yet writing the data) the apply_synchro_row is extended with replica_id argument which is non zero when applier is subscribed. The calculation of the downstream lag itself lag will be addressed in next patch because sending the timestamp and its observation are independent actions. Main structure which manages timestamp handling is called `applier_wal_stat` because this is a rather remote WAL statistics and we might need to extend some fields in future. Also since we try to minimize branching we carry it inside replica structure and read/write procedures are called as wal_stat_ack/wal_stat_update. Special handling is needed due to WAL batch processing: several transactions might complete before the ACK packet is sent back to the replica. Part-of #5447 Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com> --- src/box/applier.cc | 92 ++++++++++++++++++++++++++++++++++++------ src/box/applier.h | 14 +++++++ src/box/replication.cc | 3 ++ src/box/replication.h | 4 ++ 4 files changed, 100 insertions(+), 13 deletions(-) diff --git a/src/box/applier.cc b/src/box/applier.cc index 33181fdbf..3038ce05f 100644 --- a/src/box/applier.cc +++ b/src/box/applier.cc @@ -148,6 +148,37 @@ applier_check_sync(struct applier *applier) } } +/** + * Update WAL statstics once transaction is written. + */ +static void +wal_stat_update(struct applier_wal_stat *wal_st) +{ + struct replica *r = replica_by_id(wal_st->replica_id); + if (likely(r != NULL)) { + assert(wal_st->replica_id == r->wal_st.replica_id); + if (r->wal_st.first_row_tm == 0) + r->wal_st.first_row_tm = wal_st->first_row_tm; + } +} + +/** + * Encode timestamp for ACK sending. We drop the value on purpose + * because it is one time action and if transactions are written + * in a batch we need to account the oldest timestamp to show real + * relay lag. + */ +static void +wal_stat_ack(uint32_t replica_id, struct xrow_header *xrow) +{ + struct replica *r = replica_by_id(replica_id); + if (likely(r != NULL)) { + assert(replica_id == r->wal_st.replica_id); + xrow->tm = r->wal_st.first_row_tm; + r->wal_st.first_row_tm = 0; + } +} + /* * Fiber function to write vclock to replication master. * To track connection status, replica answers master @@ -193,6 +224,7 @@ applier_writer_f(va_list ap) applier->has_acks_to_send = false; struct xrow_header xrow; xrow_encode_vclock(&xrow, &replicaset.vclock); + wal_stat_ack(applier->instance_id, &xrow); coio_write_xrow(&io, &xrow); ERROR_INJECT(ERRINJ_APPLIER_SLOW_ACK, { fiber_sleep(0.01); @@ -490,7 +522,7 @@ static uint64_t applier_read_tx(struct applier *applier, struct stailq *rows, double timeout); static int -apply_final_join_tx(struct stailq *rows); +apply_final_join_tx(uint32_t replica_id, struct stailq *rows); /** * A helper struct to link xrow objects in a list. @@ -535,7 +567,7 @@ applier_wait_register(struct applier *applier, uint64_t row_count) next)->row); break; } - if (apply_final_join_tx(&rows) != 0) + if (apply_final_join_tx(applier->instance_id, &rows) != 0) diag_raise(); } @@ -756,6 +788,11 @@ applier_txn_wal_write_cb(struct trigger *trigger, void *event) { (void) trigger; (void) event; + + struct applier_wal_stat *wal_st = + (struct applier_wal_stat *)trigger->data; + wal_stat_update(wal_st); + /* Broadcast the WAL write across all appliers. */ trigger_run(&replicaset.applier.on_wal_write, NULL); return 0; @@ -766,6 +803,8 @@ struct synchro_entry { struct synchro_request *req; /** Fiber created the entry. To wakeup when WAL write is done. */ struct fiber *owner; + /** WAL bound statistics. */ + struct applier_wal_stat *wal_st; /** * The base journal entry. It has unsized array and then must be the * last entry in the structure. But can workaround it via a union @@ -789,6 +828,8 @@ apply_synchro_row_cb(struct journal_entry *entry) if (entry->res < 0) { applier_rollback_by_wal_io(); } else { + if (synchro_entry->wal_st != NULL) + wal_stat_update(synchro_entry->wal_st); txn_limbo_process(&txn_limbo, synchro_entry->req); trigger_run(&replicaset.applier.on_wal_write, NULL); } @@ -797,7 +838,7 @@ apply_synchro_row_cb(struct journal_entry *entry) /** Process a synchro request. */ static int -apply_synchro_row(struct xrow_header *row) +apply_synchro_row(uint32_t replica_id, struct xrow_header *row) { assert(iproto_type_is_synchro_request(row->type)); @@ -805,6 +846,7 @@ apply_synchro_row(struct xrow_header *row) if (xrow_decode_synchro(row, &req) != 0) goto err; + struct applier_wal_stat wal_st; struct synchro_entry entry; /* * Rows array is cast from *[] to **, because otherwise g++ complains @@ -817,6 +859,15 @@ apply_synchro_row(struct xrow_header *row) apply_synchro_row_cb, &entry); entry.req = &req; entry.owner = fiber(); + + if (replica_id != 0) { + wal_st.replica_id = replica_id; + wal_st.first_row_tm = row->tm; + entry.wal_st = &wal_st; + } else { + entry.wal_st = NULL; + } + /* * The WAL write is blocking. Otherwise it might happen that a CONFIRM * or ROLLBACK is sent to WAL, and it would empty the limbo, but before @@ -862,8 +913,9 @@ applier_handle_raft(struct applier *applier, struct xrow_header *row) return box_raft_process(&req, applier->instance_id); } -static inline int -apply_plain_tx(struct stailq *rows, bool skip_conflict, bool use_triggers) +static int +apply_plain_tx(uint32_t replica_id, struct stailq *rows, + bool skip_conflict, bool use_triggers) { /* * Explicitly begin the transaction so that we can @@ -931,10 +983,21 @@ apply_plain_tx(struct stailq *rows, bool skip_conflict, bool use_triggers) goto fail; } + struct applier_wal_stat *wal_st; + wal_st = region_alloc_object(&txn->region, typeof(*wal_st), &size); + if (wal_st == NULL) { + diag_set(OutOfMemory, size, "region_alloc_object", "wal_st"); + goto fail; + } + trigger_create(on_rollback, applier_txn_rollback_cb, NULL, NULL); txn_on_rollback(txn, on_rollback); - trigger_create(on_wal_write, applier_txn_wal_write_cb, NULL, NULL); + item = stailq_first_entry(rows, struct applier_tx_row, next); + wal_st->replica_id = replica_id; + wal_st->first_row_tm = item->row.tm; + + trigger_create(on_wal_write, applier_txn_wal_write_cb, wal_st, NULL); txn_on_wal_write(txn, on_wal_write); } @@ -946,7 +1009,7 @@ apply_plain_tx(struct stailq *rows, bool skip_conflict, bool use_triggers) /** A simpler version of applier_apply_tx() for final join stage. */ static int -apply_final_join_tx(struct stailq *rows) +apply_final_join_tx(uint32_t replica_id, struct stailq *rows) { struct xrow_header *first_row = &stailq_first_entry(rows, struct applier_tx_row, next)->row; @@ -957,9 +1020,9 @@ apply_final_join_tx(struct stailq *rows) vclock_follow_xrow(&replicaset.vclock, last_row); if (unlikely(iproto_type_is_synchro_request(first_row->type))) { assert(first_row == last_row); - rc = apply_synchro_row(first_row); + rc = apply_synchro_row(replica_id, first_row); } else { - rc = apply_plain_tx(rows, false, false); + rc = apply_plain_tx(replica_id, rows, false, false); } fiber_gc(); return rc; @@ -1088,11 +1151,14 @@ applier_apply_tx(struct applier *applier, struct stailq *rows) * each other. */ assert(first_row == last_row); - if ((rc = apply_synchro_row(first_row)) != 0) + rc = apply_synchro_row(applier->instance_id, first_row); + if (rc != 0) + goto finish; + } else { + rc = apply_plain_tx(applier->instance_id, rows, + replication_skip_conflict, true); + if (rc != 0) goto finish; - } else if ((rc = apply_plain_tx(rows, replication_skip_conflict, - true)) != 0) { - goto finish; } vclock_follow(&replicaset.applier.vclock, last_row->replica_id, last_row->lsn); diff --git a/src/box/applier.h b/src/box/applier.h index 15ca1fcfd..c0352894e 100644 --- a/src/box/applier.h +++ b/src/box/applier.h @@ -133,6 +133,20 @@ struct applier { struct vclock remote_vclock_at_subscribe; }; +/** + * WAL related statistics. + */ +struct applier_wal_stat { + /** Replica ID initiated the transaction. */ + uint32_t replica_id; + /** + * Timestamp of a first row in a transaction + * batch which is not yet ACK'ed. For relay + * statistics. + */ + double first_row_tm; +}; + /** * Start a client to a remote master using a background fiber. * diff --git a/src/box/replication.cc b/src/box/replication.cc index aefb812b3..157944f05 100644 --- a/src/box/replication.cc +++ b/src/box/replication.cc @@ -184,6 +184,7 @@ replica_new(void) trigger_create(&replica->on_applier_state, replica_on_applier_state_f, NULL, NULL); replica->applier_sync_state = APPLIER_DISCONNECTED; + memset(&replica->wal_st, 0, sizeof(replica->wal_st)); latch_create(&replica->order_latch); return replica; } @@ -234,6 +235,7 @@ replica_set_id(struct replica *replica, uint32_t replica_id) assert(replica_id < VCLOCK_MAX); assert(replica->id == REPLICA_ID_NIL); /* replica id is read-only */ replica->id = replica_id; + replica->wal_st.replica_id = replica_id; if (tt_uuid_is_equal(&INSTANCE_UUID, &replica->uuid)) { /* Assign local replica id */ @@ -281,6 +283,7 @@ replica_clear_id(struct replica *replica) instance_id = REPLICA_ID_NIL; } replica->id = REPLICA_ID_NIL; + replica->wal_st.replica_id = REPLICA_ID_NIL; say_info("removed replica %s", tt_uuid_str(&replica->uuid)); /* diff --git a/src/box/replication.h b/src/box/replication.h index 2ad1cbf66..a40582cd3 100644 --- a/src/box/replication.h +++ b/src/box/replication.h @@ -331,6 +331,10 @@ struct replica { * separate from applier. */ enum applier_state applier_sync_state; + /** + * Applier's WAL related statistics. + */ + struct applier_wal_stat wal_st; /* The latch is used to order replication requests. */ struct latch order_latch; }; -- 2.30.2
We already have `box.replication.upstream.lag` entry for monitoring sake. Same time in synchronous replication timeouts are key properties for quorum gathering procedure. Thus we would like to know how long it took of a transaction to traverse `initiator WAL -> network -> remote applier -> ACK` path. Typical ouput is | tarantool> box.info.replication[2].downstream | --- | - status: follow | idle: 0.61753897101153 | vclock: {1: 147} | lag: 0 | ... | tarantool> box.space.sync:insert{69} | --- | - [69] | ... | | tarantool> box.info.replication[2].downstream | --- | - status: follow | idle: 0.75324084801832 | vclock: {1: 151} | lag: 0.0011014938354492 | ... Closes #5447 Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com> @TarantoolBot document Title: Add `box.info.replication[n].downstream.lag` entry `replication[n].downstream.lag` is the time difference between last transactions been written to the WAL journal of the transaction initiator and the transaction written to WAL on the `n` replica. In other words this is a lag in seconds between the main node writes data to own WAL and replica `n` get this data replicated to the own WAL journal. This value is not a constant and if there is no data to replicate (ie when nodes are sitting in idle cycle, or new heartbeat message obtained) then it get set back to zero. --- .../unreleased/gh-5447-downstream-lag.md | 6 ++ src/box/lua/info.c | 3 + src/box/relay.cc | 18 ++++ src/box/relay.h | 6 ++ .../replication/gh-5447-downstream-lag.result | 93 +++++++++++++++++++ .../gh-5447-downstream-lag.test.lua | 41 ++++++++ 6 files changed, 167 insertions(+) create mode 100644 changelogs/unreleased/gh-5447-downstream-lag.md create mode 100644 test/replication/gh-5447-downstream-lag.result create mode 100644 test/replication/gh-5447-downstream-lag.test.lua diff --git a/changelogs/unreleased/gh-5447-downstream-lag.md b/changelogs/unreleased/gh-5447-downstream-lag.md new file mode 100644 index 000000000..3b11c50ee --- /dev/null +++ b/changelogs/unreleased/gh-5447-downstream-lag.md @@ -0,0 +1,6 @@ +#feature/replication + + * Introduce `box.info.replication[n].downstream.lag` to monitor state + of replication which is especially important for synchronous spaces + where malfunctioning replicas may prevent quorum from gathering votes + to commit a transaction. diff --git a/src/box/lua/info.c b/src/box/lua/info.c index 0eb48b823..5b7c25f28 100644 --- a/src/box/lua/info.c +++ b/src/box/lua/info.c @@ -143,6 +143,9 @@ lbox_pushrelay(lua_State *L, struct relay *relay) lua_pushnumber(L, ev_monotonic_now(loop()) - relay_last_row_time(relay)); lua_settable(L, -3); + lua_pushstring(L, "lag"); + lua_pushnumber(L, relay_lag(relay)); + lua_settable(L, -3); break; case RELAY_STOPPED: { diff --git a/src/box/relay.cc b/src/box/relay.cc index ff43c2fc7..1501e53c7 100644 --- a/src/box/relay.cc +++ b/src/box/relay.cc @@ -158,6 +158,12 @@ struct relay { struct stailq pending_gc; /** Time when last row was sent to peer. */ double last_row_time; + /** + * A time difference between moment when we + * wrote a row in the local WAL and received + * an ACK that peer has it replicated. + */ + double lag; /** Relay sync state. */ enum relay_state state; @@ -217,6 +223,12 @@ relay_last_row_time(const struct relay *relay) return relay->last_row_time; } +double +relay_lag(const struct relay *relay) +{ + return relay->lag; +} + static void relay_send(struct relay *relay, struct xrow_header *packet); static void @@ -621,6 +633,12 @@ relay_reader_f(va_list ap) /* vclock is followed while decoding, zeroing it. */ vclock_create(&relay->recv_vclock); xrow_decode_vclock_xc(&xrow, &relay->recv_vclock); + if (xrow.tm != 0) { + double delta = ev_now(loop()) - xrow.tm; + relay->lag = delta; + } else { + relay->lag = 0; + } fiber_cond_signal(&relay->reader_cond); } } catch (Exception *e) { diff --git a/src/box/relay.h b/src/box/relay.h index b32e2ea2a..0f7d8d1bc 100644 --- a/src/box/relay.h +++ b/src/box/relay.h @@ -93,6 +93,12 @@ relay_vclock(const struct relay *relay); double relay_last_row_time(const struct relay *relay); +/** + * Returns relay's lag. + */ +double +relay_lag(const struct relay *relay); + /** * Send a Raft update request to the relay channel. It is not * guaranteed that it will be delivered. The connection may break. diff --git a/test/replication/gh-5447-downstream-lag.result b/test/replication/gh-5447-downstream-lag.result new file mode 100644 index 000000000..8586d0ed3 --- /dev/null +++ b/test/replication/gh-5447-downstream-lag.result @@ -0,0 +1,93 @@ +-- test-run result file version 2 +-- +-- gh-5447: Test for box.info.replication[n].downstream.lag. +-- We need to be sure that if replica start been back of +-- master node reports own lagging and cluster admin would +-- be able to detect such situation. +-- + +fiber = require('fiber') + | --- + | ... +test_run = require('test_run').new() + | --- + | ... +engine = test_run:get_cfg('engine') + | --- + | ... + +box.schema.user.grant('guest', 'replication') + | --- + | ... + +test_run:cmd('create server replica with rpl_master=default, \ + script="replication/replica.lua"') + | --- + | - true + | ... +test_run:cmd('start server replica') + | --- + | - true + | ... + +s = box.schema.space.create('test', {engine = engine}) + | --- + | ... +_ = s:create_index('pk') + | --- + | ... + +-- +-- The replica should wait some time (wal delay is 1 second +-- by default) so we would be able to detect the lag, since +-- on local instances the lag is minimal and usually transactions +-- are handled instantly. +test_run:switch('replica') + | --- + | - true + | ... +box.error.injection.set("ERRINJ_WAL_DELAY", true) + | --- + | - ok + | ... + +test_run:switch('default') + | --- + | - true + | ... +box.space.test:insert({1}) + | --- + | - [1] + | ... +test_run:wait_cond(function() return box.info.replication[2].downstream.lag ~= 0 end, 10) + | --- + | - true + | ... + +test_run:switch('replica') + | --- + | - true + | ... +box.error.injection.set("ERRINJ_WAL_DELAY", false) + | --- + | - ok + | ... +-- +-- Cleanup everything. +test_run:switch('default') + | --- + | - true + | ... + +test_run:cmd('stop server replica') + | --- + | - true + | ... +test_run:cmd('cleanup server replica') + | --- + | - true + | ... +test_run:cmd('delete server replica') + | --- + | - true + | ... diff --git a/test/replication/gh-5447-downstream-lag.test.lua b/test/replication/gh-5447-downstream-lag.test.lua new file mode 100644 index 000000000..650e8e215 --- /dev/null +++ b/test/replication/gh-5447-downstream-lag.test.lua @@ -0,0 +1,41 @@ +-- +-- gh-5447: Test for box.info.replication[n].downstream.lag. +-- We need to be sure that if replica start been back of +-- master node reports own lagging and cluster admin would +-- be able to detect such situation. +-- + +fiber = require('fiber') +test_run = require('test_run').new() +engine = test_run:get_cfg('engine') + +box.schema.user.grant('guest', 'replication') + +test_run:cmd('create server replica with rpl_master=default, \ + script="replication/replica.lua"') +test_run:cmd('start server replica') + +s = box.schema.space.create('test', {engine = engine}) +_ = s:create_index('pk') + +-- +-- The replica should wait some time (wal delay is 1 second +-- by default) so we would be able to detect the lag, since +-- on local instances the lag is minimal and usually transactions +-- are handled instantly. +test_run:switch('replica') +box.error.injection.set("ERRINJ_WAL_DELAY", true) + +test_run:switch('default') +box.space.test:insert({1}) +test_run:wait_cond(function() return box.info.replication[2].downstream.lag ~= 0 end, 10) + +test_run:switch('replica') +box.error.injection.set("ERRINJ_WAL_DELAY", false) +-- +-- Cleanup everything. +test_run:switch('default') + +test_run:cmd('stop server replica') +test_run:cmd('cleanup server replica') +test_run:cmd('delete server replica') -- 2.30.2
Thanks for working on this! See 3 comments below. > diff --git a/src/box/applier.cc b/src/box/applier.cc > index 33181fdbf..3038ce05f 100644 > --- a/src/box/applier.cc > +++ b/src/box/applier.cc> @@ -789,6 +828,8 @@ apply_synchro_row_cb(struct journal_entry *entry) > if (entry->res < 0) { > applier_rollback_by_wal_io(); > } else { > + if (synchro_entry->wal_st != NULL) > + wal_stat_update(synchro_entry->wal_st); 1. You have 3 branches here: first in apply_synchro_row() where you check replica_id != 0, then here, and then in wal_stat_update() to check if the replica exists. You could make it 2 if would add wal_stat in apply_synchro_row() always. Anyway there is no replica with replica_id 0, so wal_state_update() won't do anything. > txn_limbo_process(&txn_limbo, synchro_entry->req); > trigger_run(&replicaset.applier.on_wal_write, NULL); > } > @@ -1088,11 +1151,14 @@ applier_apply_tx(struct applier *applier, struct stailq *rows) > * each other. > */ > assert(first_row == last_row); > - if ((rc = apply_synchro_row(first_row)) != 0) > + rc = apply_synchro_row(applier->instance_id, first_row); > + if (rc != 0) > + goto finish; > + } else { > + rc = apply_plain_tx(applier->instance_id, rows, > + replication_skip_conflict, true); > + if (rc != 0) > goto finish; 2. These 2 error checks are identical. You can move them out of if-else. > - } else if ((rc = apply_plain_tx(rows, replication_skip_conflict, > - true)) != 0) { > - goto finish; > } > vclock_follow(&replicaset.applier.vclock, last_row->replica_id, > last_row->lsn); > diff --git a/src/box/replication.h b/src/box/replication.h > index 2ad1cbf66..a40582cd3 100644 > --- a/src/box/replication.h > +++ b/src/box/replication.h > @@ -331,6 +331,10 @@ struct replica { > * separate from applier. > */ > enum applier_state applier_sync_state; > + /** > + * Applier's WAL related statistics. > + */ > + struct applier_wal_stat wal_st; 3. The stat object also stores replica_id which is useless to store in struct replica - you know its ID anyway. There is already replica->id. I would suggest to store only the timestamp, and move the stat declaration to .cc file.
I appreciate the work you did here! See 6 comments below. On 06.05.2021 23:41, Cyrill Gorcunov wrote: > We already have `box.replication.upstream.lag` entry for monitoring > sake. Same time in synchronous replication timeouts are key properties > for quorum gathering procedure. Thus we would like to know how long > it took of a transaction to traverse `initiator WAL -> network -> > remote applier -> ACK` path. > > Typical ouput is 1. ouput -> output. > .../unreleased/gh-5447-downstream-lag.md | 6 ++ > src/box/lua/info.c | 3 + > src/box/relay.cc | 18 ++++ > src/box/relay.h | 6 ++ > .../replication/gh-5447-downstream-lag.result | 93 +++++++++++++++++++ > .../gh-5447-downstream-lag.test.lua | 41 ++++++++ > 6 files changed, 167 insertions(+) > create mode 100644 changelogs/unreleased/gh-5447-downstream-lag.md > create mode 100644 test/replication/gh-5447-downstream-lag.result > create mode 100644 test/replication/gh-5447-downstream-lag.test.lua > > diff --git a/changelogs/unreleased/gh-5447-downstream-lag.md b/changelogs/unreleased/gh-5447-downstream-lag.md > new file mode 100644 > index 000000000..3b11c50ee > --- /dev/null > +++ b/changelogs/unreleased/gh-5447-downstream-lag.md > @@ -0,0 +1,6 @@ > +#feature/replication > + > + * Introduce `box.info.replication[n].downstream.lag` to monitor state 2. If was recently decided we must use Past Simple in changelogs. > + of replication which is especially important for synchronous spaces > + where malfunctioning replicas may prevent quorum from gathering votes > + to commit a transaction. 3. Please, add a reference to the ticket in the form `(gh-####)`. > diff --git a/src/box/lua/info.c b/src/box/lua/info.c > index 0eb48b823..5b7c25f28 100644 > --- a/src/box/lua/info.c > +++ b/src/box/lua/info.c > @@ -143,6 +143,9 @@ lbox_pushrelay(lua_State *L, struct relay *relay) > lua_pushnumber(L, ev_monotonic_now(loop()) - > relay_last_row_time(relay)); > lua_settable(L, -3); > + lua_pushstring(L, "lag"); > + lua_pushnumber(L, relay_lag(relay)); > + lua_settable(L, -3); > break; > case RELAY_STOPPED: > { > diff --git a/src/box/relay.cc b/src/box/relay.cc > index ff43c2fc7..1501e53c7 100644 > --- a/src/box/relay.cc > +++ b/src/box/relay.cc > @@ -158,6 +158,12 @@ struct relay { > struct stailq pending_gc; > /** Time when last row was sent to peer. */ > double last_row_time; > + /** > + * A time difference between moment when we > + * wrote a row in the local WAL and received > + * an ACK that peer has it replicated. > + */ > + double lag; 4. Lag is updated in the relay thread, therefore you can't simply read it in TX thread like you do in the diff block above. Try to deliver it to TX the same way as acked vclock is delivered. > /** Relay sync state. */ > enum relay_state state; > > @@ -621,6 +633,12 @@ relay_reader_f(va_list ap) > /* vclock is followed while decoding, zeroing it. */ > vclock_create(&relay->recv_vclock); > xrow_decode_vclock_xc(&xrow, &relay->recv_vclock); > + if (xrow.tm != 0) { > + double delta = ev_now(loop()) - xrow.tm; 5. What if the acked row didn't come from the remote peer? But from a third node. Node1 -> Node2 -> Node3. Does Node2, before sending the row to node3, update its timestamp? Otherwise you are going to send an ACK to Node2 with timestamp of Node1, which won't make much sense. I don't know. I see xlog_write_entry() updates the timestamp, but please, double-check it. In the debugger or via printfs. > + relay->lag = delta; > + } else { > + relay->lag = 0; 6. Hm. This means that if I sent a synchronous transaction, and its quorum collection takes longer than the replication timeout, I would see the lag 0. Moverover, it will constantly blink if I have many synchronous transactions working slow. Between 0 and seconds. I am not 100% it looks right. Especially it won't be applicable for any monitoring, the results won't be readable. I would rather expect the lag grow until I receive an ACK, which will bump it to an actual value. And drop to 0 only when there is no rows to wait ACK for. Or maybe never drop to 0 like applier does? Drop to 0 when no rows means we will have it blinking between 0 and seconds again if the synchronous transactions are long, but not so frequent. So there is often no rows to ACK. Did you try to ask Vlad G., who is going to actually use it probably? I tried to list the options we have https://github.com/tarantool/tarantool/issues/5447#issuecomment-840047693 and asked Vlad what he thinks. He also promised to ask more people who use Tarantool actively.