[Tarantool-patches] [PATCH v2 1/2] applier: fix upstream.lag calculations

Serge Petrenko sergepetrenko at tarantool.org
Mon Aug 16 18:15:22 MSK 2021


upstream.lag is the delta between the moment when a row was written to
master's journal and the moment when it was received by the replica.
It's an important metric to check whether the replica has fallen too far
behind master.

Not all the rows coming from master have a valid time of creation. For
example, RAFT system messages don't have one, and we can't assign
correct time to them: these messages do not originate from the journal,
and assigning current time to them would lead to jumps in upstream.lag
results.

Stop updating upstream.lag for rows which don't have creation time
assigned.

The upstream.lag calculation changes were meant to fix the flaky
replication/errinj.test:

 Test failed! Result content mismatch:
 --- replication/errinj.result	Fri Aug 13 15:15:35 2021
 +++ /tmp/tnt/rejects/replication/errinj.reject	Fri Aug 13 15:40:39 2021
 @@ -310,7 +310,7 @@
  ...
  box.info.replication[1].upstream.lag < 1
  ---
 -- true
 +- false
  ...

But the changes were not enough, because now the test
may see the initial lag value (TIMEOUT_INFINITY).
So fix the test as well by waiting until upstream.lag becomes < 1.
---
 src/box/applier.cc               | 3 ++-
 test/replication/errinj.result   | 5 ++++-
 test/replication/errinj.test.lua | 5 ++++-
 3 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/src/box/applier.cc b/src/box/applier.cc
index 902d0bc72..9256078e1 100644
--- a/src/box/applier.cc
+++ b/src/box/applier.cc
@@ -664,7 +664,8 @@ applier_read_tx_row(struct applier *applier, double timeout)
 
 	coio_read_xrow_timeout_xc(coio, ibuf, row, timeout);
 
-	applier->lag = ev_now(loop()) - row->tm;
+	if (row->tm > 0)
+		applier->lag = ev_now(loop()) - row->tm;
 	applier->last_row_time = ev_monotonic_now(loop());
 	return tx_row;
 }
diff --git a/test/replication/errinj.result b/test/replication/errinj.result
index 9d13f6aa7..ec251182f 100644
--- a/test/replication/errinj.result
+++ b/test/replication/errinj.result
@@ -308,7 +308,10 @@ box.info.replication[1].upstream.lag > 0
 ---
 - true
 ...
-box.info.replication[1].upstream.lag < 1
+-- Upstream lag is huge until the first row is received.
+test_run:wait_cond(function()\
+    return box.info.replication[1].upstream.lag < 1\
+end)
 ---
 - true
 ...
diff --git a/test/replication/errinj.test.lua b/test/replication/errinj.test.lua
index 19234ab35..7f6535ec1 100644
--- a/test/replication/errinj.test.lua
+++ b/test/replication/errinj.test.lua
@@ -130,7 +130,10 @@ test_run:cmd("switch replica")
 while box.info.replication[1].upstream.status ~= 'follow' do fiber.sleep(0.0001) end
 box.info.replication[1].upstream.status
 box.info.replication[1].upstream.lag > 0
-box.info.replication[1].upstream.lag < 1
+-- Upstream lag is huge until the first row is received.
+test_run:wait_cond(function()\
+    return box.info.replication[1].upstream.lag < 1\
+end)
 -- wait for ack timeout
 test_run:wait_upstream(1, {status='disconnected', message_re='unexpected EOF'})
 
-- 
2.30.1 (Apple Git-130)



More information about the Tarantool-patches mailing list