[Tarantool-patches] [PATCH] applier: fix upstream.lag calculations
Serge Petrenko
sergepetrenko at tarantool.org
Sat Aug 14 11:03:23 MSK 2021
14.08.2021 09:42, Vitaliia Ioffe пишет:
> Hi Sergey,
> I’m so sorry for saying it: but this fix is not a fix. I have to
> underline there were failed tests:
>
> [037] replication/errinj.test.lua memtx [ fail ]
>
> [037] replication/errinj.test.lua vinyl
> [ fail ]
> You can find it here:
> https://github.com/tarantool/tarantool/runs/3322606890
> --
> Vitaliia Ioffe
Don't be sorry, I didn't check the patch thoroughly enough.
Applied the following diff and reworded the patch a bit.
Everything should be fine now.
===================================
diff --git a/test/replication/errinj.result b/test/replication/errinj.result
index 9d13f6aa7..ec251182f 100644
--- a/test/replication/errinj.result
+++ b/test/replication/errinj.result
@@ -308,7 +308,10 @@ box.info.replication[1].upstream.lag > 0
---
- true
...
-box.info.replication[1].upstream.lag < 1
+-- Upstream lag is huge until the first row is received.
+test_run:wait_cond(function()\
+ return box.info.replication[1].upstream.lag < 1\
+end)
---
- true
...
diff --git a/test/replication/errinj.test.lua
b/test/replication/errinj.test.lua
index 19234ab35..7f6535ec1 100644
--- a/test/replication/errinj.test.lua
+++ b/test/replication/errinj.test.lua
@@ -130,7 +130,10 @@ test_run:cmd("switch replica")
while box.info.replication[1].upstream.status ~= 'follow' do
fiber.sleep(0.0001) end
box.info.replication[1].upstream.status
box.info.replication[1].upstream.lag > 0
-box.info.replication[1].upstream.lag < 1
+-- Upstream lag is huge until the first row is received.
+test_run:wait_cond(function()\
+ return box.info.replication[1].upstream.lag < 1\
+end)
-- wait for ack timeout
test_run:wait_upstream(1, {status='disconnected',
message_re='unexpected EOF'})
===================================
>
> Пятница, 13 августа 2021, 17:25 +03:00 от Serge Petrenko
> <sergepetrenko at tarantool.org>:
> upstream.lag is the delta between the moment when a row was written to
> master's journal and the moment when it was received by the replica.
> It's an important metric to check whether the replica has fallen
> too far
> behind master.
>
> Not all the rows coming from master have a valid time of creation. For
> example, RAFT system messages don't have one, and we can't assign
> correct time to them: these messages do not originate from the
> journal,
> and assigning current time to them would lead to jumps in upstream.lag
> results.
>
> Stop updating upstream.lag for rows which don't have creation time
> assigned.
>
> This also fixes the flaky replication/errinj.test.lua
> ---
> https://github.com/tarantool/tarantool/tree/sp/applier-lag-fix
> <https://github.com/tarantool/tarantool/tree/sp/applier-lag-fix>
>
> src/box/applier.cc | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/src/box/applier.cc b/src/box/applier.cc
> index 902d0bc72..9256078e1 100644
> --- a/src/box/applier.cc
> +++ b/src/box/applier.cc
> @@ -664,7 +664,8 @@ applier_read_tx_row(struct applier *applier,
> double timeout)
>
> coio_read_xrow_timeout_xc(coio, ibuf, row, timeout);
>
> - applier->lag = ev_now(loop()) - row->tm;
> + if (row->tm > 0)
> + applier->lag = ev_now(loop()) - row->tm;
> applier->last_row_time = ev_monotonic_now(loop());
> return tx_row;
> }
> --
> 2.30.1 (Apple Git-130)
>
--
Serge Petrenko
More information about the Tarantool-patches
mailing list