Tarantool development patches archive
 help / color / mirror / Atom feed
* [Tarantool-patches] [PATCH v2 0/2] fix a couple of flaky tests
@ 2021-08-16 15:15 Serge Petrenko via Tarantool-patches
  2021-08-16 15:15 ` [Tarantool-patches] [PATCH v2 1/2] applier: fix upstream.lag calculations Serge Petrenko via Tarantool-patches
                   ` (2 more replies)
  0 siblings, 3 replies; 5+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-08-16 15:15 UTC (permalink / raw)
  To: v.ioffe, kyukhin; +Cc: tarantool-patches

This patchset fixes flaky replication/errinj and replication/election_basic
tests.

Branch: https://github.com/tarantool/tarantool/tree/sp/election-basic-flaky-fix

Serge Petrenko (2):
  applier: fix upstream.lag calculations
  replication: fix flaky election_basic test

 src/box/applier.cc                       | 3 ++-
 test/replication/election_basic.result   | 6 ++++++
 test/replication/election_basic.test.lua | 3 +++
 test/replication/errinj.result           | 5 ++++-
 test/replication/errinj.test.lua         | 5 ++++-
 5 files changed, 19 insertions(+), 3 deletions(-)

-- 
2.30.1 (Apple Git-130)


^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Tarantool-patches] [PATCH v2 1/2] applier: fix upstream.lag calculations
  2021-08-16 15:15 [Tarantool-patches] [PATCH v2 0/2] fix a couple of flaky tests Serge Petrenko via Tarantool-patches
@ 2021-08-16 15:15 ` Serge Petrenko via Tarantool-patches
  2021-08-16 16:25   ` Vitaliia Ioffe via Tarantool-patches
  2021-08-16 15:15 ` [Tarantool-patches] [PATCH v2 2/2] replication: fix flaky election_basic test Serge Petrenko via Tarantool-patches
  2021-08-17  7:21 ` [Tarantool-patches] [PATCH v2 0/2] fix a couple of flaky tests Kirill Yukhin via Tarantool-patches
  2 siblings, 1 reply; 5+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-08-16 15:15 UTC (permalink / raw)
  To: v.ioffe, kyukhin; +Cc: tarantool-patches

upstream.lag is the delta between the moment when a row was written to
master's journal and the moment when it was received by the replica.
It's an important metric to check whether the replica has fallen too far
behind master.

Not all the rows coming from master have a valid time of creation. For
example, RAFT system messages don't have one, and we can't assign
correct time to them: these messages do not originate from the journal,
and assigning current time to them would lead to jumps in upstream.lag
results.

Stop updating upstream.lag for rows which don't have creation time
assigned.

The upstream.lag calculation changes were meant to fix the flaky
replication/errinj.test:

 Test failed! Result content mismatch:
 --- replication/errinj.result	Fri Aug 13 15:15:35 2021
 +++ /tmp/tnt/rejects/replication/errinj.reject	Fri Aug 13 15:40:39 2021
 @@ -310,7 +310,7 @@
  ...
  box.info.replication[1].upstream.lag < 1
  ---
 -- true
 +- false
  ...

But the changes were not enough, because now the test
may see the initial lag value (TIMEOUT_INFINITY).
So fix the test as well by waiting until upstream.lag becomes < 1.
---
 src/box/applier.cc               | 3 ++-
 test/replication/errinj.result   | 5 ++++-
 test/replication/errinj.test.lua | 5 ++++-
 3 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/src/box/applier.cc b/src/box/applier.cc
index 902d0bc72..9256078e1 100644
--- a/src/box/applier.cc
+++ b/src/box/applier.cc
@@ -664,7 +664,8 @@ applier_read_tx_row(struct applier *applier, double timeout)
 
 	coio_read_xrow_timeout_xc(coio, ibuf, row, timeout);
 
-	applier->lag = ev_now(loop()) - row->tm;
+	if (row->tm > 0)
+		applier->lag = ev_now(loop()) - row->tm;
 	applier->last_row_time = ev_monotonic_now(loop());
 	return tx_row;
 }
diff --git a/test/replication/errinj.result b/test/replication/errinj.result
index 9d13f6aa7..ec251182f 100644
--- a/test/replication/errinj.result
+++ b/test/replication/errinj.result
@@ -308,7 +308,10 @@ box.info.replication[1].upstream.lag > 0
 ---
 - true
 ...
-box.info.replication[1].upstream.lag < 1
+-- Upstream lag is huge until the first row is received.
+test_run:wait_cond(function()\
+    return box.info.replication[1].upstream.lag < 1\
+end)
 ---
 - true
 ...
diff --git a/test/replication/errinj.test.lua b/test/replication/errinj.test.lua
index 19234ab35..7f6535ec1 100644
--- a/test/replication/errinj.test.lua
+++ b/test/replication/errinj.test.lua
@@ -130,7 +130,10 @@ test_run:cmd("switch replica")
 while box.info.replication[1].upstream.status ~= 'follow' do fiber.sleep(0.0001) end
 box.info.replication[1].upstream.status
 box.info.replication[1].upstream.lag > 0
-box.info.replication[1].upstream.lag < 1
+-- Upstream lag is huge until the first row is received.
+test_run:wait_cond(function()\
+    return box.info.replication[1].upstream.lag < 1\
+end)
 -- wait for ack timeout
 test_run:wait_upstream(1, {status='disconnected', message_re='unexpected EOF'})
 
-- 
2.30.1 (Apple Git-130)


^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Tarantool-patches] [PATCH v2 2/2] replication: fix flaky election_basic test
  2021-08-16 15:15 [Tarantool-patches] [PATCH v2 0/2] fix a couple of flaky tests Serge Petrenko via Tarantool-patches
  2021-08-16 15:15 ` [Tarantool-patches] [PATCH v2 1/2] applier: fix upstream.lag calculations Serge Petrenko via Tarantool-patches
@ 2021-08-16 15:15 ` Serge Petrenko via Tarantool-patches
  2021-08-17  7:21 ` [Tarantool-patches] [PATCH v2 0/2] fix a couple of flaky tests Kirill Yukhin via Tarantool-patches
  2 siblings, 0 replies; 5+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-08-16 15:15 UTC (permalink / raw)
  To: v.ioffe, kyukhin; +Cc: tarantool-patches

Found the following error in our CI:

 Test failed! Result content mismatch:
 --- replication/election_basic.result	Fri Aug 13 13:50:26 2021
 +++ /build/usr/src/debug/tarantool-2.9.0.276/test/var/rejects/replication/election_basic.reject	Sat Aug 14 08:14:17 2021
 @@ -116,6 +116,7 @@
   | ...
  box.ctl.demote()
   | ---
 + | - error: box.ctl.demote does not support simultaneous invocations
   | ...
  --

Even though box.ctl.demote() or box.ctl.promote() isn't called above the
failing line, promote() is issued internally once the instance becomes
the leader.

Wait until previous promote is finished
(i.e. box.info.synchro.queue.owner is set)
---
 test/replication/election_basic.result   | 6 ++++++
 test/replication/election_basic.test.lua | 3 +++
 2 files changed, 9 insertions(+)

diff --git a/test/replication/election_basic.result b/test/replication/election_basic.result
index 5da57e87d..382aeef60 100644
--- a/test/replication/election_basic.result
+++ b/test/replication/election_basic.result
@@ -95,6 +95,12 @@ test_run:wait_cond(function() return box.info.election.state == 'leader' end)
  | ---
  | - true
  | ...
+test_run:wait_cond(function()\
+    return box.info.synchro.queue.owner == box.info.id\
+end)
+ | ---
+ | - true
+ | ...
 assert(box.info.election.term > term)
  | ---
  | - true
diff --git a/test/replication/election_basic.test.lua b/test/replication/election_basic.test.lua
index 3b3a3e7e5..47f3d318e 100644
--- a/test/replication/election_basic.test.lua
+++ b/test/replication/election_basic.test.lua
@@ -35,6 +35,9 @@ assert(box.info.election.leader == 0)
 box.cfg{election_timeout = 1000}
 box.cfg{election_mode = 'candidate'}
 test_run:wait_cond(function() return box.info.election.state == 'leader' end)
+test_run:wait_cond(function()\
+    return box.info.synchro.queue.owner == box.info.id\
+end)
 assert(box.info.election.term > term)
 assert(box.info.election.vote == box.info.id)
 assert(box.info.election.leader == box.info.id)
-- 
2.30.1 (Apple Git-130)


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Tarantool-patches]  [PATCH v2 1/2] applier: fix upstream.lag calculations
  2021-08-16 15:15 ` [Tarantool-patches] [PATCH v2 1/2] applier: fix upstream.lag calculations Serge Petrenko via Tarantool-patches
@ 2021-08-16 16:25   ` Vitaliia Ioffe via Tarantool-patches
  0 siblings, 0 replies; 5+ messages in thread
From: Vitaliia Ioffe via Tarantool-patches @ 2021-08-16 16:25 UTC (permalink / raw)
  To: Serge Petrenko; +Cc: tarantool-patches

[-- Attachment #1: Type: text/plain, Size: 3243 bytes --]


QA LGTM
 
 
--
Vitaliia Ioffe
 
  
>Понедельник, 16 августа 2021, 18:15 +03:00 от Serge Petrenko <sergepetrenko@tarantool.org>:
> 
>upstream.lag is the delta between the moment when a row was written to
>master's journal and the moment when it was received by the replica.
>It's an important metric to check whether the replica has fallen too far
>behind master.
>
>Not all the rows coming from master have a valid time of creation. For
>example, RAFT system messages don't have one, and we can't assign
>correct time to them: these messages do not originate from the journal,
>and assigning current time to them would lead to jumps in upstream.lag
>results.
>
>Stop updating upstream.lag for rows which don't have creation time
>assigned.
>
>The upstream.lag calculation changes were meant to fix the flaky
>replication/errinj.test:
>
> Test failed! Result content mismatch:
> --- replication/errinj.result Fri Aug 13 15:15:35 2021
> +++ /tmp/tnt/rejects/replication/errinj.reject Fri Aug 13 15:40:39 2021
> @@ -310,7 +310,7 @@
>  ...
>  box.info.replication[1].upstream.lag < 1
>  ---
> -- true
> +- false
>  ...
>
>But the changes were not enough, because now the test
>may see the initial lag value (TIMEOUT_INFINITY).
>So fix the test as well by waiting until upstream.lag becomes < 1.
>---
> src/box/applier.cc | 3 ++-
> test/replication/errinj.result | 5 ++++-
> test/replication/errinj.test.lua | 5 ++++-
> 3 files changed, 10 insertions(+), 3 deletions(-)
>
>diff --git a/src/box/applier.cc b/src/box/applier.cc
>index 902d0bc72..9256078e1 100644
>--- a/src/box/applier.cc
>+++ b/src/box/applier.cc
>@@ -664,7 +664,8 @@ applier_read_tx_row(struct applier *applier, double timeout)
> 
>  coio_read_xrow_timeout_xc(coio, ibuf, row, timeout);
> 
>- applier->lag = ev_now(loop()) - row->tm;
>+ if (row->tm > 0)
>+ applier->lag = ev_now(loop()) - row->tm;
>  applier->last_row_time = ev_monotonic_now(loop());
>  return tx_row;
> }
>diff --git a/test/replication/errinj.result b/test/replication/errinj.result
>index 9d13f6aa7..ec251182f 100644
>--- a/test/replication/errinj.result
>+++ b/test/replication/errinj.result
>@@ -308,7 +308,10 @@ box.info.replication[1].upstream.lag > 0
> ---
> - true
> ...
>-box.info.replication[1].upstream.lag < 1
>+-- Upstream lag is huge until the first row is received.
>+test_run:wait_cond(function()\
>+ return box.info.replication[1].upstream.lag < 1\
>+end)
> ---
> - true
> ...
>diff --git a/test/replication/errinj.test.lua b/test/replication/errinj.test.lua
>index 19234ab35..7f6535ec1 100644
>--- a/test/replication/errinj.test.lua
>+++ b/test/replication/errinj.test.lua
>@@ -130,7 +130,10 @@ test_run:cmd("switch replica")
> while box.info.replication[1].upstream.status ~= 'follow' do fiber.sleep(0.0001) end
> box.info.replication[1].upstream.status
> box.info.replication[1].upstream.lag > 0
>-box.info.replication[1].upstream.lag < 1
>+-- Upstream lag is huge until the first row is received.
>+test_run:wait_cond(function()\
>+ return box.info.replication[1].upstream.lag < 1\
>+end)
> -- wait for ack timeout
> test_run:wait_upstream(1, {status='disconnected', message_re='unexpected EOF'})
> 
>--
>2.30.1 (Apple Git-130)
 

[-- Attachment #2: Type: text/html, Size: 4033 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Tarantool-patches] [PATCH v2 0/2] fix a couple of flaky tests
  2021-08-16 15:15 [Tarantool-patches] [PATCH v2 0/2] fix a couple of flaky tests Serge Petrenko via Tarantool-patches
  2021-08-16 15:15 ` [Tarantool-patches] [PATCH v2 1/2] applier: fix upstream.lag calculations Serge Petrenko via Tarantool-patches
  2021-08-16 15:15 ` [Tarantool-patches] [PATCH v2 2/2] replication: fix flaky election_basic test Serge Petrenko via Tarantool-patches
@ 2021-08-17  7:21 ` Kirill Yukhin via Tarantool-patches
  2 siblings, 0 replies; 5+ messages in thread
From: Kirill Yukhin via Tarantool-patches @ 2021-08-17  7:21 UTC (permalink / raw)
  To: Serge Petrenko; +Cc: tarantool-patches

Hello,

On 16 авг 18:15, Serge Petrenko wrote:
> This patchset fixes flaky replication/errinj and replication/election_basic
> tests.
> 
> Branch: https://github.com/tarantool/tarantool/tree/sp/election-basic-flaky-fix
> 
> Serge Petrenko (2):
>   applier: fix upstream.lag calculations
>   replication: fix flaky election_basic test

I've checked your patchset into 2.7, 2.8 and master.

--
Regards, Kirill Yukhin

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2021-08-17  7:21 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-16 15:15 [Tarantool-patches] [PATCH v2 0/2] fix a couple of flaky tests Serge Petrenko via Tarantool-patches
2021-08-16 15:15 ` [Tarantool-patches] [PATCH v2 1/2] applier: fix upstream.lag calculations Serge Petrenko via Tarantool-patches
2021-08-16 16:25   ` Vitaliia Ioffe via Tarantool-patches
2021-08-16 15:15 ` [Tarantool-patches] [PATCH v2 2/2] replication: fix flaky election_basic test Serge Petrenko via Tarantool-patches
2021-08-17  7:21 ` [Tarantool-patches] [PATCH v2 0/2] fix a couple of flaky tests Kirill Yukhin via Tarantool-patches

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox