From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from smtp38.i.mail.ru (smtp38.i.mail.ru [94.100.177.98]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dev.tarantool.org (Postfix) with ESMTPS id C64D5469719 for ; Thu, 27 Feb 2020 17:13:33 +0300 (MSK) From: Serge Petrenko Message-Id: Content-Type: multipart/alternative; boundary="Apple-Mail=_A1DE6728-5837-4B0E-8F04-A2526795698F" Mime-Version: 1.0 (Mac OS X Mail 13.0 \(3608.40.2.2.4\)) Date: Thu, 27 Feb 2020 17:13:31 +0300 In-Reply-To: <1bb673a1-8acb-d13a-edc4-4ad13a93a13a@tarantool.org> References: <1bb673a1-8acb-d13a-edc4-4ad13a93a13a@tarantool.org> Subject: Re: [Tarantool-patches] [PATCH v4 4/4] replication: do not relay rows coming from a remote instance back to it List-Id: Tarantool development patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Vladislav Shpilevoy Cc: kirichenkoga@gmail.com, tarantool-patches@dev.tarantool.org --Apple-Mail=_A1DE6728-5837-4B0E-8F04-A2526795698F Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=utf-8 > 27 =D1=84=D0=B5=D0=B2=D1=80. 2020 =D0=B3., =D0=B2 02:54, Vladislav = Shpilevoy =D0=BD=D0=B0=D0=BF=D0=B8=D1=81=D0=B0= =D0=BB(=D0=B0): >=20 > Thanks for the patch! >=20 Hi! Thanks for the review! Please find my comments and the new diff below. > See 4 comments below. >=20 >> replication: do not relay rows coming from a remote instance back = to it >>=20 >> We have a mechanism for restoring rows originating from an = instance that >> suffered a sudden power loss: remote masters resend the isntance's = rows >> received before a certain point in time, defined by remote master = vclock >> at the moment of subscribe. >> However, this is useful only on initial replication configuraiton, = when >> an instance has just recovered, so that it can receive what it has >> relayed but haven't synced to disk. >> In other cases, when an instance is operating normally and = master-master >> replication is configured, the mechanism described above may lead = to >> instance re-applying instance's own rows, coming from a master it = has just >> subscribed to. >> To fix the problem do not relay rows coming from a remote = instance, if >> the instance has already recovered. >>=20 >> Closes #4739 >>=20 >> diff --git a/src/box/applier.cc b/src/box/applier.cc >> index 911353425..73ffc0d68 100644 >> --- a/src/box/applier.cc >> +++ b/src/box/applier.cc >> @@ -866,8 +866,13 @@ applier_subscribe(struct applier *applier) >> struct vclock vclock; >> vclock_create(&vclock); >> vclock_copy(&vclock, &replicaset.vclock); >> + /* >> + * Stop accepting local rows coming from a remote >> + * instance as soon as local WAL starts accepting writes. >> + */ >> + unsigned int id_filter =3D box_is_orphan() ? 0 : 1 << = instance_id; >=20 > 1. I was always wondering, what if the instance got orphaned after it > started accepting writes? WAL is fully functional, it syncs whatever = is > needed, and then a resubscribe happens. Can this break anything? >=20 >> xrow_encode_subscribe_xc(&row, &REPLICASET_UUID, &INSTANCE_UUID, >> - &vclock, replication_anon, 0); >> + &vclock, replication_anon, id_filter); >> coio_write_xrow(coio, &row); >>=20 >> /* Read SUBSCRIBE response */ >> diff --git a/src/box/wal.c b/src/box/wal.c >> index 27bff662a..35ba7b072 100644 >> --- a/src/box/wal.c >> +++ b/src/box/wal.c >> @@ -278,8 +278,13 @@ tx_schedule_commit(struct cmsg *msg) >> /* Closes the input valve. */ >> stailq_concat(&writer->rollback, &batch->rollback); >> } >> + >> + ERROR_INJECT(ERRINJ_REPLICASET_VCLOCK_UPDATE, { goto = skip_update; }); >> /* Update the tx vclock to the latest written by wal. */ >> vclock_copy(&replicaset.vclock, &batch->vclock); >> +#ifndef NDEBUG >> +skip_update: >> +#endif >=20 > 2. Consider this hack which I just invented. In that way you won't > depend on ERRINJ and NDEBUG interconnection. >=20 > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > @@ -282,9 +282,7 @@ tx_schedule_commit(struct cmsg *msg) > ERROR_INJECT(ERRINJ_REPLICASET_VCLOCK_UPDATE, { goto = skip_update; }); > /* Update the tx vclock to the latest written by wal. */ > vclock_copy(&replicaset.vclock, &batch->vclock); > -#ifndef NDEBUG > -skip_update: > -#endif > + ERROR_INJECT(ERRINJ_REPLICASET_VCLOCK_UPDATE, {skip_update:;}); > tx_schedule_queue(&batch->commit); > mempool_free(&writer->msg_pool, container_of(msg, struct = wal_msg, base)); > } > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D Good one, applied. >=20 > Talking of the injection itself - don't know really. Perhaps > it would be better to add a delay to the wal_write_to_disk() > function, to its very end, after wal_notify_watchers(). In > that case relay will wake up, send whatever it wants, and TX > won't update the vclock until you let wal_write_to_disk() > finish. Seems more natural this way. I tried to add a sleep first. It=E2=80=99s impossible to sleep in = tx_schedule_commit(), since it=E2=80=99s processed in tx_prio endpoint, where yielding is = impossible. I also tried to add a sleep at the end of wal_write_to_disk(), just like = you suggest. This didn=E2=80=99t work out either. I=E2=80=99ll give you more = details in the evening, when I give it another try. I=E2=80=99ll send a follow-up if I succeed = with adding a sleep. >=20 >> tx_schedule_queue(&batch->commit); >> mempool_free(&writer->msg_pool, container_of(msg, struct = wal_msg, base)); >> } >> diff --git a/test/replication/gh-4739-vclock-assert.result = b/test/replication/gh-4739-vclock-assert.result >> new file mode 100644 >> index 000000000..7dc2f7118 >> --- /dev/null >> +++ b/test/replication/gh-4739-vclock-assert.result >> @@ -0,0 +1,82 @@ >> +-- test-run result file version 2 >> +env =3D require('test_run') >> + | --- >> + | ... >> +test_run =3D env.new() >> + | --- >> + | ... >> + >> +SERVERS =3D {'rebootstrap1', 'rebootstrap2'} >> + | --- >> + | ... >> +test_run:create_cluster(SERVERS, "replication") >> + | --- >> + | ... >> +test_run:wait_fullmesh(SERVERS) >> + | --- >> + | ... >> + >> +test_run:cmd('switch rebootstrap1') >> + | --- >> + | - true >> + | ... >> +fiber =3D require('fiber') >> + | --- >> + | ... >> +-- Stop updating replicaset vclock to simulate a situation, when >> +-- a row is already relayed to the remote master, but the local >> +-- vclock update hasn't happened yet. >> +box.error.injection.set('ERRINJ_REPLICASET_VCLOCK_UPDATE', true) >> + | --- >> + | - ok >> + | ... >> +lsn =3D box.info.lsn >> + | --- >> + | ... >> +box.space._schema:replace{'something'} >> + | --- >> + | - ['something'] >> + | ... >> +-- Vclock isn't updated. >> +box.info.lsn =3D=3D lsn >> + | --- >> + | - true >> + | ... >> + >> +-- Wait until the remote instance gets the row. >> +while test_run:get_vclock('rebootstrap2')[box.info.id] =3D=3D lsn = do\ >> + fiber.sleep(0.01)\ >> +end >=20 > 3. There is a cool thing which I discovered relatively recently: > test_run:wait_cond(). It does fiber sleep and while cycle, and > has a finite timeout, so such a test won't hang for 10 minutes > in Travis in case of a problem. Thanks! >=20 >> + | --- >> + | ... >> + >> +-- Restart the remote instance. This will make the first instance >> +-- resubscribe without entering orphan mode. >> +test_run:cmd('restart server rebootstrap2') >> + | --- >> + | - true >> + | ... >> +test_run:cmd('switch rebootstrap1') >> + | --- >> + | - true >> + | ... >> +-- Wait until resubscribe is sent >> +fiber.sleep(2 * box.cfg.replication_timeout) >=20 > 4. Don't we collect any statistics on replication requests, just > like we do in box.stat()? Perhaps box.stat.net() can help? To > wait properly. Maybe just do test_run:wait_cond() for status 'sync'? wait_cond for =E2=80=99sync=E2=80=99 is enough. Applied. >=20 >> + | --- >> + | ... >> +box.info.replication[2].upstream.status >> + | --- >> + | - sync >> + | ... >> + >> +box.error.injection.set('ERRINJ_REPLICASET_VCLOCK_UPDATE', false) >> + | --- >> + | - ok >> + | ... >> +test_run:cmd('switch default') >> + | --- >> + | - true >> + | ... >> +test_run:drop_cluster(SERVERS) >> + | --- >> + | =E2=80=A6 diff --git a/src/box/applier.cc b/src/box/applier.cc index 73ffc0d68..78f3d8a73 100644 --- a/src/box/applier.cc +++ b/src/box/applier.cc @@ -870,7 +870,7 @@ applier_subscribe(struct applier *applier) * Stop accepting local rows coming from a remote * instance as soon as local WAL starts accepting writes. */ - unsigned int id_filter =3D box_is_orphan() ? 0 : 1 << = instance_id; + uint32_t id_filter =3D box_is_orphan() ? 0 : 1 << instance_id; xrow_encode_subscribe_xc(&row, &REPLICASET_UUID, &INSTANCE_UUID, &vclock, replication_anon, id_filter); coio_write_xrow(coio, &row); diff --git a/src/box/wal.c b/src/box/wal.c index 35ba7b072..bf127b259 100644 --- a/src/box/wal.c +++ b/src/box/wal.c @@ -282,9 +282,7 @@ tx_schedule_commit(struct cmsg *msg) ERROR_INJECT(ERRINJ_REPLICASET_VCLOCK_UPDATE, { goto = skip_update; }); /* Update the tx vclock to the latest written by wal. */ vclock_copy(&replicaset.vclock, &batch->vclock); -#ifndef NDEBUG -skip_update: -#endif + ERROR_INJECT(ERRINJ_REPLICASET_VCLOCK_UPDATE, {skip_update:;}); tx_schedule_queue(&batch->commit); mempool_free(&writer->msg_pool, container_of(msg, struct = wal_msg, base)); } diff --git a/test/replication/gh-4739-vclock-assert.result = b/test/replication/gh-4739-vclock-assert.result index 7dc2f7118..a612826a0 100644 --- a/test/replication/gh-4739-vclock-assert.result +++ b/test/replication/gh-4739-vclock-assert.result @@ -44,10 +44,11 @@ box.info.lsn =3D=3D lsn | ... =20 -- Wait until the remote instance gets the row. -while test_run:get_vclock('rebootstrap2')[box.info.id] =3D=3D lsn do\ - fiber.sleep(0.01)\ -end +test_run:wait_cond(function()\ + return test_run:get_vclock('rebootstrap2')[box.info.id] > lsn\ +end, 10) | --- + | - true | ... =20 -- Restart the remote instance. This will make the first instance @@ -61,14 +62,12 @@ test_run:cmd('switch rebootstrap1') | - true | ... -- Wait until resubscribe is sent -fiber.sleep(2 * box.cfg.replication_timeout) - | --- - | ... -box.info.replication[2].upstream.status +test_run:wait_cond(function()\ + return box.info.replication[2].upstream.status =3D=3D 'sync'\ +end, 10) | --- - | - sync + | - true | ... - box.error.injection.set('ERRINJ_REPLICASET_VCLOCK_UPDATE', false) | --- | - ok diff --git a/test/replication/gh-4739-vclock-assert.test.lua = b/test/replication/gh-4739-vclock-assert.test.lua index 26dc781e2..b6a7caf3b 100644 --- a/test/replication/gh-4739-vclock-assert.test.lua +++ b/test/replication/gh-4739-vclock-assert.test.lua @@ -17,18 +17,18 @@ box.space._schema:replace{'something'} box.info.lsn =3D=3D lsn =20 -- Wait until the remote instance gets the row. -while test_run:get_vclock('rebootstrap2')[box.info.id] =3D=3D lsn do\ - fiber.sleep(0.01)\ -end +test_run:wait_cond(function()\ + return test_run:get_vclock('rebootstrap2')[box.info.id] > lsn\ +end, 10) =20 -- Restart the remote instance. This will make the first instance -- resubscribe without entering orphan mode. test_run:cmd('restart server rebootstrap2') test_run:cmd('switch rebootstrap1') -- Wait until resubscribe is sent -fiber.sleep(2 * box.cfg.replication_timeout) -box.info.replication[2].upstream.status - +test_run:wait_cond(function()\ + return box.info.replication[2].upstream.status =3D=3D 'sync'\ +end, 10) box.error.injection.set('ERRINJ_REPLICASET_VCLOCK_UPDATE', false) test_run:cmd('switch default') test_run:drop_cluster(SERVERS) -- Serge Petrenko sergepetrenko@tarantool.org --Apple-Mail=_A1DE6728-5837-4B0E-8F04-A2526795698F Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=utf-8

27 =D1=84=D0=B5=D0=B2=D1=80. 2020 =D0=B3., =D0=B2 02:54, = Vladislav Shpilevoy <v.shpilevoy@tarantool.org> = =D0=BD=D0=B0=D0=BF=D0=B8=D1=81=D0=B0=D0=BB(=D0=B0):

Thanks= for the patch!


Hi! = Thanks for the review!

Please find = my comments and the new diff below.

See 4 comments = below.

=    replication: do not relay rows coming from a remote = instance back to it

   We = have a mechanism for restoring rows originating from an instance that
   suffered a sudden power loss: remote = masters resend the isntance's rows
=    received before a certain point in time, defined by = remote master vclock
   at the moment of = subscribe.
   However, this is useful only = on initial replication configuraiton, when
=    an instance has just recovered, so that it can receive = what it has
   relayed but haven't synced = to disk.
   In other cases, when an = instance is operating normally and master-master
=    replication is configured, the mechanism described = above may lead to
   instance re-applying = instance's own rows, coming from a master it has just
=    subscribed to.
   To fix = the problem do not relay rows coming from a remote instance, if
   the instance has already recovered.

   Closes #4739

diff --git a/src/box/applier.cc b/src/box/applier.cc
index = 911353425..73ffc0d68 100644
--- a/src/box/applier.cc
+++ = b/src/box/applier.cc
@@ -866,8 +866,13 @@ applier_subscribe(struct applier = *applier)
struct vclock vclock;
= vclock_create(&vclock);
= vclock_copy(&vclock, &replicaset.vclock);
+ = /*
+ * Stop accepting local rows = coming from a remote
+ * instance as soon as local WAL = starts accepting writes.
+ */
+ unsigned = int id_filter =3D box_is_orphan() ? 0 : 1 << instance_id;

1. I was always wondering, what = if the instance got orphaned after it
started accepting = writes? WAL is fully functional, it syncs whatever is
needed, and then a resubscribe happens. Can this break = anything?

= xrow_encode_subscribe_xc(&row, &REPLICASET_UUID, = &INSTANCE_UUID,
- &vclock, replication_anon, = 0);
+ &vclock, replication_anon, = id_filter);
coio_write_xrow(coio, = &row);

/* Read SUBSCRIBE response */
diff --git a/src/box/wal.c b/src/box/wal.c
index = 27bff662a..35ba7b072 100644
--- a/src/box/wal.c
+++ b/src/box/wal.c
@@ -278,8 +278,13 @@ = tx_schedule_commit(struct cmsg *msg)
/* Closes = the input valve. */
= stailq_concat(&writer->rollback, = &batch->rollback);
}
+
+ = ERROR_INJECT(ERRINJ_REPLICASET_VCLOCK_UPDATE, { goto skip_update; = });
/* Update the tx vclock to the = latest written by wal. */
= vclock_copy(&replicaset.vclock, &batch->vclock);
+#ifndef NDEBUG
+skip_update:
+#endif

2. Consider = this hack which I just invented. In that way you won't
depend on ERRINJ and NDEBUG interconnection.

=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D@@ -282,9 +282,7 @@ tx_schedule_commit(struct cmsg *msg)
= ERROR_INJECT(ERRINJ_REPLICASET_VCLOCK_UPDATE, { goto skip_update; = });
/* Update the tx vclock to the = latest written by wal. */
= vclock_copy(&replicaset.vclock, &batch->vclock);
-#ifndef NDEBUG
-skip_update:
-#endif
+ = ERROR_INJECT(ERRINJ_REPLICASET_VCLOCK_UPDATE, = {skip_update:;});
= tx_schedule_queue(&batch->commit);
= mempool_free(&writer->msg_pool, container_of(msg, struct = wal_msg, base));
}
=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D

Good = one, applied.


Talking of the = injection itself - don't know really. Perhaps
it would be = better to add a delay to the wal_write_to_disk()
function, = to its very end, after wal_notify_watchers(). In
that case = relay will wake up, send whatever it wants, and TX
won't = update the vclock until you let wal_write_to_disk()
finish. = Seems more natural this way.

I = tried to add a sleep first. It=E2=80=99s impossible to sleep in = tx_schedule_commit(),
since it=E2=80=99s processed in tx_prio = endpoint, where yielding is impossible.
I also tried to add a = sleep at the end of wal_write_to_disk(), just like = you
suggest. This didn=E2=80=99t work out either. I=E2=80=99ll = give you more details in the evening,
when I give it another = try. I=E2=80=99ll send a follow-up if I succeed with adding a = sleep.


= tx_schedule_queue(&batch->commit);
= mempool_free(&writer->msg_pool, container_of(msg, struct = wal_msg, base));
}
diff --git = a/test/replication/gh-4739-vclock-assert.result = b/test/replication/gh-4739-vclock-assert.result
new file = mode 100644
index 000000000..7dc2f7118
--- = /dev/null
+++ = b/test/replication/gh-4739-vclock-assert.result
@@ -0,0 = +1,82 @@
+-- test-run result file version 2
+env =3D require('test_run')
+ | ---
+ | ...
+test_run =3D env.new()
+ = | ---
+ | ...
+
+SERVERS =3D = {'rebootstrap1', 'rebootstrap2'}
+ | ---
+ | = ...
+test_run:create_cluster(SERVERS, "replication")
+ | ---
+ | ...
+test_run:wait_fullmesh(SERVERS)
+ | ---
+ | ...
+
+test_run:cmd('switch = rebootstrap1')
+ | ---
+ | - true
+ | ...
+fiber =3D require('fiber')
+ | ---
+ | ...
+-- Stop updating = replicaset vclock to simulate a situation, when
+-- a row = is already relayed to the remote master, but the local
+-- = vclock update hasn't happened yet.
+box.error.injection.set('ERRINJ_REPLICASET_VCLOCK_UPDATE', = true)
+ | ---
+ | - ok
+ | = ...
+lsn =3D box.info.lsn
+ | ---
+ | ...
+box.space._schema:replace{'something'}
+ | = ---
+ | - ['something']
+ | ...
+-- Vclock isn't updated.
+box.info.lsn =3D=3D = lsn
+ | ---
+ | - true
+ | = ...
+
+-- Wait until the remote instance = gets the row.
+while = test_run:get_vclock('rebootstrap2')[box.info.id] =3D=3D lsn do\
+ =    fiber.sleep(0.01)\
+end

3. There is a cool thing which I = discovered relatively recently:
test_run:wait_cond(). It = does fiber sleep and while cycle, and
has a finite = timeout, so such a test won't hang for 10 minutes
in = Travis in case of a problem.

Thanks!


+ | ---
+ = | ...
+
+-- Restart the remote instance. = This will make the first instance
+-- resubscribe without = entering orphan mode.
+test_run:cmd('restart server = rebootstrap2')
+ | ---
+ | - true
+ | ...
+test_run:cmd('switch rebootstrap1')
+ | ---
+ | - true
+ | ...
+-- Wait until resubscribe is sent
+fiber.sleep(2= * box.cfg.replication_timeout)

4. Don't we collect any statistics on replication requests, = just
like we do in box.stat()? Perhaps box.stat.net() can help? = To
wait properly. Maybe just do test_run:wait_cond() for = status 'sync'?

wait_cond for =E2=80=99sync=E2=80=99 is enough. = Applied.


+ | ---
+ | ...
+box.info.replication[2].upstream.status
+ | = ---
+ | - sync
+ | ...
+
+box.error.injection.set('ERRINJ_REPLICASET_VCLOCK_UPDATE', = false)
+ | ---
+ | - ok
+ | = ...
+test_run:cmd('switch default')
+ | = ---
+ | - true
+ | ...
+test_run:drop_cluster(SERVERS)
+ | ---
+ | =E2=80=A6

diff --git a/src/box/applier.cc b/src/box/applier.cc
index = 73ffc0d68..78f3d8a73 100644
--- a/src/box/applier.cc
+++ = b/src/box/applier.cc
@@ -870,7 +870,7 @@ applier_subscribe(struct applier = *applier)
   * Stop accepting local rows = coming from a remote
   * instance as soon as local = WAL starts accepting writes.
  =  */
- unsigned int id_filter =3D = box_is_orphan() ? 0 : 1 << instance_id;
+ uint32_t = id_filter =3D box_is_orphan() ? 0 : 1 << instance_id;
 = xrow_encode_subscribe_xc(&row, &REPLICASET_UUID, = &INSTANCE_UUID,
  =  &vclock, replication_anon, id_filter);
 = coio_write_xrow(coio, &row);
diff --git = a/src/box/wal.c b/src/box/wal.c
index 35ba7b072..bf127b259 = 100644
--- a/src/box/wal.c
+++ = b/src/box/wal.c
@@ -282,9 +282,7 @@ = tx_schedule_commit(struct cmsg *msg)
  = ERROR_INJECT(ERRINJ_REPLICASET_VCLOCK_UPDATE, { goto skip_update; = });
  /* Update the tx vclock to the = latest written by wal. */
  = vclock_copy(&replicaset.vclock, &batch->vclock);
-#ifndef NDEBUG
-skip_update:
-#endif
+ = ERROR_INJECT(ERRINJ_REPLICASET_VCLOCK_UPDATE, = {skip_update:;});
  = tx_schedule_queue(&batch->commit);
  =
mempool_free(&writer->msg_pool, container_of(msg, struct = wal_msg, base));
 }
diff --git = a/test/replication/gh-4739-vclock-assert.result = b/test/replication/gh-4739-vclock-assert.result
index = 7dc2f7118..a612826a0 100644
--- = a/test/replication/gh-4739-vclock-assert.result
+++ = b/test/replication/gh-4739-vclock-assert.result
@@ -44,10 = +44,11 @@ box.info.lsn =3D=3D lsn
  | ...
 
 -- Wait until the remote instance = gets the row.
-while = test_run:get_vclock('rebootstrap2')[box.info.id] =3D=3D lsn do\
- =    fiber.sleep(0.01)\
-end
+test_run:wait_cond(function()\
+ =    return test_run:get_vclock('rebootstrap2')[box.info.id] > lsn\
+end, 10)
  | ---
+ | - = true
  | ...
 
 -- Restart the remote instance. This will make the = first instance
@@ -61,14 +62,12 @@ test_run:cmd('switch = rebootstrap1')
  | - true
  | ...
 -- Wait until = resubscribe is sent
-fiber.sleep(2 * = box.cfg.replication_timeout)
- | ---
- | = ...
-box.info.replication[2].upstream.status
+test_run:wait_cond(function()\
+ =    return box.info.replication[2].upstream.status =3D=3D = 'sync'\
+end, 10)
  | ---
- | - sync
+ | - true
  |= ...
-
 box.error.injection.set('ERRINJ_REPLICASET_VCLOCK_UPDATE'= , false)
  | ---
  | - = ok
diff --git = a/test/replication/gh-4739-vclock-assert.test.lua = b/test/replication/gh-4739-vclock-assert.test.lua
index = 26dc781e2..b6a7caf3b 100644
--- = a/test/replication/gh-4739-vclock-assert.test.lua
+++ = b/test/replication/gh-4739-vclock-assert.test.lua
@@ = -17,18 +17,18 @@ box.space._schema:replace{'something'}
 box.info.lsn =3D=3D lsn
 
 -- Wait until the remote instance gets the row.
-while test_run:get_vclock('rebootstrap2')[box.info.id] =3D=3D lsn = do\
-    fiber.sleep(0.01)\
-end
+test_run:wait_cond(function()\
+    return = test_run:get_vclock('rebootstrap2')[box.info.id] > lsn\
+end, 10)
 
 -- Restart the remote instance. = This will make the first instance
 -- resubscribe = without entering orphan mode.
 test_run:cmd('restart = server rebootstrap2')
 test_run:cmd('switch = rebootstrap1')
 -- Wait until resubscribe is sent
-fiber.sleep(2 * box.cfg.replication_timeout)
-box.info.replication[2].upstream.status
-
+test_run:wait_cond(function()\
+ =    return box.info.replication[2].upstream.status =3D=3D = 'sync'\
+end, 10)
 box.error.injection.set('ERRINJ_REPLICASET_VCLOCK_UPDATE'= , false)
 test_run:cmd('switch default')
 test_run:drop_cluster(SERVERS)



= --Apple-Mail=_A1DE6728-5837-4B0E-8F04-A2526795698F--