[Tarantool-patches] [PATCH v2 1/7] replication: fix a hang on final join retry

Serge Petrenko sergepetrenko at tarantool.org
Wed Mar 24 15:24:11 MSK 2021


Since the introduction of synchronous replication it became possible for
final join to fail on master side due to not being able to gather acks
for some tx around _cluster registration.

A replica receives an error in this case: either ER_SYNC_ROLLBACK or
ER_SYNC_QUORUM_TIMEOUT. The errors lead to applier retrying final join,
but with wrong state, APPLIER_REGISTER, which should be used only on an
anonymous replica. This lead to a hang in fiber executing box.cfg,
because it waited for APPLIER_JOINED state, which was never entered.

Part-of #5566
---
 src/box/applier.cc | 29 ++++++++++++++++++++++-------
 1 file changed, 22 insertions(+), 7 deletions(-)

diff --git a/src/box/applier.cc b/src/box/applier.cc
index 5a88a013e..326cf18d2 100644
--- a/src/box/applier.cc
+++ b/src/box/applier.cc
@@ -551,7 +551,7 @@ applier_wait_register(struct applier *applier, uint64_t row_count)
 }
 
 static void
-applier_register(struct applier *applier)
+applier_register(struct applier *applier, bool was_anon)
 {
 	/* Send REGISTER request */
 	struct ev_io *coio = &applier->io;
@@ -566,9 +566,16 @@ applier_register(struct applier *applier)
 	row.type = IPROTO_REGISTER;
 	coio_write_xrow(coio, &row);
 
-	applier_set_state(applier, APPLIER_REGISTER);
+	/*
+	 * Register may serve as a retry for final join. Set corresponding
+	 * states to unblock anyone who's waiting for final join to start or
+	 * end.
+	 */
+	applier_set_state(applier, was_anon ? APPLIER_REGISTER :
+					      APPLIER_FINAL_JOIN);
 	applier_wait_register(applier, 0);
-	applier_set_state(applier, APPLIER_REGISTERED);
+	applier_set_state(applier, was_anon ? APPLIER_REGISTERED :
+					      APPLIER_JOINED);
 	applier_set_state(applier, APPLIER_READY);
 }
 
@@ -1303,6 +1310,14 @@ applier_f(va_list ap)
 		return -1;
 	session_set_type(session, SESSION_TYPE_APPLIER);
 
+	/*
+	 * The instance saves replication_anon value on bootstrap.
+	 * If a freshly started instance sees it has received
+	 * REPLICASET_UUID but hasn't yet registered, it must be an
+	 * anonymous replica, hence the default value 'true'.
+	 */
+	bool was_anon = true;
+
 	/* Re-connect loop */
 	while (!fiber_is_cancelled()) {
 		try {
@@ -1316,6 +1331,7 @@ applier_f(va_list ap)
 				 * The join will pause the applier
 				 * until WAL is created.
 				 */
+				was_anon = replication_anon;
 				if (replication_anon)
 					applier_fetch_snapshot(applier);
 				else
@@ -1324,11 +1340,10 @@ applier_f(va_list ap)
 			if (instance_id == REPLICA_ID_NIL &&
 			    !replication_anon) {
 				/*
-				 * The instance transitioned
-				 * from anonymous. Register it
-				 * now.
+				 * The instance transitioned from anonymous or
+				 * is retrying final join.
 				 */
-				applier_register(applier);
+				applier_register(applier, was_anon);
 			}
 			applier_subscribe(applier);
 			/*
-- 
2.24.3 (Apple Git-128)



More information about the Tarantool-patches mailing list