From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <tarantool-patches-bounces@dev.tarantool.org>
Received: from [87.239.111.99] (localhost [127.0.0.1])
	by dev.tarantool.org (Postfix) with ESMTP id 01F126EC56;
	Wed, 17 Mar 2021 21:57:48 +0300 (MSK)
DKIM-Filter: OpenDKIM Filter v2.11.0 dev.tarantool.org 01F126EC56
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=tarantool.org; s=dev;
	t=1616007469; bh=XfncakrurDnohoMSIHrUT9pVIjoT95FZ4uBAXxEndGg=;
	h=To:Date:Subject:List-Id:List-Unsubscribe:List-Archive:List-Post:
	 List-Help:List-Subscribe:From:Reply-To:Cc:From;
	b=Zb/+VNy7s+Iv3ZuIgJOdfYy4Whqk5AXUKXMQgTYIuymCdm06dBg6iq658HAZLPvMw
	 XaNDvst7+h66UG7TFOYuN1xDPUqr77mBAKg4oIafaUMS6zDw6phbU8kHjYPD6v7ah5
	 0Jw6aVtFB7UUbX2MLtYc8XPHLsw7ysN3Ix/lTsxI=
Received: from mail-lf1-f50.google.com (mail-lf1-f50.google.com
 [209.85.167.50])
 (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits)
 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256)
 (No client certificate requested)
 by dev.tarantool.org (Postfix) with ESMTPS id 55C776EC56
 for <tarantool-patches@dev.tarantool.org>;
 Wed, 17 Mar 2021 21:57:47 +0300 (MSK)
DKIM-Filter: OpenDKIM Filter v2.11.0 dev.tarantool.org 55C776EC56
Received: by mail-lf1-f50.google.com with SMTP id q13so531263lfu.8
 for <tarantool-patches@dev.tarantool.org>;
 Wed, 17 Mar 2021 11:57:47 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:from:to:cc:subject:date:message-id:mime-version
 :content-transfer-encoding;
 bh=hLWBQT+HinKBRE1GeTmrlgyUTlMVl69oxpflXDjs2TI=;
 b=mkx6zkK/abaxfwJApzLNTXa7DttICvpEBfDLIFNMp1nebdQc4i82HNmEqVKKviq9Ef
 YBa+ogZs4j4XccfXaXnT4/gdp27SRHNWJSldQETPDysDUEnkXT1ZTQ1zmjWVBNEWCJxf
 vBfF3J21HBiFdJQMtK2zBbaWu/8ezfecDIP9pSPzA6+RJhLZEJfamKkAwN3Y30AoZCjz
 KD3wclrcStMVQ/G+naOPi2NLsDph4nsvY3glVj1/AKpsZDyW/JCKiBVcE5lTlhLYlprU
 Ud9NDkPK6pMEe33GXG82wyjjRFz0YjXcXYH9LxIyT0cDs+8XEDla/raAWUSqLXGQLbMq
 yiGw==
X-Gm-Message-State: AOAM531gkEin/kGq20WW/CBhYV245sYsD+IiQr1IPfYvn5tJ28wI2qo1
 kryLJoenuub/iWnDqDdZDEA=
X-Google-Smtp-Source: ABdhPJxzLtLSA2BSE6pMVjny807dEv/NkVzYwbhmZVrLECAOPfIDKFwH/kwaq8ir8XMw3tHCr0NBHw==
X-Received: by 2002:a19:5008:: with SMTP id e8mr2985678lfb.571.1616007466609; 
 Wed, 17 Mar 2021 11:57:46 -0700 (PDT)
Received: from grain.localdomain ([5.18.171.94])
 by smtp.gmail.com with ESMTPSA id 203sm3538688ljf.41.2021.03.17.11.57.45
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Wed, 17 Mar 2021 11:57:45 -0700 (PDT)
Received: by grain.localdomain (Postfix, from userid 1000)
 id 5A3D05601CE; Wed, 17 Mar 2021 21:57:44 +0300 (MSK)
To: tml <tarantool-patches@dev.tarantool.org>
Date: Wed, 17 Mar 2021 21:57:43 +0300
Message-Id: <20210317185743.964278-1-gorcunov@gmail.com>
X-Mailer: git-send-email 2.30.2
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Subject: [Tarantool-patches] [RFC] gc/xlog: delay xlog cleanup until relays
 are subscribed
X-BeenThere: tarantool-patches@dev.tarantool.org
X-Mailman-Version: 2.1.34
Precedence: list
List-Id: Tarantool development patches <tarantool-patches.dev.tarantool.org>
List-Unsubscribe: <https://lists.tarantool.org/mailman/options/tarantool-patches>, 
 <mailto:tarantool-patches-request@dev.tarantool.org?subject=unsubscribe>
List-Archive: <https://lists.tarantool.org/pipermail/tarantool-patches/>
List-Post: <mailto:tarantool-patches@dev.tarantool.org>
List-Help: <mailto:tarantool-patches-request@dev.tarantool.org?subject=help>
List-Subscribe: <https://lists.tarantool.org/mailman/listinfo/tarantool-patches>, 
 <mailto:tarantool-patches-request@dev.tarantool.org?subject=subscribe>
From: Cyrill Gorcunov via Tarantool-patches
 <tarantool-patches@dev.tarantool.org>
Reply-To: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: Mons Anderson <v.perepelitsa@corp.mail.ru>,
 Vladislav Shpilevoy <v.shpilevoy@tarantool.org>
Errors-To: tarantool-patches-bounces@dev.tarantool.org
Sender: "Tarantool-patches" <tarantool-patches-bounces@dev.tarantool.org>

In case if replica managed to be far behind the master node
(so there are a number of xlog files present after the last
master's snapshot) then once master node get restarted it
may clean up the xlogs needed by the replica to subscribe
in a fast way and instead the replica will have to rejoin
reading a number of data back.

Lets try to address this by delaying xlog files cleanup
until replicas are got subscribed and relays are up
and running. For this sake we start with cleanup fiber
spinning in nop cycle ("paused" mode) and use a dalay
counter to wait until relays decrement them.

This implies that if `_cluster` system space is not empty
upon restart and the registered replica somehow vanished
completely and won't ever come back, then the node
administrator has to drop this replica from `_custer`
manually.

Note that this delayed cleanup start doesn't prevent
WAL engine from removing old files if there is no
space left on a storage device. The WAL will simply
drop old data without a question.

Closes #5806

Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
---
Guys, take a look please. This is RFC on a purpose I would
like to gather more comments. Should we provide an option
to start node with cleanup enabled by force and etc?

 src/box/box.cc         |  1 +
 src/box/gc.c           | 61 +++++++++++++++++++++++++++++++++++++++---
 src/box/gc.h           | 26 ++++++++++++++++++
 src/box/lua/info.c     |  8 ++++++
 src/box/relay.cc       | 23 +++++++++++++---
 src/box/replication.cc |  2 ++
 6 files changed, 115 insertions(+), 6 deletions(-)

diff --git a/src/box/box.cc b/src/box/box.cc
index b4a1a5e07..a39f56cd9 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -3045,6 +3045,7 @@ box_cfg_xc(void)
 		bootstrap(&instance_uuid, &replicaset_uuid,
 			  &is_bootstrap_leader);
 	}
+	gc_delay_unref();
 	fiber_gc();
 
 	bootstrap_journal_guard.is_active = false;
diff --git a/src/box/gc.c b/src/box/gc.c
index 9af4ef958..1b1093dd7 100644
--- a/src/box/gc.c
+++ b/src/box/gc.c
@@ -46,6 +46,7 @@
 #include <small/rlist.h>
 #include <tarantool_ev.h>
 
+#include "cfg.h"
 #include "diag.h"
 #include "errcode.h"
 #include "fiber.h"
@@ -107,6 +108,18 @@ gc_init(void)
 	/* Don't delete any files until recovery is complete. */
 	gc.min_checkpoint_count = INT_MAX;
 
+	gc.delay_ref = 0;
+	if (cfg_geti("replication_anon") != 0) {
+		/*
+		 * Anonymous replicas are read only
+		 * so no need to keep XLOGs.
+		 */
+		gc.cleanup_is_paused = false;
+	} else {
+		say_info("gc: wal/engine cleanup is paused");
+		gc.cleanup_is_paused = true;
+	}
+
 	vclock_create(&gc.vclock);
 	rlist_create(&gc.checkpoints);
 	gc_tree_new(&gc.consumers);
@@ -238,6 +251,23 @@ static int
 gc_cleanup_fiber_f(va_list ap)
 {
 	(void)ap;
+
+	/*
+	 * Stage 1 (optional): in case if we're booting
+	 * up with cleanup disabled lets do wait in a
+	 * separate cycle to minimize branching on stage 2.
+	 */
+	if (gc.cleanup_is_paused) {
+		while (!fiber_is_cancelled()) {
+			fiber_sleep(TIMEOUT_INFINITY);
+			if (!gc.cleanup_is_paused)
+				break;
+		}
+	}
+
+	/*
+	 * Stage 2: a regular cleanup cycle.
+	 */
 	while (!fiber_is_cancelled()) {
 		int64_t delta = gc.cleanup_scheduled - gc.cleanup_completed;
 		if (delta == 0) {
@@ -253,6 +283,29 @@ gc_cleanup_fiber_f(va_list ap)
 	return 0;
 }
 
+void
+gc_delay_ref(void)
+{
+	if (gc.cleanup_is_paused) {
+		assert(gc.delay_ref >= 0);
+		gc.delay_ref++;
+	}
+}
+
+void
+gc_delay_unref(void)
+{
+	if (gc.cleanup_is_paused) {
+		assert(gc.delay_ref > 0);
+		gc.delay_ref--;
+		if (gc.delay_ref == 0) {
+			say_info("gc: wal/engine cleanup is resumed");
+			gc.cleanup_is_paused = false;
+			fiber_wakeup(gc.cleanup_fiber);
+		}
+	}
+}
+
 /**
  * Trigger asynchronous garbage collection.
  */
@@ -278,9 +331,11 @@ gc_schedule_cleanup(void)
 static void
 gc_wait_cleanup(void)
 {
-	int64_t scheduled = gc.cleanup_scheduled;
-	while (gc.cleanup_completed < scheduled)
-		fiber_cond_wait(&gc.cleanup_cond);
+	if (!gc.cleanup_is_paused) {
+		int64_t scheduled = gc.cleanup_scheduled;
+		while (gc.cleanup_completed < scheduled)
+			fiber_cond_wait(&gc.cleanup_cond);
+	}
 }
 
 void
diff --git a/src/box/gc.h b/src/box/gc.h
index 2a568c5f9..780da6540 100644
--- a/src/box/gc.h
+++ b/src/box/gc.h
@@ -147,6 +147,20 @@ struct gc_state {
 	 * taken at that moment of time.
 	 */
 	int64_t cleanup_completed, cleanup_scheduled;
+	/**
+	 * A counter to wait until all replicas are managed to
+	 * subscribe so that we can enable cleanup fiber to
+	 * remove old XLOGs. Otherwise some replicas might be
+	 * far behing the master node and after the master
+	 * node been restarted they will have to reread all
+	 * data back due to XlogGapError, ie too early deleted
+	 * XLOGs.
+	 */
+	int64_t delay_ref;
+	/**
+	 * When set the cleanup fiber is paused.
+	 */
+	bool cleanup_is_paused;
 	/**
 	 * Set if there's a fiber making a checkpoint right now.
 	 */
@@ -206,6 +220,18 @@ gc_init(void);
 void
 gc_free(void);
 
+/**
+ * Increment a reference to delay counter.
+ */
+void
+gc_delay_ref(void);
+
+/**
+ * Decrement a reference from the delay counter.
+ */
+void
+gc_delay_unref(void);
+
 /**
  * Advance the garbage collector vclock to the given position.
  * Deactivate WAL consumers that need older data.
diff --git a/src/box/lua/info.c b/src/box/lua/info.c
index c4c9fa0a0..3230b5bbb 100644
--- a/src/box/lua/info.c
+++ b/src/box/lua/info.c
@@ -445,6 +445,14 @@ lbox_info_gc_call(struct lua_State *L)
 	lua_pushboolean(L, gc.checkpoint_is_in_progress);
 	lua_settable(L, -3);
 
+	lua_pushstring(L, "cleanup_is_paused");
+	lua_pushboolean(L, gc.cleanup_is_paused);
+	lua_settable(L, -3);
+
+	lua_pushstring(L, "delay_ref");
+	luaL_pushint64(L, gc.delay_ref);
+	lua_settable(L, -3);
+
 	lua_pushstring(L, "checkpoints");
 	lua_newtable(L);
 
diff --git a/src/box/relay.cc b/src/box/relay.cc
index 41f949e8e..ee2e1fec8 100644
--- a/src/box/relay.cc
+++ b/src/box/relay.cc
@@ -668,6 +668,13 @@ relay_send_is_raft_enabled(struct relay *relay,
 	}
 }
 
+static void
+relay_gc_delay_unref(struct cmsg *msg)
+{
+	(void)msg;
+	gc_delay_unref();
+}
+
 /**
  * A libev callback invoked when a relay client socket is ready
  * for read. This currently only happens when the client closes
@@ -721,6 +728,19 @@ relay_subscribe_f(va_list ap)
 	 */
 	relay_send_heartbeat(relay);
 
+	/*
+	 * Now we can resume wal/engine gc as relay
+	 * is up and running.
+	 */
+	if (!relay->replica->anon) {
+		static const struct cmsg_hop gc_delay_route[] = {
+			{relay_gc_delay_unref, NULL}
+		};
+		struct cmsg gc_delay_msg;
+		cmsg_init(&gc_delay_msg, gc_delay_route);
+		cpipe_push(&relay->tx_pipe, &gc_delay_msg);
+	}
+
 	/*
 	 * Run the event loop until the connection is broken
 	 * or an error occurs.
@@ -771,9 +791,6 @@ relay_subscribe_f(va_list ap)
 		cpipe_push(&relay->tx_pipe, &relay->status_msg.msg);
 	}
 
-	if (!relay->replica->anon)
-		relay_send_is_raft_enabled(relay, &raft_enabler, false);
-
 	/*
 	 * Clear garbage collector trigger and WAL watcher.
 	 * trigger_clear() does nothing in case the triggers
diff --git a/src/box/replication.cc b/src/box/replication.cc
index 1fa8843e7..aefb812b3 100644
--- a/src/box/replication.cc
+++ b/src/box/replication.cc
@@ -250,6 +250,7 @@ replica_set_id(struct replica *replica, uint32_t replica_id)
 						   tt_uuid_str(&replica->uuid));
 	}
 	replicaset.replica_by_id[replica_id] = replica;
+	gc_delay_ref();
 	++replicaset.registered_count;
 	say_info("assigned id %d to replica %s",
 		 replica->id, tt_uuid_str(&replica->uuid));
@@ -273,6 +274,7 @@ replica_clear_id(struct replica *replica)
 	replicaset.replica_by_id[replica->id] = NULL;
 	assert(replicaset.registered_count > 0);
 	--replicaset.registered_count;
+	gc_delay_unref();
 	if (replica->id == instance_id) {
 		/* See replica_check_id(). */
 		assert(replicaset.is_joining);
-- 
2.30.2