[Tarantool-patches] [PATCH 1/2] gc/xlog: delay xlog cleanup until relays are subscribed

Tarantool development patches archive
 help / color / mirror / Atom feed

From: Cyrill Gorcunov via Tarantool-patches <tarantool-patches@dev.tarantool.org>
To: tml <tarantool-patches@dev.tarantool.org>
Cc: Vladislav Shpilevoy <v.shpilevoy@tarantool.org>,
	Mons Anderson <v.perepelitsa@corp.mail.ru>
Subject: [Tarantool-patches] [PATCH 1/2] gc/xlog: delay xlog cleanup until relays are subscribed
Date: Thu, 18 Mar 2021 21:41:37 +0300	[thread overview]
Message-ID: <20210318184138.1077807-2-gorcunov@gmail.com> (raw)
In-Reply-To: <20210318184138.1077807-1-gorcunov@gmail.com>

In case if replica managed to be far behind the master node
(so there are a number of xlog files present after the last
master's snapshot) then once master node get restarted it
may clean up the xlogs needed by the replica to subscribe
in a fast way and instead the replica will have to rejoin
reading a number of data back.

Lets try to address this by delaying xlog files cleanup
until replicas are got subscribed and relays are up
and running. For this sake we start with cleanup fiber
spinning in nop cycle ("paused" mode) and use a delay
counter to wait until relays decrement them.

This implies that if `_cluster` system space is not empty
upon restart and the registered replica somehow vanished
completely and won't ever come back, then the node
administrator has to drop this replica from `_cluster`
manually.

Note that this delayed cleanup start doesn't prevent
WAL engine from removing old files if there is no
space left on a storage device. The WAL will simply
drop old data without a question.

We need to take into account that some administrators
might not need this functionality at all, for this
sake we introduce "wal_cleanup_delay" configuration
option which allows to enable or disable the delay.

Closes #5806

Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>

@TarantoolBot document
Title: Add wal_cleanup_delay configuration parameter

The `wal_cleanup_delay` option defines a delay in seconds
before write ahead log files (`*.xlog`) are getting started
to prune upon a node restart.

This option is isnored in case if a node is running as
an anonymous replica (`replication_anon = true`). Similarly
if replication is unused or there is no plans to use
replication at all then this option should not be considered.

An initial problem to solve is the case where a node is operating
so fast that its replicas do not manage to reach the node state
and in case if the node is restarted at this moment (for various
reasons, for example due to power outage) then unfetched `*.xlog`
files might be pruned during restart. In result replicas will not
find these files on the main node and have to reread all data back
which is a very expensive procedure.

Since replicas are tracked via `_cluster` system space this we use
its content to count subscribed replicas and when all of them are
up and running the cleanup procedure is automatically enabled even
if `wal_cleanup_delay` is not expired.

The `wal_cleanup_delay` should be set to:

 - `-1` to wait infinitely until all existing replicas are subscribed;
 - `0` to disable the cleanup delay;
 - `>0` to wait for specified number of seconds.

By default it is set to `14400` seconds (ie `4` hours).

In case if registered replica is lost forever and timeout is set to
infinity then a preferred way to enable cleanup procedure is not setting
up a small timeout value but rather to delete this replica from `_cluster`
space manually.

Note that the option does *not* prevent WAL engine from removing
old `*.xlog` files if there is no space left on a storage device,
WAL engine can remove them in a force way.

Current state of `*.xlog` garbage collector can be found in
`box.info.gc()` output. For example

``` Lua
 tarantool> box.info.gc()
 ---
   ...
   cleanup_is_paused: false
   delay_ref: 0
```

The `cleanup_is_paused` shows if cleanup fiber is paused or not,
and `delay_ref` represents number of replicas to be subscribed
if the cleanup fiber is in `pause` mode.
---
 src/box/box.cc                  | 11 +++++
 src/box/gc.c                    | 75 +++++++++++++++++++++++++++++++--
 src/box/gc.h                    | 30 +++++++++++++
 src/box/lua/info.c              |  8 ++++
 src/box/lua/load_cfg.lua        |  3 ++
 src/box/relay.cc                | 20 +++++++++
 src/box/replication.cc          |  2 +
 test/app-tap/init_script.result |  1 +
 test/box/admin.result           |  2 +
 test/box/cfg.result             |  4 ++
 10 files changed, 153 insertions(+), 3 deletions(-)

diff --git a/src/box/box.cc b/src/box/box.cc
index b4a1a5e07..ee017c2dc 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -754,6 +754,15 @@ box_check_wal_mode(const char *mode_name)
 	return (enum wal_mode) mode;
 }
 
+static void
+box_check_wal_cleanup_delay(int tmo)
+{
+	if (tmo < 0 && tmo != -1) {
+		tnt_raise(ClientError, ER_CFG, "wal_cleanup_delay",
+			  "the value must be either >= 0 or -1");
+	}
+}
+
 static void
 box_check_readahead(int readahead)
 {
@@ -899,6 +908,7 @@ box_check_config(void)
 	box_check_checkpoint_count(cfg_geti("checkpoint_count"));
 	box_check_wal_max_size(cfg_geti64("wal_max_size"));
 	box_check_wal_mode(cfg_gets("wal_mode"));
+	box_check_wal_cleanup_delay(cfg_geti("wal_cleanup_delay"));
 	if (box_check_memory_quota("memtx_memory") < 0)
 		diag_raise();
 	box_check_memtx_min_tuple_size(cfg_geti64("memtx_min_tuple_size"));
@@ -3045,6 +3055,7 @@ box_cfg_xc(void)
 		bootstrap(&instance_uuid, &replicaset_uuid,
 			  &is_bootstrap_leader);
 	}
+	gc_delay_unref();
 	fiber_gc();
 
 	bootstrap_journal_guard.is_active = false;
diff --git a/src/box/gc.c b/src/box/gc.c
index 9af4ef958..595b98bdf 100644
--- a/src/box/gc.c
+++ b/src/box/gc.c
@@ -46,6 +46,7 @@
 #include <small/rlist.h>
 #include <tarantool_ev.h>
 
+#include "cfg.h"
 #include "diag.h"
 #include "errcode.h"
 #include "fiber.h"
@@ -107,6 +108,25 @@ gc_init(void)
 	/* Don't delete any files until recovery is complete. */
 	gc.min_checkpoint_count = INT_MAX;
 
+	gc.wal_cleanup_delay = cfg_geti("wal_cleanup_delay");
+	gc.delay_ref = 0;
+
+	if (cfg_geti("replication_anon") != 0) {
+		/*
+		 * Anonymous replicas are read only
+		 * so no need to keep XLOGs.
+		 */
+		gc.cleanup_is_paused = false;
+	} else if (gc.wal_cleanup_delay == 0) {
+		/*
+		 * The delay is disabled explicitly.
+		 */
+		gc.cleanup_is_paused = false;
+	} else {
+		say_info("gc: wal/engine cleanup is paused");
+		gc.cleanup_is_paused = true;
+	}
+
 	vclock_create(&gc.vclock);
 	rlist_create(&gc.checkpoints);
 	gc_tree_new(&gc.consumers);
@@ -238,6 +258,30 @@ static int
 gc_cleanup_fiber_f(va_list ap)
 {
 	(void)ap;
+
+	/*
+	 * Stage 1 (optional): in case if we're booting
+	 * up with cleanup disabled lets do wait in a
+	 * separate cycle to minimize branching on stage 2.
+	 */
+	if (gc.cleanup_is_paused) {
+		ev_tstamp tmo = gc.wal_cleanup_delay == -1 ?
+			TIMEOUT_INFINITY : gc.wal_cleanup_delay;
+		while (!fiber_is_cancelled()) {
+			if (fiber_yield_timeout(tmo)) {
+				say_info("gc: wal/engine cleanup is resumed "
+					 "due to timeout expiration");
+				gc.cleanup_is_paused = false;
+				break;
+			}
+			if (!gc.cleanup_is_paused)
+				break;
+		}
+	}
+
+	/*
+	 * Stage 2: a regular cleanup cycle.
+	 */
 	while (!fiber_is_cancelled()) {
 		int64_t delta = gc.cleanup_scheduled - gc.cleanup_completed;
 		if (delta == 0) {
@@ -253,6 +297,29 @@ gc_cleanup_fiber_f(va_list ap)
 	return 0;
 }
 
+void
+gc_delay_ref(void)
+{
+	if (gc.cleanup_is_paused) {
+		assert(gc.delay_ref >= 0);
+		gc.delay_ref++;
+	}
+}
+
+void
+gc_delay_unref(void)
+{
+	if (gc.cleanup_is_paused) {
+		assert(gc.delay_ref > 0);
+		gc.delay_ref--;
+		if (gc.delay_ref == 0) {
+			say_info("gc: wal/engine cleanup is resumed");
+			gc.cleanup_is_paused = false;
+			fiber_wakeup(gc.cleanup_fiber);
+		}
+	}
+}
+
 /**
  * Trigger asynchronous garbage collection.
  */
@@ -278,9 +345,11 @@ gc_schedule_cleanup(void)
 static void
 gc_wait_cleanup(void)
 {
-	int64_t scheduled = gc.cleanup_scheduled;
-	while (gc.cleanup_completed < scheduled)
-		fiber_cond_wait(&gc.cleanup_cond);
+	if (!gc.cleanup_is_paused) {
+		int64_t scheduled = gc.cleanup_scheduled;
+		while (gc.cleanup_completed < scheduled)
+			fiber_cond_wait(&gc.cleanup_cond);
+	}
 }
 
 void
diff --git a/src/box/gc.h b/src/box/gc.h
index 2a568c5f9..9e73d39b0 100644
--- a/src/box/gc.h
+++ b/src/box/gc.h
@@ -147,6 +147,24 @@ struct gc_state {
 	 * taken at that moment of time.
 	 */
 	int64_t cleanup_completed, cleanup_scheduled;
+	/**
+	 * A counter to wait until all replicas are managed to
+	 * subscribe so that we can enable cleanup fiber to
+	 * remove old XLOGs. Otherwise some replicas might be
+	 * far behind the master node and after the master
+	 * node been restarted they will have to reread all
+	 * data back due to XlogGapError, ie too early deleted
+	 * XLOGs.
+	 */
+	int64_t delay_ref;
+	/**
+	 * Delay timeout in seconds.
+	 */
+	int32_t wal_cleanup_delay;
+	/**
+	 * When set the cleanup fiber is paused.
+	 */
+	bool cleanup_is_paused;
 	/**
 	 * Set if there's a fiber making a checkpoint right now.
 	 */
@@ -206,6 +224,18 @@ gc_init(void);
 void
 gc_free(void);
 
+/**
+ * Increment a reference to delay counter.
+ */
+void
+gc_delay_ref(void);
+
+/**
+ * Decrement a reference from the delay counter.
+ */
+void
+gc_delay_unref(void);
+
 /**
  * Advance the garbage collector vclock to the given position.
  * Deactivate WAL consumers that need older data.
diff --git a/src/box/lua/info.c b/src/box/lua/info.c
index c4c9fa0a0..3230b5bbb 100644
--- a/src/box/lua/info.c
+++ b/src/box/lua/info.c
@@ -445,6 +445,14 @@ lbox_info_gc_call(struct lua_State *L)
 	lua_pushboolean(L, gc.checkpoint_is_in_progress);
 	lua_settable(L, -3);
 
+	lua_pushstring(L, "cleanup_is_paused");
+	lua_pushboolean(L, gc.cleanup_is_paused);
+	lua_settable(L, -3);
+
+	lua_pushstring(L, "delay_ref");
+	luaL_pushint64(L, gc.delay_ref);
+	lua_settable(L, -3);
+
 	lua_pushstring(L, "checkpoints");
 	lua_newtable(L);
 
diff --git a/src/box/lua/load_cfg.lua b/src/box/lua/load_cfg.lua
index aac216932..1ca9d5e22 100644
--- a/src/box/lua/load_cfg.lua
+++ b/src/box/lua/load_cfg.lua
@@ -72,6 +72,7 @@ local default_cfg = {
     wal_mode            = "write",
     wal_max_size        = 256 * 1024 * 1024,
     wal_dir_rescan_delay= 2,
+    wal_cleanup_delay   = 4 * 3600,
     force_recovery      = false,
     replication         = nil,
     instance_uuid       = nil,
@@ -154,6 +155,7 @@ local template_cfg = {
     wal_mode            = 'string',
     wal_max_size        = 'number',
     wal_dir_rescan_delay= 'number',
+    wal_cleanup_delay   = 'number',
     force_recovery      = 'boolean',
     replication         = 'string, number, table',
     instance_uuid       = 'string',
@@ -378,6 +380,7 @@ local dynamic_cfg_skip_at_load = {
     replication_skip_conflict = true,
     replication_anon        = true,
     wal_dir_rescan_delay    = true,
+    wal_cleanup_delay       = true,
     custom_proc_title       = true,
     force_recovery          = true,
     instance_uuid           = true,
diff --git a/src/box/relay.cc b/src/box/relay.cc
index 41f949e8e..5d09690ad 100644
--- a/src/box/relay.cc
+++ b/src/box/relay.cc
@@ -668,6 +668,13 @@ relay_send_is_raft_enabled(struct relay *relay,
 	}
 }
 
+static void
+relay_gc_delay_unref(struct cmsg *msg)
+{
+	(void)msg;
+	gc_delay_unref();
+}
+
 /**
  * A libev callback invoked when a relay client socket is ready
  * for read. This currently only happens when the client closes
@@ -721,6 +728,19 @@ relay_subscribe_f(va_list ap)
 	 */
 	relay_send_heartbeat(relay);
 
+	/*
+	 * Now we can resume wal/engine gc as relay
+	 * is up and running.
+	 */
+	if (!relay->replica->anon) {
+		static const struct cmsg_hop gc_delay_route[] = {
+			{relay_gc_delay_unref, NULL}
+		};
+		struct cmsg gc_delay_msg;
+		cmsg_init(&gc_delay_msg, gc_delay_route);
+		cpipe_push(&relay->tx_pipe, &gc_delay_msg);
+	}
+
 	/*
 	 * Run the event loop until the connection is broken
 	 * or an error occurs.
diff --git a/src/box/replication.cc b/src/box/replication.cc
index 1fa8843e7..aefb812b3 100644
--- a/src/box/replication.cc
+++ b/src/box/replication.cc
@@ -250,6 +250,7 @@ replica_set_id(struct replica *replica, uint32_t replica_id)
 						   tt_uuid_str(&replica->uuid));
 	}
 	replicaset.replica_by_id[replica_id] = replica;
+	gc_delay_ref();
 	++replicaset.registered_count;
 	say_info("assigned id %d to replica %s",
 		 replica->id, tt_uuid_str(&replica->uuid));
@@ -273,6 +274,7 @@ replica_clear_id(struct replica *replica)
 	replicaset.replica_by_id[replica->id] = NULL;
 	assert(replicaset.registered_count > 0);
 	--replicaset.registered_count;
+	gc_delay_unref();
 	if (replica->id == instance_id) {
 		/* See replica_check_id(). */
 		assert(replicaset.is_joining);
diff --git a/test/app-tap/init_script.result b/test/app-tap/init_script.result
index 76fe2ea27..33d861ecd 100644
--- a/test/app-tap/init_script.result
+++ b/test/app-tap/init_script.result
@@ -53,6 +53,7 @@ vinyl_run_count_per_level:2
 vinyl_run_size_ratio:3.5
 vinyl_timeout:60
 vinyl_write_threads:4
+wal_cleanup_delay:14400
 wal_dir:.
 wal_dir_rescan_delay:2
 wal_max_size:268435456
diff --git a/test/box/admin.result b/test/box/admin.result
index c9a6ff9e4..3443dbf96 100644
--- a/test/box/admin.result
+++ b/test/box/admin.result
@@ -127,6 +127,8 @@ cfg_filter(box.cfg)
     - 60
   - - vinyl_write_threads
     - 4
+  - - wal_cleanup_delay
+    - 14400
   - - wal_dir
     - <hidden>
   - - wal_dir_rescan_delay
diff --git a/test/box/cfg.result b/test/box/cfg.result
index ae37b28f0..87be6053d 100644
--- a/test/box/cfg.result
+++ b/test/box/cfg.result
@@ -115,6 +115,8 @@ cfg_filter(box.cfg)
  |     - 60
  |   - - vinyl_write_threads
  |     - 4
+ |   - - wal_cleanup_delay
+ |     - 14400
  |   - - wal_dir
  |     - <hidden>
  |   - - wal_dir_rescan_delay
@@ -232,6 +234,8 @@ cfg_filter(box.cfg)
  |     - 60
  |   - - vinyl_write_threads
  |     - 4
+ |   - - wal_cleanup_delay
+ |     - 14400
  |   - - wal_dir
  |     - <hidden>
  |   - - wal_dir_rescan_delay
-- 
2.30.2

next prev parent reply	other threads:[~2021-03-18 18:42 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-03-18 18:41 [Tarantool-patches] [PATCH 0/2] " Cyrill Gorcunov via Tarantool-patches
2021-03-18 18:41 ` Cyrill Gorcunov via Tarantool-patches [this message]
2021-03-18 23:04   ` [Tarantool-patches] [PATCH 1/2] " Vladislav Shpilevoy via Tarantool-patches
2021-03-19 11:03     ` Cyrill Gorcunov via Tarantool-patches
2021-03-19 22:17       ` Vladislav Shpilevoy via Tarantool-patches
2021-03-22  9:05         ` Serge Petrenko via Tarantool-patches
2021-03-19 13:40     ` Serge Petrenko via Tarantool-patches
2021-03-19 13:57       ` Konstantin Osipov via Tarantool-patches
2021-03-19 13:50   ` Serge Petrenko via Tarantool-patches
2021-03-19 15:14     ` Cyrill Gorcunov via Tarantool-patches
2021-03-18 18:41 ` [Tarantool-patches] [PATCH 2/2] test: add a test for wal_cleanup_delay option Cyrill Gorcunov via Tarantool-patches
2021-03-18 23:04   ` Vladislav Shpilevoy via Tarantool-patches
2021-03-19 12:14     ` Cyrill Gorcunov via Tarantool-patches
2021-03-19 22:17       ` Vladislav Shpilevoy via Tarantool-patches

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210318184138.1077807-2-gorcunov@gmail.com \
    --to=tarantool-patches@dev.tarantool.org \
    --cc=gorcunov@gmail.com \
    --cc=v.perepelitsa@corp.mail.ru \
    --cc=v.shpilevoy@tarantool.org \
    --subject='Re: [Tarantool-patches] [PATCH 1/2] gc/xlog: delay xlog cleanup until relays are subscribed' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox