[Tarantool-patches] [PATCH 0/2] gc/xlog: delay xlog cleanup until relays are subscribed

Tarantool development patches archive
 help / color / mirror / Atom feed

* [Tarantool-patches] [PATCH 0/2] gc/xlog: delay xlog cleanup until relays are subscribed
@ 2021-03-18 18:41 Cyrill Gorcunov via Tarantool-patches
  2021-03-18 18:41 ` [Tarantool-patches] [PATCH 1/2] " Cyrill Gorcunov via Tarantool-patches
  2021-03-18 18:41 ` [Tarantool-patches] [PATCH 2/2] test: add a test for wal_cleanup_delay option Cyrill Gorcunov via Tarantool-patches
  0 siblings, 2 replies; 14+ messages in thread
From: Cyrill Gorcunov via Tarantool-patches @ 2021-03-18 18:41 UTC (permalink / raw)
  To: tml; +Cc: Vladislav Shpilevoy, Mons Anderson

Guys, take a look please once time permit.

issue https://github.com/tarantool/tarantool/issues/5806
branch gorcunov/gh-5806-xlog-gc

Cyrill Gorcunov (2):
  gc/xlog: delay xlog cleanup until relays are subscribed
  test: add a test for wal_cleanup_delay option

 src/box/box.cc                                |  11 +
 src/box/gc.c                                  |  75 +++-
 src/box/gc.h                                  |  30 ++
 src/box/lua/info.c                            |   8 +
 src/box/lua/load_cfg.lua                      |   3 +
 src/box/relay.cc                              |  20 ++
 src/box/replication.cc                        |   2 +
 test/app-tap/init_script.result               |   1 +
 test/box/admin.result                         |   2 +
 test/box/cfg.result                           |   4 +
 test/replication/gh-5806-master.lua           |  13 +
 test/replication/gh-5806-slave.lua            |  13 +
 test/replication/gh-5806-xlog-cleanup.result  | 336 ++++++++++++++++++
 .../replication/gh-5806-xlog-cleanup.test.lua | 148 ++++++++
 14 files changed, 663 insertions(+), 3 deletions(-)
 create mode 100644 test/replication/gh-5806-master.lua
 create mode 100644 test/replication/gh-5806-slave.lua
 create mode 100644 test/replication/gh-5806-xlog-cleanup.result
 create mode 100644 test/replication/gh-5806-xlog-cleanup.test.lua


base-commit: a0ade20662e0d55c912a72656e6dc75defd7c7f4
-- 
2.30.2


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Tarantool-patches] [PATCH 1/2] gc/xlog: delay xlog cleanup until relays are subscribed
  2021-03-18 18:41 [Tarantool-patches] [PATCH 0/2] gc/xlog: delay xlog cleanup until relays are subscribed Cyrill Gorcunov via Tarantool-patches
@ 2021-03-18 18:41 ` Cyrill Gorcunov via Tarantool-patches
  2021-03-18 23:04   ` Vladislav Shpilevoy via Tarantool-patches
  2021-03-19 13:50   ` Serge Petrenko via Tarantool-patches
  2021-03-18 18:41 ` [Tarantool-patches] [PATCH 2/2] test: add a test for wal_cleanup_delay option Cyrill Gorcunov via Tarantool-patches
  1 sibling, 2 replies; 14+ messages in thread
From: Cyrill Gorcunov via Tarantool-patches @ 2021-03-18 18:41 UTC (permalink / raw)
  To: tml; +Cc: Vladislav Shpilevoy, Mons Anderson

In case if replica managed to be far behind the master node
(so there are a number of xlog files present after the last
master's snapshot) then once master node get restarted it
may clean up the xlogs needed by the replica to subscribe
in a fast way and instead the replica will have to rejoin
reading a number of data back.

Lets try to address this by delaying xlog files cleanup
until replicas are got subscribed and relays are up
and running. For this sake we start with cleanup fiber
spinning in nop cycle ("paused" mode) and use a delay
counter to wait until relays decrement them.

This implies that if `_cluster` system space is not empty
upon restart and the registered replica somehow vanished
completely and won't ever come back, then the node
administrator has to drop this replica from `_cluster`
manually.

Note that this delayed cleanup start doesn't prevent
WAL engine from removing old files if there is no
space left on a storage device. The WAL will simply
drop old data without a question.

We need to take into account that some administrators
might not need this functionality at all, for this
sake we introduce "wal_cleanup_delay" configuration
option which allows to enable or disable the delay.

Closes #5806

Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>

@TarantoolBot document
Title: Add wal_cleanup_delay configuration parameter

The `wal_cleanup_delay` option defines a delay in seconds
before write ahead log files (`*.xlog`) are getting started
to prune upon a node restart.

This option is isnored in case if a node is running as
an anonymous replica (`replication_anon = true`). Similarly
if replication is unused or there is no plans to use
replication at all then this option should not be considered.

An initial problem to solve is the case where a node is operating
so fast that its replicas do not manage to reach the node state
and in case if the node is restarted at this moment (for various
reasons, for example due to power outage) then unfetched `*.xlog`
files might be pruned during restart. In result replicas will not
find these files on the main node and have to reread all data back
which is a very expensive procedure.

Since replicas are tracked via `_cluster` system space this we use
its content to count subscribed replicas and when all of them are
up and running the cleanup procedure is automatically enabled even
if `wal_cleanup_delay` is not expired.

The `wal_cleanup_delay` should be set to:

 - `-1` to wait infinitely until all existing replicas are subscribed;
 - `0` to disable the cleanup delay;
 - `>0` to wait for specified number of seconds.

By default it is set to `14400` seconds (ie `4` hours).

In case if registered replica is lost forever and timeout is set to
infinity then a preferred way to enable cleanup procedure is not setting
up a small timeout value but rather to delete this replica from `_cluster`
space manually.

Note that the option does *not* prevent WAL engine from removing
old `*.xlog` files if there is no space left on a storage device,
WAL engine can remove them in a force way.

Current state of `*.xlog` garbage collector can be found in
`box.info.gc()` output. For example

``` Lua
 tarantool> box.info.gc()
 ---
   ...
   cleanup_is_paused: false
   delay_ref: 0
```

The `cleanup_is_paused` shows if cleanup fiber is paused or not,
and `delay_ref` represents number of replicas to be subscribed
if the cleanup fiber is in `pause` mode.
---
 src/box/box.cc                  | 11 +++++
 src/box/gc.c                    | 75 +++++++++++++++++++++++++++++++--
 src/box/gc.h                    | 30 +++++++++++++
 src/box/lua/info.c              |  8 ++++
 src/box/lua/load_cfg.lua        |  3 ++
 src/box/relay.cc                | 20 +++++++++
 src/box/replication.cc          |  2 +
 test/app-tap/init_script.result |  1 +
 test/box/admin.result           |  2 +
 test/box/cfg.result             |  4 ++
 10 files changed, 153 insertions(+), 3 deletions(-)

diff --git a/src/box/box.cc b/src/box/box.cc
index b4a1a5e07..ee017c2dc 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -754,6 +754,15 @@ box_check_wal_mode(const char *mode_name)
 	return (enum wal_mode) mode;
 }
 
+static void
+box_check_wal_cleanup_delay(int tmo)
+{
+	if (tmo < 0 && tmo != -1) {
+		tnt_raise(ClientError, ER_CFG, "wal_cleanup_delay",
+			  "the value must be either >= 0 or -1");
+	}
+}
+
 static void
 box_check_readahead(int readahead)
 {
@@ -899,6 +908,7 @@ box_check_config(void)
 	box_check_checkpoint_count(cfg_geti("checkpoint_count"));
 	box_check_wal_max_size(cfg_geti64("wal_max_size"));
 	box_check_wal_mode(cfg_gets("wal_mode"));
+	box_check_wal_cleanup_delay(cfg_geti("wal_cleanup_delay"));
 	if (box_check_memory_quota("memtx_memory") < 0)
 		diag_raise();
 	box_check_memtx_min_tuple_size(cfg_geti64("memtx_min_tuple_size"));
@@ -3045,6 +3055,7 @@ box_cfg_xc(void)
 		bootstrap(&instance_uuid, &replicaset_uuid,
 			  &is_bootstrap_leader);
 	}
+	gc_delay_unref();
 	fiber_gc();
 
 	bootstrap_journal_guard.is_active = false;
diff --git a/src/box/gc.c b/src/box/gc.c
index 9af4ef958..595b98bdf 100644
--- a/src/box/gc.c
+++ b/src/box/gc.c
@@ -46,6 +46,7 @@
 #include <small/rlist.h>
 #include <tarantool_ev.h>
 
+#include "cfg.h"
 #include "diag.h"
 #include "errcode.h"
 #include "fiber.h"
@@ -107,6 +108,25 @@ gc_init(void)
 	/* Don't delete any files until recovery is complete. */
 	gc.min_checkpoint_count = INT_MAX;
 
+	gc.wal_cleanup_delay = cfg_geti("wal_cleanup_delay");
+	gc.delay_ref = 0;
+
+	if (cfg_geti("replication_anon") != 0) {
+		/*
+		 * Anonymous replicas are read only
+		 * so no need to keep XLOGs.
+		 */
+		gc.cleanup_is_paused = false;
+	} else if (gc.wal_cleanup_delay == 0) {
+		/*
+		 * The delay is disabled explicitly.
+		 */
+		gc.cleanup_is_paused = false;
+	} else {
+		say_info("gc: wal/engine cleanup is paused");
+		gc.cleanup_is_paused = true;
+	}
+
 	vclock_create(&gc.vclock);
 	rlist_create(&gc.checkpoints);
 	gc_tree_new(&gc.consumers);
@@ -238,6 +258,30 @@ static int
 gc_cleanup_fiber_f(va_list ap)
 {
 	(void)ap;
+
+	/*
+	 * Stage 1 (optional): in case if we're booting
+	 * up with cleanup disabled lets do wait in a
+	 * separate cycle to minimize branching on stage 2.
+	 */
+	if (gc.cleanup_is_paused) {
+		ev_tstamp tmo = gc.wal_cleanup_delay == -1 ?
+			TIMEOUT_INFINITY : gc.wal_cleanup_delay;
+		while (!fiber_is_cancelled()) {
+			if (fiber_yield_timeout(tmo)) {
+				say_info("gc: wal/engine cleanup is resumed "
+					 "due to timeout expiration");
+				gc.cleanup_is_paused = false;
+				break;
+			}
+			if (!gc.cleanup_is_paused)
+				break;
+		}
+	}
+
+	/*
+	 * Stage 2: a regular cleanup cycle.
+	 */
 	while (!fiber_is_cancelled()) {
 		int64_t delta = gc.cleanup_scheduled - gc.cleanup_completed;
 		if (delta == 0) {
@@ -253,6 +297,29 @@ gc_cleanup_fiber_f(va_list ap)
 	return 0;
 }
 
+void
+gc_delay_ref(void)
+{
+	if (gc.cleanup_is_paused) {
+		assert(gc.delay_ref >= 0);
+		gc.delay_ref++;
+	}
+}
+
+void
+gc_delay_unref(void)
+{
+	if (gc.cleanup_is_paused) {
+		assert(gc.delay_ref > 0);
+		gc.delay_ref--;
+		if (gc.delay_ref == 0) {
+			say_info("gc: wal/engine cleanup is resumed");
+			gc.cleanup_is_paused = false;
+			fiber_wakeup(gc.cleanup_fiber);
+		}
+	}
+}
+
 /**
  * Trigger asynchronous garbage collection.
  */
@@ -278,9 +345,11 @@ gc_schedule_cleanup(void)
 static void
 gc_wait_cleanup(void)
 {
-	int64_t scheduled = gc.cleanup_scheduled;
-	while (gc.cleanup_completed < scheduled)
-		fiber_cond_wait(&gc.cleanup_cond);
+	if (!gc.cleanup_is_paused) {
+		int64_t scheduled = gc.cleanup_scheduled;
+		while (gc.cleanup_completed < scheduled)
+			fiber_cond_wait(&gc.cleanup_cond);
+	}
 }
 
 void
diff --git a/src/box/gc.h b/src/box/gc.h
index 2a568c5f9..9e73d39b0 100644
--- a/src/box/gc.h
+++ b/src/box/gc.h
@@ -147,6 +147,24 @@ struct gc_state {
 	 * taken at that moment of time.
 	 */
 	int64_t cleanup_completed, cleanup_scheduled;
+	/**
+	 * A counter to wait until all replicas are managed to
+	 * subscribe so that we can enable cleanup fiber to
+	 * remove old XLOGs. Otherwise some replicas might be
+	 * far behind the master node and after the master
+	 * node been restarted they will have to reread all
+	 * data back due to XlogGapError, ie too early deleted
+	 * XLOGs.
+	 */
+	int64_t delay_ref;
+	/**
+	 * Delay timeout in seconds.
+	 */
+	int32_t wal_cleanup_delay;
+	/**
+	 * When set the cleanup fiber is paused.
+	 */
+	bool cleanup_is_paused;
 	/**
 	 * Set if there's a fiber making a checkpoint right now.
 	 */
@@ -206,6 +224,18 @@ gc_init(void);
 void
 gc_free(void);
 
+/**
+ * Increment a reference to delay counter.
+ */
+void
+gc_delay_ref(void);
+
+/**
+ * Decrement a reference from the delay counter.
+ */
+void
+gc_delay_unref(void);
+
 /**
  * Advance the garbage collector vclock to the given position.
  * Deactivate WAL consumers that need older data.
diff --git a/src/box/lua/info.c b/src/box/lua/info.c
index c4c9fa0a0..3230b5bbb 100644
--- a/src/box/lua/info.c
+++ b/src/box/lua/info.c
@@ -445,6 +445,14 @@ lbox_info_gc_call(struct lua_State *L)
 	lua_pushboolean(L, gc.checkpoint_is_in_progress);
 	lua_settable(L, -3);
 
+	lua_pushstring(L, "cleanup_is_paused");
+	lua_pushboolean(L, gc.cleanup_is_paused);
+	lua_settable(L, -3);
+
+	lua_pushstring(L, "delay_ref");
+	luaL_pushint64(L, gc.delay_ref);
+	lua_settable(L, -3);
+
 	lua_pushstring(L, "checkpoints");
 	lua_newtable(L);
 
diff --git a/src/box/lua/load_cfg.lua b/src/box/lua/load_cfg.lua
index aac216932..1ca9d5e22 100644
--- a/src/box/lua/load_cfg.lua
+++ b/src/box/lua/load_cfg.lua
@@ -72,6 +72,7 @@ local default_cfg = {
     wal_mode            = "write",
     wal_max_size        = 256 * 1024 * 1024,
     wal_dir_rescan_delay= 2,
+    wal_cleanup_delay   = 4 * 3600,
     force_recovery      = false,
     replication         = nil,
     instance_uuid       = nil,
@@ -154,6 +155,7 @@ local template_cfg = {
     wal_mode            = 'string',
     wal_max_size        = 'number',
     wal_dir_rescan_delay= 'number',
+    wal_cleanup_delay   = 'number',
     force_recovery      = 'boolean',
     replication         = 'string, number, table',
     instance_uuid       = 'string',
@@ -378,6 +380,7 @@ local dynamic_cfg_skip_at_load = {
     replication_skip_conflict = true,
     replication_anon        = true,
     wal_dir_rescan_delay    = true,
+    wal_cleanup_delay       = true,
     custom_proc_title       = true,
     force_recovery          = true,
     instance_uuid           = true,
diff --git a/src/box/relay.cc b/src/box/relay.cc
index 41f949e8e..5d09690ad 100644
--- a/src/box/relay.cc
+++ b/src/box/relay.cc
@@ -668,6 +668,13 @@ relay_send_is_raft_enabled(struct relay *relay,
 	}
 }
 
+static void
+relay_gc_delay_unref(struct cmsg *msg)
+{
+	(void)msg;
+	gc_delay_unref();
+}
+
 /**
  * A libev callback invoked when a relay client socket is ready
  * for read. This currently only happens when the client closes
@@ -721,6 +728,19 @@ relay_subscribe_f(va_list ap)
 	 */
 	relay_send_heartbeat(relay);
 
+	/*
+	 * Now we can resume wal/engine gc as relay
+	 * is up and running.
+	 */
+	if (!relay->replica->anon) {
+		static const struct cmsg_hop gc_delay_route[] = {
+			{relay_gc_delay_unref, NULL}
+		};
+		struct cmsg gc_delay_msg;
+		cmsg_init(&gc_delay_msg, gc_delay_route);
+		cpipe_push(&relay->tx_pipe, &gc_delay_msg);
+	}
+
 	/*
 	 * Run the event loop until the connection is broken
 	 * or an error occurs.
diff --git a/src/box/replication.cc b/src/box/replication.cc
index 1fa8843e7..aefb812b3 100644
--- a/src/box/replication.cc
+++ b/src/box/replication.cc
@@ -250,6 +250,7 @@ replica_set_id(struct replica *replica, uint32_t replica_id)
 						   tt_uuid_str(&replica->uuid));
 	}
 	replicaset.replica_by_id[replica_id] = replica;
+	gc_delay_ref();
 	++replicaset.registered_count;
 	say_info("assigned id %d to replica %s",
 		 replica->id, tt_uuid_str(&replica->uuid));
@@ -273,6 +274,7 @@ replica_clear_id(struct replica *replica)
 	replicaset.replica_by_id[replica->id] = NULL;
 	assert(replicaset.registered_count > 0);
 	--replicaset.registered_count;
+	gc_delay_unref();
 	if (replica->id == instance_id) {
 		/* See replica_check_id(). */
 		assert(replicaset.is_joining);
diff --git a/test/app-tap/init_script.result b/test/app-tap/init_script.result
index 76fe2ea27..33d861ecd 100644
--- a/test/app-tap/init_script.result
+++ b/test/app-tap/init_script.result
@@ -53,6 +53,7 @@ vinyl_run_count_per_level:2
 vinyl_run_size_ratio:3.5
 vinyl_timeout:60
 vinyl_write_threads:4
+wal_cleanup_delay:14400
 wal_dir:.
 wal_dir_rescan_delay:2
 wal_max_size:268435456
diff --git a/test/box/admin.result b/test/box/admin.result
index c9a6ff9e4..3443dbf96 100644
--- a/test/box/admin.result
+++ b/test/box/admin.result
@@ -127,6 +127,8 @@ cfg_filter(box.cfg)
     - 60
   - - vinyl_write_threads
     - 4
+  - - wal_cleanup_delay
+    - 14400
   - - wal_dir
     - <hidden>
   - - wal_dir_rescan_delay
diff --git a/test/box/cfg.result b/test/box/cfg.result
index ae37b28f0..87be6053d 100644
--- a/test/box/cfg.result
+++ b/test/box/cfg.result
@@ -115,6 +115,8 @@ cfg_filter(box.cfg)
  |     - 60
  |   - - vinyl_write_threads
  |     - 4
+ |   - - wal_cleanup_delay
+ |     - 14400
  |   - - wal_dir
  |     - <hidden>
  |   - - wal_dir_rescan_delay
@@ -232,6 +234,8 @@ cfg_filter(box.cfg)
  |     - 60
  |   - - vinyl_write_threads
  |     - 4
+ |   - - wal_cleanup_delay
+ |     - 14400
  |   - - wal_dir
  |     - <hidden>
  |   - - wal_dir_rescan_delay
-- 
2.30.2


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Tarantool-patches] [PATCH 2/2] test: add a test for wal_cleanup_delay option
  2021-03-18 18:41 [Tarantool-patches] [PATCH 0/2] gc/xlog: delay xlog cleanup until relays are subscribed Cyrill Gorcunov via Tarantool-patches
  2021-03-18 18:41 ` [Tarantool-patches] [PATCH 1/2] " Cyrill Gorcunov via Tarantool-patches
@ 2021-03-18 18:41 ` Cyrill Gorcunov via Tarantool-patches
  2021-03-18 23:04   ` Vladislav Shpilevoy via Tarantool-patches
  1 sibling, 1 reply; 14+ messages in thread
From: Cyrill Gorcunov via Tarantool-patches @ 2021-03-18 18:41 UTC (permalink / raw)
  To: tml; +Cc: Vladislav Shpilevoy, Mons Anderson

Part-of #5806

Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
---
 test/replication/gh-5806-master.lua           |  13 +
 test/replication/gh-5806-slave.lua            |  13 +
 test/replication/gh-5806-xlog-cleanup.result  | 336 ++++++++++++++++++
 .../replication/gh-5806-xlog-cleanup.test.lua | 148 ++++++++
 4 files changed, 510 insertions(+)
 create mode 100644 test/replication/gh-5806-master.lua
 create mode 100644 test/replication/gh-5806-slave.lua
 create mode 100644 test/replication/gh-5806-xlog-cleanup.result
 create mode 100644 test/replication/gh-5806-xlog-cleanup.test.lua

diff --git a/test/replication/gh-5806-master.lua b/test/replication/gh-5806-master.lua
new file mode 100644
index 000000000..0404965d3
--- /dev/null
+++ b/test/replication/gh-5806-master.lua
@@ -0,0 +1,13 @@
+#!/usr/bin/env tarantool
+
+require('console').listen(os.getenv('ADMIN'))
+
+function func_xlog_snap(space, value)
+    space:insert(value)
+    box.snapshot()
+end
+
+box.cfg({
+    listen              = os.getenv("LISTEN"),
+    wal_cleanup_delay   = tonumber(arg[1]) or 0,
+})
diff --git a/test/replication/gh-5806-slave.lua b/test/replication/gh-5806-slave.lua
new file mode 100644
index 000000000..917dbb1ae
--- /dev/null
+++ b/test/replication/gh-5806-slave.lua
@@ -0,0 +1,13 @@
+#!/usr/bin/env tarantool
+
+require('console').listen(os.getenv('ADMIN'))
+
+function func_xlog_snap(space, value)
+    space:insert(value)
+    box.snapshot()
+end
+
+box.cfg({
+    listen              = os.getenv("LISTEN"),
+    replication         = os.getenv("MASTER"),
+})
diff --git a/test/replication/gh-5806-xlog-cleanup.result b/test/replication/gh-5806-xlog-cleanup.result
new file mode 100644
index 000000000..97355a8bf
--- /dev/null
+++ b/test/replication/gh-5806-xlog-cleanup.result
@@ -0,0 +1,336 @@
+-- test-run result file version 2
+--
+-- gh-5806: defer xlog cleanup to keep xlogs until
+-- replicas present in "_cluster" are connected.
+-- Otherwise we are getting XlogGapError since
+-- master might go far forwad from replica and
+-- replica won't be able to connect without full
+-- rebootstrap.
+--
+
+fiber = require('fiber')
+ | ---
+ | ...
+test_run = require('test_run').new()
+ | ---
+ | ...
+engine = test_run:get_cfg('engine')
+ | ---
+ | ...
+
+--
+-- Case 1.
+--
+-- First lets make sure we're getting XlogGapError in
+-- case if wal_cleanup_delay is not used.
+--
+
+test_run:cmd('create server master with script="replication/gh-5806-master.lua"')
+ | ---
+ | - true
+ | ...
+test_run:cmd('start server master with wait=True, wait_load=True')
+ | ---
+ | - true
+ | ...
+
+test_run:switch('master')
+ | ---
+ | - true
+ | ...
+box.schema.user.grant('guest', 'replication')
+ | ---
+ | ...
+
+--
+-- Keep small number of snaps to force cleanup
+-- procedure be more intensive.
+box.cfg{checkpoint_count = 1}
+ | ---
+ | ...
+
+engine = test_run:get_cfg('engine')
+ | ---
+ | ...
+s = box.schema.space.create('test', {engine = engine})
+ | ---
+ | ...
+_ = s:create_index('pk')
+ | ---
+ | ...
+
+test_run:switch('default')
+ | ---
+ | - true
+ | ...
+test_run:cmd('create server replica with rpl_master=master,\
+              script="replication/gh-5806-slave.lua"')
+ | ---
+ | - true
+ | ...
+test_run:cmd('start server replica with wait=True, wait_load=True')
+ | ---
+ | - true
+ | ...
+
+test_run:switch('replica')
+ | ---
+ | - true
+ | ...
+box.cfg{checkpoint_count = 1}
+ | ---
+ | ...
+s = box.schema.space.create('testtemp', {temporary = true})
+ | ---
+ | ...
+_ = s:create_index('pk')
+ | ---
+ | ...
+for i=1,2 do func_xlog_snap(box.space.testtemp, {i}) end
+ | ---
+ | ...
+
+--
+-- Stop the replica node and generate
+-- first range of xlogs on the master.
+test_run:switch('default')
+ | ---
+ | - true
+ | ...
+test_run:cmd('stop server replica')
+ | ---
+ | - true
+ | ...
+
+test_run:switch('master')
+ | ---
+ | - true
+ | ...
+for i=1,2 do func_xlog_snap(box.space.test, {i}) end
+ | ---
+ | ...
+
+--
+-- Restart the masted and generate the
+-- next range of xlogs.
+test_run:switch('default')
+ | ---
+ | - true
+ | ...
+test_run:cmd('stop server master')
+ | ---
+ | - true
+ | ...
+test_run:cmd('start server master with wait_load=True')
+ | ---
+ | - true
+ | ...
+test_run:switch('master')
+ | ---
+ | - true
+ | ...
+for i=3,4 do func_xlog_snap(box.space.test, {i}) end
+ | ---
+ | ...
+
+--
+-- Restart master node and the replica then.
+test_run:switch('default')
+ | ---
+ | - true
+ | ...
+test_run:cmd('stop server master')
+ | ---
+ | - true
+ | ...
+test_run:cmd('start server master with wait_load=True')
+ | ---
+ | - true
+ | ...
+test_run:cmd('start server replica with wait=False, wait_load=False')
+ | ---
+ | - true
+ | ...
+
+--
+-- Wait error to appear.
+while test_run:grep_log("master", "XlogGapError") == nil do fiber.sleep(0.01) end
+ | ---
+ | ...
+
+--
+-- Cleanup.
+test_run:cmd('stop server master')
+ | ---
+ | - true
+ | ...
+test_run:cmd('stop server replica')
+ | ---
+ | - true
+ | ...
+test_run:cmd('delete server master')
+ | ---
+ | - true
+ | ...
+test_run:cmd('delete server replica')
+ | ---
+ | - true
+ | ...
+
+--
+-- Case 2.
+--
+-- Lets make sure we're not getting XlogGapError in
+-- case if wal_cleanup_delay is used.
+--
+
+test_run:cmd('create server master with script="replication/gh-5806-master.lua"')
+ | ---
+ | - true
+ | ...
+test_run:cmd('start server master with args="3600", wait=True, wait_load=True')
+ | ---
+ | - true
+ | ...
+
+test_run:switch('master')
+ | ---
+ | - true
+ | ...
+box.schema.user.grant('guest', 'replication')
+ | ---
+ | ...
+
+--
+-- Keep small number of snaps to force cleanup
+-- procedure be more intensive.
+box.cfg{checkpoint_count = 1}
+ | ---
+ | ...
+
+engine = test_run:get_cfg('engine')
+ | ---
+ | ...
+s = box.schema.space.create('test', {engine = engine})
+ | ---
+ | ...
+_ = s:create_index('pk')
+ | ---
+ | ...
+
+test_run:switch('default')
+ | ---
+ | - true
+ | ...
+test_run:cmd('create server replica with rpl_master=master,\
+              script="replication/gh-5806-slave.lua"')
+ | ---
+ | - true
+ | ...
+test_run:cmd('start server replica with wait=True, wait_load=True')
+ | ---
+ | - true
+ | ...
+
+test_run:switch('replica')
+ | ---
+ | - true
+ | ...
+box.cfg{checkpoint_count = 1}
+ | ---
+ | ...
+s = box.schema.space.create('testtemp', {temporary = true})
+ | ---
+ | ...
+_ = s:create_index('pk')
+ | ---
+ | ...
+for i=1,2 do func_xlog_snap(box.space.testtemp, {i}) end
+ | ---
+ | ...
+
+--
+-- Stop the replica node and generate
+-- first range of xlogs on the master.
+test_run:switch('default')
+ | ---
+ | - true
+ | ...
+test_run:cmd('stop server replica')
+ | ---
+ | - true
+ | ...
+
+test_run:switch('master')
+ | ---
+ | - true
+ | ...
+for i=1,2 do func_xlog_snap(box.space.test, {i}) end
+ | ---
+ | ...
+
+--
+-- Restart the masted and generate the
+-- next range of xlogs.
+test_run:switch('default')
+ | ---
+ | - true
+ | ...
+test_run:cmd('stop server master')
+ | ---
+ | - true
+ | ...
+test_run:cmd('start server master with args="3600", wait=True, wait_load=True')
+ | ---
+ | - true
+ | ...
+test_run:switch('master')
+ | ---
+ | - true
+ | ...
+for i=3,4 do func_xlog_snap(box.space.test, {i}) end
+ | ---
+ | ...
+
+--
+-- Restart master node and the replica then.
+test_run:switch('default')
+ | ---
+ | - true
+ | ...
+test_run:cmd('stop server master')
+ | ---
+ | - true
+ | ...
+test_run:cmd('start server master with args="3600", wait=True, wait_load=True')
+ | ---
+ | - true
+ | ...
+test_run:cmd('start server replica with wait=True, wait_load=True')
+ | ---
+ | - true
+ | ...
+
+assert(test_run:grep_log("master", "XlogGapError") == nil)
+ | ---
+ | - true
+ | ...
+
+--
+-- Cleanup.
+test_run:cmd('stop server master')
+ | ---
+ | - true
+ | ...
+test_run:cmd('stop server replica')
+ | ---
+ | - true
+ | ...
+test_run:cmd('delete server master')
+ | ---
+ | - true
+ | ...
+test_run:cmd('delete server replica')
+ | ---
+ | - true
+ | ...
diff --git a/test/replication/gh-5806-xlog-cleanup.test.lua b/test/replication/gh-5806-xlog-cleanup.test.lua
new file mode 100644
index 000000000..6232e3964
--- /dev/null
+++ b/test/replication/gh-5806-xlog-cleanup.test.lua
@@ -0,0 +1,148 @@
+--
+-- gh-5806: defer xlog cleanup to keep xlogs until
+-- replicas present in "_cluster" are connected.
+-- Otherwise we are getting XlogGapError since
+-- master might go far forwad from replica and
+-- replica won't be able to connect without full
+-- rebootstrap.
+--
+
+fiber = require('fiber')
+test_run = require('test_run').new()
+engine = test_run:get_cfg('engine')
+
+--
+-- Case 1.
+--
+-- First lets make sure we're getting XlogGapError in
+-- case if wal_cleanup_delay is not used.
+--
+
+test_run:cmd('create server master with script="replication/gh-5806-master.lua"')
+test_run:cmd('start server master with wait=True, wait_load=True')
+
+test_run:switch('master')
+box.schema.user.grant('guest', 'replication')
+
+--
+-- Keep small number of snaps to force cleanup
+-- procedure be more intensive.
+box.cfg{checkpoint_count = 1}
+
+engine = test_run:get_cfg('engine')
+s = box.schema.space.create('test', {engine = engine})
+_ = s:create_index('pk')
+
+test_run:switch('default')
+test_run:cmd('create server replica with rpl_master=master,\
+              script="replication/gh-5806-slave.lua"')
+test_run:cmd('start server replica with wait=True, wait_load=True')
+
+test_run:switch('replica')
+box.cfg{checkpoint_count = 1}
+s = box.schema.space.create('testtemp', {temporary = true})
+_ = s:create_index('pk')
+for i=1,2 do func_xlog_snap(box.space.testtemp, {i}) end
+
+--
+-- Stop the replica node and generate
+-- first range of xlogs on the master.
+test_run:switch('default')
+test_run:cmd('stop server replica')
+
+test_run:switch('master')
+for i=1,2 do func_xlog_snap(box.space.test, {i}) end
+
+--
+-- Restart the masted and generate the
+-- next range of xlogs.
+test_run:switch('default')
+test_run:cmd('stop server master')
+test_run:cmd('start server master with wait_load=True')
+test_run:switch('master')
+for i=3,4 do func_xlog_snap(box.space.test, {i}) end
+
+--
+-- Restart master node and the replica then.
+test_run:switch('default')
+test_run:cmd('stop server master')
+test_run:cmd('start server master with wait_load=True')
+test_run:cmd('start server replica with wait=False, wait_load=False')
+
+--
+-- Wait error to appear.
+while test_run:grep_log("master", "XlogGapError") == nil do fiber.sleep(0.01) end
+
+--
+-- Cleanup.
+test_run:cmd('stop server master')
+test_run:cmd('stop server replica')
+test_run:cmd('delete server master')
+test_run:cmd('delete server replica')
+
+--
+-- Case 2.
+--
+-- Lets make sure we're not getting XlogGapError in
+-- case if wal_cleanup_delay is used.
+--
+
+test_run:cmd('create server master with script="replication/gh-5806-master.lua"')
+test_run:cmd('start server master with args="3600", wait=True, wait_load=True')
+
+test_run:switch('master')
+box.schema.user.grant('guest', 'replication')
+
+--
+-- Keep small number of snaps to force cleanup
+-- procedure be more intensive.
+box.cfg{checkpoint_count = 1}
+
+engine = test_run:get_cfg('engine')
+s = box.schema.space.create('test', {engine = engine})
+_ = s:create_index('pk')
+
+test_run:switch('default')
+test_run:cmd('create server replica with rpl_master=master,\
+              script="replication/gh-5806-slave.lua"')
+test_run:cmd('start server replica with wait=True, wait_load=True')
+
+test_run:switch('replica')
+box.cfg{checkpoint_count = 1}
+s = box.schema.space.create('testtemp', {temporary = true})
+_ = s:create_index('pk')
+for i=1,2 do func_xlog_snap(box.space.testtemp, {i}) end
+
+--
+-- Stop the replica node and generate
+-- first range of xlogs on the master.
+test_run:switch('default')
+test_run:cmd('stop server replica')
+
+test_run:switch('master')
+for i=1,2 do func_xlog_snap(box.space.test, {i}) end
+
+--
+-- Restart the masted and generate the
+-- next range of xlogs.
+test_run:switch('default')
+test_run:cmd('stop server master')
+test_run:cmd('start server master with args="3600", wait=True, wait_load=True')
+test_run:switch('master')
+for i=3,4 do func_xlog_snap(box.space.test, {i}) end
+
+--
+-- Restart master node and the replica then.
+test_run:switch('default')
+test_run:cmd('stop server master')
+test_run:cmd('start server master with args="3600", wait=True, wait_load=True')
+test_run:cmd('start server replica with wait=True, wait_load=True')
+
+assert(test_run:grep_log("master", "XlogGapError") == nil)
+
+--
+-- Cleanup.
+test_run:cmd('stop server master')
+test_run:cmd('stop server replica')
+test_run:cmd('delete server master')
+test_run:cmd('delete server replica')
-- 
2.30.2


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Tarantool-patches] [PATCH 1/2] gc/xlog: delay xlog cleanup until relays are subscribed
  2021-03-18 18:41 ` [Tarantool-patches] [PATCH 1/2] " Cyrill Gorcunov via Tarantool-patches
@ 2021-03-18 23:04   ` Vladislav Shpilevoy via Tarantool-patches
  2021-03-19 11:03     ` Cyrill Gorcunov via Tarantool-patches
  2021-03-19 13:40     ` Serge Petrenko via Tarantool-patches
  2021-03-19 13:50   ` Serge Petrenko via Tarantool-patches
  1 sibling, 2 replies; 14+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-03-18 23:04 UTC (permalink / raw)
  To: Cyrill Gorcunov, tml; +Cc: Mons Anderson

Hi! Thanks for the patch!

Generally looks fine except details.

AFAIU, we go for the timeout option for the only reason that
there might be non-fullmesh typologies, when a replica is in
_cluster, but does not have relays to other nodes.

If we would know the topology, we could tell which nodes will
relay to who, and would be able to detect when a replica needs
to keep the logs, and when don't. Not even the entire topology.
Just the info from the adjacent nodes. From our own box.cfg.replication,
and from box.cfg.replication of these nodes.

I still don't understand what was wrong with implementing some kind
of topology discovery for this 2-level topology tree. For instance
when applier on a replica is connected, the remote node sends us a
flag whether it is going to connect back. If the flag is false, we don't
keep the logs for that instance.

See 13 comments below.

> @TarantoolBot document
> Title: Add wal_cleanup_delay configuration parameter

1. There are 2 issues with the name. Firstly, 'cleanup' conflicts
with the existing 'gc' name, which is already exposed in box.info.gc.
It is not called box.info.cleanup, so I would propose to use 'gc'
here too.

Another issue I described in the first patchset's discussion. It is
not really a delay. Because it is simply ignored when the replication
feels like it. It must either have 'replication' prefix to designate
it is not just a fixated timeout to keep the WALs and it depends on
the replication, or at least it must mention "max". To designate it
is not the exact strict timeout for keeping the logs.

> The `wal_cleanup_delay` option defines a delay in seconds
> before write ahead log files (`*.xlog`) are getting started
> to prune upon a node restart.
> 
> This option is isnored in case if a node is running as

2. isnored -> ignored.

> an anonymous replica (`replication_anon = true`). Similarly
> if replication is unused or there is no plans to use
> replication at all then this option should not be considered.
> 
> An initial problem to solve is the case where a node is operating
> so fast that its replicas do not manage to reach the node state
> and in case if the node is restarted at this moment (for various
> reasons, for example due to power outage) then unfetched `*.xlog`
> files might be pruned during restart. In result replicas will not
> find these files on the main node and have to reread all data back
> which is a very expensive procedure.
> 
> Since replicas are tracked via `_cluster` system space this we use
> its content to count subscribed replicas and when all of them are
> up and running the cleanup procedure is automatically enabled even
> if `wal_cleanup_delay` is not expired.
> 
> The `wal_cleanup_delay` should be set to:
> 
>  - `-1` to wait infinitely until all existing replicas are subscribed;
>  - `0` to disable the cleanup delay;
>  - `>0` to wait for specified number of seconds.
> 
> By default it is set to `14400` seconds (ie `4` hours).
> 
> In case if registered replica is lost forever and timeout is set to
> infinity then a preferred way to enable cleanup procedure is not setting
> up a small timeout value but rather to delete this replica from `_cluster`
> space manually.
> 
> Note that the option does *not* prevent WAL engine from removing
> old `*.xlog` files if there is no space left on a storage device,
> WAL engine can remove them in a force way.
> 
> Current state of `*.xlog` garbage collector can be found in
> `box.info.gc()` output. For example
> 
> ``` Lua
>  tarantool> box.info.gc()
>  ---
>    ...
>    cleanup_is_paused: false

3. 'gc.cleanup' is a tautology. If you want to expose the
flag, it better be simply 'is_paused' IMO.

>    delay_ref: 0

4. Can't the user get the same by looking at

	_cluster size - box.info.gc().consumers count

?

> ```
> 
> The `cleanup_is_paused` shows if cleanup fiber is paused or not,
> and `delay_ref` represents number of replicas to be subscribed
> if the cleanup fiber is in `pause` mode.
> ---
>  src/box/box.cc                  | 11 +++++
>  src/box/gc.c                    | 75 +++++++++++++++++++++++++++++++--
>  src/box/gc.h                    | 30 +++++++++++++
>  src/box/lua/info.c              |  8 ++++
>  src/box/lua/load_cfg.lua        |  3 ++
>  src/box/relay.cc                | 20 +++++++++
>  src/box/replication.cc          |  2 +
>  test/app-tap/init_script.result |  1 +
>  test/box/admin.result           |  2 +
>  test/box/cfg.result             |  4 ++
>  10 files changed, 153 insertions(+), 3 deletions(-)
> 
> diff --git a/src/box/box.cc b/src/box/box.cc
> index b4a1a5e07..ee017c2dc 100644
> --- a/src/box/box.cc
> +++ b/src/box/box.cc
> @@ -754,6 +754,15 @@ box_check_wal_mode(const char *mode_name)
>  	return (enum wal_mode) mode;
>  }
>  
> +static void
> +box_check_wal_cleanup_delay(int tmo)
> +{
> +	if (tmo < 0 && tmo != -1) {
> +		tnt_raise(ClientError, ER_CFG, "wal_cleanup_delay",
> +			  "the value must be either >= 0 or -1");

3. As you could notice, we try not to use C++ exceptions in the new
code. Lets use diag_set(). For example, like the new option
wal_max_queue_size does.

4. Can you move cfg_geti() into this function? I see no reason why
do you need to pass it as a parameter. The function is used in
a single place.

> +	}
> +}
> @@ -3045,6 +3055,7 @@ box_cfg_xc(void)
>  		bootstrap(&instance_uuid, &replicaset_uuid,
>  			  &is_bootstrap_leader);
>  	}
> +	gc_delay_unref();

5. Why?

>  	fiber_gc();
>  
>  	bootstrap_journal_guard.is_active = false;
> diff --git a/src/box/gc.c b/src/box/gc.c
> index 9af4ef958..595b98bdf 100644
> --- a/src/box/gc.c
> +++ b/src/box/gc.c
> @@ -107,6 +108,25 @@ gc_init(void)
>  	/* Don't delete any files until recovery is complete. */
>  	gc.min_checkpoint_count = INT_MAX;
>  
> +	gc.wal_cleanup_delay = cfg_geti("wal_cleanup_delay");

6. Please, pass the options as arguments of gc_init(). Having the
global options disseminated over the sources does not simplify
understanding of what subsystems on which options depend.

That would be consistent with not using cfg_get* fuctions almost
anywhere except a few big 'main' files, and one sql file, which
is also wrong, but is a mess in many other ways as well.

> +	gc.delay_ref = 0;
> +
> +	if (cfg_geti("replication_anon") != 0) {

7. GC does not really need to know if the instance is anon. This
could be both controlled by wal_cleanup_delay cfg and the caller
of gc_init(): box_cfg_xc(). Otherwise you made GC depend on
replication details in the code, which is not great. We must keep
the dependencies as low as possible.

In box_cfg_xc you can get both wal_cleanup_delay and replication_anon
and check if the anon is set, then you call gc_init(0). If it is
false, you call gc_init(wal_cleanup_delay). GC only knows about some
delay obtained from its owner.

> +		/*
> +		 * Anonymous replicas are read only
> +		 * so no need to keep XLOGs.
> +		 */
> +		gc.cleanup_is_paused = false;
> +	} else if (gc.wal_cleanup_delay == 0) {
> +		/*
> +		 * The delay is disabled explicitly.
> +		 */
> +		gc.cleanup_is_paused = false;
> +	} else {
> +		say_info("gc: wal/engine cleanup is paused");
> +		gc.cleanup_is_paused = true;
> +	}
> +
>  	vclock_create(&gc.vclock);
>  	rlist_create(&gc.checkpoints);
>  	gc_tree_new(&gc.consumers);
> @@ -238,6 +258,30 @@ static int
>  gc_cleanup_fiber_f(va_list ap)
>  {
>  	(void)ap;
> +
> +	/*
> +	 * Stage 1 (optional): in case if we're booting
> +	 * up with cleanup disabled lets do wait in a
> +	 * separate cycle to minimize branching on stage 2.
> +	 */
> +	if (gc.cleanup_is_paused) {
> +		ev_tstamp tmo = gc.wal_cleanup_delay == -1 ?
> +			TIMEOUT_INFINITY : gc.wal_cleanup_delay;

8. Please, call it 'timeout'. 'tmo' looks super weird, and is never
used anywhere except this patch.

9. What if I see the timout was too big, and I cal
box.cfg{wal_cleanup_delay = 0}? It seems I can only reduce it on
restart. Can you support a dynamic update? It should be easy. You
need to store gc deadline instead of the delay. So when you change
the timeout, you can see if with the new timeout the deadline is 
lready reached, and then can start GC earlier.

If I increase the timeout, GC could be paused again, or simply
ignored, or banned. But the decrease is more important, as I think.

> +		while (!fiber_is_cancelled()) {
> +			if (fiber_yield_timeout(tmo)) {
> +				say_info("gc: wal/engine cleanup is resumed "
> +					 "due to timeout expiration");

10. Other gc logs don't have 'gc' prefix. I would propose to follow
it in the new logs too.

> +				gc.cleanup_is_paused = false;
> +				break;
> +			}
> +			if (!gc.cleanup_is_paused)
> +				break;
> +		}
> +	}
> +
> +	/*
> +	 * Stage 2: a regular cleanup cycle.
> +	 */
>  	while (!fiber_is_cancelled()) {
>  		int64_t delta = gc.cleanup_scheduled - gc.cleanup_completed;
>  		if (delta == 0) {
> @@ -253,6 +297,29 @@ gc_cleanup_fiber_f(va_list ap)
>  	return 0;
>  }
>  
> +void
> +gc_delay_ref(void)
> +{
> +	if (gc.cleanup_is_paused) {
> +		assert(gc.delay_ref >= 0);
> +		gc.delay_ref++;
> +	}
> +}
> +
> +void
> +gc_delay_unref(void)
> +{
> +	if (gc.cleanup_is_paused) {
> +		assert(gc.delay_ref > 0);
> +		gc.delay_ref--;

11. If it is not paused and GC started earlier due to a
timeout, the user will see 'delay_ref' in box.info even
after all the replicas are connected.

> +		if (gc.delay_ref == 0) {
> +			say_info("gc: wal/engine cleanup is resumed");
> +			gc.cleanup_is_paused = false;
> +			fiber_wakeup(gc.cleanup_fiber);
> +		}
> +	}
> +}
> +
>  /**
>   * Trigger asynchronous garbage collection.
>   */
> @@ -278,9 +345,11 @@ gc_schedule_cleanup(void)
>  static void
>  gc_wait_cleanup(void)
>  {
> -	int64_t scheduled = gc.cleanup_scheduled;
> -	while (gc.cleanup_completed < scheduled)
> -		fiber_cond_wait(&gc.cleanup_cond);
> +	if (!gc.cleanup_is_paused) {
> +		int64_t scheduled = gc.cleanup_scheduled;
> +		while (gc.cleanup_completed < scheduled)
> +			fiber_cond_wait(&gc.cleanup_cond);
> +	}

12. The function is called 'wait_cleanup', but it does not really
wait in case GC is paused. It looks wrong.

> diff --git a/src/box/relay.cc b/src/box/relay.cc
> index 41f949e8e..5d09690ad 100644
> --- a/src/box/relay.cc
> +++ b/src/box/relay.cc
> @@ -668,6 +668,13 @@ relay_send_is_raft_enabled(struct relay *relay,
>  	}
>  }
>  
> +static void
> +relay_gc_delay_unref(struct cmsg *msg)
> +{
> +	(void)msg;
> +	gc_delay_unref();

13. You don't need a separate callback, and don't need to call it
from the relay thread. Relay already works with GC - it calls
gc_consumer_register() before starting the thread. You can do the
unref in the same place. Starting from the consumer registration
the logs are going to be kept anyway.

Looks like it would be simpler if it works.

> +}
> +
>  /**
>   * A libev callback invoked when a relay client socket is ready
>   * for read. This currently only happens when the client closes
> @@ -721,6 +728,19 @@ relay_subscribe_f(va_list ap)
>  	 */
>  	relay_send_heartbeat(relay);
>  
> +	/*
> +	 * Now we can resume wal/engine gc as relay
> +	 * is up and running.
> +	 */
> +	if (!relay->replica->anon) {
> +		static const struct cmsg_hop gc_delay_route[] = {
> +			{relay_gc_delay_unref, NULL}
> +		};
> +		struct cmsg gc_delay_msg;
> +		cmsg_init(&gc_delay_msg, gc_delay_route);
> +		cpipe_push(&relay->tx_pipe, &gc_delay_msg);
> +	}
> +
>  	/*
>  	 * Run the event loop until the connection is broken
>  	 * or an error occurs.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Tarantool-patches] [PATCH 2/2] test: add a test for wal_cleanup_delay option
  2021-03-18 18:41 ` [Tarantool-patches] [PATCH 2/2] test: add a test for wal_cleanup_delay option Cyrill Gorcunov via Tarantool-patches
@ 2021-03-18 23:04   ` Vladislav Shpilevoy via Tarantool-patches
  2021-03-19 12:14     ` Cyrill Gorcunov via Tarantool-patches
  0 siblings, 1 reply; 14+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-03-18 23:04 UTC (permalink / raw)
  To: Cyrill Gorcunov, tml; +Cc: Mons Anderson

Thanks for the patch!

See 8 comments below.

On 18.03.2021 19:41, Cyrill Gorcunov via Tarantool-patches wrote:
> Part-of #5806
> 
> Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
> ---
>  test/replication/gh-5806-master.lua           |  13 +
>  test/replication/gh-5806-slave.lua            |  13 +
>  test/replication/gh-5806-xlog-cleanup.result  | 336 ++++++++++++++++++
>  .../replication/gh-5806-xlog-cleanup.test.lua | 148 ++++++++
>  4 files changed, 510 insertions(+)
>  create mode 100644 test/replication/gh-5806-master.lua
>  create mode 100644 test/replication/gh-5806-slave.lua
>  create mode 100644 test/replication/gh-5806-xlog-cleanup.result
>  create mode 100644 test/replication/gh-5806-xlog-cleanup.test.lua
> 
> diff --git a/test/replication/gh-5806-master.lua b/test/replication/gh-5806-master.lua
> new file mode 100644
> index 000000000..0404965d3
> --- /dev/null
> +++ b/test/replication/gh-5806-master.lua
> @@ -0,0 +1,13 @@
> +#!/usr/bin/env tarantool
> +
> +require('console').listen(os.getenv('ADMIN'))
> +
> +function func_xlog_snap(space, value)

1. This fails luacheck (which I don't like having in the tests,
but still we must keep it green):

Checking test/replication/gh-5806-master.lua      1 warning

    test/replication/gh-5806-master.lua:5:10: (W111) setting non-standard global variable func_xlog_snap

Checking test/replication/gh-5806-slave.lua       1 warning

    test/replication/gh-5806-slave.lua:5:10: (W111) setting non-standard global variable func_xlog_snap

Also why does it have 'func' prefix? It is a function
obviously, we don't add 'func' prefix to all functions.

> +    space:insert(value)
> +    box.snapshot()
> +end
> diff --git a/test/replication/gh-5806-xlog-cleanup.result b/test/replication/gh-5806-xlog-cleanup.result
> new file mode 100644
> index 000000000..97355a8bf
> --- /dev/null
> +++ b/test/replication/gh-5806-xlog-cleanup.result
> @@ -0,0 +1,336 @@
> +-- test-run result file version 2
> +--
> +-- gh-5806: defer xlog cleanup to keep xlogs until
> +-- replicas present in "_cluster" are connected.
> +-- Otherwise we are getting XlogGapError since
> +-- master might go far forwad from replica and

2. forwad -> forward.

> +-- replica won't be able to connect without full
> +-- rebootstrap.
> +--
> +
> +fiber = require('fiber')
> + | ---
> + | ...
> +test_run = require('test_run').new()
> + | ---
> + | ...
> +engine = test_run:get_cfg('engine')
> + | ---
> + | ...
> +
> +--
> +-- Case 1.
> +--
> +-- First lets make sure we're getting XlogGapError in
> +-- case if wal_cleanup_delay is not used.
> +--
> +
> +test_run:cmd('create server master with script="replication/gh-5806-master.lua"')
> + | ---
> + | - true
> + | ...
> +test_run:cmd('start server master with wait=True, wait_load=True')
> + | ---
> + | - true
> + | ...
> +
> +test_run:switch('master')
> + | ---
> + | - true
> + | ...
> +box.schema.user.grant('guest', 'replication')
> + | ---
> + | ...
> +
> +--
> +-- Keep small number of snaps to force cleanup
> +-- procedure be more intensive.
> +box.cfg{checkpoint_count = 1}
> + | ---
> + | ...
> +
> +engine = test_run:get_cfg('engine')
> + | ---
> + | ...
> +s = box.schema.space.create('test', {engine = engine})
> + | ---
> + | ...
> +_ = s:create_index('pk')
> + | ---
> + | ...
> +
> +test_run:switch('default')
> + | ---
> + | - true
> + | ...
> +test_run:cmd('create server replica with rpl_master=master,\
> +              script="replication/gh-5806-slave.lua"')
> + | ---
> + | - true
> + | ...
> +test_run:cmd('start server replica with wait=True, wait_load=True')
> + | ---
> + | - true
> + | ...
> +
> +test_run:switch('replica')
> + | ---
> + | - true
> + | ...
> +box.cfg{checkpoint_count = 1}
> + | ---
> + | ...
> +s = box.schema.space.create('testtemp', {temporary = true})
> + | ---
> + | ...
> +_ = s:create_index('pk')
> + | ---
> + | ...
> +for i=1,2 do func_xlog_snap(box.space.testtemp, {i}) end

3. Honestly, it would look much simpler if it would be just 4
lines with 2 inserts and 2 snapshots.

4. Why do you do rw requests both on the replica and master?
And why do you need 2 spaces?

> + | ---
> + | ...
> +
> +--
> +-- Stop the replica node and generate
> +-- first range of xlogs on the master.
> +test_run:switch('default')
> + | ---
> + | - true
> + | ...
> +test_run:cmd('stop server replica')
> + | ---
> + | - true
> + | ...
> +
> +test_run:switch('master')
> + | ---
> + | - true
> + | ...
> +for i=1,2 do func_xlog_snap(box.space.test, {i}) end
> + | ---
> + | ...
> +
> +--
> +-- Restart the masted and generate the

5. masted -> master.

> +-- next range of xlogs.
> +test_run:switch('default')
> + | ---
> + | - true
> + | ...
> +test_run:cmd('stop server master')
> + | ---
> + | - true
> + | ...
> +test_run:cmd('start server master with wait_load=True')

6. Does 'restart server master' command work?

> + | ---
> + | - true
> + | ...
> +test_run:switch('master')
> + | ---
> + | - true
> + | ...
> +for i=3,4 do func_xlog_snap(box.space.test, {i}) end
> + | ---
> + | ...
> +
> +--
> +-- Restart master node and the replica then.

7. Why do you need to restart the master 2 times?

> +test_run:switch('default')
> + | ---
> + | - true
> + | ...
> +test_run:cmd('stop server master')
> + | ---
> + | - true
> + | ...
> +test_run:cmd('start server master with wait_load=True')
> + | ---
> + | - true
> + | ...
> +test_run:cmd('start server replica with wait=False, wait_load=False')
> + | ---
> + | - true
> + | ...
> +
> +--
> +-- Wait error to appear.
> +while test_run:grep_log("master", "XlogGapError") == nil do fiber.sleep(0.01) end

8. We have test_run:wait_log().

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Tarantool-patches] [PATCH 1/2] gc/xlog: delay xlog cleanup until relays are subscribed
  2021-03-18 23:04   ` Vladislav Shpilevoy via Tarantool-patches
@ 2021-03-19 11:03     ` Cyrill Gorcunov via Tarantool-patches
  2021-03-19 22:17       ` Vladislav Shpilevoy via Tarantool-patches
  2021-03-19 13:40     ` Serge Petrenko via Tarantool-patches
  1 sibling, 1 reply; 14+ messages in thread
From: Cyrill Gorcunov via Tarantool-patches @ 2021-03-19 11:03 UTC (permalink / raw)
  To: Vladislav Shpilevoy; +Cc: Mons Anderson, tml

On Fri, Mar 19, 2021 at 12:04:09AM +0100, Vladislav Shpilevoy wrote:
> 
> AFAIU, we go for the timeout option for the only reason that
> there might be non-fullmesh typologies, when a replica is in
> _cluster, but does not have relays to other nodes.

Exactly.

> If we would know the topology, we could tell which nodes will
> relay to who, and would be able to detect when a replica needs
> to keep the logs, and when don't. Not even the entire topology.
> Just the info from the adjacent nodes. From our own box.cfg.replication,
> and from box.cfg.replication of these nodes.
> 
> I still don't understand what was wrong with implementing some kind
> of topology discovery for this 2-level topology tree. For instance
> when applier on a replica is connected, the remote node sends us a
> flag whether it is going to connect back. If the flag is false, we don't
> keep the logs for that instance.

There is nothing wrong and I think we should do it. Could you please
elaborate the details? You mean to extend applier protocol data, so
that it would send not only vclock but also the flag if it going to
be setting up a relay?

> See 13 comments below.
> 
> > @TarantoolBot document
> > Title: Add wal_cleanup_delay configuration parameter
> 
> 1. There are 2 issues with the name. Firstly, 'cleanup' conflicts
> with the existing 'gc' name, which is already exposed in box.info.gc.
> It is not called box.info.cleanup, so I would propose to use 'gc'
> here too.
> 
> Another issue I described in the first patchset's discussion. It is
> not really a delay. Because it is simply ignored when the replication
> feels like it. It must either have 'replication' prefix to designate
> it is not just a fixated timeout to keep the WALs and it depends on
> the replication, or at least it must mention "max". To designate it
> is not the exact strict timeout for keeping the logs.

Vlad, personally I don't mind to name this option whatever you like,
just gimme a name and I'll use it.

> > The `wal_cleanup_delay` option defines a delay in seconds
> > before write ahead log files (`*.xlog`) are getting started
> > to prune upon a node restart.
> > 
> > This option is isnored in case if a node is running as
> 
> 2. isnored -> ignored.

Ouch, I happen to miss this while been running aspell. Thanks!

> > 
> > Current state of `*.xlog` garbage collector can be found in
> > `box.info.gc()` output. For example
> > 
> > ``` Lua
> >  tarantool> box.info.gc()
> >  ---
> >    ...
> >    cleanup_is_paused: false
> 
> 3. 'gc.cleanup' is a tautology. If you want to expose the
> flag, it better be simply 'is_paused' IMO.

OK

> >    delay_ref: 0
> 
> 4. Can't the user get the same by looking at
> 
> 	_cluster size - box.info.gc().consumers count
> 
> ?

I would prefer to keep this as a separate output
simply because I need an internal variable for gc
code itself and reading it is a way more convenient
than fetching _cluster size and then consumers and
do some math after. Moreover I need a way to find
out _internal_ counter value for debug purpose.

> > +static void
> > +box_check_wal_cleanup_delay(int tmo)
> > +{
> > +	if (tmo < 0 && tmo != -1) {
> > +		tnt_raise(ClientError, ER_CFG, "wal_cleanup_delay",
> > +			  "the value must be either >= 0 or -1");
> 
> 3. As you could notice, we try not to use C++ exceptions in the new
> code. Lets use diag_set(). For example, like the new option
> wal_max_queue_size does.

OK

> 
> 4. Can you move cfg_geti() into this function? I see no reason why
> do you need to pass it as a parameter. The function is used in
> a single place.

Sure

> > +	}
> > +}
> > @@ -3045,6 +3055,7 @@ box_cfg_xc(void)
> >  		bootstrap(&instance_uuid, &replicaset_uuid,
> >  			  &is_bootstrap_leader);
> >  	}
> > +	gc_delay_unref();
> 
> 5. Why?

For case where you don't have replicas at all thus you need
to unref yourself from counting and the counter shrinks to
zero enabling gc.

> 
> >  	fiber_gc();
> >  
> >  	bootstrap_journal_guard.is_active = false;
> > diff --git a/src/box/gc.c b/src/box/gc.c
> > index 9af4ef958..595b98bdf 100644
> > --- a/src/box/gc.c
> > +++ b/src/box/gc.c
> > @@ -107,6 +108,25 @@ gc_init(void)
> >  	/* Don't delete any files until recovery is complete. */
> >  	gc.min_checkpoint_count = INT_MAX;
> >  
> > +	gc.wal_cleanup_delay = cfg_geti("wal_cleanup_delay");
> 
> 6. Please, pass the options as arguments of gc_init(). Having the
> global options disseminated over the sources does not simplify
> understanding of what subsystems on which options depend.
> 
> That would be consistent with not using cfg_get* fuctions almost
> anywhere except a few big 'main' files, and one sql file, which
> is also wrong, but is a mess in many other ways as well.

OK

> 
> > +	gc.delay_ref = 0;
> > +
> > +	if (cfg_geti("replication_anon") != 0) {
> 
> 7. GC does not really need to know if the instance is anon. This
> could be both controlled by wal_cleanup_delay cfg and the caller
> of gc_init(): box_cfg_xc(). Otherwise you made GC depend on
> replication details in the code, which is not great. We must keep
> the dependencies as low as possible.
> 
> In box_cfg_xc you can get both wal_cleanup_delay and replication_anon
> and check if the anon is set, then you call gc_init(0). If it is
> false, you call gc_init(wal_cleanup_delay). GC only knows about some
> delay obtained from its owner.

OK

> > +
> > +	/*
> > +	 * Stage 1 (optional): in case if we're booting
> > +	 * up with cleanup disabled lets do wait in a
> > +	 * separate cycle to minimize branching on stage 2.
> > +	 */
> > +	if (gc.cleanup_is_paused) {
> > +		ev_tstamp tmo = gc.wal_cleanup_delay == -1 ?
> > +			TIMEOUT_INFINITY : gc.wal_cleanup_delay;
> 
> 8. Please, call it 'timeout'. 'tmo' looks super weird, and is never
> used anywhere except this patch.

TMO is well known abbreviation in network stack but sure will rename.

> 
> 9. What if I see the timout was too big, and I cal
> box.cfg{wal_cleanup_delay = 0}? It seems I can only reduce it on
> restart. Can you support a dynamic update? It should be easy. You
> need to store gc deadline instead of the delay. So when you change
> the timeout, you can see if with the new timeout the deadline is 
> lready reached, and then can start GC earlier.

Thanks, will make it dynamic.

> If I increase the timeout, GC could be paused again, or simply
> ignored, or banned. But the decrease is more important, as I think.
> 
> > +		while (!fiber_is_cancelled()) {
> > +			if (fiber_yield_timeout(tmo)) {
> > +				say_info("gc: wal/engine cleanup is resumed "
> > +					 "due to timeout expiration");
> 
> 10. Other gc logs don't have 'gc' prefix. I would propose to follow
> it in the new logs too.

This makes easier for grepping the logs but sure will drop.

> > +void
> > +gc_delay_unref(void)
> > +{
> > +	if (gc.cleanup_is_paused) {
> > +		assert(gc.delay_ref > 0);
> > +		gc.delay_ref--;
> 
> 11. If it is not paused and GC started earlier due to a
> timeout, the user will see 'delay_ref' in box.info even
> after all the replicas are connected.

Yes, and this gonna be a sign that we're exited due to
timeout. Such output should not be treated as an error
but if you prefer I can zap the counter for this case.

> > +	if (!gc.cleanup_is_paused) {
> > +		int64_t scheduled = gc.cleanup_scheduled;
> > +		while (gc.cleanup_completed < scheduled)
> > +			fiber_cond_wait(&gc.cleanup_cond);
> > +	}
> 
> 12. The function is called 'wait_cleanup', but it does not really
> wait in case GC is paused. It looks wrong.

Hard to say. paused state is rather internal state and
when someone is waiting for cleanup to complete but cleanup
is turned off then I consider it pretty natural to exit
immediately. And I moved this "if" into the function itself
simply because we may use this helper in future and we should
not stuck forever in "waiting" if cleanup is disabled.

I can move this "if" to the caller side though if you prefer.

> >  
> > +static void
> > +relay_gc_delay_unref(struct cmsg *msg)
> > +{
> > +	(void)msg;
> > +	gc_delay_unref();
> 
> 13. You don't need a separate callback, and don't need to call it
> from the relay thread. Relay already works with GC - it calls
> gc_consumer_register() before starting the thread. You can do the
> unref in the same place. Starting from the consumer registration
> the logs are going to be kept anyway.
> 
> Looks like it would be simpler if it works.

Actually initially i did it this way, right before creating
relay thread. But I think this is not good and that's why:
when relay is starting it may fail in a number of places
(the thread itself is not created; thread is created but
then faiber creation failed with eception) and I think we
should decrement the reference when we only pretty sure that
there won't be new errors inside relay cycle.

What would happen when say relay fiber is triggering an
error? iproto will write an error and as far as I understand
the replica will try to reconnect. Thus we should keep the
logs until relay is subscribed for sure.

	Cyrill

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Tarantool-patches] [PATCH 2/2] test: add a test for wal_cleanup_delay option
  2021-03-18 23:04   ` Vladislav Shpilevoy via Tarantool-patches
@ 2021-03-19 12:14     ` Cyrill Gorcunov via Tarantool-patches
  2021-03-19 22:17       ` Vladislav Shpilevoy via Tarantool-patches
  0 siblings, 1 reply; 14+ messages in thread
From: Cyrill Gorcunov via Tarantool-patches @ 2021-03-19 12:14 UTC (permalink / raw)
  To: Vladislav Shpilevoy; +Cc: tml, Mons Anderson

On Fri, Mar 19, 2021 at 12:04:11AM +0100, Vladislav Shpilevoy wrote:
> > +
> > +function func_xlog_snap(space, value)
> 
> 1. This fails luacheck (which I don't like having in the tests,
> but still we must keep it green):
> 
> Checking test/replication/gh-5806-master.lua      1 warning
> 
>     test/replication/gh-5806-master.lua:5:10: (W111) setting non-standard global variable func_xlog_snap
> 
> Checking test/replication/gh-5806-slave.lua       1 warning
> 
>     test/replication/gh-5806-slave.lua:5:10: (W111) setting non-standard global variable func_xlog_snap

Thanks, Vlad! Actually I don't understand what this warning means. Should I make
these functions local or what?

> 
> Also why does it have 'func' prefix? It is a function
> obviously, we don't add 'func' prefix to all functions.

OK

> > +--
> > +-- gh-5806: defer xlog cleanup to keep xlogs until
> > +-- replicas present in "_cluster" are connected.
> > +-- Otherwise we are getting XlogGapError since
> > +-- master might go far forwad from replica and
> 
> 2. forwad -> forward.

Thanks!

> > + | ---
> > + | ...
> > +for i=1,2 do func_xlog_snap(box.space.testtemp, {i}) end
> 
> 3. Honestly, it would look much simpler if it would be just 4
> lines with 2 inserts and 2 snapshots.

Sure, I can make it so.

> 
> 4. Why do you do rw requests both on the replica and master?

To intermix data so that lsns would be different

> And why do you need 2 spaces?

This allows to write xlog data specifically on replica and
only in this case I managed to reach XlogGapError (otherwise
the replica reported that it is too old and restarted rejoin
procedure).

> > +
> > +--
> > +-- Restart the masted and generate the
> 
> 5. masted -> master.

Thanks:)

> 
> > +-- next range of xlogs.
> > +test_run:switch('default')
> > + | ---
> > + | - true
> > + | ...
> > +test_run:cmd('stop server master')
> > + | ---
> > + | - true
> > + | ...
> > +test_run:cmd('start server master with wait_load=True')
> 
> 6. Does 'restart server master' command work?

I must confess I didn't know about this command. Will try.

> > +--
> > +-- Restart master node and the replica then.
> 
> 7. Why do you need to restart the master 2 times?

To make sure the GC passed a few iterations. Maybe
we could manage with one as well. Thanks for the
point will try.

>> > +--
> > +-- Wait error to appear.
> > +while test_run:grep_log("master", "XlogGapError") == nil do fiber.sleep(0.01) end
> 
> 8. We have test_run:wait_log().

Didn't know, thanks!

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Tarantool-patches] [PATCH 1/2] gc/xlog: delay xlog cleanup until relays are subscribed
  2021-03-18 23:04   ` Vladislav Shpilevoy via Tarantool-patches
  2021-03-19 11:03     ` Cyrill Gorcunov via Tarantool-patches
@ 2021-03-19 13:40     ` Serge Petrenko via Tarantool-patches
  2021-03-19 13:57       ` Konstantin Osipov via Tarantool-patches
  1 sibling, 1 reply; 14+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-03-19 13:40 UTC (permalink / raw)
  To: Vladislav Shpilevoy, Cyrill Gorcunov, tml; +Cc: Mons Anderson

19.03.2021 02:04, Vladislav Shpilevoy пишет:
> Hi! Thanks for the patch!
>
> Generally looks fine except details.
>
>
> AFAIU, we go for the timeout option for the only reason that
> there might be non-fullmesh typologies, when a replica is in
> _cluster, but does not have relays to other nodes.
>
> If we would know the topology, we could tell which nodes will
> relay to who, and would be able to detect when a replica needs
> to keep the logs, and when don't. Not even the entire topology.
> Just the info from the adjacent nodes. From our own box.cfg.replication,
> and from box.cfg.replication of these nodes.
>
> I still don't understand what was wrong with implementing some kind
> of topology discovery for this 2-level topology tree. For instance
> when applier on a replica is connected, the remote node sends us a
> flag whether it is going to connect back. If the flag is false, we don't
> keep the logs for that instance.

IMO this topology discovery is not so easy to implement. And we've 
chosen this
particular fix approach instead of 'persistent GC state' for its simplicity.

How do you know whether a node is going to connect back? It has a 
corresponding
entry in box.cfg.replication, sure, but how does it understand that this 
entry (URI)
corresponds to the replica that has just connected?
IIRC we had similar problems with inability to understand who's who judging
solely by URI in some other fix. Don't remember which one exactly.

Moreover, you may have something strange like a cascade topology.
Say, there are 1 <- 2 <- 3 servers, with arrows showing replica-to-master
connection. When 2 is up, how can it understand that it should wait for 3?

-- 
Serge Petrenko

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Tarantool-patches] [PATCH 1/2] gc/xlog: delay xlog cleanup until relays are subscribed
  2021-03-18 18:41 ` [Tarantool-patches] [PATCH 1/2] " Cyrill Gorcunov via Tarantool-patches
  2021-03-18 23:04   ` Vladislav Shpilevoy via Tarantool-patches
@ 2021-03-19 13:50   ` Serge Petrenko via Tarantool-patches
  2021-03-19 15:14     ` Cyrill Gorcunov via Tarantool-patches
  1 sibling, 1 reply; 14+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-03-19 13:50 UTC (permalink / raw)
  To: Cyrill Gorcunov, tml; +Cc: Mons Anderson, Vladislav Shpilevoy


18.03.2021 21:41, Cyrill Gorcunov пишет:
> In case if replica managed to be far behind the master node
> (so there are a number of xlog files present after the last

...

>   src/box/replication.cc          |  2 +
>   test/app-tap/init_script.result |  1 +
>   test/box/admin.result           |  2 +
>   test/box/cfg.result             |  4 ++
>   10 files changed, 153 insertions(+), 3 deletions(-)

Hi!  Thanks for the patch!

Please find 2 minor comments below.

>
> diff --git a/src/box/box.cc b/src/box/box.cc
> index b4a1a5e07..ee017c2dc 100644
> --- a/src/box/box.cc
> +++ b/src/box/box.cc
> @@ -754,6 +754,15 @@ box_check_wal_mode(const char *mode_name)
>   	return (enum wal_mode) mode;
>   }
>   
> +static void
> +box_check_wal_cleanup_delay(int tmo)
> +{
> +	if (tmo < 0 && tmo != -1) {
> +		tnt_raise(ClientError, ER_CFG, "wal_cleanup_delay",
> +			  "the value must be either >= 0 or -1");
> +	}
> +}
> +


1. AFAICS all other delay and timeout parameters are 'double'.
    Let's make this one also a double.

>   static void
>   box_check_readahead(int readahead)
>   {
> @@ -899,6 +908,7 @@ box_check_config(void)
>   	box_check_checkpoint_count(cfg_geti("checkpoint_count"));
>   	box_check_wal_max_size(cfg_geti64("wal_max_size"));
>   	box_check_wal_mode(cfg_gets("wal_mode"));
> +	box_check_wal_cleanup_delay(cfg_geti("wal_cleanup_delay"));
>   	if (box_check_memory_quota("memtx_memory") < 0)
>   		diag_raise();
>   	box_check_memtx_min_tuple_size(cfg_geti64("memtx_min_tuple_size"));
> @@ -3045,6 +3055,7 @@ box_cfg_xc(void)
>   		bootstrap(&instance_uuid, &replicaset_uuid,
>   			  &is_bootstrap_leader);
>   	}
> +	gc_delay_unref();
>   	fiber_gc();
>   
>   	bootstrap_journal_guard.is_active = false;

...

> diff --git a/src/box/relay.cc b/src/box/relay.cc
> index 41f949e8e..5d09690ad 100644
> --- a/src/box/relay.cc
> +++ b/src/box/relay.cc
> @@ -668,6 +668,13 @@ relay_send_is_raft_enabled(struct relay *relay,
>   	}
>   }
>   
> +static void
> +relay_gc_delay_unref(struct cmsg *msg)
> +{
> +	(void)msg;
> +	gc_delay_unref();
> +}
> +
>   /**
>    * A libev callback invoked when a relay client socket is ready
>    * for read. This currently only happens when the client closes
> @@ -721,6 +728,19 @@ relay_subscribe_f(va_list ap)
>   	 */
>   	relay_send_heartbeat(relay);
>   
> +	/*
> +	 * Now we can resume wal/engine gc as relay
> +	 * is up and running.
> +	 */
> +	if (!relay->replica->anon) {
> +		static const struct cmsg_hop gc_delay_route[] = {
> +			{relay_gc_delay_unref, NULL}
> +		};
> +		struct cmsg gc_delay_msg;
> +		cmsg_init(&gc_delay_msg, gc_delay_route);
> +		cpipe_push(&relay->tx_pipe, &gc_delay_msg);
> +	}
> +

2. I agree with Vlad here. No need to call gc_delay_unref() from the 
relay thread.
    We need to wait until a corresponding gc consumer is registered. 
This is done
    in relay_subscribe(), which is in tx thread.

    In fact, you may as well move the gc_delay_unref() call directly to 
gc_consumer_register().
    This would look even better, IMO.

>   	/*
>   	 * Run the event loop until the connection is broken
>   	 * or an error occurs.
> diff --git a/src/box/replication.cc b/src/box/replication.cc
> index 1fa8843e7..aefb812b3 100644
> --- a/src/box/replication.cc
> +++ b/src/box/replication.cc
> @@ -250,6 +250,7 @@ replica_set_id(struct replica *replica, uint32_t replica_id)
>   						   tt_uuid_str(&replica->uuid));
>   	}
>   	replicaset.replica_by_id[replica_id] = replica;
> +	gc_delay_ref();
>   	++replicaset.registered_count;
>   	say_info("assigned id %d to replica %s",
>   		 replica->id, tt_uuid_str(&replica->uuid));
> @@ -273,6 +274,7 @@ replica_clear_id(struct replica *replica)
>   	replicaset.replica_by_id[replica->id] = NULL;
>   	assert(replicaset.registered_count > 0);
>   	--replicaset.registered_count;
> +	gc_delay_unref();
>   	if (replica->id == instance_id) {
>   		/* See replica_check_id(). */
>   		assert(replicaset.is_joining);
>
...

-- 
Serge Petrenko


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Tarantool-patches] [PATCH 1/2] gc/xlog: delay xlog cleanup until relays are subscribed
  2021-03-19 13:40     ` Serge Petrenko via Tarantool-patches
@ 2021-03-19 13:57       ` Konstantin Osipov via Tarantool-patches
  0 siblings, 0 replies; 14+ messages in thread
From: Konstantin Osipov via Tarantool-patches @ 2021-03-19 13:57 UTC (permalink / raw)
  To: Serge Petrenko; +Cc: Mons Anderson, tml, Vladislav Shpilevoy

* Serge Petrenko <sergepetrenko@tarantool.org> [21/03/19 16:42]:
> IMO this topology discovery is not so easy to implement. And we've chosen
> this
> particular fix approach instead of 'persistent GC state' for its simplicity.

The topology discovery was necessary before there was raft. In
raft, one needs not discover topology, there is a configuration
change log entry for a topology change. Oops, there is no configuration changes in tarantool raft...

-- 
Konstantin Osipov, Moscow, Russia

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Tarantool-patches] [PATCH 1/2] gc/xlog: delay xlog cleanup until relays are subscribed
  2021-03-19 13:50   ` Serge Petrenko via Tarantool-patches
@ 2021-03-19 15:14     ` Cyrill Gorcunov via Tarantool-patches
  0 siblings, 0 replies; 14+ messages in thread
From: Cyrill Gorcunov via Tarantool-patches @ 2021-03-19 15:14 UTC (permalink / raw)
  To: Serge Petrenko; +Cc: Mons Anderson, tml, Vladislav Shpilevoy

On Fri, Mar 19, 2021 at 04:50:38PM +0300, Serge Petrenko wrote:
> > +static void
> > +box_check_wal_cleanup_delay(int tmo)
> > +{
> > +	if (tmo < 0 && tmo != -1) {
> > +		tnt_raise(ClientError, ER_CFG, "wal_cleanup_delay",
> > +			  "the value must be either >= 0 or -1");
> > +	}
> > +}
> > +
> 
> 
> 1. AFAICS all other delay and timeout parameters are 'double'.
>    Let's make this one also a double.

Sure

> > +	/*
> > +	 * Now we can resume wal/engine gc as relay
> > +	 * is up and running.
> > +	 */
> > +	if (!relay->replica->anon) {
> > +		static const struct cmsg_hop gc_delay_route[] = {
> > +			{relay_gc_delay_unref, NULL}
> > +		};
> > +		struct cmsg gc_delay_msg;
> > +		cmsg_init(&gc_delay_msg, gc_delay_route);
> > +		cpipe_push(&relay->tx_pipe, &gc_delay_msg);
> > +	}
> > +
> 
> 2. I agree with Vlad here. No need to call gc_delay_unref() from the relay
> thread.
>    We need to wait until a corresponding gc consumer is registered. This is
> done
>    in relay_subscribe(), which is in tx thread.
> 
>    In fact, you may as well move the gc_delay_unref() call directly to
> gc_consumer_register().
>    This would look even better, IMO.

I don't mind to move it inside gc_consumer_register itself. Still this
bothers me a bit, the gc_consumer_register might be used not only by
relays since it is general api, would it be OK in a long term?

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Tarantool-patches] [PATCH 1/2] gc/xlog: delay xlog cleanup until relays are subscribed
  2021-03-19 11:03     ` Cyrill Gorcunov via Tarantool-patches
@ 2021-03-19 22:17       ` Vladislav Shpilevoy via Tarantool-patches
  2021-03-22  9:05         ` Serge Petrenko via Tarantool-patches
  0 siblings, 1 reply; 14+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-03-19 22:17 UTC (permalink / raw)
  To: Cyrill Gorcunov; +Cc: Mons Anderson, tml

>> If we would know the topology, we could tell which nodes will
>> relay to who, and would be able to detect when a replica needs
>> to keep the logs, and when don't. Not even the entire topology.
>> Just the info from the adjacent nodes. From our own box.cfg.replication,
>> and from box.cfg.replication of these nodes.
>>
>> I still don't understand what was wrong with implementing some kind
>> of topology discovery for this 2-level topology tree. For instance
>> when applier on a replica is connected, the remote node sends us a
>> flag whether it is going to connect back. If the flag is false, we don't
>> keep the logs for that instance.
> 
> There is nothing wrong and I think we should do it. Could you please
> elaborate the details? You mean to extend applier protocol data, so
> that it would send not only vclock but also the flag if it going to
> be setting up a relay?

Nevermind, it won't work for certain topologies. The only way is to
have it in _cluster, and make _cluster synchronous space. But this is
a big separate task, so probably timeout is fine for now.

>> See 13 comments below.
>>
>>> @TarantoolBot document
>>> Title: Add wal_cleanup_delay configuration parameter
>>
>> 1. There are 2 issues with the name. Firstly, 'cleanup' conflicts
>> with the existing 'gc' name, which is already exposed in box.info.gc.
>> It is not called box.info.cleanup, so I would propose to use 'gc'
>> here too.
>>
>> Another issue I described in the first patchset's discussion. It is
>> not really a delay. Because it is simply ignored when the replication
>> feels like it. It must either have 'replication' prefix to designate
>> it is not just a fixated timeout to keep the WALs and it depends on
>> the replication, or at least it must mention "max". To designate it
>> is not the exact strict timeout for keeping the logs.
> 
> Vlad, personally I don't mind to name this option whatever you like,
> just gimme a name and I'll use it.
I still don't like the name, but I am outvoted. We won't even add
'max' to it. The current name stays, unfortunately.

>>> +	}
>>> +}
>>> @@ -3045,6 +3055,7 @@ box_cfg_xc(void)
>>>  		bootstrap(&instance_uuid, &replicaset_uuid,
>>>  			  &is_bootstrap_leader);
>>>  	}
>>> +	gc_delay_unref();
>>
>> 5. Why?
> 
> For case where you don't have replicas at all thus you need
> to unref yourself from counting and the counter shrinks to
> zero enabling gc.

Lets add a comment about this.

>>> +void
>>> +gc_delay_unref(void)
>>> +{
>>> +	if (gc.cleanup_is_paused) {
>>> +		assert(gc.delay_ref > 0);
>>> +		gc.delay_ref--;
>>
>> 11. If it is not paused and GC started earlier due to a
>> timeout, the user will see 'delay_ref' in box.info even
>> after all the replicas are connected.
> 
> Yes, and this gonna be a sign that we're exited due to
> timeout. Such output should not be treated as an error
> but if you prefer I can zap the counter for this case.

I would prefer not to have delay_ref in the public
monitoring at all, but you should ask Mons about it now.

>>> +	if (!gc.cleanup_is_paused) {
>>> +		int64_t scheduled = gc.cleanup_scheduled;
>>> +		while (gc.cleanup_completed < scheduled)
>>> +			fiber_cond_wait(&gc.cleanup_cond);
>>> +	}
>>
>> 12. The function is called 'wait_cleanup', but it does not really
>> wait in case GC is paused. It looks wrong.
> 
> Hard to say. paused state is rather internal state and
> when someone is waiting for cleanup to complete but cleanup
> is turned off then I consider it pretty natural to exit
> immediately. And I moved this "if" into the function itself
> simply because we may use this helper in future and we should
> not stuck forever in "waiting" if cleanup is disabled.
> 
> I can move this "if" to the caller side though if you prefer.

Probably this would be better. Because the function is called
'wait cleanup', not 'try wait or return immediately'.

The comment in the place where the function is called says

	it guarantees that by the time box.snapshot()
	returns, all outdated checkpoint files have been removed

It is misleading now.

>>> +static void
>>> +relay_gc_delay_unref(struct cmsg *msg)
>>> +{
>>> +	(void)msg;
>>> +	gc_delay_unref();
>>
>> 13. You don't need a separate callback, and don't need to call it
>> from the relay thread. Relay already works with GC - it calls
>> gc_consumer_register() before starting the thread. You can do the
>> unref in the same place. Starting from the consumer registration
>> the logs are going to be kept anyway.
>>
>> Looks like it would be simpler if it works.
> 
> Actually initially i did it this way, right before creating
> relay thread. But I think this is not good and that's why:
> when relay is starting it may fail in a number of places
> (the thread itself is not created; thread is created but
> then faiber creation failed with eception) and I think we
> should decrement the reference when we only pretty sure that
> there won't be new errors inside relay cycle.

Why? New errors won't break anything. They also can happen after
you did unref in your patch.

> What would happen when say relay fiber is triggering an
> error? iproto will write an error and as far as I understand
> the replica will try to reconnect. Thus we should keep the
> logs until relay is subscribed for sure.

GC consumer keeps the logs while the replica tries to reconnect.

But what I don't understand now - how does it work if the struct
replica objects create the consumer objects right in replica_set_id()?

Look at relay_subscribe(). If the replica is not anon, then
'replica->id != REPLICA_ID_NIL' is true (because there is an assertion).
It means replica_set_id() was already called. And this means replica->gc
is already not NULL. Therefore the check "replica->gc == NULL && !replica->anon"
is never true. Am I missing something?

Doesn't it mean all the _cluster nodes have a GC consumer on the
current node, and nothing is deleted until all the nodes connect?
Regardless of your patch. I feel like I miss something important.
Because before the patch we have xlog gap errors somehow. Which means
the consumers are dropped somewhere. Can you investigate?

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Tarantool-patches] [PATCH 2/2] test: add a test for wal_cleanup_delay option
  2021-03-19 12:14     ` Cyrill Gorcunov via Tarantool-patches
@ 2021-03-19 22:17       ` Vladislav Shpilevoy via Tarantool-patches
  0 siblings, 0 replies; 14+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-03-19 22:17 UTC (permalink / raw)
  To: Cyrill Gorcunov; +Cc: tml, Mons Anderson

On 19.03.2021 13:14, Cyrill Gorcunov via Tarantool-patches wrote:
> On Fri, Mar 19, 2021 at 12:04:11AM +0100, Vladislav Shpilevoy wrote:
>>> +
>>> +function func_xlog_snap(space, value)
>>
>> 1. This fails luacheck (which I don't like having in the tests,
>> but still we must keep it green):
>>
>> Checking test/replication/gh-5806-master.lua      1 warning
>>
>>     test/replication/gh-5806-master.lua:5:10: (W111) setting non-standard global variable func_xlog_snap
>>
>> Checking test/replication/gh-5806-slave.lua       1 warning
>>
>>     test/replication/gh-5806-slave.lua:5:10: (W111) setting non-standard global variable func_xlog_snap
> 
> Thanks, Vlad! Actually I don't understand what this warning means. Should I make
> these functions local or what?

Yeah, we can't declare global variables/functions. When you want
a function, you must either assign it to an index in _G, or have
it in a separate module and add it via 'require()' in the test's
body. But it should not matter now if you delete this function.

>> 4. Why do you do rw requests both on the replica and master?
> 
> To intermix data so that lsns would be different

Why should they be different (this all better be in the comments)?

>> And why do you need 2 spaces?
> 
> This allows to write xlog data specifically on replica and
> only in this case I managed to reach XlogGapError (otherwise
> the replica reported that it is too old and restarted rejoin
> procedure).

Aha, so it can't rejoin automatically, because this would lead to
data loss. I see now. Please, document it in the test.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Tarantool-patches] [PATCH 1/2] gc/xlog: delay xlog cleanup until relays are subscribed
  2021-03-19 22:17       ` Vladislav Shpilevoy via Tarantool-patches
@ 2021-03-22  9:05         ` Serge Petrenko via Tarantool-patches
  0 siblings, 0 replies; 14+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-03-22  9:05 UTC (permalink / raw)
  To: Vladislav Shpilevoy, Cyrill Gorcunov; +Cc: Mons Anderson, tml



20.03.2021 01:17, Vladislav Shpilevoy пишет:
> But what I don't understand now - how does it work if the struct
> replica objects create the consumer objects right in replica_set_id()?
>
> Look at relay_subscribe(). If the replica is not anon, then
> 'replica->id != REPLICA_ID_NIL' is true (because there is an assertion).
> It means replica_set_id() was already called. And this means replica->gc
> is already not NULL. Therefore the check "replica->gc == NULL && !replica->anon"
> is never true. Am I missing something?
replica_set_id() only sets a gc consumer for a previously anonymous replica,
which has just registered. This is not needed at all, likely. Because
box_process_register() already creates a gc consumer. And replaces replica's
gc consumer, if any, with this freshly created one later. So this piece of
code in replica_set_id() is extraneous.

It only  makes sense when an anonymous replica was connected to some
instance, then it registers on another instance and the first instance 
receives
anonymous replica's _cluster registration via replication. This is still 
unneeded.
Because once previously anonymous replica resubscribes, it'll get a gc 
consumer
like everyone else does.

This is quite misleading, indeed. It took me a while to understand 
what's happening.
I'll probably return with a patch removing this piece in 
replica_set_id() altogether.

-- 
Serge Petrenko


^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2021-03-22  9:05 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-18 18:41 [Tarantool-patches] [PATCH 0/2] gc/xlog: delay xlog cleanup until relays are subscribed Cyrill Gorcunov via Tarantool-patches
2021-03-18 18:41 ` [Tarantool-patches] [PATCH 1/2] " Cyrill Gorcunov via Tarantool-patches
2021-03-18 23:04   ` Vladislav Shpilevoy via Tarantool-patches
2021-03-19 11:03     ` Cyrill Gorcunov via Tarantool-patches
2021-03-19 22:17       ` Vladislav Shpilevoy via Tarantool-patches
2021-03-22  9:05         ` Serge Petrenko via Tarantool-patches
2021-03-19 13:40     ` Serge Petrenko via Tarantool-patches
2021-03-19 13:57       ` Konstantin Osipov via Tarantool-patches
2021-03-19 13:50   ` Serge Petrenko via Tarantool-patches
2021-03-19 15:14     ` Cyrill Gorcunov via Tarantool-patches
2021-03-18 18:41 ` [Tarantool-patches] [PATCH 2/2] test: add a test for wal_cleanup_delay option Cyrill Gorcunov via Tarantool-patches
2021-03-18 23:04   ` Vladislav Shpilevoy via Tarantool-patches
2021-03-19 12:14     ` Cyrill Gorcunov via Tarantool-patches
2021-03-19 22:17       ` Vladislav Shpilevoy via Tarantool-patches

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox