Tarantool development patches archive
 help / color / mirror / Atom feed
* [PATCH 0/6] WAL garbage collection and checkpointing fixes
@ 2018-11-25 13:48 Vladimir Davydov
  2018-11-25 13:48 ` [PATCH 1/6] vclock: allow to use const vclock as search key Vladimir Davydov
                   ` (5 more replies)
  0 siblings, 6 replies; 19+ messages in thread
From: Vladimir Davydov @ 2018-11-25 13:48 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

This patch set fixes a couple of issues related to WAL checkpointing and
garbage collection (see [1, 2]) and does some refactoring necessary for
implementation of checkpoint-on-WAL-threshold feature (see [3]).
The branch with the code is available at [4].

[1] https://github.com/tarantool/tarantool/issues/3822
[2] https://github.com/tarantool/tarantool/issues/3830
[3] https://github.com/tarantool/tarantool/issues/1082
[4] https://github.com/tarantool/tarantool/commits/dv/wal-gc-fixes

Vladimir Davydov (6):
  vclock: allow to use const vclock as search key
  engine: pass vclock instead of lsn to collect_garbage callback
  box: do not rotate WAL when replica subscribes
  box: use replicaset.vclock in replica join/subscribe
  wal: separate checkpoint and flush paths
  wal: remove files needed for recovery from backup checkpoints on
    ENOSPC

 src/box/box.cc                                     |  45 +++---
 src/box/engine.c                                   |   9 +-
 src/box/engine.h                                   |   9 +-
 src/box/gc.c                                       |  22 +--
 src/box/gc.h                                       |  14 --
 src/box/memtx_engine.c                             |   5 +-
 src/box/vclock.c                                   |   4 +-
 src/box/vclock.h                                   |   5 +-
 src/box/vinyl.c                                    |  14 +-
 src/box/vy_log.c                                   |  36 ++---
 src/box/vy_log.h                                   |   4 +-
 src/box/wal.c                                      | 171 ++++++++++++---------
 src/box/wal.h                                      |  34 ++--
 test/replication/gc.result                         |   8 +-
 test/replication/gc.test.lua                       |   6 +-
 test/replication/gc_no_space.result                |  51 +++++-
 test/replication/gc_no_space.test.lua              |  27 +++-
 test/replication/show_error_on_disconnect.result   |  12 +-
 test/replication/show_error_on_disconnect.test.lua |  10 +-
 test/replication/sync.result                       |  37 +++++
 test/replication/sync.test.lua                     |  12 ++
 21 files changed, 326 insertions(+), 209 deletions(-)

-- 
2.11.0

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 1/6] vclock: allow to use const vclock as search key
  2018-11-25 13:48 [PATCH 0/6] WAL garbage collection and checkpointing fixes Vladimir Davydov
@ 2018-11-25 13:48 ` Vladimir Davydov
  2018-11-26 17:38   ` Konstantin Osipov
  2018-11-25 13:48 ` [PATCH 2/6] engine: pass vclock instead of lsn to collect_garbage callback Vladimir Davydov
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 19+ messages in thread
From: Vladimir Davydov @ 2018-11-25 13:48 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

vclockset_psearch() and friends never modify the vclock used as a search
key, but the compiler will not allow to pass const struct vclock to any
of those functions. Fix this.
---
 src/box/vclock.c | 4 +++-
 src/box/vclock.h | 5 +++--
 2 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/src/box/vclock.c b/src/box/vclock.c
index b5eb2800..3c4b3204 100644
--- a/src/box/vclock.c
+++ b/src/box/vclock.c
@@ -206,4 +206,6 @@ vclockset_node_compare(const struct vclock *a, const struct vclock *b)
 	return res;
 }
 
-rb_gen(, vclockset_, vclockset_t, struct vclock, link, vclockset_node_compare);
+rb_gen_ext_key(, vclockset_, vclockset_t, struct vclock, link,
+	       vclockset_node_compare, const struct vclock *,
+	       vclockset_node_compare);
diff --git a/src/box/vclock.h b/src/box/vclock.h
index 111e2916..d90f5697 100644
--- a/src/box/vclock.h
+++ b/src/box/vclock.h
@@ -271,7 +271,8 @@ vclock_compare(const struct vclock *a, const struct vclock *b)
  * @brief vclockset - a set of vclocks
  */
 typedef rb_tree(struct vclock) vclockset_t;
-rb_proto(, vclockset_, vclockset_t, struct vclock);
+rb_proto_ext_key(, vclockset_, vclockset_t, struct vclock,
+		 const struct vclock *);
 
 /**
  * A proximity search in a set of vclock objects.
@@ -283,7 +284,7 @@ rb_proto(, vclockset_, vclockset_t, struct vclock);
  * @return a vclock that <= than \a key
  */
 static inline struct vclock *
-vclockset_match(vclockset_t *set, struct vclock *key)
+vclockset_match(vclockset_t *set, const struct vclock *key)
 {
 	struct vclock *match = vclockset_psearch(set, key);
 	/**
-- 
2.11.0

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 2/6] engine: pass vclock instead of lsn to collect_garbage callback
  2018-11-25 13:48 [PATCH 0/6] WAL garbage collection and checkpointing fixes Vladimir Davydov
  2018-11-25 13:48 ` [PATCH 1/6] vclock: allow to use const vclock as search key Vladimir Davydov
@ 2018-11-25 13:48 ` Vladimir Davydov
  2018-11-26 17:41   ` Konstantin Osipov
  2018-11-25 13:48 ` [PATCH 3/6] box: do not rotate WAL when replica subscribes Vladimir Davydov
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 19+ messages in thread
From: Vladimir Davydov @ 2018-11-25 13:48 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

First, this is consistent with other engine callbacks, such as
checkpoint or backup.

Second, a vclock can be used as a search key in a vclock set,
which in turn can make code more straightforward, e.g. look how
this patch simplifies vy_log_prev_checkpoint().
---
 src/box/box.cc         |  2 +-
 src/box/engine.c       |  9 +++++----
 src/box/engine.h       |  9 +++++----
 src/box/gc.c           |  7 ++-----
 src/box/memtx_engine.c |  5 +++--
 src/box/vinyl.c        |  9 ++++-----
 src/box/vy_log.c       | 36 ++++++++++++++++--------------------
 src/box/vy_log.h       |  4 ++--
 src/box/wal.c          | 41 +++++++++++++++++++++++------------------
 src/box/wal.h          |  9 +++++----
 10 files changed, 66 insertions(+), 65 deletions(-)

diff --git a/src/box/box.cc b/src/box/box.cc
index ab6658fb..21b84991 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -2129,7 +2129,7 @@ box_cfg_xc(void)
 	enum wal_mode wal_mode = box_check_wal_mode(cfg_gets("wal_mode"));
 	if (wal_init(wal_mode, cfg_gets("wal_dir"), wal_max_rows,
 		     wal_max_size, &INSTANCE_UUID, &replicaset.vclock,
-		     vclock_sum(&first_checkpoint->vclock))) {
+		     &first_checkpoint->vclock) != 0) {
 		diag_raise();
 	}
 	gc_set_wal_watcher();
diff --git a/src/box/engine.c b/src/box/engine.c
index 2a30dcdd..409479a3 100644
--- a/src/box/engine.c
+++ b/src/box/engine.c
@@ -148,11 +148,11 @@ engine_abort_checkpoint(void)
 }
 
 int
-engine_collect_garbage(int64_t lsn)
+engine_collect_garbage(const struct vclock *vclock)
 {
 	struct engine *engine;
 	engine_foreach(engine) {
-		if (engine->vtab->collect_garbage(engine, lsn) < 0)
+		if (engine->vtab->collect_garbage(engine, vclock) < 0)
 			return -1;
 	}
 	return 0;
@@ -317,10 +317,11 @@ generic_engine_abort_checkpoint(struct engine *engine)
 }
 
 int
-generic_engine_collect_garbage(struct engine *engine, int64_t lsn)
+generic_engine_collect_garbage(struct engine *engine,
+			       const struct vclock *vclock)
 {
 	(void)engine;
-	(void)lsn;
+	(void)vclock;
 	return 0;
 }
 
diff --git a/src/box/engine.h b/src/box/engine.h
index 5b96c744..87c84e51 100644
--- a/src/box/engine.h
+++ b/src/box/engine.h
@@ -156,7 +156,7 @@ struct engine_vtab {
 	void (*abort_checkpoint)(struct engine *);
 	/**
 	 * Remove files that are not needed to recover
-	 * from checkpoint with @lsn or newer.
+	 * from checkpoint @vclock or newer.
 	 *
 	 * If this function returns a non-zero value, garbage
 	 * collection is aborted, i.e. this method isn't called
@@ -166,7 +166,8 @@ struct engine_vtab {
 	 * fails to delete a snapshot file, because we recover
 	 * checkpoint list by scanning the snapshot directory.
 	 */
-	int (*collect_garbage)(struct engine *engine, int64_t lsn);
+	int (*collect_garbage)(struct engine *engine,
+			       const struct vclock *vclock);
 	/**
 	 * Backup callback. It is supposed to call @cb for each file
 	 * that needs to be backed up in order to restore from the
@@ -337,7 +338,7 @@ void
 engine_abort_checkpoint(void);
 
 int
-engine_collect_garbage(int64_t lsn);
+engine_collect_garbage(const struct vclock *vclock);
 
 int
 engine_backup(const struct vclock *vclock, engine_backup_cb cb, void *cb_arg);
@@ -369,7 +370,7 @@ int generic_engine_begin_checkpoint(struct engine *);
 int generic_engine_wait_checkpoint(struct engine *, const struct vclock *);
 void generic_engine_commit_checkpoint(struct engine *, const struct vclock *);
 void generic_engine_abort_checkpoint(struct engine *);
-int generic_engine_collect_garbage(struct engine *, int64_t);
+int generic_engine_collect_garbage(struct engine *, const struct vclock *);
 int generic_engine_backup(struct engine *, const struct vclock *,
 			  engine_backup_cb, void *);
 void generic_engine_memory_stat(struct engine *, struct engine_memory_stat *);
diff --git a/src/box/gc.c b/src/box/gc.c
index cefe1553..55c36d15 100644
--- a/src/box/gc.c
+++ b/src/box/gc.c
@@ -201,9 +201,6 @@ gc_run(void)
 	if (!run_engine_gc && !run_wal_gc)
 		return; /* nothing to do */
 
-	int64_t wal_lsn = vclock_sum(vclock);
-	int64_t checkpoint_lsn = vclock_sum(&checkpoint->vclock);
-
 	/*
 	 * Engine callbacks may sleep, because they use coio for
 	 * removing files. Make sure we won't try to remove the
@@ -220,14 +217,14 @@ gc_run(void)
 	 */
 	int rc = 0;
 	if (run_engine_gc)
-		rc = engine_collect_garbage(checkpoint_lsn);
+		rc = engine_collect_garbage(&checkpoint->vclock);
 	/*
 	 * Run wal_collect_garbage() even if we don't need to
 	 * delete any WAL files, because we have to apprise
 	 * the WAL thread of the oldest checkpoint signature.
 	 */
 	if (rc == 0)
-		wal_collect_garbage(wal_lsn, checkpoint_lsn);
+		wal_collect_garbage(vclock, &checkpoint->vclock);
 	latch_unlock(&gc.latch);
 }
 
diff --git a/src/box/memtx_engine.c b/src/box/memtx_engine.c
index 28afb324..9e05ecf6 100644
--- a/src/box/memtx_engine.c
+++ b/src/box/memtx_engine.c
@@ -822,7 +822,7 @@ memtx_engine_abort_checkpoint(struct engine *engine)
 }
 
 static int
-memtx_engine_collect_garbage(struct engine *engine, int64_t lsn)
+memtx_engine_collect_garbage(struct engine *engine, const struct vclock *vclock)
 {
 	struct memtx_engine *memtx = (struct memtx_engine *)engine;
 	/*
@@ -833,7 +833,8 @@ memtx_engine_collect_garbage(struct engine *engine, int64_t lsn)
 	 * That said, we have to abort garbage collection if we
 	 * fail to delete a snap file.
 	 */
-	if (xdir_collect_garbage(&memtx->snap_dir, lsn, XDIR_GC_USE_COIO) != 0)
+	if (xdir_collect_garbage(&memtx->snap_dir, vclock_sum(vclock),
+				 XDIR_GC_USE_COIO) != 0)
 		return -1;
 
 	return 0;
diff --git a/src/box/vinyl.c b/src/box/vinyl.c
index ce81c6ad..1794489d 100644
--- a/src/box/vinyl.c
+++ b/src/box/vinyl.c
@@ -3286,21 +3286,20 @@ vy_gc(struct vy_env *env, struct vy_recovery *recovery,
 }
 
 static int
-vinyl_engine_collect_garbage(struct engine *engine, int64_t lsn)
+vinyl_engine_collect_garbage(struct engine *engine, const struct vclock *vclock)
 {
 	struct vy_env *env = vy_env(engine);
 
 	/* Cleanup old metadata log files. */
-	vy_log_collect_garbage(lsn);
+	vy_log_collect_garbage(vclock);
 
 	/* Cleanup run files. */
-	int64_t signature = vy_log_signature();
-	struct vy_recovery *recovery = vy_recovery_new(signature, 0);
+	struct vy_recovery *recovery = vy_recovery_new(vy_log_signature(), 0);
 	if (recovery == NULL) {
 		say_error("failed to recover vylog for garbage collection");
 		return 0;
 	}
-	vy_gc(env, recovery, VY_GC_DROPPED, lsn);
+	vy_gc(env, recovery, VY_GC_DROPPED, vclock_sum(vclock));
 	vy_recovery_delete(recovery);
 	return 0;
 }
diff --git a/src/box/vy_log.c b/src/box/vy_log.c
index 8a8f9d70..c9e0713c 100644
--- a/src/box/vy_log.c
+++ b/src/box/vy_log.c
@@ -202,22 +202,16 @@ vy_log_filename(int64_t signature)
 }
 
 /**
- * Return the lsn of the checkpoint that was taken
- * before the given lsn.
+ * Return the vclock of the checkpoint that was taken
+ * before the given vclock.
  */
-static int64_t
-vy_log_prev_checkpoint(int64_t lsn)
+static const struct vclock *
+vy_log_prev_checkpoint(const struct vclock *vclock)
 {
-	int64_t ret = -1;
-	for (struct vclock *vclock = vclockset_last(&vy_log.dir.index);
-	     vclock != NULL;
-	     vclock = vclockset_prev(&vy_log.dir.index, vclock)) {
-		if (vclock_sum(vclock) < lsn) {
-			ret = vclock_sum(vclock);
-			break;
-		}
-	}
-	return ret;
+	struct vclock *prev = vclockset_psearch(&vy_log.dir.index, vclock);
+	if (prev != NULL && vclock_sum(prev) == vclock_sum(vclock))
+		prev = vclockset_prev(&vy_log.dir.index, prev);
+	return prev;
 }
 
 /** An snprint-style function to print a log record. */
@@ -1076,14 +1070,16 @@ fail:
 }
 
 void
-vy_log_collect_garbage(int64_t signature)
+vy_log_collect_garbage(const struct vclock *vclock)
 {
 	/*
 	 * Always keep the previous file, because
 	 * it is still needed for backups.
 	 */
-	signature = vy_log_prev_checkpoint(signature);
-	xdir_collect_garbage(&vy_log.dir, signature, XDIR_GC_USE_COIO);
+	vclock = vy_log_prev_checkpoint(vclock);
+	if (vclock == NULL)
+		return;
+	xdir_collect_garbage(&vy_log.dir, vclock_sum(vclock), XDIR_GC_USE_COIO);
 }
 
 int64_t
@@ -1099,10 +1095,10 @@ vy_log_backup_path(const struct vclock *vclock)
 	 * Use the previous log file, because the current one
 	 * contains records written after the last checkpoint.
 	 */
-	int64_t lsn = vy_log_prev_checkpoint(vclock_sum(vclock));
-	if (lsn < 0)
+	vclock = vy_log_prev_checkpoint(vclock);
+	if (vclock == NULL)
 		return NULL;
-	const char *path = vy_log_filename(lsn);
+	const char *path = vy_log_filename(vclock_sum(vclock));
 	if (access(path, F_OK) == -1 && errno == ENOENT)
 		return NULL; /* vinyl not used */
 	return path;
diff --git a/src/box/vy_log.h b/src/box/vy_log.h
index 7718d9c6..70e25245 100644
--- a/src/box/vy_log.h
+++ b/src/box/vy_log.h
@@ -455,10 +455,10 @@ vy_log_rotate(const struct vclock *vclock);
 
 /**
  * Remove metadata log files that are not needed to recover
- * from the snapshot with the given signature or newer.
+ * from checkpoint @vclock.
  */
 void
-vy_log_collect_garbage(int64_t signature);
+vy_log_collect_garbage(const struct vclock *vclock);
 
 /**
  * Return the signature of the newest vylog to the time.
diff --git a/src/box/wal.c b/src/box/wal.c
index 673762a6..cbf2c121 100644
--- a/src/box/wal.c
+++ b/src/box/wal.c
@@ -120,11 +120,11 @@ struct wal_writer
 	 */
 	struct vclock vclock;
 	/**
-	 * Signature of the oldest checkpoint available on the instance.
+	 * VClock of the oldest checkpoint available on the instance.
 	 * The WAL writer must not delete WAL files that are needed to
 	 * recover from it even if it is running out of disk space.
 	 */
-	int64_t checkpoint_lsn;
+	struct vclock checkpoint_vclock;
 	/** The current WAL file. */
 	struct xlog current_wal;
 	/**
@@ -295,7 +295,8 @@ static void
 wal_writer_create(struct wal_writer *writer, enum wal_mode wal_mode,
 		  const char *wal_dirname, int64_t wal_max_rows,
 		  int64_t wal_max_size, const struct tt_uuid *instance_uuid,
-		  const struct vclock *vclock, int64_t checkpoint_lsn)
+		  const struct vclock *vclock,
+		  const struct vclock *checkpoint_vclock)
 {
 	writer->wal_mode = wal_mode;
 	writer->wal_max_rows = wal_max_rows;
@@ -311,11 +312,8 @@ wal_writer_create(struct wal_writer *writer, enum wal_mode wal_mode,
 	stailq_create(&writer->rollback);
 	cmsg_init(&writer->in_rollback, NULL);
 
-	/* Create and fill writer->vclock. */
-	vclock_create(&writer->vclock);
 	vclock_copy(&writer->vclock, vclock);
-
-	writer->checkpoint_lsn = checkpoint_lsn;
+	vclock_copy(&writer->checkpoint_vclock, checkpoint_vclock);
 	rlist_create(&writer->watchers);
 }
 
@@ -421,14 +419,14 @@ wal_open(struct wal_writer *writer)
 int
 wal_init(enum wal_mode wal_mode, const char *wal_dirname, int64_t wal_max_rows,
 	 int64_t wal_max_size, const struct tt_uuid *instance_uuid,
-	 const struct vclock *vclock, int64_t first_checkpoint_lsn)
+	 const struct vclock *vclock, const struct vclock *first_checkpoint_vclock)
 {
 	assert(wal_max_rows > 1);
 
 	struct wal_writer *writer = &wal_writer_singleton;
 	wal_writer_create(writer, wal_mode, wal_dirname, wal_max_rows,
 			  wal_max_size, instance_uuid, vclock,
-			  first_checkpoint_lsn);
+			  first_checkpoint_vclock);
 
 	/*
 	 * Scan the WAL directory to build an index of all
@@ -546,8 +544,8 @@ wal_checkpoint(struct vclock *vclock, bool rotate)
 struct wal_gc_msg
 {
 	struct cbus_call_msg base;
-	int64_t wal_lsn;
-	int64_t checkpoint_lsn;
+	const struct vclock *wal_vclock;
+	const struct vclock *checkpoint_vclock;
 };
 
 static int
@@ -555,20 +553,21 @@ wal_collect_garbage_f(struct cbus_call_msg *data)
 {
 	struct wal_writer *writer = &wal_writer_singleton;
 	struct wal_gc_msg *msg = (struct wal_gc_msg *)data;
-	writer->checkpoint_lsn = msg->checkpoint_lsn;
-	xdir_collect_garbage(&writer->wal_dir, msg->wal_lsn, 0);
+	vclock_copy(&writer->checkpoint_vclock, msg->checkpoint_vclock);
+	xdir_collect_garbage(&writer->wal_dir, vclock_sum(msg->wal_vclock), 0);
 	return 0;
 }
 
 void
-wal_collect_garbage(int64_t wal_lsn, int64_t checkpoint_lsn)
+wal_collect_garbage(const struct vclock *wal_vclock,
+		    const struct vclock *checkpoint_vclock)
 {
 	struct wal_writer *writer = &wal_writer_singleton;
 	if (writer->wal_mode == WAL_NONE)
 		return;
 	struct wal_gc_msg msg;
-	msg.wal_lsn = wal_lsn;
-	msg.checkpoint_lsn = checkpoint_lsn;
+	msg.wal_vclock = wal_vclock;
+	msg.checkpoint_vclock = checkpoint_vclock;
 	bool cancellable = fiber_set_cancellable(false);
 	cbus_call(&wal_thread.wal_pipe, &wal_thread.tx_prio_pipe, &msg.base,
 		  wal_collect_garbage_f, NULL, TIMEOUT_INFINITY);
@@ -643,6 +642,12 @@ wal_fallocate(struct wal_writer *writer, size_t len)
 	struct errinj *errinj = errinj(ERRINJ_WAL_FALLOCATE, ERRINJ_INT);
 
 	/*
+	 * Max LSN that can be collected in case of ENOSPC -
+	 * we must not delete WALs necessary for recovery.
+	 */
+	int64_t gc_lsn = vclock_sum(&writer->checkpoint_vclock);
+
+	/*
 	 * The actual write size can be greater than the sum size
 	 * of encoded rows (compression, fixheaders). Double the
 	 * given length to get a rough upper bound estimate.
@@ -662,7 +667,7 @@ retry:
 	}
 	if (errno != ENOSPC)
 		goto error;
-	if (!xdir_has_garbage(&writer->wal_dir, writer->checkpoint_lsn))
+	if (!xdir_has_garbage(&writer->wal_dir, gc_lsn))
 		goto error;
 
 	if (warn_no_space) {
@@ -674,7 +679,7 @@ retry:
 	struct diag diag;
 	diag_create(&diag);
 	diag_move(diag_get(), &diag);
-	if (xdir_collect_garbage(&writer->wal_dir, writer->checkpoint_lsn,
+	if (xdir_collect_garbage(&writer->wal_dir, gc_lsn,
 				 XDIR_GC_REMOVE_ONE) != 0) {
 		diag_move(&diag, diag_get());
 		goto error;
diff --git a/src/box/wal.h b/src/box/wal.h
index d4a37c55..808af76e 100644
--- a/src/box/wal.h
+++ b/src/box/wal.h
@@ -58,7 +58,7 @@ wal_thread_start();
 int
 wal_init(enum wal_mode wal_mode, const char *wal_dirname, int64_t wal_max_rows,
 	 int64_t wal_max_size, const struct tt_uuid *instance_uuid,
-	 const struct vclock *vclock, int64_t first_checkpoint_lsn);
+	 const struct vclock *vclock, const struct vclock *first_checkpoint_vclock);
 
 void
 wal_thread_stop();
@@ -175,14 +175,15 @@ int
 wal_checkpoint(struct vclock *vclock, bool rotate);
 
 /**
- * Remove all WAL files whose signature is less than @wal_lsn.
- * Update the oldest checkpoint signature with @checkpoint_lsn.
+ * Remove all WAL files whose signature is less than @wal_vclock.
+ * Update the oldest checkpoint signature with @checkpoint_vclock.
  * WAL thread will delete WAL files that are not needed to
  * recover from the oldest checkpoint if it runs out of disk
  * space.
  */
 void
-wal_collect_garbage(int64_t wal_lsn, int64_t checkpoint_lsn);
+wal_collect_garbage(const struct vclock *wal_vclock,
+		    const struct vclock *checkpoint_vclock);
 
 void
 wal_init_vy_log();
-- 
2.11.0

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 3/6] box: do not rotate WAL when replica subscribes
  2018-11-25 13:48 [PATCH 0/6] WAL garbage collection and checkpointing fixes Vladimir Davydov
  2018-11-25 13:48 ` [PATCH 1/6] vclock: allow to use const vclock as search key Vladimir Davydov
  2018-11-25 13:48 ` [PATCH 2/6] engine: pass vclock instead of lsn to collect_garbage callback Vladimir Davydov
@ 2018-11-25 13:48 ` Vladimir Davydov
  2018-11-26 17:50   ` Konstantin Osipov
  2018-11-25 13:48 ` [PATCH 4/6] box: use replicaset.vclock in replica join/subscribe Vladimir Davydov
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 19+ messages in thread
From: Vladimir Davydov @ 2018-11-25 13:48 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

Because this is pointless and confusing. This "feature" was silently
introduced by commit f2bccc18485d ("Use WAL vclock instead of TX vclock
in most places"). Let's revert this change. This will allow us to
clearly separate WAL checkpointing from WAL flushing, which will in turn
facilitate implementation of the checkpoint-on-WAL-threshold feature.

There are two problems here, however. First, not rotating the log breaks
expectations of replication/gc test: an xlog file doesn't get deleted in
time as a consequence. This happens, because we don't delete xlogs
relayed to a replica after join stage is complete - we only do it during
subscribe stage - and if we don't rotate WAL on subscribe the garbage
collector won't be invoked. This is actually a bug - we should advance
the WAL consumer associated with a replica once join stage is complete.
This patch fixes it, but it unveils another problem - this time in the
WAL garbage collection procedure.

Turns out, when passed a vclock, the WAL garbage collection procedure
removes all WAL files that were created before the vclock. Apparently,
this isn't quite correct - if a consumer is in the middle of a WAL file,
we must not delete the WAL file, but we do. This works as long as
consumers never track vlcocks inside WAL files - currently they are
advanced only when a WAL file is closed and naturally they are advanced
to the beginning of the next WAL file. However, if we want to advance
the consumer associated with a replica when join stage ends (this is
what the previous paragraph is about), it might occur that we will
advance it to the middle of a WAL file. If that happens the WAL garbage
collector might remove a file which is actually in use by a replica.
Fix this as well.
---
 src/box/box.cc                                     | 19 ++++++++++---------
 src/box/wal.c                                      | 21 ++++++++++++++++++++-
 src/box/wal.h                                      |  4 +++-
 test/replication/gc.result                         |  8 +-------
 test/replication/gc.test.lua                       |  6 +-----
 test/replication/gc_no_space.result                |  4 ++--
 test/replication/gc_no_space.test.lua              |  4 ++--
 test/replication/show_error_on_disconnect.result   | 12 +++++++-----
 test/replication/show_error_on_disconnect.test.lua | 10 ++++++++--
 9 files changed, 54 insertions(+), 34 deletions(-)

diff --git a/src/box/box.cc b/src/box/box.cc
index 21b84991..8a1a2668 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -1511,21 +1511,22 @@ box_process_join(struct ev_io *io, struct xrow_header *header)
 	 */
 	box_on_join(&instance_uuid);
 
-	replica = replica_by_uuid(&instance_uuid);
-	assert(replica != NULL);
+	/* Remember master's vclock after the last request */
+	struct vclock stop_vclock;
+	wal_checkpoint(&stop_vclock, false);
 
-	/* Register the replica as a WAL consumer. */
+	/*
+	 * Register the replica as a WAL consumer so that
+	 * it can resume SUBSCRIBE where FINAL JOIN ends.
+	 */
+	replica = replica_by_uuid(&instance_uuid);
 	if (replica->gc != NULL)
 		gc_consumer_unregister(replica->gc);
-	replica->gc = gc_consumer_register(&start_vclock, "replica %s",
+	replica->gc = gc_consumer_register(&stop_vclock, "replica %s",
 					   tt_uuid_str(&instance_uuid));
 	if (replica->gc == NULL)
 		diag_raise();
 
-	/* Remember master's vclock after the last request */
-	struct vclock stop_vclock;
-	wal_checkpoint(&stop_vclock, false);
-
 	/* Send end of initial stage data marker */
 	xrow_encode_vclock_xc(&row, &stop_vclock);
 	row.sync = header->sync;
@@ -1608,7 +1609,7 @@ box_process_subscribe(struct ev_io *io, struct xrow_header *header)
 	 */
 	struct xrow_header row;
 	struct vclock current_vclock;
-	wal_checkpoint(&current_vclock, true);
+	wal_checkpoint(&current_vclock, false);
 	xrow_encode_vclock_xc(&row, &current_vclock);
 	/*
 	 * Identify the message with the replica id of this
diff --git a/src/box/wal.c b/src/box/wal.c
index cbf2c121..11aae5fc 100644
--- a/src/box/wal.c
+++ b/src/box/wal.c
@@ -553,8 +553,27 @@ wal_collect_garbage_f(struct cbus_call_msg *data)
 {
 	struct wal_writer *writer = &wal_writer_singleton;
 	struct wal_gc_msg *msg = (struct wal_gc_msg *)data;
+	const struct vclock *vclock = msg->wal_vclock;
+
+	if (!xlog_is_open(&writer->current_wal) &&
+	    vclock_sum(vclock) >= vclock_sum(&writer->vclock)) {
+		/*
+		 * The last available WAL file has been sealed and
+		 * all registered consumers have done reading it.
+		 * We can delete it now.
+		 */
+	} else {
+		/*
+		 * Find the most recent WAL file that contains rows
+		 * required by registered consumers and delete all
+		 * older WAL files.
+		 */
+		vclock = vclockset_psearch(&writer->wal_dir.index, vclock);
+	}
+	if (vclock != NULL)
+		xdir_collect_garbage(&writer->wal_dir, vclock_sum(vclock), 0);
+
 	vclock_copy(&writer->checkpoint_vclock, msg->checkpoint_vclock);
-	xdir_collect_garbage(&writer->wal_dir, vclock_sum(msg->wal_vclock), 0);
 	return 0;
 }
 
diff --git a/src/box/wal.h b/src/box/wal.h
index 808af76e..e4094b1e 100644
--- a/src/box/wal.h
+++ b/src/box/wal.h
@@ -175,7 +175,9 @@ int
 wal_checkpoint(struct vclock *vclock, bool rotate);
 
 /**
- * Remove all WAL files whose signature is less than @wal_vclock.
+ * Remove WAL files that are not needed by consumers reading
+ * rows at @wal_vclock or newer.
+ *
  * Update the oldest checkpoint signature with @checkpoint_vclock.
  * WAL thread will delete WAL files that are not needed to
  * recover from the oldest checkpoint if it runs out of disk
diff --git a/test/replication/gc.result b/test/replication/gc.result
index f6266004..72fdcf77 100644
--- a/test/replication/gc.result
+++ b/test/replication/gc.result
@@ -308,13 +308,7 @@ box.snapshot()
 ---
 - true
 ...
-xlog_count = #fio.glob('./master/*.xlog')
----
-...
--- the replica may have managed to download all data
--- from xlog #1 before it was stopped, in which case
--- it's OK to collect xlog #1
-xlog_count == 3 or xlog_count == 2 or fio.listdir('./master')
+#fio.glob('./master/*.xlog') == 2 or fio.listdir('./master')
 ---
 - true
 ...
diff --git a/test/replication/gc.test.lua b/test/replication/gc.test.lua
index 30108249..2707d5e2 100644
--- a/test/replication/gc.test.lua
+++ b/test/replication/gc.test.lua
@@ -140,11 +140,7 @@ box.snapshot()
 _ = s:auto_increment{}
 box.snapshot()
 #box.info.gc().checkpoints == 1 or box.info.gc()
-xlog_count = #fio.glob('./master/*.xlog')
--- the replica may have managed to download all data
--- from xlog #1 before it was stopped, in which case
--- it's OK to collect xlog #1
-xlog_count == 3 or xlog_count == 2 or fio.listdir('./master')
+#fio.glob('./master/*.xlog') == 2 or fio.listdir('./master')
 
 -- The xlog should only be deleted after the replica
 -- is unregistered.
diff --git a/test/replication/gc_no_space.result b/test/replication/gc_no_space.result
index 8e663cdf..ceea8ab3 100644
--- a/test/replication/gc_no_space.result
+++ b/test/replication/gc_no_space.result
@@ -152,7 +152,7 @@ s:auto_increment{}
 ---
 - [5]
 ...
-check_wal_count(7)
+check_wal_count(5)
 ---
 - true
 ...
@@ -168,7 +168,7 @@ check_snap_count(2)
 -- Inject a ENOSPC error and check that the WAL thread deletes
 -- old WAL files to prevent the user from seeing the error.
 --
-errinj.set('ERRINJ_WAL_FALLOCATE', 3)
+errinj.set('ERRINJ_WAL_FALLOCATE', 2)
 ---
 - ok
 ...
diff --git a/test/replication/gc_no_space.test.lua b/test/replication/gc_no_space.test.lua
index 4bab2b0e..be2e3229 100644
--- a/test/replication/gc_no_space.test.lua
+++ b/test/replication/gc_no_space.test.lua
@@ -68,7 +68,7 @@ s:auto_increment{}
 box.snapshot()
 s:auto_increment{}
 
-check_wal_count(7)
+check_wal_count(5)
 check_snap_count(2)
 #box.info.gc().consumers -- 3
 
@@ -76,7 +76,7 @@ check_snap_count(2)
 -- Inject a ENOSPC error and check that the WAL thread deletes
 -- old WAL files to prevent the user from seeing the error.
 --
-errinj.set('ERRINJ_WAL_FALLOCATE', 3)
+errinj.set('ERRINJ_WAL_FALLOCATE', 2)
 s:auto_increment{} -- success
 errinj.info()['ERRINJ_WAL_FALLOCATE'].state -- 0
 
diff --git a/test/replication/show_error_on_disconnect.result b/test/replication/show_error_on_disconnect.result
index c5a91c00..af082203 100644
--- a/test/replication/show_error_on_disconnect.result
+++ b/test/replication/show_error_on_disconnect.result
@@ -46,17 +46,18 @@ box.snapshot()
 ---
 - ok
 ...
-test_run:cmd("switch default")
+-- Manually remove all xlogs on master_quorum2 to break replication to master_quorum1.
+fio = require('fio')
 ---
-- true
 ...
-fio = require('fio')
+for _, path in ipairs(fio.glob(fio.pathjoin(box.cfg.wal_dir, '*.xlog'))) do fio.unlink(path) end
 ---
 ...
-fio.unlink(fio.pathjoin(fio.abspath("."), string.format('master_quorum2/%020d.xlog', 5)))
+box.space.test:insert{3}
 ---
-- true
+- [3]
 ...
+-- Check error reporting.
 test_run:cmd("switch master_quorum1")
 ---
 - true
@@ -90,6 +91,7 @@ box.space.test:select()
 ---
 - - [1]
   - [2]
+  - [3]
 ...
 other_id = box.info.id % 2 + 1
 ---
diff --git a/test/replication/show_error_on_disconnect.test.lua b/test/replication/show_error_on_disconnect.test.lua
index 64a75025..40e9dbc5 100644
--- a/test/replication/show_error_on_disconnect.test.lua
+++ b/test/replication/show_error_on_disconnect.test.lua
@@ -5,6 +5,7 @@
 --
 test_run = require('test_run').new()
 SERVERS = {'master_quorum1', 'master_quorum2'}
+
 -- Deploy a cluster.
 test_run:create_cluster(SERVERS)
 test_run:wait_fullmesh(SERVERS)
@@ -16,9 +17,14 @@ box.space.test:insert{1}
 box.snapshot()
 box.space.test:insert{2}
 box.snapshot()
-test_run:cmd("switch default")
+
+-- Manually remove all xlogs on master_quorum2 to break replication to master_quorum1.
 fio = require('fio')
-fio.unlink(fio.pathjoin(fio.abspath("."), string.format('master_quorum2/%020d.xlog', 5)))
+for _, path in ipairs(fio.glob(fio.pathjoin(box.cfg.wal_dir, '*.xlog'))) do fio.unlink(path) end
+
+box.space.test:insert{3}
+
+-- Check error reporting.
 test_run:cmd("switch master_quorum1")
 box.cfg{replication = repl}
 require('fiber').sleep(0.1)
-- 
2.11.0

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 4/6] box: use replicaset.vclock in replica join/subscribe
  2018-11-25 13:48 [PATCH 0/6] WAL garbage collection and checkpointing fixes Vladimir Davydov
                   ` (2 preceding siblings ...)
  2018-11-25 13:48 ` [PATCH 3/6] box: do not rotate WAL when replica subscribes Vladimir Davydov
@ 2018-11-25 13:48 ` Vladimir Davydov
  2018-11-26 17:54   ` Konstantin Osipov
  2018-11-25 13:48 ` [PATCH 5/6] wal: separate checkpoint and flush paths Vladimir Davydov
  2018-11-25 13:48 ` [PATCH 6/6] wal: remove files needed for recovery from backup checkpoints on ENOSPC Vladimir Davydov
  5 siblings, 1 reply; 19+ messages in thread
From: Vladimir Davydov @ 2018-11-25 13:48 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

Again, this is something that was introduced by commit f2bccc18485d
("Use WAL vclock instead of TX vclock in most places") without any
justification.

TX has its own copy of the current vclock - there's absolutely no need
to inquire it from the WAL thread. Actually, we already use TX local
vclock in box_process_vote(). No reason to treat join/subscribe any
different. Moreover, it's even harmful - there may be a gap at the end
of a WAL file, in which case WAL vclock will be slightly ahead of TX
vclock so that should a replica try to subscribe it would never finish
syncing, see #3830.

Closes #3830
---
 src/box/box.cc                 | 10 +++-------
 test/replication/sync.result   | 37 +++++++++++++++++++++++++++++++++++++
 test/replication/sync.test.lua | 12 ++++++++++++
 3 files changed, 52 insertions(+), 7 deletions(-)

diff --git a/src/box/box.cc b/src/box/box.cc
index 8a1a2668..8d6e966e 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -1513,7 +1513,7 @@ box_process_join(struct ev_io *io, struct xrow_header *header)
 
 	/* Remember master's vclock after the last request */
 	struct vclock stop_vclock;
-	wal_checkpoint(&stop_vclock, false);
+	vclock_copy(&stop_vclock, &replicaset.vclock);
 
 	/*
 	 * Register the replica as a WAL consumer so that
@@ -1540,9 +1540,7 @@ box_process_join(struct ev_io *io, struct xrow_header *header)
 	say_info("final data sent.");
 
 	/* Send end of WAL stream marker */
-	struct vclock current_vclock;
-	wal_checkpoint(&current_vclock, false);
-	xrow_encode_vclock_xc(&row, &current_vclock);
+	xrow_encode_vclock_xc(&row, &replicaset.vclock);
 	row.sync = header->sync;
 	coio_write_xrow(io, &row);
 }
@@ -1608,9 +1606,7 @@ box_process_subscribe(struct ev_io *io, struct xrow_header *header)
 	 * and identify ourselves with our own replica id.
 	 */
 	struct xrow_header row;
-	struct vclock current_vclock;
-	wal_checkpoint(&current_vclock, false);
-	xrow_encode_vclock_xc(&row, &current_vclock);
+	xrow_encode_vclock_xc(&row, &replicaset.vclock);
 	/*
 	 * Identify the message with the replica id of this
 	 * instance, this is the only way for a replica to find
diff --git a/test/replication/sync.result b/test/replication/sync.result
index 4c0ad9c3..dc3a6f69 100644
--- a/test/replication/sync.result
+++ b/test/replication/sync.result
@@ -311,6 +311,43 @@ test_run:cmd("stop server replica")
 ---
 - true
 ...
+-- gh-3830: Sync fails if there's a gap at the end of the master's WAL.
+box.error.injection.set('ERRINJ_WAL_WRITE_DISK', true)
+---
+- ok
+...
+box.space.test:replace{123456789}
+---
+- error: Failed to write to disk
+...
+box.error.injection.set('ERRINJ_WAL_WRITE_DISK', false)
+---
+- ok
+...
+test_run:cmd("start server replica")
+---
+- true
+...
+test_run:cmd("switch replica")
+---
+- true
+...
+box.info.status -- running
+---
+- running
+...
+box.info.ro -- false
+---
+- false
+...
+test_run:cmd("switch default")
+---
+- true
+...
+test_run:cmd("stop server replica")
+---
+- true
+...
 test_run:cmd("cleanup server replica")
 ---
 - true
diff --git a/test/replication/sync.test.lua b/test/replication/sync.test.lua
index ee82fc58..bc714735 100644
--- a/test/replication/sync.test.lua
+++ b/test/replication/sync.test.lua
@@ -158,6 +158,18 @@ test_run:grep_log('replica', 'ER_CFG.*')
 
 test_run:cmd("switch default")
 test_run:cmd("stop server replica")
+
+-- gh-3830: Sync fails if there's a gap at the end of the master's WAL.
+box.error.injection.set('ERRINJ_WAL_WRITE_DISK', true)
+box.space.test:replace{123456789}
+box.error.injection.set('ERRINJ_WAL_WRITE_DISK', false)
+test_run:cmd("start server replica")
+test_run:cmd("switch replica")
+box.info.status -- running
+box.info.ro -- false
+
+test_run:cmd("switch default")
+test_run:cmd("stop server replica")
 test_run:cmd("cleanup server replica")
 
 box.space.test:drop()
-- 
2.11.0

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 5/6] wal: separate checkpoint and flush paths
  2018-11-25 13:48 [PATCH 0/6] WAL garbage collection and checkpointing fixes Vladimir Davydov
                   ` (3 preceding siblings ...)
  2018-11-25 13:48 ` [PATCH 4/6] box: use replicaset.vclock in replica join/subscribe Vladimir Davydov
@ 2018-11-25 13:48 ` Vladimir Davydov
  2018-11-26 17:58   ` Konstantin Osipov
  2018-11-25 13:48 ` [PATCH 6/6] wal: remove files needed for recovery from backup checkpoints on ENOSPC Vladimir Davydov
  5 siblings, 1 reply; 19+ messages in thread
From: Vladimir Davydov @ 2018-11-25 13:48 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

Currently, wal_checkpoint() is used for two purposes. First, to make a
checkpoint (rotate = true). Second, to flush all pending WAL requests
(rotate = false). Since checkpointing has to fail if cascading rollback
is in progress so does flushing. This is confusing. Let's separate the
two paths.
---
 src/box/box.cc  |  5 ++--
 src/box/vinyl.c |  5 +---
 src/box/wal.c   | 93 ++++++++++++++++++++++++++-------------------------------
 src/box/wal.h   | 15 +++++++---
 4 files changed, 56 insertions(+), 62 deletions(-)

diff --git a/src/box/box.cc b/src/box/box.cc
index 8d6e966e..5ea2f014 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -2190,10 +2190,9 @@ box_checkpoint()
 		goto end;
 
 	struct vclock vclock;
-	if ((rc = wal_checkpoint(&vclock, true))) {
-		tnt_error(ClientError, ER_CHECKPOINT_ROLLBACK);
+	if ((rc = wal_checkpoint(&vclock)))
 		goto end;
-	}
+
 	rc = engine_commit_checkpoint(&vclock);
 end:
 	if (rc)
diff --git a/src/box/vinyl.c b/src/box/vinyl.c
index 1794489d..05df1329 100644
--- a/src/box/vinyl.c
+++ b/src/box/vinyl.c
@@ -1014,10 +1014,7 @@ vy_abort_writers_for_ddl(struct vy_env *env, struct vy_lsm *lsm)
 	 * Wait for prepared transactions to complete
 	 * (we can't abort them as they reached WAL).
 	 */
-	struct vclock unused;
-	if (wal_checkpoint(&unused, false) != 0)
-		return -1;
-
+	wal_flush();
 	return 0;
 }
 
diff --git a/src/box/wal.c b/src/box/wal.c
index 11aae5fc..3e6c1e7f 100644
--- a/src/box/wal.c
+++ b/src/box/wal.c
@@ -461,29 +461,35 @@ wal_thread_stop()
 		wal_writer_destroy(&wal_writer_singleton);
 }
 
-struct wal_checkpoint
+void
+wal_flush(void)
 {
-	struct cmsg base;
-	struct vclock *vclock;
-	struct fiber *fiber;
-	bool rotate;
-	int res;
+	cbus_flush(&wal_thread.wal_pipe, &wal_thread.tx_prio_pipe, NULL);
+}
+
+struct wal_checkpoint_msg {
+	struct cbus_call_msg base;
+	struct vclock vclock;
 };
 
-void
-wal_checkpoint_f(struct cmsg *data)
+static int
+wal_checkpoint_f(struct cbus_call_msg *data)
 {
-	struct wal_checkpoint *msg = (struct wal_checkpoint *) data;
+	struct wal_checkpoint_msg *msg = (struct wal_checkpoint_msg *)data;
 	struct wal_writer *writer = &wal_writer_singleton;
 	if (writer->in_rollback.route != NULL) {
-		/* We're rolling back a failed write. */
-		msg->res = -1;
-		return;
+		/*
+		 * We're rolling back a failed write and so
+		 * can't make a checkpoint - see the comment
+		 * in wal_checkpoint() for the explanation.
+		 */
+		diag_set(ClientError, ER_CHECKPOINT_ROLLBACK);
+		return -1;
 	}
 	/*
 	 * Avoid closing the current WAL if it has no rows (empty).
 	 */
-	if (msg->rotate && xlog_is_open(&writer->current_wal) &&
+	if (xlog_is_open(&writer->current_wal) &&
 	    vclock_sum(&writer->current_wal.meta.vclock) !=
 	    vclock_sum(&writer->vclock)) {
 
@@ -492,53 +498,38 @@ wal_checkpoint_f(struct cmsg *data)
 		 * The next WAL will be created on the first write.
 		 */
 	}
-	vclock_copy(msg->vclock, &writer->vclock);
-}
-
-void
-wal_checkpoint_done_f(struct cmsg *data)
-{
-	struct wal_checkpoint *msg = (struct wal_checkpoint *) data;
-	fiber_wakeup(msg->fiber);
+	vclock_copy(&msg->vclock, &writer->vclock);
+	return 0;
 }
 
 int
-wal_checkpoint(struct vclock *vclock, bool rotate)
+wal_checkpoint(struct vclock *vclock)
 {
 	struct wal_writer *writer = &wal_writer_singleton;
-	if (! stailq_empty(&writer->rollback)) {
-		/*
-		 * The writer rollback queue is not empty,
-		 * roll back this transaction immediately.
-		 * This is to ensure we do not accidentally
-		 * commit a transaction which has seen changes
-		 * that will be rolled back.
-		 */
-		say_error("Aborting transaction %llu during "
-			  "cascading rollback",
-			  vclock_sum(&writer->vclock));
-		return -1;
-	}
 	if (writer->wal_mode == WAL_NONE) {
 		vclock_copy(vclock, &writer->vclock);
 		return 0;
 	}
-	static struct cmsg_hop wal_checkpoint_route[] = {
-		{wal_checkpoint_f, &wal_thread.tx_prio_pipe},
-		{wal_checkpoint_done_f, NULL},
-	};
-	vclock_create(vclock);
-	struct wal_checkpoint msg;
-	cmsg_init(&msg.base, wal_checkpoint_route);
-	msg.vclock = vclock;
-	msg.fiber = fiber();
-	msg.rotate = rotate;
-	msg.res = 0;
-	cpipe_push(&wal_thread.wal_pipe, &msg.base);
-	fiber_set_cancellable(false);
-	fiber_yield();
-	fiber_set_cancellable(true);
-	return msg.res;
+	if (!stailq_empty(&writer->rollback)) {
+		/*
+		 * If cascading rollback is in progress, in-memory
+		 * indexes can contain changes scheduled for rollback.
+		 * If we made a checkpoint, we could write them to
+		 * the snapshot. So we abort checkpointing in this
+		 * case.
+		 */
+		diag_set(ClientError, ER_CHECKPOINT_ROLLBACK);
+		return -1;
+	}
+	struct wal_checkpoint_msg msg;
+	bool cancellable = fiber_set_cancellable(false);
+	int rc = cbus_call(&wal_thread.wal_pipe, &wal_thread.tx_prio_pipe,
+			   &msg.base, wal_checkpoint_f, NULL, TIMEOUT_INFINITY);
+	fiber_set_cancellable(cancellable);
+	if (rc != 0)
+		return -1;
+	vclock_copy(vclock, &msg.vclock);
+	return 0;
 }
 
 struct wal_gc_msg
diff --git a/src/box/wal.h b/src/box/wal.h
index e4094b1e..7ca27f1a 100644
--- a/src/box/wal.h
+++ b/src/box/wal.h
@@ -166,13 +166,20 @@ wal_mode();
 
 /**
  * Wait till all pending changes to the WAL are flushed.
- * Rotates the WAL.
- *
- * @param[out] vclock WAL vclock
+ */
+void
+wal_flush(void);
+
+/**
+ * Prepare WAL for checkpointing.
  *
+ * This function flushes all pending changes and rotates the
+ * current WAL. The vclock of the last record written to the
+ * rotated WAL is returned in @vclock. This is the vclock that
+ * is supposed to be used to identify the new checkpoint.
  */
 int
-wal_checkpoint(struct vclock *vclock, bool rotate);
+wal_checkpoint(struct vclock *vclock);
 
 /**
  * Remove WAL files that are not needed by consumers reading
-- 
2.11.0

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 6/6] wal: remove files needed for recovery from backup checkpoints on ENOSPC
  2018-11-25 13:48 [PATCH 0/6] WAL garbage collection and checkpointing fixes Vladimir Davydov
                   ` (4 preceding siblings ...)
  2018-11-25 13:48 ` [PATCH 5/6] wal: separate checkpoint and flush paths Vladimir Davydov
@ 2018-11-25 13:48 ` Vladimir Davydov
  2018-11-28 16:14   ` Vladimir Davydov
  5 siblings, 1 reply; 19+ messages in thread
From: Vladimir Davydov @ 2018-11-25 13:48 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

Tarantool always keeps box.cfg.checkpoint_count latest checkpoints. It
also never deletes WAL files needed for recovery from any of them for
the sake of redundancy, even if it gets ENOSPC while trying to write to
WAL. This patch changes that behavior: now the WAL thread is allowed to
delete backup WAL files in case of emergency ENOSPC - after all it's
better than stopping operation.

Closes #3822
---
 src/box/box.cc                        | 17 +++++++------
 src/box/gc.c                          | 17 ++++++++-----
 src/box/gc.h                          | 14 ----------
 src/box/wal.c                         | 48 +++++++++++++++++++++++------------
 src/box/wal.h                         | 22 ++++++++--------
 test/replication/gc_no_space.result   | 47 +++++++++++++++++++++++++++++-----
 test/replication/gc_no_space.test.lua | 23 ++++++++++++-----
 7 files changed, 121 insertions(+), 67 deletions(-)

diff --git a/src/box/box.cc b/src/box/box.cc
index 5ea2f014..72788f82 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -2104,6 +2104,8 @@ box_cfg_xc(void)
 		/* Bootstrap a new master */
 		bootstrap(&instance_uuid, &replicaset_uuid,
 			  &is_bootstrap_leader);
+		checkpoint = gc_last_checkpoint();
+		assert(checkpoint != NULL);
 	}
 	fiber_gc();
 
@@ -2117,16 +2119,13 @@ box_cfg_xc(void)
 		}
 	}
 
-	struct gc_checkpoint *first_checkpoint = gc_first_checkpoint();
-	assert(first_checkpoint != NULL);
-
 	/* Start WAL writer */
 	int64_t wal_max_rows = box_check_wal_max_rows(cfg_geti64("rows_per_wal"));
 	int64_t wal_max_size = box_check_wal_max_size(cfg_geti64("wal_max_size"));
 	enum wal_mode wal_mode = box_check_wal_mode(cfg_gets("wal_mode"));
 	if (wal_init(wal_mode, cfg_gets("wal_dir"), wal_max_rows,
 		     wal_max_size, &INSTANCE_UUID, &replicaset.vclock,
-		     &first_checkpoint->vclock) != 0) {
+		     &checkpoint->vclock) != 0) {
 		diag_raise();
 	}
 	gc_set_wal_watcher();
@@ -2190,15 +2189,17 @@ box_checkpoint()
 		goto end;
 
 	struct vclock vclock;
-	if ((rc = wal_checkpoint(&vclock)))
+	if ((rc = wal_begin_checkpoint(&vclock)))
+		goto end;
+
+	if ((rc = engine_commit_checkpoint(&vclock)))
 		goto end;
 
-	rc = engine_commit_checkpoint(&vclock);
+	wal_commit_checkpoint(&vclock);
+	gc_add_checkpoint(&vclock);
 end:
 	if (rc)
 		engine_abort_checkpoint();
-	else
-		gc_add_checkpoint(&vclock);
 
 	latch_unlock(&schema_lock);
 	box_checkpoint_is_in_progress = false;
diff --git a/src/box/gc.c b/src/box/gc.c
index 55c36d15..05773b91 100644
--- a/src/box/gc.c
+++ b/src/box/gc.c
@@ -218,13 +218,8 @@ gc_run(void)
 	int rc = 0;
 	if (run_engine_gc)
 		rc = engine_collect_garbage(&checkpoint->vclock);
-	/*
-	 * Run wal_collect_garbage() even if we don't need to
-	 * delete any WAL files, because we have to apprise
-	 * the WAL thread of the oldest checkpoint signature.
-	 */
-	if (rc == 0)
-		wal_collect_garbage(vclock, &checkpoint->vclock);
+	if (run_wal_gc && rc == 0)
+		wal_collect_garbage(vclock);
 	latch_unlock(&gc.latch);
 }
 
@@ -236,6 +231,14 @@ gc_process_wal_event(struct wal_watcher_msg *msg)
 {
 	assert((msg->events & WAL_EVENT_GC) != 0);
 
+	/*
+	 * In case of emergency ENOSPC, the WAL thread may delete
+	 * WAL files needed to restore from backup checkpoints,
+	 * which would be kept by the garbage collector otherwise.
+	 * Bring the garbage collector vclock up to date.
+	 */
+	vclock_copy(&gc.vclock, &msg->gc_vclock);
+
 	struct gc_consumer *consumer = gc_tree_first(&gc.consumers);
 	while (consumer != NULL &&
 	       vclock_sum(&consumer->vclock) < vclock_sum(&msg->gc_vclock)) {
diff --git a/src/box/gc.h b/src/box/gc.h
index e1241baa..eab19ba3 100644
--- a/src/box/gc.h
+++ b/src/box/gc.h
@@ -156,20 +156,6 @@ extern struct gc_state gc;
 	rlist_foreach_entry(ref, &(checkpoint)->refs, in_refs)
 
 /**
- * Return the first (oldest) checkpoint known to the garbage
- * collector. If there's no checkpoint, return NULL.
- */
-static inline struct gc_checkpoint *
-gc_first_checkpoint(void)
-{
-	if (rlist_empty(&gc.checkpoints))
-		return NULL;
-
-	return rlist_first_entry(&gc.checkpoints, struct gc_checkpoint,
-				 in_checkpoints);
-}
-
-/**
  * Return the last (newest) checkpoint known to the garbage
  * collector. If there's no checkpoint, return NULL.
  */
diff --git a/src/box/wal.c b/src/box/wal.c
index 3e6c1e7f..f0b19b7c 100644
--- a/src/box/wal.c
+++ b/src/box/wal.c
@@ -120,7 +120,7 @@ struct wal_writer
 	 */
 	struct vclock vclock;
 	/**
-	 * VClock of the oldest checkpoint available on the instance.
+	 * VClock of the most recent successfully created checkpoint.
 	 * The WAL writer must not delete WAL files that are needed to
 	 * recover from it even if it is running out of disk space.
 	 */
@@ -419,14 +419,14 @@ wal_open(struct wal_writer *writer)
 int
 wal_init(enum wal_mode wal_mode, const char *wal_dirname, int64_t wal_max_rows,
 	 int64_t wal_max_size, const struct tt_uuid *instance_uuid,
-	 const struct vclock *vclock, const struct vclock *first_checkpoint_vclock)
+	 const struct vclock *vclock, const struct vclock *checkpoint_vclock)
 {
 	assert(wal_max_rows > 1);
 
 	struct wal_writer *writer = &wal_writer_singleton;
 	wal_writer_create(writer, wal_mode, wal_dirname, wal_max_rows,
 			  wal_max_size, instance_uuid, vclock,
-			  first_checkpoint_vclock);
+			  checkpoint_vclock);
 
 	/*
 	 * Scan the WAL directory to build an index of all
@@ -473,7 +473,7 @@ struct wal_checkpoint_msg {
 };
 
 static int
-wal_checkpoint_f(struct cbus_call_msg *data)
+wal_begin_checkpoint_f(struct cbus_call_msg *data)
 {
 	struct wal_checkpoint_msg *msg = (struct wal_checkpoint_msg *)data;
 	struct wal_writer *writer = &wal_writer_singleton;
@@ -481,7 +481,7 @@ wal_checkpoint_f(struct cbus_call_msg *data)
 		/*
 		 * We're rolling back a failed write and so
 		 * can't make a checkpoint - see the comment
-		 * in wal_checkpoint() for the explanation.
+		 * in wal_begin_checkpoint() for the explanation.
 		 */
 		diag_set(ClientError, ER_CHECKPOINT_ROLLBACK);
 		return -1;
@@ -503,7 +503,7 @@ wal_checkpoint_f(struct cbus_call_msg *data)
 }
 
 int
-wal_checkpoint(struct vclock *vclock)
+wal_begin_checkpoint(struct vclock *vclock)
 {
 	struct wal_writer *writer = &wal_writer_singleton;
 	if (writer->wal_mode == WAL_NONE) {
@@ -524,7 +524,8 @@ wal_checkpoint(struct vclock *vclock)
 	struct wal_checkpoint_msg msg;
 	bool cancellable = fiber_set_cancellable(false);
 	int rc = cbus_call(&wal_thread.wal_pipe, &wal_thread.tx_prio_pipe,
-			   &msg.base, wal_checkpoint_f, NULL, TIMEOUT_INFINITY);
+			   &msg.base, wal_begin_checkpoint_f, NULL,
+			   TIMEOUT_INFINITY);
 	fiber_set_cancellable(cancellable);
 	if (rc != 0)
 		return -1;
@@ -532,19 +533,37 @@ wal_checkpoint(struct vclock *vclock)
 	return 0;
 }
 
+static int
+wal_commit_checkpoint_f(struct cbus_call_msg *data)
+{
+	struct wal_checkpoint_msg *msg = (struct wal_checkpoint_msg *)data;
+	struct wal_writer *writer = &wal_writer_singleton;
+	vclock_copy(&writer->checkpoint_vclock, &msg->vclock);
+	return 0;
+}
+
+void
+wal_commit_checkpoint(const struct vclock *vclock)
+{
+	struct wal_checkpoint_msg msg;
+	vclock_copy(&msg.vclock, vclock);
+	bool cancellable = fiber_set_cancellable(false);
+	cbus_call(&wal_thread.wal_pipe, &wal_thread.tx_prio_pipe,
+		  &msg.base, wal_commit_checkpoint_f, NULL, TIMEOUT_INFINITY);
+	fiber_set_cancellable(cancellable);
+}
+
 struct wal_gc_msg
 {
 	struct cbus_call_msg base;
-	const struct vclock *wal_vclock;
-	const struct vclock *checkpoint_vclock;
+	const struct vclock *vclock;
 };
 
 static int
 wal_collect_garbage_f(struct cbus_call_msg *data)
 {
 	struct wal_writer *writer = &wal_writer_singleton;
-	struct wal_gc_msg *msg = (struct wal_gc_msg *)data;
-	const struct vclock *vclock = msg->wal_vclock;
+	const struct vclock *vclock = ((struct wal_gc_msg *)data)->vclock;
 
 	if (!xlog_is_open(&writer->current_wal) &&
 	    vclock_sum(vclock) >= vclock_sum(&writer->vclock)) {
@@ -564,20 +583,17 @@ wal_collect_garbage_f(struct cbus_call_msg *data)
 	if (vclock != NULL)
 		xdir_collect_garbage(&writer->wal_dir, vclock_sum(vclock), 0);
 
-	vclock_copy(&writer->checkpoint_vclock, msg->checkpoint_vclock);
 	return 0;
 }
 
 void
-wal_collect_garbage(const struct vclock *wal_vclock,
-		    const struct vclock *checkpoint_vclock)
+wal_collect_garbage(const struct vclock *vclock)
 {
 	struct wal_writer *writer = &wal_writer_singleton;
 	if (writer->wal_mode == WAL_NONE)
 		return;
 	struct wal_gc_msg msg;
-	msg.wal_vclock = wal_vclock;
-	msg.checkpoint_vclock = checkpoint_vclock;
+	msg.vclock = vclock;
 	bool cancellable = fiber_set_cancellable(false);
 	cbus_call(&wal_thread.wal_pipe, &wal_thread.tx_prio_pipe, &msg.base,
 		  wal_collect_garbage_f, NULL, TIMEOUT_INFINITY);
diff --git a/src/box/wal.h b/src/box/wal.h
index 7ca27f1a..5f3a66ce 100644
--- a/src/box/wal.h
+++ b/src/box/wal.h
@@ -58,7 +58,7 @@ wal_thread_start();
 int
 wal_init(enum wal_mode wal_mode, const char *wal_dirname, int64_t wal_max_rows,
 	 int64_t wal_max_size, const struct tt_uuid *instance_uuid,
-	 const struct vclock *vclock, const struct vclock *first_checkpoint_vclock);
+	 const struct vclock *vclock, const struct vclock *checkpoint_vclock);
 
 void
 wal_thread_stop();
@@ -179,20 +179,22 @@ wal_flush(void);
  * is supposed to be used to identify the new checkpoint.
  */
 int
-wal_checkpoint(struct vclock *vclock);
+wal_begin_checkpoint(struct vclock *vclock);
+
+/**
+ * This function is called upon successful checkpoint creation.
+ * It updates the WAL thread's version of the last checkpoint
+ * vclock.
+ */
+void
+wal_commit_checkpoint(const struct vclock *vclock);
 
 /**
  * Remove WAL files that are not needed by consumers reading
- * rows at @wal_vclock or newer.
- *
- * Update the oldest checkpoint signature with @checkpoint_vclock.
- * WAL thread will delete WAL files that are not needed to
- * recover from the oldest checkpoint if it runs out of disk
- * space.
+ * rows at @vclock or newer.
  */
 void
-wal_collect_garbage(const struct vclock *wal_vclock,
-		    const struct vclock *checkpoint_vclock);
+wal_collect_garbage(const struct vclock *vclock);
 
 void
 wal_init_vy_log();
diff --git a/test/replication/gc_no_space.result b/test/replication/gc_no_space.result
index ceea8ab3..5c64bea4 100644
--- a/test/replication/gc_no_space.result
+++ b/test/replication/gc_no_space.result
@@ -160,10 +160,21 @@ check_snap_count(2)
 ---
 - true
 ...
-#box.info.gc().consumers -- 3
+gc = box.info.gc()
+---
+...
+#gc.consumers -- 3
 ---
 - 3
 ...
+#gc.checkpoints -- 2
+---
+- 2
+...
+gc.signature == gc.consumers[1].signature
+---
+- true
+...
 --
 -- Inject a ENOSPC error and check that the WAL thread deletes
 -- old WAL files to prevent the user from seeing the error.
@@ -188,15 +199,28 @@ check_snap_count(2)
 ---
 - true
 ...
-#box.info.gc().consumers -- 1
+gc = box.info.gc()
+---
+...
+#gc.consumers -- 1
 ---
 - 1
 ...
+#gc.checkpoints -- 2
+---
+- 2
+...
+gc.signature == gc.consumers[1].signature
+---
+- true
+...
 --
 -- Check that the WAL thread never deletes WAL files that are
--- needed for recovery from a checkpoint.
+-- needed for recovery from the last checkpoint, but may delete
+-- older WAL files that would be kept otherwise for recovery
+-- from backup checkpoints.
 --
-errinj.set('ERRINJ_WAL_FALLOCATE', 2)
+errinj.set('ERRINJ_WAL_FALLOCATE', 3)
 ---
 - ok
 ...
@@ -208,7 +232,7 @@ errinj.info()['ERRINJ_WAL_FALLOCATE'].state -- 0
 ---
 - 0
 ...
-check_wal_count(2)
+check_wal_count(1)
 ---
 - true
 ...
@@ -216,10 +240,21 @@ check_snap_count(2)
 ---
 - true
 ...
-#box.info.gc().consumers -- 0
+gc = box.info.gc()
+---
+...
+#gc.consumers -- 0
 ---
 - 0
 ...
+#gc.checkpoints -- 2
+---
+- 2
+...
+gc.signature == gc.checkpoints[2].signature
+---
+- true
+...
 s:drop()
 ---
 ...
diff --git a/test/replication/gc_no_space.test.lua b/test/replication/gc_no_space.test.lua
index be2e3229..7f5ab803 100644
--- a/test/replication/gc_no_space.test.lua
+++ b/test/replication/gc_no_space.test.lua
@@ -70,7 +70,10 @@ s:auto_increment{}
 
 check_wal_count(5)
 check_snap_count(2)
-#box.info.gc().consumers -- 3
+gc = box.info.gc()
+#gc.consumers -- 3
+#gc.checkpoints -- 2
+gc.signature == gc.consumers[1].signature
 
 --
 -- Inject a ENOSPC error and check that the WAL thread deletes
@@ -82,19 +85,27 @@ errinj.info()['ERRINJ_WAL_FALLOCATE'].state -- 0
 
 check_wal_count(3)
 check_snap_count(2)
-#box.info.gc().consumers -- 1
+gc = box.info.gc()
+#gc.consumers -- 1
+#gc.checkpoints -- 2
+gc.signature == gc.consumers[1].signature
 
 --
 -- Check that the WAL thread never deletes WAL files that are
--- needed for recovery from a checkpoint.
+-- needed for recovery from the last checkpoint, but may delete
+-- older WAL files that would be kept otherwise for recovery
+-- from backup checkpoints.
 --
-errinj.set('ERRINJ_WAL_FALLOCATE', 2)
+errinj.set('ERRINJ_WAL_FALLOCATE', 3)
 s:auto_increment{} -- failure
 errinj.info()['ERRINJ_WAL_FALLOCATE'].state -- 0
 
-check_wal_count(2)
+check_wal_count(1)
 check_snap_count(2)
-#box.info.gc().consumers -- 0
+gc = box.info.gc()
+#gc.consumers -- 0
+#gc.checkpoints -- 2
+gc.signature == gc.checkpoints[2].signature
 
 s:drop()
 box.schema.user.revoke('guest', 'replication')
-- 
2.11.0

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/6] vclock: allow to use const vclock as search key
  2018-11-25 13:48 ` [PATCH 1/6] vclock: allow to use const vclock as search key Vladimir Davydov
@ 2018-11-26 17:38   ` Konstantin Osipov
  2018-11-27  9:56     ` Vladimir Davydov
  0 siblings, 1 reply; 19+ messages in thread
From: Konstantin Osipov @ 2018-11-26 17:38 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: tarantool-patches

* Vladimir Davydov <vdavydov.dev@gmail.com> [18/11/26 10:27]:
> vclockset_psearch() and friends never modify the vclock used as a search
> key, but the compiler will not allow to pass const struct vclock to any
> of those functions. Fix this.

Hi, 

Please change rb_gen() signature if possible instead.

> ---
>  src/box/vclock.c | 4 +++-
>  src/box/vclock.h | 5 +++--
>  2 files changed, 6 insertions(+), 3 deletions(-)

-- 
Konstantin Osipov, Moscow, Russia, +7 903 626 22 32
http://tarantool.io - www.twitter.com/kostja_osipov

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/6] engine: pass vclock instead of lsn to collect_garbage callback
  2018-11-25 13:48 ` [PATCH 2/6] engine: pass vclock instead of lsn to collect_garbage callback Vladimir Davydov
@ 2018-11-26 17:41   ` Konstantin Osipov
  2018-11-27  9:56     ` Vladimir Davydov
  0 siblings, 1 reply; 19+ messages in thread
From: Konstantin Osipov @ 2018-11-26 17:41 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: tarantool-patches

* Vladimir Davydov <vdavydov.dev@gmail.com> [18/11/26 10:27]:
> First, this is consistent with other engine callbacks, such as
> checkpoint or backup.
> 
> Second, a vclock can be used as a search key in a vclock set,
> which in turn can make code more straightforward, e.g. look how
> this patch simplifies vy_log_prev_checkpoint().

OK to push.


-- 
Konstantin Osipov, Moscow, Russia, +7 903 626 22 32
http://tarantool.io - www.twitter.com/kostja_osipov

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/6] box: do not rotate WAL when replica subscribes
  2018-11-25 13:48 ` [PATCH 3/6] box: do not rotate WAL when replica subscribes Vladimir Davydov
@ 2018-11-26 17:50   ` Konstantin Osipov
  2018-11-27  9:57     ` Vladimir Davydov
  0 siblings, 1 reply; 19+ messages in thread
From: Konstantin Osipov @ 2018-11-26 17:50 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: tarantool-patches

* Vladimir Davydov <vdavydov.dev@gmail.com> [18/11/26 10:27]:

OK to push.
> Because this is pointless and confusing. This "feature" was silently
> introduced by commit f2bccc18485d ("Use WAL vclock instead of TX vclock
> in most places"). Let's revert this change. This will allow us to
> clearly separate WAL checkpointing from WAL flushing, which will in turn
> facilitate implementation of the checkpoint-on-WAL-threshold feature.
> 
> There are two problems here, however. First, not rotating the log breaks
> expectations of replication/gc test: an xlog file doesn't get deleted in
> time as a consequence. This happens, because we don't delete xlogs
> relayed to a replica after join stage is complete - we only do it during
> subscribe stage - and if we don't rotate WAL on subscribe the garbage
> collector won't be invoked. This is actually a bug - we should advance
> the WAL consumer associated with a replica once join stage is complete.
> This patch fixes it, but it unveils another problem - this time in the
> WAL garbage collection procedure.
> 
> Turns out, when passed a vclock, the WAL garbage collection procedure
> removes all WAL files that were created before the vclock. Apparently,
> this isn't quite correct - if a consumer is in the middle of a WAL file,
> we must not delete the WAL file, but we do. This works as long as
> consumers never track vlcocks inside WAL files - currently they are
> advanced only when a WAL file is closed and naturally they are advanced
> to the beginning of the next WAL file. However, if we want to advance
> the consumer associated with a replica when join stage ends (this is
> what the previous paragraph is about), it might occur that we will
> advance it to the middle of a WAL file. If that happens the WAL garbage
> collector might remove a file which is actually in use by a replica.
> Fix this as well.

-- 
Konstantin Osipov, Moscow, Russia, +7 903 626 22 32
http://tarantool.io - www.twitter.com/kostja_osipov

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 4/6] box: use replicaset.vclock in replica join/subscribe
  2018-11-25 13:48 ` [PATCH 4/6] box: use replicaset.vclock in replica join/subscribe Vladimir Davydov
@ 2018-11-26 17:54   ` Konstantin Osipov
  2018-11-27  9:57     ` Vladimir Davydov
  0 siblings, 1 reply; 19+ messages in thread
From: Konstantin Osipov @ 2018-11-26 17:54 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: tarantool-patches

* Vladimir Davydov <vdavydov.dev@gmail.com> [18/11/26 10:27]:
> Again, this is something that was introduced by commit f2bccc18485d
> ("Use WAL vclock instead of TX vclock in most places") without any
> justification.
> 
> TX has its own copy of the current vclock - there's absolutely no need
> to inquire it from the WAL thread. Actually, we already use TX local
> vclock in box_process_vote(). No reason to treat join/subscribe any
> different. Moreover, it's even harmful - there may be a gap at the end
> of a WAL file, in which case WAL vclock will be slightly ahead of TX
> vclock so that should a replica try to subscribe it would never finish
> syncing, see #3830.

OK to push.


-- 
Konstantin Osipov, Moscow, Russia, +7 903 626 22 32
http://tarantool.io - www.twitter.com/kostja_osipov

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 5/6] wal: separate checkpoint and flush paths
  2018-11-25 13:48 ` [PATCH 5/6] wal: separate checkpoint and flush paths Vladimir Davydov
@ 2018-11-26 17:58   ` Konstantin Osipov
  2018-11-26 20:19     ` Vladimir Davydov
  0 siblings, 1 reply; 19+ messages in thread
From: Konstantin Osipov @ 2018-11-26 17:58 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: tarantool-patches

* Vladimir Davydov <vdavydov.dev@gmail.com> [18/11/26 10:27]:
> Currently, wal_checkpoint() is used for two purposes. First, to make a
> checkpoint (rotate = true). Second, to flush all pending WAL requests
> (rotate = false). Since checkpointing has to fail if cascading rollback
> is in progress so does flushing. This is confusing. Let's separate the
> two paths.

I agree with the two paths, but wal_flush() as a name is
confusing, since no flushing is happening. 

Let's call it wal_sync() or wal_wait() or something similar.

wal_checkpoint() could also be made more obvious and renamed to
wal_rotate().

The patch is otherwise trivial and OK to push.


-- 
Konstantin Osipov, Moscow, Russia, +7 903 626 22 32
http://tarantool.io - www.twitter.com/kostja_osipov

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 5/6] wal: separate checkpoint and flush paths
  2018-11-26 17:58   ` Konstantin Osipov
@ 2018-11-26 20:19     ` Vladimir Davydov
  2018-11-28 16:46       ` Konstantin Osipov
  0 siblings, 1 reply; 19+ messages in thread
From: Vladimir Davydov @ 2018-11-26 20:19 UTC (permalink / raw)
  To: Konstantin Osipov; +Cc: tarantool-patches

On Mon, Nov 26, 2018 at 08:58:34PM +0300, Konstantin Osipov wrote:
> * Vladimir Davydov <vdavydov.dev@gmail.com> [18/11/26 10:27]:
> > Currently, wal_checkpoint() is used for two purposes. First, to make a
> > checkpoint (rotate = true). Second, to flush all pending WAL requests
> > (rotate = false). Since checkpointing has to fail if cascading rollback
> > is in progress so does flushing. This is confusing. Let's separate the
> > two paths.
> 
> I agree with the two paths, but wal_flush() as a name is
> confusing, since no flushing is happening. 

This function makes sure that all requests submitted to WAL are
processed before it returns, i.e. it effectively flushes WAL cpipe.
Its name is consistent with cbus_flush, which it uses under the hood.

> 
> Let's call it wal_sync() or wal_wait() or something similar.
> 
> wal_checkpoint() could also be made more obvious and renamed to
> wal_rotate().

There will be more to WAL checkpointing than simply rotating the current
WAL file. For example, in patch #6 I update WAL's vision of the last
checkpoint vclock there, and in the scope of checkpoint_wal_threshold
option implementation I will reset the size of WAL files written since
the last checkpoint there, too. In fact, I'll need two functions to do
all those things properly - one to be called when checkpointing is
started (this is where wal_checkpoint() is currently called), another -
when it completes. I'm planning to call them wal_begin_checkpoint() and
wal_commit_checkpoint() to match the corresponding engine methods (again
see patch #6).

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/6] vclock: allow to use const vclock as search key
  2018-11-26 17:38   ` Konstantin Osipov
@ 2018-11-27  9:56     ` Vladimir Davydov
  0 siblings, 0 replies; 19+ messages in thread
From: Vladimir Davydov @ 2018-11-27  9:56 UTC (permalink / raw)
  To: Konstantin Osipov; +Cc: tarantool-patches

On Mon, Nov 26, 2018 at 08:38:30PM +0300, Konstantin Osipov wrote:
> * Vladimir Davydov <vdavydov.dev@gmail.com> [18/11/26 10:27]:
> > vclockset_psearch() and friends never modify the vclock used as a search
> > key, but the compiler will not allow to pass const struct vclock to any
> > of those functions. Fix this.
> 
> Hi, 
> 
> Please change rb_gen() signature if possible instead.

Done.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/6] engine: pass vclock instead of lsn to collect_garbage callback
  2018-11-26 17:41   ` Konstantin Osipov
@ 2018-11-27  9:56     ` Vladimir Davydov
  0 siblings, 0 replies; 19+ messages in thread
From: Vladimir Davydov @ 2018-11-27  9:56 UTC (permalink / raw)
  To: Konstantin Osipov; +Cc: tarantool-patches

On Mon, Nov 26, 2018 at 08:41:47PM +0300, Konstantin Osipov wrote:
> * Vladimir Davydov <vdavydov.dev@gmail.com> [18/11/26 10:27]:
> > First, this is consistent with other engine callbacks, such as
> > checkpoint or backup.
> > 
> > Second, a vclock can be used as a search key in a vclock set,
> > which in turn can make code more straightforward, e.g. look how
> > this patch simplifies vy_log_prev_checkpoint().
> 
> OK to push.

Pushed to 2.1

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/6] box: do not rotate WAL when replica subscribes
  2018-11-26 17:50   ` Konstantin Osipov
@ 2018-11-27  9:57     ` Vladimir Davydov
  0 siblings, 0 replies; 19+ messages in thread
From: Vladimir Davydov @ 2018-11-27  9:57 UTC (permalink / raw)
  To: Konstantin Osipov; +Cc: tarantool-patches

On Mon, Nov 26, 2018 at 08:50:37PM +0300, Konstantin Osipov wrote:
> * Vladimir Davydov <vdavydov.dev@gmail.com> [18/11/26 10:27]:
> 
> OK to push.
> > Because this is pointless and confusing. This "feature" was silently
> > introduced by commit f2bccc18485d ("Use WAL vclock instead of TX vclock
> > in most places"). Let's revert this change. This will allow us to
> > clearly separate WAL checkpointing from WAL flushing, which will in turn
> > facilitate implementation of the checkpoint-on-WAL-threshold feature.
> > 
> > There are two problems here, however. First, not rotating the log breaks
> > expectations of replication/gc test: an xlog file doesn't get deleted in
> > time as a consequence. This happens, because we don't delete xlogs
> > relayed to a replica after join stage is complete - we only do it during
> > subscribe stage - and if we don't rotate WAL on subscribe the garbage
> > collector won't be invoked. This is actually a bug - we should advance
> > the WAL consumer associated with a replica once join stage is complete.
> > This patch fixes it, but it unveils another problem - this time in the
> > WAL garbage collection procedure.
> > 
> > Turns out, when passed a vclock, the WAL garbage collection procedure
> > removes all WAL files that were created before the vclock. Apparently,
> > this isn't quite correct - if a consumer is in the middle of a WAL file,
> > we must not delete the WAL file, but we do. This works as long as
> > consumers never track vlcocks inside WAL files - currently they are
> > advanced only when a WAL file is closed and naturally they are advanced
> > to the beginning of the next WAL file. However, if we want to advance
> > the consumer associated with a replica when join stage ends (this is
> > what the previous paragraph is about), it might occur that we will
> > advance it to the middle of a WAL file. If that happens the WAL garbage
> > collector might remove a file which is actually in use by a replica.
> > Fix this as well.

Pushed to 2.1

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 4/6] box: use replicaset.vclock in replica join/subscribe
  2018-11-26 17:54   ` Konstantin Osipov
@ 2018-11-27  9:57     ` Vladimir Davydov
  0 siblings, 0 replies; 19+ messages in thread
From: Vladimir Davydov @ 2018-11-27  9:57 UTC (permalink / raw)
  To: Konstantin Osipov; +Cc: tarantool-patches

On Mon, Nov 26, 2018 at 08:54:58PM +0300, Konstantin Osipov wrote:
> * Vladimir Davydov <vdavydov.dev@gmail.com> [18/11/26 10:27]:
> > Again, this is something that was introduced by commit f2bccc18485d
> > ("Use WAL vclock instead of TX vclock in most places") without any
> > justification.
> > 
> > TX has its own copy of the current vclock - there's absolutely no need
> > to inquire it from the WAL thread. Actually, we already use TX local
> > vclock in box_process_vote(). No reason to treat join/subscribe any
> > different. Moreover, it's even harmful - there may be a gap at the end
> > of a WAL file, in which case WAL vclock will be slightly ahead of TX
> > vclock so that should a replica try to subscribe it would never finish
> > syncing, see #3830.
> 
> OK to push.

Pushed to 2.1

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 6/6] wal: remove files needed for recovery from backup checkpoints on ENOSPC
  2018-11-25 13:48 ` [PATCH 6/6] wal: remove files needed for recovery from backup checkpoints on ENOSPC Vladimir Davydov
@ 2018-11-28 16:14   ` Vladimir Davydov
  0 siblings, 0 replies; 19+ messages in thread
From: Vladimir Davydov @ 2018-11-28 16:14 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

Ignore this pls. I will resend this patch and patch #5 as a part of
another patch set.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 5/6] wal: separate checkpoint and flush paths
  2018-11-26 20:19     ` Vladimir Davydov
@ 2018-11-28 16:46       ` Konstantin Osipov
  0 siblings, 0 replies; 19+ messages in thread
From: Konstantin Osipov @ 2018-11-28 16:46 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: tarantool-patches

* Vladimir Davydov <vdavydov.dev@gmail.com> [18/11/26 23:21]:
> > I agree with the two paths, but wal_flush() as a name is
> > confusing, since no flushing is happening. 
> 
> This function makes sure that all requests submitted to WAL are
> processed before it returns, i.e. it effectively flushes WAL cpipe.
> Its name is consistent with cbus_flush, which it uses under the hood.

WAL is a disk object, not a memory object, so flush can be
easily confused with flushing all memory to disk.

> > Let's call it wal_sync() or wal_wait() or something similar.
> > 
> > wal_checkpoint() could also be made more obvious and renamed to
> > wal_rotate().
> 
> There will be more to WAL checkpointing than simply rotating the current
> WAL file. For example, in patch #6 I update WAL's vision of the last
> checkpoint vclock there, and in the scope of checkpoint_wal_threshold
> option implementation I will reset the size of WAL files written since
> the last checkpoint there, too. In fact, I'll need two functions to do
> all those things properly - one to be called when checkpointing is
> started (this is where wal_checkpoint() is currently called), another -
> when it completes. I'm planning to call them wal_begin_checkpoint() and
> wal_commit_checkpoint() to match the corresponding engine methods (again
> see patch #6).

OK.

-- 
Konstantin Osipov, Moscow, Russia, +7 903 626 22 32
http://tarantool.io - www.twitter.com/kostja_osipov

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2018-11-28 16:46 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-11-25 13:48 [PATCH 0/6] WAL garbage collection and checkpointing fixes Vladimir Davydov
2018-11-25 13:48 ` [PATCH 1/6] vclock: allow to use const vclock as search key Vladimir Davydov
2018-11-26 17:38   ` Konstantin Osipov
2018-11-27  9:56     ` Vladimir Davydov
2018-11-25 13:48 ` [PATCH 2/6] engine: pass vclock instead of lsn to collect_garbage callback Vladimir Davydov
2018-11-26 17:41   ` Konstantin Osipov
2018-11-27  9:56     ` Vladimir Davydov
2018-11-25 13:48 ` [PATCH 3/6] box: do not rotate WAL when replica subscribes Vladimir Davydov
2018-11-26 17:50   ` Konstantin Osipov
2018-11-27  9:57     ` Vladimir Davydov
2018-11-25 13:48 ` [PATCH 4/6] box: use replicaset.vclock in replica join/subscribe Vladimir Davydov
2018-11-26 17:54   ` Konstantin Osipov
2018-11-27  9:57     ` Vladimir Davydov
2018-11-25 13:48 ` [PATCH 5/6] wal: separate checkpoint and flush paths Vladimir Davydov
2018-11-26 17:58   ` Konstantin Osipov
2018-11-26 20:19     ` Vladimir Davydov
2018-11-28 16:46       ` Konstantin Osipov
2018-11-25 13:48 ` [PATCH 6/6] wal: remove files needed for recovery from backup checkpoints on ENOSPC Vladimir Davydov
2018-11-28 16:14   ` Vladimir Davydov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox