[PATCH 0/5] Delete old WAL files if running out of disk space

Tarantool development patches archive
 help / color / mirror / Atom feed

* [PATCH 0/5] Delete old WAL files if running out of disk space
@ 2018-10-07 20:27 Vladimir Davydov
  2018-10-07 20:27 ` [PATCH 1/5] xlog: fix filename in error messages Vladimir Davydov
                   ` (5 more replies)
  0 siblings, 6 replies; 13+ messages in thread
From: Vladimir Davydov @ 2018-10-07 20:27 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

If a replica permanently stops working for some reason, it will pin WAL
files it would need to resume until it is deleted from the _cluster
system space or the master is restarted. This happens in production when
an admin drops a replica and forgets to remove it from the master, and
this is quite annoying, because it may result in ENOSPC errors on the
master.

This patch set attempts to mitigate this problem by making the WAL
thread delete old WAL files and shoot off old replicas automatically
when it runs out of disk space.

https://github.com/tarantool/tarantool/issues/3397
https://github.com/tarantool/tarantool/commits/dv/gh-3397-wal-auto-deletion


Vladimir Davydov (5):
  xlog: fix filename in error messages
  wal: preallocate disk space before writing rows
  xlog: allow to limit number of files deleted by xdir_collect_garbage
  wal: notify watchers about wal file removal
  wal: delete old wal files when running out of disk space

 CMakeLists.txt                        |   1 +
 src/box/box.cc                        |   9 +-
 src/box/gc.c                          |  67 +++++++++-
 src/box/gc.h                          |  31 +++++
 src/box/journal.c                     |   1 +
 src/box/journal.h                     |   4 +
 src/box/memtx_engine.c                |   2 +-
 src/box/relay.cc                      |   8 +-
 src/box/txn.c                         |   1 +
 src/box/vy_log.c                      |   2 +-
 src/box/wal.c                         | 122 ++++++++++++++----
 src/box/wal.h                         |  35 +++--
 src/box/xlog.c                        | 121 ++++++++++++++++--
 src/box/xlog.h                        |  43 ++++++-
 src/box/xrow.h                        |  13 ++
 src/errinj.h                          |   1 +
 src/trivia/config.h.cmake             |   1 +
 test/box/errinj.result                |   2 +
 test/replication/gc_no_space.result   | 234 ++++++++++++++++++++++++++++++++++
 test/replication/gc_no_space.test.lua | 103 +++++++++++++++
 test/replication/suite.ini            |   2 +-
 21 files changed, 739 insertions(+), 64 deletions(-)
 create mode 100644 test/replication/gc_no_space.result
 create mode 100644 test/replication/gc_no_space.test.lua

-- 
2.11.0

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 1/5] xlog: fix filename in error messages
  2018-10-07 20:27 [PATCH 0/5] Delete old WAL files if running out of disk space Vladimir Davydov
@ 2018-10-07 20:27 ` Vladimir Davydov
  2018-10-12  8:19   ` Vladimir Davydov
  2018-10-16 19:07   ` [tarantool-patches] " Konstantin Osipov
  2018-10-07 20:27 ` [PATCH 2/5] wal: preallocate disk space before writing rows Vladimir Davydov
                   ` (4 subsequent siblings)
  5 siblings, 2 replies; 13+ messages in thread
From: Vladimir Davydov @ 2018-10-07 20:27 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

 - xlog_rename() doesn't strip xlog->filename of inprogress suffix so
   write errors will mistakenly report the filename as inprogress.
 - xlog_create() uses a name without inprogress suffix for error
   reporting while it actually creates an inprogress file.
---
 src/box/xlog.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/src/box/xlog.c b/src/box/xlog.c
index d399a726..de5e52f7 100644
--- a/src/box/xlog.c
+++ b/src/box/xlog.c
@@ -744,6 +744,7 @@ xlog_rename(struct xlog *l)
 		return -1;
 	}
 	l->is_inprogress = false;
+	filename[suffix - filename] = '\0';
 	return 0;
 }
 
@@ -820,8 +821,9 @@ xlog_create(struct xlog *xlog, const char *name, int flags,
 	 */
 	xlog->fd = open(xlog->filename, flags, 0644);
 	if (xlog->fd < 0) {
-		say_syserror("open, [%s]", name);
-		diag_set(SystemError, "failed to create file '%s'", name);
+		say_syserror("open, [%s]", xlog->filename);
+		diag_set(SystemError, "failed to create file '%s'",
+			 xlog->filename);
 		goto err_open;
 	}
 
@@ -834,7 +836,8 @@ xlog_create(struct xlog *xlog, const char *name, int flags,
 
 	/* Write metadata */
 	if (fio_writen(xlog->fd, meta_buf, meta_len) < 0) {
-		diag_set(SystemError, "%s: failed to write xlog meta", name);
+		diag_set(SystemError, "%s: failed to write xlog meta",
+			 xlog->filename);
 		goto err_write;
 	}
 
-- 
2.11.0

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 2/5] wal: preallocate disk space before writing rows
  2018-10-07 20:27 [PATCH 0/5] Delete old WAL files if running out of disk space Vladimir Davydov
  2018-10-07 20:27 ` [PATCH 1/5] xlog: fix filename in error messages Vladimir Davydov
@ 2018-10-07 20:27 ` Vladimir Davydov
  2018-10-16 19:09   ` [tarantool-patches] " Konstantin Osipov
  2018-10-07 20:27 ` [PATCH 3/5] xlog: allow to limit number of files deleted by xdir_collect_garbage Vladimir Davydov
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 13+ messages in thread
From: Vladimir Davydov @ 2018-10-07 20:27 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

This function introduces a new xlog method xlog_fallocate() that makes
sure that the requested amount of disk space is available at the current
write position. It does that with posix_fallocate(). The new method is
called before writing anything to WAL. In order not to invoke a system
call too often, xlog_fallocate() allocates more than requested.

The primary reason why I'm doing this is that I want to have a single
and clearly defined point in the code to handle ENOSPC errors, where I
could delete old WALs and retry (this is what #3397 is about). I could
probably handle ENOSPC returned by xlog_tx_commit(), but that would look
suspicious, because this function can write half a transaction before it
hits ENOSPC, after which it truncates the file back. It's unclear what
happens if, for instance, a replication thread reads those transitive
data.

Anyway, preallocating disk space in big chunk is a worthwhile feature
by itself, because it should reduce the number of writes to the inode
table.

Needed for #3397
---
 CMakeLists.txt            |  1 +
 src/box/journal.c         |  1 +
 src/box/journal.h         |  4 +++
 src/box/txn.c             |  1 +
 src/box/wal.c             | 27 ++++++++++++++
 src/box/xlog.c            | 91 ++++++++++++++++++++++++++++++++++++++++++-----
 src/box/xlog.h            | 20 +++++++++++
 src/box/xrow.h            | 13 +++++++
 src/trivia/config.h.cmake |  1 +
 9 files changed, 151 insertions(+), 8 deletions(-)

diff --git a/CMakeLists.txt b/CMakeLists.txt
index bf68d187..3d12f3ff 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -80,6 +80,7 @@ check_symbol_exists(fdatasync unistd.h HAVE_FDATASYNC)
 check_symbol_exists(pthread_yield pthread.h HAVE_PTHREAD_YIELD)
 check_symbol_exists(sched_yield sched.h HAVE_SCHED_YIELD)
 check_symbol_exists(posix_fadvise fcntl.h HAVE_POSIX_FADVISE)
+check_symbol_exists(posix_fallocate fcntl.h HAVE_POSIX_FALLOCATE)
 check_symbol_exists(mremap sys/mman.h HAVE_MREMAP)
 
 check_function_exists(sync_file_range HAVE_SYNC_FILE_RANGE)
diff --git a/src/box/journal.c b/src/box/journal.c
index fd4f9539..99a521c3 100644
--- a/src/box/journal.c
+++ b/src/box/journal.c
@@ -66,6 +66,7 @@ journal_entry_new(size_t n_rows)
 		diag_set(OutOfMemory, size, "region", "struct journal_entry");
 		return NULL;
 	}
+	entry->len = 0;
 	entry->n_rows = n_rows;
 	entry->res = -1;
 	entry->fiber = fiber();
diff --git a/src/box/journal.h b/src/box/journal.h
index 1d64a7bd..fc495547 100644
--- a/src/box/journal.h
+++ b/src/box/journal.h
@@ -59,6 +59,10 @@ struct journal_entry {
 	 */
 	struct fiber *fiber;
 	/**
+	 * Max size the rows are going to take when encoded.
+	 */
+	size_t len;
+	/**
 	 * The number of rows in the request.
 	 */
 	int n_rows;
diff --git a/src/box/txn.c b/src/box/txn.c
index 17d97d76..9b465561 100644
--- a/src/box/txn.c
+++ b/src/box/txn.c
@@ -272,6 +272,7 @@ txn_write_to_wal(struct txn *txn)
 		if (stmt->row == NULL)
 			continue; /* A read (e.g. select) request */
 		*row++ = stmt->row;
+		req->len += xrow_len_max(stmt->row);
 	}
 	assert(row == req->rows + req->n_rows);
 
diff --git a/src/box/wal.c b/src/box/wal.c
index 2a1353b0..91c16fb0 100644
--- a/src/box/wal.c
+++ b/src/box/wal.c
@@ -126,6 +126,11 @@ struct wal_writer
 
 struct wal_msg {
 	struct cmsg base;
+	/**
+	 * Max size the committed requests are going to take when
+	 * written to disk.
+	 */
+	size_t len;
 	/** Input queue, on output contains all committed requests. */
 	struct stailq commit;
 	/**
@@ -168,6 +173,7 @@ static void
 wal_msg_create(struct wal_msg *batch)
 {
 	cmsg_init(&batch->base, wal_request_route);
+	batch->len = 0;
 	stailq_create(&batch->commit);
 	stailq_create(&batch->rollback);
 }
@@ -603,6 +609,20 @@ wal_opt_rotate(struct wal_writer *writer)
 	return 0;
 }
 
+/**
+ * Make sure there's enough disk space to write @len bytes
+ * of data to the current WAL.
+ */
+static int
+wal_fallocate(struct wal_writer *writer, size_t len)
+{
+	if (xlog_fallocate(&writer->current_wal, len) < 0) {
+		diag_log();
+		return -1;
+	}
+	return 0;
+}
+
 static void
 wal_writer_clear_bus(struct cmsg *msg)
 {
@@ -689,6 +709,12 @@ wal_write_to_disk(struct cmsg *msg)
 		return wal_writer_begin_rollback(writer);
 	}
 
+	/* Ensure there's enough disk space before writing anything. */
+	if (wal_fallocate(writer, wal_msg->len) != 0) {
+		stailq_concat(&wal_msg->rollback, &wal_msg->commit);
+		return wal_writer_begin_rollback(writer);
+	}
+
 	/*
 	 * This code tries to write queued requests (=transactions) using as
 	 * few I/O syscalls and memory copies as possible. For this reason
@@ -858,6 +884,7 @@ wal_write(struct journal *journal, struct journal_entry *entry)
 		stailq_add_tail_entry(&batch->commit, entry, fifo);
 		cpipe_push(&wal_thread.wal_pipe, &batch->base);
 	}
+	batch->len += entry->len;
 	wal_thread.wal_pipe.n_input += entry->n_rows * XROW_IOVMAX;
 	cpipe_flush_input(&wal_thread.wal_pipe);
 	/**
diff --git a/src/box/xlog.c b/src/box/xlog.c
index de5e52f7..292c462c 100644
--- a/src/box/xlog.c
+++ b/src/box/xlog.c
@@ -76,6 +76,22 @@ enum {
 	 * Maybe this should be a configuration option.
 	 */
 	XLOG_TX_COMPRESS_THRESHOLD = 2 * 1024,
+	/**
+	 * Minimal number of bytes of disk space to allocate
+	 * with xlog_fallocate(). Obviously, we want to invoke
+	 * fallocate() as rare as possible to avoid overhead
+	 * associated with a system call, however at the same
+	 * time we do not want to call it to allocate too big
+	 * chunks, because this may increase tx latency.
+	 */
+	XLOG_FALLOCATE_MIN = 128 * 1024,
+	/**
+	 * Allocate at least XLOG_FALLOCATE_FACTOR * size bytes
+	 * when xlog_fallocate(size) is called so that we do
+	 * not incur the overhead of an extra syscall per each
+	 * committed transaction.
+	 */
+	XLOG_FALLOCATE_FACTOR = 8,
 };
 
 /* {{{ struct xlog_meta */
@@ -988,6 +1004,48 @@ xdir_create_xlog(struct xdir *dir, struct xlog *xlog,
 	return 0;
 }
 
+/*
+ * Simplify recovery after a temporary write failure:
+ * truncate the file to the best known good write position.
+ */
+static void
+xlog_write_error(struct xlog *log)
+{
+	if (lseek(log->fd, log->offset, SEEK_SET) < 0 ||
+	    ftruncate(log->fd, log->offset) != 0)
+		panic_syserror("failed to truncate xlog after write error");
+	log->alloc_len = 0;
+}
+
+ssize_t
+xlog_fallocate(struct xlog *log, size_t len)
+{
+#ifdef HAVE_POSIX_FALLOCATE
+	if (log->alloc_len > len)
+		return log->alloc_len;
+
+	len = len * XLOG_FALLOCATE_FACTOR - log->alloc_len;
+	len = MAX(len, XLOG_FALLOCATE_MIN);
+	off_t offset = log->offset + log->alloc_len;
+
+	int rc = posix_fallocate(log->fd, offset, len);
+	if (rc != 0) {
+		xlog_write_error(log);
+		errno = rc;
+		diag_set(SystemError, "%s: can't allocate disk space",
+			 log->filename);
+		return -1;
+	}
+
+	log->alloc_len += len;
+	return log->alloc_len;
+#else
+	(void)log;
+	(void)len;
+	return 0;
+#endif /* HAVE_POSIX_FALLOCATE */
+}
+
 /**
  * Write a sequence of uncompressed xrow objects.
  *
@@ -1168,17 +1226,14 @@ xlog_tx_write(struct xlog *log)
 	});
 
 	obuf_reset(&log->obuf);
-	/*
-	 * Simplify recovery after a temporary write failure:
-	 * truncate the file to the best known good write
-	 * position.
-	 */
 	if (written < 0) {
-		if (lseek(log->fd, log->offset, SEEK_SET) < 0 ||
-		    ftruncate(log->fd, log->offset) != 0)
-			panic_syserror("failed to truncate xlog after write error");
+		xlog_write_error(log);
 		return -1;
 	}
+	if (log->alloc_len > (size_t)written)
+		log->alloc_len -= written;
+	else
+		log->alloc_len = 0;
 	log->offset += written;
 	log->rows += log->tx_rows;
 	log->tx_rows = 0;
@@ -1376,6 +1431,17 @@ xlog_write_eof(struct xlog *l)
 		diag_set(ClientError, ER_INJECTION, "xlog write injection");
 		return -1;
 	});
+
+	/*
+	 * Free disk space preallocated with xlog_fallocate().
+	 * Don't write the eof marker if this fails, otherwise
+	 * we'll get "data after eof marker" error on recovery.
+	 */
+	if (l->alloc_len > 0 && ftruncate(l->fd, l->offset) < 0) {
+		diag_set(SystemError, "ftruncate() failed");
+		return -1;
+	}
+
 	if (fio_writen(l->fd, &eof_marker, sizeof(eof_marker)) < 0) {
 		diag_set(SystemError, "write() failed");
 		return -1;
@@ -1791,6 +1857,15 @@ xlog_cursor_next_tx(struct xlog_cursor *i)
 		return -1;
 	if (rc > 0)
 		return 1;
+	if (load_u32(i->rbuf.rpos) == 0) {
+		/*
+		 * Space preallocated with xlog_fallocate().
+		 * Treat as eof and clear the buffer.
+		 */
+		i->read_offset -= ibuf_used(&i->rbuf);
+		ibuf_reset(&i->rbuf);
+		return 1;
+	}
 	if (load_u32(i->rbuf.rpos) == eof_marker) {
 		/* eof marker found */
 		goto eof_found;
diff --git a/src/box/xlog.h b/src/box/xlog.h
index c2ac4774..f9243935 100644
--- a/src/box/xlog.h
+++ b/src/box/xlog.h
@@ -303,6 +303,11 @@ struct xlog {
 	/** The current offset in the log file, for writing. */
 	off_t offset;
 	/**
+	 * Size of disk space preallocated at @offset with
+	 * xlog_fallocate().
+	 */
+	size_t alloc_len;
+	/**
 	 * Output buffer, works as row accumulator for
 	 * compression.
 	 */
@@ -423,6 +428,21 @@ int
 xlog_rename(struct xlog *l);
 
 /**
+ * Try to allocate at least @size bytes of disk space at the end
+ * of the given xlog file. This function can be used in order to
+ * ensure that the following write of @size bytes will not fail
+ * with ENOSPC.
+ *
+ * On success, this function returns the number of bytes available
+ * for writing. If fallocate is not supported by the underlying OS,
+ * it returns 0.
+ *
+ * On error, it returns -1 and sets diag and errno.
+ */
+ssize_t
+xlog_fallocate(struct xlog *log, size_t size);
+
+/**
  * Write a row to xlog, 
  *
  * @retval count of writen bytes
diff --git a/src/box/xrow.h b/src/box/xrow.h
index 3fc007a8..b73b1f2f 100644
--- a/src/box/xrow.h
+++ b/src/box/xrow.h
@@ -68,6 +68,19 @@ struct xrow_header {
 };
 
 /**
+ * Return the max size which the given row is going to take when
+ * encoded into a binary packet.
+ */
+static inline size_t
+xrow_len_max(struct xrow_header *row)
+{
+	size_t len = XROW_HEADER_LEN_MAX;
+	for (int i = 0; i < row->bodycnt; i++)
+		len += row->body[i].iov_len;
+	return len;
+}
+
+/**
  * Encode xrow into a binary packet
  *
  * @param header xrow
diff --git a/src/trivia/config.h.cmake b/src/trivia/config.h.cmake
index 66ddba99..64ae8b61 100644
--- a/src/trivia/config.h.cmake
+++ b/src/trivia/config.h.cmake
@@ -166,6 +166,7 @@
 #cmakedefine HAVE_PTHREAD_YIELD 1
 #cmakedefine HAVE_SCHED_YIELD 1
 #cmakedefine HAVE_POSIX_FADVISE 1
+#cmakedefine HAVE_POSIX_FALLOCATE 1
 #cmakedefine HAVE_MREMAP 1
 
 #cmakedefine HAVE_PRCTL_H 1
-- 
2.11.0

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 3/5] xlog: allow to limit number of files deleted by xdir_collect_garbage
  2018-10-07 20:27 [PATCH 0/5] Delete old WAL files if running out of disk space Vladimir Davydov
  2018-10-07 20:27 ` [PATCH 1/5] xlog: fix filename in error messages Vladimir Davydov
  2018-10-07 20:27 ` [PATCH 2/5] wal: preallocate disk space before writing rows Vladimir Davydov
@ 2018-10-07 20:27 ` Vladimir Davydov
  2018-10-07 20:27 ` [PATCH 4/5] wal: notify watchers about wal file removal Vladimir Davydov
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 13+ messages in thread
From: Vladimir Davydov @ 2018-10-07 20:27 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

Add an extra argument to xdir_collect_garbage() that specifies the max
number of files to delete. It will be used to delete old WAL files one
by one when running out of disk space.

Needed for #3397
---
 src/box/memtx_engine.c |  2 +-
 src/box/vy_log.c       |  2 +-
 src/box/wal.c          |  2 +-
 src/box/xlog.c         | 13 ++++++++++---
 src/box/xlog.h         |  8 +++++++-
 5 files changed, 20 insertions(+), 7 deletions(-)

diff --git a/src/box/memtx_engine.c b/src/box/memtx_engine.c
index ae1f5a0e..04dc052b 100644
--- a/src/box/memtx_engine.c
+++ b/src/box/memtx_engine.c
@@ -834,7 +834,7 @@ memtx_engine_collect_garbage(struct engine *engine, int64_t lsn)
 	 * That said, we have to abort garbage collection if we
 	 * fail to delete a snap file.
 	 */
-	if (xdir_collect_garbage(&memtx->snap_dir, lsn, true) != 0)
+	if (xdir_collect_garbage(&memtx->snap_dir, lsn, -1, true) < 0)
 		return -1;
 
 	return 0;
diff --git a/src/box/vy_log.c b/src/box/vy_log.c
index fc8ede59..bb454eaa 100644
--- a/src/box/vy_log.c
+++ b/src/box/vy_log.c
@@ -1078,7 +1078,7 @@ vy_log_collect_garbage(int64_t signature)
 	 * it is still needed for backups.
 	 */
 	signature = vy_log_prev_checkpoint(signature);
-	xdir_collect_garbage(&vy_log.dir, signature, true);
+	xdir_collect_garbage(&vy_log.dir, signature, -1, true);
 }
 
 int64_t
diff --git a/src/box/wal.c b/src/box/wal.c
index 91c16fb0..2728318a 100644
--- a/src/box/wal.c
+++ b/src/box/wal.c
@@ -538,7 +538,7 @@ static int
 wal_collect_garbage_f(struct cbus_call_msg *data)
 {
 	int64_t lsn = ((struct wal_gc_msg *)data)->lsn;
-	xdir_collect_garbage(&wal_writer_singleton.wal_dir, lsn, false);
+	xdir_collect_garbage(&wal_writer_singleton.wal_dir, lsn, -1, false);
 	return 0;
 }
 
diff --git a/src/box/xlog.c b/src/box/xlog.c
index 292c462c..1a6ead7f 100644
--- a/src/box/xlog.c
+++ b/src/box/xlog.c
@@ -669,10 +669,16 @@ xdir_format_filename(struct xdir *dir, int64_t signature,
 }
 
 int
-xdir_collect_garbage(struct xdir *dir, int64_t signature, bool use_coio)
+xdir_collect_garbage(struct xdir *dir, int64_t signature,
+		     int max_files, bool use_coio)
 {
+	if (max_files <= 0)
+		max_files = INT_MAX;
+
+	int count = 0;
 	struct vclock *vclock;
-	while ((vclock = vclockset_first(&dir->index)) != NULL &&
+	while (count < max_files &&
+	       (vclock = vclockset_first(&dir->index)) != NULL &&
 	       vclock_sum(vclock) < signature) {
 		char *filename = xdir_format_filename(dir, vclock_sum(vclock),
 						      NONE);
@@ -694,8 +700,9 @@ xdir_collect_garbage(struct xdir *dir, int64_t signature, bool use_coio)
 			say_info("removed %s", filename);
 		vclockset_remove(&dir->index, vclock);
 		free(vclock);
+		count++;
 	}
-	return 0;
+	return count;
 }
 
 void
diff --git a/src/box/xlog.h b/src/box/xlog.h
index f9243935..2233c022 100644
--- a/src/box/xlog.h
+++ b/src/box/xlog.h
@@ -180,10 +180,16 @@ xdir_format_filename(struct xdir *dir, int64_t signature,
 
 /**
  * Remove files whose signature is less than specified.
+ * If @max_files > 0, stop after deleting @max_files files.
  * If @use_coio is set, files are deleted by coio threads.
+ *
+ * On success, this function returns the number of deleted
+ * files. If it failed to delete a file, it returns -1 and
+ * sets diag.
  */
 int
-xdir_collect_garbage(struct xdir *dir, int64_t signature, bool use_coio);
+xdir_collect_garbage(struct xdir *dir, int64_t signature,
+		     int max_files, bool use_coio);
 
 /**
  * Remove inprogress files in the specified directory.
-- 
2.11.0

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 4/5] wal: notify watchers about wal file removal
  2018-10-07 20:27 [PATCH 0/5] Delete old WAL files if running out of disk space Vladimir Davydov
                   ` (2 preceding siblings ...)
  2018-10-07 20:27 ` [PATCH 3/5] xlog: allow to limit number of files deleted by xdir_collect_garbage Vladimir Davydov
@ 2018-10-07 20:27 ` Vladimir Davydov
  2018-10-07 20:27 ` [PATCH 5/5] wal: delete old wal files when running out of disk space Vladimir Davydov
  2018-10-16 19:05 ` [tarantool-patches] Re: [PATCH 0/5] Delete old WAL files if " Konstantin Osipov
  5 siblings, 0 replies; 13+ messages in thread
From: Vladimir Davydov @ 2018-10-07 20:27 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

We will use this event to kill consumers when the WAL thread removes
a WAL file on ENOSPC error.

Needed for #3397
---
 src/box/relay.cc |  8 ++++++--
 src/box/wal.c    | 31 ++++++++++++++++++-------------
 src/box/wal.h    | 20 +++++++++++++++-----
 src/box/xlog.h   | 15 +++++++++++++++
 4 files changed, 54 insertions(+), 20 deletions(-)

diff --git a/src/box/relay.cc b/src/box/relay.cc
index d5df487e..8f1ba6ac 100644
--- a/src/box/relay.cc
+++ b/src/box/relay.cc
@@ -396,9 +396,10 @@ relay_schedule_pending_gc(struct relay *relay, const struct vclock *vclock)
 }
 
 static void
-relay_process_wal_event(struct wal_watcher *watcher, unsigned events)
+relay_process_wal_event(struct wal_watcher_msg *msg)
 {
-	struct relay *relay = container_of(watcher, struct relay, wal_watcher);
+	struct relay *relay = container_of(msg->watcher, struct relay,
+					   wal_watcher);
 	if (relay->state != RELAY_FOLLOW) {
 		/*
 		 * Do not try to send anything to the replica
@@ -406,6 +407,9 @@ relay_process_wal_event(struct wal_watcher *watcher, unsigned events)
 		 */
 		return;
 	}
+	unsigned events = msg->events;
+	if ((events & (WAL_EVENT_WRITE | WAL_EVENT_ROTATE)) == 0)
+		return;
 	try {
 		recover_remaining_wals(relay->r, &relay->stream, NULL,
 				       (events & WAL_EVENT_ROTATE) != 0);
diff --git a/src/box/wal.c b/src/box/wal.c
index 2728318a..20b85f43 100644
--- a/src/box/wal.c
+++ b/src/box/wal.c
@@ -55,6 +55,9 @@ wal_write(struct journal *, struct journal_entry *);
 static int64_t
 wal_write_in_wal_mode_none(struct journal *, struct journal_entry *);
 
+static void
+wal_notify_watchers(struct wal_writer *writer, unsigned events);
+
 /* WAL thread. */
 struct wal_thread {
 	/** 'wal' thread doing the writes. */
@@ -537,8 +540,10 @@ struct wal_gc_msg
 static int
 wal_collect_garbage_f(struct cbus_call_msg *data)
 {
+	struct wal_writer *writer = &wal_writer_singleton;
 	int64_t lsn = ((struct wal_gc_msg *)data)->lsn;
-	xdir_collect_garbage(&wal_writer_singleton.wal_dir, lsn, -1, false);
+	xdir_collect_garbage(&writer->wal_dir, lsn, -1, false);
+	wal_notify_watchers(writer, WAL_EVENT_GC);
 	return 0;
 }
 
@@ -556,9 +561,6 @@ wal_collect_garbage(int64_t lsn)
 	fiber_set_cancellable(cancellable);
 }
 
-static void
-wal_notify_watchers(struct wal_writer *writer, unsigned events);
-
 /**
  * If there is no current WAL, try to open it, and close the
  * previous WAL. We close the previous WAL only after opening
@@ -1008,7 +1010,10 @@ wal_watcher_notify(struct wal_watcher *watcher, unsigned events)
 {
 	assert(!rlist_empty(&watcher->next));
 
-	if (watcher->msg.cmsg.route != NULL) {
+	struct wal_watcher_msg *msg = &watcher->msg;
+	struct wal_writer *writer = &wal_writer_singleton;
+
+	if (msg->cmsg.route != NULL) {
 		/*
 		 * If the notification message is still en route,
 		 * mark the watcher to resend it as soon as it
@@ -1018,19 +1023,19 @@ wal_watcher_notify(struct wal_watcher *watcher, unsigned events)
 		return;
 	}
 
-	watcher->msg.events = events;
-	cmsg_init(&watcher->msg.cmsg, watcher->route);
-	cpipe_push(&watcher->watcher_pipe, &watcher->msg.cmsg);
+	msg->events = events;
+	msg->gc_lsn = xdir_first_vclock(&writer->wal_dir, NULL);
+	if (msg->gc_lsn < 0)
+		msg->gc_lsn = vclock_sum(&writer->vclock);
+	cmsg_init(&msg->cmsg, watcher->route);
+	cpipe_push(&watcher->watcher_pipe, &msg->cmsg);
 }
 
 static void
 wal_watcher_notify_perform(struct cmsg *cmsg)
 {
 	struct wal_watcher_msg *msg = (struct wal_watcher_msg *) cmsg;
-	struct wal_watcher *watcher = msg->watcher;
-	unsigned events = msg->events;
-
-	watcher->cb(watcher, events);
+	msg->watcher->cb(msg);
 }
 
 static void
@@ -1083,7 +1088,7 @@ wal_watcher_detach(void *arg)
 
 void
 wal_set_watcher(struct wal_watcher *watcher, const char *name,
-		void (*watcher_cb)(struct wal_watcher *, unsigned events),
+		void (*watcher_cb)(struct wal_watcher_msg *),
 		void (*process_cb)(struct cbus_endpoint *))
 {
 	assert(journal_is_initialized(&wal_writer_singleton.base));
diff --git a/src/box/wal.h b/src/box/wal.h
index 8ef1fb1d..4867ec3b 100644
--- a/src/box/wal.h
+++ b/src/box/wal.h
@@ -63,10 +63,18 @@ wal_init(enum wal_mode wal_mode, const char *wal_dirname,
 void
 wal_thread_stop();
 
+/**
+ * A notification message sent from the WAL to a watcher
+ * when a WAL event occurs.
+ */
 struct wal_watcher_msg {
 	struct cmsg cmsg;
+	/** Pointer to the watcher this message is for. */
 	struct wal_watcher *watcher;
+	/** Bit mask of events, see wal_event. */
 	unsigned events;
+	/** Signature of the oldest stored WAL row. */
+	int64_t gc_lsn;
 };
 
 enum wal_event {
@@ -74,13 +82,15 @@ enum wal_event {
 	WAL_EVENT_WRITE		= (1 << 0),
 	/** A new WAL is created. */
 	WAL_EVENT_ROTATE	= (1 << 1),
+	/** One or more old WALs have been deleted. */
+	WAL_EVENT_GC		= (1 << 2),
 };
 
 struct wal_watcher {
 	/** Link in wal_writer::watchers. */
 	struct rlist next;
 	/** The watcher callback function. */
-	void (*cb)(struct wal_watcher *, unsigned events);
+	void (*cb)(struct wal_watcher_msg *);
 	/** Pipe from the watcher to WAL. */
 	struct cpipe wal_pipe;
 	/** Pipe from WAL to the watcher. */
@@ -114,16 +124,16 @@ struct wal_watcher {
  * @param watcher     WAL watcher to register.
  * @param name        Name of the cbus endpoint at the caller's cord.
  * @param watcher_cb  Callback to invoke from the caller's cord
- *                    upon receiving a WAL event. Apart from the
- *                    watcher itself, it takes a bit mask of events.
- *                    Events are described in wal_event enum.
+ *                    upon receiving a WAL event. It takes an object
+ *                    of type wal_watcher_msg that stores a pointer
+ *                    to the watcher and information about the event.
  * @param process_cb  Function called to process cbus messages
  *                    while the watcher is being attached or NULL
  *                    if the cbus loop is running elsewhere.
  */
 void
 wal_set_watcher(struct wal_watcher *watcher, const char *name,
-		void (*watcher_cb)(struct wal_watcher *, unsigned events),
+		void (*watcher_cb)(struct wal_watcher_msg *),
 		void (*process_cb)(struct cbus_endpoint *));
 
 /**
diff --git a/src/box/xlog.h b/src/box/xlog.h
index 2233c022..75bc610b 100644
--- a/src/box/xlog.h
+++ b/src/box/xlog.h
@@ -198,6 +198,21 @@ void
 xdir_collect_inprogress(struct xdir *xdir);
 
 /**
+ * Return LSN and vclock (unless @vclock is NULL) of the oldest
+ * file in a directory or -1 if the directory is empty.
+ */
+static inline int64_t
+xdir_first_vclock(struct xdir *xdir, struct vclock *vclock)
+{
+	struct vclock *first = vclockset_first(&xdir->index);
+	if (first == NULL)
+		return -1;
+	if (vclock != NULL)
+		vclock_copy(vclock, first);
+	return vclock_sum(first);
+}
+
+/**
  * Return LSN and vclock (unless @vclock is NULL) of the newest
  * file in a directory or -1 if the directory is empty.
  */
-- 
2.11.0

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 5/5] wal: delete old wal files when running out of disk space
  2018-10-07 20:27 [PATCH 0/5] Delete old WAL files if running out of disk space Vladimir Davydov
                   ` (3 preceding siblings ...)
  2018-10-07 20:27 ` [PATCH 4/5] wal: notify watchers about wal file removal Vladimir Davydov
@ 2018-10-07 20:27 ` Vladimir Davydov
  2018-10-16 19:05 ` [tarantool-patches] Re: [PATCH 0/5] Delete old WAL files if " Konstantin Osipov
  5 siblings, 0 replies; 13+ messages in thread
From: Vladimir Davydov @ 2018-10-07 20:27 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

Now if the WAL thread fails to preallocate disk space needed to commit
a transaction, it will delete old WAL files until it succeeds or it
deletes all files that are not needed for local recovery from the oldest
checkpoint. After it deletes a file, it notifies the garbage collector
via the WAL watcher interface. The latter then deactivates consumers
that would need deleted files.

The user doesn't see a ENOSPC error if the WAL thread successfully
allocates disk space after deleting old files. Here's what's printed
to the log when this happens:

  wal/101/main C> ran out of disk space, try to delete old WAL files
  wal/101/main I> removed /home/vlad/src/tarantool/test/var/001_replication/master/00000000000000000005.xlog
  wal/101/main I> removed /home/vlad/src/tarantool/test/var/001_replication/master/00000000000000000006.xlog
  wal/101/main I> removed /home/vlad/src/tarantool/test/var/001_replication/master/00000000000000000007.xlog
  main/105/main C> deactivated WAL consumer replica 82d0fa3f-6881-4bc5-a2c0-a0f5dcf80120 at {1: 5}
  main/105/main C> deactivated WAL consumer replica 98dce0a8-1213-4824-b31e-c7e3c4eaf437 at {1: 7}

Closes #3397
---
 src/box/box.cc                        |   9 +-
 src/box/gc.c                          |  67 +++++++++-
 src/box/gc.h                          |  31 +++++
 src/box/wal.c                         |  74 ++++++++---
 src/box/wal.h                         |  15 ++-
 src/box/xlog.c                        |   8 ++
 src/errinj.h                          |   1 +
 test/box/errinj.result                |   2 +
 test/replication/gc_no_space.result   | 234 ++++++++++++++++++++++++++++++++++
 test/replication/gc_no_space.test.lua | 103 +++++++++++++++
 test/replication/suite.ini            |   2 +-
 11 files changed, 514 insertions(+), 32 deletions(-)
 create mode 100644 test/replication/gc_no_space.result
 create mode 100644 test/replication/gc_no_space.test.lua

diff --git a/src/box/box.cc b/src/box/box.cc
index 7e32b9fc..409897f6 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -2093,14 +2093,19 @@ box_cfg_xc(void)
 		}
 	}
 
+	struct gc_checkpoint *first_checkpoint = gc_first_checkpoint();
+	assert(first_checkpoint != NULL);
+
 	/* Start WAL writer */
 	int64_t wal_max_rows = box_check_wal_max_rows(cfg_geti64("rows_per_wal"));
 	int64_t wal_max_size = box_check_wal_max_size(cfg_geti64("wal_max_size"));
 	enum wal_mode wal_mode = box_check_wal_mode(cfg_gets("wal_mode"));
-	if (wal_init(wal_mode, cfg_gets("wal_dir"), &INSTANCE_UUID,
-		      &replicaset.vclock, wal_max_rows, wal_max_size)) {
+	if (wal_init(wal_mode, cfg_gets("wal_dir"), wal_max_rows,
+		     wal_max_size, &INSTANCE_UUID, &replicaset.vclock,
+		     vclock_sum(&first_checkpoint->vclock))) {
 		diag_raise();
 	}
+	gc_set_wal_watcher();
 
 	rmean_cleanup(rmean_box);
 
diff --git a/src/box/gc.c b/src/box/gc.c
index becb5d09..c078a419 100644
--- a/src/box/gc.c
+++ b/src/box/gc.c
@@ -48,6 +48,7 @@
 #include "say.h"
 #include "latch.h"
 #include "vclock.h"
+#include "cbus.h"
 #include "engine.h"		/* engine_collect_garbage() */
 #include "wal.h"		/* wal_collect_garbage() */
 
@@ -102,9 +103,24 @@ gc_init(void)
 	latch_create(&gc.latch);
 }
 
+static void
+gc_process_wal_event(struct wal_watcher_msg *);
+
+void
+gc_set_wal_watcher(void)
+{
+	wal_set_watcher(&gc.wal_watcher, "tx", gc_process_wal_event,
+			cbus_process);
+}
+
 void
 gc_free(void)
 {
+	/*
+	 * Can't clear the WAL watcher as the event loop isn't
+	 * running when this function is called.
+	 */
+
 	/* Free checkpoints. */
 	struct gc_checkpoint *checkpoint, *next_checkpoint;
 	rlist_foreach_entry_safe(checkpoint, &gc.checkpoints, in_checkpoints,
@@ -175,6 +191,9 @@ gc_run(void)
 	if (!run_engine_gc && !run_wal_gc)
 		return; /* nothing to do */
 
+	int64_t wal_lsn = vclock_sum(vclock);
+	int64_t checkpoint_lsn = vclock_sum(&checkpoint->vclock);
+
 	/*
 	 * Engine callbacks may sleep, because they use coio for
 	 * removing files. Make sure we won't try to remove the
@@ -191,12 +210,45 @@ gc_run(void)
 	 */
 	int rc = 0;
 	if (run_engine_gc)
-		rc = engine_collect_garbage(vclock_sum(&checkpoint->vclock));
-	if (run_wal_gc && rc == 0)
-		wal_collect_garbage(vclock_sum(vclock));
+		rc = engine_collect_garbage(checkpoint_lsn);
+	/*
+	 * Run wal_collect_garbage() even if we don't need to
+	 * delete any WAL files to apprise the WAL thread of
+	 * the oldest checkpoint signature.
+	 */
+	if (rc == 0)
+		wal_collect_garbage(wal_lsn, checkpoint_lsn);
 	latch_unlock(&gc.latch);
 }
 
+/**
+ * Deactivate consumers that need files deleted by the WAL thread.
+ */
+static void
+gc_process_wal_event(struct wal_watcher_msg *msg)
+{
+	if ((msg->events & WAL_EVENT_GC) == 0)
+		return;
+
+	struct gc_consumer *consumer = gc_tree_first(&gc.consumers);
+	while (consumer != NULL &&
+	       vclock_sum(&consumer->vclock) < msg->gc_lsn) {
+		struct gc_consumer *next = gc_tree_next(&gc.consumers,
+							consumer);
+		assert(!consumer->is_inactive);
+		consumer->is_inactive = true;
+		gc_tree_remove(&gc.consumers, consumer);
+
+		char *vclock_str = vclock_to_string(&consumer->vclock);
+		say_crit("deactivated WAL consumer %s at %s",
+			 consumer->name, vclock_str);
+		free(vclock_str);
+
+		consumer = next;
+	}
+	gc_run();
+}
+
 void
 gc_set_min_checkpoint_count(int min_checkpoint_count)
 {
@@ -279,14 +331,19 @@ gc_consumer_register(const struct vclock *vclock, const char *format, ...)
 void
 gc_consumer_unregister(struct gc_consumer *consumer)
 {
-	gc_tree_remove(&gc.consumers, consumer);
+	if (!consumer->is_inactive) {
+		gc_tree_remove(&gc.consumers, consumer);
+		gc_run();
+	}
 	gc_consumer_delete(consumer);
-	gc_run();
 }
 
 void
 gc_consumer_advance(struct gc_consumer *consumer, const struct vclock *vclock)
 {
+	if (consumer->is_inactive)
+		return;
+
 	int64_t signature = vclock_sum(vclock);
 	int64_t prev_signature = vclock_sum(&consumer->vclock);
 
diff --git a/src/box/gc.h b/src/box/gc.h
index a5392cef..e1241baa 100644
--- a/src/box/gc.h
+++ b/src/box/gc.h
@@ -36,6 +36,7 @@
 
 #include "vclock.h"
 #include "latch.h"
+#include "wal.h"
 #include "trivia/util.h"
 
 #if defined(__cplusplus)
@@ -89,6 +90,11 @@ struct gc_consumer {
 	char name[GC_NAME_MAX];
 	/** The vclock tracked by this consumer. */
 	struct vclock vclock;
+	/**
+	 * This flag is set if a WAL needed by this consumer was
+	 * deleted by the WAL thread on ENOSPC.
+	 */
+	bool is_inactive;
 };
 
 typedef rb_tree(struct gc_consumer) gc_tree_t;
@@ -120,6 +126,11 @@ struct gc_state {
 	 * garbage collection callbacks.
 	 */
 	struct latch latch;
+	/**
+	 * WAL event watcher. Needed to shoot off stale consumers
+	 * when a WAL file is deleted due to ENOSPC.
+	 */
+	struct wal_watcher wal_watcher;
 };
 extern struct gc_state gc;
 
@@ -145,6 +156,20 @@ extern struct gc_state gc;
 	rlist_foreach_entry(ref, &(checkpoint)->refs, in_refs)
 
 /**
+ * Return the first (oldest) checkpoint known to the garbage
+ * collector. If there's no checkpoint, return NULL.
+ */
+static inline struct gc_checkpoint *
+gc_first_checkpoint(void)
+{
+	if (rlist_empty(&gc.checkpoints))
+		return NULL;
+
+	return rlist_first_entry(&gc.checkpoints, struct gc_checkpoint,
+				 in_checkpoints);
+}
+
+/**
  * Return the last (newest) checkpoint known to the garbage
  * collector. If there's no checkpoint, return NULL.
  */
@@ -165,6 +190,12 @@ void
 gc_init(void);
 
 /**
+ * Set WAL watcher. Called after WAL is initialized.
+ */
+void
+gc_set_wal_watcher(void);
+
+/**
  * Destroy the garbage collection state.
  */
 void
diff --git a/src/box/wal.c b/src/box/wal.c
index 20b85f43..6e7a6b3f 100644
--- a/src/box/wal.c
+++ b/src/box/wal.c
@@ -111,6 +111,12 @@ struct wal_writer
 	 * with this LSN and LSN becomes "real".
 	 */
 	struct vclock vclock;
+	/**
+	 * Signature of the oldest checkpoint available on the instance.
+	 * The WAL writer must not delete WAL files that are needed to
+	 * recover from it even if it is running out of disk space.
+	 */
+	int64_t checkpoint_lsn;
 	/** The current WAL file. */
 	struct xlog current_wal;
 	/**
@@ -282,9 +288,9 @@ tx_schedule_rollback(struct cmsg *msg)
  */
 static void
 wal_writer_create(struct wal_writer *writer, enum wal_mode wal_mode,
-		  const char *wal_dirname, const struct tt_uuid *instance_uuid,
-		  struct vclock *vclock, int64_t wal_max_rows,
-		  int64_t wal_max_size)
+		  const char *wal_dirname, int64_t wal_max_rows,
+		  int64_t wal_max_size, const struct tt_uuid *instance_uuid,
+		  const struct vclock *vclock, int64_t checkpoint_lsn)
 {
 	writer->wal_mode = wal_mode;
 	writer->wal_max_rows = wal_max_rows;
@@ -304,6 +310,7 @@ wal_writer_create(struct wal_writer *writer, enum wal_mode wal_mode,
 	vclock_create(&writer->vclock);
 	vclock_copy(&writer->vclock, vclock);
 
+	writer->checkpoint_lsn = checkpoint_lsn;
 	rlist_create(&writer->watchers);
 }
 
@@ -407,16 +414,16 @@ wal_open(struct wal_writer *writer)
  *        mode are closed. WAL thread has been started.
  */
 int
-wal_init(enum wal_mode wal_mode, const char *wal_dirname,
-	 const struct tt_uuid *instance_uuid, struct vclock *vclock,
-	 int64_t wal_max_rows, int64_t wal_max_size)
+wal_init(enum wal_mode wal_mode, const char *wal_dirname, int64_t wal_max_rows,
+	 int64_t wal_max_size, const struct tt_uuid *instance_uuid,
+	 const struct vclock *vclock, int64_t first_checkpoint_lsn)
 {
 	assert(wal_max_rows > 1);
 
 	struct wal_writer *writer = &wal_writer_singleton;
-
-	wal_writer_create(writer, wal_mode, wal_dirname, instance_uuid,
-			  vclock, wal_max_rows, wal_max_size);
+	wal_writer_create(writer, wal_mode, wal_dirname, wal_max_rows,
+			  wal_max_size, instance_uuid, vclock,
+			  first_checkpoint_lsn);
 
 	/*
 	 * Scan the WAL directory to build an index of all
@@ -534,27 +541,30 @@ wal_checkpoint(struct vclock *vclock, bool rotate)
 struct wal_gc_msg
 {
 	struct cbus_call_msg base;
-	int64_t lsn;
+	int64_t wal_lsn;
+	int64_t checkpoint_lsn;
 };
 
 static int
 wal_collect_garbage_f(struct cbus_call_msg *data)
 {
 	struct wal_writer *writer = &wal_writer_singleton;
-	int64_t lsn = ((struct wal_gc_msg *)data)->lsn;
-	xdir_collect_garbage(&writer->wal_dir, lsn, -1, false);
+	struct wal_gc_msg *msg = (struct wal_gc_msg *)data;
+	writer->checkpoint_lsn = msg->checkpoint_lsn;
+	xdir_collect_garbage(&writer->wal_dir, msg->wal_lsn, -1, false);
 	wal_notify_watchers(writer, WAL_EVENT_GC);
 	return 0;
 }
 
 void
-wal_collect_garbage(int64_t lsn)
+wal_collect_garbage(int64_t wal_lsn, int64_t checkpoint_lsn)
 {
 	struct wal_writer *writer = &wal_writer_singleton;
 	if (writer->wal_mode == WAL_NONE)
 		return;
 	struct wal_gc_msg msg;
-	msg.lsn = lsn;
+	msg.wal_lsn = wal_lsn;
+	msg.checkpoint_lsn = checkpoint_lsn;
 	bool cancellable = fiber_set_cancellable(false);
 	cbus_call(&wal_thread.wal_pipe, &wal_thread.tx_prio_pipe, &msg.base,
 		  wal_collect_garbage_f, NULL, TIMEOUT_INFINITY);
@@ -614,15 +624,43 @@ wal_opt_rotate(struct wal_writer *writer)
 /**
  * Make sure there's enough disk space to write @len bytes
  * of data to the current WAL.
+ *
+ * If fallocate() fails with ENOSPC, delete old WAL files
+ * that are not needed for recovery and retry.
  */
 static int
 wal_fallocate(struct wal_writer *writer, size_t len)
 {
-	if (xlog_fallocate(&writer->current_wal, len) < 0) {
-		diag_log();
-		return -1;
+	bool warn_no_space = true;
+retry:
+	if (xlog_fallocate(&writer->current_wal, len) >= 0) {
+		diag_clear(diag_get());
+		return 0;
 	}
-	return 0;
+	if (errno != ENOSPC)
+		goto error;
+
+	if (warn_no_space) {
+		say_crit("ran out of disk space, try to delete old WAL files");
+		warn_no_space = false;
+	}
+
+	/* Keep the original error. */
+	struct diag diag;
+	diag_create(&diag);
+	diag_move(diag_get(), &diag);
+	int rc = xdir_collect_garbage(&writer->wal_dir, writer->checkpoint_lsn,
+				      1, false);
+	diag_move(&diag, diag_get());
+	diag_destroy(&diag);
+	if (rc <= 0)
+		goto error;
+
+	wal_notify_watchers(writer, WAL_EVENT_GC);
+	goto retry;
+error:
+	diag_log();
+	return -1;
 }
 
 static void
diff --git a/src/box/wal.h b/src/box/wal.h
index 4867ec3b..6d5ee0a6 100644
--- a/src/box/wal.h
+++ b/src/box/wal.h
@@ -56,9 +56,9 @@ void
 wal_thread_start();
 
 int
-wal_init(enum wal_mode wal_mode, const char *wal_dirname,
-	 const struct tt_uuid *instance_uuid, struct vclock *vclock,
-	 int64_t wal_max_rows, int64_t wal_max_size);
+wal_init(enum wal_mode wal_mode, const char *wal_dirname, int64_t wal_max_rows,
+	 int64_t wal_max_size, const struct tt_uuid *instance_uuid,
+	 const struct vclock *vclock, int64_t first_checkpoint_lsn);
 
 void
 wal_thread_stop();
@@ -165,11 +165,14 @@ int
 wal_checkpoint(struct vclock *vclock, bool rotate);
 
 /**
- * Remove WAL files that are not needed to recover
- * from snapshot with @lsn or newer.
+ * Remove all WAL files whose signature is less than @wal_lsn.
+ * Update the oldest checkpoint signature with @checkpoint_lsn.
+ * WAL thread will delete WAL files that are not needed to
+ * recover from the oldest checkpoint if it runs out of disk
+ * space.
  */
 void
-wal_collect_garbage(int64_t lsn);
+wal_collect_garbage(int64_t wal_lsn, int64_t checkpoint_lsn);
 
 void
 wal_init_vy_log();
diff --git a/src/box/xlog.c b/src/box/xlog.c
index 1a6ead7f..bc558593 100644
--- a/src/box/xlog.c
+++ b/src/box/xlog.c
@@ -1027,6 +1027,14 @@ xlog_write_error(struct xlog *log)
 ssize_t
 xlog_fallocate(struct xlog *log, size_t len)
 {
+	struct errinj *inj = errinj(ERRINJ_XLOG_FALLOCATE, ERRINJ_INT);
+	if (inj != NULL && inj->iparam > 0) {
+		inj->iparam--;
+		diag_set(ClientError, ER_INJECTION, "xlog fallocate");
+		errno = ENOSPC;
+		return -1;
+	}
+
 #ifdef HAVE_POSIX_FALLOCATE
 	if (log->alloc_len > len)
 		return log->alloc_len;
diff --git a/src/errinj.h b/src/errinj.h
index 84a1fbb5..19304f8e 100644
--- a/src/errinj.h
+++ b/src/errinj.h
@@ -102,6 +102,7 @@ struct errinj {
 	_(ERRINJ_XLOG_GARBAGE, ERRINJ_BOOL, {.bparam = false}) \
 	_(ERRINJ_XLOG_META, ERRINJ_BOOL, {.bparam = false}) \
 	_(ERRINJ_XLOG_READ, ERRINJ_INT, {.iparam = -1}) \
+	_(ERRINJ_XLOG_FALLOCATE, ERRINJ_INT, {.iparam = 0}) \
 	_(ERRINJ_VYRUN_INDEX_GARBAGE, ERRINJ_BOOL, {.bparam = false}) \
 	_(ERRINJ_VYRUN_DATA_READ, ERRINJ_BOOL, {.bparam = false}) \
 	_(ERRINJ_BUILD_INDEX, ERRINJ_INT, {.iparam = -1}) \
diff --git a/test/box/errinj.result b/test/box/errinj.result
index c4a1326c..f7140143 100644
--- a/test/box/errinj.result
+++ b/test/box/errinj.result
@@ -24,6 +24,8 @@ errinj.info()
     state: 0
   ERRINJ_VY_SCHED_TIMEOUT:
     state: 0
+  ERRINJ_XLOG_FALLOCATE:
+    state: 0
   ERRINJ_WAL_WRITE_PARTIAL:
     state: -1
   ERRINJ_VY_GC:
diff --git a/test/replication/gc_no_space.result b/test/replication/gc_no_space.result
new file mode 100644
index 00000000..a84ae2db
--- /dev/null
+++ b/test/replication/gc_no_space.result
@@ -0,0 +1,234 @@
+--
+-- This test checks that when the WAL thread runs out of disk
+-- space it automatically deletes old WAL files and notifies
+-- the TX thread so that the latter can shoot off WAL consumers
+-- that need them. See gh-3397.
+--
+test_run = require('test_run').new()
+---
+...
+engine = test_run:get_cfg('engine')
+---
+...
+fio = require('fio')
+---
+...
+errinj = box.error.injection
+---
+...
+test_run:cmd("setopt delimiter ';'")
+---
+- true
+...
+function check_file_count(dir, glob, count)
+    local files = fio.glob(fio.pathjoin(dir, glob))
+    if #files == count then
+        return true
+    end
+    return false, files
+end;
+---
+...
+function check_wal_count(count)
+    return check_file_count(box.cfg.wal_dir, '*.xlog', count)
+end;
+---
+...
+function check_snap_count(count)
+    return check_file_count(box.cfg.memtx_dir, '*.snap', count)
+end;
+---
+...
+test_run:cmd("setopt delimiter ''");
+---
+- true
+...
+default_checkpoint_count = box.cfg.checkpoint_count
+---
+...
+box.cfg{checkpoint_count = 2}
+---
+...
+test_run:cleanup_cluster()
+---
+...
+box.schema.user.grant('guest', 'replication')
+---
+...
+s = box.schema.space.create('test', {engine = engine})
+---
+...
+_ = s:create_index('pk')
+---
+...
+box.snapshot()
+---
+- ok
+...
+--
+-- Create a few dead replicas to pin WAL files.
+--
+test_run:cmd("create server replica with rpl_master=default, script='replication/replica.lua'")
+---
+- true
+...
+test_run:cmd("start server replica")
+---
+- true
+...
+test_run:cmd("stop server replica")
+---
+- true
+...
+test_run:cmd("cleanup server replica")
+---
+- true
+...
+s:auto_increment{}
+---
+- [1]
+...
+box.snapshot()
+---
+- ok
+...
+test_run:cmd("start server replica")
+---
+- true
+...
+test_run:cmd("stop server replica")
+---
+- true
+...
+test_run:cmd("cleanup server replica")
+---
+- true
+...
+s:auto_increment{}
+---
+- [2]
+...
+box.snapshot()
+---
+- ok
+...
+test_run:cmd("start server replica")
+---
+- true
+...
+test_run:cmd("stop server replica")
+---
+- true
+...
+test_run:cmd("cleanup server replica")
+---
+- true
+...
+test_run:cmd("delete server replica")
+---
+- true
+...
+--
+-- Make a few checkpoints and check that old WAL files are not
+-- deleted.
+--
+s:auto_increment{}
+---
+- [3]
+...
+box.snapshot()
+---
+- ok
+...
+s:auto_increment{}
+---
+- [4]
+...
+box.snapshot()
+---
+- ok
+...
+s:auto_increment{}
+---
+- [5]
+...
+check_wal_count(7)
+---
+- true
+...
+check_snap_count(2)
+---
+- true
+...
+#box.info.gc().consumers -- 3
+---
+- 3
+...
+--
+-- Inject a ENOSPC error and check that the WAL thread deletes
+-- old WAL files to prevent the user from seeing the error.
+--
+errinj.set('ERRINJ_XLOG_FALLOCATE', 3)
+---
+- ok
+...
+s:auto_increment{} -- success
+---
+- [6]
+...
+errinj.info()['ERRINJ_XLOG_FALLOCATE'].state -- 0
+---
+- 0
+...
+check_wal_count(3)
+---
+- true
+...
+check_snap_count(2)
+---
+- true
+...
+#box.info.gc().consumers -- 1
+---
+- 1
+...
+--
+-- Check that the WAL thread never deletes WAL files that are
+-- needed for recovery from a checkpoint.
+--
+errinj.set('ERRINJ_XLOG_FALLOCATE', 2)
+---
+- ok
+...
+s:auto_increment{} -- failure
+---
+- error: Failed to write to disk
+...
+errinj.info()['ERRINJ_XLOG_FALLOCATE'].state -- 0
+---
+- 0
+...
+check_wal_count(2)
+---
+- true
+...
+check_snap_count(2)
+---
+- true
+...
+#box.info.gc().consumers -- 0
+---
+- 0
+...
+s:drop()
+---
+...
+box.schema.user.revoke('guest', 'replication')
+---
+...
+test_run:cleanup_cluster()
+---
+...
+box.cfg{checkpoint_count = default_checkpoint_count}
+---
+...
diff --git a/test/replication/gc_no_space.test.lua b/test/replication/gc_no_space.test.lua
new file mode 100644
index 00000000..32ad18f0
--- /dev/null
+++ b/test/replication/gc_no_space.test.lua
@@ -0,0 +1,103 @@
+--
+-- This test checks that when the WAL thread runs out of disk
+-- space it automatically deletes old WAL files and notifies
+-- the TX thread so that the latter can shoot off WAL consumers
+-- that need them. See gh-3397.
+--
+test_run = require('test_run').new()
+engine = test_run:get_cfg('engine')
+
+fio = require('fio')
+errinj = box.error.injection
+
+test_run:cmd("setopt delimiter ';'")
+function check_file_count(dir, glob, count)
+    local files = fio.glob(fio.pathjoin(dir, glob))
+    if #files == count then
+        return true
+    end
+    return false, files
+end;
+function check_wal_count(count)
+    return check_file_count(box.cfg.wal_dir, '*.xlog', count)
+end;
+function check_snap_count(count)
+    return check_file_count(box.cfg.memtx_dir, '*.snap', count)
+end;
+test_run:cmd("setopt delimiter ''");
+
+default_checkpoint_count = box.cfg.checkpoint_count
+box.cfg{checkpoint_count = 2}
+
+test_run:cleanup_cluster()
+box.schema.user.grant('guest', 'replication')
+s = box.schema.space.create('test', {engine = engine})
+_ = s:create_index('pk')
+box.snapshot()
+
+--
+-- Create a few dead replicas to pin WAL files.
+--
+test_run:cmd("create server replica with rpl_master=default, script='replication/replica.lua'")
+test_run:cmd("start server replica")
+test_run:cmd("stop server replica")
+test_run:cmd("cleanup server replica")
+
+s:auto_increment{}
+box.snapshot()
+
+test_run:cmd("start server replica")
+test_run:cmd("stop server replica")
+test_run:cmd("cleanup server replica")
+
+s:auto_increment{}
+box.snapshot()
+
+test_run:cmd("start server replica")
+test_run:cmd("stop server replica")
+test_run:cmd("cleanup server replica")
+test_run:cmd("delete server replica")
+
+--
+-- Make a few checkpoints and check that old WAL files are not
+-- deleted.
+--
+s:auto_increment{}
+box.snapshot()
+s:auto_increment{}
+box.snapshot()
+s:auto_increment{}
+
+check_wal_count(7)
+check_snap_count(2)
+#box.info.gc().consumers -- 3
+
+--
+-- Inject a ENOSPC error and check that the WAL thread deletes
+-- old WAL files to prevent the user from seeing the error.
+--
+errinj.set('ERRINJ_XLOG_FALLOCATE', 3)
+s:auto_increment{} -- success
+errinj.info()['ERRINJ_XLOG_FALLOCATE'].state -- 0
+
+check_wal_count(3)
+check_snap_count(2)
+#box.info.gc().consumers -- 1
+
+--
+-- Check that the WAL thread never deletes WAL files that are
+-- needed for recovery from a checkpoint.
+--
+errinj.set('ERRINJ_XLOG_FALLOCATE', 2)
+s:auto_increment{} -- failure
+errinj.info()['ERRINJ_XLOG_FALLOCATE'].state -- 0
+
+check_wal_count(2)
+check_snap_count(2)
+#box.info.gc().consumers -- 0
+
+s:drop()
+box.schema.user.revoke('guest', 'replication')
+test_run:cleanup_cluster()
+
+box.cfg{checkpoint_count = default_checkpoint_count}
diff --git a/test/replication/suite.ini b/test/replication/suite.ini
index f4abc7af..569c9048 100644
--- a/test/replication/suite.ini
+++ b/test/replication/suite.ini
@@ -3,7 +3,7 @@ core = tarantool
 script =  master.lua
 description = tarantool/box, replication
 disabled = consistent.test.lua
-release_disabled = catch.test.lua errinj.test.lua gc.test.lua before_replace.test.lua quorum.test.lua recover_missing_xlog.test.lua sync.test.lua
+release_disabled = catch.test.lua errinj.test.lua gc.test.lua gc_no_space.test.lua before_replace.test.lua quorum.test.lua recover_missing_xlog.test.lua sync.test.lua
 config = suite.cfg
 lua_libs = lua/fast_replica.lua lua/rlimit.lua
 long_run = prune.test.lua
-- 
2.11.0

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/5] xlog: fix filename in error messages
  2018-10-07 20:27 ` [PATCH 1/5] xlog: fix filename in error messages Vladimir Davydov
@ 2018-10-12  8:19   ` Vladimir Davydov
  2018-10-16 19:07   ` [tarantool-patches] " Konstantin Osipov
  1 sibling, 0 replies; 13+ messages in thread
From: Vladimir Davydov @ 2018-10-12  8:19 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

Pushed this one to 1.10 as trivial. The rest of the series still needs
a review.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [tarantool-patches] Re: [PATCH 0/5] Delete old WAL files if running out of disk space
  2018-10-07 20:27 [PATCH 0/5] Delete old WAL files if running out of disk space Vladimir Davydov
                   ` (4 preceding siblings ...)
  2018-10-07 20:27 ` [PATCH 5/5] wal: delete old wal files when running out of disk space Vladimir Davydov
@ 2018-10-16 19:05 ` Konstantin Osipov
  2018-10-17  8:20   ` Vladimir Davydov
  5 siblings, 1 reply; 13+ messages in thread
From: Konstantin Osipov @ 2018-10-16 19:05 UTC (permalink / raw)
  To: tarantool-patches

* Vladimir Davydov <vdavydov.dev@gmail.com> [18/10/08 13:52]:
> If a replica permanently stops working for some reason, it will pin WAL
> files it would need to resume until it is deleted from the _cluster
> system space or the master is restarted. This happens in production when
> an admin drops a replica and forgets to remove it from the master, and
> this is quite annoying, because it may result in ENOSPC errors on the
> master.

I started benching this patch to check whether fallocate() introduces a 
performance regression and  discovered that there is a general 45% regression
between 1.6 and 1.10.

I hope finally once I have pointed it out and A.Lyapunov has
pointed it out, it will be addressed.

In any case we need to measure fallocate() impact very carefully
before adding it. It seems we make things unnecessarily
complicated all in order to spare the user from spurious ENOSPC,
In my opinion it's a non-goal.

If we get rid of this requirement, we don't need fallocate(), and
the patch could be made simpler in a couple more dimensions.

Please consider making a trivial patch which follows the steps of
the patch by @belyak

Thanks,

-- 
Konstantin Osipov, Moscow, Russia, +7 903 626 22 32
http://tarantool.io - www.twitter.com/kostja_osipov

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [tarantool-patches] Re: [PATCH 1/5] xlog: fix filename in error messages
  2018-10-07 20:27 ` [PATCH 1/5] xlog: fix filename in error messages Vladimir Davydov
  2018-10-12  8:19   ` Vladimir Davydov
@ 2018-10-16 19:07   ` Konstantin Osipov
  1 sibling, 0 replies; 13+ messages in thread
From: Konstantin Osipov @ 2018-10-16 19:07 UTC (permalink / raw)
  To: tarantool-patches

* Vladimir Davydov <vdavydov.dev@gmail.com> [18/10/08 13:52]:
>  - xlog_rename() doesn't strip xlog->filename of inprogress suffix so
>    write errors will mistakenly report the filename as inprogress.
>  - xlog_create() uses a name without inprogress suffix for error
>    reporting while it actually creates an inprogress file.

This fragment is trivial and is OK to push.


-- 
Konstantin Osipov, Moscow, Russia, +7 903 626 22 32
http://tarantool.io - www.twitter.com/kostja_osipov

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [tarantool-patches] Re: [PATCH 2/5] wal: preallocate disk space before writing rows
  2018-10-07 20:27 ` [PATCH 2/5] wal: preallocate disk space before writing rows Vladimir Davydov
@ 2018-10-16 19:09   ` Konstantin Osipov
  0 siblings, 0 replies; 13+ messages in thread
From: Konstantin Osipov @ 2018-10-16 19:09 UTC (permalink / raw)
  To: tarantool-patches

* Vladimir Davydov <vdavydov.dev@gmail.com> [18/10/08 13:52]:
> This function introduces a new xlog method xlog_fallocate() that makes
> sure that the requested amount of disk space is available at the current
> write position. It does that with posix_fallocate(). The new method is
> called before writing anything to WAL. In order not to invoke a system
> call too often, xlog_fallocate() allocates more than requested.
> 
> The primary reason why I'm doing this is that I want to have a single
> and clearly defined point in the code to handle ENOSPC errors, where I
> could delete old WALs and retry (this is what #3397 is about). I could
> probably handle ENOSPC returned by xlog_tx_commit(), but that would look
> suspicious, because this function can write half a transaction before it
> hits ENOSPC, after which it truncates the file back. It's unclear what
> happens if, for instance, a replication thread reads those transitive
> data.
> 
> Anyway, preallocating disk space in big chunk is a worthwhile feature
> by itself, because it should reduce the number of writes to the inode
> table.

This patch needs to be measured carefully or better yet let's not
do it at all (or find a way to use fallocate() to speed things up,
    not potentially slow them down).

Before we measure this patch we need to measure the general perf.
regression that is currently observable in 1.9+ releases.


-- 
Konstantin Osipov, Moscow, Russia, +7 903 626 22 32
http://tarantool.io - www.twitter.com/kostja_osipov

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [tarantool-patches] Re: [PATCH 0/5] Delete old WAL files if running out of disk space
  2018-10-16 19:05 ` [tarantool-patches] Re: [PATCH 0/5] Delete old WAL files if " Konstantin Osipov
@ 2018-10-17  8:20   ` Vladimir Davydov
  2018-10-23  8:37     ` Vladimir Davydov
  0 siblings, 1 reply; 13+ messages in thread
From: Vladimir Davydov @ 2018-10-17  8:20 UTC (permalink / raw)
  To: Konstantin Osipov; +Cc: tarantool-patches

On Tue, Oct 16, 2018 at 10:05:22PM +0300, Konstantin Osipov wrote:
> * Vladimir Davydov <vdavydov.dev@gmail.com> [18/10/08 13:52]:
> > If a replica permanently stops working for some reason, it will pin WAL
> > files it would need to resume until it is deleted from the _cluster
> > system space or the master is restarted. This happens in production when
> > an admin drops a replica and forgets to remove it from the master, and
> > this is quite annoying, because it may result in ENOSPC errors on the
> > master.
> 
> I started benching this patch to check whether fallocate() introduces a 
> performance regression and  discovered that there is a general 45% regression
> between 1.6 and 1.10.
> 
> I hope finally once I have pointed it out and A.Lyapunov has
> pointed it out, it will be addressed.
> 
> In any case we need to measure fallocate() impact very carefully
> before adding it. It seems we make things unnecessarily
> complicated all in order to spare the user from spurious ENOSPC,
> In my opinion it's a non-goal.
> 
> If we get rid of this requirement, we don't need fallocate(), and
> the patch could be made simpler in a couple more dimensions.

This requirement is a must IMO. What's the point of returning ENOSPC and
alerting the user if we can avoid that? I can foresee users complaining
about it and opening issues, like "spurious ENOSPC when there's enough
disk space" (because there will be enough disk space once gc has removed
stale replicas).

> 
> Please consider making a trivial patch which follows the steps of
> the patch by @belyak

It's not trivial and it's ugly. I wonder why you fail to see that.
Ping-ponging messages from WAL to TX in order to remove files?
Introducing yet another pipe for that. What for? WAL thread already has
all the necessary information about xlog files. It just needs to be told
what's the oldest WAL row it has to preserve in any case.

Moreover, shooting off consumers before deleting WAL files is
semantically incorrect, because the garbage collector knows nothing
about WAL files. For GC there's a continuous range of WAL rows it
tracks. Dividing those rows in files is a business of the WAL thread.
So how's it going to work if we put the TX thread responsible for
triggering WAL file deletion? WAL sends ENOSPC signal to TX. TX shoots
off a consumer. WAL retries, ENOSPC again, because no file was deleted!
Sends ENOSPC to TX again and so forth. Do you really want this?!

The design proposed in this patch is simple and clear. When invoked, GC
lets WAL know about rows that can be pruned right now and rows that can
be pruned in case of emergency. When hitting ENOSPC, WAL deletes old
WALs on its own basing on this information and notifies TX via the
existing notification subsystem (wal_watcher) so that the latter can
shoot off replicas that would need those files.

Regarding usage of falloate. I could implement this patch without it,
but it would be a bit more difficult, because there wouldn't be a clear
point of ENOSPC failure. Besides, what would happen if we wrote half of
a transaction to disk? How replication would work then? BTW, triggering
WAL deletion on behalf of TX suffers from the very same problem. That is
we are risking not only returning ENOSPC to the user, but also breaking
replication in a peculiar way.

Anyway, you seem to be unaware of the fact that one of fallocate use
cases is speeding writes by reducing the number of file size updates
(which require a write to the inode table). I wrote a simple test that
demonstrates that, see below.

vlad@esperanza test$ gcc -O2 fallocate_test.c -o fallocate_test
vlad@esperanza test$ ./fallocate_test
Usage: ./fallocate_test <filename> <write_count> <write_size> <alloc_size>
filename - test file
write_count - number of writes (append)
write_size - write(2) size
alloc_size - fallocate(2) size
returns time in seconds
vlad@esperanza test$ ./fallocate_test dummy 1000000 100 0
1.548161
vlad@esperanza test$ ./fallocate_test dummy 1000000 100 0
1.505698
vlad@esperanza test$ ./fallocate_test dummy 1000000 100 100000
1.195223
vlad@esperanza test$ ./fallocate_test dummy 1000000 100 100000
1.137996

I have stock ext4 over hdd on my laptop, configured by Debian, no
tuning. The code is right below. That is for an append-only workload
similar to WAL fallocate yields ~25% gain.

True, fallocate might need some tuning (how much to allocate for
different write sizes), but it's something we definitely want to have on
board.

vlad@esperanza test$ cat fallocate_test.c
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <errno.h>
#include <time.h>

double gettime(void)
{
        struct timespec ts;
        if (clock_gettime(CLOCK_MONOTONIC, &ts) != 0) {
                perror("clock_monotonic");
                exit(EXIT_FAILURE);
        }
        return ts.tv_sec + ts.tv_nsec / 1e9;
}

int main(int argc, char **argv)
{
        if (argc != 5) {
                fprintf(stderr, "Usage: %s <filename> <write_count> "
                        "<write_size> <alloc_size>\n"
                        "filename - test file\n"
                        "write_count - number of writes (append)\n"
                        "write_size - write(2) size\n"
                        "alloc_size - fallocate(2) size\n"
                        "returns time in seconds\n",
                        argv[0]);
                return -1;
        }

        const char *filename = argv[1];
        int write_count = atoi(argv[2]);
        int write_size = atoi(argv[3]);
        int alloc_size = atoi(argv[4]);

        char *buf = malloc(write_size);
        if (buf == NULL) {
                perror("malloc");
                exit(EXIT_FAILURE);
        }
        memset(buf, 1, write_size);

        int fd = open(filename, O_RDWR | O_TRUNC | O_CREAT, 0666);
        if (fd < 0) {
                perror("open");
                exit(EXIT_FAILURE);
        }

        double t1 = gettime();

        off_t offset = 0;
        int prealloced = 0;
        for (int i = 0; i < write_count; i++) {
                if (alloc_size > 0 && prealloced < write_size) {
                        errno = posix_fallocate(fd, offset, alloc_size);
                        if (errno != 0) {
                                perror("posix_fallocate");
                                exit(EXIT_FAILURE);
                        }
                        prealloced += alloc_size;
                }
                ssize_t written = write(fd, buf, write_size);
                if (written < 0) {
                        perror("write");
                        exit(EXIT_FAILURE);
                }
                offset += written;
                prealloced -= written;
                if (prealloced < 0)
                        prealloced = 0;
        }

        double t2 = gettime();

        close(fd);
        unlink(filename);
        free(buf);

        printf("%f\n", __func__, t2 - t1);
        return 0;
}

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [tarantool-patches] Re: [PATCH 0/5] Delete old WAL files if running out of disk space
  2018-10-17  8:20   ` Vladimir Davydov
@ 2018-10-23  8:37     ` Vladimir Davydov
  2018-10-23  8:46       ` Konstantin Osipov
  0 siblings, 1 reply; 13+ messages in thread
From: Vladimir Davydov @ 2018-10-23  8:37 UTC (permalink / raw)
  To: Konstantin Osipov; +Cc: tarantool-patches

On Wed, Oct 17, 2018 at 11:20:58AM +0300, Vladimir Davydov wrote:
> Anyway, you seem to be unaware of the fact that one of fallocate use
> cases is speeding writes by reducing the number of file size updates
> (which require a write to the inode table). I wrote a simple test that
> demonstrates that, see below.
> 
> vlad@esperanza test$ gcc -O2 fallocate_test.c -o fallocate_test
> vlad@esperanza test$ ./fallocate_test
> Usage: ./fallocate_test <filename> <write_count> <write_size> <alloc_size>
> filename - test file
> write_count - number of writes (append)
> write_size - write(2) size
> alloc_size - fallocate(2) size
> returns time in seconds
> vlad@esperanza test$ ./fallocate_test dummy 1000000 100 0
> 1.548161
> vlad@esperanza test$ ./fallocate_test dummy 1000000 100 0
> 1.505698
> vlad@esperanza test$ ./fallocate_test dummy 1000000 100 100000
> 1.195223
> vlad@esperanza test$ ./fallocate_test dummy 1000000 100 100000
> 1.137996
> 
> I have stock ext4 over hdd on my laptop, configured by Debian, no
> tuning. The code is right below. That is for an append-only workload
> similar to WAL fallocate yields ~25% gain.

Alas, I was wrong. Even though the synthetic test shows a great
improvement when fallocate() is used, there's practically no difference
in case of nosqlbench. So the feature doesn't seem to be worthwhile.
I guess I'll reimplement WAL auto-removal without fallocate().

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [tarantool-patches] Re: [PATCH 0/5] Delete old WAL files if running out of disk space
  2018-10-23  8:37     ` Vladimir Davydov
@ 2018-10-23  8:46       ` Konstantin Osipov
  0 siblings, 0 replies; 13+ messages in thread
From: Konstantin Osipov @ 2018-10-23  8:46 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: tarantool-patches

* Vladimir Davydov <vdavydov.dev@gmail.com> [18/10/23 11:42]:
> Alas, I was wrong. Even though the synthetic test shows a great
> improvement when fallocate() is used, there's practically no difference
> in case of nosqlbench. So the feature doesn't seem to be worthwhile.
> I guess I'll reimplement WAL auto-removal without fallocate().

Well, wait. This is the same thing I observed on my ssd machine.
The good thing is that it doesn't make things worse. 

-- 
Konstantin Osipov, Moscow, Russia, +7 903 626 22 32
http://tarantool.io - www.twitter.com/kostja_osipov

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2018-10-23  8:46 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-10-07 20:27 [PATCH 0/5] Delete old WAL files if running out of disk space Vladimir Davydov
2018-10-07 20:27 ` [PATCH 1/5] xlog: fix filename in error messages Vladimir Davydov
2018-10-12  8:19   ` Vladimir Davydov
2018-10-16 19:07   ` [tarantool-patches] " Konstantin Osipov
2018-10-07 20:27 ` [PATCH 2/5] wal: preallocate disk space before writing rows Vladimir Davydov
2018-10-16 19:09   ` [tarantool-patches] " Konstantin Osipov
2018-10-07 20:27 ` [PATCH 3/5] xlog: allow to limit number of files deleted by xdir_collect_garbage Vladimir Davydov
2018-10-07 20:27 ` [PATCH 4/5] wal: notify watchers about wal file removal Vladimir Davydov
2018-10-07 20:27 ` [PATCH 5/5] wal: delete old wal files when running out of disk space Vladimir Davydov
2018-10-16 19:05 ` [tarantool-patches] Re: [PATCH 0/5] Delete old WAL files if " Konstantin Osipov
2018-10-17  8:20   ` Vladimir Davydov
2018-10-23  8:37     ` Vladimir Davydov
2018-10-23  8:46       ` Konstantin Osipov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox