[PATCH 00/10] vinyl: don't yield in DDL on

Tarantool development patches archive
 help / color / mirror / Atom feed

* [PATCH 00/10] vinyl: don't yield in DDL on_commit triggers
@ 2019-05-17 14:52 Vladimir Davydov
  2019-05-17 14:52 ` [PATCH 01/10] box: zap atfork callback Vladimir Davydov
                   ` (11 more replies)
  0 siblings, 12 replies; 40+ messages in thread
From: Vladimir Davydov @ 2019-05-17 14:52 UTC (permalink / raw)
  To: tarantool-patches

DDL on_commit triggers may yield on vylog write, even though they don't
really need to wait for the transaction to be flushed to disk, as we can
recover it after restart. This blocks implementation of transactional
DDL (see #4083), because even an empty index creation may yield and
hence can't be executed transactionally along with other lightweight
operations, such as space creation.

This patch reworks vylog implementation so that we don't need to yield
in case the vylog transaction is non-discardable (i.e. stays in the
memory buffer in case of failure). Since on_commit triggers issue only
non-discardable transactions, this makes them non-yielding, as they
should be.

https://github.com/tarantool/tarantool/issues/4218
https://github.com/tarantool/tarantool/commits/dv/gh-4218-vy-remove-yields-from-ddl-commit

Vladimir Davydov (10):
  box: zap atfork callback
  vinyl: add a separate thread for vylog
  vinyl: move vylog recovery to vylog thread
  vinyl: rework vylog transaction backlog implementation
  vinyl: don't purge deleted runs from vylog on compaction
  vinyl: lock out compaction while checkpointing is in progress
  vinyl: don't access last vylog signature outside vylog implementation
  vinyl: zap ERRINJ_VY_LOG_FLUSH_DELAY
  key_def: pass alloc callback to key_def_dump_parts
  vinyl: get rid of the latch protecting vylog buffer

 src/box/box.cc             |  10 -
 src/box/box.h              |   6 -
 src/box/key_def.c          |   6 +-
 src/box/key_def.h          |   4 +-
 src/box/sql/select.c       |   2 +-
 src/box/vinyl.c            |   2 +-
 src/box/vy_log.c           | 618 ++++++++++++++++++++++++++++-----------------
 src/box/vy_log.h           |  23 +-
 src/box/vy_scheduler.c     |  44 +++-
 src/box/wal.c              |  95 -------
 src/box/wal.h              |  18 --
 src/box/xlog.c             |  28 +-
 src/box/xlog.h             |  38 ++-
 src/lib/core/errinj.h      |   1 -
 src/main.cc                |   1 -
 test/box/errinj.result     |  44 ++--
 test/vinyl/errinj.result   |  38 ++-
 test/vinyl/errinj.test.lua |  20 +-
 18 files changed, 519 insertions(+), 479 deletions(-)

-- 
2.11.0

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 01/10] box: zap atfork callback
  2019-05-17 14:52 [PATCH 00/10] vinyl: don't yield in DDL on_commit triggers Vladimir Davydov
@ 2019-05-17 14:52 ` Vladimir Davydov
  2019-05-18 18:37   ` [tarantool-patches] " Konstantin Osipov
  2019-05-17 14:52 ` [PATCH 02/10] vinyl: add a separate thread for vylog Vladimir Davydov
                   ` (10 subsequent siblings)
  11 siblings, 1 reply; 40+ messages in thread
From: Vladimir Davydov @ 2019-05-17 14:52 UTC (permalink / raw)
  To: tarantool-patches

box_atfork calls wal_atfork which in turn calls xlog_atfork for the wal
and vylog files. A comment to xlog_atfork says that it's necessary to
prevent atexit handlers in a child from closing xlog files again, but we
don't use atexit for that anymore. A comment to box_atfork says that
box.coredump forks to write a core, but there's no box.coredump anymore.
There's also a comment mentioning box.cfg.background, but when we fork
that early there's no xlog file open.

To sum it up, atfork looks like a piece of legacy code. Let's get rid of
it now so as not to bother patching it later.
---
 src/box/box.cc | 10 ----------
 src/box/box.h  |  6 ------
 src/box/wal.c  | 16 ----------------
 src/box/wal.h  |  3 ---
 src/box/xlog.c | 16 ----------------
 src/box/xlog.h |  7 -------
 src/main.cc    |  1 -
 7 files changed, 59 deletions(-)

diff --git a/src/box/box.cc b/src/box/box.cc
index 7828f575..bb8aca36 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -2167,16 +2167,6 @@ box_cfg(void)
 	}
 }
 
-/**
- * box.coredump() forks to save a core. The entire
- * server forks in box.cfg{} if background=true.
- */
-void
-box_atfork()
-{
-	wal_atfork();
-}
-
 int
 box_checkpoint()
 {
diff --git a/src/box/box.h b/src/box/box.h
index 53d88ab7..53b5eb45 100644
--- a/src/box/box.h
+++ b/src/box/box.h
@@ -93,12 +93,6 @@ box_cfg(void);
 bool
 box_is_configured(void);
 
-/**
- * A pthread_atfork() callback for box
- */
-void
-box_atfork(void);
-
 void
 box_set_ro(bool ro);
 
diff --git a/src/box/wal.c b/src/box/wal.c
index 0ea15a43..25bceb68 100644
--- a/src/box/wal.c
+++ b/src/box/wal.c
@@ -1383,19 +1383,3 @@ wal_notify_watchers(struct wal_writer *writer, unsigned events)
 	rlist_foreach_entry(watcher, &writer->watchers, next)
 		wal_watcher_notify(watcher, events);
 }
-
-
-/**
- * After fork, the WAL writer thread disappears.
- * Make sure that atexit() handlers in the child do
- * not try to stop a non-existent thread or write
- * a second EOF marker to an open file.
- */
-void
-wal_atfork()
-{
-	if (xlog_is_open(&wal_writer_singleton.current_wal))
-		xlog_atfork(&wal_writer_singleton.current_wal);
-	if (xlog_is_open(&vy_log_writer.xlog))
-		xlog_atfork(&vy_log_writer.xlog);
-}
diff --git a/src/box/wal.h b/src/box/wal.h
index 4e500d2a..a88a3f23 100644
--- a/src/box/wal.h
+++ b/src/box/wal.h
@@ -164,9 +164,6 @@ void
 wal_clear_watcher(struct wal_watcher *watcher,
 		  void (*process_cb)(struct cbus_endpoint *));
 
-void
-wal_atfork();
-
 enum wal_mode
 wal_mode();
 
diff --git a/src/box/xlog.c b/src/box/xlog.c
index 8254cce2..82588682 100644
--- a/src/box/xlog.c
+++ b/src/box/xlog.c
@@ -1472,22 +1472,6 @@ xlog_close(struct xlog *l, bool reuse_fd)
 	return rc;
 }
 
-/**
- * Free xlog memory and destroy it cleanly, without side
- * effects (for use in the atfork handler).
- */
-void
-xlog_atfork(struct xlog *xlog)
-{
-	/*
-	 * Close the file descriptor STDIO buffer does not
-	 * make its way into the respective file in
-	 * fclose().
-	 */
-	close(xlog->fd);
-	xlog->fd = -1;
-}
-
 /* }}} */
 
 /* {{{ struct xlog_cursor */
diff --git a/src/box/xlog.h b/src/box/xlog.h
index a48b05fc..964d303e 100644
--- a/src/box/xlog.h
+++ b/src/box/xlog.h
@@ -550,13 +550,6 @@ xlog_sync(struct xlog *l);
 int
 xlog_close(struct xlog *l, bool reuse_fd);
 
-/**
- * atfork() handler function to close the log pointed
- * at by xlog in the child.
- */
-void
-xlog_atfork(struct xlog *xlog);
-
 /* {{{ xlog_tx_cursor - iterate over rows in xlog transaction */
 
 /**
diff --git a/src/main.cc b/src/main.cc
index 606c64c1..68c67773 100644
--- a/src/main.cc
+++ b/src/main.cc
@@ -341,7 +341,6 @@ static void
 tarantool_atfork()
 {
 	signal_reset();
-	box_atfork();
 }
 
 /**
-- 
2.11.0

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 02/10] vinyl: add a separate thread for vylog
  2019-05-17 14:52 [PATCH 00/10] vinyl: don't yield in DDL on_commit triggers Vladimir Davydov
  2019-05-17 14:52 ` [PATCH 01/10] box: zap atfork callback Vladimir Davydov
@ 2019-05-17 14:52 ` Vladimir Davydov
  2019-05-18 18:39   ` [tarantool-patches] " Konstantin Osipov
  2019-06-01  8:26   ` Konstantin Osipov
  2019-05-17 14:52 ` [PATCH 03/10] vinyl: move vylog recovery to vylog thread Vladimir Davydov
                   ` (9 subsequent siblings)
  11 siblings, 2 replies; 40+ messages in thread
From: Vladimir Davydov @ 2019-05-17 14:52 UTC (permalink / raw)
  To: tarantool-patches

Historically, we use the WAL thread for writing vylog files, because,
I guess, we didn't want to introduce a separate thread. However, that
design decision turned out to be quite dubious:

 - vy_log (vy_log.c) calls vy_log_writer (wal.c) while vy_log_writer
   calls back to vy_log. That is we have a piece of logic split crudely
   between two files, which makes the code difficult to follow and just
   looks ugly.

 - We can't make vy_log part of vy_env because of this relationship.
   In fact, vy_log is the last singleton in the whole vy implementation:
   everything else is wrapped neatly (well, not quite, but still) in
   vy_env struct.

 - We can't kill the infamous vy_log.latch, which is a prerequisite for
   transactional DDL. The latch is needed to sync between vy_log readers
   and writers. You see, currently we can't read vy_log in the same
   thread where we write it, which would eliminate the need for any kind
   of synchronization, because vy_log read is quite a heavy operation -
   it may stall WAL writes and thus badly affect latency. So we have to
   do it from a coio thread.

That being said, it's time to move vy_log writer back to where it
belongs - vy_log.c, separate thread. This patch does just that. It
doesn't patch the implementation to fix the flaws enumerated above -
this will be done by following patches.
---
 src/box/vy_log.c | 113 +++++++++++++++++++++++++++++++++++++++++++++++++++----
 src/box/vy_log.h |   6 ---
 src/box/wal.c    |  79 --------------------------------------
 src/box/wal.h    |  15 --------
 4 files changed, 105 insertions(+), 108 deletions(-)

diff --git a/src/box/vy_log.c b/src/box/vy_log.c
index edd61b33..25ab73fd 100644
--- a/src/box/vy_log.c
+++ b/src/box/vy_log.c
@@ -46,19 +46,20 @@
 #include <small/rlist.h>
 
 #include "assoc.h"
+#include "cbus.h"
 #include "coio_task.h"
 #include "diag.h"
 #include "errcode.h"
 #include "errinj.h"
 #include "fiber.h"
 #include "iproto_constants.h" /* IPROTO_INSERT */
+#include "journal.h"
 #include "key_def.h"
 #include "latch.h"
 #include "replication.h" /* INSTANCE_UUID */
 #include "salad/stailq.h"
 #include "say.h"
 #include "tt_static.h"
-#include "wal.h"
 #include "vclock.h"
 #include "xlog.h"
 #include "xrow.h"
@@ -178,6 +179,14 @@ struct vy_log {
 	 * only relevant if @tx_failed is set.
 	 */
 	struct diag tx_diag;
+	/** Vylog file. */
+	struct xlog xlog;
+	/** Vylog IO thread. */
+	struct cord cord;
+	/** Pipe to the vylog thread. */
+	struct cpipe pipe;
+	/** Returning pipe from the vylog thread back to tx. */
+	struct cpipe tx_pipe;
 };
 static struct vy_log vy_log;
 
@@ -189,6 +198,9 @@ vy_recovery_process_record(struct vy_recovery *recovery,
 			   const struct vy_log_record *record);
 
 static int
+vy_log_open(void);
+
+static int
 vy_log_create(const struct vclock *vclock, struct vy_recovery *recovery);
 
 int
@@ -731,6 +743,26 @@ err:
 	return NULL;
 }
 
+/** Vylog thread main loop. */
+static int
+vy_log_thread_f(va_list ap)
+{
+	(void)ap;
+
+	struct cbus_endpoint endpoint;
+	cbus_endpoint_create(&endpoint, "vylog", fiber_schedule_cb, fiber());
+
+	cpipe_create(&vy_log.tx_pipe, "tx");
+
+	cbus_loop(&endpoint);
+
+	if (xlog_is_open(&vy_log.xlog))
+		xlog_close(&vy_log.xlog, false);
+
+	cpipe_destroy(&vy_log.tx_pipe);
+	return 0;
+}
+
 void
 vy_log_init(const char *dir)
 {
@@ -740,7 +772,45 @@ vy_log_init(const char *dir)
 	region_create(&vy_log.pool, cord_slab_cache());
 	stailq_create(&vy_log.tx);
 	diag_create(&vy_log.tx_diag);
-	wal_init_vy_log();
+	xlog_clear(&vy_log.xlog);
+
+	if (cord_costart(&vy_log.cord, "vinyl.log", vy_log_thread_f, NULL) != 0)
+		panic_syserror("failed to start vinyl log thread");
+
+	cpipe_create(&vy_log.pipe, "vylog");
+}
+
+struct vy_log_flush_msg {
+	struct cbus_call_msg base;
+	struct journal_entry *entry;
+};
+
+static int
+vy_log_flush_f(struct cbus_call_msg *base)
+{
+	struct vy_log_flush_msg *msg = (struct vy_log_flush_msg *)base;
+	struct journal_entry *entry = msg->entry;
+	struct xlog *xlog = &vy_log.xlog;
+
+	if (!xlog_is_open(xlog)) {
+		if (vy_log_open() < 0)
+			return -1;
+	}
+
+	xlog_tx_begin(xlog);
+	for (int i = 0; i < entry->n_rows; ++i) {
+		entry->rows[i]->tm = ev_now(loop());
+		if (xlog_write_row(xlog, entry->rows[i]) < 0) {
+			xlog_tx_rollback(xlog);
+			return -1;
+		}
+	}
+
+	if (xlog_tx_commit(xlog) < 0 ||
+	    xlog_flush(xlog) < 0)
+		return -1;
+
+	return 0;
 }
 
 /**
@@ -798,10 +868,16 @@ vy_log_flush(void)
 	assert(i == tx_size);
 
 	/*
-	 * Do actual disk writes on behalf of the WAL
+	 * Do actual disk writes on behalf of the vylog thread
 	 * so as not to block the tx thread.
 	 */
-	if (wal_write_vy_log(entry) != 0)
+	struct vy_log_flush_msg msg;
+	msg.entry = entry;
+	bool cancellable = fiber_set_cancellable(false);
+	int rc = cbus_call(&vy_log.pipe, &vy_log.tx_pipe, &msg.base,
+			   vy_log_flush_f, NULL, TIMEOUT_INFINITY);
+	fiber_set_cancellable(cancellable);
+	if (rc != 0)
 		goto err;
 
 	/* Success. Free flushed records. */
@@ -817,14 +893,21 @@ err:
 void
 vy_log_free(void)
 {
+	cbus_stop_loop(&vy_log.pipe);
+	if (cord_join(&vy_log.cord) != 0)
+		panic_syserror("failed to join vinyl log thread");
 	xdir_destroy(&vy_log.dir);
 	region_destroy(&vy_log.pool);
 	diag_destroy(&vy_log.tx_diag);
 }
 
-int
-vy_log_open(struct xlog *xlog)
+/**
+ * Open current vy_log file.
+ */
+static int
+vy_log_open(void)
 {
+	struct xlog *xlog = &vy_log.xlog;
 	/*
 	 * Open the current log file or create a new one
 	 * if it doesn't exist.
@@ -1022,6 +1105,15 @@ vy_log_rotate_f(va_list ap)
 	return vy_log_create(vclock, recovery);
 }
 
+static int
+vy_log_close_f(struct cbus_call_msg *msg)
+{
+	(void)msg;
+	if (xlog_is_open(&vy_log.xlog))
+		xlog_close(&vy_log.xlog, false);
+	return 0;
+}
+
 int
 vy_log_rotate(const struct vclock *vclock)
 {
@@ -1069,9 +1161,14 @@ vy_log_rotate(const struct vclock *vclock)
 
 	/*
 	 * Success. Close the old log. The new one will be opened
-	 * automatically on the first write (see wal_write_vy_log()).
+	 * automatically on the first write (see vy_log_flush_f()).
 	 */
-	wal_rotate_vy_log();
+	struct cbus_call_msg msg;
+	bool cancellable = fiber_set_cancellable(false);
+	cbus_call(&vy_log.pipe, &vy_log.tx_pipe, &msg,
+		  vy_log_close_f, NULL, TIMEOUT_INFINITY);
+	fiber_set_cancellable(cancellable);
+
 	vclock_copy(&vy_log.last_checkpoint, vclock);
 
 	/* Add the new vclock to the xdir so that we can track it. */
diff --git a/src/box/vy_log.h b/src/box/vy_log.h
index ee38c193..81f62706 100644
--- a/src/box/vy_log.h
+++ b/src/box/vy_log.h
@@ -439,12 +439,6 @@ void
 vy_log_free(void);
 
 /**
- * Open current vy_log file.
- */
-int
-vy_log_open(struct xlog *xlog);
-
-/**
  * Rotate the metadata log. This function creates a new
  * xlog file in the log directory having vclock @vclock
  * and writes records required to recover active LSM trees.
diff --git a/src/box/wal.c b/src/box/wal.c
index 25bceb68..1f93e162 100644
--- a/src/box/wal.c
+++ b/src/box/wal.c
@@ -39,7 +39,6 @@
 
 #include "xlog.h"
 #include "xrow.h"
-#include "vy_log.h"
 #include "cbus.h"
 #include "coio_task.h"
 #include "replication.h"
@@ -173,15 +172,6 @@ struct wal_msg {
 	struct vclock vclock;
 };
 
-/**
- * Vinyl metadata log writer.
- */
-struct vy_log_writer {
-	/** The metadata log file. */
-	struct xlog xlog;
-};
-
-static struct vy_log_writer vy_log_writer;
 static struct wal_writer wal_writer_singleton;
 
 enum wal_mode
@@ -1115,9 +1105,6 @@ wal_writer_f(va_list ap)
 	if (xlog_is_open(&writer->current_wal))
 		xlog_close(&writer->current_wal, false);
 
-	if (xlog_is_open(&vy_log_writer.xlog))
-		xlog_close(&vy_log_writer.xlog, false);
-
 	cpipe_destroy(&writer->tx_prio_pipe);
 	return 0;
 }
@@ -1198,72 +1185,6 @@ wal_write_in_wal_mode_none(struct journal *journal,
 	return vclock_sum(&writer->vclock);
 }
 
-void
-wal_init_vy_log()
-{
-	xlog_clear(&vy_log_writer.xlog);
-}
-
-struct wal_write_vy_log_msg
-{
-	struct cbus_call_msg base;
-	struct journal_entry *entry;
-};
-
-static int
-wal_write_vy_log_f(struct cbus_call_msg *msg)
-{
-	struct journal_entry *entry =
-		((struct wal_write_vy_log_msg *)msg)->entry;
-
-	if (! xlog_is_open(&vy_log_writer.xlog)) {
-		if (vy_log_open(&vy_log_writer.xlog) < 0)
-			return -1;
-	}
-
-	if (xlog_write_entry(&vy_log_writer.xlog, entry) < 0)
-		return -1;
-
-	if (xlog_flush(&vy_log_writer.xlog) < 0)
-		return -1;
-
-	return 0;
-}
-
-int
-wal_write_vy_log(struct journal_entry *entry)
-{
-	struct wal_writer *writer = &wal_writer_singleton;
-	struct wal_write_vy_log_msg msg;
-	msg.entry= entry;
-	bool cancellable = fiber_set_cancellable(false);
-	int rc = cbus_call(&writer->wal_pipe, &writer->tx_prio_pipe,
-			   &msg.base, wal_write_vy_log_f, NULL,
-			   TIMEOUT_INFINITY);
-	fiber_set_cancellable(cancellable);
-	return rc;
-}
-
-static int
-wal_rotate_vy_log_f(struct cbus_call_msg *msg)
-{
-	(void) msg;
-	if (xlog_is_open(&vy_log_writer.xlog))
-		xlog_close(&vy_log_writer.xlog, false);
-	return 0;
-}
-
-void
-wal_rotate_vy_log()
-{
-	struct wal_writer *writer = &wal_writer_singleton;
-	struct cbus_call_msg msg;
-	bool cancellable = fiber_set_cancellable(false);
-	cbus_call(&writer->wal_pipe, &writer->tx_prio_pipe, &msg,
-		  wal_rotate_vy_log_f, NULL, TIMEOUT_INFINITY);
-	fiber_set_cancellable(cancellable);
-}
-
 static void
 wal_watcher_notify(struct wal_watcher *watcher, unsigned events)
 {
diff --git a/src/box/wal.h b/src/box/wal.h
index a88a3f23..8448f6d4 100644
--- a/src/box/wal.h
+++ b/src/box/wal.h
@@ -222,21 +222,6 @@ wal_set_checkpoint_threshold(int64_t threshold);
 void
 wal_collect_garbage(const struct vclock *vclock);
 
-void
-wal_init_vy_log();
-
-/**
- * Write xrows to the vinyl metadata log.
- */
-int
-wal_write_vy_log(struct journal_entry *req);
-
-/**
- * Rotate the vinyl metadata log.
- */
-void
-wal_rotate_vy_log();
-
 #if defined(__cplusplus)
 } /* extern "C" */
 #endif /* defined(__cplusplus) */
-- 
2.11.0

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 03/10] vinyl: move vylog recovery to vylog thread
  2019-05-17 14:52 [PATCH 00/10] vinyl: don't yield in DDL on_commit triggers Vladimir Davydov
  2019-05-17 14:52 ` [PATCH 01/10] box: zap atfork callback Vladimir Davydov
  2019-05-17 14:52 ` [PATCH 02/10] vinyl: add a separate thread for vylog Vladimir Davydov
@ 2019-05-17 14:52 ` Vladimir Davydov
  2019-06-01  8:36   ` [tarantool-patches] " Konstantin Osipov
  2019-05-17 14:52 ` [PATCH 04/10] vinyl: rework vylog transaction backlog implementation Vladimir Davydov
                   ` (8 subsequent siblings)
  11 siblings, 1 reply; 40+ messages in thread
From: Vladimir Davydov @ 2019-05-17 14:52 UTC (permalink / raw)
  To: tarantool-patches

We used coio, because vylog was written from a WAL thread, which
shouldn't be used for such a heavy operation as vylog recovery.
Now, we can move it to the dedicated vylog thread. This allows
us to simplify rotation logic as well: now most of work is done
from the same function (vy_log_rotate_f) executed by vylog thread,
not scattered between coio and WAL, as it used to be.

This is a step toward removal of the vylog latch, which blocks
transaction DDL implementation.
---
 src/box/vy_log.c | 174 ++++++++++++++++++++++++++++++++-----------------------
 1 file changed, 100 insertions(+), 74 deletions(-)

diff --git a/src/box/vy_log.c b/src/box/vy_log.c
index 25ab73fd..cf967595 100644
--- a/src/box/vy_log.c
+++ b/src/box/vy_log.c
@@ -47,7 +47,6 @@
 
 #include "assoc.h"
 #include "cbus.h"
-#include "coio_task.h"
 #include "diag.h"
 #include "errcode.h"
 #include "errinj.h"
@@ -191,7 +190,7 @@ struct vy_log {
 static struct vy_log vy_log;
 
 static struct vy_recovery *
-vy_recovery_new_locked(int64_t signature, int flags);
+vy_recovery_load(int64_t signature, int flags);
 
 static int
 vy_recovery_process_record(struct vy_recovery *recovery,
@@ -203,8 +202,8 @@ vy_log_open(void);
 static int
 vy_log_create(const struct vclock *vclock, struct vy_recovery *recovery);
 
-int
-vy_log_rotate(const struct vclock *vclock);
+static int
+vy_log_flush(void);
 
 /**
  * Return the name of the vylog file that has the given signature.
@@ -1097,18 +1096,31 @@ vy_log_end_recovery(void)
 	return 0;
 }
 
-static ssize_t
-vy_log_rotate_f(va_list ap)
-{
-	struct vy_recovery *recovery = va_arg(ap, struct vy_recovery *);
-	const struct vclock *vclock = va_arg(ap, const struct vclock *);
-	return vy_log_create(vclock, recovery);
-}
+struct vy_log_rotate_msg {
+	struct cbus_call_msg base;
+	struct vclock vclock;
+};
 
 static int
-vy_log_close_f(struct cbus_call_msg *msg)
+vy_log_rotate_f(struct cbus_call_msg *base)
 {
-	(void)msg;
+	struct vy_log_rotate_msg *msg = (struct vy_log_rotate_msg *)base;
+	int64_t prev_signature = vclock_sum(&vy_log.last_checkpoint);
+
+	/* Load the last vylog into a recovery context. */
+	struct vy_recovery *recovery = vy_recovery_load(prev_signature, 0);
+	if (recovery == NULL)
+		return -1;
+
+	/* Write the contents of the recovery context to the new vylog. */
+	int rc = vy_log_create(&msg->vclock, recovery);
+	vy_recovery_delete(recovery);
+	if (rc != 0)
+		return -1;
+	/*
+	 * Success. Close the old log. The new one will be opened
+	 * automatically on the first write (see vy_log_flush_f()).
+	 */
 	if (xlog_is_open(&vy_log.xlog))
 		xlog_close(&vy_log.xlog, false);
 	return 0;
@@ -1145,30 +1157,24 @@ vy_log_rotate(const struct vclock *vclock)
 	 */
 	latch_lock(&vy_log.latch);
 
-	struct vy_recovery *recovery;
-	recovery = vy_recovery_new_locked(prev_signature, 0);
-	if (recovery == NULL)
+	if (vy_log_flush() != 0) {
+		diag_log();
+		say_error("failed to flush vylog for checkpoint");
 		goto fail;
+	}
 
 	/* Do actual work from coio so as not to stall tx thread. */
-	int rc = coio_call(vy_log_rotate_f, recovery, vclock);
-	vy_recovery_delete(recovery);
-	if (rc < 0) {
-		diag_log();
-		say_error("failed to write `%s'", vy_log_filename(signature));
-		goto fail;
-	}
+	struct vy_log_rotate_msg msg;
+	vclock_copy(&msg.vclock, vclock);
 
-	/*
-	 * Success. Close the old log. The new one will be opened
-	 * automatically on the first write (see vy_log_flush_f()).
-	 */
-	struct cbus_call_msg msg;
 	bool cancellable = fiber_set_cancellable(false);
-	cbus_call(&vy_log.pipe, &vy_log.tx_pipe, &msg,
-		  vy_log_close_f, NULL, TIMEOUT_INFINITY);
+	int rc = cbus_call(&vy_log.pipe, &vy_log.tx_pipe, &msg.base,
+			   vy_log_rotate_f, NULL, TIMEOUT_INFINITY);
 	fiber_set_cancellable(cancellable);
 
+	if (rc != 0)
+		goto fail;
+
 	vclock_copy(&vy_log.last_checkpoint, vclock);
 
 	/* Add the new vclock to the xdir so that we can track it. */
@@ -2281,13 +2287,14 @@ vy_recovery_build_index_id_hash(struct vy_recovery *recovery)
 	return 0;
 }
 
-static ssize_t
-vy_recovery_new_f(va_list ap)
+/**
+ * Load recovery context from the vylog file identified by
+ * the given signature. This is a blocking operation, which
+ * is supposed to be called from the vylog thread.
+ */
+static struct vy_recovery *
+vy_recovery_load(int64_t signature, int flags)
 {
-	int64_t signature = va_arg(ap, int64_t);
-	int flags = va_arg(ap, int);
-	struct vy_recovery **p_recovery = va_arg(ap, struct vy_recovery **);
-
 	say_verbose("loading vylog %lld", (long long)signature);
 
 	struct vy_recovery *recovery = malloc(sizeof(*recovery));
@@ -2369,58 +2376,71 @@ vy_recovery_new_f(va_list ap)
 		goto fail_free;
 out:
 	say_verbose("done loading vylog");
-	*p_recovery = recovery;
-	return 0;
+	return recovery;
 
 fail_close:
 	xlog_cursor_close(&cursor, false);
 fail_free:
 	vy_recovery_delete(recovery);
 fail:
-	return -1;
+	diag_log();
+	say_error("failed to load `%s'", vy_log_filename(signature));
+	return NULL;
 }
 
-/**
- * Load the metadata log and return a recovery context.
- * Must be called with the log latch held.
- */
-static struct vy_recovery *
-vy_recovery_new_locked(int64_t signature, int flags)
-{
-	int rc;
+struct vy_recovery_msg {
+	struct cbus_call_msg base;
+	int signature;
+	int flags;
 	struct vy_recovery *recovery;
+};
+
+int
+vy_recovery_new_f(struct cbus_call_msg *base)
+{
+	struct vy_recovery_msg *msg = (struct vy_recovery_msg *)base;
+
+	msg->recovery = vy_recovery_load(msg->signature, msg->flags);
+	if (msg->recovery == NULL)
+		return -1;
+
+	return 0;
+}
+
+struct vy_recovery *
+vy_recovery_new(int64_t signature, int flags)
+{
+	/* Lock out concurrent writers while we are loading the log. */
+	latch_lock(&vy_log.latch);
 
-	assert(latch_owner(&vy_log.latch) == fiber());
 	/*
 	 * Before proceeding to log recovery, make sure that all
 	 * pending records have been flushed out.
 	 */
-	rc = vy_log_flush();
-	if (rc != 0) {
+	if (vy_log_flush() != 0) {
 		diag_log();
 		say_error("failed to flush vylog for recovery");
-		return NULL;
+		goto fail;
 	}
 
-	/* Load the log from coio so as not to stall tx thread. */
-	rc = coio_call(vy_recovery_new_f, signature, flags, &recovery);
-	if (rc != 0) {
-		diag_log();
-		say_error("failed to load `%s'", vy_log_filename(signature));
-		return NULL;
-	}
-	return recovery;
-}
+	struct vy_recovery_msg msg;
+	msg.signature = signature;
+	msg.flags = flags;
+	msg.recovery = NULL;
+
+	bool cancellable = fiber_set_cancellable(false);
+	int rc = cbus_call(&vy_log.pipe, &vy_log.tx_pipe, &msg.base,
+			   vy_recovery_new_f, NULL, TIMEOUT_INFINITY);
+	fiber_set_cancellable(cancellable);
 
-struct vy_recovery *
-vy_recovery_new(int64_t signature, int flags)
-{
-	/* Lock out concurrent writers while we are loading the log. */
-	latch_lock(&vy_log.latch);
-	struct vy_recovery *recovery;
-	recovery = vy_recovery_new_locked(signature, flags);
+	if (rc != 0)
+		goto fail;
+
+	latch_unlock(&vy_log.latch);
+	return msg.recovery;
+fail:
 	latch_unlock(&vy_log.latch);
-	return recovery;
+	return NULL;
 }
 
 void
@@ -2563,6 +2583,8 @@ vy_log_append_lsm(struct xlog *xlog, struct vy_lsm_recovery_info *lsm)
 static int
 vy_log_create(const struct vclock *vclock, struct vy_recovery *recovery)
 {
+	struct errinj *inj = errinj(ERRINJ_VY_LOG_FILE_RENAME, ERRINJ_BOOL);
+
 	say_verbose("saving vylog %lld", (long long)vclock_sum(vclock));
 
 	/*
@@ -2592,11 +2614,10 @@ vy_log_create(const struct vclock *vclock, struct vy_recovery *recovery)
 	if (vy_log_append_record(&xlog, &record) != 0)
 		goto err_write_xlog;
 
-	ERROR_INJECT(ERRINJ_VY_LOG_FILE_RENAME, {
+	if (inj && inj->bparam) {
 		diag_set(ClientError, ER_INJECTION, "vinyl log file rename");
-		xlog_close(&xlog, false);
-		return -1;
-	});
+		goto err_write_xlog;
+	}
 
 	/* Finalize the new xlog. */
 	if (xlog_flush(&xlog) < 0 ||
@@ -2610,13 +2631,18 @@ done:
 	return 0;
 
 err_write_xlog:
-	/* Delete the unfinished xlog. */
+	/*
+	 * Delete the unfinished xlog unless ERRINJ_VY_LOG_FILE_RENAME
+	 * is set (we use it to test collection of .inprogress files).
+	 */
 	assert(xlog_is_open(&xlog));
-	if (unlink(xlog.filename) < 0)
+	if ((!inj || !inj->bparam) && unlink(xlog.filename) < 0)
 		say_syserror("failed to delete file '%s'",
 			     xlog.filename);
 	xlog_close(&xlog, false);
 
 err_create_xlog:
+	diag_log();
+	say_error("failed to write `%s'", vy_log_filename(vclock_sum(vclock)));
 	return -1;
 }
-- 
2.11.0

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 04/10] vinyl: rework vylog transaction backlog implementation
  2019-05-17 14:52 [PATCH 00/10] vinyl: don't yield in DDL on_commit triggers Vladimir Davydov
                   ` (2 preceding siblings ...)
  2019-05-17 14:52 ` [PATCH 03/10] vinyl: move vylog recovery to vylog thread Vladimir Davydov
@ 2019-05-17 14:52 ` Vladimir Davydov
  2019-06-01  8:38   ` [tarantool-patches] " Konstantin Osipov
  2019-05-17 14:52 ` [PATCH 05/10] vinyl: don't purge deleted runs from vylog on compaction Vladimir Davydov
                   ` (7 subsequent siblings)
  11 siblings, 1 reply; 40+ messages in thread
From: Vladimir Davydov @ 2019-05-17 14:52 UTC (permalink / raw)
  To: tarantool-patches

Some vylog transactions don't tolerate failures. For example, it's too
late to fail in an on_commit trigger, after an index creation record was
written to WAL. So we keep an in-memory backlog for such transactions
and try to flush them to disk whenever we write anything else to the log.
This doesn't guarantee that the transaction will reach the vylog file
though, as the instance may be restarted before we manage to flush the
backlog. It's okay - we know how to handle this on recovery.

Currently, the backlog is implemented as follows: we have a list in the
tx thread to which we append all pending vylog records; if we fail to
flush a transaction to disk on commit, we delete records written by
the transaction from the list only if the transaction is discardable,
otherwise we leave the records in the list. The list is protected by
a global latch, which means that any vylog write may block. This blocks
implementation of transactional DDL.

Actually, it isn't necessary to store backlog in the list of pending
records. We can instead leave non-discardable records right in the xlog
buffer we failed to flush. Since the xlog is managed exclusively by the
vylog thread, this doesn't require any synchronization. Besides this
allows us to hand record encoding to the vylog thread, which is good.

Note, this patch doesn't remove the latch, not yet. We still use it to
sync access to the transaction buffer in tx. This will be reworked by
the following patches.
---
 src/box/vy_log.c | 185 ++++++++++++++++++++++++-------------------------------
 src/box/xlog.c   |  12 +++-
 src/box/xlog.h   |  31 ++++++++++
 3 files changed, 119 insertions(+), 109 deletions(-)

diff --git a/src/box/vy_log.c b/src/box/vy_log.c
index cf967595..1bc35e82 100644
--- a/src/box/vy_log.c
+++ b/src/box/vy_log.c
@@ -52,7 +52,6 @@
 #include "errinj.h"
 #include "fiber.h"
 #include "iproto_constants.h" /* IPROTO_INSERT */
-#include "journal.h"
 #include "key_def.h"
 #include "latch.h"
 #include "replication.h" /* INSTANCE_UUID */
@@ -156,13 +155,6 @@ struct vy_log {
 	 * Linked by vy_log_record::in_tx;
 	 */
 	struct stailq tx;
-	/** Start of the current transaction in the pool, for rollback */
-	size_t tx_svp;
-	/**
-	 * Last record in the queue at the time when the current
-	 * transaction was started. Used for rollback.
-	 */
-	struct stailq_entry *tx_begin;
 	/**
 	 * Flag set if vy_log_write() failed.
 	 *
@@ -202,9 +194,6 @@ vy_log_open(void);
 static int
 vy_log_create(const struct vclock *vclock, struct vy_recovery *recovery);
 
-static int
-vy_log_flush(void);
-
 /**
  * Return the name of the vylog file that has the given signature.
  */
@@ -781,35 +770,83 @@ vy_log_init(const char *dir)
 
 struct vy_log_flush_msg {
 	struct cbus_call_msg base;
-	struct journal_entry *entry;
+	bool no_discard;
 };
 
 static int
 vy_log_flush_f(struct cbus_call_msg *base)
 {
 	struct vy_log_flush_msg *msg = (struct vy_log_flush_msg *)base;
-	struct journal_entry *entry = msg->entry;
+	struct region *region = &fiber()->gc;
+	size_t region_svp = region_used(region);
 	struct xlog *xlog = &vy_log.xlog;
 
 	if (!xlog_is_open(xlog)) {
 		if (vy_log_open() < 0)
-			return -1;
+			goto fail;
 	}
 
+	/* Encode the transaction and write it to the xlog buffer. */
 	xlog_tx_begin(xlog);
-	for (int i = 0; i < entry->n_rows; ++i) {
-		entry->rows[i]->tm = ev_now(loop());
-		if (xlog_write_row(xlog, entry->rows[i]) < 0) {
+	struct vy_log_record *record;
+	stailq_foreach_entry(record, &vy_log.tx, in_tx) {
+		struct xrow_header row;
+		if (vy_log_record_encode(record, &row) < 0) {
 			xlog_tx_rollback(xlog);
-			return -1;
+			goto fail;
+		}
+		row.tm = ev_now(loop());
+		if (xlog_write_row(xlog, &row) < 0) {
+			xlog_tx_rollback(xlog);
+			goto fail;
 		}
 	}
 
-	if (xlog_tx_commit(xlog) < 0 ||
-	    xlog_flush(xlog) < 0)
-		return -1;
+	/* Flush the xlog buffer to disk. */
+	if (msg->no_discard)
+		xlog_tx_set_rollback_svp(xlog);
+
+	int rc = 0;
+	ERROR_INJECT(ERRINJ_VY_LOG_FLUSH, {
+		diag_set(ClientError, ER_INJECTION, "vinyl log flush");
+		xlog_tx_rollback(xlog);
+		rc = -1;
+	});
+	if (rc >= 0)
+		rc = xlog_tx_commit(xlog);
+	if (rc >= 0)
+		rc = xlog_flush(xlog);
 
-	return 0;
+	if (rc < 0 && msg->no_discard) {
+		/*
+		 * The message got appended to the xlog buffer so
+		 * it will get flushed along with the next message.
+		 * If it doesn't get flushed until restart, the
+		 * caller should be prepared for that so just warn
+		 * and go on.
+		 */
+		struct error *e = diag_last_error(diag_get());
+		say_warn("failed to flush vylog: %s", e->errmsg);
+		rc = 0;
+	}
+	region_truncate(region, region_svp);
+	return rc < 0 ? -1 : 0;
+fail:
+	if (msg->no_discard) {
+		/*
+		 * The caller doesn't tolerate failures so we have
+		 * no choice but panic. Good news is this shouldn't
+		 * normally happen. For example, we don't panic if
+		 * we failed to write the transaction to disk - we
+		 * leave it in the xlog buffer then (see above).
+		 * We get here on OOM, in which case there's no point
+		 * to continue anyway.
+		 */
+		diag_log();
+		panic("non-discardable vylog transaction failed");
+	}
+	region_truncate(region, region_svp);
+	return -1;
 }
 
 /**
@@ -820,73 +857,31 @@ vy_log_flush_f(struct cbus_call_msg *base)
  * buffered transactions, and want to avoid a partial write.
  */
 static int
-vy_log_flush(void)
+vy_log_flush(bool no_discard)
 {
 	if (stailq_empty(&vy_log.tx))
 		return 0; /* nothing to do */
 
-	ERROR_INJECT(ERRINJ_VY_LOG_FLUSH, {
-		diag_set(ClientError, ER_INJECTION, "vinyl log flush");
-		return -1;
-	});
 	struct errinj *delay = errinj(ERRINJ_VY_LOG_FLUSH_DELAY, ERRINJ_BOOL);
 	if (delay != NULL && delay->bparam) {
 		while (delay->bparam)
 			fiber_sleep(0.001);
 	}
 
-	int tx_size = 0;
-	struct vy_log_record *record;
-	stailq_foreach_entry(record, &vy_log.tx, in_tx)
-		tx_size++;
-
-	size_t used = region_used(&fiber()->gc);
-	struct journal_entry *entry = journal_entry_new(tx_size, &fiber()->gc);
-	if (entry == NULL)
-		goto err;
-
-	struct xrow_header *rows;
-	rows = region_aligned_alloc(&fiber()->gc,
-				    tx_size * sizeof(struct xrow_header),
-				    alignof(struct xrow_header));
-	if (rows == NULL)
-		goto err;
-
-	/*
-	 * Encode buffered records.
-	 */
-	int i = 0;
-	stailq_foreach_entry(record, &vy_log.tx, in_tx) {
-		assert(i < tx_size);
-		struct xrow_header *row = &rows[i];
-		if (vy_log_record_encode(record, row) < 0)
-			goto err;
-		entry->rows[i] = row;
-		i++;
-	}
-	assert(i == tx_size);
-
 	/*
 	 * Do actual disk writes on behalf of the vylog thread
 	 * so as not to block the tx thread.
 	 */
 	struct vy_log_flush_msg msg;
-	msg.entry = entry;
+	msg.no_discard = no_discard;
 	bool cancellable = fiber_set_cancellable(false);
 	int rc = cbus_call(&vy_log.pipe, &vy_log.tx_pipe, &msg.base,
 			   vy_log_flush_f, NULL, TIMEOUT_INFINITY);
 	fiber_set_cancellable(cancellable);
-	if (rc != 0)
-		goto err;
 
-	/* Success. Free flushed records. */
-	region_reset(&vy_log.pool);
+	region_free(&vy_log.pool);
 	stailq_create(&vy_log.tx);
-	region_truncate(&fiber()->gc, used);
-	return 0;
-err:
-	region_truncate(&fiber()->gc, used);
-	return -1;
+	return rc;
 }
 
 void
@@ -1085,7 +1080,7 @@ vy_log_end_recovery(void)
 		vy_recovery_process_record(vy_log.recovery, record);
 
 	/* Flush all pending records. */
-	if (vy_log_flush() < 0) {
+	if (vy_log_flush(false) != 0) {
 		diag_log();
 		say_error("failed to flush vylog after recovery");
 		return -1;
@@ -1157,12 +1152,6 @@ vy_log_rotate(const struct vclock *vclock)
 	 */
 	latch_lock(&vy_log.latch);
 
-	if (vy_log_flush() != 0) {
-		diag_log();
-		say_error("failed to flush vylog for checkpoint");
-		goto fail;
-	}
-
 	/* Do actual work from coio so as not to stall tx thread. */
 	struct vy_log_rotate_msg msg;
 	vclock_copy(&msg.vclock, vclock);
@@ -1227,8 +1216,6 @@ void
 vy_log_tx_begin(void)
 {
 	latch_lock(&vy_log.latch);
-	vy_log.tx_begin = stailq_last(&vy_log.tx);
-	vy_log.tx_svp = region_used(&vy_log.pool);
 	vy_log.tx_failed = false;
 	say_verbose("begin vylog transaction");
 }
@@ -1243,8 +1230,6 @@ vy_log_tx_begin(void)
 static int
 vy_log_tx_do_commit(bool no_discard)
 {
-	struct stailq rollback;
-
 	assert(latch_owner(&vy_log.latch) == fiber());
 
 	if (vy_log.tx_failed) {
@@ -1257,7 +1242,7 @@ vy_log_tx_do_commit(bool no_discard)
 			diag_log();
 			panic("non-discardable vylog transaction failed");
 		}
-		goto rollback;
+		goto fail;
 	}
 
 	/*
@@ -1269,26 +1254,13 @@ vy_log_tx_do_commit(bool no_discard)
 	if (vy_log.recovery != NULL)
 		goto done;
 
-	if (vy_log_flush() != 0) {
-		if (!no_discard)
-			goto rollback;
-		/*
-		 * We were told not to discard the transaction on
-		 * failure so just warn and leave it in the buffer.
-		 */
-		struct error *e = diag_last_error(diag_get());
-		say_warn("failed to flush vylog: %s", e->errmsg);
-	}
-
+	if (vy_log_flush(no_discard) != 0)
+		goto fail;
 done:
 	say_verbose("commit vylog transaction");
 	latch_unlock(&vy_log.latch);
 	return 0;
-
-rollback:
-	stailq_cut_tail(&vy_log.tx, vy_log.tx_begin, &rollback);
-	region_truncate(&vy_log.pool, vy_log.tx_svp);
-	vy_log.tx_svp = 0;
+fail:
 	say_verbose("rollback vylog transaction");
 	latch_unlock(&vy_log.latch);
 	return -1;
@@ -2295,6 +2267,17 @@ vy_recovery_build_index_id_hash(struct vy_recovery *recovery)
 static struct vy_recovery *
 vy_recovery_load(int64_t signature, int flags)
 {
+	/*
+	 * Before proceeding to log recovery, make sure that all
+	 * pending records have been flushed out.
+	 */
+	if (xlog_is_open(&vy_log.xlog) &&
+	    xlog_flush(&vy_log.xlog) < 0) {
+		diag_log();
+		say_error("failed to flush vylog for recovery");
+		return NULL;
+	}
+
 	say_verbose("loading vylog %lld", (long long)signature);
 
 	struct vy_recovery *recovery = malloc(sizeof(*recovery));
@@ -2413,16 +2396,6 @@ vy_recovery_new(int64_t signature, int flags)
 	/* Lock out concurrent writers while we are loading the log. */
 	latch_lock(&vy_log.latch);
 
-	/*
-	 * Before proceeding to log recovery, make sure that all
-	 * pending records have been flushed out.
-	 */
-	if (vy_log_flush() != 0) {
-		diag_log();
-		say_error("failed to flush vylog for recovery");
-		goto fail;
-	}
-
 	struct vy_recovery_msg msg;
 	msg.signature = signature;
 	msg.flags = flags;
diff --git a/src/box/xlog.c b/src/box/xlog.c
index 82588682..8bddac38 100644
--- a/src/box/xlog.c
+++ b/src/box/xlog.c
@@ -770,6 +770,7 @@ xlog_init(struct xlog *xlog, const struct xlog_opts *opts)
 	xlog->is_autocommit = true;
 	obuf_create(&xlog->obuf, &cord()->slabc, XLOG_TX_AUTOCOMMIT_THRESHOLD);
 	obuf_create(&xlog->zbuf, &cord()->slabc, XLOG_TX_AUTOCOMMIT_THRESHOLD);
+	xlog->rollback_svp.obuf_svp = obuf_create_svp(&xlog->obuf);
 	if (!opts->no_compression) {
 		xlog->zctx = ZSTD_createCCtx();
 		if (xlog->zctx == NULL) {
@@ -1201,6 +1202,7 @@ xlog_tx_write(struct xlog *log)
 {
 	if (obuf_size(&log->obuf) == XLOG_FIXHEADER_SIZE)
 		return 0;
+
 	ssize_t written;
 
 	if (!log->opts.no_compression &&
@@ -1214,19 +1216,23 @@ xlog_tx_write(struct xlog *log)
 		written = -1;
 	});
 
-	obuf_reset(&log->obuf);
 	/*
 	 * Simplify recovery after a temporary write failure:
 	 * truncate the file to the best known good write
 	 * position.
 	 */
 	if (written < 0) {
+		log->tx_rows = log->rollback_svp.rows;
+		obuf_rollback_to_svp(&log->obuf, &log->rollback_svp.obuf_svp);
 		if (lseek(log->fd, log->offset, SEEK_SET) < 0 ||
 		    ftruncate(log->fd, log->offset) != 0)
 			panic_syserror("failed to truncate xlog after write error");
 		log->allocated = 0;
 		return -1;
 	}
+	obuf_reset(&log->obuf);
+	log->rollback_svp.rows = 0;
+	log->rollback_svp.obuf_svp = obuf_create_svp(&log->obuf);
 	if (log->allocated > (size_t)written)
 		log->allocated -= written;
 	else
@@ -1373,8 +1379,8 @@ void
 xlog_tx_rollback(struct xlog *log)
 {
 	log->is_autocommit = true;
-	log->tx_rows = 0;
-	obuf_reset(&log->obuf);
+	log->tx_rows = log->rollback_svp.rows;
+	obuf_rollback_to_svp(&log->obuf, &log->rollback_svp.obuf_svp);
 }
 
 /**
diff --git a/src/box/xlog.h b/src/box/xlog.h
index 964d303e..5dba4264 100644
--- a/src/box/xlog.h
+++ b/src/box/xlog.h
@@ -381,6 +381,16 @@ struct xlog {
 	 * compression.
 	 */
 	struct obuf obuf;
+	/**
+	 * Marks the beginning of the first row that may be discarded
+	 * on write error. See xlog_tx_set_rollback_svp().
+	 */
+	struct {
+		/** Number of non-discardable rows in the buffer. */
+		int rows;
+		/** Position of the first discardable row in the buffer. */
+		struct obuf_svp obuf_svp;
+	} rollback_svp;
 	/** The context of zstd compression */
 	ZSTD_CCtx *zctx;
 	/**
@@ -525,6 +535,27 @@ void
 xlog_tx_rollback(struct xlog *log);
 
 /**
+ * Sets the position in the write buffer to rollback to in case
+ * of error or voluntary rollback. Useful for transactions that
+ * cannot be discarded even on write error, e.g. certain types
+ * of vylog transactions. Everything written before this function
+ * is called will stay in the buffer until eventually gets flushed
+ * along with another transaction or by an explicit invocation of
+ * xlog_flush(). Note, however, this doesn't guarantee that the
+ * transaction will reach the disk - the instance may be restarted
+ * before anything is written to the xlog file, in which case the
+ * data written to the buffer will be lost. The caller must be
+ * prepared for this.
+ */
+static inline void
+xlog_tx_set_rollback_svp(struct xlog *log)
+{
+	assert(!log->is_autocommit);
+	log->rollback_svp.rows = log->tx_rows;
+	log->rollback_svp.obuf_svp = obuf_create_svp(&log->obuf);
+}
+
+/**
  * Flush buffered rows and sync file
  */
 ssize_t
-- 
2.11.0

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 05/10] vinyl: don't purge deleted runs from vylog on compaction
  2019-05-17 14:52 [PATCH 00/10] vinyl: don't yield in DDL on_commit triggers Vladimir Davydov
                   ` (3 preceding siblings ...)
  2019-05-17 14:52 ` [PATCH 04/10] vinyl: rework vylog transaction backlog implementation Vladimir Davydov
@ 2019-05-17 14:52 ` Vladimir Davydov
  2019-05-18 18:47   ` [tarantool-patches] " Konstantin Osipov
  2019-06-01  8:39   ` Konstantin Osipov
  2019-05-17 14:52 ` [PATCH 06/10] vinyl: lock out compaction while checkpointing is in progress Vladimir Davydov
                   ` (6 subsequent siblings)
  11 siblings, 2 replies; 40+ messages in thread
From: Vladimir Davydov @ 2019-05-17 14:52 UTC (permalink / raw)
  To: tarantool-patches

After compacting runs, we first mark them as dropped (VY_LOG_DROP_RUN),
then try to delete their files unless they are needed for recovery from
the checkpoint, and finally mark them as not needed in the vylog
(VY_LOG_FORGET_RUN). There's a potential race sitting here: the problem
is the garbage collector might kick in after files are dropped, but
before they are marked as not needed. If this happens, there will be
runs that have two VY_LOG_FORGET_RUN records, which will break recovery:

  Run XX is forgotten, but not registered

The following patches make the race more likely to happen so let's
eliminate it by making the garbage collector the only one who can mark
runs as not needed (i.e. write VY_LOG_FORGET_RUN record). There will
be no warnings, because the garbage collector silently ignores ENOENT
errors, see vy_gc().

Another good thing about this patch is that now we never yield inside
a vylog transaction, which makes it easier to remove the vylog latch
blocking implementation of transactional DDL.
---
 src/box/vy_scheduler.c | 15 ++++++++-------
 1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/src/box/vy_scheduler.c b/src/box/vy_scheduler.c
index fabb4bb4..0180331e 100644
--- a/src/box/vy_scheduler.c
+++ b/src/box/vy_scheduler.c
@@ -1523,16 +1523,17 @@ vy_task_compaction_complete(struct vy_task *task)
 	 * Remove compacted run files that were created after
 	 * the last checkpoint (and hence are not referenced
 	 * by any checkpoint) immediately to save disk space.
+	 *
+	 * Don't bother logging it to avoid a potential race
+	 * with a garbage collection task, which may be cleaning
+	 * up concurrently. The log will be cleaned up on the
+	 * next checkpoint.
 	 */
-	vy_log_tx_begin();
 	rlist_foreach_entry(run, &unused_runs, in_unused) {
-		if (run->dump_lsn > gc_lsn &&
-		    vy_run_remove_files(lsm->env->path, lsm->space_id,
-					lsm->index_id, run->id) == 0) {
-			vy_log_forget_run(run->id);
-		}
+		if (run->dump_lsn > gc_lsn)
+			vy_run_remove_files(lsm->env->path, lsm->space_id,
+					    lsm->index_id, run->id);
 	}
-	vy_log_tx_try_commit();
 
 	/*
 	 * Account the new run if it is not empty,
-- 
2.11.0

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 06/10] vinyl: lock out compaction while checkpointing is in progress
  2019-05-17 14:52 [PATCH 00/10] vinyl: don't yield in DDL on_commit triggers Vladimir Davydov
                   ` (4 preceding siblings ...)
  2019-05-17 14:52 ` [PATCH 05/10] vinyl: don't purge deleted runs from vylog on compaction Vladimir Davydov
@ 2019-05-17 14:52 ` Vladimir Davydov
  2019-05-17 14:52 ` [PATCH 07/10] vinyl: don't access last vylog signature outside vylog implementation Vladimir Davydov
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 40+ messages in thread
From: Vladimir Davydov @ 2019-05-17 14:52 UTC (permalink / raw)
  To: tarantool-patches

Upon compaction completion we delete run files that are not needed for
recovery from the last checkpoint. If we start compaction while
checkpointing is in progress, it might compact and delete a run file
dumped during checkpointing, which would make it impossible to backup
files corresponding to the last checkpoint. Commit b25e31685096 ("vinyl:
fix compaction vs checkpoint race resulting in invalid gc") claimed to
eliminate the races, but in fact it just minimized probability of it: no
matter whether we use the last checkpoint signature or vylog signature
when checking if a run file doesn't belong to the last checkpoint,
there's still a race window, which opens wider in case the more indexes
we have to dump. Let's fix it once and for all by locking out compaction
until checkpointing is complete.
---
 src/box/vy_scheduler.c     | 17 +++++++++++++++++
 test/vinyl/errinj.result   | 16 +++++++++++++---
 test/vinyl/errinj.test.lua |  9 ++++++---
 3 files changed, 36 insertions(+), 6 deletions(-)

diff --git a/src/box/vy_scheduler.c b/src/box/vy_scheduler.c
index 0180331e..2d1abc2e 100644
--- a/src/box/vy_scheduler.c
+++ b/src/box/vy_scheduler.c
@@ -741,6 +741,13 @@ vy_scheduler_end_checkpoint(struct vy_scheduler *scheduler)
 		 */
 		vy_scheduler_trigger_dump(scheduler);
 	}
+
+	/*
+	 * Checkpointing temporarily blocks compaction.
+	 * Wake up the scheduler to check if there are
+	 * pending compaction tasks.
+	 */
+	fiber_cond_signal(&scheduler->scheduler_cond);
 }
 
 /**
@@ -1896,6 +1903,16 @@ vy_scheduler_peek_compaction(struct vy_scheduler *scheduler,
 	struct vy_worker *worker = NULL;
 retry:
 	*ptask = NULL;
+	/*
+	 * Upon completion a compaction task removes compacted run
+	 * files unless they are needed for recovery from the last
+	 * checkpoint. If we start compaction while checkpointing
+	 * is in progress we might compact a run that belongs to
+	 * the new, to be created, checkpoint. To avoid that we
+	 * lock out compaction while checkpointing is in progress.
+	 */
+	if (scheduler->checkpoint_in_progress)
+		goto no_task;
 	struct vy_lsm *lsm = vy_compaction_heap_top(&scheduler->compaction_heap);
 	if (lsm == NULL)
 		goto no_task; /* nothing to do */
diff --git a/test/vinyl/errinj.result b/test/vinyl/errinj.result
index e8795143..e05a616a 100644
--- a/test/vinyl/errinj.result
+++ b/test/vinyl/errinj.result
@@ -966,8 +966,9 @@ box.snapshot()
 - ok
 ...
 -- Create another run file. This will trigger compaction
--- as run_count_per_level is set to 1. Due to the error
--- injection compaction will finish before snapshot.
+-- as run_count_per_level is set to 1. Delay checkpointing
+-- completion and check that compaction doesn't remove
+-- files that are still needed for backup.
 _ = s:replace{2}
 ---
 ...
@@ -981,8 +982,13 @@ c = fiber.channel(1)
 _ = fiber.create(function() box.snapshot() c:put(true) end)
 ---
 ...
-while s.index.pk:stat().disk.compaction.count == 0 do fiber.sleep(0.001) end
+test_run:wait_cond(function() return s.index.pk:stat().disk.compaction.queue.bytes > 0 end)
 ---
+- true
+...
+test_run:wait_cond(function() return box.stat.vinyl().scheduler.tasks_inprogress == 0 end)
+---
+- true
 ...
 errinj.set('ERRINJ_SNAP_COMMIT_DELAY', false)
 ---
@@ -992,6 +998,10 @@ c:get()
 ---
 - true
 ...
+test_run:wait_cond(function() return s.index.pk:stat().disk.compaction.count > 0 end)
+---
+- true
+...
 -- Check that all files corresponding to the last checkpoint
 -- are present.
 files = box.backup.start()
diff --git a/test/vinyl/errinj.test.lua b/test/vinyl/errinj.test.lua
index 034ed34c..4317ccb8 100644
--- a/test/vinyl/errinj.test.lua
+++ b/test/vinyl/errinj.test.lua
@@ -346,15 +346,18 @@ _ = s:create_index('pk', {run_count_per_level = 1})
 _ = s:replace{1}
 box.snapshot()
 -- Create another run file. This will trigger compaction
--- as run_count_per_level is set to 1. Due to the error
--- injection compaction will finish before snapshot.
+-- as run_count_per_level is set to 1. Delay checkpointing
+-- completion and check that compaction doesn't remove
+-- files that are still needed for backup.
 _ = s:replace{2}
 errinj.set('ERRINJ_SNAP_COMMIT_DELAY', true)
 c = fiber.channel(1)
 _ = fiber.create(function() box.snapshot() c:put(true) end)
-while s.index.pk:stat().disk.compaction.count == 0 do fiber.sleep(0.001) end
+test_run:wait_cond(function() return s.index.pk:stat().disk.compaction.queue.bytes > 0 end)
+test_run:wait_cond(function() return box.stat.vinyl().scheduler.tasks_inprogress == 0 end)
 errinj.set('ERRINJ_SNAP_COMMIT_DELAY', false)
 c:get()
+test_run:wait_cond(function() return s.index.pk:stat().disk.compaction.count > 0 end)
 -- Check that all files corresponding to the last checkpoint
 -- are present.
 files = box.backup.start()
-- 
2.11.0

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 07/10] vinyl: don't access last vylog signature outside vylog implementation
  2019-05-17 14:52 [PATCH 00/10] vinyl: don't yield in DDL on_commit triggers Vladimir Davydov
                   ` (5 preceding siblings ...)
  2019-05-17 14:52 ` [PATCH 06/10] vinyl: lock out compaction while checkpointing is in progress Vladimir Davydov
@ 2019-05-17 14:52 ` Vladimir Davydov
  2019-05-17 14:52 ` [PATCH 08/10] vinyl: zap ERRINJ_VY_LOG_FLUSH_DELAY Vladimir Davydov
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 40+ messages in thread
From: Vladimir Davydov @ 2019-05-17 14:52 UTC (permalink / raw)
  To: tarantool-patches

Currently, it's updated in the tx thread under the vylog latch. Once we
remove the vylog latch, we'll have to update it from the vylog thread,
which would make it unsafe to access this variable from the tx thread.
Actually, we don't really need to access it from tx. We use it in two
places outside vylog. First, to load the last vylog for garbage
collection, but we can simply pass a special signature value (-1) and
let vylog thread lookup the last vylog file for us. Second, we use it in
the scheduler to collect compacted runs, but there we can use the
signature provided by the general gc infrastructure.
---
 src/box/vinyl.c        |  2 +-
 src/box/vy_log.c       | 11 ++++-------
 src/box/vy_log.h       |  9 ++-------
 src/box/vy_scheduler.c | 12 +++++++++++-
 4 files changed, 18 insertions(+), 16 deletions(-)

diff --git a/src/box/vinyl.c b/src/box/vinyl.c
index d4929a37..36bdba36 100644
--- a/src/box/vinyl.c
+++ b/src/box/vinyl.c
@@ -3464,7 +3464,7 @@ vinyl_engine_collect_garbage(struct engine *engine, const struct vclock *vclock)
 	vy_log_collect_garbage(vclock);
 
 	/* Cleanup run files. */
-	struct vy_recovery *recovery = vy_recovery_new(vy_log_signature(), 0);
+	struct vy_recovery *recovery = vy_recovery_new(-1, 0);
 	if (recovery == NULL) {
 		say_error("failed to recover vylog for garbage collection");
 		return;
diff --git a/src/box/vy_log.c b/src/box/vy_log.c
index 1bc35e82..31bc5c63 100644
--- a/src/box/vy_log.c
+++ b/src/box/vy_log.c
@@ -1190,12 +1190,6 @@ vy_log_collect_garbage(const struct vclock *vclock)
 	xdir_collect_garbage(&vy_log.dir, vclock_sum(vclock), XDIR_GC_ASYNC);
 }
 
-int64_t
-vy_log_signature(void)
-{
-	return vclock_sum(&vy_log.last_checkpoint);
-}
-
 const char *
 vy_log_backup_path(const struct vclock *vclock)
 {
@@ -2185,7 +2179,7 @@ vy_recovery_commit_rebootstrap(struct vy_recovery *recovery)
 			 * The files will be removed when the current
 			 * checkpoint is purged by garbage collector.
 			 */
-			lsm->drop_lsn = vy_log_signature();
+			lsm->drop_lsn = vclock_sum(&vy_log.last_checkpoint);
 		}
 	}
 }
@@ -2383,6 +2377,9 @@ vy_recovery_new_f(struct cbus_call_msg *base)
 {
 	struct vy_recovery_msg *msg = (struct vy_recovery_msg *)base;
 
+	if (msg->signature < 0)
+		msg->signature = vclock_sum(&vy_log.last_checkpoint);
+
 	msg->recovery = vy_recovery_load(msg->signature, msg->flags);
 	if (msg->recovery == NULL)
 		return -1;
diff --git a/src/box/vy_log.h b/src/box/vy_log.h
index 81f62706..bd6a4a0d 100644
--- a/src/box/vy_log.h
+++ b/src/box/vy_log.h
@@ -459,12 +459,6 @@ void
 vy_log_collect_garbage(const struct vclock *vclock);
 
 /**
- * Return the signature of the newest vylog to the time.
- */
-int64_t
-vy_log_signature(void);
-
-/**
  * Return the path to the log file that needs to be backed up
  * in order to recover to checkpoint @vclock.
  */
@@ -569,7 +563,8 @@ enum vy_recovery_flag {
 
 /**
  * Create a recovery context from the metadata log created
- * by checkpoint with the given signature.
+ * by checkpoint with the given signature. Pass -1 to load
+ * the most recent log file.
  *
  * For valid values of @flags, see vy_recovery_flag.
  *
diff --git a/src/box/vy_scheduler.c b/src/box/vy_scheduler.c
index 2d1abc2e..079438c7 100644
--- a/src/box/vy_scheduler.c
+++ b/src/box/vy_scheduler.c
@@ -50,6 +50,7 @@
 #include "txn.h"
 #include "space.h"
 #include "schema.h"
+#include "gc.h"
 #include "xrow.h"
 #include "vy_lsm.h"
 #include "vy_log.h"
@@ -198,6 +199,11 @@ struct vy_task {
 	 */
 	struct vy_slice *first_slice, *last_slice;
 	/**
+	 * For compaction tasks: signature of the last checkpoint
+	 * at the time of task creation.
+	 */
+	int64_t gc_lsn;
+	/**
 	 * Index options may be modified while a task is in
 	 * progress so we save them here to safely access them
 	 * from another thread.
@@ -1459,6 +1465,7 @@ vy_task_compaction_complete(struct vy_task *task)
 	struct vy_lsm *lsm = task->lsm;
 	struct vy_range *range = task->range;
 	struct vy_run *new_run = task->new_run;
+	int64_t gc_lsn = task->gc_lsn;
 	double compaction_time = ev_monotonic_now(loop()) - task->start_time;
 	struct vy_disk_stmt_counter compaction_output = new_run->count;
 	struct vy_disk_stmt_counter compaction_input;
@@ -1510,7 +1517,6 @@ vy_task_compaction_complete(struct vy_task *task)
 		if (slice == last_slice)
 			break;
 	}
-	int64_t gc_lsn = vy_log_signature();
 	rlist_foreach_entry(run, &unused_runs, in_unused)
 		vy_log_drop_run(run->id, gc_lsn);
 	if (new_slice != NULL) {
@@ -1709,6 +1715,10 @@ vy_task_compaction_new(struct vy_scheduler *scheduler, struct vy_worker *worker,
 
 	range->needs_compaction = false;
 
+	struct gc_checkpoint *checkpoint = gc_last_checkpoint();
+	if (checkpoint != NULL)
+		task->gc_lsn = vclock_sum(&checkpoint->vclock);
+
 	task->range = range;
 	task->new_run = new_run;
 	task->wi = wi;
-- 
2.11.0

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 08/10] vinyl: zap ERRINJ_VY_LOG_FLUSH_DELAY
  2019-05-17 14:52 [PATCH 00/10] vinyl: don't yield in DDL on_commit triggers Vladimir Davydov
                   ` (6 preceding siblings ...)
  2019-05-17 14:52 ` [PATCH 07/10] vinyl: don't access last vylog signature outside vylog implementation Vladimir Davydov
@ 2019-05-17 14:52 ` Vladimir Davydov
  2019-05-17 14:52 ` [PATCH 09/10] key_def: pass alloc callback to key_def_dump_parts Vladimir Davydov
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 40+ messages in thread
From: Vladimir Davydov @ 2019-05-17 14:52 UTC (permalink / raw)
  To: tarantool-patches

To make it vy_log_tx_try_commit() non-yiedling, which is required for
transactional DDL, we need to move ERRINJ_VY_LOG_FLUSH_DELAY out of this
function. We can't move it to vylog thread, because the only test case
that uses it tries to restart the server with this error injection
enabled, while we don't exit until vylog thread completes. So let's just
kill it - the problem it is checking will be gone once the vylog latch
is removed, anyway.
---
 src/box/vy_log.c           |  6 ------
 src/lib/core/errinj.h      |  1 -
 test/box/errinj.result     | 44 +++++++++++++++++++++-----------------------
 test/vinyl/errinj.result   | 22 ++--------------------
 test/vinyl/errinj.test.lua | 11 ++---------
 5 files changed, 25 insertions(+), 59 deletions(-)

diff --git a/src/box/vy_log.c b/src/box/vy_log.c
index 31bc5c63..67b8a763 100644
--- a/src/box/vy_log.c
+++ b/src/box/vy_log.c
@@ -862,12 +862,6 @@ vy_log_flush(bool no_discard)
 	if (stailq_empty(&vy_log.tx))
 		return 0; /* nothing to do */
 
-	struct errinj *delay = errinj(ERRINJ_VY_LOG_FLUSH_DELAY, ERRINJ_BOOL);
-	if (delay != NULL && delay->bparam) {
-		while (delay->bparam)
-			fiber_sleep(0.001);
-	}
-
 	/*
 	 * Do actual disk writes on behalf of the vylog thread
 	 * so as not to block the tx thread.
diff --git a/src/lib/core/errinj.h b/src/lib/core/errinj.h
index 462c9846..a9748b1d 100644
--- a/src/lib/core/errinj.h
+++ b/src/lib/core/errinj.h
@@ -95,7 +95,6 @@ struct errinj {
 	_(ERRINJ_VY_SCHED_TIMEOUT, ERRINJ_DOUBLE, {.dparam = 0}) \
 	_(ERRINJ_VY_GC, ERRINJ_BOOL, {.bparam = false}) \
 	_(ERRINJ_VY_LOG_FLUSH, ERRINJ_BOOL, {.bparam = false}) \
-	_(ERRINJ_VY_LOG_FLUSH_DELAY, ERRINJ_BOOL, {.bparam = false}) \
 	_(ERRINJ_RELAY_SEND_DELAY, ERRINJ_BOOL, {.bparam = false}) \
 	_(ERRINJ_RELAY_TIMEOUT, ERRINJ_DOUBLE, {.dparam = 0}) \
 	_(ERRINJ_RELAY_REPORT_INTERVAL, ERRINJ_DOUBLE, {.dparam = 0}) \
diff --git a/test/box/errinj.result b/test/box/errinj.result
index 061501c9..b2ba8bb2 100644
--- a/test/box/errinj.result
+++ b/test/box/errinj.result
@@ -40,26 +40,24 @@ errinj.info()
     state: false
   ERRINJ_COIO_SENDFILE_CHUNK:
     state: -1
-  ERRINJ_VY_INDEX_FILE_RENAME:
-    state: false
-  ERRINJ_VY_POINT_ITER_WAIT:
-    state: false
+  ERRINJ_RELAY_BREAK_LSN:
+    state: -1
   ERRINJ_VY_DELAY_PK_LOOKUP:
     state: false
   ERRINJ_VY_TASK_COMPLETE:
     state: false
   ERRINJ_PORT_DUMP:
     state: false
-  ERRINJ_WAL_BREAK_LSN:
-    state: -1
+  ERRINJ_RELAY_EXIT_DELAY:
+    state: 0
   ERRINJ_WAL_IO:
     state: false
   ERRINJ_WAL_FALLOCATE:
     state: 0
-  ERRINJ_LOG_ROTATE:
-    state: false
   ERRINJ_VY_DUMP_DELAY:
     state: false
+  ERRINJ_SNAP_COMMIT_DELAY:
+    state: false
   ERRINJ_TUPLE_FORMAT_COUNT:
     state: -1
   ERRINJ_TUPLE_ALLOC:
@@ -70,7 +68,7 @@ errinj.info()
     state: false
   ERRINJ_RELAY_REPORT_INTERVAL:
     state: 0
-  ERRINJ_RELAY_BREAK_LSN:
+  ERRINJ_WAL_BREAK_LSN:
     state: -1
   ERRINJ_VY_READ_PAGE_TIMEOUT:
     state: 0
@@ -78,20 +76,20 @@ errinj.info()
     state: false
   ERRINJ_SIO_READ_MAX:
     state: -1
-  ERRINJ_VY_RUN_FILE_RENAME:
+  ERRINJ_VY_INDEX_FILE_RENAME:
     state: false
   ERRINJ_WAL_WRITE_DISK:
     state: false
+  ERRINJ_VY_RUN_FILE_RENAME:
+    state: false
   ERRINJ_VY_LOG_FILE_RENAME:
     state: false
-  ERRINJ_HTTP_RESPONSE_ADD_WAIT:
-    state: false
   ERRINJ_VY_RUN_WRITE:
     state: false
-  ERRINJ_SNAP_COMMIT_DELAY:
-    state: false
-  ERRINJ_VY_LOG_FLUSH_DELAY:
+  ERRINJ_HTTP_RESPONSE_ADD_WAIT:
     state: false
+  ERRINJ_BUILD_INDEX:
+    state: -1
   ERRINJ_RELAY_FINAL_JOIN:
     state: false
   ERRINJ_VY_COMPACTION_DELAY:
@@ -102,10 +100,10 @@ errinj.info()
     state: false
   ERRINJ_WAL_ROTATE:
     state: false
-  ERRINJ_BUILD_INDEX:
-    state: -1
-  ERRINJ_RELAY_EXIT_DELAY:
-    state: 0
+  ERRINJ_LOG_ROTATE:
+    state: false
+  ERRINJ_VY_POINT_ITER_WAIT:
+    state: false
   ERRINJ_MEMTX_DELAY_GC:
     state: false
   ERRINJ_IPROTO_TX_DELAY:
@@ -122,14 +120,14 @@ errinj.info()
     state: false
   ERRINJ_TESTING:
     state: false
-  ERRINJ_RELAY_TIMEOUT:
-    state: 0
+  ERRINJ_RELAY_SEND_DELAY:
+    state: false
   ERRINJ_VY_SQUASH_TIMEOUT:
     state: 0
   ERRINJ_VY_LOG_FLUSH:
     state: false
-  ERRINJ_RELAY_SEND_DELAY:
-    state: false
+  ERRINJ_RELAY_TIMEOUT:
+    state: 0
 ...
 errinj.set("some-injection", true)
 ---
diff --git a/test/vinyl/errinj.result b/test/vinyl/errinj.result
index e05a616a..effb57a6 100644
--- a/test/vinyl/errinj.result
+++ b/test/vinyl/errinj.result
@@ -880,9 +880,8 @@ s:drop()
 ---
 ...
 --
--- gh-3412 - assertion failure at exit in case:
--- * there is a fiber waiting for quota
--- * there is a pending vylog write
+-- gh-3412 - assertion failure at exit in case there is a fiber
+-- waiting for quota.
 --
 test_run:cmd("create server low_quota with script='vinyl/low_quota.lua'")
 ---
@@ -918,23 +917,6 @@ _ = fiber.create(function() for i = 1, 11 do box.space.test:replace{i, pad} end
 repeat fiber.sleep(0.001) until box.cfg.vinyl_memory - box.stat.vinyl().memory.level0 < pad:len()
 ---
 ...
-test_run:cmd("restart server low_quota with args='1048576'")
-box.error.injection.set('ERRINJ_VY_LOG_FLUSH_DELAY', true)
----
-- ok
-...
-fiber = require('fiber')
----
-...
-pad = string.rep('x', 100 * 1024)
----
-...
-_ = fiber.create(function() for i = 1, 11 do box.space.test:replace{i, pad} end end)
----
-...
-repeat fiber.sleep(0.001) until box.cfg.vinyl_memory - box.stat.vinyl().memory.level0 < pad:len()
----
-...
 test_run:cmd('switch default')
 ---
 - true
diff --git a/test/vinyl/errinj.test.lua b/test/vinyl/errinj.test.lua
index 4317ccb8..4169490a 100644
--- a/test/vinyl/errinj.test.lua
+++ b/test/vinyl/errinj.test.lua
@@ -312,9 +312,8 @@ ret
 s:drop()
 
 --
--- gh-3412 - assertion failure at exit in case:
--- * there is a fiber waiting for quota
--- * there is a pending vylog write
+-- gh-3412 - assertion failure at exit in case there is a fiber
+-- waiting for quota.
 --
 test_run:cmd("create server low_quota with script='vinyl/low_quota.lua'")
 test_run:cmd("start server low_quota with args='1048576'")
@@ -326,12 +325,6 @@ fiber = require('fiber')
 pad = string.rep('x', 100 * 1024)
 _ = fiber.create(function() for i = 1, 11 do box.space.test:replace{i, pad} end end)
 repeat fiber.sleep(0.001) until box.cfg.vinyl_memory - box.stat.vinyl().memory.level0 < pad:len()
-test_run:cmd("restart server low_quota with args='1048576'")
-box.error.injection.set('ERRINJ_VY_LOG_FLUSH_DELAY', true)
-fiber = require('fiber')
-pad = string.rep('x', 100 * 1024)
-_ = fiber.create(function() for i = 1, 11 do box.space.test:replace{i, pad} end end)
-repeat fiber.sleep(0.001) until box.cfg.vinyl_memory - box.stat.vinyl().memory.level0 < pad:len()
 test_run:cmd('switch default')
 test_run:cmd("stop server low_quota")
 test_run:cmd("cleanup server low_quota")
-- 
2.11.0

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 09/10] key_def: pass alloc callback to key_def_dump_parts
  2019-05-17 14:52 [PATCH 00/10] vinyl: don't yield in DDL on_commit triggers Vladimir Davydov
                   ` (7 preceding siblings ...)
  2019-05-17 14:52 ` [PATCH 08/10] vinyl: zap ERRINJ_VY_LOG_FLUSH_DELAY Vladimir Davydov
@ 2019-05-17 14:52 ` Vladimir Davydov
  2019-05-18 18:52   ` [tarantool-patches] " Konstantin Osipov
  2019-05-17 14:52 ` [PATCH 10/10] vinyl: get rid of the latch protecting vylog buffer Vladimir Davydov
                   ` (2 subsequent siblings)
  11 siblings, 1 reply; 40+ messages in thread
From: Vladimir Davydov @ 2019-05-17 14:52 UTC (permalink / raw)
  To: tarantool-patches

So that we can use an allocator different from the region in vylog.
---
 src/box/key_def.c    |  6 +++---
 src/box/key_def.h    |  4 ++--
 src/box/sql/select.c |  2 +-
 src/box/vy_log.c     | 22 ++++++++++------------
 4 files changed, 16 insertions(+), 18 deletions(-)

diff --git a/src/box/key_def.c b/src/box/key_def.c
index fc547570..46f9c4ef 100644
--- a/src/box/key_def.c
+++ b/src/box/key_def.c
@@ -286,7 +286,7 @@ key_def_new(const struct key_part_def *parts, uint32_t part_count)
 
 int
 key_def_dump_parts(const struct key_def *def, struct key_part_def *parts,
-		   struct region *region)
+		   void *(*alloc_cb)(void *, size_t), void *alloc_ctx)
 {
 	for (uint32_t i = 0; i < def->part_count; i++) {
 		const struct key_part *part = &def->parts[i];
@@ -297,7 +297,7 @@ key_def_dump_parts(const struct key_def *def, struct key_part_def *parts,
 		part_def->nullable_action = part->nullable_action;
 		part_def->coll_id = part->coll_id;
 		if (part->path != NULL) {
-			char *path = region_alloc(region, part->path_len + 1);
+			char *path = alloc_cb(alloc_ctx, part->path_len + 1);
 			if (path == NULL) {
 				diag_set(OutOfMemory, part->path_len + 1,
 					 "region", "part_def->path");
@@ -808,7 +808,7 @@ key_def_find_pk_in_cmp_def(const struct key_def *cmp_def,
 			 "region", "key def parts");
 		goto out;
 	}
-	if (key_def_dump_parts(pk_def, parts, region) != 0)
+	if (key_def_dump_parts(pk_def, parts, region_alloc_cb, region) != 0)
 		goto out;
 	/*
 	 * Second, update field numbers to match the primary key
diff --git a/src/box/key_def.h b/src/box/key_def.h
index f4a1a8fd..147064da 100644
--- a/src/box/key_def.h
+++ b/src/box/key_def.h
@@ -332,12 +332,12 @@ key_def_new(const struct key_part_def *parts, uint32_t part_count);
 
 /**
  * Dump part definitions of the given key def.
- * The region is used for allocating JSON paths, if any.
+ * The callback is used for allocating JSON paths, if any.
  * Return -1 on memory allocation error, 0 on success.
  */
 int
 key_def_dump_parts(const struct key_def *def, struct key_part_def *parts,
-		   struct region *region);
+		   void *(*alloc_cb)(void *, size_t), void *alloc_ctx);
 
 /**
  * Update 'has_optional_parts' of @a key_def with correspondence
diff --git a/src/box/sql/select.c b/src/box/sql/select.c
index d3472a92..5745fb4c 100644
--- a/src/box/sql/select.c
+++ b/src/box/sql/select.c
@@ -1415,7 +1415,7 @@ sql_key_info_new_from_key_def(sql *db, const struct key_def *key_def)
 	key_info->key_def = NULL;
 	key_info->refs = 1;
 	key_info->part_count = key_def->part_count;
-	key_def_dump_parts(key_def, key_info->parts, NULL);
+	key_def_dump_parts(key_def, key_info->parts, NULL, NULL);
 	return key_info;
 }
 
diff --git a/src/box/vy_log.c b/src/box/vy_log.c
index 67b8a763..7157bbf3 100644
--- a/src/box/vy_log.c
+++ b/src/box/vy_log.c
@@ -675,11 +675,10 @@ fail:
  * are duplicated as well.
  */
 static struct vy_log_record *
-vy_log_record_dup(struct region *pool, const struct vy_log_record *src)
+vy_log_record_dup(const struct vy_log_record *src,
+		  void *(*alloc_cb)(void *, size_t), void *alloc_ctx)
 {
-	size_t used = region_used(pool);
-
-	struct vy_log_record *dst = region_alloc(pool, sizeof(*dst));
+	struct vy_log_record *dst = alloc_cb(alloc_ctx, sizeof(*dst));
 	if (dst == NULL) {
 		diag_set(OutOfMemory, sizeof(*dst),
 			 "region", "struct vy_log_record");
@@ -690,7 +689,7 @@ vy_log_record_dup(struct region *pool, const struct vy_log_record *src)
 		const char *data = src->begin;
 		mp_next(&data);
 		size_t size = data - src->begin;
-		dst->begin = region_alloc(pool, size);
+		dst->begin = alloc_cb(alloc_ctx, size);
 		if (dst->begin == NULL) {
 			diag_set(OutOfMemory, size, "region",
 				 "vy_log_record::begin");
@@ -702,7 +701,7 @@ vy_log_record_dup(struct region *pool, const struct vy_log_record *src)
 		const char *data = src->end;
 		mp_next(&data);
 		size_t size = data - src->end;
-		dst->end = region_alloc(pool, size);
+		dst->end = alloc_cb(alloc_ctx, size);
 		if (dst->end == NULL) {
 			diag_set(OutOfMemory, size, "region",
 				 "struct vy_log_record");
@@ -713,21 +712,20 @@ vy_log_record_dup(struct region *pool, const struct vy_log_record *src)
 	if (src->key_def != NULL) {
 		size_t size = src->key_def->part_count *
 				sizeof(struct key_part_def);
-		dst->key_parts = region_alloc(pool, size);
+		dst->key_parts = alloc_cb(alloc_ctx, size);
 		if (dst->key_parts == NULL) {
 			diag_set(OutOfMemory, size, "region",
 				 "struct key_part_def");
 			goto err;
 		}
-		if (key_def_dump_parts(src->key_def, dst->key_parts, pool) != 0)
+		if (key_def_dump_parts(src->key_def, dst->key_parts,
+				       alloc_cb, alloc_ctx) != 0)
 			goto err;
 		dst->key_part_count = src->key_def->part_count;
 		dst->key_def = NULL;
 	}
 	return dst;
-
 err:
-	region_truncate(pool, used);
 	return NULL;
 }
 
@@ -1272,8 +1270,8 @@ vy_log_write(const struct vy_log_record *record)
 {
 	assert(latch_owner(&vy_log.latch) == fiber());
 
-	struct vy_log_record *tx_record = vy_log_record_dup(&vy_log.pool,
-							    record);
+	struct vy_log_record *tx_record = vy_log_record_dup(record,
+					region_alloc_cb, &vy_log.pool);
 	if (tx_record == NULL) {
 		diag_move(diag_get(), &vy_log.tx_diag);
 		vy_log.tx_failed = true;
-- 
2.11.0

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 10/10] vinyl: get rid of the latch protecting vylog buffer
  2019-05-17 14:52 [PATCH 00/10] vinyl: don't yield in DDL on_commit triggers Vladimir Davydov
                   ` (8 preceding siblings ...)
  2019-05-17 14:52 ` [PATCH 09/10] key_def: pass alloc callback to key_def_dump_parts Vladimir Davydov
@ 2019-05-17 14:52 ` Vladimir Davydov
  2019-06-01  8:44   ` [tarantool-patches] " Konstantin Osipov
  2019-05-18 18:35 ` [tarantool-patches] Re: [PATCH 00/10] vinyl: don't yield in DDL on_commit triggers Konstantin Osipov
  2019-06-01  8:09 ` Konstantin Osipov
  11 siblings, 1 reply; 40+ messages in thread
From: Vladimir Davydov @ 2019-05-17 14:52 UTC (permalink / raw)
  To: tarantool-patches

The latch is needed solely to synchronize access to the transaction
write buffer, which is shared by all transactions. We don't need it to
sync vylog readers vs writers as everything is done from the same
thread. So to get rid of the latch, which is a prerequisite for
transactional DDL, as it makes even simply Vinyl DDL operations
yielding, we need to rework the buffer management.

To achieve that, this patch introduces a separate object for each vylog
transaction, which stores the list of records to be committed. Once a
transaction is ready to be committed, it is sent to the vylog thread,
which writes it to vylog and then wakes up the awaiting fiber. For
non-discardable transactions, it doesn't wakeup any fiber, the fiber
returns immediately and the transaction is freed automatically.

To allow that, transactions are now allocated on lsregion, which acts
as a queue: new transactions are appended to the tail, committed
transactions are popped from the head.
---
 src/box/vy_log.c | 261 ++++++++++++++++++++++++++++++++++---------------------
 src/box/vy_log.h |   8 +-
 2 files changed, 169 insertions(+), 100 deletions(-)

diff --git a/src/box/vy_log.c b/src/box/vy_log.c
index 7157bbf3..ca0f48c3 100644
--- a/src/box/vy_log.c
+++ b/src/box/vy_log.c
@@ -41,6 +41,7 @@
 #include <unistd.h>
 
 #include <msgpuck/msgpuck.h>
+#include <small/lsregion.h>
 #include <small/mempool.h>
 #include <small/region.h>
 #include <small/rlist.h>
@@ -53,7 +54,7 @@
 #include "fiber.h"
 #include "iproto_constants.h" /* IPROTO_INSERT */
 #include "key_def.h"
-#include "latch.h"
+#include "memory.h"
 #include "replication.h" /* INSTANCE_UUID */
 #include "salad/stailq.h"
 #include "say.h"
@@ -129,6 +130,31 @@ static const char *vy_log_type_name[] = {
 	[VY_LOG_ABORT_REBOOTSTRAP]	= "abort_rebootstrap",
 };
 
+/** Metadata transaction. */
+struct vy_log_tx {
+	/** Message sent to vylog thread via cbus. */
+	struct cmsg cmsg;
+	/**
+	 * Unique, monotonically growing identifier.
+	 * Initialized with vy_log::last_tx_id.
+	 */
+	int64_t id;
+	/** List of records, linked by vy_log_record::in_tx. */
+	struct stailq records;
+	/**
+	 * Fiber waiting for transaction completion or NULL
+	 * if the transaction is non-discardable and hence
+	 * there's no need to wait for it - it will stay in
+	 * the xlog buffer on failure and gets flushed along
+	 * on the next write.
+	 */
+	struct fiber *fiber;
+	/** Set to true if write failed. */
+	bool failed;
+	/** Write error information. */
+	struct diag diag;
+};
+
 /** Metadata log object. */
 struct vy_log {
 	/**
@@ -141,35 +167,20 @@ struct vy_log {
 	struct vclock last_checkpoint;
 	/** Recovery context. */
 	struct vy_recovery *recovery;
-	/** Latch protecting the log buffer. */
-	struct latch latch;
 	/**
 	 * Next ID to use for a vinyl object.
 	 * Used by vy_log_next_id().
 	 */
 	int64_t next_id;
-	/** A region of struct vy_log_record entries. */
-	struct region pool;
 	/**
-	 * Records awaiting to be written to disk.
-	 * Linked by vy_log_record::in_tx;
+	 * Memory pool for allocating transaction records.
+	 * Committed transactions are collected by id.
 	 */
-	struct stailq tx;
-	/**
-	 * Flag set if vy_log_write() failed.
-	 *
-	 * It indicates that that the current transaction must be
-	 * aborted on vy_log_commit(). Thanks to this flag, we don't
-	 * need to add error handling code after each invocation of
-	 * vy_log_write(), instead we only check vy_log_commit()
-	 * return code.
-	 */
-	bool tx_failed;
-	/**
-	 * Diagnostic area where vy_log_write() error is stored,
-	 * only relevant if @tx_failed is set.
-	 */
-	struct diag tx_diag;
+	struct lsregion tx_pool;
+	/** Current transaction or NULL. */
+	struct vy_log_tx *tx;
+	/** Identifier of the last started transaction. */
+	int64_t last_tx_id;
 	/** Vylog file. */
 	struct xlog xlog;
 	/** Vylog IO thread. */
@@ -754,10 +765,7 @@ vy_log_init(const char *dir)
 {
 	xdir_create(&vy_log.dir, dir, VYLOG, &INSTANCE_UUID,
 		    &xlog_opts_default);
-	latch_create(&vy_log.latch);
-	region_create(&vy_log.pool, cord_slab_cache());
-	stailq_create(&vy_log.tx);
-	diag_create(&vy_log.tx_diag);
+	lsregion_create(&vy_log.tx_pool, &runtime);
 	xlog_clear(&vy_log.xlog);
 
 	if (cord_costart(&vy_log.cord, "vinyl.log", vy_log_thread_f, NULL) != 0)
@@ -766,18 +774,14 @@ vy_log_init(const char *dir)
 	cpipe_create(&vy_log.pipe, "vylog");
 }
 
-struct vy_log_flush_msg {
-	struct cbus_call_msg base;
-	bool no_discard;
-};
-
-static int
-vy_log_flush_f(struct cbus_call_msg *base)
+static void
+vy_log_tx_write_f(struct cmsg *msg)
 {
-	struct vy_log_flush_msg *msg = (struct vy_log_flush_msg *)base;
+	struct vy_log_tx *tx = (struct vy_log_tx *)msg;
 	struct region *region = &fiber()->gc;
 	size_t region_svp = region_used(region);
 	struct xlog *xlog = &vy_log.xlog;
+	bool no_discard = (tx->fiber == NULL);
 
 	if (!xlog_is_open(xlog)) {
 		if (vy_log_open() < 0)
@@ -787,7 +791,7 @@ vy_log_flush_f(struct cbus_call_msg *base)
 	/* Encode the transaction and write it to the xlog buffer. */
 	xlog_tx_begin(xlog);
 	struct vy_log_record *record;
-	stailq_foreach_entry(record, &vy_log.tx, in_tx) {
+	stailq_foreach_entry(record, &tx->records, in_tx) {
 		struct xrow_header row;
 		if (vy_log_record_encode(record, &row) < 0) {
 			xlog_tx_rollback(xlog);
@@ -801,7 +805,7 @@ vy_log_flush_f(struct cbus_call_msg *base)
 	}
 
 	/* Flush the xlog buffer to disk. */
-	if (msg->no_discard)
+	if (no_discard)
 		xlog_tx_set_rollback_svp(xlog);
 
 	int rc = 0;
@@ -815,7 +819,7 @@ vy_log_flush_f(struct cbus_call_msg *base)
 	if (rc >= 0)
 		rc = xlog_flush(xlog);
 
-	if (rc < 0 && msg->no_discard) {
+	if (rc < 0 && no_discard) {
 		/*
 		 * The message got appended to the xlog buffer so
 		 * it will get flushed along with the next message.
@@ -827,10 +831,13 @@ vy_log_flush_f(struct cbus_call_msg *base)
 		say_warn("failed to flush vylog: %s", e->errmsg);
 		rc = 0;
 	}
+	if (rc < 0)
+		goto fail;
+out:
 	region_truncate(region, region_svp);
-	return rc < 0 ? -1 : 0;
+	return;
 fail:
-	if (msg->no_discard) {
+	if (no_discard) {
 		/*
 		 * The caller doesn't tolerate failures so we have
 		 * no choice but panic. Good news is this shouldn't
@@ -843,36 +850,69 @@ fail:
 		diag_log();
 		panic("non-discardable vylog transaction failed");
 	}
-	region_truncate(region, region_svp);
-	return -1;
+	tx->failed = true;
+	diag_move(diag_get(), &tx->diag);
+	goto out;
+}
+
+static void
+vy_log_tx_free(struct vy_log_tx *tx)
+{
+	diag_destroy(&tx->diag);
+	lsregion_gc(&vy_log.tx_pool, tx->id);
+}
+
+static void
+vy_log_tx_write_complete_f(struct cmsg *cmsg)
+{
+	struct vy_log_tx *tx = container_of(cmsg, struct vy_log_tx, cmsg);
+	if (tx->fiber == NULL) {
+		/*
+		 * Nobody's waiting for the transaction.
+		 * It's our responsibility to free memory.
+		 */
+		vy_log_tx_free(tx);
+	} else {
+		fiber_wakeup(tx->fiber);
+	}
 }
 
-/**
- * Try to flush the log buffer to disk.
- *
- * We always flush the entire vy_log buffer as a single xlog
- * transaction, since we do not track boundaries of @no_discard
- * buffered transactions, and want to avoid a partial write.
- */
 static int
-vy_log_flush(bool no_discard)
+vy_log_tx_write(bool no_discard)
 {
-	if (stailq_empty(&vy_log.tx))
-		return 0; /* nothing to do */
+	struct vy_log_tx *tx = vy_log.tx;
+	assert(tx != NULL);
+	vy_log.tx = NULL;
 
 	/*
 	 * Do actual disk writes on behalf of the vylog thread
 	 * so as not to block the tx thread.
 	 */
-	struct vy_log_flush_msg msg;
-	msg.no_discard = no_discard;
+	static const struct cmsg_hop route[2] = {
+		{ vy_log_tx_write_f, &vy_log.tx_pipe },
+		{ vy_log_tx_write_complete_f, NULL },
+	};
+	cmsg_init(&tx->cmsg, route);
+	tx->fiber = no_discard ? NULL : fiber();
+	cpipe_push(&vy_log.pipe, &tx->cmsg);
+
+	/*
+	 * Non-discardable transactions can't fail so no need to wait
+	 * for it to complete. Memory will be freed on completion.
+	 */
+	if (no_discard)
+		return 0;
+
 	bool cancellable = fiber_set_cancellable(false);
-	int rc = cbus_call(&vy_log.pipe, &vy_log.tx_pipe, &msg.base,
-			   vy_log_flush_f, NULL, TIMEOUT_INFINITY);
+	fiber_yield();
 	fiber_set_cancellable(cancellable);
 
-	region_free(&vy_log.pool);
-	stailq_create(&vy_log.tx);
+	int rc = 0;
+	if (tx->failed) {
+		diag_move(&tx->diag, diag_get());
+		rc = -1;
+	}
+	vy_log_tx_free(tx);
 	return rc;
 }
 
@@ -883,8 +923,7 @@ vy_log_free(void)
 	if (cord_join(&vy_log.cord) != 0)
 		panic_syserror("failed to join vinyl log thread");
 	xdir_destroy(&vy_log.dir);
-	region_destroy(&vy_log.pool);
-	diag_destroy(&vy_log.tx_diag);
+	lsregion_destroy(&vy_log.tx_pool);
 }
 
 /**
@@ -1058,26 +1097,34 @@ vy_log_begin_recovery(const struct vclock *vclock)
 	return recovery;
 }
 
-int
-vy_log_end_recovery(void)
+static int
+vy_log_flush_recovered_records(void)
 {
-	assert(vy_log.recovery != NULL);
+	if (vy_log.tx == NULL)
+		return 0;
 
 	/*
 	 * Update the recovery context with records written during
 	 * recovery - we will need them for garbage collection.
 	 */
 	struct vy_log_record *record;
-	stailq_foreach_entry(record, &vy_log.tx, in_tx)
+	stailq_foreach_entry(record, &vy_log.tx->records, in_tx)
 		vy_recovery_process_record(vy_log.recovery, record);
 
 	/* Flush all pending records. */
-	if (vy_log_flush(false) != 0) {
+	if (vy_log_tx_write(false) != 0) {
 		diag_log();
 		say_error("failed to flush vylog after recovery");
 		return -1;
 	}
+	return 0;
+}
 
+int
+vy_log_end_recovery(void)
+{
+	assert(vy_log.recovery != NULL);
+	vy_log_flush_recovered_records();
 	xdir_collect_inprogress(&vy_log.dir);
 	vy_log.recovery = NULL;
 	return 0;
@@ -1110,6 +1157,8 @@ vy_log_rotate_f(struct cbus_call_msg *base)
 	 */
 	if (xlog_is_open(&vy_log.xlog))
 		xlog_close(&vy_log.xlog, false);
+
+	vclock_copy(&vy_log.last_checkpoint, &msg->vclock);
 	return 0;
 }
 
@@ -1134,17 +1183,16 @@ vy_log_rotate(const struct vclock *vclock)
 		    (long long)prev_signature, (long long)signature);
 
 	/*
-	 * Lock out all concurrent log writers while we are rotating it.
-	 * This effectively stalls the vinyl scheduler for a while, but
+	 * Do actual work from vylog thread so as not to stall tx thread.
+	 * Note, both rotation and writes are done by the same thread so
+	 * this effectively stalls the vinyl scheduler for a while, but
 	 * this is acceptable, because (1) the log file is small and
 	 * hence can be rotated fairly quickly so the stall isn't going
 	 * to take too long and (2) dumps/compactions, which are scheduled
 	 * by the scheduler, are rare events so there shouldn't be too
 	 * many of them piling up due to log rotation.
 	 */
-	latch_lock(&vy_log.latch);
 
-	/* Do actual work from coio so as not to stall tx thread. */
 	struct vy_log_rotate_msg msg;
 	vclock_copy(&msg.vclock, vclock);
 
@@ -1154,19 +1202,13 @@ vy_log_rotate(const struct vclock *vclock)
 	fiber_set_cancellable(cancellable);
 
 	if (rc != 0)
-		goto fail;
-
-	vclock_copy(&vy_log.last_checkpoint, vclock);
+		return -1;
 
 	/* Add the new vclock to the xdir so that we can track it. */
 	xdir_add_vclock(&vy_log.dir, vclock);
 
-	latch_unlock(&vy_log.latch);
 	say_verbose("done rotating vylog");
 	return 0;
-fail:
-	latch_unlock(&vy_log.latch);
-	return -1;
 }
 
 void
@@ -1201,8 +1243,26 @@ vy_log_backup_path(const struct vclock *vclock)
 void
 vy_log_tx_begin(void)
 {
-	latch_lock(&vy_log.latch);
-	vy_log.tx_failed = false;
+	if (vy_log.tx != NULL) {
+		/*
+		 * We don't commit vylog transactions until recovery
+		 * is complete, see vy_log_tx_do_commit().
+		 */
+		assert(vy_log.recovery != NULL);
+		goto done;
+	}
+	struct vy_log_tx *tx = lsregion_alloc(&vy_log.tx_pool, sizeof(*tx),
+					      ++vy_log.last_tx_id);
+	if (tx == NULL)
+		panic("failed to allocate memory for vylog transaction");
+
+	tx->id = vy_log.last_tx_id;
+	stailq_create(&tx->records);
+	diag_create(&tx->diag);
+	tx->failed = false;
+	tx->fiber = NULL;
+	vy_log.tx = tx;
+done:
 	say_verbose("begin vylog transaction");
 }
 
@@ -1216,18 +1276,21 @@ vy_log_tx_begin(void)
 static int
 vy_log_tx_do_commit(bool no_discard)
 {
-	assert(latch_owner(&vy_log.latch) == fiber());
+	struct vy_log_tx *tx = vy_log.tx;
+	assert(tx != NULL);
 
-	if (vy_log.tx_failed) {
+	if (tx->failed) {
 		/*
 		 * vy_log_write() failed to append a record to tx.
 		 * @no_discard transactions can't handle this.
 		 */
-		diag_move(&vy_log.tx_diag, diag_get());
+		diag_move(&tx->diag, diag_get());
 		if (no_discard) {
 			diag_log();
 			panic("non-discardable vylog transaction failed");
 		}
+		vy_log_tx_free(tx);
+		vy_log.tx = NULL;
 		goto fail;
 	}
 
@@ -1240,15 +1303,13 @@ vy_log_tx_do_commit(bool no_discard)
 	if (vy_log.recovery != NULL)
 		goto done;
 
-	if (vy_log_flush(no_discard) != 0)
+	if (vy_log_tx_write(no_discard) != 0)
 		goto fail;
 done:
 	say_verbose("commit vylog transaction");
-	latch_unlock(&vy_log.latch);
 	return 0;
 fail:
 	say_verbose("rollback vylog transaction");
-	latch_unlock(&vy_log.latch);
 	return -1;
 }
 
@@ -1265,21 +1326,29 @@ vy_log_tx_try_commit(void)
 		unreachable();
 }
 
+static void *
+vy_log_tx_alloc_cb(void *ctx, size_t size)
+{
+	(void)ctx;
+	return lsregion_alloc(&vy_log.tx_pool, size, vy_log.last_tx_id);
+}
+
 void
 vy_log_write(const struct vy_log_record *record)
 {
-	assert(latch_owner(&vy_log.latch) == fiber());
+	struct vy_log_tx *tx = vy_log.tx;
+	assert(tx != NULL);
 
 	struct vy_log_record *tx_record = vy_log_record_dup(record,
-					region_alloc_cb, &vy_log.pool);
+					vy_log_tx_alloc_cb, NULL);
 	if (tx_record == NULL) {
-		diag_move(diag_get(), &vy_log.tx_diag);
-		vy_log.tx_failed = true;
+		diag_move(diag_get(), &tx->diag);
+		tx->failed = true;
 		return;
 	}
 
 	say_verbose("write vylog record: %s", vy_log_record_str(tx_record));
-	stailq_add_tail_entry(&vy_log.tx, tx_record, in_tx);
+	stailq_add_tail_entry(&tx->records, tx_record, in_tx);
 }
 
 /**
@@ -2382,9 +2451,11 @@ vy_recovery_new_f(struct cbus_call_msg *base)
 struct vy_recovery *
 vy_recovery_new(int64_t signature, int flags)
 {
-	/* Lock out concurrent writers while we are loading the log. */
-	latch_lock(&vy_log.latch);
-
+	/*
+	 * No need to lock out concurrent writers as loading is
+	 * done by vylog thread, the same thread that processes
+	 * writes.
+	 */
 	struct vy_recovery_msg msg;
 	msg.signature = signature;
 	msg.flags = flags;
@@ -2396,13 +2467,9 @@ vy_recovery_new(int64_t signature, int flags)
 	fiber_set_cancellable(cancellable);
 
 	if (rc != 0)
-		goto fail;
+		return NULL;
 
-	latch_unlock(&vy_log.latch);
 	return msg.recovery;
-fail:
-	latch_unlock(&vy_log.latch);
-	return NULL;
 }
 
 void
diff --git a/src/box/vy_log.h b/src/box/vy_log.h
index bd6a4a0d..4fb6da33 100644
--- a/src/box/vy_log.h
+++ b/src/box/vy_log.h
@@ -273,7 +273,7 @@ struct vy_log_record {
 	int64_t gc_lsn;
 	/** For runs: number of dumps it took to create the run. */
 	uint32_t dump_count;
-	/** Link in vy_log::tx. */
+	/** Link in vy_log_tx::records. */
 	struct stailq_entry in_tx;
 };
 
@@ -482,7 +482,8 @@ vy_log_tx_begin(void);
  * Commit a transaction started with vy_log_tx_begin().
  *
  * This function flushes all buffered records to disk. If it fails,
- * all records of the current transaction are discarded.
+ * all records of the current transaction are discarded. Note, it
+ * waits for the transaction to be written, i.e. yields.
  *
  * See also vy_log_tx_try_commit().
  *
@@ -497,7 +498,8 @@ vy_log_tx_commit(void);
  * Similarly to vy_log_tx_commit(), this function tries to write all
  * buffered records to disk, but in case of failure pending records
  * are not expunged from the buffer, so that the next transaction
- * will retry to flush them.
+ * will retry to flush them. In contrast to vy_log_tx_commit(),
+ * this function doesn't yield.
  */
 void
 vy_log_tx_try_commit(void);
-- 
2.11.0

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [tarantool-patches] Re: [PATCH 00/10] vinyl: don't yield in DDL on_commit triggers
  2019-05-17 14:52 [PATCH 00/10] vinyl: don't yield in DDL on_commit triggers Vladimir Davydov
                   ` (9 preceding siblings ...)
  2019-05-17 14:52 ` [PATCH 10/10] vinyl: get rid of the latch protecting vylog buffer Vladimir Davydov
@ 2019-05-18 18:35 ` Konstantin Osipov
  2019-05-20  8:09   ` Vladimir Davydov
  2019-06-01  8:09 ` Konstantin Osipov
  11 siblings, 1 reply; 40+ messages in thread
From: Konstantin Osipov @ 2019-05-18 18:35 UTC (permalink / raw)
  To: tarantool-patches

* Vladimir Davydov <vdavydov.dev@gmail.com> [19/05/17 17:54]:
> DDL on_commit triggers may yield on vylog write, even though they don't
> really need to wait for the transaction to be flushed to disk, as we can
> recover it after restart. This blocks implementation of transactional
> DDL (see #4083), because even an empty index creation may yield and
> hence can't be executed transactionally along with other lightweight
> operations, such as space creation.
> 
> This patch reworks vylog implementation so that we don't need to yield
> in case the vylog transaction is non-discardable (i.e. stays in the
> memory buffer in case of failure). Since on_commit triggers issue only
> non-discardable transactions, this makes them non-yielding, as they
> should be.

why do you need to have an own thread for vylog?

> 

-- 
Konstantin Osipov, Moscow, Russia, +7 903 626 22 32

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [tarantool-patches] Re: [PATCH 01/10] box: zap atfork callback
  2019-05-17 14:52 ` [PATCH 01/10] box: zap atfork callback Vladimir Davydov
@ 2019-05-18 18:37   ` Konstantin Osipov
  2019-05-20  8:13     ` Vladimir Davydov
  2019-06-01  8:16     ` Konstantin Osipov
  0 siblings, 2 replies; 40+ messages in thread
From: Konstantin Osipov @ 2019-05-18 18:37 UTC (permalink / raw)
  To: tarantool-patches

* Vladimir Davydov <vdavydov.dev@gmail.com> [19/05/17 17:54]:
> box_atfork calls wal_atfork which in turn calls xlog_atfork for the wal
> and vylog files. A comment to xlog_atfork says that it's necessary to
> prevent atexit handlers in a child from closing xlog files again, but we
> don't use atexit for that anymore. A comment to box_atfork says that
> box.coredump forks to write a core, but there's no box.coredump anymore.
> There's also a comment mentioning box.cfg.background, but when we fork
> that early there's no xlog file open.
> 
> To sum it up, atfork looks like a piece of legacy code. Let's get rid of
> it now so as not to bother patching it later.

If anyone adds a fork any time in the future, xlogs will break
silently, and it would be hard to catch in a test. 

If you wish to remove the dead code, please keep the atfork, but panic in it.


-- 
Konstantin Osipov, Moscow, Russia, +7 903 626 22 32

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [tarantool-patches] Re: [PATCH 02/10] vinyl: add a separate thread for vylog
  2019-05-17 14:52 ` [PATCH 02/10] vinyl: add a separate thread for vylog Vladimir Davydov
@ 2019-05-18 18:39   ` Konstantin Osipov
  2019-05-20  8:17     ` Vladimir Davydov
  2019-06-01  8:26   ` Konstantin Osipov
  1 sibling, 1 reply; 40+ messages in thread
From: Konstantin Osipov @ 2019-05-18 18:39 UTC (permalink / raw)
  To: tarantool-patches

* Vladimir Davydov <vdavydov.dev@gmail.com> [19/05/17 17:54]:

We need an infrastructure to run an opaque piece of code in any
OS thread, not create a new thread any time we have an itch for
refactoring.

If you wish to run vylog in an own thread, fine, WAL thread has
plenty of capacity to host 10 vylog threads, please use it.

> Historically, we use the WAL thread for writing vylog files, because,
> I guess, we didn't want to introduce a separate thread. However, that
> design decision turned out to be quite dubious:
> 
>  - vy_log (vy_log.c) calls vy_log_writer (wal.c) while vy_log_writer
>    calls back to vy_log. That is we have a piece of logic split crudely
>    between two files, which makes the code difficult to follow and just
>    looks ugly.
> 
>  - We can't make vy_log part of vy_env because of this relationship.
>    In fact, vy_log is the last singleton in the whole vy implementation:
>    everything else is wrapped neatly (well, not quite, but still) in
>    vy_env struct.
> 
>  - We can't kill the infamous vy_log.latch, which is a prerequisite for
>    transactional DDL. The latch is needed to sync between vy_log readers
>    and writers. You see, currently we can't read vy_log in the same
>    thread where we write it, which would eliminate the need for any kind
>    of synchronization, because vy_log read is quite a heavy operation -
>    it may stall WAL writes and thus badly affect latency. So we have to
>    do it from a coio thread.
> 
> That being said, it's time to move vy_log writer back to where it
> belongs - vy_log.c, separate thread. This patch does just that. It
> doesn't patch the implementation to fix the flaws enumerated above -
> this will be done by following patches.
> ---
>  src/box/vy_log.c | 113 +++++++++++++++++++++++++++++++++++++++++++++++++++----
>  src/box/vy_log.h |   6 ---
>  src/box/wal.c    |  79 --------------------------------------
>  src/box/wal.h    |  15 --------
>  4 files changed, 105 insertions(+), 108 deletions(-)
> 
> diff --git a/src/box/vy_log.c b/src/box/vy_log.c
> index edd61b33..25ab73fd 100644
> --- a/src/box/vy_log.c
> +++ b/src/box/vy_log.c
> @@ -46,19 +46,20 @@
>  #include <small/rlist.h>
>  
>  #include "assoc.h"
> +#include "cbus.h"
>  #include "coio_task.h"
>  #include "diag.h"
>  #include "errcode.h"
>  #include "errinj.h"
>  #include "fiber.h"
>  #include "iproto_constants.h" /* IPROTO_INSERT */
> +#include "journal.h"
>  #include "key_def.h"
>  #include "latch.h"
>  #include "replication.h" /* INSTANCE_UUID */
>  #include "salad/stailq.h"
>  #include "say.h"
>  #include "tt_static.h"
> -#include "wal.h"
>  #include "vclock.h"
>  #include "xlog.h"
>  #include "xrow.h"
> @@ -178,6 +179,14 @@ struct vy_log {
>  	 * only relevant if @tx_failed is set.
>  	 */
>  	struct diag tx_diag;
> +	/** Vylog file. */
> +	struct xlog xlog;
> +	/** Vylog IO thread. */
> +	struct cord cord;
> +	/** Pipe to the vylog thread. */
> +	struct cpipe pipe;
> +	/** Returning pipe from the vylog thread back to tx. */
> +	struct cpipe tx_pipe;
>  };
>  static struct vy_log vy_log;
>  
> @@ -189,6 +198,9 @@ vy_recovery_process_record(struct vy_recovery *recovery,
>  			   const struct vy_log_record *record);
>  
>  static int
> +vy_log_open(void);
> +
> +static int
>  vy_log_create(const struct vclock *vclock, struct vy_recovery *recovery);
>  
>  int
> @@ -731,6 +743,26 @@ err:
>  	return NULL;
>  }
>  
> +/** Vylog thread main loop. */
> +static int
> +vy_log_thread_f(va_list ap)
> +{
> +	(void)ap;
> +
> +	struct cbus_endpoint endpoint;
> +	cbus_endpoint_create(&endpoint, "vylog", fiber_schedule_cb, fiber());
> +
> +	cpipe_create(&vy_log.tx_pipe, "tx");
> +
> +	cbus_loop(&endpoint);
> +
> +	if (xlog_is_open(&vy_log.xlog))
> +		xlog_close(&vy_log.xlog, false);
> +
> +	cpipe_destroy(&vy_log.tx_pipe);
> +	return 0;
> +}
> +
>  void
>  vy_log_init(const char *dir)
>  {
> @@ -740,7 +772,45 @@ vy_log_init(const char *dir)
>  	region_create(&vy_log.pool, cord_slab_cache());
>  	stailq_create(&vy_log.tx);
>  	diag_create(&vy_log.tx_diag);
> -	wal_init_vy_log();
> +	xlog_clear(&vy_log.xlog);
> +
> +	if (cord_costart(&vy_log.cord, "vinyl.log", vy_log_thread_f, NULL) != 0)
> +		panic_syserror("failed to start vinyl log thread");
> +
> +	cpipe_create(&vy_log.pipe, "vylog");
> +}
> +
> +struct vy_log_flush_msg {
> +	struct cbus_call_msg base;
> +	struct journal_entry *entry;
> +};
> +
> +static int
> +vy_log_flush_f(struct cbus_call_msg *base)
> +{
> +	struct vy_log_flush_msg *msg = (struct vy_log_flush_msg *)base;
> +	struct journal_entry *entry = msg->entry;
> +	struct xlog *xlog = &vy_log.xlog;
> +
> +	if (!xlog_is_open(xlog)) {
> +		if (vy_log_open() < 0)
> +			return -1;
> +	}
> +
> +	xlog_tx_begin(xlog);
> +	for (int i = 0; i < entry->n_rows; ++i) {
> +		entry->rows[i]->tm = ev_now(loop());
> +		if (xlog_write_row(xlog, entry->rows[i]) < 0) {
> +			xlog_tx_rollback(xlog);
> +			return -1;
> +		}
> +	}
> +
> +	if (xlog_tx_commit(xlog) < 0 ||
> +	    xlog_flush(xlog) < 0)
> +		return -1;
> +
> +	return 0;
>  }
>  
>  /**
> @@ -798,10 +868,16 @@ vy_log_flush(void)
>  	assert(i == tx_size);
>  
>  	/*
> -	 * Do actual disk writes on behalf of the WAL
> +	 * Do actual disk writes on behalf of the vylog thread
>  	 * so as not to block the tx thread.
>  	 */
> -	if (wal_write_vy_log(entry) != 0)
> +	struct vy_log_flush_msg msg;
> +	msg.entry = entry;
> +	bool cancellable = fiber_set_cancellable(false);
> +	int rc = cbus_call(&vy_log.pipe, &vy_log.tx_pipe, &msg.base,
> +			   vy_log_flush_f, NULL, TIMEOUT_INFINITY);
> +	fiber_set_cancellable(cancellable);
> +	if (rc != 0)
>  		goto err;
>  
>  	/* Success. Free flushed records. */
> @@ -817,14 +893,21 @@ err:
>  void
>  vy_log_free(void)
>  {
> +	cbus_stop_loop(&vy_log.pipe);
> +	if (cord_join(&vy_log.cord) != 0)
> +		panic_syserror("failed to join vinyl log thread");
>  	xdir_destroy(&vy_log.dir);
>  	region_destroy(&vy_log.pool);
>  	diag_destroy(&vy_log.tx_diag);
>  }
>  
> -int
> -vy_log_open(struct xlog *xlog)
> +/**
> + * Open current vy_log file.
> + */
> +static int
> +vy_log_open(void)
>  {
> +	struct xlog *xlog = &vy_log.xlog;
>  	/*
>  	 * Open the current log file or create a new one
>  	 * if it doesn't exist.
> @@ -1022,6 +1105,15 @@ vy_log_rotate_f(va_list ap)
>  	return vy_log_create(vclock, recovery);
>  }
>  
> +static int
> +vy_log_close_f(struct cbus_call_msg *msg)
> +{
> +	(void)msg;
> +	if (xlog_is_open(&vy_log.xlog))
> +		xlog_close(&vy_log.xlog, false);
> +	return 0;
> +}
> +
>  int
>  vy_log_rotate(const struct vclock *vclock)
>  {
> @@ -1069,9 +1161,14 @@ vy_log_rotate(const struct vclock *vclock)
>  
>  	/*
>  	 * Success. Close the old log. The new one will be opened
> -	 * automatically on the first write (see wal_write_vy_log()).
> +	 * automatically on the first write (see vy_log_flush_f()).
>  	 */
> -	wal_rotate_vy_log();
> +	struct cbus_call_msg msg;
> +	bool cancellable = fiber_set_cancellable(false);
> +	cbus_call(&vy_log.pipe, &vy_log.tx_pipe, &msg,
> +		  vy_log_close_f, NULL, TIMEOUT_INFINITY);
> +	fiber_set_cancellable(cancellable);
> +
>  	vclock_copy(&vy_log.last_checkpoint, vclock);
>  
>  	/* Add the new vclock to the xdir so that we can track it. */
> diff --git a/src/box/vy_log.h b/src/box/vy_log.h
> index ee38c193..81f62706 100644
> --- a/src/box/vy_log.h
> +++ b/src/box/vy_log.h
> @@ -439,12 +439,6 @@ void
>  vy_log_free(void);
>  
>  /**
> - * Open current vy_log file.
> - */
> -int
> -vy_log_open(struct xlog *xlog);
> -
> -/**
>   * Rotate the metadata log. This function creates a new
>   * xlog file in the log directory having vclock @vclock
>   * and writes records required to recover active LSM trees.
> diff --git a/src/box/wal.c b/src/box/wal.c
> index 25bceb68..1f93e162 100644
> --- a/src/box/wal.c
> +++ b/src/box/wal.c
> @@ -39,7 +39,6 @@
>  
>  #include "xlog.h"
>  #include "xrow.h"
> -#include "vy_log.h"
>  #include "cbus.h"
>  #include "coio_task.h"
>  #include "replication.h"
> @@ -173,15 +172,6 @@ struct wal_msg {
>  	struct vclock vclock;
>  };
>  
> -/**
> - * Vinyl metadata log writer.
> - */
> -struct vy_log_writer {
> -	/** The metadata log file. */
> -	struct xlog xlog;
> -};
> -
> -static struct vy_log_writer vy_log_writer;
>  static struct wal_writer wal_writer_singleton;
>  
>  enum wal_mode
> @@ -1115,9 +1105,6 @@ wal_writer_f(va_list ap)
>  	if (xlog_is_open(&writer->current_wal))
>  		xlog_close(&writer->current_wal, false);
>  
> -	if (xlog_is_open(&vy_log_writer.xlog))
> -		xlog_close(&vy_log_writer.xlog, false);
> -
>  	cpipe_destroy(&writer->tx_prio_pipe);
>  	return 0;
>  }
> @@ -1198,72 +1185,6 @@ wal_write_in_wal_mode_none(struct journal *journal,
>  	return vclock_sum(&writer->vclock);
>  }
>  
> -void
> -wal_init_vy_log()
> -{
> -	xlog_clear(&vy_log_writer.xlog);
> -}
> -
> -struct wal_write_vy_log_msg
> -{
> -	struct cbus_call_msg base;
> -	struct journal_entry *entry;
> -};
> -
> -static int
> -wal_write_vy_log_f(struct cbus_call_msg *msg)
> -{
> -	struct journal_entry *entry =
> -		((struct wal_write_vy_log_msg *)msg)->entry;
> -
> -	if (! xlog_is_open(&vy_log_writer.xlog)) {
> -		if (vy_log_open(&vy_log_writer.xlog) < 0)
> -			return -1;
> -	}
> -
> -	if (xlog_write_entry(&vy_log_writer.xlog, entry) < 0)
> -		return -1;
> -
> -	if (xlog_flush(&vy_log_writer.xlog) < 0)
> -		return -1;
> -
> -	return 0;
> -}
> -
> -int
> -wal_write_vy_log(struct journal_entry *entry)
> -{
> -	struct wal_writer *writer = &wal_writer_singleton;
> -	struct wal_write_vy_log_msg msg;
> -	msg.entry= entry;
> -	bool cancellable = fiber_set_cancellable(false);
> -	int rc = cbus_call(&writer->wal_pipe, &writer->tx_prio_pipe,
> -			   &msg.base, wal_write_vy_log_f, NULL,
> -			   TIMEOUT_INFINITY);
> -	fiber_set_cancellable(cancellable);
> -	return rc;
> -}
> -
> -static int
> -wal_rotate_vy_log_f(struct cbus_call_msg *msg)
> -{
> -	(void) msg;
> -	if (xlog_is_open(&vy_log_writer.xlog))
> -		xlog_close(&vy_log_writer.xlog, false);
> -	return 0;
> -}
> -
> -void
> -wal_rotate_vy_log()
> -{
> -	struct wal_writer *writer = &wal_writer_singleton;
> -	struct cbus_call_msg msg;
> -	bool cancellable = fiber_set_cancellable(false);
> -	cbus_call(&writer->wal_pipe, &writer->tx_prio_pipe, &msg,
> -		  wal_rotate_vy_log_f, NULL, TIMEOUT_INFINITY);
> -	fiber_set_cancellable(cancellable);
> -}
> -
>  static void
>  wal_watcher_notify(struct wal_watcher *watcher, unsigned events)
>  {
> diff --git a/src/box/wal.h b/src/box/wal.h
> index a88a3f23..8448f6d4 100644
> --- a/src/box/wal.h
> +++ b/src/box/wal.h
> @@ -222,21 +222,6 @@ wal_set_checkpoint_threshold(int64_t threshold);
>  void
>  wal_collect_garbage(const struct vclock *vclock);
>  
> -void
> -wal_init_vy_log();
> -
> -/**
> - * Write xrows to the vinyl metadata log.
> - */
> -int
> -wal_write_vy_log(struct journal_entry *req);
> -
> -/**
> - * Rotate the vinyl metadata log.
> - */
> -void
> -wal_rotate_vy_log();
> -
>  #if defined(__cplusplus)
>  } /* extern "C" */
>  #endif /* defined(__cplusplus) */
> -- 
> 2.11.0
> 

-- 
Konstantin Osipov, Moscow, Russia, +7 903 626 22 32

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [tarantool-patches] Re: [PATCH 05/10] vinyl: don't purge deleted runs from vylog on compaction
  2019-05-17 14:52 ` [PATCH 05/10] vinyl: don't purge deleted runs from vylog on compaction Vladimir Davydov
@ 2019-05-18 18:47   ` Konstantin Osipov
  2019-05-20  8:27     ` Vladimir Davydov
  2019-06-01  8:39   ` Konstantin Osipov
  1 sibling, 1 reply; 40+ messages in thread
From: Konstantin Osipov @ 2019-05-18 18:47 UTC (permalink / raw)
  To: tarantool-patches

* Vladimir Davydov <vdavydov.dev@gmail.com> [19/05/17 17:54]:
> After compacting runs, we first mark them as dropped (VY_LOG_DROP_RUN),
> then try to delete their files unless they are needed for recovery from
> the checkpoint, and finally mark them as not needed in the vylog
> (VY_LOG_FORGET_RUN). There's a potential race sitting here: the problem
> is the garbage collector might kick in after files are dropped, but
> before they are marked as not needed. If this happens, there will be
> runs that have two VY_LOG_FORGET_RUN records, which will break recovery:
> 
>   Run XX is forgotten, but not registered
> 
> The following patches make the race more likely to happen so let's
> eliminate it by making the garbage collector the only one who can mark
> runs as not needed (i.e. write VY_LOG_FORGET_RUN record). There will
> be no warnings, because the garbage collector silently ignores ENOENT
> errors, see vy_gc().
> 
> Another good thing about this patch is that now we never yield inside
> a vylog transaction, which makes it easier to remove the vylog latch
> blocking implementation of transactional DDL.

Should be a separate commit with a test?


-- 
Konstantin Osipov, Moscow, Russia, +7 903 626 22 32

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [tarantool-patches] Re: [PATCH 09/10] key_def: pass alloc callback to key_def_dump_parts
  2019-05-17 14:52 ` [PATCH 09/10] key_def: pass alloc callback to key_def_dump_parts Vladimir Davydov
@ 2019-05-18 18:52   ` Konstantin Osipov
  2019-05-20  8:34     ` Vladimir Davydov
  2019-06-01  8:41     ` Konstantin Osipov
  0 siblings, 2 replies; 40+ messages in thread
From: Konstantin Osipov @ 2019-05-18 18:52 UTC (permalink / raw)
  To: tarantool-patches

* Vladimir Davydov <vdavydov.dev@gmail.com> [19/05/17 17:54]:
> So that we can use an allocator different from the region in vylog.

Please, let's simply always use malloc.

btw, sql code is shit. not a single comment for entire call trace
of key_def_dump_parts() from there. I failed to understand whether
this is a hot path or not.


-- 
Konstantin Osipov, Moscow, Russia, +7 903 626 22 32

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [tarantool-patches] Re: [PATCH 00/10] vinyl: don't yield in DDL on_commit triggers
  2019-05-18 18:35 ` [tarantool-patches] Re: [PATCH 00/10] vinyl: don't yield in DDL on_commit triggers Konstantin Osipov
@ 2019-05-20  8:09   ` Vladimir Davydov
  0 siblings, 0 replies; 40+ messages in thread
From: Vladimir Davydov @ 2019-05-20  8:09 UTC (permalink / raw)
  To: Konstantin Osipov; +Cc: tarantool-patches

On Sat, May 18, 2019 at 09:35:38PM +0300, Konstantin Osipov wrote:
> * Vladimir Davydov <vdavydov.dev@gmail.com> [19/05/17 17:54]:
> > DDL on_commit triggers may yield on vylog write, even though they don't
> > really need to wait for the transaction to be flushed to disk, as we can
> > recover it after restart. This blocks implementation of transactional
> > DDL (see #4083), because even an empty index creation may yield and
> > hence can't be executed transactionally along with other lightweight
> > operations, such as space creation.
> > 
> > This patch reworks vylog implementation so that we don't need to yield
> > in case the vylog transaction is non-discardable (i.e. stays in the
> > memory buffer in case of failure). Since on_commit triggers issue only
> > non-discardable transactions, this makes them non-yielding, as they
> > should be.
> 
> why do you need to have an own thread for vylog?

Because this simplifies things a lot - we don't need to synchronize
between vylog readers and writers as everything is done from the same
thread. Currently, we read vylog from coio and write it from WAL using a
latch to synchronize between them. This makes any write to vylog, even
non-discardable one, which lands in the memory buffer on disk error,
yielding. Besides, all those calls from vylog to WAL and back look ugly.

Anyway, what was the reason to write vylog on behalf of WAL in the first
place?

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [tarantool-patches] Re: [PATCH 01/10] box: zap atfork callback
  2019-05-18 18:37   ` [tarantool-patches] " Konstantin Osipov
@ 2019-05-20  8:13     ` Vladimir Davydov
  2019-06-01  8:16     ` Konstantin Osipov
  1 sibling, 0 replies; 40+ messages in thread
From: Vladimir Davydov @ 2019-05-20  8:13 UTC (permalink / raw)
  To: Konstantin Osipov; +Cc: tarantool-patches

On Sat, May 18, 2019 at 09:37:19PM +0300, Konstantin Osipov wrote:
> * Vladimir Davydov <vdavydov.dev@gmail.com> [19/05/17 17:54]:
> > box_atfork calls wal_atfork which in turn calls xlog_atfork for the wal
> > and vylog files. A comment to xlog_atfork says that it's necessary to
> > prevent atexit handlers in a child from closing xlog files again, but we
> > don't use atexit for that anymore. A comment to box_atfork says that
> > box.coredump forks to write a core, but there's no box.coredump anymore.
> > There's also a comment mentioning box.cfg.background, but when we fork
> > that early there's no xlog file open.
> > 
> > To sum it up, atfork looks like a piece of legacy code. Let's get rid of
> > it now so as not to bother patching it later.
> 
> If anyone adds a fork any time in the future, xlogs will break
> silently,

Why is that?

> and it would be hard to catch in a test. 
> 
> If you wish to remove the dead code, please keep the atfork, but panic in it.

There's no point in keeping dead code around.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [tarantool-patches] Re: [PATCH 02/10] vinyl: add a separate thread for vylog
  2019-05-18 18:39   ` [tarantool-patches] " Konstantin Osipov
@ 2019-05-20  8:17     ` Vladimir Davydov
  0 siblings, 0 replies; 40+ messages in thread
From: Vladimir Davydov @ 2019-05-20  8:17 UTC (permalink / raw)
  To: Konstantin Osipov; +Cc: tarantool-patches

On Sat, May 18, 2019 at 09:39:58PM +0300, Konstantin Osipov wrote:
> * Vladimir Davydov <vdavydov.dev@gmail.com> [19/05/17 17:54]:
> 
> We need an infrastructure to run an opaque piece of code in any
> OS thread, not create a new thread any time we have an itch for
> refactoring.
> 
> If you wish to run vylog in an own thread, fine, WAL thread has
> plenty of capacity to host 10 vylog threads, please use it.

It doesn't have capacity to load vylog and I wrote that in the comment
below.

We have threads for vinyl writers and readers and that's okay, but one
more thread for vylog thread is suddenly a no-go. Is there a *technical*
reason why you're reluctant to add a new thread.

> 
> > Historically, we use the WAL thread for writing vylog files, because,
> > I guess, we didn't want to introduce a separate thread. However, that
> > design decision turned out to be quite dubious:
> > 
> >  - vy_log (vy_log.c) calls vy_log_writer (wal.c) while vy_log_writer
> >    calls back to vy_log. That is we have a piece of logic split crudely
> >    between two files, which makes the code difficult to follow and just
> >    looks ugly.
> > 
> >  - We can't make vy_log part of vy_env because of this relationship.
> >    In fact, vy_log is the last singleton in the whole vy implementation:
> >    everything else is wrapped neatly (well, not quite, but still) in
> >    vy_env struct.
> > 
> >  - We can't kill the infamous vy_log.latch, which is a prerequisite for
> >    transactional DDL. The latch is needed to sync between vy_log readers
> >    and writers. You see, currently we can't read vy_log in the same
> >    thread where we write it, which would eliminate the need for any kind
> >    of synchronization, because vy_log read is quite a heavy operation -
> >    it may stall WAL writes and thus badly affect latency. So we have to
> >    do it from a coio thread.
> > 
> > That being said, it's time to move vy_log writer back to where it
> > belongs - vy_log.c, separate thread. This patch does just that. It
> > doesn't patch the implementation to fix the flaws enumerated above -
> > this will be done by following patches.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [tarantool-patches] Re: [PATCH 05/10] vinyl: don't purge deleted runs from vylog on compaction
  2019-05-18 18:47   ` [tarantool-patches] " Konstantin Osipov
@ 2019-05-20  8:27     ` Vladimir Davydov
  0 siblings, 0 replies; 40+ messages in thread
From: Vladimir Davydov @ 2019-05-20  8:27 UTC (permalink / raw)
  To: Konstantin Osipov; +Cc: tarantool-patches

On Sat, May 18, 2019 at 09:47:15PM +0300, Konstantin Osipov wrote:
> * Vladimir Davydov <vdavydov.dev@gmail.com> [19/05/17 17:54]:
> > After compacting runs, we first mark them as dropped (VY_LOG_DROP_RUN),
> > then try to delete their files unless they are needed for recovery from
> > the checkpoint, and finally mark them as not needed in the vylog
> > (VY_LOG_FORGET_RUN). There's a potential race sitting here: the problem
> > is the garbage collector might kick in after files are dropped, but
> > before they are marked as not needed. If this happens, there will be
> > runs that have two VY_LOG_FORGET_RUN records, which will break recovery:
> > 
> >   Run XX is forgotten, but not registered
> > 
> > The following patches make the race more likely to happen so let's
> > eliminate it by making the garbage collector the only one who can mark
> > runs as not needed (i.e. write VY_LOG_FORGET_RUN record). There will
> > be no warnings, because the garbage collector silently ignores ENOENT
> > errors, see vy_gc().
> > 
> > Another good thing about this patch is that now we never yield inside
> > a vylog transaction, which makes it easier to remove the vylog latch
> > blocking implementation of transactional DDL.
> 
> Should be a separate commit with a test?

By removing VY_LOG_FORGET_RUN from compaction, we get this for free.
I don't see how this could be factored out into a separate patch.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [tarantool-patches] Re: [PATCH 09/10] key_def: pass alloc callback to key_def_dump_parts
  2019-05-18 18:52   ` [tarantool-patches] " Konstantin Osipov
@ 2019-05-20  8:34     ` Vladimir Davydov
  2019-06-01  8:41     ` Konstantin Osipov
  1 sibling, 0 replies; 40+ messages in thread
From: Vladimir Davydov @ 2019-05-20  8:34 UTC (permalink / raw)
  To: Konstantin Osipov; +Cc: tarantool-patches

On Sat, May 18, 2019 at 09:52:31PM +0300, Konstantin Osipov wrote:
> * Vladimir Davydov <vdavydov.dev@gmail.com> [19/05/17 17:54]:
> > So that we can use an allocator different from the region in vylog.
> 
> Please, let's simply always use malloc.

I did it this way so as to use lsregion for allocating vylog records in
the next patch. Are you suggesting to use malloc() instead? I guess
that's possible and that'd simplify things quite a bit. I just wanted to
use lock-free allocator for this. Not sure if it's really necessary
though, because vy_log_tx_write isn't a hot path.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [tarantool-patches] Re: [PATCH 00/10] vinyl: don't yield in DDL on_commit triggers
  2019-05-17 14:52 [PATCH 00/10] vinyl: don't yield in DDL on_commit triggers Vladimir Davydov
                   ` (10 preceding siblings ...)
  2019-05-18 18:35 ` [tarantool-patches] Re: [PATCH 00/10] vinyl: don't yield in DDL on_commit triggers Konstantin Osipov
@ 2019-06-01  8:09 ` Konstantin Osipov
  11 siblings, 0 replies; 40+ messages in thread
From: Konstantin Osipov @ 2019-06-01  8:09 UTC (permalink / raw)
  To: tarantool-patches

* Vladimir Davydov <vdavydov.dev@gmail.com> [19/05/17 17:54]:
> DDL on_commit triggers may yield on vylog write, even though they don't
> really need to wait for the transaction to be flushed to disk, as we can
> recover it after restart. This blocks implementation of transactional
> DDL (see #4083), because even an empty index creation may yield and
> hence can't be executed transactionally along with other lightweight
> operations, such as space creation.
> 
> This patch reworks vylog implementation so that we don't need to yield
> in case the vylog transaction is non-discardable (i.e. stays in the
> memory buffer in case of failure). Since on_commit triggers issue only
> non-discardable transactions, this makes them non-yielding, as they
> should be.

Given we agreed on a better fix, and you could start working on it
soon (I mean really soon), I could review this patch stack as an
interim solution. It contains some good changes which I don't want
to go astray.

Please let me know if you still need a review. Some of the
comments follow on individual commits.


-- 
Konstantin Osipov, Moscow, Russia

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [tarantool-patches] Re: [PATCH 01/10] box: zap atfork callback
  2019-05-18 18:37   ` [tarantool-patches] " Konstantin Osipov
  2019-05-20  8:13     ` Vladimir Davydov
@ 2019-06-01  8:16     ` Konstantin Osipov
  2019-06-06 10:04       ` Vladimir Davydov
  1 sibling, 1 reply; 40+ messages in thread
From: Konstantin Osipov @ 2019-06-01  8:16 UTC (permalink / raw)
  To: tarantool-patches

* Konstantin Osipov <kostja@tarantool.org> [19/05/18 21:41]:
> * Vladimir Davydov <vdavydov.dev@gmail.com> [19/05/17 17:54]:
> > box_atfork calls wal_atfork which in turn calls xlog_atfork for the wal
> > and vylog files. A comment to xlog_atfork says that it's necessary to
> > prevent atexit handlers in a child from closing xlog files again, but we
> > don't use atexit for that anymore. A comment to box_atfork says that
> > box.coredump forks to write a core, but there's no box.coredump anymore.
> > There's also a comment mentioning box.cfg.background, but when we fork
> > that early there's no xlog file open.
> > 
> > To sum it up, atfork looks like a piece of legacy code. Let's get rid of
> > it now so as not to bother patching it later.
> 
> If anyone adds a fork any time in the future, xlogs will break
> silently, and it would be hard to catch in a test. 
> 
> If you wish to remove the dead code, please keep the atfork, but panic in it.

As discussed in the channel, you can't panic in atfork handler
since there is an upcoming popen() patch. This all still looks
very fragile.
I read from vfork() man page that NPTL doesn't call
pthread_atfork() handlers in vfork,
while older implementation may. What is the behaviour on freebsd? 
If both systems don't invoke the handlers, I would keep the
panic() call, otherwise remove.

-- 
Konstantin Osipov, Moscow, Russia

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [tarantool-patches] Re: [PATCH 02/10] vinyl: add a separate thread for vylog
  2019-05-17 14:52 ` [PATCH 02/10] vinyl: add a separate thread for vylog Vladimir Davydov
  2019-05-18 18:39   ` [tarantool-patches] " Konstantin Osipov
@ 2019-06-01  8:26   ` Konstantin Osipov
  2019-06-06 10:20     ` Vladimir Davydov
  1 sibling, 1 reply; 40+ messages in thread
From: Konstantin Osipov @ 2019-06-01  8:26 UTC (permalink / raw)
  To: tarantool-patches

* Vladimir Davydov <vdavydov.dev@gmail.com> [19/05/17 17:54]:
> Historically, we use the WAL thread for writing vylog files, because,
> I guess, we didn't want to introduce a separate thread. However, that
> design decision turned out to be quite dubious:

>  - vy_log (vy_log.c) calls vy_log_writer (wal.c) while vy_log_writer
>    calls back to vy_log. That is we have a piece of logic split crudely
>    between two files, which makes the code difficult to follow and just
>    looks ugly.

This is because we can not ship arbitrary logic into a thread
without the thread becoming aware of it. This can be easily fixed
with run_in_cord() function, which would take a cord pointer and a 
fiber function pointer with context, and run the fiber in a given
cord. A mature implementation would take more than just a function
pointer and a context, but some sort of runnable object, which
responds to cancel, exit, suspend and resume. This would be very
useful in a bunch of other places, not just for vy_log.

>  - We can't make vy_log part of vy_env because of this relationship.
>    In fact, vy_log is the last singleton in the whole vy implementation:
>    everything else is wrapped neatly (well, not quite, but still) in
>    vy_env struct.

See above. The wal thread doesn't have to know anything about
struct vy_log or struct wal, for that matter, it's just a
container of runnable objects. We should also be able to move any
runnable object along with its context from a thread to thread
by suspending it first and then resuming at another thread, as I
described.

> 
>  - We can't kill the infamous vy_log.latch, which is a prerequisite for
>    transactional DDL. The latch is needed to sync between vy_log readers
>    and writers. You see, currently we can't read vy_log in the same
>    thread where we write it, which would eliminate the need for any kind
>    of synchronization, because vy_log read is quite a heavy operation -
>    it may stall WAL writes and thus badly affect latency. So we have to
>    do it from a coio thread.

This would be the main reason for the change, however, I would
rather ensure vy_log readers and writers use entirely different
objects as contexts. We already have vy_recovery as vy log reader
context, what prevents you from populating it using an own
xlog object?

As discussed verbally, there is a way to shift entire vy_log
contents to the existing WAL log which would make our future life
much easier. 

Please feel free to accept the above comments and complete this
patch or switch to an entirely new approach (which may well
require the same housekeeping changes I am pushing for in this
review).

-- 
Konstantin Osipov, Moscow, Russia

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [tarantool-patches] Re: [PATCH 03/10] vinyl: move vylog recovery to vylog thread
  2019-05-17 14:52 ` [PATCH 03/10] vinyl: move vylog recovery to vylog thread Vladimir Davydov
@ 2019-06-01  8:36   ` Konstantin Osipov
  2019-06-06 10:23     ` Vladimir Davydov
  0 siblings, 1 reply; 40+ messages in thread
From: Konstantin Osipov @ 2019-06-01  8:36 UTC (permalink / raw)
  To: tarantool-patches

* Vladimir Davydov <vdavydov.dev@gmail.com> [19/05/17 17:54]:
> We used coio, because vylog was written from a WAL thread, which
> shouldn't be used for such a heavy operation as vylog recovery.
> Now, we can move it to the dedicated vylog thread. This allows
> us to simplify rotation logic as well: now most of work is done
> from the same function (vy_log_rotate_f) executed by vylog thread,
> not scattered between coio and WAL, as it used to be.

Why do we need to lock out the scheduler while rotating the log in
the first place? 

This is not anywhere in the comments, and vy_log was renamed from
xctl so I got lost in the commit history.

It's a pity it's not reflected in the comment.


-- 
Konstantin Osipov, Moscow, Russia

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [tarantool-patches] Re: [PATCH 04/10] vinyl: rework vylog transaction backlog implementation
  2019-05-17 14:52 ` [PATCH 04/10] vinyl: rework vylog transaction backlog implementation Vladimir Davydov
@ 2019-06-01  8:38   ` Konstantin Osipov
  2019-06-06 11:58     ` Vladimir Davydov
  0 siblings, 1 reply; 40+ messages in thread
From: Konstantin Osipov @ 2019-06-01  8:38 UTC (permalink / raw)
  To: tarantool-patches

* Vladimir Davydov <vdavydov.dev@gmail.com> [19/05/17 17:54]:

I like this approach, I would push it regardless of other patches.
The patch is LGTM.


-- 
Konstantin Osipov, Moscow, Russia

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [tarantool-patches] Re: [PATCH 05/10] vinyl: don't purge deleted runs from vylog on compaction
  2019-05-17 14:52 ` [PATCH 05/10] vinyl: don't purge deleted runs from vylog on compaction Vladimir Davydov
  2019-05-18 18:47   ` [tarantool-patches] " Konstantin Osipov
@ 2019-06-01  8:39   ` Konstantin Osipov
  2019-06-06 12:40     ` Vladimir Davydov
  1 sibling, 1 reply; 40+ messages in thread
From: Konstantin Osipov @ 2019-06-01  8:39 UTC (permalink / raw)
  To: tarantool-patches

* Vladimir Davydov <vdavydov.dev@gmail.com> [19/05/17 17:54]:
> After compacting runs, we first mark them as dropped (VY_LOG_DROP_RUN),
> then try to delete their files unless they are needed for recovery from
> the checkpoint, and finally mark them as not needed in the vylog
> (VY_LOG_FORGET_RUN). There's a potential race sitting here: the problem
> is the garbage collector might kick in after files are dropped, but
> before they are marked as not needed. If this happens, there will be
> runs that have two VY_LOG_FORGET_RUN records, which will break recovery:
> 
>   Run XX is forgotten, but not registered
> 
> The following patches make the race more likely to happen so let's
> eliminate it by making the garbage collector the only one who can mark
> runs as not needed (i.e. write VY_LOG_FORGET_RUN record). There will
> be no warnings, because the garbage collector silently ignores ENOENT
> errors, see vy_gc().
> 
> Another good thing about this patch is that now we never yield inside
> a vylog transaction, which makes it easier to remove the vylog latch
> blocking implementation of transactional DDL.
> ---

This is also a good one, LGTM.


-- 
Konstantin Osipov, Moscow, Russia

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [tarantool-patches] Re: [PATCH 09/10] key_def: pass alloc callback to key_def_dump_parts
  2019-05-18 18:52   ` [tarantool-patches] " Konstantin Osipov
  2019-05-20  8:34     ` Vladimir Davydov
@ 2019-06-01  8:41     ` Konstantin Osipov
  2019-06-10 15:28       ` Vladimir Davydov
  1 sibling, 1 reply; 40+ messages in thread
From: Konstantin Osipov @ 2019-06-01  8:41 UTC (permalink / raw)
  To: tarantool-patches

* Konstantin Osipov <kostja@tarantool.org> [19/05/18 21:56]:
> * Vladimir Davydov <vdavydov.dev@gmail.com> [19/05/17 17:54]:
> > So that we can use an allocator different from the region in vylog.
> 
> Please, let's simply always use malloc.
> 
> btw, sql code is shit. not a single comment for entire call trace
> of key_def_dump_parts() from there. I failed to understand whether
> this is a hot path or not.

I stand by this request.

-- 
Konstantin Osipov, Moscow, Russia

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [tarantool-patches] Re: [PATCH 10/10] vinyl: get rid of the latch protecting vylog buffer
  2019-05-17 14:52 ` [PATCH 10/10] vinyl: get rid of the latch protecting vylog buffer Vladimir Davydov
@ 2019-06-01  8:44   ` Konstantin Osipov
  2019-06-06 13:15     ` Vladimir Davydov
  0 siblings, 1 reply; 40+ messages in thread
From: Konstantin Osipov @ 2019-06-01  8:44 UTC (permalink / raw)
  To: tarantool-patches

* Vladimir Davydov <vdavydov.dev@gmail.com> [19/05/17 17:54]:
> The latch is needed solely to synchronize access to the transaction
> write buffer, which is shared by all transactions. We don't need it to
> sync vylog readers vs writers as everything is done from the same
> thread. So to get rid of the latch, which is a prerequisite for
> transactional DDL, as it makes even simply Vinyl DDL operations
> yielding, we need to rework the buffer management.
> 
> To achieve that, this patch introduces a separate object for each vylog
> transaction, which stores the list of records to be committed. Once a
> transaction is ready to be committed, it is sent to the vylog thread,
> which writes it to vylog and then wakes up the awaiting fiber. For
> non-discardable transactions, it doesn't wakeup any fiber, the fiber
> returns immediately and the transaction is freed automatically.
> 
> To allow that, transactions are now allocated on lsregion, which acts
> as a queue: new transactions are appended to the tail, committed
> transactions are popped from the head.

The same could be achieved with xlog savepoints that you
introduced earlier, if you allow multiple "savepoints" to be present 
in xlog tx, each of which could be rolled back independently (the
tail of the buffer would have to be moved on rollback).

I mean, honestly, if we're getting rid of the latch, there is no
point of having adding a separate thread on the way, it just takes
a little extra effort. And better yet, we could kill entire vy_log
altogether and store everything in xlog.

-- 
Konstantin Osipov, Moscow, Russia

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [tarantool-patches] Re: [PATCH 01/10] box: zap atfork callback
  2019-06-01  8:16     ` Konstantin Osipov
@ 2019-06-06 10:04       ` Vladimir Davydov
  0 siblings, 0 replies; 40+ messages in thread
From: Vladimir Davydov @ 2019-06-06 10:04 UTC (permalink / raw)
  To: Konstantin Osipov; +Cc: tarantool-patches

On Sat, Jun 01, 2019 at 11:16:08AM +0300, Konstantin Osipov wrote:
> * Konstantin Osipov <kostja@tarantool.org> [19/05/18 21:41]:
> > * Vladimir Davydov <vdavydov.dev@gmail.com> [19/05/17 17:54]:
> > > box_atfork calls wal_atfork which in turn calls xlog_atfork for the wal
> > > and vylog files. A comment to xlog_atfork says that it's necessary to
> > > prevent atexit handlers in a child from closing xlog files again, but we
> > > don't use atexit for that anymore. A comment to box_atfork says that
> > > box.coredump forks to write a core, but there's no box.coredump anymore.
> > > There's also a comment mentioning box.cfg.background, but when we fork
> > > that early there's no xlog file open.
> > > 
> > > To sum it up, atfork looks like a piece of legacy code. Let's get rid of
> > > it now so as not to bother patching it later.
> > 
> > If anyone adds a fork any time in the future, xlogs will break
> > silently, and it would be hard to catch in a test. 
> > 
> > If you wish to remove the dead code, please keep the atfork, but panic in it.
> 
> As discussed in the channel, you can't panic in atfork handler
> since there is an upcoming popen() patch. This all still looks
> very fragile.

Note, I didn't remove atfork handlers completely. I only removed the box
part, which closes xlog file descriptors. This doesn't make much sense
anyway, because there are lots of other file descriptors that need to be
closed (think of vinyl) so closing xlog and vylog only is just a half
measure. We should probably use O_CLOEXEC instead.

That's why I don't understand why you're so opposed to this particular
patch.

> I read from vfork() man page that NPTL doesn't call
> pthread_atfork() handlers in vfork,
> while older implementation may. What is the behaviour on freebsd? 
> If both systems don't invoke the handlers, I would keep the
> panic() call, otherwise remove.

Hmm, turns out we use fork() to spawn a pipe logger thread. And it may
be called after box.cfg from a C module to create a custom logger. So we
definitely can't panic() there.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [tarantool-patches] Re: [PATCH 02/10] vinyl: add a separate thread for vylog
  2019-06-01  8:26   ` Konstantin Osipov
@ 2019-06-06 10:20     ` Vladimir Davydov
  0 siblings, 0 replies; 40+ messages in thread
From: Vladimir Davydov @ 2019-06-06 10:20 UTC (permalink / raw)
  To: Konstantin Osipov; +Cc: tarantool-patches

On Sat, Jun 01, 2019 at 11:26:08AM +0300, Konstantin Osipov wrote:
> * Vladimir Davydov <vdavydov.dev@gmail.com> [19/05/17 17:54]:
> > Historically, we use the WAL thread for writing vylog files, because,
> > I guess, we didn't want to introduce a separate thread. However, that
> > design decision turned out to be quite dubious:
> 
> >  - vy_log (vy_log.c) calls vy_log_writer (wal.c) while vy_log_writer
> >    calls back to vy_log. That is we have a piece of logic split crudely
> >    between two files, which makes the code difficult to follow and just
> >    looks ugly.
> 
> This is because we can not ship arbitrary logic into a thread
> without the thread becoming aware of it. This can be easily fixed
> with run_in_cord() function, which would take a cord pointer and a 
> fiber function pointer with context, and run the fiber in a given
> cord. A mature implementation would take more than just a function
> pointer and a context, but some sort of runnable object, which
> responds to cancel, exit, suspend and resume. This would be very
> useful in a bunch of other places, not just for vy_log.
> 
> >  - We can't make vy_log part of vy_env because of this relationship.
> >    In fact, vy_log is the last singleton in the whole vy implementation:
> >    everything else is wrapped neatly (well, not quite, but still) in
> >    vy_env struct.
> 
> See above. The wal thread doesn't have to know anything about
> struct vy_log or struct wal, for that matter, it's just a
> container of runnable objects. We should also be able to move any
> runnable object along with its context from a thread to thread
> by suspending it first and then resuming at another thread, as I
> described.
> 
> > 
> >  - We can't kill the infamous vy_log.latch, which is a prerequisite for
> >    transactional DDL. The latch is needed to sync between vy_log readers
> >    and writers. You see, currently we can't read vy_log in the same
> >    thread where we write it, which would eliminate the need for any kind
> >    of synchronization, because vy_log read is quite a heavy operation -
> >    it may stall WAL writes and thus badly affect latency. So we have to
> >    do it from a coio thread.
> 
> This would be the main reason for the change, however, I would
> rather ensure vy_log readers and writers use entirely different
> objects as contexts. We already have vy_recovery as vy log reader
> context, what prevents you from populating it using an own
> xlog object?

In order to make a snapshot of a vylog (when we rotate it), we need to
lock out all writers. Otherwise we might get an inconsistent view.
If vylog was a part of xlog, we would use memtx read-view, but since it
isn't, we use a latch. For vylog it's okay as writers tolerate stalls by
design. Using the same thread would make the synchronization much
easier.

> 
> As discussed verbally, there is a way to shift entire vy_log
> contents to the existing WAL log which would make our future life
> much easier. 

I'm looking into this. The main problem is backward compatibility as
we'll have to support both xlog and vylog. Also, it looks like we'll
need to scan xlog twice on recovery - first to restore vinyl index
structure, then to apply rows missing in the memory level (as we don't
want to replay rows that have already been dumped to disk).

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [tarantool-patches] Re: [PATCH 03/10] vinyl: move vylog recovery to vylog thread
  2019-06-01  8:36   ` [tarantool-patches] " Konstantin Osipov
@ 2019-06-06 10:23     ` Vladimir Davydov
  2019-06-07 13:39       ` Konstantin Osipov
  2019-06-07 13:40       ` Konstantin Osipov
  0 siblings, 2 replies; 40+ messages in thread
From: Vladimir Davydov @ 2019-06-06 10:23 UTC (permalink / raw)
  To: Konstantin Osipov; +Cc: tarantool-patches

On Sat, Jun 01, 2019 at 11:36:00AM +0300, Konstantin Osipov wrote:
> * Vladimir Davydov <vdavydov.dev@gmail.com> [19/05/17 17:54]:
> > We used coio, because vylog was written from a WAL thread, which
> > shouldn't be used for such a heavy operation as vylog recovery.
> > Now, we can move it to the dedicated vylog thread. This allows
> > us to simplify rotation logic as well: now most of work is done
> > from the same function (vy_log_rotate_f) executed by vylog thread,
> > not scattered between coio and WAL, as it used to be.
> 
> Why do we need to lock out the scheduler while rotating the log in
> the first place? 

We rotate vylog by first reading the old vylog and forming a recovery
context, then dumping this recovery context to the new vylog. If a new
record appears in the old vylog in between, it will be missing in the
new vylog. That's why we lock out writers.

> 
> This is not anywhere in the comments, and vy_log was renamed from
> xctl so I got lost in the commit history.
> 
> It's a pity it's not reflected in the comment.

I can update it.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [tarantool-patches] Re: [PATCH 04/10] vinyl: rework vylog transaction backlog implementation
  2019-06-01  8:38   ` [tarantool-patches] " Konstantin Osipov
@ 2019-06-06 11:58     ` Vladimir Davydov
  0 siblings, 0 replies; 40+ messages in thread
From: Vladimir Davydov @ 2019-06-06 11:58 UTC (permalink / raw)
  To: Konstantin Osipov; +Cc: tarantool-patches

On Sat, Jun 01, 2019 at 11:38:38AM +0300, Konstantin Osipov wrote:
> * Vladimir Davydov <vdavydov.dev@gmail.com> [19/05/17 17:54]:
> 
> I like this approach, I would push it regardless of other patches.
> The patch is LGTM.

I'm afraid I can't push it without the previous patch. You see, I need
to flush the xlog buffer before loading vylog. I can't call xlog_flush
from vy_recovery_new unless it's running in the same thread as vylog
writer, because xlog struct shouldn't be used from different threads.
But without the previous patch vy_recovery_new runs in coio, not WAL.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [tarantool-patches] Re: [PATCH 05/10] vinyl: don't purge deleted runs from vylog on compaction
  2019-06-01  8:39   ` Konstantin Osipov
@ 2019-06-06 12:40     ` Vladimir Davydov
  0 siblings, 0 replies; 40+ messages in thread
From: Vladimir Davydov @ 2019-06-06 12:40 UTC (permalink / raw)
  To: Konstantin Osipov; +Cc: tarantool-patches

On Sat, Jun 01, 2019 at 11:39:24AM +0300, Konstantin Osipov wrote:
> * Vladimir Davydov <vdavydov.dev@gmail.com> [19/05/17 17:54]:
> > After compacting runs, we first mark them as dropped (VY_LOG_DROP_RUN),
> > then try to delete their files unless they are needed for recovery from
> > the checkpoint, and finally mark them as not needed in the vylog
> > (VY_LOG_FORGET_RUN). There's a potential race sitting here: the problem
> > is the garbage collector might kick in after files are dropped, but
> > before they are marked as not needed. If this happens, there will be
> > runs that have two VY_LOG_FORGET_RUN records, which will break recovery:
> > 
> >   Run XX is forgotten, but not registered
> > 
> > The following patches make the race more likely to happen so let's
> > eliminate it by making the garbage collector the only one who can mark
> > runs as not needed (i.e. write VY_LOG_FORGET_RUN record). There will
> > be no warnings, because the garbage collector silently ignores ENOENT
> > errors, see vy_gc().
> > 
> > Another good thing about this patch is that now we never yield inside
> > a vylog transaction, which makes it easier to remove the vylog latch
> > blocking implementation of transactional DDL.
> > ---
> 
> This is also a good one, LGTM.

Pushed to master, 2.1, 1.10.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [tarantool-patches] Re: [PATCH 10/10] vinyl: get rid of the latch protecting vylog buffer
  2019-06-01  8:44   ` [tarantool-patches] " Konstantin Osipov
@ 2019-06-06 13:15     ` Vladimir Davydov
  0 siblings, 0 replies; 40+ messages in thread
From: Vladimir Davydov @ 2019-06-06 13:15 UTC (permalink / raw)
  To: Konstantin Osipov; +Cc: tarantool-patches

On Sat, Jun 01, 2019 at 11:44:36AM +0300, Konstantin Osipov wrote:
> * Vladimir Davydov <vdavydov.dev@gmail.com> [19/05/17 17:54]:
> > The latch is needed solely to synchronize access to the transaction
> > write buffer, which is shared by all transactions. We don't need it to
> > sync vylog readers vs writers as everything is done from the same
> > thread. So to get rid of the latch, which is a prerequisite for
> > transactional DDL, as it makes even simply Vinyl DDL operations
> > yielding, we need to rework the buffer management.
> > 
> > To achieve that, this patch introduces a separate object for each vylog
> > transaction, which stores the list of records to be committed. Once a
> > transaction is ready to be committed, it is sent to the vylog thread,
> > which writes it to vylog and then wakes up the awaiting fiber. For
> > non-discardable transactions, it doesn't wakeup any fiber, the fiber
> > returns immediately and the transaction is freed automatically.
> > 
> > To allow that, transactions are now allocated on lsregion, which acts
> > as a queue: new transactions are appended to the tail, committed
> > transactions are popped from the head.
> 
> The same could be achieved with xlog savepoints that you
> introduced earlier, if you allow multiple "savepoints" to be present 
> in xlog tx, each of which could be rolled back independently (the
> tail of the buffer would have to be moved on rollback).

Before writing a transaction to the xlog buffer, we need to accumulate
its statements in a buffer in the tx thread. We can either malloc() them
and then free(), but I considered it to have too much overhead. So I
decided to allocate all vylog statements using the same allocator -
lsregion - acting as a FIFO. Possibly, I'm wrong about it, and using
malloc() is just fine.

> 
> I mean, honestly, if we're getting rid of the latch, there is no
> point of having adding a separate thread on the way, it just takes
> a little extra effort. And better yet, we could kill entire vy_log
> altogether and store everything in xlog.

We need to discuss that f2f, I think.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [tarantool-patches] Re: [PATCH 03/10] vinyl: move vylog recovery to vylog thread
  2019-06-06 10:23     ` Vladimir Davydov
@ 2019-06-07 13:39       ` Konstantin Osipov
  2019-06-10 15:24         ` Vladimir Davydov
  2019-06-07 13:40       ` Konstantin Osipov
  1 sibling, 1 reply; 40+ messages in thread
From: Konstantin Osipov @ 2019-06-07 13:39 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: tarantool-patches

* Vladimir Davydov <vdavydov.dev@gmail.com> [19/06/06 13:24]:
> > > We used coio, because vylog was written from a WAL thread, which
> > > shouldn't be used for such a heavy operation as vylog recovery.
> > > Now, we can move it to the dedicated vylog thread. This allows
> > > us to simplify rotation logic as well: now most of work is done
> > > from the same function (vy_log_rotate_f) executed by vylog thread,
> > > not scattered between coio and WAL, as it used to be.
> > 
> > Why do we need to lock out the scheduler while rotating the log in
> > the first place? 
> 
> We rotate vylog by first reading the old vylog and forming a recovery
> context, then dumping this recovery context to the new vylog. If a new
> record appears in the old vylog in between, it will be missing in the
> new vylog. That's why we lock out writers.

We have two layers of abstractions intermixed here. During
snapshotting, when we really rotate the vylog, no DDL can happen,
it's locked out. So no one can take the problematic latch
anyway. So there is, strictly speaking, no problem at all. But
since we're using a low level latch, and not a centralized
mechanism to lock out writers, we wouldn't know.

One option could be to append the writes to vylog which happen
during checkpointing to the vylog buffer, and not flush them to
the vylog file which is about-to-become-obsolete.

Anyway, I keep thinking that if you want to kill a latch, there is
a dozen of ways of killing it, not an own thread.

-- 
Konstantin Osipov, Moscow, Russia

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [tarantool-patches] Re: [PATCH 03/10] vinyl: move vylog recovery to vylog thread
  2019-06-06 10:23     ` Vladimir Davydov
  2019-06-07 13:39       ` Konstantin Osipov
@ 2019-06-07 13:40       ` Konstantin Osipov
  1 sibling, 0 replies; 40+ messages in thread
From: Konstantin Osipov @ 2019-06-07 13:40 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: tarantool-patches

* Vladimir Davydov <vdavydov.dev@gmail.com> [19/06/06 13:24]:
> > This is not anywhere in the comments, and vy_log was renamed from
> > xctl so I got lost in the commit history.
> > 
> > It's a pity it's not reflected in the comment.
> 
> I can update it.

Please.

-- 
Konstantin Osipov, Moscow, Russia

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [tarantool-patches] Re: [PATCH 03/10] vinyl: move vylog recovery to vylog thread
  2019-06-07 13:39       ` Konstantin Osipov
@ 2019-06-10 15:24         ` Vladimir Davydov
  0 siblings, 0 replies; 40+ messages in thread
From: Vladimir Davydov @ 2019-06-10 15:24 UTC (permalink / raw)
  To: Konstantin Osipov; +Cc: tarantool-patches

On Fri, Jun 07, 2019 at 04:39:54PM +0300, Konstantin Osipov wrote:
> * Vladimir Davydov <vdavydov.dev@gmail.com> [19/06/06 13:24]:
> > > > We used coio, because vylog was written from a WAL thread, which
> > > > shouldn't be used for such a heavy operation as vylog recovery.
> > > > Now, we can move it to the dedicated vylog thread. This allows
> > > > us to simplify rotation logic as well: now most of work is done
> > > > from the same function (vy_log_rotate_f) executed by vylog thread,
> > > > not scattered between coio and WAL, as it used to be.
> > > 
> > > Why do we need to lock out the scheduler while rotating the log in
> > > the first place? 
> > 
> > We rotate vylog by first reading the old vylog and forming a recovery
> > context, then dumping this recovery context to the new vylog. If a new
> > record appears in the old vylog in between, it will be missing in the
> > new vylog. That's why we lock out writers.
> 
> We have two layers of abstractions intermixed here. During
> snapshotting, when we really rotate the vylog, no DDL can happen,
> it's locked out. So no one can take the problematic latch
> anyway.

Except compaction, which isn't locked out by checkpointing.

> So there is, strictly speaking, no problem at all. But
> since we're using a low level latch, and not a centralized
> mechanism to lock out writers, we wouldn't know.
> 
> One option could be to append the writes to vylog which happen
> during checkpointing to the vylog buffer, and not flush them to
> the vylog file which is about-to-become-obsolete.

We must flush those records to disk, otherwise we risk loosing data.

> 
> Anyway, I keep thinking that if you want to kill a latch, there is
> a dozen of ways of killing it, not an own thread.

What's so wrong about the new thread? Could you please give some insight
why we should avoid introducing a separate thread for vylog at all costs
at this point?

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [tarantool-patches] Re: [PATCH 09/10] key_def: pass alloc callback to key_def_dump_parts
  2019-06-01  8:41     ` Konstantin Osipov
@ 2019-06-10 15:28       ` Vladimir Davydov
  2019-06-16 14:57         ` Konstantin Osipov
  0 siblings, 1 reply; 40+ messages in thread
From: Vladimir Davydov @ 2019-06-10 15:28 UTC (permalink / raw)
  To: Konstantin Osipov; +Cc: tarantool-patches

On Sat, Jun 01, 2019 at 11:41:42AM +0300, Konstantin Osipov wrote:
> * Konstantin Osipov <kostja@tarantool.org> [19/05/18 21:56]:
> > * Vladimir Davydov <vdavydov.dev@gmail.com> [19/05/17 17:54]:
> > > So that we can use an allocator different from the region in vylog.
> > 
> > Please, let's simply always use malloc.
> > 
> > btw, sql code is shit. not a single comment for entire call trace
> > of key_def_dump_parts() from there. I failed to understand whether
> > this is a hot path or not.
> 
> I stand by this request.

Please take a look at my reply I sent you earlier. Quoting it here:

} I did it this way so as to use lsregion for allocating vylog records in
} the next patch. Are you suggesting to use malloc() instead? I guess
} that's possible and that'd simplify things quite a bit. I just wanted to
} use lock-free allocator for this. Not sure if it's really necessary
} though, because vy_log_tx_write isn't a hot path.

https://www.freelists.org/post/tarantool-patches/PATCH-0910-key-def-pass-alloc-callback-to-key-def-dump-parts,2

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [tarantool-patches] Re: [PATCH 09/10] key_def: pass alloc callback to key_def_dump_parts
  2019-06-10 15:28       ` Vladimir Davydov
@ 2019-06-16 14:57         ` Konstantin Osipov
  0 siblings, 0 replies; 40+ messages in thread
From: Konstantin Osipov @ 2019-06-16 14:57 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: tarantool-patches

* Vladimir Davydov <vdavydov.dev@gmail.com> [19/06/10 18:33]:
> On Sat, Jun 01, 2019 at 11:41:42AM +0300, Konstantin Osipov wrote:
> > * Konstantin Osipov <kostja@tarantool.org> [19/05/18 21:56]:
> > > * Vladimir Davydov <vdavydov.dev@gmail.com> [19/05/17 17:54]:
> > > > So that we can use an allocator different from the region in vylog.
> > > 
> > > Please, let's simply always use malloc.
> > > 
> > > btw, sql code is shit. not a single comment for entire call trace
> > > of key_def_dump_parts() from there. I failed to understand whether
> > > this is a hot path or not.
> > 
> > I stand by this request.
> 
> Please take a look at my reply I sent you earlier. Quoting it here:
> 
> } I did it this way so as to use lsregion for allocating vylog records in
> } the next patch. Are you suggesting to use malloc() instead? I guess
> } that's possible and that'd simplify things quite a bit. I just wanted to
> } use lock-free allocator for this. Not sure if it's really necessary
> } though, because vy_log_tx_write isn't a hot path.

It's totally OK to use malloc() for vy_log writes.

-- 
Konstantin Osipov, Moscow, Russia

^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2019-06-16 14:57 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-05-17 14:52 [PATCH 00/10] vinyl: don't yield in DDL on_commit triggers Vladimir Davydov
2019-05-17 14:52 ` [PATCH 01/10] box: zap atfork callback Vladimir Davydov
2019-05-18 18:37   ` [tarantool-patches] " Konstantin Osipov
2019-05-20  8:13     ` Vladimir Davydov
2019-06-01  8:16     ` Konstantin Osipov
2019-06-06 10:04       ` Vladimir Davydov
2019-05-17 14:52 ` [PATCH 02/10] vinyl: add a separate thread for vylog Vladimir Davydov
2019-05-18 18:39   ` [tarantool-patches] " Konstantin Osipov
2019-05-20  8:17     ` Vladimir Davydov
2019-06-01  8:26   ` Konstantin Osipov
2019-06-06 10:20     ` Vladimir Davydov
2019-05-17 14:52 ` [PATCH 03/10] vinyl: move vylog recovery to vylog thread Vladimir Davydov
2019-06-01  8:36   ` [tarantool-patches] " Konstantin Osipov
2019-06-06 10:23     ` Vladimir Davydov
2019-06-07 13:39       ` Konstantin Osipov
2019-06-10 15:24         ` Vladimir Davydov
2019-06-07 13:40       ` Konstantin Osipov
2019-05-17 14:52 ` [PATCH 04/10] vinyl: rework vylog transaction backlog implementation Vladimir Davydov
2019-06-01  8:38   ` [tarantool-patches] " Konstantin Osipov
2019-06-06 11:58     ` Vladimir Davydov
2019-05-17 14:52 ` [PATCH 05/10] vinyl: don't purge deleted runs from vylog on compaction Vladimir Davydov
2019-05-18 18:47   ` [tarantool-patches] " Konstantin Osipov
2019-05-20  8:27     ` Vladimir Davydov
2019-06-01  8:39   ` Konstantin Osipov
2019-06-06 12:40     ` Vladimir Davydov
2019-05-17 14:52 ` [PATCH 06/10] vinyl: lock out compaction while checkpointing is in progress Vladimir Davydov
2019-05-17 14:52 ` [PATCH 07/10] vinyl: don't access last vylog signature outside vylog implementation Vladimir Davydov
2019-05-17 14:52 ` [PATCH 08/10] vinyl: zap ERRINJ_VY_LOG_FLUSH_DELAY Vladimir Davydov
2019-05-17 14:52 ` [PATCH 09/10] key_def: pass alloc callback to key_def_dump_parts Vladimir Davydov
2019-05-18 18:52   ` [tarantool-patches] " Konstantin Osipov
2019-05-20  8:34     ` Vladimir Davydov
2019-06-01  8:41     ` Konstantin Osipov
2019-06-10 15:28       ` Vladimir Davydov
2019-06-16 14:57         ` Konstantin Osipov
2019-05-17 14:52 ` [PATCH 10/10] vinyl: get rid of the latch protecting vylog buffer Vladimir Davydov
2019-06-01  8:44   ` [tarantool-patches] " Konstantin Osipov
2019-06-06 13:15     ` Vladimir Davydov
2019-05-18 18:35 ` [tarantool-patches] Re: [PATCH 00/10] vinyl: don't yield in DDL on_commit triggers Konstantin Osipov
2019-05-20  8:09   ` Vladimir Davydov
2019-06-01  8:09 ` Konstantin Osipov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox