Tarantool development patches archive
 help / color / mirror / Atom feed
* [PATCH v2 00/11] vinyl: transaction throttling infrastructure
@ 2018-09-28 17:39 Vladimir Davydov
  2018-09-28 17:39 ` [PATCH v2 01/11] vinyl: add helper to start scheduler and enable quota on startup Vladimir Davydov
                   ` (11 more replies)
  0 siblings, 12 replies; 35+ messages in thread
From: Vladimir Davydov @ 2018-09-28 17:39 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

This patch set contains preparatory patches necessary for implementing
transaction throttling. It does some refactoring, implements basic
rate limiting based on dump bandwidth, and introduces quota consumer
priorities, which is necessary for compaction based throttling. The only
thing left is invent such a throttling policy that guarantees that
compaction always keeps up with dumps.

https://github.com/tarantool/tarantool/issues/1862
https://github.com/tarantool/tarantool/tree/dv/gh-1862-vy-throttling

Changes in v2:
 - Factor out load regulator logic out of quota and do some renames,
   as suggested by Kostja.

v1: https://www.freelists.org/post/tarantool-patches/PATCH-0011-vinyl-prepare-for-transaction-throttling

Vladimir Davydov (11):
  vinyl: add helper to start scheduler and enable quota on startup
  vinyl: factor load regulator out of quota
  vinyl: minor refactoring of quota methods
  vinyl: move transaction size sanity check to quota
  vinyl: implement quota wait queue without fiber_cond
  vinyl: enable quota upon recovery completion explicitly
  vinyl: zap vy_env::memory, read_threads, and write_threads
  vinyl: do not try to trigger dump in regulator if already in progress
  vinyl: do not account zero dump bandwidth
  vinyl: implement basic transaction throttling
  vinyl: introduce quota consumer priorities

 src/box/CMakeLists.txt             |   1 +
 src/box/vinyl.c                    | 133 +++++++-------
 src/box/vy_quota.c                 | 357 +++++++++++++++++++++----------------
 src/box/vy_quota.h                 | 231 ++++++++++++++++++------
 src/box/vy_regulator.c             | 268 ++++++++++++++++++++++++++++
 src/box/vy_regulator.h             | 153 ++++++++++++++++
 src/box/vy_run.c                   |  12 +-
 src/box/vy_run.h                   |  17 +-
 test/unit/vy_point_lookup.c        |   2 +-
 test/vinyl/errinj.result           |   4 +-
 test/vinyl/errinj.test.lua         |   4 +-
 test/vinyl/info.result             |  14 +-
 test/vinyl/info.test.lua           |   8 +-
 test/vinyl/quota.result            |  26 +--
 test/vinyl/quota.test.lua          |  26 +--
 test/vinyl/quota_timeout.result    |   6 +-
 test/vinyl/quota_timeout.test.lua  |   6 +-
 test/vinyl/recovery_quota.result   |   6 +-
 test/vinyl/recovery_quota.test.lua |   6 +-
 test/vinyl/suite.ini               |   2 +-
 test/vinyl/throttle.result         | 102 +++++++++++
 test/vinyl/throttle.test.lua       |  54 ++++++
 22 files changed, 1093 insertions(+), 345 deletions(-)
 create mode 100644 src/box/vy_regulator.c
 create mode 100644 src/box/vy_regulator.h
 create mode 100644 test/vinyl/throttle.result
 create mode 100644 test/vinyl/throttle.test.lua

-- 
2.11.0

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH v2 01/11] vinyl: add helper to start scheduler and enable quota on startup
  2018-09-28 17:39 [PATCH v2 00/11] vinyl: transaction throttling infrastructure Vladimir Davydov
@ 2018-09-28 17:39 ` Vladimir Davydov
  2018-09-29  4:37   ` [tarantool-patches] " Konstantin Osipov
  2018-09-28 17:40 ` [PATCH v2 02/11] vinyl: factor load regulator out of quota Vladimir Davydov
                   ` (10 subsequent siblings)
  11 siblings, 1 reply; 35+ messages in thread
From: Vladimir Davydov @ 2018-09-28 17:39 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

There are three places where we start the scheduler fiber and enable the
configured memory quota limit: local bootstrap, remote bootstrap, and
local recovery completion. I'm planning to add more code there so let's
factor it out now.
---
 src/box/vinyl.c | 20 ++++++++++++++------
 1 file changed, 14 insertions(+), 6 deletions(-)

diff --git a/src/box/vinyl.c b/src/box/vinyl.c
index 803f32d5..67b9c589 100644
--- a/src/box/vinyl.c
+++ b/src/box/vinyl.c
@@ -2565,6 +2565,17 @@ vy_env_delete(struct vy_env *e)
 	free(e);
 }
 
+/**
+ * Called upon local recovery completion to enable memory quota
+ * and start the scheduler.
+ */
+static void
+vy_env_complete_recovery(struct vy_env *env)
+{
+	vy_scheduler_start(&env->scheduler);
+	vy_quota_set_limit(&env->quota, env->memory);
+}
+
 struct vinyl_engine *
 vinyl_engine_new(const char *dir, size_t memory,
 		 int read_threads, int write_threads, bool force_recovery)
@@ -2722,8 +2733,7 @@ vinyl_engine_bootstrap(struct engine *engine)
 	assert(e->status == VINYL_OFFLINE);
 	if (vy_log_bootstrap() != 0)
 		return -1;
-	vy_scheduler_start(&e->scheduler);
-	vy_quota_set_limit(&e->quota, e->memory);
+	vy_env_complete_recovery(e);
 	e->status = VINYL_ONLINE;
 	return 0;
 }
@@ -2757,8 +2767,7 @@ vinyl_engine_begin_initial_recovery(struct engine *engine,
 	} else {
 		if (vy_log_bootstrap() != 0)
 			return -1;
-		vy_scheduler_start(&e->scheduler);
-		vy_quota_set_limit(&e->quota, e->memory);
+		vy_env_complete_recovery(e);
 		e->status = VINYL_INITIAL_RECOVERY_REMOTE;
 	}
 	return 0;
@@ -2801,8 +2810,7 @@ vinyl_engine_end_recovery(struct engine *engine)
 		vy_recovery_delete(e->recovery);
 		e->recovery = NULL;
 		e->recovery_vclock = NULL;
-		vy_scheduler_start(&e->scheduler);
-		vy_quota_set_limit(&e->quota, e->memory);
+		vy_env_complete_recovery(e);
 		break;
 	case VINYL_FINAL_RECOVERY_REMOTE:
 		break;
-- 
2.11.0

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH v2 02/11] vinyl: factor load regulator out of quota
  2018-09-28 17:39 [PATCH v2 00/11] vinyl: transaction throttling infrastructure Vladimir Davydov
  2018-09-28 17:39 ` [PATCH v2 01/11] vinyl: add helper to start scheduler and enable quota on startup Vladimir Davydov
@ 2018-09-28 17:40 ` Vladimir Davydov
  2018-09-29  5:00   ` [tarantool-patches] " Konstantin Osipov
                     ` (2 more replies)
  2018-09-28 17:40 ` [PATCH v2 03/11] vinyl: minor refactoring of quota methods Vladimir Davydov
                   ` (9 subsequent siblings)
  11 siblings, 3 replies; 35+ messages in thread
From: Vladimir Davydov @ 2018-09-28 17:40 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

Turned out that throttling isn't going to be as simple as maintaining
the write rate below the estimated dump bandwidth, because we also need
to take into account whether compaction keeps up with dumps. Tracking
compaction progress isn't a trivial task and mixing it in a module
responsible for resource limiting, which vy_quota is, doesn't seem to be
a good idea. Let's factor out the related code into a separate module
and call it vy_regulator. Currently, the new module only keeps track of
the write rate and the dump bandwidth and sets the memory watermark
accordingly, but soon we will extend it to configure throttling as well.

Since write rate and dump bandwidth are now a part of the regulator
subsystem, this patch renames 'quota' entry of box.stat.vinyl() to
'regulator'. It also removes 'quota.usage' and 'quota.limit' altogether,
because memory usage is reported under 'memory.level0' while the limit
can be read from box.cfg.vinyl_memory, and renames 'use_rate' to
'write_rate', because the latter seems to be a more appropriate name.

Needed for #1862
---
 src/box/CMakeLists.txt             |   1 +
 src/box/vinyl.c                    |  48 ++++++---
 src/box/vy_quota.c                 | 133 +-----------------------
 src/box/vy_quota.h                 |  59 +----------
 src/box/vy_regulator.c             | 206 +++++++++++++++++++++++++++++++++++++
 src/box/vy_regulator.h             | 146 ++++++++++++++++++++++++++
 test/vinyl/errinj.result           |   4 +-
 test/vinyl/errinj.test.lua         |   4 +-
 test/vinyl/info.result             |  14 +--
 test/vinyl/info.test.lua           |   8 +-
 test/vinyl/quota.result            |  26 ++---
 test/vinyl/quota.test.lua          |  26 ++---
 test/vinyl/quota_timeout.result    |   6 +-
 test/vinyl/quota_timeout.test.lua  |   6 +-
 test/vinyl/recovery_quota.result   |   6 +-
 test/vinyl/recovery_quota.test.lua |   6 +-
 16 files changed, 438 insertions(+), 261 deletions(-)
 create mode 100644 src/box/vy_regulator.c
 create mode 100644 src/box/vy_regulator.h

diff --git a/src/box/CMakeLists.txt b/src/box/CMakeLists.txt
index cab6a227..67750898 100644
--- a/src/box/CMakeLists.txt
+++ b/src/box/CMakeLists.txt
@@ -86,6 +86,7 @@ add_library(box STATIC
     vy_history.c
     vy_read_set.c
     vy_scheduler.c
+    vy_regulator.c
     vy_quota.c
     request.c
     space.c
diff --git a/src/box/vinyl.c b/src/box/vinyl.c
index 67b9c589..1c0192cb 100644
--- a/src/box/vinyl.c
+++ b/src/box/vinyl.c
@@ -43,6 +43,7 @@
 #include "vy_point_lookup.h"
 #include "vy_quota.h"
 #include "vy_scheduler.h"
+#include "vy_regulator.h"
 #include "vy_stat.h"
 
 #include <stdbool.h>
@@ -115,6 +116,8 @@ struct vy_env {
 	struct vy_mem_env mem_env;
 	/** Scheduler */
 	struct vy_scheduler scheduler;
+	/** Load regulator. */
+	struct vy_regulator regulator;
 	/** Local recovery context. */
 	struct vy_recovery *recovery;
 	/** Local recovery vclock. */
@@ -249,17 +252,15 @@ static struct trigger on_replace_vinyl_deferred_delete;
 /** {{{ Introspection */
 
 static void
-vy_info_append_quota(struct vy_env *env, struct info_handler *h)
+vy_info_append_regulator(struct vy_env *env, struct info_handler *h)
 {
-	struct vy_quota *q = &env->quota;
+	struct vy_regulator *r = &env->regulator;
 
-	info_table_begin(h, "quota");
-	info_append_int(h, "used", q->used);
-	info_append_int(h, "limit", q->limit);
-	info_append_int(h, "watermark", q->watermark);
-	info_append_int(h, "use_rate", q->use_rate);
-	info_append_int(h, "dump_bandwidth", q->dump_bw);
-	info_table_end(h); /* quota */
+	info_table_begin(h, "regulator");
+	info_append_int(h, "watermark", r->watermark);
+	info_append_int(h, "write_rate", r->write_rate);
+	info_append_int(h, "dump_bandwidth", r->dump_bw);
+	info_table_end(h); /* regulator */
 }
 
 static void
@@ -328,10 +329,10 @@ vinyl_engine_stat(struct vinyl_engine *vinyl, struct info_handler *h)
 	struct vy_env *env = vinyl->env;
 
 	info_begin(h);
-	vy_info_append_quota(env, h);
 	vy_info_append_tx(env, h);
 	vy_info_append_memory(env, h);
 	vy_info_append_disk(env, h);
+	vy_info_append_regulator(env, h);
 	info_end(h);
 }
 
@@ -2366,6 +2367,7 @@ vinyl_engine_prepare(struct engine *engine, struct txn *txn)
 	assert(mem_used_after >= mem_used_before);
 	vy_quota_adjust(&env->quota, tx->write_size,
 			mem_used_after - mem_used_before);
+	vy_regulator_check_watermark(&env->regulator);
 	return rc;
 }
 
@@ -2390,6 +2392,7 @@ vinyl_engine_commit(struct engine *engine, struct txn *txn)
 	assert(mem_used_after >= mem_used_before);
 	/* We can't abort the transaction at this point, use force. */
 	vy_quota_force_use(&env->quota, mem_used_after - mem_used_before);
+	vy_regulator_check_watermark(&env->regulator);
 
 	txn->engine_tx = NULL;
 	if (!txn->is_autocommit)
@@ -2440,6 +2443,13 @@ static void
 vy_env_quota_exceeded_cb(struct vy_quota *quota)
 {
 	struct vy_env *env = container_of(quota, struct vy_env, quota);
+	vy_regulator_no_memory(&env->regulator);
+}
+
+static void
+vy_env_trigger_dump_cb(struct vy_regulator *regulator)
+{
+	struct vy_env *env = container_of(regulator, struct vy_env, regulator);
 
 	if (lsregion_used(&env->mem_env.allocator) == 0) {
 		/*
@@ -2468,7 +2478,7 @@ vy_env_dump_complete_cb(struct vy_scheduler *scheduler,
 	assert(mem_used_after <= mem_used_before);
 	size_t mem_dumped = mem_used_before - mem_used_after;
 	vy_quota_release(quota, mem_dumped);
-	vy_quota_update_dump_bandwidth(quota, mem_dumped, dump_duration);
+	vy_regulator_dump_complete(&env->regulator, mem_dumped, dump_duration);
 	say_info("dumped %zu bytes in %.1f sec", mem_dumped, dump_duration);
 }
 
@@ -2520,8 +2530,9 @@ vy_env_new(const char *path, size_t memory,
 			      vy_squash_schedule, e) != 0)
 		goto error_lsm_env;
 
-	if (vy_quota_create(&e->quota, vy_env_quota_exceeded_cb) != 0)
-		goto error_quota;
+	vy_quota_create(&e->quota, vy_env_quota_exceeded_cb);
+	vy_regulator_create(&e->regulator, &e->quota,
+			    vy_env_trigger_dump_cb);
 
 	struct slab_cache *slab_cache = cord_slab_cache();
 	mempool_create(&e->iterator_pool, slab_cache,
@@ -2530,8 +2541,7 @@ vy_env_new(const char *path, size_t memory,
 	vy_run_env_create(&e->run_env);
 	vy_log_init(e->path);
 	return e;
-error_quota:
-	vy_lsm_env_destroy(&e->lsm_env);
+
 error_lsm_env:
 	vy_mem_env_destroy(&e->mem_env);
 	vy_scheduler_destroy(&e->scheduler);
@@ -2548,6 +2558,7 @@ error_path:
 static void
 vy_env_delete(struct vy_env *e)
 {
+	vy_regulator_destroy(&e->regulator);
 	vy_scheduler_destroy(&e->scheduler);
 	vy_squash_queue_delete(e->squash_queue);
 	tx_manager_delete(e->xm);
@@ -2574,6 +2585,7 @@ vy_env_complete_recovery(struct vy_env *env)
 {
 	vy_scheduler_start(&env->scheduler);
 	vy_quota_set_limit(&env->quota, env->memory);
+	vy_regulator_start(&env->regulator);
 }
 
 struct vinyl_engine *
@@ -2651,7 +2663,7 @@ vinyl_engine_set_snap_io_rate_limit(struct vinyl_engine *vinyl, double limit)
 {
 	int64_t limit_in_bytes = limit * 1024 * 1024;
 	vinyl->env->run_env.snap_io_rate_limit = limit_in_bytes;
-	vy_quota_reset_dump_bandwidth(&vinyl->env->quota, limit_in_bytes);
+	vy_regulator_reset_dump_bw(&vinyl->env->regulator, limit_in_bytes);
 }
 
 /** }}} Environment */
@@ -3194,6 +3206,7 @@ vinyl_space_apply_initial_join_row(struct space *space, struct request *request)
 	assert(mem_used_after >= mem_used_before);
 	size_t used = mem_used_after - mem_used_before;
 	vy_quota_adjust(&env->quota, reserved, used);
+	vy_regulator_check_watermark(&env->regulator);
 	return rc;
 }
 
@@ -3515,6 +3528,7 @@ vy_squash_process(struct vy_squash *squash)
 		vy_mem_commit_stmt(mem, region_stmt);
 		vy_quota_force_use(&env->quota,
 				   mem_used_after - mem_used_before);
+		vy_regulator_check_watermark(&env->regulator);
 	}
 	return rc;
 }
@@ -3989,6 +4003,7 @@ vy_build_insert_tuple(struct vy_env *env, struct vy_lsm *lsm,
 	size_t mem_used_after = lsregion_used(&env->mem_env.allocator);
 	assert(mem_used_after >= mem_used_before);
 	vy_quota_force_use(&env->quota, mem_used_after - mem_used_before);
+	vy_regulator_check_watermark(&env->regulator);
 	vy_quota_wait(&env->quota);
 	return rc;
 }
@@ -4419,6 +4434,7 @@ vy_deferred_delete_on_replace(struct trigger *trigger, void *event)
 	size_t mem_used_after = lsregion_used(&env->mem_env.allocator);
 	assert(mem_used_after >= mem_used_before);
 	vy_quota_force_use(&env->quota, mem_used_after - mem_used_before);
+	vy_regulator_check_watermark(&env->regulator);
 
 	tuple_unref(delete);
 	if (rc != 0)
diff --git a/src/box/vy_quota.c b/src/box/vy_quota.c
index 51f0ba71..e6d22348 100644
--- a/src/box/vy_quota.c
+++ b/src/box/vy_quota.c
@@ -32,42 +32,13 @@
 
 #include <assert.h>
 #include <stddef.h>
-#include <stdint.h>
-#include <math.h>
 #include <tarantool_ev.h>
 
-#include "diag.h"
 #include "fiber.h"
 #include "fiber_cond.h"
 #include "say.h"
-#include "histogram.h"
 #include "trivia/util.h"
 
-enum {
-	/**
-	 * Time interval between successive updates of
-	 * quota watermark and use rate, in seconds.
-	 */
-	VY_QUOTA_UPDATE_INTERVAL = 1,
-	/**
-	 * Period of time over which the quota use rate
-	 * is averaged, in seconds.
-	 */
-	VY_QUOTA_RATE_AVG_PERIOD = 5,
-};
-
-/*
- * Until we dump anything, assume bandwidth to be 10 MB/s,
- * which should be fine for initial guess.
- */
-static const size_t VY_DEFAULT_DUMP_BANDWIDTH = 10 * 1024 * 1024;
-
-/**
- * Histogram percentile used for estimating dump bandwidth.
- * For details see the comment to vy_quota::dump_bw_hist.
- */
-enum { VY_DUMP_BANDWIDTH_PCT = 10 };
-
 /**
  * Returns true if the quota limit is exceeded and so consumers
  * have to wait.
@@ -78,84 +49,19 @@ vy_quota_is_exceeded(struct vy_quota *q)
 	return q->used > q->limit;
 }
 
-static void
-vy_quota_timer_cb(ev_loop *loop, ev_timer *timer, int events)
-{
-	(void)loop;
-	(void)events;
-
-	struct vy_quota *q = timer->data;
-
-	/*
-	 * Update the quota use rate with the new measurement.
-	 */
-	const double weight = 1 - exp(-VY_QUOTA_UPDATE_INTERVAL /
-				      (double)VY_QUOTA_RATE_AVG_PERIOD);
-	q->use_rate = (1 - weight) * q->use_rate +
-		weight * q->use_curr / VY_QUOTA_UPDATE_INTERVAL;
-	q->use_curr = 0;
-
-	/*
-	 * Due to log structured nature of the lsregion allocator,
-	 * which is used for allocating statements, we cannot free
-	 * memory in chunks, only all at once. Therefore we should
-	 * configure the watermark so that by the time we hit the
-	 * limit, all memory have been dumped, i.e.
-	 *
-	 *   limit - watermark      watermark
-	 *   ----------------- = --------------
-	 *        use_rate       dump_bandwidth
-	 */
-	q->watermark = ((double)q->limit * q->dump_bw /
-			(q->dump_bw + q->use_rate + 1));
-	if (q->used >= q->watermark)
-		q->quota_exceeded_cb(q);
-}
-
-int
+void
 vy_quota_create(struct vy_quota *q, vy_quota_exceeded_f quota_exceeded_cb)
 {
-	enum { KB = 1024, MB = KB * KB };
-	static int64_t dump_bandwidth_buckets[] = {
-		100 * KB, 200 * KB, 300 * KB, 400 * KB, 500 * KB, 600 * KB,
-		700 * KB, 800 * KB, 900 * KB,   1 * MB,   2 * MB,   3 * MB,
-		  4 * MB,   5 * MB,   6 * MB,   7 * MB,   8 * MB,   9 * MB,
-		 10 * MB,  15 * MB,  20 * MB,  25 * MB,  30 * MB,  35 * MB,
-		 40 * MB,  45 * MB,  50 * MB,  55 * MB,  60 * MB,  65 * MB,
-		 70 * MB,  75 * MB,  80 * MB,  85 * MB,  90 * MB,  95 * MB,
-		100 * MB, 200 * MB, 300 * MB, 400 * MB, 500 * MB, 600 * MB,
-		700 * MB, 800 * MB, 900 * MB,
-	};
-
-	q->dump_bw_hist = histogram_new(dump_bandwidth_buckets,
-					lengthof(dump_bandwidth_buckets));
-	if (q->dump_bw_hist == NULL) {
-		diag_set(OutOfMemory, 0, "histogram_new",
-			 "dump bandwidth histogram");
-		return -1;
-	}
-
 	q->limit = SIZE_MAX;
-	q->watermark = SIZE_MAX;
 	q->used = 0;
-	q->use_curr = 0;
-	q->use_rate = 0;
 	q->too_long_threshold = TIMEOUT_INFINITY;
-	q->dump_bw = VY_DEFAULT_DUMP_BANDWIDTH;
 	q->quota_exceeded_cb = quota_exceeded_cb;
 	fiber_cond_create(&q->cond);
-	ev_timer_init(&q->timer, vy_quota_timer_cb, 0,
-		      VY_QUOTA_UPDATE_INTERVAL);
-	q->timer.data = q;
-	ev_timer_start(loop(), &q->timer);
-	return 0;
 }
 
 void
 vy_quota_destroy(struct vy_quota *q)
 {
-	ev_timer_stop(loop(), &q->timer);
-	histogram_delete(q->dump_bw_hist);
 	fiber_cond_broadcast(&q->cond);
 	fiber_cond_destroy(&q->cond);
 }
@@ -163,41 +69,17 @@ vy_quota_destroy(struct vy_quota *q)
 void
 vy_quota_set_limit(struct vy_quota *q, size_t limit)
 {
-	q->limit = q->watermark = limit;
+	q->limit = limit;
 	if (q->used >= limit)
 		q->quota_exceeded_cb(q);
 	fiber_cond_signal(&q->cond);
 }
 
 void
-vy_quota_update_dump_bandwidth(struct vy_quota *q, size_t size,
-			       double duration)
-{
-	if (duration > 0) {
-		histogram_collect(q->dump_bw_hist, size / duration);
-		/*
-		 * To avoid unpredictably long stalls, we need to
-		 * know the worst (smallest) dump bandwidth so use
-		 * a lower-bound percentile estimate.
-		 */
-		q->dump_bw = histogram_percentile_lower(q->dump_bw_hist,
-							VY_DUMP_BANDWIDTH_PCT);
-	}
-}
-
-void
-vy_quota_reset_dump_bandwidth(struct vy_quota *q, size_t max)
-{
-	histogram_reset(q->dump_bw_hist);
-	q->dump_bw = MIN(VY_DEFAULT_DUMP_BANDWIDTH, max);
-}
-
-void
 vy_quota_force_use(struct vy_quota *q, size_t size)
 {
 	q->used += size;
-	q->use_curr += size;
-	if (q->used >= q->watermark)
+	if (q->used >= q->limit)
 		q->quota_exceeded_cb(q);
 }
 
@@ -213,7 +95,6 @@ int
 vy_quota_use(struct vy_quota *q, size_t size, double timeout)
 {
 	q->used += size;
-	q->use_curr += size;
 	if (vy_quota_is_exceeded(q)) {
 		/* Wait for quota. */
 		double start_time = ev_monotonic_now(loop());
@@ -222,11 +103,9 @@ vy_quota_use(struct vy_quota *q, size_t size, double timeout)
 		do {
 			q->quota_exceeded_cb(q);
 			q->used -= size;
-			q->use_curr -= size;
 			if (fiber_cond_wait_deadline(&q->cond, deadline) != 0)
 				return -1; /* timed out */
 			q->used += size;
-			q->use_curr += size;
 		} while (vy_quota_is_exceeded(q));
 
 		double wait_time = ev_monotonic_now(loop()) - start_time;
@@ -240,8 +119,6 @@ vy_quota_use(struct vy_quota *q, size_t size, double timeout)
 		 */
 		fiber_cond_signal(&q->cond);
 	}
-	if (q->used >= q->watermark)
-		q->quota_exceeded_cb(q);
 	return 0;
 }
 
@@ -252,10 +129,6 @@ vy_quota_adjust(struct vy_quota *q, size_t reserved, size_t used)
 		size_t excess = reserved - used;
 		assert(q->used >= excess);
 		q->used -= excess;
-		if (q->use_curr >= excess)
-			q->use_curr -= excess;
-		else /* was reset by timeout */
-			q->use_curr = 0;
 		fiber_cond_signal(&q->cond);
 	}
 	if (reserved < used)
diff --git a/src/box/vy_quota.h b/src/box/vy_quota.h
index 9ce53fc4..59fe075f 100644
--- a/src/box/vy_quota.h
+++ b/src/box/vy_quota.h
@@ -40,7 +40,6 @@ extern "C" {
 #endif /* defined(__cplusplus) */
 
 struct vy_quota;
-struct histogram;
 
 typedef void
 (*vy_quota_exceeded_f)(struct vy_quota *quota);
@@ -55,12 +54,6 @@ struct vy_quota {
 	 * throttled until memory is reclaimed.
 	 */
 	size_t limit;
-	/**
-	 * Memory watermark. Exceeding it does not result in
-	 * throttling new transactions, but it does trigger
-	 * background memory reclaim.
-	 */
-	size_t watermark;
 	/** Current memory consumption. */
 	size_t used;
 	/**
@@ -74,40 +67,13 @@ struct vy_quota {
 	 */
 	struct fiber_cond cond;
 	/**
-	 * Called when quota is consumed if used >= watermark.
+	 * Called if the limit is hit when quota is consumed.
 	 * It is supposed to trigger memory reclaim.
 	 */
 	vy_quota_exceeded_f quota_exceeded_cb;
-	/** Timer for updating quota watermark. */
-	ev_timer timer;
-	/**
-	 * Amount of quota used since the last
-	 * invocation of the quota timer callback.
-	 */
-	size_t use_curr;
-	/**
-	 * Quota use rate, in bytes per second.
-	 * Calculated as exponentially weighted
-	 * moving average of use_curr.
-	 */
-	size_t use_rate;
-	/** Current dump bandwidth estimate. */
-	size_t dump_bw;
-	/**
-	 * Dump bandwidth is needed for calculating the quota watermark.
-	 * The higher the bandwidth, the later we can start dumping w/o
-	 * suffering from transaction throttling. So we want to be very
-	 * conservative about estimating the bandwidth.
-	 *
-	 * To make sure we don't overestimate it, we maintain a
-	 * histogram of all observed measurements and assume the
-	 * bandwidth to be equal to the 10th percentile, i.e. the
-	 * best result among 10% worst measurements.
-	 */
-	struct histogram *dump_bw_hist;
 };
 
-int
+void
 vy_quota_create(struct vy_quota *q, vy_quota_exceeded_f quota_exceeded_cb);
 
 void
@@ -120,27 +86,6 @@ vy_quota_destroy(struct vy_quota *q);
 void
 vy_quota_set_limit(struct vy_quota *q, size_t limit);
 
-/** Return dump bandwidth. */
-size_t
-vy_quota_dump_bandwidth(struct vy_quota *q);
-
-/**
- * Update dump bandwidth.
- *
- * @size: size of dumped memory.
- * @duration: how long memory dump took.
- */
-void
-vy_quota_update_dump_bandwidth(struct vy_quota *q, size_t size,
-			       double duration);
-
-/**
- * Reset dump bandwidth histogram and update initial estimate.
- * Called when box.cfg.snap_io_rate_limit is updated.
- */
-void
-vy_quota_reset_dump_bandwidth(struct vy_quota *q, size_t max);
-
 /**
  * Consume @size bytes of memory. In contrast to vy_quota_use()
  * this function does not throttle the caller.
diff --git a/src/box/vy_regulator.c b/src/box/vy_regulator.c
new file mode 100644
index 00000000..3bd4fee4
--- /dev/null
+++ b/src/box/vy_regulator.c
@@ -0,0 +1,206 @@
+/*
+ * Copyright 2010-2018, Tarantool AUTHORS, please see AUTHORS file.
+ *
+ * Redistribution and use in source and binary forms, with or
+ * without modification, are permitted provided that the following
+ * conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above
+ *    copyright notice, this list of conditions and the
+ *    following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above
+ *    copyright notice, this list of conditions and the following
+ *    disclaimer in the documentation and/or other materials
+ *    provided with the distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY AUTHORS ``AS IS'' AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
+ * TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL
+ * AUTHORS OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT,
+ * INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
+ * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR
+ * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
+ * LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF
+ * THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ */
+#include "vy_regulator.h"
+
+#include <math.h>
+#include <stddef.h>
+#include <stdint.h>
+#include <tarantool_ev.h>
+
+#include "fiber.h"
+#include "histogram.h"
+#include "say.h"
+#include "trivia/util.h"
+
+#include "vy_quota.h"
+
+/**
+ * Regulator timer period, in seconds.
+ */
+static const double VY_REGULATOR_TIMER_PERIOD = 1;
+
+/**
+ * Time window over which the write rate is averaged,
+ * in seconds.
+ */
+static const double VY_WRITE_RATE_AVG_WIN = 5;
+
+/**
+ * Histogram percentile used for estimating dump bandwidth.
+ * For details see the comment to vy_regulator::dump_bw_hist.
+ */
+static const int VY_DUMP_BW_PCT = 10;
+
+/*
+ * Until we dump anything, assume bandwidth to be 10 MB/s,
+ * which should be fine for initial guess.
+ */
+static const size_t VY_DUMP_BW_DEFAULT = 10 * 1024 * 1024;
+
+static void
+vy_regulator_update_write_rate(struct vy_regulator *regulator)
+{
+	size_t used_curr = regulator->quota->used;
+	size_t used_last = regulator->used_last;
+
+	/*
+	 * Memory can be dumped between two subsequent timer
+	 * callback invocations, in which case memory usage
+	 * will decrease. Ignore such observations - it's not
+	 * a big deal, because dump is a rare event.
+	 */
+	if (used_curr < used_last)
+		return;
+
+	size_t rate_avg = regulator->write_rate;
+	size_t rate_curr = (used_curr - used_last) / VY_REGULATOR_TIMER_PERIOD;
+
+	double weight = 1 - exp(-VY_REGULATOR_TIMER_PERIOD /
+				VY_WRITE_RATE_AVG_WIN);
+	rate_avg = (1 - weight) * rate_avg + weight * rate_curr;
+
+	regulator->write_rate = rate_avg;
+	regulator->used_last = used_curr;
+}
+
+static void
+vy_regulator_update_watermark(struct vy_regulator *regulator)
+{
+	struct vy_quota *quota = regulator->quota;
+
+	/*
+	 * Due to log structured nature of the lsregion allocator,
+	 * which is used for allocating statements, we cannot free
+	 * memory in chunks, only all at once. Therefore we should
+	 * configure the watermark so that by the time we hit the
+	 * limit, all memory have been dumped, i.e.
+	 *
+	 *   limit - watermark      watermark
+	 *   ----------------- = --------------
+	 *       write_rate      dump_bandwidth
+	 */
+	regulator->watermark = (double)quota->limit * regulator->dump_bw /
+			(regulator->dump_bw + regulator->write_rate + 1);
+}
+
+static void
+vy_regulator_timer_cb(ev_loop *loop, ev_timer *timer, int events)
+{
+	(void)loop;
+	(void)events;
+
+	struct vy_regulator *regulator = timer->data;
+
+	vy_regulator_update_write_rate(regulator);
+	vy_regulator_update_watermark(regulator);
+	vy_regulator_check_watermark(regulator);
+}
+
+void
+vy_regulator_create(struct vy_regulator *regulator, struct vy_quota *quota,
+		    vy_trigger_dump_f trigger_dump_cb)
+{
+	enum { KB = 1024, MB = KB * KB };
+	static int64_t dump_bw_buckets[] = {
+		100 * KB, 200 * KB, 300 * KB, 400 * KB, 500 * KB, 600 * KB,
+		700 * KB, 800 * KB, 900 * KB,   1 * MB,   2 * MB,   3 * MB,
+		  4 * MB,   5 * MB,   6 * MB,   7 * MB,   8 * MB,   9 * MB,
+		 10 * MB,  15 * MB,  20 * MB,  25 * MB,  30 * MB,  35 * MB,
+		 40 * MB,  45 * MB,  50 * MB,  55 * MB,  60 * MB,  65 * MB,
+		 70 * MB,  75 * MB,  80 * MB,  85 * MB,  90 * MB,  95 * MB,
+		100 * MB, 200 * MB, 300 * MB, 400 * MB, 500 * MB, 600 * MB,
+		700 * MB, 800 * MB, 900 * MB,
+	};
+	regulator->dump_bw_hist = histogram_new(dump_bw_buckets,
+						lengthof(dump_bw_buckets));
+	if (regulator->dump_bw_hist == NULL)
+		panic("failed to allocate dump bandwidth histogram");
+
+	regulator->quota = quota;
+	regulator->trigger_dump_cb = trigger_dump_cb;
+	ev_timer_init(&regulator->timer, vy_regulator_timer_cb, 0,
+		      VY_REGULATOR_TIMER_PERIOD);
+	regulator->timer.data = regulator;
+	regulator->watermark = SIZE_MAX;
+	regulator->write_rate = 0;
+	regulator->dump_bw = VY_DUMP_BW_DEFAULT;
+}
+
+void
+vy_regulator_start(struct vy_regulator *regulator)
+{
+	ev_timer_start(loop(), &regulator->timer);
+}
+
+void
+vy_regulator_destroy(struct vy_regulator *regulator)
+{
+	ev_timer_stop(loop(), &regulator->timer);
+	histogram_delete(regulator->dump_bw_hist);
+}
+
+void
+vy_regulator_no_memory(struct vy_regulator *regulator)
+{
+	regulator->trigger_dump_cb(regulator);
+}
+
+void
+vy_regulator_check_watermark(struct vy_regulator *regulator)
+{
+	if (regulator->quota->used >= regulator->watermark)
+		regulator->trigger_dump_cb(regulator);
+}
+
+void
+vy_regulator_dump_complete(struct vy_regulator *regulator,
+			   size_t mem_dumped, double dump_duration)
+{
+	if (dump_duration > 0) {
+		histogram_collect(regulator->dump_bw_hist,
+				  mem_dumped / dump_duration);
+		/*
+		 * To avoid unpredictably long stalls caused by
+		 * mispredicting dump time duration, we need to
+		 * know the worst (smallest) dump bandwidth so
+		 * use a lower-bound percentile estimate.
+		 */
+		regulator->dump_bw = histogram_percentile_lower(
+			regulator->dump_bw_hist, VY_DUMP_BW_PCT);
+	}
+}
+
+void
+vy_regulator_reset_dump_bw(struct vy_regulator *regulator, size_t max)
+{
+	histogram_reset(regulator->dump_bw_hist);
+	regulator->dump_bw = MIN(VY_DUMP_BW_DEFAULT, max);
+}
diff --git a/src/box/vy_regulator.h b/src/box/vy_regulator.h
new file mode 100644
index 00000000..9336adc9
--- /dev/null
+++ b/src/box/vy_regulator.h
@@ -0,0 +1,146 @@
+#ifndef INCLUDES_TARANTOOL_BOX_VY_REGULATOR_H
+#define INCLUDES_TARANTOOL_BOX_VY_REGULATOR_H
+/*
+ * Copyright 2010-2018, Tarantool AUTHORS, please see AUTHORS file.
+ *
+ * Redistribution and use in source and binary forms, with or
+ * without modification, are permitted provided that the following
+ * conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above
+ *    copyright notice, this list of conditions and the
+ *    following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above
+ *    copyright notice, this list of conditions and the following
+ *    disclaimer in the documentation and/or other materials
+ *    provided with the distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY AUTHORS ``AS IS'' AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
+ * TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL
+ * AUTHORS OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT,
+ * INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
+ * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR
+ * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
+ * LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF
+ * THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ */
+
+#include <stddef.h>
+#include <tarantool_ev.h>
+
+#if defined(__cplusplus)
+extern "C" {
+#endif /* defined(__cplusplus) */
+
+struct histogram;
+struct vy_quota;
+struct vy_regulator;
+
+typedef void
+(*vy_trigger_dump_f)(struct vy_regulator *regulator);
+
+/**
+ * The regulator is supposed to keep track of vinyl memory usage
+ * and dump/compaction progress and adjust transaction write rate
+ * accordingly.
+ */
+struct vy_regulator {
+	/**
+	 * Pointer to a quota object that is used to control
+	 * memory usage.
+	 */
+	struct vy_quota *quota;
+	/**
+	 * Called when the regulator detects that memory usage
+	 * exceeds the computed watermark. Supposed to trigger
+	 * memory dump.
+	 */
+	vy_trigger_dump_f trigger_dump_cb;
+	/**
+	 * Periodic timer that updates the memory watermark
+	 * basing on accumulated statistics.
+	 */
+	ev_timer timer;
+	/**
+	 * Memory watermark. Exceeding it does not result in
+	 * throttling new transactions, but it does trigger
+	 * background memory reclaim.
+	 */
+	size_t watermark;
+	/**
+	 * Average rate at which transactions are writing to
+	 * the database, in bytes per second.
+	 */
+	size_t write_rate;
+	/**
+	 * Amount of memory that was used when the timer was
+	 * executed last time. Needed to update @write_rate.
+	 */
+	size_t used_last;
+	/**
+	 * Current dump bandwidth estimate, in bytes per second.
+	 * See @dump_bw_hist for more details.
+	 */
+	size_t dump_bw;
+	/**
+	 * Dump bandwidth is needed for calculating the watermark.
+	 * The higher the bandwidth, the later we can start dumping
+	 * w/o suffering from transaction throttling. So we want to
+	 * be very conservative about estimating the bandwidth.
+	 *
+	 * To make sure we don't overestimate it, we maintain a
+	 * histogram of all observed measurements and assume the
+	 * bandwidth to be equal to the 10th percentile, i.e. the
+	 * best result among 10% worst measurements.
+	 */
+	struct histogram *dump_bw_hist;
+};
+
+void
+vy_regulator_create(struct vy_regulator *regulator, struct vy_quota *quota,
+		    vy_trigger_dump_f trigger_dump_cb);
+
+void
+vy_regulator_start(struct vy_regulator *regulator);
+
+void
+vy_regulator_destroy(struct vy_regulator *regulator);
+
+/**
+ * Check if memory usage is above the watermark and trigger
+ * memory dump if so.
+ */
+void
+vy_regulator_check_watermark(struct vy_regulator *regulator);
+
+/**
+ * Called when the memory limit is hit by a quota consumer.
+ */
+void
+vy_regulator_no_memory(struct vy_regulator *regulator);
+
+/**
+ * Notify the regulator about memory dump completion.
+ */
+void
+vy_regulator_dump_complete(struct vy_regulator *regulator,
+			   size_t mem_dumped, double dump_duration);
+
+/**
+ * Reset dump bandwidth histogram and update initial estimate.
+ * Called when box.cfg.snap_io_rate_limit is updated.
+ */
+void
+vy_regulator_reset_dump_bw(struct vy_regulator *regulator, size_t max);
+
+#if defined(__cplusplus)
+} /* extern "C" */
+#endif /* defined(__cplusplus) */
+
+#endif /* INCLUDES_TARANTOOL_BOX_VY_REGULATOR_H */
diff --git a/test/vinyl/errinj.result b/test/vinyl/errinj.result
index 8badb47a..b4dc5b69 100644
--- a/test/vinyl/errinj.result
+++ b/test/vinyl/errinj.result
@@ -1359,7 +1359,7 @@ pad = string.rep('x', 100 * 1024)
 _ = fiber.create(function() for i = 1, 11 do box.space.test:replace{i, pad} end end)
 ---
 ...
-repeat fiber.sleep(0.001) q = box.info.vinyl().quota until q.limit - q.used < pad:len()
+repeat fiber.sleep(0.001) until box.cfg.vinyl_memory - box.stat.vinyl().memory.level0 < pad:len()
 ---
 ...
 test_run:cmd("restart server low_quota with args='1048576'")
@@ -1376,7 +1376,7 @@ pad = string.rep('x', 100 * 1024)
 _ = fiber.create(function() for i = 1, 11 do box.space.test:replace{i, pad} end end)
 ---
 ...
-repeat fiber.sleep(0.001) q = box.info.vinyl().quota until q.limit - q.used < pad:len()
+repeat fiber.sleep(0.001) until box.cfg.vinyl_memory - box.stat.vinyl().memory.level0 < pad:len()
 ---
 ...
 test_run:cmd('switch default')
diff --git a/test/vinyl/errinj.test.lua b/test/vinyl/errinj.test.lua
index 5a47ecc4..e82b6aee 100644
--- a/test/vinyl/errinj.test.lua
+++ b/test/vinyl/errinj.test.lua
@@ -529,13 +529,13 @@ box.error.injection.set('ERRINJ_VY_RUN_WRITE_STMT_TIMEOUT', 0.01)
 fiber = require('fiber')
 pad = string.rep('x', 100 * 1024)
 _ = fiber.create(function() for i = 1, 11 do box.space.test:replace{i, pad} end end)
-repeat fiber.sleep(0.001) q = box.info.vinyl().quota until q.limit - q.used < pad:len()
+repeat fiber.sleep(0.001) until box.cfg.vinyl_memory - box.stat.vinyl().memory.level0 < pad:len()
 test_run:cmd("restart server low_quota with args='1048576'")
 box.error.injection.set('ERRINJ_VY_LOG_FLUSH_DELAY', true)
 fiber = require('fiber')
 pad = string.rep('x', 100 * 1024)
 _ = fiber.create(function() for i = 1, 11 do box.space.test:replace{i, pad} end end)
-repeat fiber.sleep(0.001) q = box.info.vinyl().quota until q.limit - q.used < pad:len()
+repeat fiber.sleep(0.001) until box.cfg.vinyl_memory - box.stat.vinyl().memory.level0 < pad:len()
 test_run:cmd('switch default')
 test_run:cmd("stop server low_quota")
 test_run:cmd("cleanup server low_quota")
diff --git a/test/vinyl/info.result b/test/vinyl/info.result
index 21945b9d..a47c9a1b 100644
--- a/test/vinyl/info.result
+++ b/test/vinyl/info.result
@@ -95,13 +95,11 @@ end;
 ...
 -- Return global statistics.
 --
--- Note, quota watermark checking is beyond the scope of this
--- test so we just filter out related statistics.
+-- Note, checking correctness of the load regulator logic is beyond
+-- the scope of this test so we just filter out related statistics.
 function gstat()
     local st = box.stat.vinyl()
-    st.quota.use_rate = nil
-    st.quota.dump_bandwidth = nil
-    st.quota.watermark = nil
+    st.regulator = nil
     return st
 end;
 ---
@@ -231,9 +229,6 @@ gstat()
       out: 0
     data: 0
     index: 0
-  quota:
-    limit: 134217728
-    used: 0
   memory:
     tuple_cache: 0
     tx: 0
@@ -1075,9 +1070,6 @@ gstat()
       out: 0
     data: 104300
     index: 1190
-  quota:
-    limit: 134217728
-    used: 262583
   memory:
     tuple_cache: 14313
     tx: 0
diff --git a/test/vinyl/info.test.lua b/test/vinyl/info.test.lua
index 5912320c..e5794a23 100644
--- a/test/vinyl/info.test.lua
+++ b/test/vinyl/info.test.lua
@@ -77,13 +77,11 @@ end;
 
 -- Return global statistics.
 --
--- Note, quota watermark checking is beyond the scope of this
--- test so we just filter out related statistics.
+-- Note, checking correctness of the load regulator logic is beyond
+-- the scope of this test so we just filter out related statistics.
 function gstat()
     local st = box.stat.vinyl()
-    st.quota.use_rate = nil
-    st.quota.dump_bandwidth = nil
-    st.quota.watermark = nil
+    st.regulator = nil
     return st
 end;
 
diff --git a/test/vinyl/quota.result b/test/vinyl/quota.result
index 48042185..1a0842f9 100644
--- a/test/vinyl/quota.result
+++ b/test/vinyl/quota.result
@@ -12,7 +12,7 @@ test_run:cmd('restart server default')
 --
 -- gh-1863 add BPS tree extents to memory quota
 --
-box.stat.vinyl().quota.used
+box.stat.vinyl().memory.level0
 ---
 - 0
 ...
@@ -29,7 +29,7 @@ space:insert({1, 1})
 ---
 - [1, 1]
 ...
-box.stat.vinyl().quota.used
+box.stat.vinyl().memory.level0
 ---
 - 98343
 ...
@@ -37,7 +37,7 @@ space:insert({1, 1})
 ---
 - error: Duplicate key exists in unique index 'pk' in space 'test'
 ...
-box.stat.vinyl().quota.used
+box.stat.vinyl().memory.level0
 ---
 - 98343
 ...
@@ -45,7 +45,7 @@ space:update({1}, {{'!', 1, 100}}) -- try to modify the primary key
 ---
 - error: Attempt to modify a tuple field which is part of index 'pk' in space 'test'
 ...
-box.stat.vinyl().quota.used
+box.stat.vinyl().memory.level0
 ---
 - 98343
 ...
@@ -61,7 +61,7 @@ space:insert({4, 4})
 ---
 - [4, 4]
 ...
-box.stat.vinyl().quota.used
+box.stat.vinyl().memory.level0
 ---
 - 98460
 ...
@@ -69,7 +69,7 @@ box.snapshot()
 ---
 - ok
 ...
-box.stat.vinyl().quota.used
+box.stat.vinyl().memory.level0
 ---
 - 0
 ...
@@ -80,14 +80,14 @@ space:select{}
   - [3, 3]
   - [4, 4]
 ...
-box.stat.vinyl().quota.used
+box.stat.vinyl().memory.level0
 ---
 - 0
 ...
 _ = space:replace{1, 1, string.rep('a', 1024 * 1024 * 5)}
 ---
 ...
-box.stat.vinyl().quota.used
+box.stat.vinyl().memory.level0
 ---
 - 5341267
 ...
@@ -121,14 +121,14 @@ count = 20
 pad = string.rep('x', 100 * 1024)
 ---
 ...
-box.stat.vinyl().quota.limit
+box.cfg.vinyl_memory
 ---
 - 1048576
 ...
 for i = 1, count do s:replace{i, pad} end -- triggers dump
 ---
 ...
-box.stat.vinyl().quota.used < count * pad:len()
+box.stat.vinyl().memory.level0 < count * pad:len()
 ---
 - true
 ...
@@ -139,14 +139,14 @@ box.snapshot()
 box.cfg{vinyl_memory = 8 * 1024 * 1024}
 ---
 ...
-box.stat.vinyl().quota.limit
+box.cfg.vinyl_memory
 ---
 - 8388608
 ...
 for i = 1, count do s:replace{i, pad} end -- does not trigger dump
 ---
 ...
-box.stat.vinyl().quota.used > count * pad:len()
+box.stat.vinyl().memory.level0 > count * pad:len()
 ---
 - true
 ...
@@ -155,7 +155,7 @@ box.cfg{vinyl_memory = 4 * 1024 * 1024} -- error: decreasing vinyl_memory is not
 - error: 'Incorrect value for option ''vinyl_memory'': cannot decrease memory size
     at runtime'
 ...
-box.stat.vinyl().quota.limit
+box.cfg.vinyl_memory
 ---
 - 8388608
 ...
diff --git a/test/vinyl/quota.test.lua b/test/vinyl/quota.test.lua
index e67e5430..a2793a01 100644
--- a/test/vinyl/quota.test.lua
+++ b/test/vinyl/quota.test.lua
@@ -12,7 +12,7 @@ test_run:cmd('restart server default')
 -- gh-1863 add BPS tree extents to memory quota
 --
 
-box.stat.vinyl().quota.used
+box.stat.vinyl().memory.level0
 
 space = box.schema.space.create('test', { engine = 'vinyl' })
 pk = space:create_index('pk')
@@ -20,33 +20,33 @@ sec = space:create_index('sec', { parts = {2, 'unsigned'} })
 
 space:insert({1, 1})
 
-box.stat.vinyl().quota.used
+box.stat.vinyl().memory.level0
 
 space:insert({1, 1})
 
-box.stat.vinyl().quota.used
+box.stat.vinyl().memory.level0
 
 space:update({1}, {{'!', 1, 100}}) -- try to modify the primary key
 
-box.stat.vinyl().quota.used
+box.stat.vinyl().memory.level0
 
 space:insert({2, 2})
 space:insert({3, 3})
 space:insert({4, 4})
 
-box.stat.vinyl().quota.used
+box.stat.vinyl().memory.level0
 
 box.snapshot()
 
-box.stat.vinyl().quota.used
+box.stat.vinyl().memory.level0
 
 space:select{}
 
-box.stat.vinyl().quota.used
+box.stat.vinyl().memory.level0
 
 _ = space:replace{1, 1, string.rep('a', 1024 * 1024 * 5)}
 
-box.stat.vinyl().quota.used
+box.stat.vinyl().memory.level0
 
 space:drop()
 
@@ -63,21 +63,21 @@ _ = s:create_index('pk')
 count = 20
 pad = string.rep('x', 100 * 1024)
 
-box.stat.vinyl().quota.limit
+box.cfg.vinyl_memory
 
 for i = 1, count do s:replace{i, pad} end -- triggers dump
-box.stat.vinyl().quota.used < count * pad:len()
+box.stat.vinyl().memory.level0 < count * pad:len()
 
 box.snapshot()
 
 box.cfg{vinyl_memory = 8 * 1024 * 1024}
-box.stat.vinyl().quota.limit
+box.cfg.vinyl_memory
 
 for i = 1, count do s:replace{i, pad} end -- does not trigger dump
-box.stat.vinyl().quota.used > count * pad:len()
+box.stat.vinyl().memory.level0 > count * pad:len()
 
 box.cfg{vinyl_memory = 4 * 1024 * 1024} -- error: decreasing vinyl_memory is not allowed
-box.stat.vinyl().quota.limit
+box.cfg.vinyl_memory
 
 test_run:cmd('switch default')
 test_run:cmd("stop server test")
diff --git a/test/vinyl/quota_timeout.result b/test/vinyl/quota_timeout.result
index fd8b0196..990d0a4c 100644
--- a/test/vinyl/quota_timeout.result
+++ b/test/vinyl/quota_timeout.result
@@ -47,7 +47,7 @@ s:count()
 ---
 - 1
 ...
-box.stat.vinyl().quota.used
+box.stat.vinyl().memory.level0
 ---
 - 748241
 ...
@@ -61,7 +61,7 @@ s:count()
 ---
 - 1
 ...
-box.stat.vinyl().quota.used
+box.stat.vinyl().memory.level0
 ---
 - 748241
 ...
@@ -148,7 +148,7 @@ box.snapshot()
 -- The following operation should fail instantly irrespective
 -- of the value of 'vinyl_timeout' (gh-3291).
 --
-box.stat.vinyl().quota.used == 0
+box.stat.vinyl().memory.level0 == 0
 ---
 - true
 ...
diff --git a/test/vinyl/quota_timeout.test.lua b/test/vinyl/quota_timeout.test.lua
index 41a864bb..6266d38b 100644
--- a/test/vinyl/quota_timeout.test.lua
+++ b/test/vinyl/quota_timeout.test.lua
@@ -21,13 +21,13 @@ _ = s:create_index('pk')
 pad = string.rep('x', 2 * box.cfg.vinyl_memory / 3)
 _ = s:auto_increment{pad}
 s:count()
-box.stat.vinyl().quota.used
+box.stat.vinyl().memory.level0
 
 -- Since the following operation requires more memory than configured
 -- and dump is disabled, it should fail with ER_VY_QUOTA_TIMEOUT.
 _ = s:auto_increment{pad}
 s:count()
-box.stat.vinyl().quota.used
+box.stat.vinyl().memory.level0
 
 --
 -- Check that increasing box.cfg.vinyl_memory wakes up fibers
@@ -72,7 +72,7 @@ box.snapshot()
 -- The following operation should fail instantly irrespective
 -- of the value of 'vinyl_timeout' (gh-3291).
 --
-box.stat.vinyl().quota.used == 0
+box.stat.vinyl().memory.level0 == 0
 box.cfg{vinyl_timeout = 9000}
 pad = string.rep('x', box.cfg.vinyl_memory)
 _ = s:auto_increment{pad}
diff --git a/test/vinyl/recovery_quota.result b/test/vinyl/recovery_quota.result
index 323b7967..151f280b 100644
--- a/test/vinyl/recovery_quota.result
+++ b/test/vinyl/recovery_quota.result
@@ -60,7 +60,7 @@ test_run:cmd('restart server test with args="2097152"')
 stat = box.stat.vinyl()
 ---
 ...
-stat.quota.used <= stat.quota.limit or {stat.quota.used, stat.quota.limit}
+stat.memory.level0 <= box.cfg.vinyl_memory or {stat.memory.level0, box.cfg.vinyl_memory}
 ---
 - true
 ...
@@ -109,7 +109,7 @@ for i = 1, box.cfg.vinyl_memory / pad_size do box.space.test:replace{i, pad} end
 ---
 - error: Timed out waiting for Vinyl memory quota
 ...
-box.stat.vinyl().quota.used > 1024 * 1024
+box.stat.vinyl().memory.level0 > 1024 * 1024
 ---
 - true
 ...
@@ -125,7 +125,7 @@ while box.space.test.index.pk:stat().disk.dump.count == 0 do fiber.sleep(0.001)
 stat = box.stat.vinyl()
 ---
 ...
-stat.quota.used <= stat.quota.limit or {stat.quota.used, stat.quota.limit}
+stat.memory.level0 <= box.cfg.vinyl_memory or {stat.memory.level0, box.cfg.vinyl_memory}
 ---
 - true
 ...
diff --git a/test/vinyl/recovery_quota.test.lua b/test/vinyl/recovery_quota.test.lua
index 44e35ba6..1e347667 100644
--- a/test/vinyl/recovery_quota.test.lua
+++ b/test/vinyl/recovery_quota.test.lua
@@ -27,7 +27,7 @@ _ = var:insert{'dump', stat.disk.dump.out.rows}
 test_run:cmd('restart server test with args="2097152"')
 -- Check that we do not exceed quota.
 stat = box.stat.vinyl()
-stat.quota.used <= stat.quota.limit or {stat.quota.used, stat.quota.limit}
+stat.memory.level0 <= box.cfg.vinyl_memory or {stat.memory.level0, box.cfg.vinyl_memory}
 -- Check that we did not replay statements dumped before restart.
 stat = box.space.test.index.pk:stat()
 var = box.space.var
@@ -43,14 +43,14 @@ box.cfg{vinyl_timeout=0.001}
 pad_size = 1000
 pad = string.rep('x', pad_size)
 for i = 1, box.cfg.vinyl_memory / pad_size do box.space.test:replace{i, pad} end
-box.stat.vinyl().quota.used > 1024 * 1024
+box.stat.vinyl().memory.level0 > 1024 * 1024
 -- Check that tarantool can recover with a smaller memory limit.
 test_run:cmd('restart server test with args="1048576"')
 fiber = require 'fiber'
 -- All memory above the limit must be dumped after recovery.
 while box.space.test.index.pk:stat().disk.dump.count == 0 do fiber.sleep(0.001) end
 stat = box.stat.vinyl()
-stat.quota.used <= stat.quota.limit or {stat.quota.used, stat.quota.limit}
+stat.memory.level0 <= box.cfg.vinyl_memory or {stat.memory.level0, box.cfg.vinyl_memory}
 _ = test_run:cmd('switch default')
 
 test_run:cmd('stop server test')
-- 
2.11.0

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH v2 03/11] vinyl: minor refactoring of quota methods
  2018-09-28 17:39 [PATCH v2 00/11] vinyl: transaction throttling infrastructure Vladimir Davydov
  2018-09-28 17:39 ` [PATCH v2 01/11] vinyl: add helper to start scheduler and enable quota on startup Vladimir Davydov
  2018-09-28 17:40 ` [PATCH v2 02/11] vinyl: factor load regulator out of quota Vladimir Davydov
@ 2018-09-28 17:40 ` Vladimir Davydov
  2018-09-29  5:01   ` [tarantool-patches] " Konstantin Osipov
  2018-09-28 17:40 ` [PATCH v2 04/11] vinyl: move transaction size sanity check to quota Vladimir Davydov
                   ` (8 subsequent siblings)
  11 siblings, 1 reply; 35+ messages in thread
From: Vladimir Davydov @ 2018-09-28 17:40 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

The refactoring is targeted at facilitating introduction of rate
limiting within the quota class. It moves code blocks around, factors
out some blocks in functions, and improves comments. No functional
changes.

Needed for #1862
---
 src/box/vy_quota.c | 131 +++++++++++++++++++++++++++++++++++++----------------
 1 file changed, 91 insertions(+), 40 deletions(-)

diff --git a/src/box/vy_quota.c b/src/box/vy_quota.c
index e6d22348..64ce56c0 100644
--- a/src/box/vy_quota.c
+++ b/src/box/vy_quota.c
@@ -31,6 +31,7 @@
 #include "vy_quota.h"
 
 #include <assert.h>
+#include <stdbool.h>
 #include <stddef.h>
 #include <tarantool_ev.h>
 
@@ -40,13 +41,55 @@
 #include "trivia/util.h"
 
 /**
- * Returns true if the quota limit is exceeded and so consumers
- * have to wait.
+ * Return true if the requested amount of memory may be consumed
+ * right now, false if consumers have to wait.
  */
 static inline bool
-vy_quota_is_exceeded(struct vy_quota *q)
+vy_quota_may_use(struct vy_quota *q, size_t size)
 {
-	return q->used > q->limit;
+	if (q->used + size > q->limit)
+		return false;
+	return true;
+}
+
+/**
+ * Consume the given amount of memory without checking the limit.
+ */
+static inline void
+vy_quota_do_use(struct vy_quota *q, size_t size)
+{
+	q->used += size;
+}
+
+/**
+ * Return the given amount of memory without waking blocked fibers.
+ * This function is an exact opposite of vy_quota_do_use().
+ */
+static inline void
+vy_quota_do_unuse(struct vy_quota *q, size_t size)
+{
+	assert(q->used >= size);
+	q->used -= size;
+}
+
+/**
+ * Invoke the registered callback in case memory usage exceeds
+ * the configured limit.
+ */
+static inline void
+vy_quota_check_limit(struct vy_quota *q)
+{
+	if (q->used > q->limit)
+		q->quota_exceeded_cb(q);
+}
+
+/**
+ * Wake up the first consumer in the line waiting for quota.
+ */
+static void
+vy_quota_signal(struct vy_quota *q)
+{
+	fiber_cond_signal(&q->cond);
 }
 
 void
@@ -70,55 +113,63 @@ void
 vy_quota_set_limit(struct vy_quota *q, size_t limit)
 {
 	q->limit = limit;
-	if (q->used >= limit)
-		q->quota_exceeded_cb(q);
-	fiber_cond_signal(&q->cond);
+	vy_quota_check_limit(q);
+	vy_quota_signal(q);
 }
 
 void
 vy_quota_force_use(struct vy_quota *q, size_t size)
 {
-	q->used += size;
-	if (q->used >= q->limit)
-		q->quota_exceeded_cb(q);
+	vy_quota_do_use(q, size);
+	vy_quota_check_limit(q);
 }
 
 void
 vy_quota_release(struct vy_quota *q, size_t size)
 {
-	assert(q->used >= size);
-	q->used -= size;
-	fiber_cond_signal(&q->cond);
+	vy_quota_do_unuse(q, size);
+	vy_quota_signal(q);
 }
 
 int
 vy_quota_use(struct vy_quota *q, size_t size, double timeout)
 {
-	q->used += size;
-	if (vy_quota_is_exceeded(q)) {
-		/* Wait for quota. */
-		double start_time = ev_monotonic_now(loop());
-		double deadline = start_time + timeout;
+	if (vy_quota_may_use(q, size)) {
+		vy_quota_do_use(q, size);
+		return 0;
+	}
 
-		do {
-			q->quota_exceeded_cb(q);
-			q->used -= size;
-			if (fiber_cond_wait_deadline(&q->cond, deadline) != 0)
-				return -1; /* timed out */
-			q->used += size;
-		} while (vy_quota_is_exceeded(q));
-
-		double wait_time = ev_monotonic_now(loop()) - start_time;
-		if (wait_time > q->too_long_threshold) {
-			say_warn("waited for %zu bytes of vinyl memory quota "
-				 "for too long: %.3f sec", size, wait_time);
-		}
+	/* Wait for quota. */
+	double wait_start = ev_monotonic_now(loop());
+	double deadline = wait_start + timeout;
+
+	do {
 		/*
-		 * Wake up the next fiber in the line waiting
-		 * for quota.
+		 * If the requested amount of memory cannot be
+		 * consumed due to the configured limit, notify
+		 * the caller before going to sleep so that it
+		 * can start memory reclaim immediately.
 		 */
-		fiber_cond_signal(&q->cond);
+		if (q->used + size > q->limit)
+			q->quota_exceeded_cb(q);
+		if (fiber_cond_wait_deadline(&q->cond, deadline) != 0)
+			return -1; /* timed out */
+	} while (!vy_quota_may_use(q, size));
+
+	double wait_time = ev_monotonic_now(loop()) - wait_start;
+	if (wait_time > q->too_long_threshold) {
+		say_warn("waited for %zu bytes of vinyl memory quota "
+			 "for too long: %.3f sec", size, wait_time);
 	}
+
+	vy_quota_do_use(q, size);
+	/*
+	 * Blocked consumers are awaken one by one to preserve
+	 * the order they were put to sleep. It's a responsibility
+	 * of a consumer that managed to acquire the requested
+	 * amount of quota to wake up the next one in the line.
+	 */
+	vy_quota_signal(q);
 	return 0;
 }
 
@@ -126,11 +177,11 @@ void
 vy_quota_adjust(struct vy_quota *q, size_t reserved, size_t used)
 {
 	if (reserved > used) {
-		size_t excess = reserved - used;
-		assert(q->used >= excess);
-		q->used -= excess;
-		fiber_cond_signal(&q->cond);
+		vy_quota_do_unuse(q, reserved - used);
+		vy_quota_signal(q);
+	}
+	if (reserved < used) {
+		vy_quota_do_use(q, used - reserved);
+		vy_quota_check_limit(q);
 	}
-	if (reserved < used)
-		vy_quota_force_use(q, used - reserved);
 }
-- 
2.11.0

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH v2 04/11] vinyl: move transaction size sanity check to quota
  2018-09-28 17:39 [PATCH v2 00/11] vinyl: transaction throttling infrastructure Vladimir Davydov
                   ` (2 preceding siblings ...)
  2018-09-28 17:40 ` [PATCH v2 03/11] vinyl: minor refactoring of quota methods Vladimir Davydov
@ 2018-09-28 17:40 ` Vladimir Davydov
  2018-09-29  5:02   ` [tarantool-patches] " Konstantin Osipov
  2018-09-28 17:40 ` [PATCH v2 05/11] vinyl: implement quota wait queue without fiber_cond Vladimir Davydov
                   ` (7 subsequent siblings)
  11 siblings, 1 reply; 35+ messages in thread
From: Vladimir Davydov @ 2018-09-28 17:40 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

There's a sanity check in vinyl_engine_prepare, which checks if the
transaction size is less than the configured limit and fails without
waiting for quota if it isn't. Let's move this check to vy_quota_use,
because it's really a business of the quota object. This implies that
vy_quota_use has to set diag to differentiate this error from timeout.
---
 src/box/vinyl.c    | 14 +-------------
 src/box/vy_quota.c | 18 ++++++++++++++++--
 2 files changed, 17 insertions(+), 15 deletions(-)

diff --git a/src/box/vinyl.c b/src/box/vinyl.c
index 1c0192cb..4ca2953b 100644
--- a/src/box/vinyl.c
+++ b/src/box/vinyl.c
@@ -2331,16 +2331,6 @@ vinyl_engine_prepare(struct engine *engine, struct txn *txn)
 		return -1;
 
 	/*
-	 * The configured memory limit will never allow us to commit
-	 * this transaction. Fail early.
-	 */
-	if (tx->write_size > env->quota.limit) {
-		diag_set(OutOfMemory, tx->write_size,
-			 "lsregion", "vinyl transaction");
-		return -1;
-	}
-
-	/*
 	 * Do not abort join/subscribe on quota timeout - replication
 	 * is asynchronous anyway and there's box.info.replication
 	 * available for the admin to track the lag so let the applier
@@ -2354,10 +2344,8 @@ vinyl_engine_prepare(struct engine *engine, struct txn *txn)
 	 * the transaction to be sent to read view or aborted, we call
 	 * it before checking for conflicts.
 	 */
-	if (vy_quota_use(&env->quota, tx->write_size, timeout) != 0) {
-		diag_set(ClientError, ER_VY_QUOTA_TIMEOUT);
+	if (vy_quota_use(&env->quota, tx->write_size, timeout) != 0)
 		return -1;
-	}
 
 	size_t mem_used_before = lsregion_used(&env->mem_env.allocator);
 
diff --git a/src/box/vy_quota.c b/src/box/vy_quota.c
index 64ce56c0..2afed6b7 100644
--- a/src/box/vy_quota.c
+++ b/src/box/vy_quota.c
@@ -35,6 +35,9 @@
 #include <stddef.h>
 #include <tarantool_ev.h>
 
+#include "diag.h"
+#include "error.h"
+#include "errcode.h"
 #include "fiber.h"
 #include "fiber_cond.h"
 #include "say.h"
@@ -139,6 +142,15 @@ vy_quota_use(struct vy_quota *q, size_t size, double timeout)
 		return 0;
 	}
 
+	/*
+	 * Fail early if the configured memory limit never allows
+	 * us to commit the transaction.
+	 */
+	if (size > q->limit) {
+		diag_set(OutOfMemory, size, "lsregion", "vinyl transaction");
+		return -1;
+	}
+
 	/* Wait for quota. */
 	double wait_start = ev_monotonic_now(loop());
 	double deadline = wait_start + timeout;
@@ -152,8 +164,10 @@ vy_quota_use(struct vy_quota *q, size_t size, double timeout)
 		 */
 		if (q->used + size > q->limit)
 			q->quota_exceeded_cb(q);
-		if (fiber_cond_wait_deadline(&q->cond, deadline) != 0)
-			return -1; /* timed out */
+		if (fiber_cond_wait_deadline(&q->cond, deadline) != 0) {
+			diag_set(ClientError, ER_VY_QUOTA_TIMEOUT);
+			return -1;
+		}
 	} while (!vy_quota_may_use(q, size));
 
 	double wait_time = ev_monotonic_now(loop()) - wait_start;
-- 
2.11.0

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH v2 05/11] vinyl: implement quota wait queue without fiber_cond
  2018-09-28 17:39 [PATCH v2 00/11] vinyl: transaction throttling infrastructure Vladimir Davydov
                   ` (3 preceding siblings ...)
  2018-09-28 17:40 ` [PATCH v2 04/11] vinyl: move transaction size sanity check to quota Vladimir Davydov
@ 2018-09-28 17:40 ` Vladimir Davydov
  2018-09-29  5:05   ` [tarantool-patches] " Konstantin Osipov
  2018-09-28 17:40 ` [PATCH v2 06/11] vinyl: enable quota upon recovery completion explicitly Vladimir Davydov
                   ` (6 subsequent siblings)
  11 siblings, 1 reply; 35+ messages in thread
From: Vladimir Davydov @ 2018-09-28 17:40 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

Using fiber_cond as a wait queue isn't very convenient, because:
 - It doesn't allow us to put a spuriously woken up fiber back to the
   same position in the queue where it was, thus violating fairness.
 - It doesn't allow us to check whether we actually need to wake up a
   fiber or it will have to go back to sleep anyway as it needs more
   memory than currently available.
 - It doesn't allow us to implement a multi-queue approach where fibers
   that have different priorities are put to different queues.

So let's rewrite the wait queue with plain rlist and fiber_yield.

Needed for #1862
---
 src/box/vy_quota.c | 33 +++++++++++++++++++++++++++------
 src/box/vy_quota.h | 23 +++++++++++++++++------
 2 files changed, 44 insertions(+), 12 deletions(-)

diff --git a/src/box/vy_quota.c b/src/box/vy_quota.c
index 2afed6b7..99811ae9 100644
--- a/src/box/vy_quota.c
+++ b/src/box/vy_quota.c
@@ -39,7 +39,6 @@
 #include "error.h"
 #include "errcode.h"
 #include "fiber.h"
-#include "fiber_cond.h"
 #include "say.h"
 #include "trivia/util.h"
 
@@ -92,7 +91,17 @@ vy_quota_check_limit(struct vy_quota *q)
 static void
 vy_quota_signal(struct vy_quota *q)
 {
-	fiber_cond_signal(&q->cond);
+	if (!rlist_empty(&q->wait_queue)) {
+		struct vy_quota_wait_node *n;
+		n = rlist_first_entry(&q->wait_queue,
+				      struct vy_quota_wait_node, in_wait_queue);
+		/*
+		 * No need in waking up a consumer if it will have
+		 * to go back to sleep immediately.
+		 */
+		if (vy_quota_may_use(q, n->size))
+			fiber_wakeup(n->fiber);
+	}
 }
 
 void
@@ -102,14 +111,13 @@ vy_quota_create(struct vy_quota *q, vy_quota_exceeded_f quota_exceeded_cb)
 	q->used = 0;
 	q->too_long_threshold = TIMEOUT_INFINITY;
 	q->quota_exceeded_cb = quota_exceeded_cb;
-	fiber_cond_create(&q->cond);
+	rlist_create(&q->wait_queue);
 }
 
 void
 vy_quota_destroy(struct vy_quota *q)
 {
-	fiber_cond_broadcast(&q->cond);
-	fiber_cond_destroy(&q->cond);
+	(void)q;
 }
 
 void
@@ -155,6 +163,12 @@ vy_quota_use(struct vy_quota *q, size_t size, double timeout)
 	double wait_start = ev_monotonic_now(loop());
 	double deadline = wait_start + timeout;
 
+	struct vy_quota_wait_node wait_node = {
+		.fiber = fiber(),
+		.size = size,
+	};
+	rlist_add_tail_entry(&q->wait_queue, &wait_node, in_wait_queue);
+
 	do {
 		/*
 		 * If the requested amount of memory cannot be
@@ -164,12 +178,19 @@ vy_quota_use(struct vy_quota *q, size_t size, double timeout)
 		 */
 		if (q->used + size > q->limit)
 			q->quota_exceeded_cb(q);
-		if (fiber_cond_wait_deadline(&q->cond, deadline) != 0) {
+
+		double now = ev_monotonic_now(loop());
+		fiber_yield_timeout(deadline - now);
+
+		if (now >= deadline) {
+			rlist_del_entry(&wait_node, in_wait_queue);
 			diag_set(ClientError, ER_VY_QUOTA_TIMEOUT);
 			return -1;
 		}
 	} while (!vy_quota_may_use(q, size));
 
+	rlist_del_entry(&wait_node, in_wait_queue);
+
 	double wait_time = ev_monotonic_now(loop()) - wait_start;
 	if (wait_time > q->too_long_threshold) {
 		say_warn("waited for %zu bytes of vinyl memory quota "
diff --git a/src/box/vy_quota.h b/src/box/vy_quota.h
index 59fe075f..d46b6748 100644
--- a/src/box/vy_quota.h
+++ b/src/box/vy_quota.h
@@ -32,18 +32,28 @@
  */
 
 #include <stddef.h>
+#include <small/rlist.h>
 #include <tarantool_ev.h>
-#include "fiber_cond.h"
 
 #if defined(__cplusplus)
 extern "C" {
 #endif /* defined(__cplusplus) */
 
+struct fiber;
 struct vy_quota;
 
 typedef void
 (*vy_quota_exceeded_f)(struct vy_quota *quota);
 
+struct vy_quota_wait_node {
+	/** Link in vy_quota::wait_queue. */
+	struct rlist in_wait_queue;
+	/** Fiber waiting for quota. */
+	struct fiber *fiber;
+	/** Amount of requested memory. */
+	size_t size;
+};
+
 /**
  * Quota used for accounting and limiting memory consumption
  * in the vinyl engine. It is NOT multi-threading safe.
@@ -62,15 +72,16 @@ struct vy_quota {
 	 */
 	double too_long_threshold;
 	/**
-	 * Condition variable used for throttling consumers when
-	 * there is no quota left.
-	 */
-	struct fiber_cond cond;
-	/**
 	 * Called if the limit is hit when quota is consumed.
 	 * It is supposed to trigger memory reclaim.
 	 */
 	vy_quota_exceeded_f quota_exceeded_cb;
+	/**
+	 * Queue of consumers waiting for quota, linked by
+	 * vy_quota_wait_node::state. Newcomers are added
+	 * to the tail.
+	 */
+	struct rlist wait_queue;
 };
 
 void
-- 
2.11.0

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH v2 06/11] vinyl: enable quota upon recovery completion explicitly
  2018-09-28 17:39 [PATCH v2 00/11] vinyl: transaction throttling infrastructure Vladimir Davydov
                   ` (4 preceding siblings ...)
  2018-09-28 17:40 ` [PATCH v2 05/11] vinyl: implement quota wait queue without fiber_cond Vladimir Davydov
@ 2018-09-28 17:40 ` Vladimir Davydov
  2018-09-29  5:06   ` [tarantool-patches] " Konstantin Osipov
  2018-09-28 17:40 ` [PATCH v2 07/11] vinyl: zap vy_env::memory, read_threads, and write_threads Vladimir Davydov
                   ` (5 subsequent siblings)
  11 siblings, 1 reply; 35+ messages in thread
From: Vladimir Davydov @ 2018-09-28 17:40 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

Currently, we create a quota object with the limit maximized, and only
set the configured limit when local recovery is complete, so as to make
sure that no dump is triggered during recovery. As a result, we have to
store the configured limit in vy_env::memory, which looks ugly, because
this member is never used afterwards. Let's introduce a new method
vy_quota_enable to enable quota so that we can set the limit right on
quota object construction. This implies that we add a boolean flag to
vy_quota and only check the limit if it is set.

There's another reason to add such a method. Soon we will implement
quota consumption rate limiting. Rate limiting requires a periodic timer
that would replenish quota. It only makes sense to start such a timer
upon recovery completion, which again leads us to an explicit method for
enabling quota.

vy_env::memory will be removed by the following patch along with a few
other pointless members of vy_env.

Needed for #1862
---
 src/box/vinyl.c    |  4 ++--
 src/box/vy_quota.c | 18 +++++++++++++++---
 src/box/vy_quota.h | 21 ++++++++++++++++++++-
 3 files changed, 37 insertions(+), 6 deletions(-)

diff --git a/src/box/vinyl.c b/src/box/vinyl.c
index 4ca2953b..2ed14c94 100644
--- a/src/box/vinyl.c
+++ b/src/box/vinyl.c
@@ -2518,7 +2518,7 @@ vy_env_new(const char *path, size_t memory,
 			      vy_squash_schedule, e) != 0)
 		goto error_lsm_env;
 
-	vy_quota_create(&e->quota, vy_env_quota_exceeded_cb);
+	vy_quota_create(&e->quota, memory, vy_env_quota_exceeded_cb);
 	vy_regulator_create(&e->regulator, &e->quota,
 			    vy_env_trigger_dump_cb);
 
@@ -2572,7 +2572,7 @@ static void
 vy_env_complete_recovery(struct vy_env *env)
 {
 	vy_scheduler_start(&env->scheduler);
-	vy_quota_set_limit(&env->quota, env->memory);
+	vy_quota_enable(&env->quota);
 	vy_regulator_start(&env->regulator);
 }
 
diff --git a/src/box/vy_quota.c b/src/box/vy_quota.c
index 99811ae9..4b3527b4 100644
--- a/src/box/vy_quota.c
+++ b/src/box/vy_quota.c
@@ -49,6 +49,8 @@
 static inline bool
 vy_quota_may_use(struct vy_quota *q, size_t size)
 {
+	if (!q->is_enabled)
+		return true;
 	if (q->used + size > q->limit)
 		return false;
 	return true;
@@ -81,7 +83,7 @@ vy_quota_do_unuse(struct vy_quota *q, size_t size)
 static inline void
 vy_quota_check_limit(struct vy_quota *q)
 {
-	if (q->used > q->limit)
+	if (q->is_enabled && q->used > q->limit)
 		q->quota_exceeded_cb(q);
 }
 
@@ -105,9 +107,11 @@ vy_quota_signal(struct vy_quota *q)
 }
 
 void
-vy_quota_create(struct vy_quota *q, vy_quota_exceeded_f quota_exceeded_cb)
+vy_quota_create(struct vy_quota *q, size_t limit,
+		vy_quota_exceeded_f quota_exceeded_cb)
 {
-	q->limit = SIZE_MAX;
+	q->is_enabled = false;
+	q->limit = limit;
 	q->used = 0;
 	q->too_long_threshold = TIMEOUT_INFINITY;
 	q->quota_exceeded_cb = quota_exceeded_cb;
@@ -115,6 +119,14 @@ vy_quota_create(struct vy_quota *q, vy_quota_exceeded_f quota_exceeded_cb)
 }
 
 void
+vy_quota_enable(struct vy_quota *q)
+{
+	assert(!q->is_enabled);
+	q->is_enabled = true;
+	vy_quota_check_limit(q);
+}
+
+void
 vy_quota_destroy(struct vy_quota *q)
 {
 	(void)q;
diff --git a/src/box/vy_quota.h b/src/box/vy_quota.h
index d46b6748..f249512b 100644
--- a/src/box/vy_quota.h
+++ b/src/box/vy_quota.h
@@ -31,6 +31,7 @@
  * SUCH DAMAGE.
  */
 
+#include <stdbool.h>
 #include <stddef.h>
 #include <small/rlist.h>
 #include <tarantool_ev.h>
@@ -59,6 +60,8 @@ struct vy_quota_wait_node {
  * in the vinyl engine. It is NOT multi-threading safe.
  */
 struct vy_quota {
+	/** Set if the quota was enabled. */
+	bool is_enabled;
 	/**
 	 * Memory limit. Once hit, new transactions are
 	 * throttled until memory is reclaimed.
@@ -84,9 +87,25 @@ struct vy_quota {
 	struct rlist wait_queue;
 };
 
+/**
+ * Initialize a quota object.
+ *
+ * Note, the limit won't be imposed until vy_quota_enable()
+ * is called.
+ */
 void
-vy_quota_create(struct vy_quota *q, vy_quota_exceeded_f quota_exceeded_cb);
+vy_quota_create(struct vy_quota *q, size_t limit,
+		vy_quota_exceeded_f quota_exceeded_cb);
 
+/**
+ * Enable the configured limit for a quota object.
+ */
+void
+vy_quota_enable(struct vy_quota *q);
+
+/**
+ * Destroy a quota object.
+ */
 void
 vy_quota_destroy(struct vy_quota *q);
 
-- 
2.11.0

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH v2 07/11] vinyl: zap vy_env::memory, read_threads, and write_threads
  2018-09-28 17:39 [PATCH v2 00/11] vinyl: transaction throttling infrastructure Vladimir Davydov
                   ` (5 preceding siblings ...)
  2018-09-28 17:40 ` [PATCH v2 06/11] vinyl: enable quota upon recovery completion explicitly Vladimir Davydov
@ 2018-09-28 17:40 ` Vladimir Davydov
  2018-09-29  5:06   ` [tarantool-patches] " Konstantin Osipov
  2018-09-28 17:40 ` [PATCH v2 08/11] vinyl: do not try to trigger dump in regulator if already in progress Vladimir Davydov
                   ` (4 subsequent siblings)
  11 siblings, 1 reply; 35+ messages in thread
From: Vladimir Davydov @ 2018-09-28 17:40 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

They are only used to set corresponding members of vy_quota, vy_run_env,
and vy_scheduler when vy_env is created. No point in keeping them around
all the time.
---
 src/box/vinyl.c             | 20 +++++---------------
 src/box/vy_run.c            | 12 ++++++------
 src/box/vy_run.h            | 17 ++++++++++++-----
 test/unit/vy_point_lookup.c |  2 +-
 4 files changed, 24 insertions(+), 27 deletions(-)

diff --git a/src/box/vinyl.c b/src/box/vinyl.c
index 2ed14c94..25ba92fe 100644
--- a/src/box/vinyl.c
+++ b/src/box/vinyl.c
@@ -135,14 +135,8 @@ struct vy_env {
 	int64_t join_lsn;
 	/** Path to the data directory. */
 	char *path;
-	/** Max size of the memory level. */
-	size_t memory;
 	/** Max time a transaction may wait for memory. */
 	double timeout;
-	/** Max number of threads used for reading. */
-	int read_threads;
-	/** Max number of threads used for writing. */
-	int write_threads;
 	/** Try to recover corrupted data if set. */
 	bool force_recovery;
 };
@@ -760,8 +754,7 @@ vinyl_index_open(struct index *index)
 		rc = vy_lsm_create(lsm);
 		if (rc == 0) {
 			/* Make sure reader threads are up and running. */
-			vy_run_env_enable_coio(&env->run_env,
-					       env->read_threads);
+			vy_run_env_enable_coio(&env->run_env);
 		}
 		break;
 	case VINYL_INITIAL_RECOVERY_REMOTE:
@@ -2489,10 +2482,7 @@ vy_env_new(const char *path, size_t memory,
 	}
 	memset(e, 0, sizeof(*e));
 	e->status = VINYL_OFFLINE;
-	e->memory = memory;
 	e->timeout = TIMEOUT_INFINITY;
-	e->read_threads = read_threads;
-	e->write_threads = write_threads;
 	e->force_recovery = force_recovery;
 	e->path = strdup(path);
 	if (e->path == NULL) {
@@ -2508,8 +2498,8 @@ vy_env_new(const char *path, size_t memory,
 	if (e->squash_queue == NULL)
 		goto error_squash_queue;
 
-	vy_mem_env_create(&e->mem_env, e->memory);
-	vy_scheduler_create(&e->scheduler, e->write_threads,
+	vy_mem_env_create(&e->mem_env, memory);
+	vy_scheduler_create(&e->scheduler, write_threads,
 			    vy_env_dump_complete_cb,
 			    &e->run_env, &e->xm->read_views);
 
@@ -2526,7 +2516,7 @@ vy_env_new(const char *path, size_t memory,
 	mempool_create(&e->iterator_pool, slab_cache,
 	               sizeof(struct vinyl_iterator));
 	vy_cache_env_create(&e->cache_env, slab_cache);
-	vy_run_env_create(&e->run_env);
+	vy_run_env_create(&e->run_env, read_threads);
 	vy_log_init(e->path);
 	return e;
 
@@ -2823,7 +2813,7 @@ vinyl_engine_end_recovery(struct engine *engine)
 	 * creation, see vinyl_index_open().
 	 */
 	if (e->lsm_env.lsm_count > 0)
-		vy_run_env_enable_coio(&e->run_env, e->read_threads);
+		vy_run_env_enable_coio(&e->run_env);
 
 	e->status = VINYL_ONLINE;
 	return 0;
diff --git a/src/box/vy_run.c b/src/box/vy_run.c
index 7485d97a..bb5baf2c 100644
--- a/src/box/vy_run.c
+++ b/src/box/vy_run.c
@@ -123,12 +123,11 @@ vy_run_reader_f(va_list ap)
 
 /** Start run reader threads. */
 static void
-vy_run_env_start_readers(struct vy_run_env *env, int threads)
+vy_run_env_start_readers(struct vy_run_env *env)
 {
-	assert(threads > 0);
 	assert(env->reader_pool == NULL);
+	assert(env->reader_pool_size > 0);
 
-	env->reader_pool_size = threads;
 	env->reader_pool = calloc(env->reader_pool_size,
 				  sizeof(*env->reader_pool));
 	if (env->reader_pool == NULL)
@@ -166,9 +165,10 @@ vy_run_env_stop_readers(struct vy_run_env *env)
  * Initialize vinyl run environment
  */
 void
-vy_run_env_create(struct vy_run_env *env)
+vy_run_env_create(struct vy_run_env *env, int read_threads)
 {
 	memset(env, 0, sizeof(*env));
+	env->reader_pool_size = read_threads;
 	tt_pthread_key_create(&env->zdctx_key, vy_free_zdctx);
 	mempool_create(&env->read_task_pool, cord_slab_cache(),
 		       sizeof(struct vy_page_read_task));
@@ -190,11 +190,11 @@ vy_run_env_destroy(struct vy_run_env *env)
  * Enable coio reads for a vinyl run environment.
  */
 void
-vy_run_env_enable_coio(struct vy_run_env *env, int threads)
+vy_run_env_enable_coio(struct vy_run_env *env)
 {
 	if (env->reader_pool != NULL)
 		return; /* already enabled */
-	vy_run_env_start_readers(env, threads);
+	vy_run_env_start_readers(env);
 }
 
 /**
diff --git a/src/box/vy_run.h b/src/box/vy_run.h
index d74f216c..4b919a8f 100644
--- a/src/box/vy_run.h
+++ b/src/box/vy_run.h
@@ -281,9 +281,13 @@ struct vy_page {
 
 /**
  * Initialize vinyl run environment
+ *
+ * @param read_threads - max number of background threads to
+ * use for disk reads; note background threads are not used
+ * until vy_run_env_enable_coio() is called.
  */
 void
-vy_run_env_create(struct vy_run_env *env);
+vy_run_env_create(struct vy_run_env *env, int read_threads);
 
 /**
  * Destroy vinyl run environment
@@ -294,14 +298,17 @@ vy_run_env_destroy(struct vy_run_env *env);
 /**
  * Enable coio reads for a vinyl run environment.
  *
- * This function starts @threads reader threads and makes
- * the run iterator hand disk reads over to them rather than
- * read run files directly blocking the current fiber.
+ * This function starts background reader threads and makes
+ * the run iterator hand disk reads over to them rather
+ * than read run files directly blocking the current fiber.
+ *
+ * The number of background reader threads is configured when
+ * the environment is created, see vy_run_env_create().
  *
  * Subsequent calls to this function will silently return.
  */
 void
-vy_run_env_enable_coio(struct vy_run_env *env, int threads);
+vy_run_env_enable_coio(struct vy_run_env *env);
 
 /**
  * Return the size of a run bloom filter.
diff --git a/test/unit/vy_point_lookup.c b/test/unit/vy_point_lookup.c
index 86877d7d..dd33bbec 100644
--- a/test/unit/vy_point_lookup.c
+++ b/test/unit/vy_point_lookup.c
@@ -70,7 +70,7 @@ test_basic()
 	is(rc, 0, "vy_lsm_env_create");
 
 	struct vy_run_env run_env;
-	vy_run_env_create(&run_env);
+	vy_run_env_create(&run_env, 0);
 
 	struct vy_cache_env cache_env;
 	vy_cache_env_create(&cache_env, slab_cache);
-- 
2.11.0

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH v2 08/11] vinyl: do not try to trigger dump in regulator if already in progress
  2018-09-28 17:39 [PATCH v2 00/11] vinyl: transaction throttling infrastructure Vladimir Davydov
                   ` (6 preceding siblings ...)
  2018-09-28 17:40 ` [PATCH v2 07/11] vinyl: zap vy_env::memory, read_threads, and write_threads Vladimir Davydov
@ 2018-09-28 17:40 ` Vladimir Davydov
  2018-09-28 17:40 ` [PATCH v2 09/11] vinyl: do not account zero dump bandwidth Vladimir Davydov
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 35+ messages in thread
From: Vladimir Davydov @ 2018-09-28 17:40 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

This is pointless since trigger_dump_cb callback will return right away
in such a case. Let's wrap trigger_dump_cb in vy_regulator_trigger_dump
method, which will actulally invoke the callback only if the previous
dump has already completed (i.e. vy_regulator_dump_complete was called).

This also gives us a definite place in code where we can adjust the rate
limit so as to guarantee that a triggered memory dump will finish before
we hit the hard memory limit (this will be done later).

Needed for #1862
---
 src/box/vinyl.c        |  5 +++--
 src/box/vy_regulator.c | 20 ++++++++++++++++++--
 src/box/vy_regulator.h | 11 +++++++++--
 3 files changed, 30 insertions(+), 6 deletions(-)

diff --git a/src/box/vinyl.c b/src/box/vinyl.c
index 25ba92fe..c3d95777 100644
--- a/src/box/vinyl.c
+++ b/src/box/vinyl.c
@@ -2427,7 +2427,7 @@ vy_env_quota_exceeded_cb(struct vy_quota *quota)
 	vy_regulator_no_memory(&env->regulator);
 }
 
-static void
+static int
 vy_env_trigger_dump_cb(struct vy_regulator *regulator)
 {
 	struct vy_env *env = container_of(regulator, struct vy_env, regulator);
@@ -2439,9 +2439,10 @@ vy_env_trigger_dump_cb(struct vy_regulator *regulator)
 		 * quota has been consumed by pending transactions.
 		 * There's nothing we can do about that.
 		 */
-		return;
+		return -1;
 	}
 	vy_scheduler_trigger_dump(&env->scheduler);
+	return 0;
 }
 
 static void
diff --git a/src/box/vy_regulator.c b/src/box/vy_regulator.c
index 3bd4fee4..5ec5629f 100644
--- a/src/box/vy_regulator.c
+++ b/src/box/vy_regulator.c
@@ -31,6 +31,7 @@
 #include "vy_regulator.h"
 
 #include <math.h>
+#include <stdbool.h>
 #include <stddef.h>
 #include <stdint.h>
 #include <tarantool_ev.h>
@@ -66,6 +67,18 @@ static const int VY_DUMP_BW_PCT = 10;
 static const size_t VY_DUMP_BW_DEFAULT = 10 * 1024 * 1024;
 
 static void
+vy_regulator_trigger_dump(struct vy_regulator *regulator)
+{
+	if (regulator->dump_in_progress)
+		return;
+
+	if (regulator->trigger_dump_cb(regulator) != 0)
+		return;
+
+	regulator->dump_in_progress = true;
+}
+
+static void
 vy_regulator_update_write_rate(struct vy_regulator *regulator)
 {
 	size_t used_curr = regulator->quota->used;
@@ -152,6 +165,7 @@ vy_regulator_create(struct vy_regulator *regulator, struct vy_quota *quota,
 	regulator->watermark = SIZE_MAX;
 	regulator->write_rate = 0;
 	regulator->dump_bw = VY_DUMP_BW_DEFAULT;
+	regulator->dump_in_progress = false;
 }
 
 void
@@ -170,20 +184,22 @@ vy_regulator_destroy(struct vy_regulator *regulator)
 void
 vy_regulator_no_memory(struct vy_regulator *regulator)
 {
-	regulator->trigger_dump_cb(regulator);
+	vy_regulator_trigger_dump(regulator);
 }
 
 void
 vy_regulator_check_watermark(struct vy_regulator *regulator)
 {
 	if (regulator->quota->used >= regulator->watermark)
-		regulator->trigger_dump_cb(regulator);
+		vy_regulator_trigger_dump(regulator);
 }
 
 void
 vy_regulator_dump_complete(struct vy_regulator *regulator,
 			   size_t mem_dumped, double dump_duration)
 {
+	regulator->dump_in_progress = false;
+
 	if (dump_duration > 0) {
 		histogram_collect(regulator->dump_bw_hist,
 				  mem_dumped / dump_duration);
diff --git a/src/box/vy_regulator.h b/src/box/vy_regulator.h
index 9336adc9..0ea708d6 100644
--- a/src/box/vy_regulator.h
+++ b/src/box/vy_regulator.h
@@ -31,6 +31,7 @@
  * SUCH DAMAGE.
  */
 
+#include <stdbool.h>
 #include <stddef.h>
 #include <tarantool_ev.h>
 
@@ -42,7 +43,7 @@ struct histogram;
 struct vy_quota;
 struct vy_regulator;
 
-typedef void
+typedef int
 (*vy_trigger_dump_f)(struct vy_regulator *regulator);
 
 /**
@@ -59,7 +60,7 @@ struct vy_regulator {
 	/**
 	 * Called when the regulator detects that memory usage
 	 * exceeds the computed watermark. Supposed to trigger
-	 * memory dump.
+	 * memory dump and return 0 on success, -1 on failure.
 	 */
 	vy_trigger_dump_f trigger_dump_cb;
 	/**
@@ -100,6 +101,12 @@ struct vy_regulator {
 	 * best result among 10% worst measurements.
 	 */
 	struct histogram *dump_bw_hist;
+	/**
+	 * Set if the last triggered memory dump hasn't completed
+	 * yet, i.e. trigger_dump_cb() was successfully invoked,
+	 * but vy_regulator_dump_complete() hasn't been called yet.
+	 */
+	bool dump_in_progress;
 };
 
 void
-- 
2.11.0

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH v2 09/11] vinyl: do not account zero dump bandwidth
  2018-09-28 17:39 [PATCH v2 00/11] vinyl: transaction throttling infrastructure Vladimir Davydov
                   ` (7 preceding siblings ...)
  2018-09-28 17:40 ` [PATCH v2 08/11] vinyl: do not try to trigger dump in regulator if already in progress Vladimir Davydov
@ 2018-09-28 17:40 ` Vladimir Davydov
  2018-10-12 13:27   ` Vladimir Davydov
  2018-09-28 17:40 ` [PATCH v2 10/11] vinyl: implement basic transaction throttling Vladimir Davydov
                   ` (2 subsequent siblings)
  11 siblings, 1 reply; 35+ messages in thread
From: Vladimir Davydov @ 2018-09-28 17:40 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

Since we free memory in 16 MB blocks (see SLAB_SIZE), it may occur that
we dump almost all data stored in a block but still have to leave it be,
because it contains data of a newer generation. If the memory limit is
small (as it is typically in tests), this may result in dumping 0 bytes.
In order not to disrupt statistics and throttling transactions in vain,
let's simply ignore such results. Normally, the memory limit should be
large enough for such granularity not to affect the measurements
(hundreds of megabytes) so this problem isn't worth putting more efforts
into.

Needed for #1862
---
 src/box/vy_regulator.c | 15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/src/box/vy_regulator.c b/src/box/vy_regulator.c
index 5ec5629f..682777fc 100644
--- a/src/box/vy_regulator.c
+++ b/src/box/vy_regulator.c
@@ -200,7 +200,20 @@ vy_regulator_dump_complete(struct vy_regulator *regulator,
 {
 	regulator->dump_in_progress = false;
 
-	if (dump_duration > 0) {
+	/*
+	 * Update dump bandwidth.
+	 *
+	 * Note, since we free memory in 16 MB blocks (see SLAB_SIZE),
+	 * it may occur that we dump almost all data stored in a block
+	 * but still have to leave it be, because it contains data of
+	 * a newer generation. If the memory limit is small, this may
+	 * result in dumping 0 bytes. In order not to disrupt statistics
+	 * let's simply ignore such results. Normally, the memory limit
+	 * should be large enough for such granularity not to affect the
+	 * measurements (hundreds of megabytes) so this problem isn't
+	 * worth putting more efforts into.
+	 */
+	if (mem_dumped > 0 && dump_duration > 0) {
 		histogram_collect(regulator->dump_bw_hist,
 				  mem_dumped / dump_duration);
 		/*
-- 
2.11.0

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH v2 10/11] vinyl: implement basic transaction throttling
  2018-09-28 17:39 [PATCH v2 00/11] vinyl: transaction throttling infrastructure Vladimir Davydov
                   ` (8 preceding siblings ...)
  2018-09-28 17:40 ` [PATCH v2 09/11] vinyl: do not account zero dump bandwidth Vladimir Davydov
@ 2018-09-28 17:40 ` Vladimir Davydov
  2018-09-28 17:40 ` [PATCH v2 11/11] vinyl: introduce quota consumer priorities Vladimir Davydov
  2018-10-03  9:06 ` [PATCH v2 00/11] vinyl: transaction throttling infrastructure Vladimir Davydov
  11 siblings, 0 replies; 35+ messages in thread
From: Vladimir Davydov @ 2018-09-28 17:40 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

If the rate at which transactions are ready to write to the database is
greater than the dump bandwidth, memory will get depleted before the
previously scheduled dump is complete and all newer transactions will
have to wait, which may take seconds or even minutes:

  W> waited for 555 bytes of vinyl memory quota for too long: 15.750 sec

This patch set implements basic transaction throttling that is supposed
to help avoid unpredictably long stalls. Now the transaction write rate
is always capped by the observed dump bandwidth, because it doesn't make
sense to consume memory at a greater rate than it can be freed. On top
of that, when a dump begins, we estimate the amount of time it is going
to take and limit the transaction write rate accordingly.

Note, this patch doesn't take into account compaction when setting the
rate limit so compaction threads may still fail to keep up with dumps,
increasing the read amplification. It will be addressed later.

Part of #1862
---
 src/box/vy_quota.c           |  44 ++++++++++++++++++-
 src/box/vy_quota.h           |  78 +++++++++++++++++++++++++++++++++
 src/box/vy_regulator.c       |  30 +++++++++++++
 test/vinyl/suite.ini         |   2 +-
 test/vinyl/throttle.result   | 102 +++++++++++++++++++++++++++++++++++++++++++
 test/vinyl/throttle.test.lua |  54 +++++++++++++++++++++++
 6 files changed, 307 insertions(+), 3 deletions(-)
 create mode 100644 test/vinyl/throttle.result
 create mode 100644 test/vinyl/throttle.test.lua

diff --git a/src/box/vy_quota.c b/src/box/vy_quota.c
index 4b3527b4..ceac4878 100644
--- a/src/box/vy_quota.c
+++ b/src/box/vy_quota.c
@@ -43,6 +43,15 @@
 #include "trivia/util.h"
 
 /**
+ * Quota timer period, in seconds.
+ *
+ * The timer is used for replenishing the rate limit value so
+ * its period defines how long throttled transactions will wait.
+ * Therefore use a relatively small period.
+ */
+static const double VY_QUOTA_TIMER_PERIOD = 0.1;
+
+/**
  * Return true if the requested amount of memory may be consumed
  * right now, false if consumers have to wait.
  */
@@ -53,6 +62,8 @@ vy_quota_may_use(struct vy_quota *q, size_t size)
 		return true;
 	if (q->used + size > q->limit)
 		return false;
+	if (!vy_rate_limit_may_use(&q->rate_limit))
+		return false;
 	return true;
 }
 
@@ -63,6 +74,7 @@ static inline void
 vy_quota_do_use(struct vy_quota *q, size_t size)
 {
 	q->used += size;
+	vy_rate_limit_use(&q->rate_limit, size);
 }
 
 /**
@@ -74,6 +86,7 @@ vy_quota_do_unuse(struct vy_quota *q, size_t size)
 {
 	assert(q->used >= size);
 	q->used -= size;
+	vy_rate_limit_unuse(&q->rate_limit, size);
 }
 
 /**
@@ -106,6 +119,18 @@ vy_quota_signal(struct vy_quota *q)
 	}
 }
 
+static void
+vy_quota_timer_cb(ev_loop *loop, ev_timer *timer, int events)
+{
+	(void)loop;
+	(void)events;
+
+	struct vy_quota *q = timer->data;
+
+	vy_rate_limit_refill(&q->rate_limit, VY_QUOTA_TIMER_PERIOD);
+	vy_quota_signal(q);
+}
+
 void
 vy_quota_create(struct vy_quota *q, size_t limit,
 		vy_quota_exceeded_f quota_exceeded_cb)
@@ -116,6 +141,9 @@ vy_quota_create(struct vy_quota *q, size_t limit,
 	q->too_long_threshold = TIMEOUT_INFINITY;
 	q->quota_exceeded_cb = quota_exceeded_cb;
 	rlist_create(&q->wait_queue);
+	vy_rate_limit_create(&q->rate_limit);
+	ev_timer_init(&q->timer, vy_quota_timer_cb, 0, VY_QUOTA_TIMER_PERIOD);
+	q->timer.data = q;
 }
 
 void
@@ -123,13 +151,14 @@ vy_quota_enable(struct vy_quota *q)
 {
 	assert(!q->is_enabled);
 	q->is_enabled = true;
+	ev_timer_start(loop(), &q->timer);
 	vy_quota_check_limit(q);
 }
 
 void
 vy_quota_destroy(struct vy_quota *q)
 {
-	(void)q;
+	ev_timer_stop(loop(), &q->timer);
 }
 
 void
@@ -141,6 +170,12 @@ vy_quota_set_limit(struct vy_quota *q, size_t limit)
 }
 
 void
+vy_quota_set_rate_limit(struct vy_quota *q, size_t rate)
+{
+	vy_rate_limit_set(&q->rate_limit, rate);
+}
+
+void
 vy_quota_force_use(struct vy_quota *q, size_t size)
 {
 	vy_quota_do_use(q, size);
@@ -150,7 +185,12 @@ vy_quota_force_use(struct vy_quota *q, size_t size)
 void
 vy_quota_release(struct vy_quota *q, size_t size)
 {
-	vy_quota_do_unuse(q, size);
+	/*
+	 * Don't use vy_quota_do_unuse(), because it affects
+	 * the rate limit state.
+	 */
+	assert(q->used >= size);
+	q->used -= size;
 	vy_quota_signal(q);
 }
 
diff --git a/src/box/vy_quota.h b/src/box/vy_quota.h
index f249512b..79755e89 100644
--- a/src/box/vy_quota.h
+++ b/src/box/vy_quota.h
@@ -31,11 +31,14 @@
  * SUCH DAMAGE.
  */
 
+#include <limits.h>
 #include <stdbool.h>
 #include <stddef.h>
 #include <small/rlist.h>
 #include <tarantool_ev.h>
 
+#include "trivia/util.h"
+
 #if defined(__cplusplus)
 extern "C" {
 #endif /* defined(__cplusplus) */
@@ -43,6 +46,67 @@ extern "C" {
 struct fiber;
 struct vy_quota;
 
+/** Rate limit state. */
+struct vy_rate_limit {
+	/** Max allowed rate, per second. */
+	size_t rate;
+	/** Current quota. */
+	ssize_t value;
+};
+
+/** Initialize a rate limit state. */
+static inline void
+vy_rate_limit_create(struct vy_rate_limit *rl)
+{
+	rl->rate = SIZE_MAX;
+	rl->value = SSIZE_MAX;
+}
+
+/** Set rate limit. */
+static inline void
+vy_rate_limit_set(struct vy_rate_limit *rl, size_t rate)
+{
+	rl->rate = rate;
+}
+
+/**
+ * Return true if quota may be consumed without exceeding
+ * the configured rate limit.
+ */
+static inline bool
+vy_rate_limit_may_use(struct vy_rate_limit *rl)
+{
+	return rl->value > 0;
+}
+
+/** Consume the given amount of quota. */
+static inline void
+vy_rate_limit_use(struct vy_rate_limit *rl, size_t size)
+{
+	rl->value -= size;
+}
+
+/** Release the given amount of quota. */
+static inline void
+vy_rate_limit_unuse(struct vy_rate_limit *rl, size_t size)
+{
+	rl->value += size;
+}
+
+/**
+ * Replenish quota by the amount accumulated for the given
+ * time interval.
+ */
+static inline void
+vy_rate_limit_refill(struct vy_rate_limit *rl, double time)
+{
+	double size = rl->rate * time;
+	double value = rl->value + size;
+	/* Allow bursts up to 2x rate. */
+	value = MIN(value, size * 2);
+	rl->value = MIN(value, SSIZE_MAX);
+}
+
 typedef void
 (*vy_quota_exceeded_f)(struct vy_quota *quota);
 
@@ -85,6 +149,13 @@ struct vy_quota {
 	 * to the tail.
 	 */
 	struct rlist wait_queue;
+	/** Rate limit state. */
+	struct vy_rate_limit rate_limit;
+	/**
+	 * Periodic timer that is used for refilling the rate
+	 * limit value.
+	 */
+	ev_timer timer;
 };
 
 /**
@@ -117,6 +188,13 @@ void
 vy_quota_set_limit(struct vy_quota *q, size_t limit);
 
 /**
+ * Set the max rate at which quota may be consumed,
+ * in bytes per second.
+ */
+void
+vy_quota_set_rate_limit(struct vy_quota *q, size_t rate);
+
+/**
  * Consume @size bytes of memory. In contrast to vy_quota_use()
  * this function does not throttle the caller.
  */
diff --git a/src/box/vy_regulator.c b/src/box/vy_regulator.c
index 682777fc..1e106fc4 100644
--- a/src/box/vy_regulator.c
+++ b/src/box/vy_regulator.c
@@ -76,6 +76,24 @@ vy_regulator_trigger_dump(struct vy_regulator *regulator)
 		return;
 
 	regulator->dump_in_progress = true;
+
+	/*
+	 * To avoid unpredictably long stalls, we must limit
+	 * the write rate when a dump is in progress so that
+	 * we don't hit the hard limit before the dump has
+	 * completed, i.e.
+	 *
+	 *    mem_left        mem_used
+	 *   ---------- >= --------------
+	 *   write_rate    dump_bandwidth
+	 */
+	struct vy_quota *quota = regulator->quota;
+	size_t mem_left = (quota->used < quota->limit ?
+			   quota->limit - quota->used : 0);
+	size_t mem_used = quota->used;
+	size_t max_write_rate = mem_left * regulator->dump_bw / (mem_used + 1);
+	max_write_rate = MIN(max_write_rate, regulator->dump_bw);
+	vy_quota_set_rate_limit(quota, max_write_rate);
 }
 
 static void
@@ -172,6 +190,7 @@ void
 vy_regulator_start(struct vy_regulator *regulator)
 {
 	ev_timer_start(loop(), &regulator->timer);
+	vy_quota_set_rate_limit(regulator->quota, regulator->dump_bw);
 }
 
 void
@@ -225,6 +244,16 @@ vy_regulator_dump_complete(struct vy_regulator *regulator,
 		regulator->dump_bw = histogram_percentile_lower(
 			regulator->dump_bw_hist, VY_DUMP_BW_PCT);
 	}
+
+	/*
+	 * Reset the rate limit.
+	 *
+	 * It doesn't make sense to allow to consume memory at
+	 * a higher rate than it can be dumped so we set the rate
+	 * limit to the dump bandwidth rather than disabling it
+	 * completely.
+	 */
+	vy_quota_set_rate_limit(regulator->quota, regulator->dump_bw);
 }
 
 void
@@ -232,4 +261,5 @@ vy_regulator_reset_dump_bw(struct vy_regulator *regulator, size_t max)
 {
 	histogram_reset(regulator->dump_bw_hist);
 	regulator->dump_bw = MIN(VY_DUMP_BW_DEFAULT, max);
+	vy_quota_set_rate_limit(regulator->quota, regulator->dump_bw);
 }
diff --git a/test/vinyl/suite.ini b/test/vinyl/suite.ini
index b9dae380..785bc63d 100644
--- a/test/vinyl/suite.ini
+++ b/test/vinyl/suite.ini
@@ -6,5 +6,5 @@ release_disabled = errinj.test.lua errinj_gc.test.lua errinj_vylog.test.lua part
 config = suite.cfg
 lua_libs = suite.lua stress.lua large.lua txn_proxy.lua ../box/lua/utils.lua
 use_unix_sockets = True
-long_run = stress.test.lua large.test.lua write_iterator_rand.test.lua dump_stress.test.lua select_consistency.test.lua
+long_run = stress.test.lua large.test.lua write_iterator_rand.test.lua dump_stress.test.lua select_consistency.test.lua throttle.test.lua
 is_parallel = False
diff --git a/test/vinyl/throttle.result b/test/vinyl/throttle.result
new file mode 100644
index 00000000..628ee8a2
--- /dev/null
+++ b/test/vinyl/throttle.result
@@ -0,0 +1,102 @@
+--
+-- Basic test for transaction throttling.
+--
+-- It checks that write transactions aren't stalled for long
+-- due to hitting the memory limit, but instead are throttled
+-- in advance.
+--
+test_run = require('test_run').new()
+---
+...
+test_run:cmd("create server test with script='vinyl/low_quota.lua'")
+---
+- true
+...
+test_run:cmd(string.format("start server test with args='%d'", 32 * 1024 * 1024))
+---
+- true
+...
+test_run:cmd('switch test')
+---
+- true
+...
+fiber = require('fiber')
+---
+...
+digest = require('digest')
+---
+...
+box.cfg{snap_io_rate_limit = 8}
+---
+...
+FIBER_COUNT = 5
+---
+...
+TUPLE_SIZE = 1000
+---
+...
+TX_TUPLE_COUNT = 10
+---
+...
+TX_SIZE = TUPLE_SIZE * TX_TUPLE_COUNT
+---
+...
+TX_COUNT = math.ceil(box.cfg.vinyl_memory / (TX_SIZE * FIBER_COUNT))
+---
+...
+s = box.schema.space.create('test', {engine = 'vinyl'})
+---
+...
+_ = s:create_index('primary', {parts = {1, 'unsigned', 2, 'unsigned', 3, 'unsigned'}})
+---
+...
+latency = 0
+---
+...
+c = fiber.channel(FIBER_COUNT)
+---
+...
+test_run:cmd("setopt delimiter ';'")
+---
+- true
+...
+for i = 1, FIBER_COUNT do
+    fiber.create(function()
+        for j = 1, TX_COUNT do
+            local t1 = fiber.time()
+            box.begin()
+            for k = 1, TX_TUPLE_COUNT do
+                s:replace{i, j, k, digest.urandom(TUPLE_SIZE)}
+            end
+            box.commit()
+            local t2 = fiber.time()
+            latency = math.max(latency, t2 - t1)
+        end
+        c:put(true)
+    end)
+end;
+---
+...
+test_run:cmd("setopt delimiter ''");
+---
+- true
+...
+for i = 1, FIBER_COUNT do c:get() end
+---
+...
+latency < 0.2 or latency
+---
+- true
+...
+test_run:cmd('switch default')
+---
+- true
+...
+test_run:cmd("stop server test")
+---
+- true
+...
+test_run:cmd("cleanup server test")
+---
+- true
+...
diff --git a/test/vinyl/throttle.test.lua b/test/vinyl/throttle.test.lua
new file mode 100644
index 00000000..06548a08
--- /dev/null
+++ b/test/vinyl/throttle.test.lua
@@ -0,0 +1,54 @@
+--
+-- Basic test for transaction throttling.
+--
+-- It checks that write transactions aren't stalled for long
+-- due to hitting the memory limit, but instead are throttled
+-- in advance.
+--
+test_run = require('test_run').new()
+test_run:cmd("create server test with script='vinyl/low_quota.lua'")
+test_run:cmd(string.format("start server test with args='%d'", 32 * 1024 * 1024))
+test_run:cmd('switch test')
+
+fiber = require('fiber')
+digest = require('digest')
+
+box.cfg{snap_io_rate_limit = 8}
+
+FIBER_COUNT = 5
+TUPLE_SIZE = 1000
+TX_TUPLE_COUNT = 10
+TX_SIZE = TUPLE_SIZE * TX_TUPLE_COUNT
+TX_COUNT = math.ceil(box.cfg.vinyl_memory / (TX_SIZE * FIBER_COUNT))
+
+s = box.schema.space.create('test', {engine = 'vinyl'})
+_ = s:create_index('primary', {parts = {1, 'unsigned', 2, 'unsigned', 3, 'unsigned'}})
+
+latency = 0
+c = fiber.channel(FIBER_COUNT)
+
+test_run:cmd("setopt delimiter ';'")
+for i = 1, FIBER_COUNT do
+    fiber.create(function()
+        for j = 1, TX_COUNT do
+            local t1 = fiber.time()
+            box.begin()
+            for k = 1, TX_TUPLE_COUNT do
+                s:replace{i, j, k, digest.urandom(TUPLE_SIZE)}
+            end
+            box.commit()
+            local t2 = fiber.time()
+            latency = math.max(latency, t2 - t1)
+        end
+        c:put(true)
+    end)
+end;
+test_run:cmd("setopt delimiter ''");
+
+for i = 1, FIBER_COUNT do c:get() end
+
+latency < 0.2 or latency
+
+test_run:cmd('switch default')
+test_run:cmd("stop server test")
+test_run:cmd("cleanup server test")
-- 
2.11.0

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH v2 11/11] vinyl: introduce quota consumer priorities
  2018-09-28 17:39 [PATCH v2 00/11] vinyl: transaction throttling infrastructure Vladimir Davydov
                   ` (9 preceding siblings ...)
  2018-09-28 17:40 ` [PATCH v2 10/11] vinyl: implement basic transaction throttling Vladimir Davydov
@ 2018-09-28 17:40 ` Vladimir Davydov
  2018-10-06 13:24   ` Konstantin Osipov
  2018-10-03  9:06 ` [PATCH v2 00/11] vinyl: transaction throttling infrastructure Vladimir Davydov
  11 siblings, 1 reply; 35+ messages in thread
From: Vladimir Davydov @ 2018-09-28 17:40 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

Currently, we only limit quota consumption rate so that writers won't
hit the hard limit before memory dump is complete. However, it isn't
enough, because we also need to consider compaction: if it doesn't keep
up with dumps, read and space amplification will grow uncontrollably.

The problem is compaction may be a quota consumer by itself, as it may
generate deferred DELETE statements for secondary indexes. We can't
ignore quota completely there, because if we do, we may hit the memory
limit and stall all writers, which is unacceptable, but we do want to
ignore the rate limit imposed to make sure that compaction keeps up with
dumps, otherwise compaction won't benefit from such a throttling.

To tackle this problem, this patch introduces quota consumer priorities.
Now a quota object maintains one rate limit and one wait queue per each
transaction priority. Rate limit i is used by a consumer with priority
prio if and only if prio <= i. Whenever a consumer has to be throttled,
it is put to sleep to the wait queue corresponding to its priority.
When quota is replenished, we pick the oldest consumer among all that
may be woken up. This ensures fairness.

For now, there are only two priorities - one for transactions and one
for compaction jobs. Transactions have the lowest priority (0) and are
throttled for two purposes: first, to ensure that the hard memory limit
won't be hit before memory dump is complete (memory-based throttling),
and second, to keep compaction progress in sync with dumps (disk-based
throttling). Compaction jobs have a higher priority (1) and ignore
disk-based throttling, but still respect memory-based throttling.

Note, the patch doesn't implement the logic of disk-based throttling in
the regulator module. It is still left for future work.

Part of #1862
---
 src/box/vinyl.c        | 30 +++++++++-------
 src/box/vy_quota.c     | 96 +++++++++++++++++++++++++++++++++++---------------
 src/box/vy_quota.h     | 88 +++++++++++++++++++++++++++++++++++++--------
 src/box/vy_regulator.c | 11 +++---
 4 files changed, 167 insertions(+), 58 deletions(-)

diff --git a/src/box/vinyl.c b/src/box/vinyl.c
index c3d95777..615321ee 100644
--- a/src/box/vinyl.c
+++ b/src/box/vinyl.c
@@ -2337,7 +2337,8 @@ vinyl_engine_prepare(struct engine *engine, struct txn *txn)
 	 * the transaction to be sent to read view or aborted, we call
 	 * it before checking for conflicts.
 	 */
-	if (vy_quota_use(&env->quota, tx->write_size, timeout) != 0)
+	if (vy_quota_use(&env->quota, VY_QUOTA_CONSUMER_TX,
+			 tx->write_size, timeout) != 0)
 		return -1;
 
 	size_t mem_used_before = lsregion_used(&env->mem_env.allocator);
@@ -2346,8 +2347,8 @@ vinyl_engine_prepare(struct engine *engine, struct txn *txn)
 
 	size_t mem_used_after = lsregion_used(&env->mem_env.allocator);
 	assert(mem_used_after >= mem_used_before);
-	vy_quota_adjust(&env->quota, tx->write_size,
-			mem_used_after - mem_used_before);
+	vy_quota_adjust(&env->quota, VY_QUOTA_CONSUMER_TX,
+			tx->write_size, mem_used_after - mem_used_before);
 	vy_regulator_check_watermark(&env->regulator);
 	return rc;
 }
@@ -2372,7 +2373,8 @@ vinyl_engine_commit(struct engine *engine, struct txn *txn)
 	size_t mem_used_after = lsregion_used(&env->mem_env.allocator);
 	assert(mem_used_after >= mem_used_before);
 	/* We can't abort the transaction at this point, use force. */
-	vy_quota_force_use(&env->quota, mem_used_after - mem_used_before);
+	vy_quota_force_use(&env->quota, VY_QUOTA_CONSUMER_TX,
+			   mem_used_after - mem_used_before);
 	vy_regulator_check_watermark(&env->regulator);
 
 	txn->engine_tx = NULL;
@@ -3165,7 +3167,8 @@ vinyl_space_apply_initial_join_row(struct space *space, struct request *request)
 	 * quota accounting.
 	 */
 	size_t reserved = tx->write_size;
-	if (vy_quota_use(&env->quota, reserved, TIMEOUT_INFINITY) != 0)
+	if (vy_quota_use(&env->quota, VY_QUOTA_CONSUMER_TX,
+			 reserved, TIMEOUT_INFINITY) != 0)
 		unreachable();
 
 	size_t mem_used_before = lsregion_used(&env->mem_env.allocator);
@@ -3184,7 +3187,7 @@ vinyl_space_apply_initial_join_row(struct space *space, struct request *request)
 	size_t mem_used_after = lsregion_used(&env->mem_env.allocator);
 	assert(mem_used_after >= mem_used_before);
 	size_t used = mem_used_after - mem_used_before;
-	vy_quota_adjust(&env->quota, reserved, used);
+	vy_quota_adjust(&env->quota, VY_QUOTA_CONSUMER_TX, reserved, used);
 	vy_regulator_check_watermark(&env->regulator);
 	return rc;
 }
@@ -3505,7 +3508,7 @@ vy_squash_process(struct vy_squash *squash)
 		 * so there's no need in invalidating the cache.
 		 */
 		vy_mem_commit_stmt(mem, region_stmt);
-		vy_quota_force_use(&env->quota,
+		vy_quota_force_use(&env->quota, VY_QUOTA_CONSUMER_TX,
 				   mem_used_after - mem_used_before);
 		vy_regulator_check_watermark(&env->regulator);
 	}
@@ -3981,9 +3984,10 @@ vy_build_insert_tuple(struct vy_env *env, struct vy_lsm *lsm,
 	/* Consume memory quota. Throttle if it is exceeded. */
 	size_t mem_used_after = lsregion_used(&env->mem_env.allocator);
 	assert(mem_used_after >= mem_used_before);
-	vy_quota_force_use(&env->quota, mem_used_after - mem_used_before);
+	vy_quota_force_use(&env->quota, VY_QUOTA_CONSUMER_TX,
+			   mem_used_after - mem_used_before);
 	vy_regulator_check_watermark(&env->regulator);
-	vy_quota_wait(&env->quota);
+	vy_quota_wait(&env->quota, VY_QUOTA_CONSUMER_TX);
 	return rc;
 }
 
@@ -4109,7 +4113,8 @@ vy_build_recover(struct vy_env *env, struct vy_lsm *lsm, struct vy_lsm *pk)
 
 	mem_used_after = lsregion_used(&env->mem_env.allocator);
 	assert(mem_used_after >= mem_used_before);
-	vy_quota_force_use(&env->quota, mem_used_after - mem_used_before);
+	vy_quota_force_use(&env->quota, VY_QUOTA_CONSUMER_TX,
+			   mem_used_after - mem_used_before);
 	return rc;
 }
 
@@ -4325,7 +4330,7 @@ vy_deferred_delete_on_replace(struct trigger *trigger, void *event)
 	 */
 	struct vy_env *env = vy_env(space->engine);
 	if (is_first_statement)
-		vy_quota_wait(&env->quota);
+		vy_quota_wait(&env->quota, VY_QUOTA_CONSUMER_COMPACTION);
 
 	/* Create the deferred DELETE statement. */
 	struct vy_lsm *pk = vy_lsm(space->index[0]);
@@ -4412,7 +4417,8 @@ vy_deferred_delete_on_replace(struct trigger *trigger, void *event)
 	}
 	size_t mem_used_after = lsregion_used(&env->mem_env.allocator);
 	assert(mem_used_after >= mem_used_before);
-	vy_quota_force_use(&env->quota, mem_used_after - mem_used_before);
+	vy_quota_force_use(&env->quota, VY_QUOTA_CONSUMER_COMPACTION,
+			   mem_used_after - mem_used_before);
 	vy_regulator_check_watermark(&env->regulator);
 
 	tuple_unref(delete);
diff --git a/src/box/vy_quota.c b/src/box/vy_quota.c
index ceac4878..1588b3dc 100644
--- a/src/box/vy_quota.c
+++ b/src/box/vy_quota.c
@@ -52,18 +52,29 @@
 static const double VY_QUOTA_TIMER_PERIOD = 0.1;
 
 /**
+ * Iterate over rate limit states that are enforced for a consumer
+ * with the given priority.
+ */
+#define vy_quota_for_each_rate_limit(quota, rl, prio) \
+	for (struct vy_rate_limit *rl = &(quota)->rate_limit[prio]; \
+	     rl - (quota)->rate_limit < vy_quota_consumer_prio_MAX; rl++)
+
+/**
  * Return true if the requested amount of memory may be consumed
  * right now, false if consumers have to wait.
  */
 static inline bool
-vy_quota_may_use(struct vy_quota *q, size_t size)
+vy_quota_may_use(struct vy_quota *q, enum vy_quota_consumer_prio prio,
+		 size_t size)
 {
 	if (!q->is_enabled)
 		return true;
 	if (q->used + size > q->limit)
 		return false;
-	if (!vy_rate_limit_may_use(&q->rate_limit))
-		return false;
+	vy_quota_for_each_rate_limit(q, rl, prio) {
+		if (!vy_rate_limit_may_use(rl))
+			return false;
+	}
 	return true;
 }
 
@@ -71,10 +82,12 @@ vy_quota_may_use(struct vy_quota *q, size_t size)
  * Consume the given amount of memory without checking the limit.
  */
 static inline void
-vy_quota_do_use(struct vy_quota *q, size_t size)
+vy_quota_do_use(struct vy_quota *q, enum vy_quota_consumer_prio prio,
+		size_t size)
 {
 	q->used += size;
-	vy_rate_limit_use(&q->rate_limit, size);
+	vy_quota_for_each_rate_limit(q, rl, prio)
+		vy_rate_limit_use(rl, size);
 }
 
 /**
@@ -82,11 +95,13 @@ vy_quota_do_use(struct vy_quota *q, size_t size)
  * This function is an exact opposite of vy_quota_do_use().
  */
 static inline void
-vy_quota_do_unuse(struct vy_quota *q, size_t size)
+vy_quota_do_unuse(struct vy_quota *q, enum vy_quota_consumer_prio prio,
+		  size_t size)
 {
 	assert(q->used >= size);
 	q->used -= size;
-	vy_rate_limit_unuse(&q->rate_limit, size);
+	vy_quota_for_each_rate_limit(q, rl, prio)
+		vy_rate_limit_unuse(rl, size);
 }
 
 /**
@@ -106,17 +121,33 @@ vy_quota_check_limit(struct vy_quota *q)
 static void
 vy_quota_signal(struct vy_quota *q)
 {
-	if (!rlist_empty(&q->wait_queue)) {
+	/*
+	 * Wake up a consumer that has waited most no matter
+	 * whether it's high or low priority. This assures that
+	 * high priority consumers don't uncontrollably throttle
+	 * low priority ones.
+	 */
+	struct vy_quota_wait_node *oldest = NULL;
+	for (int i = 0; i < vy_quota_consumer_prio_MAX; i++) {
+		struct rlist *wq = &q->wait_queue[i];
+		if (rlist_empty(wq))
+			continue;
+
 		struct vy_quota_wait_node *n;
-		n = rlist_first_entry(&q->wait_queue,
-				      struct vy_quota_wait_node, in_wait_queue);
+		n = rlist_first_entry(wq, struct vy_quota_wait_node,
+				      in_wait_queue);
 		/*
 		 * No need in waking up a consumer if it will have
 		 * to go back to sleep immediately.
 		 */
-		if (vy_quota_may_use(q, n->size))
-			fiber_wakeup(n->fiber);
+		if (!vy_quota_may_use(q, i, n->size))
+			continue;
+
+		if (oldest == NULL || oldest->timestamp > n->timestamp)
+			oldest = n;
 	}
+	if (oldest != NULL)
+		fiber_wakeup(oldest->fiber);
 }
 
 static void
@@ -127,7 +158,8 @@ vy_quota_timer_cb(ev_loop *loop, ev_timer *timer, int events)
 
 	struct vy_quota *q = timer->data;
 
-	vy_rate_limit_refill(&q->rate_limit, VY_QUOTA_TIMER_PERIOD);
+	for (int i = 0; i < vy_quota_consumer_prio_MAX; i++)
+		vy_rate_limit_refill(&q->rate_limit[i], VY_QUOTA_TIMER_PERIOD);
 	vy_quota_signal(q);
 }
 
@@ -140,8 +172,11 @@ vy_quota_create(struct vy_quota *q, size_t limit,
 	q->used = 0;
 	q->too_long_threshold = TIMEOUT_INFINITY;
 	q->quota_exceeded_cb = quota_exceeded_cb;
-	rlist_create(&q->wait_queue);
-	vy_rate_limit_create(&q->rate_limit);
+	q->wait_timestamp = 0;
+	for (int i = 0; i < vy_quota_consumer_prio_MAX; i++) {
+		rlist_create(&q->wait_queue[i]);
+		vy_rate_limit_create(&q->rate_limit[i]);
+	}
 	ev_timer_init(&q->timer, vy_quota_timer_cb, 0, VY_QUOTA_TIMER_PERIOD);
 	q->timer.data = q;
 }
@@ -170,15 +205,17 @@ vy_quota_set_limit(struct vy_quota *q, size_t limit)
 }
 
 void
-vy_quota_set_rate_limit(struct vy_quota *q, size_t rate)
+vy_quota_set_rate_limit(struct vy_quota *q, enum vy_quota_consumer_prio prio,
+			size_t rate)
 {
-	vy_rate_limit_set(&q->rate_limit, rate);
+	vy_rate_limit_set(&q->rate_limit[prio], rate);
 }
 
 void
-vy_quota_force_use(struct vy_quota *q, size_t size)
+vy_quota_force_use(struct vy_quota *q, enum vy_quota_consumer_prio prio,
+		   size_t size)
 {
-	vy_quota_do_use(q, size);
+	vy_quota_do_use(q, prio, size);
 	vy_quota_check_limit(q);
 }
 
@@ -195,10 +232,11 @@ vy_quota_release(struct vy_quota *q, size_t size)
 }
 
 int
-vy_quota_use(struct vy_quota *q, size_t size, double timeout)
+vy_quota_use(struct vy_quota *q, enum vy_quota_consumer_prio prio,
+	     size_t size, double timeout)
 {
-	if (vy_quota_may_use(q, size)) {
-		vy_quota_do_use(q, size);
+	if (vy_quota_may_use(q, prio, size)) {
+		vy_quota_do_use(q, prio, size);
 		return 0;
 	}
 
@@ -218,8 +256,9 @@ vy_quota_use(struct vy_quota *q, size_t size, double timeout)
 	struct vy_quota_wait_node wait_node = {
 		.fiber = fiber(),
 		.size = size,
+		.timestamp = ++q->wait_timestamp,
 	};
-	rlist_add_tail_entry(&q->wait_queue, &wait_node, in_wait_queue);
+	rlist_add_tail_entry(&q->wait_queue[prio], &wait_node, in_wait_queue);
 
 	do {
 		/*
@@ -239,7 +278,7 @@ vy_quota_use(struct vy_quota *q, size_t size, double timeout)
 			diag_set(ClientError, ER_VY_QUOTA_TIMEOUT);
 			return -1;
 		}
-	} while (!vy_quota_may_use(q, size));
+	} while (!vy_quota_may_use(q, prio, size));
 
 	rlist_del_entry(&wait_node, in_wait_queue);
 
@@ -249,7 +288,7 @@ vy_quota_use(struct vy_quota *q, size_t size, double timeout)
 			 "for too long: %.3f sec", size, wait_time);
 	}
 
-	vy_quota_do_use(q, size);
+	vy_quota_do_use(q, prio, size);
 	/*
 	 * Blocked consumers are awaken one by one to preserve
 	 * the order they were put to sleep. It's a responsibility
@@ -261,14 +300,15 @@ vy_quota_use(struct vy_quota *q, size_t size, double timeout)
 }
 
 void
-vy_quota_adjust(struct vy_quota *q, size_t reserved, size_t used)
+vy_quota_adjust(struct vy_quota *q, enum vy_quota_consumer_prio prio,
+		size_t reserved, size_t used)
 {
 	if (reserved > used) {
-		vy_quota_do_unuse(q, reserved - used);
+		vy_quota_do_unuse(q, prio, reserved - used);
 		vy_quota_signal(q);
 	}
 	if (reserved < used) {
-		vy_quota_do_use(q, used - reserved);
+		vy_quota_do_use(q, prio, used - reserved);
 		vy_quota_check_limit(q);
 	}
 }
diff --git a/src/box/vy_quota.h b/src/box/vy_quota.h
index 79755e89..f1c3fb90 100644
--- a/src/box/vy_quota.h
+++ b/src/box/vy_quota.h
@@ -110,6 +110,43 @@ vy_rate_limit_refill(struct vy_rate_limit *rl, double time)
 typedef void
 (*vy_quota_exceeded_f)(struct vy_quota *quota);
 
+/**
+ * Quota consumer priority. Determines how a consumer will be
+ * rate limited. See also vy_quota::rate_limit.
+ */
+enum vy_quota_consumer_prio {
+	/**
+	 * Transaction processor priority.
+	 *
+	 * Transaction throttling pursues two goals. First, it is
+	 * capping memory consumption rate so that the hard memory
+	 * limit will not be hit before memory dump has completed
+	 * (memory-based throttling). Second, we must make sure
+	 * that compaction jobs keep up with dumps to keep the read
+	 * amplification within bounds (disk-based throttling).
+	 * Transactions ought to respect them both.
+	 */
+	VY_QUOTA_CONSUMER_TX = 0,
+	/**
+	 * Compaction job priority.
+	 *
+	 * Compaction jobs may need some quota too, because they
+	 * may generate deferred DELETEs for secondary indexes.
+	 * Apparently, we must not impose the rate limit that
+	 * is supposed to speed up compaction on them, however
+	 * they still have to respect memory-based throttling to
+	 * avoid long stalls.
+	 */
+	VY_QUOTA_CONSUMER_COMPACTION = 1,
+	/**
+	 * A convenience shortcut for setting the rate limit for
+	 * all kinds of consumers.
+	 */
+	VY_QUOTA_CONSUMER_ALL = VY_QUOTA_CONSUMER_COMPACTION,
+
+	vy_quota_consumer_prio_MAX,
+};
+
 struct vy_quota_wait_node {
 	/** Link in vy_quota::wait_queue. */
 	struct rlist in_wait_queue;
@@ -117,6 +154,11 @@ struct vy_quota_wait_node {
 	struct fiber *fiber;
 	/** Amount of requested memory. */
 	size_t size;
+	/**
+	 * Timestamp assigned to this fiber when it was put to
+	 * sleep, see vy_quota::wait_timestamp for more details.
+	 */
+	int64_t timestamp;
 };
 
 /**
@@ -144,13 +186,27 @@ struct vy_quota {
 	 */
 	vy_quota_exceeded_f quota_exceeded_cb;
 	/**
-	 * Queue of consumers waiting for quota, linked by
-	 * vy_quota_wait_node::state. Newcomers are added
-	 * to the tail.
+	 * Monotonically growing timestamp assigned to consumers
+	 * waiting for quota. It is used for balancing wakeups
+	 * among wait queues: if two fibers from different wait
+	 * queues may proceed, the one with the lowest timestamp
+	 * will be picked.
+	 *
+	 * See also vy_quota_wait_node::timestamp.
+	 */
+	int64_t wait_timestamp;
+	/**
+	 * Queue of consumers waiting for quota, one per each
+	 * consumer priority, linked by vy_quota_wait_node::state.
+	 * Newcomers are added to the tail.
+	 */
+	struct rlist wait_queue[vy_quota_consumer_prio_MAX];
+	/**
+	 * Rate limit state, one per each consumer priority.
+	 * A rate limit is enforced if and only if the consumer
+	 * priority is less than or equal to its index.
 	 */
-	struct rlist wait_queue;
-	/** Rate limit state. */
-	struct vy_rate_limit rate_limit;
+	struct vy_rate_limit rate_limit[vy_quota_consumer_prio_MAX];
 	/**
 	 * Periodic timer that is used for refilling the rate
 	 * limit value.
@@ -188,18 +244,20 @@ void
 vy_quota_set_limit(struct vy_quota *q, size_t limit);
 
 /**
- * Set the max rate at which quota may be consumed,
- * in bytes per second.
+ * Set the rate limit for consumers with priority less than or
+ * equal to @prio, in bytes per second.
  */
 void
-vy_quota_set_rate_limit(struct vy_quota *q, size_t rate);
+vy_quota_set_rate_limit(struct vy_quota *q, enum vy_quota_consumer_prio prio,
+			size_t rate);
 
 /**
  * Consume @size bytes of memory. In contrast to vy_quota_use()
  * this function does not throttle the caller.
  */
 void
-vy_quota_force_use(struct vy_quota *q, size_t size);
+vy_quota_force_use(struct vy_quota *q, enum vy_quota_consumer_prio prio,
+		   size_t size);
 
 /**
  * Release @size bytes of memory.
@@ -242,7 +300,8 @@ vy_quota_release(struct vy_quota *q, size_t size);
  * account while estimating the size of a memory allocation.
  */
 int
-vy_quota_use(struct vy_quota *q, size_t size, double timeout);
+vy_quota_use(struct vy_quota *q, enum vy_quota_consumer_prio prio,
+	     size_t size, double timeout);
 
 /**
  * Adjust quota after allocating memory.
@@ -253,15 +312,16 @@ vy_quota_use(struct vy_quota *q, size_t size, double timeout);
  * See also vy_quota_use().
  */
 void
-vy_quota_adjust(struct vy_quota *q, size_t reserved, size_t used);
+vy_quota_adjust(struct vy_quota *q, enum vy_quota_consumer_prio prio,
+		size_t reserved, size_t used);
 
 /**
  * Block the caller until the quota is not exceeded.
  */
 static inline void
-vy_quota_wait(struct vy_quota *q)
+vy_quota_wait(struct vy_quota *q, enum vy_quota_consumer_prio prio)
 {
-	vy_quota_use(q, 0, TIMEOUT_INFINITY);
+	vy_quota_use(q, prio, 0, TIMEOUT_INFINITY);
 }
 
 #if defined(__cplusplus)
diff --git a/src/box/vy_regulator.c b/src/box/vy_regulator.c
index 1e106fc4..da35cc74 100644
--- a/src/box/vy_regulator.c
+++ b/src/box/vy_regulator.c
@@ -93,7 +93,7 @@ vy_regulator_trigger_dump(struct vy_regulator *regulator)
 	size_t mem_used = quota->used;
 	size_t max_write_rate = mem_left * regulator->dump_bw / (mem_used + 1);
 	max_write_rate = MIN(max_write_rate, regulator->dump_bw);
-	vy_quota_set_rate_limit(quota, max_write_rate);
+	vy_quota_set_rate_limit(quota, VY_QUOTA_CONSUMER_ALL, max_write_rate);
 }
 
 static void
@@ -190,7 +190,8 @@ void
 vy_regulator_start(struct vy_regulator *regulator)
 {
 	ev_timer_start(loop(), &regulator->timer);
-	vy_quota_set_rate_limit(regulator->quota, regulator->dump_bw);
+	vy_quota_set_rate_limit(regulator->quota, VY_QUOTA_CONSUMER_ALL,
+				regulator->dump_bw);
 }
 
 void
@@ -253,7 +254,8 @@ vy_regulator_dump_complete(struct vy_regulator *regulator,
 	 * limit to the dump bandwidth rather than disabling it
 	 * completely.
 	 */
-	vy_quota_set_rate_limit(regulator->quota, regulator->dump_bw);
+	vy_quota_set_rate_limit(regulator->quota, VY_QUOTA_CONSUMER_ALL,
+				regulator->dump_bw);
 }
 
 void
@@ -261,5 +263,6 @@ vy_regulator_reset_dump_bw(struct vy_regulator *regulator, size_t max)
 {
 	histogram_reset(regulator->dump_bw_hist);
 	regulator->dump_bw = MIN(VY_DUMP_BW_DEFAULT, max);
-	vy_quota_set_rate_limit(regulator->quota, regulator->dump_bw);
+	vy_quota_set_rate_limit(regulator->quota, VY_QUOTA_CONSUMER_ALL,
+				regulator->dump_bw);
 }
-- 
2.11.0

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [tarantool-patches] Re: [PATCH v2 01/11] vinyl: add helper to start scheduler and enable quota on startup
  2018-09-28 17:39 ` [PATCH v2 01/11] vinyl: add helper to start scheduler and enable quota on startup Vladimir Davydov
@ 2018-09-29  4:37   ` Konstantin Osipov
  0 siblings, 0 replies; 35+ messages in thread
From: Konstantin Osipov @ 2018-09-29  4:37 UTC (permalink / raw)
  To: tarantool-patches

* Vladimir Davydov <vdavydov.dev@gmail.com> [18/09/28 21:00]:
> There are three places where we start the scheduler fiber and enable the
> configured memory quota limit: local bootstrap, remote bootstrap, and
> local recovery completion. I'm planning to add more code there so let's
> factor it out now.

OK to push.


-- 
Konstantin Osipov, Moscow, Russia, +7 903 626 22 32
http://tarantool.io - www.twitter.com/kostja_osipov

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [tarantool-patches] Re: [PATCH v2 02/11] vinyl: factor load regulator out of quota
  2018-09-28 17:40 ` [PATCH v2 02/11] vinyl: factor load regulator out of quota Vladimir Davydov
@ 2018-09-29  5:00   ` Konstantin Osipov
  2018-09-29 11:36     ` Vladimir Davydov
  2018-10-01 10:31   ` Vladimir Davydov
  2018-10-02 18:16   ` [tarantool-patches] " Konstantin Osipov
  2 siblings, 1 reply; 35+ messages in thread
From: Konstantin Osipov @ 2018-09-29  5:00 UTC (permalink / raw)
  To: tarantool-patches

* Vladimir Davydov <vdavydov.dev@gmail.com> [18/09/28 21:00]:
> Turned out that throttling isn't going to be as simple as maintaining
> the write rate below the estimated dump bandwidth, because we also need
> to take into account whether compaction keeps up with dumps. Tracking
> compaction progress isn't a trivial task and mixing it in a module
> responsible for resource limiting, which vy_quota is, doesn't seem to be
> a good idea. Let's factor out the related code into a separate module
> and call it vy_regulator. Currently, the new module only keeps track of
> the write rate and the dump bandwidth and sets the memory watermark
> accordingly, but soon we will extend it to configure throttling as well.
> 
> Since write rate and dump bandwidth are now a part of the regulator
> subsystem, this patch renames 'quota' entry of box.stat.vinyl() to
> 'regulator'. It also removes 'quota.usage' and 'quota.limit' altogether,
> because memory usage is reported under 'memory.level0' while the limit
> can be read from box.cfg.vinyl_memory, and renames 'use_rate' to
> 'write_rate', because the latter seems to be a more appropriate name.
> 
> Needed for #1862

It's all good except the name. The name is a bit more academic,
generic and long than we both prefer. Other names to consider:
vy_pressure_valve, vy_back_pressure, or simply vy_valve, vy_load_valve,
vy_load_control, vy_rate_control, vy_bandwidth, (I think if we add
more names to the table we will be able to find a good one).

-- 
Konstantin Osipov, Moscow, Russia, +7 903 626 22 32
http://tarantool.io - www.twitter.com/kostja_osipov

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [tarantool-patches] Re: [PATCH v2 03/11] vinyl: minor refactoring of quota methods
  2018-09-28 17:40 ` [PATCH v2 03/11] vinyl: minor refactoring of quota methods Vladimir Davydov
@ 2018-09-29  5:01   ` Konstantin Osipov
  0 siblings, 0 replies; 35+ messages in thread
From: Konstantin Osipov @ 2018-09-29  5:01 UTC (permalink / raw)
  To: tarantool-patches

* Vladimir Davydov <vdavydov.dev@gmail.com> [18/09/28 21:00]:
> The refactoring is targeted at facilitating introduction of rate
> limiting within the quota class. It moves code blocks around, factors
> out some blocks in functions, and improves comments. No functional
> changes.
> 
> Needed for #1862

OK to push.

-- 
Konstantin Osipov, Moscow, Russia, +7 903 626 22 32
http://tarantool.io - www.twitter.com/kostja_osipov

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [tarantool-patches] Re: [PATCH v2 04/11] vinyl: move transaction size sanity check to quota
  2018-09-28 17:40 ` [PATCH v2 04/11] vinyl: move transaction size sanity check to quota Vladimir Davydov
@ 2018-09-29  5:02   ` Konstantin Osipov
  0 siblings, 0 replies; 35+ messages in thread
From: Konstantin Osipov @ 2018-09-29  5:02 UTC (permalink / raw)
  To: tarantool-patches

* Vladimir Davydov <vdavydov.dev@gmail.com> [18/09/28 21:00]:
> There's a sanity check in vinyl_engine_prepare, which checks if the
> transaction size is less than the configured limit and fails without
> waiting for quota if it isn't. Let's move this check to vy_quota_use,
> because it's really a business of the quota object. This implies that
> vy_quota_use has to set diag to differentiate this error from timeout.

OK to push.


-- 
Konstantin Osipov, Moscow, Russia, +7 903 626 22 32
http://tarantool.io - www.twitter.com/kostja_osipov

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [tarantool-patches] Re: [PATCH v2 05/11] vinyl: implement quota wait queue without fiber_cond
  2018-09-28 17:40 ` [PATCH v2 05/11] vinyl: implement quota wait queue without fiber_cond Vladimir Davydov
@ 2018-09-29  5:05   ` Konstantin Osipov
  2018-09-29 11:44     ` Vladimir Davydov
  0 siblings, 1 reply; 35+ messages in thread
From: Konstantin Osipov @ 2018-09-29  5:05 UTC (permalink / raw)
  To: tarantool-patches

* Vladimir Davydov <vdavydov.dev@gmail.com> [18/09/28 21:00]:
> Using fiber_cond as a wait queue isn't very convenient, because:
>  - It doesn't allow us to put a spuriously woken up fiber back to the
>    same position in the queue where it was, thus violating fairness.
>  - It doesn't allow us to check whether we actually need to wake up a
>    fiber or it will have to go back to sleep anyway as it needs more
>    memory than currently available.
>  - It doesn't allow us to implement a multi-queue approach where fibers
>    that have different priorities are put to different queues.
> 
> So let's rewrite the wait queue with plain rlist and fiber_yield.

Maybe you could factor out this change into an abstraction in fiber.[hc]?
Or fix fiber_cond to be fair?

>  
> @@ -92,7 +91,17 @@ vy_quota_check_limit(struct vy_quota *q)
>  static void
>  vy_quota_signal(struct vy_quota *q)
>  {
> -	fiber_cond_signal(&q->cond);
> +	if (!rlist_empty(&q->wait_queue)) {
> +		struct vy_quota_wait_node *n;
> +		n = rlist_first_entry(&q->wait_queue,
> +				      struct vy_quota_wait_node, in_wait_queue);
> +		/*
> +		 * No need in waking up a consumer if it will have
> +		 * to go back to sleep immediately.
> +		 */
> +		if (vy_quota_may_use(q, n->size))
> +			fiber_wakeup(n->fiber);
> +	}
>  }

-- 
Konstantin Osipov, Moscow, Russia, +7 903 626 22 32
http://tarantool.io - www.twitter.com/kostja_osipov

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [tarantool-patches] Re: [PATCH v2 06/11] vinyl: enable quota upon recovery completion explicitly
  2018-09-28 17:40 ` [PATCH v2 06/11] vinyl: enable quota upon recovery completion explicitly Vladimir Davydov
@ 2018-09-29  5:06   ` Konstantin Osipov
  0 siblings, 0 replies; 35+ messages in thread
From: Konstantin Osipov @ 2018-09-29  5:06 UTC (permalink / raw)
  To: tarantool-patches

* Vladimir Davydov <vdavydov.dev@gmail.com> [18/09/28 21:00]:
> Currently, we create a quota object with the limit maximized, and only
> set the configured limit when local recovery is complete, so as to make
> sure that no dump is triggered during recovery. As a result, we have to
> store the configured limit in vy_env::memory, which looks ugly, because
> this member is never used afterwards. Let's introduce a new method
> vy_quota_enable to enable quota so that we can set the limit right on
> quota object construction. This implies that we add a boolean flag to
> vy_quota and only check the limit if it is set.
> 
> There's another reason to add such a method. Soon we will implement
> quota consumption rate limiting. Rate limiting requires a periodic timer
> that would replenish quota. It only makes sense to start such a timer
> upon recovery completion, which again leads us to an explicit method for
> enabling quota.
> 
> vy_env::memory will be removed by the following patch along with a few
> other pointless members of vy_env.
> ac

OK to push. 
> Needed for #1862

-- 
Konstantin Osipov, Moscow, Russia, +7 903 626 22 32
http://tarantool.io - www.twitter.com/kostja_osipov

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [tarantool-patches] Re: [PATCH v2 07/11] vinyl: zap vy_env::memory, read_threads, and write_threads
  2018-09-28 17:40 ` [PATCH v2 07/11] vinyl: zap vy_env::memory, read_threads, and write_threads Vladimir Davydov
@ 2018-09-29  5:06   ` Konstantin Osipov
  0 siblings, 0 replies; 35+ messages in thread
From: Konstantin Osipov @ 2018-09-29  5:06 UTC (permalink / raw)
  To: tarantool-patches

* Vladimir Davydov <vdavydov.dev@gmail.com> [18/09/28 21:00]:
> They are only used to set corresponding members of vy_quota, vy_run_env,
> and vy_scheduler when vy_env is created. No point in keeping them around
> all the time.

OK to push.

-- 
Konstantin Osipov, Moscow, Russia, +7 903 626 22 32
http://tarantool.io - www.twitter.com/kostja_osipov

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [tarantool-patches] Re: [PATCH v2 02/11] vinyl: factor load regulator out of quota
  2018-09-29  5:00   ` [tarantool-patches] " Konstantin Osipov
@ 2018-09-29 11:36     ` Vladimir Davydov
       [not found]       ` <20180929114308.GA19162@chai>
  0 siblings, 1 reply; 35+ messages in thread
From: Vladimir Davydov @ 2018-09-29 11:36 UTC (permalink / raw)
  To: Konstantin Osipov; +Cc: tarantool-patches

On Sat, Sep 29, 2018 at 08:00:46AM +0300, Konstantin Osipov wrote:
> * Vladimir Davydov <vdavydov.dev@gmail.com> [18/09/28 21:00]:
> > Turned out that throttling isn't going to be as simple as maintaining
> > the write rate below the estimated dump bandwidth, because we also need
> > to take into account whether compaction keeps up with dumps. Tracking
> > compaction progress isn't a trivial task and mixing it in a module
> > responsible for resource limiting, which vy_quota is, doesn't seem to be
> > a good idea. Let's factor out the related code into a separate module
> > and call it vy_regulator. Currently, the new module only keeps track of
> > the write rate and the dump bandwidth and sets the memory watermark
> > accordingly, but soon we will extend it to configure throttling as well.
> > 
> > Since write rate and dump bandwidth are now a part of the regulator
> > subsystem, this patch renames 'quota' entry of box.stat.vinyl() to
> > 'regulator'. It also removes 'quota.usage' and 'quota.limit' altogether,
> > because memory usage is reported under 'memory.level0' while the limit
> > can be read from box.cfg.vinyl_memory, and renames 'use_rate' to
> > 'write_rate', because the latter seems to be a more appropriate name.
> > 
> > Needed for #1862
> 
> It's all good except the name. The name is a bit more academic,
> generic and long than we both prefer.

Not longer, more generic or academic than 'scheduler'. Moreover, the two
names match perfectly IMO.

Why would you say it's academic BTW? 'Regulator' is a common engineering
term.

> Other names to consider:
> vy_pressure_valve, vy_back_pressure, or simply vy_valve, vy_load_valve,
> vy_load_control, vy_rate_control, vy_bandwidth, (I think if we add
> more names to the table we will be able to find a good one).

vy_valve - interesting, but I've never heard the term 'valve' used for
anything but air/fluid pressure control. Besides, vy_valve_no_memory
(the event triggered when quota hits the limit) sounds ridiculous.
Anyway, the term 'regulator' is sometimes used instead of 'valve' while
at the same time it may also denote other things that are supposed to
control something (there are things called 'linear voltage regulator'
and 'current regulator').

vy_load_valve, vy_pressure_valve - ditto plus too long.

vy_back_pressure - it's not pressure, actually, it's an object with
complex logic, one of tasks of which is to exert pressure.

vy_bandwidth - bandwidth->dump_bandwidth, oh my...

vy_rate_control, vy_load_control - these two are more or less acceptable
by me, but too long - I'd prefer a one-word name, because it's easier to
name a variable without causing confusing then (think of possible var
names: ctl, lctl, ldctl, load_control sound either weird or are just too
long). vy_control or vy_controller may be? But 'controller' is primarily
associated with game consoles nowadays, while 'control' reminds me of a
remote control I use to switch channels.

I also considered 'monitor', but since this thing not just monitors the
load, but also shapes it, I ruled it out.

That said, I'm all for vy_regulator, and I don't share your concern.
I suggest you take a look at what Google gives you when you search for
valve, regulator, controller pictures:

Valve:
https://www.google.ru/search?newwindow=1&biw=1366&bih=671&tbm=isch&sa=1&ei=dWGvW_2hKsilsAHqlYqQCA&q=valve&oq=valve&gs_l=img.3...0.0.0.67630.0.0.0.0.0.0.0.0..0.0....0...1c..64.img..0.0.0....0.3Zf0u1mmIk0

Controller:
https://www.google.ru/search?newwindow=1&biw=1366&bih=671&tbm=isch&sa=1&ei=zGGvW-rzN8mYsAHDvIXQAQ&q=controller&oq=controller&gs_l=img.3...0.0.0.2506.0.0.0.0.0.0.0.0..0.0....0...1c..64.img..0.0.0....0.MN_IjxK65_E

Control:
https://www.google.ru/search?newwindow=1&biw=1366&bih=671&tbm=isch&sa=1&ei=2WGvW8nsC8bVsAHv35OwBA&q=control&oq=control&gs_l=img.3...5597.6148.0.6310.0.0.0.0.0.0.0.0..0.0....0...1c.1.64.img..0.0.0....0.lmYqRhCCRmo

Regulator:
https://www.google.ru/search?newwindow=1&biw=1366&bih=671&tbm=isch&sa=1&ei=4GGvW6q9FMaisAHbx7CQAw&q=regulator&oq=regulator&gs_l=img.3...31737.32507.0.32629.0.0.0.0.0.0.0.0..0.0....0...1c.1.64.img..0.0.0....0.kgus6kvKzJc

'Regulator' looks exactly like what we want to implement to me :-)

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [tarantool-patches] Re: [PATCH v2 05/11] vinyl: implement quota wait queue without fiber_cond
  2018-09-29  5:05   ` [tarantool-patches] " Konstantin Osipov
@ 2018-09-29 11:44     ` Vladimir Davydov
  0 siblings, 0 replies; 35+ messages in thread
From: Vladimir Davydov @ 2018-09-29 11:44 UTC (permalink / raw)
  To: Konstantin Osipov; +Cc: tarantool-patches

On Sat, Sep 29, 2018 at 08:05:31AM +0300, Konstantin Osipov wrote:
> * Vladimir Davydov <vdavydov.dev@gmail.com> [18/09/28 21:00]:
> > Using fiber_cond as a wait queue isn't very convenient, because:
> >  - It doesn't allow us to put a spuriously woken up fiber back to the
> >    same position in the queue where it was, thus violating fairness.
> >  - It doesn't allow us to check whether we actually need to wake up a
> >    fiber or it will have to go back to sleep anyway as it needs more
> >    memory than currently available.
> >  - It doesn't allow us to implement a multi-queue approach where fibers
> >    that have different priorities are put to different queues.
> > 
> > So let's rewrite the wait queue with plain rlist and fiber_yield.
> 
> Maybe you could factor out this change into an abstraction in fiber.[hc]?
> Or fix fiber_cond to be fair?

I don't see any point in abstracting such a tiny and simple thing as a
wait queue, at least not right now, when we need it only in vy_quota.
Note, what we need in vy_quota isn't as simple as waking up fibers in
the order they were put to sleep, because we don't want to wake up a
fiber if it will have to go back to sleep immediately. Besides, take a
look at the last patch in the series where I turn the queue into a
multi-queue in order to introduce quota consumer priorities. I don't
rule out that it may be possible to make this code abstract, but IMO it
isn't worth the effort right now.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [tarantool-patches] Re: [PATCH v2 02/11] vinyl: factor load regulator out of quota
       [not found]       ` <20180929114308.GA19162@chai>
@ 2018-10-01 10:27         ` Vladimir Davydov
  0 siblings, 0 replies; 35+ messages in thread
From: Vladimir Davydov @ 2018-10-01 10:27 UTC (permalink / raw)
  To: tarantool-patches; +Cc: Konstantin Osipov

[ Kostja accidentally dropped the mailing list from Cc.
  Sending for the record. ]

On Sat, Sep 29, 2018 at 02:43:08PM +0300, Konstantin Osipov wrote:
> * Vladimir Davydov <vdavydov.dev@gmail.com> [18/09/29 14:39]:
> > On Sat, Sep 29, 2018 at 08:00:46AM +0300, Konstantin Osipov wrote:
> > > * Vladimir Davydov <vdavydov.dev@gmail.com> [18/09/28 21:00]:
> > > > Turned out that throttling isn't going to be as simple as maintaining
> > > > the write rate below the estimated dump bandwidth, because we also need
> > > > to take into account whether compaction keeps up with dumps. Tracking
> > > > compaction progress isn't a trivial task and mixing it in a module
> > > > responsible for resource limiting, which vy_quota is, doesn't seem to be
> > > > a good idea. Let's factor out the related code into a separate module
> > > > and call it vy_regulator. Currently, the new module only keeps track of
> > > > the write rate and the dump bandwidth and sets the memory watermark
> > > > accordingly, but soon we will extend it to configure throttling as well.
> > > > 
> > > > Since write rate and dump bandwidth are now a part of the regulator
> > > > subsystem, this patch renames 'quota' entry of box.stat.vinyl() to
> > > > 'regulator'. It also removes 'quota.usage' and 'quota.limit' altogether,
> > > > because memory usage is reported under 'memory.level0' while the limit
> > > > can be read from box.cfg.vinyl_memory, and renames 'use_rate' to
> > > > 'write_rate', because the latter seems to be a more appropriate name.
> > > > 
> > > > Needed for #1862
> 
> 
> > Regulator:
> > https://www.google.ru/search?newwindow=1&biw=1366&bih=671&tbm=isch&sa=1&ei=4GGvW6q9FMaisAHbx7CQAw&q=regulator&oq=regulator&gs_l=img.3...31737.32507.0.32629.0.0.0.0.0.0.0.0..0.0....0...1c.1.64.img..0.0.0....0.kgus6kvKzJc
> > 
> > 'Regulator' looks exactly like what we want to implement to me :-)
> 
> OK :)
> -- 
> Konstantin Osipov, Moscow, Russia, +7 903 626 22 32
> http://tarantool.io - www.twitter.com/kostja_osipov
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 02/11] vinyl: factor load regulator out of quota
  2018-09-28 17:40 ` [PATCH v2 02/11] vinyl: factor load regulator out of quota Vladimir Davydov
  2018-09-29  5:00   ` [tarantool-patches] " Konstantin Osipov
@ 2018-10-01 10:31   ` Vladimir Davydov
  2018-10-02 18:16   ` [tarantool-patches] " Konstantin Osipov
  2 siblings, 0 replies; 35+ messages in thread
From: Vladimir Davydov @ 2018-10-01 10:31 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

On Fri, Sep 28, 2018 at 08:40:00PM +0300, Vladimir Davydov wrote:
> +static void
> +vy_regulator_update_write_rate(struct vy_regulator *regulator)
> +{
> +	size_t used_curr = regulator->quota->used;
> +	size_t used_last = regulator->used_last;
> +
> +	/*
> +	 * Memory can be dumped between two subsequent timer
> +	 * callback invocations, in which case memory usage
> +	 * will decrease. Ignore such observations - it's not
> +	 * a big deal, because dump is a rare event.
> +	 */
> +	if (used_curr < used_last)
> +		return;

There's a bug here: we must update regulator->used_last in this case,
otherwise we may stop updating the rate at all if used_last happens to
be close to the limit. The fix is below. I updated the commit on the
branch.

> +
> +	size_t rate_avg = regulator->write_rate;
> +	size_t rate_curr = (used_curr - used_last) / VY_REGULATOR_TIMER_PERIOD;
> +
> +	double weight = 1 - exp(-VY_REGULATOR_TIMER_PERIOD /
> +				VY_WRITE_RATE_AVG_WIN);
> +	rate_avg = (1 - weight) * rate_avg + weight * rate_curr;
> +
> +	regulator->write_rate = rate_avg;
> +	regulator->used_last = used_curr;
> +}

diff --git a/src/box/vy_regulator.c b/src/box/vy_regulator.c
index da35cc74..0d633abe 100644
--- a/src/box/vy_regulator.c
+++ b/src/box/vy_regulator.c
@@ -108,8 +108,10 @@ vy_regulator_update_write_rate(struct vy_regulator *regulator)
 	 * will decrease. Ignore such observations - it's not
 	 * a big deal, because dump is a rare event.
 	 */
-	if (used_curr < used_last)
+	if (used_curr < used_last) {
+		regulator->used_last = used_curr;
 		return;
+	}
 
 	size_t rate_avg = regulator->write_rate;
 	size_t rate_curr = (used_curr - used_last) / VY_REGULATOR_TIMER_PERIOD;

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [tarantool-patches] Re: [PATCH v2 02/11] vinyl: factor load regulator out of quota
  2018-09-28 17:40 ` [PATCH v2 02/11] vinyl: factor load regulator out of quota Vladimir Davydov
  2018-09-29  5:00   ` [tarantool-patches] " Konstantin Osipov
  2018-10-01 10:31   ` Vladimir Davydov
@ 2018-10-02 18:16   ` Konstantin Osipov
  2018-10-03  8:49     ` Vladimir Davydov
  2 siblings, 1 reply; 35+ messages in thread
From: Konstantin Osipov @ 2018-10-02 18:16 UTC (permalink / raw)
  To: tarantool-patches

* Vladimir Davydov <vdavydov.dev@gmail.com> [18/09/28 21:00]:
> Turned out that throttling isn't going to be as simple as maintaining
> the write rate below the estimated dump bandwidth, because we also need
> to take into account whether compaction keeps up with dumps. Tracking
> compaction progress isn't a trivial task and mixing it in a module
> responsible for resource limiting, which vy_quota is, doesn't seem to be
> a good idea. Let's factor out the related code into a separate module
> and call it vy_regulator. Currently, the new module only keeps track of
> the write rate and the dump bandwidth and sets the memory watermark
> accordingly, but soon we will extend it to configure throttling as well.

OK, if we keep regulator in place there is a few comments about
the contents of the patch (i.e. the renames).

> -	struct vy_quota *q = &env->quota;
> +	struct vy_regulator *r = &env->regulator;
>  
> -	info_table_begin(h, "quota");
> -	info_append_int(h, "used", q->used);
> -	info_append_int(h, "limit", q->limit);
> -	info_append_int(h, "watermark", q->watermark);
> -	info_append_int(h, "use_rate", q->use_rate);
> -	info_append_int(h, "dump_bandwidth", q->dump_bw);
> -	info_table_end(h); /* quota */
> +	info_table_begin(h, "regulator");
> +	info_append_int(h, "watermark", r->watermark);
> +	info_append_int(h, "write_rate", r->write_rate);
> +	info_append_int(h, "dump_bandwidth", r->dump_bw);

I think "watermark" should be now "dump_watermark", sicne simply
"watermark" is not specific enough. Is it regulator watermark? A
regulator watermark doesn't make any sense to me.

The rename should take place across the entire code base imho.

>  	assert(mem_used_after >= mem_used_before);
>  	vy_quota_adjust(&env->quota, tx->write_size,
>  			mem_used_after - mem_used_before);
> +	vy_regulator_check_watermark(&env->regulator);

Like check_dump_watermark.

>  vy_env_quota_exceeded_cb(struct vy_quota *quota)
>  {
>  	struct vy_env *env = container_of(quota, struct vy_env, quota);
> +	vy_regulator_no_memory(&env->regulator);

IMHO vy_regulator_quota_exceeded is a better name.
-- 
Konstantin Osipov, Moscow, Russia, +7 903 626 22 32
http://tarantool.io - www.twitter.com/kostja_osipov

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [tarantool-patches] Re: [PATCH v2 02/11] vinyl: factor load regulator out of quota
  2018-10-02 18:16   ` [tarantool-patches] " Konstantin Osipov
@ 2018-10-03  8:49     ` Vladimir Davydov
  0 siblings, 0 replies; 35+ messages in thread
From: Vladimir Davydov @ 2018-10-03  8:49 UTC (permalink / raw)
  To: Konstantin Osipov; +Cc: tarantool-patches

On Tue, Oct 02, 2018 at 09:16:56PM +0300, Konstantin Osipov wrote:
> * Vladimir Davydov <vdavydov.dev@gmail.com> [18/09/28 21:00]:
> > Turned out that throttling isn't going to be as simple as maintaining
> > the write rate below the estimated dump bandwidth, because we also need
> > to take into account whether compaction keeps up with dumps. Tracking
> > compaction progress isn't a trivial task and mixing it in a module
> > responsible for resource limiting, which vy_quota is, doesn't seem to be
> > a good idea. Let's factor out the related code into a separate module
> > and call it vy_regulator. Currently, the new module only keeps track of
> > the write rate and the dump bandwidth and sets the memory watermark
> > accordingly, but soon we will extend it to configure throttling as well.
> 
> OK, if we keep regulator in place there is a few comments about
> the contents of the patch (i.e. the renames).
> 
> > -	struct vy_quota *q = &env->quota;
> > +	struct vy_regulator *r = &env->regulator;
> >  
> > -	info_table_begin(h, "quota");
> > -	info_append_int(h, "used", q->used);
> > -	info_append_int(h, "limit", q->limit);
> > -	info_append_int(h, "watermark", q->watermark);
> > -	info_append_int(h, "use_rate", q->use_rate);
> > -	info_append_int(h, "dump_bandwidth", q->dump_bw);
> > -	info_table_end(h); /* quota */
> > +	info_table_begin(h, "regulator");
> > +	info_append_int(h, "watermark", r->watermark);
> > +	info_append_int(h, "write_rate", r->write_rate);
> > +	info_append_int(h, "dump_bandwidth", r->dump_bw);
> 
> I think "watermark" should be now "dump_watermark", sicne simply
> "watermark" is not specific enough. Is it regulator watermark? A
> regulator watermark doesn't make any sense to me.
> 
> The rename should take place across the entire code base imho.
> 
> >  	assert(mem_used_after >= mem_used_before);
> >  	vy_quota_adjust(&env->quota, tx->write_size,
> >  			mem_used_after - mem_used_before);
> > +	vy_regulator_check_watermark(&env->regulator);
> 
> Like check_dump_watermark.
> 
> >  vy_env_quota_exceeded_cb(struct vy_quota *quota)
> >  {
> >  	struct vy_env *env = container_of(quota, struct vy_env, quota);
> > +	vy_regulator_no_memory(&env->regulator);
> 
> IMHO vy_regulator_quota_exceeded is a better name.

Makes sense to me. I also renamed dump_bw to dump_bandwidth, because
we typically don't use abbreviations in member names, and used_last to
quota_used_last and pushed it to 1.10.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 00/11] vinyl: transaction throttling infrastructure
  2018-09-28 17:39 [PATCH v2 00/11] vinyl: transaction throttling infrastructure Vladimir Davydov
                   ` (10 preceding siblings ...)
  2018-09-28 17:40 ` [PATCH v2 11/11] vinyl: introduce quota consumer priorities Vladimir Davydov
@ 2018-10-03  9:06 ` Vladimir Davydov
  11 siblings, 0 replies; 35+ messages in thread
From: Vladimir Davydov @ 2018-10-03  9:06 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

Patches 1-7 have been approved by Kostja and pushed to 1.10.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 11/11] vinyl: introduce quota consumer priorities
  2018-09-28 17:40 ` [PATCH v2 11/11] vinyl: introduce quota consumer priorities Vladimir Davydov
@ 2018-10-06 13:24   ` Konstantin Osipov
  2018-10-08 11:10     ` Vladimir Davydov
  0 siblings, 1 reply; 35+ messages in thread
From: Konstantin Osipov @ 2018-10-06 13:24 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: tarantool-patches

* Vladimir Davydov <vdavydov.dev@gmail.com> [18/09/28 21:00]:

Since we agreed to go with the implemented approach, please see a
few implementation comments below.

> Currently, we only limit quota consumption rate so that writers won't
> hit the hard limit before memory dump is complete. However, it isn't
> enough, because we also need to consider compaction: if it doesn't keep
> up with dumps, read and space amplification will grow uncontrollably.
> 
> The problem is compaction may be a quota consumer by itself, as it may
> generate deferred DELETE statements for secondary indexes. We can't
> ignore quota completely there, because if we do, we may hit the memory
> limit and stall all writers, which is unacceptable, but we do want to
> ignore the rate limit imposed to make sure that compaction keeps up with
> dumps, otherwise compaction won't benefit from such a throttling.
> 
> To tackle this problem, this patch introduces quota consumer priorities.
> Now a quota object maintains one rate limit and one wait queue per each
> transaction priority. Rate limit i is used by a consumer with priority
> prio if and only if prio <= i. Whenever a consumer has to be throttled,
> it is put to sleep to the wait queue corresponding to its priority.
> When quota is replenished, we pick the oldest consumer among all that
> may be woken up. This ensures fairness.

As discusses, these are not priorities, these are resource types.
Please change the patch to pass a bit mask of resource types for
which we request the quota.

Quota refill interval should be varying - please schedule this
work in a separate patch.


>  static void
>  vy_quota_signal(struct vy_quota *q)
>  {
> -	if (!rlist_empty(&q->wait_queue)) {
> +	/*
> +	 * Wake up a consumer that has waited most no matter
> +	 * whether it's high or low priority. This assures that
> +	 * high priority consumers don't uncontrollably throttle
> +	 * low priority ones.
> +	 */
> +	struct vy_quota_wait_node *oldest = NULL;
> +	for (int i = 0; i < vy_quota_consumer_prio_MAX; i++) {
> +		struct rlist *wq = &q->wait_queue[i];
> +		if (rlist_empty(wq))
> +			continue;

I still do not understand why you need a second queue and
timestapms. If you need to ensure total fairness, a single queue
should be enough. 
> +
>  		struct vy_quota_wait_node *n;
> -		n = rlist_first_entry(&q->wait_queue,
> -				      struct vy_quota_wait_node, in_wait_queue);
> +		n = rlist_first_entry(wq, struct vy_quota_wait_node,
> +				      in_wait_queue);
>  		/*
>  		 * No need in waking up a consumer if it will have
>  		 * to go back to sleep immediately.
>  		 */
> -		if (vy_quota_may_use(q, n->size))
> -			fiber_wakeup(n->fiber);
> +		if (!vy_quota_may_use(q, i, n->size))
> +			continue;
> +
> +		if (oldest == NULL || oldest->timestamp > n->timestamp)
> +			oldest = n;
>  	}
> +	if (oldest != NULL)
> +		fiber_wakeup(oldest->fiber);

This is some kind of black magic to me. Imagine you have another
resource type, say, CPU? Will you add the third queue? How this
will piece of code look then?

I'm OK if you go with a single queue in which you register all
requests, regardless of whether they are for disk, memory, or both
- and satisfy all requests in vy_quota_signal(). But I don't
immediately  understand the magic with two wait queues - and
frankly still don't see any need for it. 
> +	/**
> +	 * Timestamp assigned to this fiber when it was put to
> +	 * sleep, see vy_quota::wait_timestamp for more details.
> +	 */
> +	int64_t timestamp;

If you have a single queue, you don't need this. The name is to
general in any case.

>  };
>  
>  /**
> @@ -144,13 +186,27 @@ struct vy_quota {
>  	 */
>  	vy_quota_exceeded_f quota_exceeded_cb;
>  	/**
> -	 * Queue of consumers waiting for quota, linked by
> -	 * vy_quota_wait_node::state. Newcomers are added
> -	 * to the tail.
> +	 * Monotonically growing timestamp assigned to consumers
> +	 * waiting for quota. It is used for balancing wakeups
> +	 * among wait queues: if two fibers from different wait
> +	 * queues may proceed, the one with the lowest timestamp
> +	 * will be picked.
> +	 *
> +	 * See also vy_quota_wait_node::timestamp.
> +	 */
> +	int64_t wait_timestamp;

-- 
Konstantin Osipov, Moscow, Russia, +7 903 626 22 32
http://tarantool.io - www.twitter.com/kostja_osipov

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 11/11] vinyl: introduce quota consumer priorities
  2018-10-06 13:24   ` Konstantin Osipov
@ 2018-10-08 11:10     ` Vladimir Davydov
  2018-10-09 13:25       ` Vladimir Davydov
  2018-10-11  7:02       ` Konstantin Osipov
  0 siblings, 2 replies; 35+ messages in thread
From: Vladimir Davydov @ 2018-10-08 11:10 UTC (permalink / raw)
  To: Konstantin Osipov; +Cc: tarantool-patches

On Sat, Oct 06, 2018 at 04:24:05PM +0300, Konstantin Osipov wrote:
> * Vladimir Davydov <vdavydov.dev@gmail.com> [18/09/28 21:00]:
> 
> Since we agreed to go with the implemented approach, please see a
> few implementation comments below.
> 
> > Currently, we only limit quota consumption rate so that writers won't
> > hit the hard limit before memory dump is complete. However, it isn't
> > enough, because we also need to consider compaction: if it doesn't keep
> > up with dumps, read and space amplification will grow uncontrollably.
> > 
> > The problem is compaction may be a quota consumer by itself, as it may
> > generate deferred DELETE statements for secondary indexes. We can't
> > ignore quota completely there, because if we do, we may hit the memory
> > limit and stall all writers, which is unacceptable, but we do want to
> > ignore the rate limit imposed to make sure that compaction keeps up with
> > dumps, otherwise compaction won't benefit from such a throttling.
> > 
> > To tackle this problem, this patch introduces quota consumer priorities.
> > Now a quota object maintains one rate limit and one wait queue per each
> > transaction priority. Rate limit i is used by a consumer with priority
> > prio if and only if prio <= i. Whenever a consumer has to be throttled,
> > it is put to sleep to the wait queue corresponding to its priority.
> > When quota is replenished, we pick the oldest consumer among all that
> > may be woken up. This ensures fairness.
> 
> As discusses, these are not priorities, these are resource types.
> Please change the patch to pass a bit mask of resource types for
> which we request the quota.

Actually, there are resource types and there are consumer types.
I admit the fact that I mixed them may look confusing at the first
glance. We may introduce a seprate enum for resource types with a
mapping between them.

enum vy_quota_consumer_type {
	VY_QUOTA_CONSUMER_TX = 0,
	VY_QUOTA_CONSUMER_COMPACTION = 1,
	vy_quota_consumer_max,
};

enum vy_quota_resource_type {
	VY_QUOTA_RESOURCE_DISK = 0,
	VY_QUOTA_RESOURCE_MEMORY = 1,
	vy_quota_resource_max,
};

static unsigned vy_quota_consumer_mask[] = {
	[VY_QUOTA_CONSUMER_TX] = (1 << VY_QUOTA_RESOURCE_DISK) |
				 (1 << VY_QUOTA_RESOURCE_MEMORY),
	[VY_QUOTA_CONSUMER_COMPACTION] = (1 << VY_QUOTA_RESOURCE_MEMORY),
};

struct vy_quota {
	...
	struct rlist wait_queue[vy_quota_consumer_max];
	struct vy_rate_limit rate_limit[vy_quota_resource_max];
};

This would make the code more bulky though. Do we really want to
complicate?

Also, quite frankly I don't quite dig the concept of 'resources' and
the corresponding constant names, because 'memory' rate limit may be
confused with memory usage, which is what vy_quota is about in the first
place.

> 
> Quota refill interval should be varying - please schedule this
> work in a separate patch.
> 
> 
> >  static void
> >  vy_quota_signal(struct vy_quota *q)
> >  {
> > -	if (!rlist_empty(&q->wait_queue)) {
> > +	/*
> > +	 * Wake up a consumer that has waited most no matter
> > +	 * whether it's high or low priority. This assures that
> > +	 * high priority consumers don't uncontrollably throttle
> > +	 * low priority ones.
> > +	 */
> > +	struct vy_quota_wait_node *oldest = NULL;
> > +	for (int i = 0; i < vy_quota_consumer_prio_MAX; i++) {
> > +		struct rlist *wq = &q->wait_queue[i];
> > +		if (rlist_empty(wq))
> > +			continue;
> 
> I still do not understand why you need a second queue and
> timestapms. If you need to ensure total fairness, a single queue
> should be enough. 

I need to maintain one queue per consumer type, because it's possible
that we may wake up consumers of one type, but not of the other. If we
used a single queue, it could occur that consumers that could be woken
up landed in the middle of the queue so that without scanning the queue
we wouldn't be able to find them. Scanning a queue looks ugly.

Think of the multi-queue design like this: there's one queue per each
consumer type. When a resource is replenished, we check only those
queues that store consumers that may proceed and choose the one that has
waited most.

> > +
> >  		struct vy_quota_wait_node *n;
> > -		n = rlist_first_entry(&q->wait_queue,
> > -				      struct vy_quota_wait_node, in_wait_queue);
> > +		n = rlist_first_entry(wq, struct vy_quota_wait_node,
> > +				      in_wait_queue);
> >  		/*
> >  		 * No need in waking up a consumer if it will have
> >  		 * to go back to sleep immediately.
> >  		 */
> > -		if (vy_quota_may_use(q, n->size))
> > -			fiber_wakeup(n->fiber);
> > +		if (!vy_quota_may_use(q, i, n->size))
> > +			continue;
> > +
> > +		if (oldest == NULL || oldest->timestamp > n->timestamp)
> > +			oldest = n;
> >  	}
> > +	if (oldest != NULL)
> > +		fiber_wakeup(oldest->fiber);
> 
> This is some kind of black magic to me. Imagine you have another
> resource type, say, CPU? Will you add the third queue? How this
> will piece of code look then?

Exactly the same. The only thing that would change is vy_quota_may_use:
we would have to add the new resource type there then.

We would have to add a new queue only if a new consumer type had to be
introduced, e.g. a consumer that only consumes CPU and nothing else, but
even then this code would stay the same.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 11/11] vinyl: introduce quota consumer priorities
  2018-10-08 11:10     ` Vladimir Davydov
@ 2018-10-09 13:25       ` Vladimir Davydov
  2018-10-11  7:02       ` Konstantin Osipov
  1 sibling, 0 replies; 35+ messages in thread
From: Vladimir Davydov @ 2018-10-09 13:25 UTC (permalink / raw)
  To: Konstantin Osipov; +Cc: tarantool-patches

On Mon, Oct 08, 2018 at 02:10:10PM +0300, Vladimir Davydov wrote:
> On Sat, Oct 06, 2018 at 04:24:05PM +0300, Konstantin Osipov wrote:
> > * Vladimir Davydov <vdavydov.dev@gmail.com> [18/09/28 21:00]:
> > 
> > Since we agreed to go with the implemented approach, please see a
> > few implementation comments below.
> > 
> > > Currently, we only limit quota consumption rate so that writers won't
> > > hit the hard limit before memory dump is complete. However, it isn't
> > > enough, because we also need to consider compaction: if it doesn't keep
> > > up with dumps, read and space amplification will grow uncontrollably.
> > > 
> > > The problem is compaction may be a quota consumer by itself, as it may
> > > generate deferred DELETE statements for secondary indexes. We can't
> > > ignore quota completely there, because if we do, we may hit the memory
> > > limit and stall all writers, which is unacceptable, but we do want to
> > > ignore the rate limit imposed to make sure that compaction keeps up with
> > > dumps, otherwise compaction won't benefit from such a throttling.
> > > 
> > > To tackle this problem, this patch introduces quota consumer priorities.
> > > Now a quota object maintains one rate limit and one wait queue per each
> > > transaction priority. Rate limit i is used by a consumer with priority
> > > prio if and only if prio <= i. Whenever a consumer has to be throttled,
> > > it is put to sleep to the wait queue corresponding to its priority.
> > > When quota is replenished, we pick the oldest consumer among all that
> > > may be woken up. This ensures fairness.
> > 
> > As discusses, these are not priorities, these are resource types.
> > Please change the patch to pass a bit mask of resource types for
> > which we request the quota.
> 
> Actually, there are resource types and there are consumer types.
> I admit the fact that I mixed them may look confusing at the first
> glance. We may introduce a seprate enum for resource types with a
> mapping between them.

I replaced the concept of consumer priority with resources and consumer
types. Hope it clears up your questions. The patch goes below. I also
updated the branch:

https://github.com/tarantool/tarantool/commits/dv/gh-1862-vy-throttling

From: Vladimir Davydov <vdavydov.dev@gmail.com>
Date: Tue, 9 Oct 2018 15:59:03 +0300
Subject: [PATCH] vinyl: introduce quota consumer types

Currently, we only limit quota consumption rate so that writers won't
hit the hard limit before memory dump is complete. However, it isn't
enough, because we also need to consider compaction: if it doesn't keep
up with dumps, read and space amplification will grow uncontrollably.

The problem is compaction may be a quota consumer by itself, as it may
generate deferred DELETE statements for secondary indexes. We can't
ignore quota completely there, because if we do, we may hit the memory
limit and stall all writers, which is unacceptable, but we do want to
ignore the rate limit imposed to make sure that compaction keeps up with
dumps, otherwise compaction won't benefit from such a throttling.

To tackle this problem, this patch introduces the concept of quota
consumer types and resources. Now vy_quota maintains one rate limit per
each resource and one wait queue per each consumer type. There are two
types of consumers, compaction jobs and usual transactions, and there
are two resources managed by vy_quota, disk and memory. Memory-based
rate limit ensures that transactions won't hit the hard memory limit and
stall before memory dump is complete. It is respected by all types of
consumers. Disk-based rate limit is supposed to be set when compaction
doesn't keep up with dumps. It is only used by usual transactions and
ignored by compaction jobs.

Since now there are two wait queues, we need to balance wakeups between
them in case consumers in both queues are ready to proceed. To ensure
there's no starvation, we maintain a monotonically growing counter and
assign its value to each consumer put to slip (ticket). We use it to
wake up the consumer that has waited most when both queues are ready.

Note, the patch doesn't implement the logic of disk-based throttling in
the regulator module. It is still left for future work.

Needed for #3721

diff --git a/src/box/vinyl.c b/src/box/vinyl.c
index d526547d..e43a85fb 100644
--- a/src/box/vinyl.c
+++ b/src/box/vinyl.c
@@ -2336,7 +2336,8 @@ vinyl_engine_prepare(struct engine *engine, struct txn *txn)
 	 * the transaction to be sent to read view or aborted, we call
 	 * it before checking for conflicts.
 	 */
-	if (vy_quota_use(&env->quota, tx->write_size, timeout) != 0)
+	if (vy_quota_use(&env->quota, VY_QUOTA_CONSUMER_TX,
+			 tx->write_size, timeout) != 0)
 		return -1;
 
 	size_t mem_used_before = lsregion_used(&env->mem_env.allocator);
@@ -2345,8 +2346,8 @@ vinyl_engine_prepare(struct engine *engine, struct txn *txn)
 
 	size_t mem_used_after = lsregion_used(&env->mem_env.allocator);
 	assert(mem_used_after >= mem_used_before);
-	vy_quota_adjust(&env->quota, tx->write_size,
-			mem_used_after - mem_used_before);
+	vy_quota_adjust(&env->quota, VY_QUOTA_CONSUMER_TX,
+			tx->write_size, mem_used_after - mem_used_before);
 	vy_regulator_check_dump_watermark(&env->regulator);
 	return rc;
 }
@@ -2371,7 +2372,8 @@ vinyl_engine_commit(struct engine *engine, struct txn *txn)
 	size_t mem_used_after = lsregion_used(&env->mem_env.allocator);
 	assert(mem_used_after >= mem_used_before);
 	/* We can't abort the transaction at this point, use force. */
-	vy_quota_force_use(&env->quota, mem_used_after - mem_used_before);
+	vy_quota_force_use(&env->quota, VY_QUOTA_CONSUMER_TX,
+			   mem_used_after - mem_used_before);
 	vy_regulator_check_dump_watermark(&env->regulator);
 
 	txn->engine_tx = NULL;
@@ -3165,7 +3167,8 @@ vinyl_space_apply_initial_join_row(struct space *space, struct request *request)
 	 * quota accounting.
 	 */
 	size_t reserved = tx->write_size;
-	if (vy_quota_use(&env->quota, reserved, TIMEOUT_INFINITY) != 0)
+	if (vy_quota_use(&env->quota, VY_QUOTA_CONSUMER_TX,
+			 reserved, TIMEOUT_INFINITY) != 0)
 		unreachable();
 
 	size_t mem_used_before = lsregion_used(&env->mem_env.allocator);
@@ -3184,7 +3187,7 @@ vinyl_space_apply_initial_join_row(struct space *space, struct request *request)
 	size_t mem_used_after = lsregion_used(&env->mem_env.allocator);
 	assert(mem_used_after >= mem_used_before);
 	size_t used = mem_used_after - mem_used_before;
-	vy_quota_adjust(&env->quota, reserved, used);
+	vy_quota_adjust(&env->quota, VY_QUOTA_CONSUMER_TX, reserved, used);
 	vy_regulator_check_dump_watermark(&env->regulator);
 	return rc;
 }
@@ -3507,7 +3510,7 @@ vy_squash_process(struct vy_squash *squash)
 		 * so there's no need in invalidating the cache.
 		 */
 		vy_mem_commit_stmt(mem, region_stmt);
-		vy_quota_force_use(&env->quota,
+		vy_quota_force_use(&env->quota, VY_QUOTA_CONSUMER_TX,
 				   mem_used_after - mem_used_before);
 		vy_regulator_check_dump_watermark(&env->regulator);
 	}
@@ -3983,9 +3986,10 @@ vy_build_insert_tuple(struct vy_env *env, struct vy_lsm *lsm,
 	/* Consume memory quota. Throttle if it is exceeded. */
 	size_t mem_used_after = lsregion_used(&env->mem_env.allocator);
 	assert(mem_used_after >= mem_used_before);
-	vy_quota_force_use(&env->quota, mem_used_after - mem_used_before);
+	vy_quota_force_use(&env->quota, VY_QUOTA_CONSUMER_TX,
+			   mem_used_after - mem_used_before);
 	vy_regulator_check_dump_watermark(&env->regulator);
-	vy_quota_wait(&env->quota);
+	vy_quota_wait(&env->quota, VY_QUOTA_CONSUMER_TX);
 	return rc;
 }
 
@@ -4111,7 +4115,8 @@ vy_build_recover(struct vy_env *env, struct vy_lsm *lsm, struct vy_lsm *pk)
 
 	mem_used_after = lsregion_used(&env->mem_env.allocator);
 	assert(mem_used_after >= mem_used_before);
-	vy_quota_force_use(&env->quota, mem_used_after - mem_used_before);
+	vy_quota_force_use(&env->quota, VY_QUOTA_CONSUMER_TX,
+			   mem_used_after - mem_used_before);
 	return rc;
 }
 
@@ -4327,7 +4332,7 @@ vy_deferred_delete_on_replace(struct trigger *trigger, void *event)
 	 */
 	struct vy_env *env = vy_env(space->engine);
 	if (is_first_statement)
-		vy_quota_wait(&env->quota);
+		vy_quota_wait(&env->quota, VY_QUOTA_CONSUMER_COMPACTION);
 
 	/* Create the deferred DELETE statement. */
 	struct vy_lsm *pk = vy_lsm(space->index[0]);
@@ -4414,7 +4419,8 @@ vy_deferred_delete_on_replace(struct trigger *trigger, void *event)
 	}
 	size_t mem_used_after = lsregion_used(&env->mem_env.allocator);
 	assert(mem_used_after >= mem_used_before);
-	vy_quota_force_use(&env->quota, mem_used_after - mem_used_before);
+	vy_quota_force_use(&env->quota, VY_QUOTA_CONSUMER_COMPACTION,
+			   mem_used_after - mem_used_before);
 	vy_regulator_check_dump_watermark(&env->regulator);
 
 	tuple_unref(delete);
diff --git a/src/box/vy_quota.c b/src/box/vy_quota.c
index ceac4878..64741da9 100644
--- a/src/box/vy_quota.c
+++ b/src/box/vy_quota.c
@@ -52,18 +52,58 @@
 static const double VY_QUOTA_TIMER_PERIOD = 0.1;
 
 /**
+ * Bit mask of resources used by a particular consumer type.
+ */
+static unsigned
+vy_quota_consumer_resource_map[] = {
+	/**
+	 * Transaction throttling pursues two goals. First, it is
+	 * capping memory consumption rate so that the hard memory
+	 * limit will not be hit before memory dump has completed
+	 * (memory-based throttling). Second, we must make sure
+	 * that compaction jobs keep up with dumps to keep read and
+	 * space amplification within bounds (disk-based throttling).
+	 * Transactions ought to respect them both.
+	 */
+	[VY_QUOTA_CONSUMER_TX] = (1 << VY_QUOTA_RESOURCE_DISK) |
+				 (1 << VY_QUOTA_RESOURCE_MEMORY),
+	/**
+	 * Compaction jobs may need some quota too, because they
+	 * may generate deferred DELETEs for secondary indexes.
+	 * Apparently, we must not impose the rate limit that is
+	 * supposed to speed up compaction on them (disk-based),
+	 * however they still have to respect memory-based throttling
+	 * to avoid long stalls.
+	 */
+	[VY_QUOTA_CONSUMER_COMPACTION] = (1 << VY_QUOTA_RESOURCE_MEMORY),
+};
+
+/**
+ * Iterate over rate limit states that are enforced for a consumer
+ * of the given type.
+ */
+#define vy_quota_consumer_for_each_rate_limit(quota, type, rl) \
+	for (struct vy_rate_limit *rl = (quota)->rate_limit; \
+	     rl - (quota)->rate_limit < vy_quota_resource_type_MAX; rl++) \
+		if (vy_quota_consumer_resource_map[type] & \
+		    (1 << (rl - (quota)->rate_limit)))
+
+/**
  * Return true if the requested amount of memory may be consumed
  * right now, false if consumers have to wait.
  */
 static inline bool
-vy_quota_may_use(struct vy_quota *q, size_t size)
+vy_quota_may_use(struct vy_quota *q, enum vy_quota_consumer_type type,
+		 size_t size)
 {
 	if (!q->is_enabled)
 		return true;
 	if (q->used + size > q->limit)
 		return false;
-	if (!vy_rate_limit_may_use(&q->rate_limit))
-		return false;
+	vy_quota_consumer_for_each_rate_limit(q, type, rl) {
+		if (!vy_rate_limit_may_use(rl))
+			return false;
+	}
 	return true;
 }
 
@@ -71,10 +111,12 @@ vy_quota_may_use(struct vy_quota *q, size_t size)
  * Consume the given amount of memory without checking the limit.
  */
 static inline void
-vy_quota_do_use(struct vy_quota *q, size_t size)
+vy_quota_do_use(struct vy_quota *q, enum vy_quota_consumer_type type,
+		size_t size)
 {
 	q->used += size;
-	vy_rate_limit_use(&q->rate_limit, size);
+	vy_quota_consumer_for_each_rate_limit(q, type, rl)
+		vy_rate_limit_use(rl, size);
 }
 
 /**
@@ -82,11 +124,13 @@ vy_quota_do_use(struct vy_quota *q, size_t size)
  * This function is an exact opposite of vy_quota_do_use().
  */
 static inline void
-vy_quota_do_unuse(struct vy_quota *q, size_t size)
+vy_quota_do_unuse(struct vy_quota *q, enum vy_quota_consumer_type type,
+		  size_t size)
 {
 	assert(q->used >= size);
 	q->used -= size;
-	vy_rate_limit_unuse(&q->rate_limit, size);
+	vy_quota_consumer_for_each_rate_limit(q, type, rl)
+		vy_rate_limit_unuse(rl, size);
 }
 
 /**
@@ -106,17 +150,31 @@ vy_quota_check_limit(struct vy_quota *q)
 static void
 vy_quota_signal(struct vy_quota *q)
 {
-	if (!rlist_empty(&q->wait_queue)) {
+	/*
+	 * To prevent starvation, wake up a consumer that has
+	 * waited most irrespective of its type.
+	 */
+	struct vy_quota_wait_node *oldest = NULL;
+	for (int i = 0; i < vy_quota_consumer_type_MAX; i++) {
+		struct rlist *wq = &q->wait_queue[i];
+		if (rlist_empty(wq))
+			continue;
+
 		struct vy_quota_wait_node *n;
-		n = rlist_first_entry(&q->wait_queue,
-				      struct vy_quota_wait_node, in_wait_queue);
+		n = rlist_first_entry(wq, struct vy_quota_wait_node,
+				      in_wait_queue);
 		/*
 		 * No need in waking up a consumer if it will have
 		 * to go back to sleep immediately.
 		 */
-		if (vy_quota_may_use(q, n->size))
-			fiber_wakeup(n->fiber);
+		if (!vy_quota_may_use(q, i, n->size))
+			continue;
+
+		if (oldest == NULL || oldest->ticket > n->ticket)
+			oldest = n;
 	}
+	if (oldest != NULL)
+		fiber_wakeup(oldest->fiber);
 }
 
 static void
@@ -127,7 +185,8 @@ vy_quota_timer_cb(ev_loop *loop, ev_timer *timer, int events)
 
 	struct vy_quota *q = timer->data;
 
-	vy_rate_limit_refill(&q->rate_limit, VY_QUOTA_TIMER_PERIOD);
+	for (int i = 0; i < vy_quota_resource_type_MAX; i++)
+		vy_rate_limit_refill(&q->rate_limit[i], VY_QUOTA_TIMER_PERIOD);
 	vy_quota_signal(q);
 }
 
@@ -140,8 +199,11 @@ vy_quota_create(struct vy_quota *q, size_t limit,
 	q->used = 0;
 	q->too_long_threshold = TIMEOUT_INFINITY;
 	q->quota_exceeded_cb = quota_exceeded_cb;
-	rlist_create(&q->wait_queue);
-	vy_rate_limit_create(&q->rate_limit);
+	q->wait_ticket = 0;
+	for (int i = 0; i < vy_quota_consumer_type_MAX; i++)
+		rlist_create(&q->wait_queue[i]);
+	for (int i = 0; i < vy_quota_resource_type_MAX; i++)
+		vy_rate_limit_create(&q->rate_limit[i]);
 	ev_timer_init(&q->timer, vy_quota_timer_cb, 0, VY_QUOTA_TIMER_PERIOD);
 	q->timer.data = q;
 }
@@ -170,15 +232,17 @@ vy_quota_set_limit(struct vy_quota *q, size_t limit)
 }
 
 void
-vy_quota_set_rate_limit(struct vy_quota *q, size_t rate)
+vy_quota_set_rate_limit(struct vy_quota *q, enum vy_quota_resource_type type,
+			size_t rate)
 {
-	vy_rate_limit_set(&q->rate_limit, rate);
+	vy_rate_limit_set(&q->rate_limit[type], rate);
 }
 
 void
-vy_quota_force_use(struct vy_quota *q, size_t size)
+vy_quota_force_use(struct vy_quota *q, enum vy_quota_consumer_type type,
+		   size_t size)
 {
-	vy_quota_do_use(q, size);
+	vy_quota_do_use(q, type, size);
 	vy_quota_check_limit(q);
 }
 
@@ -195,10 +259,11 @@ vy_quota_release(struct vy_quota *q, size_t size)
 }
 
 int
-vy_quota_use(struct vy_quota *q, size_t size, double timeout)
+vy_quota_use(struct vy_quota *q, enum vy_quota_consumer_type type,
+	     size_t size, double timeout)
 {
-	if (vy_quota_may_use(q, size)) {
-		vy_quota_do_use(q, size);
+	if (vy_quota_may_use(q, type, size)) {
+		vy_quota_do_use(q, type, size);
 		return 0;
 	}
 
@@ -218,8 +283,9 @@ vy_quota_use(struct vy_quota *q, size_t size, double timeout)
 	struct vy_quota_wait_node wait_node = {
 		.fiber = fiber(),
 		.size = size,
+		.ticket = ++q->wait_ticket,
 	};
-	rlist_add_tail_entry(&q->wait_queue, &wait_node, in_wait_queue);
+	rlist_add_tail_entry(&q->wait_queue[type], &wait_node, in_wait_queue);
 
 	do {
 		/*
@@ -239,7 +305,7 @@ vy_quota_use(struct vy_quota *q, size_t size, double timeout)
 			diag_set(ClientError, ER_VY_QUOTA_TIMEOUT);
 			return -1;
 		}
-	} while (!vy_quota_may_use(q, size));
+	} while (!vy_quota_may_use(q, type, size));
 
 	rlist_del_entry(&wait_node, in_wait_queue);
 
@@ -249,7 +315,7 @@ vy_quota_use(struct vy_quota *q, size_t size, double timeout)
 			 "for too long: %.3f sec", size, wait_time);
 	}
 
-	vy_quota_do_use(q, size);
+	vy_quota_do_use(q, type, size);
 	/*
 	 * Blocked consumers are awaken one by one to preserve
 	 * the order they were put to sleep. It's a responsibility
@@ -261,14 +327,15 @@ vy_quota_use(struct vy_quota *q, size_t size, double timeout)
 }
 
 void
-vy_quota_adjust(struct vy_quota *q, size_t reserved, size_t used)
+vy_quota_adjust(struct vy_quota *q, enum vy_quota_consumer_type type,
+		size_t reserved, size_t used)
 {
 	if (reserved > used) {
-		vy_quota_do_unuse(q, reserved - used);
+		vy_quota_do_unuse(q, type, reserved - used);
 		vy_quota_signal(q);
 	}
 	if (reserved < used) {
-		vy_quota_do_use(q, used - reserved);
+		vy_quota_do_use(q, type, used - reserved);
 		vy_quota_check_limit(q);
 	}
 }
diff --git a/src/box/vy_quota.h b/src/box/vy_quota.h
index 79755e89..d90922b2 100644
--- a/src/box/vy_quota.h
+++ b/src/box/vy_quota.h
@@ -110,6 +110,50 @@ vy_rate_limit_refill(struct vy_rate_limit *rl, double time)
 typedef void
 (*vy_quota_exceeded_f)(struct vy_quota *quota);
 
+/**
+ * Apart from memory usage accounting and limiting, vy_quota is
+ * responsible for consumption rate limiting (aka throttling).
+ * There are multiple rate limits, each of which is associated
+ * with a particular resource type. Different kinds of consumers
+ * respect different limits. The following enumeration defines
+ * the resource types for which vy_quota enables throttling.
+ *
+ * See also vy_quota_consumer_resource_map.
+ */
+enum vy_quota_resource_type {
+	/**
+	 * The goal of disk-based throttling is to keep LSM trees
+	 * in a good shape so that read and space amplification
+	 * stay within bounds. It is enabled when compaction does
+	 * not keep up with dumps.
+	 */
+	VY_QUOTA_RESOURCE_DISK = 0,
+	/**
+	 * Memory-based throttling is needed to avoid long stalls
+	 * caused by hitting the hard memory limit. It is set so
+	 * that by the time the hard limit is hit, the last memory
+	 * dump will have completed.
+	 */
+	VY_QUOTA_RESOURCE_MEMORY = 1,
+
+	vy_quota_resource_type_MAX,
+};
+
+/**
+ * Quota consumer type determines how a quota consumer will be
+ * rate limited.
+ *
+ * See also vy_quota_consumer_resource_map.
+ */
+enum vy_quota_consumer_type {
+	/** Transaction processor. */
+	VY_QUOTA_CONSUMER_TX = 0,
+	/** Compaction job. */
+	VY_QUOTA_CONSUMER_COMPACTION = 1,
+
+	vy_quota_consumer_type_MAX,
+};
+
 struct vy_quota_wait_node {
 	/** Link in vy_quota::wait_queue. */
 	struct rlist in_wait_queue;
@@ -117,6 +161,11 @@ struct vy_quota_wait_node {
 	struct fiber *fiber;
 	/** Amount of requested memory. */
 	size_t size;
+	/**
+	 * Ticket assigned to this fiber when it was put to
+	 * sleep, see vy_quota::wait_ticket for more details.
+	 */
+	int64_t ticket;
 };
 
 /**
@@ -144,13 +193,23 @@ struct vy_quota {
 	 */
 	vy_quota_exceeded_f quota_exceeded_cb;
 	/**
-	 * Queue of consumers waiting for quota, linked by
-	 * vy_quota_wait_node::state. Newcomers are added
-	 * to the tail.
+	 * Monotonically growing counter assigned to consumers
+	 * waiting for quota. It is used for balancing wakeups
+	 * among wait queues: if two fibers from different wait
+	 * queues may proceed, the one with the lowest ticket
+	 * will be picked.
+	 *
+	 * See also vy_quota_wait_node::ticket.
+	 */
+	int64_t wait_ticket;
+	/**
+	 * Queue of consumers waiting for quota, one per each
+	 * consumer type, linked by vy_quota_wait_node::state.
+	 * Newcomers are added to the tail.
 	 */
-	struct rlist wait_queue;
-	/** Rate limit state. */
-	struct vy_rate_limit rate_limit;
+	struct rlist wait_queue[vy_quota_consumer_type_MAX];
+	/** Rate limit state, one per each resource type. */
+	struct vy_rate_limit rate_limit[vy_quota_resource_type_MAX];
 	/**
 	 * Periodic timer that is used for refilling the rate
 	 * limit value.
@@ -188,18 +247,20 @@ void
 vy_quota_set_limit(struct vy_quota *q, size_t limit);
 
 /**
- * Set the max rate at which quota may be consumed,
- * in bytes per second.
+ * Set the rate limit corresponding to the resource of the given
+ * type. The rate limit is given in bytes per second.
  */
 void
-vy_quota_set_rate_limit(struct vy_quota *q, size_t rate);
+vy_quota_set_rate_limit(struct vy_quota *q, enum vy_quota_resource_type type,
+			size_t rate);
 
 /**
  * Consume @size bytes of memory. In contrast to vy_quota_use()
  * this function does not throttle the caller.
  */
 void
-vy_quota_force_use(struct vy_quota *q, size_t size);
+vy_quota_force_use(struct vy_quota *q, enum vy_quota_consumer_type type,
+		   size_t size);
 
 /**
  * Release @size bytes of memory.
@@ -242,7 +303,8 @@ vy_quota_release(struct vy_quota *q, size_t size);
  * account while estimating the size of a memory allocation.
  */
 int
-vy_quota_use(struct vy_quota *q, size_t size, double timeout);
+vy_quota_use(struct vy_quota *q, enum vy_quota_consumer_type type,
+	     size_t size, double timeout);
 
 /**
  * Adjust quota after allocating memory.
@@ -253,15 +315,16 @@ vy_quota_use(struct vy_quota *q, size_t size, double timeout);
  * See also vy_quota_use().
  */
 void
-vy_quota_adjust(struct vy_quota *q, size_t reserved, size_t used);
+vy_quota_adjust(struct vy_quota *q, enum vy_quota_consumer_type type,
+		size_t reserved, size_t used);
 
 /**
  * Block the caller until the quota is not exceeded.
  */
 static inline void
-vy_quota_wait(struct vy_quota *q)
+vy_quota_wait(struct vy_quota *q, enum vy_quota_consumer_type type)
 {
-	vy_quota_use(q, 0, TIMEOUT_INFINITY);
+	vy_quota_use(q, type, 0, TIMEOUT_INFINITY);
 }
 
 #if defined(__cplusplus)
diff --git a/src/box/vy_regulator.c b/src/box/vy_regulator.c
index de0136f7..d62ec4a4 100644
--- a/src/box/vy_regulator.c
+++ b/src/box/vy_regulator.c
@@ -94,7 +94,8 @@ vy_regulator_trigger_dump(struct vy_regulator *regulator)
 	size_t max_write_rate = (double)mem_left / (mem_used + 1) *
 					regulator->dump_bandwidth;
 	max_write_rate = MIN(max_write_rate, regulator->dump_bandwidth);
-	vy_quota_set_rate_limit(quota, max_write_rate);
+	vy_quota_set_rate_limit(quota, VY_QUOTA_RESOURCE_MEMORY,
+				max_write_rate);
 }
 
 static void
@@ -195,7 +196,8 @@ void
 vy_regulator_start(struct vy_regulator *regulator)
 {
 	regulator->quota_used_last = regulator->quota->used;
-	vy_quota_set_rate_limit(regulator->quota, regulator->dump_bandwidth);
+	vy_quota_set_rate_limit(regulator->quota, VY_QUOTA_RESOURCE_MEMORY,
+				regulator->dump_bandwidth);
 	ev_timer_start(loop(), &regulator->timer);
 }
 
@@ -259,7 +261,8 @@ vy_regulator_dump_complete(struct vy_regulator *regulator,
 	 * limit to the dump bandwidth rather than disabling it
 	 * completely.
 	 */
-	vy_quota_set_rate_limit(regulator->quota, regulator->dump_bandwidth);
+	vy_quota_set_rate_limit(regulator->quota, VY_QUOTA_RESOURCE_MEMORY,
+				regulator->dump_bandwidth);
 }
 
 void
@@ -269,5 +272,6 @@ vy_regulator_reset_dump_bandwidth(struct vy_regulator *regulator, size_t max)
 	regulator->dump_bandwidth = VY_DUMP_BANDWIDTH_DEFAULT;
 	if (max > 0 && regulator->dump_bandwidth > max)
 		regulator->dump_bandwidth = max;
-	vy_quota_set_rate_limit(regulator->quota, regulator->dump_bandwidth);
+	vy_quota_set_rate_limit(regulator->quota, VY_QUOTA_RESOURCE_MEMORY,
+				regulator->dump_bandwidth);
 }

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 11/11] vinyl: introduce quota consumer priorities
  2018-10-08 11:10     ` Vladimir Davydov
  2018-10-09 13:25       ` Vladimir Davydov
@ 2018-10-11  7:02       ` Konstantin Osipov
  2018-10-11  8:29         ` Vladimir Davydov
  1 sibling, 1 reply; 35+ messages in thread
From: Konstantin Osipov @ 2018-10-11  7:02 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: tarantool-patches

* Vladimir Davydov <vdavydov.dev@gmail.com> [18/10/08 14:12]:
> 
> Actually, there are resource types and there are consumer types.
> I admit the fact that I mixed them may look confusing at the first
> glance. We may introduce a seprate enum for resource types with a
> mapping between them.
> 
> enum vy_quota_consumer_type {
> 	VY_QUOTA_CONSUMER_TX = 0,
> 	VY_QUOTA_CONSUMER_COMPACTION = 1,
> 	vy_quota_consumer_max,
> };
> 
> enum vy_quota_resource_type {
> 	VY_QUOTA_RESOURCE_DISK = 0,
> 	VY_QUOTA_RESOURCE_MEMORY = 1,
> 	vy_quota_resource_max,
> };
> 
> static unsigned vy_quota_consumer_mask[] = {
> 	[VY_QUOTA_CONSUMER_TX] = (1 << VY_QUOTA_RESOURCE_DISK) |
> 				 (1 << VY_QUOTA_RESOURCE_MEMORY),
> 	[VY_QUOTA_CONSUMER_COMPACTION] = (1 << VY_QUOTA_RESOURCE_MEMORY),
> };
> 
> struct vy_quota {
> 	...
> 	struct rlist wait_queue[vy_quota_consumer_max];
> 	struct vy_rate_limit rate_limit[vy_quota_resource_max];
> };


> 
> This would make the code more bulky though. Do we really want to
> complicate?

I like it.
> 
> Also, quite frankly I don't quite dig the concept of 'resources' and
> the corresponding constant names, because 'memory' rate limit may be
> confused with memory usage, which is what vy_quota is about in the first
> place.

Yes, vy_quota is originally about usage. I don't know why you
insist on having rate limits in it ;) I'm ok if we put rate limits
into a separate object, now that we have vy_regulator it can
manage all of them.

> > Quota refill interval should be varying - please schedule this
> > work in a separate patch.
> > 
> > 
> > >  static void
> > >  vy_quota_signal(struct vy_quota *q)
> > >  {
> > > -	if (!rlist_empty(&q->wait_queue)) {
> > > +	/*
> > > +	 * Wake up a consumer that has waited most no matter
> > > +	 * whether it's high or low priority. This assures that
> > > +	 * high priority consumers don't uncontrollably throttle
> > > +	 * low priority ones.
> > > +	 */
> > > +	struct vy_quota_wait_node *oldest = NULL;
> > > +	for (int i = 0; i < vy_quota_consumer_prio_MAX; i++) {
> > > +		struct rlist *wq = &q->wait_queue[i];
> > > +		if (rlist_empty(wq))
> > > +			continue;
> > 
> > I still do not understand why you need a second queue and
> > timestapms. If you need to ensure total fairness, a single queue
> > should be enough. 
> 
> I need to maintain one queue per consumer type, because it's possible
> that we may wake up consumers of one type, but not of the other.

If you do this, your fairness is strictly speaking broken, unless,
then again, a type implies some kind of priority. By waking up a
consumer out of (strict) order you give away a resource which may
be needed for another consumer which (in strict single-queue
order) is ahead of it.

Frankly I still don't see why you did it this way except for the
personal preference.

> If we used a single queue, it could occur that consumers that
> could be woken up landed in the middle of the queue so that
> without scanning the queue we wouldn't be able to find them.
> Scanning a queue looks ugly.

Don't scan the queue. If you have no resources for the first
consumer in the queue, everyone else has to wait. Make it simple
and stupid.

> Think of the multi-queue design like this: there's one queue per each
> consumer type. When a resource is replenished, we check only those
> queues that store consumers that may proceed and choose the one that has
> waited most.

How is that possible that some consumer can not proceed? The
replenishment is instantaneous for all kinds of resources. So if
there is at all a chance that the first (strictly first) consumer
has insufficient resources to proceed, it means it requires more
than is at all possible within this rate limit interval. It's
better to prohibit this altogether.

-- 
Konstantin Osipov, Moscow, Russia, +7 903 626 22 32
http://tarantool.io - www.twitter.com/kostja_osipov

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 11/11] vinyl: introduce quota consumer priorities
  2018-10-11  7:02       ` Konstantin Osipov
@ 2018-10-11  8:29         ` Vladimir Davydov
  0 siblings, 0 replies; 35+ messages in thread
From: Vladimir Davydov @ 2018-10-11  8:29 UTC (permalink / raw)
  To: Konstantin Osipov; +Cc: tarantool-patches

On Thu, Oct 11, 2018 at 10:02:50AM +0300, Konstantin Osipov wrote:
> * Vladimir Davydov <vdavydov.dev@gmail.com> [18/10/08 14:12]:
> > 
> > Actually, there are resource types and there are consumer types.
> > I admit the fact that I mixed them may look confusing at the first
> > glance. We may introduce a seprate enum for resource types with a
> > mapping between them.
> > 
> > enum vy_quota_consumer_type {
> > 	VY_QUOTA_CONSUMER_TX = 0,
> > 	VY_QUOTA_CONSUMER_COMPACTION = 1,
> > 	vy_quota_consumer_max,
> > };
> > 
> > enum vy_quota_resource_type {
> > 	VY_QUOTA_RESOURCE_DISK = 0,
> > 	VY_QUOTA_RESOURCE_MEMORY = 1,
> > 	vy_quota_resource_max,
> > };
> > 
> > static unsigned vy_quota_consumer_mask[] = {
> > 	[VY_QUOTA_CONSUMER_TX] = (1 << VY_QUOTA_RESOURCE_DISK) |
> > 				 (1 << VY_QUOTA_RESOURCE_MEMORY),
> > 	[VY_QUOTA_CONSUMER_COMPACTION] = (1 << VY_QUOTA_RESOURCE_MEMORY),
> > };
> > 
> > struct vy_quota {
> > 	...
> > 	struct rlist wait_queue[vy_quota_consumer_max];
> > 	struct vy_rate_limit rate_limit[vy_quota_resource_max];
> > };
> 
> 
> > 
> > This would make the code more bulky though. Do we really want to
> > complicate?
> 
> I like it.

Then please see the patch:

https://www.freelists.org/post/tarantool-patches/PATCH-v2-1111-vinyl-introduce-quota-consumer-priorities,3

> > 
> > Also, quite frankly I don't quite dig the concept of 'resources' and
> > the corresponding constant names, because 'memory' rate limit may be
> > confused with memory usage, which is what vy_quota is about in the first
> > place.
> 
> Yes, vy_quota is originally about usage. I don't know why you
> insist on having rate limits in it ;) I'm ok if we put rate limits
> into a separate object, now that we have vy_regulator it can
> manage all of them.

Because I want to reuse the wait queue which was initially introduced
for putting consumers to sleep when they hit the memory limit.

Besides, rate limit value is consumed only along with quota so I'm
convinced they should be together.

> 
> > > Quota refill interval should be varying - please schedule this
> > > work in a separate patch.
> > > 
> > > 
> > > >  static void
> > > >  vy_quota_signal(struct vy_quota *q)
> > > >  {
> > > > -	if (!rlist_empty(&q->wait_queue)) {
> > > > +	/*
> > > > +	 * Wake up a consumer that has waited most no matter
> > > > +	 * whether it's high or low priority. This assures that
> > > > +	 * high priority consumers don't uncontrollably throttle
> > > > +	 * low priority ones.
> > > > +	 */
> > > > +	struct vy_quota_wait_node *oldest = NULL;
> > > > +	for (int i = 0; i < vy_quota_consumer_prio_MAX; i++) {
> > > > +		struct rlist *wq = &q->wait_queue[i];
> > > > +		if (rlist_empty(wq))
> > > > +			continue;
> > > 
> > > I still do not understand why you need a second queue and
> > > timestapms. If you need to ensure total fairness, a single queue
> > > should be enough. 
> > 
> > I need to maintain one queue per consumer type, because it's possible
> > that we may wake up consumers of one type, but not of the other.
> 
> If you do this, your fairness is strictly speaking broken, unless,
> then again, a type implies some kind of priority. By waking up a
> consumer out of (strict) order you give away a resource which may
> be needed for another consumer which (in strict single-queue
> order) is ahead of it.

I only wake up consumers out of (strict) order in case some consumers
can't proceed in the *current* interval, i.e. I only give the *surplus*
of the resource that wouldn't be consumed anyway in *this* interval.
Note, the resource is replenished on a timely basis so by dispensing the
surplus I don't deprive throttled consumers of any resource.

> 
> Frankly I still don't see why you did it this way except for the
> personal preference.
> 
> > If we used a single queue, it could occur that consumers that
> > could be woken up landed in the middle of the queue so that
> > without scanning the queue we wouldn't be able to find them.
> > Scanning a queue looks ugly.
> 
> Don't scan the queue. If you have no resources for the first
> consumer in the queue, everyone else has to wait. Make it simple
> and stupid.

Then compaction threads could occasionally be throttled by transactions
that stalled due to the disk-based rate limit.

> 
> > Think of the multi-queue design like this: there's one queue per each
> > consumer type. When a resource is replenished, we check only those
> > queues that store consumers that may proceed and choose the one that has
> > waited most.
> 
> How is that possible that some consumer can not proceed? The

Transactions waiting for disk-based quota since the last interval may
consume all quota that was added for the current interval and get
throttled. Then we wouldn't be able to wake up compaction threads that
were queued after them.

Even though you might argue that this is unlikely and this should never
happen once the load is stabilized, ignoring such a possibility makes me
feel uncomfortable. Using one wait queue per consumer with ticketing for
fairness is a simple and clear solution that assures this can't happen.
It's clear both to me and Kirill. I fail to see why you find it
difficult for understanding. You haven't given a single technical
argument against it. You just seem to not like it for some reason,
that's all...

> replenishment is instantaneous for all kinds of resources. So if
> there is at all a chance that the first (strictly first) consumer
> has insufficient resources to proceed, it means it requires more
> than is at all possible within this rate limit interval. It's
> better to prohibit this altogether.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v2 09/11] vinyl: do not account zero dump bandwidth
  2018-09-28 17:40 ` [PATCH v2 09/11] vinyl: do not account zero dump bandwidth Vladimir Davydov
@ 2018-10-12 13:27   ` Vladimir Davydov
  2018-10-16 18:25     ` [tarantool-patches] " Konstantin Osipov
  0 siblings, 1 reply; 35+ messages in thread
From: Vladimir Davydov @ 2018-10-12 13:27 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

On Fri, Sep 28, 2018 at 08:40:07PM +0300, Vladimir Davydov wrote:
> Since we free memory in 16 MB blocks (see SLAB_SIZE), it may occur that
> we dump almost all data stored in a block but still have to leave it be,
> because it contains data of a newer generation. If the memory limit is
> small (as it is typically in tests), this may result in dumping 0 bytes.
> In order not to disrupt statistics and throttling transactions in vain,
> let's simply ignore such results. Normally, the memory limit should be
> large enough for such granularity not to affect the measurements
> (hundreds of megabytes) so this problem isn't worth putting more efforts
> into.
> 
> Needed for #1862
> ---
>  src/box/vy_regulator.c | 15 ++++++++++++++-
>  1 file changed, 14 insertions(+), 1 deletion(-)
> 
> diff --git a/src/box/vy_regulator.c b/src/box/vy_regulator.c
> index 5ec5629f..682777fc 100644
> --- a/src/box/vy_regulator.c
> +++ b/src/box/vy_regulator.c
> @@ -200,7 +200,20 @@ vy_regulator_dump_complete(struct vy_regulator *regulator,
>  {
>  	regulator->dump_in_progress = false;
>  
> -	if (dump_duration > 0) {
> +	/*
> +	 * Update dump bandwidth.
> +	 *
> +	 * Note, since we free memory in 16 MB blocks (see SLAB_SIZE),
> +	 * it may occur that we dump almost all data stored in a block
> +	 * but still have to leave it be, because it contains data of
> +	 * a newer generation. If the memory limit is small, this may
> +	 * result in dumping 0 bytes. In order not to disrupt statistics
> +	 * let's simply ignore such results. Normally, the memory limit
> +	 * should be large enough for such granularity not to affect the
> +	 * measurements (hundreds of megabytes) so this problem isn't
> +	 * worth putting more efforts into.
> +	 */
> +	if (mem_dumped > 0 && dump_duration > 0) {
>  		histogram_collect(regulator->dump_bw_hist,
>  				  mem_dumped / dump_duration);
>  		/*

Turns out this isn't enough. We'd better not account too small dumps,
because such dumps have too high overhead associated with file creation.
Our tests create a lot of small dumps using box.snapshot(). Taking them
into account may slow down the overall test execution time or even break
some tests. Let's ignore all dumps of size less than 1 MB for bandwidth
estimation. The new patch is below.

From 8d5858835c61ed290e753241e146cee50c41e6db Mon Sep 17 00:00:00 2001
From: Vladimir Davydov <vdavydov.dev@gmail.com>
Date: Fri, 12 Oct 2018 16:12:06 +0300
Subject: [PATCH] vinyl: do not account small dumps for bandwidth estimation

Small dumps (e.g. triggered by box.snapshot) have too high overhead
associated with file creation so taking them into account for bandwidth
estimation may result in erroneous transaction throttling. Let's ignore
dumps of size less than 1 MB.

Needed for #1862

diff --git a/src/box/vy_regulator.c b/src/box/vy_regulator.c
index 6ecb5aa6..c6a56905 100644
--- a/src/box/vy_regulator.c
+++ b/src/box/vy_regulator.c
@@ -66,6 +66,13 @@ static const int VY_DUMP_BANDWIDTH_PCT = 10;
  */
 static const size_t VY_DUMP_BANDWIDTH_DEFAULT = 10 * 1024 * 1024;
 
+/**
+ * Do not take into account small dumps when estimating dump
+ * bandwidth, because they have too high overhead associated
+ * with file creation.
+ */
+static const size_t VY_DUMP_SIZE_ACCT_MIN = 1024 * 1024;
+
 static void
 vy_regulator_trigger_dump(struct vy_regulator *regulator)
 {
@@ -205,7 +212,7 @@ vy_regulator_dump_complete(struct vy_regulator *regulator,
 {
 	regulator->dump_in_progress = false;
 
-	if (dump_duration > 0) {
+	if (mem_dumped >= VY_DUMP_SIZE_ACCT_MIN && dump_duration > 0) {
 		histogram_collect(regulator->dump_bandwidth_hist,
 				  mem_dumped / dump_duration);
 		/*

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [tarantool-patches] Re: [PATCH v2 09/11] vinyl: do not account zero dump bandwidth
  2018-10-12 13:27   ` Vladimir Davydov
@ 2018-10-16 18:25     ` Konstantin Osipov
  2018-10-17  8:44       ` Vladimir Davydov
  0 siblings, 1 reply; 35+ messages in thread
From: Konstantin Osipov @ 2018-10-16 18:25 UTC (permalink / raw)
  To: tarantool-patches

* Vladimir Davydov <vdavydov.dev@gmail.com> [18/10/12 21:23]:
> Turns out this isn't enough. We'd better not account too small dumps,
> because such dumps have too high overhead associated with file creation.
> Our tests create a lot of small dumps using box.snapshot(). Taking them
> into account may slow down the overall test execution time or even break
> some tests. Let's ignore all dumps of size less than 1 MB for bandwidth
> estimation. The new patch is below.

Let's talk face to face, but this entire clutch seems to be
becoming too huge with this patch.

-- 
Konstantin Osipov, Moscow, Russia, +7 903 626 22 32
http://tarantool.io - www.twitter.com/kostja_osipov

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [tarantool-patches] Re: [PATCH v2 09/11] vinyl: do not account zero dump bandwidth
  2018-10-16 18:25     ` [tarantool-patches] " Konstantin Osipov
@ 2018-10-17  8:44       ` Vladimir Davydov
  2018-10-23  7:02         ` Konstantin Osipov
  0 siblings, 1 reply; 35+ messages in thread
From: Vladimir Davydov @ 2018-10-17  8:44 UTC (permalink / raw)
  To: Konstantin Osipov; +Cc: tarantool-patches

On Tue, Oct 16, 2018 at 09:25:18PM +0300, Konstantin Osipov wrote:
> * Vladimir Davydov <vdavydov.dev@gmail.com> [18/10/12 21:23]:
> > Turns out this isn't enough. We'd better not account too small dumps,
> > because such dumps have too high overhead associated with file creation.
> > Our tests create a lot of small dumps using box.snapshot(). Taking them
> > into account may slow down the overall test execution time or even break
> > some tests. Let's ignore all dumps of size less than 1 MB for bandwidth
> > estimation. The new patch is below.
> 
> Let's talk face to face, but this entire clutch seems to be
> becoming too huge with this patch.

What's wrong with it? If you create a 100 MB file then the time it takes
to allocate an inode is negligible comparing to the write time. This is
a typical use case. However, if you create a 1 KB file, then you'll
spend much more time on inode bookkeeping than actually writing
anything. This is unlikely in practice (who'd write such small files?),
but this is done very often by our tests. This results in enabling
throttling and slowing down test execution time.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [tarantool-patches] Re: [PATCH v2 09/11] vinyl: do not account zero dump bandwidth
  2018-10-17  8:44       ` Vladimir Davydov
@ 2018-10-23  7:02         ` Konstantin Osipov
  0 siblings, 0 replies; 35+ messages in thread
From: Konstantin Osipov @ 2018-10-23  7:02 UTC (permalink / raw)
  To: tarantool-patches

* Vladimir Davydov <vdavydov.dev@gmail.com> [18/10/17 12:14]:
> On Tue, Oct 16, 2018 at 09:25:18PM +0300, Konstantin Osipov wrote:
> > * Vladimir Davydov <vdavydov.dev@gmail.com> [18/10/12 21:23]:
> > > Turns out this isn't enough. We'd better not account too small dumps,
> > > because such dumps have too high overhead associated with file creation.
> > > Our tests create a lot of small dumps using box.snapshot(). Taking them
> > > into account may slow down the overall test execution time or even break
> > > some tests. Let's ignore all dumps of size less than 1 MB for bandwidth
> > > estimation. The new patch is below.
> > 
> > Let's talk face to face, but this entire clutch seems to be
> > becoming too huge with this patch.
> 
> What's wrong with it? If you create a 100 MB file then the time it takes
> to allocate an inode is negligible comparing to the write time. This is
> a typical use case. However, if you create a 1 KB file, then you'll
> spend much more time on inode bookkeeping than actually writing
> anything. This is unlikely in practice (who'd write such small files?),
> but this is done very often by our tests. This results in enabling
> throttling and slowing down test execution time.

Simply put, we need to disable throttling because things get worse
when it's enabled. 

And we simply disable throttling considering this situation an
edge case rather than try to understand what's wrong with our
model that it makes things worse on edges.

-- 
Konstantin Osipov, Moscow, Russia, +7 903 626 22 32
http://tarantool.io - www.twitter.com/kostja_osipov

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2018-10-23  7:12 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-09-28 17:39 [PATCH v2 00/11] vinyl: transaction throttling infrastructure Vladimir Davydov
2018-09-28 17:39 ` [PATCH v2 01/11] vinyl: add helper to start scheduler and enable quota on startup Vladimir Davydov
2018-09-29  4:37   ` [tarantool-patches] " Konstantin Osipov
2018-09-28 17:40 ` [PATCH v2 02/11] vinyl: factor load regulator out of quota Vladimir Davydov
2018-09-29  5:00   ` [tarantool-patches] " Konstantin Osipov
2018-09-29 11:36     ` Vladimir Davydov
     [not found]       ` <20180929114308.GA19162@chai>
2018-10-01 10:27         ` Vladimir Davydov
2018-10-01 10:31   ` Vladimir Davydov
2018-10-02 18:16   ` [tarantool-patches] " Konstantin Osipov
2018-10-03  8:49     ` Vladimir Davydov
2018-09-28 17:40 ` [PATCH v2 03/11] vinyl: minor refactoring of quota methods Vladimir Davydov
2018-09-29  5:01   ` [tarantool-patches] " Konstantin Osipov
2018-09-28 17:40 ` [PATCH v2 04/11] vinyl: move transaction size sanity check to quota Vladimir Davydov
2018-09-29  5:02   ` [tarantool-patches] " Konstantin Osipov
2018-09-28 17:40 ` [PATCH v2 05/11] vinyl: implement quota wait queue without fiber_cond Vladimir Davydov
2018-09-29  5:05   ` [tarantool-patches] " Konstantin Osipov
2018-09-29 11:44     ` Vladimir Davydov
2018-09-28 17:40 ` [PATCH v2 06/11] vinyl: enable quota upon recovery completion explicitly Vladimir Davydov
2018-09-29  5:06   ` [tarantool-patches] " Konstantin Osipov
2018-09-28 17:40 ` [PATCH v2 07/11] vinyl: zap vy_env::memory, read_threads, and write_threads Vladimir Davydov
2018-09-29  5:06   ` [tarantool-patches] " Konstantin Osipov
2018-09-28 17:40 ` [PATCH v2 08/11] vinyl: do not try to trigger dump in regulator if already in progress Vladimir Davydov
2018-09-28 17:40 ` [PATCH v2 09/11] vinyl: do not account zero dump bandwidth Vladimir Davydov
2018-10-12 13:27   ` Vladimir Davydov
2018-10-16 18:25     ` [tarantool-patches] " Konstantin Osipov
2018-10-17  8:44       ` Vladimir Davydov
2018-10-23  7:02         ` Konstantin Osipov
2018-09-28 17:40 ` [PATCH v2 10/11] vinyl: implement basic transaction throttling Vladimir Davydov
2018-09-28 17:40 ` [PATCH v2 11/11] vinyl: introduce quota consumer priorities Vladimir Davydov
2018-10-06 13:24   ` Konstantin Osipov
2018-10-08 11:10     ` Vladimir Davydov
2018-10-09 13:25       ` Vladimir Davydov
2018-10-11  7:02       ` Konstantin Osipov
2018-10-11  8:29         ` Vladimir Davydov
2018-10-03  9:06 ` [PATCH v2 00/11] vinyl: transaction throttling infrastructure Vladimir Davydov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox