Tarantool development patches archive
 help / color / mirror / Atom feed
* [PATCH 00/18] Implement write throttling for vinyl
@ 2018-08-16 16:11 Vladimir Davydov
  2018-08-16 16:11 ` [PATCH 01/18] vinyl: rework internal quota API Vladimir Davydov
                   ` (17 more replies)
  0 siblings, 18 replies; 56+ messages in thread
From: Vladimir Davydov @ 2018-08-16 16:11 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

If vinyl doesn't keep up with dumps, transactions can consume all
available memory and stall until the last dump is complete, which may
take quite a while (seconds, even minutes sometimes). To avoid
unpredictably long stalls, this patch set introduces transaction
throttling. Now, once a dump is started, vinyl will limit the rate at
which transactions can write to the database so that they won't hit the
limit before the dump is complete.

https://github.com/tarantool/tarantool/issues/1862
https://github.com/tarantool/tarantool/commits/dv/gh-1862-vy-throttling

Vladimir Davydov (18):
  vinyl: rework internal quota API
  vinyl: move quota methods implementation to vy_quota.c
  vinyl: move quota related methods and variables from vy_env to
    vy_quota
  vinyl: implement vy_quota_wait using vy_quota_try_use
  vinyl: wake up fibers waiting for quota one by one
  vinyl: do not wake up fibers waiting for quota if quota is unavailable
  vinyl: tune dump bandwidth histogram buckets
  vinyl: rename vy_quota::dump_bw to dump_bw_hist
  vinyl: cache dump bandwidth for timer invocation
  vinyl: do not add initial guess to dump bandwidth histogram
  vinyl: use snap_io_rate_limit for initial dump bandwidth estimate
  histogram: add function for computing lower bound percentile estimate
  vinyl: use lower bound percentile estimate for dump bandwidth
  vinyl: do not try to trigger dump if it is already in progress
  vinyl: improve dump start/stop logging
  vinyl: confine quota watermark within sane value range
  vinyl: set quota timer period to 100 ms
  vinyl: throttle tx rate if dump does not catch up

 src/box/CMakeLists.txt       |   1 +
 src/box/vinyl.c              | 180 +++------------------
 src/box/vy_quota.c           | 362 +++++++++++++++++++++++++++++++++++++++++++
 src/box/vy_quota.h           | 193 +++++++++++++----------
 src/histogram.c              |  13 ++
 src/histogram.h              |   7 +
 test/unit/histogram.c        |   4 +
 test/vinyl/suite.ini         |   2 +-
 test/vinyl/throttle.result   |  95 ++++++++++++
 test/vinyl/throttle.test.lua |  47 ++++++
 10 files changed, 664 insertions(+), 240 deletions(-)
 create mode 100644 src/box/vy_quota.c
 create mode 100644 test/vinyl/throttle.result
 create mode 100644 test/vinyl/throttle.test.lua

-- 
2.11.0

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH 01/18] vinyl: rework internal quota API
  2018-08-16 16:11 [PATCH 00/18] Implement write throttling for vinyl Vladimir Davydov
@ 2018-08-16 16:11 ` Vladimir Davydov
  2018-08-20 11:07   ` Konstantin Osipov
  2018-08-27 18:29   ` Vladimir Davydov
  2018-08-16 16:11 ` [PATCH 02/18] vinyl: move quota methods implementation to vy_quota.c Vladimir Davydov
                   ` (16 subsequent siblings)
  17 siblings, 2 replies; 56+ messages in thread
From: Vladimir Davydov @ 2018-08-16 16:11 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

The API is too generic now. It would be rather difficult to introduce
throttling on top of it. Let's rework it to reflect vinyl algorithms.
---
 src/box/vinyl.c    | 28 +++++--------------------
 src/box/vy_quota.h | 61 +++++++++++++++++++++++++++++++++++++++++++++++++-----
 2 files changed, 61 insertions(+), 28 deletions(-)

diff --git a/src/box/vinyl.c b/src/box/vinyl.c
index 7f779634..d0e822bf 100644
--- a/src/box/vinyl.c
+++ b/src/box/vinyl.c
@@ -2329,7 +2329,7 @@ vinyl_engine_prepare(struct engine *engine, struct txn *txn)
 	 * the transaction to be sent to read view or aborted, we call
 	 * it before checking for conflicts.
 	 */
-	if (vy_quota_use(&env->quota, tx->write_size, timeout) != 0) {
+	if (vy_quota_try_use(&env->quota, tx->write_size, timeout) != 0) {
 		diag_set(ClientError, ER_VY_QUOTA_TIMEOUT);
 		return -1;
 	}
@@ -2341,21 +2341,7 @@ vinyl_engine_prepare(struct engine *engine, struct txn *txn)
 	size_t mem_used_after = lsregion_used(&env->mem_env.allocator);
 	assert(mem_used_after >= mem_used_before);
 	size_t write_size = mem_used_after - mem_used_before;
-	/*
-	 * Insertion of a statement into an in-memory tree can trigger
-	 * an allocation of a new tree block. This should not normally
-	 * result in a noticeable excess of the memory limit, because
-	 * most memory is occupied by statements anyway, but we need to
-	 * adjust the quota accordingly in this case.
-	 *
-	 * The actual allocation size can also be less than reservation
-	 * if a statement is allocated from an lsregion slab allocated
-	 * by a previous transaction. Take this into account, too.
-	 */
-	if (write_size >= tx->write_size)
-		vy_quota_force_use(&env->quota, write_size - tx->write_size);
-	else
-		vy_quota_release(&env->quota, tx->write_size - write_size);
+	vy_quota_commit_use(&env->quota, tx->write_size, write_size);
 
 	if (rc != 0)
 		return -1;
@@ -2512,7 +2498,7 @@ vy_env_dump_complete_cb(struct vy_scheduler *scheduler,
 	size_t mem_used_after = lsregion_used(allocator);
 	assert(mem_used_after <= mem_used_before);
 	size_t mem_dumped = mem_used_before - mem_used_after;
-	vy_quota_release(quota, mem_dumped);
+	vy_quota_dump(quota, mem_dumped);
 
 	say_info("dumped %zu bytes in %.1f sec", mem_dumped, dump_duration);
 
@@ -3214,7 +3200,7 @@ vinyl_space_apply_initial_join_row(struct space *space, struct request *request)
 	 * quota accounting.
 	 */
 	size_t reserved = tx->write_size;
-	if (vy_quota_use(&env->quota, reserved, TIMEOUT_INFINITY) != 0)
+	if (vy_quota_try_use(&env->quota, reserved, TIMEOUT_INFINITY) != 0)
 		unreachable();
 
 	size_t mem_used_before = lsregion_used(&env->mem_env.allocator);
@@ -3233,11 +3219,7 @@ vinyl_space_apply_initial_join_row(struct space *space, struct request *request)
 	size_t mem_used_after = lsregion_used(&env->mem_env.allocator);
 	assert(mem_used_after >= mem_used_before);
 	size_t used = mem_used_after - mem_used_before;
-	if (used >= reserved)
-		vy_quota_force_use(&env->quota, used - reserved);
-	else
-		vy_quota_release(&env->quota, reserved - used);
-
+	vy_quota_commit_use(&env->quota, reserved, used);
 	return rc;
 }
 
diff --git a/src/box/vy_quota.h b/src/box/vy_quota.h
index d741c34a..fd1004da 100644
--- a/src/box/vy_quota.h
+++ b/src/box/vy_quota.h
@@ -67,7 +67,7 @@ struct vy_quota {
 	/** Current memory consumption. */
 	size_t used;
 	/**
-	 * If vy_quota_use() takes longer than the given
+	 * If vy_quota_try_use() takes longer than the given
 	 * value, warn about it in the log.
 	 */
 	double too_long_threshold;
@@ -127,7 +127,7 @@ vy_quota_set_watermark(struct vy_quota *q, size_t watermark)
 }
 
 /**
- * Consume @size bytes of memory. In contrast to vy_quota_use()
+ * Consume @size bytes of memory. In contrast to vy_quota_try_use()
  * this function does not throttle the caller.
  */
 static inline void
@@ -139,10 +139,11 @@ vy_quota_force_use(struct vy_quota *q, size_t size)
 }
 
 /**
- * Release @size bytes of memory.
+ * Function called on dump completion to release quota after
+ * freeing memory.
  */
 static inline void
-vy_quota_release(struct vy_quota *q, size_t size)
+vy_quota_dump(struct vy_quota *q, size_t size)
 {
 	assert(q->used >= size);
 	q->used -= size;
@@ -153,9 +154,38 @@ vy_quota_release(struct vy_quota *q, size_t size)
  * Try to consume @size bytes of memory, throttle the caller
  * if the limit is exceeded. @timeout specifies the maximal
  * time to wait. Return 0 on success, -1 on timeout.
+ *
+ * Usage pattern:
+ *
+ *   size_t reserved = <estimate>;
+ *   if (vy_quota_try_use(q, reserved, timeout) != 0)
+ *           return -1;
+ *   <allocate memory>
+ *   size_t used = <actually allocated>;
+ *   vy_quota_commit_use(q, reserved, used);
+ *
+ * We use two-step quota allocation strategy (reserve-consume),
+ * because we may not yield after we start inserting statements
+ * into a space so we estimate the allocation size and wait for
+ * quota before committing statements. At the same time, we
+ * cannot precisely estimate the size of memory we are going to
+ * consume so we adjust the quota after the allocation.
+ *
+ * The size of memory allocated while committing a transaction
+ * may be greater than an estimate, because insertion of a
+ * statement into an in-memory index can trigger allocation
+ * of a new index extent. This should not normally result in a
+ * noticeable breach in the memory limit, because most memory
+ * is occupied by statements, but we need to adjust the quota
+ * accordingly after the allocation in this case.
+ *
+ * The actual memory allocation size may also be less than an
+ * estimate if the space has multiple indexes, because statements
+ * are stored in the common memory level, which isn't taken into
+ * account while estimating the size of a memory allocation.
  */
 static inline int
-vy_quota_use(struct vy_quota *q, size_t size, double timeout)
+vy_quota_try_use(struct vy_quota *q, size_t size, double timeout)
 {
 	double start_time = ev_monotonic_now(loop());
 	double deadline = start_time + timeout;
@@ -178,6 +208,27 @@ vy_quota_use(struct vy_quota *q, size_t size, double timeout)
 }
 
 /**
+ * Adjust quota after allocating memory.
+ *
+ * @reserved: size of quota reserved by vy_quota_try_use().
+ * @used: size of memory actually allocated.
+ *
+ * See also vy_quota_try_use().
+ */
+static inline void
+vy_quota_commit_use(struct vy_quota *q, size_t reserved, size_t used)
+{
+	if (reserved > used) {
+		size_t excess = reserved - used;
+		assert(q->used >= excess);
+		q->used -= excess;
+		fiber_cond_broadcast(&q->cond);
+	}
+	if (reserved < used)
+		vy_quota_force_use(q, used - reserved);
+}
+
+/**
  * Block the caller until the quota is not exceeded.
  */
 static inline void
-- 
2.11.0

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH 02/18] vinyl: move quota methods implementation to vy_quota.c
  2018-08-16 16:11 [PATCH 00/18] Implement write throttling for vinyl Vladimir Davydov
  2018-08-16 16:11 ` [PATCH 01/18] vinyl: rework internal quota API Vladimir Davydov
@ 2018-08-16 16:11 ` Vladimir Davydov
  2018-08-20 11:07   ` Konstantin Osipov
  2018-08-27 18:30   ` Vladimir Davydov
  2018-08-16 16:11 ` [PATCH 03/18] vinyl: move quota related methods and variables from vy_env to vy_quota Vladimir Davydov
                   ` (15 subsequent siblings)
  17 siblings, 2 replies; 56+ messages in thread
From: Vladimir Davydov @ 2018-08-16 16:11 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

None of vy_quota methods is called from a hot path - even the most
frequently called ones, vy_quota_try_use and vy_quota_commit_use, are
only invoked once per a transactions. So there's no need to clog the
header with the methods implementation.
---
 src/box/CMakeLists.txt |   1 +
 src/box/vy_quota.c     | 133 +++++++++++++++++++++++++++++++++++++++++++++++++
 src/box/vy_quota.h     | 110 +++++++---------------------------------
 3 files changed, 153 insertions(+), 91 deletions(-)
 create mode 100644 src/box/vy_quota.c

diff --git a/src/box/CMakeLists.txt b/src/box/CMakeLists.txt
index ad544270..cab6a227 100644
--- a/src/box/CMakeLists.txt
+++ b/src/box/CMakeLists.txt
@@ -86,6 +86,7 @@ add_library(box STATIC
     vy_history.c
     vy_read_set.c
     vy_scheduler.c
+    vy_quota.c
     request.c
     space.c
     space_def.c
diff --git a/src/box/vy_quota.c b/src/box/vy_quota.c
new file mode 100644
index 00000000..6e93d652
--- /dev/null
+++ b/src/box/vy_quota.c
@@ -0,0 +1,133 @@
+/*
+ * Copyright 2010-2018, Tarantool AUTHORS, please see AUTHORS file.
+ *
+ * Redistribution and use in source and binary forms, with or
+ * without modification, are permitted provided that the following
+ * conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above
+ *    copyright notice, this list of conditions and the
+ *    following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above
+ *    copyright notice, this list of conditions and the following
+ *    disclaimer in the documentation and/or other materials
+ *    provided with the distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY AUTHORS ``AS IS'' AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
+ * TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL
+ * AUTHORS OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT,
+ * INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
+ * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR
+ * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
+ * LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF
+ * THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ */
+#include "vy_quota.h"
+
+#include <assert.h>
+#include <stddef.h>
+#include <tarantool_ev.h>
+
+#include "fiber.h"
+#include "fiber_cond.h"
+#include "say.h"
+
+void
+vy_quota_create(struct vy_quota *q, vy_quota_exceeded_f quota_exceeded_cb)
+{
+	q->limit = SIZE_MAX;
+	q->watermark = SIZE_MAX;
+	q->used = 0;
+	q->too_long_threshold = TIMEOUT_INFINITY;
+	q->quota_exceeded_cb = quota_exceeded_cb;
+	fiber_cond_create(&q->cond);
+}
+
+void
+vy_quota_destroy(struct vy_quota *q)
+{
+	fiber_cond_broadcast(&q->cond);
+	fiber_cond_destroy(&q->cond);
+}
+
+void
+vy_quota_set_limit(struct vy_quota *q, size_t limit)
+{
+	q->limit = q->watermark = limit;
+	if (q->used >= limit)
+		q->quota_exceeded_cb(q);
+	fiber_cond_broadcast(&q->cond);
+}
+
+void
+vy_quota_set_watermark(struct vy_quota *q, size_t watermark)
+{
+	q->watermark = watermark;
+	if (q->used >= watermark)
+		q->quota_exceeded_cb(q);
+}
+
+void
+vy_quota_force_use(struct vy_quota *q, size_t size)
+{
+	q->used += size;
+	if (q->used >= q->watermark)
+		q->quota_exceeded_cb(q);
+}
+
+void
+vy_quota_dump(struct vy_quota *q, size_t size)
+{
+	assert(q->used >= size);
+	q->used -= size;
+	fiber_cond_broadcast(&q->cond);
+}
+
+int
+vy_quota_try_use(struct vy_quota *q, size_t size, double timeout)
+{
+	double start_time = ev_monotonic_now(loop());
+	double deadline = start_time + timeout;
+	while (q->used + size > q->limit && timeout > 0) {
+		q->quota_exceeded_cb(q);
+		if (fiber_cond_wait_deadline(&q->cond, deadline) != 0)
+			break; /* timed out */
+	}
+	double wait_time = ev_monotonic_now(loop()) - start_time;
+	if (wait_time > q->too_long_threshold) {
+		say_warn("waited for %zu bytes of vinyl memory quota "
+			 "for too long: %.3f sec", size, wait_time);
+	}
+	if (q->used + size > q->limit)
+		return -1;
+	q->used += size;
+	if (q->used >= q->watermark)
+		q->quota_exceeded_cb(q);
+	return 0;
+}
+
+void
+vy_quota_commit_use(struct vy_quota *q, size_t reserved, size_t used)
+{
+	if (reserved > used) {
+		size_t excess = reserved - used;
+		assert(q->used >= excess);
+		q->used -= excess;
+		fiber_cond_broadcast(&q->cond);
+	}
+	if (reserved < used)
+		vy_quota_force_use(q, used - reserved);
+}
+
+void
+vy_quota_wait(struct vy_quota *q)
+{
+	while (q->used > q->limit)
+		fiber_cond_wait(&q->cond);
+}
diff --git a/src/box/vy_quota.h b/src/box/vy_quota.h
index fd1004da..cf70b1ab 100644
--- a/src/box/vy_quota.h
+++ b/src/box/vy_quota.h
@@ -31,13 +31,8 @@
  * SUCH DAMAGE.
  */
 
-#include <stdbool.h>
 #include <stddef.h>
-#include <tarantool_ev.h>
-
-#include "fiber.h"
 #include "fiber_cond.h"
-#include "say.h"
 
 #if defined(__cplusplus)
 extern "C" {
@@ -83,72 +78,39 @@ struct vy_quota {
 	vy_quota_exceeded_f quota_exceeded_cb;
 };
 
-static inline void
-vy_quota_create(struct vy_quota *q, vy_quota_exceeded_f quota_exceeded_cb)
-{
-	q->limit = SIZE_MAX;
-	q->watermark = SIZE_MAX;
-	q->used = 0;
-	q->too_long_threshold = TIMEOUT_INFINITY;
-	q->quota_exceeded_cb = quota_exceeded_cb;
-	fiber_cond_create(&q->cond);
-}
-
-static inline void
-vy_quota_destroy(struct vy_quota *q)
-{
-	fiber_cond_broadcast(&q->cond);
-	fiber_cond_destroy(&q->cond);
-}
+void
+vy_quota_create(struct vy_quota *q, vy_quota_exceeded_f quota_exceeded_cb);
+
+void
+vy_quota_destroy(struct vy_quota *q);
 
 /**
  * Set memory limit. If current memory usage exceeds
  * the new limit, invoke the callback.
  */
-static inline void
-vy_quota_set_limit(struct vy_quota *q, size_t limit)
-{
-	q->limit = q->watermark = limit;
-	if (q->used >= limit)
-		q->quota_exceeded_cb(q);
-	fiber_cond_broadcast(&q->cond);
-}
+void
+vy_quota_set_limit(struct vy_quota *q, size_t limit);
 
 /**
  * Set memory watermark. If current memory usage exceeds
  * the new watermark, invoke the callback.
  */
-static inline void
-vy_quota_set_watermark(struct vy_quota *q, size_t watermark)
-{
-	q->watermark = watermark;
-	if (q->used >= watermark)
-		q->quota_exceeded_cb(q);
-}
+void
+vy_quota_set_watermark(struct vy_quota *q, size_t watermark);
 
 /**
  * Consume @size bytes of memory. In contrast to vy_quota_try_use()
  * this function does not throttle the caller.
  */
-static inline void
-vy_quota_force_use(struct vy_quota *q, size_t size)
-{
-	q->used += size;
-	if (q->used >= q->watermark)
-		q->quota_exceeded_cb(q);
-}
+void
+vy_quota_force_use(struct vy_quota *q, size_t size);
 
 /**
  * Function called on dump completion to release quota after
  * freeing memory.
  */
-static inline void
-vy_quota_dump(struct vy_quota *q, size_t size)
-{
-	assert(q->used >= size);
-	q->used -= size;
-	fiber_cond_broadcast(&q->cond);
-}
+void
+vy_quota_dump(struct vy_quota *q, size_t size);
 
 /**
  * Try to consume @size bytes of memory, throttle the caller
@@ -184,28 +146,8 @@ vy_quota_dump(struct vy_quota *q, size_t size)
  * are stored in the common memory level, which isn't taken into
  * account while estimating the size of a memory allocation.
  */
-static inline int
-vy_quota_try_use(struct vy_quota *q, size_t size, double timeout)
-{
-	double start_time = ev_monotonic_now(loop());
-	double deadline = start_time + timeout;
-	while (q->used + size > q->limit && timeout > 0) {
-		q->quota_exceeded_cb(q);
-		if (fiber_cond_wait_deadline(&q->cond, deadline) != 0)
-			break; /* timed out */
-	}
-	double wait_time = ev_monotonic_now(loop()) - start_time;
-	if (wait_time > q->too_long_threshold) {
-		say_warn("waited for %zu bytes of vinyl memory quota "
-			 "for too long: %.3f sec", size, wait_time);
-	}
-	if (q->used + size > q->limit)
-		return -1;
-	q->used += size;
-	if (q->used >= q->watermark)
-		q->quota_exceeded_cb(q);
-	return 0;
-}
+int
+vy_quota_try_use(struct vy_quota *q, size_t size, double timeout);
 
 /**
  * Adjust quota after allocating memory.
@@ -215,28 +157,14 @@ vy_quota_try_use(struct vy_quota *q, size_t size, double timeout)
  *
  * See also vy_quota_try_use().
  */
-static inline void
-vy_quota_commit_use(struct vy_quota *q, size_t reserved, size_t used)
-{
-	if (reserved > used) {
-		size_t excess = reserved - used;
-		assert(q->used >= excess);
-		q->used -= excess;
-		fiber_cond_broadcast(&q->cond);
-	}
-	if (reserved < used)
-		vy_quota_force_use(q, used - reserved);
-}
+void
+vy_quota_commit_use(struct vy_quota *q, size_t reserved, size_t used);
 
 /**
  * Block the caller until the quota is not exceeded.
  */
-static inline void
-vy_quota_wait(struct vy_quota *q)
-{
-	while (q->used > q->limit)
-		fiber_cond_wait(&q->cond);
-}
+void
+vy_quota_wait(struct vy_quota *q);
 
 #if defined(__cplusplus)
 } /* extern "C" */
-- 
2.11.0

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH 03/18] vinyl: move quota related methods and variables from vy_env to vy_quota
  2018-08-16 16:11 [PATCH 00/18] Implement write throttling for vinyl Vladimir Davydov
  2018-08-16 16:11 ` [PATCH 01/18] vinyl: rework internal quota API Vladimir Davydov
  2018-08-16 16:11 ` [PATCH 02/18] vinyl: move quota methods implementation to vy_quota.c Vladimir Davydov
@ 2018-08-16 16:11 ` Vladimir Davydov
  2018-08-20 11:08   ` Konstantin Osipov
  2018-08-27 18:33   ` Vladimir Davydov
  2018-08-16 16:11 ` [PATCH 04/18] vinyl: implement vy_quota_wait using vy_quota_try_use Vladimir Davydov
                   ` (14 subsequent siblings)
  17 siblings, 2 replies; 56+ messages in thread
From: Vladimir Davydov @ 2018-08-16 16:11 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

Watermark calculation is a private business of vy_quota. Let's move
related stuff from vy_env to vy_quota. This will also make it easier
to implement throttling opaque to the caller.
---
 src/box/vinyl.c    | 145 ++++-------------------------------------------------
 src/box/vy_quota.c | 119 +++++++++++++++++++++++++++++++++++++++----
 src/box/vy_quota.h |  45 +++++++++++++----
 3 files changed, 156 insertions(+), 153 deletions(-)

diff --git a/src/box/vinyl.c b/src/box/vinyl.c
index d0e822bf..a07f661f 100644
--- a/src/box/vinyl.c
+++ b/src/box/vinyl.c
@@ -45,7 +45,6 @@
 #include "vy_scheduler.h"
 #include "vy_stat.h"
 
-#include <math.h>
 #include <stdbool.h>
 #include <stddef.h>
 #include <stdint.h>
@@ -105,31 +104,6 @@ struct vy_env {
 	struct mempool iterator_pool;
 	/** Memory quota */
 	struct vy_quota     quota;
-	/** Timer for updating quota watermark. */
-	ev_timer            quota_timer;
-	/**
-	 * Amount of quota used since the last
-	 * invocation of the quota timer callback.
-	 */
-	size_t quota_use_curr;
-	/**
-	 * Quota use rate, in bytes per second.
-	 * Calculated as exponentially weighted
-	 * moving average of quota_use_curr.
-	 */
-	size_t quota_use_rate;
-	/**
-	 * Dump bandwidth is needed for calculating the quota watermark.
-	 * The higher the bandwidth, the later we can start dumping w/o
-	 * suffering from transaction throttling. So we want to be very
-	 * conservative about estimating the bandwidth.
-	 *
-	 * To make sure we don't overestimate it, we maintain a
-	 * histogram of all observed measurements and assume the
-	 * bandwidth to be equal to the 10th percentile, i.e. the
-	 * best result among 10% worst measurements.
-	 */
-	struct histogram *dump_bw;
 	/** Common LSM tree environment. */
 	struct vy_lsm_env lsm_env;
 	/** Environment for cache subsystem */
@@ -169,26 +143,6 @@ struct vy_env {
 	bool force_recovery;
 };
 
-enum {
-	/**
-	 * Time interval between successive updates of
-	 * quota watermark and use rate, in seconds.
-	 */
-	VY_QUOTA_UPDATE_INTERVAL = 1,
-	/**
-	 * Period of time over which the quota use rate
-	 * is averaged, in seconds.
-	 */
-	VY_QUOTA_RATE_AVG_PERIOD = 5,
-};
-
-static inline int64_t
-vy_dump_bandwidth(struct vy_env *env)
-{
-	/* See comment to vy_env::dump_bw. */
-	return histogram_percentile(env->dump_bw, 10);
-}
-
 struct vinyl_engine {
 	struct engine base;
 	/** Vinyl environment. */
@@ -300,8 +254,8 @@ vy_info_append_quota(struct vy_env *env, struct info_handler *h)
 	info_append_int(h, "used", q->used);
 	info_append_int(h, "limit", q->limit);
 	info_append_int(h, "watermark", q->watermark);
-	info_append_int(h, "use_rate", env->quota_use_rate);
-	info_append_int(h, "dump_bandwidth", vy_dump_bandwidth(env));
+	info_append_int(h, "use_rate", q->use_rate);
+	info_append_int(h, "dump_bandwidth", vy_quota_dump_bandwidth(q));
 	info_table_end(h);
 }
 
@@ -2340,14 +2294,9 @@ vinyl_engine_prepare(struct engine *engine, struct txn *txn)
 
 	size_t mem_used_after = lsregion_used(&env->mem_env.allocator);
 	assert(mem_used_after >= mem_used_before);
-	size_t write_size = mem_used_after - mem_used_before;
-	vy_quota_commit_use(&env->quota, tx->write_size, write_size);
-
-	if (rc != 0)
-		return -1;
-
-	env->quota_use_curr += write_size;
-	return 0;
+	vy_quota_commit_use(&env->quota, tx->write_size,
+			    mem_used_after - mem_used_before);
+	return rc;
 }
 
 static void
@@ -2418,41 +2367,6 @@ vinyl_engine_rollback_statement(struct engine *engine, struct txn *txn,
 /** {{{ Environment */
 
 static void
-vy_env_quota_timer_cb(ev_loop *loop, ev_timer *timer, int events)
-{
-	(void)loop;
-	(void)events;
-
-	struct vy_env *e = timer->data;
-
-	/*
-	 * Update the quota use rate with the new measurement.
-	 */
-	const double weight = 1 - exp(-VY_QUOTA_UPDATE_INTERVAL /
-				      (double)VY_QUOTA_RATE_AVG_PERIOD);
-	e->quota_use_rate = (1 - weight) * e->quota_use_rate +
-		weight * e->quota_use_curr / VY_QUOTA_UPDATE_INTERVAL;
-	e->quota_use_curr = 0;
-
-	/*
-	 * Due to log structured nature of the lsregion allocator,
-	 * which is used for allocating statements, we cannot free
-	 * memory in chunks, only all at once. Therefore we should
-	 * configure the watermark so that by the time we hit the
-	 * limit, all memory have been dumped, i.e.
-	 *
-	 *   limit - watermark      watermark
-	 *   ----------------- = --------------
-	 *     quota_use_rate    dump_bandwidth
-	 */
-	int64_t dump_bandwidth = vy_dump_bandwidth(e);
-	size_t watermark = ((double)e->quota.limit * dump_bandwidth /
-			    (dump_bandwidth + e->quota_use_rate + 1));
-
-	vy_quota_set_watermark(&e->quota, watermark);
-}
-
-static void
 vy_env_quota_exceeded_cb(struct vy_quota *quota)
 {
 	struct vy_env *env = container_of(quota, struct vy_env, quota);
@@ -2498,14 +2412,9 @@ vy_env_dump_complete_cb(struct vy_scheduler *scheduler,
 	size_t mem_used_after = lsregion_used(allocator);
 	assert(mem_used_after <= mem_used_before);
 	size_t mem_dumped = mem_used_before - mem_used_after;
-	vy_quota_dump(quota, mem_dumped);
+	vy_quota_dump(quota, mem_dumped, dump_duration);
 
 	say_info("dumped %zu bytes in %.1f sec", mem_dumped, dump_duration);
-
-	/* Account dump bandwidth. */
-	if (dump_duration > 0)
-		histogram_collect(env->dump_bw,
-				  mem_dumped / dump_duration);
 }
 
 static struct vy_squash_queue *
@@ -2520,21 +2429,6 @@ static struct vy_env *
 vy_env_new(const char *path, size_t memory,
 	   int read_threads, int write_threads, bool force_recovery)
 {
-	enum { KB = 1000, MB = 1000 * 1000 };
-	static int64_t dump_bandwidth_buckets[] = {
-		100 * KB, 200 * KB, 300 * KB, 400 * KB, 500 * KB,
-		  1 * MB,   2 * MB,   3 * MB,   4 * MB,   5 * MB,
-		 10 * MB,  20 * MB,  30 * MB,  40 * MB,  50 * MB,
-		 60 * MB,  70 * MB,  80 * MB,  90 * MB, 100 * MB,
-		110 * MB, 120 * MB, 130 * MB, 140 * MB, 150 * MB,
-		160 * MB, 170 * MB, 180 * MB, 190 * MB, 200 * MB,
-		220 * MB, 240 * MB, 260 * MB, 280 * MB, 300 * MB,
-		320 * MB, 340 * MB, 360 * MB, 380 * MB, 400 * MB,
-		450 * MB, 500 * MB, 550 * MB, 600 * MB, 650 * MB,
-		700 * MB, 750 * MB, 800 * MB, 850 * MB, 900 * MB,
-		950 * MB, 1000 * MB,
-	};
-
 	struct vy_env *e = malloc(sizeof(*e));
 	if (unlikely(e == NULL)) {
 		diag_set(OutOfMemory, sizeof(*e), "malloc", "struct vy_env");
@@ -2554,19 +2448,6 @@ vy_env_new(const char *path, size_t memory,
 		goto error_path;
 	}
 
-	e->dump_bw = histogram_new(dump_bandwidth_buckets,
-				   lengthof(dump_bandwidth_buckets));
-	if (e->dump_bw == NULL) {
-		diag_set(OutOfMemory, 0, "histogram_new",
-			 "dump bandwidth histogram");
-		goto error_dump_bw;
-	}
-	/*
-	 * Until we dump anything, assume bandwidth to be 10 MB/s,
-	 * which should be fine for initial guess.
-	 */
-	histogram_collect(e->dump_bw, 10 * MB);
-
 	e->xm = tx_manager_new();
 	if (e->xm == NULL)
 		goto error_xm;
@@ -2584,18 +2465,18 @@ vy_env_new(const char *path, size_t memory,
 			      vy_squash_schedule, e) != 0)
 		goto error_lsm_env;
 
+	if (vy_quota_create(&e->quota, vy_env_quota_exceeded_cb) != 0)
+		goto error_quota;
+
 	struct slab_cache *slab_cache = cord_slab_cache();
 	mempool_create(&e->iterator_pool, slab_cache,
 	               sizeof(struct vinyl_iterator));
-	vy_quota_create(&e->quota, vy_env_quota_exceeded_cb);
-	ev_timer_init(&e->quota_timer, vy_env_quota_timer_cb, 0,
-		      VY_QUOTA_UPDATE_INTERVAL);
-	e->quota_timer.data = e;
-	ev_timer_start(loop(), &e->quota_timer);
 	vy_cache_env_create(&e->cache_env, slab_cache);
 	vy_run_env_create(&e->run_env);
 	vy_log_init(e->path);
 	return e;
+error_quota:
+	vy_lsm_env_destroy(&e->lsm_env);
 error_lsm_env:
 	vy_mem_env_destroy(&e->mem_env);
 	vy_scheduler_destroy(&e->scheduler);
@@ -2603,8 +2484,6 @@ error_lsm_env:
 error_squash_queue:
 	tx_manager_delete(e->xm);
 error_xm:
-	histogram_delete(e->dump_bw);
-error_dump_bw:
 	free(e->path);
 error_path:
 	free(e);
@@ -2614,12 +2493,10 @@ error_path:
 static void
 vy_env_delete(struct vy_env *e)
 {
-	ev_timer_stop(loop(), &e->quota_timer);
 	vy_scheduler_destroy(&e->scheduler);
 	vy_squash_queue_delete(e->squash_queue);
 	tx_manager_delete(e->xm);
 	free(e->path);
-	histogram_delete(e->dump_bw);
 	mempool_destroy(&e->iterator_pool);
 	vy_run_env_destroy(&e->run_env);
 	vy_lsm_env_destroy(&e->lsm_env);
diff --git a/src/box/vy_quota.c b/src/box/vy_quota.c
index 6e93d652..c8177c69 100644
--- a/src/box/vy_quota.c
+++ b/src/box/vy_quota.c
@@ -32,30 +32,127 @@
 
 #include <assert.h>
 #include <stddef.h>
+#include <stdint.h>
+#include <math.h>
 #include <tarantool_ev.h>
 
+#include "diag.h"
 #include "fiber.h"
 #include "fiber_cond.h"
 #include "say.h"
+#include "histogram.h"
+#include "trivia/util.h"
 
-void
+enum {
+	/**
+	 * Time interval between successive updates of
+	 * quota watermark and use rate, in seconds.
+	 */
+	VY_QUOTA_UPDATE_INTERVAL = 1,
+	/**
+	 * Period of time over which the quota use rate
+	 * is averaged, in seconds.
+	 */
+	VY_QUOTA_RATE_AVG_PERIOD = 5,
+};
+
+static void
+vy_quota_timer_cb(ev_loop *loop, ev_timer *timer, int events)
+{
+	(void)loop;
+	(void)events;
+
+	struct vy_quota *q = timer->data;
+
+	/*
+	 * Update the quota use rate with the new measurement.
+	 */
+	const double weight = 1 - exp(-VY_QUOTA_UPDATE_INTERVAL /
+				      (double)VY_QUOTA_RATE_AVG_PERIOD);
+	q->use_rate = (1 - weight) * q->use_rate +
+		weight * q->use_curr / VY_QUOTA_UPDATE_INTERVAL;
+	q->use_curr = 0;
+
+	/*
+	 * Due to log structured nature of the lsregion allocator,
+	 * which is used for allocating statements, we cannot free
+	 * memory in chunks, only all at once. Therefore we should
+	 * configure the watermark so that by the time we hit the
+	 * limit, all memory have been dumped, i.e.
+	 *
+	 *   limit - watermark      watermark
+	 *   ----------------- = --------------
+	 *        use_rate       dump_bandwidth
+	 */
+	size_t dump_bandwidth = vy_quota_dump_bandwidth(q);
+	q->watermark = ((double)q->limit * dump_bandwidth /
+			(dump_bandwidth + q->use_rate + 1));
+	if (q->used >= q->watermark)
+		q->quota_exceeded_cb(q);
+}
+
+int
 vy_quota_create(struct vy_quota *q, vy_quota_exceeded_f quota_exceeded_cb)
 {
+	enum { KB = 1000, MB = 1000 * 1000 };
+	static int64_t dump_bandwidth_buckets[] = {
+		100 * KB, 200 * KB, 300 * KB, 400 * KB, 500 * KB,
+		  1 * MB,   2 * MB,   3 * MB,   4 * MB,   5 * MB,
+		 10 * MB,  20 * MB,  30 * MB,  40 * MB,  50 * MB,
+		 60 * MB,  70 * MB,  80 * MB,  90 * MB, 100 * MB,
+		110 * MB, 120 * MB, 130 * MB, 140 * MB, 150 * MB,
+		160 * MB, 170 * MB, 180 * MB, 190 * MB, 200 * MB,
+		220 * MB, 240 * MB, 260 * MB, 280 * MB, 300 * MB,
+		320 * MB, 340 * MB, 360 * MB, 380 * MB, 400 * MB,
+		450 * MB, 500 * MB, 550 * MB, 600 * MB, 650 * MB,
+		700 * MB, 750 * MB, 800 * MB, 850 * MB, 900 * MB,
+		950 * MB, 1000 * MB,
+	};
+
+	q->dump_bw = histogram_new(dump_bandwidth_buckets,
+				   lengthof(dump_bandwidth_buckets));
+	if (q->dump_bw == NULL) {
+		diag_set(OutOfMemory, 0, "histogram_new",
+			 "dump bandwidth histogram");
+		return -1;
+	}
+	/*
+	 * Until we dump anything, assume bandwidth to be 10 MB/s,
+	 * which should be fine for initial guess.
+	 */
+	histogram_collect(q->dump_bw, 10 * MB);
+
 	q->limit = SIZE_MAX;
 	q->watermark = SIZE_MAX;
 	q->used = 0;
+	q->use_curr = 0;
+	q->use_rate = 0;
 	q->too_long_threshold = TIMEOUT_INFINITY;
 	q->quota_exceeded_cb = quota_exceeded_cb;
 	fiber_cond_create(&q->cond);
+	ev_timer_init(&q->timer, vy_quota_timer_cb, 0,
+		      VY_QUOTA_UPDATE_INTERVAL);
+	q->timer.data = q;
+	ev_timer_start(loop(), &q->timer);
+	return 0;
 }
 
 void
 vy_quota_destroy(struct vy_quota *q)
 {
+	ev_timer_stop(loop(), &q->timer);
+	histogram_delete(q->dump_bw);
 	fiber_cond_broadcast(&q->cond);
 	fiber_cond_destroy(&q->cond);
 }
 
+size_t
+vy_quota_dump_bandwidth(struct vy_quota *q)
+{
+	/* See comment to vy_quota::dump_bw. */
+	return histogram_percentile(q->dump_bw, 10);
+}
+
 void
 vy_quota_set_limit(struct vy_quota *q, size_t limit)
 {
@@ -66,27 +163,24 @@ vy_quota_set_limit(struct vy_quota *q, size_t limit)
 }
 
 void
-vy_quota_set_watermark(struct vy_quota *q, size_t watermark)
-{
-	q->watermark = watermark;
-	if (q->used >= watermark)
-		q->quota_exceeded_cb(q);
-}
-
-void
 vy_quota_force_use(struct vy_quota *q, size_t size)
 {
 	q->used += size;
+	q->use_curr += size;
 	if (q->used >= q->watermark)
 		q->quota_exceeded_cb(q);
 }
 
 void
-vy_quota_dump(struct vy_quota *q, size_t size)
+vy_quota_dump(struct vy_quota *q, size_t size, double duration)
 {
 	assert(q->used >= size);
 	q->used -= size;
 	fiber_cond_broadcast(&q->cond);
+
+	/* Account dump bandwidth. */
+	if (duration > 0)
+		histogram_collect(q->dump_bw, size / duration);
 }
 
 int
@@ -107,6 +201,7 @@ vy_quota_try_use(struct vy_quota *q, size_t size, double timeout)
 	if (q->used + size > q->limit)
 		return -1;
 	q->used += size;
+	q->use_curr += size;
 	if (q->used >= q->watermark)
 		q->quota_exceeded_cb(q);
 	return 0;
@@ -119,6 +214,10 @@ vy_quota_commit_use(struct vy_quota *q, size_t reserved, size_t used)
 		size_t excess = reserved - used;
 		assert(q->used >= excess);
 		q->used -= excess;
+		if (q->use_curr >= excess)
+			q->use_curr -= excess;
+		else /* was reset by timeout */
+			q->use_curr = 0;
 		fiber_cond_broadcast(&q->cond);
 	}
 	if (reserved < used)
diff --git a/src/box/vy_quota.h b/src/box/vy_quota.h
index cf70b1ab..3a7a24e7 100644
--- a/src/box/vy_quota.h
+++ b/src/box/vy_quota.h
@@ -32,6 +32,7 @@
  */
 
 #include <stddef.h>
+#include <tarantool_ev.h>
 #include "fiber_cond.h"
 
 #if defined(__cplusplus)
@@ -39,6 +40,7 @@ extern "C" {
 #endif /* defined(__cplusplus) */
 
 struct vy_quota;
+struct histogram;
 
 typedef void
 (*vy_quota_exceeded_f)(struct vy_quota *quota);
@@ -76,14 +78,43 @@ struct vy_quota {
 	 * It is supposed to trigger memory reclaim.
 	 */
 	vy_quota_exceeded_f quota_exceeded_cb;
+	/** Timer for updating quota watermark. */
+	ev_timer timer;
+	/**
+	 * Amount of quota used since the last
+	 * invocation of the quota timer callback.
+	 */
+	size_t use_curr;
+	/**
+	 * Quota use rate, in bytes per second.
+	 * Calculated as exponentially weighted
+	 * moving average of use_curr.
+	 */
+	size_t use_rate;
+	/**
+	 * Dump bandwidth is needed for calculating the quota watermark.
+	 * The higher the bandwidth, the later we can start dumping w/o
+	 * suffering from transaction throttling. So we want to be very
+	 * conservative about estimating the bandwidth.
+	 *
+	 * To make sure we don't overestimate it, we maintain a
+	 * histogram of all observed measurements and assume the
+	 * bandwidth to be equal to the 10th percentile, i.e. the
+	 * best result among 10% worst measurements.
+	 */
+	struct histogram *dump_bw;
 };
 
-void
+int
 vy_quota_create(struct vy_quota *q, vy_quota_exceeded_f quota_exceeded_cb);
 
 void
 vy_quota_destroy(struct vy_quota *q);
 
+/** Return quota dump bandwidth. */
+size_t
+vy_quota_dump_bandwidth(struct vy_quota *q);
+
 /**
  * Set memory limit. If current memory usage exceeds
  * the new limit, invoke the callback.
@@ -92,13 +123,6 @@ void
 vy_quota_set_limit(struct vy_quota *q, size_t limit);
 
 /**
- * Set memory watermark. If current memory usage exceeds
- * the new watermark, invoke the callback.
- */
-void
-vy_quota_set_watermark(struct vy_quota *q, size_t watermark);
-
-/**
  * Consume @size bytes of memory. In contrast to vy_quota_try_use()
  * this function does not throttle the caller.
  */
@@ -108,9 +132,12 @@ vy_quota_force_use(struct vy_quota *q, size_t size);
 /**
  * Function called on dump completion to release quota after
  * freeing memory.
+ *
+ * @size: size of dumped memory.
+ * @duration: how long memory dump took.
  */
 void
-vy_quota_dump(struct vy_quota *q, size_t size);
+vy_quota_dump(struct vy_quota *q, size_t size, double duration);
 
 /**
  * Try to consume @size bytes of memory, throttle the caller
-- 
2.11.0

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH 04/18] vinyl: implement vy_quota_wait using vy_quota_try_use
  2018-08-16 16:11 [PATCH 00/18] Implement write throttling for vinyl Vladimir Davydov
                   ` (2 preceding siblings ...)
  2018-08-16 16:11 ` [PATCH 03/18] vinyl: move quota related methods and variables from vy_env to vy_quota Vladimir Davydov
@ 2018-08-16 16:11 ` Vladimir Davydov
  2018-08-20 11:09   ` Konstantin Osipov
  2018-08-27 18:36   ` Vladimir Davydov
  2018-08-16 16:11 ` [PATCH 05/18] vinyl: wake up fibers waiting for quota one by one Vladimir Davydov
                   ` (13 subsequent siblings)
  17 siblings, 2 replies; 56+ messages in thread
From: Vladimir Davydov @ 2018-08-16 16:11 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

So that there's a single place where we can wait for quota. It should
make it easier to implement quota throttling.
---
 src/box/vy_quota.c | 7 -------
 src/box/vy_quota.h | 7 +++++--
 2 files changed, 5 insertions(+), 9 deletions(-)

diff --git a/src/box/vy_quota.c b/src/box/vy_quota.c
index c8177c69..3ea7c80f 100644
--- a/src/box/vy_quota.c
+++ b/src/box/vy_quota.c
@@ -223,10 +223,3 @@ vy_quota_commit_use(struct vy_quota *q, size_t reserved, size_t used)
 	if (reserved < used)
 		vy_quota_force_use(q, used - reserved);
 }
-
-void
-vy_quota_wait(struct vy_quota *q)
-{
-	while (q->used > q->limit)
-		fiber_cond_wait(&q->cond);
-}
diff --git a/src/box/vy_quota.h b/src/box/vy_quota.h
index 3a7a24e7..9bfb48f5 100644
--- a/src/box/vy_quota.h
+++ b/src/box/vy_quota.h
@@ -190,8 +190,11 @@ vy_quota_commit_use(struct vy_quota *q, size_t reserved, size_t used);
 /**
  * Block the caller until the quota is not exceeded.
  */
-void
-vy_quota_wait(struct vy_quota *q);
+static inline void
+vy_quota_wait(struct vy_quota *q)
+{
+	vy_quota_try_use(q, 0, TIMEOUT_INFINITY);
+}
 
 #if defined(__cplusplus)
 } /* extern "C" */
-- 
2.11.0

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH 05/18] vinyl: wake up fibers waiting for quota one by one
  2018-08-16 16:11 [PATCH 00/18] Implement write throttling for vinyl Vladimir Davydov
                   ` (3 preceding siblings ...)
  2018-08-16 16:11 ` [PATCH 04/18] vinyl: implement vy_quota_wait using vy_quota_try_use Vladimir Davydov
@ 2018-08-16 16:11 ` Vladimir Davydov
  2018-08-20 11:11   ` Konstantin Osipov
  2018-08-28 13:19   ` Vladimir Davydov
  2018-08-16 16:12 ` [PATCH 06/18] vinyl: do not wake up fibers waiting for quota if quota is unavailable Vladimir Davydov
                   ` (12 subsequent siblings)
  17 siblings, 2 replies; 56+ messages in thread
From: Vladimir Davydov @ 2018-08-16 16:11 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

Currently, we wake up all fibers whenever we free some memory. This
is inefficient, because it might occur that all available quota gets
consumed by a few fibers while the rest will have to go back to sleep.
This is also kinda unfair, because waking up all fibers breaks the order
in which the fibers were put to sleep. This works now, because we free
memory and wake up fibers infrequently (on dump) and there normally
shouldn't be any fibers waiting for quota (if there were, the latency
would rocket sky high because of absence of any kind of throttling).
However, once throttling is introduced, fibers waiting for quota will
become the norm. So let's wake up fibers one by one: whenever we free
memory we wake up the first fiber in the line, which will wake up the
next fiber on success and so forth.
---
 src/box/vy_quota.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/src/box/vy_quota.c b/src/box/vy_quota.c
index 3ea7c80f..3677e22e 100644
--- a/src/box/vy_quota.c
+++ b/src/box/vy_quota.c
@@ -159,7 +159,7 @@ vy_quota_set_limit(struct vy_quota *q, size_t limit)
 	q->limit = q->watermark = limit;
 	if (q->used >= limit)
 		q->quota_exceeded_cb(q);
-	fiber_cond_broadcast(&q->cond);
+	fiber_cond_signal(&q->cond);
 }
 
 void
@@ -176,7 +176,7 @@ vy_quota_dump(struct vy_quota *q, size_t size, double duration)
 {
 	assert(q->used >= size);
 	q->used -= size;
-	fiber_cond_broadcast(&q->cond);
+	fiber_cond_signal(&q->cond);
 
 	/* Account dump bandwidth. */
 	if (duration > 0)
@@ -204,6 +204,9 @@ vy_quota_try_use(struct vy_quota *q, size_t size, double timeout)
 	q->use_curr += size;
 	if (q->used >= q->watermark)
 		q->quota_exceeded_cb(q);
+
+	/* Wake up the next fiber in the line waiting for quota. */
+	fiber_cond_signal(&q->cond);
 	return 0;
 }
 
@@ -218,7 +221,7 @@ vy_quota_commit_use(struct vy_quota *q, size_t reserved, size_t used)
 			q->use_curr -= excess;
 		else /* was reset by timeout */
 			q->use_curr = 0;
-		fiber_cond_broadcast(&q->cond);
+		fiber_cond_signal(&q->cond);
 	}
 	if (reserved < used)
 		vy_quota_force_use(q, used - reserved);
-- 
2.11.0

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH 06/18] vinyl: do not wake up fibers waiting for quota if quota is unavailable
  2018-08-16 16:11 [PATCH 00/18] Implement write throttling for vinyl Vladimir Davydov
                   ` (4 preceding siblings ...)
  2018-08-16 16:11 ` [PATCH 05/18] vinyl: wake up fibers waiting for quota one by one Vladimir Davydov
@ 2018-08-16 16:12 ` Vladimir Davydov
  2018-08-20 11:13   ` Konstantin Osipov
  2018-08-16 16:12 ` [PATCH 07/18] vinyl: tune dump bandwidth histogram buckets Vladimir Davydov
                   ` (11 subsequent siblings)
  17 siblings, 1 reply; 56+ messages in thread
From: Vladimir Davydov @ 2018-08-16 16:12 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

There's no point to do that as they go back to sleep anyway. What is
worse, such spurious wakeups may break the order in which fibers wait
for quota.
---
 src/box/vy_quota.c | 19 +++++++++++++++----
 1 file changed, 15 insertions(+), 4 deletions(-)

diff --git a/src/box/vy_quota.c b/src/box/vy_quota.c
index 3677e22e..8cea4deb 100644
--- a/src/box/vy_quota.c
+++ b/src/box/vy_quota.c
@@ -56,6 +56,17 @@ enum {
 	VY_QUOTA_RATE_AVG_PERIOD = 5,
 };
 
+/**
+ * Wake up the next fiber in the line waiting for quota
+ * provided quota is available.
+ */
+static inline void
+vy_quota_signal(struct vy_quota *q)
+{
+	if (q->used < q->limit)
+		fiber_cond_signal(&q->cond);
+}
+
 static void
 vy_quota_timer_cb(ev_loop *loop, ev_timer *timer, int events)
 {
@@ -159,7 +170,7 @@ vy_quota_set_limit(struct vy_quota *q, size_t limit)
 	q->limit = q->watermark = limit;
 	if (q->used >= limit)
 		q->quota_exceeded_cb(q);
-	fiber_cond_signal(&q->cond);
+	vy_quota_signal(q);
 }
 
 void
@@ -176,7 +187,7 @@ vy_quota_dump(struct vy_quota *q, size_t size, double duration)
 {
 	assert(q->used >= size);
 	q->used -= size;
-	fiber_cond_signal(&q->cond);
+	vy_quota_signal(q);
 
 	/* Account dump bandwidth. */
 	if (duration > 0)
@@ -206,7 +217,7 @@ vy_quota_try_use(struct vy_quota *q, size_t size, double timeout)
 		q->quota_exceeded_cb(q);
 
 	/* Wake up the next fiber in the line waiting for quota. */
-	fiber_cond_signal(&q->cond);
+	vy_quota_signal(q);
 	return 0;
 }
 
@@ -221,7 +232,7 @@ vy_quota_commit_use(struct vy_quota *q, size_t reserved, size_t used)
 			q->use_curr -= excess;
 		else /* was reset by timeout */
 			q->use_curr = 0;
-		fiber_cond_signal(&q->cond);
+		vy_quota_signal(q);
 	}
 	if (reserved < used)
 		vy_quota_force_use(q, used - reserved);
-- 
2.11.0

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH 07/18] vinyl: tune dump bandwidth histogram buckets
  2018-08-16 16:11 [PATCH 00/18] Implement write throttling for vinyl Vladimir Davydov
                   ` (5 preceding siblings ...)
  2018-08-16 16:12 ` [PATCH 06/18] vinyl: do not wake up fibers waiting for quota if quota is unavailable Vladimir Davydov
@ 2018-08-16 16:12 ` Vladimir Davydov
  2018-08-20 11:15   ` Konstantin Osipov
  2018-08-28 15:37   ` Vladimir Davydov
  2018-08-16 16:12 ` [PATCH 08/18] vinyl: rename vy_quota::dump_bw to dump_bw_hist Vladimir Davydov
                   ` (10 subsequent siblings)
  17 siblings, 2 replies; 56+ messages in thread
From: Vladimir Davydov @ 2018-08-16 16:12 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

Typically, dump bandwidth varies from 10 MB to 100 MB per second so
let's use 5 MB bucket granularity in this range. Values less than
10 MB/s can also be observed, because the user can limit disk rate
with box.cfg.snap_io_rate_limit so use 1 MB granularity between 1 MB
and 10 MB and 100 KB granularity between 100 KB and 1 MB. A write rate
greater than 100 MB/s is unlikely in practice, even on very fast disks,
since dump bandwidth is also limited by CPU, so use 100 MB granularity
there.
---
 src/box/vy_quota.c | 21 +++++++++------------
 1 file changed, 9 insertions(+), 12 deletions(-)

diff --git a/src/box/vy_quota.c b/src/box/vy_quota.c
index 8cea4deb..99da69ae 100644
--- a/src/box/vy_quota.c
+++ b/src/box/vy_quota.c
@@ -105,19 +105,16 @@ vy_quota_timer_cb(ev_loop *loop, ev_timer *timer, int events)
 int
 vy_quota_create(struct vy_quota *q, vy_quota_exceeded_f quota_exceeded_cb)
 {
-	enum { KB = 1000, MB = 1000 * 1000 };
+	enum { KB = 1024, MB = KB * KB };
 	static int64_t dump_bandwidth_buckets[] = {
-		100 * KB, 200 * KB, 300 * KB, 400 * KB, 500 * KB,
-		  1 * MB,   2 * MB,   3 * MB,   4 * MB,   5 * MB,
-		 10 * MB,  20 * MB,  30 * MB,  40 * MB,  50 * MB,
-		 60 * MB,  70 * MB,  80 * MB,  90 * MB, 100 * MB,
-		110 * MB, 120 * MB, 130 * MB, 140 * MB, 150 * MB,
-		160 * MB, 170 * MB, 180 * MB, 190 * MB, 200 * MB,
-		220 * MB, 240 * MB, 260 * MB, 280 * MB, 300 * MB,
-		320 * MB, 340 * MB, 360 * MB, 380 * MB, 400 * MB,
-		450 * MB, 500 * MB, 550 * MB, 600 * MB, 650 * MB,
-		700 * MB, 750 * MB, 800 * MB, 850 * MB, 900 * MB,
-		950 * MB, 1000 * MB,
+		100 * KB, 200 * KB, 300 * KB, 400 * KB, 500 * KB, 600 * KB,
+		700 * KB, 800 * KB, 900 * KB,   1 * MB,   2 * MB,   3 * MB,
+		  4 * MB,   5 * MB,   6 * MB,   7 * MB,   8 * MB,   9 * MB,
+		 10 * MB,  15 * MB,  20 * MB,  25 * MB,  30 * MB,  35 * MB,
+		 40 * MB,  45 * MB,  50 * MB,  55 * MB,  60 * MB,  65 * MB,
+		 70 * MB,  75 * MB,  80 * MB,  85 * MB,  90 * MB,  95 * MB,
+		100 * MB, 200 * MB, 300 * MB, 400 * MB, 500 * MB, 600 * MB,
+		700 * MB, 800 * MB, 900 * MB,
 	};
 
 	q->dump_bw = histogram_new(dump_bandwidth_buckets,
-- 
2.11.0

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH 08/18] vinyl: rename vy_quota::dump_bw to dump_bw_hist
  2018-08-16 16:11 [PATCH 00/18] Implement write throttling for vinyl Vladimir Davydov
                   ` (6 preceding siblings ...)
  2018-08-16 16:12 ` [PATCH 07/18] vinyl: tune dump bandwidth histogram buckets Vladimir Davydov
@ 2018-08-16 16:12 ` Vladimir Davydov
  2018-08-20 11:15   ` Konstantin Osipov
  2018-08-28 16:04   ` Vladimir Davydov
  2018-08-16 16:12 ` [PATCH 09/18] vinyl: cache dump bandwidth for timer invocation Vladimir Davydov
                   ` (9 subsequent siblings)
  17 siblings, 2 replies; 56+ messages in thread
From: Vladimir Davydov @ 2018-08-16 16:12 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

No functional changes. Needed to facilitate review.
---
 src/box/vy_quota.c | 16 ++++++++--------
 src/box/vy_quota.h |  2 +-
 2 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/src/box/vy_quota.c b/src/box/vy_quota.c
index 99da69ae..bcdc4c44 100644
--- a/src/box/vy_quota.c
+++ b/src/box/vy_quota.c
@@ -117,9 +117,9 @@ vy_quota_create(struct vy_quota *q, vy_quota_exceeded_f quota_exceeded_cb)
 		700 * MB, 800 * MB, 900 * MB,
 	};
 
-	q->dump_bw = histogram_new(dump_bandwidth_buckets,
-				   lengthof(dump_bandwidth_buckets));
-	if (q->dump_bw == NULL) {
+	q->dump_bw_hist = histogram_new(dump_bandwidth_buckets,
+					lengthof(dump_bandwidth_buckets));
+	if (q->dump_bw_hist == NULL) {
 		diag_set(OutOfMemory, 0, "histogram_new",
 			 "dump bandwidth histogram");
 		return -1;
@@ -128,7 +128,7 @@ vy_quota_create(struct vy_quota *q, vy_quota_exceeded_f quota_exceeded_cb)
 	 * Until we dump anything, assume bandwidth to be 10 MB/s,
 	 * which should be fine for initial guess.
 	 */
-	histogram_collect(q->dump_bw, 10 * MB);
+	histogram_collect(q->dump_bw_hist, 10 * MB);
 
 	q->limit = SIZE_MAX;
 	q->watermark = SIZE_MAX;
@@ -149,7 +149,7 @@ void
 vy_quota_destroy(struct vy_quota *q)
 {
 	ev_timer_stop(loop(), &q->timer);
-	histogram_delete(q->dump_bw);
+	histogram_delete(q->dump_bw_hist);
 	fiber_cond_broadcast(&q->cond);
 	fiber_cond_destroy(&q->cond);
 }
@@ -157,8 +157,8 @@ vy_quota_destroy(struct vy_quota *q)
 size_t
 vy_quota_dump_bandwidth(struct vy_quota *q)
 {
-	/* See comment to vy_quota::dump_bw. */
-	return histogram_percentile(q->dump_bw, 10);
+	/* See comment to vy_quota::dump_bw_hist. */
+	return histogram_percentile(q->dump_bw_hist, 10);
 }
 
 void
@@ -188,7 +188,7 @@ vy_quota_dump(struct vy_quota *q, size_t size, double duration)
 
 	/* Account dump bandwidth. */
 	if (duration > 0)
-		histogram_collect(q->dump_bw, size / duration);
+		histogram_collect(q->dump_bw_hist, size / duration);
 }
 
 int
diff --git a/src/box/vy_quota.h b/src/box/vy_quota.h
index 9bfb48f5..0a7427f0 100644
--- a/src/box/vy_quota.h
+++ b/src/box/vy_quota.h
@@ -102,7 +102,7 @@ struct vy_quota {
 	 * bandwidth to be equal to the 10th percentile, i.e. the
 	 * best result among 10% worst measurements.
 	 */
-	struct histogram *dump_bw;
+	struct histogram *dump_bw_hist;
 };
 
 int
-- 
2.11.0

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH 09/18] vinyl: cache dump bandwidth for timer invocation
  2018-08-16 16:11 [PATCH 00/18] Implement write throttling for vinyl Vladimir Davydov
                   ` (7 preceding siblings ...)
  2018-08-16 16:12 ` [PATCH 08/18] vinyl: rename vy_quota::dump_bw to dump_bw_hist Vladimir Davydov
@ 2018-08-16 16:12 ` Vladimir Davydov
  2018-08-20 11:21   ` Konstantin Osipov
  2018-08-28 16:10   ` Vladimir Davydov
  2018-08-16 16:12 ` [PATCH 10/18] vinyl: do not add initial guess to dump bandwidth histogram Vladimir Davydov
                   ` (8 subsequent siblings)
  17 siblings, 2 replies; 56+ messages in thread
From: Vladimir Davydov @ 2018-08-16 16:12 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

We don't need to compute a percentile of dump bandwidth histogram on
each invocation of quota timer callback, because it may only be updated
on dump completion. Let's cache it. Currently, it isn't that important,
because the timer period is set to 1 second. However, once we start
using the timer for throttling, we'll have to make it run more often and
so caching the dump bandwidth value will make sense.
---
 src/box/vinyl.c    |  2 +-
 src/box/vy_quota.c | 26 ++++++++++++++------------
 src/box/vy_quota.h |  6 ++----
 3 files changed, 17 insertions(+), 17 deletions(-)

diff --git a/src/box/vinyl.c b/src/box/vinyl.c
index a07f661f..8d315e87 100644
--- a/src/box/vinyl.c
+++ b/src/box/vinyl.c
@@ -255,7 +255,7 @@ vy_info_append_quota(struct vy_env *env, struct info_handler *h)
 	info_append_int(h, "limit", q->limit);
 	info_append_int(h, "watermark", q->watermark);
 	info_append_int(h, "use_rate", q->use_rate);
-	info_append_int(h, "dump_bandwidth", vy_quota_dump_bandwidth(q));
+	info_append_int(h, "dump_bandwidth", q->dump_bw);
 	info_table_end(h);
 }
 
diff --git a/src/box/vy_quota.c b/src/box/vy_quota.c
index bcdc4c44..dfe336c6 100644
--- a/src/box/vy_quota.c
+++ b/src/box/vy_quota.c
@@ -57,6 +57,12 @@ enum {
 };
 
 /**
+ * Histogram percentile used for estimating dump bandwidth.
+ * For details see the comment to vy_quota::dump_bw_hist.
+ */
+enum { VY_DUMP_BANDWIDTH_PCT = 10 };
+
+/**
  * Wake up the next fiber in the line waiting for quota
  * provided quota is available.
  */
@@ -95,9 +101,8 @@ vy_quota_timer_cb(ev_loop *loop, ev_timer *timer, int events)
 	 *   ----------------- = --------------
 	 *        use_rate       dump_bandwidth
 	 */
-	size_t dump_bandwidth = vy_quota_dump_bandwidth(q);
-	q->watermark = ((double)q->limit * dump_bandwidth /
-			(dump_bandwidth + q->use_rate + 1));
+	q->watermark = ((double)q->limit * q->dump_bw /
+			(q->dump_bw + q->use_rate + 1));
 	if (q->used >= q->watermark)
 		q->quota_exceeded_cb(q);
 }
@@ -128,7 +133,8 @@ vy_quota_create(struct vy_quota *q, vy_quota_exceeded_f quota_exceeded_cb)
 	 * Until we dump anything, assume bandwidth to be 10 MB/s,
 	 * which should be fine for initial guess.
 	 */
-	histogram_collect(q->dump_bw_hist, 10 * MB);
+	q->dump_bw = 10 * MB;
+	histogram_collect(q->dump_bw_hist, q->dump_bw);
 
 	q->limit = SIZE_MAX;
 	q->watermark = SIZE_MAX;
@@ -154,13 +160,6 @@ vy_quota_destroy(struct vy_quota *q)
 	fiber_cond_destroy(&q->cond);
 }
 
-size_t
-vy_quota_dump_bandwidth(struct vy_quota *q)
-{
-	/* See comment to vy_quota::dump_bw_hist. */
-	return histogram_percentile(q->dump_bw_hist, 10);
-}
-
 void
 vy_quota_set_limit(struct vy_quota *q, size_t limit)
 {
@@ -187,8 +186,11 @@ vy_quota_dump(struct vy_quota *q, size_t size, double duration)
 	vy_quota_signal(q);
 
 	/* Account dump bandwidth. */
-	if (duration > 0)
+	if (duration > 0) {
 		histogram_collect(q->dump_bw_hist, size / duration);
+		q->dump_bw = histogram_percentile(q->dump_bw_hist,
+						  VY_DUMP_BANDWIDTH_PCT);
+	}
 }
 
 int
diff --git a/src/box/vy_quota.h b/src/box/vy_quota.h
index 0a7427f0..33672a39 100644
--- a/src/box/vy_quota.h
+++ b/src/box/vy_quota.h
@@ -91,6 +91,8 @@ struct vy_quota {
 	 * moving average of use_curr.
 	 */
 	size_t use_rate;
+	/** Current dump bandwidth estimate. */
+	size_t dump_bw;
 	/**
 	 * Dump bandwidth is needed for calculating the quota watermark.
 	 * The higher the bandwidth, the later we can start dumping w/o
@@ -111,10 +113,6 @@ vy_quota_create(struct vy_quota *q, vy_quota_exceeded_f quota_exceeded_cb);
 void
 vy_quota_destroy(struct vy_quota *q);
 
-/** Return quota dump bandwidth. */
-size_t
-vy_quota_dump_bandwidth(struct vy_quota *q);
-
 /**
  * Set memory limit. If current memory usage exceeds
  * the new limit, invoke the callback.
-- 
2.11.0

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH 10/18] vinyl: do not add initial guess to dump bandwidth histogram
  2018-08-16 16:11 [PATCH 00/18] Implement write throttling for vinyl Vladimir Davydov
                   ` (8 preceding siblings ...)
  2018-08-16 16:12 ` [PATCH 09/18] vinyl: cache dump bandwidth for timer invocation Vladimir Davydov
@ 2018-08-16 16:12 ` Vladimir Davydov
  2018-08-20 11:23   ` Konstantin Osipov
                     ` (2 more replies)
  2018-08-16 16:12 ` [PATCH 11/18] vinyl: use snap_io_rate_limit for initial dump bandwidth estimate Vladimir Davydov
                   ` (7 subsequent siblings)
  17 siblings, 3 replies; 56+ messages in thread
From: Vladimir Davydov @ 2018-08-16 16:12 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

Do not add the initial guess to the histogram, because otherwise it
takes more than 10 dumps to get the real dump bandwidth in case the
initial value is less (we use 10th percentile).
---
 src/box/vy_quota.c | 13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/src/box/vy_quota.c b/src/box/vy_quota.c
index dfe336c6..96f111f0 100644
--- a/src/box/vy_quota.c
+++ b/src/box/vy_quota.c
@@ -56,6 +56,12 @@ enum {
 	VY_QUOTA_RATE_AVG_PERIOD = 5,
 };
 
+/*
+ * Until we dump anything, assume bandwidth to be 10 MB/s,
+ * which should be fine for initial guess.
+ */
+static const size_t VY_DEFAULT_DUMP_BANDWIDTH = 10 * 1024 * 1024;
+
 /**
  * Histogram percentile used for estimating dump bandwidth.
  * For details see the comment to vy_quota::dump_bw_hist.
@@ -129,12 +135,6 @@ vy_quota_create(struct vy_quota *q, vy_quota_exceeded_f quota_exceeded_cb)
 			 "dump bandwidth histogram");
 		return -1;
 	}
-	/*
-	 * Until we dump anything, assume bandwidth to be 10 MB/s,
-	 * which should be fine for initial guess.
-	 */
-	q->dump_bw = 10 * MB;
-	histogram_collect(q->dump_bw_hist, q->dump_bw);
 
 	q->limit = SIZE_MAX;
 	q->watermark = SIZE_MAX;
@@ -142,6 +142,7 @@ vy_quota_create(struct vy_quota *q, vy_quota_exceeded_f quota_exceeded_cb)
 	q->use_curr = 0;
 	q->use_rate = 0;
 	q->too_long_threshold = TIMEOUT_INFINITY;
+	q->dump_bw = VY_DEFAULT_DUMP_BANDWIDTH;
 	q->quota_exceeded_cb = quota_exceeded_cb;
 	fiber_cond_create(&q->cond);
 	ev_timer_init(&q->timer, vy_quota_timer_cb, 0,
-- 
2.11.0

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH 11/18] vinyl: use snap_io_rate_limit for initial dump bandwidth estimate
  2018-08-16 16:11 [PATCH 00/18] Implement write throttling for vinyl Vladimir Davydov
                   ` (9 preceding siblings ...)
  2018-08-16 16:12 ` [PATCH 10/18] vinyl: do not add initial guess to dump bandwidth histogram Vladimir Davydov
@ 2018-08-16 16:12 ` Vladimir Davydov
  2018-08-20 11:24   ` Konstantin Osipov
  2018-08-28 16:18   ` Vladimir Davydov
  2018-08-16 16:12 ` [PATCH 12/18] histogram: add function for computing lower bound percentile estimate Vladimir Davydov
                   ` (6 subsequent siblings)
  17 siblings, 2 replies; 56+ messages in thread
From: Vladimir Davydov @ 2018-08-16 16:12 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

The user can limit dump bandwidth with box.cfg.snap_io_rate_limit to a
value, which is less than the current estimate. To avoid stalls caused
by overestimating dump bandwidth, we must take into account the limit
for the initial guess and forget all observations whenever it changes.
---
 src/box/vinyl.c    | 4 +++-
 src/box/vy_quota.c | 7 +++++++
 src/box/vy_quota.h | 7 +++++++
 3 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/src/box/vinyl.c b/src/box/vinyl.c
index 8d315e87..3699fff8 100644
--- a/src/box/vinyl.c
+++ b/src/box/vinyl.c
@@ -2583,7 +2583,9 @@ vinyl_engine_set_too_long_threshold(struct vinyl_engine *vinyl,
 void
 vinyl_engine_set_snap_io_rate_limit(struct vinyl_engine *vinyl, double limit)
 {
-	vinyl->env->run_env.snap_io_rate_limit = limit * 1024 * 1024;
+	int64_t limit_in_bytes = limit * 1024 * 1024;
+	vinyl->env->run_env.snap_io_rate_limit = limit_in_bytes;
+	vy_quota_reset_dump_bw(&vinyl->env->quota, limit_in_bytes);
 }
 
 /** }}} Environment */
diff --git a/src/box/vy_quota.c b/src/box/vy_quota.c
index 96f111f0..64c2ca04 100644
--- a/src/box/vy_quota.c
+++ b/src/box/vy_quota.c
@@ -171,6 +171,13 @@ vy_quota_set_limit(struct vy_quota *q, size_t limit)
 }
 
 void
+vy_quota_reset_dump_bw(struct vy_quota *q, size_t max)
+{
+	histogram_reset(q->dump_bw_hist);
+	q->dump_bw = MIN(VY_DEFAULT_DUMP_BANDWIDTH, max);
+}
+
+void
 vy_quota_force_use(struct vy_quota *q, size_t size)
 {
 	q->used += size;
diff --git a/src/box/vy_quota.h b/src/box/vy_quota.h
index 33672a39..287c50f2 100644
--- a/src/box/vy_quota.h
+++ b/src/box/vy_quota.h
@@ -121,6 +121,13 @@ void
 vy_quota_set_limit(struct vy_quota *q, size_t limit);
 
 /**
+ * Reset dump bandwidth histogram and update initial estimate.
+ * Called when box.cfg.snap_io_rate_limit is updated.
+ */
+void
+vy_quota_reset_dump_bw(struct vy_quota *q, size_t max);
+
+/**
  * Consume @size bytes of memory. In contrast to vy_quota_try_use()
  * this function does not throttle the caller.
  */
-- 
2.11.0

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH 12/18] histogram: add function for computing lower bound percentile estimate
  2018-08-16 16:11 [PATCH 00/18] Implement write throttling for vinyl Vladimir Davydov
                   ` (10 preceding siblings ...)
  2018-08-16 16:12 ` [PATCH 11/18] vinyl: use snap_io_rate_limit for initial dump bandwidth estimate Vladimir Davydov
@ 2018-08-16 16:12 ` Vladimir Davydov
  2018-08-20 11:29   ` [tarantool-patches] " Konstantin Osipov
  2018-08-28 16:39   ` Vladimir Davydov
  2018-08-16 16:12 ` [PATCH 13/18] vinyl: use lower bound percentile estimate for dump bandwidth Vladimir Davydov
                   ` (5 subsequent siblings)
  17 siblings, 2 replies; 56+ messages in thread
From: Vladimir Davydov @ 2018-08-16 16:12 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

The value returned by histogram_percentile() is an upper bound estimate.
This is fine for measuring latency, because we are interested in the
worst, i.e. highest, observations, but doesn't suit particularly well
if we want to keep track of the lowest observations, as it is the case
with bandwidth. So this patch introduces histogram_percentile_lower(),
a function that is similar to histogram_percentile(), but returns a
lower bound estimate of the requested percentile.
---
 src/histogram.c       | 13 +++++++++++++
 src/histogram.h       |  7 +++++++
 test/unit/histogram.c |  4 ++++
 3 files changed, 24 insertions(+)

diff --git a/src/histogram.c b/src/histogram.c
index c934ee9f..8c6cd6c0 100644
--- a/src/histogram.c
+++ b/src/histogram.c
@@ -145,6 +145,19 @@ histogram_percentile(struct histogram *hist, int pct)
 	return hist->max;
 }
 
+int64_t
+histogram_percentile_lower(struct histogram *hist, int pct)
+{
+	size_t count = 0;
+
+	for (size_t i = 0; i < hist->n_buckets; i++) {
+		count += hist->buckets[i].count;
+		if (count * 100 > hist->total * pct)
+			return hist->buckets[i > 0 ? i - 1 : 0].max;
+	}
+	return hist->max;
+}
+
 int
 histogram_snprint(char *buf, int size, struct histogram *hist)
 {
diff --git a/src/histogram.h b/src/histogram.h
index 95d47d40..2c3df5d1 100644
--- a/src/histogram.h
+++ b/src/histogram.h
@@ -90,6 +90,13 @@ int64_t
 histogram_percentile(struct histogram *hist, int pct);
 
 /**
+ * Same as histogram_percentile(), but return a lower bound
+ * estimate of the percentile.
+ */
+int64_t
+histogram_percentile_lower(struct histogram *hist, int pct);
+
+/**
  * Print string representation of a histogram.
  */
 int
diff --git a/test/unit/histogram.c b/test/unit/histogram.c
index 4c980d24..a026be26 100644
--- a/test/unit/histogram.c
+++ b/test/unit/histogram.c
@@ -154,14 +154,18 @@ test_percentile(void)
 	for (int pct = 5; pct < 100; pct += 5) {
 		int64_t val = data[data_len * pct / 100];
 		int64_t expected = max;
+		int64_t expected_lo = max;
 		for (size_t b = 0; b < n_buckets; b++) {
 			if (buckets[b] >= val) {
 				expected = buckets[b];
+				expected_lo = buckets[b > 0 ? b - 1 : 0];
 				break;
 			}
 		}
 		int64_t result = histogram_percentile(hist, pct);
 		fail_if(result != expected);
+		int64_t result_lo = histogram_percentile_lower(hist, pct);
+		fail_if(result_lo != expected_lo);
 	}
 
 	histogram_delete(hist);
-- 
2.11.0

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH 13/18] vinyl: use lower bound percentile estimate for dump bandwidth
  2018-08-16 16:11 [PATCH 00/18] Implement write throttling for vinyl Vladimir Davydov
                   ` (11 preceding siblings ...)
  2018-08-16 16:12 ` [PATCH 12/18] histogram: add function for computing lower bound percentile estimate Vladimir Davydov
@ 2018-08-16 16:12 ` Vladimir Davydov
  2018-08-28 16:51   ` Vladimir Davydov
  2018-08-16 16:12 ` [PATCH 14/18] vinyl: do not try to trigger dump if it is already in progress Vladimir Davydov
                   ` (4 subsequent siblings)
  17 siblings, 1 reply; 56+ messages in thread
From: Vladimir Davydov @ 2018-08-16 16:12 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

Use a lower bound estimate in order not to overestimate dump bandwidth.
For example, if an observation of 12 MB/s falls in bucket 10 .. 15, we
should use 10 MB/s to avoid stalls.
---
 src/box/vy_quota.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/src/box/vy_quota.c b/src/box/vy_quota.c
index 64c2ca04..620155ab 100644
--- a/src/box/vy_quota.c
+++ b/src/box/vy_quota.c
@@ -196,8 +196,13 @@ vy_quota_dump(struct vy_quota *q, size_t size, double duration)
 	/* Account dump bandwidth. */
 	if (duration > 0) {
 		histogram_collect(q->dump_bw_hist, size / duration);
-		q->dump_bw = histogram_percentile(q->dump_bw_hist,
-						  VY_DUMP_BANDWIDTH_PCT);
+		/*
+		 * To avoid unpredictably long stalls, we need to
+		 * know the worst (smallest) dump bandwidth so use
+		 * a lower-bound percentile estimate.
+		 */
+		q->dump_bw = histogram_percentile_lower(q->dump_bw_hist,
+							VY_DUMP_BANDWIDTH_PCT);
 	}
 }
 
-- 
2.11.0

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH 14/18] vinyl: do not try to trigger dump if it is already in progress
  2018-08-16 16:11 [PATCH 00/18] Implement write throttling for vinyl Vladimir Davydov
                   ` (12 preceding siblings ...)
  2018-08-16 16:12 ` [PATCH 13/18] vinyl: use lower bound percentile estimate for dump bandwidth Vladimir Davydov
@ 2018-08-16 16:12 ` Vladimir Davydov
  2018-08-16 16:12 ` [PATCH 15/18] vinyl: improve dump start/stop logging Vladimir Davydov
                   ` (3 subsequent siblings)
  17 siblings, 0 replies; 56+ messages in thread
From: Vladimir Davydov @ 2018-08-16 16:12 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

Currently, vy_quota_try_use() calls vy_quota_exceeded_cb() every time
it sees that the memory usage exceeds the watermark. Actually, this is
pointless, because the callback will return immediately if dump is
already in progress. Let's introduce a flag vy_quota::dump_in_progress
and skip the call to vy_quota_exceeded_cb() if dump has already been
triggered. The flag is cleared when dump by vy_quota_dump(), which is
called when dump is complete. This also gives us a place in the code
where we can remember the time and memory usage at the time when dump
was started, which will be useful for transaction throttling.
---
 src/box/vinyl.c    |  5 +++--
 src/box/vy_quota.c | 46 +++++++++++++++++++++++++++++++++-------------
 src/box/vy_quota.h | 12 ++++++++++--
 3 files changed, 46 insertions(+), 17 deletions(-)

diff --git a/src/box/vinyl.c b/src/box/vinyl.c
index 3699fff8..25a937bc 100644
--- a/src/box/vinyl.c
+++ b/src/box/vinyl.c
@@ -2366,7 +2366,7 @@ vinyl_engine_rollback_statement(struct engine *engine, struct txn *txn,
 
 /** {{{ Environment */
 
-static void
+static bool
 vy_env_quota_exceeded_cb(struct vy_quota *quota)
 {
 	struct vy_env *env = container_of(quota, struct vy_env, quota);
@@ -2393,9 +2393,10 @@ vy_env_quota_exceeded_cb(struct vy_quota *quota)
 		 * quota has been consumed by pending transactions.
 		 * There's nothing we can do about that.
 		 */
-		return;
+		return false;
 	}
 	vy_scheduler_trigger_dump(&env->scheduler);
+	return true;
 }
 
 static void
diff --git a/src/box/vy_quota.c b/src/box/vy_quota.c
index 620155ab..0250b3ab 100644
--- a/src/box/vy_quota.c
+++ b/src/box/vy_quota.c
@@ -79,6 +79,17 @@ vy_quota_signal(struct vy_quota *q)
 		fiber_cond_signal(&q->cond);
 }
 
+/**
+ * Trigger memory dump if memory usage is above the wateramrk
+ * and dump hasn't been triggered yet.
+ */
+static inline void
+vy_quota_check_watermark(struct vy_quota *q)
+{
+	if (!q->dump_in_progress && q->used >= q->watermark)
+		q->dump_in_progress = q->quota_exceeded_cb(q);
+}
+
 static void
 vy_quota_timer_cb(ev_loop *loop, ev_timer *timer, int events)
 {
@@ -109,8 +120,7 @@ vy_quota_timer_cb(ev_loop *loop, ev_timer *timer, int events)
 	 */
 	q->watermark = ((double)q->limit * q->dump_bw /
 			(q->dump_bw + q->use_rate + 1));
-	if (q->used >= q->watermark)
-		q->quota_exceeded_cb(q);
+	vy_quota_check_watermark(q);
 }
 
 int
@@ -144,6 +154,7 @@ vy_quota_create(struct vy_quota *q, vy_quota_exceeded_f quota_exceeded_cb)
 	q->too_long_threshold = TIMEOUT_INFINITY;
 	q->dump_bw = VY_DEFAULT_DUMP_BANDWIDTH;
 	q->quota_exceeded_cb = quota_exceeded_cb;
+	q->dump_in_progress = false;
 	fiber_cond_create(&q->cond);
 	ev_timer_init(&q->timer, vy_quota_timer_cb, 0,
 		      VY_QUOTA_UPDATE_INTERVAL);
@@ -165,8 +176,7 @@ void
 vy_quota_set_limit(struct vy_quota *q, size_t limit)
 {
 	q->limit = q->watermark = limit;
-	if (q->used >= limit)
-		q->quota_exceeded_cb(q);
+	vy_quota_check_watermark(q);
 	vy_quota_signal(q);
 }
 
@@ -182,8 +192,7 @@ vy_quota_force_use(struct vy_quota *q, size_t size)
 {
 	q->used += size;
 	q->use_curr += size;
-	if (q->used >= q->watermark)
-		q->quota_exceeded_cb(q);
+	vy_quota_check_watermark(q);
 }
 
 void
@@ -191,6 +200,7 @@ vy_quota_dump(struct vy_quota *q, size_t size, double duration)
 {
 	assert(q->used >= size);
 	q->used -= size;
+	q->dump_in_progress = false;
 	vy_quota_signal(q);
 
 	/* Account dump bandwidth. */
@@ -211,9 +221,14 @@ vy_quota_try_use(struct vy_quota *q, size_t size, double timeout)
 {
 	double start_time = ev_monotonic_now(loop());
 	double deadline = start_time + timeout;
-	while (q->used + size > q->limit && timeout > 0) {
-		q->quota_exceeded_cb(q);
-		if (fiber_cond_wait_deadline(&q->cond, deadline) != 0)
+
+	q->used += size;
+	while (q->used > q->limit) {
+		vy_quota_check_watermark(q);
+		q->used -= size;
+		int rc = fiber_cond_wait_deadline(&q->cond, deadline);
+		q->used += size;
+		if (rc != 0)
 			break; /* timed out */
 	}
 	double wait_time = ev_monotonic_now(loop()) - start_time;
@@ -221,12 +236,17 @@ vy_quota_try_use(struct vy_quota *q, size_t size, double timeout)
 		say_warn("waited for %zu bytes of vinyl memory quota "
 			 "for too long: %.3f sec", size, wait_time);
 	}
-	if (q->used + size > q->limit)
+	if (q->used > q->limit) {
+		/*
+		 * In case of faiure diag should have been set by
+		 * fiber_cond_wait_deadline().
+		 */
+		assert(!diag_is_empty(diag_get()));
+		q->used -= size;
 		return -1;
-	q->used += size;
+	}
 	q->use_curr += size;
-	if (q->used >= q->watermark)
-		q->quota_exceeded_cb(q);
+	vy_quota_check_watermark(q);
 
 	/* Wake up the next fiber in the line waiting for quota. */
 	vy_quota_signal(q);
diff --git a/src/box/vy_quota.h b/src/box/vy_quota.h
index 287c50f2..cb681386 100644
--- a/src/box/vy_quota.h
+++ b/src/box/vy_quota.h
@@ -31,6 +31,7 @@
  * SUCH DAMAGE.
  */
 
+#include <stdbool.h>
 #include <stddef.h>
 #include <tarantool_ev.h>
 #include "fiber_cond.h"
@@ -42,7 +43,7 @@ extern "C" {
 struct vy_quota;
 struct histogram;
 
-typedef void
+typedef bool
 (*vy_quota_exceeded_f)(struct vy_quota *quota);
 
 /**
@@ -75,9 +76,16 @@ struct vy_quota {
 	struct fiber_cond cond;
 	/**
 	 * Called when quota is consumed if used >= watermark.
-	 * It is supposed to trigger memory reclaim.
+	 * It is supposed to trigger memory dump and return true
+	 * on success, false on failure.
 	 */
 	vy_quota_exceeded_f quota_exceeded_cb;
+	/**
+	 * Set if the last triggered memory dump hasn't completed
+	 * yet, i.e. quota_exceeded_cb() was invoked and returned
+	 * true, but vy_quota_dump() hasn't been called yet.
+	 */
+	bool dump_in_progress;
 	/** Timer for updating quota watermark. */
 	ev_timer timer;
 	/**
-- 
2.11.0

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH 15/18] vinyl: improve dump start/stop logging
  2018-08-16 16:11 [PATCH 00/18] Implement write throttling for vinyl Vladimir Davydov
                   ` (13 preceding siblings ...)
  2018-08-16 16:12 ` [PATCH 14/18] vinyl: do not try to trigger dump if it is already in progress Vladimir Davydov
@ 2018-08-16 16:12 ` Vladimir Davydov
  2018-08-23 20:18   ` Konstantin Osipov
  2018-08-16 16:12 ` [PATCH 16/18] vinyl: confine quota watermark within sane value range Vladimir Davydov
                   ` (2 subsequent siblings)
  17 siblings, 1 reply; 56+ messages in thread
From: Vladimir Davydov @ 2018-08-16 16:12 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

When initiating memory dump, print how much memory is going to be
dumped, expected dump rate, ETA, and the recent write rate. Upon dump
completion, print observed dump rate in addition to dump size and
duration.

Example:

  2018-08-16 14:11:40.802 [19044] main/2096/main I> dumping 134217746 bytes, expected rate 4.0 MB/s, ETA 32.0 s, recent write rate 16.6 MB/s
  2018-08-16 14:11:40.843 [19044] main/103/vinyl.scheduler I> 513/0: dump started
  2018-08-16 14:11:40.844 [19044] vinyl.writer.0/102/task I> writing `./513/0/00000000000000000002.run'
  2018-08-16 14:12:01.394 [19044] vinyl.writer.0/102/task I> writing `./513/0/00000000000000000002.index'
  2018-08-16 14:12:01.467 [19044] main/103/vinyl.scheduler I> 513/0: dump completed
  2018-08-16 14:12:01.467 [19044] main/103/vinyl.scheduler I> dumped 134214971 bytes in 20.7 s, rate 6.2 MB/s
---
 src/box/vinyl.c    |  2 --
 src/box/vy_quota.c | 14 ++++++++++++--
 2 files changed, 12 insertions(+), 4 deletions(-)

diff --git a/src/box/vinyl.c b/src/box/vinyl.c
index 25a937bc..ec7045b5 100644
--- a/src/box/vinyl.c
+++ b/src/box/vinyl.c
@@ -2414,8 +2414,6 @@ vy_env_dump_complete_cb(struct vy_scheduler *scheduler,
 	assert(mem_used_after <= mem_used_before);
 	size_t mem_dumped = mem_used_before - mem_used_after;
 	vy_quota_dump(quota, mem_dumped, dump_duration);
-
-	say_info("dumped %zu bytes in %.1f sec", mem_dumped, dump_duration);
 }
 
 static struct vy_squash_queue *
diff --git a/src/box/vy_quota.c b/src/box/vy_quota.c
index 0250b3ab..c22a8519 100644
--- a/src/box/vy_quota.c
+++ b/src/box/vy_quota.c
@@ -86,8 +86,15 @@ vy_quota_signal(struct vy_quota *q)
 static inline void
 vy_quota_check_watermark(struct vy_quota *q)
 {
-	if (!q->dump_in_progress && q->used >= q->watermark)
-		q->dump_in_progress = q->quota_exceeded_cb(q);
+	if (!q->dump_in_progress &&
+	    q->used >= q->watermark && q->quota_exceeded_cb(q)) {
+		q->dump_in_progress = true;
+		say_info("dumping %zu bytes, expected rate %.1f MB/s, "
+			 "ETA %.1f s, recent write rate %.1f MB/s", q->used,
+			 (double)q->dump_bw / 1024 / 1024,
+			 (double)q->used / (q->dump_bw + 1),
+			 (double)q->use_rate / 1024 / 1024);
+	}
 }
 
 static void
@@ -214,6 +221,9 @@ vy_quota_dump(struct vy_quota *q, size_t size, double duration)
 		q->dump_bw = histogram_percentile_lower(q->dump_bw_hist,
 							VY_DUMP_BANDWIDTH_PCT);
 	}
+
+	say_info("dumped %zu bytes in %.1f s, rate %.1f MB/s",
+		 size, duration, size / duration / 1024 / 1024);
 }
 
 int
-- 
2.11.0

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH 16/18] vinyl: confine quota watermark within sane value range
  2018-08-16 16:11 [PATCH 00/18] Implement write throttling for vinyl Vladimir Davydov
                   ` (14 preceding siblings ...)
  2018-08-16 16:12 ` [PATCH 15/18] vinyl: improve dump start/stop logging Vladimir Davydov
@ 2018-08-16 16:12 ` Vladimir Davydov
  2018-08-16 16:12 ` [PATCH 17/18] vinyl: set quota timer period to 100 ms Vladimir Davydov
  2018-08-16 16:12 ` [PATCH 18/18] vinyl: throttle tx rate if dump does not catch up Vladimir Davydov
  17 siblings, 0 replies; 56+ messages in thread
From: Vladimir Davydov @ 2018-08-16 16:12 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

Make sure the watermark is within 50 .. 90% of the memory limit.
See the comment in the code for the rationale.
---
 src/box/vy_quota.c | 61 +++++++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 53 insertions(+), 8 deletions(-)

diff --git a/src/box/vy_quota.c b/src/box/vy_quota.c
index c22a8519..43fc645a 100644
--- a/src/box/vy_quota.c
+++ b/src/box/vy_quota.c
@@ -69,6 +69,51 @@ static const size_t VY_DEFAULT_DUMP_BANDWIDTH = 10 * 1024 * 1024;
 enum { VY_DUMP_BANDWIDTH_PCT = 10 };
 
 /**
+ * Min and max values of watermark, in percentage of limit.
+ *
+ * We set the watermark so that we can dump all memory below it
+ * before we hit the hard limit:
+ *
+ *   limit - watermark      watermark
+ *   ----------------- = --------------
+ *        use_rate       dump_bandwidth
+ *
+ * This is done that way, because due to the log structured
+ * nature of the allocator we cannot free memory in arbitrary
+ * chunks, only in whole generations, and we bump the generation
+ * counter only when a dump is triggered.  We could probably
+ * maintain more than two generations (active and the one being
+ * dumped), but that would make memory lookups more expensive
+ * (as we would have to maintain more than two in-memory trees
+ * for each index) and would also resulted in producing smaller
+ * run files, thus intensifying compaction.
+ *
+ * With such a memory dumping algorithm, setting the watermark to
+ * a value less than 50% doesn't make much sense. For instance,
+ * suppose the quota consumption rate is 3 times greater than the
+ * dump bandwidth. Then according to the formula we are supposed
+ * to set the watermark to 25%. If we did that, then by the time
+ * memory dump is complete we would have 75% of memory used up
+ * and hence would have to throttle the quota consumption rate
+ * down to one third of the dump bandwidth to avoid long stalls
+ * due to exhausted quota. Never setting watermark below 50%
+ * will give us a consistent RPS equal to the dump bandwidth.
+ *
+ * Setting the watermark to very high values (say 99%) is also
+ * not good, because in case the quota consumption rate suddenly
+ * raises we will have to throttle it to avoid stalls, and the
+ * higher the watermark the more repressive throttling we will
+ * have to exert until memory dump is complete. Limiting the max
+ * watermark to 90% can result in throttling to 1/10th of the
+ * dump bandwidth at worst, which is harsh, but tolerable (think
+ * of 1/100th for 99% watermark).
+ */
+enum {
+	VY_QUOTA_WATERMARK_MIN = 50,
+	VY_QUOTA_WATERMARK_MAX = 90,
+};
+
+/**
  * Wake up the next fiber in the line waiting for quota
  * provided quota is available.
  */
@@ -115,18 +160,18 @@ vy_quota_timer_cb(ev_loop *loop, ev_timer *timer, int events)
 	q->use_curr = 0;
 
 	/*
-	 * Due to log structured nature of the lsregion allocator,
-	 * which is used for allocating statements, we cannot free
-	 * memory in chunks, only all at once. Therefore we should
-	 * configure the watermark so that by the time we hit the
-	 * limit, all memory have been dumped, i.e.
+	 * Update the quota watermark and trigger memory dump
+	 * if the watermark is exceeded.
 	 *
-	 *   limit - watermark      watermark
-	 *   ----------------- = --------------
-	 *        use_rate       dump_bandwidth
+	 * See the comment to VY_QUOTA_WATERMARK_MIN/MAX for
+	 * more details about the formula.
 	 */
 	q->watermark = ((double)q->limit * q->dump_bw /
 			(q->dump_bw + q->use_rate + 1));
+	q->watermark = MAX(q->limit * VY_QUOTA_WATERMARK_MIN / 100,
+			   q->watermark);
+	q->watermark = MIN(q->limit * VY_QUOTA_WATERMARK_MAX / 100,
+			   q->watermark);
 	vy_quota_check_watermark(q);
 }
 
-- 
2.11.0

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH 17/18] vinyl: set quota timer period to 100 ms
  2018-08-16 16:11 [PATCH 00/18] Implement write throttling for vinyl Vladimir Davydov
                   ` (15 preceding siblings ...)
  2018-08-16 16:12 ` [PATCH 16/18] vinyl: confine quota watermark within sane value range Vladimir Davydov
@ 2018-08-16 16:12 ` Vladimir Davydov
  2018-08-23 20:49   ` Konstantin Osipov
  2018-08-16 16:12 ` [PATCH 18/18] vinyl: throttle tx rate if dump does not catch up Vladimir Davydov
  17 siblings, 1 reply; 56+ messages in thread
From: Vladimir Davydov @ 2018-08-16 16:12 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

Currently, it's 1 second, which is OK for calculating watermark, but
too long for throttling (think of latency of 1 seconds that would be
introduced by throttling if such timeout were used).
---
 src/box/vy_quota.c | 25 ++++++++++++-------------
 1 file changed, 12 insertions(+), 13 deletions(-)

diff --git a/src/box/vy_quota.c b/src/box/vy_quota.c
index 43fc645a..471a8bd0 100644
--- a/src/box/vy_quota.c
+++ b/src/box/vy_quota.c
@@ -43,18 +43,17 @@
 #include "histogram.h"
 #include "trivia/util.h"
 
-enum {
-	/**
-	 * Time interval between successive updates of
-	 * quota watermark and use rate, in seconds.
-	 */
-	VY_QUOTA_UPDATE_INTERVAL = 1,
-	/**
-	 * Period of time over which the quota use rate
-	 * is averaged, in seconds.
-	 */
-	VY_QUOTA_RATE_AVG_PERIOD = 5,
-};
+/**
+ * Time interval between successive updates of quota watermark
+ * and use rate, in seconds.
+ */
+static const double VY_QUOTA_UPDATE_INTERVAL = 0.1;
+
+/**
+ * Period of time over which the quota use rate is averaged,
+ * in seconds.
+ */
+static const double VY_QUOTA_RATE_AVG_PERIOD = 5;
 
 /*
  * Until we dump anything, assume bandwidth to be 10 MB/s,
@@ -154,7 +153,7 @@ vy_quota_timer_cb(ev_loop *loop, ev_timer *timer, int events)
 	 * Update the quota use rate with the new measurement.
 	 */
 	const double weight = 1 - exp(-VY_QUOTA_UPDATE_INTERVAL /
-				      (double)VY_QUOTA_RATE_AVG_PERIOD);
+				      VY_QUOTA_RATE_AVG_PERIOD);
 	q->use_rate = (1 - weight) * q->use_rate +
 		weight * q->use_curr / VY_QUOTA_UPDATE_INTERVAL;
 	q->use_curr = 0;
-- 
2.11.0

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH 18/18] vinyl: throttle tx rate if dump does not catch up
  2018-08-16 16:11 [PATCH 00/18] Implement write throttling for vinyl Vladimir Davydov
                   ` (16 preceding siblings ...)
  2018-08-16 16:12 ` [PATCH 17/18] vinyl: set quota timer period to 100 ms Vladimir Davydov
@ 2018-08-16 16:12 ` Vladimir Davydov
  2018-08-23 20:54   ` Konstantin Osipov
  17 siblings, 1 reply; 56+ messages in thread
From: Vladimir Davydov @ 2018-08-16 16:12 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

If the rate at which transactions are ready to write to the database is
greater than the dump bandwidth, memory will get depleted before the
last dump is complete and all newer transactions will have to wait until
the dump has been completed, which may take seconds or even minutes:

  2018-08-16 15:45:11.739 [30874] main/1100/main vy_quota.c:291 W> waited for 555 bytes of vinyl memory quota for too long: 15.750 sec

This patch set implements transaction throttling that are supposed to
help avoid unpredictably long stalls. Now once a dump is started,
transaction write rate will be limited so that the hard limit cannot get
exceeded before the dump is complete. This is how it looks in the log:

  2018-08-16 16:01:09.412 [489] main/445/main I> dumping 134217901 bytes, expected rate 6.0 MB/s, ETA 21.3 s, recent write rate 10.5 MB/s
  2018-08-16 16:01:09.447 [489] main/103/vinyl.scheduler I> 513/0: dump started
  2018-08-16 16:01:09.447 [489] vinyl.writer.0/103/task I> writing `./513/0/00000000000000000004.run'
  2018-08-16 16:01:09.468 [489] main I> throttling enabled, max write rate 6.0 MB/s
  2018-08-16 16:01:30.004 [489] vinyl.writer.0/103/task I> writing `./513/0/00000000000000000004.index'
  2018-08-16 16:01:30.094 [489] main/103/vinyl.scheduler I> 513/0: dump completed
  2018-08-16 16:01:30.095 [489] main/103/vinyl.scheduler I> dumped 134216236 bytes in 20.7 s, rate 6.2 MB/s
  2018-08-16 16:01:30.167 [489] main I> throttling disabled

Closes #1862
---
 src/box/vy_quota.c           | 41 ++++++++++++++++++-
 src/box/vy_quota.h           | 13 ++++++
 test/vinyl/suite.ini         |  2 +-
 test/vinyl/throttle.result   | 95 ++++++++++++++++++++++++++++++++++++++++++++
 test/vinyl/throttle.test.lua | 47 ++++++++++++++++++++++
 5 files changed, 195 insertions(+), 3 deletions(-)
 create mode 100644 test/vinyl/throttle.result
 create mode 100644 test/vinyl/throttle.test.lua

diff --git a/src/box/vy_quota.c b/src/box/vy_quota.c
index 471a8bd0..85e18d4b 100644
--- a/src/box/vy_quota.c
+++ b/src/box/vy_quota.c
@@ -119,7 +119,7 @@ enum {
 static inline void
 vy_quota_signal(struct vy_quota *q)
 {
-	if (q->used < q->limit)
+	if (q->used < q->limit && q->use_curr < q->use_max)
 		fiber_cond_signal(&q->cond);
 }
 
@@ -133,6 +133,8 @@ vy_quota_check_watermark(struct vy_quota *q)
 	if (!q->dump_in_progress &&
 	    q->used >= q->watermark && q->quota_exceeded_cb(q)) {
 		q->dump_in_progress = true;
+		q->dump_size = q->used;
+		q->dump_start = ev_monotonic_now(loop());
 		say_info("dumping %zu bytes, expected rate %.1f MB/s, "
 			 "ETA %.1f s, recent write rate %.1f MB/s", q->used,
 			 (double)q->dump_bw / 1024 / 1024,
@@ -148,6 +150,7 @@ vy_quota_timer_cb(ev_loop *loop, ev_timer *timer, int events)
 	(void)events;
 
 	struct vy_quota *q = timer->data;
+	double now = ev_monotonic_now(loop());
 
 	/*
 	 * Update the quota use rate with the new measurement.
@@ -159,6 +162,34 @@ vy_quota_timer_cb(ev_loop *loop, ev_timer *timer, int events)
 	q->use_curr = 0;
 
 	/*
+	 * To avoid unpredictably long stalls, we must limit
+	 * the write rate when a dump is in progress so that
+	 * we don't hit the hard limit before the dump has
+	 * completed, i.e.
+	 *
+	 *   left_to_use    left_to_dump
+	 *   ----------- <= ------------
+	 *     use_rate       dump_rate
+	 */
+	if (q->dump_in_progress) {
+		size_t dumped = q->dump_bw * (now - q->dump_start);
+		size_t left_to_dump = (dumped < q->dump_size ?
+				       q->dump_size - dumped : 0);
+		size_t left_to_use = (q->used < q->limit ?
+				      q->limit - q->used : 0);
+		double max_use_rate = (left_to_use * q->dump_bw /
+				       (left_to_dump + 1));
+		if (q->use_max == SIZE_MAX)
+			say_info("throttling enabled, max write rate "
+				 "%.1f MB/s", max_use_rate / 1024 / 1024);
+		q->use_max = VY_QUOTA_UPDATE_INTERVAL * max_use_rate;
+	} else {
+		if (q->use_max < SIZE_MAX)
+			say_info("throttling disabled");
+		q->use_max = SIZE_MAX;
+	}
+
+	/*
 	 * Update the quota watermark and trigger memory dump
 	 * if the watermark is exceeded.
 	 *
@@ -172,6 +203,9 @@ vy_quota_timer_cb(ev_loop *loop, ev_timer *timer, int events)
 	q->watermark = MIN(q->limit * VY_QUOTA_WATERMARK_MAX / 100,
 			   q->watermark);
 	vy_quota_check_watermark(q);
+
+	/* Wake up the next throttled fiber in the line. */
+	vy_quota_signal(q);
 }
 
 int
@@ -201,11 +235,14 @@ vy_quota_create(struct vy_quota *q, vy_quota_exceeded_f quota_exceeded_cb)
 	q->watermark = SIZE_MAX;
 	q->used = 0;
 	q->use_curr = 0;
+	q->use_max = SIZE_MAX;
 	q->use_rate = 0;
 	q->too_long_threshold = TIMEOUT_INFINITY;
 	q->dump_bw = VY_DEFAULT_DUMP_BANDWIDTH;
 	q->quota_exceeded_cb = quota_exceeded_cb;
 	q->dump_in_progress = false;
+	q->dump_size = 0;
+	q->dump_start = 0;
 	fiber_cond_create(&q->cond);
 	ev_timer_init(&q->timer, vy_quota_timer_cb, 0,
 		      VY_QUOTA_UPDATE_INTERVAL);
@@ -277,7 +314,7 @@ vy_quota_try_use(struct vy_quota *q, size_t size, double timeout)
 	double deadline = start_time + timeout;
 
 	q->used += size;
-	while (q->used > q->limit) {
+	while (q->used > q->limit || q->use_curr >= q->use_max) {
 		vy_quota_check_watermark(q);
 		q->used -= size;
 		int rc = fiber_cond_wait_deadline(&q->cond, deadline);
diff --git a/src/box/vy_quota.h b/src/box/vy_quota.h
index cb681386..ef2fb6cb 100644
--- a/src/box/vy_quota.h
+++ b/src/box/vy_quota.h
@@ -86,6 +86,13 @@ struct vy_quota {
 	 * true, but vy_quota_dump() hasn't been called yet.
 	 */
 	bool dump_in_progress;
+	/**
+	 * Memory usage at the time when the last dump was started
+	 * (memory dump size).
+	 */
+	size_t dump_size;
+	/** Time when the last dump was started. */
+	double dump_start;
 	/** Timer for updating quota watermark. */
 	ev_timer timer;
 	/**
@@ -94,6 +101,12 @@ struct vy_quota {
 	 */
 	size_t use_curr;
 	/**
+	 * Maximal amount of quota that can be used between timer
+	 * callback invocations. It is set to such a value so that
+	 * the quota use rate never exceeds the dump bandwidth.
+	 */
+	size_t use_max;
+	/**
 	 * Quota use rate, in bytes per second.
 	 * Calculated as exponentially weighted
 	 * moving average of use_curr.
diff --git a/test/vinyl/suite.ini b/test/vinyl/suite.ini
index b9dae380..785bc63d 100644
--- a/test/vinyl/suite.ini
+++ b/test/vinyl/suite.ini
@@ -6,5 +6,5 @@ release_disabled = errinj.test.lua errinj_gc.test.lua errinj_vylog.test.lua part
 config = suite.cfg
 lua_libs = suite.lua stress.lua large.lua txn_proxy.lua ../box/lua/utils.lua
 use_unix_sockets = True
-long_run = stress.test.lua large.test.lua write_iterator_rand.test.lua dump_stress.test.lua select_consistency.test.lua
+long_run = stress.test.lua large.test.lua write_iterator_rand.test.lua dump_stress.test.lua select_consistency.test.lua throttle.test.lua
 is_parallel = False
diff --git a/test/vinyl/throttle.result b/test/vinyl/throttle.result
new file mode 100644
index 00000000..5634c40b
--- /dev/null
+++ b/test/vinyl/throttle.result
@@ -0,0 +1,95 @@
+test_run = require('test_run').new()
+---
+...
+test_run:cmd("create server test with script='vinyl/low_quota.lua'")
+---
+- true
+...
+test_run:cmd(string.format("start server test with args='%d'", 32 * 1024 * 1024))
+---
+- true
+...
+test_run:cmd('switch test')
+---
+- true
+...
+fiber = require('fiber')
+---
+...
+digest = require('digest')
+---
+...
+box.cfg{snap_io_rate_limit = 4}
+---
+...
+FIBER_COUNT = 5
+---
+...
+TUPLE_SIZE = 1000
+---
+...
+TX_TUPLE_COUNT = 10
+---
+...
+TX_SIZE = TUPLE_SIZE * TX_TUPLE_COUNT
+---
+...
+TX_COUNT = math.ceil(box.cfg.vinyl_memory / (TX_SIZE * FIBER_COUNT))
+---
+...
+s = box.schema.space.create('test', {engine = 'vinyl'})
+---
+...
+_ = s:create_index('primary', {parts = {1, 'unsigned', 2, 'unsigned', 3, 'unsigned'}})
+---
+...
+latency = 0
+---
+...
+c = fiber.channel(FIBER_COUNT)
+---
+...
+test_run:cmd("setopt delimiter ';'")
+---
+- true
+...
+for i = 1, FIBER_COUNT do
+    fiber.create(function()
+        for j = 1, TX_COUNT do
+            local t1 = fiber.time()
+            box.begin()
+            for k = 1, TX_TUPLE_COUNT do
+                s:replace{i, j, k, digest.urandom(TUPLE_SIZE)}
+            end
+            box.commit()
+            local t2 = fiber.time()
+            latency = math.max(latency, t2 - t1)
+        end
+        c:put(true)
+    end)
+end;
+---
+...
+test_run:cmd("setopt delimiter ''");
+---
+- true
+...
+for i = 1, FIBER_COUNT do c:get() end
+---
+...
+latency < 0.5 or latency
+---
+- true
+...
+test_run:cmd('switch default')
+---
+- true
+...
+test_run:cmd("stop server test")
+---
+- true
+...
+test_run:cmd("cleanup server test")
+---
+- true
+...
diff --git a/test/vinyl/throttle.test.lua b/test/vinyl/throttle.test.lua
new file mode 100644
index 00000000..4efbbec7
--- /dev/null
+++ b/test/vinyl/throttle.test.lua
@@ -0,0 +1,47 @@
+test_run = require('test_run').new()
+test_run:cmd("create server test with script='vinyl/low_quota.lua'")
+test_run:cmd(string.format("start server test with args='%d'", 32 * 1024 * 1024))
+test_run:cmd('switch test')
+
+fiber = require('fiber')
+digest = require('digest')
+
+box.cfg{snap_io_rate_limit = 4}
+
+FIBER_COUNT = 5
+TUPLE_SIZE = 1000
+TX_TUPLE_COUNT = 10
+TX_SIZE = TUPLE_SIZE * TX_TUPLE_COUNT
+TX_COUNT = math.ceil(box.cfg.vinyl_memory / (TX_SIZE * FIBER_COUNT))
+
+s = box.schema.space.create('test', {engine = 'vinyl'})
+_ = s:create_index('primary', {parts = {1, 'unsigned', 2, 'unsigned', 3, 'unsigned'}})
+
+latency = 0
+c = fiber.channel(FIBER_COUNT)
+
+test_run:cmd("setopt delimiter ';'")
+for i = 1, FIBER_COUNT do
+    fiber.create(function()
+        for j = 1, TX_COUNT do
+            local t1 = fiber.time()
+            box.begin()
+            for k = 1, TX_TUPLE_COUNT do
+                s:replace{i, j, k, digest.urandom(TUPLE_SIZE)}
+            end
+            box.commit()
+            local t2 = fiber.time()
+            latency = math.max(latency, t2 - t1)
+        end
+        c:put(true)
+    end)
+end;
+test_run:cmd("setopt delimiter ''");
+
+for i = 1, FIBER_COUNT do c:get() end
+
+latency < 0.5 or latency
+
+test_run:cmd('switch default')
+test_run:cmd("stop server test")
+test_run:cmd("cleanup server test")
-- 
2.11.0

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 01/18] vinyl: rework internal quota API
  2018-08-16 16:11 ` [PATCH 01/18] vinyl: rework internal quota API Vladimir Davydov
@ 2018-08-20 11:07   ` Konstantin Osipov
  2018-08-24  8:32     ` Vladimir Davydov
  2018-08-27 18:29   ` Vladimir Davydov
  1 sibling, 1 reply; 56+ messages in thread
From: Konstantin Osipov @ 2018-08-20 11:07 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: tarantool-patches

* Vladimir Davydov <vdavydov.dev@gmail.com> [18/08/16 23:03]:
> + *
> + * Usage pattern:
> + *
> + *   size_t reserved = <estimate>;
> + *   if (vy_quota_try_use(q, reserved, timeout) != 0)
> + *           return -1;
> + *   <allocate memory>
> + *   size_t used = <actually allocated>;
> + *   vy_quota_commit_use(q, reserved, used);

How is this different from vy_quota_use(<estimate>); followed by
vy_quota_release(<estimate> - <actually allocated>)? 

If vy_quota_commit_use() is actually release *or* force-use,
depending on the sign of the result (<estimate> - <actually
allocated>) then the new api is actually less clear than the old
one.

If you would like to introduce a new call which would either
release quota or force-use it, then this call should be called 
vy_quota_adjust() or something like that, and the old call
(vy_quota_use()) should be left intact. The new names are imho
less clear.

> + * because we may not yield after we start inserting statements
> + * into a space so we estimate the allocation size and wait for
> + * quota before committing statements. At the same time, we
> + * cannot precisely estimate the size of memory we are going to
> + * consume so we adjust the quota after the allocation.
> + *
> + * The size of memory allocated while committing a transaction
> + * may be greater than an estimate, because insertion of a
> + * statement into an in-memory index can trigger allocation
> + * of a new index extent. This should not normally result in a
> + * noticeable breach in the memory limit, because most memory
> + * is occupied by statements, but we need to adjust the quota
> + * accordingly after the allocation in this case.
> + *
> + * The actual memory allocation size may also be less than an
> + * estimate if the space has multiple indexes, because statements
> + * are stored in the common memory level, which isn't taken into
> + * account while estimating the size of a memory allocation.
>   */
>  static inline int
> -vy_quota_use(struct vy_quota *q, size_t size, double timeout)
> +vy_quota_try_use(struct vy_quota *q, size_t size, double timeout)
>  {
>  	double start_time = ev_monotonic_now(loop());

try_use() suggests that the call quickly fails if it can't succeed
(e.g. trylock()). 

>  	double deadline = start_time + timeout;
> @@ -178,6 +208,27 @@ vy_quota_use(struct vy_quota *q, size_t size, double timeout)
>  }
>  
>  /**
> + * Adjust quota after allocating memory.

interestingly you use the verb adjust in the comment yourself.
> + *
> + * @reserved: size of quota reserved by vy_quota_try_use().
> + * @used: size of memory actually allocated.
> + *
> + * See also vy_quota_try_use().
> + */
> +static inline void
> +vy_quota_commit_use(struct vy_quota *q, size_t reserved, size_t used)
> +{
> +	if (reserved > used) {
> +		size_t excess = reserved - used;
> +		assert(q->used >= excess);
> +		q->used -= excess;
> +		fiber_cond_broadcast(&q->cond);
> +	}
> +	if (reserved < used)
> +		vy_quota_force_use(q, used - reserved);
> +}

-- 
Konstantin Osipov, Moscow, Russia, +7 903 626 22 32
http://tarantool.io - www.twitter.com/kostja_osipov

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 02/18] vinyl: move quota methods implementation to vy_quota.c
  2018-08-16 16:11 ` [PATCH 02/18] vinyl: move quota methods implementation to vy_quota.c Vladimir Davydov
@ 2018-08-20 11:07   ` Konstantin Osipov
  2018-08-27 18:30   ` Vladimir Davydov
  1 sibling, 0 replies; 56+ messages in thread
From: Konstantin Osipov @ 2018-08-20 11:07 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: tarantool-patches

* Vladimir Davydov <vdavydov.dev@gmail.com> [18/08/16 23:03]:
> None of vy_quota methods is called from a hot path - even the most
> frequently called ones, vy_quota_try_use and vy_quota_commit_use, are
> only invoked once per a transactions. So there's no need to clog the
> header with the methods implementation.
> ---
>  src/box/CMakeLists.txt |   1 +
>  src/box/vy_quota.c     | 133 +++++++++++++++++++++++++++++++++++++++++++++++++
>  src/box/vy_quota.h     | 110 +++++++---------------------------------
>  3 files changed, 153 insertions(+), 91 deletions(-)
>  create mode 100644 src/box/vy_quota.c

OK to push.


-- 
Konstantin Osipov, Moscow, Russia, +7 903 626 22 32
http://tarantool.io - www.twitter.com/kostja_osipov

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 03/18] vinyl: move quota related methods and variables from vy_env to vy_quota
  2018-08-16 16:11 ` [PATCH 03/18] vinyl: move quota related methods and variables from vy_env to vy_quota Vladimir Davydov
@ 2018-08-20 11:08   ` Konstantin Osipov
  2018-08-27 18:33   ` Vladimir Davydov
  1 sibling, 0 replies; 56+ messages in thread
From: Konstantin Osipov @ 2018-08-20 11:08 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: tarantool-patches

* Vladimir Davydov <vdavydov.dev@gmail.com> [18/08/16 23:03]:
> Watermark calculation is a private business of vy_quota. Let's move
> related stuff from vy_env to vy_quota. This will also make it easier
> to implement throttling opaque to the caller.

OK to push.

-- 
Konstantin Osipov, Moscow, Russia, +7 903 626 22 32
http://tarantool.io - www.twitter.com/kostja_osipov

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 04/18] vinyl: implement vy_quota_wait using vy_quota_try_use
  2018-08-16 16:11 ` [PATCH 04/18] vinyl: implement vy_quota_wait using vy_quota_try_use Vladimir Davydov
@ 2018-08-20 11:09   ` Konstantin Osipov
  2018-08-27 18:36   ` Vladimir Davydov
  1 sibling, 0 replies; 56+ messages in thread
From: Konstantin Osipov @ 2018-08-20 11:09 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: tarantool-patches

* Vladimir Davydov <vdavydov.dev@gmail.com> [18/08/16 23:03]:
> So that there's a single place where we can wait for quota. It should
> make it easier to implement quota throttling.
> ---
>  src/box/vy_quota.c | 7 -------
>  src/box/vy_quota.h | 7 +++++--
>  2 files changed, 5 insertions(+), 9 deletions(-)

OK to push.


-- 
Konstantin Osipov, Moscow, Russia, +7 903 626 22 32
http://tarantool.io - www.twitter.com/kostja_osipov

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 05/18] vinyl: wake up fibers waiting for quota one by one
  2018-08-16 16:11 ` [PATCH 05/18] vinyl: wake up fibers waiting for quota one by one Vladimir Davydov
@ 2018-08-20 11:11   ` Konstantin Osipov
  2018-08-24  8:33     ` Vladimir Davydov
  2018-08-28 13:19   ` Vladimir Davydov
  1 sibling, 1 reply; 56+ messages in thread
From: Konstantin Osipov @ 2018-08-20 11:11 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: tarantool-patches

* Vladimir Davydov <vdavydov.dev@gmail.com> [18/08/16 23:03]:
> Currently, we wake up all fibers whenever we free some memory. This
> is inefficient, because it might occur that all available quota gets
> consumed by a few fibers while the rest will have to go back to sleep.
> This is also kinda unfair, because waking up all fibers breaks the order
> in which the fibers were put to sleep. This works now, because we free
> memory and wake up fibers infrequently (on dump) and there normally
> shouldn't be any fibers waiting for quota (if there were, the latency
> would rocket sky high because of absence of any kind of throttling).
> However, once throttling is introduced, fibers waiting for quota will
> become the norm. So let's wake up fibers one by one: whenever we free
> memory we wake up the first fiber in the line, which will wake up the
> next fiber on success and so forth.
> +
> +	/* Wake up the next fiber in the line waiting for quota. */
> +	fiber_cond_signal(&q->cond);
>  	return 0;

Please only wake up the next fiber if there is one, or, better
yet,  if this fiber had been put to sleep and then was woken up. 

>  }
-- 
Konstantin Osipov, Moscow, Russia, +7 903 626 22 32
http://tarantool.io - www.twitter.com/kostja_osipov

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 06/18] vinyl: do not wake up fibers waiting for quota if quota is unavailable
  2018-08-16 16:12 ` [PATCH 06/18] vinyl: do not wake up fibers waiting for quota if quota is unavailable Vladimir Davydov
@ 2018-08-20 11:13   ` Konstantin Osipov
  0 siblings, 0 replies; 56+ messages in thread
From: Konstantin Osipov @ 2018-08-20 11:13 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: tarantool-patches

* Vladimir Davydov <vdavydov.dev@gmail.com> [18/08/16 23:03]:
> There's no point to do that as they go back to sleep anyway. What is
> worse, such spurious wakeups may break the order in which fibers wait
> for quota.
> ---
>  src/box/vy_quota.c | 19 +++++++++++++++----
>  1 file changed, 15 insertions(+), 4 deletions(-)
> 
> diff --git a/src/box/vy_quota.c b/src/box/vy_quota.c
> index 3677e22e..8cea4deb 100644
> --- a/src/box/vy_quota.c
> +++ b/src/box/vy_quota.c
> @@ -56,6 +56,17 @@ enum {
>  	VY_QUOTA_RATE_AVG_PERIOD = 5,
>  };
>  
> +/**
> + * Wake up the next fiber in the line waiting for quota
> + * provided quota is available.
> + */
> +static inline void
> +vy_quota_signal(struct vy_quota *q)
> +{
> +	if (q->used < q->limit)
> +		fiber_cond_signal(&q->cond);
> +}

Meh, this makes it only more obscure. This kind of addresses my
previous comment, but takes the opposite approach: it relaxes the
wake up discipline by making sure there are no spurious wakeups.
You could add another check, e.g., that there are fibers waiting
in the line, and make vy_quota_signal() entirely foolproof. But
this doesn't answer the policy question: when should
vy_quota_siganl() be called? 

And as long as there is no clear answer to this question, it's
going to be called *everywhere*, just in case.

-- 
Konstantin Osipov, Moscow, Russia, +7 903 626 22 32
http://tarantool.io - www.twitter.com/kostja_osipov

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 07/18] vinyl: tune dump bandwidth histogram buckets
  2018-08-16 16:12 ` [PATCH 07/18] vinyl: tune dump bandwidth histogram buckets Vladimir Davydov
@ 2018-08-20 11:15   ` Konstantin Osipov
  2018-08-28 15:37   ` Vladimir Davydov
  1 sibling, 0 replies; 56+ messages in thread
From: Konstantin Osipov @ 2018-08-20 11:15 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: tarantool-patches

* Vladimir Davydov <vdavydov.dev@gmail.com> [18/08/16 23:03]:
> Typically, dump bandwidth varies from 10 MB to 100 MB per second so
> let's use 5 MB bucket granularity in this range. Values less than
> 10 MB/s can also be observed, because the user can limit disk rate
> with box.cfg.snap_io_rate_limit so use 1 MB granularity between 1 MB
> and 10 MB and 100 KB granularity between 100 KB and 1 MB. A write rate
> greater than 100 MB/s is unlikely in practice, even on very fast disks,
> since dump bandwidth is also limited by CPU, so use 100 MB granularity
> there.

I like the idea and I assume you know what you're doing (aka you
measured the positive impact of the new histogram), so ok to push.

-- 
Konstantin Osipov, Moscow, Russia, +7 903 626 22 32
http://tarantool.io - www.twitter.com/kostja_osipov

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 08/18] vinyl: rename vy_quota::dump_bw to dump_bw_hist
  2018-08-16 16:12 ` [PATCH 08/18] vinyl: rename vy_quota::dump_bw to dump_bw_hist Vladimir Davydov
@ 2018-08-20 11:15   ` Konstantin Osipov
  2018-08-28 16:04   ` Vladimir Davydov
  1 sibling, 0 replies; 56+ messages in thread
From: Konstantin Osipov @ 2018-08-20 11:15 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: tarantool-patches

* Vladimir Davydov <vdavydov.dev@gmail.com> [18/08/16 23:03]:
> No functional changes. Needed to facilitate review.
> ---
>  src/box/vy_quota.c | 16 ++++++++--------
>  src/box/vy_quota.h |  2 +-
>  2 files changed, 9 insertions(+), 9 deletions(-)

OK to push.

-- 
Konstantin Osipov, Moscow, Russia, +7 903 626 22 32
http://tarantool.io - www.twitter.com/kostja_osipov

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 09/18] vinyl: cache dump bandwidth for timer invocation
  2018-08-16 16:12 ` [PATCH 09/18] vinyl: cache dump bandwidth for timer invocation Vladimir Davydov
@ 2018-08-20 11:21   ` Konstantin Osipov
  2018-08-28 16:10   ` Vladimir Davydov
  1 sibling, 0 replies; 56+ messages in thread
From: Konstantin Osipov @ 2018-08-20 11:21 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: tarantool-patches

* Vladimir Davydov <vdavydov.dev@gmail.com> [18/08/16 23:03]:
> We don't need to compute a percentile of dump bandwidth histogram on
> each invocation of quota timer callback, because it may only be updated
> on dump completion. Let's cache it. Currently, it isn't that important,
> because the timer period is set to 1 second. However, once we start
> using the timer for throttling, we'll have to make it run more often and
> so caching the dump bandwidth value will make sense.

OK to push.


-- 
Konstantin Osipov, Moscow, Russia, +7 903 626 22 32
http://tarantool.io - www.twitter.com/kostja_osipov

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 10/18] vinyl: do not add initial guess to dump bandwidth histogram
  2018-08-16 16:12 ` [PATCH 10/18] vinyl: do not add initial guess to dump bandwidth histogram Vladimir Davydov
@ 2018-08-20 11:23   ` Konstantin Osipov
  2018-08-23 20:15   ` Konstantin Osipov
  2018-08-28 16:15   ` Vladimir Davydov
  2 siblings, 0 replies; 56+ messages in thread
From: Konstantin Osipov @ 2018-08-20 11:23 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: tarantool-patches

* Vladimir Davydov <vdavydov.dev@gmail.com> [18/08/16 23:03]:
> Do not add the initial guess to the histogram, because otherwise it
> takes more than 10 dumps to get the real dump bandwidth in case the
> initial value is less (we use 10th percentile).

Nice, OK to push. You don't mention it's also a raison d'etre for
dump_bw, please updaet the comment for dump_bw explaining this
nifty moment.


-- 
Konstantin Osipov, Moscow, Russia, +7 903 626 22 32
http://tarantool.io - www.twitter.com/kostja_osipov

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 11/18] vinyl: use snap_io_rate_limit for initial dump bandwidth estimate
  2018-08-16 16:12 ` [PATCH 11/18] vinyl: use snap_io_rate_limit for initial dump bandwidth estimate Vladimir Davydov
@ 2018-08-20 11:24   ` Konstantin Osipov
  2018-08-24  8:31     ` Vladimir Davydov
  2018-08-28 16:18   ` Vladimir Davydov
  1 sibling, 1 reply; 56+ messages in thread
From: Konstantin Osipov @ 2018-08-20 11:24 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: tarantool-patches

* Vladimir Davydov <vdavydov.dev@gmail.com> [18/08/16 23:03]:
> The user can limit dump bandwidth with box.cfg.snap_io_rate_limit to a
> value, which is less than the current estimate. To avoid stalls caused
> by overestimating dump bandwidth, we must take into account the limit
> for the initial guess and forget all observations whenever it changes.

>  void
> +vy_quota_reset_dump_bw(struct vy_quota *q, size_t max)
> +{
> +	histogram_reset(q->dump_bw_hist);
> +	q->dump_bw = MIN(VY_DEFAULT_DUMP_BANDWIDTH, max);
> +}
> +

This sounds like vy_quota_set_dump_bw to me, not reset().

Otherwise OK to push.

-- 
Konstantin Osipov, Moscow, Russia, +7 903 626 22 32
http://tarantool.io - www.twitter.com/kostja_osipov

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [tarantool-patches] Re: [PATCH 12/18] histogram: add function for computing lower bound percentile estimate
  2018-08-16 16:12 ` [PATCH 12/18] histogram: add function for computing lower bound percentile estimate Vladimir Davydov
@ 2018-08-20 11:29   ` Konstantin Osipov
  2018-08-24  8:30     ` Vladimir Davydov
  2018-08-28 16:39   ` Vladimir Davydov
  1 sibling, 1 reply; 56+ messages in thread
From: Konstantin Osipov @ 2018-08-20 11:29 UTC (permalink / raw)
  To: tarantool-patches

* Vladimir Davydov <vdavydov.dev@gmail.com> [18/08/16 23:03]:
> The value returned by histogram_percentile() is an upper bound estimate.
> This is fine for measuring latency, because we are interested in the
> worst, i.e. highest, observations, but doesn't suit particularly well
> if we want to keep track of the lowest observations, as it is the case
> with bandwidth. So this patch introduces histogram_percentile_lower(),
> a function that is similar to histogram_percentile(), but returns a
> lower bound estimate of the requested percentile.
> ---

I need you to explain to me what's going on here (add comment?), sorry.

  
> +int64_t
> +histogram_percentile_lower(struct histogram *hist, int pct)
> +{
> +	size_t count = 0;
> +
> +	for (size_t i = 0; i < hist->n_buckets; i++) {
> +		count += hist->buckets[i].count;
> +		if (count * 100 > hist->total * pct)
> +			return hist->buckets[i > 0 ? i - 1 : 0].max;
> +	}
> +	return hist->max;
> +}
 

-- 
Konstantin Osipov, Moscow, Russia, +7 903 626 22 32
http://tarantool.io - www.twitter.com/kostja_osipov

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 10/18] vinyl: do not add initial guess to dump bandwidth histogram
  2018-08-16 16:12 ` [PATCH 10/18] vinyl: do not add initial guess to dump bandwidth histogram Vladimir Davydov
  2018-08-20 11:23   ` Konstantin Osipov
@ 2018-08-23 20:15   ` Konstantin Osipov
  2018-08-28 16:15   ` Vladimir Davydov
  2 siblings, 0 replies; 56+ messages in thread
From: Konstantin Osipov @ 2018-08-23 20:15 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: tarantool-patches

* Vladimir Davydov <vdavydov.dev@gmail.com> [18/08/16 23:03]:
> Do not add the initial guess to the histogram, because otherwise it
> takes more than 10 dumps to get the real dump bandwidth in case the
> initial value is less (we use 10th percentile).

OK to push.


-- 
Konstantin Osipov, Moscow, Russia, +7 903 626 22 32
http://tarantool.io - www.twitter.com/kostja_osipov

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 15/18] vinyl: improve dump start/stop logging
  2018-08-16 16:12 ` [PATCH 15/18] vinyl: improve dump start/stop logging Vladimir Davydov
@ 2018-08-23 20:18   ` Konstantin Osipov
  0 siblings, 0 replies; 56+ messages in thread
From: Konstantin Osipov @ 2018-08-23 20:18 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: tarantool-patches

* Vladimir Davydov <vdavydov.dev@gmail.com> [18/08/16 23:03]:
> When initiating memory dump, print how much memory is going to be
> dumped, expected dump rate, ETA, and the recent write rate. Upon dump
> completion, print observed dump rate in addition to dump size and
> duration.

OK to push.


-- 
Konstantin Osipov, Moscow, Russia, +7 903 626 22 32
http://tarantool.io - www.twitter.com/kostja_osipov

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 17/18] vinyl: set quota timer period to 100 ms
  2018-08-16 16:12 ` [PATCH 17/18] vinyl: set quota timer period to 100 ms Vladimir Davydov
@ 2018-08-23 20:49   ` Konstantin Osipov
  2018-08-24  8:18     ` Vladimir Davydov
  0 siblings, 1 reply; 56+ messages in thread
From: Konstantin Osipov @ 2018-08-23 20:49 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: tarantool-patches

* Vladimir Davydov <vdavydov.dev@gmail.com> [18/08/16 23:03]:
> Currently, it's 1 second, which is OK for calculating watermark, but
> too long for throttling (think of latency of 1 seconds that would be
> introduced by throttling if such timeout were used).

1) I think the timer period should self-adjust based on the amount
   of available quota.

2) We narrowed down the problem to throttling only.
   We never do anticipatory dump/compaction today.  Let's design a
   system which swings the pendulum both ways - not only throttles
   the client when the load is high but performs an anticipatory
   dump/compaction when the load is low. See for example
   https://github.com/tarantool/tarantool/issues/3225

   In other words, I think we should not only enable "stalls" in
   the timer callback when the write stream is strong, but
   trigger compaction when the write stream is thin. Unfortunately
   we do not collect read statistics per lsm as a secondary
   indicator of compaction priority, so we may only trigger
   compaction based simply on the number of levels, but never the
   less.


-- 
Konstantin Osipov, Moscow, Russia, +7 903 626 22 32
http://tarantool.io - www.twitter.com/kostja_osipov

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 18/18] vinyl: throttle tx rate if dump does not catch up
  2018-08-16 16:12 ` [PATCH 18/18] vinyl: throttle tx rate if dump does not catch up Vladimir Davydov
@ 2018-08-23 20:54   ` Konstantin Osipov
  2018-08-23 20:58     ` [tarantool-patches] " Konstantin Osipov
  2018-08-24  8:21     ` Vladimir Davydov
  0 siblings, 2 replies; 56+ messages in thread
From: Konstantin Osipov @ 2018-08-23 20:54 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: tarantool-patches

* Vladimir Davydov <vdavydov.dev@gmail.com> [18/08/16 23:03]:
> If the rate at which transactions are ready to write to the database is
> greater than the dump bandwidth, memory will get depleted before the
> last dump is complete and all newer transactions will have to wait until
> the dump has been completed, which may take seconds or even minutes:

Why are you limiting the rate only during a dump? What about
compaction throughtput? Write amplification from compaction is
10x+ times higher than from dump, compaction is actually the main
consumer of disk bandwidth. Shouldn't you throttle writes when
write bandwidth is higher than write amplification/disk bandwidth?

-- 
Konstantin Osipov, Moscow, Russia, +7 903 626 22 32
http://tarantool.io - www.twitter.com/kostja_osipov

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [tarantool-patches] Re: [PATCH 18/18] vinyl: throttle tx rate if dump does not catch up
  2018-08-23 20:54   ` Konstantin Osipov
@ 2018-08-23 20:58     ` Konstantin Osipov
  2018-08-24  8:21     ` Vladimir Davydov
  1 sibling, 0 replies; 56+ messages in thread
From: Konstantin Osipov @ 2018-08-23 20:58 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: tarantool-patches

* Konstantin Osipov <kostja@tarantool.org> [18/08/23 23:55]:
> * Vladimir Davydov <vdavydov.dev@gmail.com> [18/08/16 23:03]:
> > If the rate at which transactions are ready to write to the database is
> > greater than the dump bandwidth, memory will get depleted before the
> > last dump is complete and all newer transactions will have to wait until
> > the dump has been completed, which may take seconds or even minutes:
> 
> Why are you limiting the rate only during a dump? What about
> compaction throughtput? Write amplification from compaction is
> 10x+ times higher than from dump, compaction is actually the main
> consumer of disk bandwidth. Shouldn't you throttle writes when
> write bandwidth is higher than write amplification/disk bandwidth?


if (disk bandwidth/write amplification > write bandwidth)
    throttle writes;
else
    expedite compaction

-- 
Konstantin Osipov, Moscow, Russia, +7 903 626 22 32
http://tarantool.io - www.twitter.com/kostja_osipov

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 17/18] vinyl: set quota timer period to 100 ms
  2018-08-23 20:49   ` Konstantin Osipov
@ 2018-08-24  8:18     ` Vladimir Davydov
  0 siblings, 0 replies; 56+ messages in thread
From: Vladimir Davydov @ 2018-08-24  8:18 UTC (permalink / raw)
  To: Konstantin Osipov; +Cc: tarantool-patches

On Thu, Aug 23, 2018 at 11:49:36PM +0300, Konstantin Osipov wrote:
> * Vladimir Davydov <vdavydov.dev@gmail.com> [18/08/16 23:03]:
> > Currently, it's 1 second, which is OK for calculating watermark, but
> > too long for throttling (think of latency of 1 seconds that would be
> > introduced by throttling if such timeout were used).
> 
> 1) I think the timer period should self-adjust based on the amount
>    of available quota.

Do we really need to bother? An execution of a timer callback is pretty
cheap - it's merely a few arithmetic calculations. I set it to 100 ms to
make sure the latency won't be too high when throttling is enabled.

> 
> 2) We narrowed down the problem to throttling only.
>    We never do anticipatory dump/compaction today.  Let's design a
>    system which swings the pendulum both ways - not only throttles
>    the client when the load is high but performs an anticipatory
>    dump/compaction when the load is low. See for example
>    https://github.com/tarantool/tarantool/issues/3225

It totally makes sense, of course, but it's a completely different
issue, which can be done separately on top of this patch set.

> 
>    In other words, I think we should not only enable "stalls" in
>    the timer callback when the write stream is strong, but
>    trigger compaction when the write stream is thin. Unfortunately
>    we do not collect read statistics per lsm as a secondary
>    indicator of compaction priority, so we may only trigger
>    compaction based simply on the number of levels, but never the
>    less.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 18/18] vinyl: throttle tx rate if dump does not catch up
  2018-08-23 20:54   ` Konstantin Osipov
  2018-08-23 20:58     ` [tarantool-patches] " Konstantin Osipov
@ 2018-08-24  8:21     ` Vladimir Davydov
  1 sibling, 0 replies; 56+ messages in thread
From: Vladimir Davydov @ 2018-08-24  8:21 UTC (permalink / raw)
  To: Konstantin Osipov; +Cc: tarantool-patches

On Thu, Aug 23, 2018 at 11:54:43PM +0300, Konstantin Osipov wrote:
> * Vladimir Davydov <vdavydov.dev@gmail.com> [18/08/16 23:03]:
> > If the rate at which transactions are ready to write to the database is
> > greater than the dump bandwidth, memory will get depleted before the
> > last dump is complete and all newer transactions will have to wait until
> > the dump has been completed, which may take seconds or even minutes:
> 
> Why are you limiting the rate only during a dump? What about
> compaction throughtput? Write amplification from compaction is
> 10x+ times higher than from dump, compaction is actually the main
> consumer of disk bandwidth. Shouldn't you throttle writes when
> write bandwidth is higher than write amplification/disk bandwidth?

We use the 10th percentile of all dump/compaction write rate
observations when estimating the dump bandwidth. That is, if
a dump task is ever run in parallel with other dump tasks or
compaction, we will take this into account. This might be not
perfect, but it works well in practice. Anyways, we can tune
that later if we find such an algorithm unsatisfactory.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [tarantool-patches] Re: [PATCH 12/18] histogram: add function for computing lower bound percentile estimate
  2018-08-20 11:29   ` [tarantool-patches] " Konstantin Osipov
@ 2018-08-24  8:30     ` Vladimir Davydov
  0 siblings, 0 replies; 56+ messages in thread
From: Vladimir Davydov @ 2018-08-24  8:30 UTC (permalink / raw)
  To: Konstantin Osipov; +Cc: tarantool-patches

On Mon, Aug 20, 2018 at 02:29:16PM +0300, Konstantin Osipov wrote:
> * Vladimir Davydov <vdavydov.dev@gmail.com> [18/08/16 23:03]:
> > The value returned by histogram_percentile() is an upper bound estimate.
> > This is fine for measuring latency, because we are interested in the
> > worst, i.e. highest, observations, but doesn't suit particularly well
> > if we want to keep track of the lowest observations, as it is the case
> > with bandwidth. So this patch introduces histogram_percentile_lower(),
> > a function that is similar to histogram_percentile(), but returns a
> > lower bound estimate of the requested percentile.
> > ---
> 
> I need you to explain to me what's going on here (add comment?), sorry.

There's a comment to the function declaration.

Sorry, but I don't quite understand what else you want me to write here.
This function simply calculates a histogram percentile, similarly to
histogram_percentile(), but uses the lower bound estimate. I don't know
what else to say...

> 
>   
> > +int64_t
> > +histogram_percentile_lower(struct histogram *hist, int pct)
> > +{
> > +	size_t count = 0;
> > +
> > +	for (size_t i = 0; i < hist->n_buckets; i++) {
> > +		count += hist->buckets[i].count;
> > +		if (count * 100 > hist->total * pct)
> > +			return hist->buckets[i > 0 ? i - 1 : 0].max;
> > +	}
> > +	return hist->max;
> > +}

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 11/18] vinyl: use snap_io_rate_limit for initial dump bandwidth estimate
  2018-08-20 11:24   ` Konstantin Osipov
@ 2018-08-24  8:31     ` Vladimir Davydov
  0 siblings, 0 replies; 56+ messages in thread
From: Vladimir Davydov @ 2018-08-24  8:31 UTC (permalink / raw)
  To: Konstantin Osipov; +Cc: tarantool-patches

On Mon, Aug 20, 2018 at 02:24:31PM +0300, Konstantin Osipov wrote:
> * Vladimir Davydov <vdavydov.dev@gmail.com> [18/08/16 23:03]:
> > The user can limit dump bandwidth with box.cfg.snap_io_rate_limit to a
> > value, which is less than the current estimate. To avoid stalls caused
> > by overestimating dump bandwidth, we must take into account the limit
> > for the initial guess and forget all observations whenever it changes.
> 
> >  void
> > +vy_quota_reset_dump_bw(struct vy_quota *q, size_t max)
> > +{
> > +	histogram_reset(q->dump_bw_hist);
> > +	q->dump_bw = MIN(VY_DEFAULT_DUMP_BANDWIDTH, max);
> > +}
> > +
> 
> This sounds like vy_quota_set_dump_bw to me, not reset().

I think it's rather 'reset', because it forgets all measurements that
have been observed so far.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 01/18] vinyl: rework internal quota API
  2018-08-20 11:07   ` Konstantin Osipov
@ 2018-08-24  8:32     ` Vladimir Davydov
  0 siblings, 0 replies; 56+ messages in thread
From: Vladimir Davydov @ 2018-08-24  8:32 UTC (permalink / raw)
  To: Konstantin Osipov; +Cc: tarantool-patches

On Mon, Aug 20, 2018 at 02:07:21PM +0300, Konstantin Osipov wrote:
> * Vladimir Davydov <vdavydov.dev@gmail.com> [18/08/16 23:03]:
> > + *
> > + * Usage pattern:
> > + *
> > + *   size_t reserved = <estimate>;
> > + *   if (vy_quota_try_use(q, reserved, timeout) != 0)
> > + *           return -1;
> > + *   <allocate memory>
> > + *   size_t used = <actually allocated>;
> > + *   vy_quota_commit_use(q, reserved, used);
> 
> How is this different from vy_quota_use(<estimate>); followed by
> vy_quota_release(<estimate> - <actually allocated>)? 
> 
> If vy_quota_commit_use() is actually release *or* force-use,
> depending on the sign of the result (<estimate> - <actually
> allocated>) then the new api is actually less clear than the old
> one.
> 
> If you would like to introduce a new call which would either
> release quota or force-use it, then this call should be called 
> vy_quota_adjust() or something like that, and the old call
> (vy_quota_use()) should be left intact. The new names are imho
> less clear.

NP, I'll rename vy_quota_try_use() back to vy_quota_use() and
vy_quota_commit_use() to vy_quota_adjust().

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 05/18] vinyl: wake up fibers waiting for quota one by one
  2018-08-20 11:11   ` Konstantin Osipov
@ 2018-08-24  8:33     ` Vladimir Davydov
  0 siblings, 0 replies; 56+ messages in thread
From: Vladimir Davydov @ 2018-08-24  8:33 UTC (permalink / raw)
  To: Konstantin Osipov; +Cc: tarantool-patches

On Mon, Aug 20, 2018 at 02:11:23PM +0300, Konstantin Osipov wrote:
> * Vladimir Davydov <vdavydov.dev@gmail.com> [18/08/16 23:03]:
> > Currently, we wake up all fibers whenever we free some memory. This
> > is inefficient, because it might occur that all available quota gets
> > consumed by a few fibers while the rest will have to go back to sleep.
> > This is also kinda unfair, because waking up all fibers breaks the order
> > in which the fibers were put to sleep. This works now, because we free
> > memory and wake up fibers infrequently (on dump) and there normally
> > shouldn't be any fibers waiting for quota (if there were, the latency
> > would rocket sky high because of absence of any kind of throttling).
> > However, once throttling is introduced, fibers waiting for quota will
> > become the norm. So let's wake up fibers one by one: whenever we free
> > memory we wake up the first fiber in the line, which will wake up the
> > next fiber on success and so forth.
> > +
> > +	/* Wake up the next fiber in the line waiting for quota. */
> > +	fiber_cond_signal(&q->cond);
> >  	return 0;
> 
> Please only wake up the next fiber if there is one, or, better
> yet,  if this fiber had been put to sleep and then was woken up. 

I don't quite understand what you mean - fiber_cond_signal() will do
nothing if there's no fiber waiting on it.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 01/18] vinyl: rework internal quota API
  2018-08-16 16:11 ` [PATCH 01/18] vinyl: rework internal quota API Vladimir Davydov
  2018-08-20 11:07   ` Konstantin Osipov
@ 2018-08-27 18:29   ` Vladimir Davydov
  1 sibling, 0 replies; 56+ messages in thread
From: Vladimir Davydov @ 2018-08-27 18:29 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

We decided to leave all method names as is and only introduce
vy_quota_adjust helper in this patch. Here's the final version
that was pushed to 1.10:

From 89a14237337e220b3373bda978ee5453ba70abee Mon Sep 17 00:00:00 2001
From: Vladimir Davydov <vdavydov.dev@gmail.com>
Date: Mon, 27 Aug 2018 20:41:49 +0300
Subject: [PATCH] vinyl: add vy_quota_adjust helper

Let's introduce this helper to avoid code duplication and keep comments
regarding quota consumption protocol in one place.

diff --git a/src/box/vinyl.c b/src/box/vinyl.c
index 6d5ea380..27a93c81 100644
--- a/src/box/vinyl.c
+++ b/src/box/vinyl.c
@@ -2384,21 +2384,7 @@ vinyl_engine_prepare(struct engine *engine, struct txn *txn)
 	size_t mem_used_after = lsregion_used(&env->mem_env.allocator);
 	assert(mem_used_after >= mem_used_before);
 	size_t write_size = mem_used_after - mem_used_before;
-	/*
-	 * Insertion of a statement into an in-memory tree can trigger
-	 * an allocation of a new tree block. This should not normally
-	 * result in a noticeable excess of the memory limit, because
-	 * most memory is occupied by statements anyway, but we need to
-	 * adjust the quota accordingly in this case.
-	 *
-	 * The actual allocation size can also be less than reservation
-	 * if a statement is allocated from an lsregion slab allocated
-	 * by a previous transaction. Take this into account, too.
-	 */
-	if (write_size >= tx->write_size)
-		vy_quota_force_use(&env->quota, write_size - tx->write_size);
-	else
-		vy_quota_release(&env->quota, tx->write_size - write_size);
+	vy_quota_adjust(&env->quota, tx->write_size, write_size);
 
 	if (rc != 0)
 		return -1;
@@ -3292,11 +3278,7 @@ vinyl_space_apply_initial_join_row(struct space *space, struct request *request)
 	size_t mem_used_after = lsregion_used(&env->mem_env.allocator);
 	assert(mem_used_after >= mem_used_before);
 	size_t used = mem_used_after - mem_used_before;
-	if (used >= reserved)
-		vy_quota_force_use(&env->quota, used - reserved);
-	else
-		vy_quota_release(&env->quota, reserved - used);
-
+	vy_quota_adjust(&env->quota, reserved, used);
 	return rc;
 }
 
diff --git a/src/box/vy_quota.h b/src/box/vy_quota.h
index d741c34a..726d3d44 100644
--- a/src/box/vy_quota.h
+++ b/src/box/vy_quota.h
@@ -153,6 +153,35 @@ vy_quota_release(struct vy_quota *q, size_t size)
  * Try to consume @size bytes of memory, throttle the caller
  * if the limit is exceeded. @timeout specifies the maximal
  * time to wait. Return 0 on success, -1 on timeout.
+ *
+ * Usage pattern:
+ *
+ *   size_t reserved = <estimate>;
+ *   if (vy_quota_use(q, reserved, timeout) != 0)
+ *           return -1;
+ *   <allocate memory>
+ *   size_t used = <actually allocated>;
+ *   vy_quota_adjust(q, reserved, used);
+ *
+ * We use two-step quota allocation strategy (reserve-consume),
+ * because we may not yield after we start inserting statements
+ * into a space so we estimate the allocation size and wait for
+ * quota before committing statements. At the same time, we
+ * cannot precisely estimate the size of memory we are going to
+ * consume so we adjust the quota after the allocation.
+ *
+ * The size of memory allocated while committing a transaction
+ * may be greater than an estimate, because insertion of a
+ * statement into an in-memory index can trigger allocation
+ * of a new index extent. This should not normally result in a
+ * noticeable breach in the memory limit, because most memory
+ * is occupied by statements, but we need to adjust the quota
+ * accordingly after the allocation in this case.
+ *
+ * The actual memory allocation size may also be less than an
+ * estimate if the space has multiple indexes, because statements
+ * are stored in the common memory level, which isn't taken into
+ * account while estimating the size of a memory allocation.
  */
 static inline int
 vy_quota_use(struct vy_quota *q, size_t size, double timeout)
@@ -178,6 +207,27 @@ vy_quota_use(struct vy_quota *q, size_t size, double timeout)
 }
 
 /**
+ * Adjust quota after allocating memory.
+ *
+ * @reserved: size of quota reserved by vy_quota_use().
+ * @used: size of memory actually allocated.
+ *
+ * See also vy_quota_use().
+ */
+static inline void
+vy_quota_adjust(struct vy_quota *q, size_t reserved, size_t used)
+{
+	if (reserved > used) {
+		size_t excess = reserved - used;
+		assert(q->used >= excess);
+		q->used -= excess;
+		fiber_cond_broadcast(&q->cond);
+	}
+	if (reserved < used)
+		vy_quota_force_use(q, used - reserved);
+}
+
+/**
  * Block the caller until the quota is not exceeded.
  */
 static inline void

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 02/18] vinyl: move quota methods implementation to vy_quota.c
  2018-08-16 16:11 ` [PATCH 02/18] vinyl: move quota methods implementation to vy_quota.c Vladimir Davydov
  2018-08-20 11:07   ` Konstantin Osipov
@ 2018-08-27 18:30   ` Vladimir Davydov
  1 sibling, 0 replies; 56+ messages in thread
From: Vladimir Davydov @ 2018-08-27 18:30 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

Rebased and pushed to 1.10

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 03/18] vinyl: move quota related methods and variables from vy_env to vy_quota
  2018-08-16 16:11 ` [PATCH 03/18] vinyl: move quota related methods and variables from vy_env to vy_quota Vladimir Davydov
  2018-08-20 11:08   ` Konstantin Osipov
@ 2018-08-27 18:33   ` Vladimir Davydov
  1 sibling, 0 replies; 56+ messages in thread
From: Vladimir Davydov @ 2018-08-27 18:33 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

Factored out vy_quota_update_dump_bandwidth from vy_quota_release as
discussed and pushed to 1.10. Here's the final version:

From 33a0eddeafbc707fceae8923bf2f559a45aa44cd Mon Sep 17 00:00:00 2001
From: Vladimir Davydov <vdavydov.dev@gmail.com>
Date: Sun, 12 Aug 2018 17:28:06 +0300
Subject: [PATCH] vinyl: move quota related methods and variables from vy_env
 to vy_quota

Watermark calculation is a private business of vy_quota. Let's move
related stuff from vy_env to vy_quota. This will also make it easier
to implement throttling opaque to the caller.

diff --git a/src/box/vinyl.c b/src/box/vinyl.c
index 27a93c81..1705c8e6 100644
--- a/src/box/vinyl.c
+++ b/src/box/vinyl.c
@@ -45,7 +45,6 @@
 #include "vy_scheduler.h"
 #include "vy_stat.h"
 
-#include <math.h>
 #include <stdbool.h>
 #include <stddef.h>
 #include <stdint.h>
@@ -106,31 +105,6 @@ struct vy_env {
 	struct mempool iterator_pool;
 	/** Memory quota */
 	struct vy_quota     quota;
-	/** Timer for updating quota watermark. */
-	ev_timer            quota_timer;
-	/**
-	 * Amount of quota used since the last
-	 * invocation of the quota timer callback.
-	 */
-	size_t quota_use_curr;
-	/**
-	 * Quota use rate, in bytes per second.
-	 * Calculated as exponentially weighted
-	 * moving average of quota_use_curr.
-	 */
-	size_t quota_use_rate;
-	/**
-	 * Dump bandwidth is needed for calculating the quota watermark.
-	 * The higher the bandwidth, the later we can start dumping w/o
-	 * suffering from transaction throttling. So we want to be very
-	 * conservative about estimating the bandwidth.
-	 *
-	 * To make sure we don't overestimate it, we maintain a
-	 * histogram of all observed measurements and assume the
-	 * bandwidth to be equal to the 10th percentile, i.e. the
-	 * best result among 10% worst measurements.
-	 */
-	struct histogram *dump_bw;
 	/** Common LSM tree environment. */
 	struct vy_lsm_env lsm_env;
 	/** Environment for cache subsystem */
@@ -170,26 +144,6 @@ struct vy_env {
 	bool force_recovery;
 };
 
-enum {
-	/**
-	 * Time interval between successive updates of
-	 * quota watermark and use rate, in seconds.
-	 */
-	VY_QUOTA_UPDATE_INTERVAL = 1,
-	/**
-	 * Period of time over which the quota use rate
-	 * is averaged, in seconds.
-	 */
-	VY_QUOTA_RATE_AVG_PERIOD = 5,
-};
-
-static inline int64_t
-vy_dump_bandwidth(struct vy_env *env)
-{
-	/* See comment to vy_env::dump_bw. */
-	return histogram_percentile(env->dump_bw, 10);
-}
-
 struct vinyl_engine {
 	struct engine base;
 	/** Vinyl environment. */
@@ -303,8 +257,8 @@ vy_info_append_quota(struct vy_env *env, struct info_handler *h)
 	info_append_int(h, "used", q->used);
 	info_append_int(h, "limit", q->limit);
 	info_append_int(h, "watermark", q->watermark);
-	info_append_int(h, "use_rate", env->quota_use_rate);
-	info_append_int(h, "dump_bandwidth", vy_dump_bandwidth(env));
+	info_append_int(h, "use_rate", q->use_rate);
+	info_append_int(h, "dump_bandwidth", vy_quota_dump_bandwidth(q));
 	info_table_end(h);
 }
 
@@ -2383,14 +2337,9 @@ vinyl_engine_prepare(struct engine *engine, struct txn *txn)
 
 	size_t mem_used_after = lsregion_used(&env->mem_env.allocator);
 	assert(mem_used_after >= mem_used_before);
-	size_t write_size = mem_used_after - mem_used_before;
-	vy_quota_adjust(&env->quota, tx->write_size, write_size);
-
-	if (rc != 0)
-		return -1;
-
-	env->quota_use_curr += write_size;
-	return 0;
+	vy_quota_adjust(&env->quota, tx->write_size,
+			mem_used_after - mem_used_before);
+	return rc;
 }
 
 static void
@@ -2461,41 +2410,6 @@ vinyl_engine_rollback_statement(struct engine *engine, struct txn *txn,
 /** {{{ Environment */
 
 static void
-vy_env_quota_timer_cb(ev_loop *loop, ev_timer *timer, int events)
-{
-	(void)loop;
-	(void)events;
-
-	struct vy_env *e = timer->data;
-
-	/*
-	 * Update the quota use rate with the new measurement.
-	 */
-	const double weight = 1 - exp(-VY_QUOTA_UPDATE_INTERVAL /
-				      (double)VY_QUOTA_RATE_AVG_PERIOD);
-	e->quota_use_rate = (1 - weight) * e->quota_use_rate +
-		weight * e->quota_use_curr / VY_QUOTA_UPDATE_INTERVAL;
-	e->quota_use_curr = 0;
-
-	/*
-	 * Due to log structured nature of the lsregion allocator,
-	 * which is used for allocating statements, we cannot free
-	 * memory in chunks, only all at once. Therefore we should
-	 * configure the watermark so that by the time we hit the
-	 * limit, all memory have been dumped, i.e.
-	 *
-	 *   limit - watermark      watermark
-	 *   ----------------- = --------------
-	 *     quota_use_rate    dump_bandwidth
-	 */
-	int64_t dump_bandwidth = vy_dump_bandwidth(e);
-	size_t watermark = ((double)e->quota.limit * dump_bandwidth /
-			    (dump_bandwidth + e->quota_use_rate + 1));
-
-	vy_quota_set_watermark(&e->quota, watermark);
-}
-
-static void
 vy_env_quota_exceeded_cb(struct vy_quota *quota)
 {
 	struct vy_env *env = container_of(quota, struct vy_env, quota);
@@ -2542,13 +2456,8 @@ vy_env_dump_complete_cb(struct vy_scheduler *scheduler,
 	assert(mem_used_after <= mem_used_before);
 	size_t mem_dumped = mem_used_before - mem_used_after;
 	vy_quota_release(quota, mem_dumped);
-
+	vy_quota_update_dump_bandwidth(quota, mem_dumped, dump_duration);
 	say_info("dumped %zu bytes in %.1f sec", mem_dumped, dump_duration);
-
-	/* Account dump bandwidth. */
-	if (dump_duration > 0)
-		histogram_collect(env->dump_bw,
-				  mem_dumped / dump_duration);
 }
 
 static struct vy_squash_queue *
@@ -2563,21 +2472,6 @@ static struct vy_env *
 vy_env_new(const char *path, size_t memory,
 	   int read_threads, int write_threads, bool force_recovery)
 {
-	enum { KB = 1000, MB = 1000 * 1000 };
-	static int64_t dump_bandwidth_buckets[] = {
-		100 * KB, 200 * KB, 300 * KB, 400 * KB, 500 * KB,
-		  1 * MB,   2 * MB,   3 * MB,   4 * MB,   5 * MB,
-		 10 * MB,  20 * MB,  30 * MB,  40 * MB,  50 * MB,
-		 60 * MB,  70 * MB,  80 * MB,  90 * MB, 100 * MB,
-		110 * MB, 120 * MB, 130 * MB, 140 * MB, 150 * MB,
-		160 * MB, 170 * MB, 180 * MB, 190 * MB, 200 * MB,
-		220 * MB, 240 * MB, 260 * MB, 280 * MB, 300 * MB,
-		320 * MB, 340 * MB, 360 * MB, 380 * MB, 400 * MB,
-		450 * MB, 500 * MB, 550 * MB, 600 * MB, 650 * MB,
-		700 * MB, 750 * MB, 800 * MB, 850 * MB, 900 * MB,
-		950 * MB, 1000 * MB,
-	};
-
 	struct vy_env *e = malloc(sizeof(*e));
 	if (unlikely(e == NULL)) {
 		diag_set(OutOfMemory, sizeof(*e), "malloc", "struct vy_env");
@@ -2597,19 +2491,6 @@ vy_env_new(const char *path, size_t memory,
 		goto error_path;
 	}
 
-	e->dump_bw = histogram_new(dump_bandwidth_buckets,
-				   lengthof(dump_bandwidth_buckets));
-	if (e->dump_bw == NULL) {
-		diag_set(OutOfMemory, 0, "histogram_new",
-			 "dump bandwidth histogram");
-		goto error_dump_bw;
-	}
-	/*
-	 * Until we dump anything, assume bandwidth to be 10 MB/s,
-	 * which should be fine for initial guess.
-	 */
-	histogram_collect(e->dump_bw, 10 * MB);
-
 	e->xm = tx_manager_new();
 	if (e->xm == NULL)
 		goto error_xm;
@@ -2627,18 +2508,18 @@ vy_env_new(const char *path, size_t memory,
 			      vy_squash_schedule, e) != 0)
 		goto error_lsm_env;
 
+	if (vy_quota_create(&e->quota, vy_env_quota_exceeded_cb) != 0)
+		goto error_quota;
+
 	struct slab_cache *slab_cache = cord_slab_cache();
 	mempool_create(&e->iterator_pool, slab_cache,
 	               sizeof(struct vinyl_iterator));
-	vy_quota_create(&e->quota, vy_env_quota_exceeded_cb);
-	ev_timer_init(&e->quota_timer, vy_env_quota_timer_cb, 0,
-		      VY_QUOTA_UPDATE_INTERVAL);
-	e->quota_timer.data = e;
-	ev_timer_start(loop(), &e->quota_timer);
 	vy_cache_env_create(&e->cache_env, slab_cache);
 	vy_run_env_create(&e->run_env);
 	vy_log_init(e->path);
 	return e;
+error_quota:
+	vy_lsm_env_destroy(&e->lsm_env);
 error_lsm_env:
 	vy_mem_env_destroy(&e->mem_env);
 	vy_scheduler_destroy(&e->scheduler);
@@ -2646,8 +2527,6 @@ error_lsm_env:
 error_squash_queue:
 	tx_manager_delete(e->xm);
 error_xm:
-	histogram_delete(e->dump_bw);
-error_dump_bw:
 	free(e->path);
 error_path:
 	free(e);
@@ -2657,12 +2536,10 @@ error_path:
 static void
 vy_env_delete(struct vy_env *e)
 {
-	ev_timer_stop(loop(), &e->quota_timer);
 	vy_scheduler_destroy(&e->scheduler);
 	vy_squash_queue_delete(e->squash_queue);
 	tx_manager_delete(e->xm);
 	free(e->path);
-	histogram_delete(e->dump_bw);
 	mempool_destroy(&e->iterator_pool);
 	vy_run_env_destroy(&e->run_env);
 	vy_lsm_env_destroy(&e->lsm_env);
diff --git a/src/box/vy_quota.c b/src/box/vy_quota.c
index 842f06a4..656c2012 100644
--- a/src/box/vy_quota.c
+++ b/src/box/vy_quota.c
@@ -32,26 +32,116 @@
 
 #include <assert.h>
 #include <stddef.h>
+#include <stdint.h>
+#include <math.h>
 #include <tarantool_ev.h>
 
+#include "diag.h"
 #include "fiber.h"
 #include "fiber_cond.h"
 #include "say.h"
+#include "histogram.h"
+#include "trivia/util.h"
 
-void
+enum {
+	/**
+	 * Time interval between successive updates of
+	 * quota watermark and use rate, in seconds.
+	 */
+	VY_QUOTA_UPDATE_INTERVAL = 1,
+	/**
+	 * Period of time over which the quota use rate
+	 * is averaged, in seconds.
+	 */
+	VY_QUOTA_RATE_AVG_PERIOD = 5,
+};
+
+static void
+vy_quota_timer_cb(ev_loop *loop, ev_timer *timer, int events)
+{
+	(void)loop;
+	(void)events;
+
+	struct vy_quota *q = timer->data;
+
+	/*
+	 * Update the quota use rate with the new measurement.
+	 */
+	const double weight = 1 - exp(-VY_QUOTA_UPDATE_INTERVAL /
+				      (double)VY_QUOTA_RATE_AVG_PERIOD);
+	q->use_rate = (1 - weight) * q->use_rate +
+		weight * q->use_curr / VY_QUOTA_UPDATE_INTERVAL;
+	q->use_curr = 0;
+
+	/*
+	 * Due to log structured nature of the lsregion allocator,
+	 * which is used for allocating statements, we cannot free
+	 * memory in chunks, only all at once. Therefore we should
+	 * configure the watermark so that by the time we hit the
+	 * limit, all memory have been dumped, i.e.
+	 *
+	 *   limit - watermark      watermark
+	 *   ----------------- = --------------
+	 *        use_rate       dump_bandwidth
+	 */
+	size_t dump_bandwidth = vy_quota_dump_bandwidth(q);
+	q->watermark = ((double)q->limit * dump_bandwidth /
+			(dump_bandwidth + q->use_rate + 1));
+	if (q->used >= q->watermark)
+		q->quota_exceeded_cb(q);
+}
+
+int
 vy_quota_create(struct vy_quota *q, vy_quota_exceeded_f quota_exceeded_cb)
 {
+	enum { KB = 1000, MB = 1000 * 1000 };
+	static int64_t dump_bandwidth_buckets[] = {
+		100 * KB, 200 * KB, 300 * KB, 400 * KB, 500 * KB,
+		  1 * MB,   2 * MB,   3 * MB,   4 * MB,   5 * MB,
+		 10 * MB,  20 * MB,  30 * MB,  40 * MB,  50 * MB,
+		 60 * MB,  70 * MB,  80 * MB,  90 * MB, 100 * MB,
+		110 * MB, 120 * MB, 130 * MB, 140 * MB, 150 * MB,
+		160 * MB, 170 * MB, 180 * MB, 190 * MB, 200 * MB,
+		220 * MB, 240 * MB, 260 * MB, 280 * MB, 300 * MB,
+		320 * MB, 340 * MB, 360 * MB, 380 * MB, 400 * MB,
+		450 * MB, 500 * MB, 550 * MB, 600 * MB, 650 * MB,
+		700 * MB, 750 * MB, 800 * MB, 850 * MB, 900 * MB,
+		950 * MB, 1000 * MB,
+	};
+
+	q->dump_bw = histogram_new(dump_bandwidth_buckets,
+				   lengthof(dump_bandwidth_buckets));
+	if (q->dump_bw == NULL) {
+		diag_set(OutOfMemory, 0, "histogram_new",
+			 "dump bandwidth histogram");
+		return -1;
+	}
+	/*
+	 * Until we dump anything, assume bandwidth to be 10 MB/s,
+	 * which should be fine for initial guess.
+	 */
+	histogram_collect(q->dump_bw, 10 * MB);
+
 	q->limit = SIZE_MAX;
 	q->watermark = SIZE_MAX;
 	q->used = 0;
+	q->use_curr = 0;
+	q->use_rate = 0;
 	q->too_long_threshold = TIMEOUT_INFINITY;
 	q->quota_exceeded_cb = quota_exceeded_cb;
 	fiber_cond_create(&q->cond);
+	ev_timer_init(&q->timer, vy_quota_timer_cb, 0,
+		      VY_QUOTA_UPDATE_INTERVAL);
+	q->timer.data = q;
+	ev_timer_start(loop(), &q->timer);
+	return 0;
 }
 
 void
 vy_quota_destroy(struct vy_quota *q)
 {
+	ev_timer_stop(loop(), &q->timer);
+	histogram_delete(q->dump_bw);
 	fiber_cond_broadcast(&q->cond);
 	fiber_cond_destroy(&q->cond);
 }
@@ -65,18 +155,26 @@ vy_quota_set_limit(struct vy_quota *q, size_t limit)
 	fiber_cond_broadcast(&q->cond);
 }
 
+size_t
+vy_quota_dump_bandwidth(struct vy_quota *q)
+{
+	/* See comment to vy_quota::dump_bw. */
+	return histogram_percentile(q->dump_bw, 10);
+}
+
 void
-vy_quota_set_watermark(struct vy_quota *q, size_t watermark)
+vy_quota_update_dump_bandwidth(struct vy_quota *q, size_t size,
+			       double duration)
 {
-	q->watermark = watermark;
-	if (q->used >= watermark)
-		q->quota_exceeded_cb(q);
+	if (duration > 0)
+		histogram_collect(q->dump_bw, size / duration);
 }
 
 void
 vy_quota_force_use(struct vy_quota *q, size_t size)
 {
 	q->used += size;
+	q->use_curr += size;
 	if (q->used >= q->watermark)
 		q->quota_exceeded_cb(q);
 }
@@ -107,6 +205,7 @@ vy_quota_use(struct vy_quota *q, size_t size, double timeout)
 	if (q->used + size > q->limit)
 		return -1;
 	q->used += size;
+	q->use_curr += size;
 	if (q->used >= q->watermark)
 		q->quota_exceeded_cb(q);
 	return 0;
@@ -119,6 +218,10 @@ vy_quota_adjust(struct vy_quota *q, size_t reserved, size_t used)
 		size_t excess = reserved - used;
 		assert(q->used >= excess);
 		q->used -= excess;
+		if (q->use_curr >= excess)
+			q->use_curr -= excess;
+		else /* was reset by timeout */
+			q->use_curr = 0;
 		fiber_cond_broadcast(&q->cond);
 	}
 	if (reserved < used)
diff --git a/src/box/vy_quota.h b/src/box/vy_quota.h
index fbdbdc9b..6be4a41e 100644
--- a/src/box/vy_quota.h
+++ b/src/box/vy_quota.h
@@ -32,6 +32,7 @@
  */
 
 #include <stddef.h>
+#include <tarantool_ev.h>
 #include "fiber_cond.h"
 
 #if defined(__cplusplus)
@@ -39,6 +40,7 @@ extern "C" {
 #endif /* defined(__cplusplus) */
 
 struct vy_quota;
+struct histogram;
 
 typedef void
 (*vy_quota_exceeded_f)(struct vy_quota *quota);
@@ -76,9 +78,34 @@ struct vy_quota {
 	 * It is supposed to trigger memory reclaim.
 	 */
 	vy_quota_exceeded_f quota_exceeded_cb;
+	/** Timer for updating quota watermark. */
+	ev_timer timer;
+	/**
+	 * Amount of quota used since the last
+	 * invocation of the quota timer callback.
+	 */
+	size_t use_curr;
+	/**
+	 * Quota use rate, in bytes per second.
+	 * Calculated as exponentially weighted
+	 * moving average of use_curr.
+	 */
+	size_t use_rate;
+	/**
+	 * Dump bandwidth is needed for calculating the quota watermark.
+	 * The higher the bandwidth, the later we can start dumping w/o
+	 * suffering from transaction throttling. So we want to be very
+	 * conservative about estimating the bandwidth.
+	 *
+	 * To make sure we don't overestimate it, we maintain a
+	 * histogram of all observed measurements and assume the
+	 * bandwidth to be equal to the 10th percentile, i.e. the
+	 * best result among 10% worst measurements.
+	 */
+	struct histogram *dump_bw;
 };
 
-void
+int
 vy_quota_create(struct vy_quota *q, vy_quota_exceeded_f quota_exceeded_cb);
 
 void
@@ -91,12 +118,19 @@ vy_quota_destroy(struct vy_quota *q);
 void
 vy_quota_set_limit(struct vy_quota *q, size_t limit);
 
+/** Return dump bandwidth. */
+size_t
+vy_quota_dump_bandwidth(struct vy_quota *q);
+
 /**
- * Set memory watermark. If current memory usage exceeds
- * the new watermark, invoke the callback.
+ * Update dump bandwidth.
+ *
+ * @size: size of dumped memory.
+ * @duration: how long memory dump took.
  */
 void
-vy_quota_set_watermark(struct vy_quota *q, size_t watermark);
+vy_quota_update_dump_bandwidth(struct vy_quota *q, size_t size,
+			       double duration);
 
 /**
  * Consume @size bytes of memory. In contrast to vy_quota_use()

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 04/18] vinyl: implement vy_quota_wait using vy_quota_try_use
  2018-08-16 16:11 ` [PATCH 04/18] vinyl: implement vy_quota_wait using vy_quota_try_use Vladimir Davydov
  2018-08-20 11:09   ` Konstantin Osipov
@ 2018-08-27 18:36   ` Vladimir Davydov
  1 sibling, 0 replies; 56+ messages in thread
From: Vladimir Davydov @ 2018-08-27 18:36 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

Rebased and pushed to 1.10

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 05/18] vinyl: wake up fibers waiting for quota one by one
  2018-08-16 16:11 ` [PATCH 05/18] vinyl: wake up fibers waiting for quota one by one Vladimir Davydov
  2018-08-20 11:11   ` Konstantin Osipov
@ 2018-08-28 13:19   ` Vladimir Davydov
  2018-08-28 14:04     ` Konstantin Osipov
  1 sibling, 1 reply; 56+ messages in thread
From: Vladimir Davydov @ 2018-08-28 13:19 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

Discussed verbally. Agreed to wake up the next fiber on vy_quota_use()
only if we waited for quota (otherwise we don't need to do that). Here's
the updated patch:

From 7c10049ebf9c96adb197a719d4ed7c16f61a4f33 Mon Sep 17 00:00:00 2001
From: Vladimir Davydov <vdavydov.dev@gmail.com>
Date: Tue, 28 Aug 2018 16:15:33 +0300
Subject: [PATCH] vinyl: wake up fibers waiting for quota one by one

Currently, we wake up all fibers whenever we free some memory. This
is inefficient, because it might occur that all available quota gets
consumed by a few fibers while the rest will have to go back to sleep.
This is also kinda unfair, because waking up all fibers breaks the order
in which the fibers were put to sleep. This works now, because we free
memory and wake up fibers infrequently (on dump) and there normally
shouldn't be any fibers waiting for quota (if there were, the latency
would rocket sky high because of absence of any kind of throttling).
However, once throttling is introduced, fibers waiting for quota will
become the norm. So let's wake up fibers one by one: whenever we free
memory we wake up the first fiber in the line, which will wake up the
next fiber on success and so forth.

diff --git a/src/box/vy_quota.c b/src/box/vy_quota.c
index 1e7c754f..b3f073c3 100644
--- a/src/box/vy_quota.c
+++ b/src/box/vy_quota.c
@@ -56,6 +56,16 @@ enum {
 	VY_QUOTA_RATE_AVG_PERIOD = 5,
 };
 
+/**
+ * Returns true if the quota limit is exceeded and so consumers
+ * have to wait.
+ */
+static inline bool
+vy_quota_is_exceeded(struct vy_quota *q)
+{
+	return q->used > q->limit;
+}
+
 static void
 vy_quota_timer_cb(ev_loop *loop, ev_timer *timer, int events)
 {
@@ -152,7 +162,7 @@ vy_quota_set_limit(struct vy_quota *q, size_t limit)
 	q->limit = q->watermark = limit;
 	if (q->used >= limit)
 		q->quota_exceeded_cb(q);
-	fiber_cond_broadcast(&q->cond);
+	fiber_cond_signal(&q->cond);
 }
 
 size_t
@@ -184,28 +194,40 @@ vy_quota_release(struct vy_quota *q, size_t size)
 {
 	assert(q->used >= size);
 	q->used -= size;
-	fiber_cond_broadcast(&q->cond);
+	fiber_cond_signal(&q->cond);
 }
 
 int
 vy_quota_use(struct vy_quota *q, size_t size, double timeout)
 {
-	double start_time = ev_monotonic_now(loop());
-	double deadline = start_time + timeout;
-	while (q->used + size > q->limit && timeout > 0) {
-		q->quota_exceeded_cb(q);
-		if (fiber_cond_wait_deadline(&q->cond, deadline) != 0)
-			break; /* timed out */
-	}
-	double wait_time = ev_monotonic_now(loop()) - start_time;
-	if (wait_time > q->too_long_threshold) {
-		say_warn("waited for %zu bytes of vinyl memory quota "
-			 "for too long: %.3f sec", size, wait_time);
-	}
-	if (q->used + size > q->limit)
-		return -1;
 	q->used += size;
 	q->use_curr += size;
+	if (vy_quota_is_exceeded(q)) {
+		/* Wait for quota. */
+		double start_time = ev_monotonic_now(loop());
+		double deadline = start_time + timeout;
+
+		do {
+			q->quota_exceeded_cb(q);
+			q->used -= size;
+			q->use_curr -= size;
+			if (fiber_cond_wait_deadline(&q->cond, deadline) != 0)
+				return -1; /* timed out */
+			q->used += size;
+			q->use_curr += size;
+		} while (vy_quota_is_exceeded(q));
+
+		double wait_time = ev_monotonic_now(loop()) - start_time;
+		if (wait_time > q->too_long_threshold) {
+			say_warn("waited for %zu bytes of vinyl memory quota "
+				 "for too long: %.3f sec", size, wait_time);
+		}
+		/*
+		 * Wake up the next fiber in the line waiting
+		 * for quota.
+		 */
+		fiber_cond_signal(&q->cond);
+	}
 	if (q->used >= q->watermark)
 		q->quota_exceeded_cb(q);
 	return 0;
@@ -222,7 +244,7 @@ vy_quota_adjust(struct vy_quota *q, size_t reserved, size_t used)
 			q->use_curr -= excess;
 		else /* was reset by timeout */
 			q->use_curr = 0;
-		fiber_cond_broadcast(&q->cond);
+		fiber_cond_signal(&q->cond);
 	}
 	if (reserved < used)
 		vy_quota_force_use(q, used - reserved);

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 05/18] vinyl: wake up fibers waiting for quota one by one
  2018-08-28 13:19   ` Vladimir Davydov
@ 2018-08-28 14:04     ` Konstantin Osipov
  2018-08-28 14:39       ` Vladimir Davydov
  0 siblings, 1 reply; 56+ messages in thread
From: Konstantin Osipov @ 2018-08-28 14:04 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: tarantool-patches

* Vladimir Davydov <vdavydov.dev@gmail.com> [18/08/28 16:19]:

OK to push.

> Discussed verbally. Agreed to wake up the next fiber on vy_quota_use()
> only if we waited for quota (otherwise we don't need to do that). Here's
> the updated patch:
-- 
Konstantin Osipov, Moscow, Russia, +7 903 626 22 32
http://tarantool.io - www.twitter.com/kostja_osipov

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 05/18] vinyl: wake up fibers waiting for quota one by one
  2018-08-28 14:04     ` Konstantin Osipov
@ 2018-08-28 14:39       ` Vladimir Davydov
  0 siblings, 0 replies; 56+ messages in thread
From: Vladimir Davydov @ 2018-08-28 14:39 UTC (permalink / raw)
  To: Konstantin Osipov; +Cc: tarantool-patches

Pushed to 1.10

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 07/18] vinyl: tune dump bandwidth histogram buckets
  2018-08-16 16:12 ` [PATCH 07/18] vinyl: tune dump bandwidth histogram buckets Vladimir Davydov
  2018-08-20 11:15   ` Konstantin Osipov
@ 2018-08-28 15:37   ` Vladimir Davydov
  1 sibling, 0 replies; 56+ messages in thread
From: Vladimir Davydov @ 2018-08-28 15:37 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

Pushed to 1.10

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 08/18] vinyl: rename vy_quota::dump_bw to dump_bw_hist
  2018-08-16 16:12 ` [PATCH 08/18] vinyl: rename vy_quota::dump_bw to dump_bw_hist Vladimir Davydov
  2018-08-20 11:15   ` Konstantin Osipov
@ 2018-08-28 16:04   ` Vladimir Davydov
  1 sibling, 0 replies; 56+ messages in thread
From: Vladimir Davydov @ 2018-08-28 16:04 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

Pushed to 1.10

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 09/18] vinyl: cache dump bandwidth for timer invocation
  2018-08-16 16:12 ` [PATCH 09/18] vinyl: cache dump bandwidth for timer invocation Vladimir Davydov
  2018-08-20 11:21   ` Konstantin Osipov
@ 2018-08-28 16:10   ` Vladimir Davydov
  1 sibling, 0 replies; 56+ messages in thread
From: Vladimir Davydov @ 2018-08-28 16:10 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

Pushed to 1.10

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 10/18] vinyl: do not add initial guess to dump bandwidth histogram
  2018-08-16 16:12 ` [PATCH 10/18] vinyl: do not add initial guess to dump bandwidth histogram Vladimir Davydov
  2018-08-20 11:23   ` Konstantin Osipov
  2018-08-23 20:15   ` Konstantin Osipov
@ 2018-08-28 16:15   ` Vladimir Davydov
  2 siblings, 0 replies; 56+ messages in thread
From: Vladimir Davydov @ 2018-08-28 16:15 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

Pushed to 1.10

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 11/18] vinyl: use snap_io_rate_limit for initial dump bandwidth estimate
  2018-08-16 16:12 ` [PATCH 11/18] vinyl: use snap_io_rate_limit for initial dump bandwidth estimate Vladimir Davydov
  2018-08-20 11:24   ` Konstantin Osipov
@ 2018-08-28 16:18   ` Vladimir Davydov
  1 sibling, 0 replies; 56+ messages in thread
From: Vladimir Davydov @ 2018-08-28 16:18 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

Pushed to 1.10

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 12/18] histogram: add function for computing lower bound percentile estimate
  2018-08-16 16:12 ` [PATCH 12/18] histogram: add function for computing lower bound percentile estimate Vladimir Davydov
  2018-08-20 11:29   ` [tarantool-patches] " Konstantin Osipov
@ 2018-08-28 16:39   ` Vladimir Davydov
  1 sibling, 0 replies; 56+ messages in thread
From: Vladimir Davydov @ 2018-08-28 16:39 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

Pushed to 1.10

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 13/18] vinyl: use lower bound percentile estimate for dump bandwidth
  2018-08-16 16:12 ` [PATCH 13/18] vinyl: use lower bound percentile estimate for dump bandwidth Vladimir Davydov
@ 2018-08-28 16:51   ` Vladimir Davydov
  0 siblings, 0 replies; 56+ messages in thread
From: Vladimir Davydov @ 2018-08-28 16:51 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

Pushed to 1.10

^ permalink raw reply	[flat|nested] 56+ messages in thread

end of thread, other threads:[~2018-08-28 16:51 UTC | newest]

Thread overview: 56+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-08-16 16:11 [PATCH 00/18] Implement write throttling for vinyl Vladimir Davydov
2018-08-16 16:11 ` [PATCH 01/18] vinyl: rework internal quota API Vladimir Davydov
2018-08-20 11:07   ` Konstantin Osipov
2018-08-24  8:32     ` Vladimir Davydov
2018-08-27 18:29   ` Vladimir Davydov
2018-08-16 16:11 ` [PATCH 02/18] vinyl: move quota methods implementation to vy_quota.c Vladimir Davydov
2018-08-20 11:07   ` Konstantin Osipov
2018-08-27 18:30   ` Vladimir Davydov
2018-08-16 16:11 ` [PATCH 03/18] vinyl: move quota related methods and variables from vy_env to vy_quota Vladimir Davydov
2018-08-20 11:08   ` Konstantin Osipov
2018-08-27 18:33   ` Vladimir Davydov
2018-08-16 16:11 ` [PATCH 04/18] vinyl: implement vy_quota_wait using vy_quota_try_use Vladimir Davydov
2018-08-20 11:09   ` Konstantin Osipov
2018-08-27 18:36   ` Vladimir Davydov
2018-08-16 16:11 ` [PATCH 05/18] vinyl: wake up fibers waiting for quota one by one Vladimir Davydov
2018-08-20 11:11   ` Konstantin Osipov
2018-08-24  8:33     ` Vladimir Davydov
2018-08-28 13:19   ` Vladimir Davydov
2018-08-28 14:04     ` Konstantin Osipov
2018-08-28 14:39       ` Vladimir Davydov
2018-08-16 16:12 ` [PATCH 06/18] vinyl: do not wake up fibers waiting for quota if quota is unavailable Vladimir Davydov
2018-08-20 11:13   ` Konstantin Osipov
2018-08-16 16:12 ` [PATCH 07/18] vinyl: tune dump bandwidth histogram buckets Vladimir Davydov
2018-08-20 11:15   ` Konstantin Osipov
2018-08-28 15:37   ` Vladimir Davydov
2018-08-16 16:12 ` [PATCH 08/18] vinyl: rename vy_quota::dump_bw to dump_bw_hist Vladimir Davydov
2018-08-20 11:15   ` Konstantin Osipov
2018-08-28 16:04   ` Vladimir Davydov
2018-08-16 16:12 ` [PATCH 09/18] vinyl: cache dump bandwidth for timer invocation Vladimir Davydov
2018-08-20 11:21   ` Konstantin Osipov
2018-08-28 16:10   ` Vladimir Davydov
2018-08-16 16:12 ` [PATCH 10/18] vinyl: do not add initial guess to dump bandwidth histogram Vladimir Davydov
2018-08-20 11:23   ` Konstantin Osipov
2018-08-23 20:15   ` Konstantin Osipov
2018-08-28 16:15   ` Vladimir Davydov
2018-08-16 16:12 ` [PATCH 11/18] vinyl: use snap_io_rate_limit for initial dump bandwidth estimate Vladimir Davydov
2018-08-20 11:24   ` Konstantin Osipov
2018-08-24  8:31     ` Vladimir Davydov
2018-08-28 16:18   ` Vladimir Davydov
2018-08-16 16:12 ` [PATCH 12/18] histogram: add function for computing lower bound percentile estimate Vladimir Davydov
2018-08-20 11:29   ` [tarantool-patches] " Konstantin Osipov
2018-08-24  8:30     ` Vladimir Davydov
2018-08-28 16:39   ` Vladimir Davydov
2018-08-16 16:12 ` [PATCH 13/18] vinyl: use lower bound percentile estimate for dump bandwidth Vladimir Davydov
2018-08-28 16:51   ` Vladimir Davydov
2018-08-16 16:12 ` [PATCH 14/18] vinyl: do not try to trigger dump if it is already in progress Vladimir Davydov
2018-08-16 16:12 ` [PATCH 15/18] vinyl: improve dump start/stop logging Vladimir Davydov
2018-08-23 20:18   ` Konstantin Osipov
2018-08-16 16:12 ` [PATCH 16/18] vinyl: confine quota watermark within sane value range Vladimir Davydov
2018-08-16 16:12 ` [PATCH 17/18] vinyl: set quota timer period to 100 ms Vladimir Davydov
2018-08-23 20:49   ` Konstantin Osipov
2018-08-24  8:18     ` Vladimir Davydov
2018-08-16 16:12 ` [PATCH 18/18] vinyl: throttle tx rate if dump does not catch up Vladimir Davydov
2018-08-23 20:54   ` Konstantin Osipov
2018-08-23 20:58     ` [tarantool-patches] " Konstantin Osipov
2018-08-24  8:21     ` Vladimir Davydov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox