[PATCH 9/9] vinyl: throttle tx to ensure compaction keeps up with dumps

Tarantool development patches archive
 help / color / mirror / Atom feed

From: Vladimir Davydov <vdavydov.dev@gmail.com>
To: tarantool-patches@freelists.org
Subject: [PATCH 9/9] vinyl: throttle tx to ensure compaction keeps up with dumps
Date: Mon, 21 Jan 2019 00:17:08 +0300	[thread overview]
Message-ID: <a48046dbdd143ba262592ad9ca3bdc0b3474257a.1548017258.git.vdavydov.dev@gmail.com> (raw)
In-Reply-To: <cover.1548017258.git.vdavydov.dev@gmail.com>
In-Reply-To: <cover.1548017258.git.vdavydov.dev@gmail.com>

Every byte of data written to a vinyl database eventually gets compacted
with data written to the database earlier. The ratio of the size of data
actually written to disk to the size of data written to the database is
called write amplification. Write amplification depends on the LSM tree
configuration and the workload parameters and varies in a wide range,
from 2-3 to 10-20 or even higher in some extreme cases. If the database
engine doesn't manage to write those extra data, LSM tree shape will get
distorted, which will result in increased read and space amplification,
which, in turn, will lead to slowing down reads and wasting disk space.
That's why it's so important to ensure the database engine has enough
compaction power.

One way to ensure that is increase the number of compaction threads by
tuning box.cfg.vinyl_write_threads configuration knob, but one can't
increase it beyond the capacity of the server running the instance. So
the database engine must throttle writes if it detects that compaction
threads are struggling to keep up. This patch implements a very simple
algorithm to achieve that: it keeps track of recently observed write
amplification and data compaction speed, use them to calculate the max
transaction rate that the database engine can handle while steadily
maintaining the current level of write amplification, and sets the rate
limit to half that so as to give the engine enough room to increase
write amplification if needed.

The algorithm is obviously pessimistic: it undervalues the transaction
rate the database can handle after write amplification has steadied. But
this is compensated by its simplicity and stability - there shouldn't be
any abrupt drops or peaks in RPS due to its decisions. Besides, it
adapts fairly quickly to increase in write amplification when a database
is filled up. If one finds that the algorithm is being too cautious by
undervaluing the limit, it's easy to fix by simply increasing the number
of compaction threads - the rate limit will scale proportionately if the
system is underloaded.

The current value of the rate limit set by the algorithm is reported by
box.stat.vinyl() under regulator.rate_limit section.

Closes #3721
---
 src/box/vinyl.c        |  6 ++++
 src/box/vy_quota.c     |  9 ++++++
 src/box/vy_quota.h     |  6 ++++
 src/box/vy_regulator.c | 84 ++++++++++++++++++++++++++++++++++++++++++++++++--
 src/box/vy_regulator.h | 27 ++++++++++++++++
 5 files changed, 129 insertions(+), 3 deletions(-)

diff --git a/src/box/vinyl.c b/src/box/vinyl.c
index aaef858e..650d5c26 100644
--- a/src/box/vinyl.c
+++ b/src/box/vinyl.c
@@ -274,6 +274,8 @@ vy_info_append_regulator(struct vy_env *env, struct info_handler *h)
 	info_append_int(h, "write_rate", r->write_rate);
 	info_append_int(h, "dump_bandwidth", r->dump_bandwidth);
 	info_append_int(h, "dump_watermark", r->dump_watermark);
+	info_append_int(h, "rate_limit", vy_quota_get_rate_limit(r->quota,
+							VY_QUOTA_CONSUMER_TX));
 	info_table_end(h); /* regulator */
 }
 
@@ -532,6 +534,7 @@ vinyl_engine_reset_stat(struct engine *engine)
 	memset(&xm->stat, 0, sizeof(xm->stat));
 
 	vy_scheduler_reset_stat(&env->scheduler);
+	vy_regulator_reset_stat(&env->regulator);
 }
 
 /** }}} Introspection */
@@ -2475,6 +2478,9 @@ vy_env_dump_complete_cb(struct vy_scheduler *scheduler,
 	 */
 	vy_regulator_dump_complete(&env->regulator, mem_dumped, dump_duration);
 	vy_quota_release(quota, mem_dumped);
+
+	vy_regulator_update_rate_limit(&env->regulator, &scheduler->stat,
+				       scheduler->compaction_pool.size);
 }
 
 static struct vy_squash_queue *
diff --git a/src/box/vy_quota.c b/src/box/vy_quota.c
index 20d322de..4dd961c9 100644
--- a/src/box/vy_quota.c
+++ b/src/box/vy_quota.c
@@ -244,6 +244,15 @@ vy_quota_set_rate_limit(struct vy_quota *q, enum vy_quota_resource_type type,
 	vy_rate_limit_set(&q->rate_limit[type], rate);
 }
 
+size_t
+vy_quota_get_rate_limit(struct vy_quota *q, enum vy_quota_consumer_type type)
+{
+	size_t rate = SIZE_MAX;
+	vy_quota_consumer_for_each_rate_limit(q, type, rl)
+		rate = MIN(rate, rl->rate);
+	return rate;
+}
+
 void
 vy_quota_force_use(struct vy_quota *q, enum vy_quota_consumer_type type,
 		   size_t size)
diff --git a/src/box/vy_quota.h b/src/box/vy_quota.h
index d90922b2..7ff98cc1 100644
--- a/src/box/vy_quota.h
+++ b/src/box/vy_quota.h
@@ -255,6 +255,12 @@ vy_quota_set_rate_limit(struct vy_quota *q, enum vy_quota_resource_type type,
 			size_t rate);
 
 /**
+ * Return the rate limit applied to a consumer of the given type.
+ */
+size_t
+vy_quota_get_rate_limit(struct vy_quota *q, enum vy_quota_consumer_type type);
+
+/**
  * Consume @size bytes of memory. In contrast to vy_quota_use()
  * this function does not throttle the caller.
  */
diff --git a/src/box/vy_regulator.c b/src/box/vy_regulator.c
index e14b01aa..b406cf97 100644
--- a/src/box/vy_regulator.c
+++ b/src/box/vy_regulator.c
@@ -34,6 +34,7 @@
 #include <stdbool.h>
 #include <stddef.h>
 #include <stdint.h>
+#include <string.h>
 #include <tarantool_ev.h>
 
 #include "fiber.h"
@@ -42,6 +43,7 @@
 #include "trivia/util.h"
 
 #include "vy_quota.h"
+#include "vy_stat.h"
 
 /**
  * Regulator timer period, in seconds.
@@ -73,6 +75,14 @@ static const size_t VY_DUMP_BANDWIDTH_DEFAULT = 10 * 1024 * 1024;
  */
 static const size_t VY_DUMP_SIZE_ACCT_MIN = 1024 * 1024;
 
+/**
+ * Number of dumps to take into account for rate limit calculation.
+ * Shouldn't be too small to avoid uneven RPS. Shouldn't be too big
+ * either - otherwise the rate limit will adapt too slowly to workload
+ * changes. 100 feels like a good choice.
+ */
+static const int VY_RECENT_DUMP_COUNT = 100;
+
 static void
 vy_regulator_trigger_dump(struct vy_regulator *regulator)
 {
@@ -182,6 +192,7 @@ vy_regulator_create(struct vy_regulator *regulator, struct vy_quota *quota,
 		100 * MB, 200 * MB, 300 * MB, 400 * MB, 500 * MB, 600 * MB,
 		700 * MB, 800 * MB, 900 * MB,
 	};
+	memset(regulator, 0, sizeof(*regulator));
 	regulator->dump_bandwidth_hist = histogram_new(dump_bandwidth_buckets,
 					lengthof(dump_bandwidth_buckets));
 	if (regulator->dump_bandwidth_hist == NULL)
@@ -192,11 +203,8 @@ vy_regulator_create(struct vy_regulator *regulator, struct vy_quota *quota,
 	ev_timer_init(&regulator->timer, vy_regulator_timer_cb, 0,
 		      VY_REGULATOR_TIMER_PERIOD);
 	regulator->timer.data = regulator;
-	regulator->write_rate = 0;
-	regulator->quota_used_last = 0;
 	regulator->dump_bandwidth = VY_DUMP_BANDWIDTH_DEFAULT;
 	regulator->dump_watermark = SIZE_MAX;
-	regulator->dump_in_progress = false;
 }
 
 void
@@ -269,3 +277,73 @@ vy_regulator_reset_dump_bandwidth(struct vy_regulator *regulator, size_t max)
 	vy_quota_set_rate_limit(regulator->quota, VY_QUOTA_RESOURCE_MEMORY,
 				regulator->dump_bandwidth);
 }
+
+void
+vy_regulator_reset_stat(struct vy_regulator *regulator)
+{
+	memset(&regulator->sched_stat_last, 0,
+	       sizeof(regulator->sched_stat_last));
+}
+
+void
+vy_regulator_update_rate_limit(struct vy_regulator *regulator,
+			       const struct vy_scheduler_stat *stat,
+			       int compaction_threads)
+{
+	struct vy_scheduler_stat *last = &regulator->sched_stat_last;
+	struct vy_scheduler_stat *recent = &regulator->sched_stat_recent;
+	/*
+	 * The maximal dump rate the database can handle while
+	 * maintaining the current level of write amplification
+	 * equals:
+	 *
+	 *                                        dump_output
+	 *   max_dump_rate = compaction_rate * -----------------
+	 *                                     compaction_output
+	 *
+	 * The average compaction rate can be estimated with:
+	 *
+	 *                                          compaction_output
+	 *   compaction_rate = compaction_threads * -----------------
+	 *                                           compaction_time
+	 *
+	 * Putting it all together and taking into account data
+	 * compaction during memory dump, we get for the max
+	 * transaction rate:
+	 *
+	 *                                 dump_input
+	 *   max_tx_rate = max_dump_rate * ----------- =
+	 *                                 dump_output
+	 *
+	 *                                        dump_input
+	 *                 compaction_threads * ---------------
+	 *                                      compaction_time
+	 *
+	 * We set the rate limit to half that to leave the database
+	 * engine enough room needed for growing write amplification.
+	 */
+	recent->dump_count += stat->dump_count - last->dump_count;
+	recent->dump_input += stat->dump_input - last->dump_input;
+	recent->compaction_time += stat->compaction_time -
+				   last->compaction_time;
+	*last = *stat;
+
+	if (recent->compaction_time == 0 ||
+	    recent->dump_input < (int)VY_DUMP_SIZE_ACCT_MIN)
+		return;
+
+	double rate = 0.5 * compaction_threads * recent->dump_input /
+						 recent->compaction_time;
+	vy_quota_set_rate_limit(regulator->quota, VY_QUOTA_RESOURCE_DISK,
+				MIN(rate, SIZE_MAX));
+
+	/*
+	 * Periodically rotate statistics for quicker adaptation
+	 * to workload changes.
+	 */
+	if (recent->dump_count > VY_RECENT_DUMP_COUNT) {
+		recent->dump_count /= 2;
+		recent->dump_input /= 2;
+		recent->compaction_time /= 2;
+	}
+}
diff --git a/src/box/vy_regulator.h b/src/box/vy_regulator.h
index 0188da26..341f41df 100644
--- a/src/box/vy_regulator.h
+++ b/src/box/vy_regulator.h
@@ -35,6 +35,8 @@
 #include <stddef.h>
 #include <tarantool_ev.h>
 
+#include "vy_stat.h"
+
 #if defined(__cplusplus)
 extern "C" {
 #endif /* defined(__cplusplus) */
@@ -107,6 +109,16 @@ struct vy_regulator {
 	 * but vy_regulator_dump_complete() hasn't been called yet.
 	 */
 	bool dump_in_progress;
+	/**
+	 * Snapshot of scheduler statistics taken at the time of
+	 * the last rate limit update.
+	 */
+	struct vy_scheduler_stat sched_stat_last;
+	/**
+	 * Scheduler statistics for the most recent few dumps.
+	 * Used for calculating the rate limit.
+	 */
+	struct vy_scheduler_stat sched_stat_recent;
 };
 
 void
@@ -146,6 +158,21 @@ vy_regulator_dump_complete(struct vy_regulator *regulator,
 void
 vy_regulator_reset_dump_bandwidth(struct vy_regulator *regulator, size_t max);
 
+/**
+ * Called when global statistics are reset by box.stat.reset().
+ */
+void
+vy_regulator_reset_stat(struct vy_regulator *regulator);
+
+/**
+ * Set transaction rate limit so as to ensure that compaction
+ * will keep up with dumps.
+ */
+void
+vy_regulator_update_rate_limit(struct vy_regulator *regulator,
+			       const struct vy_scheduler_stat *stat,
+			       int compaction_threads);
+
 #if defined(__cplusplus)
 } /* extern "C" */
 #endif /* defined(__cplusplus) */
-- 
2.11.0

next prev parent reply	other threads:[~2019-01-20 21:17 UTC|newest]

Thread overview: 39+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-01-20 21:16 [PATCH 0/9] vinyl: compaction randomization and throttling Vladimir Davydov
2019-01-20 21:17 ` [PATCH 1/9] vinyl: update lsm->range_heap in one go on dump completion Vladimir Davydov
2019-01-24 16:55   ` Vladimir Davydov
2019-02-05 16:37   ` [tarantool-patches] " Konstantin Osipov
2019-01-20 21:17 ` [PATCH 2/9] vinyl: ignore unknown .run, .index and .vylog keys Vladimir Davydov
2019-01-24 16:56   ` Vladimir Davydov
2019-01-20 21:17 ` [PATCH 3/9] vinyl: use uncompressed run size for range split/coalesce/compaction Vladimir Davydov
2019-01-21  9:42   ` Vladimir Davydov
2019-02-05 16:49     ` [tarantool-patches] " Konstantin Osipov
2019-02-06  8:55       ` Vladimir Davydov
2019-02-06 10:46         ` Konstantin Osipov
2019-02-06 10:55           ` Vladimir Davydov
2019-02-05 16:43   ` Konstantin Osipov
2019-02-06 16:48     ` Vladimir Davydov
2019-01-20 21:17 ` [PATCH 4/9] vinyl: rename lsm->range_heap to max_compaction_priority Vladimir Davydov
2019-01-20 21:17 ` [PATCH 5/9] vinyl: keep track of dumps per compaction for each LSM tree Vladimir Davydov
2019-02-05 16:58   ` [tarantool-patches] " Konstantin Osipov
2019-02-06  9:20     ` Vladimir Davydov
2019-02-06 16:54       ` Vladimir Davydov
2019-01-20 21:17 ` [PATCH 6/9] vinyl: set range size automatically Vladimir Davydov
2019-01-22  9:17   ` Vladimir Davydov
2019-02-05 17:09   ` [tarantool-patches] " Konstantin Osipov
2019-02-06  9:23     ` Vladimir Davydov
2019-02-06 17:04       ` Vladimir Davydov
2019-01-20 21:17 ` [PATCH 7/9] vinyl: randomize range compaction to avoid IO load spikes Vladimir Davydov
2019-01-22 12:54   ` Vladimir Davydov
2019-02-05 17:39     ` [tarantool-patches] " Konstantin Osipov
2019-02-06  8:53       ` Vladimir Davydov
2019-02-06 10:44         ` Konstantin Osipov
2019-02-06 10:52           ` Vladimir Davydov
2019-02-06 11:06             ` Konstantin Osipov
2019-02-06 11:49               ` Vladimir Davydov
2019-02-06 13:43                 ` Konstantin Osipov
2019-02-06 14:00                   ` Vladimir Davydov
2019-02-05 17:14   ` Konstantin Osipov
2019-01-20 21:17 ` [PATCH 8/9] vinyl: introduce quota consumer types Vladimir Davydov
2019-01-20 21:17 ` Vladimir Davydov [this message]
2019-01-21 14:14   ` [PATCH 9/9] vinyl: throttle tx to ensure compaction keeps up with dumps Vladimir Davydov
2019-01-22  9:09   ` Vladimir Davydov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=a48046dbdd143ba262592ad9ca3bdc0b3474257a.1548017258.git.vdavydov.dev@gmail.com \
    --to=vdavydov.dev@gmail.com \
    --cc=tarantool-patches@freelists.org \
    --subject='Re: [PATCH 9/9] vinyl: throttle tx to ensure compaction keeps up with dumps' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox