Re: [PATCH 7/9] vinyl: randomize range compaction to avoid IO load spikes

Tarantool development patches archive
 help / color / mirror / Atom feed

From: Vladimir Davydov <vdavydov.dev@gmail.com>
To: tarantool-patches@freelists.org
Subject: Re: [PATCH 7/9] vinyl: randomize range compaction to avoid IO load spikes
Date: Tue, 22 Jan 2019 15:54:58 +0300	[thread overview]
Message-ID: <20190122125458.cutoz5rtfd2sb6el@esperanza> (raw)
In-Reply-To: <44f34fbaf09af5d1054f2e4843a77e095afe1e71.1548017258.git.vdavydov.dev@gmail.com>

On Mon, Jan 21, 2019 at 12:17:06AM +0300, Vladimir Davydov wrote:
> Since all ranges constituting an LSM tree have the same configuration,
> they tend to get compacted at approximately the same time. This entails
> IO load spikes, which, in turn, lead to deviation of the LSM tree from
> the target shape and hence increased read amplification. To prevent this
> from happening, this patch implements compaction randomization: with 10%
> probability we defer compaction at each LSM tree level, i.e. if the
> number of runs at a level exceeds the configured run_count_per_level,
> the level will be compacted with 90%-probability, but with 10%
> probability it won't - compaction will be deferred until another run
> is added to the level.
> 
> Our simulations show that such a simple algorithm performs fairly well:
> it randomizes compaction pace among ranges, spreading IO load evenly in
> time, while the write amplification is increased by not more than 5-10%,
> which seems to be a reasonable price for elimination of IO load spikes.
> 
> Closes #3944
> ---
>  src/box/vy_range.c                 | 20 +++++++++++++++++++-
>  src/box/vy_run.c                   |  1 +
>  src/box/vy_run.h                   |  5 +++++
>  test/vinyl/write_iterator.result   | 16 ++++++++++++++++
>  test/vinyl/write_iterator.test.lua |  6 ++++++
>  5 files changed, 47 insertions(+), 1 deletion(-)
> 
> diff --git a/src/box/vy_range.c b/src/box/vy_range.c
> index db4a7ab0..747bfae3 100644
> --- a/src/box/vy_range.c
> +++ b/src/box/vy_range.c
> @@ -369,7 +369,25 @@ vy_range_update_compaction_priority(struct vy_range *range,
>  			 * we find an appropriate level for it.
>  			 */
>  		}
> -		if (level_run_count > opts->run_count_per_level) {
> +		/*
> +		 * Since all ranges constituting an LSM tree have
> +		 * the same configuration, they tend to get compacted
> +		 * simultaneously, leading to IO load spikes and, as
> +		 * a result, distortion of the LSM tree shape and
> +		 * increased read amplification. To prevent this from
> +		 * happening, we constantly randomize compaction pace
> +		 * among ranges by deferring compaction at each LSM
> +		 * tree level with some fixed small probability.
> +		 *
> +		 * Note, we can't use rand() directly here, because
> +		 * this function is called on every memory dump and
> +		 * scans all LSM tree levels. Instead we use the
> +		 * value of rand() from the slice creation time.
> +		 */
> +		uint32_t max_run_count = opts->run_count_per_level;
> +		if (slice->seed < RAND_MAX / 10)
> +			max_run_count++;
> +		if (level_run_count > max_run_count) {
>  			/*
>  			 * The number of runs at the current level
>  			 * exceeds the configured maximum. Arrange

I ran some tests and, surprisingly, it turned out that randomization
didn't help at all: the compaction queue size jumped up to 30% and even
40% from time to time although there was plenty of compaction power -
compaction threads were busy only half of the time. When I looked
closer, I saw that the queue size behavior looked weird - it jumped
after a dump for a very short period of time, until the next dump, which
pushed it back to 10%. This made me wonder how it could happen at all -
normally, compaction queue should only grow after a dump, not diminish.

I think I've finally found the root cause of the problem. The code
computing compaction priority (see vy_range_update_compaction_priority)
is unstable meaning the size of the first level equals the size of the
smallest run so if memory dumps produce runs of varying sizes, which is
what happens in practice in contrast to simulation, the shape of the
tree will vary as well, resulting in different compaction priority and
unstable queue behavior.

We must fix this somehow. One way to do it is compute the first level
size basing on the size of the last level run, which is constant most of
the time, and indeed, when I did it and reran the tests, I found that
the queue stayed below 10% all the time. Turning off randomization, made
the queue jump up to 30%, which was expected. The diff is below. I'll
spruce it up a little, wrap it nicely, and submit it separately later.

diff --git a/src/box/vy_range.c b/src/box/vy_range.c
index 747bfae3..38a64632 100644
--- a/src/box/vy_range.c
+++ b/src/box/vy_range.c
@@ -315,28 +315,33 @@ vy_range_update_compaction_priority(struct vy_range *range,
 	struct vy_disk_stmt_counter total_stmt_count;
 	vy_disk_stmt_counter_reset(&total_stmt_count);
 	/* Total number of checked runs. */
-	uint32_t total_run_count = 0;
+	int32_t total_run_count = 0;
 	/* Estimated size of a compacted run, if compaction is scheduled. */
-	uint64_t est_new_run_size = 0;
+	int64_t est_new_run_size = 0;
 	/* The number of runs at the current level. */
-	uint32_t level_run_count = 0;
+	int32_t level_run_count = 0;
 	/*
 	 * The target (perfect) size of a run at the current level.
 	 * For the first level, it's the size of the newest run.
 	 * For lower levels it's computed as first level run size
 	 * times run_size_ratio.
 	 */
-	uint64_t target_run_size = 0;
+	int64_t target_run_size = 0;
 
+	int64_t size;
 	struct vy_slice *slice;
+
+	slice = rlist_last_entry(&range->slices, struct vy_slice, in_range);
+	size = slice->count.bytes;
+
+	slice = rlist_first_entry(&range->slices, struct vy_slice, in_range);
+	do {
+		target_run_size = size;
+		size = DIV_ROUND_UP(target_run_size, opts->run_size_ratio);
+	} while (size > MAX(slice->count.bytes, 1));
+
 	rlist_foreach_entry(slice, &range->slices, in_range) {
-		uint64_t size = slice->count.bytes;
-		/*
-		 * The size of the first level is defined by
-		 * the size of the most recent run.
-		 */
-		if (target_run_size == 0)
-			target_run_size = size;
+		size = slice->count.bytes;
 		level_run_count++;
 		total_run_count++;
 		vy_disk_stmt_counter_add(&total_stmt_count, &slice->count);
@@ -384,7 +389,7 @@ vy_range_update_compaction_priority(struct vy_range *range,
 		 * scans all LSM tree levels. Instead we use the
 		 * value of rand() from the slice creation time.
 		 */
-		uint32_t max_run_count = opts->run_count_per_level;
+		int32_t max_run_count = opts->run_count_per_level;
 		if (slice->seed < RAND_MAX / 10)
 			max_run_count++;
 		if (level_run_count > max_run_count) {

next prev parent reply	other threads:[~2019-01-22 12:54 UTC|newest]

Thread overview: 39+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-01-20 21:16 [PATCH 0/9] vinyl: compaction randomization and throttling Vladimir Davydov
2019-01-20 21:17 ` [PATCH 1/9] vinyl: update lsm->range_heap in one go on dump completion Vladimir Davydov
2019-01-24 16:55   ` Vladimir Davydov
2019-02-05 16:37   ` [tarantool-patches] " Konstantin Osipov
2019-01-20 21:17 ` [PATCH 2/9] vinyl: ignore unknown .run, .index and .vylog keys Vladimir Davydov
2019-01-24 16:56   ` Vladimir Davydov
2019-01-20 21:17 ` [PATCH 3/9] vinyl: use uncompressed run size for range split/coalesce/compaction Vladimir Davydov
2019-01-21  9:42   ` Vladimir Davydov
2019-02-05 16:49     ` [tarantool-patches] " Konstantin Osipov
2019-02-06  8:55       ` Vladimir Davydov
2019-02-06 10:46         ` Konstantin Osipov
2019-02-06 10:55           ` Vladimir Davydov
2019-02-05 16:43   ` Konstantin Osipov
2019-02-06 16:48     ` Vladimir Davydov
2019-01-20 21:17 ` [PATCH 4/9] vinyl: rename lsm->range_heap to max_compaction_priority Vladimir Davydov
2019-01-20 21:17 ` [PATCH 5/9] vinyl: keep track of dumps per compaction for each LSM tree Vladimir Davydov
2019-02-05 16:58   ` [tarantool-patches] " Konstantin Osipov
2019-02-06  9:20     ` Vladimir Davydov
2019-02-06 16:54       ` Vladimir Davydov
2019-01-20 21:17 ` [PATCH 6/9] vinyl: set range size automatically Vladimir Davydov
2019-01-22  9:17   ` Vladimir Davydov
2019-02-05 17:09   ` [tarantool-patches] " Konstantin Osipov
2019-02-06  9:23     ` Vladimir Davydov
2019-02-06 17:04       ` Vladimir Davydov
2019-01-20 21:17 ` [PATCH 7/9] vinyl: randomize range compaction to avoid IO load spikes Vladimir Davydov
2019-01-22 12:54   ` Vladimir Davydov [this message]
2019-02-05 17:39     ` [tarantool-patches] " Konstantin Osipov
2019-02-06  8:53       ` Vladimir Davydov
2019-02-06 10:44         ` Konstantin Osipov
2019-02-06 10:52           ` Vladimir Davydov
2019-02-06 11:06             ` Konstantin Osipov
2019-02-06 11:49               ` Vladimir Davydov
2019-02-06 13:43                 ` Konstantin Osipov
2019-02-06 14:00                   ` Vladimir Davydov
2019-02-05 17:14   ` Konstantin Osipov
2019-01-20 21:17 ` [PATCH 8/9] vinyl: introduce quota consumer types Vladimir Davydov
2019-01-20 21:17 ` [PATCH 9/9] vinyl: throttle tx to ensure compaction keeps up with dumps Vladimir Davydov
2019-01-21 14:14   ` Vladimir Davydov
2019-01-22  9:09   ` Vladimir Davydov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190122125458.cutoz5rtfd2sb6el@esperanza \
    --to=vdavydov.dev@gmail.com \
    --cc=tarantool-patches@freelists.org \
    --subject='Re: [PATCH 7/9] vinyl: randomize range compaction to avoid IO load spikes' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox