[PATCH 7/9] vinyl: randomize range compaction to avoid IO load spikes

Tue Jan 22 15:54:58 MSK 2019

On Mon, Jan 21, 2019 at 12:17:06AM +0300, Vladimir Davydov wrote:
> Since all ranges constituting an LSM tree have the same configuration,
> they tend to get compacted at approximately the same time. This entails
> IO load spikes, which, in turn, lead to deviation of the LSM tree from
> the target shape and hence increased read amplification. To prevent this
> from happening, this patch implements compaction randomization: with 10%
> probability we defer compaction at each LSM tree level, i.e. if the
> number of runs at a level exceeds the configured run_count_per_level,
> the level will be compacted with 90%-probability, but with 10%
> probability it won't - compaction will be deferred until another run
> is added to the level.
> 
> Our simulations show that such a simple algorithm performs fairly well:
> it randomizes compaction pace among ranges, spreading IO load evenly in
> time, while the write amplification is increased by not more than 5-10%,
> which seems to be a reasonable price for elimination of IO load spikes.
> 
> Closes #3944
> ---
>  src/box/vy_range.c                 | 20 +++++++++++++++++++-
>  src/box/vy_run.c                   |  1 +
>  src/box/vy_run.h                   |  5 +++++
>  test/vinyl/write_iterator.result   | 16 ++++++++++++++++
>  test/vinyl/write_iterator.test.lua |  6 ++++++
>  5 files changed, 47 insertions(+), 1 deletion(-)
> 
> diff --git a/src/box/vy_range.c b/src/box/vy_range.c
> index db4a7ab0..747bfae3 100644
> --- a/src/box/vy_range.c
> +++ b/src/box/vy_range.c
> @@ -369,7 +369,25 @@ vy_range_update_compaction_priority(struct vy_range *range,
>  			 * we find an appropriate level for it.
>  			 */
>  		}
> -		if (level_run_count > opts->run_count_per_level) {
> +		/*
> +		 * Since all ranges constituting an LSM tree have
> +		 * the same configuration, they tend to get compacted
> +		 * simultaneously, leading to IO load spikes and, as
> +		 * a result, distortion of the LSM tree shape and
> +		 * increased read amplification. To prevent this from
> +		 * happening, we constantly randomize compaction pace
> +		 * among ranges by deferring compaction at each LSM
> +		 * tree level with some fixed small probability.
> +		 *
> +		 * Note, we can't use rand() directly here, because
> +		 * this function is called on every memory dump and
> +		 * scans all LSM tree levels. Instead we use the
> +		 * value of rand() from the slice creation time.
> +		 */
> +		uint32_t max_run_count = opts->run_count_per_level;
> +		if (slice->seed < RAND_MAX / 10)
> +			max_run_count++;
> +		if (level_run_count > max_run_count) {
>  			/*
>  			 * The number of runs at the current level
>  			 * exceeds the configured maximum. Arrange

I ran some tests and, surprisingly, it turned out that randomization
didn't help at all: the compaction queue size jumped up to 30% and even
40% from time to time although there was plenty of compaction power -
compaction threads were busy only half of the time. When I looked
closer, I saw that the queue size behavior looked weird - it jumped
after a dump for a very short period of time, until the next dump, which
pushed it back to 10%. This made me wonder how it could happen at all -
normally, compaction queue should only grow after a dump, not diminish.

I think I've finally found the root cause of the problem. The code
computing compaction priority (see vy_range_update_compaction_priority)
is unstable meaning the size of the first level equals the size of the
smallest run so if memory dumps produce runs of varying sizes, which is
what happens in practice in contrast to simulation, the shape of the
tree will vary as well, resulting in different compaction priority and
unstable queue behavior.

We must fix this somehow. One way to do it is compute the first level
size basing on the size of the last level run, which is constant most of
the time, and indeed, when I did it and reran the tests, I found that
the queue stayed below 10% all the time. Turning off randomization, made
the queue jump up to 30%, which was expected. The diff is below. I'll
spruce it up a little, wrap it nicely, and submit it separately later.

diff --git a/src/box/vy_range.c b/src/box/vy_range.c
index 747bfae3..38a64632 100644
--- a/src/box/vy_range.c
+++ b/src/box/vy_range.c
@@ -315,28 +315,33 @@ vy_range_update_compaction_priority(struct vy_range *range,
 	struct vy_disk_stmt_counter total_stmt_count;
 	vy_disk_stmt_counter_reset(&total_stmt_count);
 	/* Total number of checked runs. */
-	uint32_t total_run_count = 0;
+	int32_t total_run_count = 0;
 	/* Estimated size of a compacted run, if compaction is scheduled. */
-	uint64_t est_new_run_size = 0;
+	int64_t est_new_run_size = 0;
 	/* The number of runs at the current level. */
-	uint32_t level_run_count = 0;
+	int32_t level_run_count = 0;
 	/*
 	 * The target (perfect) size of a run at the current level.
 	 * For the first level, it's the size of the newest run.
 	 * For lower levels it's computed as first level run size
 	 * times run_size_ratio.
 	 */
-	uint64_t target_run_size = 0;
+	int64_t target_run_size = 0;
 
+	int64_t size;
 	struct vy_slice *slice;
+
+	slice = rlist_last_entry(&range->slices, struct vy_slice, in_range);
+	size = slice->count.bytes;
+
+	slice = rlist_first_entry(&range->slices, struct vy_slice, in_range);
+	do {
+		target_run_size = size;
+		size = DIV_ROUND_UP(target_run_size, opts->run_size_ratio);
+	} while (size > MAX(slice->count.bytes, 1));
+
 	rlist_foreach_entry(slice, &range->slices, in_range) {
-		uint64_t size = slice->count.bytes;
-		/*
-		 * The size of the first level is defined by
-		 * the size of the most recent run.
-		 */
-		if (target_run_size == 0)
-			target_run_size = size;
+		size = slice->count.bytes;
 		level_run_count++;
 		total_run_count++;
 		vy_disk_stmt_counter_add(&total_stmt_count, &slice->count);
@@ -384,7 +389,7 @@ vy_range_update_compaction_priority(struct vy_range *range,
 		 * scans all LSM tree levels. Instead we use the
 		 * value of rand() from the slice creation time.
 		 */
-		uint32_t max_run_count = opts->run_count_per_level;
+		int32_t max_run_count = opts->run_count_per_level;
 		if (slice->seed < RAND_MAX / 10)
 			max_run_count++;
 		if (level_run_count > max_run_count) {