[PATCH v2 0/8] vinyl: compaction randomization and throttling

Tarantool development patches archive
 help / color / mirror / Atom feed

* [PATCH v2 0/8] vinyl: compaction randomization and throttling
@ 2019-01-24 17:12 Vladimir Davydov
  2019-01-24 17:12 ` [PATCH v2 1/8] vinyl: use uncompressed run size for range split/coalesce/compaction Vladimir Davydov
                   ` (7 more replies)
  0 siblings, 8 replies; 22+ messages in thread
From: Vladimir Davydov @ 2019-01-24 17:12 UTC (permalink / raw)
  To: tarantool-patches

This patch set achieves two goals. The first one is randomizing
compaction pace among ranges so as to avoid IO load bursts, which badly
affect read and space amplification and complicate implementation of
transaction throttling. The second goal is making sure that compaction
always keeps up with dumps and throttling transactions if it doesn't.
For more details, see comments to the individual patches and the GitHub
issues.

https://github.com/tarantool/tarantool/issues/3944
https://github.com/tarantool/tarantool/issues/3721
https://github.com/tarantool/tarantool/commits/dv/gh-3721-3944-vy-compaction-randomization-and-throttling

Changes in v2:
 - Fixed compaction priority calculation that resulted in unstable
   compaction behavior and prevented randomization from smoothing out
   IO load generated by compaction.
 - Tuned automatic range sizing: don't limit the total number of ranges
   as this isn't going to play nicely with time series workloads; set
   max range size to 2 GB as we don't want compaction to take eternity
   no matter what; use 4 * dumps_per_compaction for range_count - better
   safe than sorry.
 - Fixed dumps_per_compaction update in case compaction was forced
   manually using index.compact(). We don't want to update this metric
   in such a case to avoid undesired side effects, such as splitting or
   coalescing ranges for no good reason.
 - Pushed the first two trivial patches and rebased.
 - Fixed a few failing tests.

v1: https://www.freelists.org/post/tarantool-patches/PATCH-09-vinyl-compaction-randomization-and-throttling

Vladimir Davydov (8):
  vinyl: use uncompressed run size for range split/coalesce/compaction
  vinyl: fix compaction priority calculation
  vinyl: rename lsm->range_heap to max_compaction_priority
  vinyl: keep track of dumps per compaction for each LSM tree
  vinyl: set range size automatically
  vinyl: randomize range compaction to avoid IO load spikes
  vinyl: introduce quota consumer types
  vinyl: throttle tx to ensure compaction keeps up with dumps

 src/box/alter.cc                    |   8 +-
 src/box/box.cc                      |   6 +-
 src/box/index_def.c                 |   2 +-
 src/box/lua/load_cfg.lua            |   2 +-
 src/box/lua/space.cc                |   6 +-
 src/box/vinyl.c                     |  39 ++++--
 src/box/vy_log.c                    |  26 +++-
 src/box/vy_log.h                    |  10 +-
 src/box/vy_lsm.c                    |  91 ++++++++++++--
 src/box/vy_lsm.h                    |  12 +-
 src/box/vy_quota.c                  | 132 ++++++++++++++++-----
 src/box/vy_quota.h                  |  97 ++++++++++++---
 src/box/vy_range.c                  |  95 +++++++++++----
 src/box/vy_range.h                  |  60 ++++++++--
 src/box/vy_regulator.c              |  98 +++++++++++++--
 src/box/vy_regulator.h              |  27 +++++
 src/box/vy_run.c                    |   1 +
 src/box/vy_run.h                    |  20 ++++
 src/box/vy_scheduler.c              |  48 ++++++--
 test/app-tap/init_script.result     |  21 ++--
 test/box-tap/cfg.test.lua           |   3 +-
 test/box/admin.result               |   2 -
 test/box/cfg.result                 |   4 -
 test/vinyl/ddl.result               |   5 -
 test/vinyl/ddl.test.lua             |   1 -
 test/vinyl/deferred_delete.result   |  21 +++-
 test/vinyl/deferred_delete.test.lua |  11 +-
 test/vinyl/errinj.result            |   4 +-
 test/vinyl/errinj.test.lua          |   3 +-
 test/vinyl/gc.result                |   3 +-
 test/vinyl/gc.test.lua              |   3 +-
 test/vinyl/layout.result            | 178 ++++++++++++++++++++-------
 test/vinyl/layout.test.lua          |   5 +-
 test/vinyl/misc.result              |  78 ++++++++++++
 test/vinyl/misc.test.lua            |  26 ++++
 test/vinyl/replica_rejoin.result    |   2 +-
 test/vinyl/replica_rejoin.test.lua  |   2 +-
 test/vinyl/stat.result              | 231 ++++++++++++++++++++++++++++--------
 test/vinyl/stat.test.lua            |  63 ++++++++++
 test/vinyl/update_optimize.result   |  76 ++++++------
 test/vinyl/update_optimize.test.lua |  56 +++++----
 test/vinyl/write_iterator.result    |  20 +++-
 test/vinyl/write_iterator.test.lua  |  10 +-
 43 files changed, 1259 insertions(+), 349 deletions(-)

-- 
2.11.0

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v2 1/8] vinyl: use uncompressed run size for range split/coalesce/compaction
  2019-01-24 17:12 [PATCH v2 0/8] vinyl: compaction randomization and throttling Vladimir Davydov
@ 2019-01-24 17:12 ` Vladimir Davydov
  2019-01-24 17:12 ` [PATCH v2 2/8] vinyl: fix compaction priority calculation Vladimir Davydov
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 22+ messages in thread
From: Vladimir Davydov @ 2019-01-24 17:12 UTC (permalink / raw)
  To: tarantool-patches

Historically, when considering splitting or coalescing a range or
updating compaction priority, we use sizes of compressed runs (see
bytes_compressed). This makes the algorithms dependent on whether
compression is used or not and how effective it is, which is weird,
because compression is a way of storing data on disk - it shouldn't
affect the way data is partitioned. E.g. if we turned off compression
at the first LSM tree level, which would make sense, because it's
relatively small, we would affect the compaction algorithm because
of this.

That said, let's use uncompressed run sizes when considering range
tree transformations.
---
 src/box/vy_range.c                  | 12 ++++++------
 test/vinyl/deferred_delete.result   | 21 +++++++++++++++------
 test/vinyl/deferred_delete.test.lua | 11 +++++++----
 3 files changed, 28 insertions(+), 16 deletions(-)

diff --git a/src/box/vy_range.c b/src/box/vy_range.c
index f649aff7..87c4c6b9 100644
--- a/src/box/vy_range.c
+++ b/src/box/vy_range.c
@@ -329,7 +329,7 @@ vy_range_update_compaction_priority(struct vy_range *range,
 
 	struct vy_slice *slice;
 	rlist_foreach_entry(slice, &range->slices, in_range) {
-		uint64_t size = slice->count.bytes_compressed;
+		uint64_t size = slice->count.bytes;
 		/*
 		 * The size of the first level is defined by
 		 * the size of the most recent run.
@@ -377,7 +377,7 @@ vy_range_update_compaction_priority(struct vy_range *range,
 			 */
 			range->compaction_priority = total_run_count;
 			range->compaction_queue = total_stmt_count;
-			est_new_run_size = total_stmt_count.bytes_compressed;
+			est_new_run_size = total_stmt_count.bytes;
 		}
 	}
 
@@ -419,7 +419,7 @@ vy_range_needs_split(struct vy_range *range, const struct index_opts *opts,
 	slice = rlist_last_entry(&range->slices, struct vy_slice, in_range);
 
 	/* The range is too small to be split. */
-	if (slice->count.bytes_compressed < opts->range_size * 4 / 3)
+	if (slice->count.bytes < opts->range_size * 4 / 3)
 		return false;
 
 	/* Find the median key in the oldest run (approximately). */
@@ -481,7 +481,7 @@ vy_range_needs_coalesce(struct vy_range *range, vy_range_tree_t *tree,
 	struct vy_range *it;
 
 	/* Size of the coalesced range. */
-	uint64_t total_size = range->count.bytes_compressed;
+	uint64_t total_size = range->count.bytes;
 	/* Coalesce ranges until total_size > max_size. */
 	uint64_t max_size = opts->range_size / 2;
 
@@ -496,7 +496,7 @@ vy_range_needs_coalesce(struct vy_range *range, vy_range_tree_t *tree,
 	for (it = vy_range_tree_next(tree, range);
 	     it != NULL && !vy_range_is_scheduled(it);
 	     it = vy_range_tree_next(tree, it)) {
-		uint64_t size = it->count.bytes_compressed;
+		uint64_t size = it->count.bytes;
 		if (total_size + size > max_size)
 			break;
 		total_size += size;
@@ -505,7 +505,7 @@ vy_range_needs_coalesce(struct vy_range *range, vy_range_tree_t *tree,
 	for (it = vy_range_tree_prev(tree, range);
 	     it != NULL && !vy_range_is_scheduled(it);
 	     it = vy_range_tree_prev(tree, it)) {
-		uint64_t size = it->count.bytes_compressed;
+		uint64_t size = it->count.bytes;
 		if (total_size + size > max_size)
 			break;
 		total_size += size;
diff --git a/test/vinyl/deferred_delete.result b/test/vinyl/deferred_delete.result
index 29945f8d..61f81ce2 100644
--- a/test/vinyl/deferred_delete.result
+++ b/test/vinyl/deferred_delete.result
@@ -668,16 +668,13 @@ test_run:cmd("switch test")
 fiber = require('fiber')
 ---
 ...
-digest = require('digest')
----
-...
 s = box.schema.space.create('test', {engine = 'vinyl'})
 ---
 ...
-pk = s:create_index('pk', {run_count_per_level = 10})
+pk = s:create_index('pk', {run_count_per_level = 10, run_size_ratio = 2})
 ---
 ...
-sk = s:create_index('sk', {run_count_per_level = 10, parts = {2, 'unsigned', 3, 'string'}, unique = false})
+sk = s:create_index('sk', {run_count_per_level = 10, run_size_ratio = 2, parts = {2, 'unsigned', 3, 'string'}, unique = false})
 ---
 ...
 -- Write a run big enough to prevent major compaction from kicking in
@@ -685,13 +682,25 @@ sk = s:create_index('sk', {run_count_per_level = 10, parts = {2, 'unsigned', 3,
 dummy_rows = 100
 ---
 ...
-for i = 1, dummy_rows do s:replace{i + 1000, i + 1000, digest.urandom(100)} end
+pad = string.rep('z', 50 * 1024)
+---
+...
+for i = 1, dummy_rows do s:replace{i + 1000, i + 1000, pad} end
 ---
 ...
 box.snapshot()
 ---
 - ok
 ...
+pk:compact()
+---
+...
+sk:compact()
+---
+...
+while box.stat.vinyl().scheduler.compaction_queue > 0 do fiber.sleep(0.001) end
+---
+...
 pad = string.rep('x', 10 * 1024)
 ---
 ...
diff --git a/test/vinyl/deferred_delete.test.lua b/test/vinyl/deferred_delete.test.lua
index d38802da..93b5b358 100644
--- a/test/vinyl/deferred_delete.test.lua
+++ b/test/vinyl/deferred_delete.test.lua
@@ -252,17 +252,20 @@ test_run:cmd("start server test with args='1048576'")
 test_run:cmd("switch test")
 
 fiber = require('fiber')
-digest = require('digest')
 
 s = box.schema.space.create('test', {engine = 'vinyl'})
-pk = s:create_index('pk', {run_count_per_level = 10})
-sk = s:create_index('sk', {run_count_per_level = 10, parts = {2, 'unsigned', 3, 'string'}, unique = false})
+pk = s:create_index('pk', {run_count_per_level = 10, run_size_ratio = 2})
+sk = s:create_index('sk', {run_count_per_level = 10, run_size_ratio = 2, parts = {2, 'unsigned', 3, 'string'}, unique = false})
 
 -- Write a run big enough to prevent major compaction from kicking in
 -- (run_count_per_level is ignored on the last level - see gh-3657).
 dummy_rows = 100
-for i = 1, dummy_rows do s:replace{i + 1000, i + 1000, digest.urandom(100)} end
+pad = string.rep('z', 50 * 1024)
+for i = 1, dummy_rows do s:replace{i + 1000, i + 1000, pad} end
 box.snapshot()
+pk:compact()
+sk:compact()
+while box.stat.vinyl().scheduler.compaction_queue > 0 do fiber.sleep(0.001) end
 
 pad = string.rep('x', 10 * 1024)
 for i = 1, 120 do s:replace{i, i, pad} end
-- 
2.11.0

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v2 2/8] vinyl: fix compaction priority calculation
  2019-01-24 17:12 [PATCH v2 0/8] vinyl: compaction randomization and throttling Vladimir Davydov
  2019-01-24 17:12 ` [PATCH v2 1/8] vinyl: use uncompressed run size for range split/coalesce/compaction Vladimir Davydov
@ 2019-01-24 17:12 ` Vladimir Davydov
  2019-02-08 17:18   ` [tarantool-patches] " Konstantin Osipov
  2019-01-24 17:12 ` [PATCH v2 3/8] vinyl: rename lsm->range_heap to max_compaction_priority Vladimir Davydov
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 22+ messages in thread
From: Vladimir Davydov @ 2019-01-24 17:12 UTC (permalink / raw)
  To: tarantool-patches

When computing the number of runs that need to be compacted for a range
to conform to the target LSM tree shape, we use the newest run size for
the size of the first LSM tree level. This isn't quite correct for two
reasons.

First, the size of the newest run is unstable - it may vary in a
relatively wide range from dump to dump. This leads to frequent changes
in the target LSM tree shape and, as a result, unpredictable compaction
behavior. In particular this breaks compaction randomization, which is
supposed to smooth out IO load generated by compaction.

Second, this can increase space amplification. We trigger compaction at
the last level when there's more than one run, irrespective of the value
of run_count_per_level configuration option. We expect this to keep
space amplification below 2 provided run_count_per_level is not greater
than (run_size_ratio - 1). However, if the newest run happens to have
such a size that multiplying it by run_size_ratio several times gives us
a value only slightly less than the size of the oldest run, we can
accumulate up to run_count_per_level more runs that are approximately as
big as the last level run without triggering compaction, thus increasing
space amplification by up to run_count_per_level.

To fix these problems, let's use the oldest run size for computing the
size of the first LSM tree level - simply divide it by run_size_ratio
until it exceeds the size of the newest run.

Follow-up #3657
---
 src/box/vy_range.c                  |  42 ++++++---
 test/vinyl/errinj.result            |   4 +-
 test/vinyl/errinj.test.lua          |   3 +-
 test/vinyl/gc.result                |   3 +-
 test/vinyl/gc.test.lua              |   3 +-
 test/vinyl/layout.result            | 178 +++++++++++++++++++++++++++---------
 test/vinyl/layout.test.lua          |   5 +-
 test/vinyl/replica_rejoin.result    |   2 +-
 test/vinyl/replica_rejoin.test.lua  |   2 +-
 test/vinyl/update_optimize.result   |  76 +++++++--------
 test/vinyl/update_optimize.test.lua |  56 ++++++------
 11 files changed, 241 insertions(+), 133 deletions(-)

diff --git a/src/box/vy_range.c b/src/box/vy_range.c
index 87c4c6b9..7211cfb2 100644
--- a/src/box/vy_range.c
+++ b/src/box/vy_range.c
@@ -321,21 +321,41 @@ vy_range_update_compaction_priority(struct vy_range *range,
 	uint32_t level_run_count = 0;
 	/*
 	 * The target (perfect) size of a run at the current level.
-	 * For the first level, it's the size of the newest run.
-	 * For lower levels it's computed as first level run size
-	 * times run_size_ratio.
+	 * Calculated recurrently: the size of the next level equals
+	 * the size of the previous level times run_size_ratio.
+	 *
+	 * For the last level we want it to be slightly greater
+	 * than the size of the last (biggest, oldest) run so that
+	 * all newer runs are at least run_size_ratio times smaller,
+	 * because in conjunction with the fact that we never store
+	 * more than one run at the last level, this will keep space
+	 * amplification below 2 provided run_count_per_level is not
+	 * greater than (run_size_ratio - 1).
+	 *
+	 * So to calculate the target size of the first level, we
+	 * divide the size of the oldest run by run_size_ratio until
+	 * it exceeds the size of the newest run. Note, DIV_ROUND_UP
+	 * is important here, because if we used integral division,
+	 * then after descending to the last level we would get a
+	 * value slightly less than the last run size, not slightly
+	 * greater, as we wanted to, which could increase space
+	 * amplification by run_count_per_level in the worse case
+	 * scenario.
 	 */
-	uint64_t target_run_size = 0;
+	uint64_t target_run_size;
 
+	uint64_t size;
 	struct vy_slice *slice;
+	slice = rlist_last_entry(&range->slices, struct vy_slice, in_range);
+	size = slice->count.bytes;
+	slice = rlist_first_entry(&range->slices, struct vy_slice, in_range);
+	do {
+		target_run_size = size;
+		size = DIV_ROUND_UP(target_run_size, opts->run_size_ratio);
+	} while (size > (uint64_t)MAX(slice->count.bytes, 1));
+
 	rlist_foreach_entry(slice, &range->slices, in_range) {
-		uint64_t size = slice->count.bytes;
-		/*
-		 * The size of the first level is defined by
-		 * the size of the most recent run.
-		 */
-		if (target_run_size == 0)
-			target_run_size = size;
+		size = slice->count.bytes;
 		level_run_count++;
 		total_run_count++;
 		vy_disk_stmt_counter_add(&total_stmt_count, &slice->count);
diff --git a/test/vinyl/errinj.result b/test/vinyl/errinj.result
index 4a3df6ae..990c7e85 100644
--- a/test/vinyl/errinj.result
+++ b/test/vinyl/errinj.result
@@ -1016,9 +1016,9 @@ s:replace{1, 10}
 ---
 - [1, 10]
 ...
-s:replace{10, 100} -- to prevent last-level compaction (gh-3657)
+-- Some padding to prevent last-level compaction (gh-3657).
+for i = 1001, 1010 do s:replace{i, i} end
 ---
-- [10, 100]
 ...
 box.snapshot()
 ---
diff --git a/test/vinyl/errinj.test.lua b/test/vinyl/errinj.test.lua
index c9d04aaf..d374a910 100644
--- a/test/vinyl/errinj.test.lua
+++ b/test/vinyl/errinj.test.lua
@@ -371,7 +371,8 @@ s = box.schema.space.create('test', {engine = 'vinyl'})
 _ = s:create_index('pk', {run_count_per_level = 10})
 _ = s:create_index('sk', {unique = false, parts = {2, 'unsigned'}})
 s:replace{1, 10}
-s:replace{10, 100} -- to prevent last-level compaction (gh-3657)
+-- Some padding to prevent last-level compaction (gh-3657).
+for i = 1001, 1010 do s:replace{i, i} end
 box.snapshot()
 s:replace{1, 20}
 box.snapshot()
diff --git a/test/vinyl/gc.result b/test/vinyl/gc.result
index 098c17c2..11e31619 100644
--- a/test/vinyl/gc.result
+++ b/test/vinyl/gc.result
@@ -168,7 +168,8 @@ function count_runs() return #fio.glob(fio.pathjoin(box.cfg.vinyl_dir, s.id, s.i
 _ = s:replace{1}
 ---
 ...
-_ = s:replace{2} -- to prevent last-level compaction (gh-3657)
+-- Some padding to prevent last-level compaction (gh-3657).
+for i = 1001, 1010 do s:replace{i} end
 ---
 ...
 box.snapshot()
diff --git a/test/vinyl/gc.test.lua b/test/vinyl/gc.test.lua
index b5e42d6b..02cb6d32 100644
--- a/test/vinyl/gc.test.lua
+++ b/test/vinyl/gc.test.lua
@@ -84,7 +84,8 @@ _ = s:create_index('pk', {run_count_per_level = 3})
 function count_runs() return #fio.glob(fio.pathjoin(box.cfg.vinyl_dir, s.id, s.index.pk.id, '*.run')) end
 
 _ = s:replace{1}
-_ = s:replace{2} -- to prevent last-level compaction (gh-3657)
+-- Some padding to prevent last-level compaction (gh-3657).
+for i = 1001, 1010 do s:replace{i} end
 box.snapshot()
 _ = s:replace{3}
 box.snapshot()
diff --git a/test/vinyl/layout.result b/test/vinyl/layout.result
index 14201c5d..3be2bb91 100644
--- a/test/vinyl/layout.result
+++ b/test/vinyl/layout.result
@@ -38,8 +38,6 @@ box.snapshot()
 space.index.sk:alter{parts = {{2, 'unsigned', is_nullable = true}}}
 ---
 ...
--- Note, the first run is bigger than the second one to prevent
--- last-level compaction (gh-3657).
 space:replace{'ЭЭЭ', box.NULL}
 ---
 - ['ЭЭЭ', null]
@@ -52,9 +50,9 @@ space:replace{'ёёё', box.NULL}
 ---
 - ['ёёё', null]
 ...
-space:replace{'ЭЮЯ', 666}
+-- Some padding to prevent last-level compaction (gh-3657).
+for i = 1001, 1010 do space:replace{tostring(i), i} end
 ---
-- ['ЭЮЯ', 666]
 ...
 box.snapshot()
 ---
@@ -132,16 +130,16 @@ test_run:cmd("push filter 'offset: .*' to 'offset: <offset>'")
 ...
 result
 ---
-- - - 00000000000000000011.vylog
+- - - 00000000000000000020.vylog
     - - HEADER:
           type: INSERT
         BODY:
           tuple: [0, {6: 512, 7: [{'field': 0, 'collation': 1, 'type': 'string'}],
-              9: 11, 12: 3, 13: 7}]
+              9: 20, 12: 3, 13: 7}]
       - HEADER:
           type: INSERT
         BODY:
-          tuple: [5, {2: 8, 9: 11}]
+          tuple: [5, {2: 8, 9: 20}]
       - HEADER:
           type: INSERT
         BODY:
@@ -162,11 +160,11 @@ result
           type: INSERT
         BODY:
           tuple: [0, {0: 2, 5: 1, 6: 512, 7: [{'field': 1, 'is_nullable': true, 'type': 'unsigned'}],
-              9: 11, 12: 4, 13: 7}]
+              9: 20, 12: 4, 13: 7}]
       - HEADER:
           type: INSERT
         BODY:
-          tuple: [5, {0: 2, 2: 6, 9: 11}]
+          tuple: [5, {0: 2, 2: 6, 9: 20}]
       - HEADER:
           type: INSERT
         BODY:
@@ -206,7 +204,7 @@ result
           timestamp: <timestamp>
           type: INSERT
         BODY:
-          tuple: [5, {0: 2, 2: 10, 9: 14}]
+          tuple: [5, {0: 2, 2: 10, 9: 23}]
       - HEADER:
           timestamp: <timestamp>
           type: INSERT
@@ -216,7 +214,7 @@ result
           timestamp: <timestamp>
           type: INSERT
         BODY:
-          tuple: [10, {0: 2, 9: 14}]
+          tuple: [10, {0: 2, 9: 23}]
       - HEADER:
           timestamp: <timestamp>
           type: INSERT
@@ -226,7 +224,7 @@ result
           timestamp: <timestamp>
           type: INSERT
         BODY:
-          tuple: [5, {2: 12, 9: 14}]
+          tuple: [5, {2: 12, 9: 23}]
       - HEADER:
           timestamp: <timestamp>
           type: INSERT
@@ -236,29 +234,79 @@ result
           timestamp: <timestamp>
           type: INSERT
         BODY:
-          tuple: [10, {9: 14}]
+          tuple: [10, {9: 23}]
   - - 00000000000000000008.index
     - - HEADER:
           type: RUNINFO
         BODY:
           min_lsn: 8
           bloom_filter: <bloom_filter>
-          max_key: ['ЭЮЯ']
+          max_key: ['ЭЭЭ']
           page_count: 1
-          stmt_stat: {9: 0, 2: 0, 5: 0, 3: 4}
-          max_lsn: 11
-          min_key: ['ёёё']
+          stmt_stat: {9: 0, 2: 0, 5: 0, 3: 13}
+          max_lsn: 20
+          min_key: ['1001']
       - HEADER:
           type: PAGEINFO
         BODY:
           row_index_offset: <offset>
           offset: <offset>
-          size: 108
-          unpacked_size: 89
-          row_count: 4
-          min_key: ['ёёё']
+          size: 286
+          unpacked_size: 267
+          row_count: 13
+          min_key: ['1001']
   - - 00000000000000000008.run
     - - HEADER:
+          lsn: 11
+          type: REPLACE
+        BODY:
+          tuple: ['1001', 1001]
+      - HEADER:
+          lsn: 12
+          type: REPLACE
+        BODY:
+          tuple: ['1002', 1002]
+      - HEADER:
+          lsn: 13
+          type: REPLACE
+        BODY:
+          tuple: ['1003', 1003]
+      - HEADER:
+          lsn: 14
+          type: REPLACE
+        BODY:
+          tuple: ['1004', 1004]
+      - HEADER:
+          lsn: 15
+          type: REPLACE
+        BODY:
+          tuple: ['1005', 1005]
+      - HEADER:
+          lsn: 16
+          type: REPLACE
+        BODY:
+          tuple: ['1006', 1006]
+      - HEADER:
+          lsn: 17
+          type: REPLACE
+        BODY:
+          tuple: ['1007', 1007]
+      - HEADER:
+          lsn: 18
+          type: REPLACE
+        BODY:
+          tuple: ['1008', 1008]
+      - HEADER:
+          lsn: 19
+          type: REPLACE
+        BODY:
+          tuple: ['1009', 1009]
+      - HEADER:
+          lsn: 20
+          type: REPLACE
+        BODY:
+          tuple: ['1010', 1010]
+      - HEADER:
           lsn: 10
           type: REPLACE
         BODY:
@@ -274,24 +322,19 @@ result
         BODY:
           tuple: ['ЭЭЭ', null]
       - HEADER:
-          lsn: 11
-          type: REPLACE
-        BODY:
-          tuple: ['ЭЮЯ', 666]
-      - HEADER:
           type: ROWINDEX
         BODY:
-          row_index: "\0\0\0\0\0\0\0\x10\0\0\0 \0\0\00"
+          row_index: !!binary AAAAAAAAABAAAAAgAAAAMAAAAEAAAABQAAAAYAAAAHAAAACAAAAAkAAAAKAAAACwAAAAwA==
   - - 00000000000000000012.index
     - - HEADER:
           type: RUNINFO
         BODY:
-          min_lsn: 12
+          min_lsn: 21
           bloom_filter: <bloom_filter>
           max_key: ['ЮЮЮ']
           page_count: 1
           stmt_stat: {9: 0, 2: 0, 5: 0, 3: 3}
-          max_lsn: 14
+          max_lsn: 23
           min_key: ['ёёё']
       - HEADER:
           type: PAGEINFO
@@ -304,19 +347,19 @@ result
           min_key: ['ёёё']
   - - 00000000000000000012.run
     - - HEADER:
-          lsn: 12
+          lsn: 21
           type: REPLACE
         BODY:
           tuple: ['ёёё', 123]
           tuple_meta: {1: 1}
       - HEADER:
-          lsn: 14
+          lsn: 23
           type: REPLACE
         BODY:
           tuple: ['ююю', 789]
           tuple_meta: {1: 1}
       - HEADER:
-          lsn: 13
+          lsn: 22
           type: REPLACE
         BODY:
           tuple: ['ЮЮЮ', 456]
@@ -331,19 +374,19 @@ result
         BODY:
           min_lsn: 8
           bloom_filter: <bloom_filter>
-          max_key: [666, 'ЭЮЯ']
+          max_key: [1010, '1010']
           page_count: 1
-          stmt_stat: {9: 0, 2: 0, 5: 0, 3: 4}
-          max_lsn: 11
+          stmt_stat: {9: 0, 2: 0, 5: 0, 3: 13}
+          max_lsn: 20
           min_key: [null, 'ёёё']
       - HEADER:
           type: PAGEINFO
         BODY:
           row_index_offset: <offset>
           offset: <offset>
-          size: 108
-          unpacked_size: 89
-          row_count: 4
+          size: 286
+          unpacked_size: 267
+          row_count: 13
           min_key: [null, 'ёёё']
   - - 00000000000000000006.run
     - - HEADER:
@@ -365,21 +408,66 @@ result
           lsn: 11
           type: REPLACE
         BODY:
-          tuple: [666, 'ЭЮЯ']
+          tuple: [1001, '1001']
+      - HEADER:
+          lsn: 12
+          type: REPLACE
+        BODY:
+          tuple: [1002, '1002']
+      - HEADER:
+          lsn: 13
+          type: REPLACE
+        BODY:
+          tuple: [1003, '1003']
+      - HEADER:
+          lsn: 14
+          type: REPLACE
+        BODY:
+          tuple: [1004, '1004']
+      - HEADER:
+          lsn: 15
+          type: REPLACE
+        BODY:
+          tuple: [1005, '1005']
+      - HEADER:
+          lsn: 16
+          type: REPLACE
+        BODY:
+          tuple: [1006, '1006']
+      - HEADER:
+          lsn: 17
+          type: REPLACE
+        BODY:
+          tuple: [1007, '1007']
+      - HEADER:
+          lsn: 18
+          type: REPLACE
+        BODY:
+          tuple: [1008, '1008']
+      - HEADER:
+          lsn: 19
+          type: REPLACE
+        BODY:
+          tuple: [1009, '1009']
+      - HEADER:
+          lsn: 20
+          type: REPLACE
+        BODY:
+          tuple: [1010, '1010']
       - HEADER:
           type: ROWINDEX
         BODY:
-          row_index: "\0\0\0\0\0\0\0\x10\0\0\0 \0\0\00"
+          row_index: !!binary AAAAAAAAABAAAAAgAAAAMAAAAEAAAABQAAAAYAAAAHAAAACAAAAAkAAAAKAAAACwAAAAwA==
   - - 00000000000000000010.index
     - - HEADER:
           type: RUNINFO
         BODY:
-          min_lsn: 12
+          min_lsn: 21
           bloom_filter: <bloom_filter>
           max_key: [789, 'ююю']
           page_count: 1
           stmt_stat: {9: 0, 2: 0, 5: 0, 3: 3}
-          max_lsn: 14
+          max_lsn: 23
           min_key: [123, 'ёёё']
       - HEADER:
           type: PAGEINFO
@@ -392,17 +480,17 @@ result
           min_key: [123, 'ёёё']
   - - 00000000000000000010.run
     - - HEADER:
-          lsn: 12
+          lsn: 21
           type: REPLACE
         BODY:
           tuple: [123, 'ёёё']
       - HEADER:
-          lsn: 13
+          lsn: 22
           type: REPLACE
         BODY:
           tuple: [456, 'ЮЮЮ']
       - HEADER:
-          lsn: 14
+          lsn: 23
           type: REPLACE
         BODY:
           tuple: [789, 'ююю']
diff --git a/test/vinyl/layout.test.lua b/test/vinyl/layout.test.lua
index 60f22c76..f6591027 100644
--- a/test/vinyl/layout.test.lua
+++ b/test/vinyl/layout.test.lua
@@ -17,12 +17,11 @@ box.snapshot()
 
 space.index.sk:alter{parts = {{2, 'unsigned', is_nullable = true}}}
 
--- Note, the first run is bigger than the second one to prevent
--- last-level compaction (gh-3657).
 space:replace{'ЭЭЭ', box.NULL}
 space:replace{'эээ', box.NULL}
 space:replace{'ёёё', box.NULL}
-space:replace{'ЭЮЯ', 666}
+-- Some padding to prevent last-level compaction (gh-3657).
+for i = 1001, 1010 do space:replace{tostring(i), i} end
 box.snapshot()
 
 space:replace{'ёёё', 123}
diff --git a/test/vinyl/replica_rejoin.result b/test/vinyl/replica_rejoin.result
index d153e346..1dfcb91b 100644
--- a/test/vinyl/replica_rejoin.result
+++ b/test/vinyl/replica_rejoin.result
@@ -17,7 +17,7 @@ _ = box.schema.space.create('test', { id = 9000, engine = 'vinyl' })
 _ = box.space.test:create_index('pk')
 ---
 ...
-pad = string.rep('x', 15 * 1024)
+pad = string.rep('x', 12 * 1024)
 ---
 ...
 for i = 1, 100 do box.space.test:replace{i, pad} end
diff --git a/test/vinyl/replica_rejoin.test.lua b/test/vinyl/replica_rejoin.test.lua
index 8226fb94..2c5a69e0 100644
--- a/test/vinyl/replica_rejoin.test.lua
+++ b/test/vinyl/replica_rejoin.test.lua
@@ -8,7 +8,7 @@ test_run = env.new()
 box.schema.user.grant('guest', 'replication')
 _ = box.schema.space.create('test', { id = 9000, engine = 'vinyl' })
 _ = box.space.test:create_index('pk')
-pad = string.rep('x', 15 * 1024)
+pad = string.rep('x', 12 * 1024)
 for i = 1, 100 do box.space.test:replace{i, pad} end
 box.snapshot()
 
diff --git a/test/vinyl/update_optimize.result b/test/vinyl/update_optimize.result
index d8ff9fc4..09370e7d 100644
--- a/test/vinyl/update_optimize.result
+++ b/test/vinyl/update_optimize.result
@@ -28,10 +28,10 @@ test_run:cmd("setopt delimiter ';'")
 - true
 ...
 function wait_for_dump(index, old_count)
-	while index:stat().run_count == old_count do
+	while index:stat().disk.dump.count == old_count do
 		fiber.sleep(0)
 	end
-	return index:stat().run_count
+	return index:stat().disk.dump.count
 end;
 ---
 ...
@@ -39,10 +39,7 @@ test_run:cmd("setopt delimiter ''");
 ---
 - true
 ...
-index_run_count = index:stat().run_count
----
-...
-index2_run_count = index2:stat().run_count
+dump_count = index:stat().disk.dump.count
 ---
 ...
 old_stmt_count = dumped_stmt_count()
@@ -69,7 +66,7 @@ box.snapshot()
 - ok
 ...
 -- Wait for dump both indexes.
-index_run_count = wait_for_dump(index, index_run_count)
+dump_count = wait_for_dump(index, dump_count)
 ---
 ...
 new_stmt_count = dumped_stmt_count()
@@ -93,7 +90,7 @@ box.snapshot()
 ---
 - ok
 ...
-index_run_count = wait_for_dump(index, index_run_count)
+dump_count = wait_for_dump(index, dump_count)
 ---
 ...
 space:update({1}, {{'!', 4, 20}}) -- move range containing index field
@@ -104,7 +101,7 @@ box.snapshot()
 ---
 - ok
 ...
-index_run_count = wait_for_dump(index, index_run_count)
+dump_count = wait_for_dump(index, dump_count)
 ---
 ...
 space:update({1}, {{'#', 3, 1}}) -- same
@@ -115,7 +112,7 @@ box.snapshot()
 ---
 - ok
 ...
-index_run_count = wait_for_dump(index, index_run_count)
+dump_count = wait_for_dump(index, dump_count)
 ---
 ...
 new_stmt_count = dumped_stmt_count()
@@ -151,7 +148,7 @@ box.snapshot()
 ---
 - ok
 ...
-index_run_count = wait_for_dump(index, index_run_count)
+dump_count = wait_for_dump(index, dump_count)
 ---
 ...
 -- Move range that doesn't contain indexed fields.
@@ -163,7 +160,7 @@ box.snapshot()
 ---
 - ok
 ...
-index_run_count = wait_for_dump(index, index_run_count)
+dump_count = wait_for_dump(index, dump_count)
 ---
 ...
 space:update({2}, {{'#', 6, 1}}) -- same
@@ -174,7 +171,7 @@ box.snapshot()
 ---
 - ok
 ...
-index_run_count = wait_for_dump(index, index_run_count)
+dump_count = wait_for_dump(index, dump_count)
 ---
 ...
 new_stmt_count = dumped_stmt_count()
@@ -224,13 +221,7 @@ box.snapshot()
 ---
 - ok
 ...
-index_run_count = index:stat().run_count
----
-...
-index2_run_count = index2:stat().run_count
----
-...
-index3_run_count = index3:stat().run_count
+dump_count = index:stat().run_count
 ---
 ...
 old_stmt_count = dumped_stmt_count()
@@ -256,7 +247,7 @@ box.snapshot()
 ---
 - ok
 ...
-index_run_count = wait_for_dump(index, index_run_count)
+dump_count = wait_for_dump(index, dump_count)
 ---
 ...
 new_stmt_count = dumped_stmt_count()
@@ -278,7 +269,7 @@ box.snapshot()
 ---
 - ok
 ...
-index_run_count = wait_for_dump(index, index_run_count)
+dump_count = wait_for_dump(index, dump_count)
 ---
 ...
 index:update({2}, {{'!', 3, 20}}) -- move range containing all indexes
@@ -289,7 +280,7 @@ box.snapshot()
 ---
 - ok
 ...
-index_run_count = wait_for_dump(index, index_run_count)
+dump_count = wait_for_dump(index, dump_count)
 ---
 ...
 index:update({2}, {{'=', 7, 100}, {'+', 5, 10}, {'#', 3, 1}}) -- change two cols but then move range with all indexed fields
@@ -300,7 +291,7 @@ box.snapshot()
 ---
 - ok
 ...
-index_run_count = wait_for_dump(index, index_run_count)
+dump_count = wait_for_dump(index, dump_count)
 ---
 ...
 new_stmt_count = dumped_stmt_count()
@@ -343,7 +334,7 @@ box.snapshot()
 ---
 - ok
 ...
-index_run_count = wait_for_dump(index, index_run_count)
+dump_count = wait_for_dump(index, dump_count)
 ---
 ...
 new_stmt_count = dumped_stmt_count()
@@ -365,7 +356,7 @@ box.snapshot()
 ---
 - ok
 ...
-index_run_count = wait_for_dump(index, index_run_count)
+dump_count = wait_for_dump(index, dump_count)
 ---
 ...
 new_stmt_count = dumped_stmt_count()
@@ -387,7 +378,7 @@ box.snapshot()
 ---
 - ok
 ...
-index_run_count = wait_for_dump(index, index_run_count)
+dump_count = wait_for_dump(index, dump_count)
 ---
 ...
 new_stmt_count = dumped_stmt_count()
@@ -439,7 +430,7 @@ box.snapshot()
 - ok
 ...
 -- Make update of not indexed field with pos > 64.
-index_run_count = wait_for_dump(index, index_run_count)
+dump_count = wait_for_dump(index, dump_count)
 ---
 ...
 old_stmt_count = dumped_stmt_count()
@@ -453,7 +444,7 @@ box.snapshot()
 - ok
 ...
 -- Check the only primary index to be changed.
-index_run_count = wait_for_dump(index, index_run_count)
+dump_count = wait_for_dump(index, dump_count)
 ---
 ...
 new_stmt_count = dumped_stmt_count()
@@ -481,7 +472,7 @@ box.snapshot()
 ---
 - ok
 ...
-index_run_count = wait_for_dump(index, index_run_count)
+dump_count = wait_for_dump(index, dump_count)
 ---
 ...
 new_stmt_count = dumped_stmt_count()
@@ -524,7 +515,7 @@ box.snapshot()
 ---
 - ok
 ...
-index_run_count = wait_for_dump(index, index_run_count)
+dump_count = wait_for_dump(index, dump_count)
 ---
 ...
 old_stmt_count = dumped_stmt_count()
@@ -538,7 +529,7 @@ box.snapshot()
 ---
 - ok
 ...
-index_run_count = wait_for_dump(index, index_run_count)
+dump_count = wait_for_dump(index, dump_count)
 ---
 ...
 new_stmt_count = dumped_stmt_count()
@@ -604,7 +595,7 @@ box.snapshot()
 ---
 - ok
 ...
-index_run_count = wait_for_dump(index, index_run_count)
+dump_count = wait_for_dump(index, dump_count)
 ---
 ...
 old_stmt_count = dumped_stmt_count()
@@ -728,9 +719,9 @@ s:insert{1, 10}
 ---
 - [1, 10]
 ...
-s:insert{10, 100} -- to prevent last-level compaction (gh-3657)
+-- Some padding to prevent last-level compaction (gh-3657).
+for i = 1001, 1010 do s:replace{i, i} end
 ---
-- [10, 100]
 ...
 box.snapshot()
 ---
@@ -747,11 +738,11 @@ box.snapshot()
 ---
 - ok
 ...
--- Should be 3: INSERT{10, 1} and INSERT{100, 10} in the first run
+-- Should be 12: INSERT{10, 1} and INSERT[1001..1010] in the first run
 -- plus DELETE{10, 1} in the second run.
 s.index.sk:stat().rows
 ---
-- 3
+- 12
 ...
 s:insert{1, 20}
 ---
@@ -760,7 +751,16 @@ s:insert{1, 20}
 s.index.sk:select()
 ---
 - - [1, 20]
-  - [10, 100]
+  - [1001, 1001]
+  - [1002, 1002]
+  - [1003, 1003]
+  - [1004, 1004]
+  - [1005, 1005]
+  - [1006, 1006]
+  - [1007, 1007]
+  - [1008, 1008]
+  - [1009, 1009]
+  - [1010, 1010]
 ...
 s:drop()
 ---
diff --git a/test/vinyl/update_optimize.test.lua b/test/vinyl/update_optimize.test.lua
index 41ff964b..a0de6e4c 100644
--- a/test/vinyl/update_optimize.test.lua
+++ b/test/vinyl/update_optimize.test.lua
@@ -12,15 +12,14 @@ function dumped_stmt_count() return index:stat().disk.dump.output.rows + index2:
 box.snapshot()
 test_run:cmd("setopt delimiter ';'")
 function wait_for_dump(index, old_count)
-	while index:stat().run_count == old_count do
+	while index:stat().disk.dump.count == old_count do
 		fiber.sleep(0)
 	end
-	return index:stat().run_count
+	return index:stat().disk.dump.count
 end;
 test_run:cmd("setopt delimiter ''");
 
-index_run_count = index:stat().run_count
-index2_run_count = index2:stat().run_count
+dump_count = index:stat().disk.dump.count
 old_stmt_count = dumped_stmt_count()
 space:insert({1, 2, 3, 4, 5})
 space:insert({2, 3, 4, 5, 6})
@@ -28,7 +27,7 @@ space:insert({3, 4, 5, 6, 7})
 space:insert({4, 5, 6, 7, 8})
 box.snapshot()
 -- Wait for dump both indexes.
-index_run_count = wait_for_dump(index, index_run_count)
+dump_count = wait_for_dump(index, dump_count)
 new_stmt_count = dumped_stmt_count()
 new_stmt_count - old_stmt_count == 8
 old_stmt_count = new_stmt_count
@@ -39,13 +38,13 @@ space:update({1}, {{'=', 5, 10}}) -- change secondary index field
 -- statements in vy_write_iterator during dump.
 
 box.snapshot()
-index_run_count = wait_for_dump(index, index_run_count)
+dump_count = wait_for_dump(index, dump_count)
 space:update({1}, {{'!', 4, 20}}) -- move range containing index field
 box.snapshot()
-index_run_count = wait_for_dump(index, index_run_count)
+dump_count = wait_for_dump(index, dump_count)
 space:update({1}, {{'#', 3, 1}}) -- same
 box.snapshot()
-index_run_count = wait_for_dump(index, index_run_count)
+dump_count = wait_for_dump(index, dump_count)
 new_stmt_count = dumped_stmt_count()
 new_stmt_count - old_stmt_count == 9
 old_stmt_count = new_stmt_count
@@ -55,14 +54,14 @@ index2:select{}
 -- optimized updates
 space:update({2}, {{'=', 6, 10}}) -- change not indexed field
 box.snapshot()
-index_run_count = wait_for_dump(index, index_run_count)
+dump_count = wait_for_dump(index, dump_count)
 -- Move range that doesn't contain indexed fields.
 space:update({2}, {{'!', 7, 20}})
 box.snapshot()
-index_run_count = wait_for_dump(index, index_run_count)
+dump_count = wait_for_dump(index, dump_count)
 space:update({2}, {{'#', 6, 1}}) -- same
 box.snapshot()
-index_run_count = wait_for_dump(index, index_run_count)
+dump_count = wait_for_dump(index, dump_count)
 new_stmt_count = dumped_stmt_count()
 new_stmt_count - old_stmt_count == 3
 old_stmt_count = new_stmt_count
@@ -78,16 +77,14 @@ index2 = space:create_index('secondary', { parts = {4, 'unsigned', 3, 'unsigned'
 index3 = space:create_index('third', { parts = {5, 'unsigned'}, run_count_per_level = 20 })
 function dumped_stmt_count() return index:stat().disk.dump.output.rows + index2:stat().disk.dump.output.rows + index3:stat().disk.dump.output.rows end
 box.snapshot()
-index_run_count = index:stat().run_count
-index2_run_count = index2:stat().run_count
-index3_run_count = index3:stat().run_count
+dump_count = index:stat().run_count
 old_stmt_count = dumped_stmt_count()
 space:insert({1, 2, 3, 4, 5})
 space:insert({2, 3, 4, 5, 6})
 space:insert({3, 4, 5, 6, 7})
 space:insert({4, 5, 6, 7, 8})
 box.snapshot()
-index_run_count = wait_for_dump(index, index_run_count)
+dump_count = wait_for_dump(index, dump_count)
 new_stmt_count = dumped_stmt_count()
 new_stmt_count - old_stmt_count == 12
 old_stmt_count = new_stmt_count
@@ -95,13 +92,13 @@ old_stmt_count = new_stmt_count
 -- not optimizes updates
 index:update({2}, {{'+', 1, 10}, {'+', 3, 10}, {'+', 4, 10}, {'+', 5, 10}}) -- change all fields
 box.snapshot()
-index_run_count = wait_for_dump(index, index_run_count)
+dump_count = wait_for_dump(index, dump_count)
 index:update({2}, {{'!', 3, 20}}) -- move range containing all indexes
 box.snapshot()
-index_run_count = wait_for_dump(index, index_run_count)
+dump_count = wait_for_dump(index, dump_count)
 index:update({2}, {{'=', 7, 100}, {'+', 5, 10}, {'#', 3, 1}}) -- change two cols but then move range with all indexed fields
 box.snapshot()
-index_run_count = wait_for_dump(index, index_run_count)
+dump_count = wait_for_dump(index, dump_count)
 new_stmt_count = dumped_stmt_count()
 new_stmt_count - old_stmt_count == 15
 old_stmt_count = new_stmt_count
@@ -112,21 +109,21 @@ index3:select{}
 -- optimize one 'secondary' index update
 index:update({3}, {{'+', 1, 10}, {'-', 5, 2}, {'!', 6, 100}}) -- change only index 'third'
 box.snapshot()
-index_run_count = wait_for_dump(index, index_run_count)
+dump_count = wait_for_dump(index, dump_count)
 new_stmt_count = dumped_stmt_count()
 new_stmt_count - old_stmt_count == 3
 old_stmt_count = new_stmt_count
 -- optimize one 'third' index update
 index:update({3}, {{'=', 1, 20}, {'+', 3, 5}, {'=', 4, 30}, {'!', 6, 110}}) -- change only index 'secondary'
 box.snapshot()
-index_run_count = wait_for_dump(index, index_run_count)
+dump_count = wait_for_dump(index, dump_count)
 new_stmt_count = dumped_stmt_count()
 new_stmt_count - old_stmt_count == 3
 old_stmt_count = new_stmt_count
 -- optimize both indexes
 index:update({3}, {{'+', 1, 10}, {'#', 6, 1}}) -- don't change any indexed fields
 box.snapshot()
-index_run_count = wait_for_dump(index, index_run_count)
+dump_count = wait_for_dump(index, dump_count)
 new_stmt_count = dumped_stmt_count()
 new_stmt_count - old_stmt_count == 1
 old_stmt_count = new_stmt_count
@@ -144,13 +141,13 @@ _ = space:replace(long_tuple)
 box.snapshot()
 
 -- Make update of not indexed field with pos > 64.
-index_run_count = wait_for_dump(index, index_run_count)
+dump_count = wait_for_dump(index, dump_count)
 old_stmt_count = dumped_stmt_count()
 _ = index:update({2}, {{'=', 65, 1000}})
 box.snapshot()
 
 -- Check the only primary index to be changed.
-index_run_count = wait_for_dump(index, index_run_count)
+dump_count = wait_for_dump(index, dump_count)
 new_stmt_count = dumped_stmt_count()
 new_stmt_count - old_stmt_count == 1
 old_stmt_count = new_stmt_count
@@ -161,7 +158,7 @@ space:get{2}[65]
 --
 index:update({2}, {{'#', -65, 65}})
 box.snapshot()
-index_run_count = wait_for_dump(index, index_run_count)
+dump_count = wait_for_dump(index, dump_count)
 new_stmt_count = dumped_stmt_count()
 new_stmt_count - old_stmt_count == 1
 old_stmt_count = new_stmt_count
@@ -172,12 +169,12 @@ index3:select{}
 -- Optimize index2 with negative update op.
 space:replace{10, 20, 30, 40, 50}
 box.snapshot()
-index_run_count = wait_for_dump(index, index_run_count)
+dump_count = wait_for_dump(index, dump_count)
 old_stmt_count = dumped_stmt_count()
 
 index:update({20}, {{'=', -1, 500}})
 box.snapshot()
-index_run_count = wait_for_dump(index, index_run_count)
+dump_count = wait_for_dump(index, dump_count)
 new_stmt_count = dumped_stmt_count()
 -- 3 = REPLACE in index1 and DELETE + REPLACE in index3.
 new_stmt_count - old_stmt_count == 3
@@ -195,7 +192,7 @@ space:replace{20, 200, 2000, 20000, 200000, 2000000}
 index:update({200}, {{'=', 6, 2}})
 box.commit()
 box.snapshot()
-index_run_count = wait_for_dump(index, index_run_count)
+dump_count = wait_for_dump(index, dump_count)
 old_stmt_count = dumped_stmt_count()
 index:select{}
 index2:select{}
@@ -244,14 +241,15 @@ _ = s:create_index('pk')
 _ = s:create_index('sk', {parts = {2, 'unsigned'}, run_count_per_level = 10})
 
 s:insert{1, 10}
-s:insert{10, 100} -- to prevent last-level compaction (gh-3657)
+-- Some padding to prevent last-level compaction (gh-3657).
+for i = 1001, 1010 do s:replace{i, i} end
 box.snapshot()
 
 s:update(1, {{'=', 2, 10}})
 s:delete(1)
 box.snapshot()
 
--- Should be 3: INSERT{10, 1} and INSERT{100, 10} in the first run
+-- Should be 12: INSERT{10, 1} and INSERT[1001..1010] in the first run
 -- plus DELETE{10, 1} in the second run.
 s.index.sk:stat().rows
 
-- 
2.11.0

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v2 3/8] vinyl: rename lsm->range_heap to max_compaction_priority
  2019-01-24 17:12 [PATCH v2 0/8] vinyl: compaction randomization and throttling Vladimir Davydov
  2019-01-24 17:12 ` [PATCH v2 1/8] vinyl: use uncompressed run size for range split/coalesce/compaction Vladimir Davydov
  2019-01-24 17:12 ` [PATCH v2 2/8] vinyl: fix compaction priority calculation Vladimir Davydov
@ 2019-01-24 17:12 ` Vladimir Davydov
  2019-02-08 17:19   ` [tarantool-patches] " Konstantin Osipov
  2019-01-24 17:12 ` [PATCH v2 4/8] vinyl: keep track of dumps per compaction for each LSM tree Vladimir Davydov
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 22+ messages in thread
From: Vladimir Davydov @ 2019-01-24 17:12 UTC (permalink / raw)
  To: tarantool-patches

The name 'range_heap' is ambiguous, because it's unclear what range
should be on top of the heap. We need to introduce another heap of
ranges ordered differently, so let's rename to max_compaction_priority
to avoid confusion.
---
 src/box/vy_lsm.c       | 24 ++++++++++++++----------
 src/box/vy_lsm.h       |  2 +-
 src/box/vy_range.c     |  2 +-
 src/box/vy_range.h     | 18 ++++++++++--------
 src/box/vy_scheduler.c | 24 +++++++++++++-----------
 5 files changed, 39 insertions(+), 31 deletions(-)

diff --git a/src/box/vy_lsm.c b/src/box/vy_lsm.c
index 5eb3fd77..851785ee 100644
--- a/src/box/vy_lsm.c
+++ b/src/box/vy_lsm.c
@@ -180,7 +180,7 @@ vy_lsm_new(struct vy_lsm_env *lsm_env, struct vy_cache_env *cache_env,
 	vy_cache_create(&lsm->cache, cache_env, cmp_def, index_def->iid == 0);
 	rlist_create(&lsm->sealed);
 	vy_range_tree_new(lsm->tree);
-	vy_range_heap_create(&lsm->range_heap);
+	vy_max_compaction_priority_create(&lsm->max_compaction_priority);
 	rlist_create(&lsm->runs);
 	lsm->pk = pk;
 	if (pk != NULL)
@@ -257,7 +257,7 @@ vy_lsm_delete(struct vy_lsm *lsm)
 		vy_lsm_remove_run(lsm, run);
 
 	vy_range_tree_iter(lsm->tree, NULL, vy_range_tree_free_cb, NULL);
-	vy_range_heap_destroy(&lsm->range_heap);
+	vy_max_compaction_priority_destroy(&lsm->max_compaction_priority);
 	tuple_format_unref(lsm->disk_format);
 	key_def_delete(lsm->cmp_def);
 	key_def_delete(lsm->key_def);
@@ -665,10 +665,12 @@ vy_lsm_generation(struct vy_lsm *lsm)
 int
 vy_lsm_compaction_priority(struct vy_lsm *lsm)
 {
-	struct heap_node *n = vy_range_heap_top(&lsm->range_heap);
-	if (n == NULL)
+	struct heap_node *node;
+	node = vy_max_compaction_priority_top(&lsm->max_compaction_priority);
+	if (node == NULL)
 		return 0;
-	struct vy_range *range = container_of(n, struct vy_range, heap_node);
+	struct vy_range *range = container_of(node, struct vy_range,
+					      compaction_priority_node);
 	return range->compaction_priority;
 }
 
@@ -732,8 +734,9 @@ vy_lsm_remove_run(struct vy_lsm *lsm, struct vy_run *run)
 void
 vy_lsm_add_range(struct vy_lsm *lsm, struct vy_range *range)
 {
-	assert(range->heap_node.pos == UINT32_MAX);
-	vy_range_heap_insert(&lsm->range_heap, &range->heap_node);
+	assert(range->compaction_priority_node.pos == UINT32_MAX);
+	vy_max_compaction_priority_insert(&lsm->max_compaction_priority,
+					  &range->compaction_priority_node);
 	vy_range_tree_insert(lsm->tree, range);
 	lsm->range_count++;
 }
@@ -741,8 +744,9 @@ vy_lsm_add_range(struct vy_lsm *lsm, struct vy_range *range)
 void
 vy_lsm_remove_range(struct vy_lsm *lsm, struct vy_range *range)
 {
-	assert(range->heap_node.pos != UINT32_MAX);
-	vy_range_heap_delete(&lsm->range_heap, &range->heap_node);
+	assert(range->compaction_priority_node.pos != UINT32_MAX);
+	vy_max_compaction_priority_delete(&lsm->max_compaction_priority,
+					  &range->compaction_priority_node);
 	vy_range_tree_remove(lsm->tree, range);
 	lsm->range_count--;
 }
@@ -1224,5 +1228,5 @@ vy_lsm_force_compaction(struct vy_lsm *lsm)
 		vy_lsm_acct_range(lsm, range);
 	}
 
-	vy_range_heap_update_all(&lsm->range_heap);
+	vy_max_compaction_priority_update_all(&lsm->max_compaction_priority);
 }
diff --git a/src/box/vy_lsm.h b/src/box/vy_lsm.h
index 74033627..a1d872e9 100644
--- a/src/box/vy_lsm.h
+++ b/src/box/vy_lsm.h
@@ -252,7 +252,7 @@ struct vy_lsm {
 	/** Number of ranges in this LSM tree. */
 	int range_count;
 	/** Heap of ranges, prioritized by compaction_priority. */
-	heap_t range_heap;
+	heap_t max_compaction_priority;
 	/**
 	 * List of all runs created for this LSM tree,
 	 * linked by vy_run->in_lsm.
diff --git a/src/box/vy_range.c b/src/box/vy_range.c
index 7211cfb2..2e351599 100644
--- a/src/box/vy_range.c
+++ b/src/box/vy_range.c
@@ -197,7 +197,7 @@ vy_range_new(int64_t id, struct tuple *begin, struct tuple *end,
 	}
 	range->cmp_def = cmp_def;
 	rlist_create(&range->slices);
-	range->heap_node.pos = UINT32_MAX;
+	range->compaction_priority_node.pos = UINT32_MAX;
 	return range;
 }
 
diff --git a/src/box/vy_range.h b/src/box/vy_range.h
index 05195d08..7c0a16e2 100644
--- a/src/box/vy_range.h
+++ b/src/box/vy_range.h
@@ -107,6 +107,8 @@ struct vy_range {
 	 * how we  decide how many runs to compact next time.
 	 */
 	int compaction_priority;
+	/** Link in vy_lsm->max_compaction_priority. */
+	struct heap_node compaction_priority_node;
 	/** Number of statements that need to be compacted. */
 	struct vy_disk_stmt_counter compaction_queue;
 	/**
@@ -121,8 +123,6 @@ struct vy_range {
 	int n_compactions;
 	/** Link in vy_lsm->tree. */
 	rb_node(struct vy_range) tree_node;
-	/** Link in vy_lsm->range_heap. */
-	struct heap_node heap_node;
 	/**
 	 * Incremented whenever a run is added to or deleted
 	 * from this range. Used invalidate read iterators.
@@ -134,15 +134,17 @@ struct vy_range {
  * Heap of all ranges of the same LSM tree, prioritized by
  * vy_range->compaction_priority.
  */
-#define HEAP_NAME vy_range_heap
+#define HEAP_NAME vy_max_compaction_priority
 static inline bool
-vy_range_heap_less(struct heap_node *a, struct heap_node *b)
+vy_max_compaction_priority_less(struct heap_node *a, struct heap_node *b)
 {
-	struct vy_range *r1 = container_of(a, struct vy_range, heap_node);
-	struct vy_range *r2 = container_of(b, struct vy_range, heap_node);
+	struct vy_range *r1 = container_of(a, struct vy_range,
+					   compaction_priority_node);
+	struct vy_range *r2 = container_of(b, struct vy_range,
+					   compaction_priority_node);
 	return r1->compaction_priority > r2->compaction_priority;
 }
-#define HEAP_LESS(h, l, r) vy_range_heap_less(l, r)
+#define HEAP_LESS(h, l, r) vy_max_compaction_priority_less(l, r)
 #include "salad/heap.h"
 #undef HEAP_LESS
 #undef HEAP_NAME
@@ -151,7 +153,7 @@ vy_range_heap_less(struct heap_node *a, struct heap_node *b)
 static inline bool
 vy_range_is_scheduled(struct vy_range *range)
 {
-	return range->heap_node.pos == UINT32_MAX;
+	return range->compaction_priority_node.pos == UINT32_MAX;
 }
 
 /**
diff --git a/src/box/vy_scheduler.c b/src/box/vy_scheduler.c
index 5ec6d171..16ecafed 100644
--- a/src/box/vy_scheduler.c
+++ b/src/box/vy_scheduler.c
@@ -1228,7 +1228,7 @@ vy_task_dump_complete(struct vy_task *task)
 		vy_range_update_compaction_priority(range, &lsm->opts);
 		vy_lsm_acct_range(lsm, range);
 	}
-	vy_range_heap_update_all(&lsm->range_heap);
+	vy_max_compaction_priority_update_all(&lsm->max_compaction_priority);
 	free(new_slices);
 
 delete_mems:
@@ -1610,8 +1610,9 @@ vy_task_compaction_complete(struct vy_task *task)
 	/* The iterator has been cleaned up in worker. */
 	task->wi->iface->close(task->wi);
 
-	assert(range->heap_node.pos == UINT32_MAX);
-	vy_range_heap_insert(&lsm->range_heap, &range->heap_node);
+	assert(range->compaction_priority_node.pos == UINT32_MAX);
+	vy_max_compaction_priority_insert(&lsm->max_compaction_priority,
+					  &range->compaction_priority_node);
 	vy_scheduler_update_lsm(scheduler, lsm);
 
 	say_info("%s: completed compacting range %s",
@@ -1642,8 +1643,9 @@ vy_task_compaction_abort(struct vy_task *task)
 
 	vy_run_discard(task->new_run);
 
-	assert(range->heap_node.pos == UINT32_MAX);
-	vy_range_heap_insert(&lsm->range_heap, &range->heap_node);
+	assert(range->compaction_priority_node.pos == UINT32_MAX);
+	vy_max_compaction_priority_insert(&lsm->max_compaction_priority,
+					  &range->compaction_priority_node);
 	vy_scheduler_update_lsm(scheduler, lsm);
 }
 
@@ -1657,14 +1659,14 @@ vy_task_compaction_new(struct vy_scheduler *scheduler, struct vy_worker *worker,
 		.abort = vy_task_compaction_abort,
 	};
 
-	struct heap_node *range_node;
+	struct heap_node *node;
 	struct vy_range *range;
 
 	assert(!lsm->is_dropped);
 
-	range_node = vy_range_heap_top(&lsm->range_heap);
-	assert(range_node != NULL);
-	range = container_of(range_node, struct vy_range, heap_node);
+	node = vy_max_compaction_priority_top(&lsm->max_compaction_priority);
+	assert(node != NULL);
+	range = container_of(node, struct vy_range, compaction_priority_node);
 	assert(range->compaction_priority > 1);
 
 	if (vy_lsm_split_range(lsm, range) ||
@@ -1722,8 +1724,8 @@ vy_task_compaction_new(struct vy_scheduler *scheduler, struct vy_worker *worker,
 	 * Remove the range we are going to compact from the heap
 	 * so that it doesn't get selected again.
 	 */
-	vy_range_heap_delete(&lsm->range_heap, range_node);
-	range_node->pos = UINT32_MAX;
+	vy_max_compaction_priority_delete(&lsm->max_compaction_priority, node);
+	node->pos = UINT32_MAX;
 	vy_scheduler_update_lsm(scheduler, lsm);
 
 	say_info("%s: started compacting range %s, runs %d/%d",
-- 
2.11.0

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v2 4/8] vinyl: keep track of dumps per compaction for each LSM tree
  2019-01-24 17:12 [PATCH v2 0/8] vinyl: compaction randomization and throttling Vladimir Davydov
                   ` (2 preceding siblings ...)
  2019-01-24 17:12 ` [PATCH v2 3/8] vinyl: rename lsm->range_heap to max_compaction_priority Vladimir Davydov
@ 2019-01-24 17:12 ` Vladimir Davydov
  2019-02-08 17:42   ` Vladimir Davydov
  2019-02-11 18:17   ` Konstantin Osipov
  2019-01-24 17:12 ` [PATCH v2 5/8] vinyl: set range size automatically Vladimir Davydov
                   ` (3 subsequent siblings)
  7 siblings, 2 replies; 22+ messages in thread
From: Vladimir Davydov @ 2019-01-24 17:12 UTC (permalink / raw)
  To: tarantool-patches

This patch adds dumps_per_compaction metric to per index statistics. It
shows the number of dumps it takes to trigger a major compaction of a
range in a given LSM tree. We need it to automatically choose the
optimal number of ranges that would smooth out the load generated by
range compaction.

To calculate this metric, we assign dump_count to each run. It shows how
many dumps it took to create the run. If a run was created by a memory
dump, it is set to 1. If a run was created by a minor compaction, it is
set to the sum of dump counts of compacted ranges. If a run was created
by a major compaction, it is set to the sum of dump counts of compacted
ranges minus dump count of the last level run. The dump_count is stored
in vylog.

This allows us to estimate the number of dumps that triggers compaction
in a range as dump_count of the last level run stored in the range.
Finally, we report dumps_per_compaction of an LSM tree as the minimal
dumps_per_compaction among all ranges constituting the tree. To achieve
that, we maintain a heap of ranges per each LSM tree ordered by
dumps_per_compaction.

Needed for #3944
---
 src/box/vinyl.c          |   2 +
 src/box/vy_log.c         |  26 ++++++++-
 src/box/vy_log.h         |  10 +++-
 src/box/vy_lsm.c         |  23 ++++++++
 src/box/vy_lsm.h         |   6 ++
 src/box/vy_range.c       |  13 +++++
 src/box/vy_range.h       |  32 +++++++++++
 src/box/vy_run.h         |  15 +++++
 src/box/vy_scheduler.c   |  24 +++++++-
 test/vinyl/layout.result |   8 +--
 test/vinyl/stat.result   | 147 +++++++++++++++++++++++++++++++++++++++++++++--
 test/vinyl/stat.test.lua |  63 ++++++++++++++++++++
 12 files changed, 353 insertions(+), 16 deletions(-)

diff --git a/src/box/vinyl.c b/src/box/vinyl.c
index d6117f44..dc4fc830 100644
--- a/src/box/vinyl.c
+++ b/src/box/vinyl.c
@@ -463,6 +463,8 @@ vinyl_index_stat(struct index *index, struct info_handler *h)
 	info_append_int(h, "run_avg", lsm->run_count / lsm->range_count);
 	histogram_snprint(buf, sizeof(buf), lsm->run_hist);
 	info_append_str(h, "run_histogram", buf);
+	info_append_int(h, "dumps_per_compaction",
+			vy_lsm_dumps_per_compaction(lsm));
 
 	info_end(h);
 }
diff --git a/src/box/vy_log.c b/src/box/vy_log.c
index d3fa0c7a..d7cf4996 100644
--- a/src/box/vy_log.c
+++ b/src/box/vy_log.c
@@ -84,6 +84,7 @@ enum vy_log_key {
 	VY_LOG_KEY_MODIFY_LSN		= 13,
 	VY_LOG_KEY_DROP_LSN		= 14,
 	VY_LOG_KEY_GROUP_ID		= 15,
+	VY_LOG_KEY_DUMP_COUNT		= 16,
 };
 
 /** vy_log_key -> human readable name. */
@@ -104,6 +105,7 @@ static const char *vy_log_key_name[] = {
 	[VY_LOG_KEY_MODIFY_LSN]		= "modify_lsn",
 	[VY_LOG_KEY_DROP_LSN]		= "drop_lsn",
 	[VY_LOG_KEY_GROUP_ID]		= "group_id",
+	[VY_LOG_KEY_DUMP_COUNT]		= "dump_count",
 };
 
 /** vy_log_type -> human readable name. */
@@ -285,6 +287,10 @@ vy_log_record_snprint(char *buf, int size, const struct vy_log_record *record)
 		SNPRINT(total, snprintf, buf, size, "%s=%"PRIi64", ",
 			vy_log_key_name[VY_LOG_KEY_GC_LSN],
 			record->gc_lsn);
+	if (record->dump_count > 0)
+		SNPRINT(total, snprintf, buf, size, "%s=%"PRIu32", ",
+			vy_log_key_name[VY_LOG_KEY_DUMP_COUNT],
+			record->dump_count);
 	SNPRINT(total, snprintf, buf, size, "}");
 	return total;
 }
@@ -411,6 +417,11 @@ vy_log_record_encode(const struct vy_log_record *record,
 		size += mp_sizeof_uint(record->gc_lsn);
 		n_keys++;
 	}
+	if (record->dump_count > 0) {
+		size += mp_sizeof_uint(VY_LOG_KEY_DUMP_COUNT);
+		size += mp_sizeof_uint(record->dump_count);
+		n_keys++;
+	}
 	size += mp_sizeof_map(n_keys);
 
 	/*
@@ -493,6 +504,10 @@ vy_log_record_encode(const struct vy_log_record *record,
 		pos = mp_encode_uint(pos, VY_LOG_KEY_GC_LSN);
 		pos = mp_encode_uint(pos, record->gc_lsn);
 	}
+	if (record->dump_count > 0) {
+		pos = mp_encode_uint(pos, VY_LOG_KEY_DUMP_COUNT);
+		pos = mp_encode_uint(pos, record->dump_count);
+	}
 	assert(pos == tuple + size);
 
 	/*
@@ -620,6 +635,9 @@ vy_log_record_decode(struct vy_log_record *record,
 		case VY_LOG_KEY_GC_LSN:
 			record->gc_lsn = mp_decode_uint(&pos);
 			break;
+		case VY_LOG_KEY_DUMP_COUNT:
+			record->dump_count = mp_decode_uint(&pos);
+			break;
 		default:
 			mp_next(&pos); /* unknown key, ignore */
 			break;
@@ -1558,6 +1576,7 @@ vy_recovery_do_create_run(struct vy_recovery *recovery, int64_t run_id)
 	run->id = run_id;
 	run->dump_lsn = -1;
 	run->gc_lsn = -1;
+	run->dump_count = 0;
 	run->is_incomplete = false;
 	run->is_dropped = false;
 	run->data = NULL;
@@ -1612,7 +1631,7 @@ vy_recovery_prepare_run(struct vy_recovery *recovery, int64_t lsm_id,
  */
 static int
 vy_recovery_create_run(struct vy_recovery *recovery, int64_t lsm_id,
-		       int64_t run_id, int64_t dump_lsn)
+		       int64_t run_id, int64_t dump_lsn, uint32_t dump_count)
 {
 	struct vy_lsm_recovery_info *lsm;
 	lsm = vy_recovery_lookup_lsm(recovery, lsm_id);
@@ -1637,6 +1656,7 @@ vy_recovery_create_run(struct vy_recovery *recovery, int64_t lsm_id,
 			return -1;
 	}
 	run->dump_lsn = dump_lsn;
+	run->dump_count = dump_count;
 	run->is_incomplete = false;
 	rlist_move_entry(&lsm->runs, run, in_lsm);
 	return 0;
@@ -1998,7 +2018,8 @@ vy_recovery_process_record(struct vy_recovery *recovery,
 		break;
 	case VY_LOG_CREATE_RUN:
 		rc = vy_recovery_create_run(recovery, record->lsm_id,
-					    record->run_id, record->dump_lsn);
+					    record->run_id, record->dump_lsn,
+					    record->dump_count);
 		break;
 	case VY_LOG_DROP_RUN:
 		rc = vy_recovery_drop_run(recovery, record->run_id,
@@ -2348,6 +2369,7 @@ vy_log_append_lsm(struct xlog *xlog, struct vy_lsm_recovery_info *lsm)
 		} else {
 			record.type = VY_LOG_CREATE_RUN;
 			record.dump_lsn = run->dump_lsn;
+			record.dump_count = run->dump_count;
 		}
 		record.lsm_id = lsm->id;
 		record.run_id = run->id;
diff --git a/src/box/vy_log.h b/src/box/vy_log.h
index 70e25245..ee38c193 100644
--- a/src/box/vy_log.h
+++ b/src/box/vy_log.h
@@ -96,7 +96,7 @@ enum vy_log_record_type {
 	VY_LOG_PREPARE_RUN		= 4,
 	/**
 	 * Commit a vinyl run file creation.
-	 * Requires vy_log_record::lsm_id, run_id, dump_lsn.
+	 * Requires vy_log_record::lsm_id, run_id, dump_lsn, dump_count.
 	 *
 	 * Written after a run file was successfully created.
 	 */
@@ -271,6 +271,8 @@ struct vy_log_record {
 	 * that uses this run.
 	 */
 	int64_t gc_lsn;
+	/** For runs: number of dumps it took to create the run. */
+	uint32_t dump_count;
 	/** Link in vy_log::tx. */
 	struct stailq_entry in_tx;
 };
@@ -389,6 +391,8 @@ struct vy_run_recovery_info {
 	 * that uses this run.
 	 */
 	int64_t gc_lsn;
+	/** Number of dumps it took to create the run. */
+	uint32_t dump_count;
 	/**
 	 * True if the run was not committed (there's
 	 * VY_LOG_PREPARE_RUN, but no VY_LOG_CREATE_RUN).
@@ -710,7 +714,8 @@ vy_log_prepare_run(int64_t lsm_id, int64_t run_id)
 
 /** Helper to log a vinyl run creation. */
 static inline void
-vy_log_create_run(int64_t lsm_id, int64_t run_id, int64_t dump_lsn)
+vy_log_create_run(int64_t lsm_id, int64_t run_id,
+		  int64_t dump_lsn, uint32_t dump_count)
 {
 	struct vy_log_record record;
 	vy_log_record_init(&record);
@@ -718,6 +723,7 @@ vy_log_create_run(int64_t lsm_id, int64_t run_id, int64_t dump_lsn)
 	record.lsm_id = lsm_id;
 	record.run_id = run_id;
 	record.dump_lsn = dump_lsn;
+	record.dump_count = dump_count;
 	vy_log_write(&record);
 }
 
diff --git a/src/box/vy_lsm.c b/src/box/vy_lsm.c
index 851785ee..6ec86c22 100644
--- a/src/box/vy_lsm.c
+++ b/src/box/vy_lsm.c
@@ -181,6 +181,7 @@ vy_lsm_new(struct vy_lsm_env *lsm_env, struct vy_cache_env *cache_env,
 	rlist_create(&lsm->sealed);
 	vy_range_tree_new(lsm->tree);
 	vy_max_compaction_priority_create(&lsm->max_compaction_priority);
+	vy_min_dumps_per_compaction_create(&lsm->min_dumps_per_compaction);
 	rlist_create(&lsm->runs);
 	lsm->pk = pk;
 	if (pk != NULL)
@@ -258,6 +259,7 @@ vy_lsm_delete(struct vy_lsm *lsm)
 
 	vy_range_tree_iter(lsm->tree, NULL, vy_range_tree_free_cb, NULL);
 	vy_max_compaction_priority_destroy(&lsm->max_compaction_priority);
+	vy_min_dumps_per_compaction_destroy(&lsm->min_dumps_per_compaction);
 	tuple_format_unref(lsm->disk_format);
 	key_def_delete(lsm->cmp_def);
 	key_def_delete(lsm->key_def);
@@ -351,6 +353,7 @@ vy_lsm_recover_run(struct vy_lsm *lsm, struct vy_run_recovery_info *run_info,
 		return NULL;
 
 	run->dump_lsn = run_info->dump_lsn;
+	run->dump_count = run_info->dump_count;
 	if (vy_run_recover(run, lsm->env->path,
 			   lsm->space_id, lsm->index_id) != 0 &&
 	    (!force_recovery ||
@@ -636,6 +639,7 @@ vy_lsm_recover(struct vy_lsm *lsm, struct vy_recovery *recovery,
 					    (long long)range->id));
 			return -1;
 		}
+		vy_range_update_dumps_per_compaction(range);
 		vy_lsm_acct_range(lsm, range);
 	}
 	if (prev == NULL) {
@@ -651,6 +655,7 @@ vy_lsm_recover(struct vy_lsm *lsm, struct vy_recovery *recovery,
 				    (long long)prev->id));
 		return -1;
 	}
+	vy_min_dumps_per_compaction_update_all(&lsm->min_dumps_per_compaction);
 	return 0;
 }
 
@@ -674,6 +679,18 @@ vy_lsm_compaction_priority(struct vy_lsm *lsm)
 	return range->compaction_priority;
 }
 
+int
+vy_lsm_dumps_per_compaction(struct vy_lsm *lsm)
+{
+	struct heap_node *node;
+	node = vy_min_dumps_per_compaction_top(&lsm->min_dumps_per_compaction);
+	if (node == NULL)
+		return 0;
+	struct vy_range *range = container_of(node, struct vy_range,
+					      dumps_per_compaction_node);
+	return range->dumps_per_compaction;
+}
+
 void
 vy_lsm_add_run(struct vy_lsm *lsm, struct vy_run *run)
 {
@@ -737,6 +754,8 @@ vy_lsm_add_range(struct vy_lsm *lsm, struct vy_range *range)
 	assert(range->compaction_priority_node.pos == UINT32_MAX);
 	vy_max_compaction_priority_insert(&lsm->max_compaction_priority,
 					  &range->compaction_priority_node);
+	vy_min_dumps_per_compaction_insert(&lsm->min_dumps_per_compaction,
+					   &range->dumps_per_compaction_node);
 	vy_range_tree_insert(lsm->tree, range);
 	lsm->range_count++;
 }
@@ -747,6 +766,8 @@ vy_lsm_remove_range(struct vy_lsm *lsm, struct vy_range *range)
 	assert(range->compaction_priority_node.pos != UINT32_MAX);
 	vy_max_compaction_priority_delete(&lsm->max_compaction_priority,
 					  &range->compaction_priority_node);
+	vy_min_dumps_per_compaction_delete(&lsm->min_dumps_per_compaction,
+					  &range->dumps_per_compaction_node);
 	vy_range_tree_remove(lsm->tree, range);
 	lsm->range_count--;
 }
@@ -1080,6 +1101,7 @@ vy_lsm_split_range(struct vy_lsm *lsm, struct vy_range *range)
 		}
 		part->needs_compaction = range->needs_compaction;
 		vy_range_update_compaction_priority(part, &lsm->opts);
+		vy_range_update_dumps_per_compaction(part);
 	}
 
 	/*
@@ -1197,6 +1219,7 @@ vy_lsm_coalesce_range(struct vy_lsm *lsm, struct vy_range *range)
 	 * as it fits the configured LSM tree shape.
 	 */
 	vy_range_update_compaction_priority(result, &lsm->opts);
+	vy_range_update_dumps_per_compaction(result);
 	vy_lsm_acct_range(lsm, result);
 	vy_lsm_add_range(lsm, result);
 	lsm->range_tree_version++;
diff --git a/src/box/vy_lsm.h b/src/box/vy_lsm.h
index a1d872e9..4df9d19a 100644
--- a/src/box/vy_lsm.h
+++ b/src/box/vy_lsm.h
@@ -253,6 +253,8 @@ struct vy_lsm {
 	int range_count;
 	/** Heap of ranges, prioritized by compaction_priority. */
 	heap_t max_compaction_priority;
+	/** Heap of ranges, prioritized by dumps_per_compaction. */
+	heap_t min_dumps_per_compaction;
 	/**
 	 * List of all runs created for this LSM tree,
 	 * linked by vy_run->in_lsm.
@@ -438,6 +440,10 @@ vy_lsm_generation(struct vy_lsm *lsm);
 int
 vy_lsm_compaction_priority(struct vy_lsm *lsm);
 
+/** Return min dumps_per_compaction among ranges of an LSM tree. */
+int
+vy_lsm_dumps_per_compaction(struct vy_lsm *lsm);
+
 /** Add a run to the list of runs of an LSM tree. */
 void
 vy_lsm_add_run(struct vy_lsm *lsm, struct vy_run *run);
diff --git a/src/box/vy_range.c b/src/box/vy_range.c
index 2e351599..4ba9ec5b 100644
--- a/src/box/vy_range.c
+++ b/src/box/vy_range.c
@@ -198,6 +198,7 @@ vy_range_new(int64_t id, struct tuple *begin, struct tuple *end,
 	range->cmp_def = cmp_def;
 	rlist_create(&range->slices);
 	range->compaction_priority_node.pos = UINT32_MAX;
+	range->dumps_per_compaction_node.pos = UINT32_MAX;
 	return range;
 }
 
@@ -411,6 +412,18 @@ vy_range_update_compaction_priority(struct vy_range *range,
 	}
 }
 
+void
+vy_range_update_dumps_per_compaction(struct vy_range *range)
+{
+	if (!rlist_empty(&range->slices)) {
+		struct vy_slice *slice = rlist_last_entry(&range->slices,
+						struct vy_slice, in_range);
+		range->dumps_per_compaction = slice->run->dump_count;
+	} else {
+		range->dumps_per_compaction = 0;
+	}
+}
+
 /**
  * Return true and set split_key accordingly if the range needs to be
  * split in two.
diff --git a/src/box/vy_range.h b/src/box/vy_range.h
index 7c0a16e2..f19c2c6b 100644
--- a/src/box/vy_range.h
+++ b/src/box/vy_range.h
@@ -121,6 +121,13 @@ struct vy_range {
 	bool needs_compaction;
 	/** Number of times the range was compacted. */
 	int n_compactions;
+	/**
+	 * Number of dumps it takes to trigger major compaction in
+	 * this range, see vy_run::dump_count for more details.
+	 */
+	int dumps_per_compaction;
+	/** Link in vy_lsm->min_dumps_per_compaction. */
+	struct heap_node dumps_per_compaction_node;
 	/** Link in vy_lsm->tree. */
 	rb_node(struct vy_range) tree_node;
 	/**
@@ -149,6 +156,25 @@ vy_max_compaction_priority_less(struct heap_node *a, struct heap_node *b)
 #undef HEAP_LESS
 #undef HEAP_NAME
 
+/**
+ * Heap of all ranges of the same LSM tree, prioritized by
+ * vy_range->dumps_per_compaction.
+ */
+#define HEAP_NAME vy_min_dumps_per_compaction
+static inline bool
+vy_min_dumps_per_compaction_less(struct heap_node *a, struct heap_node *b)
+{
+	struct vy_range *r1 = container_of(a, struct vy_range,
+					   dumps_per_compaction_node);
+	struct vy_range *r2 = container_of(b, struct vy_range,
+					   dumps_per_compaction_node);
+	return r1->dumps_per_compaction < r2->dumps_per_compaction;
+}
+#define HEAP_LESS(h, l, r) vy_min_dumps_per_compaction_less(l, r)
+#include "salad/heap.h"
+#undef HEAP_LESS
+#undef HEAP_NAME
+
 /** Return true if a task is scheduled for a given range. */
 static inline bool
 vy_range_is_scheduled(struct vy_range *range)
@@ -245,6 +271,12 @@ vy_range_update_compaction_priority(struct vy_range *range,
 				    const struct index_opts *opts);
 
 /**
+ * Update the value of range->dumps_per_compaction.
+ */
+void
+vy_range_update_dumps_per_compaction(struct vy_range *range);
+
+/**
  * Check if a range needs to be split in two.
  *
  * @param range             The range.
diff --git a/src/box/vy_run.h b/src/box/vy_run.h
index 990daffa..28fd6a50 100644
--- a/src/box/vy_run.h
+++ b/src/box/vy_run.h
@@ -130,6 +130,21 @@ struct vy_run {
 	/** Max LSN stored on disk. */
 	int64_t dump_lsn;
 	/**
+	 * Number of dumps it took to create this run.
+	 *
+	 * If the run was produced by a memory dump, it is 1.
+	 * If the run was produced by a minor compaction, it
+	 * is is the sum of dump counts of compacted runs.
+	 * If the run was produced by a major compaction, it
+	 * is is the sum of dump counts of compacted runs
+	 * minus the dump count of the last (greatest) run.
+	 *
+	 * This way, by looking at the last level run in an LSM
+	 * tree, we can tell how many dumps it took to compact
+	 * it last time.
+	 */
+	uint32_t dump_count;
+	/**
 	 * Run reference counter, the run is deleted once it hits 0.
 	 * A new run is created with the reference counter set to 1.
 	 * A run is referenced by each slice created for it and each
diff --git a/src/box/vy_scheduler.c b/src/box/vy_scheduler.c
index 16ecafed..70f538ef 100644
--- a/src/box/vy_scheduler.c
+++ b/src/box/vy_scheduler.c
@@ -1193,7 +1193,7 @@ vy_task_dump_complete(struct vy_task *task)
 	 * Log change in metadata.
 	 */
 	vy_log_tx_begin();
-	vy_log_create_run(lsm->id, new_run->id, dump_lsn);
+	vy_log_create_run(lsm->id, new_run->id, dump_lsn, new_run->dump_count);
 	for (range = begin_range, i = 0; range != end_range;
 	     range = vy_range_tree_next(lsm->tree, range), i++) {
 		assert(i < lsm->range_count);
@@ -1226,9 +1226,11 @@ vy_task_dump_complete(struct vy_task *task)
 		vy_lsm_unacct_range(lsm, range);
 		vy_range_add_slice(range, slice);
 		vy_range_update_compaction_priority(range, &lsm->opts);
+		vy_range_update_dumps_per_compaction(range);
 		vy_lsm_acct_range(lsm, range);
 	}
 	vy_max_compaction_priority_update_all(&lsm->max_compaction_priority);
+	vy_min_dumps_per_compaction_update_all(&lsm->min_dumps_per_compaction);
 	free(new_slices);
 
 delete_mems:
@@ -1396,6 +1398,7 @@ vy_task_dump_new(struct vy_scheduler *scheduler, struct vy_worker *worker,
 	if (new_run == NULL)
 		goto err_run;
 
+	new_run->dump_count = 1;
 	new_run->dump_lsn = dump_lsn;
 
 	/*
@@ -1528,7 +1531,8 @@ vy_task_compaction_complete(struct vy_task *task)
 	rlist_foreach_entry(run, &unused_runs, in_unused)
 		vy_log_drop_run(run->id, gc_lsn);
 	if (new_slice != NULL) {
-		vy_log_create_run(lsm->id, new_run->id, new_run->dump_lsn);
+		vy_log_create_run(lsm->id, new_run->id, new_run->dump_lsn,
+				  new_run->dump_count);
 		vy_log_insert_slice(range->id, new_run->id, new_slice->id,
 				    tuple_data_or_null(new_slice->begin),
 				    tuple_data_or_null(new_slice->end));
@@ -1589,6 +1593,7 @@ vy_task_compaction_complete(struct vy_task *task)
 	}
 	range->n_compactions++;
 	vy_range_update_compaction_priority(range, &lsm->opts);
+	vy_range_update_dumps_per_compaction(range);
 	vy_lsm_acct_range(lsm, range);
 	vy_lsm_acct_compaction(lsm, compaction_time,
 			       &compaction_input, &compaction_output);
@@ -1613,6 +1618,8 @@ vy_task_compaction_complete(struct vy_task *task)
 	assert(range->compaction_priority_node.pos == UINT32_MAX);
 	vy_max_compaction_priority_insert(&lsm->max_compaction_priority,
 					  &range->compaction_priority_node);
+	vy_min_dumps_per_compaction_update(&lsm->min_dumps_per_compaction,
+					   &range->dumps_per_compaction_node);
 	vy_scheduler_update_lsm(scheduler, lsm);
 
 	say_info("%s: completed compacting range %s",
@@ -1695,12 +1702,14 @@ vy_task_compaction_new(struct vy_scheduler *scheduler, struct vy_worker *worker,
 		goto err_wi;
 
 	struct vy_slice *slice;
+	int32_t dump_count = 0;
 	int n = range->compaction_priority;
 	rlist_foreach_entry(slice, &range->slices, in_range) {
 		if (vy_write_iterator_new_slice(wi, slice) != 0)
 			goto err_wi_sub;
 		new_run->dump_lsn = MAX(new_run->dump_lsn,
 					slice->run->dump_lsn);
+		dump_count += slice->run->dump_count;
 		/* Remember the slices we are compacting. */
 		if (task->first_slice == NULL)
 			task->first_slice = slice;
@@ -1711,6 +1720,17 @@ vy_task_compaction_new(struct vy_scheduler *scheduler, struct vy_worker *worker,
 	}
 	assert(n == 0);
 	assert(new_run->dump_lsn >= 0);
+	if (range->compaction_priority == range->slice_count)
+		dump_count -= slice->run->dump_count;
+	/*
+	 * Do not update dumps_per_compaction in case compaction
+	 * was triggered manually to avoid unexpected side effects,
+	 * such as splitting/coalescing ranges for no good reason.
+	 */
+	if (range->needs_compaction)
+		new_run->dump_count = slice->run->dump_count;
+	else
+		new_run->dump_count = dump_count;
 
 	range->needs_compaction = false;
 
diff --git a/test/vinyl/layout.result b/test/vinyl/layout.result
index 3be2bb91..6d58f747 100644
--- a/test/vinyl/layout.result
+++ b/test/vinyl/layout.result
@@ -139,7 +139,7 @@ result
       - HEADER:
           type: INSERT
         BODY:
-          tuple: [5, {2: 8, 9: 20}]
+          tuple: [5, {2: 8, 16: 1, 9: 20}]
       - HEADER:
           type: INSERT
         BODY:
@@ -164,7 +164,7 @@ result
       - HEADER:
           type: INSERT
         BODY:
-          tuple: [5, {0: 2, 2: 6, 9: 20}]
+          tuple: [5, {0: 2, 2: 6, 16: 1, 9: 20}]
       - HEADER:
           type: INSERT
         BODY:
@@ -204,7 +204,7 @@ result
           timestamp: <timestamp>
           type: INSERT
         BODY:
-          tuple: [5, {0: 2, 2: 10, 9: 23}]
+          tuple: [5, {0: 2, 2: 10, 16: 1, 9: 23}]
       - HEADER:
           timestamp: <timestamp>
           type: INSERT
@@ -224,7 +224,7 @@ result
           timestamp: <timestamp>
           type: INSERT
         BODY:
-          tuple: [5, {2: 12, 9: 23}]
+          tuple: [5, {2: 12, 16: 1, 9: 23}]
       - HEADER:
           timestamp: <timestamp>
           type: INSERT
diff --git a/test/vinyl/stat.result b/test/vinyl/stat.result
index 419d3e6c..e79c32f0 100644
--- a/test/vinyl/stat.result
+++ b/test/vinyl/stat.result
@@ -129,7 +129,8 @@ test_run:cmd("setopt delimiter ''");
 -- initially stats are empty
 istat()
 ---
-- rows: 0
+- dumps_per_compaction: 0
+  rows: 0
   run_avg: 0
   bytes: 0
   upsert:
@@ -294,10 +295,7 @@ wait(istat, st, 'disk.dump.count', 1)
 ...
 stat_diff(istat(), st)
 ---
-- rows: 25
-  run_avg: 1
-  run_count: 1
-  disk:
+- disk:
     last_level:
       bytes: 26049
       pages: 7
@@ -321,6 +319,10 @@ stat_diff(istat(), st)
     pages: 7
     bytes_compressed: <bytes_compressed>
     bloom_size: 70
+  rows: 25
+  run_avg: 1
+  run_count: 1
+  dumps_per_compaction: 1
   bytes: 26049
   put:
     rows: 25
@@ -998,7 +1000,8 @@ box.stat.reset()
 ...
 istat()
 ---
-- rows: 306
+- dumps_per_compaction: 1
+  rows: 306
   run_avg: 1
   bytes: 317731
   upsert:
@@ -1732,6 +1735,138 @@ box.stat.vinyl().disk.data_compacted
 ---
 - 0
 ...
+--
+-- Number of dumps needed to trigger major compaction in
+-- an LSM tree range.
+--
+s = box.schema.space.create('test', {engine = 'vinyl'})
+---
+...
+i = s:create_index('primary', {page_size = 128, range_size = 8192, run_count_per_level = 1, run_size_ratio = 2})
+---
+...
+test_run:cmd("setopt delimiter ';'")
+---
+- true
+...
+function dump(a, b)
+    for i = a, b do
+        s:replace{i, digest.urandom(100)}
+    end
+    box.snapshot()
+end;
+---
+...
+function wait_compaction(count)
+    test_run:wait_cond(function()
+        return i:stat().disk.compaction.count == count
+    end, 10)
+end;
+---
+...
+test_run:cmd("setopt delimiter ''");
+---
+- true
+...
+dump(1, 100)
+---
+...
+i:stat().dumps_per_compaction -- 1
+---
+- 1
+...
+dump(1, 100) -- compaction
+---
+...
+dump(1, 100) -- split + compaction
+---
+...
+wait_compaction(3)
+---
+...
+i:stat().range_count -- 2
+---
+- 2
+...
+i:stat().dumps_per_compaction -- 1
+---
+- 1
+...
+dump(1, 10)
+---
+...
+dump(1, 40) -- compaction in range 1
+---
+...
+wait_compaction(4)
+---
+...
+i:stat().dumps_per_compaction -- 1
+---
+- 1
+...
+dump(90, 100)
+---
+...
+dump(60, 100) -- compaction in range 2
+---
+...
+wait_compaction(5)
+---
+...
+i:stat().dumps_per_compaction -- 2
+---
+- 2
+...
+-- Forcing compaction manually doesn't affect dumps_per_compaction.
+dump(40, 60)
+---
+...
+i:compact()
+---
+...
+wait_compaction(7)
+---
+...
+i:stat().dumps_per_compaction -- 2
+---
+- 2
+...
+test_run:cmd('restart server test')
+fiber = require('fiber')
+---
+...
+digest = require('digest')
+---
+...
+s = box.space.test
+---
+...
+i = s.index.primary
+---
+...
+i:stat().dumps_per_compaction -- 2
+---
+- 2
+...
+for i = 1, 100 do s:replace{i, digest.urandom(100)} end
+---
+...
+box.snapshot()
+---
+- ok
+...
+test_run:wait_cond(function() return i:stat().disk.compaction.count == 2 end, 10)
+---
+- true
+...
+i:stat().dumps_per_compaction -- 1
+---
+- 1
+...
+s:drop()
+---
+...
 test_run:cmd('switch default')
 ---
 - true
diff --git a/test/vinyl/stat.test.lua b/test/vinyl/stat.test.lua
index 4a955682..4a360f33 100644
--- a/test/vinyl/stat.test.lua
+++ b/test/vinyl/stat.test.lua
@@ -528,6 +528,69 @@ s:drop()
 
 box.stat.vinyl().disk.data_compacted
 
+--
+-- Number of dumps needed to trigger major compaction in
+-- an LSM tree range.
+--
+s = box.schema.space.create('test', {engine = 'vinyl'})
+i = s:create_index('primary', {page_size = 128, range_size = 8192, run_count_per_level = 1, run_size_ratio = 2})
+
+test_run:cmd("setopt delimiter ';'")
+function dump(a, b)
+    for i = a, b do
+        s:replace{i, digest.urandom(100)}
+    end
+    box.snapshot()
+end;
+function wait_compaction(count)
+    test_run:wait_cond(function()
+        return i:stat().disk.compaction.count == count
+    end, 10)
+end;
+test_run:cmd("setopt delimiter ''");
+
+dump(1, 100)
+i:stat().dumps_per_compaction -- 1
+
+dump(1, 100) -- compaction
+dump(1, 100) -- split + compaction
+wait_compaction(3)
+i:stat().range_count -- 2
+i:stat().dumps_per_compaction -- 1
+
+dump(1, 10)
+dump(1, 40) -- compaction in range 1
+wait_compaction(4)
+i:stat().dumps_per_compaction -- 1
+
+dump(90, 100)
+dump(60, 100) -- compaction in range 2
+wait_compaction(5)
+i:stat().dumps_per_compaction -- 2
+
+-- Forcing compaction manually doesn't affect dumps_per_compaction.
+dump(40, 60)
+i:compact()
+wait_compaction(7)
+i:stat().dumps_per_compaction -- 2
+
+test_run:cmd('restart server test')
+
+fiber = require('fiber')
+digest = require('digest')
+
+s = box.space.test
+i = s.index.primary
+
+i:stat().dumps_per_compaction -- 2
+for i = 1, 100 do s:replace{i, digest.urandom(100)} end
+box.snapshot()
+test_run:wait_cond(function() return i:stat().disk.compaction.count == 2 end, 10)
+
+i:stat().dumps_per_compaction -- 1
+
+s:drop()
+
 test_run:cmd('switch default')
 test_run:cmd('stop server test')
 test_run:cmd('cleanup server test')
-- 
2.11.0

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v2 5/8] vinyl: set range size automatically
  2019-01-24 17:12 [PATCH v2 0/8] vinyl: compaction randomization and throttling Vladimir Davydov
                   ` (3 preceding siblings ...)
  2019-01-24 17:12 ` [PATCH v2 4/8] vinyl: keep track of dumps per compaction for each LSM tree Vladimir Davydov
@ 2019-01-24 17:12 ` Vladimir Davydov
  2019-02-11 18:21   ` [tarantool-patches] " Konstantin Osipov
  2019-01-24 17:12 ` [PATCH v2 6/8] vinyl: randomize range compaction to avoid IO load spikes Vladimir Davydov
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 22+ messages in thread
From: Vladimir Davydov @ 2019-01-24 17:12 UTC (permalink / raw)
  To: tarantool-patches

The key space of a vinyl index consists of multiple ranges that can be
compacted independently. This design was initially invented to enable
parallel compaction, so the range size is configured statically, by the
range_size index option, which equals 1 GB by default. However, it turns
out that ranges can also be useful for smoothing IO load: if we compact
approximately the same number of ranges after each dump, we will avoid
IO bursts, which is good, because IO bursts can distort the LSM tree
shape, resulting in increased read amplification.

To achieve that, we need to maintain at least as many ranges as the
number of dumps it takes to trigger major compaction of a range. With
the default range size, this condition will hold only if the index is
huge (tens to hundreds gigabytes). If the database isn't that big or
consists of many small indexes, the range count will never even approach
that number. So this patch makes the range size scale dynamically to
satisfy that condition.

The range size configuration options, both global and per index, aren't
removed though. The patch just changes box.cfg.vinyl_range_size default
value to nil, which enables automatic range sizing for all new indexes
created without passing range_size explicitly. All existing indexes will
still use the range size stored in index options (we don't want to alter
the behavior of an existing production setup). We are not planning to
drop range_size option altogether - it still can be useful for testing
and performance analysis.

The actual range size value is now reported in index.stat().

Needed for #3944
---
 src/box/alter.cc                |   8 +--
 src/box/box.cc                  |   6 +--
 src/box/index_def.c             |   2 +-
 src/box/lua/load_cfg.lua        |   2 +-
 src/box/lua/space.cc            |   6 ++-
 src/box/vinyl.c                 |   1 +
 src/box/vy_lsm.c                |  44 +++++++++++++++-
 src/box/vy_lsm.h                |   4 ++
 src/box/vy_range.c              |  10 ++--
 src/box/vy_range.h              |  10 ++--
 test/app-tap/init_script.result |  21 ++++----
 test/box-tap/cfg.test.lua       |   3 +-
 test/box/admin.result           |   2 -
 test/box/cfg.result             |   4 --
 test/vinyl/ddl.result           |   5 --
 test/vinyl/ddl.test.lua         |   1 -
 test/vinyl/misc.result          |  78 +++++++++++++++++++++++++++++
 test/vinyl/misc.test.lua        |  26 ++++++++++
 test/vinyl/stat.result          | 108 ++++++++++++++++++++--------------------
 19 files changed, 236 insertions(+), 105 deletions(-)

diff --git a/src/box/alter.cc b/src/box/alter.cc
index 0589c967..83953a88 100644
--- a/src/box/alter.cc
+++ b/src/box/alter.cc
@@ -187,12 +187,8 @@ index_opts_decode(struct index_opts *opts, const char *map,
 			  BOX_INDEX_FIELD_OPTS, "distance must be either "\
 			  "'euclid' or 'manhattan'");
 	}
-	if (opts->range_size <= 0) {
-		tnt_raise(ClientError, ER_WRONG_INDEX_OPTIONS,
-			  BOX_INDEX_FIELD_OPTS,
-			  "range_size must be greater than 0");
-	}
-	if (opts->page_size <= 0 || opts->page_size > opts->range_size) {
+	if (opts->page_size <= 0 || (opts->range_size > 0 &&
+				     opts->page_size > opts->range_size)) {
 		tnt_raise(ClientError, ER_WRONG_INDEX_OPTIONS,
 			  BOX_INDEX_FIELD_OPTS,
 			  "page_size must be greater than 0 and "
diff --git a/src/box/box.cc b/src/box/box.cc
index 9f2fd6da..b045e465 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -592,11 +592,7 @@ box_check_vinyl_options(void)
 		tnt_raise(ClientError, ER_CFG, "vinyl_write_threads",
 			  "must be greater than or equal to 2");
 	}
-	if (range_size <= 0) {
-		tnt_raise(ClientError, ER_CFG, "vinyl_range_size",
-			  "must be greater than 0");
-	}
-	if (page_size <= 0 || page_size > range_size) {
+	if (page_size <= 0 || (range_size > 0 && page_size > range_size)) {
 		tnt_raise(ClientError, ER_CFG, "vinyl_page_size",
 			  "must be greater than 0 and less than "
 			  "or equal to vinyl_range_size");
diff --git a/src/box/index_def.c b/src/box/index_def.c
index 2ba57ee9..c82bc01c 100644
--- a/src/box/index_def.c
+++ b/src/box/index_def.c
@@ -40,7 +40,7 @@ const struct index_opts index_opts_default = {
 	/* .unique              = */ true,
 	/* .dimension           = */ 2,
 	/* .distance            = */ RTREE_INDEX_DISTANCE_TYPE_EUCLID,
-	/* .range_size          = */ 1073741824,
+	/* .range_size          = */ 0,
 	/* .page_size           = */ 8192,
 	/* .run_count_per_level = */ 2,
 	/* .run_size_ratio      = */ 3.5,
diff --git a/src/box/lua/load_cfg.lua b/src/box/lua/load_cfg.lua
index 6dc4a2af..fc4e560d 100644
--- a/src/box/lua/load_cfg.lua
+++ b/src/box/lua/load_cfg.lua
@@ -41,7 +41,7 @@ local default_cfg = {
     vinyl_timeout       = 60,
     vinyl_run_count_per_level = 2,
     vinyl_run_size_ratio      = 3.5,
-    vinyl_range_size          = 1024 * 1024 * 1024,
+    vinyl_range_size          = nil, -- set automatically
     vinyl_page_size           = 8 * 1024,
     vinyl_bloom_fpr           = 0.05,
     log                 = nil,
diff --git a/src/box/lua/space.cc b/src/box/lua/space.cc
index 7cae436f..abebaa87 100644
--- a/src/box/lua/space.cc
+++ b/src/box/lua/space.cc
@@ -334,8 +334,10 @@ lbox_fillspace(struct lua_State *L, struct space *space, int i)
 			lua_pushstring(L, "options");
 			lua_newtable(L);
 
-			lua_pushnumber(L, index_opts->range_size);
-			lua_setfield(L, -2, "range_size");
+			if (index_opts->range_size > 0) {
+				lua_pushnumber(L, index_opts->range_size);
+				lua_setfield(L, -2, "range_size");
+			}
 
 			lua_pushnumber(L, index_opts->page_size);
 			lua_setfield(L, -2, "page_size");
diff --git a/src/box/vinyl.c b/src/box/vinyl.c
index dc4fc830..0936932b 100644
--- a/src/box/vinyl.c
+++ b/src/box/vinyl.c
@@ -458,6 +458,7 @@ vinyl_index_stat(struct index *index, struct info_handler *h)
 	info_table_end(h); /* iterator */
 	info_table_end(h); /* txw */
 
+	info_append_int(h, "range_size", vy_lsm_range_size(lsm));
 	info_append_int(h, "range_count", lsm->range_count);
 	info_append_int(h, "run_count", lsm->run_count);
 	info_append_int(h, "run_avg", lsm->run_count / lsm->range_count);
diff --git a/src/box/vy_lsm.c b/src/box/vy_lsm.c
index 6ec86c22..e61338f7 100644
--- a/src/box/vy_lsm.c
+++ b/src/box/vy_lsm.c
@@ -54,6 +54,20 @@
 #include "vy_history.h"
 #include "vy_read_set.h"
 
+/*
+ * It doesn't make much sense to create too small ranges as this
+ * would make the overhead associated with file creation prominent
+ * and increase the number of open files. So we never create ranges
+ * less than 16 MB.
+ */
+static const int64_t VY_MIN_RANGE_SIZE = 16 * 1024 * 1024;
+
+/**
+ * We want a single compaction job to finish in reasonable time
+ * so we limit the range size to 2 GB.
+ */
+static const int64_t VY_MAX_RANGE_SIZE = 2LL * 1024 * 1024 * 1024;
+
 int
 vy_lsm_env_create(struct vy_lsm_env *env, const char *path,
 		  int64_t *p_generation,
@@ -691,6 +705,31 @@ vy_lsm_dumps_per_compaction(struct vy_lsm *lsm)
 	return range->dumps_per_compaction;
 }
 
+int64_t
+vy_lsm_range_size(struct vy_lsm *lsm)
+{
+	/* Use the configured range size if available. */
+	if (lsm->opts.range_size > 0)
+		return lsm->opts.range_size;
+	/*
+	 * Ideally, we want to compact roughly the same amount of
+	 * data after each dump so as to avoid IO bursts caused by
+	 * simultaneous major compaction of a bunch of ranges,
+	 * because such IO bursts can lead to a deviation of the
+	 * LSM tree from the configured shape and, as a result,
+	 * increased read amplification. To achieve that, we need
+	 * to have at least as many ranges as the number of dumps
+	 * it takes to trigger major compaction in a range. We
+	 * create four times more than that for better smoothing.
+	 */
+	int range_count = 4 * vy_lsm_dumps_per_compaction(lsm);
+	int64_t range_size = range_count == 0 ? 0 :
+		lsm->stat.disk.last_level_count.bytes / range_count;
+	range_size = MAX(range_size, VY_MIN_RANGE_SIZE);
+	range_size = MIN(range_size, VY_MAX_RANGE_SIZE);
+	return range_size;
+}
+
 void
 vy_lsm_add_run(struct vy_lsm *lsm, struct vy_run *run)
 {
@@ -1055,7 +1094,8 @@ vy_lsm_split_range(struct vy_lsm *lsm, struct vy_range *range)
 	struct tuple_format *key_format = lsm->env->key_format;
 
 	const char *split_key_raw;
-	if (!vy_range_needs_split(range, &lsm->opts, &split_key_raw))
+	if (!vy_range_needs_split(range, vy_lsm_range_size(lsm),
+				  &split_key_raw))
 		return false;
 
 	/* Split a range in two parts. */
@@ -1163,7 +1203,7 @@ bool
 vy_lsm_coalesce_range(struct vy_lsm *lsm, struct vy_range *range)
 {
 	struct vy_range *first, *last;
-	if (!vy_range_needs_coalesce(range, lsm->tree, &lsm->opts,
+	if (!vy_range_needs_coalesce(range, lsm->tree, vy_lsm_range_size(lsm),
 				     &first, &last))
 		return false;
 
diff --git a/src/box/vy_lsm.h b/src/box/vy_lsm.h
index 4df9d19a..d7cba109 100644
--- a/src/box/vy_lsm.h
+++ b/src/box/vy_lsm.h
@@ -444,6 +444,10 @@ vy_lsm_compaction_priority(struct vy_lsm *lsm);
 int
 vy_lsm_dumps_per_compaction(struct vy_lsm *lsm);
 
+/** Return the target size of a range in an LSM tree. */
+int64_t
+vy_lsm_range_size(struct vy_lsm *lsm);
+
 /** Add a run to the list of runs of an LSM tree. */
 void
 vy_lsm_add_run(struct vy_lsm *lsm, struct vy_run *run);
diff --git a/src/box/vy_range.c b/src/box/vy_range.c
index 4ba9ec5b..1a2a2c12 100644
--- a/src/box/vy_range.c
+++ b/src/box/vy_range.c
@@ -438,7 +438,7 @@ vy_range_update_dumps_per_compaction(struct vy_range *range)
  *   4/3 * range_size.
  */
 bool
-vy_range_needs_split(struct vy_range *range, const struct index_opts *opts,
+vy_range_needs_split(struct vy_range *range, int64_t range_size,
 		     const char **p_split_key)
 {
 	struct vy_slice *slice;
@@ -452,7 +452,7 @@ vy_range_needs_split(struct vy_range *range, const struct index_opts *opts,
 	slice = rlist_last_entry(&range->slices, struct vy_slice, in_range);
 
 	/* The range is too small to be split. */
-	if (slice->count.bytes < opts->range_size * 4 / 3)
+	if (slice->count.bytes < range_size * 4 / 3)
 		return false;
 
 	/* Find the median key in the oldest run (approximately). */
@@ -508,15 +508,15 @@ vy_range_needs_split(struct vy_range *range, const struct index_opts *opts,
  */
 bool
 vy_range_needs_coalesce(struct vy_range *range, vy_range_tree_t *tree,
-			const struct index_opts *opts,
-			struct vy_range **p_first, struct vy_range **p_last)
+			int64_t range_size, struct vy_range **p_first,
+			struct vy_range **p_last)
 {
 	struct vy_range *it;
 
 	/* Size of the coalesced range. */
 	uint64_t total_size = range->count.bytes;
 	/* Coalesce ranges until total_size > max_size. */
-	uint64_t max_size = opts->range_size / 2;
+	uint64_t max_size = range_size / 2;
 
 	/*
 	 * We can't coalesce a range that was scheduled for dump
diff --git a/src/box/vy_range.h b/src/box/vy_range.h
index f19c2c6b..1df71dbf 100644
--- a/src/box/vy_range.h
+++ b/src/box/vy_range.h
@@ -280,13 +280,13 @@ vy_range_update_dumps_per_compaction(struct vy_range *range);
  * Check if a range needs to be split in two.
  *
  * @param range             The range.
- * @param opts              Index options.
+ * @param range_size        Target range size.
  * @param[out] p_split_key  Key to split the range by.
  *
  * @retval true             If the range needs to be split.
  */
 bool
-vy_range_needs_split(struct vy_range *range, const struct index_opts *opts,
+vy_range_needs_split(struct vy_range *range, int64_t range_size,
 		     const char **p_split_key);
 
 /**
@@ -295,7 +295,7 @@ vy_range_needs_split(struct vy_range *range, const struct index_opts *opts,
  *
  * @param range         The range.
  * @param tree          The range tree.
- * @param opts          Index options.
+ * @param range_size    Target range size.
  * @param[out] p_first  The first range in the tree to coalesce.
  * @param[out] p_last   The last range in the tree to coalesce.
  *
@@ -303,8 +303,8 @@ vy_range_needs_split(struct vy_range *range, const struct index_opts *opts,
  */
 bool
 vy_range_needs_coalesce(struct vy_range *range, vy_range_tree_t *tree,
-			const struct index_opts *opts,
-			struct vy_range **p_first, struct vy_range **p_last);
+			int64_t range_size, struct vy_range **p_first,
+			struct vy_range **p_last);
 
 #if defined(__cplusplus)
 } /* extern "C" */
diff --git a/test/app-tap/init_script.result b/test/app-tap/init_script.result
index 70a4b258..559ef521 100644
--- a/test/app-tap/init_script.result
+++ b/test/app-tap/init_script.result
@@ -39,17 +39,16 @@ box.cfg
 34	vinyl_max_tuple_size:1048576
 35	vinyl_memory:134217728
 36	vinyl_page_size:8192
-37	vinyl_range_size:1073741824
-38	vinyl_read_threads:1
-39	vinyl_run_count_per_level:2
-40	vinyl_run_size_ratio:3.5
-41	vinyl_timeout:60
-42	vinyl_write_threads:4
-43	wal_dir:.
-44	wal_dir_rescan_delay:2
-45	wal_max_size:268435456
-46	wal_mode:write
-47	worker_pool_threads:4
+37	vinyl_read_threads:1
+38	vinyl_run_count_per_level:2
+39	vinyl_run_size_ratio:3.5
+40	vinyl_timeout:60
+41	vinyl_write_threads:4
+42	wal_dir:.
+43	wal_dir_rescan_delay:2
+44	wal_max_size:268435456
+45	wal_mode:write
+46	worker_pool_threads:4
 --
 -- Test insert from detached fiber
 --
diff --git a/test/box-tap/cfg.test.lua b/test/box-tap/cfg.test.lua
index d8715e27..f791cc3f 100755
--- a/test/box-tap/cfg.test.lua
+++ b/test/box-tap/cfg.test.lua
@@ -6,7 +6,7 @@ local socket = require('socket')
 local fio = require('fio')
 local uuid = require('uuid')
 local msgpack = require('msgpack')
-test:plan(103)
+test:plan(102)
 
 --------------------------------------------------------------------------------
 -- Invalid values
@@ -45,7 +45,6 @@ invalid('log', ':test:')
 invalid('vinyl_memory', -1)
 invalid('vinyl_read_threads', 0)
 invalid('vinyl_write_threads', 1)
-invalid('vinyl_range_size', 0)
 invalid('vinyl_page_size', 0)
 invalid('vinyl_run_count_per_level', 0)
 invalid('vinyl_run_size_ratio', 1)
diff --git a/test/box/admin.result b/test/box/admin.result
index 0b233889..e6fc1f30 100644
--- a/test/box/admin.result
+++ b/test/box/admin.result
@@ -98,8 +98,6 @@ cfg_filter(box.cfg)
     - 134217728
   - - vinyl_page_size
     - 8192
-  - - vinyl_range_size
-    - 1073741824
   - - vinyl_read_threads
     - 1
   - - vinyl_run_count_per_level
diff --git a/test/box/cfg.result b/test/box/cfg.result
index 68465669..7778f82a 100644
--- a/test/box/cfg.result
+++ b/test/box/cfg.result
@@ -86,8 +86,6 @@ cfg_filter(box.cfg)
     - 134217728
   - - vinyl_page_size
     - 8192
-  - - vinyl_range_size
-    - 1073741824
   - - vinyl_read_threads
     - 1
   - - vinyl_run_count_per_level
@@ -187,8 +185,6 @@ cfg_filter(box.cfg)
     - 134217728
   - - vinyl_page_size
     - 8192
-  - - vinyl_range_size
-    - 1073741824
   - - vinyl_read_threads
     - 1
   - - vinyl_run_count_per_level
diff --git a/test/vinyl/ddl.result b/test/vinyl/ddl.result
index 68bb6b3a..864050b3 100644
--- a/test/vinyl/ddl.result
+++ b/test/vinyl/ddl.result
@@ -8,10 +8,6 @@ test_run = require('test_run').new()
 space = box.schema.space.create('test', {engine = 'vinyl' })
 ---
 ...
-space:create_index('pk', {range_size = 0})
----
-- error: 'Wrong index options (field 4): range_size must be greater than 0'
-...
 space:create_index('pk', {page_size = 0})
 ---
 - error: 'Wrong index options (field 4): page_size must be greater than 0 and less
@@ -586,7 +582,6 @@ box.space.test.index.pk
     run_count_per_level: 2
     run_size_ratio: 3.5
     bloom_fpr: 0.05
-    range_size: 1073741824
   name: pk
   type: TREE
 ...
diff --git a/test/vinyl/ddl.test.lua b/test/vinyl/ddl.test.lua
index 9b870f35..46189828 100644
--- a/test/vinyl/ddl.test.lua
+++ b/test/vinyl/ddl.test.lua
@@ -3,7 +3,6 @@ test_run = require('test_run').new()
 
 -- sanity checks
 space = box.schema.space.create('test', {engine = 'vinyl' })
-space:create_index('pk', {range_size = 0})
 space:create_index('pk', {page_size = 0})
 space:create_index('pk', {page_size = 8192, range_size = 4096})
 space:create_index('pk', {run_count_per_level = 0})
diff --git a/test/vinyl/misc.result b/test/vinyl/misc.result
index 59492f77..5f67271e 100644
--- a/test/vinyl/misc.result
+++ b/test/vinyl/misc.result
@@ -1,3 +1,6 @@
+test_run = require('test_run').new()
+---
+...
 fiber = require('fiber')
 ---
 ...
@@ -204,3 +207,78 @@ s:insert{1, 1, 2} -- error
 s:drop()
 ---
 ...
+--
+-- gh-3944: automatic range size configuration.
+--
+-- Passing range_size explicitly on index creation.
+s = box.schema.space.create('test', {engine = 'vinyl'})
+---
+...
+i = s:create_index('pk', {range_size = 0})
+---
+...
+i.options.range_size -- nil
+---
+- null
+...
+i:stat().range_size -- 16 MB
+---
+- 16777216
+...
+box.space._index:get{s.id, i.id}[5].range_size -- 0
+---
+- 0
+...
+s:drop()
+---
+...
+-- Inheriting global settings.
+test_run:cmd('create server test with script = "vinyl/stat.lua"')
+---
+- true
+...
+test_run:cmd('start server test')
+---
+- true
+...
+test_run:cmd('switch test')
+---
+- true
+...
+box.cfg.vinyl_range_size -- nil
+---
+- null
+...
+s = box.schema.space.create('test', {engine = 'vinyl'})
+---
+...
+i = s:create_index('pk')
+---
+...
+i.options.range_size -- nil
+---
+- null
+...
+i:stat().range_size -- 16 MB
+---
+- 16777216
+...
+box.space._index:get{s.id, i.id}[5].range_size -- nil
+---
+- null
+...
+s:drop()
+---
+...
+test_run:cmd('switch default')
+---
+- true
+...
+test_run:cmd('stop server test')
+---
+- true
+...
+test_run:cmd('cleanup server test')
+---
+- true
+...
diff --git a/test/vinyl/misc.test.lua b/test/vinyl/misc.test.lua
index ba7403ec..1c3a9517 100644
--- a/test/vinyl/misc.test.lua
+++ b/test/vinyl/misc.test.lua
@@ -1,3 +1,4 @@
+test_run = require('test_run').new()
 fiber = require('fiber')
 
 --
@@ -88,3 +89,28 @@ _ = s:create_index('sk', {unique = true, parts = {2, 'unsigned'}})
 s:insert{1, 1, 1}
 s:insert{1, 1, 2} -- error
 s:drop()
+
+--
+-- gh-3944: automatic range size configuration.
+--
+-- Passing range_size explicitly on index creation.
+s = box.schema.space.create('test', {engine = 'vinyl'})
+i = s:create_index('pk', {range_size = 0})
+i.options.range_size -- nil
+i:stat().range_size -- 16 MB
+box.space._index:get{s.id, i.id}[5].range_size -- 0
+s:drop()
+-- Inheriting global settings.
+test_run:cmd('create server test with script = "vinyl/stat.lua"')
+test_run:cmd('start server test')
+test_run:cmd('switch test')
+box.cfg.vinyl_range_size -- nil
+s = box.schema.space.create('test', {engine = 'vinyl'})
+i = s:create_index('pk')
+i.options.range_size -- nil
+i:stat().range_size -- 16 MB
+box.space._index:get{s.id, i.id}[5].range_size -- nil
+s:drop()
+test_run:cmd('switch default')
+test_run:cmd('stop server test')
+test_run:cmd('cleanup server test')
diff --git a/test/vinyl/stat.result b/test/vinyl/stat.result
index e79c32f0..01da5f14 100644
--- a/test/vinyl/stat.result
+++ b/test/vinyl/stat.result
@@ -129,15 +129,10 @@ test_run:cmd("setopt delimiter ''");
 -- initially stats are empty
 istat()
 ---
-- dumps_per_compaction: 0
-  rows: 0
-  run_avg: 0
-  bytes: 0
-  upsert:
+- upsert:
     squashed: 0
     applied: 0
-  lookup: 0
-  run_count: 0
+  bytes: 0
   cache:
     invalidate:
       rows: 0
@@ -155,10 +150,7 @@ istat()
     get:
       rows: 0
       bytes: 0
-  range_count: 1
-  put:
-    rows: 0
-    bytes: 0
+  run_histogram: '[0]:1'
   disk:
     last_level:
       bytes_compressed: 0
@@ -216,6 +208,12 @@ istat()
     pages: 0
     bytes_compressed: 0
     bytes: 0
+  range_size: 16384
+  rows: 0
+  run_avg: 0
+  dumps_per_compaction: 0
+  lookup: 0
+  range_count: 1
   txw:
     bytes: 0
     rows: 0
@@ -224,7 +222,10 @@ istat()
       get:
         rows: 0
         bytes: 0
-  run_histogram: '[0]:1'
+  run_count: 0
+  put:
+    rows: 0
+    bytes: 0
   memory:
     bytes: 0
     index_size: 0
@@ -295,7 +296,14 @@ wait(istat, st, 'disk.dump.count', 1)
 ...
 stat_diff(istat(), st)
 ---
-- disk:
+- put:
+    rows: 25
+    bytes: 26525
+  rows: 25
+  run_avg: 1
+  run_count: 1
+  dumps_per_compaction: 1
+  disk:
     last_level:
       bytes: 26049
       pages: 7
@@ -319,14 +327,7 @@ stat_diff(istat(), st)
     pages: 7
     bytes_compressed: <bytes_compressed>
     bloom_size: 70
-  rows: 25
-  run_avg: 1
-  run_count: 1
-  dumps_per_compaction: 1
   bytes: 26049
-  put:
-    rows: 25
-    bytes: 26525
 ...
 -- put + dump + compaction
 st = istat()
@@ -344,7 +345,12 @@ wait(istat, st, 'disk.compaction.count', 1)
 ...
 stat_diff(istat(), st)
 ---
-- disk:
+- put:
+    rows: 50
+    bytes: 53050
+  rows: 25
+  bytes: 26042
+  disk:
     last_level:
       bytes: 26042
       pages: 6
@@ -379,11 +385,6 @@ stat_diff(istat(), st)
         pages: 13
         bytes_compressed: <bytes_compressed>
         rows: 50
-  put:
-    rows: 50
-    bytes: 53050
-  rows: 25
-  bytes: 26042
 ...
 -- point lookup from disk + cache put
 st = istat()
@@ -403,7 +404,6 @@ stat_diff(istat(), st)
     put:
       rows: 1
       bytes: 1061
-  lookup: 1
   disk:
     iterator:
       read:
@@ -415,6 +415,7 @@ stat_diff(istat(), st)
       get:
         rows: 1
         bytes: 1061
+  lookup: 1
   memory:
     iterator:
       lookup: 1
@@ -654,6 +655,19 @@ stat_diff(istat(), st)
     put:
       rows: 51
       bytes: 54111
+  lookup: 1
+  txw:
+    iterator:
+      lookup: 1
+      get:
+        rows: 50
+        bytes: 53050
+  memory:
+    iterator:
+      lookup: 1
+      get:
+        rows: 100
+        bytes: 106100
   disk:
     iterator:
       read:
@@ -665,19 +679,6 @@ stat_diff(istat(), st)
       get:
         rows: 100
         bytes: 106100
-  txw:
-    iterator:
-      lookup: 1
-      get:
-        rows: 50
-        bytes: 53050
-  memory:
-    iterator:
-      lookup: 1
-      get:
-        rows: 100
-        bytes: 106100
-  lookup: 1
   get:
     rows: 100
     bytes: 106100
@@ -1000,15 +1001,10 @@ box.stat.reset()
 ...
 istat()
 ---
-- dumps_per_compaction: 1
-  rows: 306
-  run_avg: 1
+- upsert:
+    squashed: 0
+    applied: 0
   bytes: 317731
-  upsert:
-    squashed: 0
-    applied: 0
-  lookup: 0
-  run_count: 2
   cache:
     invalidate:
       rows: 0
@@ -1026,10 +1022,7 @@ istat()
     get:
       rows: 0
       bytes: 0
-  range_count: 2
-  put:
-    rows: 0
-    bytes: 0
+  run_histogram: '[1]:2'
   disk:
     last_level:
       bytes_compressed: <bytes_compressed>
@@ -1087,6 +1080,12 @@ istat()
     pages: 25
     bytes_compressed: <bytes_compressed>
     bytes: 104300
+  range_size: 16384
+  rows: 306
+  run_avg: 1
+  dumps_per_compaction: 1
+  lookup: 0
+  range_count: 2
   txw:
     bytes: 0
     rows: 0
@@ -1095,7 +1094,10 @@ istat()
       get:
         rows: 0
         bytes: 0
-  run_histogram: '[1]:2'
+  run_count: 2
+  put:
+    rows: 0
+    bytes: 0
   memory:
     bytes: 213431
     index_size: 49152
-- 
2.11.0

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v2 6/8] vinyl: randomize range compaction to avoid IO load spikes
  2019-01-24 17:12 [PATCH v2 0/8] vinyl: compaction randomization and throttling Vladimir Davydov
                   ` (4 preceding siblings ...)
  2019-01-24 17:12 ` [PATCH v2 5/8] vinyl: set range size automatically Vladimir Davydov
@ 2019-01-24 17:12 ` Vladimir Davydov
  2019-02-11 18:21   ` [tarantool-patches] " Konstantin Osipov
  2019-01-24 17:12 ` [PATCH v2 7/8] vinyl: introduce quota consumer types Vladimir Davydov
  2019-01-24 17:12 ` [PATCH v2 8/8] vinyl: throttle tx to ensure compaction keeps up with dumps Vladimir Davydov
  7 siblings, 1 reply; 22+ messages in thread
From: Vladimir Davydov @ 2019-01-24 17:12 UTC (permalink / raw)
  To: tarantool-patches

Since all ranges constituting an LSM tree have the same configuration,
they tend to get compacted at approximately the same time. This entails
IO load spikes, which, in turn, lead to deviation of the LSM tree from
the target shape and hence increased read amplification. To prevent this
from happening, this patch implements compaction randomization: with 10%
probability we defer compaction at each LSM tree level, i.e. if the
number of runs at a level exceeds the configured run_count_per_level,
the level will be compacted with 90%-probability, but with 10%
probability it won't - compaction will be deferred until another run
is added to the level.

Our simulations show that such a simple algorithm performs fairly well:
it randomizes compaction pace among ranges, spreading IO load evenly in
time, while the write amplification is increased by not more than 5-10%,
which seems to be a reasonable price for elimination of IO load spikes.

Closes #3944
---
 src/box/vy_range.c                 | 20 +++++++++++++++++++-
 src/box/vy_run.c                   |  1 +
 src/box/vy_run.h                   |  5 +++++
 test/vinyl/write_iterator.result   | 20 ++++++++++++++++++--
 test/vinyl/write_iterator.test.lua | 10 ++++++++--
 5 files changed, 51 insertions(+), 5 deletions(-)

diff --git a/src/box/vy_range.c b/src/box/vy_range.c
index 1a2a2c12..7b6ddb40 100644
--- a/src/box/vy_range.c
+++ b/src/box/vy_range.c
@@ -389,7 +389,25 @@ vy_range_update_compaction_priority(struct vy_range *range,
 			 * we find an appropriate level for it.
 			 */
 		}
-		if (level_run_count > opts->run_count_per_level) {
+		/*
+		 * Since all ranges constituting an LSM tree have
+		 * the same configuration, they tend to get compacted
+		 * simultaneously, leading to IO load spikes and, as
+		 * a result, distortion of the LSM tree shape and
+		 * increased read amplification. To prevent this from
+		 * happening, we constantly randomize compaction pace
+		 * among ranges by deferring compaction at each LSM
+		 * tree level with some fixed small probability.
+		 *
+		 * Note, we can't use rand() directly here, because
+		 * this function is called on every memory dump and
+		 * scans all LSM tree levels. Instead we use the
+		 * value of rand() from the slice creation time.
+		 */
+		uint32_t max_run_count = opts->run_count_per_level;
+		if (slice->seed < RAND_MAX / 10)
+			max_run_count++;
+		if (level_run_count > max_run_count) {
 			/*
 			 * The number of runs at the current level
 			 * exceeds the configured maximum. Arrange
diff --git a/src/box/vy_run.c b/src/box/vy_run.c
index cee90458..4c7e637c 100644
--- a/src/box/vy_run.c
+++ b/src/box/vy_run.c
@@ -377,6 +377,7 @@ vy_slice_new(int64_t id, struct vy_run *run, struct tuple *begin,
 	memset(slice, 0, sizeof(*slice));
 	slice->id = id;
 	slice->run = run;
+	slice->seed = rand();
 	vy_run_ref(run);
 	run->slice_count++;
 	if (begin != NULL)
diff --git a/src/box/vy_run.h b/src/box/vy_run.h
index 28fd6a50..18ca1729 100644
--- a/src/box/vy_run.h
+++ b/src/box/vy_run.h
@@ -185,6 +185,11 @@ struct vy_slice {
 	struct tuple *begin;
 	struct tuple *end;
 	/**
+	 * Random seed used for compaction randomization.
+	 * Lays in range [0, RAND_MAX].
+	 */
+	int seed;
+	/**
 	 * Number of async users of this slice. Slice must not
 	 * be removed until it hits 0. Used by the iterator to
 	 * prevent use-after-free after waiting for IO.
diff --git a/test/vinyl/write_iterator.result b/test/vinyl/write_iterator.result
index 88a1c287..39212572 100644
--- a/test/vinyl/write_iterator.result
+++ b/test/vinyl/write_iterator.result
@@ -755,7 +755,7 @@ sk = s:create_index('secondary', {run_count_per_level = 1, parts = {2, 'unsigned
 PAD1 = 100
 ---
 ...
-PAD2 = 10
+PAD2 = 15
 ---
 ...
 -- Create a big run to prevent major compaction.
@@ -766,6 +766,14 @@ box.snapshot()
 ---
 - ok
 ...
+-- Some padding to trigger minor compaction.
+for i = 1001, 1000 + PAD2 do s:replace{i, i} end
+---
+...
+box.snapshot()
+---
+- ok
+...
 -- Generate some INSERT statements and dump them to disk.
 _ = s:insert{1, 1} -- insert
 ---
@@ -887,7 +895,7 @@ sk = s:create_index('secondary', {run_count_per_level = 1, parts = {2, 'unsigned
 PAD1 = 100
 ---
 ...
-PAD2 = 10
+PAD2 = 15
 ---
 ...
 -- Create a big run to prevent major compaction.
@@ -922,6 +930,14 @@ box.snapshot()
 ---
 - ok
 ...
+-- Some padding to trigger minor compaction.
+for i = 1001, 1000 + PAD2 do s:replace{i, i} end
+---
+...
+box.snapshot()
+---
+- ok
+...
 -- Generate DELETE+INSERT statements and write them to disk.
 s:delete{1} s:insert{1, 100}
 ---
diff --git a/test/vinyl/write_iterator.test.lua b/test/vinyl/write_iterator.test.lua
index 069c7f69..d7871ffc 100644
--- a/test/vinyl/write_iterator.test.lua
+++ b/test/vinyl/write_iterator.test.lua
@@ -323,10 +323,13 @@ _ = s:on_replace(function() end)
 pk = s:create_index('primary', {run_count_per_level = 1})
 sk = s:create_index('secondary', {run_count_per_level = 1, parts = {2, 'unsigned'}})
 PAD1 = 100
-PAD2 = 10
+PAD2 = 15
 -- Create a big run to prevent major compaction.
 for i = 1001, 1000 + PAD1 do s:replace{i, i} end
 box.snapshot()
+-- Some padding to trigger minor compaction.
+for i = 1001, 1000 + PAD2 do s:replace{i, i} end
+box.snapshot()
 -- Generate some INSERT statements and dump them to disk.
 _ = s:insert{1, 1} -- insert
 _ = s:replace{2, 2} -- replace, no old tuple
@@ -373,7 +376,7 @@ s = box.schema.space.create('test', {engine = 'vinyl'})
 pk = s:create_index('primary', {run_count_per_level = 1})
 sk = s:create_index('secondary', {run_count_per_level = 1, parts = {2, 'unsigned'}})
 PAD1 = 100
-PAD2 = 10
+PAD2 = 15
 -- Create a big run to prevent major compaction.
 for i = 1001, 1000 + PAD1 do s:insert{i, i} end
 _ = s:insert{1, 1}
@@ -385,6 +388,9 @@ _ = s:insert{6, 6}
 _ = s:insert{7, 7}
 _ = s:insert{8, 8}
 box.snapshot()
+-- Some padding to trigger minor compaction.
+for i = 1001, 1000 + PAD2 do s:replace{i, i} end
+box.snapshot()
 -- Generate DELETE+INSERT statements and write them to disk.
 s:delete{1} s:insert{1, 100}
 box.begin() s:delete{2} s:insert{2, 200} box.commit()
-- 
2.11.0

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v2 7/8] vinyl: introduce quota consumer types
  2019-01-24 17:12 [PATCH v2 0/8] vinyl: compaction randomization and throttling Vladimir Davydov
                   ` (5 preceding siblings ...)
  2019-01-24 17:12 ` [PATCH v2 6/8] vinyl: randomize range compaction to avoid IO load spikes Vladimir Davydov
@ 2019-01-24 17:12 ` Vladimir Davydov
  2019-02-12 15:48   ` Vladimir Davydov
  2019-01-24 17:12 ` [PATCH v2 8/8] vinyl: throttle tx to ensure compaction keeps up with dumps Vladimir Davydov
  7 siblings, 1 reply; 22+ messages in thread
From: Vladimir Davydov @ 2019-01-24 17:12 UTC (permalink / raw)
  To: tarantool-patches

Currently, we only limit quota consumption rate so that writers won't
hit the hard limit before memory dump is complete. However, it isn't
enough, because we also need to consider compaction: if it doesn't keep
up with dumps, read and space amplification will grow uncontrollably.

The problem is compaction may be a quota consumer by itself, as it may
generate deferred DELETE statements for secondary indexes. We can't
ignore quota completely there, because if we do, we may hit the memory
limit and stall all writers, which is unacceptable, but we do want to
ignore the rate limit imposed to make sure that compaction keeps up with
dumps, otherwise compaction won't benefit from such a throttling.

To tackle this problem, this patch introduces the concept of quota
consumer types and resources. Now vy_quota maintains one rate limit per
each resource and one wait queue per each consumer type. There are two
types of consumers, compaction jobs and usual transactions, and there
are two resources managed by vy_quota, disk and memory. Memory-based
rate limit ensures that transactions won't hit the hard memory limit and
stall before memory dump is complete. It is respected by all types of
consumers. Disk-based rate limit is supposed to be set when compaction
doesn't keep up with dumps. It is only used by usual transactions and
ignored by compaction jobs.

Since now there are two wait queues, we need to balance wakeups between
them in case consumers in both queues are ready to proceed. To ensure
there's no starvation, we maintain a monotonically growing counter and
assign its value to each consumer put to slip (ticket). We use it to
wake up the consumer that has waited most when both queues are ready.

Note, the patch doesn't implement the logic of disk-based throttling in
the regulator module. It is still left for future work.

Needed for #3721
---
 src/box/vinyl.c        |  30 +++++++-----
 src/box/vy_quota.c     | 123 ++++++++++++++++++++++++++++++++++++++-----------
 src/box/vy_quota.h     |  91 ++++++++++++++++++++++++++++++------
 src/box/vy_regulator.c |  12 +++--
 4 files changed, 198 insertions(+), 58 deletions(-)

diff --git a/src/box/vinyl.c b/src/box/vinyl.c
index 0936932b..aaef858e 100644
--- a/src/box/vinyl.c
+++ b/src/box/vinyl.c
@@ -2342,7 +2342,8 @@ vinyl_engine_prepare(struct engine *engine, struct txn *txn)
 	 * the transaction to be sent to read view or aborted, we call
 	 * it before checking for conflicts.
 	 */
-	if (vy_quota_use(&env->quota, tx->write_size, timeout) != 0)
+	if (vy_quota_use(&env->quota, VY_QUOTA_CONSUMER_TX,
+			 tx->write_size, timeout) != 0)
 		return -1;
 
 	size_t mem_used_before = lsregion_used(&env->mem_env.allocator);
@@ -2351,8 +2352,8 @@ vinyl_engine_prepare(struct engine *engine, struct txn *txn)
 
 	size_t mem_used_after = lsregion_used(&env->mem_env.allocator);
 	assert(mem_used_after >= mem_used_before);
-	vy_quota_adjust(&env->quota, tx->write_size,
-			mem_used_after - mem_used_before);
+	vy_quota_adjust(&env->quota, VY_QUOTA_CONSUMER_TX,
+			tx->write_size, mem_used_after - mem_used_before);
 	vy_regulator_check_dump_watermark(&env->regulator);
 	return rc;
 }
@@ -2377,7 +2378,8 @@ vinyl_engine_commit(struct engine *engine, struct txn *txn)
 	size_t mem_used_after = lsregion_used(&env->mem_env.allocator);
 	assert(mem_used_after >= mem_used_before);
 	/* We can't abort the transaction at this point, use force. */
-	vy_quota_force_use(&env->quota, mem_used_after - mem_used_before);
+	vy_quota_force_use(&env->quota, VY_QUOTA_CONSUMER_TX,
+			   mem_used_after - mem_used_before);
 	vy_regulator_check_dump_watermark(&env->regulator);
 
 	txn->engine_tx = NULL;
@@ -3188,7 +3190,8 @@ vinyl_space_apply_initial_join_row(struct space *space, struct request *request)
 	 * quota accounting.
 	 */
 	size_t reserved = tx->write_size;
-	if (vy_quota_use(&env->quota, reserved, TIMEOUT_INFINITY) != 0)
+	if (vy_quota_use(&env->quota, VY_QUOTA_CONSUMER_TX,
+			 reserved, TIMEOUT_INFINITY) != 0)
 		unreachable();
 
 	size_t mem_used_before = lsregion_used(&env->mem_env.allocator);
@@ -3207,7 +3210,7 @@ vinyl_space_apply_initial_join_row(struct space *space, struct request *request)
 	size_t mem_used_after = lsregion_used(&env->mem_env.allocator);
 	assert(mem_used_after >= mem_used_before);
 	size_t used = mem_used_after - mem_used_before;
-	vy_quota_adjust(&env->quota, reserved, used);
+	vy_quota_adjust(&env->quota, VY_QUOTA_CONSUMER_TX, reserved, used);
 	vy_regulator_check_dump_watermark(&env->regulator);
 	return rc;
 }
@@ -3529,7 +3532,7 @@ vy_squash_process(struct vy_squash *squash)
 		 * so there's no need in invalidating the cache.
 		 */
 		vy_mem_commit_stmt(mem, region_stmt);
-		vy_quota_force_use(&env->quota,
+		vy_quota_force_use(&env->quota, VY_QUOTA_CONSUMER_TX,
 				   mem_used_after - mem_used_before);
 		vy_regulator_check_dump_watermark(&env->regulator);
 	}
@@ -4005,9 +4008,10 @@ vy_build_insert_tuple(struct vy_env *env, struct vy_lsm *lsm,
 	/* Consume memory quota. Throttle if it is exceeded. */
 	size_t mem_used_after = lsregion_used(&env->mem_env.allocator);
 	assert(mem_used_after >= mem_used_before);
-	vy_quota_force_use(&env->quota, mem_used_after - mem_used_before);
+	vy_quota_force_use(&env->quota, VY_QUOTA_CONSUMER_TX,
+			   mem_used_after - mem_used_before);
 	vy_regulator_check_dump_watermark(&env->regulator);
-	vy_quota_wait(&env->quota);
+	vy_quota_wait(&env->quota, VY_QUOTA_CONSUMER_TX);
 	return rc;
 }
 
@@ -4133,7 +4137,8 @@ vy_build_recover(struct vy_env *env, struct vy_lsm *lsm, struct vy_lsm *pk)
 
 	mem_used_after = lsregion_used(&env->mem_env.allocator);
 	assert(mem_used_after >= mem_used_before);
-	vy_quota_force_use(&env->quota, mem_used_after - mem_used_before);
+	vy_quota_force_use(&env->quota, VY_QUOTA_CONSUMER_TX,
+			   mem_used_after - mem_used_before);
 	return rc;
 }
 
@@ -4349,7 +4354,7 @@ vy_deferred_delete_on_replace(struct trigger *trigger, void *event)
 	 */
 	struct vy_env *env = vy_env(space->engine);
 	if (is_first_statement)
-		vy_quota_wait(&env->quota);
+		vy_quota_wait(&env->quota, VY_QUOTA_CONSUMER_COMPACTION);
 
 	/* Create the deferred DELETE statement. */
 	struct vy_lsm *pk = vy_lsm(space->index[0]);
@@ -4436,7 +4441,8 @@ vy_deferred_delete_on_replace(struct trigger *trigger, void *event)
 	}
 	size_t mem_used_after = lsregion_used(&env->mem_env.allocator);
 	assert(mem_used_after >= mem_used_before);
-	vy_quota_force_use(&env->quota, mem_used_after - mem_used_before);
+	vy_quota_force_use(&env->quota, VY_QUOTA_CONSUMER_COMPACTION,
+			   mem_used_after - mem_used_before);
 	vy_regulator_check_dump_watermark(&env->regulator);
 
 	tuple_unref(delete);
diff --git a/src/box/vy_quota.c b/src/box/vy_quota.c
index 07cd5856..20d322de 100644
--- a/src/box/vy_quota.c
+++ b/src/box/vy_quota.c
@@ -52,6 +52,43 @@
 static const double VY_QUOTA_TIMER_PERIOD = 0.1;
 
 /**
+ * Bit mask of resources used by a particular consumer type.
+ */
+static unsigned
+vy_quota_consumer_resource_map[] = {
+	/**
+	 * Transaction throttling pursues two goals. First, it is
+	 * capping memory consumption rate so that the hard memory
+	 * limit will not be hit before memory dump has completed
+	 * (memory-based throttling). Second, we must make sure
+	 * that compaction jobs keep up with dumps to keep read and
+	 * space amplification within bounds (disk-based throttling).
+	 * Transactions ought to respect them both.
+	 */
+	[VY_QUOTA_CONSUMER_TX] = (1 << VY_QUOTA_RESOURCE_DISK) |
+				 (1 << VY_QUOTA_RESOURCE_MEMORY),
+	/**
+	 * Compaction jobs may need some quota too, because they
+	 * may generate deferred DELETEs for secondary indexes.
+	 * Apparently, we must not impose the rate limit that is
+	 * supposed to speed up compaction on them (disk-based),
+	 * however they still have to respect memory-based throttling
+	 * to avoid long stalls.
+	 */
+	[VY_QUOTA_CONSUMER_COMPACTION] = (1 << VY_QUOTA_RESOURCE_MEMORY),
+};
+
+/**
+ * Iterate over rate limit states that are enforced for a consumer
+ * of the given type.
+ */
+#define vy_quota_consumer_for_each_rate_limit(quota, type, rl) \
+	for (struct vy_rate_limit *rl = (quota)->rate_limit; \
+	     rl - (quota)->rate_limit < vy_quota_resource_type_MAX; rl++) \
+		if (vy_quota_consumer_resource_map[type] & \
+		    (1 << (rl - (quota)->rate_limit)))
+
+/**
  * Return true if the requested amount of memory may be consumed
  * right now, false if consumers have to wait.
  *
@@ -60,7 +97,8 @@ static const double VY_QUOTA_TIMER_PERIOD = 0.1;
  * it can start memory reclaim immediately.
  */
 static inline bool
-vy_quota_may_use(struct vy_quota *q, size_t size)
+vy_quota_may_use(struct vy_quota *q, enum vy_quota_consumer_type type,
+		 size_t size)
 {
 	if (!q->is_enabled)
 		return true;
@@ -68,8 +106,10 @@ vy_quota_may_use(struct vy_quota *q, size_t size)
 		q->quota_exceeded_cb(q);
 		return false;
 	}
-	if (!vy_rate_limit_may_use(&q->rate_limit))
-		return false;
+	vy_quota_consumer_for_each_rate_limit(q, type, rl) {
+		if (!vy_rate_limit_may_use(rl))
+			return false;
+	}
 	return true;
 }
 
@@ -77,10 +117,12 @@ vy_quota_may_use(struct vy_quota *q, size_t size)
  * Consume the given amount of memory without checking the limit.
  */
 static inline void
-vy_quota_do_use(struct vy_quota *q, size_t size)
+vy_quota_do_use(struct vy_quota *q, enum vy_quota_consumer_type type,
+		size_t size)
 {
 	q->used += size;
-	vy_rate_limit_use(&q->rate_limit, size);
+	vy_quota_consumer_for_each_rate_limit(q, type, rl)
+		vy_rate_limit_use(rl, size);
 }
 
 /**
@@ -88,11 +130,13 @@ vy_quota_do_use(struct vy_quota *q, size_t size)
  * This function is an exact opposite of vy_quota_do_use().
  */
 static inline void
-vy_quota_do_unuse(struct vy_quota *q, size_t size)
+vy_quota_do_unuse(struct vy_quota *q, enum vy_quota_consumer_type type,
+		  size_t size)
 {
 	assert(q->used >= size);
 	q->used -= size;
-	vy_rate_limit_unuse(&q->rate_limit, size);
+	vy_quota_consumer_for_each_rate_limit(q, type, rl)
+		vy_rate_limit_unuse(rl, size);
 }
 
 /**
@@ -112,17 +156,31 @@ vy_quota_check_limit(struct vy_quota *q)
 static void
 vy_quota_signal(struct vy_quota *q)
 {
-	if (!rlist_empty(&q->wait_queue)) {
+	/*
+	 * To prevent starvation, wake up a consumer that has
+	 * waited most irrespective of its type.
+	 */
+	struct vy_quota_wait_node *oldest = NULL;
+	for (int i = 0; i < vy_quota_consumer_type_MAX; i++) {
+		struct rlist *wq = &q->wait_queue[i];
+		if (rlist_empty(wq))
+			continue;
+
 		struct vy_quota_wait_node *n;
-		n = rlist_first_entry(&q->wait_queue,
-				      struct vy_quota_wait_node, in_wait_queue);
+		n = rlist_first_entry(wq, struct vy_quota_wait_node,
+				      in_wait_queue);
 		/*
 		 * No need in waking up a consumer if it will have
 		 * to go back to sleep immediately.
 		 */
-		if (vy_quota_may_use(q, n->size))
-			fiber_wakeup(n->fiber);
+		if (!vy_quota_may_use(q, i, n->size))
+			continue;
+
+		if (oldest == NULL || oldest->ticket > n->ticket)
+			oldest = n;
 	}
+	if (oldest != NULL)
+		fiber_wakeup(oldest->fiber);
 }
 
 static void
@@ -133,7 +191,8 @@ vy_quota_timer_cb(ev_loop *loop, ev_timer *timer, int events)
 
 	struct vy_quota *q = timer->data;
 
-	vy_rate_limit_refill(&q->rate_limit, VY_QUOTA_TIMER_PERIOD);
+	for (int i = 0; i < vy_quota_resource_type_MAX; i++)
+		vy_rate_limit_refill(&q->rate_limit[i], VY_QUOTA_TIMER_PERIOD);
 	vy_quota_signal(q);
 }
 
@@ -146,8 +205,11 @@ vy_quota_create(struct vy_quota *q, size_t limit,
 	q->used = 0;
 	q->too_long_threshold = TIMEOUT_INFINITY;
 	q->quota_exceeded_cb = quota_exceeded_cb;
-	rlist_create(&q->wait_queue);
-	vy_rate_limit_create(&q->rate_limit);
+	q->wait_ticket = 0;
+	for (int i = 0; i < vy_quota_consumer_type_MAX; i++)
+		rlist_create(&q->wait_queue[i]);
+	for (int i = 0; i < vy_quota_resource_type_MAX; i++)
+		vy_rate_limit_create(&q->rate_limit[i]);
 	ev_timer_init(&q->timer, vy_quota_timer_cb, 0, VY_QUOTA_TIMER_PERIOD);
 	q->timer.data = q;
 }
@@ -176,15 +238,17 @@ vy_quota_set_limit(struct vy_quota *q, size_t limit)
 }
 
 void
-vy_quota_set_rate_limit(struct vy_quota *q, size_t rate)
+vy_quota_set_rate_limit(struct vy_quota *q, enum vy_quota_resource_type type,
+			size_t rate)
 {
-	vy_rate_limit_set(&q->rate_limit, rate);
+	vy_rate_limit_set(&q->rate_limit[type], rate);
 }
 
 void
-vy_quota_force_use(struct vy_quota *q, size_t size)
+vy_quota_force_use(struct vy_quota *q, enum vy_quota_consumer_type type,
+		   size_t size)
 {
-	vy_quota_do_use(q, size);
+	vy_quota_do_use(q, type, size);
 	vy_quota_check_limit(q);
 }
 
@@ -201,7 +265,8 @@ vy_quota_release(struct vy_quota *q, size_t size)
 }
 
 int
-vy_quota_use(struct vy_quota *q, size_t size, double timeout)
+vy_quota_use(struct vy_quota *q, enum vy_quota_consumer_type type,
+	     size_t size, double timeout)
 {
 	/*
 	 * Fail early if the configured memory limit never allows
@@ -212,8 +277,8 @@ vy_quota_use(struct vy_quota *q, size_t size, double timeout)
 		return -1;
 	}
 
-	if (vy_quota_may_use(q, size)) {
-		vy_quota_do_use(q, size);
+	if (vy_quota_may_use(q, type, size)) {
+		vy_quota_do_use(q, type, size);
 		return 0;
 	}
 
@@ -224,8 +289,9 @@ vy_quota_use(struct vy_quota *q, size_t size, double timeout)
 	struct vy_quota_wait_node wait_node = {
 		.fiber = fiber(),
 		.size = size,
+		.ticket = ++q->wait_ticket,
 	};
-	rlist_add_tail_entry(&q->wait_queue, &wait_node, in_wait_queue);
+	rlist_add_tail_entry(&q->wait_queue[type], &wait_node, in_wait_queue);
 
 	do {
 		double now = ev_monotonic_now(loop());
@@ -235,7 +301,7 @@ vy_quota_use(struct vy_quota *q, size_t size, double timeout)
 			diag_set(ClientError, ER_VY_QUOTA_TIMEOUT);
 			return -1;
 		}
-	} while (!vy_quota_may_use(q, size));
+	} while (!vy_quota_may_use(q, type, size));
 
 	rlist_del_entry(&wait_node, in_wait_queue);
 
@@ -246,7 +312,7 @@ vy_quota_use(struct vy_quota *q, size_t size, double timeout)
 				     wait_time);
 	}
 
-	vy_quota_do_use(q, size);
+	vy_quota_do_use(q, type, size);
 	/*
 	 * Blocked consumers are awaken one by one to preserve
 	 * the order they were put to sleep. It's a responsibility
@@ -258,14 +324,15 @@ vy_quota_use(struct vy_quota *q, size_t size, double timeout)
 }
 
 void
-vy_quota_adjust(struct vy_quota *q, size_t reserved, size_t used)
+vy_quota_adjust(struct vy_quota *q, enum vy_quota_consumer_type type,
+		size_t reserved, size_t used)
 {
 	if (reserved > used) {
-		vy_quota_do_unuse(q, reserved - used);
+		vy_quota_do_unuse(q, type, reserved - used);
 		vy_quota_signal(q);
 	}
 	if (reserved < used) {
-		vy_quota_do_use(q, used - reserved);
+		vy_quota_do_use(q, type, used - reserved);
 		vy_quota_check_limit(q);
 	}
 }
diff --git a/src/box/vy_quota.h b/src/box/vy_quota.h
index 79755e89..d90922b2 100644
--- a/src/box/vy_quota.h
+++ b/src/box/vy_quota.h
@@ -110,6 +110,50 @@ vy_rate_limit_refill(struct vy_rate_limit *rl, double time)
 typedef void
 (*vy_quota_exceeded_f)(struct vy_quota *quota);
 
+/**
+ * Apart from memory usage accounting and limiting, vy_quota is
+ * responsible for consumption rate limiting (aka throttling).
+ * There are multiple rate limits, each of which is associated
+ * with a particular resource type. Different kinds of consumers
+ * respect different limits. The following enumeration defines
+ * the resource types for which vy_quota enables throttling.
+ *
+ * See also vy_quota_consumer_resource_map.
+ */
+enum vy_quota_resource_type {
+	/**
+	 * The goal of disk-based throttling is to keep LSM trees
+	 * in a good shape so that read and space amplification
+	 * stay within bounds. It is enabled when compaction does
+	 * not keep up with dumps.
+	 */
+	VY_QUOTA_RESOURCE_DISK = 0,
+	/**
+	 * Memory-based throttling is needed to avoid long stalls
+	 * caused by hitting the hard memory limit. It is set so
+	 * that by the time the hard limit is hit, the last memory
+	 * dump will have completed.
+	 */
+	VY_QUOTA_RESOURCE_MEMORY = 1,
+
+	vy_quota_resource_type_MAX,
+};
+
+/**
+ * Quota consumer type determines how a quota consumer will be
+ * rate limited.
+ *
+ * See also vy_quota_consumer_resource_map.
+ */
+enum vy_quota_consumer_type {
+	/** Transaction processor. */
+	VY_QUOTA_CONSUMER_TX = 0,
+	/** Compaction job. */
+	VY_QUOTA_CONSUMER_COMPACTION = 1,
+
+	vy_quota_consumer_type_MAX,
+};
+
 struct vy_quota_wait_node {
 	/** Link in vy_quota::wait_queue. */
 	struct rlist in_wait_queue;
@@ -117,6 +161,11 @@ struct vy_quota_wait_node {
 	struct fiber *fiber;
 	/** Amount of requested memory. */
 	size_t size;
+	/**
+	 * Ticket assigned to this fiber when it was put to
+	 * sleep, see vy_quota::wait_ticket for more details.
+	 */
+	int64_t ticket;
 };
 
 /**
@@ -144,13 +193,23 @@ struct vy_quota {
 	 */
 	vy_quota_exceeded_f quota_exceeded_cb;
 	/**
-	 * Queue of consumers waiting for quota, linked by
-	 * vy_quota_wait_node::state. Newcomers are added
-	 * to the tail.
+	 * Monotonically growing counter assigned to consumers
+	 * waiting for quota. It is used for balancing wakeups
+	 * among wait queues: if two fibers from different wait
+	 * queues may proceed, the one with the lowest ticket
+	 * will be picked.
+	 *
+	 * See also vy_quota_wait_node::ticket.
 	 */
-	struct rlist wait_queue;
-	/** Rate limit state. */
-	struct vy_rate_limit rate_limit;
+	int64_t wait_ticket;
+	/**
+	 * Queue of consumers waiting for quota, one per each
+	 * consumer type, linked by vy_quota_wait_node::state.
+	 * Newcomers are added to the tail.
+	 */
+	struct rlist wait_queue[vy_quota_consumer_type_MAX];
+	/** Rate limit state, one per each resource type. */
+	struct vy_rate_limit rate_limit[vy_quota_resource_type_MAX];
 	/**
 	 * Periodic timer that is used for refilling the rate
 	 * limit value.
@@ -188,18 +247,20 @@ void
 vy_quota_set_limit(struct vy_quota *q, size_t limit);
 
 /**
- * Set the max rate at which quota may be consumed,
- * in bytes per second.
+ * Set the rate limit corresponding to the resource of the given
+ * type. The rate limit is given in bytes per second.
  */
 void
-vy_quota_set_rate_limit(struct vy_quota *q, size_t rate);
+vy_quota_set_rate_limit(struct vy_quota *q, enum vy_quota_resource_type type,
+			size_t rate);
 
 /**
  * Consume @size bytes of memory. In contrast to vy_quota_use()
  * this function does not throttle the caller.
  */
 void
-vy_quota_force_use(struct vy_quota *q, size_t size);
+vy_quota_force_use(struct vy_quota *q, enum vy_quota_consumer_type type,
+		   size_t size);
 
 /**
  * Release @size bytes of memory.
@@ -242,7 +303,8 @@ vy_quota_release(struct vy_quota *q, size_t size);
  * account while estimating the size of a memory allocation.
  */
 int
-vy_quota_use(struct vy_quota *q, size_t size, double timeout);
+vy_quota_use(struct vy_quota *q, enum vy_quota_consumer_type type,
+	     size_t size, double timeout);
 
 /**
  * Adjust quota after allocating memory.
@@ -253,15 +315,16 @@ vy_quota_use(struct vy_quota *q, size_t size, double timeout);
  * See also vy_quota_use().
  */
 void
-vy_quota_adjust(struct vy_quota *q, size_t reserved, size_t used);
+vy_quota_adjust(struct vy_quota *q, enum vy_quota_consumer_type type,
+		size_t reserved, size_t used);
 
 /**
  * Block the caller until the quota is not exceeded.
  */
 static inline void
-vy_quota_wait(struct vy_quota *q)
+vy_quota_wait(struct vy_quota *q, enum vy_quota_consumer_type type)
 {
-	vy_quota_use(q, 0, TIMEOUT_INFINITY);
+	vy_quota_use(q, type, 0, TIMEOUT_INFINITY);
 }
 
 #if defined(__cplusplus)
diff --git a/src/box/vy_regulator.c b/src/box/vy_regulator.c
index 2e09b93c..e14b01aa 100644
--- a/src/box/vy_regulator.c
+++ b/src/box/vy_regulator.c
@@ -101,7 +101,8 @@ vy_regulator_trigger_dump(struct vy_regulator *regulator)
 	size_t max_write_rate = (double)mem_left / (mem_used + 1) *
 					regulator->dump_bandwidth;
 	max_write_rate = MIN(max_write_rate, regulator->dump_bandwidth);
-	vy_quota_set_rate_limit(quota, max_write_rate);
+	vy_quota_set_rate_limit(quota, VY_QUOTA_RESOURCE_MEMORY,
+				max_write_rate);
 }
 
 static void
@@ -202,7 +203,8 @@ void
 vy_regulator_start(struct vy_regulator *regulator)
 {
 	regulator->quota_used_last = regulator->quota->used;
-	vy_quota_set_rate_limit(regulator->quota, regulator->dump_bandwidth);
+	vy_quota_set_rate_limit(regulator->quota, VY_QUOTA_RESOURCE_MEMORY,
+				regulator->dump_bandwidth);
 	ev_timer_start(loop(), &regulator->timer);
 }
 
@@ -253,7 +255,8 @@ vy_regulator_dump_complete(struct vy_regulator *regulator,
 	 * limit to the dump bandwidth rather than disabling it
 	 * completely.
 	 */
-	vy_quota_set_rate_limit(regulator->quota, regulator->dump_bandwidth);
+	vy_quota_set_rate_limit(regulator->quota, VY_QUOTA_RESOURCE_MEMORY,
+				regulator->dump_bandwidth);
 }
 
 void
@@ -263,5 +266,6 @@ vy_regulator_reset_dump_bandwidth(struct vy_regulator *regulator, size_t max)
 	regulator->dump_bandwidth = VY_DUMP_BANDWIDTH_DEFAULT;
 	if (max > 0 && regulator->dump_bandwidth > max)
 		regulator->dump_bandwidth = max;
-	vy_quota_set_rate_limit(regulator->quota, regulator->dump_bandwidth);
+	vy_quota_set_rate_limit(regulator->quota, VY_QUOTA_RESOURCE_MEMORY,
+				regulator->dump_bandwidth);
 }
-- 
2.11.0

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v2 8/8] vinyl: throttle tx to ensure compaction keeps up with dumps
  2019-01-24 17:12 [PATCH v2 0/8] vinyl: compaction randomization and throttling Vladimir Davydov
                   ` (6 preceding siblings ...)
  2019-01-24 17:12 ` [PATCH v2 7/8] vinyl: introduce quota consumer types Vladimir Davydov
@ 2019-01-24 17:12 ` Vladimir Davydov
  7 siblings, 0 replies; 22+ messages in thread
From: Vladimir Davydov @ 2019-01-24 17:12 UTC (permalink / raw)
  To: tarantool-patches

Every byte of data written to a vinyl database eventually gets compacted
with data written to the database earlier. The ratio of the size of data
actually written to disk to the size of data written to the database is
called write amplification. Write amplification depends on the LSM tree
configuration and the workload parameters and varies in a wide range,
from 2-3 to 10-20 or even higher in some extreme cases. If the database
engine doesn't manage to write those extra data, LSM tree shape will get
distorted, which will result in increased read and space amplification,
which, in turn, will lead to slowing down reads and wasting disk space.
That's why it's so important to ensure the database engine has enough
compaction power.

One way to ensure that is increase the number of compaction threads by
tuning box.cfg.vinyl_write_threads configuration knob, but one can't
increase it beyond the capacity of the server running the instance. So
the database engine must throttle writes if it detects that compaction
threads are struggling to keep up. This patch implements a very simple
algorithm to achieve that: it keeps track of recently observed write
amplification and data compaction speed, use them to calculate the max
transaction rate that the database engine can handle while steadily
maintaining the current level of write amplification, and sets the rate
limit to half that so as to give the engine enough room to increase
write amplification if needed.

The algorithm is obviously pessimistic: it undervalues the transaction
rate the database can handle after write amplification has steadied. But
this is compensated by its simplicity and stability - there shouldn't be
any abrupt drops or peaks in RPS due to its decisions. Besides, it
adapts fairly quickly to increase in write amplification when a database
is filled up. If one finds that the algorithm is being too cautious by
undervaluing the limit, it's easy to fix by simply increasing the number
of compaction threads - the rate limit will scale proportionately if the
system is underloaded.

The current value of the rate limit set by the algorithm is reported by
box.stat.vinyl() under regulator.rate_limit section.

Closes #3721
---
 src/box/vinyl.c        |  6 ++++
 src/box/vy_quota.c     |  9 ++++++
 src/box/vy_quota.h     |  6 ++++
 src/box/vy_regulator.c | 86 ++++++++++++++++++++++++++++++++++++++++++++++++--
 src/box/vy_regulator.h | 27 ++++++++++++++++
 5 files changed, 131 insertions(+), 3 deletions(-)

diff --git a/src/box/vinyl.c b/src/box/vinyl.c
index aaef858e..650d5c26 100644
--- a/src/box/vinyl.c
+++ b/src/box/vinyl.c
@@ -274,6 +274,8 @@ vy_info_append_regulator(struct vy_env *env, struct info_handler *h)
 	info_append_int(h, "write_rate", r->write_rate);
 	info_append_int(h, "dump_bandwidth", r->dump_bandwidth);
 	info_append_int(h, "dump_watermark", r->dump_watermark);
+	info_append_int(h, "rate_limit", vy_quota_get_rate_limit(r->quota,
+							VY_QUOTA_CONSUMER_TX));
 	info_table_end(h); /* regulator */
 }
 
@@ -532,6 +534,7 @@ vinyl_engine_reset_stat(struct engine *engine)
 	memset(&xm->stat, 0, sizeof(xm->stat));
 
 	vy_scheduler_reset_stat(&env->scheduler);
+	vy_regulator_reset_stat(&env->regulator);
 }
 
 /** }}} Introspection */
@@ -2475,6 +2478,9 @@ vy_env_dump_complete_cb(struct vy_scheduler *scheduler,
 	 */
 	vy_regulator_dump_complete(&env->regulator, mem_dumped, dump_duration);
 	vy_quota_release(quota, mem_dumped);
+
+	vy_regulator_update_rate_limit(&env->regulator, &scheduler->stat,
+				       scheduler->compaction_pool.size);
 }
 
 static struct vy_squash_queue *
diff --git a/src/box/vy_quota.c b/src/box/vy_quota.c
index 20d322de..4dd961c9 100644
--- a/src/box/vy_quota.c
+++ b/src/box/vy_quota.c
@@ -244,6 +244,15 @@ vy_quota_set_rate_limit(struct vy_quota *q, enum vy_quota_resource_type type,
 	vy_rate_limit_set(&q->rate_limit[type], rate);
 }
 
+size_t
+vy_quota_get_rate_limit(struct vy_quota *q, enum vy_quota_consumer_type type)
+{
+	size_t rate = SIZE_MAX;
+	vy_quota_consumer_for_each_rate_limit(q, type, rl)
+		rate = MIN(rate, rl->rate);
+	return rate;
+}
+
 void
 vy_quota_force_use(struct vy_quota *q, enum vy_quota_consumer_type type,
 		   size_t size)
diff --git a/src/box/vy_quota.h b/src/box/vy_quota.h
index d90922b2..7ff98cc1 100644
--- a/src/box/vy_quota.h
+++ b/src/box/vy_quota.h
@@ -255,6 +255,12 @@ vy_quota_set_rate_limit(struct vy_quota *q, enum vy_quota_resource_type type,
 			size_t rate);
 
 /**
+ * Return the rate limit applied to a consumer of the given type.
+ */
+size_t
+vy_quota_get_rate_limit(struct vy_quota *q, enum vy_quota_consumer_type type);
+
+/**
  * Consume @size bytes of memory. In contrast to vy_quota_use()
  * this function does not throttle the caller.
  */
diff --git a/src/box/vy_regulator.c b/src/box/vy_regulator.c
index e14b01aa..75f1f798 100644
--- a/src/box/vy_regulator.c
+++ b/src/box/vy_regulator.c
@@ -34,6 +34,7 @@
 #include <stdbool.h>
 #include <stddef.h>
 #include <stdint.h>
+#include <string.h>
 #include <tarantool_ev.h>
 
 #include "fiber.h"
@@ -42,6 +43,7 @@
 #include "trivia/util.h"
 
 #include "vy_quota.h"
+#include "vy_stat.h"
 
 /**
  * Regulator timer period, in seconds.
@@ -73,6 +75,14 @@ static const size_t VY_DUMP_BANDWIDTH_DEFAULT = 10 * 1024 * 1024;
  */
 static const size_t VY_DUMP_SIZE_ACCT_MIN = 1024 * 1024;
 
+/**
+ * Number of dumps to take into account for rate limit calculation.
+ * Shouldn't be too small to avoid uneven RPS. Shouldn't be too big
+ * either - otherwise the rate limit will adapt too slowly to workload
+ * changes. 100 feels like a good choice.
+ */
+static const int VY_RECENT_DUMP_COUNT = 100;
+
 static void
 vy_regulator_trigger_dump(struct vy_regulator *regulator)
 {
@@ -182,6 +192,7 @@ vy_regulator_create(struct vy_regulator *regulator, struct vy_quota *quota,
 		100 * MB, 200 * MB, 300 * MB, 400 * MB, 500 * MB, 600 * MB,
 		700 * MB, 800 * MB, 900 * MB,
 	};
+	memset(regulator, 0, sizeof(*regulator));
 	regulator->dump_bandwidth_hist = histogram_new(dump_bandwidth_buckets,
 					lengthof(dump_bandwidth_buckets));
 	if (regulator->dump_bandwidth_hist == NULL)
@@ -192,11 +203,8 @@ vy_regulator_create(struct vy_regulator *regulator, struct vy_quota *quota,
 	ev_timer_init(&regulator->timer, vy_regulator_timer_cb, 0,
 		      VY_REGULATOR_TIMER_PERIOD);
 	regulator->timer.data = regulator;
-	regulator->write_rate = 0;
-	regulator->quota_used_last = 0;
 	regulator->dump_bandwidth = VY_DUMP_BANDWIDTH_DEFAULT;
 	regulator->dump_watermark = SIZE_MAX;
-	regulator->dump_in_progress = false;
 }
 
 void
@@ -269,3 +277,75 @@ vy_regulator_reset_dump_bandwidth(struct vy_regulator *regulator, size_t max)
 	vy_quota_set_rate_limit(regulator->quota, VY_QUOTA_RESOURCE_MEMORY,
 				regulator->dump_bandwidth);
 }
+
+void
+vy_regulator_reset_stat(struct vy_regulator *regulator)
+{
+	memset(&regulator->sched_stat_last, 0,
+	       sizeof(regulator->sched_stat_last));
+}
+
+void
+vy_regulator_update_rate_limit(struct vy_regulator *regulator,
+			       const struct vy_scheduler_stat *stat,
+			       int compaction_threads)
+{
+	struct vy_scheduler_stat *last = &regulator->sched_stat_last;
+	struct vy_scheduler_stat *recent = &regulator->sched_stat_recent;
+	/*
+	 * The maximal dump rate the database can handle while
+	 * maintaining the current level of write amplification
+	 * equals:
+	 *
+	 *                                        dump_output
+	 *   max_dump_rate = compaction_rate * -----------------
+	 *                                     compaction_output
+	 *
+	 * The average compaction rate can be estimated with:
+	 *
+	 *                                          compaction_output
+	 *   compaction_rate = compaction_threads * -----------------
+	 *                                           compaction_time
+	 *
+	 * Putting it all together and taking into account data
+	 * compaction during memory dump, we get for the max
+	 * transaction rate:
+	 *
+	 *                                 dump_input
+	 *   max_tx_rate = max_dump_rate * ----------- =
+	 *                                 dump_output
+	 *
+	 *                                        dump_input
+	 *                 compaction_threads * ---------------
+	 *                                      compaction_time
+	 *
+	 * We set the rate limit to half that to leave the database
+	 * engine enough room needed for growing write amplification.
+	 */
+	int32_t dump_count = stat->dump_count - last->dump_count;
+	int64_t dump_input = stat->dump_input - last->dump_input;
+	double compaction_time = stat->compaction_time - last->compaction_time;
+	*last = *stat;
+
+	if (dump_input < (ssize_t)VY_DUMP_SIZE_ACCT_MIN)
+		return;
+
+	recent->dump_count += dump_count;
+	recent->dump_input += dump_input;
+	recent->compaction_time += compaction_time;
+
+	double rate = 0.5 * compaction_threads * recent->dump_input /
+						 recent->compaction_time;
+	vy_quota_set_rate_limit(regulator->quota, VY_QUOTA_RESOURCE_DISK,
+				MIN(rate, SIZE_MAX));
+
+	/*
+	 * Periodically rotate statistics for quicker adaptation
+	 * to workload changes.
+	 */
+	if (recent->dump_count > VY_RECENT_DUMP_COUNT) {
+		recent->dump_count /= 2;
+		recent->dump_input /= 2;
+		recent->compaction_time /= 2;
+	}
+}
diff --git a/src/box/vy_regulator.h b/src/box/vy_regulator.h
index 0188da26..341f41df 100644
--- a/src/box/vy_regulator.h
+++ b/src/box/vy_regulator.h
@@ -35,6 +35,8 @@
 #include <stddef.h>
 #include <tarantool_ev.h>
 
+#include "vy_stat.h"
+
 #if defined(__cplusplus)
 extern "C" {
 #endif /* defined(__cplusplus) */
@@ -107,6 +109,16 @@ struct vy_regulator {
 	 * but vy_regulator_dump_complete() hasn't been called yet.
 	 */
 	bool dump_in_progress;
+	/**
+	 * Snapshot of scheduler statistics taken at the time of
+	 * the last rate limit update.
+	 */
+	struct vy_scheduler_stat sched_stat_last;
+	/**
+	 * Scheduler statistics for the most recent few dumps.
+	 * Used for calculating the rate limit.
+	 */
+	struct vy_scheduler_stat sched_stat_recent;
 };
 
 void
@@ -146,6 +158,21 @@ vy_regulator_dump_complete(struct vy_regulator *regulator,
 void
 vy_regulator_reset_dump_bandwidth(struct vy_regulator *regulator, size_t max);
 
+/**
+ * Called when global statistics are reset by box.stat.reset().
+ */
+void
+vy_regulator_reset_stat(struct vy_regulator *regulator);
+
+/**
+ * Set transaction rate limit so as to ensure that compaction
+ * will keep up with dumps.
+ */
+void
+vy_regulator_update_rate_limit(struct vy_regulator *regulator,
+			       const struct vy_scheduler_stat *stat,
+			       int compaction_threads);
+
 #if defined(__cplusplus)
 } /* extern "C" */
 #endif /* defined(__cplusplus) */
-- 
2.11.0

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [tarantool-patches] Re: [PATCH v2 2/8] vinyl: fix compaction priority calculation
  2019-01-24 17:12 ` [PATCH v2 2/8] vinyl: fix compaction priority calculation Vladimir Davydov
@ 2019-02-08 17:18   ` Konstantin Osipov
  2019-02-11 14:42     ` Vladimir Davydov
  0 siblings, 1 reply; 22+ messages in thread
From: Konstantin Osipov @ 2019-02-08 17:18 UTC (permalink / raw)
  To: tarantool-patches

* Vladimir Davydov <vdavydov.dev@gmail.com> [19/01/24 20:16]:

OK to push.


-- 
Konstantin Osipov, Moscow, Russia, +7 903 626 22 32
http://tarantool.io - www.twitter.com/kostja_osipov

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [tarantool-patches] Re: [PATCH v2 3/8] vinyl: rename lsm->range_heap to max_compaction_priority
  2019-01-24 17:12 ` [PATCH v2 3/8] vinyl: rename lsm->range_heap to max_compaction_priority Vladimir Davydov
@ 2019-02-08 17:19   ` Konstantin Osipov
  2019-02-08 17:26     ` Vladimir Davydov
  0 siblings, 1 reply; 22+ messages in thread
From: Konstantin Osipov @ 2019-02-08 17:19 UTC (permalink / raw)
  To: tarantool-patches

* Vladimir Davydov <vdavydov.dev@gmail.com> [19/01/24 20:16]:
> The name 'range_heap' is ambiguous, because it's unclear what range
> should be on top of the heap. We need to introduce another heap of
> ranges ordered differently, so let's rename to max_compaction_priority
> to avoid confusion.

I think the old name was better. And I think we could avoid
introducing another heap ;)


-- 
Konstantin Osipov, Moscow, Russia, +7 903 626 22 32
http://tarantool.io - www.twitter.com/kostja_osipov

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [tarantool-patches] Re: [PATCH v2 3/8] vinyl: rename lsm->range_heap to max_compaction_priority
  2019-02-08 17:19   ` [tarantool-patches] " Konstantin Osipov
@ 2019-02-08 17:26     ` Vladimir Davydov
  0 siblings, 0 replies; 22+ messages in thread
From: Vladimir Davydov @ 2019-02-08 17:26 UTC (permalink / raw)
  To: Konstantin Osipov; +Cc: tarantool-patches

On Fri, Feb 08, 2019 at 08:19:46PM +0300, Konstantin Osipov wrote:
> * Vladimir Davydov <vdavydov.dev@gmail.com> [19/01/24 20:16]:
> > The name 'range_heap' is ambiguous, because it's unclear what range
> > should be on top of the heap. We need to introduce another heap of
> > ranges ordered differently, so let's rename to max_compaction_priority
> > to avoid confusion.
> 
> I think the old name was better. And I think we could avoid
> introducing another heap ;)

Yep, I removed the heap in the new version and left the old name.
See the branch pls.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 4/8] vinyl: keep track of dumps per compaction for each LSM tree
  2019-01-24 17:12 ` [PATCH v2 4/8] vinyl: keep track of dumps per compaction for each LSM tree Vladimir Davydov
@ 2019-02-08 17:42   ` Vladimir Davydov
  2019-02-11 18:17     ` [tarantool-patches] " Konstantin Osipov
  2019-02-11 18:17   ` Konstantin Osipov
  1 sibling, 1 reply; 22+ messages in thread
From: Vladimir Davydov @ 2019-02-08 17:42 UTC (permalink / raw)
  To: Konstantin Osipov; +Cc: tarantool-patches

On Thu, Jan 24, 2019 at 08:12:40PM +0300, Vladimir Davydov wrote:
> This patch adds dumps_per_compaction metric to per index statistics. It
> shows the number of dumps it takes to trigger a major compaction of a
> range in a given LSM tree. We need it to automatically choose the
> optimal number of ranges that would smooth out the load generated by
> range compaction.
> 
> To calculate this metric, we assign dump_count to each run. It shows how
> many dumps it took to create the run. If a run was created by a memory
> dump, it is set to 1. If a run was created by a minor compaction, it is
> set to the sum of dump counts of compacted ranges. If a run was created
> by a major compaction, it is set to the sum of dump counts of compacted
> ranges minus dump count of the last level run. The dump_count is stored
> in vylog.
> 
> This allows us to estimate the number of dumps that triggers compaction
> in a range as dump_count of the last level run stored in the range.
> Finally, we report dumps_per_compaction of an LSM tree as the minimal
> dumps_per_compaction among all ranges constituting the tree. To achieve
> that, we maintain a heap of ranges per each LSM tree ordered by
> dumps_per_compaction.
> 
> Needed for #3944

Rewritten this patch without using a heap - see below.

From cce2fbba22eeeed78b05f76dc9e7d22ef218f3e9 Mon Sep 17 00:00:00 2001
From: Vladimir Davydov <vdavydov.dev@gmail.com>
Date: Wed, 6 Feb 2019 19:23:58 +0300
Subject: [PATCH] vinyl: keep track of dumps per compaction for each LSM tree

This patch adds dumps_per_compaction metric to per index statistics. It
shows the number of dumps it takes to trigger a major compaction of a
range in a given LSM tree. We need it to automatically choose the
optimal number of ranges that would smooth out the load generated by
range compaction.

To calculate this metric, we assign dump_count to each run. It shows how
many dumps it took to create the run. If a run was created by a memory
dump, it is set to 1. If a run was created by a minor compaction, it is
set to the sum of dump counts of compacted ranges. If a run was created
by a major compaction, it is set to the sum of dump counts of compacted
ranges minus dump count of the last level run. The dump_count is stored
in vylog.

This allows us to estimate the number of dumps that triggers compaction
in a range as dump_count of the last level run stored in the range.
Finally, we report dumps_per_compaction of an LSM tree as the average
dumps_per_compaction among all ranges constituting the tree.

Needed for #3944

diff --git a/src/box/vinyl.c b/src/box/vinyl.c
index 065a309f..e1ff65ae 100644
--- a/src/box/vinyl.c
+++ b/src/box/vinyl.c
@@ -463,6 +463,8 @@ vinyl_index_stat(struct index *index, struct info_handler *h)
 	info_append_int(h, "run_avg", lsm->run_count / lsm->range_count);
 	histogram_snprint(buf, sizeof(buf), lsm->run_hist);
 	info_append_str(h, "run_histogram", buf);
+	info_append_int(h, "dumps_per_compaction",
+			vy_lsm_dumps_per_compaction(lsm));
 
 	info_end(h);
 }
diff --git a/src/box/vy_log.c b/src/box/vy_log.c
index f94b60ff..06ab7247 100644
--- a/src/box/vy_log.c
+++ b/src/box/vy_log.c
@@ -84,6 +84,7 @@ enum vy_log_key {
 	VY_LOG_KEY_MODIFY_LSN		= 13,
 	VY_LOG_KEY_DROP_LSN		= 14,
 	VY_LOG_KEY_GROUP_ID		= 15,
+	VY_LOG_KEY_DUMP_COUNT		= 16,
 };
 
 /** vy_log_key -> human readable name. */
@@ -104,6 +105,7 @@ static const char *vy_log_key_name[] = {
 	[VY_LOG_KEY_MODIFY_LSN]		= "modify_lsn",
 	[VY_LOG_KEY_DROP_LSN]		= "drop_lsn",
 	[VY_LOG_KEY_GROUP_ID]		= "group_id",
+	[VY_LOG_KEY_DUMP_COUNT]		= "dump_count",
 };
 
 /** vy_log_type -> human readable name. */
@@ -285,6 +287,10 @@ vy_log_record_snprint(char *buf, int size, const struct vy_log_record *record)
 		SNPRINT(total, snprintf, buf, size, "%s=%"PRIi64", ",
 			vy_log_key_name[VY_LOG_KEY_GC_LSN],
 			record->gc_lsn);
+	if (record->dump_count > 0)
+		SNPRINT(total, snprintf, buf, size, "%s=%"PRIu32", ",
+			vy_log_key_name[VY_LOG_KEY_DUMP_COUNT],
+			record->dump_count);
 	SNPRINT(total, snprintf, buf, size, "}");
 	return total;
 }
@@ -411,6 +417,11 @@ vy_log_record_encode(const struct vy_log_record *record,
 		size += mp_sizeof_uint(record->gc_lsn);
 		n_keys++;
 	}
+	if (record->dump_count > 0) {
+		size += mp_sizeof_uint(VY_LOG_KEY_DUMP_COUNT);
+		size += mp_sizeof_uint(record->dump_count);
+		n_keys++;
+	}
 	size += mp_sizeof_map(n_keys);
 
 	/*
@@ -493,6 +504,10 @@ vy_log_record_encode(const struct vy_log_record *record,
 		pos = mp_encode_uint(pos, VY_LOG_KEY_GC_LSN);
 		pos = mp_encode_uint(pos, record->gc_lsn);
 	}
+	if (record->dump_count > 0) {
+		pos = mp_encode_uint(pos, VY_LOG_KEY_DUMP_COUNT);
+		pos = mp_encode_uint(pos, record->dump_count);
+	}
 	assert(pos == tuple + size);
 
 	/*
@@ -621,6 +636,9 @@ vy_log_record_decode(struct vy_log_record *record,
 		case VY_LOG_KEY_GC_LSN:
 			record->gc_lsn = mp_decode_uint(&pos);
 			break;
+		case VY_LOG_KEY_DUMP_COUNT:
+			record->dump_count = mp_decode_uint(&pos);
+			break;
 		default:
 			mp_next(&pos); /* unknown key, ignore */
 			break;
@@ -1593,6 +1611,7 @@ vy_recovery_do_create_run(struct vy_recovery *recovery, int64_t run_id)
 	run->id = run_id;
 	run->dump_lsn = -1;
 	run->gc_lsn = -1;
+	run->dump_count = 0;
 	run->is_incomplete = false;
 	run->is_dropped = false;
 	run->data = NULL;
@@ -1647,7 +1666,7 @@ vy_recovery_prepare_run(struct vy_recovery *recovery, int64_t lsm_id,
  */
 static int
 vy_recovery_create_run(struct vy_recovery *recovery, int64_t lsm_id,
-		       int64_t run_id, int64_t dump_lsn)
+		       int64_t run_id, int64_t dump_lsn, uint32_t dump_count)
 {
 	struct vy_lsm_recovery_info *lsm;
 	lsm = vy_recovery_lookup_lsm(recovery, lsm_id);
@@ -1672,6 +1691,7 @@ vy_recovery_create_run(struct vy_recovery *recovery, int64_t lsm_id,
 			return -1;
 	}
 	run->dump_lsn = dump_lsn;
+	run->dump_count = dump_count;
 	run->is_incomplete = false;
 	rlist_move_entry(&lsm->runs, run, in_lsm);
 	return 0;
@@ -2033,7 +2053,8 @@ vy_recovery_process_record(struct vy_recovery *recovery,
 		break;
 	case VY_LOG_CREATE_RUN:
 		rc = vy_recovery_create_run(recovery, record->lsm_id,
-					    record->run_id, record->dump_lsn);
+					    record->run_id, record->dump_lsn,
+					    record->dump_count);
 		break;
 	case VY_LOG_DROP_RUN:
 		rc = vy_recovery_drop_run(recovery, record->run_id,
@@ -2383,6 +2404,7 @@ vy_log_append_lsm(struct xlog *xlog, struct vy_lsm_recovery_info *lsm)
 		} else {
 			record.type = VY_LOG_CREATE_RUN;
 			record.dump_lsn = run->dump_lsn;
+			record.dump_count = run->dump_count;
 		}
 		record.lsm_id = lsm->id;
 		record.run_id = run->id;
diff --git a/src/box/vy_log.h b/src/box/vy_log.h
index 70e25245..ee38c193 100644
--- a/src/box/vy_log.h
+++ b/src/box/vy_log.h
@@ -96,7 +96,7 @@ enum vy_log_record_type {
 	VY_LOG_PREPARE_RUN		= 4,
 	/**
 	 * Commit a vinyl run file creation.
-	 * Requires vy_log_record::lsm_id, run_id, dump_lsn.
+	 * Requires vy_log_record::lsm_id, run_id, dump_lsn, dump_count.
 	 *
 	 * Written after a run file was successfully created.
 	 */
@@ -271,6 +271,8 @@ struct vy_log_record {
 	 * that uses this run.
 	 */
 	int64_t gc_lsn;
+	/** For runs: number of dumps it took to create the run. */
+	uint32_t dump_count;
 	/** Link in vy_log::tx. */
 	struct stailq_entry in_tx;
 };
@@ -389,6 +391,8 @@ struct vy_run_recovery_info {
 	 * that uses this run.
 	 */
 	int64_t gc_lsn;
+	/** Number of dumps it took to create the run. */
+	uint32_t dump_count;
 	/**
 	 * True if the run was not committed (there's
 	 * VY_LOG_PREPARE_RUN, but no VY_LOG_CREATE_RUN).
@@ -710,7 +714,8 @@ vy_log_prepare_run(int64_t lsm_id, int64_t run_id)
 
 /** Helper to log a vinyl run creation. */
 static inline void
-vy_log_create_run(int64_t lsm_id, int64_t run_id, int64_t dump_lsn)
+vy_log_create_run(int64_t lsm_id, int64_t run_id,
+		  int64_t dump_lsn, uint32_t dump_count)
 {
 	struct vy_log_record record;
 	vy_log_record_init(&record);
@@ -718,6 +723,7 @@ vy_log_create_run(int64_t lsm_id, int64_t run_id, int64_t dump_lsn)
 	record.lsm_id = lsm_id;
 	record.run_id = run_id;
 	record.dump_lsn = dump_lsn;
+	record.dump_count = dump_count;
 	vy_log_write(&record);
 }
 
diff --git a/src/box/vy_lsm.c b/src/box/vy_lsm.c
index 570e783a..6b70cd75 100644
--- a/src/box/vy_lsm.c
+++ b/src/box/vy_lsm.c
@@ -353,6 +353,7 @@ vy_lsm_recover_run(struct vy_lsm *lsm, struct vy_run_recovery_info *run_info,
 		return NULL;
 
 	run->dump_lsn = run_info->dump_lsn;
+	run->dump_count = run_info->dump_count;
 	if (vy_run_recover(run, lsm->env->path,
 			   lsm->space_id, lsm->index_id) != 0 &&
 	    (!force_recovery ||
@@ -638,6 +639,7 @@ vy_lsm_recover(struct vy_lsm *lsm, struct vy_recovery *recovery,
 					    (long long)range->id));
 			return -1;
 		}
+		vy_range_update_dumps_per_compaction(range);
 		vy_lsm_acct_range(lsm, range);
 	}
 	if (prev == NULL) {
@@ -753,6 +755,7 @@ void
 vy_lsm_acct_range(struct vy_lsm *lsm, struct vy_range *range)
 {
 	histogram_collect(lsm->run_hist, range->slice_count);
+	lsm->sum_dumps_per_compaction += range->dumps_per_compaction;
 	vy_disk_stmt_counter_add(&lsm->stat.disk.compaction.queue,
 				 &range->compaction_queue);
 	lsm->env->compaction_queue_size += range->compaction_queue.bytes;
@@ -770,6 +773,7 @@ void
 vy_lsm_unacct_range(struct vy_lsm *lsm, struct vy_range *range)
 {
 	histogram_discard(lsm->run_hist, range->slice_count);
+	lsm->sum_dumps_per_compaction -= range->dumps_per_compaction;
 	vy_disk_stmt_counter_sub(&lsm->stat.disk.compaction.queue,
 				 &range->compaction_queue);
 	lsm->env->compaction_queue_size -= range->compaction_queue.bytes;
@@ -1078,6 +1082,7 @@ vy_lsm_split_range(struct vy_lsm *lsm, struct vy_range *range)
 		}
 		part->needs_compaction = range->needs_compaction;
 		vy_range_update_compaction_priority(part, &lsm->opts);
+		vy_range_update_dumps_per_compaction(part);
 	}
 
 	/*
@@ -1195,6 +1200,7 @@ vy_lsm_coalesce_range(struct vy_lsm *lsm, struct vy_range *range)
 	 * as it fits the configured LSM tree shape.
 	 */
 	vy_range_update_compaction_priority(result, &lsm->opts);
+	vy_range_update_dumps_per_compaction(result);
 	vy_lsm_acct_range(lsm, result);
 	vy_lsm_add_range(lsm, result);
 	lsm->range_tree_version++;
diff --git a/src/box/vy_lsm.h b/src/box/vy_lsm.h
index 74033627..eb7cbbf0 100644
--- a/src/box/vy_lsm.h
+++ b/src/box/vy_lsm.h
@@ -251,6 +251,8 @@ struct vy_lsm {
 	vy_range_tree_t *tree;
 	/** Number of ranges in this LSM tree. */
 	int range_count;
+	/** Sum dumps_per_compaction across all ranges. */
+	int sum_dumps_per_compaction;
 	/** Heap of ranges, prioritized by compaction_priority. */
 	heap_t range_heap;
 	/**
@@ -351,6 +353,16 @@ vy_lsm_is_empty(struct vy_lsm *lsm)
 }
 
 /**
+ * Return the averange number of dumps it takes to trigger major
+ * compaction of a range in this LSM tree.
+ */
+static inline int
+vy_lsm_dumps_per_compaction(struct vy_lsm *lsm)
+{
+	return lsm->sum_dumps_per_compaction / lsm->range_count;
+}
+
+/**
  * Increment the reference counter of an LSM tree.
  * An LSM tree cannot be deleted if its reference
  * counter is elevated.
@@ -464,8 +476,8 @@ vy_lsm_remove_range(struct vy_lsm *lsm, struct vy_range *range);
  * Account a range in an LSM tree.
  *
  * This function updates the following LSM tree statistics:
- *  - vy_lsm::run_hist after a slice is added to or removed from
- *    a range of the LSM tree.
+ *  - vy_lsm::run_hist and vy_lsm::sum_dumps_per_compaction after
+ *    a slice is added to or removed from a range of the LSM tree.
  *  - vy_lsm::stat::disk::compaction::queue after compaction priority
  *    of a range is updated.
  *  - vy_lsm::stat::disk::last_level_count after a range is compacted.
diff --git a/src/box/vy_range.c b/src/box/vy_range.c
index 7211cfb2..19ada26b 100644
--- a/src/box/vy_range.c
+++ b/src/box/vy_range.c
@@ -411,6 +411,18 @@ vy_range_update_compaction_priority(struct vy_range *range,
 	}
 }
 
+void
+vy_range_update_dumps_per_compaction(struct vy_range *range)
+{
+	if (!rlist_empty(&range->slices)) {
+		struct vy_slice *slice = rlist_last_entry(&range->slices,
+						struct vy_slice, in_range);
+		range->dumps_per_compaction = slice->run->dump_count;
+	} else {
+		range->dumps_per_compaction = 0;
+	}
+}
+
 /**
  * Return true and set split_key accordingly if the range needs to be
  * split in two.
diff --git a/src/box/vy_range.h b/src/box/vy_range.h
index 05195d08..0b3a78c3 100644
--- a/src/box/vy_range.h
+++ b/src/box/vy_range.h
@@ -119,6 +119,11 @@ struct vy_range {
 	bool needs_compaction;
 	/** Number of times the range was compacted. */
 	int n_compactions;
+	/**
+	 * Number of dumps it takes to trigger major compaction in
+	 * this range, see vy_run::dump_count for more details.
+	 */
+	int dumps_per_compaction;
 	/** Link in vy_lsm->tree. */
 	rb_node(struct vy_range) tree_node;
 	/** Link in vy_lsm->range_heap. */
@@ -243,6 +248,12 @@ vy_range_update_compaction_priority(struct vy_range *range,
 				    const struct index_opts *opts);
 
 /**
+ * Update the value of range->dumps_per_compaction.
+ */
+void
+vy_range_update_dumps_per_compaction(struct vy_range *range);
+
+/**
  * Check if a range needs to be split in two.
  *
  * @param range             The range.
diff --git a/src/box/vy_run.h b/src/box/vy_run.h
index 990daffa..28fd6a50 100644
--- a/src/box/vy_run.h
+++ b/src/box/vy_run.h
@@ -130,6 +130,21 @@ struct vy_run {
 	/** Max LSN stored on disk. */
 	int64_t dump_lsn;
 	/**
+	 * Number of dumps it took to create this run.
+	 *
+	 * If the run was produced by a memory dump, it is 1.
+	 * If the run was produced by a minor compaction, it
+	 * is is the sum of dump counts of compacted runs.
+	 * If the run was produced by a major compaction, it
+	 * is is the sum of dump counts of compacted runs
+	 * minus the dump count of the last (greatest) run.
+	 *
+	 * This way, by looking at the last level run in an LSM
+	 * tree, we can tell how many dumps it took to compact
+	 * it last time.
+	 */
+	uint32_t dump_count;
+	/**
 	 * Run reference counter, the run is deleted once it hits 0.
 	 * A new run is created with the reference counter set to 1.
 	 * A run is referenced by each slice created for it and each
diff --git a/src/box/vy_scheduler.c b/src/box/vy_scheduler.c
index 5ec6d171..5c53b423 100644
--- a/src/box/vy_scheduler.c
+++ b/src/box/vy_scheduler.c
@@ -1193,7 +1193,7 @@ vy_task_dump_complete(struct vy_task *task)
 	 * Log change in metadata.
 	 */
 	vy_log_tx_begin();
-	vy_log_create_run(lsm->id, new_run->id, dump_lsn);
+	vy_log_create_run(lsm->id, new_run->id, dump_lsn, new_run->dump_count);
 	for (range = begin_range, i = 0; range != end_range;
 	     range = vy_range_tree_next(lsm->tree, range), i++) {
 		assert(i < lsm->range_count);
@@ -1226,6 +1226,7 @@ vy_task_dump_complete(struct vy_task *task)
 		vy_lsm_unacct_range(lsm, range);
 		vy_range_add_slice(range, slice);
 		vy_range_update_compaction_priority(range, &lsm->opts);
+		vy_range_update_dumps_per_compaction(range);
 		vy_lsm_acct_range(lsm, range);
 	}
 	vy_range_heap_update_all(&lsm->range_heap);
@@ -1396,6 +1397,7 @@ vy_task_dump_new(struct vy_scheduler *scheduler, struct vy_worker *worker,
 	if (new_run == NULL)
 		goto err_run;
 
+	new_run->dump_count = 1;
 	new_run->dump_lsn = dump_lsn;
 
 	/*
@@ -1528,7 +1530,8 @@ vy_task_compaction_complete(struct vy_task *task)
 	rlist_foreach_entry(run, &unused_runs, in_unused)
 		vy_log_drop_run(run->id, gc_lsn);
 	if (new_slice != NULL) {
-		vy_log_create_run(lsm->id, new_run->id, new_run->dump_lsn);
+		vy_log_create_run(lsm->id, new_run->id, new_run->dump_lsn,
+				  new_run->dump_count);
 		vy_log_insert_slice(range->id, new_run->id, new_slice->id,
 				    tuple_data_or_null(new_slice->begin),
 				    tuple_data_or_null(new_slice->end));
@@ -1589,6 +1592,7 @@ vy_task_compaction_complete(struct vy_task *task)
 	}
 	range->n_compactions++;
 	vy_range_update_compaction_priority(range, &lsm->opts);
+	vy_range_update_dumps_per_compaction(range);
 	vy_lsm_acct_range(lsm, range);
 	vy_lsm_acct_compaction(lsm, compaction_time,
 			       &compaction_input, &compaction_output);
@@ -1693,12 +1697,14 @@ vy_task_compaction_new(struct vy_scheduler *scheduler, struct vy_worker *worker,
 		goto err_wi;
 
 	struct vy_slice *slice;
+	int32_t dump_count = 0;
 	int n = range->compaction_priority;
 	rlist_foreach_entry(slice, &range->slices, in_range) {
 		if (vy_write_iterator_new_slice(wi, slice) != 0)
 			goto err_wi_sub;
 		new_run->dump_lsn = MAX(new_run->dump_lsn,
 					slice->run->dump_lsn);
+		dump_count += slice->run->dump_count;
 		/* Remember the slices we are compacting. */
 		if (task->first_slice == NULL)
 			task->first_slice = slice;
@@ -1709,6 +1715,17 @@ vy_task_compaction_new(struct vy_scheduler *scheduler, struct vy_worker *worker,
 	}
 	assert(n == 0);
 	assert(new_run->dump_lsn >= 0);
+	if (range->compaction_priority == range->slice_count)
+		dump_count -= slice->run->dump_count;
+	/*
+	 * Do not update dumps_per_compaction in case compaction
+	 * was triggered manually to avoid unexpected side effects,
+	 * such as splitting/coalescing ranges for no good reason.
+	 */
+	if (range->needs_compaction)
+		new_run->dump_count = slice->run->dump_count;
+	else
+		new_run->dump_count = dump_count;
 
 	range->needs_compaction = false;
 
diff --git a/test/vinyl/layout.result b/test/vinyl/layout.result
index 3be2bb91..6d58f747 100644
--- a/test/vinyl/layout.result
+++ b/test/vinyl/layout.result
@@ -139,7 +139,7 @@ result
       - HEADER:
           type: INSERT
         BODY:
-          tuple: [5, {2: 8, 9: 20}]
+          tuple: [5, {2: 8, 16: 1, 9: 20}]
       - HEADER:
           type: INSERT
         BODY:
@@ -164,7 +164,7 @@ result
       - HEADER:
           type: INSERT
         BODY:
-          tuple: [5, {0: 2, 2: 6, 9: 20}]
+          tuple: [5, {0: 2, 2: 6, 16: 1, 9: 20}]
       - HEADER:
           type: INSERT
         BODY:
@@ -204,7 +204,7 @@ result
           timestamp: <timestamp>
           type: INSERT
         BODY:
-          tuple: [5, {0: 2, 2: 10, 9: 23}]
+          tuple: [5, {0: 2, 2: 10, 16: 1, 9: 23}]
       - HEADER:
           timestamp: <timestamp>
           type: INSERT
@@ -224,7 +224,7 @@ result
           timestamp: <timestamp>
           type: INSERT
         BODY:
-          tuple: [5, {2: 12, 9: 23}]
+          tuple: [5, {2: 12, 16: 1, 9: 23}]
       - HEADER:
           timestamp: <timestamp>
           type: INSERT
diff --git a/test/vinyl/stat.result b/test/vinyl/stat.result
index 419d3e6c..e79c32f0 100644
--- a/test/vinyl/stat.result
+++ b/test/vinyl/stat.result
@@ -129,7 +129,8 @@ test_run:cmd("setopt delimiter ''");
 -- initially stats are empty
 istat()
 ---
-- rows: 0
+- dumps_per_compaction: 0
+  rows: 0
   run_avg: 0
   bytes: 0
   upsert:
@@ -294,10 +295,7 @@ wait(istat, st, 'disk.dump.count', 1)
 ...
 stat_diff(istat(), st)
 ---
-- rows: 25
-  run_avg: 1
-  run_count: 1
-  disk:
+- disk:
     last_level:
       bytes: 26049
       pages: 7
@@ -321,6 +319,10 @@ stat_diff(istat(), st)
     pages: 7
     bytes_compressed: <bytes_compressed>
     bloom_size: 70
+  rows: 25
+  run_avg: 1
+  run_count: 1
+  dumps_per_compaction: 1
   bytes: 26049
   put:
     rows: 25
@@ -998,7 +1000,8 @@ box.stat.reset()
 ...
 istat()
 ---
-- rows: 306
+- dumps_per_compaction: 1
+  rows: 306
   run_avg: 1
   bytes: 317731
   upsert:
@@ -1732,6 +1735,138 @@ box.stat.vinyl().disk.data_compacted
 ---
 - 0
 ...
+--
+-- Number of dumps needed to trigger major compaction in
+-- an LSM tree range.
+--
+s = box.schema.space.create('test', {engine = 'vinyl'})
+---
+...
+i = s:create_index('primary', {page_size = 128, range_size = 8192, run_count_per_level = 1, run_size_ratio = 2})
+---
+...
+test_run:cmd("setopt delimiter ';'")
+---
+- true
+...
+function dump(a, b)
+    for i = a, b do
+        s:replace{i, digest.urandom(100)}
+    end
+    box.snapshot()
+end;
+---
+...
+function wait_compaction(count)
+    test_run:wait_cond(function()
+        return i:stat().disk.compaction.count == count
+    end, 10)
+end;
+---
+...
+test_run:cmd("setopt delimiter ''");
+---
+- true
+...
+dump(1, 100)
+---
+...
+i:stat().dumps_per_compaction -- 1
+---
+- 1
+...
+dump(1, 100) -- compaction
+---
+...
+dump(1, 100) -- split + compaction
+---
+...
+wait_compaction(3)
+---
+...
+i:stat().range_count -- 2
+---
+- 2
+...
+i:stat().dumps_per_compaction -- 1
+---
+- 1
+...
+dump(1, 10)
+---
+...
+dump(1, 40) -- compaction in range 1
+---
+...
+wait_compaction(4)
+---
+...
+i:stat().dumps_per_compaction -- 1
+---
+- 1
+...
+dump(90, 100)
+---
+...
+dump(60, 100) -- compaction in range 2
+---
+...
+wait_compaction(5)
+---
+...
+i:stat().dumps_per_compaction -- 2
+---
+- 2
+...
+-- Forcing compaction manually doesn't affect dumps_per_compaction.
+dump(40, 60)
+---
+...
+i:compact()
+---
+...
+wait_compaction(7)
+---
+...
+i:stat().dumps_per_compaction -- 2
+---
+- 2
+...
+test_run:cmd('restart server test')
+fiber = require('fiber')
+---
+...
+digest = require('digest')
+---
+...
+s = box.space.test
+---
+...
+i = s.index.primary
+---
+...
+i:stat().dumps_per_compaction -- 2
+---
+- 2
+...
+for i = 1, 100 do s:replace{i, digest.urandom(100)} end
+---
+...
+box.snapshot()
+---
+- ok
+...
+test_run:wait_cond(function() return i:stat().disk.compaction.count == 2 end, 10)
+---
+- true
+...
+i:stat().dumps_per_compaction -- 1
+---
+- 1
+...
+s:drop()
+---
+...
 test_run:cmd('switch default')
 ---
 - true
diff --git a/test/vinyl/stat.test.lua b/test/vinyl/stat.test.lua
index 4a955682..4a360f33 100644
--- a/test/vinyl/stat.test.lua
+++ b/test/vinyl/stat.test.lua
@@ -528,6 +528,69 @@ s:drop()
 
 box.stat.vinyl().disk.data_compacted
 
+--
+-- Number of dumps needed to trigger major compaction in
+-- an LSM tree range.
+--
+s = box.schema.space.create('test', {engine = 'vinyl'})
+i = s:create_index('primary', {page_size = 128, range_size = 8192, run_count_per_level = 1, run_size_ratio = 2})
+
+test_run:cmd("setopt delimiter ';'")
+function dump(a, b)
+    for i = a, b do
+        s:replace{i, digest.urandom(100)}
+    end
+    box.snapshot()
+end;
+function wait_compaction(count)
+    test_run:wait_cond(function()
+        return i:stat().disk.compaction.count == count
+    end, 10)
+end;
+test_run:cmd("setopt delimiter ''");
+
+dump(1, 100)
+i:stat().dumps_per_compaction -- 1
+
+dump(1, 100) -- compaction
+dump(1, 100) -- split + compaction
+wait_compaction(3)
+i:stat().range_count -- 2
+i:stat().dumps_per_compaction -- 1
+
+dump(1, 10)
+dump(1, 40) -- compaction in range 1
+wait_compaction(4)
+i:stat().dumps_per_compaction -- 1
+
+dump(90, 100)
+dump(60, 100) -- compaction in range 2
+wait_compaction(5)
+i:stat().dumps_per_compaction -- 2
+
+-- Forcing compaction manually doesn't affect dumps_per_compaction.
+dump(40, 60)
+i:compact()
+wait_compaction(7)
+i:stat().dumps_per_compaction -- 2
+
+test_run:cmd('restart server test')
+
+fiber = require('fiber')
+digest = require('digest')
+
+s = box.space.test
+i = s.index.primary
+
+i:stat().dumps_per_compaction -- 2
+for i = 1, 100 do s:replace{i, digest.urandom(100)} end
+box.snapshot()
+test_run:wait_cond(function() return i:stat().disk.compaction.count == 2 end, 10)
+
+i:stat().dumps_per_compaction -- 1
+
+s:drop()
+
 test_run:cmd('switch default')
 test_run:cmd('stop server test')
 test_run:cmd('cleanup server test')

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [tarantool-patches] Re: [PATCH v2 2/8] vinyl: fix compaction priority calculation
  2019-02-08 17:18   ` [tarantool-patches] " Konstantin Osipov
@ 2019-02-11 14:42     ` Vladimir Davydov
  0 siblings, 0 replies; 22+ messages in thread
From: Vladimir Davydov @ 2019-02-11 14:42 UTC (permalink / raw)
  To: Konstantin Osipov; +Cc: tarantool-patches

On Fri, Feb 08, 2019 at 08:18:56PM +0300, Konstantin Osipov wrote:
> * Vladimir Davydov <vdavydov.dev@gmail.com> [19/01/24 20:16]:
> 
> OK to push.

Pushed to 2.1 and 1.10.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [tarantool-patches] Re: [PATCH v2 4/8] vinyl: keep track of dumps per compaction for each LSM tree
  2019-01-24 17:12 ` [PATCH v2 4/8] vinyl: keep track of dumps per compaction for each LSM tree Vladimir Davydov
  2019-02-08 17:42   ` Vladimir Davydov
@ 2019-02-11 18:17   ` Konstantin Osipov
  2019-02-12 10:15     ` Vladimir Davydov
  1 sibling, 1 reply; 22+ messages in thread
From: Konstantin Osipov @ 2019-02-11 18:17 UTC (permalink / raw)
  To: tarantool-patches

* Vladimir Davydov <vdavydov.dev@gmail.com> [19/01/24 20:16]:
> This patch adds dumps_per_compaction metric to per index statistics. It
> shows the number of dumps it takes to trigger a major compaction of a
> range in a given LSM tree. We need it to automatically choose the
> optimal number of ranges that would smooth out the load generated by
> range compaction.

The version of this patch which is on the branch is OK to push
(adding average instead of heap).



-- 
Konstantin Osipov, Moscow, Russia, +7 903 626 22 32
http://tarantool.io - www.twitter.com/kostja_osipov

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [tarantool-patches] Re: [PATCH v2 4/8] vinyl: keep track of dumps per compaction for each LSM tree
  2019-02-08 17:42   ` Vladimir Davydov
@ 2019-02-11 18:17     ` Konstantin Osipov
  0 siblings, 0 replies; 22+ messages in thread
From: Konstantin Osipov @ 2019-02-11 18:17 UTC (permalink / raw)
  To: tarantool-patches

* Vladimir Davydov <vdavydov.dev@gmail.com> [19/02/08 20:42]:
> On Thu, Jan 24, 2019 at 08:12:40PM +0300, Vladimir Davydov wrote:
> > This patch adds dumps_per_compaction metric to per index statistics. It
> > shows the number of dumps it takes to trigger a major compaction of a
> > range in a given LSM tree. We need it to automatically choose the
> > optimal number of ranges that would smooth out the load generated by
> > range compaction.
> > 
> > To calculate this metric, we assign dump_count to each run. It shows how
> > many dumps it took to create the run. If a run was created by a memory
> > dump, it is set to 1. If a run was created by a minor compaction, it is
> > set to the sum of dump counts of compacted ranges. If a run was created
> > by a major compaction, it is set to the sum of dump counts of compacted
> > ranges minus dump count of the last level run. The dump_count is stored
> > in vylog.
> > 
> > This allows us to estimate the number of dumps that triggers compaction
> > in a range as dump_count of the last level run stored in the range.
> > Finally, we report dumps_per_compaction of an LSM tree as the minimal
> > dumps_per_compaction among all ranges constituting the tree. To achieve
> > that, we maintain a heap of ranges per each LSM tree ordered by
> > dumps_per_compaction.
> > 
> > Needed for #3944
> 
> Rewritten this patch without using a heap - see below.

OK to push.


-- 
Konstantin Osipov, Moscow, Russia, +7 903 626 22 32
http://tarantool.io - www.twitter.com/kostja_osipov

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [tarantool-patches] Re: [PATCH v2 5/8] vinyl: set range size automatically
  2019-01-24 17:12 ` [PATCH v2 5/8] vinyl: set range size automatically Vladimir Davydov
@ 2019-02-11 18:21   ` Konstantin Osipov
  2019-02-12 10:16     ` Vladimir Davydov
  0 siblings, 1 reply; 22+ messages in thread
From: Konstantin Osipov @ 2019-02-11 18:21 UTC (permalink / raw)
  To: tarantool-patches

* Vladimir Davydov <vdavydov.dev@gmail.com> [19/01/24 20:16]:

OK to push (reviewed the patch which is on the branch).


-- 
Konstantin Osipov, Moscow, Russia, +7 903 626 22 32
http://tarantool.io - www.twitter.com/kostja_osipov

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [tarantool-patches] Re: [PATCH v2 6/8] vinyl: randomize range compaction to avoid IO load spikes
  2019-01-24 17:12 ` [PATCH v2 6/8] vinyl: randomize range compaction to avoid IO load spikes Vladimir Davydov
@ 2019-02-11 18:21   ` Konstantin Osipov
  2019-02-12 10:16     ` Vladimir Davydov
  0 siblings, 1 reply; 22+ messages in thread
From: Konstantin Osipov @ 2019-02-11 18:21 UTC (permalink / raw)
  To: tarantool-patches

* Vladimir Davydov <vdavydov.dev@gmail.com> [19/01/24 20:16]:
> Since all ranges constituting an LSM tree have the same configuration,
> they tend to get compacted at approximately the same time. This entails
> IO load spikes, which, in turn, lead to deviation of the LSM tree from
> the target shape and hence increased read amplification. To prevent this
> from happening, this patch implements compaction randomization: with 10%
> probability we defer compaction at each LSM tree level, i.e. if the
> number of runs at a level exceeds the configured run_count_per_level,
> the level will be compacted with 90%-probability, but with 10%
> probability it won't - compaction will be deferred until another run
> is added to the level.
> 
> Our simulations show that such a simple algorithm performs fairly well:
> it randomizes compaction pace among ranges, spreading IO load evenly in
> time, while the write amplification is increased by not more than 5-10%,
> which seems to be a reasonable price for elimination of IO load spikes.
> 
> Closes #3944

OK to push.



-- 
Konstantin Osipov, Moscow, Russia, +7 903 626 22 32
http://tarantool.io - www.twitter.com/kostja_osipov

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [tarantool-patches] Re: [PATCH v2 4/8] vinyl: keep track of dumps per compaction for each LSM tree
  2019-02-11 18:17   ` Konstantin Osipov
@ 2019-02-12 10:15     ` Vladimir Davydov
  0 siblings, 0 replies; 22+ messages in thread
From: Vladimir Davydov @ 2019-02-12 10:15 UTC (permalink / raw)
  To: Konstantin Osipov; +Cc: tarantool-patches

On Mon, Feb 11, 2019 at 09:17:25PM +0300, Konstantin Osipov wrote:
> * Vladimir Davydov <vdavydov.dev@gmail.com> [19/01/24 20:16]:
> > This patch adds dumps_per_compaction metric to per index statistics. It
> > shows the number of dumps it takes to trigger a major compaction of a
> > range in a given LSM tree. We need it to automatically choose the
> > optimal number of ranges that would smooth out the load generated by
> > range compaction.
> 
> The version of this patch which is on the branch is OK to push
> (adding average instead of heap).

Pushed to 2.1 and 1.10.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [tarantool-patches] Re: [PATCH v2 5/8] vinyl: set range size automatically
  2019-02-11 18:21   ` [tarantool-patches] " Konstantin Osipov
@ 2019-02-12 10:16     ` Vladimir Davydov
  0 siblings, 0 replies; 22+ messages in thread
From: Vladimir Davydov @ 2019-02-12 10:16 UTC (permalink / raw)
  To: Konstantin Osipov; +Cc: tarantool-patches

On Mon, Feb 11, 2019 at 09:21:11PM +0300, Konstantin Osipov wrote:
> * Vladimir Davydov <vdavydov.dev@gmail.com> [19/01/24 20:16]:
> 
> OK to push (reviewed the patch which is on the branch).

Pushed to 2.1 and 1.10.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [tarantool-patches] Re: [PATCH v2 6/8] vinyl: randomize range compaction to avoid IO load spikes
  2019-02-11 18:21   ` [tarantool-patches] " Konstantin Osipov
@ 2019-02-12 10:16     ` Vladimir Davydov
  0 siblings, 0 replies; 22+ messages in thread
From: Vladimir Davydov @ 2019-02-12 10:16 UTC (permalink / raw)
  To: Konstantin Osipov; +Cc: tarantool-patches

On Mon, Feb 11, 2019 at 09:21:57PM +0300, Konstantin Osipov wrote:
> * Vladimir Davydov <vdavydov.dev@gmail.com> [19/01/24 20:16]:
> > Since all ranges constituting an LSM tree have the same configuration,
> > they tend to get compacted at approximately the same time. This entails
> > IO load spikes, which, in turn, lead to deviation of the LSM tree from
> > the target shape and hence increased read amplification. To prevent this
> > from happening, this patch implements compaction randomization: with 10%
> > probability we defer compaction at each LSM tree level, i.e. if the
> > number of runs at a level exceeds the configured run_count_per_level,
> > the level will be compacted with 90%-probability, but with 10%
> > probability it won't - compaction will be deferred until another run
> > is added to the level.
> > 
> > Our simulations show that such a simple algorithm performs fairly well:
> > it randomizes compaction pace among ranges, spreading IO load evenly in
> > time, while the write amplification is increased by not more than 5-10%,
> > which seems to be a reasonable price for elimination of IO load spikes.
> > 
> > Closes #3944
> 
> OK to push.

Pushed to 2.1 and 1.10.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 7/8] vinyl: introduce quota consumer types
  2019-01-24 17:12 ` [PATCH v2 7/8] vinyl: introduce quota consumer types Vladimir Davydov
@ 2019-02-12 15:48   ` Vladimir Davydov
  0 siblings, 0 replies; 22+ messages in thread
From: Vladimir Davydov @ 2019-02-12 15:48 UTC (permalink / raw)
  To: Konstantin Osipov; +Cc: tarantool-patches

On Thu, Jan 24, 2019 at 08:12:43PM +0300, Vladimir Davydov wrote:
> diff --git a/src/box/vy_quota.c b/src/box/vy_quota.c
> index 07cd5856..20d322de 100644
> --- a/src/box/vy_quota.c
> +++ b/src/box/vy_quota.c
> @@ -52,6 +52,43 @@
>  static const double VY_QUOTA_TIMER_PERIOD = 0.1;
>  
>  /**
> + * Bit mask of resources used by a particular consumer type.
> + */
> +static unsigned
> +vy_quota_consumer_resource_map[] = {
> +	/**
> +	 * Transaction throttling pursues two goals. First, it is
> +	 * capping memory consumption rate so that the hard memory
> +	 * limit will not be hit before memory dump has completed
> +	 * (memory-based throttling). Second, we must make sure
> +	 * that compaction jobs keep up with dumps to keep read and
> +	 * space amplification within bounds (disk-based throttling).
> +	 * Transactions ought to respect them both.
> +	 */
> +	[VY_QUOTA_CONSUMER_TX] = (1 << VY_QUOTA_RESOURCE_DISK) |
> +				 (1 << VY_QUOTA_RESOURCE_MEMORY),
> +	/**
> +	 * Compaction jobs may need some quota too, because they
> +	 * may generate deferred DELETEs for secondary indexes.
> +	 * Apparently, we must not impose the rate limit that is
> +	 * supposed to speed up compaction on them (disk-based),
> +	 * however they still have to respect memory-based throttling
> +	 * to avoid long stalls.
> +	 */
> +	[VY_QUOTA_CONSUMER_COMPACTION] = (1 << VY_QUOTA_RESOURCE_MEMORY),
> +};
> +
> +/**
> + * Iterate over rate limit states that are enforced for a consumer
> + * of the given type.
> + */
> +#define vy_quota_consumer_for_each_rate_limit(quota, type, rl) \
> +	for (struct vy_rate_limit *rl = (quota)->rate_limit; \
> +	     rl - (quota)->rate_limit < vy_quota_resource_type_MAX; rl++) \
> +		if (vy_quota_consumer_resource_map[type] & \
> +		    (1 << (rl - (quota)->rate_limit)))
> +

Kostja: This macro looks confusing, let's inline it.
I: OK, here we go:

diff --git a/src/box/vy_quota.c b/src/box/vy_quota.c
index 20d322de..dc452adb 100644
--- a/src/box/vy_quota.c
+++ b/src/box/vy_quota.c
@@ -79,14 +79,16 @@ vy_quota_consumer_resource_map[] = {
 };
 
 /**
- * Iterate over rate limit states that are enforced for a consumer
- * of the given type.
+ * Check if the rate limit corresponding to resource @resource_type
+ * should be applied to a consumer of type @consumer_type.
  */
-#define vy_quota_consumer_for_each_rate_limit(quota, type, rl) \
-	for (struct vy_rate_limit *rl = (quota)->rate_limit; \
-	     rl - (quota)->rate_limit < vy_quota_resource_type_MAX; rl++) \
-		if (vy_quota_consumer_resource_map[type] & \
-		    (1 << (rl - (quota)->rate_limit)))
+static inline bool
+vy_rate_limit_is_applicable(enum vy_quota_consumer_type consumer_type,
+			    enum vy_quota_resource_type resource_type)
+{
+	return (vy_quota_consumer_resource_map[consumer_type] &
+					(1 << resource_type)) != 0;
+}
 
 /**
  * Return true if the requested amount of memory may be consumed
@@ -106,8 +108,10 @@ vy_quota_may_use(struct vy_quota *q, enum vy_quota_consumer_type type,
 		q->quota_exceeded_cb(q);
 		return false;
 	}
-	vy_quota_consumer_for_each_rate_limit(q, type, rl) {
-		if (!vy_rate_limit_may_use(rl))
+	for (int i = 0; i < vy_quota_resource_type_MAX; i++) {
+		struct vy_rate_limit *rl = &q->rate_limit[i];
+		if (vy_rate_limit_is_applicable(type, i) &&
+		    !vy_rate_limit_may_use(rl))
 			return false;
 	}
 	return true;
@@ -121,8 +125,11 @@ vy_quota_do_use(struct vy_quota *q, enum vy_quota_consumer_type type,
 		size_t size)
 {
 	q->used += size;
-	vy_quota_consumer_for_each_rate_limit(q, type, rl)
-		vy_rate_limit_use(rl, size);
+	for (int i = 0; i < vy_quota_resource_type_MAX; i++) {
+		struct vy_rate_limit *rl = &q->rate_limit[i];
+		if (vy_rate_limit_is_applicable(type, i))
+			vy_rate_limit_use(rl, size);
+	}
 }
 
 /**
@@ -135,8 +142,11 @@ vy_quota_do_unuse(struct vy_quota *q, enum vy_quota_consumer_type type,
 {
 	assert(q->used >= size);
 	q->used -= size;
-	vy_quota_consumer_for_each_rate_limit(q, type, rl)
-		vy_rate_limit_unuse(rl, size);
+	for (int i = 0; i < vy_quota_resource_type_MAX; i++) {
+		struct vy_rate_limit *rl = &q->rate_limit[i];
+		if (vy_rate_limit_is_applicable(type, i))
+			vy_rate_limit_unuse(rl, size);
+	}
 }
 
 /**

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2019-02-12 15:48 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-01-24 17:12 [PATCH v2 0/8] vinyl: compaction randomization and throttling Vladimir Davydov
2019-01-24 17:12 ` [PATCH v2 1/8] vinyl: use uncompressed run size for range split/coalesce/compaction Vladimir Davydov
2019-01-24 17:12 ` [PATCH v2 2/8] vinyl: fix compaction priority calculation Vladimir Davydov
2019-02-08 17:18   ` [tarantool-patches] " Konstantin Osipov
2019-02-11 14:42     ` Vladimir Davydov
2019-01-24 17:12 ` [PATCH v2 3/8] vinyl: rename lsm->range_heap to max_compaction_priority Vladimir Davydov
2019-02-08 17:19   ` [tarantool-patches] " Konstantin Osipov
2019-02-08 17:26     ` Vladimir Davydov
2019-01-24 17:12 ` [PATCH v2 4/8] vinyl: keep track of dumps per compaction for each LSM tree Vladimir Davydov
2019-02-08 17:42   ` Vladimir Davydov
2019-02-11 18:17     ` [tarantool-patches] " Konstantin Osipov
2019-02-11 18:17   ` Konstantin Osipov
2019-02-12 10:15     ` Vladimir Davydov
2019-01-24 17:12 ` [PATCH v2 5/8] vinyl: set range size automatically Vladimir Davydov
2019-02-11 18:21   ` [tarantool-patches] " Konstantin Osipov
2019-02-12 10:16     ` Vladimir Davydov
2019-01-24 17:12 ` [PATCH v2 6/8] vinyl: randomize range compaction to avoid IO load spikes Vladimir Davydov
2019-02-11 18:21   ` [tarantool-patches] " Konstantin Osipov
2019-02-12 10:16     ` Vladimir Davydov
2019-01-24 17:12 ` [PATCH v2 7/8] vinyl: introduce quota consumer types Vladimir Davydov
2019-02-12 15:48   ` Vladimir Davydov
2019-01-24 17:12 ` [PATCH v2 8/8] vinyl: throttle tx to ensure compaction keeps up with dumps Vladimir Davydov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox