* [PATCH 1/9] vinyl: update lsm->range_heap in one go on dump completion
2019-01-20 21:16 [PATCH 0/9] vinyl: compaction randomization and throttling Vladimir Davydov
@ 2019-01-20 21:17 ` Vladimir Davydov
2019-01-24 16:55 ` Vladimir Davydov
2019-02-05 16:37 ` [tarantool-patches] " Konstantin Osipov
2019-01-20 21:17 ` [PATCH 2/9] vinyl: ignore unknown .run, .index and .vylog keys Vladimir Davydov
` (7 subsequent siblings)
8 siblings, 2 replies; 39+ messages in thread
From: Vladimir Davydov @ 2019-01-20 21:17 UTC (permalink / raw)
To: tarantool-patches
Upon LSM tree dump completion, we iterate over all ranges of the LSM
tree to update their priority and the position in the compaction heap.
Since typically we need to update all ranges, we better use update_all
heap method instead of updating the heap entries one by one.
---
src/box/vy_scheduler.c | 4 +---
1 file changed, 1 insertion(+), 3 deletions(-)
diff --git a/src/box/vy_scheduler.c b/src/box/vy_scheduler.c
index c59bedee..5ec6d171 100644
--- a/src/box/vy_scheduler.c
+++ b/src/box/vy_scheduler.c
@@ -1227,10 +1227,8 @@ vy_task_dump_complete(struct vy_task *task)
vy_range_add_slice(range, slice);
vy_range_update_compaction_priority(range, &lsm->opts);
vy_lsm_acct_range(lsm, range);
- if (!vy_range_is_scheduled(range))
- vy_range_heap_update(&lsm->range_heap,
- &range->heap_node);
}
+ vy_range_heap_update_all(&lsm->range_heap);
free(new_slices);
delete_mems:
--
2.11.0
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH 1/9] vinyl: update lsm->range_heap in one go on dump completion
2019-01-20 21:17 ` [PATCH 1/9] vinyl: update lsm->range_heap in one go on dump completion Vladimir Davydov
@ 2019-01-24 16:55 ` Vladimir Davydov
2019-02-05 16:37 ` [tarantool-patches] " Konstantin Osipov
1 sibling, 0 replies; 39+ messages in thread
From: Vladimir Davydov @ 2019-01-24 16:55 UTC (permalink / raw)
To: tarantool-patches
On Mon, Jan 21, 2019 at 12:17:00AM +0300, Vladimir Davydov wrote:
> Upon LSM tree dump completion, we iterate over all ranges of the LSM
> tree to update their priority and the position in the compaction heap.
> Since typically we need to update all ranges, we better use update_all
> heap method instead of updating the heap entries one by one.
> ---
> src/box/vy_scheduler.c | 4 +---
> 1 file changed, 1 insertion(+), 3 deletions(-)
>
> diff --git a/src/box/vy_scheduler.c b/src/box/vy_scheduler.c
> index c59bedee..5ec6d171 100644
> --- a/src/box/vy_scheduler.c
> +++ b/src/box/vy_scheduler.c
> @@ -1227,10 +1227,8 @@ vy_task_dump_complete(struct vy_task *task)
> vy_range_add_slice(range, slice);
> vy_range_update_compaction_priority(range, &lsm->opts);
> vy_lsm_acct_range(lsm, range);
> - if (!vy_range_is_scheduled(range))
> - vy_range_heap_update(&lsm->range_heap,
> - &range->heap_node);
> }
> + vy_range_heap_update_all(&lsm->range_heap);
> free(new_slices);
>
> delete_mems:
This one is trivial and should raise no questions.
I pushed it to 2.1 and 1.10.
^ permalink raw reply [flat|nested] 39+ messages in thread
* [tarantool-patches] Re: [PATCH 1/9] vinyl: update lsm->range_heap in one go on dump completion
2019-01-20 21:17 ` [PATCH 1/9] vinyl: update lsm->range_heap in one go on dump completion Vladimir Davydov
2019-01-24 16:55 ` Vladimir Davydov
@ 2019-02-05 16:37 ` Konstantin Osipov
1 sibling, 0 replies; 39+ messages in thread
From: Konstantin Osipov @ 2019-02-05 16:37 UTC (permalink / raw)
To: tarantool-patches
* Vladimir Davydov <vdavydov.dev@gmail.com> [19/01/21 06:58]:
> Upon LSM tree dump completion, we iterate over all ranges of the LSM
> tree to update their priority and the position in the compaction heap.
> Since typically we need to update all ranges, we better use update_all
> heap method instead of updating the heap entries one by one.
Trivial, OK to push.
--
Konstantin Osipov, Moscow, Russia, +7 903 626 22 32
http://tarantool.io - www.twitter.com/kostja_osipov
^ permalink raw reply [flat|nested] 39+ messages in thread
* [PATCH 2/9] vinyl: ignore unknown .run, .index and .vylog keys
2019-01-20 21:16 [PATCH 0/9] vinyl: compaction randomization and throttling Vladimir Davydov
2019-01-20 21:17 ` [PATCH 1/9] vinyl: update lsm->range_heap in one go on dump completion Vladimir Davydov
@ 2019-01-20 21:17 ` Vladimir Davydov
2019-01-24 16:56 ` Vladimir Davydov
2019-01-20 21:17 ` [PATCH 3/9] vinyl: use uncompressed run size for range split/coalesce/compaction Vladimir Davydov
` (6 subsequent siblings)
8 siblings, 1 reply; 39+ messages in thread
From: Vladimir Davydov @ 2019-01-20 21:17 UTC (permalink / raw)
To: tarantool-patches
Currently, if we encounter an unknown key while parsing a .run, .index,
or .vylog file we raise an error. As a result, if we add a new key to
either of those entities, we will break forward compatibility although
there's actually no reason for that. To avoid that, let's silently
ignore unknown keys, as we do in case of xrow header keys.
---
src/box/vy_log.c | 9 ++-------
src/box/vy_run.c | 12 ++++--------
2 files changed, 6 insertions(+), 15 deletions(-)
diff --git a/src/box/vy_log.c b/src/box/vy_log.c
index c9e0713c..d3fa0c7a 100644
--- a/src/box/vy_log.c
+++ b/src/box/vy_log.c
@@ -620,14 +620,9 @@ vy_log_record_decode(struct vy_log_record *record,
case VY_LOG_KEY_GC_LSN:
record->gc_lsn = mp_decode_uint(&pos);
break;
- case VY_LOG_KEY_TRUNCATE_COUNT:
- /* Not used anymore, ignore. */
- break;
default:
- diag_set(ClientError, ER_INVALID_VYLOG_FILE,
- tt_sprintf("Bad record: unknown key %u",
- (unsigned)key));
- goto fail;
+ mp_next(&pos); /* unknown key, ignore */
+ break;
}
}
if (record->type == VY_LOG_CREATE_LSM) {
diff --git a/src/box/vy_run.c b/src/box/vy_run.c
index d82f1e37..cee90458 100644
--- a/src/box/vy_run.c
+++ b/src/box/vy_run.c
@@ -524,10 +524,8 @@ vy_page_info_decode(struct vy_page_info *page, const struct xrow_header *xrow,
page->row_index_offset = mp_decode_uint(&pos);
break;
default:
- diag_set(ClientError, ER_INVALID_INDEX_FILE, filename,
- tt_sprintf("Can't decode page info: "
- "unknown key %u", (unsigned)key));
- return -1;
+ mp_next(&pos); /* unknown key, ignore */
+ break;
}
}
if (key_map) {
@@ -634,10 +632,8 @@ vy_run_info_decode(struct vy_run_info *run_info,
vy_stmt_stat_decode(&run_info->stmt_stat, &pos);
break;
default:
- diag_set(ClientError, ER_INVALID_INDEX_FILE, filename,
- "Can't decode run info: unknown key %u",
- (unsigned)key);
- return -1;
+ mp_next(&pos); /* unknown key, ignore */
+ break;
}
}
if (key_map) {
--
2.11.0
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH 2/9] vinyl: ignore unknown .run, .index and .vylog keys
2019-01-20 21:17 ` [PATCH 2/9] vinyl: ignore unknown .run, .index and .vylog keys Vladimir Davydov
@ 2019-01-24 16:56 ` Vladimir Davydov
0 siblings, 0 replies; 39+ messages in thread
From: Vladimir Davydov @ 2019-01-24 16:56 UTC (permalink / raw)
To: tarantool-patches
On Mon, Jan 21, 2019 at 12:17:01AM +0300, Vladimir Davydov wrote:
> Currently, if we encounter an unknown key while parsing a .run, .index,
> or .vylog file we raise an error. As a result, if we add a new key to
> either of those entities, we will break forward compatibility although
> there's actually no reason for that. To avoid that, let's silently
> ignore unknown keys, as we do in case of xrow header keys.
> ---
> src/box/vy_log.c | 9 ++-------
> src/box/vy_run.c | 12 ++++--------
> 2 files changed, 6 insertions(+), 15 deletions(-)
>
> diff --git a/src/box/vy_log.c b/src/box/vy_log.c
> index c9e0713c..d3fa0c7a 100644
> --- a/src/box/vy_log.c
> +++ b/src/box/vy_log.c
> @@ -620,14 +620,9 @@ vy_log_record_decode(struct vy_log_record *record,
> case VY_LOG_KEY_GC_LSN:
> record->gc_lsn = mp_decode_uint(&pos);
> break;
> - case VY_LOG_KEY_TRUNCATE_COUNT:
> - /* Not used anymore, ignore. */
> - break;
> default:
> - diag_set(ClientError, ER_INVALID_VYLOG_FILE,
> - tt_sprintf("Bad record: unknown key %u",
> - (unsigned)key));
> - goto fail;
> + mp_next(&pos); /* unknown key, ignore */
> + break;
> }
> }
> if (record->type == VY_LOG_CREATE_LSM) {
> diff --git a/src/box/vy_run.c b/src/box/vy_run.c
> index d82f1e37..cee90458 100644
> --- a/src/box/vy_run.c
> +++ b/src/box/vy_run.c
> @@ -524,10 +524,8 @@ vy_page_info_decode(struct vy_page_info *page, const struct xrow_header *xrow,
> page->row_index_offset = mp_decode_uint(&pos);
> break;
> default:
> - diag_set(ClientError, ER_INVALID_INDEX_FILE, filename,
> - tt_sprintf("Can't decode page info: "
> - "unknown key %u", (unsigned)key));
> - return -1;
> + mp_next(&pos); /* unknown key, ignore */
> + break;
> }
> }
> if (key_map) {
> @@ -634,10 +632,8 @@ vy_run_info_decode(struct vy_run_info *run_info,
> vy_stmt_stat_decode(&run_info->stmt_stat, &pos);
> break;
> default:
> - diag_set(ClientError, ER_INVALID_INDEX_FILE, filename,
> - "Can't decode run info: unknown key %u",
> - (unsigned)key);
> - return -1;
> + mp_next(&pos); /* unknown key, ignore */
> + break;
> }
> }
> if (key_map) {
This one is trivial and should raise no questions.
I pushed it to 2.1 and 1.10.
^ permalink raw reply [flat|nested] 39+ messages in thread
* [PATCH 3/9] vinyl: use uncompressed run size for range split/coalesce/compaction
2019-01-20 21:16 [PATCH 0/9] vinyl: compaction randomization and throttling Vladimir Davydov
2019-01-20 21:17 ` [PATCH 1/9] vinyl: update lsm->range_heap in one go on dump completion Vladimir Davydov
2019-01-20 21:17 ` [PATCH 2/9] vinyl: ignore unknown .run, .index and .vylog keys Vladimir Davydov
@ 2019-01-20 21:17 ` Vladimir Davydov
2019-01-21 9:42 ` Vladimir Davydov
2019-02-05 16:43 ` Konstantin Osipov
2019-01-20 21:17 ` [PATCH 4/9] vinyl: rename lsm->range_heap to max_compaction_priority Vladimir Davydov
` (5 subsequent siblings)
8 siblings, 2 replies; 39+ messages in thread
From: Vladimir Davydov @ 2019-01-20 21:17 UTC (permalink / raw)
To: tarantool-patches
Historically, when considering splitting or coalescing a range or
updating compaction priority, we use sizes of compressed runs (see
bytes_compressed). This makes the algorithms dependent on whether
compression is used or not and how effective it is, which is weird,
because compression is a way of storing data on disk - it shouldn't
affect the way data is partitioned. E.g. if we turned off compression
at the first LSM tree level, which would make sense, because it's
relatively small, we would affect the compaction algorithm because
of this.
That said, let's use uncompressed run sizes when considering range
tree transformations.
---
src/box/vy_range.c | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)
diff --git a/src/box/vy_range.c b/src/box/vy_range.c
index f649aff7..87c4c6b9 100644
--- a/src/box/vy_range.c
+++ b/src/box/vy_range.c
@@ -329,7 +329,7 @@ vy_range_update_compaction_priority(struct vy_range *range,
struct vy_slice *slice;
rlist_foreach_entry(slice, &range->slices, in_range) {
- uint64_t size = slice->count.bytes_compressed;
+ uint64_t size = slice->count.bytes;
/*
* The size of the first level is defined by
* the size of the most recent run.
@@ -377,7 +377,7 @@ vy_range_update_compaction_priority(struct vy_range *range,
*/
range->compaction_priority = total_run_count;
range->compaction_queue = total_stmt_count;
- est_new_run_size = total_stmt_count.bytes_compressed;
+ est_new_run_size = total_stmt_count.bytes;
}
}
@@ -419,7 +419,7 @@ vy_range_needs_split(struct vy_range *range, const struct index_opts *opts,
slice = rlist_last_entry(&range->slices, struct vy_slice, in_range);
/* The range is too small to be split. */
- if (slice->count.bytes_compressed < opts->range_size * 4 / 3)
+ if (slice->count.bytes < opts->range_size * 4 / 3)
return false;
/* Find the median key in the oldest run (approximately). */
@@ -481,7 +481,7 @@ vy_range_needs_coalesce(struct vy_range *range, vy_range_tree_t *tree,
struct vy_range *it;
/* Size of the coalesced range. */
- uint64_t total_size = range->count.bytes_compressed;
+ uint64_t total_size = range->count.bytes;
/* Coalesce ranges until total_size > max_size. */
uint64_t max_size = opts->range_size / 2;
@@ -496,7 +496,7 @@ vy_range_needs_coalesce(struct vy_range *range, vy_range_tree_t *tree,
for (it = vy_range_tree_next(tree, range);
it != NULL && !vy_range_is_scheduled(it);
it = vy_range_tree_next(tree, it)) {
- uint64_t size = it->count.bytes_compressed;
+ uint64_t size = it->count.bytes;
if (total_size + size > max_size)
break;
total_size += size;
@@ -505,7 +505,7 @@ vy_range_needs_coalesce(struct vy_range *range, vy_range_tree_t *tree,
for (it = vy_range_tree_prev(tree, range);
it != NULL && !vy_range_is_scheduled(it);
it = vy_range_tree_prev(tree, it)) {
- uint64_t size = it->count.bytes_compressed;
+ uint64_t size = it->count.bytes;
if (total_size + size > max_size)
break;
total_size += size;
--
2.11.0
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH 3/9] vinyl: use uncompressed run size for range split/coalesce/compaction
2019-01-20 21:17 ` [PATCH 3/9] vinyl: use uncompressed run size for range split/coalesce/compaction Vladimir Davydov
@ 2019-01-21 9:42 ` Vladimir Davydov
2019-02-05 16:49 ` [tarantool-patches] " Konstantin Osipov
2019-02-05 16:43 ` Konstantin Osipov
1 sibling, 1 reply; 39+ messages in thread
From: Vladimir Davydov @ 2019-01-21 9:42 UTC (permalink / raw)
To: tarantool-patches
On Mon, Jan 21, 2019 at 12:17:02AM +0300, Vladimir Davydov wrote:
> Historically, when considering splitting or coalescing a range or
> updating compaction priority, we use sizes of compressed runs (see
> bytes_compressed). This makes the algorithms dependent on whether
> compression is used or not and how effective it is, which is weird,
> because compression is a way of storing data on disk - it shouldn't
> affect the way data is partitioned. E.g. if we turned off compression
> at the first LSM tree level, which would make sense, because it's
> relatively small, we would affect the compaction algorithm because
> of this.
>
> That said, let's use uncompressed run sizes when considering range
> tree transformations.
This results in occasional failures of vinyl/deferred_delete.test.lua.
I amended the patch on the branch to fix this. Here's the diff:
diff --git a/test/vinyl/deferred_delete.result b/test/vinyl/deferred_delete.result
index 29945f8d..61f81ce2 100644
--- a/test/vinyl/deferred_delete.result
+++ b/test/vinyl/deferred_delete.result
@@ -668,16 +668,13 @@ test_run:cmd("switch test")
fiber = require('fiber')
---
...
-digest = require('digest')
----
-...
s = box.schema.space.create('test', {engine = 'vinyl'})
---
...
-pk = s:create_index('pk', {run_count_per_level = 10})
+pk = s:create_index('pk', {run_count_per_level = 10, run_size_ratio = 2})
---
...
-sk = s:create_index('sk', {run_count_per_level = 10, parts = {2, 'unsigned', 3, 'string'}, unique = false})
+sk = s:create_index('sk', {run_count_per_level = 10, run_size_ratio = 2, parts = {2, 'unsigned', 3, 'string'}, unique = false})
---
...
-- Write a run big enough to prevent major compaction from kicking in
@@ -685,13 +682,25 @@ sk = s:create_index('sk', {run_count_per_level = 10, parts = {2, 'unsigned', 3,
dummy_rows = 100
---
...
-for i = 1, dummy_rows do s:replace{i + 1000, i + 1000, digest.urandom(100)} end
+pad = string.rep('z', 50 * 1024)
+---
+...
+for i = 1, dummy_rows do s:replace{i + 1000, i + 1000, pad} end
---
...
box.snapshot()
---
- ok
...
+pk:compact()
+---
+...
+sk:compact()
+---
+...
+while box.stat.vinyl().scheduler.compaction_queue > 0 do fiber.sleep(0.001) end
+---
+...
pad = string.rep('x', 10 * 1024)
---
...
diff --git a/test/vinyl/deferred_delete.test.lua b/test/vinyl/deferred_delete.test.lua
index d38802da..93b5b358 100644
--- a/test/vinyl/deferred_delete.test.lua
+++ b/test/vinyl/deferred_delete.test.lua
@@ -252,17 +252,20 @@ test_run:cmd("start server test with args='1048576'")
test_run:cmd("switch test")
fiber = require('fiber')
-digest = require('digest')
s = box.schema.space.create('test', {engine = 'vinyl'})
-pk = s:create_index('pk', {run_count_per_level = 10})
-sk = s:create_index('sk', {run_count_per_level = 10, parts = {2, 'unsigned', 3, 'string'}, unique = false})
+pk = s:create_index('pk', {run_count_per_level = 10, run_size_ratio = 2})
+sk = s:create_index('sk', {run_count_per_level = 10, run_size_ratio = 2, parts = {2, 'unsigned', 3, 'string'}, unique = false})
-- Write a run big enough to prevent major compaction from kicking in
-- (run_count_per_level is ignored on the last level - see gh-3657).
dummy_rows = 100
-for i = 1, dummy_rows do s:replace{i + 1000, i + 1000, digest.urandom(100)} end
+pad = string.rep('z', 50 * 1024)
+for i = 1, dummy_rows do s:replace{i + 1000, i + 1000, pad} end
box.snapshot()
+pk:compact()
+sk:compact()
+while box.stat.vinyl().scheduler.compaction_queue > 0 do fiber.sleep(0.001) end
pad = string.rep('x', 10 * 1024)
for i = 1, 120 do s:replace{i, i, pad} end
^ permalink raw reply [flat|nested] 39+ messages in thread
* [tarantool-patches] Re: [PATCH 3/9] vinyl: use uncompressed run size for range split/coalesce/compaction
2019-01-21 9:42 ` Vladimir Davydov
@ 2019-02-05 16:49 ` Konstantin Osipov
2019-02-06 8:55 ` Vladimir Davydov
0 siblings, 1 reply; 39+ messages in thread
From: Konstantin Osipov @ 2019-02-05 16:49 UTC (permalink / raw)
To: tarantool-patches
* Vladimir Davydov <vdavydov.dev@gmail.com> [19/01/21 13:15]:
> On Mon, Jan 21, 2019 at 12:17:02AM +0300, Vladimir Davydov wrote:
> > Historically, when considering splitting or coalescing a range or
> > updating compaction priority, we use sizes of compressed runs (see
> > bytes_compressed). This makes the algorithms dependent on whether
> > compression is used or not and how effective it is, which is weird,
> > because compression is a way of storing data on disk - it shouldn't
> > affect the way data is partitioned. E.g. if we turned off compression
> > at the first LSM tree level, which would make sense, because it's
> > relatively small, we would affect the compaction algorithm because
> > of this.
> >
> > That said, let's use uncompressed run sizes when considering range
> > tree transformations.
>
> This results in occasional failures of vinyl/deferred_delete.test.lua.
> I amended the patch on the branch to fix this. Here's the diff:
>
I don't understand why you had to replace random data with
string.rep('z', ...). Otherwise lgtm.
--
Konstantin Osipov, Moscow, Russia, +7 903 626 22 32
http://tarantool.io - www.twitter.com/kostja_osipov
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [tarantool-patches] Re: [PATCH 3/9] vinyl: use uncompressed run size for range split/coalesce/compaction
2019-02-05 16:49 ` [tarantool-patches] " Konstantin Osipov
@ 2019-02-06 8:55 ` Vladimir Davydov
2019-02-06 10:46 ` Konstantin Osipov
0 siblings, 1 reply; 39+ messages in thread
From: Vladimir Davydov @ 2019-02-06 8:55 UTC (permalink / raw)
To: Konstantin Osipov; +Cc: tarantool-patches
On Tue, Feb 05, 2019 at 07:49:05PM +0300, Konstantin Osipov wrote:
> * Vladimir Davydov <vdavydov.dev@gmail.com> [19/01/21 13:15]:
> > On Mon, Jan 21, 2019 at 12:17:02AM +0300, Vladimir Davydov wrote:
> > > Historically, when considering splitting or coalescing a range or
> > > updating compaction priority, we use sizes of compressed runs (see
> > > bytes_compressed). This makes the algorithms dependent on whether
> > > compression is used or not and how effective it is, which is weird,
> > > because compression is a way of storing data on disk - it shouldn't
> > > affect the way data is partitioned. E.g. if we turned off compression
> > > at the first LSM tree level, which would make sense, because it's
> > > relatively small, we would affect the compaction algorithm because
> > > of this.
> > >
> > > That said, let's use uncompressed run sizes when considering range
> > > tree transformations.
> >
> > This results in occasional failures of vinyl/deferred_delete.test.lua.
> > I amended the patch on the branch to fix this. Here's the diff:
> >
>
> I don't understand why you had to replace random data with
> string.rep('z', ...). Otherwise lgtm.
Only because after the patch I don't need to use random() to generate
data that would trigger compaction - compaction logic uses uncompressed
sizes now so we can use string.rep(), which is faster.
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [tarantool-patches] Re: [PATCH 3/9] vinyl: use uncompressed run size for range split/coalesce/compaction
2019-02-06 8:55 ` Vladimir Davydov
@ 2019-02-06 10:46 ` Konstantin Osipov
2019-02-06 10:55 ` Vladimir Davydov
0 siblings, 1 reply; 39+ messages in thread
From: Konstantin Osipov @ 2019-02-06 10:46 UTC (permalink / raw)
To: Vladimir Davydov; +Cc: tarantool-patches
* Vladimir Davydov <vdavydov.dev@gmail.com> [19/02/06 13:31]:
> Only because after the patch I don't need to use random() to generate
> data that would trigger compaction - compaction logic uses uncompressed
> sizes now so we can use string.rep(), which is faster.
It is faster but it may impact other properties of the runs - like
physical run size. For example you begin getting much more pages per
run, for all runs except the top-level one.
I would avoid using string.rep() anywhere in vinyl test data,
unless I intentionally want to check the effects of inserting data
with high compression rate.
--
Konstantin Osipov, Moscow, Russia, +7 903 626 22 32
http://tarantool.io - www.twitter.com/kostja_osipov
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [tarantool-patches] Re: [PATCH 3/9] vinyl: use uncompressed run size for range split/coalesce/compaction
2019-02-06 10:46 ` Konstantin Osipov
@ 2019-02-06 10:55 ` Vladimir Davydov
0 siblings, 0 replies; 39+ messages in thread
From: Vladimir Davydov @ 2019-02-06 10:55 UTC (permalink / raw)
To: Konstantin Osipov; +Cc: tarantool-patches
On Wed, Feb 06, 2019 at 01:46:28PM +0300, Konstantin Osipov wrote:
> * Vladimir Davydov <vdavydov.dev@gmail.com> [19/02/06 13:31]:
> > Only because after the patch I don't need to use random() to generate
> > data that would trigger compaction - compaction logic uses uncompressed
> > sizes now so we can use string.rep(), which is faster.
>
> It is faster but it may impact other properties of the runs - like
> physical run size. For example you begin getting much more pages per
> run, for all runs except the top-level one.
No. We don't take into account compression when splitting a run in
pages. Never did. Actually, after this patch, it doesn't matter whether
we use data that's easily compressed or not in tests - the result will
be virtually the same.
> I would avoid using string.rep() anywhere in vinyl test data,
> unless I intentionally want to check the effects of inserting data
> with high compression rate.
Okay. I can change that if you wish. Just that there's no difference.
^ permalink raw reply [flat|nested] 39+ messages in thread
* [tarantool-patches] Re: [PATCH 3/9] vinyl: use uncompressed run size for range split/coalesce/compaction
2019-01-20 21:17 ` [PATCH 3/9] vinyl: use uncompressed run size for range split/coalesce/compaction Vladimir Davydov
2019-01-21 9:42 ` Vladimir Davydov
@ 2019-02-05 16:43 ` Konstantin Osipov
2019-02-06 16:48 ` Vladimir Davydov
1 sibling, 1 reply; 39+ messages in thread
From: Konstantin Osipov @ 2019-02-05 16:43 UTC (permalink / raw)
To: tarantool-patches
* Vladimir Davydov <vdavydov.dev@gmail.com> [19/01/21 06:58]:
> Historically, when considering splitting or coalescing a range or
> updating compaction priority, we use sizes of compressed runs (see
> bytes_compressed). This makes the algorithms dependent on whether
> compression is used or not and how effective it is, which is weird,
> because compression is a way of storing data on disk - it shouldn't
> affect the way data is partitioned. E.g. if we turned off compression
> at the first LSM tree level, which would make sense, because it's
> relatively small, we would affect the compaction algorithm because
> of this.
>
OK to push.
--
Konstantin Osipov, Moscow, Russia, +7 903 626 22 32
http://tarantool.io - www.twitter.com/kostja_osipov
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [tarantool-patches] Re: [PATCH 3/9] vinyl: use uncompressed run size for range split/coalesce/compaction
2019-02-05 16:43 ` Konstantin Osipov
@ 2019-02-06 16:48 ` Vladimir Davydov
0 siblings, 0 replies; 39+ messages in thread
From: Vladimir Davydov @ 2019-02-06 16:48 UTC (permalink / raw)
To: Konstantin Osipov; +Cc: tarantool-patches
On Tue, Feb 05, 2019 at 07:43:55PM +0300, Konstantin Osipov wrote:
> * Vladimir Davydov <vdavydov.dev@gmail.com> [19/01/21 06:58]:
> > Historically, when considering splitting or coalescing a range or
> > updating compaction priority, we use sizes of compressed runs (see
> > bytes_compressed). This makes the algorithms dependent on whether
> > compression is used or not and how effective it is, which is weird,
> > because compression is a way of storing data on disk - it shouldn't
> > affect the way data is partitioned. E.g. if we turned off compression
> > at the first LSM tree level, which would make sense, because it's
> > relatively small, we would affect the compaction algorithm because
> > of this.
> >
>
> OK to push.
Pushed to 2.1 and 1.10.
^ permalink raw reply [flat|nested] 39+ messages in thread
* [PATCH 4/9] vinyl: rename lsm->range_heap to max_compaction_priority
2019-01-20 21:16 [PATCH 0/9] vinyl: compaction randomization and throttling Vladimir Davydov
` (2 preceding siblings ...)
2019-01-20 21:17 ` [PATCH 3/9] vinyl: use uncompressed run size for range split/coalesce/compaction Vladimir Davydov
@ 2019-01-20 21:17 ` Vladimir Davydov
2019-01-20 21:17 ` [PATCH 5/9] vinyl: keep track of dumps per compaction for each LSM tree Vladimir Davydov
` (4 subsequent siblings)
8 siblings, 0 replies; 39+ messages in thread
From: Vladimir Davydov @ 2019-01-20 21:17 UTC (permalink / raw)
To: tarantool-patches
The name 'range_heap' is ambiguous, because it's unclear what range
should be on top of the heap. We need to introduce another heap of
ranges ordered differently, so let's rename to max_compaction_priority
to avoid confusion.
---
src/box/vy_lsm.c | 24 ++++++++++++++----------
src/box/vy_lsm.h | 2 +-
src/box/vy_range.c | 2 +-
src/box/vy_range.h | 18 ++++++++++--------
src/box/vy_scheduler.c | 24 +++++++++++++-----------
5 files changed, 39 insertions(+), 31 deletions(-)
diff --git a/src/box/vy_lsm.c b/src/box/vy_lsm.c
index 5eb3fd77..851785ee 100644
--- a/src/box/vy_lsm.c
+++ b/src/box/vy_lsm.c
@@ -180,7 +180,7 @@ vy_lsm_new(struct vy_lsm_env *lsm_env, struct vy_cache_env *cache_env,
vy_cache_create(&lsm->cache, cache_env, cmp_def, index_def->iid == 0);
rlist_create(&lsm->sealed);
vy_range_tree_new(lsm->tree);
- vy_range_heap_create(&lsm->range_heap);
+ vy_max_compaction_priority_create(&lsm->max_compaction_priority);
rlist_create(&lsm->runs);
lsm->pk = pk;
if (pk != NULL)
@@ -257,7 +257,7 @@ vy_lsm_delete(struct vy_lsm *lsm)
vy_lsm_remove_run(lsm, run);
vy_range_tree_iter(lsm->tree, NULL, vy_range_tree_free_cb, NULL);
- vy_range_heap_destroy(&lsm->range_heap);
+ vy_max_compaction_priority_destroy(&lsm->max_compaction_priority);
tuple_format_unref(lsm->disk_format);
key_def_delete(lsm->cmp_def);
key_def_delete(lsm->key_def);
@@ -665,10 +665,12 @@ vy_lsm_generation(struct vy_lsm *lsm)
int
vy_lsm_compaction_priority(struct vy_lsm *lsm)
{
- struct heap_node *n = vy_range_heap_top(&lsm->range_heap);
- if (n == NULL)
+ struct heap_node *node;
+ node = vy_max_compaction_priority_top(&lsm->max_compaction_priority);
+ if (node == NULL)
return 0;
- struct vy_range *range = container_of(n, struct vy_range, heap_node);
+ struct vy_range *range = container_of(node, struct vy_range,
+ compaction_priority_node);
return range->compaction_priority;
}
@@ -732,8 +734,9 @@ vy_lsm_remove_run(struct vy_lsm *lsm, struct vy_run *run)
void
vy_lsm_add_range(struct vy_lsm *lsm, struct vy_range *range)
{
- assert(range->heap_node.pos == UINT32_MAX);
- vy_range_heap_insert(&lsm->range_heap, &range->heap_node);
+ assert(range->compaction_priority_node.pos == UINT32_MAX);
+ vy_max_compaction_priority_insert(&lsm->max_compaction_priority,
+ &range->compaction_priority_node);
vy_range_tree_insert(lsm->tree, range);
lsm->range_count++;
}
@@ -741,8 +744,9 @@ vy_lsm_add_range(struct vy_lsm *lsm, struct vy_range *range)
void
vy_lsm_remove_range(struct vy_lsm *lsm, struct vy_range *range)
{
- assert(range->heap_node.pos != UINT32_MAX);
- vy_range_heap_delete(&lsm->range_heap, &range->heap_node);
+ assert(range->compaction_priority_node.pos != UINT32_MAX);
+ vy_max_compaction_priority_delete(&lsm->max_compaction_priority,
+ &range->compaction_priority_node);
vy_range_tree_remove(lsm->tree, range);
lsm->range_count--;
}
@@ -1224,5 +1228,5 @@ vy_lsm_force_compaction(struct vy_lsm *lsm)
vy_lsm_acct_range(lsm, range);
}
- vy_range_heap_update_all(&lsm->range_heap);
+ vy_max_compaction_priority_update_all(&lsm->max_compaction_priority);
}
diff --git a/src/box/vy_lsm.h b/src/box/vy_lsm.h
index 74033627..a1d872e9 100644
--- a/src/box/vy_lsm.h
+++ b/src/box/vy_lsm.h
@@ -252,7 +252,7 @@ struct vy_lsm {
/** Number of ranges in this LSM tree. */
int range_count;
/** Heap of ranges, prioritized by compaction_priority. */
- heap_t range_heap;
+ heap_t max_compaction_priority;
/**
* List of all runs created for this LSM tree,
* linked by vy_run->in_lsm.
diff --git a/src/box/vy_range.c b/src/box/vy_range.c
index 87c4c6b9..a2cb4558 100644
--- a/src/box/vy_range.c
+++ b/src/box/vy_range.c
@@ -197,7 +197,7 @@ vy_range_new(int64_t id, struct tuple *begin, struct tuple *end,
}
range->cmp_def = cmp_def;
rlist_create(&range->slices);
- range->heap_node.pos = UINT32_MAX;
+ range->compaction_priority_node.pos = UINT32_MAX;
return range;
}
diff --git a/src/box/vy_range.h b/src/box/vy_range.h
index 05195d08..7c0a16e2 100644
--- a/src/box/vy_range.h
+++ b/src/box/vy_range.h
@@ -107,6 +107,8 @@ struct vy_range {
* how we decide how many runs to compact next time.
*/
int compaction_priority;
+ /** Link in vy_lsm->max_compaction_priority. */
+ struct heap_node compaction_priority_node;
/** Number of statements that need to be compacted. */
struct vy_disk_stmt_counter compaction_queue;
/**
@@ -121,8 +123,6 @@ struct vy_range {
int n_compactions;
/** Link in vy_lsm->tree. */
rb_node(struct vy_range) tree_node;
- /** Link in vy_lsm->range_heap. */
- struct heap_node heap_node;
/**
* Incremented whenever a run is added to or deleted
* from this range. Used invalidate read iterators.
@@ -134,15 +134,17 @@ struct vy_range {
* Heap of all ranges of the same LSM tree, prioritized by
* vy_range->compaction_priority.
*/
-#define HEAP_NAME vy_range_heap
+#define HEAP_NAME vy_max_compaction_priority
static inline bool
-vy_range_heap_less(struct heap_node *a, struct heap_node *b)
+vy_max_compaction_priority_less(struct heap_node *a, struct heap_node *b)
{
- struct vy_range *r1 = container_of(a, struct vy_range, heap_node);
- struct vy_range *r2 = container_of(b, struct vy_range, heap_node);
+ struct vy_range *r1 = container_of(a, struct vy_range,
+ compaction_priority_node);
+ struct vy_range *r2 = container_of(b, struct vy_range,
+ compaction_priority_node);
return r1->compaction_priority > r2->compaction_priority;
}
-#define HEAP_LESS(h, l, r) vy_range_heap_less(l, r)
+#define HEAP_LESS(h, l, r) vy_max_compaction_priority_less(l, r)
#include "salad/heap.h"
#undef HEAP_LESS
#undef HEAP_NAME
@@ -151,7 +153,7 @@ vy_range_heap_less(struct heap_node *a, struct heap_node *b)
static inline bool
vy_range_is_scheduled(struct vy_range *range)
{
- return range->heap_node.pos == UINT32_MAX;
+ return range->compaction_priority_node.pos == UINT32_MAX;
}
/**
diff --git a/src/box/vy_scheduler.c b/src/box/vy_scheduler.c
index 5ec6d171..16ecafed 100644
--- a/src/box/vy_scheduler.c
+++ b/src/box/vy_scheduler.c
@@ -1228,7 +1228,7 @@ vy_task_dump_complete(struct vy_task *task)
vy_range_update_compaction_priority(range, &lsm->opts);
vy_lsm_acct_range(lsm, range);
}
- vy_range_heap_update_all(&lsm->range_heap);
+ vy_max_compaction_priority_update_all(&lsm->max_compaction_priority);
free(new_slices);
delete_mems:
@@ -1610,8 +1610,9 @@ vy_task_compaction_complete(struct vy_task *task)
/* The iterator has been cleaned up in worker. */
task->wi->iface->close(task->wi);
- assert(range->heap_node.pos == UINT32_MAX);
- vy_range_heap_insert(&lsm->range_heap, &range->heap_node);
+ assert(range->compaction_priority_node.pos == UINT32_MAX);
+ vy_max_compaction_priority_insert(&lsm->max_compaction_priority,
+ &range->compaction_priority_node);
vy_scheduler_update_lsm(scheduler, lsm);
say_info("%s: completed compacting range %s",
@@ -1642,8 +1643,9 @@ vy_task_compaction_abort(struct vy_task *task)
vy_run_discard(task->new_run);
- assert(range->heap_node.pos == UINT32_MAX);
- vy_range_heap_insert(&lsm->range_heap, &range->heap_node);
+ assert(range->compaction_priority_node.pos == UINT32_MAX);
+ vy_max_compaction_priority_insert(&lsm->max_compaction_priority,
+ &range->compaction_priority_node);
vy_scheduler_update_lsm(scheduler, lsm);
}
@@ -1657,14 +1659,14 @@ vy_task_compaction_new(struct vy_scheduler *scheduler, struct vy_worker *worker,
.abort = vy_task_compaction_abort,
};
- struct heap_node *range_node;
+ struct heap_node *node;
struct vy_range *range;
assert(!lsm->is_dropped);
- range_node = vy_range_heap_top(&lsm->range_heap);
- assert(range_node != NULL);
- range = container_of(range_node, struct vy_range, heap_node);
+ node = vy_max_compaction_priority_top(&lsm->max_compaction_priority);
+ assert(node != NULL);
+ range = container_of(node, struct vy_range, compaction_priority_node);
assert(range->compaction_priority > 1);
if (vy_lsm_split_range(lsm, range) ||
@@ -1722,8 +1724,8 @@ vy_task_compaction_new(struct vy_scheduler *scheduler, struct vy_worker *worker,
* Remove the range we are going to compact from the heap
* so that it doesn't get selected again.
*/
- vy_range_heap_delete(&lsm->range_heap, range_node);
- range_node->pos = UINT32_MAX;
+ vy_max_compaction_priority_delete(&lsm->max_compaction_priority, node);
+ node->pos = UINT32_MAX;
vy_scheduler_update_lsm(scheduler, lsm);
say_info("%s: started compacting range %s, runs %d/%d",
--
2.11.0
^ permalink raw reply [flat|nested] 39+ messages in thread
* [PATCH 5/9] vinyl: keep track of dumps per compaction for each LSM tree
2019-01-20 21:16 [PATCH 0/9] vinyl: compaction randomization and throttling Vladimir Davydov
` (3 preceding siblings ...)
2019-01-20 21:17 ` [PATCH 4/9] vinyl: rename lsm->range_heap to max_compaction_priority Vladimir Davydov
@ 2019-01-20 21:17 ` Vladimir Davydov
2019-02-05 16:58 ` [tarantool-patches] " Konstantin Osipov
2019-01-20 21:17 ` [PATCH 6/9] vinyl: set range size automatically Vladimir Davydov
` (3 subsequent siblings)
8 siblings, 1 reply; 39+ messages in thread
From: Vladimir Davydov @ 2019-01-20 21:17 UTC (permalink / raw)
To: tarantool-patches
This patch adds dumps_per_compaction metric to per index statistics. It
shows the number of dumps it takes to trigger a major compaction of a
range in a given LSM tree. We need it to automatically choose the
optimal number of ranges that would smooth out the load generated by
range compaction.
To calculate this metric, we assign dump_count to each run. It shows how
many dumps it took to create the run. If a run was created by a memory
dump, it is set to 1. If a run was created by a minor compaction, it is
set to the sum of dump counts of compacted ranges. If a run was created
by a major compaction, it is set to the sum of dump counts of compacted
ranges minus dump count of the last level run. The dump_count is stored
in vylog.
This allows us to estimate the number of dumps that triggers compaction
in a range as dump_count of the last level run stored in the range.
Finally, we report dumps_per_compaction of an LSM tree as the minimal
dumps_per_compaction among all ranges constituting the tree. To achieve
that, we maintain a heap of ranges per each LSM tree ordered by
dumps_per_compaction.
Needed for #3944
---
src/box/vinyl.c | 2 +
src/box/vy_log.c | 26 ++++++++-
src/box/vy_log.h | 10 +++-
src/box/vy_lsm.c | 23 ++++++++
src/box/vy_lsm.h | 6 +++
src/box/vy_range.c | 13 +++++
src/box/vy_range.h | 32 ++++++++++++
src/box/vy_run.h | 15 ++++++
src/box/vy_scheduler.c | 14 ++++-
test/vinyl/layout.result | 8 +--
test/vinyl/stat.result | 133 ++++++++++++++++++++++++++++++++++++++++++++---
test/vinyl/stat.test.lua | 57 ++++++++++++++++++++
12 files changed, 323 insertions(+), 16 deletions(-)
diff --git a/src/box/vinyl.c b/src/box/vinyl.c
index d6117f44..dc4fc830 100644
--- a/src/box/vinyl.c
+++ b/src/box/vinyl.c
@@ -463,6 +463,8 @@ vinyl_index_stat(struct index *index, struct info_handler *h)
info_append_int(h, "run_avg", lsm->run_count / lsm->range_count);
histogram_snprint(buf, sizeof(buf), lsm->run_hist);
info_append_str(h, "run_histogram", buf);
+ info_append_int(h, "dumps_per_compaction",
+ vy_lsm_dumps_per_compaction(lsm));
info_end(h);
}
diff --git a/src/box/vy_log.c b/src/box/vy_log.c
index d3fa0c7a..d7cf4996 100644
--- a/src/box/vy_log.c
+++ b/src/box/vy_log.c
@@ -84,6 +84,7 @@ enum vy_log_key {
VY_LOG_KEY_MODIFY_LSN = 13,
VY_LOG_KEY_DROP_LSN = 14,
VY_LOG_KEY_GROUP_ID = 15,
+ VY_LOG_KEY_DUMP_COUNT = 16,
};
/** vy_log_key -> human readable name. */
@@ -104,6 +105,7 @@ static const char *vy_log_key_name[] = {
[VY_LOG_KEY_MODIFY_LSN] = "modify_lsn",
[VY_LOG_KEY_DROP_LSN] = "drop_lsn",
[VY_LOG_KEY_GROUP_ID] = "group_id",
+ [VY_LOG_KEY_DUMP_COUNT] = "dump_count",
};
/** vy_log_type -> human readable name. */
@@ -285,6 +287,10 @@ vy_log_record_snprint(char *buf, int size, const struct vy_log_record *record)
SNPRINT(total, snprintf, buf, size, "%s=%"PRIi64", ",
vy_log_key_name[VY_LOG_KEY_GC_LSN],
record->gc_lsn);
+ if (record->dump_count > 0)
+ SNPRINT(total, snprintf, buf, size, "%s=%"PRIu32", ",
+ vy_log_key_name[VY_LOG_KEY_DUMP_COUNT],
+ record->dump_count);
SNPRINT(total, snprintf, buf, size, "}");
return total;
}
@@ -411,6 +417,11 @@ vy_log_record_encode(const struct vy_log_record *record,
size += mp_sizeof_uint(record->gc_lsn);
n_keys++;
}
+ if (record->dump_count > 0) {
+ size += mp_sizeof_uint(VY_LOG_KEY_DUMP_COUNT);
+ size += mp_sizeof_uint(record->dump_count);
+ n_keys++;
+ }
size += mp_sizeof_map(n_keys);
/*
@@ -493,6 +504,10 @@ vy_log_record_encode(const struct vy_log_record *record,
pos = mp_encode_uint(pos, VY_LOG_KEY_GC_LSN);
pos = mp_encode_uint(pos, record->gc_lsn);
}
+ if (record->dump_count > 0) {
+ pos = mp_encode_uint(pos, VY_LOG_KEY_DUMP_COUNT);
+ pos = mp_encode_uint(pos, record->dump_count);
+ }
assert(pos == tuple + size);
/*
@@ -620,6 +635,9 @@ vy_log_record_decode(struct vy_log_record *record,
case VY_LOG_KEY_GC_LSN:
record->gc_lsn = mp_decode_uint(&pos);
break;
+ case VY_LOG_KEY_DUMP_COUNT:
+ record->dump_count = mp_decode_uint(&pos);
+ break;
default:
mp_next(&pos); /* unknown key, ignore */
break;
@@ -1558,6 +1576,7 @@ vy_recovery_do_create_run(struct vy_recovery *recovery, int64_t run_id)
run->id = run_id;
run->dump_lsn = -1;
run->gc_lsn = -1;
+ run->dump_count = 0;
run->is_incomplete = false;
run->is_dropped = false;
run->data = NULL;
@@ -1612,7 +1631,7 @@ vy_recovery_prepare_run(struct vy_recovery *recovery, int64_t lsm_id,
*/
static int
vy_recovery_create_run(struct vy_recovery *recovery, int64_t lsm_id,
- int64_t run_id, int64_t dump_lsn)
+ int64_t run_id, int64_t dump_lsn, uint32_t dump_count)
{
struct vy_lsm_recovery_info *lsm;
lsm = vy_recovery_lookup_lsm(recovery, lsm_id);
@@ -1637,6 +1656,7 @@ vy_recovery_create_run(struct vy_recovery *recovery, int64_t lsm_id,
return -1;
}
run->dump_lsn = dump_lsn;
+ run->dump_count = dump_count;
run->is_incomplete = false;
rlist_move_entry(&lsm->runs, run, in_lsm);
return 0;
@@ -1998,7 +2018,8 @@ vy_recovery_process_record(struct vy_recovery *recovery,
break;
case VY_LOG_CREATE_RUN:
rc = vy_recovery_create_run(recovery, record->lsm_id,
- record->run_id, record->dump_lsn);
+ record->run_id, record->dump_lsn,
+ record->dump_count);
break;
case VY_LOG_DROP_RUN:
rc = vy_recovery_drop_run(recovery, record->run_id,
@@ -2348,6 +2369,7 @@ vy_log_append_lsm(struct xlog *xlog, struct vy_lsm_recovery_info *lsm)
} else {
record.type = VY_LOG_CREATE_RUN;
record.dump_lsn = run->dump_lsn;
+ record.dump_count = run->dump_count;
}
record.lsm_id = lsm->id;
record.run_id = run->id;
diff --git a/src/box/vy_log.h b/src/box/vy_log.h
index 70e25245..ee38c193 100644
--- a/src/box/vy_log.h
+++ b/src/box/vy_log.h
@@ -96,7 +96,7 @@ enum vy_log_record_type {
VY_LOG_PREPARE_RUN = 4,
/**
* Commit a vinyl run file creation.
- * Requires vy_log_record::lsm_id, run_id, dump_lsn.
+ * Requires vy_log_record::lsm_id, run_id, dump_lsn, dump_count.
*
* Written after a run file was successfully created.
*/
@@ -271,6 +271,8 @@ struct vy_log_record {
* that uses this run.
*/
int64_t gc_lsn;
+ /** For runs: number of dumps it took to create the run. */
+ uint32_t dump_count;
/** Link in vy_log::tx. */
struct stailq_entry in_tx;
};
@@ -389,6 +391,8 @@ struct vy_run_recovery_info {
* that uses this run.
*/
int64_t gc_lsn;
+ /** Number of dumps it took to create the run. */
+ uint32_t dump_count;
/**
* True if the run was not committed (there's
* VY_LOG_PREPARE_RUN, but no VY_LOG_CREATE_RUN).
@@ -710,7 +714,8 @@ vy_log_prepare_run(int64_t lsm_id, int64_t run_id)
/** Helper to log a vinyl run creation. */
static inline void
-vy_log_create_run(int64_t lsm_id, int64_t run_id, int64_t dump_lsn)
+vy_log_create_run(int64_t lsm_id, int64_t run_id,
+ int64_t dump_lsn, uint32_t dump_count)
{
struct vy_log_record record;
vy_log_record_init(&record);
@@ -718,6 +723,7 @@ vy_log_create_run(int64_t lsm_id, int64_t run_id, int64_t dump_lsn)
record.lsm_id = lsm_id;
record.run_id = run_id;
record.dump_lsn = dump_lsn;
+ record.dump_count = dump_count;
vy_log_write(&record);
}
diff --git a/src/box/vy_lsm.c b/src/box/vy_lsm.c
index 851785ee..6ec86c22 100644
--- a/src/box/vy_lsm.c
+++ b/src/box/vy_lsm.c
@@ -181,6 +181,7 @@ vy_lsm_new(struct vy_lsm_env *lsm_env, struct vy_cache_env *cache_env,
rlist_create(&lsm->sealed);
vy_range_tree_new(lsm->tree);
vy_max_compaction_priority_create(&lsm->max_compaction_priority);
+ vy_min_dumps_per_compaction_create(&lsm->min_dumps_per_compaction);
rlist_create(&lsm->runs);
lsm->pk = pk;
if (pk != NULL)
@@ -258,6 +259,7 @@ vy_lsm_delete(struct vy_lsm *lsm)
vy_range_tree_iter(lsm->tree, NULL, vy_range_tree_free_cb, NULL);
vy_max_compaction_priority_destroy(&lsm->max_compaction_priority);
+ vy_min_dumps_per_compaction_destroy(&lsm->min_dumps_per_compaction);
tuple_format_unref(lsm->disk_format);
key_def_delete(lsm->cmp_def);
key_def_delete(lsm->key_def);
@@ -351,6 +353,7 @@ vy_lsm_recover_run(struct vy_lsm *lsm, struct vy_run_recovery_info *run_info,
return NULL;
run->dump_lsn = run_info->dump_lsn;
+ run->dump_count = run_info->dump_count;
if (vy_run_recover(run, lsm->env->path,
lsm->space_id, lsm->index_id) != 0 &&
(!force_recovery ||
@@ -636,6 +639,7 @@ vy_lsm_recover(struct vy_lsm *lsm, struct vy_recovery *recovery,
(long long)range->id));
return -1;
}
+ vy_range_update_dumps_per_compaction(range);
vy_lsm_acct_range(lsm, range);
}
if (prev == NULL) {
@@ -651,6 +655,7 @@ vy_lsm_recover(struct vy_lsm *lsm, struct vy_recovery *recovery,
(long long)prev->id));
return -1;
}
+ vy_min_dumps_per_compaction_update_all(&lsm->min_dumps_per_compaction);
return 0;
}
@@ -674,6 +679,18 @@ vy_lsm_compaction_priority(struct vy_lsm *lsm)
return range->compaction_priority;
}
+int
+vy_lsm_dumps_per_compaction(struct vy_lsm *lsm)
+{
+ struct heap_node *node;
+ node = vy_min_dumps_per_compaction_top(&lsm->min_dumps_per_compaction);
+ if (node == NULL)
+ return 0;
+ struct vy_range *range = container_of(node, struct vy_range,
+ dumps_per_compaction_node);
+ return range->dumps_per_compaction;
+}
+
void
vy_lsm_add_run(struct vy_lsm *lsm, struct vy_run *run)
{
@@ -737,6 +754,8 @@ vy_lsm_add_range(struct vy_lsm *lsm, struct vy_range *range)
assert(range->compaction_priority_node.pos == UINT32_MAX);
vy_max_compaction_priority_insert(&lsm->max_compaction_priority,
&range->compaction_priority_node);
+ vy_min_dumps_per_compaction_insert(&lsm->min_dumps_per_compaction,
+ &range->dumps_per_compaction_node);
vy_range_tree_insert(lsm->tree, range);
lsm->range_count++;
}
@@ -747,6 +766,8 @@ vy_lsm_remove_range(struct vy_lsm *lsm, struct vy_range *range)
assert(range->compaction_priority_node.pos != UINT32_MAX);
vy_max_compaction_priority_delete(&lsm->max_compaction_priority,
&range->compaction_priority_node);
+ vy_min_dumps_per_compaction_delete(&lsm->min_dumps_per_compaction,
+ &range->dumps_per_compaction_node);
vy_range_tree_remove(lsm->tree, range);
lsm->range_count--;
}
@@ -1080,6 +1101,7 @@ vy_lsm_split_range(struct vy_lsm *lsm, struct vy_range *range)
}
part->needs_compaction = range->needs_compaction;
vy_range_update_compaction_priority(part, &lsm->opts);
+ vy_range_update_dumps_per_compaction(part);
}
/*
@@ -1197,6 +1219,7 @@ vy_lsm_coalesce_range(struct vy_lsm *lsm, struct vy_range *range)
* as it fits the configured LSM tree shape.
*/
vy_range_update_compaction_priority(result, &lsm->opts);
+ vy_range_update_dumps_per_compaction(result);
vy_lsm_acct_range(lsm, result);
vy_lsm_add_range(lsm, result);
lsm->range_tree_version++;
diff --git a/src/box/vy_lsm.h b/src/box/vy_lsm.h
index a1d872e9..4df9d19a 100644
--- a/src/box/vy_lsm.h
+++ b/src/box/vy_lsm.h
@@ -253,6 +253,8 @@ struct vy_lsm {
int range_count;
/** Heap of ranges, prioritized by compaction_priority. */
heap_t max_compaction_priority;
+ /** Heap of ranges, prioritized by dumps_per_compaction. */
+ heap_t min_dumps_per_compaction;
/**
* List of all runs created for this LSM tree,
* linked by vy_run->in_lsm.
@@ -438,6 +440,10 @@ vy_lsm_generation(struct vy_lsm *lsm);
int
vy_lsm_compaction_priority(struct vy_lsm *lsm);
+/** Return min dumps_per_compaction among ranges of an LSM tree. */
+int
+vy_lsm_dumps_per_compaction(struct vy_lsm *lsm);
+
/** Add a run to the list of runs of an LSM tree. */
void
vy_lsm_add_run(struct vy_lsm *lsm, struct vy_run *run);
diff --git a/src/box/vy_range.c b/src/box/vy_range.c
index a2cb4558..7cb1b4ba 100644
--- a/src/box/vy_range.c
+++ b/src/box/vy_range.c
@@ -198,6 +198,7 @@ vy_range_new(int64_t id, struct tuple *begin, struct tuple *end,
range->cmp_def = cmp_def;
rlist_create(&range->slices);
range->compaction_priority_node.pos = UINT32_MAX;
+ range->dumps_per_compaction_node.pos = UINT32_MAX;
return range;
}
@@ -391,6 +392,18 @@ vy_range_update_compaction_priority(struct vy_range *range,
}
}
+void
+vy_range_update_dumps_per_compaction(struct vy_range *range)
+{
+ if (!rlist_empty(&range->slices)) {
+ struct vy_slice *slice = rlist_last_entry(&range->slices,
+ struct vy_slice, in_range);
+ range->dumps_per_compaction = slice->run->dump_count;
+ } else {
+ range->dumps_per_compaction = 0;
+ }
+}
+
/**
* Return true and set split_key accordingly if the range needs to be
* split in two.
diff --git a/src/box/vy_range.h b/src/box/vy_range.h
index 7c0a16e2..f19c2c6b 100644
--- a/src/box/vy_range.h
+++ b/src/box/vy_range.h
@@ -121,6 +121,13 @@ struct vy_range {
bool needs_compaction;
/** Number of times the range was compacted. */
int n_compactions;
+ /**
+ * Number of dumps it takes to trigger major compaction in
+ * this range, see vy_run::dump_count for more details.
+ */
+ int dumps_per_compaction;
+ /** Link in vy_lsm->min_dumps_per_compaction. */
+ struct heap_node dumps_per_compaction_node;
/** Link in vy_lsm->tree. */
rb_node(struct vy_range) tree_node;
/**
@@ -149,6 +156,25 @@ vy_max_compaction_priority_less(struct heap_node *a, struct heap_node *b)
#undef HEAP_LESS
#undef HEAP_NAME
+/**
+ * Heap of all ranges of the same LSM tree, prioritized by
+ * vy_range->dumps_per_compaction.
+ */
+#define HEAP_NAME vy_min_dumps_per_compaction
+static inline bool
+vy_min_dumps_per_compaction_less(struct heap_node *a, struct heap_node *b)
+{
+ struct vy_range *r1 = container_of(a, struct vy_range,
+ dumps_per_compaction_node);
+ struct vy_range *r2 = container_of(b, struct vy_range,
+ dumps_per_compaction_node);
+ return r1->dumps_per_compaction < r2->dumps_per_compaction;
+}
+#define HEAP_LESS(h, l, r) vy_min_dumps_per_compaction_less(l, r)
+#include "salad/heap.h"
+#undef HEAP_LESS
+#undef HEAP_NAME
+
/** Return true if a task is scheduled for a given range. */
static inline bool
vy_range_is_scheduled(struct vy_range *range)
@@ -245,6 +271,12 @@ vy_range_update_compaction_priority(struct vy_range *range,
const struct index_opts *opts);
/**
+ * Update the value of range->dumps_per_compaction.
+ */
+void
+vy_range_update_dumps_per_compaction(struct vy_range *range);
+
+/**
* Check if a range needs to be split in two.
*
* @param range The range.
diff --git a/src/box/vy_run.h b/src/box/vy_run.h
index 990daffa..28fd6a50 100644
--- a/src/box/vy_run.h
+++ b/src/box/vy_run.h
@@ -130,6 +130,21 @@ struct vy_run {
/** Max LSN stored on disk. */
int64_t dump_lsn;
/**
+ * Number of dumps it took to create this run.
+ *
+ * If the run was produced by a memory dump, it is 1.
+ * If the run was produced by a minor compaction, it
+ * is is the sum of dump counts of compacted runs.
+ * If the run was produced by a major compaction, it
+ * is is the sum of dump counts of compacted runs
+ * minus the dump count of the last (greatest) run.
+ *
+ * This way, by looking at the last level run in an LSM
+ * tree, we can tell how many dumps it took to compact
+ * it last time.
+ */
+ uint32_t dump_count;
+ /**
* Run reference counter, the run is deleted once it hits 0.
* A new run is created with the reference counter set to 1.
* A run is referenced by each slice created for it and each
diff --git a/src/box/vy_scheduler.c b/src/box/vy_scheduler.c
index 16ecafed..f14a199b 100644
--- a/src/box/vy_scheduler.c
+++ b/src/box/vy_scheduler.c
@@ -1193,7 +1193,7 @@ vy_task_dump_complete(struct vy_task *task)
* Log change in metadata.
*/
vy_log_tx_begin();
- vy_log_create_run(lsm->id, new_run->id, dump_lsn);
+ vy_log_create_run(lsm->id, new_run->id, dump_lsn, new_run->dump_count);
for (range = begin_range, i = 0; range != end_range;
range = vy_range_tree_next(lsm->tree, range), i++) {
assert(i < lsm->range_count);
@@ -1226,9 +1226,11 @@ vy_task_dump_complete(struct vy_task *task)
vy_lsm_unacct_range(lsm, range);
vy_range_add_slice(range, slice);
vy_range_update_compaction_priority(range, &lsm->opts);
+ vy_range_update_dumps_per_compaction(range);
vy_lsm_acct_range(lsm, range);
}
vy_max_compaction_priority_update_all(&lsm->max_compaction_priority);
+ vy_min_dumps_per_compaction_update_all(&lsm->min_dumps_per_compaction);
free(new_slices);
delete_mems:
@@ -1396,6 +1398,7 @@ vy_task_dump_new(struct vy_scheduler *scheduler, struct vy_worker *worker,
if (new_run == NULL)
goto err_run;
+ new_run->dump_count = 1;
new_run->dump_lsn = dump_lsn;
/*
@@ -1528,7 +1531,8 @@ vy_task_compaction_complete(struct vy_task *task)
rlist_foreach_entry(run, &unused_runs, in_unused)
vy_log_drop_run(run->id, gc_lsn);
if (new_slice != NULL) {
- vy_log_create_run(lsm->id, new_run->id, new_run->dump_lsn);
+ vy_log_create_run(lsm->id, new_run->id, new_run->dump_lsn,
+ new_run->dump_count);
vy_log_insert_slice(range->id, new_run->id, new_slice->id,
tuple_data_or_null(new_slice->begin),
tuple_data_or_null(new_slice->end));
@@ -1589,6 +1593,7 @@ vy_task_compaction_complete(struct vy_task *task)
}
range->n_compactions++;
vy_range_update_compaction_priority(range, &lsm->opts);
+ vy_range_update_dumps_per_compaction(range);
vy_lsm_acct_range(lsm, range);
vy_lsm_acct_compaction(lsm, compaction_time,
&compaction_input, &compaction_output);
@@ -1613,6 +1618,8 @@ vy_task_compaction_complete(struct vy_task *task)
assert(range->compaction_priority_node.pos == UINT32_MAX);
vy_max_compaction_priority_insert(&lsm->max_compaction_priority,
&range->compaction_priority_node);
+ vy_min_dumps_per_compaction_update(&lsm->min_dumps_per_compaction,
+ &range->dumps_per_compaction_node);
vy_scheduler_update_lsm(scheduler, lsm);
say_info("%s: completed compacting range %s",
@@ -1701,6 +1708,7 @@ vy_task_compaction_new(struct vy_scheduler *scheduler, struct vy_worker *worker,
goto err_wi_sub;
new_run->dump_lsn = MAX(new_run->dump_lsn,
slice->run->dump_lsn);
+ new_run->dump_count += slice->run->dump_count;
/* Remember the slices we are compacting. */
if (task->first_slice == NULL)
task->first_slice = slice;
@@ -1709,6 +1717,8 @@ vy_task_compaction_new(struct vy_scheduler *scheduler, struct vy_worker *worker,
if (--n == 0)
break;
}
+ if (range->compaction_priority == range->slice_count)
+ new_run->dump_count -= slice->run->dump_count;
assert(n == 0);
assert(new_run->dump_lsn >= 0);
diff --git a/test/vinyl/layout.result b/test/vinyl/layout.result
index 14201c5d..0e9d7260 100644
--- a/test/vinyl/layout.result
+++ b/test/vinyl/layout.result
@@ -141,7 +141,7 @@ result
- HEADER:
type: INSERT
BODY:
- tuple: [5, {2: 8, 9: 11}]
+ tuple: [5, {2: 8, 16: 1, 9: 11}]
- HEADER:
type: INSERT
BODY:
@@ -166,7 +166,7 @@ result
- HEADER:
type: INSERT
BODY:
- tuple: [5, {0: 2, 2: 6, 9: 11}]
+ tuple: [5, {0: 2, 2: 6, 16: 1, 9: 11}]
- HEADER:
type: INSERT
BODY:
@@ -206,7 +206,7 @@ result
timestamp: <timestamp>
type: INSERT
BODY:
- tuple: [5, {0: 2, 2: 10, 9: 14}]
+ tuple: [5, {0: 2, 2: 10, 16: 1, 9: 14}]
- HEADER:
timestamp: <timestamp>
type: INSERT
@@ -226,7 +226,7 @@ result
timestamp: <timestamp>
type: INSERT
BODY:
- tuple: [5, {2: 12, 9: 14}]
+ tuple: [5, {2: 12, 16: 1, 9: 14}]
- HEADER:
timestamp: <timestamp>
type: INSERT
diff --git a/test/vinyl/stat.result b/test/vinyl/stat.result
index 419d3e6c..b0b569ab 100644
--- a/test/vinyl/stat.result
+++ b/test/vinyl/stat.result
@@ -129,7 +129,8 @@ test_run:cmd("setopt delimiter ''");
-- initially stats are empty
istat()
---
-- rows: 0
+- dumps_per_compaction: 0
+ rows: 0
run_avg: 0
bytes: 0
upsert:
@@ -294,10 +295,7 @@ wait(istat, st, 'disk.dump.count', 1)
...
stat_diff(istat(), st)
---
-- rows: 25
- run_avg: 1
- run_count: 1
- disk:
+- disk:
last_level:
bytes: 26049
pages: 7
@@ -321,6 +319,10 @@ stat_diff(istat(), st)
pages: 7
bytes_compressed: <bytes_compressed>
bloom_size: 70
+ rows: 25
+ run_avg: 1
+ run_count: 1
+ dumps_per_compaction: 1
bytes: 26049
put:
rows: 25
@@ -998,7 +1000,8 @@ box.stat.reset()
...
istat()
---
-- rows: 306
+- dumps_per_compaction: 1
+ rows: 306
run_avg: 1
bytes: 317731
upsert:
@@ -1732,6 +1735,124 @@ box.stat.vinyl().disk.data_compacted
---
- 0
...
+--
+-- Number of dumps needed to trigger major compaction in
+-- an LSM tree range.
+--
+s = box.schema.space.create('test', {engine = 'vinyl'})
+---
+...
+i = s:create_index('primary', {page_size = 128, range_size = 8192, run_count_per_level = 2, run_size_ratio = 5})
+---
+...
+test_run:cmd("setopt delimiter ';'")
+---
+- true
+...
+function dump(a, b)
+ for i = a, b do
+ s:replace{i, digest.urandom(100)}
+ end
+ box.snapshot()
+end;
+---
+...
+function wait_compaction(count)
+ test_run:wait_cond(function()
+ return i:stat().disk.compaction.count == count
+ end, 10)
+end;
+---
+...
+test_run:cmd("setopt delimiter ''");
+---
+- true
+...
+dump(1, 100)
+---
+...
+i:stat().dumps_per_compaction -- 1
+---
+- 1
+...
+dump(1, 100) -- compaction
+---
+...
+dump(1, 100) -- split + compaction
+---
+...
+wait_compaction(3)
+---
+...
+i:stat().range_count -- 2
+---
+- 2
+...
+i:stat().dumps_per_compaction -- 1
+---
+- 1
+...
+dump(1, 20)
+---
+...
+dump(1, 10) -- compaction in range 1
+---
+...
+wait_compaction(4)
+---
+...
+i:stat().dumps_per_compaction -- 1
+---
+- 1
+...
+dump(80, 100)
+---
+...
+dump(90, 100) -- compaction in range 2
+---
+...
+wait_compaction(5)
+---
+...
+i:stat().dumps_per_compaction -- 2
+---
+- 2
+...
+test_run:cmd('restart server test')
+fiber = require('fiber')
+---
+...
+digest = require('digest')
+---
+...
+s = box.space.test
+---
+...
+i = s.index.primary
+---
+...
+i:stat().dumps_per_compaction -- 2
+---
+- 2
+...
+for i = 1, 100 do s:replace{i, digest.urandom(100)} end
+---
+...
+box.snapshot()
+---
+- ok
+...
+test_run:wait_cond(function() return i:stat().disk.compaction.count == 2 end, 10)
+---
+- true
+...
+i:stat().dumps_per_compaction -- 1
+---
+- 1
+...
+s:drop()
+---
+...
test_run:cmd('switch default')
---
- true
diff --git a/test/vinyl/stat.test.lua b/test/vinyl/stat.test.lua
index 4a955682..73729f49 100644
--- a/test/vinyl/stat.test.lua
+++ b/test/vinyl/stat.test.lua
@@ -528,6 +528,63 @@ s:drop()
box.stat.vinyl().disk.data_compacted
+--
+-- Number of dumps needed to trigger major compaction in
+-- an LSM tree range.
+--
+s = box.schema.space.create('test', {engine = 'vinyl'})
+i = s:create_index('primary', {page_size = 128, range_size = 8192, run_count_per_level = 2, run_size_ratio = 5})
+
+test_run:cmd("setopt delimiter ';'")
+function dump(a, b)
+ for i = a, b do
+ s:replace{i, digest.urandom(100)}
+ end
+ box.snapshot()
+end;
+function wait_compaction(count)
+ test_run:wait_cond(function()
+ return i:stat().disk.compaction.count == count
+ end, 10)
+end;
+test_run:cmd("setopt delimiter ''");
+
+dump(1, 100)
+i:stat().dumps_per_compaction -- 1
+
+dump(1, 100) -- compaction
+dump(1, 100) -- split + compaction
+wait_compaction(3)
+i:stat().range_count -- 2
+i:stat().dumps_per_compaction -- 1
+
+dump(1, 20)
+dump(1, 10) -- compaction in range 1
+wait_compaction(4)
+i:stat().dumps_per_compaction -- 1
+
+dump(80, 100)
+dump(90, 100) -- compaction in range 2
+wait_compaction(5)
+i:stat().dumps_per_compaction -- 2
+
+test_run:cmd('restart server test')
+
+fiber = require('fiber')
+digest = require('digest')
+
+s = box.space.test
+i = s.index.primary
+
+i:stat().dumps_per_compaction -- 2
+for i = 1, 100 do s:replace{i, digest.urandom(100)} end
+box.snapshot()
+test_run:wait_cond(function() return i:stat().disk.compaction.count == 2 end, 10)
+
+i:stat().dumps_per_compaction -- 1
+
+s:drop()
+
test_run:cmd('switch default')
test_run:cmd('stop server test')
test_run:cmd('cleanup server test')
--
2.11.0
^ permalink raw reply [flat|nested] 39+ messages in thread
* [tarantool-patches] Re: [PATCH 5/9] vinyl: keep track of dumps per compaction for each LSM tree
2019-01-20 21:17 ` [PATCH 5/9] vinyl: keep track of dumps per compaction for each LSM tree Vladimir Davydov
@ 2019-02-05 16:58 ` Konstantin Osipov
2019-02-06 9:20 ` Vladimir Davydov
0 siblings, 1 reply; 39+ messages in thread
From: Konstantin Osipov @ 2019-02-05 16:58 UTC (permalink / raw)
To: tarantool-patches
* Vladimir Davydov <vdavydov.dev@gmail.com> [19/01/21 06:58]:
> This patch adds dumps_per_compaction metric to per index statistics. It
> shows the number of dumps it takes to trigger a major compaction of a
> range in a given LSM tree. We need it to automatically choose the
> optimal number of ranges that would smooth out the load generated by
> range compaction.
I obviously like the idea of dumps_per_compaction :-), but using a
heap to maintain the minimum sounds like a bit of an overkill. Is
average so much less accurate? It would be much cheaper to
maintain.
--
Konstantin Osipov, Moscow, Russia, +7 903 626 22 32
http://tarantool.io - www.twitter.com/kostja_osipov
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [tarantool-patches] Re: [PATCH 5/9] vinyl: keep track of dumps per compaction for each LSM tree
2019-02-05 16:58 ` [tarantool-patches] " Konstantin Osipov
@ 2019-02-06 9:20 ` Vladimir Davydov
2019-02-06 16:54 ` Vladimir Davydov
0 siblings, 1 reply; 39+ messages in thread
From: Vladimir Davydov @ 2019-02-06 9:20 UTC (permalink / raw)
To: Konstantin Osipov; +Cc: tarantool-patches
On Tue, Feb 05, 2019 at 07:58:29PM +0300, Konstantin Osipov wrote:
> * Vladimir Davydov <vdavydov.dev@gmail.com> [19/01/21 06:58]:
> > This patch adds dumps_per_compaction metric to per index statistics. It
> > shows the number of dumps it takes to trigger a major compaction of a
> > range in a given LSM tree. We need it to automatically choose the
> > optimal number of ranges that would smooth out the load generated by
> > range compaction.
>
> I obviously like the idea of dumps_per_compaction :-), but using a
> heap to maintain the minimum sounds like a bit of an overkill. Is
> average so much less accurate? It would be much cheaper to
> maintain.
I guess we could do that. I'll try.
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [tarantool-patches] Re: [PATCH 5/9] vinyl: keep track of dumps per compaction for each LSM tree
2019-02-06 9:20 ` Vladimir Davydov
@ 2019-02-06 16:54 ` Vladimir Davydov
0 siblings, 0 replies; 39+ messages in thread
From: Vladimir Davydov @ 2019-02-06 16:54 UTC (permalink / raw)
To: Konstantin Osipov; +Cc: tarantool-patches
On Wed, Feb 06, 2019 at 12:20:03PM +0300, Vladimir Davydov wrote:
> On Tue, Feb 05, 2019 at 07:58:29PM +0300, Konstantin Osipov wrote:
> > * Vladimir Davydov <vdavydov.dev@gmail.com> [19/01/21 06:58]:
> > > This patch adds dumps_per_compaction metric to per index statistics. It
> > > shows the number of dumps it takes to trigger a major compaction of a
> > > range in a given LSM tree. We need it to automatically choose the
> > > optimal number of ranges that would smooth out the load generated by
> > > range compaction.
> >
> > I obviously like the idea of dumps_per_compaction :-), but using a
> > heap to maintain the minimum sounds like a bit of an overkill. Is
> > average so much less accurate? It would be much cheaper to
> > maintain.
>
> I guess we could do that. I'll try.
Had to introduce vy_lsm::sum_dumps_per_compaction for that.
Here's what it looks like:
From cce2fbba22eeeed78b05f76dc9e7d22ef218f3e9 Mon Sep 17 00:00:00 2001
From: Vladimir Davydov <vdavydov.dev@gmail.com>
Date: Wed, 6 Feb 2019 19:23:58 +0300
Subject: [PATCH] vinyl: keep track of dumps per compaction for each LSM tree
This patch adds dumps_per_compaction metric to per index statistics. It
shows the number of dumps it takes to trigger a major compaction of a
range in a given LSM tree. We need it to automatically choose the
optimal number of ranges that would smooth out the load generated by
range compaction.
To calculate this metric, we assign dump_count to each run. It shows how
many dumps it took to create the run. If a run was created by a memory
dump, it is set to 1. If a run was created by a minor compaction, it is
set to the sum of dump counts of compacted ranges. If a run was created
by a major compaction, it is set to the sum of dump counts of compacted
ranges minus dump count of the last level run. The dump_count is stored
in vylog.
This allows us to estimate the number of dumps that triggers compaction
in a range as dump_count of the last level run stored in the range.
Finally, we report dumps_per_compaction of an LSM tree as the average
dumps_per_compaction among all ranges constituting the tree.
Needed for #3944
diff --git a/src/box/vinyl.c b/src/box/vinyl.c
index 065a309f..e1ff65ae 100644
--- a/src/box/vinyl.c
+++ b/src/box/vinyl.c
@@ -463,6 +463,8 @@ vinyl_index_stat(struct index *index, struct info_handler *h)
info_append_int(h, "run_avg", lsm->run_count / lsm->range_count);
histogram_snprint(buf, sizeof(buf), lsm->run_hist);
info_append_str(h, "run_histogram", buf);
+ info_append_int(h, "dumps_per_compaction",
+ vy_lsm_dumps_per_compaction(lsm));
info_end(h);
}
diff --git a/src/box/vy_log.c b/src/box/vy_log.c
index f94b60ff..06ab7247 100644
--- a/src/box/vy_log.c
+++ b/src/box/vy_log.c
@@ -84,6 +84,7 @@ enum vy_log_key {
VY_LOG_KEY_MODIFY_LSN = 13,
VY_LOG_KEY_DROP_LSN = 14,
VY_LOG_KEY_GROUP_ID = 15,
+ VY_LOG_KEY_DUMP_COUNT = 16,
};
/** vy_log_key -> human readable name. */
@@ -104,6 +105,7 @@ static const char *vy_log_key_name[] = {
[VY_LOG_KEY_MODIFY_LSN] = "modify_lsn",
[VY_LOG_KEY_DROP_LSN] = "drop_lsn",
[VY_LOG_KEY_GROUP_ID] = "group_id",
+ [VY_LOG_KEY_DUMP_COUNT] = "dump_count",
};
/** vy_log_type -> human readable name. */
@@ -285,6 +287,10 @@ vy_log_record_snprint(char *buf, int size, const struct vy_log_record *record)
SNPRINT(total, snprintf, buf, size, "%s=%"PRIi64", ",
vy_log_key_name[VY_LOG_KEY_GC_LSN],
record->gc_lsn);
+ if (record->dump_count > 0)
+ SNPRINT(total, snprintf, buf, size, "%s=%"PRIu32", ",
+ vy_log_key_name[VY_LOG_KEY_DUMP_COUNT],
+ record->dump_count);
SNPRINT(total, snprintf, buf, size, "}");
return total;
}
@@ -411,6 +417,11 @@ vy_log_record_encode(const struct vy_log_record *record,
size += mp_sizeof_uint(record->gc_lsn);
n_keys++;
}
+ if (record->dump_count > 0) {
+ size += mp_sizeof_uint(VY_LOG_KEY_DUMP_COUNT);
+ size += mp_sizeof_uint(record->dump_count);
+ n_keys++;
+ }
size += mp_sizeof_map(n_keys);
/*
@@ -493,6 +504,10 @@ vy_log_record_encode(const struct vy_log_record *record,
pos = mp_encode_uint(pos, VY_LOG_KEY_GC_LSN);
pos = mp_encode_uint(pos, record->gc_lsn);
}
+ if (record->dump_count > 0) {
+ pos = mp_encode_uint(pos, VY_LOG_KEY_DUMP_COUNT);
+ pos = mp_encode_uint(pos, record->dump_count);
+ }
assert(pos == tuple + size);
/*
@@ -621,6 +636,9 @@ vy_log_record_decode(struct vy_log_record *record,
case VY_LOG_KEY_GC_LSN:
record->gc_lsn = mp_decode_uint(&pos);
break;
+ case VY_LOG_KEY_DUMP_COUNT:
+ record->dump_count = mp_decode_uint(&pos);
+ break;
default:
mp_next(&pos); /* unknown key, ignore */
break;
@@ -1593,6 +1611,7 @@ vy_recovery_do_create_run(struct vy_recovery *recovery, int64_t run_id)
run->id = run_id;
run->dump_lsn = -1;
run->gc_lsn = -1;
+ run->dump_count = 0;
run->is_incomplete = false;
run->is_dropped = false;
run->data = NULL;
@@ -1647,7 +1666,7 @@ vy_recovery_prepare_run(struct vy_recovery *recovery, int64_t lsm_id,
*/
static int
vy_recovery_create_run(struct vy_recovery *recovery, int64_t lsm_id,
- int64_t run_id, int64_t dump_lsn)
+ int64_t run_id, int64_t dump_lsn, uint32_t dump_count)
{
struct vy_lsm_recovery_info *lsm;
lsm = vy_recovery_lookup_lsm(recovery, lsm_id);
@@ -1672,6 +1691,7 @@ vy_recovery_create_run(struct vy_recovery *recovery, int64_t lsm_id,
return -1;
}
run->dump_lsn = dump_lsn;
+ run->dump_count = dump_count;
run->is_incomplete = false;
rlist_move_entry(&lsm->runs, run, in_lsm);
return 0;
@@ -2033,7 +2053,8 @@ vy_recovery_process_record(struct vy_recovery *recovery,
break;
case VY_LOG_CREATE_RUN:
rc = vy_recovery_create_run(recovery, record->lsm_id,
- record->run_id, record->dump_lsn);
+ record->run_id, record->dump_lsn,
+ record->dump_count);
break;
case VY_LOG_DROP_RUN:
rc = vy_recovery_drop_run(recovery, record->run_id,
@@ -2383,6 +2404,7 @@ vy_log_append_lsm(struct xlog *xlog, struct vy_lsm_recovery_info *lsm)
} else {
record.type = VY_LOG_CREATE_RUN;
record.dump_lsn = run->dump_lsn;
+ record.dump_count = run->dump_count;
}
record.lsm_id = lsm->id;
record.run_id = run->id;
diff --git a/src/box/vy_log.h b/src/box/vy_log.h
index 70e25245..ee38c193 100644
--- a/src/box/vy_log.h
+++ b/src/box/vy_log.h
@@ -96,7 +96,7 @@ enum vy_log_record_type {
VY_LOG_PREPARE_RUN = 4,
/**
* Commit a vinyl run file creation.
- * Requires vy_log_record::lsm_id, run_id, dump_lsn.
+ * Requires vy_log_record::lsm_id, run_id, dump_lsn, dump_count.
*
* Written after a run file was successfully created.
*/
@@ -271,6 +271,8 @@ struct vy_log_record {
* that uses this run.
*/
int64_t gc_lsn;
+ /** For runs: number of dumps it took to create the run. */
+ uint32_t dump_count;
/** Link in vy_log::tx. */
struct stailq_entry in_tx;
};
@@ -389,6 +391,8 @@ struct vy_run_recovery_info {
* that uses this run.
*/
int64_t gc_lsn;
+ /** Number of dumps it took to create the run. */
+ uint32_t dump_count;
/**
* True if the run was not committed (there's
* VY_LOG_PREPARE_RUN, but no VY_LOG_CREATE_RUN).
@@ -710,7 +714,8 @@ vy_log_prepare_run(int64_t lsm_id, int64_t run_id)
/** Helper to log a vinyl run creation. */
static inline void
-vy_log_create_run(int64_t lsm_id, int64_t run_id, int64_t dump_lsn)
+vy_log_create_run(int64_t lsm_id, int64_t run_id,
+ int64_t dump_lsn, uint32_t dump_count)
{
struct vy_log_record record;
vy_log_record_init(&record);
@@ -718,6 +723,7 @@ vy_log_create_run(int64_t lsm_id, int64_t run_id, int64_t dump_lsn)
record.lsm_id = lsm_id;
record.run_id = run_id;
record.dump_lsn = dump_lsn;
+ record.dump_count = dump_count;
vy_log_write(&record);
}
diff --git a/src/box/vy_lsm.c b/src/box/vy_lsm.c
index 570e783a..6b70cd75 100644
--- a/src/box/vy_lsm.c
+++ b/src/box/vy_lsm.c
@@ -353,6 +353,7 @@ vy_lsm_recover_run(struct vy_lsm *lsm, struct vy_run_recovery_info *run_info,
return NULL;
run->dump_lsn = run_info->dump_lsn;
+ run->dump_count = run_info->dump_count;
if (vy_run_recover(run, lsm->env->path,
lsm->space_id, lsm->index_id) != 0 &&
(!force_recovery ||
@@ -638,6 +639,7 @@ vy_lsm_recover(struct vy_lsm *lsm, struct vy_recovery *recovery,
(long long)range->id));
return -1;
}
+ vy_range_update_dumps_per_compaction(range);
vy_lsm_acct_range(lsm, range);
}
if (prev == NULL) {
@@ -753,6 +755,7 @@ void
vy_lsm_acct_range(struct vy_lsm *lsm, struct vy_range *range)
{
histogram_collect(lsm->run_hist, range->slice_count);
+ lsm->sum_dumps_per_compaction += range->dumps_per_compaction;
vy_disk_stmt_counter_add(&lsm->stat.disk.compaction.queue,
&range->compaction_queue);
lsm->env->compaction_queue_size += range->compaction_queue.bytes;
@@ -770,6 +773,7 @@ void
vy_lsm_unacct_range(struct vy_lsm *lsm, struct vy_range *range)
{
histogram_discard(lsm->run_hist, range->slice_count);
+ lsm->sum_dumps_per_compaction -= range->dumps_per_compaction;
vy_disk_stmt_counter_sub(&lsm->stat.disk.compaction.queue,
&range->compaction_queue);
lsm->env->compaction_queue_size -= range->compaction_queue.bytes;
@@ -1078,6 +1082,7 @@ vy_lsm_split_range(struct vy_lsm *lsm, struct vy_range *range)
}
part->needs_compaction = range->needs_compaction;
vy_range_update_compaction_priority(part, &lsm->opts);
+ vy_range_update_dumps_per_compaction(part);
}
/*
@@ -1195,6 +1200,7 @@ vy_lsm_coalesce_range(struct vy_lsm *lsm, struct vy_range *range)
* as it fits the configured LSM tree shape.
*/
vy_range_update_compaction_priority(result, &lsm->opts);
+ vy_range_update_dumps_per_compaction(result);
vy_lsm_acct_range(lsm, result);
vy_lsm_add_range(lsm, result);
lsm->range_tree_version++;
diff --git a/src/box/vy_lsm.h b/src/box/vy_lsm.h
index 74033627..eb7cbbf0 100644
--- a/src/box/vy_lsm.h
+++ b/src/box/vy_lsm.h
@@ -251,6 +251,8 @@ struct vy_lsm {
vy_range_tree_t *tree;
/** Number of ranges in this LSM tree. */
int range_count;
+ /** Sum dumps_per_compaction across all ranges. */
+ int sum_dumps_per_compaction;
/** Heap of ranges, prioritized by compaction_priority. */
heap_t range_heap;
/**
@@ -351,6 +353,16 @@ vy_lsm_is_empty(struct vy_lsm *lsm)
}
/**
+ * Return the averange number of dumps it takes to trigger major
+ * compaction of a range in this LSM tree.
+ */
+static inline int
+vy_lsm_dumps_per_compaction(struct vy_lsm *lsm)
+{
+ return lsm->sum_dumps_per_compaction / lsm->range_count;
+}
+
+/**
* Increment the reference counter of an LSM tree.
* An LSM tree cannot be deleted if its reference
* counter is elevated.
@@ -464,8 +476,8 @@ vy_lsm_remove_range(struct vy_lsm *lsm, struct vy_range *range);
* Account a range in an LSM tree.
*
* This function updates the following LSM tree statistics:
- * - vy_lsm::run_hist after a slice is added to or removed from
- * a range of the LSM tree.
+ * - vy_lsm::run_hist and vy_lsm::sum_dumps_per_compaction after
+ * a slice is added to or removed from a range of the LSM tree.
* - vy_lsm::stat::disk::compaction::queue after compaction priority
* of a range is updated.
* - vy_lsm::stat::disk::last_level_count after a range is compacted.
diff --git a/src/box/vy_range.c b/src/box/vy_range.c
index 7211cfb2..19ada26b 100644
--- a/src/box/vy_range.c
+++ b/src/box/vy_range.c
@@ -411,6 +411,18 @@ vy_range_update_compaction_priority(struct vy_range *range,
}
}
+void
+vy_range_update_dumps_per_compaction(struct vy_range *range)
+{
+ if (!rlist_empty(&range->slices)) {
+ struct vy_slice *slice = rlist_last_entry(&range->slices,
+ struct vy_slice, in_range);
+ range->dumps_per_compaction = slice->run->dump_count;
+ } else {
+ range->dumps_per_compaction = 0;
+ }
+}
+
/**
* Return true and set split_key accordingly if the range needs to be
* split in two.
diff --git a/src/box/vy_range.h b/src/box/vy_range.h
index 05195d08..0b3a78c3 100644
--- a/src/box/vy_range.h
+++ b/src/box/vy_range.h
@@ -119,6 +119,11 @@ struct vy_range {
bool needs_compaction;
/** Number of times the range was compacted. */
int n_compactions;
+ /**
+ * Number of dumps it takes to trigger major compaction in
+ * this range, see vy_run::dump_count for more details.
+ */
+ int dumps_per_compaction;
/** Link in vy_lsm->tree. */
rb_node(struct vy_range) tree_node;
/** Link in vy_lsm->range_heap. */
@@ -243,6 +248,12 @@ vy_range_update_compaction_priority(struct vy_range *range,
const struct index_opts *opts);
/**
+ * Update the value of range->dumps_per_compaction.
+ */
+void
+vy_range_update_dumps_per_compaction(struct vy_range *range);
+
+/**
* Check if a range needs to be split in two.
*
* @param range The range.
diff --git a/src/box/vy_run.h b/src/box/vy_run.h
index 990daffa..28fd6a50 100644
--- a/src/box/vy_run.h
+++ b/src/box/vy_run.h
@@ -130,6 +130,21 @@ struct vy_run {
/** Max LSN stored on disk. */
int64_t dump_lsn;
/**
+ * Number of dumps it took to create this run.
+ *
+ * If the run was produced by a memory dump, it is 1.
+ * If the run was produced by a minor compaction, it
+ * is is the sum of dump counts of compacted runs.
+ * If the run was produced by a major compaction, it
+ * is is the sum of dump counts of compacted runs
+ * minus the dump count of the last (greatest) run.
+ *
+ * This way, by looking at the last level run in an LSM
+ * tree, we can tell how many dumps it took to compact
+ * it last time.
+ */
+ uint32_t dump_count;
+ /**
* Run reference counter, the run is deleted once it hits 0.
* A new run is created with the reference counter set to 1.
* A run is referenced by each slice created for it and each
diff --git a/src/box/vy_scheduler.c b/src/box/vy_scheduler.c
index 5ec6d171..5c53b423 100644
--- a/src/box/vy_scheduler.c
+++ b/src/box/vy_scheduler.c
@@ -1193,7 +1193,7 @@ vy_task_dump_complete(struct vy_task *task)
* Log change in metadata.
*/
vy_log_tx_begin();
- vy_log_create_run(lsm->id, new_run->id, dump_lsn);
+ vy_log_create_run(lsm->id, new_run->id, dump_lsn, new_run->dump_count);
for (range = begin_range, i = 0; range != end_range;
range = vy_range_tree_next(lsm->tree, range), i++) {
assert(i < lsm->range_count);
@@ -1226,6 +1226,7 @@ vy_task_dump_complete(struct vy_task *task)
vy_lsm_unacct_range(lsm, range);
vy_range_add_slice(range, slice);
vy_range_update_compaction_priority(range, &lsm->opts);
+ vy_range_update_dumps_per_compaction(range);
vy_lsm_acct_range(lsm, range);
}
vy_range_heap_update_all(&lsm->range_heap);
@@ -1396,6 +1397,7 @@ vy_task_dump_new(struct vy_scheduler *scheduler, struct vy_worker *worker,
if (new_run == NULL)
goto err_run;
+ new_run->dump_count = 1;
new_run->dump_lsn = dump_lsn;
/*
@@ -1528,7 +1530,8 @@ vy_task_compaction_complete(struct vy_task *task)
rlist_foreach_entry(run, &unused_runs, in_unused)
vy_log_drop_run(run->id, gc_lsn);
if (new_slice != NULL) {
- vy_log_create_run(lsm->id, new_run->id, new_run->dump_lsn);
+ vy_log_create_run(lsm->id, new_run->id, new_run->dump_lsn,
+ new_run->dump_count);
vy_log_insert_slice(range->id, new_run->id, new_slice->id,
tuple_data_or_null(new_slice->begin),
tuple_data_or_null(new_slice->end));
@@ -1589,6 +1592,7 @@ vy_task_compaction_complete(struct vy_task *task)
}
range->n_compactions++;
vy_range_update_compaction_priority(range, &lsm->opts);
+ vy_range_update_dumps_per_compaction(range);
vy_lsm_acct_range(lsm, range);
vy_lsm_acct_compaction(lsm, compaction_time,
&compaction_input, &compaction_output);
@@ -1693,12 +1697,14 @@ vy_task_compaction_new(struct vy_scheduler *scheduler, struct vy_worker *worker,
goto err_wi;
struct vy_slice *slice;
+ int32_t dump_count = 0;
int n = range->compaction_priority;
rlist_foreach_entry(slice, &range->slices, in_range) {
if (vy_write_iterator_new_slice(wi, slice) != 0)
goto err_wi_sub;
new_run->dump_lsn = MAX(new_run->dump_lsn,
slice->run->dump_lsn);
+ dump_count += slice->run->dump_count;
/* Remember the slices we are compacting. */
if (task->first_slice == NULL)
task->first_slice = slice;
@@ -1709,6 +1715,17 @@ vy_task_compaction_new(struct vy_scheduler *scheduler, struct vy_worker *worker,
}
assert(n == 0);
assert(new_run->dump_lsn >= 0);
+ if (range->compaction_priority == range->slice_count)
+ dump_count -= slice->run->dump_count;
+ /*
+ * Do not update dumps_per_compaction in case compaction
+ * was triggered manually to avoid unexpected side effects,
+ * such as splitting/coalescing ranges for no good reason.
+ */
+ if (range->needs_compaction)
+ new_run->dump_count = slice->run->dump_count;
+ else
+ new_run->dump_count = dump_count;
range->needs_compaction = false;
diff --git a/test/vinyl/layout.result b/test/vinyl/layout.result
index 3be2bb91..6d58f747 100644
--- a/test/vinyl/layout.result
+++ b/test/vinyl/layout.result
@@ -139,7 +139,7 @@ result
- HEADER:
type: INSERT
BODY:
- tuple: [5, {2: 8, 9: 20}]
+ tuple: [5, {2: 8, 16: 1, 9: 20}]
- HEADER:
type: INSERT
BODY:
@@ -164,7 +164,7 @@ result
- HEADER:
type: INSERT
BODY:
- tuple: [5, {0: 2, 2: 6, 9: 20}]
+ tuple: [5, {0: 2, 2: 6, 16: 1, 9: 20}]
- HEADER:
type: INSERT
BODY:
@@ -204,7 +204,7 @@ result
timestamp: <timestamp>
type: INSERT
BODY:
- tuple: [5, {0: 2, 2: 10, 9: 23}]
+ tuple: [5, {0: 2, 2: 10, 16: 1, 9: 23}]
- HEADER:
timestamp: <timestamp>
type: INSERT
@@ -224,7 +224,7 @@ result
timestamp: <timestamp>
type: INSERT
BODY:
- tuple: [5, {2: 12, 9: 23}]
+ tuple: [5, {2: 12, 16: 1, 9: 23}]
- HEADER:
timestamp: <timestamp>
type: INSERT
diff --git a/test/vinyl/stat.result b/test/vinyl/stat.result
index 419d3e6c..e79c32f0 100644
--- a/test/vinyl/stat.result
+++ b/test/vinyl/stat.result
@@ -129,7 +129,8 @@ test_run:cmd("setopt delimiter ''");
-- initially stats are empty
istat()
---
-- rows: 0
+- dumps_per_compaction: 0
+ rows: 0
run_avg: 0
bytes: 0
upsert:
@@ -294,10 +295,7 @@ wait(istat, st, 'disk.dump.count', 1)
...
stat_diff(istat(), st)
---
-- rows: 25
- run_avg: 1
- run_count: 1
- disk:
+- disk:
last_level:
bytes: 26049
pages: 7
@@ -321,6 +319,10 @@ stat_diff(istat(), st)
pages: 7
bytes_compressed: <bytes_compressed>
bloom_size: 70
+ rows: 25
+ run_avg: 1
+ run_count: 1
+ dumps_per_compaction: 1
bytes: 26049
put:
rows: 25
@@ -998,7 +1000,8 @@ box.stat.reset()
...
istat()
---
-- rows: 306
+- dumps_per_compaction: 1
+ rows: 306
run_avg: 1
bytes: 317731
upsert:
@@ -1732,6 +1735,138 @@ box.stat.vinyl().disk.data_compacted
---
- 0
...
+--
+-- Number of dumps needed to trigger major compaction in
+-- an LSM tree range.
+--
+s = box.schema.space.create('test', {engine = 'vinyl'})
+---
+...
+i = s:create_index('primary', {page_size = 128, range_size = 8192, run_count_per_level = 1, run_size_ratio = 2})
+---
+...
+test_run:cmd("setopt delimiter ';'")
+---
+- true
+...
+function dump(a, b)
+ for i = a, b do
+ s:replace{i, digest.urandom(100)}
+ end
+ box.snapshot()
+end;
+---
+...
+function wait_compaction(count)
+ test_run:wait_cond(function()
+ return i:stat().disk.compaction.count == count
+ end, 10)
+end;
+---
+...
+test_run:cmd("setopt delimiter ''");
+---
+- true
+...
+dump(1, 100)
+---
+...
+i:stat().dumps_per_compaction -- 1
+---
+- 1
+...
+dump(1, 100) -- compaction
+---
+...
+dump(1, 100) -- split + compaction
+---
+...
+wait_compaction(3)
+---
+...
+i:stat().range_count -- 2
+---
+- 2
+...
+i:stat().dumps_per_compaction -- 1
+---
+- 1
+...
+dump(1, 10)
+---
+...
+dump(1, 40) -- compaction in range 1
+---
+...
+wait_compaction(4)
+---
+...
+i:stat().dumps_per_compaction -- 1
+---
+- 1
+...
+dump(90, 100)
+---
+...
+dump(60, 100) -- compaction in range 2
+---
+...
+wait_compaction(5)
+---
+...
+i:stat().dumps_per_compaction -- 2
+---
+- 2
+...
+-- Forcing compaction manually doesn't affect dumps_per_compaction.
+dump(40, 60)
+---
+...
+i:compact()
+---
+...
+wait_compaction(7)
+---
+...
+i:stat().dumps_per_compaction -- 2
+---
+- 2
+...
+test_run:cmd('restart server test')
+fiber = require('fiber')
+---
+...
+digest = require('digest')
+---
+...
+s = box.space.test
+---
+...
+i = s.index.primary
+---
+...
+i:stat().dumps_per_compaction -- 2
+---
+- 2
+...
+for i = 1, 100 do s:replace{i, digest.urandom(100)} end
+---
+...
+box.snapshot()
+---
+- ok
+...
+test_run:wait_cond(function() return i:stat().disk.compaction.count == 2 end, 10)
+---
+- true
+...
+i:stat().dumps_per_compaction -- 1
+---
+- 1
+...
+s:drop()
+---
+...
test_run:cmd('switch default')
---
- true
diff --git a/test/vinyl/stat.test.lua b/test/vinyl/stat.test.lua
index 4a955682..4a360f33 100644
--- a/test/vinyl/stat.test.lua
+++ b/test/vinyl/stat.test.lua
@@ -528,6 +528,69 @@ s:drop()
box.stat.vinyl().disk.data_compacted
+--
+-- Number of dumps needed to trigger major compaction in
+-- an LSM tree range.
+--
+s = box.schema.space.create('test', {engine = 'vinyl'})
+i = s:create_index('primary', {page_size = 128, range_size = 8192, run_count_per_level = 1, run_size_ratio = 2})
+
+test_run:cmd("setopt delimiter ';'")
+function dump(a, b)
+ for i = a, b do
+ s:replace{i, digest.urandom(100)}
+ end
+ box.snapshot()
+end;
+function wait_compaction(count)
+ test_run:wait_cond(function()
+ return i:stat().disk.compaction.count == count
+ end, 10)
+end;
+test_run:cmd("setopt delimiter ''");
+
+dump(1, 100)
+i:stat().dumps_per_compaction -- 1
+
+dump(1, 100) -- compaction
+dump(1, 100) -- split + compaction
+wait_compaction(3)
+i:stat().range_count -- 2
+i:stat().dumps_per_compaction -- 1
+
+dump(1, 10)
+dump(1, 40) -- compaction in range 1
+wait_compaction(4)
+i:stat().dumps_per_compaction -- 1
+
+dump(90, 100)
+dump(60, 100) -- compaction in range 2
+wait_compaction(5)
+i:stat().dumps_per_compaction -- 2
+
+-- Forcing compaction manually doesn't affect dumps_per_compaction.
+dump(40, 60)
+i:compact()
+wait_compaction(7)
+i:stat().dumps_per_compaction -- 2
+
+test_run:cmd('restart server test')
+
+fiber = require('fiber')
+digest = require('digest')
+
+s = box.space.test
+i = s.index.primary
+
+i:stat().dumps_per_compaction -- 2
+for i = 1, 100 do s:replace{i, digest.urandom(100)} end
+box.snapshot()
+test_run:wait_cond(function() return i:stat().disk.compaction.count == 2 end, 10)
+
+i:stat().dumps_per_compaction -- 1
+
+s:drop()
+
test_run:cmd('switch default')
test_run:cmd('stop server test')
test_run:cmd('cleanup server test')
^ permalink raw reply [flat|nested] 39+ messages in thread
* [PATCH 6/9] vinyl: set range size automatically
2019-01-20 21:16 [PATCH 0/9] vinyl: compaction randomization and throttling Vladimir Davydov
` (4 preceding siblings ...)
2019-01-20 21:17 ` [PATCH 5/9] vinyl: keep track of dumps per compaction for each LSM tree Vladimir Davydov
@ 2019-01-20 21:17 ` Vladimir Davydov
2019-01-22 9:17 ` Vladimir Davydov
2019-02-05 17:09 ` [tarantool-patches] " Konstantin Osipov
2019-01-20 21:17 ` [PATCH 7/9] vinyl: randomize range compaction to avoid IO load spikes Vladimir Davydov
` (2 subsequent siblings)
8 siblings, 2 replies; 39+ messages in thread
From: Vladimir Davydov @ 2019-01-20 21:17 UTC (permalink / raw)
To: tarantool-patches
The key space of a vinyl index consists of multiple ranges that can be
compacted independently. This design was initially invented to enable
parallel compaction, so the range size is configured statically, by the
range_size index option, which equals 1 GB by default. However, it turns
out that ranges can also be useful for smoothing IO load: if we compact
approximately the same number of ranges after each dump, we will avoid
IO bursts, which is good, because IO bursts can distort the LSM tree
shape, resulting in increased read amplification.
To achieve that, we need to maintain at least as many ranges as the
number of dumps it takes to trigger major compaction of a range. With
the default range size, this condition will hold only if the index is
huge (tens to hundreds gigabytes). If the database isn't that big or
consists of many small indexes, the range count will never even approach
that number. So this patch makes the range size scale dynamically to
satisfy that condition.
The range size configuration options, both global and per index, aren't
removed though. The patch just changes box.cfg.vinyl_range_size default
value to nil, which enables automatic range sizing for all new indexes
created without passing range_size explicitly. All existing indexes will
still use the range size stored in index options (we don't want to alter
the behavior of an existing production setup). We are not planning to
drop range_size option altogether - it still can be useful for testing
and performance analysis.
The actual range size value is now reported in index.stat().
Needed for #3944
---
src/box/alter.cc | 8 +--
src/box/box.cc | 6 +--
src/box/index_def.c | 2 +-
src/box/lua/load_cfg.lua | 2 +-
src/box/lua/space.cc | 6 ++-
src/box/vinyl.c | 1 +
src/box/vy_lsm.c | 36 +++++++++++++-
src/box/vy_lsm.h | 4 ++
src/box/vy_range.c | 10 ++--
src/box/vy_range.h | 10 ++--
test/app-tap/init_script.result | 21 ++++----
test/box-tap/cfg.test.lua | 3 +-
test/box/admin.result | 2 -
test/box/cfg.result | 4 --
test/vinyl/ddl.result | 5 --
test/vinyl/ddl.test.lua | 1 -
test/vinyl/misc.result | 78 +++++++++++++++++++++++++++++
test/vinyl/misc.test.lua | 26 ++++++++++
test/vinyl/stat.result | 108 ++++++++++++++++++++--------------------
19 files changed, 228 insertions(+), 105 deletions(-)
diff --git a/src/box/alter.cc b/src/box/alter.cc
index 0589c967..83953a88 100644
--- a/src/box/alter.cc
+++ b/src/box/alter.cc
@@ -187,12 +187,8 @@ index_opts_decode(struct index_opts *opts, const char *map,
BOX_INDEX_FIELD_OPTS, "distance must be either "\
"'euclid' or 'manhattan'");
}
- if (opts->range_size <= 0) {
- tnt_raise(ClientError, ER_WRONG_INDEX_OPTIONS,
- BOX_INDEX_FIELD_OPTS,
- "range_size must be greater than 0");
- }
- if (opts->page_size <= 0 || opts->page_size > opts->range_size) {
+ if (opts->page_size <= 0 || (opts->range_size > 0 &&
+ opts->page_size > opts->range_size)) {
tnt_raise(ClientError, ER_WRONG_INDEX_OPTIONS,
BOX_INDEX_FIELD_OPTS,
"page_size must be greater than 0 and "
diff --git a/src/box/box.cc b/src/box/box.cc
index 9f2fd6da..b045e465 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -592,11 +592,7 @@ box_check_vinyl_options(void)
tnt_raise(ClientError, ER_CFG, "vinyl_write_threads",
"must be greater than or equal to 2");
}
- if (range_size <= 0) {
- tnt_raise(ClientError, ER_CFG, "vinyl_range_size",
- "must be greater than 0");
- }
- if (page_size <= 0 || page_size > range_size) {
+ if (page_size <= 0 || (range_size > 0 && page_size > range_size)) {
tnt_raise(ClientError, ER_CFG, "vinyl_page_size",
"must be greater than 0 and less than "
"or equal to vinyl_range_size");
diff --git a/src/box/index_def.c b/src/box/index_def.c
index 2ba57ee9..c82bc01c 100644
--- a/src/box/index_def.c
+++ b/src/box/index_def.c
@@ -40,7 +40,7 @@ const struct index_opts index_opts_default = {
/* .unique = */ true,
/* .dimension = */ 2,
/* .distance = */ RTREE_INDEX_DISTANCE_TYPE_EUCLID,
- /* .range_size = */ 1073741824,
+ /* .range_size = */ 0,
/* .page_size = */ 8192,
/* .run_count_per_level = */ 2,
/* .run_size_ratio = */ 3.5,
diff --git a/src/box/lua/load_cfg.lua b/src/box/lua/load_cfg.lua
index 6dc4a2af..fc4e560d 100644
--- a/src/box/lua/load_cfg.lua
+++ b/src/box/lua/load_cfg.lua
@@ -41,7 +41,7 @@ local default_cfg = {
vinyl_timeout = 60,
vinyl_run_count_per_level = 2,
vinyl_run_size_ratio = 3.5,
- vinyl_range_size = 1024 * 1024 * 1024,
+ vinyl_range_size = nil, -- set automatically
vinyl_page_size = 8 * 1024,
vinyl_bloom_fpr = 0.05,
log = nil,
diff --git a/src/box/lua/space.cc b/src/box/lua/space.cc
index 7cae436f..abebaa87 100644
--- a/src/box/lua/space.cc
+++ b/src/box/lua/space.cc
@@ -334,8 +334,10 @@ lbox_fillspace(struct lua_State *L, struct space *space, int i)
lua_pushstring(L, "options");
lua_newtable(L);
- lua_pushnumber(L, index_opts->range_size);
- lua_setfield(L, -2, "range_size");
+ if (index_opts->range_size > 0) {
+ lua_pushnumber(L, index_opts->range_size);
+ lua_setfield(L, -2, "range_size");
+ }
lua_pushnumber(L, index_opts->page_size);
lua_setfield(L, -2, "page_size");
diff --git a/src/box/vinyl.c b/src/box/vinyl.c
index dc4fc830..0936932b 100644
--- a/src/box/vinyl.c
+++ b/src/box/vinyl.c
@@ -458,6 +458,7 @@ vinyl_index_stat(struct index *index, struct info_handler *h)
info_table_end(h); /* iterator */
info_table_end(h); /* txw */
+ info_append_int(h, "range_size", vy_lsm_range_size(lsm));
info_append_int(h, "range_count", lsm->range_count);
info_append_int(h, "run_count", lsm->run_count);
info_append_int(h, "run_avg", lsm->run_count / lsm->range_count);
diff --git a/src/box/vy_lsm.c b/src/box/vy_lsm.c
index 6ec86c22..d52482c6 100644
--- a/src/box/vy_lsm.c
+++ b/src/box/vy_lsm.c
@@ -691,6 +691,37 @@ vy_lsm_dumps_per_compaction(struct vy_lsm *lsm)
return range->dumps_per_compaction;
}
+int64_t
+vy_lsm_range_size(struct vy_lsm *lsm)
+{
+ /* Use the configured range size if available. */
+ if (lsm->opts.range_size > 0)
+ return lsm->opts.range_size;
+ /*
+ * It doesn't make much sense to create too small ranges.
+ * Limit the max number of ranges per index to 1000 and
+ * never create ranges smaller than 16 MB.
+ */
+ enum { MIN_RANGE_SIZE = 16 * 1024 * 1024 };
+ enum { MAX_RANGE_COUNT = 1000 };
+ /*
+ * Ideally, we want to compact roughly the same amount of
+ * data after each dump so as to avoid IO bursts caused by
+ * simultaneous major compaction of a bunch of ranges,
+ * because such IO bursts can lead to a deviation of the
+ * LSM tree from the configured shape and, as a result,
+ * increased read amplification. To achieve that, we need
+ * to have at least as many ranges as the number of dumps
+ * it takes to trigger major compaction in a range.
+ */
+ int range_count = vy_lsm_dumps_per_compaction(lsm);
+ range_count = MIN(range_count, MAX_RANGE_COUNT);
+ int64_t range_size = lsm->stat.disk.last_level_count.bytes /
+ (range_count + 1);
+ range_size = MAX(range_size, MIN_RANGE_SIZE);
+ return range_size;
+}
+
void
vy_lsm_add_run(struct vy_lsm *lsm, struct vy_run *run)
{
@@ -1055,7 +1086,8 @@ vy_lsm_split_range(struct vy_lsm *lsm, struct vy_range *range)
struct tuple_format *key_format = lsm->env->key_format;
const char *split_key_raw;
- if (!vy_range_needs_split(range, &lsm->opts, &split_key_raw))
+ if (!vy_range_needs_split(range, vy_lsm_range_size(lsm),
+ &split_key_raw))
return false;
/* Split a range in two parts. */
@@ -1163,7 +1195,7 @@ bool
vy_lsm_coalesce_range(struct vy_lsm *lsm, struct vy_range *range)
{
struct vy_range *first, *last;
- if (!vy_range_needs_coalesce(range, lsm->tree, &lsm->opts,
+ if (!vy_range_needs_coalesce(range, lsm->tree, vy_lsm_range_size(lsm),
&first, &last))
return false;
diff --git a/src/box/vy_lsm.h b/src/box/vy_lsm.h
index 4df9d19a..d7cba109 100644
--- a/src/box/vy_lsm.h
+++ b/src/box/vy_lsm.h
@@ -444,6 +444,10 @@ vy_lsm_compaction_priority(struct vy_lsm *lsm);
int
vy_lsm_dumps_per_compaction(struct vy_lsm *lsm);
+/** Return the target size of a range in an LSM tree. */
+int64_t
+vy_lsm_range_size(struct vy_lsm *lsm);
+
/** Add a run to the list of runs of an LSM tree. */
void
vy_lsm_add_run(struct vy_lsm *lsm, struct vy_run *run);
diff --git a/src/box/vy_range.c b/src/box/vy_range.c
index 7cb1b4ba..db4a7ab0 100644
--- a/src/box/vy_range.c
+++ b/src/box/vy_range.c
@@ -418,7 +418,7 @@ vy_range_update_dumps_per_compaction(struct vy_range *range)
* 4/3 * range_size.
*/
bool
-vy_range_needs_split(struct vy_range *range, const struct index_opts *opts,
+vy_range_needs_split(struct vy_range *range, int64_t range_size,
const char **p_split_key)
{
struct vy_slice *slice;
@@ -432,7 +432,7 @@ vy_range_needs_split(struct vy_range *range, const struct index_opts *opts,
slice = rlist_last_entry(&range->slices, struct vy_slice, in_range);
/* The range is too small to be split. */
- if (slice->count.bytes < opts->range_size * 4 / 3)
+ if (slice->count.bytes < range_size * 4 / 3)
return false;
/* Find the median key in the oldest run (approximately). */
@@ -488,15 +488,15 @@ vy_range_needs_split(struct vy_range *range, const struct index_opts *opts,
*/
bool
vy_range_needs_coalesce(struct vy_range *range, vy_range_tree_t *tree,
- const struct index_opts *opts,
- struct vy_range **p_first, struct vy_range **p_last)
+ int64_t range_size, struct vy_range **p_first,
+ struct vy_range **p_last)
{
struct vy_range *it;
/* Size of the coalesced range. */
uint64_t total_size = range->count.bytes;
/* Coalesce ranges until total_size > max_size. */
- uint64_t max_size = opts->range_size / 2;
+ uint64_t max_size = range_size / 2;
/*
* We can't coalesce a range that was scheduled for dump
diff --git a/src/box/vy_range.h b/src/box/vy_range.h
index f19c2c6b..1df71dbf 100644
--- a/src/box/vy_range.h
+++ b/src/box/vy_range.h
@@ -280,13 +280,13 @@ vy_range_update_dumps_per_compaction(struct vy_range *range);
* Check if a range needs to be split in two.
*
* @param range The range.
- * @param opts Index options.
+ * @param range_size Target range size.
* @param[out] p_split_key Key to split the range by.
*
* @retval true If the range needs to be split.
*/
bool
-vy_range_needs_split(struct vy_range *range, const struct index_opts *opts,
+vy_range_needs_split(struct vy_range *range, int64_t range_size,
const char **p_split_key);
/**
@@ -295,7 +295,7 @@ vy_range_needs_split(struct vy_range *range, const struct index_opts *opts,
*
* @param range The range.
* @param tree The range tree.
- * @param opts Index options.
+ * @param range_size Target range size.
* @param[out] p_first The first range in the tree to coalesce.
* @param[out] p_last The last range in the tree to coalesce.
*
@@ -303,8 +303,8 @@ vy_range_needs_split(struct vy_range *range, const struct index_opts *opts,
*/
bool
vy_range_needs_coalesce(struct vy_range *range, vy_range_tree_t *tree,
- const struct index_opts *opts,
- struct vy_range **p_first, struct vy_range **p_last);
+ int64_t range_size, struct vy_range **p_first,
+ struct vy_range **p_last);
#if defined(__cplusplus)
} /* extern "C" */
diff --git a/test/app-tap/init_script.result b/test/app-tap/init_script.result
index 70a4b258..559ef521 100644
--- a/test/app-tap/init_script.result
+++ b/test/app-tap/init_script.result
@@ -39,17 +39,16 @@ box.cfg
34 vinyl_max_tuple_size:1048576
35 vinyl_memory:134217728
36 vinyl_page_size:8192
-37 vinyl_range_size:1073741824
-38 vinyl_read_threads:1
-39 vinyl_run_count_per_level:2
-40 vinyl_run_size_ratio:3.5
-41 vinyl_timeout:60
-42 vinyl_write_threads:4
-43 wal_dir:.
-44 wal_dir_rescan_delay:2
-45 wal_max_size:268435456
-46 wal_mode:write
-47 worker_pool_threads:4
+37 vinyl_read_threads:1
+38 vinyl_run_count_per_level:2
+39 vinyl_run_size_ratio:3.5
+40 vinyl_timeout:60
+41 vinyl_write_threads:4
+42 wal_dir:.
+43 wal_dir_rescan_delay:2
+44 wal_max_size:268435456
+45 wal_mode:write
+46 worker_pool_threads:4
--
-- Test insert from detached fiber
--
diff --git a/test/box-tap/cfg.test.lua b/test/box-tap/cfg.test.lua
index d8715e27..f791cc3f 100755
--- a/test/box-tap/cfg.test.lua
+++ b/test/box-tap/cfg.test.lua
@@ -6,7 +6,7 @@ local socket = require('socket')
local fio = require('fio')
local uuid = require('uuid')
local msgpack = require('msgpack')
-test:plan(103)
+test:plan(102)
--------------------------------------------------------------------------------
-- Invalid values
@@ -45,7 +45,6 @@ invalid('log', ':test:')
invalid('vinyl_memory', -1)
invalid('vinyl_read_threads', 0)
invalid('vinyl_write_threads', 1)
-invalid('vinyl_range_size', 0)
invalid('vinyl_page_size', 0)
invalid('vinyl_run_count_per_level', 0)
invalid('vinyl_run_size_ratio', 1)
diff --git a/test/box/admin.result b/test/box/admin.result
index 0b233889..e6fc1f30 100644
--- a/test/box/admin.result
+++ b/test/box/admin.result
@@ -98,8 +98,6 @@ cfg_filter(box.cfg)
- 134217728
- - vinyl_page_size
- 8192
- - - vinyl_range_size
- - 1073741824
- - vinyl_read_threads
- 1
- - vinyl_run_count_per_level
diff --git a/test/box/cfg.result b/test/box/cfg.result
index 68465669..7778f82a 100644
--- a/test/box/cfg.result
+++ b/test/box/cfg.result
@@ -86,8 +86,6 @@ cfg_filter(box.cfg)
- 134217728
- - vinyl_page_size
- 8192
- - - vinyl_range_size
- - 1073741824
- - vinyl_read_threads
- 1
- - vinyl_run_count_per_level
@@ -187,8 +185,6 @@ cfg_filter(box.cfg)
- 134217728
- - vinyl_page_size
- 8192
- - - vinyl_range_size
- - 1073741824
- - vinyl_read_threads
- 1
- - vinyl_run_count_per_level
diff --git a/test/vinyl/ddl.result b/test/vinyl/ddl.result
index 68bb6b3a..864050b3 100644
--- a/test/vinyl/ddl.result
+++ b/test/vinyl/ddl.result
@@ -8,10 +8,6 @@ test_run = require('test_run').new()
space = box.schema.space.create('test', {engine = 'vinyl' })
---
...
-space:create_index('pk', {range_size = 0})
----
-- error: 'Wrong index options (field 4): range_size must be greater than 0'
-...
space:create_index('pk', {page_size = 0})
---
- error: 'Wrong index options (field 4): page_size must be greater than 0 and less
@@ -586,7 +582,6 @@ box.space.test.index.pk
run_count_per_level: 2
run_size_ratio: 3.5
bloom_fpr: 0.05
- range_size: 1073741824
name: pk
type: TREE
...
diff --git a/test/vinyl/ddl.test.lua b/test/vinyl/ddl.test.lua
index 9b870f35..46189828 100644
--- a/test/vinyl/ddl.test.lua
+++ b/test/vinyl/ddl.test.lua
@@ -3,7 +3,6 @@ test_run = require('test_run').new()
-- sanity checks
space = box.schema.space.create('test', {engine = 'vinyl' })
-space:create_index('pk', {range_size = 0})
space:create_index('pk', {page_size = 0})
space:create_index('pk', {page_size = 8192, range_size = 4096})
space:create_index('pk', {run_count_per_level = 0})
diff --git a/test/vinyl/misc.result b/test/vinyl/misc.result
index 59492f77..5f67271e 100644
--- a/test/vinyl/misc.result
+++ b/test/vinyl/misc.result
@@ -1,3 +1,6 @@
+test_run = require('test_run').new()
+---
+...
fiber = require('fiber')
---
...
@@ -204,3 +207,78 @@ s:insert{1, 1, 2} -- error
s:drop()
---
...
+--
+-- gh-3944: automatic range size configuration.
+--
+-- Passing range_size explicitly on index creation.
+s = box.schema.space.create('test', {engine = 'vinyl'})
+---
+...
+i = s:create_index('pk', {range_size = 0})
+---
+...
+i.options.range_size -- nil
+---
+- null
+...
+i:stat().range_size -- 16 MB
+---
+- 16777216
+...
+box.space._index:get{s.id, i.id}[5].range_size -- 0
+---
+- 0
+...
+s:drop()
+---
+...
+-- Inheriting global settings.
+test_run:cmd('create server test with script = "vinyl/stat.lua"')
+---
+- true
+...
+test_run:cmd('start server test')
+---
+- true
+...
+test_run:cmd('switch test')
+---
+- true
+...
+box.cfg.vinyl_range_size -- nil
+---
+- null
+...
+s = box.schema.space.create('test', {engine = 'vinyl'})
+---
+...
+i = s:create_index('pk')
+---
+...
+i.options.range_size -- nil
+---
+- null
+...
+i:stat().range_size -- 16 MB
+---
+- 16777216
+...
+box.space._index:get{s.id, i.id}[5].range_size -- nil
+---
+- null
+...
+s:drop()
+---
+...
+test_run:cmd('switch default')
+---
+- true
+...
+test_run:cmd('stop server test')
+---
+- true
+...
+test_run:cmd('cleanup server test')
+---
+- true
+...
diff --git a/test/vinyl/misc.test.lua b/test/vinyl/misc.test.lua
index ba7403ec..1c3a9517 100644
--- a/test/vinyl/misc.test.lua
+++ b/test/vinyl/misc.test.lua
@@ -1,3 +1,4 @@
+test_run = require('test_run').new()
fiber = require('fiber')
--
@@ -88,3 +89,28 @@ _ = s:create_index('sk', {unique = true, parts = {2, 'unsigned'}})
s:insert{1, 1, 1}
s:insert{1, 1, 2} -- error
s:drop()
+
+--
+-- gh-3944: automatic range size configuration.
+--
+-- Passing range_size explicitly on index creation.
+s = box.schema.space.create('test', {engine = 'vinyl'})
+i = s:create_index('pk', {range_size = 0})
+i.options.range_size -- nil
+i:stat().range_size -- 16 MB
+box.space._index:get{s.id, i.id}[5].range_size -- 0
+s:drop()
+-- Inheriting global settings.
+test_run:cmd('create server test with script = "vinyl/stat.lua"')
+test_run:cmd('start server test')
+test_run:cmd('switch test')
+box.cfg.vinyl_range_size -- nil
+s = box.schema.space.create('test', {engine = 'vinyl'})
+i = s:create_index('pk')
+i.options.range_size -- nil
+i:stat().range_size -- 16 MB
+box.space._index:get{s.id, i.id}[5].range_size -- nil
+s:drop()
+test_run:cmd('switch default')
+test_run:cmd('stop server test')
+test_run:cmd('cleanup server test')
diff --git a/test/vinyl/stat.result b/test/vinyl/stat.result
index b0b569ab..c1fad931 100644
--- a/test/vinyl/stat.result
+++ b/test/vinyl/stat.result
@@ -129,15 +129,10 @@ test_run:cmd("setopt delimiter ''");
-- initially stats are empty
istat()
---
-- dumps_per_compaction: 0
- rows: 0
- run_avg: 0
- bytes: 0
- upsert:
+- upsert:
squashed: 0
applied: 0
- lookup: 0
- run_count: 0
+ bytes: 0
cache:
invalidate:
rows: 0
@@ -155,10 +150,7 @@ istat()
get:
rows: 0
bytes: 0
- range_count: 1
- put:
- rows: 0
- bytes: 0
+ run_histogram: '[0]:1'
disk:
last_level:
bytes_compressed: 0
@@ -216,6 +208,12 @@ istat()
pages: 0
bytes_compressed: 0
bytes: 0
+ range_size: 16384
+ rows: 0
+ run_avg: 0
+ dumps_per_compaction: 0
+ lookup: 0
+ range_count: 1
txw:
bytes: 0
rows: 0
@@ -224,7 +222,10 @@ istat()
get:
rows: 0
bytes: 0
- run_histogram: '[0]:1'
+ run_count: 0
+ put:
+ rows: 0
+ bytes: 0
memory:
bytes: 0
index_size: 0
@@ -295,7 +296,14 @@ wait(istat, st, 'disk.dump.count', 1)
...
stat_diff(istat(), st)
---
-- disk:
+- put:
+ rows: 25
+ bytes: 26525
+ rows: 25
+ run_avg: 1
+ run_count: 1
+ dumps_per_compaction: 1
+ disk:
last_level:
bytes: 26049
pages: 7
@@ -319,14 +327,7 @@ stat_diff(istat(), st)
pages: 7
bytes_compressed: <bytes_compressed>
bloom_size: 70
- rows: 25
- run_avg: 1
- run_count: 1
- dumps_per_compaction: 1
bytes: 26049
- put:
- rows: 25
- bytes: 26525
...
-- put + dump + compaction
st = istat()
@@ -344,7 +345,12 @@ wait(istat, st, 'disk.compaction.count', 1)
...
stat_diff(istat(), st)
---
-- disk:
+- put:
+ rows: 50
+ bytes: 53050
+ rows: 25
+ bytes: 26042
+ disk:
last_level:
bytes: 26042
pages: 6
@@ -379,11 +385,6 @@ stat_diff(istat(), st)
pages: 13
bytes_compressed: <bytes_compressed>
rows: 50
- put:
- rows: 50
- bytes: 53050
- rows: 25
- bytes: 26042
...
-- point lookup from disk + cache put
st = istat()
@@ -403,7 +404,6 @@ stat_diff(istat(), st)
put:
rows: 1
bytes: 1061
- lookup: 1
disk:
iterator:
read:
@@ -415,6 +415,7 @@ stat_diff(istat(), st)
get:
rows: 1
bytes: 1061
+ lookup: 1
memory:
iterator:
lookup: 1
@@ -654,6 +655,19 @@ stat_diff(istat(), st)
put:
rows: 51
bytes: 54111
+ lookup: 1
+ txw:
+ iterator:
+ lookup: 1
+ get:
+ rows: 50
+ bytes: 53050
+ memory:
+ iterator:
+ lookup: 1
+ get:
+ rows: 100
+ bytes: 106100
disk:
iterator:
read:
@@ -665,19 +679,6 @@ stat_diff(istat(), st)
get:
rows: 100
bytes: 106100
- txw:
- iterator:
- lookup: 1
- get:
- rows: 50
- bytes: 53050
- memory:
- iterator:
- lookup: 1
- get:
- rows: 100
- bytes: 106100
- lookup: 1
get:
rows: 100
bytes: 106100
@@ -1000,15 +1001,10 @@ box.stat.reset()
...
istat()
---
-- dumps_per_compaction: 1
- rows: 306
- run_avg: 1
+- upsert:
+ squashed: 0
+ applied: 0
bytes: 317731
- upsert:
- squashed: 0
- applied: 0
- lookup: 0
- run_count: 2
cache:
invalidate:
rows: 0
@@ -1026,10 +1022,7 @@ istat()
get:
rows: 0
bytes: 0
- range_count: 2
- put:
- rows: 0
- bytes: 0
+ run_histogram: '[1]:2'
disk:
last_level:
bytes_compressed: <bytes_compressed>
@@ -1087,6 +1080,12 @@ istat()
pages: 25
bytes_compressed: <bytes_compressed>
bytes: 104300
+ range_size: 16384
+ rows: 306
+ run_avg: 1
+ dumps_per_compaction: 1
+ lookup: 0
+ range_count: 2
txw:
bytes: 0
rows: 0
@@ -1095,7 +1094,10 @@ istat()
get:
rows: 0
bytes: 0
- run_histogram: '[1]:2'
+ run_count: 2
+ put:
+ rows: 0
+ bytes: 0
memory:
bytes: 213431
index_size: 49152
--
2.11.0
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH 6/9] vinyl: set range size automatically
2019-01-20 21:17 ` [PATCH 6/9] vinyl: set range size automatically Vladimir Davydov
@ 2019-01-22 9:17 ` Vladimir Davydov
2019-02-05 17:09 ` [tarantool-patches] " Konstantin Osipov
1 sibling, 0 replies; 39+ messages in thread
From: Vladimir Davydov @ 2019-01-22 9:17 UTC (permalink / raw)
To: tarantool-patches
On Mon, Jan 21, 2019 at 12:17:05AM +0300, Vladimir Davydov wrote:
> +int64_t
> +vy_lsm_range_size(struct vy_lsm *lsm)
> +{
> + /* Use the configured range size if available. */
> + if (lsm->opts.range_size > 0)
> + return lsm->opts.range_size;
> + /*
> + * It doesn't make much sense to create too small ranges.
> + * Limit the max number of ranges per index to 1000 and
> + * never create ranges smaller than 16 MB.
> + */
> + enum { MIN_RANGE_SIZE = 16 * 1024 * 1024 };
> + enum { MAX_RANGE_COUNT = 1000 };
> + /*
> + * Ideally, we want to compact roughly the same amount of
> + * data after each dump so as to avoid IO bursts caused by
> + * simultaneous major compaction of a bunch of ranges,
> + * because such IO bursts can lead to a deviation of the
> + * LSM tree from the configured shape and, as a result,
> + * increased read amplification. To achieve that, we need
> + * to have at least as many ranges as the number of dumps
> + * it takes to trigger major compaction in a range.
> + */
> + int range_count = vy_lsm_dumps_per_compaction(lsm);
After having pondered this for a while, I'm inclined to think it isn't
such a good idea to use dumps_per_compaction for range_count: first, it
won't work for time series like workloads - the range_count won't scale
as the space size grows; second, after forcing compaction with
index.compact(), the dumps_per_compaction may collapse resulting in
massing range coalescing.
May be, we'd better use LSM tree fanout for the target number of ranges?
But how do we calculate it reliably?
> + range_count = MIN(range_count, MAX_RANGE_COUNT);
> + int64_t range_size = lsm->stat.disk.last_level_count.bytes /
> + (range_count + 1);
> + range_size = MAX(range_size, MIN_RANGE_SIZE);
> + return range_size;
> +}
^ permalink raw reply [flat|nested] 39+ messages in thread
* [tarantool-patches] Re: [PATCH 6/9] vinyl: set range size automatically
2019-01-20 21:17 ` [PATCH 6/9] vinyl: set range size automatically Vladimir Davydov
2019-01-22 9:17 ` Vladimir Davydov
@ 2019-02-05 17:09 ` Konstantin Osipov
2019-02-06 9:23 ` Vladimir Davydov
1 sibling, 1 reply; 39+ messages in thread
From: Konstantin Osipov @ 2019-02-05 17:09 UTC (permalink / raw)
To: tarantool-patches
* Vladimir Davydov <vdavydov.dev@gmail.com> [19/01/21 06:58]:
> +int64_t
> +vy_lsm_range_size(struct vy_lsm *lsm)
> +{
> + /* Use the configured range size if available. */
> + if (lsm->opts.range_size > 0)
> + return lsm->opts.range_size;
> + /*
> + * It doesn't make much sense to create too small ranges.
> + * Limit the max number of ranges per index to 1000 and
> + * never create ranges smaller than 16 MB.
> + */
> + enum { MIN_RANGE_SIZE = 16 * 1024 * 1024 };
> + enum { MAX_RANGE_COUNT = 1000 };
> + /*
> + * Ideally, we want to compact roughly the same amount of
> + * data after each dump so as to avoid IO bursts caused by
> + * simultaneous major compaction of a bunch of ranges,
> + * because such IO bursts can lead to a deviation of the
> + * LSM tree from the configured shape and, as a result,
> + * increased read amplification. To achieve that, we need
> + * to have at least as many ranges as the number of dumps
> + * it takes to trigger major compaction in a range.
> + */
> + int range_count = vy_lsm_dumps_per_compaction(lsm);
> + range_count = MIN(range_count, MAX_RANGE_COUNT);
> + int64_t range_size = lsm->stat.disk.last_level_count.bytes /
> + (range_count + 1);
> + range_size = MAX(range_size, MIN_RANGE_SIZE);
> + return range_size;
> +}
OK, you could say the value is rarely used, so can be calculated
each time it is used, but why instead not recalculate it on each
major compaction? This would spare us from technical debt and
having to think about potential performance bottleneck in the
future.
--
Konstantin Osipov, Moscow, Russia, +7 903 626 22 32
http://tarantool.io - www.twitter.com/kostja_osipov
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [tarantool-patches] Re: [PATCH 6/9] vinyl: set range size automatically
2019-02-05 17:09 ` [tarantool-patches] " Konstantin Osipov
@ 2019-02-06 9:23 ` Vladimir Davydov
2019-02-06 17:04 ` Vladimir Davydov
0 siblings, 1 reply; 39+ messages in thread
From: Vladimir Davydov @ 2019-02-06 9:23 UTC (permalink / raw)
To: Konstantin Osipov; +Cc: tarantool-patches
On Tue, Feb 05, 2019 at 08:09:09PM +0300, Konstantin Osipov wrote:
> * Vladimir Davydov <vdavydov.dev@gmail.com> [19/01/21 06:58]:
> > +int64_t
> > +vy_lsm_range_size(struct vy_lsm *lsm)
> > +{
> > + /* Use the configured range size if available. */
> > + if (lsm->opts.range_size > 0)
> > + return lsm->opts.range_size;
> > + /*
> > + * It doesn't make much sense to create too small ranges.
> > + * Limit the max number of ranges per index to 1000 and
> > + * never create ranges smaller than 16 MB.
> > + */
> > + enum { MIN_RANGE_SIZE = 16 * 1024 * 1024 };
> > + enum { MAX_RANGE_COUNT = 1000 };
> > + /*
> > + * Ideally, we want to compact roughly the same amount of
> > + * data after each dump so as to avoid IO bursts caused by
> > + * simultaneous major compaction of a bunch of ranges,
> > + * because such IO bursts can lead to a deviation of the
> > + * LSM tree from the configured shape and, as a result,
> > + * increased read amplification. To achieve that, we need
> > + * to have at least as many ranges as the number of dumps
> > + * it takes to trigger major compaction in a range.
> > + */
> > + int range_count = vy_lsm_dumps_per_compaction(lsm);
> > + range_count = MIN(range_count, MAX_RANGE_COUNT);
> > + int64_t range_size = lsm->stat.disk.last_level_count.bytes /
> > + (range_count + 1);
> > + range_size = MAX(range_size, MIN_RANGE_SIZE);
> > + return range_size;
> > +}
>
> OK, you could say the value is rarely used, so can be calculated
> each time it is used, but why instead not recalculate it on each
> major compaction? This would spare us from technical debt and
> having to think about potential performance bottleneck in the
> future.
Well, okay, I think we can do that, too.
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [tarantool-patches] Re: [PATCH 6/9] vinyl: set range size automatically
2019-02-06 9:23 ` Vladimir Davydov
@ 2019-02-06 17:04 ` Vladimir Davydov
0 siblings, 0 replies; 39+ messages in thread
From: Vladimir Davydov @ 2019-02-06 17:04 UTC (permalink / raw)
To: Konstantin Osipov; +Cc: tarantool-patches
On Wed, Feb 06, 2019 at 12:23:31PM +0300, Vladimir Davydov wrote:
> On Tue, Feb 05, 2019 at 08:09:09PM +0300, Konstantin Osipov wrote:
> > * Vladimir Davydov <vdavydov.dev@gmail.com> [19/01/21 06:58]:
> > > +int64_t
> > > +vy_lsm_range_size(struct vy_lsm *lsm)
> > > +{
> > > + /* Use the configured range size if available. */
> > > + if (lsm->opts.range_size > 0)
> > > + return lsm->opts.range_size;
> > > + /*
> > > + * It doesn't make much sense to create too small ranges.
> > > + * Limit the max number of ranges per index to 1000 and
> > > + * never create ranges smaller than 16 MB.
> > > + */
> > > + enum { MIN_RANGE_SIZE = 16 * 1024 * 1024 };
> > > + enum { MAX_RANGE_COUNT = 1000 };
> > > + /*
> > > + * Ideally, we want to compact roughly the same amount of
> > > + * data after each dump so as to avoid IO bursts caused by
> > > + * simultaneous major compaction of a bunch of ranges,
> > > + * because such IO bursts can lead to a deviation of the
> > > + * LSM tree from the configured shape and, as a result,
> > > + * increased read amplification. To achieve that, we need
> > > + * to have at least as many ranges as the number of dumps
> > > + * it takes to trigger major compaction in a range.
> > > + */
> > > + int range_count = vy_lsm_dumps_per_compaction(lsm);
> > > + range_count = MIN(range_count, MAX_RANGE_COUNT);
> > > + int64_t range_size = lsm->stat.disk.last_level_count.bytes /
> > > + (range_count + 1);
> > > + range_size = MAX(range_size, MIN_RANGE_SIZE);
> > > + return range_size;
> > > +}
> >
> > OK, you could say the value is rarely used, so can be calculated
> > each time it is used, but why instead not recalculate it on each
> > major compaction? This would spare us from technical debt and
> > having to think about potential performance bottleneck in the
> > future.
>
> Well, okay, I think we can do that, too.
Come to think of it, I'd prefer not to, because this would one more
member to struct vy_lsm, which we would have to update on dump and
compaction - it's more error prone IMO. Since calculation of the target
range size is trivial and called rarely (on split/coalesce), I think we
should leave it as is.
^ permalink raw reply [flat|nested] 39+ messages in thread
* [PATCH 7/9] vinyl: randomize range compaction to avoid IO load spikes
2019-01-20 21:16 [PATCH 0/9] vinyl: compaction randomization and throttling Vladimir Davydov
` (5 preceding siblings ...)
2019-01-20 21:17 ` [PATCH 6/9] vinyl: set range size automatically Vladimir Davydov
@ 2019-01-20 21:17 ` Vladimir Davydov
2019-01-22 12:54 ` Vladimir Davydov
2019-02-05 17:14 ` Konstantin Osipov
2019-01-20 21:17 ` [PATCH 8/9] vinyl: introduce quota consumer types Vladimir Davydov
2019-01-20 21:17 ` [PATCH 9/9] vinyl: throttle tx to ensure compaction keeps up with dumps Vladimir Davydov
8 siblings, 2 replies; 39+ messages in thread
From: Vladimir Davydov @ 2019-01-20 21:17 UTC (permalink / raw)
To: tarantool-patches
Since all ranges constituting an LSM tree have the same configuration,
they tend to get compacted at approximately the same time. This entails
IO load spikes, which, in turn, lead to deviation of the LSM tree from
the target shape and hence increased read amplification. To prevent this
from happening, this patch implements compaction randomization: with 10%
probability we defer compaction at each LSM tree level, i.e. if the
number of runs at a level exceeds the configured run_count_per_level,
the level will be compacted with 90%-probability, but with 10%
probability it won't - compaction will be deferred until another run
is added to the level.
Our simulations show that such a simple algorithm performs fairly well:
it randomizes compaction pace among ranges, spreading IO load evenly in
time, while the write amplification is increased by not more than 5-10%,
which seems to be a reasonable price for elimination of IO load spikes.
Closes #3944
---
src/box/vy_range.c | 20 +++++++++++++++++++-
src/box/vy_run.c | 1 +
src/box/vy_run.h | 5 +++++
test/vinyl/write_iterator.result | 16 ++++++++++++++++
test/vinyl/write_iterator.test.lua | 6 ++++++
5 files changed, 47 insertions(+), 1 deletion(-)
diff --git a/src/box/vy_range.c b/src/box/vy_range.c
index db4a7ab0..747bfae3 100644
--- a/src/box/vy_range.c
+++ b/src/box/vy_range.c
@@ -369,7 +369,25 @@ vy_range_update_compaction_priority(struct vy_range *range,
* we find an appropriate level for it.
*/
}
- if (level_run_count > opts->run_count_per_level) {
+ /*
+ * Since all ranges constituting an LSM tree have
+ * the same configuration, they tend to get compacted
+ * simultaneously, leading to IO load spikes and, as
+ * a result, distortion of the LSM tree shape and
+ * increased read amplification. To prevent this from
+ * happening, we constantly randomize compaction pace
+ * among ranges by deferring compaction at each LSM
+ * tree level with some fixed small probability.
+ *
+ * Note, we can't use rand() directly here, because
+ * this function is called on every memory dump and
+ * scans all LSM tree levels. Instead we use the
+ * value of rand() from the slice creation time.
+ */
+ uint32_t max_run_count = opts->run_count_per_level;
+ if (slice->seed < RAND_MAX / 10)
+ max_run_count++;
+ if (level_run_count > max_run_count) {
/*
* The number of runs at the current level
* exceeds the configured maximum. Arrange
diff --git a/src/box/vy_run.c b/src/box/vy_run.c
index cee90458..4c7e637c 100644
--- a/src/box/vy_run.c
+++ b/src/box/vy_run.c
@@ -377,6 +377,7 @@ vy_slice_new(int64_t id, struct vy_run *run, struct tuple *begin,
memset(slice, 0, sizeof(*slice));
slice->id = id;
slice->run = run;
+ slice->seed = rand();
vy_run_ref(run);
run->slice_count++;
if (begin != NULL)
diff --git a/src/box/vy_run.h b/src/box/vy_run.h
index 28fd6a50..18ca1729 100644
--- a/src/box/vy_run.h
+++ b/src/box/vy_run.h
@@ -185,6 +185,11 @@ struct vy_slice {
struct tuple *begin;
struct tuple *end;
/**
+ * Random seed used for compaction randomization.
+ * Lays in range [0, RAND_MAX].
+ */
+ int seed;
+ /**
* Number of async users of this slice. Slice must not
* be removed until it hits 0. Used by the iterator to
* prevent use-after-free after waiting for IO.
diff --git a/test/vinyl/write_iterator.result b/test/vinyl/write_iterator.result
index 88a1c287..433338ee 100644
--- a/test/vinyl/write_iterator.result
+++ b/test/vinyl/write_iterator.result
@@ -766,6 +766,14 @@ box.snapshot()
---
- ok
...
+-- Some padding to trigger minor compaction.
+for i = 1001, 1000 + PAD2 do s:replace{i, i} end
+---
+...
+box.snapshot()
+---
+- ok
+...
-- Generate some INSERT statements and dump them to disk.
_ = s:insert{1, 1} -- insert
---
@@ -922,6 +930,14 @@ box.snapshot()
---
- ok
...
+-- Some padding to trigger minor compaction.
+for i = 1001, 1000 + PAD2 do s:replace{i, i} end
+---
+...
+box.snapshot()
+---
+- ok
+...
-- Generate DELETE+INSERT statements and write them to disk.
s:delete{1} s:insert{1, 100}
---
diff --git a/test/vinyl/write_iterator.test.lua b/test/vinyl/write_iterator.test.lua
index 069c7f69..9d216eba 100644
--- a/test/vinyl/write_iterator.test.lua
+++ b/test/vinyl/write_iterator.test.lua
@@ -327,6 +327,9 @@ PAD2 = 10
-- Create a big run to prevent major compaction.
for i = 1001, 1000 + PAD1 do s:replace{i, i} end
box.snapshot()
+-- Some padding to trigger minor compaction.
+for i = 1001, 1000 + PAD2 do s:replace{i, i} end
+box.snapshot()
-- Generate some INSERT statements and dump them to disk.
_ = s:insert{1, 1} -- insert
_ = s:replace{2, 2} -- replace, no old tuple
@@ -385,6 +388,9 @@ _ = s:insert{6, 6}
_ = s:insert{7, 7}
_ = s:insert{8, 8}
box.snapshot()
+-- Some padding to trigger minor compaction.
+for i = 1001, 1000 + PAD2 do s:replace{i, i} end
+box.snapshot()
-- Generate DELETE+INSERT statements and write them to disk.
s:delete{1} s:insert{1, 100}
box.begin() s:delete{2} s:insert{2, 200} box.commit()
--
2.11.0
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH 7/9] vinyl: randomize range compaction to avoid IO load spikes
2019-01-20 21:17 ` [PATCH 7/9] vinyl: randomize range compaction to avoid IO load spikes Vladimir Davydov
@ 2019-01-22 12:54 ` Vladimir Davydov
2019-02-05 17:39 ` [tarantool-patches] " Konstantin Osipov
2019-02-05 17:14 ` Konstantin Osipov
1 sibling, 1 reply; 39+ messages in thread
From: Vladimir Davydov @ 2019-01-22 12:54 UTC (permalink / raw)
To: tarantool-patches
On Mon, Jan 21, 2019 at 12:17:06AM +0300, Vladimir Davydov wrote:
> Since all ranges constituting an LSM tree have the same configuration,
> they tend to get compacted at approximately the same time. This entails
> IO load spikes, which, in turn, lead to deviation of the LSM tree from
> the target shape and hence increased read amplification. To prevent this
> from happening, this patch implements compaction randomization: with 10%
> probability we defer compaction at each LSM tree level, i.e. if the
> number of runs at a level exceeds the configured run_count_per_level,
> the level will be compacted with 90%-probability, but with 10%
> probability it won't - compaction will be deferred until another run
> is added to the level.
>
> Our simulations show that such a simple algorithm performs fairly well:
> it randomizes compaction pace among ranges, spreading IO load evenly in
> time, while the write amplification is increased by not more than 5-10%,
> which seems to be a reasonable price for elimination of IO load spikes.
>
> Closes #3944
> ---
> src/box/vy_range.c | 20 +++++++++++++++++++-
> src/box/vy_run.c | 1 +
> src/box/vy_run.h | 5 +++++
> test/vinyl/write_iterator.result | 16 ++++++++++++++++
> test/vinyl/write_iterator.test.lua | 6 ++++++
> 5 files changed, 47 insertions(+), 1 deletion(-)
>
> diff --git a/src/box/vy_range.c b/src/box/vy_range.c
> index db4a7ab0..747bfae3 100644
> --- a/src/box/vy_range.c
> +++ b/src/box/vy_range.c
> @@ -369,7 +369,25 @@ vy_range_update_compaction_priority(struct vy_range *range,
> * we find an appropriate level for it.
> */
> }
> - if (level_run_count > opts->run_count_per_level) {
> + /*
> + * Since all ranges constituting an LSM tree have
> + * the same configuration, they tend to get compacted
> + * simultaneously, leading to IO load spikes and, as
> + * a result, distortion of the LSM tree shape and
> + * increased read amplification. To prevent this from
> + * happening, we constantly randomize compaction pace
> + * among ranges by deferring compaction at each LSM
> + * tree level with some fixed small probability.
> + *
> + * Note, we can't use rand() directly here, because
> + * this function is called on every memory dump and
> + * scans all LSM tree levels. Instead we use the
> + * value of rand() from the slice creation time.
> + */
> + uint32_t max_run_count = opts->run_count_per_level;
> + if (slice->seed < RAND_MAX / 10)
> + max_run_count++;
> + if (level_run_count > max_run_count) {
> /*
> * The number of runs at the current level
> * exceeds the configured maximum. Arrange
I ran some tests and, surprisingly, it turned out that randomization
didn't help at all: the compaction queue size jumped up to 30% and even
40% from time to time although there was plenty of compaction power -
compaction threads were busy only half of the time. When I looked
closer, I saw that the queue size behavior looked weird - it jumped
after a dump for a very short period of time, until the next dump, which
pushed it back to 10%. This made me wonder how it could happen at all -
normally, compaction queue should only grow after a dump, not diminish.
I think I've finally found the root cause of the problem. The code
computing compaction priority (see vy_range_update_compaction_priority)
is unstable meaning the size of the first level equals the size of the
smallest run so if memory dumps produce runs of varying sizes, which is
what happens in practice in contrast to simulation, the shape of the
tree will vary as well, resulting in different compaction priority and
unstable queue behavior.
We must fix this somehow. One way to do it is compute the first level
size basing on the size of the last level run, which is constant most of
the time, and indeed, when I did it and reran the tests, I found that
the queue stayed below 10% all the time. Turning off randomization, made
the queue jump up to 30%, which was expected. The diff is below. I'll
spruce it up a little, wrap it nicely, and submit it separately later.
diff --git a/src/box/vy_range.c b/src/box/vy_range.c
index 747bfae3..38a64632 100644
--- a/src/box/vy_range.c
+++ b/src/box/vy_range.c
@@ -315,28 +315,33 @@ vy_range_update_compaction_priority(struct vy_range *range,
struct vy_disk_stmt_counter total_stmt_count;
vy_disk_stmt_counter_reset(&total_stmt_count);
/* Total number of checked runs. */
- uint32_t total_run_count = 0;
+ int32_t total_run_count = 0;
/* Estimated size of a compacted run, if compaction is scheduled. */
- uint64_t est_new_run_size = 0;
+ int64_t est_new_run_size = 0;
/* The number of runs at the current level. */
- uint32_t level_run_count = 0;
+ int32_t level_run_count = 0;
/*
* The target (perfect) size of a run at the current level.
* For the first level, it's the size of the newest run.
* For lower levels it's computed as first level run size
* times run_size_ratio.
*/
- uint64_t target_run_size = 0;
+ int64_t target_run_size = 0;
+ int64_t size;
struct vy_slice *slice;
+
+ slice = rlist_last_entry(&range->slices, struct vy_slice, in_range);
+ size = slice->count.bytes;
+
+ slice = rlist_first_entry(&range->slices, struct vy_slice, in_range);
+ do {
+ target_run_size = size;
+ size = DIV_ROUND_UP(target_run_size, opts->run_size_ratio);
+ } while (size > MAX(slice->count.bytes, 1));
+
rlist_foreach_entry(slice, &range->slices, in_range) {
- uint64_t size = slice->count.bytes;
- /*
- * The size of the first level is defined by
- * the size of the most recent run.
- */
- if (target_run_size == 0)
- target_run_size = size;
+ size = slice->count.bytes;
level_run_count++;
total_run_count++;
vy_disk_stmt_counter_add(&total_stmt_count, &slice->count);
@@ -384,7 +389,7 @@ vy_range_update_compaction_priority(struct vy_range *range,
* scans all LSM tree levels. Instead we use the
* value of rand() from the slice creation time.
*/
- uint32_t max_run_count = opts->run_count_per_level;
+ int32_t max_run_count = opts->run_count_per_level;
if (slice->seed < RAND_MAX / 10)
max_run_count++;
if (level_run_count > max_run_count) {
^ permalink raw reply [flat|nested] 39+ messages in thread
* [tarantool-patches] Re: [PATCH 7/9] vinyl: randomize range compaction to avoid IO load spikes
2019-01-22 12:54 ` Vladimir Davydov
@ 2019-02-05 17:39 ` Konstantin Osipov
2019-02-06 8:53 ` Vladimir Davydov
0 siblings, 1 reply; 39+ messages in thread
From: Konstantin Osipov @ 2019-02-05 17:39 UTC (permalink / raw)
To: tarantool-patches
* Vladimir Davydov <vdavydov.dev@gmail.com> [19/01/22 15:56]:
> I ran some tests and, surprisingly, it turned out that randomization
> didn't help at all: the compaction queue size jumped up to 30% and even
> 40% from time to time although there was plenty of compaction power -
> compaction threads were busy only half of the time. When I looked
> closer, I saw that the queue size behavior looked weird - it jumped
> after a dump for a very short period of time, until the next dump, which
> pushed it back to 10%. This made me wonder how it could happen at all -
> normally, compaction queue should only grow after a dump, not diminish.
> I think I've finally found the root cause of the problem. The code
> computing compaction priority (see vy_range_update_compaction_priority)
> is unstable meaning the size of the first level equals the size of the
> smallest run so if memory dumps produce runs of varying sizes, which is
> what happens in practice in contrast to simulation, the shape of the
> tree will vary as well, resulting in different compaction priority and
> unstable queue behavior.
>
> We must fix this somehow. One way to do it is compute the first level
> size basing on the size of the last level run, which is constant most of
> the time, and indeed, when I did it and reran the tests, I found that
> the queue stayed below 10% all the time. Turning off randomization, made
> the queue jump up to 30%, which was expected. The diff is below. I'll
> spruce it up a little, wrap it nicely, and submit it separately later.
Why not use a simple weighted average?
--
Konstantin Osipov, Moscow, Russia, +7 903 626 22 32
http://tarantool.io - www.twitter.com/kostja_osipov
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [tarantool-patches] Re: [PATCH 7/9] vinyl: randomize range compaction to avoid IO load spikes
2019-02-05 17:39 ` [tarantool-patches] " Konstantin Osipov
@ 2019-02-06 8:53 ` Vladimir Davydov
2019-02-06 10:44 ` Konstantin Osipov
0 siblings, 1 reply; 39+ messages in thread
From: Vladimir Davydov @ 2019-02-06 8:53 UTC (permalink / raw)
To: Konstantin Osipov; +Cc: tarantool-patches
On Tue, Feb 05, 2019 at 08:39:58PM +0300, Konstantin Osipov wrote:
> * Vladimir Davydov <vdavydov.dev@gmail.com> [19/01/22 15:56]:
> > I ran some tests and, surprisingly, it turned out that randomization
> > didn't help at all: the compaction queue size jumped up to 30% and even
> > 40% from time to time although there was plenty of compaction power -
> > compaction threads were busy only half of the time. When I looked
> > closer, I saw that the queue size behavior looked weird - it jumped
> > after a dump for a very short period of time, until the next dump, which
> > pushed it back to 10%. This made me wonder how it could happen at all -
> > normally, compaction queue should only grow after a dump, not diminish.
>
> > I think I've finally found the root cause of the problem. The code
> > computing compaction priority (see vy_range_update_compaction_priority)
> > is unstable meaning the size of the first level equals the size of the
> > smallest run so if memory dumps produce runs of varying sizes, which is
> > what happens in practice in contrast to simulation, the shape of the
> > tree will vary as well, resulting in different compaction priority and
> > unstable queue behavior.
> >
> > We must fix this somehow. One way to do it is compute the first level
> > size basing on the size of the last level run, which is constant most of
> > the time, and indeed, when I did it and reran the tests, I found that
> > the queue stayed below 10% all the time. Turning off randomization, made
> > the queue jump up to 30%, which was expected. The diff is below. I'll
> > spruce it up a little, wrap it nicely, and submit it separately later.
>
> Why not use a simple weighted average?
Over how many dumps? What do we do after restart, when there's no
history and perhaps even no level 1?
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [tarantool-patches] Re: [PATCH 7/9] vinyl: randomize range compaction to avoid IO load spikes
2019-02-06 8:53 ` Vladimir Davydov
@ 2019-02-06 10:44 ` Konstantin Osipov
2019-02-06 10:52 ` Vladimir Davydov
0 siblings, 1 reply; 39+ messages in thread
From: Konstantin Osipov @ 2019-02-06 10:44 UTC (permalink / raw)
To: Vladimir Davydov; +Cc: tarantool-patches
* Vladimir Davydov <vdavydov.dev@gmail.com> [19/02/06 13:31]:
> Over how many dumps? What do we do after restart, when there's no
> history and perhaps even no level 1?
A am thinking about something along these lines:
f(n+1) = (f(n) + x*k)(1+k) - where k is the weight used to scale
the next input
alternatively:
f(n+1) = sqrt((f(n)^2 + x^2)/2)
--
Konstantin Osipov, Moscow, Russia, +7 903 626 22 32
http://tarantool.io - www.twitter.com/kostja_osipov
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [tarantool-patches] Re: [PATCH 7/9] vinyl: randomize range compaction to avoid IO load spikes
2019-02-06 10:44 ` Konstantin Osipov
@ 2019-02-06 10:52 ` Vladimir Davydov
2019-02-06 11:06 ` Konstantin Osipov
0 siblings, 1 reply; 39+ messages in thread
From: Vladimir Davydov @ 2019-02-06 10:52 UTC (permalink / raw)
To: Konstantin Osipov; +Cc: tarantool-patches
On Wed, Feb 06, 2019 at 01:44:19PM +0300, Konstantin Osipov wrote:
> * Vladimir Davydov <vdavydov.dev@gmail.com> [19/02/06 13:31]:
> > Over how many dumps? What do we do after restart, when there's no
> > history and perhaps even no level 1?
>
> A am thinking about something along these lines:
> f(n+1) = (f(n) + x*k)(1+k) - where k is the weight used to scale
> the next input
What should be k equal to?
Also, what we should do after restart when there's no history?
Why is using the last level size as reference bad?
>
> alternatively:
>
> f(n+1) = sqrt((f(n)^2 + x^2)/2)
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [tarantool-patches] Re: [PATCH 7/9] vinyl: randomize range compaction to avoid IO load spikes
2019-02-06 10:52 ` Vladimir Davydov
@ 2019-02-06 11:06 ` Konstantin Osipov
2019-02-06 11:49 ` Vladimir Davydov
0 siblings, 1 reply; 39+ messages in thread
From: Konstantin Osipov @ 2019-02-06 11:06 UTC (permalink / raw)
To: Vladimir Davydov; +Cc: tarantool-patches
* Vladimir Davydov <vdavydov.dev@gmail.com> [19/02/06 13:57]:
> On Wed, Feb 06, 2019 at 01:44:19PM +0300, Konstantin Osipov wrote:
> > * Vladimir Davydov <vdavydov.dev@gmail.com> [19/02/06 13:31]:
> > > Over how many dumps? What do we do after restart, when there's no
> > > history and perhaps even no level 1?
> >
> > A am thinking about something along these lines:
> > f(n+1) = (f(n) + x*k)(1+k) - where k is the weight used to scale
> > the next input
>
> What should be k equal to?
I don't know.
> Also, what we should do after restart when there's no history?
Seed the function with the top run size. If there is no top run,
seed with 0.
>
> Why is using the last level size as reference bad?
Because you don't know if it's last or not?
--
Konstantin Osipov, Moscow, Russia, +7 903 626 22 32
http://tarantool.io - www.twitter.com/kostja_osipov
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [tarantool-patches] Re: [PATCH 7/9] vinyl: randomize range compaction to avoid IO load spikes
2019-02-06 11:06 ` Konstantin Osipov
@ 2019-02-06 11:49 ` Vladimir Davydov
2019-02-06 13:43 ` Konstantin Osipov
0 siblings, 1 reply; 39+ messages in thread
From: Vladimir Davydov @ 2019-02-06 11:49 UTC (permalink / raw)
To: Konstantin Osipov; +Cc: tarantool-patches
On Wed, Feb 06, 2019 at 02:06:09PM +0300, Konstantin Osipov wrote:
> * Vladimir Davydov <vdavydov.dev@gmail.com> [19/02/06 13:57]:
> > On Wed, Feb 06, 2019 at 01:44:19PM +0300, Konstantin Osipov wrote:
> > > * Vladimir Davydov <vdavydov.dev@gmail.com> [19/02/06 13:31]:
> > > > Over how many dumps? What do we do after restart, when there's no
> > > > history and perhaps even no level 1?
> > >
> > > A am thinking about something along these lines:
> > > f(n+1) = (f(n) + x*k)(1+k) - where k is the weight used to scale
> > > the next input
> >
> > What should be k equal to?
>
> I don't know.
Neither do I. That's why I don't want to involve any kind of averaging.
>
> > Also, what we should do after restart when there's no history?
>
> Seed the function with the top run size.
> If there is no top run, seed with 0.
This way the moving average will grow very reluctantly - we'll need more
than k dumps to adapt. During that time, compaction priority calculation
will be unstable.
> >
> > Why is using the last level size as reference bad?
>
> Because you don't know if it's last or not?
We know which run is last. Provided the workload is stable, i.e. have
stopped growing its dataset, it will be roughly the same. Besides, the
last level run size changes only on major compaction, which is
infrequent. After a major compaction, it's OK to use a different first
level size - the point is in order not to break LSM algorithm, we have
to maintain stable level sizing between major compactions.
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [tarantool-patches] Re: [PATCH 7/9] vinyl: randomize range compaction to avoid IO load spikes
2019-02-06 11:49 ` Vladimir Davydov
@ 2019-02-06 13:43 ` Konstantin Osipov
2019-02-06 14:00 ` Vladimir Davydov
0 siblings, 1 reply; 39+ messages in thread
From: Konstantin Osipov @ 2019-02-06 13:43 UTC (permalink / raw)
To: Vladimir Davydov; +Cc: tarantool-patches
* Vladimir Davydov <vdavydov.dev@gmail.com> [19/02/06 14:53]:
> We know which run is last. Provided the workload is stable, i.e. have
> stopped growing its dataset, it will be roughly the same. Besides, the
> last level run size changes only on major compaction, which is
> infrequent. After a major compaction, it's OK to use a different first
> level size - the point is in order not to break LSM algorithm, we have
> to maintain stable level sizing between major compactions.
The problem is that you can't use the last level as-is. You need a
divider, e.g. dumps_per_compaction, which is meaningless until the
first major compaction has happened.
--
Konstantin Osipov, Moscow, Russia, +7 903 626 22 32
http://tarantool.io - www.twitter.com/kostja_osipov
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [tarantool-patches] Re: [PATCH 7/9] vinyl: randomize range compaction to avoid IO load spikes
2019-02-06 13:43 ` Konstantin Osipov
@ 2019-02-06 14:00 ` Vladimir Davydov
0 siblings, 0 replies; 39+ messages in thread
From: Vladimir Davydov @ 2019-02-06 14:00 UTC (permalink / raw)
To: Konstantin Osipov; +Cc: tarantool-patches
On Wed, Feb 06, 2019 at 04:43:41PM +0300, Konstantin Osipov wrote:
> * Vladimir Davydov <vdavydov.dev@gmail.com> [19/02/06 14:53]:
> > We know which run is last. Provided the workload is stable, i.e. have
> > stopped growing its dataset, it will be roughly the same. Besides, the
> > last level run size changes only on major compaction, which is
> > infrequent. After a major compaction, it's OK to use a different first
> > level size - the point is in order not to break LSM algorithm, we have
> > to maintain stable level sizing between major compactions.
>
> The problem is that you can't use the last level as-is. You need a
> divider, e.g. dumps_per_compaction, which is meaningless until the
> first major compaction has happened.
Why? Level sizing isn't tight. Sizes of runs at level L range from
X*run_size_ratio^L to X*run_size_ratio^(L+1). We can choose X as we
please without breaking LSM algorithm assumptions, can't we?
^ permalink raw reply [flat|nested] 39+ messages in thread
* [tarantool-patches] Re: [PATCH 7/9] vinyl: randomize range compaction to avoid IO load spikes
2019-01-20 21:17 ` [PATCH 7/9] vinyl: randomize range compaction to avoid IO load spikes Vladimir Davydov
2019-01-22 12:54 ` Vladimir Davydov
@ 2019-02-05 17:14 ` Konstantin Osipov
1 sibling, 0 replies; 39+ messages in thread
From: Konstantin Osipov @ 2019-02-05 17:14 UTC (permalink / raw)
To: tarantool-patches
* Vladimir Davydov <vdavydov.dev@gmail.com> [19/01/21 06:58]:
> Since all ranges constituting an LSM tree have the same configuration,
> they tend to get compacted at approximately the same time. This entails
> IO load spikes, which, in turn, lead to deviation of the LSM tree from
> the target shape and hence increased read amplification. To prevent this
> from happening, this patch implements compaction randomization: with 10%
> probability we defer compaction at each LSM tree level, i.e. if the
> number of runs at a level exceeds the configured run_count_per_level,
> the level will be compacted with 90%-probability, but with 10%
> probability it won't - compaction will be deferred until another run
> is added to the level.
>
> Our simulations show that such a simple algorithm performs fairly well:
> it randomizes compaction pace among ranges, spreading IO load evenly in
> time, while the write amplification is increased by not more than 5-10%,
> which seems to be a reasonable price for elimination of IO load spikes.
This is very nice. Hours of debates and modeling resulting in just
a few lines of code.
--
Konstantin Osipov, Moscow, Russia, +7 903 626 22 32
http://tarantool.io - www.twitter.com/kostja_osipov
^ permalink raw reply [flat|nested] 39+ messages in thread
* [PATCH 8/9] vinyl: introduce quota consumer types
2019-01-20 21:16 [PATCH 0/9] vinyl: compaction randomization and throttling Vladimir Davydov
` (6 preceding siblings ...)
2019-01-20 21:17 ` [PATCH 7/9] vinyl: randomize range compaction to avoid IO load spikes Vladimir Davydov
@ 2019-01-20 21:17 ` Vladimir Davydov
2019-01-20 21:17 ` [PATCH 9/9] vinyl: throttle tx to ensure compaction keeps up with dumps Vladimir Davydov
8 siblings, 0 replies; 39+ messages in thread
From: Vladimir Davydov @ 2019-01-20 21:17 UTC (permalink / raw)
To: tarantool-patches
Currently, we only limit quota consumption rate so that writers won't
hit the hard limit before memory dump is complete. However, it isn't
enough, because we also need to consider compaction: if it doesn't keep
up with dumps, read and space amplification will grow uncontrollably.
The problem is compaction may be a quota consumer by itself, as it may
generate deferred DELETE statements for secondary indexes. We can't
ignore quota completely there, because if we do, we may hit the memory
limit and stall all writers, which is unacceptable, but we do want to
ignore the rate limit imposed to make sure that compaction keeps up with
dumps, otherwise compaction won't benefit from such a throttling.
To tackle this problem, this patch introduces the concept of quota
consumer types and resources. Now vy_quota maintains one rate limit per
each resource and one wait queue per each consumer type. There are two
types of consumers, compaction jobs and usual transactions, and there
are two resources managed by vy_quota, disk and memory. Memory-based
rate limit ensures that transactions won't hit the hard memory limit and
stall before memory dump is complete. It is respected by all types of
consumers. Disk-based rate limit is supposed to be set when compaction
doesn't keep up with dumps. It is only used by usual transactions and
ignored by compaction jobs.
Since now there are two wait queues, we need to balance wakeups between
them in case consumers in both queues are ready to proceed. To ensure
there's no starvation, we maintain a monotonically growing counter and
assign its value to each consumer put to slip (ticket). We use it to
wake up the consumer that has waited most when both queues are ready.
Note, the patch doesn't implement the logic of disk-based throttling in
the regulator module. It is still left for future work.
Needed for #3721
---
src/box/vinyl.c | 30 +++++++-----
src/box/vy_quota.c | 123 ++++++++++++++++++++++++++++++++++++++-----------
src/box/vy_quota.h | 91 ++++++++++++++++++++++++++++++------
src/box/vy_regulator.c | 12 +++--
4 files changed, 198 insertions(+), 58 deletions(-)
diff --git a/src/box/vinyl.c b/src/box/vinyl.c
index 0936932b..aaef858e 100644
--- a/src/box/vinyl.c
+++ b/src/box/vinyl.c
@@ -2342,7 +2342,8 @@ vinyl_engine_prepare(struct engine *engine, struct txn *txn)
* the transaction to be sent to read view or aborted, we call
* it before checking for conflicts.
*/
- if (vy_quota_use(&env->quota, tx->write_size, timeout) != 0)
+ if (vy_quota_use(&env->quota, VY_QUOTA_CONSUMER_TX,
+ tx->write_size, timeout) != 0)
return -1;
size_t mem_used_before = lsregion_used(&env->mem_env.allocator);
@@ -2351,8 +2352,8 @@ vinyl_engine_prepare(struct engine *engine, struct txn *txn)
size_t mem_used_after = lsregion_used(&env->mem_env.allocator);
assert(mem_used_after >= mem_used_before);
- vy_quota_adjust(&env->quota, tx->write_size,
- mem_used_after - mem_used_before);
+ vy_quota_adjust(&env->quota, VY_QUOTA_CONSUMER_TX,
+ tx->write_size, mem_used_after - mem_used_before);
vy_regulator_check_dump_watermark(&env->regulator);
return rc;
}
@@ -2377,7 +2378,8 @@ vinyl_engine_commit(struct engine *engine, struct txn *txn)
size_t mem_used_after = lsregion_used(&env->mem_env.allocator);
assert(mem_used_after >= mem_used_before);
/* We can't abort the transaction at this point, use force. */
- vy_quota_force_use(&env->quota, mem_used_after - mem_used_before);
+ vy_quota_force_use(&env->quota, VY_QUOTA_CONSUMER_TX,
+ mem_used_after - mem_used_before);
vy_regulator_check_dump_watermark(&env->regulator);
txn->engine_tx = NULL;
@@ -3188,7 +3190,8 @@ vinyl_space_apply_initial_join_row(struct space *space, struct request *request)
* quota accounting.
*/
size_t reserved = tx->write_size;
- if (vy_quota_use(&env->quota, reserved, TIMEOUT_INFINITY) != 0)
+ if (vy_quota_use(&env->quota, VY_QUOTA_CONSUMER_TX,
+ reserved, TIMEOUT_INFINITY) != 0)
unreachable();
size_t mem_used_before = lsregion_used(&env->mem_env.allocator);
@@ -3207,7 +3210,7 @@ vinyl_space_apply_initial_join_row(struct space *space, struct request *request)
size_t mem_used_after = lsregion_used(&env->mem_env.allocator);
assert(mem_used_after >= mem_used_before);
size_t used = mem_used_after - mem_used_before;
- vy_quota_adjust(&env->quota, reserved, used);
+ vy_quota_adjust(&env->quota, VY_QUOTA_CONSUMER_TX, reserved, used);
vy_regulator_check_dump_watermark(&env->regulator);
return rc;
}
@@ -3529,7 +3532,7 @@ vy_squash_process(struct vy_squash *squash)
* so there's no need in invalidating the cache.
*/
vy_mem_commit_stmt(mem, region_stmt);
- vy_quota_force_use(&env->quota,
+ vy_quota_force_use(&env->quota, VY_QUOTA_CONSUMER_TX,
mem_used_after - mem_used_before);
vy_regulator_check_dump_watermark(&env->regulator);
}
@@ -4005,9 +4008,10 @@ vy_build_insert_tuple(struct vy_env *env, struct vy_lsm *lsm,
/* Consume memory quota. Throttle if it is exceeded. */
size_t mem_used_after = lsregion_used(&env->mem_env.allocator);
assert(mem_used_after >= mem_used_before);
- vy_quota_force_use(&env->quota, mem_used_after - mem_used_before);
+ vy_quota_force_use(&env->quota, VY_QUOTA_CONSUMER_TX,
+ mem_used_after - mem_used_before);
vy_regulator_check_dump_watermark(&env->regulator);
- vy_quota_wait(&env->quota);
+ vy_quota_wait(&env->quota, VY_QUOTA_CONSUMER_TX);
return rc;
}
@@ -4133,7 +4137,8 @@ vy_build_recover(struct vy_env *env, struct vy_lsm *lsm, struct vy_lsm *pk)
mem_used_after = lsregion_used(&env->mem_env.allocator);
assert(mem_used_after >= mem_used_before);
- vy_quota_force_use(&env->quota, mem_used_after - mem_used_before);
+ vy_quota_force_use(&env->quota, VY_QUOTA_CONSUMER_TX,
+ mem_used_after - mem_used_before);
return rc;
}
@@ -4349,7 +4354,7 @@ vy_deferred_delete_on_replace(struct trigger *trigger, void *event)
*/
struct vy_env *env = vy_env(space->engine);
if (is_first_statement)
- vy_quota_wait(&env->quota);
+ vy_quota_wait(&env->quota, VY_QUOTA_CONSUMER_COMPACTION);
/* Create the deferred DELETE statement. */
struct vy_lsm *pk = vy_lsm(space->index[0]);
@@ -4436,7 +4441,8 @@ vy_deferred_delete_on_replace(struct trigger *trigger, void *event)
}
size_t mem_used_after = lsregion_used(&env->mem_env.allocator);
assert(mem_used_after >= mem_used_before);
- vy_quota_force_use(&env->quota, mem_used_after - mem_used_before);
+ vy_quota_force_use(&env->quota, VY_QUOTA_CONSUMER_COMPACTION,
+ mem_used_after - mem_used_before);
vy_regulator_check_dump_watermark(&env->regulator);
tuple_unref(delete);
diff --git a/src/box/vy_quota.c b/src/box/vy_quota.c
index 07cd5856..20d322de 100644
--- a/src/box/vy_quota.c
+++ b/src/box/vy_quota.c
@@ -52,6 +52,43 @@
static const double VY_QUOTA_TIMER_PERIOD = 0.1;
/**
+ * Bit mask of resources used by a particular consumer type.
+ */
+static unsigned
+vy_quota_consumer_resource_map[] = {
+ /**
+ * Transaction throttling pursues two goals. First, it is
+ * capping memory consumption rate so that the hard memory
+ * limit will not be hit before memory dump has completed
+ * (memory-based throttling). Second, we must make sure
+ * that compaction jobs keep up with dumps to keep read and
+ * space amplification within bounds (disk-based throttling).
+ * Transactions ought to respect them both.
+ */
+ [VY_QUOTA_CONSUMER_TX] = (1 << VY_QUOTA_RESOURCE_DISK) |
+ (1 << VY_QUOTA_RESOURCE_MEMORY),
+ /**
+ * Compaction jobs may need some quota too, because they
+ * may generate deferred DELETEs for secondary indexes.
+ * Apparently, we must not impose the rate limit that is
+ * supposed to speed up compaction on them (disk-based),
+ * however they still have to respect memory-based throttling
+ * to avoid long stalls.
+ */
+ [VY_QUOTA_CONSUMER_COMPACTION] = (1 << VY_QUOTA_RESOURCE_MEMORY),
+};
+
+/**
+ * Iterate over rate limit states that are enforced for a consumer
+ * of the given type.
+ */
+#define vy_quota_consumer_for_each_rate_limit(quota, type, rl) \
+ for (struct vy_rate_limit *rl = (quota)->rate_limit; \
+ rl - (quota)->rate_limit < vy_quota_resource_type_MAX; rl++) \
+ if (vy_quota_consumer_resource_map[type] & \
+ (1 << (rl - (quota)->rate_limit)))
+
+/**
* Return true if the requested amount of memory may be consumed
* right now, false if consumers have to wait.
*
@@ -60,7 +97,8 @@ static const double VY_QUOTA_TIMER_PERIOD = 0.1;
* it can start memory reclaim immediately.
*/
static inline bool
-vy_quota_may_use(struct vy_quota *q, size_t size)
+vy_quota_may_use(struct vy_quota *q, enum vy_quota_consumer_type type,
+ size_t size)
{
if (!q->is_enabled)
return true;
@@ -68,8 +106,10 @@ vy_quota_may_use(struct vy_quota *q, size_t size)
q->quota_exceeded_cb(q);
return false;
}
- if (!vy_rate_limit_may_use(&q->rate_limit))
- return false;
+ vy_quota_consumer_for_each_rate_limit(q, type, rl) {
+ if (!vy_rate_limit_may_use(rl))
+ return false;
+ }
return true;
}
@@ -77,10 +117,12 @@ vy_quota_may_use(struct vy_quota *q, size_t size)
* Consume the given amount of memory without checking the limit.
*/
static inline void
-vy_quota_do_use(struct vy_quota *q, size_t size)
+vy_quota_do_use(struct vy_quota *q, enum vy_quota_consumer_type type,
+ size_t size)
{
q->used += size;
- vy_rate_limit_use(&q->rate_limit, size);
+ vy_quota_consumer_for_each_rate_limit(q, type, rl)
+ vy_rate_limit_use(rl, size);
}
/**
@@ -88,11 +130,13 @@ vy_quota_do_use(struct vy_quota *q, size_t size)
* This function is an exact opposite of vy_quota_do_use().
*/
static inline void
-vy_quota_do_unuse(struct vy_quota *q, size_t size)
+vy_quota_do_unuse(struct vy_quota *q, enum vy_quota_consumer_type type,
+ size_t size)
{
assert(q->used >= size);
q->used -= size;
- vy_rate_limit_unuse(&q->rate_limit, size);
+ vy_quota_consumer_for_each_rate_limit(q, type, rl)
+ vy_rate_limit_unuse(rl, size);
}
/**
@@ -112,17 +156,31 @@ vy_quota_check_limit(struct vy_quota *q)
static void
vy_quota_signal(struct vy_quota *q)
{
- if (!rlist_empty(&q->wait_queue)) {
+ /*
+ * To prevent starvation, wake up a consumer that has
+ * waited most irrespective of its type.
+ */
+ struct vy_quota_wait_node *oldest = NULL;
+ for (int i = 0; i < vy_quota_consumer_type_MAX; i++) {
+ struct rlist *wq = &q->wait_queue[i];
+ if (rlist_empty(wq))
+ continue;
+
struct vy_quota_wait_node *n;
- n = rlist_first_entry(&q->wait_queue,
- struct vy_quota_wait_node, in_wait_queue);
+ n = rlist_first_entry(wq, struct vy_quota_wait_node,
+ in_wait_queue);
/*
* No need in waking up a consumer if it will have
* to go back to sleep immediately.
*/
- if (vy_quota_may_use(q, n->size))
- fiber_wakeup(n->fiber);
+ if (!vy_quota_may_use(q, i, n->size))
+ continue;
+
+ if (oldest == NULL || oldest->ticket > n->ticket)
+ oldest = n;
}
+ if (oldest != NULL)
+ fiber_wakeup(oldest->fiber);
}
static void
@@ -133,7 +191,8 @@ vy_quota_timer_cb(ev_loop *loop, ev_timer *timer, int events)
struct vy_quota *q = timer->data;
- vy_rate_limit_refill(&q->rate_limit, VY_QUOTA_TIMER_PERIOD);
+ for (int i = 0; i < vy_quota_resource_type_MAX; i++)
+ vy_rate_limit_refill(&q->rate_limit[i], VY_QUOTA_TIMER_PERIOD);
vy_quota_signal(q);
}
@@ -146,8 +205,11 @@ vy_quota_create(struct vy_quota *q, size_t limit,
q->used = 0;
q->too_long_threshold = TIMEOUT_INFINITY;
q->quota_exceeded_cb = quota_exceeded_cb;
- rlist_create(&q->wait_queue);
- vy_rate_limit_create(&q->rate_limit);
+ q->wait_ticket = 0;
+ for (int i = 0; i < vy_quota_consumer_type_MAX; i++)
+ rlist_create(&q->wait_queue[i]);
+ for (int i = 0; i < vy_quota_resource_type_MAX; i++)
+ vy_rate_limit_create(&q->rate_limit[i]);
ev_timer_init(&q->timer, vy_quota_timer_cb, 0, VY_QUOTA_TIMER_PERIOD);
q->timer.data = q;
}
@@ -176,15 +238,17 @@ vy_quota_set_limit(struct vy_quota *q, size_t limit)
}
void
-vy_quota_set_rate_limit(struct vy_quota *q, size_t rate)
+vy_quota_set_rate_limit(struct vy_quota *q, enum vy_quota_resource_type type,
+ size_t rate)
{
- vy_rate_limit_set(&q->rate_limit, rate);
+ vy_rate_limit_set(&q->rate_limit[type], rate);
}
void
-vy_quota_force_use(struct vy_quota *q, size_t size)
+vy_quota_force_use(struct vy_quota *q, enum vy_quota_consumer_type type,
+ size_t size)
{
- vy_quota_do_use(q, size);
+ vy_quota_do_use(q, type, size);
vy_quota_check_limit(q);
}
@@ -201,7 +265,8 @@ vy_quota_release(struct vy_quota *q, size_t size)
}
int
-vy_quota_use(struct vy_quota *q, size_t size, double timeout)
+vy_quota_use(struct vy_quota *q, enum vy_quota_consumer_type type,
+ size_t size, double timeout)
{
/*
* Fail early if the configured memory limit never allows
@@ -212,8 +277,8 @@ vy_quota_use(struct vy_quota *q, size_t size, double timeout)
return -1;
}
- if (vy_quota_may_use(q, size)) {
- vy_quota_do_use(q, size);
+ if (vy_quota_may_use(q, type, size)) {
+ vy_quota_do_use(q, type, size);
return 0;
}
@@ -224,8 +289,9 @@ vy_quota_use(struct vy_quota *q, size_t size, double timeout)
struct vy_quota_wait_node wait_node = {
.fiber = fiber(),
.size = size,
+ .ticket = ++q->wait_ticket,
};
- rlist_add_tail_entry(&q->wait_queue, &wait_node, in_wait_queue);
+ rlist_add_tail_entry(&q->wait_queue[type], &wait_node, in_wait_queue);
do {
double now = ev_monotonic_now(loop());
@@ -235,7 +301,7 @@ vy_quota_use(struct vy_quota *q, size_t size, double timeout)
diag_set(ClientError, ER_VY_QUOTA_TIMEOUT);
return -1;
}
- } while (!vy_quota_may_use(q, size));
+ } while (!vy_quota_may_use(q, type, size));
rlist_del_entry(&wait_node, in_wait_queue);
@@ -246,7 +312,7 @@ vy_quota_use(struct vy_quota *q, size_t size, double timeout)
wait_time);
}
- vy_quota_do_use(q, size);
+ vy_quota_do_use(q, type, size);
/*
* Blocked consumers are awaken one by one to preserve
* the order they were put to sleep. It's a responsibility
@@ -258,14 +324,15 @@ vy_quota_use(struct vy_quota *q, size_t size, double timeout)
}
void
-vy_quota_adjust(struct vy_quota *q, size_t reserved, size_t used)
+vy_quota_adjust(struct vy_quota *q, enum vy_quota_consumer_type type,
+ size_t reserved, size_t used)
{
if (reserved > used) {
- vy_quota_do_unuse(q, reserved - used);
+ vy_quota_do_unuse(q, type, reserved - used);
vy_quota_signal(q);
}
if (reserved < used) {
- vy_quota_do_use(q, used - reserved);
+ vy_quota_do_use(q, type, used - reserved);
vy_quota_check_limit(q);
}
}
diff --git a/src/box/vy_quota.h b/src/box/vy_quota.h
index 79755e89..d90922b2 100644
--- a/src/box/vy_quota.h
+++ b/src/box/vy_quota.h
@@ -110,6 +110,50 @@ vy_rate_limit_refill(struct vy_rate_limit *rl, double time)
typedef void
(*vy_quota_exceeded_f)(struct vy_quota *quota);
+/**
+ * Apart from memory usage accounting and limiting, vy_quota is
+ * responsible for consumption rate limiting (aka throttling).
+ * There are multiple rate limits, each of which is associated
+ * with a particular resource type. Different kinds of consumers
+ * respect different limits. The following enumeration defines
+ * the resource types for which vy_quota enables throttling.
+ *
+ * See also vy_quota_consumer_resource_map.
+ */
+enum vy_quota_resource_type {
+ /**
+ * The goal of disk-based throttling is to keep LSM trees
+ * in a good shape so that read and space amplification
+ * stay within bounds. It is enabled when compaction does
+ * not keep up with dumps.
+ */
+ VY_QUOTA_RESOURCE_DISK = 0,
+ /**
+ * Memory-based throttling is needed to avoid long stalls
+ * caused by hitting the hard memory limit. It is set so
+ * that by the time the hard limit is hit, the last memory
+ * dump will have completed.
+ */
+ VY_QUOTA_RESOURCE_MEMORY = 1,
+
+ vy_quota_resource_type_MAX,
+};
+
+/**
+ * Quota consumer type determines how a quota consumer will be
+ * rate limited.
+ *
+ * See also vy_quota_consumer_resource_map.
+ */
+enum vy_quota_consumer_type {
+ /** Transaction processor. */
+ VY_QUOTA_CONSUMER_TX = 0,
+ /** Compaction job. */
+ VY_QUOTA_CONSUMER_COMPACTION = 1,
+
+ vy_quota_consumer_type_MAX,
+};
+
struct vy_quota_wait_node {
/** Link in vy_quota::wait_queue. */
struct rlist in_wait_queue;
@@ -117,6 +161,11 @@ struct vy_quota_wait_node {
struct fiber *fiber;
/** Amount of requested memory. */
size_t size;
+ /**
+ * Ticket assigned to this fiber when it was put to
+ * sleep, see vy_quota::wait_ticket for more details.
+ */
+ int64_t ticket;
};
/**
@@ -144,13 +193,23 @@ struct vy_quota {
*/
vy_quota_exceeded_f quota_exceeded_cb;
/**
- * Queue of consumers waiting for quota, linked by
- * vy_quota_wait_node::state. Newcomers are added
- * to the tail.
+ * Monotonically growing counter assigned to consumers
+ * waiting for quota. It is used for balancing wakeups
+ * among wait queues: if two fibers from different wait
+ * queues may proceed, the one with the lowest ticket
+ * will be picked.
+ *
+ * See also vy_quota_wait_node::ticket.
*/
- struct rlist wait_queue;
- /** Rate limit state. */
- struct vy_rate_limit rate_limit;
+ int64_t wait_ticket;
+ /**
+ * Queue of consumers waiting for quota, one per each
+ * consumer type, linked by vy_quota_wait_node::state.
+ * Newcomers are added to the tail.
+ */
+ struct rlist wait_queue[vy_quota_consumer_type_MAX];
+ /** Rate limit state, one per each resource type. */
+ struct vy_rate_limit rate_limit[vy_quota_resource_type_MAX];
/**
* Periodic timer that is used for refilling the rate
* limit value.
@@ -188,18 +247,20 @@ void
vy_quota_set_limit(struct vy_quota *q, size_t limit);
/**
- * Set the max rate at which quota may be consumed,
- * in bytes per second.
+ * Set the rate limit corresponding to the resource of the given
+ * type. The rate limit is given in bytes per second.
*/
void
-vy_quota_set_rate_limit(struct vy_quota *q, size_t rate);
+vy_quota_set_rate_limit(struct vy_quota *q, enum vy_quota_resource_type type,
+ size_t rate);
/**
* Consume @size bytes of memory. In contrast to vy_quota_use()
* this function does not throttle the caller.
*/
void
-vy_quota_force_use(struct vy_quota *q, size_t size);
+vy_quota_force_use(struct vy_quota *q, enum vy_quota_consumer_type type,
+ size_t size);
/**
* Release @size bytes of memory.
@@ -242,7 +303,8 @@ vy_quota_release(struct vy_quota *q, size_t size);
* account while estimating the size of a memory allocation.
*/
int
-vy_quota_use(struct vy_quota *q, size_t size, double timeout);
+vy_quota_use(struct vy_quota *q, enum vy_quota_consumer_type type,
+ size_t size, double timeout);
/**
* Adjust quota after allocating memory.
@@ -253,15 +315,16 @@ vy_quota_use(struct vy_quota *q, size_t size, double timeout);
* See also vy_quota_use().
*/
void
-vy_quota_adjust(struct vy_quota *q, size_t reserved, size_t used);
+vy_quota_adjust(struct vy_quota *q, enum vy_quota_consumer_type type,
+ size_t reserved, size_t used);
/**
* Block the caller until the quota is not exceeded.
*/
static inline void
-vy_quota_wait(struct vy_quota *q)
+vy_quota_wait(struct vy_quota *q, enum vy_quota_consumer_type type)
{
- vy_quota_use(q, 0, TIMEOUT_INFINITY);
+ vy_quota_use(q, type, 0, TIMEOUT_INFINITY);
}
#if defined(__cplusplus)
diff --git a/src/box/vy_regulator.c b/src/box/vy_regulator.c
index 2e09b93c..e14b01aa 100644
--- a/src/box/vy_regulator.c
+++ b/src/box/vy_regulator.c
@@ -101,7 +101,8 @@ vy_regulator_trigger_dump(struct vy_regulator *regulator)
size_t max_write_rate = (double)mem_left / (mem_used + 1) *
regulator->dump_bandwidth;
max_write_rate = MIN(max_write_rate, regulator->dump_bandwidth);
- vy_quota_set_rate_limit(quota, max_write_rate);
+ vy_quota_set_rate_limit(quota, VY_QUOTA_RESOURCE_MEMORY,
+ max_write_rate);
}
static void
@@ -202,7 +203,8 @@ void
vy_regulator_start(struct vy_regulator *regulator)
{
regulator->quota_used_last = regulator->quota->used;
- vy_quota_set_rate_limit(regulator->quota, regulator->dump_bandwidth);
+ vy_quota_set_rate_limit(regulator->quota, VY_QUOTA_RESOURCE_MEMORY,
+ regulator->dump_bandwidth);
ev_timer_start(loop(), ®ulator->timer);
}
@@ -253,7 +255,8 @@ vy_regulator_dump_complete(struct vy_regulator *regulator,
* limit to the dump bandwidth rather than disabling it
* completely.
*/
- vy_quota_set_rate_limit(regulator->quota, regulator->dump_bandwidth);
+ vy_quota_set_rate_limit(regulator->quota, VY_QUOTA_RESOURCE_MEMORY,
+ regulator->dump_bandwidth);
}
void
@@ -263,5 +266,6 @@ vy_regulator_reset_dump_bandwidth(struct vy_regulator *regulator, size_t max)
regulator->dump_bandwidth = VY_DUMP_BANDWIDTH_DEFAULT;
if (max > 0 && regulator->dump_bandwidth > max)
regulator->dump_bandwidth = max;
- vy_quota_set_rate_limit(regulator->quota, regulator->dump_bandwidth);
+ vy_quota_set_rate_limit(regulator->quota, VY_QUOTA_RESOURCE_MEMORY,
+ regulator->dump_bandwidth);
}
--
2.11.0
^ permalink raw reply [flat|nested] 39+ messages in thread
* [PATCH 9/9] vinyl: throttle tx to ensure compaction keeps up with dumps
2019-01-20 21:16 [PATCH 0/9] vinyl: compaction randomization and throttling Vladimir Davydov
` (7 preceding siblings ...)
2019-01-20 21:17 ` [PATCH 8/9] vinyl: introduce quota consumer types Vladimir Davydov
@ 2019-01-20 21:17 ` Vladimir Davydov
2019-01-21 14:14 ` Vladimir Davydov
2019-01-22 9:09 ` Vladimir Davydov
8 siblings, 2 replies; 39+ messages in thread
From: Vladimir Davydov @ 2019-01-20 21:17 UTC (permalink / raw)
To: tarantool-patches
Every byte of data written to a vinyl database eventually gets compacted
with data written to the database earlier. The ratio of the size of data
actually written to disk to the size of data written to the database is
called write amplification. Write amplification depends on the LSM tree
configuration and the workload parameters and varies in a wide range,
from 2-3 to 10-20 or even higher in some extreme cases. If the database
engine doesn't manage to write those extra data, LSM tree shape will get
distorted, which will result in increased read and space amplification,
which, in turn, will lead to slowing down reads and wasting disk space.
That's why it's so important to ensure the database engine has enough
compaction power.
One way to ensure that is increase the number of compaction threads by
tuning box.cfg.vinyl_write_threads configuration knob, but one can't
increase it beyond the capacity of the server running the instance. So
the database engine must throttle writes if it detects that compaction
threads are struggling to keep up. This patch implements a very simple
algorithm to achieve that: it keeps track of recently observed write
amplification and data compaction speed, use them to calculate the max
transaction rate that the database engine can handle while steadily
maintaining the current level of write amplification, and sets the rate
limit to half that so as to give the engine enough room to increase
write amplification if needed.
The algorithm is obviously pessimistic: it undervalues the transaction
rate the database can handle after write amplification has steadied. But
this is compensated by its simplicity and stability - there shouldn't be
any abrupt drops or peaks in RPS due to its decisions. Besides, it
adapts fairly quickly to increase in write amplification when a database
is filled up. If one finds that the algorithm is being too cautious by
undervaluing the limit, it's easy to fix by simply increasing the number
of compaction threads - the rate limit will scale proportionately if the
system is underloaded.
The current value of the rate limit set by the algorithm is reported by
box.stat.vinyl() under regulator.rate_limit section.
Closes #3721
---
src/box/vinyl.c | 6 ++++
src/box/vy_quota.c | 9 ++++++
src/box/vy_quota.h | 6 ++++
src/box/vy_regulator.c | 84 ++++++++++++++++++++++++++++++++++++++++++++++++--
src/box/vy_regulator.h | 27 ++++++++++++++++
5 files changed, 129 insertions(+), 3 deletions(-)
diff --git a/src/box/vinyl.c b/src/box/vinyl.c
index aaef858e..650d5c26 100644
--- a/src/box/vinyl.c
+++ b/src/box/vinyl.c
@@ -274,6 +274,8 @@ vy_info_append_regulator(struct vy_env *env, struct info_handler *h)
info_append_int(h, "write_rate", r->write_rate);
info_append_int(h, "dump_bandwidth", r->dump_bandwidth);
info_append_int(h, "dump_watermark", r->dump_watermark);
+ info_append_int(h, "rate_limit", vy_quota_get_rate_limit(r->quota,
+ VY_QUOTA_CONSUMER_TX));
info_table_end(h); /* regulator */
}
@@ -532,6 +534,7 @@ vinyl_engine_reset_stat(struct engine *engine)
memset(&xm->stat, 0, sizeof(xm->stat));
vy_scheduler_reset_stat(&env->scheduler);
+ vy_regulator_reset_stat(&env->regulator);
}
/** }}} Introspection */
@@ -2475,6 +2478,9 @@ vy_env_dump_complete_cb(struct vy_scheduler *scheduler,
*/
vy_regulator_dump_complete(&env->regulator, mem_dumped, dump_duration);
vy_quota_release(quota, mem_dumped);
+
+ vy_regulator_update_rate_limit(&env->regulator, &scheduler->stat,
+ scheduler->compaction_pool.size);
}
static struct vy_squash_queue *
diff --git a/src/box/vy_quota.c b/src/box/vy_quota.c
index 20d322de..4dd961c9 100644
--- a/src/box/vy_quota.c
+++ b/src/box/vy_quota.c
@@ -244,6 +244,15 @@ vy_quota_set_rate_limit(struct vy_quota *q, enum vy_quota_resource_type type,
vy_rate_limit_set(&q->rate_limit[type], rate);
}
+size_t
+vy_quota_get_rate_limit(struct vy_quota *q, enum vy_quota_consumer_type type)
+{
+ size_t rate = SIZE_MAX;
+ vy_quota_consumer_for_each_rate_limit(q, type, rl)
+ rate = MIN(rate, rl->rate);
+ return rate;
+}
+
void
vy_quota_force_use(struct vy_quota *q, enum vy_quota_consumer_type type,
size_t size)
diff --git a/src/box/vy_quota.h b/src/box/vy_quota.h
index d90922b2..7ff98cc1 100644
--- a/src/box/vy_quota.h
+++ b/src/box/vy_quota.h
@@ -255,6 +255,12 @@ vy_quota_set_rate_limit(struct vy_quota *q, enum vy_quota_resource_type type,
size_t rate);
/**
+ * Return the rate limit applied to a consumer of the given type.
+ */
+size_t
+vy_quota_get_rate_limit(struct vy_quota *q, enum vy_quota_consumer_type type);
+
+/**
* Consume @size bytes of memory. In contrast to vy_quota_use()
* this function does not throttle the caller.
*/
diff --git a/src/box/vy_regulator.c b/src/box/vy_regulator.c
index e14b01aa..b406cf97 100644
--- a/src/box/vy_regulator.c
+++ b/src/box/vy_regulator.c
@@ -34,6 +34,7 @@
#include <stdbool.h>
#include <stddef.h>
#include <stdint.h>
+#include <string.h>
#include <tarantool_ev.h>
#include "fiber.h"
@@ -42,6 +43,7 @@
#include "trivia/util.h"
#include "vy_quota.h"
+#include "vy_stat.h"
/**
* Regulator timer period, in seconds.
@@ -73,6 +75,14 @@ static const size_t VY_DUMP_BANDWIDTH_DEFAULT = 10 * 1024 * 1024;
*/
static const size_t VY_DUMP_SIZE_ACCT_MIN = 1024 * 1024;
+/**
+ * Number of dumps to take into account for rate limit calculation.
+ * Shouldn't be too small to avoid uneven RPS. Shouldn't be too big
+ * either - otherwise the rate limit will adapt too slowly to workload
+ * changes. 100 feels like a good choice.
+ */
+static const int VY_RECENT_DUMP_COUNT = 100;
+
static void
vy_regulator_trigger_dump(struct vy_regulator *regulator)
{
@@ -182,6 +192,7 @@ vy_regulator_create(struct vy_regulator *regulator, struct vy_quota *quota,
100 * MB, 200 * MB, 300 * MB, 400 * MB, 500 * MB, 600 * MB,
700 * MB, 800 * MB, 900 * MB,
};
+ memset(regulator, 0, sizeof(*regulator));
regulator->dump_bandwidth_hist = histogram_new(dump_bandwidth_buckets,
lengthof(dump_bandwidth_buckets));
if (regulator->dump_bandwidth_hist == NULL)
@@ -192,11 +203,8 @@ vy_regulator_create(struct vy_regulator *regulator, struct vy_quota *quota,
ev_timer_init(®ulator->timer, vy_regulator_timer_cb, 0,
VY_REGULATOR_TIMER_PERIOD);
regulator->timer.data = regulator;
- regulator->write_rate = 0;
- regulator->quota_used_last = 0;
regulator->dump_bandwidth = VY_DUMP_BANDWIDTH_DEFAULT;
regulator->dump_watermark = SIZE_MAX;
- regulator->dump_in_progress = false;
}
void
@@ -269,3 +277,73 @@ vy_regulator_reset_dump_bandwidth(struct vy_regulator *regulator, size_t max)
vy_quota_set_rate_limit(regulator->quota, VY_QUOTA_RESOURCE_MEMORY,
regulator->dump_bandwidth);
}
+
+void
+vy_regulator_reset_stat(struct vy_regulator *regulator)
+{
+ memset(®ulator->sched_stat_last, 0,
+ sizeof(regulator->sched_stat_last));
+}
+
+void
+vy_regulator_update_rate_limit(struct vy_regulator *regulator,
+ const struct vy_scheduler_stat *stat,
+ int compaction_threads)
+{
+ struct vy_scheduler_stat *last = ®ulator->sched_stat_last;
+ struct vy_scheduler_stat *recent = ®ulator->sched_stat_recent;
+ /*
+ * The maximal dump rate the database can handle while
+ * maintaining the current level of write amplification
+ * equals:
+ *
+ * dump_output
+ * max_dump_rate = compaction_rate * -----------------
+ * compaction_output
+ *
+ * The average compaction rate can be estimated with:
+ *
+ * compaction_output
+ * compaction_rate = compaction_threads * -----------------
+ * compaction_time
+ *
+ * Putting it all together and taking into account data
+ * compaction during memory dump, we get for the max
+ * transaction rate:
+ *
+ * dump_input
+ * max_tx_rate = max_dump_rate * ----------- =
+ * dump_output
+ *
+ * dump_input
+ * compaction_threads * ---------------
+ * compaction_time
+ *
+ * We set the rate limit to half that to leave the database
+ * engine enough room needed for growing write amplification.
+ */
+ recent->dump_count += stat->dump_count - last->dump_count;
+ recent->dump_input += stat->dump_input - last->dump_input;
+ recent->compaction_time += stat->compaction_time -
+ last->compaction_time;
+ *last = *stat;
+
+ if (recent->compaction_time == 0 ||
+ recent->dump_input < (int)VY_DUMP_SIZE_ACCT_MIN)
+ return;
+
+ double rate = 0.5 * compaction_threads * recent->dump_input /
+ recent->compaction_time;
+ vy_quota_set_rate_limit(regulator->quota, VY_QUOTA_RESOURCE_DISK,
+ MIN(rate, SIZE_MAX));
+
+ /*
+ * Periodically rotate statistics for quicker adaptation
+ * to workload changes.
+ */
+ if (recent->dump_count > VY_RECENT_DUMP_COUNT) {
+ recent->dump_count /= 2;
+ recent->dump_input /= 2;
+ recent->compaction_time /= 2;
+ }
+}
diff --git a/src/box/vy_regulator.h b/src/box/vy_regulator.h
index 0188da26..341f41df 100644
--- a/src/box/vy_regulator.h
+++ b/src/box/vy_regulator.h
@@ -35,6 +35,8 @@
#include <stddef.h>
#include <tarantool_ev.h>
+#include "vy_stat.h"
+
#if defined(__cplusplus)
extern "C" {
#endif /* defined(__cplusplus) */
@@ -107,6 +109,16 @@ struct vy_regulator {
* but vy_regulator_dump_complete() hasn't been called yet.
*/
bool dump_in_progress;
+ /**
+ * Snapshot of scheduler statistics taken at the time of
+ * the last rate limit update.
+ */
+ struct vy_scheduler_stat sched_stat_last;
+ /**
+ * Scheduler statistics for the most recent few dumps.
+ * Used for calculating the rate limit.
+ */
+ struct vy_scheduler_stat sched_stat_recent;
};
void
@@ -146,6 +158,21 @@ vy_regulator_dump_complete(struct vy_regulator *regulator,
void
vy_regulator_reset_dump_bandwidth(struct vy_regulator *regulator, size_t max);
+/**
+ * Called when global statistics are reset by box.stat.reset().
+ */
+void
+vy_regulator_reset_stat(struct vy_regulator *regulator);
+
+/**
+ * Set transaction rate limit so as to ensure that compaction
+ * will keep up with dumps.
+ */
+void
+vy_regulator_update_rate_limit(struct vy_regulator *regulator,
+ const struct vy_scheduler_stat *stat,
+ int compaction_threads);
+
#if defined(__cplusplus)
} /* extern "C" */
#endif /* defined(__cplusplus) */
--
2.11.0
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH 9/9] vinyl: throttle tx to ensure compaction keeps up with dumps
2019-01-20 21:17 ` [PATCH 9/9] vinyl: throttle tx to ensure compaction keeps up with dumps Vladimir Davydov
@ 2019-01-21 14:14 ` Vladimir Davydov
2019-01-22 9:09 ` Vladimir Davydov
1 sibling, 0 replies; 39+ messages in thread
From: Vladimir Davydov @ 2019-01-21 14:14 UTC (permalink / raw)
To: tarantool-patches
On Mon, Jan 21, 2019 at 12:17:08AM +0300, Vladimir Davydov wrote:
> +void
> +vy_regulator_update_rate_limit(struct vy_regulator *regulator,
> + const struct vy_scheduler_stat *stat,
> + int compaction_threads)
> +{
> + struct vy_scheduler_stat *last = ®ulator->sched_stat_last;
> + struct vy_scheduler_stat *recent = ®ulator->sched_stat_recent;
> + /*
> + * The maximal dump rate the database can handle while
> + * maintaining the current level of write amplification
> + * equals:
> + *
> + * dump_output
> + * max_dump_rate = compaction_rate * -----------------
> + * compaction_output
> + *
> + * The average compaction rate can be estimated with:
> + *
> + * compaction_output
> + * compaction_rate = compaction_threads * -----------------
> + * compaction_time
> + *
> + * Putting it all together and taking into account data
> + * compaction during memory dump, we get for the max
> + * transaction rate:
> + *
> + * dump_input
> + * max_tx_rate = max_dump_rate * ----------- =
> + * dump_output
> + *
> + * dump_input
> + * compaction_threads * ---------------
> + * compaction_time
> + *
> + * We set the rate limit to half that to leave the database
> + * engine enough room needed for growing write amplification.
> + */
> + recent->dump_count += stat->dump_count - last->dump_count;
> + recent->dump_input += stat->dump_input - last->dump_input;
> + recent->compaction_time += stat->compaction_time -
> + last->compaction_time;
> + *last = *stat;
> +
> + if (recent->compaction_time == 0 ||
> + recent->dump_input < (int)VY_DUMP_SIZE_ACCT_MIN)
> + return;
> +
> + double rate = 0.5 * compaction_threads * recent->dump_input /
> + recent->compaction_time;
> + vy_quota_set_rate_limit(regulator->quota, VY_QUOTA_RESOURCE_DISK,
> + MIN(rate, SIZE_MAX));
> +
> + /*
> + * Periodically rotate statistics for quicker adaptation
> + * to workload changes.
> + */
> + if (recent->dump_count > VY_RECENT_DUMP_COUNT) {
> + recent->dump_count /= 2;
> + recent->dump_input /= 2;
> + recent->compaction_time /= 2;
> + }
> +}
vinyl/deferreted_delete.test.lua generates very small dumps (several
KBs). If the disk is slow, which is the case on Travis CI, the overhead
associated with file creation will be too high in comparison to the
amount of work done by compaction, and so the rate limit will be set to
a very small value, stalling the test. I fixed this on the branch by
completely ignoring any dump less than 1 MB, as we do when estimating
dump bandwidth. The incremental diff is below.
diff --git a/src/box/vy_regulator.c b/src/box/vy_regulator.c
index b406cf97..75f1f798 100644
--- a/src/box/vy_regulator.c
+++ b/src/box/vy_regulator.c
@@ -322,16 +322,18 @@ vy_regulator_update_rate_limit(struct vy_regulator *regulator,
* We set the rate limit to half that to leave the database
* engine enough room needed for growing write amplification.
*/
- recent->dump_count += stat->dump_count - last->dump_count;
- recent->dump_input += stat->dump_input - last->dump_input;
- recent->compaction_time += stat->compaction_time -
- last->compaction_time;
+ int32_t dump_count = stat->dump_count - last->dump_count;
+ int64_t dump_input = stat->dump_input - last->dump_input;
+ double compaction_time = stat->compaction_time - last->compaction_time;
*last = *stat;
- if (recent->compaction_time == 0 ||
- recent->dump_input < (int)VY_DUMP_SIZE_ACCT_MIN)
+ if (dump_input < (ssize_t)VY_DUMP_SIZE_ACCT_MIN)
return;
+ recent->dump_count += dump_count;
+ recent->dump_input += dump_input;
+ recent->compaction_time += compaction_time;
+
double rate = 0.5 * compaction_threads * recent->dump_input /
recent->compaction_time;
vy_quota_set_rate_limit(regulator->quota, VY_QUOTA_RESOURCE_DISK,
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH 9/9] vinyl: throttle tx to ensure compaction keeps up with dumps
2019-01-20 21:17 ` [PATCH 9/9] vinyl: throttle tx to ensure compaction keeps up with dumps Vladimir Davydov
2019-01-21 14:14 ` Vladimir Davydov
@ 2019-01-22 9:09 ` Vladimir Davydov
1 sibling, 0 replies; 39+ messages in thread
From: Vladimir Davydov @ 2019-01-22 9:09 UTC (permalink / raw)
To: tarantool-patches
On Mon, Jan 21, 2019 at 12:17:08AM +0300, Vladimir Davydov wrote:
> +void
> +vy_regulator_update_rate_limit(struct vy_regulator *regulator,
> + const struct vy_scheduler_stat *stat,
> + int compaction_threads)
> +{
> + struct vy_scheduler_stat *last = ®ulator->sched_stat_last;
> + struct vy_scheduler_stat *recent = ®ulator->sched_stat_recent;
> + /*
> + * The maximal dump rate the database can handle while
> + * maintaining the current level of write amplification
> + * equals:
> + *
> + * dump_output
> + * max_dump_rate = compaction_rate * -----------------
> + * compaction_output
> + *
> + * The average compaction rate can be estimated with:
> + *
> + * compaction_output
> + * compaction_rate = compaction_threads * -----------------
> + * compaction_time
> + *
> + * Putting it all together and taking into account data
> + * compaction during memory dump, we get for the max
> + * transaction rate:
> + *
> + * dump_input
> + * max_tx_rate = max_dump_rate * ----------- =
> + * dump_output
> + *
> + * dump_input
> + * compaction_threads * ---------------
> + * compaction_time
> + *
> + * We set the rate limit to half that to leave the database
> + * engine enough room needed for growing write amplification.
> + */
> + recent->dump_count += stat->dump_count - last->dump_count;
> + recent->dump_input += stat->dump_input - last->dump_input;
> + recent->compaction_time += stat->compaction_time -
> + last->compaction_time;
> + *last = *stat;
> +
> + if (recent->compaction_time == 0 ||
> + recent->dump_input < (int)VY_DUMP_SIZE_ACCT_MIN)
> + return;
> +
> + double rate = 0.5 * compaction_threads * recent->dump_input /
> + recent->compaction_time;
^^^
I ran some tests that showed that 0.75 down-shifting coefficient works
fine as well: the instance didn't accumulate much of compaction debt
while the average compaction thread utilization grew from 50-60% to
about 80%. However, write amplification decreased from 4 to 3.6, which
indicates that the instance didn't have enough breathing space for
compaction to keep up with dumps. Setting the coefficient to a higher
value, say 0.9, results in a very bad performance - compaction queue
grows up to 100% and write amplification stays at about 2.
Probably, we could make this coefficient dynamic - scale it from 0.5 up
to 1 depending on the compaction queue size - but that's going to be
tricky to implement while maintaining a steady RPS, without spikes and
falls - we'll need to run many more tests. I think we should start with
a static coefficient, which is predictable and results in a steady RPS,
and then think of something more sophisticated if there's a demand for
that. If so, what coefficient should we choose? 0.5 is safe while 0.75
yields higher RPS. The user can always increase the number of compaction
threads if RPS happens to be too small due to regulator decisions, but
still, which one is better?
> + vy_quota_set_rate_limit(regulator->quota, VY_QUOTA_RESOURCE_DISK,
> + MIN(rate, SIZE_MAX));
> +
> + /*
> + * Periodically rotate statistics for quicker adaptation
> + * to workload changes.
> + */
> + if (recent->dump_count > VY_RECENT_DUMP_COUNT) {
> + recent->dump_count /= 2;
> + recent->dump_input /= 2;
> + recent->compaction_time /= 2;
> + }
> +}
^ permalink raw reply [flat|nested] 39+ messages in thread