Tarantool development patches archive
 help / color / mirror / Atom feed
* [PATCH 0/3] Force compaction if there too many DELETEs
@ 2018-10-20 18:23 Vladimir Davydov
  2018-10-20 18:23 ` [PATCH 1/3] vinyl: remove useless local var from vy_range_update_compact_priority Vladimir Davydov
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Vladimir Davydov @ 2018-10-20 18:23 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

Normally, space amplification of a vinyl index is confined by the
configured LSM tree shape and shouldn't exceed 3-4. However, this
might not hold in case indexed fields are frequently updated, e.g.

  s = box.schema.space.create('test', {engine = 'vinyl'})
  s:create_index('primary')
  s:create_index('secondary', {parts = {2, 'unsigned'}})

  s:update(key, {{'+', 2, 1}}) -- there a lot of updates like this

Such a workload will generate tons of DELETE statement in the secondary
index, and they won't be purged until major compaction is scheduled,
which may not happen for a very very long time, resulting in space
amplification getting as high as 100 or even 1000.

This patch set fixes this issue by forcing major compaction of a range
in case the number of DELETEs plus statements that they are supposed to
purge exceeds a half of all statements in the range. This should prevent
space and read amplification from growing to insane heights.

Note, it doesn't close #3225, because that issue is also about forcing
compaction in case the observed read amplification gets too high, which
isn't addressed by this particular patch set.

https://github.com/tarantool/tarantool/issues/3225
https://github.com/tarantool/tarantool/commits/dv/gh-3225-vy-force-compaction-when-there-are-too-many-deletes

Vladimir Davydov (3):
  vinyl: remove useless local var from vy_range_update_compact_priority
  vinyl: account disk statements of each type
  vinyl: force major compaction if there are too many DELETEs

 src/box/iproto_constants.c          |   1 +
 src/box/iproto_constants.h          |   2 +
 src/box/vinyl.c                     |   6 ++
 src/box/vy_lsm.c                    |   2 +
 src/box/vy_range.c                  |  27 +++++++--
 src/box/vy_run.c                    |  82 +++++++++++++++++++++++++-
 src/box/vy_run.h                    |  10 ++++
 src/box/vy_stat.h                   |  65 +++++++++++++++++++++
 test/vinyl/compact.result           |  40 +++++++++++++
 test/vinyl/compact.test.lua         |  18 ++++++
 test/vinyl/deferred_delete.result   |  26 +++++++--
 test/vinyl/deferred_delete.test.lua |  15 ++++-
 test/vinyl/info.result              | 113 +++++++++++++++++++++++++++++++++---
 test/vinyl/info.test.lua            |  36 ++++++++++++
 test/vinyl/layout.result            |  12 ++--
 test/vinyl/update_optimize.result   |   9 ++-
 test/vinyl/update_optimize.test.lua |   5 +-
 17 files changed, 440 insertions(+), 29 deletions(-)

-- 
2.11.0

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 1/3] vinyl: remove useless local var from vy_range_update_compact_priority
  2018-10-20 18:23 [PATCH 0/3] Force compaction if there too many DELETEs Vladimir Davydov
@ 2018-10-20 18:23 ` Vladimir Davydov
  2018-10-23  8:51   ` [tarantool-patches] " Konstantin Osipov
  2018-10-20 18:23 ` [PATCH 2/3] vinyl: account disk statements of each type Vladimir Davydov
  2018-10-20 18:23 ` [PATCH 3/3] vinyl: force major compaction if there are too many DELETEs Vladimir Davydov
  2 siblings, 1 reply; 9+ messages in thread
From: Vladimir Davydov @ 2018-10-20 18:23 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

Local variable total_size equals total_stmt_count.bytes_compressed so we
don't really need it.
---
 src/box/vy_range.c | 5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/src/box/vy_range.c b/src/box/vy_range.c
index 0558098a..af1227ea 100644
--- a/src/box/vy_range.c
+++ b/src/box/vy_range.c
@@ -312,8 +312,6 @@ vy_range_update_compact_priority(struct vy_range *range,
 	vy_disk_stmt_counter_reset(&total_stmt_count);
 	/* Total number of checked runs. */
 	uint32_t total_run_count = 0;
-	/* The total size of runs checked so far. */
-	uint64_t total_size = 0;
 	/* Estimated size of a compacted run, if compaction is scheduled. */
 	uint64_t est_new_run_size = 0;
 	/* The number of runs at the current level. */
@@ -335,7 +333,6 @@ vy_range_update_compact_priority(struct vy_range *range,
 		 */
 		if (target_run_size == 0)
 			target_run_size = size;
-		total_size += size;
 		level_run_count++;
 		total_run_count++;
 		vy_disk_stmt_counter_add(&total_stmt_count, &slice->count);
@@ -377,7 +374,7 @@ vy_range_update_compact_priority(struct vy_range *range,
 			 */
 			range->compact_priority = total_run_count;
 			range->compact_queue = total_stmt_count;
-			est_new_run_size = total_size;
+			est_new_run_size = total_stmt_count.bytes_compressed;
 		}
 	}
 }
-- 
2.11.0

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 2/3] vinyl: account disk statements of each type
  2018-10-20 18:23 [PATCH 0/3] Force compaction if there too many DELETEs Vladimir Davydov
  2018-10-20 18:23 ` [PATCH 1/3] vinyl: remove useless local var from vy_range_update_compact_priority Vladimir Davydov
@ 2018-10-20 18:23 ` Vladimir Davydov
  2018-10-23  8:58   ` [tarantool-patches] " Konstantin Osipov
  2018-10-20 18:23 ` [PATCH 3/3] vinyl: force major compaction if there are too many DELETEs Vladimir Davydov
  2 siblings, 1 reply; 9+ messages in thread
From: Vladimir Davydov @ 2018-10-20 18:23 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

This patch adds a new entry to per index statistics reported by
index.stat():

  disk.statement
    inserts
    replaces
    deletes
    upserts

It shows the number of statements of each type stored in run files.
The new statistics are persisted in index files. We will need this
information so that we can force major compaction when there are too
many DELETE statements accumulated in run files.

Needed for #3225
---
 src/box/iproto_constants.c |   1 +
 src/box/iproto_constants.h |   2 +
 src/box/vinyl.c            |   6 +++
 src/box/vy_lsm.c           |   2 +
 src/box/vy_run.c           |  68 ++++++++++++++++++++++++++-
 src/box/vy_run.h           |   2 +
 src/box/vy_stat.h          |  59 +++++++++++++++++++++++
 test/vinyl/info.result     | 113 ++++++++++++++++++++++++++++++++++++++++++---
 test/vinyl/info.test.lua   |  36 +++++++++++++++
 test/vinyl/layout.result   |  12 +++--
 10 files changed, 289 insertions(+), 12 deletions(-)

diff --git a/src/box/iproto_constants.c b/src/box/iproto_constants.c
index 4f3510e4..7fd29577 100644
--- a/src/box/iproto_constants.c
+++ b/src/box/iproto_constants.c
@@ -216,6 +216,7 @@ const char *vy_run_info_key_strs[VY_RUN_INFO_KEY_MAX] = {
 	"page count",
 	"bloom filter legacy",
 	"bloom filter",
+	"stmt stat",
 };
 
 const char *vy_row_index_key_strs[VY_ROW_INDEX_KEY_MAX] = {
diff --git a/src/box/iproto_constants.h b/src/box/iproto_constants.h
index f571375e..ab7db1f0 100644
--- a/src/box/iproto_constants.h
+++ b/src/box/iproto_constants.h
@@ -354,6 +354,8 @@ enum vy_run_info_key {
 	VY_RUN_INFO_BLOOM_LEGACY = 6,
 	/** Bloom filter for keys. */
 	VY_RUN_INFO_BLOOM = 7,
+	/** Number of statements of each type (map). */
+	VY_RUN_INFO_STMT_STAT = 8,
 	/** The last key in this enum + 1 */
 	VY_RUN_INFO_KEY_MAX
 };
diff --git a/src/box/vinyl.c b/src/box/vinyl.c
index 09b66559..e9df7463 100644
--- a/src/box/vinyl.c
+++ b/src/box/vinyl.c
@@ -397,6 +397,12 @@ vinyl_index_stat(struct index *index, struct info_handler *h)
 
 	info_table_begin(h, "disk");
 	vy_info_append_disk_stmt_counter(h, NULL, &stat->disk.count);
+	info_table_begin(h, "statement");
+	info_append_int(h, "inserts", stat->disk.stmt.inserts);
+	info_append_int(h, "replaces", stat->disk.stmt.replaces);
+	info_append_int(h, "deletes", stat->disk.stmt.deletes);
+	info_append_int(h, "upserts", stat->disk.stmt.upserts);
+	info_table_end(h); /* statement */
 	info_table_begin(h, "iterator");
 	info_append_int(h, "lookup", stat->disk.iterator.lookup);
 	vy_info_append_stmt_counter(h, "get", &stat->disk.iterator.get);
diff --git a/src/box/vy_lsm.c b/src/box/vy_lsm.c
index de5ad314..7e6bc3c2 100644
--- a/src/box/vy_lsm.c
+++ b/src/box/vy_lsm.c
@@ -695,6 +695,7 @@ vy_lsm_add_run(struct vy_lsm *lsm, struct vy_run *run)
 	rlist_add_entry(&lsm->runs, run, in_lsm);
 	lsm->run_count++;
 	vy_disk_stmt_counter_add(&lsm->stat.disk.count, &run->count);
+	vy_stmt_stat_add(&lsm->stat.disk.stmt, &run->info.stmt_stat);
 
 	lsm->bloom_size += bloom_size;
 	lsm->page_index_size += page_index_size;
@@ -723,6 +724,7 @@ vy_lsm_remove_run(struct vy_lsm *lsm, struct vy_run *run)
 	rlist_del_entry(run, in_lsm);
 	lsm->run_count--;
 	vy_disk_stmt_counter_sub(&lsm->stat.disk.count, &run->count);
+	vy_stmt_stat_sub(&lsm->stat.disk.stmt, &run->info.stmt_stat);
 
 	lsm->bloom_size -= bloom_size;
 	lsm->page_index_size -= page_index_size;
diff --git a/src/box/vy_run.c b/src/box/vy_run.c
index bb5baf2c..5f2b980b 100644
--- a/src/box/vy_run.c
+++ b/src/box/vy_run.c
@@ -542,6 +542,33 @@ vy_page_info_decode(struct vy_page_info *page, const struct xrow_header *xrow,
 	return 0;
 }
 
+/** Decode statement statistics from @data and advance @data. */
+static void
+vy_stmt_stat_decode(struct vy_stmt_stat *stat, const char **data)
+{
+	uint32_t size = mp_decode_map(data);
+	for (uint32_t i = 0; i < size; i++) {
+		uint64_t key = mp_decode_uint(data);
+		uint64_t value = mp_decode_uint(data);
+		switch (key) {
+		case IPROTO_INSERT:
+			stat->inserts = value;
+			break;
+		case IPROTO_REPLACE:
+			stat->replaces = value;
+			break;
+		case IPROTO_DELETE:
+			stat->deletes = value;
+			break;
+		case IPROTO_UPSERT:
+			stat->upserts = value;
+			break;
+		default:
+			break;
+		}
+	}
+}
+
 /**
  * Decode the run metadata from xrow.
  *
@@ -603,6 +630,9 @@ vy_run_info_decode(struct vy_run_info *run_info,
 			if (run_info->bloom == NULL)
 				return -1;
 			break;
+		case VY_RUN_INFO_STMT_STAT:
+			vy_stmt_stat_decode(&run_info->stmt_stat, &pos);
+			break;
 		default:
 			diag_set(ClientError, ER_INVALID_INDEX_FILE, filename,
 				"Can't decode run info: unknown key %u",
@@ -1876,6 +1906,37 @@ vy_page_info_encode(const struct vy_page_info *page_info,
 
 /** {{{ vy_run_info */
 
+/** Return the size of encoded statement statistics. */
+static size_t
+vy_stmt_stat_sizeof(const struct vy_stmt_stat *stat)
+{
+	return mp_sizeof_map(4) +
+		mp_sizeof_uint(IPROTO_INSERT) +
+		mp_sizeof_uint(IPROTO_REPLACE) +
+		mp_sizeof_uint(IPROTO_DELETE) +
+		mp_sizeof_uint(IPROTO_UPSERT) +
+		mp_sizeof_uint(stat->inserts) +
+		mp_sizeof_uint(stat->replaces) +
+		mp_sizeof_uint(stat->deletes) +
+		mp_sizeof_uint(stat->upserts);
+}
+
+/** Encode statement statistics to @buf and return advanced @buf. */
+static char *
+vy_stmt_stat_encode(const struct vy_stmt_stat *stat, char *buf)
+{
+	buf = mp_encode_map(buf, 4);
+	buf = mp_encode_uint(buf, IPROTO_INSERT);
+	buf = mp_encode_uint(buf, stat->inserts);
+	buf = mp_encode_uint(buf, IPROTO_REPLACE);
+	buf = mp_encode_uint(buf, stat->replaces);
+	buf = mp_encode_uint(buf, IPROTO_DELETE);
+	buf = mp_encode_uint(buf, stat->deletes);
+	buf = mp_encode_uint(buf, IPROTO_UPSERT);
+	buf = mp_encode_uint(buf, stat->upserts);
+	return buf;
+}
+
 /**
  * Encode vy_run_info as xrow
  * Allocates using region alloc
@@ -1898,7 +1959,7 @@ vy_run_info_encode(const struct vy_run_info *run_info,
 	mp_next(&tmp);
 	size_t max_key_size = tmp - run_info->max_key;
 
-	uint32_t key_count = 5;
+	uint32_t key_count = 6;
 	if (run_info->bloom != NULL)
 		key_count++;
 
@@ -1914,6 +1975,8 @@ vy_run_info_encode(const struct vy_run_info *run_info,
 	if (run_info->bloom != NULL)
 		size += mp_sizeof_uint(VY_RUN_INFO_BLOOM) +
 			tuple_bloom_size(run_info->bloom);
+	size += mp_sizeof_uint(VY_RUN_INFO_STMT_STAT) +
+		vy_stmt_stat_sizeof(&run_info->stmt_stat);
 
 	char *pos = region_alloc(&fiber()->gc, size);
 	if (pos == NULL) {
@@ -1940,6 +2003,8 @@ vy_run_info_encode(const struct vy_run_info *run_info,
 		pos = mp_encode_uint(pos, VY_RUN_INFO_BLOOM);
 		pos = tuple_bloom_encode(run_info->bloom, pos);
 	}
+	pos = mp_encode_uint(pos, VY_RUN_INFO_STMT_STAT);
+	pos = vy_stmt_stat_encode(&run_info->stmt_stat, pos);
 	xrow->body->iov_len = (void *)pos - xrow->body->iov_base;
 	xrow->bodycnt = 1;
 	xrow->type = VY_INDEX_RUN_INFO;
@@ -2136,6 +2201,7 @@ vy_run_writer_write_to_page(struct vy_run_writer *writer, struct tuple *stmt)
 	int64_t lsn = vy_stmt_lsn(stmt);
 	run->info.min_lsn = MIN(run->info.min_lsn, lsn);
 	run->info.max_lsn = MAX(run->info.max_lsn, lsn);
+	vy_stmt_stat_acct(&run->info.stmt_stat, vy_stmt_type(stmt));
 	return 0;
 }
 
diff --git a/src/box/vy_run.h b/src/box/vy_run.h
index 4b919a8f..990daffa 100644
--- a/src/box/vy_run.h
+++ b/src/box/vy_run.h
@@ -87,6 +87,8 @@ struct vy_run_info {
 	uint32_t page_count;
 	/** Bloom filter of all tuples in run */
 	struct tuple_bloom *bloom;
+	/** Statement statistics. */
+	struct vy_stmt_stat stmt_stat;
 };
 
 /**
diff --git a/src/box/vy_stat.h b/src/box/vy_stat.h
index 1545115a..9bb09174 100644
--- a/src/box/vy_stat.h
+++ b/src/box/vy_stat.h
@@ -36,11 +36,20 @@
 
 #include "latency.h"
 #include "tuple.h"
+#include "iproto_constants.h"
 
 #if defined(__cplusplus)
 extern "C" {
 #endif /* defined(__cplusplus) */
 
+/** Number of statements of each type. */
+struct vy_stmt_stat {
+	int64_t inserts;
+	int64_t replaces;
+	int64_t deletes;
+	int64_t upserts;
+};
+
 /** Used for accounting statements stored in memory. */
 struct vy_stmt_counter {
 	/** Number of statements. */
@@ -130,6 +139,8 @@ struct vy_lsm_stat {
 	struct {
 		/** Number of statements stored on disk. */
 		struct vy_disk_stmt_counter count;
+		/** Statement statistics. */
+		struct vy_stmt_stat stmt;
 		/** Run iterator statistics. */
 		struct vy_run_iterator_stat iterator;
 		/** Dump statistics. */
@@ -298,6 +309,54 @@ vy_disk_stmt_counter_sub(struct vy_disk_stmt_counter *c1,
 	c1->pages -= c2->pages;
 }
 
+/**
+ * Account a single statement of the given type in @stat.
+ */
+static inline void
+vy_stmt_stat_acct(struct vy_stmt_stat *stat, enum iproto_type type)
+{
+	switch (type) {
+	case IPROTO_INSERT:
+		stat->inserts++;
+		break;
+	case IPROTO_REPLACE:
+		stat->replaces++;
+		break;
+	case IPROTO_DELETE:
+		stat->deletes++;
+		break;
+	case IPROTO_UPSERT:
+		stat->upserts++;
+		break;
+	default:
+		break;
+	}
+}
+
+/**
+ * Add statistics accumulated in @s2 to @s1.
+ */
+static inline void
+vy_stmt_stat_add(struct vy_stmt_stat *s1, const struct vy_stmt_stat *s2)
+{
+	s1->inserts += s2->inserts;
+	s1->replaces += s2->replaces;
+	s1->deletes += s2->deletes;
+	s1->upserts += s2->upserts;
+}
+
+/**
+ * Subtract statistics accumulated in @s2 from @s1.
+ */
+static inline void
+vy_stmt_stat_sub(struct vy_stmt_stat *s1, const struct vy_stmt_stat *s2)
+{
+	s1->inserts -= s2->inserts;
+	s1->replaces -= s2->replaces;
+	s1->deletes -= s2->deletes;
+	s1->upserts -= s2->upserts;
+}
+
 #if defined(__cplusplus)
 } /* extern "C" */
 #endif /* defined(__cplusplus) */
diff --git a/test/vinyl/info.result b/test/vinyl/info.result
index a47c9a1b..49c2037a 100644
--- a/test/vinyl/info.result
+++ b/test/vinyl/info.result
@@ -151,7 +151,11 @@ istat()
   disk:
     index_size: 0
     rows: 0
-    bytes: 0
+    statement:
+      inserts: 0
+      replaces: 0
+      upserts: 0
+      deletes: 0
     dump:
       in:
         rows: 0
@@ -192,9 +196,10 @@ istat()
       get:
         rows: 0
         bytes: 0
+    bloom_size: 0
     pages: 0
     bytes_compressed: 0
-    bloom_size: 0
+    bytes: 0
   txw:
     bytes: 0
     rows: 0
@@ -286,10 +291,12 @@ stat_diff(istat(), st)
         rows: 25
     index_size: 294
     rows: 25
-    bloom_size: 70
-    pages: 7
     bytes: 26049
     bytes_compressed: <bytes_compressed>
+    bloom_size: 70
+    statement:
+      replaces: 25
+    pages: 7
   bytes: 26049
   put:
     rows: 25
@@ -324,9 +331,11 @@ stat_diff(istat(), st)
         rows: 50
     index_size: 252
     rows: 25
+    bytes: 26042
     bytes_compressed: <bytes_compressed>
     pages: 6
-    bytes: 26042
+    statement:
+      replaces: 25
     compact:
       in:
         bytes: 78140
@@ -992,7 +1001,11 @@ istat()
   disk:
     index_size: 1050
     rows: 100
-    bytes: 104300
+    statement:
+      inserts: 0
+      replaces: 100
+      upserts: 0
+      deletes: 0
     dump:
       in:
         rows: 0
@@ -1033,9 +1046,10 @@ istat()
       get:
         rows: 0
         bytes: 0
+    bloom_size: 140
     pages: 25
     bytes_compressed: <bytes_compressed>
-    bloom_size: 140
+    bytes: 104300
   txw:
     bytes: 0
     rows: 0
@@ -1392,6 +1406,91 @@ gst.disk.index == 0
 ---
 - true
 ...
+--
+-- Statement statistics.
+--
+s = box.schema.space.create('test', {engine = 'vinyl'})
+---
+...
+i = s:create_index('primary', {run_count_per_level = 10})
+---
+...
+i:stat().disk.statement
+---
+- inserts: 0
+  replaces: 0
+  upserts: 0
+  deletes: 0
+...
+s:insert{1, 1}
+---
+- [1, 1]
+...
+s:replace{2, 2}
+---
+- [2, 2]
+...
+box.snapshot()
+---
+- ok
+...
+i:stat().disk.statement
+---
+- inserts: 1
+  replaces: 1
+  upserts: 0
+  deletes: 0
+...
+s:upsert({1, 1}, {{'+', 2, 1}})
+---
+...
+s:delete{2}
+---
+...
+box.snapshot()
+---
+- ok
+...
+i:stat().disk.statement
+---
+- inserts: 1
+  replaces: 1
+  upserts: 1
+  deletes: 1
+...
+test_run:cmd('restart server test')
+fiber = require('fiber')
+---
+...
+s = box.space.test
+---
+...
+i = s.index.primary
+---
+...
+i:stat().disk.statement
+---
+- inserts: 1
+  replaces: 1
+  upserts: 1
+  deletes: 1
+...
+i:compact()
+---
+...
+while i:stat().disk.compact.count == 0 do fiber.sleep(0.01) end
+---
+...
+i:stat().disk.statement
+---
+- inserts: 1
+  replaces: 0
+  upserts: 0
+  deletes: 0
+...
+s:drop()
+---
+...
 test_run:cmd('switch default')
 ---
 - true
diff --git a/test/vinyl/info.test.lua b/test/vinyl/info.test.lua
index e5794a23..b11fc044 100644
--- a/test/vinyl/info.test.lua
+++ b/test/vinyl/info.test.lua
@@ -411,6 +411,42 @@ gst.memory.bloom_filter == 0
 gst.disk.data == 0
 gst.disk.index == 0
 
+--
+-- Statement statistics.
+--
+s = box.schema.space.create('test', {engine = 'vinyl'})
+i = s:create_index('primary', {run_count_per_level = 10})
+
+i:stat().disk.statement
+
+s:insert{1, 1}
+s:replace{2, 2}
+box.snapshot()
+
+i:stat().disk.statement
+
+s:upsert({1, 1}, {{'+', 2, 1}})
+s:delete{2}
+box.snapshot()
+
+i:stat().disk.statement
+
+test_run:cmd('restart server test')
+
+fiber = require('fiber')
+
+s = box.space.test
+i = s.index.primary
+
+i:stat().disk.statement
+
+i:compact()
+while i:stat().disk.compact.count == 0 do fiber.sleep(0.01) end
+
+i:stat().disk.statement
+
+s:drop()
+
 test_run:cmd('switch default')
 test_run:cmd('stop server test')
 test_run:cmd('cleanup server test')
diff --git a/test/vinyl/layout.result b/test/vinyl/layout.result
index ee14cd51..989ddf64 100644
--- a/test/vinyl/layout.result
+++ b/test/vinyl/layout.result
@@ -236,9 +236,10 @@ result
           type: RUNINFO
         BODY:
           min_lsn: 8
+          bloom_filter: <bloom_filter>
           max_key: ['ЭЭЭ']
           page_count: 1
-          bloom_filter: <bloom_filter>
+          stmt_stat: {9: 0, 2: 0, 5: 0, 3: 3}
           max_lsn: 10
           min_key: ['ёёё']
       - HEADER:
@@ -275,9 +276,10 @@ result
           type: RUNINFO
         BODY:
           min_lsn: 11
+          bloom_filter: <bloom_filter>
           max_key: ['ЮЮЮ']
           page_count: 1
-          bloom_filter: <bloom_filter>
+          stmt_stat: {9: 0, 2: 0, 5: 0, 3: 3}
           max_lsn: 13
           min_key: ['ёёё']
       - HEADER:
@@ -317,9 +319,10 @@ result
           type: RUNINFO
         BODY:
           min_lsn: 8
+          bloom_filter: <bloom_filter>
           max_key: [null, 'ЭЭЭ']
           page_count: 1
-          bloom_filter: <bloom_filter>
+          stmt_stat: {9: 0, 2: 0, 5: 0, 3: 3}
           max_lsn: 10
           min_key: [null, 'ёёё']
       - HEADER:
@@ -356,9 +359,10 @@ result
           type: RUNINFO
         BODY:
           min_lsn: 11
+          bloom_filter: <bloom_filter>
           max_key: [789, 'ююю']
           page_count: 1
-          bloom_filter: <bloom_filter>
+          stmt_stat: {9: 0, 2: 0, 5: 0, 3: 3}
           max_lsn: 13
           min_key: [123, 'ёёё']
       - HEADER:
-- 
2.11.0

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 3/3] vinyl: force major compaction if there are too many DELETEs
  2018-10-20 18:23 [PATCH 0/3] Force compaction if there too many DELETEs Vladimir Davydov
  2018-10-20 18:23 ` [PATCH 1/3] vinyl: remove useless local var from vy_range_update_compact_priority Vladimir Davydov
  2018-10-20 18:23 ` [PATCH 2/3] vinyl: account disk statements of each type Vladimir Davydov
@ 2018-10-20 18:23 ` Vladimir Davydov
  2018-10-23  9:03   ` [tarantool-patches] " Konstantin Osipov
  2 siblings, 1 reply; 9+ messages in thread
From: Vladimir Davydov @ 2018-10-20 18:23 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

Even a perfectly shaped LSM tree can accumulate a huge number of DELETE
statements over time in case indexed fields are frequently updated. This
can significantly increase read and space amplification, especially for
secondary indexes.

One way to deal with it is to propagate read amplification back to the
scheduler so that it can raise compaction priority accordingly. Although
this would probably make sense, it wouldn't be enough, because it
wouldn't deal with space amplification growth in case the workload is
write-mostly.

So this patch simply forces major compaction for a range if the number
of DELETEs plus the statements they are supposed to purge exceeds a half
of all statements in the range. This should prevent uncontrollable growth
of the number of DELETEs and confine space and read amplification to
sane values.

Part of #3225
---
 src/box/vy_range.c                  | 22 ++++++++++++++++++++
 src/box/vy_run.c                    | 14 +++++++++++++
 src/box/vy_run.h                    |  8 ++++++++
 src/box/vy_stat.h                   |  6 ++++++
 test/vinyl/compact.result           | 40 +++++++++++++++++++++++++++++++++++++
 test/vinyl/compact.test.lua         | 18 +++++++++++++++++
 test/vinyl/deferred_delete.result   | 26 +++++++++++++++++++-----
 test/vinyl/deferred_delete.test.lua | 15 +++++++++++---
 test/vinyl/update_optimize.result   |  9 ++++++---
 test/vinyl/update_optimize.test.lua |  5 +++--
 10 files changed, 150 insertions(+), 13 deletions(-)

diff --git a/src/box/vy_range.c b/src/box/vy_range.c
index af1227ea..63efcfc6 100644
--- a/src/box/vy_range.c
+++ b/src/box/vy_range.c
@@ -310,6 +310,8 @@ vy_range_update_compact_priority(struct vy_range *range,
 	/* Total number of statements in checked runs. */
 	struct vy_disk_stmt_counter total_stmt_count;
 	vy_disk_stmt_counter_reset(&total_stmt_count);
+	struct vy_stmt_stat total_stmt_stat;
+	vy_stmt_stat_reset(&total_stmt_stat);
 	/* Total number of checked runs. */
 	uint32_t total_run_count = 0;
 	/* Estimated size of a compacted run, if compaction is scheduled. */
@@ -326,6 +328,10 @@ vy_range_update_compact_priority(struct vy_range *range,
 
 	struct vy_slice *slice;
 	rlist_foreach_entry(slice, &range->slices, in_range) {
+		struct vy_stmt_stat stmt_stat;
+		vy_slice_stmt_stat(slice, &stmt_stat);
+		vy_stmt_stat_add(&total_stmt_stat, &stmt_stat);
+
 		uint64_t size = slice->count.bytes_compressed;
 		/*
 		 * The size of the first level is defined by
@@ -377,6 +383,22 @@ vy_range_update_compact_priority(struct vy_range *range,
 			est_new_run_size = total_stmt_count.bytes_compressed;
 		}
 	}
+
+	/*
+	 * Even a perfectly shaped LSM tree can accumulate a huge
+	 * number of DELETE statements over time in case indexed
+	 * fields get updated frequently. To prevent read and space
+	 * amplification from growing uncontrollably, we force
+	 * major compaction if the total number of DELETEs plus the
+	 * statements that they are supposed to purge exceeds a
+	 * half of all statements, i.e.
+	 *
+	 *   2 * deletes > total / 2
+	 */
+	if (total_stmt_stat.deletes > total_stmt_count.rows / 4) {
+		range->compact_priority = range->slice_count;
+		range->compact_queue = range->count;
+	}
 }
 
 /**
diff --git a/src/box/vy_run.c b/src/box/vy_run.c
index 5f2b980b..5c8b3700 100644
--- a/src/box/vy_run.c
+++ b/src/box/vy_run.c
@@ -476,6 +476,20 @@ vy_slice_cut(struct vy_slice *slice, int64_t id, struct tuple *begin,
 	return 0;
 }
 
+void
+vy_slice_stmt_stat(struct vy_slice *slice, struct vy_stmt_stat *stat)
+{
+	struct vy_run_info *run_info = &slice->run->info;
+	struct vy_stmt_stat *run_stmt_stat = &run_info->stmt_stat;
+	uint32_t run_pages = run_info->page_count;
+	uint32_t slice_pages = slice->last_page_no - slice->first_page_no + 1;
+
+	stat->inserts = run_stmt_stat->inserts * slice_pages / run_pages;
+	stat->replaces = run_stmt_stat->replaces * slice_pages / run_pages;
+	stat->deletes = run_stmt_stat->deletes * slice_pages / run_pages;
+	stat->upserts = run_stmt_stat->upserts * slice_pages / run_pages;
+}
+
 /**
  * Decode page information from xrow.
  *
diff --git a/src/box/vy_run.h b/src/box/vy_run.h
index 990daffa..90e83c22 100644
--- a/src/box/vy_run.h
+++ b/src/box/vy_run.h
@@ -492,6 +492,14 @@ vy_slice_cut(struct vy_slice *slice, int64_t id, struct tuple *begin,
 	     struct vy_slice **result);
 
 /**
+ * Estimate statement count of each type spanned by a slice.
+ * Basically, this function returns run_info::stmt_stat scaled
+ * down proportionately to the slice size.
+ */
+void
+vy_slice_stmt_stat(struct vy_slice *slice, struct vy_stmt_stat *stat);
+
+/**
  * Open an iterator over on-disk run.
  *
  * Note, it is the caller's responsibility to make sure the slice
diff --git a/src/box/vy_stat.h b/src/box/vy_stat.h
index 9bb09174..04fea3c3 100644
--- a/src/box/vy_stat.h
+++ b/src/box/vy_stat.h
@@ -309,6 +309,12 @@ vy_disk_stmt_counter_sub(struct vy_disk_stmt_counter *c1,
 	c1->pages -= c2->pages;
 }
 
+static inline void
+vy_stmt_stat_reset(struct vy_stmt_stat *stat)
+{
+	memset(stat, 0, sizeof(*stat));
+}
+
 /**
  * Account a single statement of the given type in @stat.
  */
diff --git a/test/vinyl/compact.result b/test/vinyl/compact.result
index a79292ee..51dca963 100644
--- a/test/vinyl/compact.result
+++ b/test/vinyl/compact.result
@@ -210,3 +210,43 @@ info() -- 4 ranges, 4 runs
 s:drop()
 ---
 ...
+--
+-- gh-3225: major compaction is triggered if key fields are
+-- frequently updated.
+--
+s = box.schema.space.create('test', {engine = 'vinyl'})
+---
+...
+i1 = s:create_index('primary', {run_count_per_level = 10})
+---
+...
+i2 = s:create_index('secondary', {run_count_per_level = 10, parts = {2, 'unsigned'}})
+---
+...
+for i = 1, 10 do s:replace{i, i} end
+---
+...
+box.snapshot()
+---
+- ok
+...
+for i = 1, 10 do s:update(i, {{'+', 2, 10}}) end
+---
+...
+box.snapshot()
+---
+- ok
+...
+while i2:stat().disk.compact.queue.rows > 0 do fiber.sleep(0.01) end
+---
+...
+i2:stat().disk.statement
+---
+- inserts: 10
+  replaces: 0
+  upserts: 0
+  deletes: 0
+...
+s:drop()
+---
+...
diff --git a/test/vinyl/compact.test.lua b/test/vinyl/compact.test.lua
index b9a1c109..f0c71640 100644
--- a/test/vinyl/compact.test.lua
+++ b/test/vinyl/compact.test.lua
@@ -95,3 +95,21 @@ compact()
 info() -- 4 ranges, 4 runs
 
 s:drop()
+
+--
+-- gh-3225: major compaction is triggered if key fields are
+-- frequently updated.
+--
+s = box.schema.space.create('test', {engine = 'vinyl'})
+i1 = s:create_index('primary', {run_count_per_level = 10})
+i2 = s:create_index('secondary', {run_count_per_level = 10, parts = {2, 'unsigned'}})
+
+for i = 1, 10 do s:replace{i, i} end
+box.snapshot()
+for i = 1, 10 do s:update(i, {{'+', 2, 10}}) end
+box.snapshot()
+
+while i2:stat().disk.compact.queue.rows > 0 do fiber.sleep(0.01) end
+i2:stat().disk.statement
+
+s:drop()
diff --git a/test/vinyl/deferred_delete.result b/test/vinyl/deferred_delete.result
index 4a68c6c2..14408fc2 100644
--- a/test/vinyl/deferred_delete.result
+++ b/test/vinyl/deferred_delete.result
@@ -662,6 +662,12 @@ test_run:cmd("switch test")
 fiber = require('fiber')
 ---
 ...
+var = box.schema.space.create('var')
+---
+...
+_ = var:create_index('primary', {parts = {1, 'string'}})
+---
+...
 s = box.schema.space.create('test', {engine = 'vinyl'})
 ---
 ...
@@ -711,11 +717,18 @@ sk:stat().disk.dump.count -- 1
 ---
 - 1
 ...
-sk:stat().rows -- 120 old REPLACEs + 120 new REPLACEs + 120 deferred DELETEs
+-- Excess of DELETEs triggers compaction. Wait for it to complete.
+while sk:stat().disk.compact.queue.rows > 0 do fiber.sleep(0.01) end
+---
+...
+-- Remember the total number of rows before restart.
+_ = var:put{'rows', sk:stat().rows}
 ---
-- 360
 ...
 test_run:cmd("restart server test with args='1048576'")
+var = box.space.var
+---
+...
 s = box.space.test
 ---
 ...
@@ -725,16 +738,19 @@ pk = s.index.pk
 sk = s.index.sk
 ---
 ...
--- Should be 360, the same amount of statements as before restart.
+-- Should be the same amount of statements as before restart.
 -- If we applied all deferred DELETEs, including the dumped ones,
 -- then there would be more.
-sk:stat().rows
+sk:stat().rows == var:get('rows')[2]
 ---
-- 360
+- true
 ...
 s:drop()
 ---
 ...
+var:drop()
+---
+...
 test_run:cmd("switch default")
 ---
 - true
diff --git a/test/vinyl/deferred_delete.test.lua b/test/vinyl/deferred_delete.test.lua
index e9e0e293..dfffe22f 100644
--- a/test/vinyl/deferred_delete.test.lua
+++ b/test/vinyl/deferred_delete.test.lua
@@ -249,6 +249,9 @@ test_run:cmd("switch test")
 
 fiber = require('fiber')
 
+var = box.schema.space.create('var')
+_ = var:create_index('primary', {parts = {1, 'string'}})
+
 s = box.schema.space.create('test', {engine = 'vinyl'})
 pk = s:create_index('pk', {run_count_per_level = 10})
 sk = s:create_index('sk', {run_count_per_level = 10, parts = {2, 'unsigned', 3, 'string'}, unique = false})
@@ -273,19 +276,25 @@ while pk:stat().disk.compact.count == 0 do fiber.sleep(0.001) end
 
 sk:stat().disk.dump.count -- 1
 
-sk:stat().rows -- 120 old REPLACEs + 120 new REPLACEs + 120 deferred DELETEs
+-- Excess of DELETEs triggers compaction. Wait for it to complete.
+while sk:stat().disk.compact.queue.rows > 0 do fiber.sleep(0.01) end
+
+-- Remember the total number of rows before restart.
+_ = var:put{'rows', sk:stat().rows}
 
 test_run:cmd("restart server test with args='1048576'")
+var = box.space.var
 s = box.space.test
 pk = s.index.pk
 sk = s.index.sk
 
--- Should be 360, the same amount of statements as before restart.
+-- Should be the same amount of statements as before restart.
 -- If we applied all deferred DELETEs, including the dumped ones,
 -- then there would be more.
-sk:stat().rows
+sk:stat().rows == var:get('rows')[2]
 
 s:drop()
+var:drop()
 
 test_run:cmd("switch default")
 test_run:cmd("stop server test")
diff --git a/test/vinyl/update_optimize.result b/test/vinyl/update_optimize.result
index 40f34151..93aa5989 100644
--- a/test/vinyl/update_optimize.result
+++ b/test/vinyl/update_optimize.result
@@ -721,7 +721,7 @@ s = box.schema.space.create('test', {engine = 'vinyl'})
 _ = s:create_index('pk')
 ---
 ...
-_ = s:create_index('sk', {parts = {2, 'unsigned'}, run_count_per_level = 10})
+_ = s:create_index('sk', {parts = {2, 'unsigned'}})
 ---
 ...
 s:insert{1, 10}
@@ -743,9 +743,12 @@ box.snapshot()
 ---
 - ok
 ...
-s.index.sk:stat().rows -- INSERT in the first run + DELETE the second run
+while s.index.sk:stat().disk.compact.queue.rows > 0 do fiber.sleep(0.01) end
 ---
-- 2
+...
+s.index.sk:stat().rows -- 0 (INSERT + DELETE got purged by compaction)
+---
+- 0
 ...
 s:insert{1, 20}
 ---
diff --git a/test/vinyl/update_optimize.test.lua b/test/vinyl/update_optimize.test.lua
index a214dcae..a3eed38b 100644
--- a/test/vinyl/update_optimize.test.lua
+++ b/test/vinyl/update_optimize.test.lua
@@ -241,7 +241,7 @@ space:drop()
 --
 s = box.schema.space.create('test', {engine = 'vinyl'})
 _ = s:create_index('pk')
-_ = s:create_index('sk', {parts = {2, 'unsigned'}, run_count_per_level = 10})
+_ = s:create_index('sk', {parts = {2, 'unsigned'}})
 
 s:insert{1, 10}
 box.snapshot()
@@ -250,7 +250,8 @@ s:update(1, {{'=', 2, 10}})
 s:delete(1)
 box.snapshot()
 
-s.index.sk:stat().rows -- INSERT in the first run + DELETE the second run
+while s.index.sk:stat().disk.compact.queue.rows > 0 do fiber.sleep(0.01) end
+s.index.sk:stat().rows -- 0 (INSERT + DELETE got purged by compaction)
 
 s:insert{1, 20}
 s.index.sk:select()
-- 
2.11.0

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [tarantool-patches] Re: [PATCH 1/3] vinyl: remove useless local var from vy_range_update_compact_priority
  2018-10-20 18:23 ` [PATCH 1/3] vinyl: remove useless local var from vy_range_update_compact_priority Vladimir Davydov
@ 2018-10-23  8:51   ` Konstantin Osipov
  2018-10-24 11:19     ` Vladimir Davydov
  0 siblings, 1 reply; 9+ messages in thread
From: Konstantin Osipov @ 2018-10-23  8:51 UTC (permalink / raw)
  To: tarantool-patches

* Vladimir Davydov <vdavydov.dev@gmail.com> [18/10/20 23:19]:
> Local variable total_size equals total_stmt_count.bytes_compressed so we
> don't really need it.

OK to push.


-- 
Konstantin Osipov, Moscow, Russia, +7 903 626 22 32
http://tarantool.io - www.twitter.com/kostja_osipov

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [tarantool-patches] Re: [PATCH 2/3] vinyl: account disk statements of each type
  2018-10-20 18:23 ` [PATCH 2/3] vinyl: account disk statements of each type Vladimir Davydov
@ 2018-10-23  8:58   ` Konstantin Osipov
  2018-10-24 11:19     ` Vladimir Davydov
  0 siblings, 1 reply; 9+ messages in thread
From: Konstantin Osipov @ 2018-10-23  8:58 UTC (permalink / raw)
  To: tarantool-patches

* Vladimir Davydov <vdavydov.dev@gmail.com> [18/10/20 23:19]:
> This patch adds a new entry to per index statistics reported by
> index.stat():
> 
>   disk.statement
>     inserts
>     replaces
>     deletes
>     upserts
> 
> It shows the number of statements of each type stored in run files.
> The new statistics are persisted in index files. We will need this
> information so that we can force major compaction when there are too
> many DELETE statements accumulated in run files.
> 
> Needed for #3225

Please add a documentation request to the changeset comment and
ensure there is a doc ticket - docbot had uptime issues recently.

Moreover, it turned out a number of features were not filed for
documentation in 1.10 release, I believe we need to ban external
behaviour changes in "internal" patches - any patch which changes
the server api should be assigned to an own ticket.

The patch itself is OK to push. 

-- 
Konstantin Osipov, Moscow, Russia, +7 903 626 22 32
http://tarantool.io - www.twitter.com/kostja_osipov

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [tarantool-patches] Re: [PATCH 3/3] vinyl: force major compaction if there are too many DELETEs
  2018-10-20 18:23 ` [PATCH 3/3] vinyl: force major compaction if there are too many DELETEs Vladimir Davydov
@ 2018-10-23  9:03   ` Konstantin Osipov
  0 siblings, 0 replies; 9+ messages in thread
From: Konstantin Osipov @ 2018-10-23  9:03 UTC (permalink / raw)
  To: tarantool-patches

* Vladimir Davydov <vdavydov.dev@gmail.com> [18/10/20 23:19]:
> Even a perfectly shaped LSM tree can accumulate a huge number of DELETE
> statements over time in case indexed fields are frequently updated. This
> can significantly increase read and space amplification, especially for
> secondary indexes.
> 
> One way to deal with it is to propagate read amplification back to the
> scheduler so that it can raise compaction priority accordingly. Although
> this would probably make sense, it wouldn't be enough, because it
> wouldn't deal with space amplification growth in case the workload is
> write-mostly.

I disagree with the reasoning. We need a weighted norm of all
parameters of the lsm tree when calculating compaction priority. 
It's pretty easy to do. Imagine it's a multi-dimensional space,
where dimensions are write amplification, read
amplification, space amplification. We need to scale each
dimension and calculated a distance to the center of the space,
which stands for a perfectly shaped lsm. 

In any case reduction of read amplification and space amplification
address independent concerns: we need to ensure that space
amplification is within boundaries to not run out of disk space. 


-- 
Konstantin Osipov, Moscow, Russia, +7 903 626 22 32
http://tarantool.io - www.twitter.com/kostja_osipov

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [tarantool-patches] Re: [PATCH 1/3] vinyl: remove useless local var from vy_range_update_compact_priority
  2018-10-23  8:51   ` [tarantool-patches] " Konstantin Osipov
@ 2018-10-24 11:19     ` Vladimir Davydov
  0 siblings, 0 replies; 9+ messages in thread
From: Vladimir Davydov @ 2018-10-24 11:19 UTC (permalink / raw)
  To: Konstantin Osipov; +Cc: tarantool-patches

On Tue, Oct 23, 2018 at 11:51:35AM +0300, Konstantin Osipov wrote:
> * Vladimir Davydov <vdavydov.dev@gmail.com> [18/10/20 23:19]:
> > Local variable total_size equals total_stmt_count.bytes_compressed so we
> > don't really need it.
> 
> OK to push.

Pushed to 1.10-features

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [tarantool-patches] Re: [PATCH 2/3] vinyl: account disk statements of each type
  2018-10-23  8:58   ` [tarantool-patches] " Konstantin Osipov
@ 2018-10-24 11:19     ` Vladimir Davydov
  0 siblings, 0 replies; 9+ messages in thread
From: Vladimir Davydov @ 2018-10-24 11:19 UTC (permalink / raw)
  To: Konstantin Osipov; +Cc: tarantool-patches

On Tue, Oct 23, 2018 at 11:58:45AM +0300, Konstantin Osipov wrote:
> * Vladimir Davydov <vdavydov.dev@gmail.com> [18/10/20 23:19]:
> > This patch adds a new entry to per index statistics reported by
> > index.stat():
> > 
> >   disk.statement
> >     inserts
> >     replaces
> >     deletes
> >     upserts
> > 
> > It shows the number of statements of each type stored in run files.
> > The new statistics are persisted in index files. We will need this
> > information so that we can force major compaction when there are too
> > many DELETE statements accumulated in run files.
> > 
> > Needed for #3225
> 
> Please add a documentation request to the changeset comment and
> ensure there is a doc ticket - docbot had uptime issues recently.
> 
> Moreover, it turned out a number of features were not filed for
> documentation in 1.10 release, I believe we need to ban external
> behaviour changes in "internal" patches - any patch which changes
> the server api should be assigned to an own ticket.
> 
> The patch itself is OK to push. 

Pushed to 1.10-features and opened a doc ticket:

https://github.com/tarantool/doc/issues/645

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2018-10-24 11:19 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-10-20 18:23 [PATCH 0/3] Force compaction if there too many DELETEs Vladimir Davydov
2018-10-20 18:23 ` [PATCH 1/3] vinyl: remove useless local var from vy_range_update_compact_priority Vladimir Davydov
2018-10-23  8:51   ` [tarantool-patches] " Konstantin Osipov
2018-10-24 11:19     ` Vladimir Davydov
2018-10-20 18:23 ` [PATCH 2/3] vinyl: account disk statements of each type Vladimir Davydov
2018-10-23  8:58   ` [tarantool-patches] " Konstantin Osipov
2018-10-24 11:19     ` Vladimir Davydov
2018-10-20 18:23 ` [PATCH 3/3] vinyl: force major compaction if there are too many DELETEs Vladimir Davydov
2018-10-23  9:03   ` [tarantool-patches] " Konstantin Osipov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox