Tarantool development patches archive
 help / color / mirror / Atom feed
* [PATCH v2 0/7] vinyl: eliminate disk read on REPLACE/DELETE
@ 2018-08-21 11:15 Vladimir Davydov
  2018-08-21 11:15 ` [PATCH v2 1/7] vinyl: do not store meta in secondary index runs Vladimir Davydov
                   ` (7 more replies)
  0 siblings, 8 replies; 20+ messages in thread
From: Vladimir Davydov @ 2018-08-21 11:15 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

This patch set optimizes REPLACE and DELETE operations in vinyl in
presence of secondary indexes: now they don't need to read the primary
key in order to delete the overwritten/deleted tuple from secondary
indexes, instead this job is handed over to primary index compaction
task, while the read iterator is tought to filter out overwritten
tuples that haven't been purged yet.

For more information about the algorithm and implementation details,
please see the final patch of the series.

https://github.com/tarantool/tarantool/issues/2129
https://github.com/tarantool/tarantool/commits/dv/gh-2129-vy-eliminate-read-on-replace-delete

Changes in v2:
 - Remove merged patches and rebase on the latest 1.10.
 - Use id 257 for _vinyl_deferred_delete space, as it doesn't depend on
   other system spaces.
 - Refactor _vinyl_deferred_delete space proto creation code.
 - Use interface instead of callback for returning deferred DELETEs from
   write iterator.
 - Get rid of vy_mem::wal_deferred_delete_max_lsn. Account deferred
   DELETE WAL LSN in vy_mem::dump_lsn.
 - Don't encode flags for secondary index statements (so that there's no
   need to pass deferred DELETE handler to the write iterator on dump).

v1: https://www.freelists.org/post/tarantool-patches/PATCH-0025-vinyl-eliminate-disk-read-on-REPLACEDELETE
v0: https://www.freelists.org/post/tarantool-patches/RFC-PATCH-0023-vinyl-eliminate-read-on-REPLACEDELETE

Vladimir Davydov (7):
  vinyl: do not store meta in secondary index runs
  vinyl: teach write iterator to return overwritten tuples
  vinyl: prepare write iterator heap comparator for deferred DELETEs
  vinyl: allow to skip certain statements on read
  Introduce _vinyl_deferred_delete system space
  vinyl: zap vy_mem::min_lsn and rename max_lsn to dump_lsn
  vinyl: eliminate disk read on REPLACE/DELETE

 src/box/bootstrap.snap              | Bin 1540 -> 1607 bytes
 src/box/lua/space.cc                |   2 +
 src/box/lua/upgrade.lua             |  21 ++
 src/box/schema.cc                   |  27 +-
 src/box/schema_def.h                |   2 +
 src/box/vinyl.c                     | 261 ++++++++++++--
 src/box/vy_lsm.h                    |   5 +
 src/box/vy_mem.c                    |  27 +-
 src/box/vy_mem.h                    |  16 +-
 src/box/vy_point_lookup.c           |  32 ++
 src/box/vy_point_lookup.h           |  18 +
 src/box/vy_run.c                    |   7 +-
 src/box/vy_scheduler.c              | 310 ++++++++++++++++-
 src/box/vy_stmt.c                   |   7 +-
 src/box/vy_stmt.h                   |  29 ++
 src/box/vy_tx.c                     | 133 +++++++
 src/box/vy_write_iterator.c         | 153 +++++++-
 src/box/vy_write_iterator.h         |  45 ++-
 test/app-tap/tarantoolctl.test.lua  |   2 +-
 test/box-py/bootstrap.result        |   7 +-
 test/box/access_misc.result         |   5 +-
 test/box/access_sysview.result      |   2 +-
 test/unit/vy_iterators_helper.c     |   5 +
 test/unit/vy_iterators_helper.h     |  12 +-
 test/unit/vy_mem.c                  |  11 +-
 test/unit/vy_mem.result             |  22 +-
 test/unit/vy_point_lookup.c         |   6 +-
 test/unit/vy_write_iterator.c       | 254 +++++++++++++-
 test/unit/vy_write_iterator.result  |  22 +-
 test/vinyl/deferred_delete.result   | 677 ++++++++++++++++++++++++++++++++++++
 test/vinyl/deferred_delete.test.lua | 261 ++++++++++++++
 test/vinyl/info.result              |  18 +-
 test/vinyl/info.test.lua            |   9 +-
 test/vinyl/layout.result            |  46 ++-
 test/vinyl/quota.result             |   2 +-
 test/vinyl/tx_gap_lock.result       |  16 +-
 test/vinyl/tx_gap_lock.test.lua     |  10 +-
 test/vinyl/write_iterator.result    |   5 +
 test/vinyl/write_iterator.test.lua  |   3 +
 test/wal_off/alter.result           |   2 +-
 test/xlog/upgrade.result            |   7 +-
 41 files changed, 2344 insertions(+), 155 deletions(-)
 create mode 100644 test/vinyl/deferred_delete.result
 create mode 100644 test/vinyl/deferred_delete.test.lua

-- 
2.11.0

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v2 1/7] vinyl: do not store meta in secondary index runs
  2018-08-21 11:15 [PATCH v2 0/7] vinyl: eliminate disk read on REPLACE/DELETE Vladimir Davydov
@ 2018-08-21 11:15 ` Vladimir Davydov
  2018-08-21 15:08   ` Konstantin Osipov
  2018-08-21 11:15 ` [PATCH v2 2/7] vinyl: teach write iterator to return overwritten tuples Vladimir Davydov
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 20+ messages in thread
From: Vladimir Davydov @ 2018-08-21 11:15 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

Currenlty, tuple meta is only needed for storing statement flags in run
files. In the scope of #2129 two statement flags will be introduced,
VY_STMT_SKIP_READ and VY_STMT_DEFERRED_DELETE. None of them makes any
sense for secondary indexes. If we encode meta for secondary index
statements, we will have to either clear the flags on the upper level
(e.g. in the write iterator) or filter them out before encoding a
statement. Alternatively, we can skip encoding meta for secondary index
statements altogether, and this is what this patch does, because it's
the simplest and clearest method for now. If tuple meta is ever used for
storing anything else besides statement flags or a new statement flag
appears that may be used with secondary index statements, we will
recover the code and mask out those flags for secondary indexes.
---
 src/box/vy_stmt.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/src/box/vy_stmt.c b/src/box/vy_stmt.c
index d6ebee80..37da282d 100644
--- a/src/box/vy_stmt.c
+++ b/src/box/vy_stmt.c
@@ -615,8 +615,11 @@ vy_stmt_encode_secondary(const struct tuple *value,
 		request.key = extracted;
 		request.key_end = extracted + size;
 	}
-	if (vy_stmt_meta_encode(value, &request) != 0)
-		return -1;
+	/*
+	 * Note, all flags used with vinyl statements make sense
+	 * only for primary indexes so we do not encode meta for
+	 * secondary index statements.
+	 */
 	xrow->bodycnt = xrow_encode_dml(&request, xrow->body);
 	if (xrow->bodycnt < 0)
 		return -1;
-- 
2.11.0

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v2 2/7] vinyl: teach write iterator to return overwritten tuples
  2018-08-21 11:15 [PATCH v2 0/7] vinyl: eliminate disk read on REPLACE/DELETE Vladimir Davydov
  2018-08-21 11:15 ` [PATCH v2 1/7] vinyl: do not store meta in secondary index runs Vladimir Davydov
@ 2018-08-21 11:15 ` Vladimir Davydov
  2018-08-21 15:14   ` Konstantin Osipov
  2018-08-21 11:15 ` [PATCH v2 3/7] vinyl: prepare write iterator heap comparator for deferred DELETEs Vladimir Davydov
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 20+ messages in thread
From: Vladimir Davydov @ 2018-08-21 11:15 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

A REPLACE/DELETE request is supposed to delete the old tuple from all
indexes. In order to generate a DELETE statement for a secondary index,
we need to look up the old tuple in the primary index, which is costly
as it implies a random disk access. In the scope of #2129 we are
planning to optimize out the lookup by deferring generation of the
DELETE statement until primary index compaction.

To do that, we need to differentiate statements for which DELETE was
deferred from those for which it was inserted when the request was
executed (as it is the case for UPDATE). So this patch introduces a per
statement flag, VY_STMT_DEFERRED_DELETE. If set for a REPLACE or DELETE
statement, it will make the write iterator to return the overwritten
statement to the caller via a callback.

Needed for #2129
---
 src/box/vinyl.c                    |   2 +-
 src/box/vy_scheduler.c             |   4 +-
 src/box/vy_stmt.h                  |  19 +++
 src/box/vy_write_iterator.c        | 140 +++++++++++++++++++-
 src/box/vy_write_iterator.h        |  45 ++++++-
 test/unit/vy_iterators_helper.c    |   5 +
 test/unit/vy_iterators_helper.h    |  12 +-
 test/unit/vy_point_lookup.c        |   4 +-
 test/unit/vy_write_iterator.c      | 254 ++++++++++++++++++++++++++++++++++---
 test/unit/vy_write_iterator.result |  22 +++-
 10 files changed, 475 insertions(+), 32 deletions(-)

diff --git a/src/box/vinyl.c b/src/box/vinyl.c
index f2f93736..fd14d1e7 100644
--- a/src/box/vinyl.c
+++ b/src/box/vinyl.c
@@ -3007,7 +3007,7 @@ vy_send_range(struct vy_join_ctx *ctx,
 	struct rlist fake_read_views;
 	rlist_create(&fake_read_views);
 	ctx->wi = vy_write_iterator_new(ctx->key_def, ctx->format,
-					true, true, &fake_read_views);
+					true, true, &fake_read_views, NULL);
 	if (ctx->wi == NULL) {
 		rc = -1;
 		goto out;
diff --git a/src/box/vy_scheduler.c b/src/box/vy_scheduler.c
index b206a605..4e8b476b 100644
--- a/src/box/vy_scheduler.c
+++ b/src/box/vy_scheduler.c
@@ -1006,7 +1006,7 @@ vy_task_dump_new(struct vy_scheduler *scheduler, struct vy_lsm *lsm,
 	bool is_last_level = (lsm->run_count == 0);
 	wi = vy_write_iterator_new(task->cmp_def, lsm->disk_format,
 				   lsm->index_id == 0, is_last_level,
-				   scheduler->read_views);
+				   scheduler->read_views, NULL);
 	if (wi == NULL)
 		goto err_wi;
 	rlist_foreach_entry(mem, &lsm->sealed, in_sealed) {
@@ -1273,7 +1273,7 @@ vy_task_compact_new(struct vy_scheduler *scheduler, struct vy_lsm *lsm,
 	bool is_last_level = (range->compact_priority == range->slice_count);
 	wi = vy_write_iterator_new(task->cmp_def, lsm->disk_format,
 				   lsm->index_id == 0, is_last_level,
-				   scheduler->read_views);
+				   scheduler->read_views, NULL);
 	if (wi == NULL)
 		goto err_wi;
 
diff --git a/src/box/vy_stmt.h b/src/box/vy_stmt.h
index bee3c21e..8051f1e2 100644
--- a/src/box/vy_stmt.h
+++ b/src/box/vy_stmt.h
@@ -70,6 +70,25 @@ extern struct tuple_format_vtab vy_tuple_format_vtab;
  */
 extern size_t vy_max_tuple_size;
 
+/** Statement flags. */
+enum {
+	/**
+	 * A REPLACE/DELETE request is supposed to delete the old
+	 * tuple from all indexes. In order to generate a DELETE
+	 * statement for a secondary index, we need to look up the
+	 * old tuple in the primary index, which is expensive as
+	 * it implies a random disk access. We can optimize out the
+	 * lookup by deferring generation of the DELETE statement
+	 * until primary index compaction.
+	 *
+	 * The following flag is set for those REPLACE and DELETE
+	 * statements that skipped deletion of the old tuple from
+	 * secondary indexes. It makes the write iterator generate
+	 * DELETE statements for them during compaction.
+	 */
+	VY_STMT_DEFERRED_DELETE		= 1 << 0,
+};
+
 /**
  * There are two groups of statements:
  *
diff --git a/src/box/vy_write_iterator.c b/src/box/vy_write_iterator.c
index 06ae342b..50c51f2b 100644
--- a/src/box/vy_write_iterator.c
+++ b/src/box/vy_write_iterator.c
@@ -177,7 +177,14 @@ struct vy_write_iterator {
 	 * key and its tuple format is different.
 	 */
 	bool is_primary;
-
+	/** Deferred DELETE handler. */
+	struct vy_deferred_delete_handler *deferred_delete_handler;
+	/**
+	 * Last scanned REPLACE or DELETE statement that was
+	 * inserted into the primary index without deletion
+	 * of the old tuple from secondary indexes.
+	 */
+	struct tuple *deferred_delete_stmt;
 	/** Length of the @read_views. */
 	int rv_count;
 	/**
@@ -331,11 +338,16 @@ static const struct vy_stmt_stream_iface vy_slice_stream_iface;
  */
 struct vy_stmt_stream *
 vy_write_iterator_new(const struct key_def *cmp_def,
-		      struct tuple_format *format,
-		      bool is_primary, bool is_last_level,
-		      struct rlist *read_views)
+		      struct tuple_format *format, bool is_primary,
+		      bool is_last_level, struct rlist *read_views,
+		      struct vy_deferred_delete_handler *handler)
 {
 	/*
+	 * Deferred DELETE statements can only be produced by
+	 * primary index compaction.
+	 */
+	assert(is_primary || handler == NULL);
+	/*
 	 * One is reserved for INT64_MAX - maximal read view.
 	 */
 	int count = 1;
@@ -368,6 +380,7 @@ vy_write_iterator_new(const struct key_def *cmp_def,
 	tuple_format_ref(stream->format);
 	stream->is_primary = is_primary;
 	stream->is_last_level = is_last_level;
+	stream->deferred_delete_handler = handler;
 	return &stream->base;
 }
 
@@ -406,6 +419,16 @@ vy_write_iterator_stop(struct vy_stmt_stream *vstream)
 		vy_stmt_unref_if_possible(stream->last_stmt);
 		stream->last_stmt = NULL;
 	}
+	if (stream->deferred_delete_stmt != NULL) {
+		vy_stmt_unref_if_possible(stream->deferred_delete_stmt);
+		stream->deferred_delete_stmt = NULL;
+	}
+	struct vy_deferred_delete_handler *handler =
+			stream->deferred_delete_handler;
+	if (handler != NULL) {
+		handler->iface->destroy(handler);
+		stream->deferred_delete_handler = NULL;
+	}
 }
 
 /**
@@ -554,6 +577,60 @@ vy_write_iterator_pop_read_view_stmt(struct vy_write_iterator *stream)
 }
 
 /**
+ * Generate a DELETE statement for the given tuple if its
+ * deletion from secondary indexes was deferred.
+ *
+ * @param stream Write iterator.
+ * @param stmt Current statement.
+ *
+ * @retval  0 Success.
+ * @retval -1 Error.
+ */
+static int
+vy_write_iterator_deferred_delete(struct vy_write_iterator *stream,
+				  struct tuple *stmt)
+{
+	if (!stream->is_primary)
+		return 0;
+
+	/*
+	 * UPSERTs cannot change secondary index parts neither
+	 * can they produce deferred DELETEs, so we skip them.
+	 */
+	if (vy_stmt_type(stmt) == IPROTO_UPSERT) {
+		assert((vy_stmt_flags(stmt) & VY_STMT_DEFERRED_DELETE) == 0);
+		return 0;
+	}
+	/*
+	 * Invoke the callback to generate a deferred DELETE
+	 * in case the current tuple was overwritten.
+	 */
+	if (stream->deferred_delete_stmt != NULL) {
+		struct vy_deferred_delete_handler *handler =
+				stream->deferred_delete_handler;
+		if (handler != NULL && vy_stmt_type(stmt) != IPROTO_DELETE &&
+		    handler->iface->process(handler, stmt,
+					    stream->deferred_delete_stmt) != 0)
+			return -1;
+		vy_stmt_unref_if_possible(stream->deferred_delete_stmt);
+		stream->deferred_delete_stmt = NULL;
+	}
+	/*
+	 * Remember the current statement if it is marked with
+	 * VY_STMT_DEFERRED_DELETE so that we can use it to
+	 * generate a DELETE for the overwritten tuple when this
+	 * function is called next time.
+	 */
+	if ((vy_stmt_flags(stmt) & VY_STMT_DEFERRED_DELETE) != 0) {
+		assert(vy_stmt_type(stmt) == IPROTO_DELETE ||
+		       vy_stmt_type(stmt) == IPROTO_REPLACE);
+		vy_stmt_ref_if_possible(stmt);
+		stream->deferred_delete_stmt = stmt;
+	}
+	return 0;
+}
+
+/**
  * Build the history of the current key.
  * Apply optimizations 1, 2 and 3 (@sa vy_write_iterator.h).
  * When building a history, some statements can be
@@ -578,6 +655,7 @@ vy_write_iterator_build_history(struct vy_write_iterator *stream,
 	*count = 0;
 	*is_first_insert = false;
 	assert(stream->stmt_i == -1);
+	assert(stream->deferred_delete_stmt == NULL);
 	struct heap_node *node = vy_source_heap_top(&stream->src_heap);
 	if (node == NULL)
 		return 0; /* no more data */
@@ -630,6 +708,10 @@ vy_write_iterator_build_history(struct vy_write_iterator *stream,
 				*is_first_insert = true;
 		}
 
+		rc = vy_write_iterator_deferred_delete(stream, src->tuple);
+		if (rc != 0)
+			break;
+
 		if (vy_stmt_lsn(src->tuple) > current_rv_lsn) {
 			/*
 			 * Skip statements invisible to the current read
@@ -710,6 +792,17 @@ next_lsn:
 			break;
 	}
 
+	/*
+	 * No point in keeping the last VY_STMT_DEFERRED_DELETE
+	 * statement around if this is major compaction, because
+	 * there's no tuple it could overwrite.
+	 */
+	if (rc == 0 && stream->is_last_level &&
+	    stream->deferred_delete_stmt != NULL) {
+		vy_stmt_unref_if_possible(stream->deferred_delete_stmt);
+		stream->deferred_delete_stmt = NULL;
+	}
+
 	vy_source_heap_delete(&stream->src_heap, &end_of_key_src.heap_node);
 	vy_stmt_unref_if_possible(end_of_key_src.tuple);
 	return rc;
@@ -794,6 +887,23 @@ vy_read_view_merge(struct vy_write_iterator *stream, struct tuple *hint,
 	rv->history = NULL;
 	result->tuple = NULL;
 	assert(result->next == NULL);
+	/*
+	 * The write iterator generates deferred DELETEs for all
+	 * VY_STMT_DEFERRED_DELETE statements, except, may be,
+	 * the last seen one. Clear the flag for all other output
+	 * statements so as not to generate the same DELETEs on
+	 * the next compaction.
+	 */
+	uint8_t flags = vy_stmt_flags(rv->tuple);
+	if ((flags & VY_STMT_DEFERRED_DELETE) != 0 &&
+	    rv->tuple != stream->deferred_delete_stmt) {
+		if (!vy_stmt_is_refable(rv->tuple)) {
+			rv->tuple = vy_stmt_dup(rv->tuple);
+			if (rv->tuple == NULL)
+				return -1;
+		}
+		vy_stmt_set_flags(rv->tuple, flags & ~VY_STMT_DEFERRED_DELETE);
+	}
 	if (hint != NULL) {
 		/* Not the first statement. */
 		return 0;
@@ -918,6 +1028,28 @@ vy_write_iterator_next(struct vy_stmt_stream *vstream,
 	*ret = vy_write_iterator_pop_read_view_stmt(stream);
 	if (*ret != NULL)
 		return 0;
+	/*
+	 * If we didn't generate a deferred DELETE corresponding to
+	 * the last seen VY_STMT_DEFERRED_DELETE statement, we must
+	 * include it into the output, because there still might be
+	 * an overwritten tuple in an older source.
+	 */
+	if (stream->deferred_delete_stmt != NULL) {
+		if (stream->deferred_delete_stmt == stream->last_stmt) {
+			/*
+			 * The statement was returned via a read view.
+			 * Nothing to do.
+			 */
+			vy_stmt_unref_if_possible(stream->deferred_delete_stmt);
+			stream->deferred_delete_stmt = NULL;
+		} else {
+			if (stream->last_stmt != NULL)
+				vy_stmt_unref_if_possible(stream->last_stmt);
+			*ret = stream->last_stmt = stream->deferred_delete_stmt;
+			stream->deferred_delete_stmt = NULL;
+			return 0;
+		}
+	}
 
 	/* Build the next key sequence. */
 	stream->stmt_i = -1;
diff --git a/src/box/vy_write_iterator.h b/src/box/vy_write_iterator.h
index ea14b07a..5214b60c 100644
--- a/src/box/vy_write_iterator.h
+++ b/src/box/vy_write_iterator.h
@@ -213,6 +213,7 @@
  */
 
 struct vy_write_iterator;
+struct vy_deferred_delete_handler;
 struct key_def;
 struct tuple_format;
 struct tuple;
@@ -220,6 +221,41 @@ struct vy_mem;
 struct vy_slice;
 
 /**
+ * Callback invoked by the write iterator for tuples that were
+ * overwritten or deleted in the primary index without generating
+ * a DELETE statement for secondary indexes. It is supposed to
+ * produce a DELETE statement and insert it into secondary indexes.
+ *
+ * @param handler  Deferred DELETE handler.
+ * @param old_stmt Overwritten tuple.
+ * @param new_stmt Statement that overwrote @old_stmt.
+ *
+ * @retval  0 Success.
+ * @retval -1 Error.
+ *
+ * @sa VY_STMT_DEFERRED_DELETE.
+ */
+typedef int
+(*vy_deferred_delete_process_f)(struct vy_deferred_delete_handler *handler,
+				struct tuple *old_stmt, struct tuple *new_stmt);
+
+/**
+ * Callack invoked by the write iterator to destroy a deferred
+ * DELETE handler when the iteration is stopped.
+ */
+typedef void
+(*vy_deferred_delete_destroy_f)(struct vy_deferred_delete_handler *handler);
+
+struct vy_deferred_delete_handler_iface {
+	vy_deferred_delete_process_f process;
+	vy_deferred_delete_destroy_f destroy;
+};
+
+struct vy_deferred_delete_handler {
+	const struct vy_deferred_delete_handler_iface *iface;
+};
+
+/**
  * Open an empty write iterator. To add sources to the iterator
  * use vy_write_iterator_add_* functions.
  * @param cmp_def - key definition for tuple compare.
@@ -227,13 +263,16 @@ struct vy_slice;
  * @param LSM tree is_primary - set if this iterator is for a primary index.
  * @param is_last_level - there is no older level than the one we're writing to.
  * @param read_views - Opened read views.
+ * @param handler - Deferred DELETE handler or NULL if no deferred DELETEs is
+ * expected. Only relevant to primary index compaction. For secondary indexes
+ * this argument must be set to NULL.
  * @return the iterator or NULL on error (diag is set).
  */
 struct vy_stmt_stream *
 vy_write_iterator_new(const struct key_def *cmp_def,
-		      struct tuple_format *format,
-		      bool is_primary, bool is_last_level,
-		      struct rlist *read_views);
+		      struct tuple_format *format, bool is_primary,
+		      bool is_last_level, struct rlist *read_views,
+		      struct vy_deferred_delete_handler *handler);
 
 /**
  * Add a mem as a source to the iterator.
diff --git a/test/unit/vy_iterators_helper.c b/test/unit/vy_iterators_helper.c
index 642d8bf2..89603376 100644
--- a/test/unit/vy_iterators_helper.c
+++ b/test/unit/vy_iterators_helper.c
@@ -136,6 +136,7 @@ vy_new_simple_stmt(struct tuple_format *format,
 	}
 	free(buf);
 	vy_stmt_set_lsn(ret, templ->lsn);
+	vy_stmt_set_flags(ret, templ->flags);
 	if (templ->optimize_update)
 		vy_stmt_set_column_mask(ret, 0);
 	return ret;
@@ -277,6 +278,10 @@ vy_stmt_are_same(const struct tuple *actual,
 		tuple_unref(tmp);
 		return false;
 	}
+	if (vy_stmt_flags(actual) != expected->flags) {
+		tuple_unref(tmp);
+		return false;
+	}
 	bool rc = memcmp(a, b, a_len) == 0;
 	tuple_unref(tmp);
 	return rc;
diff --git a/test/unit/vy_iterators_helper.h b/test/unit/vy_iterators_helper.h
index e38ec295..24641df3 100644
--- a/test/unit/vy_iterators_helper.h
+++ b/test/unit/vy_iterators_helper.h
@@ -43,10 +43,16 @@
 #define vyend 99999999
 #define MAX_FIELDS_COUNT 100
 #define STMT_TEMPLATE(lsn, type, ...) \
-{ { __VA_ARGS__, vyend }, IPROTO_##type, lsn, false, 0, 0 }
+{ { __VA_ARGS__, vyend }, IPROTO_##type, lsn, false, 0, 0, 0 }
 
 #define STMT_TEMPLATE_OPTIMIZED(lsn, type, ...) \
-{ { __VA_ARGS__, vyend }, IPROTO_##type, lsn, true, 0, 0 }
+{ { __VA_ARGS__, vyend }, IPROTO_##type, lsn, true, 0, 0, 0 }
+
+#define STMT_TEMPLATE_FLAGS(lsn, type, flags, ...) \
+{ { __VA_ARGS__, vyend }, IPROTO_##type, lsn, false, flags, 0, 0 }
+
+#define STMT_TEMPLATE_DEFERRED_DELETE(lsn, type, ...) \
+STMT_TEMPLATE_FLAGS(lsn, type, VY_STMT_DEFERRED_DELETE, __VA_ARGS__)
 
 extern struct tuple_format_vtab vy_tuple_format_vtab;
 extern struct tuple_format *vy_key_format;
@@ -82,6 +88,8 @@ struct vy_stmt_template {
 	 * to skip it in the write_iterator.
 	 */
 	bool optimize_update;
+	/** Statement flags. */
+	uint8_t flags;
 	/*
 	 * In case of upsert it is possible to use only one 'add' operation.
 	 * This is the column number of the operation.
diff --git a/test/unit/vy_point_lookup.c b/test/unit/vy_point_lookup.c
index b9b7d6ff..87f26900 100644
--- a/test/unit/vy_point_lookup.c
+++ b/test/unit/vy_point_lookup.c
@@ -192,7 +192,7 @@ test_basic()
 	}
 	struct vy_stmt_stream *write_stream
 		= vy_write_iterator_new(pk->cmp_def, pk->disk_format,
-					true, true, &read_views);
+					true, true, &read_views, NULL);
 	vy_write_iterator_new_mem(write_stream, run_mem);
 	struct vy_run *run = vy_run_new(&run_env, 1);
 	isnt(run, NULL, "vy_run_new");
@@ -225,7 +225,7 @@ test_basic()
 	}
 	write_stream
 		= vy_write_iterator_new(pk->cmp_def, pk->disk_format,
-					true, true, &read_views);
+					true, true, &read_views, NULL);
 	vy_write_iterator_new_mem(write_stream, run_mem);
 	run = vy_run_new(&run_env, 2);
 	isnt(run, NULL, "vy_run_new");
diff --git a/test/unit/vy_write_iterator.c b/test/unit/vy_write_iterator.c
index 25a346af..337e27ac 100644
--- a/test/unit/vy_write_iterator.c
+++ b/test/unit/vy_write_iterator.c
@@ -3,6 +3,65 @@
 #include "vy_write_iterator.h"
 #include "vy_iterators_helper.h"
 
+enum { MAX_DEFERRED_COUNT = 32 };
+
+/** Test deferred delete handler. */
+struct test_handler {
+	struct vy_deferred_delete_handler base;
+	/** Format to use for making DELETEs. */
+	struct tuple_format *format;
+	/** Deferred DELETEs generated by the write iterator. */
+	struct tuple *stmt[MAX_DEFERRED_COUNT];
+	/** Number of elements in @stmt array. */
+	int count;
+};
+
+/**
+ * Callback passed to the write iterator for generating deferred
+ * DELETE statements.
+ */
+static int
+test_handler_process(struct vy_deferred_delete_handler *base,
+		     struct tuple *old_stmt, struct tuple *new_stmt)
+{
+	struct test_handler *handler = (struct test_handler *)base;
+
+	fail_if(vy_stmt_type(old_stmt) == IPROTO_DELETE);
+	fail_if(vy_stmt_type(new_stmt) != IPROTO_DELETE &&
+		vy_stmt_type(new_stmt) != IPROTO_REPLACE);
+
+	struct tuple *delete = vy_stmt_new_surrogate_delete(handler->format,
+							    old_stmt);
+	fail_if(delete == NULL);
+	vy_stmt_set_lsn(delete, vy_stmt_lsn(new_stmt));
+
+	fail_if(handler->count >= MAX_DEFERRED_COUNT);
+	handler->stmt[handler->count++] = delete;
+	return 0;
+}
+
+static void
+test_handler_destroy(struct vy_deferred_delete_handler *base)
+{
+	struct test_handler *handler = (struct test_handler *)base;
+	for (int i = 0; i < handler->count; i++)
+		tuple_unref(handler->stmt[i]);
+}
+
+static const struct vy_deferred_delete_handler_iface test_handler_iface = {
+	.process = test_handler_process,
+	.destroy = test_handler_destroy,
+};
+
+static void
+test_handler_create(struct test_handler *handler, struct tuple_format *format)
+{
+	memset(handler, 0, sizeof(*handler));
+	handler->base.iface = &test_handler_iface;
+	handler->format = format;
+	tuple_format_ref(format);
+}
+
 /**
  * Create a mem with the specified content, iterate over it with
  * write_iterator and compare actual result statements with the
@@ -12,6 +71,8 @@
  * @param content_count Size of the @content.
  * @param expected Expected results of the iteration.
  * @param expected_count Size of the @expected.
+ * @param deferred Expected deferred DELETEs returned by the iteration.
+ * @param deferred_count Size of @deferred.
  * @param vlsns Read view lsns for the write iterator.
  * @param vlsns_count Size of the @vlsns.
  * @param is_primary True, if the new mem belongs to the primary
@@ -23,6 +84,8 @@ compare_write_iterator_results(const struct vy_stmt_template *content,
 			       int content_count,
 			       const struct vy_stmt_template *expected,
 			       int expected_count,
+			       const struct vy_stmt_template *deferred,
+			       int deferred_count,
 			       const int *vlsns, int vlsns_count,
 			       bool is_primary, bool is_last_level)
 {
@@ -38,8 +101,13 @@ compare_write_iterator_results(const struct vy_stmt_template *content,
 	fail_if(rv_array == NULL);
 	init_read_views_list(&rv_list, rv_array, vlsns, vlsns_count);
 
-	struct vy_stmt_stream *wi = vy_write_iterator_new(key_def, mem->format,
-					is_primary, is_last_level, &rv_list);
+	struct test_handler handler;
+	test_handler_create(&handler, mem->format);
+
+	struct vy_stmt_stream *wi;
+	wi = vy_write_iterator_new(key_def, mem->format, is_primary,
+				   is_last_level, &rv_list,
+				   is_primary ? &handler.base : NULL);
 	fail_if(wi == NULL);
 	fail_if(vy_write_iterator_new_mem(wi, mem) != 0);
 
@@ -58,7 +126,19 @@ compare_write_iterator_results(const struct vy_stmt_template *content,
 	} while (ret != NULL);
 	ok(i == expected_count, "correct results count");
 
+	for (i = 0; i < handler.count; i++) {
+		fail_if(i >= deferred_count);
+		ok(vy_stmt_are_same(handler.stmt[i], &deferred[i],
+				    handler.format, NULL),
+		   "deferred stmt %d is correct", i);
+	}
+	if (deferred != NULL) {
+		ok(handler.count == deferred_count,
+		   "correct deferred stmt count");
+	}
+
 	/* Clean up */
+	wi->iface->stop(wi);
 	wi->iface->close(wi);
 	vy_mem_delete(mem);
 	box_key_def_delete(key_def);
@@ -69,7 +149,7 @@ void
 test_basic(void)
 {
 	header();
-	plan(46);
+	plan(66);
 {
 /*
  * STATEMENT: REPL REPL REPL  DEL  REPL  REPL  REPL  REPL  REPL  REPL
@@ -98,7 +178,7 @@ test_basic(void)
 	int expected_count = sizeof(expected) / sizeof(expected[0]);
 	int vlsns_count = sizeof(vlsns) / sizeof(vlsns[0]);
 	compare_write_iterator_results(content, content_count,
-				       expected, expected_count,
+				       expected, expected_count, NULL, 0,
 				       vlsns, vlsns_count, true, true);
 }
 {
@@ -132,7 +212,7 @@ test_basic(void)
 	int expected_count = sizeof(expected) / sizeof(expected[0]);
 	int vlsns_count = sizeof(vlsns) / sizeof(vlsns[0]);
 	compare_write_iterator_results(content, content_count,
-				       expected, expected_count,
+				       expected, expected_count, NULL, 0,
 				       vlsns, vlsns_count, true, false);
 }
 {
@@ -160,7 +240,7 @@ test_basic(void)
 	int expected_count = sizeof(expected) / sizeof(expected[0]);
 	int vlsns_count = sizeof(vlsns) / sizeof(vlsns[0]);
 	compare_write_iterator_results(content, content_count,
-				       expected, expected_count,
+				       expected, expected_count, NULL, 0,
 				       vlsns, vlsns_count, true, true);
 }
 {
@@ -180,7 +260,7 @@ test_basic(void)
 	int expected_count = sizeof(expected) / sizeof(expected[0]);
 	int vlsns_count = sizeof(vlsns) / sizeof(vlsns[0]);
 	compare_write_iterator_results(content, content_count,
-				       expected, expected_count,
+				       expected, expected_count, NULL, 0,
 				       vlsns, vlsns_count, true, true);
 }
 {
@@ -204,7 +284,7 @@ test_basic(void)
 	int expected_count = sizeof(expected) / sizeof(expected[0]);
 	int vlsns_count = sizeof(vlsns) / sizeof(vlsns[0]);
 	compare_write_iterator_results(content, content_count,
-				       expected, expected_count,
+				       expected, expected_count, NULL, 0,
 				       vlsns, vlsns_count, true, true);
 }
 {
@@ -227,7 +307,7 @@ test_basic(void)
 	int expected_count = sizeof(expected) / sizeof(expected[0]);
 	int vlsns_count = sizeof(vlsns) / sizeof(vlsns[0]);
 	compare_write_iterator_results(content, content_count,
-				       expected, expected_count,
+				       expected, expected_count, NULL, 0,
 				       vlsns, vlsns_count, true, false);
 }
 {
@@ -255,7 +335,7 @@ test_basic(void)
 	int expected_count = sizeof(expected) / sizeof(expected[0]);
 	int vlsns_count = sizeof(vlsns) / sizeof(vlsns[0]);
 	compare_write_iterator_results(content, content_count,
-				       expected, expected_count,
+				       expected, expected_count, NULL, 0,
 				       vlsns, vlsns_count, false, true);
 }
 {
@@ -275,7 +355,7 @@ test_basic(void)
 	int expected_count = sizeof(expected) / sizeof(expected[0]);
 	int vlsns_count = sizeof(vlsns) / sizeof(vlsns[0]);
 	compare_write_iterator_results(content, content_count,
-				       expected, expected_count,
+				       expected, expected_count, NULL, 0,
 				       vlsns, vlsns_count, false, false);
 }
 {
@@ -302,7 +382,7 @@ test_basic(void)
 	int expected_count = sizeof(expected) / sizeof(expected[0]);
 	int vlsns_count = sizeof(vlsns) / sizeof(vlsns[0]);
 	compare_write_iterator_results(content, content_count,
-				       expected, expected_count,
+				       expected, expected_count, NULL, 0,
 				       vlsns, vlsns_count, true, false);
 }
 {
@@ -330,7 +410,7 @@ test_basic(void)
 	int expected_count = sizeof(expected) / sizeof(expected[0]);
 	int vlsns_count = sizeof(vlsns) / sizeof(vlsns[0]);
 	compare_write_iterator_results(content, content_count,
-				       expected, expected_count,
+				       expected, expected_count, NULL, 0,
 				       vlsns, vlsns_count, true, true);
 }
 {
@@ -355,7 +435,7 @@ test_basic(void)
 	int expected_count = sizeof(expected) / sizeof(expected[0]);
 	int vlsns_count = sizeof(vlsns) / sizeof(vlsns[0]);
 	compare_write_iterator_results(content, content_count,
-				       expected, expected_count,
+				       expected, expected_count, NULL, 0,
 				       vlsns, vlsns_count, false, false);
 }
 {
@@ -380,7 +460,7 @@ test_basic(void)
 	int expected_count = sizeof(expected) / sizeof(expected[0]);
 	int vlsns_count = sizeof(vlsns) / sizeof(vlsns[0]);
 	compare_write_iterator_results(content, content_count,
-				       expected, expected_count,
+				       expected, expected_count, NULL, 0,
 				       vlsns, vlsns_count, true, false);
 }
 {
@@ -410,7 +490,7 @@ test_basic(void)
 	int expected_count = sizeof(expected) / sizeof(expected[0]);
 	int vlsns_count = sizeof(vlsns) / sizeof(vlsns[0]);
 	compare_write_iterator_results(content, content_count,
-				       expected, expected_count,
+				       expected, expected_count, NULL, 0,
 				       vlsns, vlsns_count, true, false);
 }
 {
@@ -451,7 +531,7 @@ test_basic(void)
 	int expected_count = sizeof(expected) / sizeof(expected[0]);
 	int vlsns_count = sizeof(vlsns) / sizeof(vlsns[0]);
 	compare_write_iterator_results(content, content_count,
-				       expected, expected_count,
+				       expected, expected_count, NULL, 0,
 				       vlsns, vlsns_count, true, false);
 }
 {
@@ -491,7 +571,147 @@ test_basic(void)
 	int expected_count = sizeof(expected) / sizeof(expected[0]);
 	int vlsns_count = sizeof(vlsns) / sizeof(vlsns[0]);
 	compare_write_iterator_results(content, content_count,
+				       expected, expected_count, NULL, 0,
+				       vlsns, vlsns_count, true, false);
+}
+{
+/*
+ * STATEMENT:    REPL DEL REPL REPL DEL DEL DEL REPL DEL INS DEL INS REPL
+ * LSN:            4   5    6    7   8   9  10   11  12  13  14  15   16
+ * DEFERRED DEL:   +   +    +        +   +        +           +        +
+ * READ VIEW:          *         *                *
+ *
+ * is_last_level = true
+ *
+ * Test generation of deferred DELETEs for various combinations
+ * of input statements.
+ */
+	const struct vy_stmt_template content[] = {
+		STMT_TEMPLATE_DEFERRED_DELETE(4, REPLACE, 1, 2),
+		STMT_TEMPLATE_DEFERRED_DELETE(5, DELETE, 1),
+		STMT_TEMPLATE_DEFERRED_DELETE(6, REPLACE, 1, 3),
+		STMT_TEMPLATE(7, REPLACE, 1, 4),
+		STMT_TEMPLATE_DEFERRED_DELETE(8, DELETE, 1),
+		STMT_TEMPLATE_DEFERRED_DELETE(9, DELETE, 1),
+		STMT_TEMPLATE(10, DELETE, 1),
+		STMT_TEMPLATE_DEFERRED_DELETE(11, REPLACE, 1, 5),
+		STMT_TEMPLATE(12, DELETE, 1),
+		STMT_TEMPLATE(13, INSERT, 1, 6),
+		STMT_TEMPLATE_DEFERRED_DELETE(14, DELETE, 1),
+		STMT_TEMPLATE(15, INSERT, 1, 7),
+		STMT_TEMPLATE_DEFERRED_DELETE(16, REPLACE, 1, 8),
+	};
+	const struct vy_stmt_template expected[] = {
+		STMT_TEMPLATE(16, REPLACE, 1, 8),
+		STMT_TEMPLATE(11, REPLACE, 1, 5),
+		STMT_TEMPLATE(7, REPLACE, 1, 4),
+	};
+	const struct vy_stmt_template deferred[] = {
+		STMT_TEMPLATE(16, DELETE, 1, 7),
+		STMT_TEMPLATE(14, DELETE, 1, 6),
+		STMT_TEMPLATE(8, DELETE, 1, 4),
+		STMT_TEMPLATE(5, DELETE, 1, 2),
+	};
+	const int vlsns[] = {5, 7, 11};
+	int content_count = sizeof(content) / sizeof(content[0]);
+	int expected_count = sizeof(expected) / sizeof(expected[0]);
+	int deferred_count = sizeof(deferred) / sizeof(deferred[0]);
+	int vlsns_count = sizeof(vlsns) / sizeof(vlsns[0]);
+	compare_write_iterator_results(content, content_count,
+				       expected, expected_count,
+				       deferred, deferred_count,
+				       vlsns, vlsns_count, true, true);
+}
+{
+/*
+ * STATEMENT:    REPL REPL DEL
+ * LSN:            7    8   9
+ * DEFERRED DEL:   +
+ *
+ * is_last_level = false
+ *
+ * Check that the oldest VY_STMT_DEFERRED_DELETE statement is
+ * preserved in case it doesn't overwrite a terminal statement
+ * and this is not a major compaction.
+ */
+	const struct vy_stmt_template content[] = {
+		STMT_TEMPLATE_DEFERRED_DELETE(7, REPLACE, 1, 1),
+		STMT_TEMPLATE(8, REPLACE, 1, 2),
+		STMT_TEMPLATE(9, DELETE, 1, 3),
+	};
+	const struct vy_stmt_template expected[] = {
+		STMT_TEMPLATE(9, DELETE, 1, 1),
+		STMT_TEMPLATE_DEFERRED_DELETE(7, REPLACE, 1, 1),
+	};
+	const struct vy_stmt_template deferred[] = {};
+	const int vlsns[] = {};
+	int content_count = sizeof(content) / sizeof(content[0]);
+	int expected_count = sizeof(expected) / sizeof(expected[0]);
+	int deferred_count = sizeof(deferred) / sizeof(deferred[0]);
+	int vlsns_count = sizeof(vlsns) / sizeof(vlsns[0]);
+	compare_write_iterator_results(content, content_count,
+				       expected, expected_count,
+				       deferred, deferred_count,
+				       vlsns, vlsns_count, true, false);
+}
+{
+/*
+ * STATEMENT:    REPL REPL DEL
+ * LSN:            7    8   9
+ * DEFERRED DEL:   +
+ * READ VIEW:      *
+ *
+ * is_last_level = false
+ *
+ * Check that the oldest VY_STMT_DEFERRED_DELETE statement is
+ * not returned twice if it is referenced by a read view.
+ */
+	const struct vy_stmt_template content[] = {
+		STMT_TEMPLATE_DEFERRED_DELETE(7, REPLACE, 1, 1),
+		STMT_TEMPLATE(8, REPLACE, 1, 2),
+		STMT_TEMPLATE(9, DELETE, 1, 3),
+	};
+	const struct vy_stmt_template expected[] = {
+		STMT_TEMPLATE(9, DELETE, 1, 1),
+		STMT_TEMPLATE_DEFERRED_DELETE(7, REPLACE, 1, 1),
+	};
+	const struct vy_stmt_template deferred[] = {};
+	const int vlsns[] = {7};
+	int content_count = sizeof(content) / sizeof(content[0]);
+	int expected_count = sizeof(expected) / sizeof(expected[0]);
+	int deferred_count = sizeof(deferred) / sizeof(deferred[0]);
+	int vlsns_count = sizeof(vlsns) / sizeof(vlsns[0]);
+	compare_write_iterator_results(content, content_count,
+				       expected, expected_count,
+				       deferred, deferred_count,
+				       vlsns, vlsns_count, true, false);
+}
+{
+/*
+ * STATEMENT:    REPL
+ * LSN:            7
+ * DEFERRED DEL:   +
+ *
+ * is_last_level = false
+ *
+ * Check that the oldest VY_STMT_DEFERRED_DELETE statement is
+ * not returned twice if it is the only statement in the output.
+ */
+	const struct vy_stmt_template content[] = {
+		STMT_TEMPLATE_DEFERRED_DELETE(7, REPLACE, 1, 1),
+	};
+	const struct vy_stmt_template expected[] = {
+		STMT_TEMPLATE_DEFERRED_DELETE(7, REPLACE, 1, 1),
+	};
+	const struct vy_stmt_template deferred[] = {};
+	const int vlsns[] = {};
+	int content_count = sizeof(content) / sizeof(content[0]);
+	int expected_count = sizeof(expected) / sizeof(expected[0]);
+	int deferred_count = sizeof(deferred) / sizeof(deferred[0]);
+	int vlsns_count = sizeof(vlsns) / sizeof(vlsns[0]);
+	compare_write_iterator_results(content, content_count,
 				       expected, expected_count,
+				       deferred, deferred_count,
 				       vlsns, vlsns_count, true, false);
 }
 	fiber_gc();
diff --git a/test/unit/vy_write_iterator.result b/test/unit/vy_write_iterator.result
index 56d8cb1f..4f95aeb9 100644
--- a/test/unit/vy_write_iterator.result
+++ b/test/unit/vy_write_iterator.result
@@ -1,5 +1,5 @@
 	*** test_basic ***
-1..46
+1..66
 ok 1 - stmt 0 is correct
 ok 2 - stmt 1 is correct
 ok 3 - stmt 2 is correct
@@ -46,4 +46,24 @@ ok 43 - stmt 0 is correct
 ok 44 - stmt 1 is correct
 ok 45 - stmt 2 is correct
 ok 46 - correct results count
+ok 47 - stmt 0 is correct
+ok 48 - stmt 1 is correct
+ok 49 - stmt 2 is correct
+ok 50 - correct results count
+ok 51 - deferred stmt 0 is correct
+ok 52 - deferred stmt 1 is correct
+ok 53 - deferred stmt 2 is correct
+ok 54 - deferred stmt 3 is correct
+ok 55 - correct deferred stmt count
+ok 56 - stmt 0 is correct
+ok 57 - stmt 1 is correct
+ok 58 - correct results count
+ok 59 - correct deferred stmt count
+ok 60 - stmt 0 is correct
+ok 61 - stmt 1 is correct
+ok 62 - correct results count
+ok 63 - correct deferred stmt count
+ok 64 - stmt 0 is correct
+ok 65 - correct results count
+ok 66 - correct deferred stmt count
 	*** test_basic: done ***
-- 
2.11.0

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v2 3/7] vinyl: prepare write iterator heap comparator for deferred DELETEs
  2018-08-21 11:15 [PATCH v2 0/7] vinyl: eliminate disk read on REPLACE/DELETE Vladimir Davydov
  2018-08-21 11:15 ` [PATCH v2 1/7] vinyl: do not store meta in secondary index runs Vladimir Davydov
  2018-08-21 11:15 ` [PATCH v2 2/7] vinyl: teach write iterator to return overwritten tuples Vladimir Davydov
@ 2018-08-21 11:15 ` Vladimir Davydov
  2018-08-21 15:38   ` Konstantin Osipov
  2018-08-21 11:15 ` [PATCH v2 4/7] vinyl: allow to skip certain statements on read Vladimir Davydov
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 20+ messages in thread
From: Vladimir Davydov @ 2018-08-21 11:15 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

In the scope of #2129, we won't delete the overwritten tuple from
secondary indexes immediately on REPLACE. Instead we will defer
generation of the DELETE statement until the primary index compaction.
However, it may happen that the overwritten tuple and the tuple that
overwrote it have the same secondary key parts, in which case the
deferred DELETE is not needed and should be discarded on secondary
index compaction. This patch makes the write iterator heap comparator
function discard such useless deferred DELETEs.

Note, this patch also removes the code that prioritises terminal
statements over UPSERTs in the write iterator, which, according to the
comment, may happen only during forced recovery. I don't see why we
should do that, even during forced recovery, neither have I managed to
find the reason in the commit history, so I dropped this code in order
not to overburden the write iterator logic with some esoteric cases.

Needed for #2129
---
 src/box/vy_write_iterator.c | 13 ++++++++-----
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/src/box/vy_write_iterator.c b/src/box/vy_write_iterator.c
index 50c51f2b..de58012a 100644
--- a/src/box/vy_write_iterator.c
+++ b/src/box/vy_write_iterator.c
@@ -242,12 +242,15 @@ heap_less(heap_t *heap, struct heap_node *node1, struct heap_node *node2)
 	if (lsn1 != lsn2)
 		return lsn1 > lsn2;
 
-	/**
-	 * LSNs are equal. This may happen only during forced recovery.
-	 * Prioritize terminal (non-UPSERT) statements
+	/*
+	 * LSNs are equal. This may only happen if one of the statements
+	 * is a deferred DELETE and the overwritten tuple which it is
+	 * supposed to purge has the same key parts as the REPLACE that
+	 * overwrote it. Discard the deferred DELETE as the overwritten
+	 * tuple will be (or has already been) purged by the REPLACE.
 	 */
-	return (vy_stmt_type(src1->tuple) == IPROTO_UPSERT ? 1 : 0) <
-	       (vy_stmt_type(src2->tuple) == IPROTO_UPSERT ? 1 : 0);
+	return (vy_stmt_type(src1->tuple) == IPROTO_DELETE ? 1 : 0) <
+	       (vy_stmt_type(src2->tuple) == IPROTO_DELETE ? 1 : 0);
 
 }
 
-- 
2.11.0

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v2 4/7] vinyl: allow to skip certain statements on read
  2018-08-21 11:15 [PATCH v2 0/7] vinyl: eliminate disk read on REPLACE/DELETE Vladimir Davydov
                   ` (2 preceding siblings ...)
  2018-08-21 11:15 ` [PATCH v2 3/7] vinyl: prepare write iterator heap comparator for deferred DELETEs Vladimir Davydov
@ 2018-08-21 11:15 ` Vladimir Davydov
  2018-08-21 15:39   ` Konstantin Osipov
  2018-08-21 11:15 ` [PATCH v2 5/7] Introduce _vinyl_deferred_delete system space Vladimir Davydov
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 20+ messages in thread
From: Vladimir Davydov @ 2018-08-21 11:15 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

In the scope of #2129 we will defer insertion of certain DELETE
statements into secondary indexes until primary index compaction.
However, by the time we invoke compaction, new statements might
have been inserted into the space for the same set of keys.
If that happens, insertion of a deferred DELETE will break the
invariant which the read iterator relies upon: that for any key
older sources store older statements. To avoid that, let's add
a new per statement flag, VY_STMT_SKIP_READ, and make the read
iterator ignore statements marked with it.

Needed for #2129
---
 src/box/vy_mem.c  | 19 ++++++++++++-------
 src/box/vy_run.c  |  7 ++++++-
 src/box/vy_stmt.h | 10 ++++++++++
 3 files changed, 28 insertions(+), 8 deletions(-)

diff --git a/src/box/vy_mem.c b/src/box/vy_mem.c
index 3313ae54..0c46b93c 100644
--- a/src/box/vy_mem.c
+++ b/src/box/vy_mem.c
@@ -324,7 +324,8 @@ vy_mem_iterator_find_lsn(struct vy_mem_iterator *itr,
 	assert(!vy_mem_tree_iterator_is_invalid(&itr->curr_pos));
 	assert(itr->curr_stmt == vy_mem_iterator_curr_stmt(itr));
 	const struct key_def *cmp_def = itr->mem->cmp_def;
-	while (vy_stmt_lsn(itr->curr_stmt) > (**itr->read_view).vlsn) {
+	while (vy_stmt_lsn(itr->curr_stmt) > (**itr->read_view).vlsn ||
+	       vy_stmt_flags(itr->curr_stmt) & VY_STMT_SKIP_READ) {
 		if (vy_mem_iterator_step(itr, iterator_type) != 0 ||
 		    (iterator_type == ITER_EQ &&
 		     vy_stmt_compare(key, itr->curr_stmt, cmp_def))) {
@@ -341,6 +342,7 @@ vy_mem_iterator_find_lsn(struct vy_mem_iterator *itr,
 				*vy_mem_tree_iterator_get_elem(&itr->mem->tree,
 							       &prev_pos);
 			if (vy_stmt_lsn(prev_stmt) > (**itr->read_view).vlsn ||
+			    vy_stmt_flags(prev_stmt) & VY_STMT_SKIP_READ ||
 			    vy_tuple_compare(itr->curr_stmt, prev_stmt,
 					     cmp_def) != 0)
 				break;
@@ -496,18 +498,21 @@ vy_mem_iterator_next_lsn(struct vy_mem_iterator *itr)
 	const struct key_def *cmp_def = itr->mem->cmp_def;
 
 	struct vy_mem_tree_iterator next_pos = itr->curr_pos;
+next:
 	vy_mem_tree_iterator_next(&itr->mem->tree, &next_pos);
 	if (vy_mem_tree_iterator_is_invalid(&next_pos))
 		return 1; /* EOF */
 
 	const struct tuple *next_stmt;
 	next_stmt = *vy_mem_tree_iterator_get_elem(&itr->mem->tree, &next_pos);
-	if (vy_tuple_compare(itr->curr_stmt, next_stmt, cmp_def) == 0) {
-		itr->curr_pos = next_pos;
-		itr->curr_stmt = next_stmt;
-		return 0;
-	}
-	return 1;
+	if (vy_tuple_compare(itr->curr_stmt, next_stmt, cmp_def) != 0)
+		return 1;
+
+	itr->curr_pos = next_pos;
+	itr->curr_stmt = next_stmt;
+	if (vy_stmt_flags(itr->curr_stmt) & VY_STMT_SKIP_READ)
+		goto next;
+	return 0;
 }
 
 /**
diff --git a/src/box/vy_run.c b/src/box/vy_run.c
index eae3e74d..f107e3a9 100644
--- a/src/box/vy_run.c
+++ b/src/box/vy_run.c
@@ -1157,7 +1157,8 @@ vy_run_iterator_find_lsn(struct vy_run_iterator *itr,
 	assert(itr->curr_stmt != NULL);
 	assert(itr->curr_pos.page_no < slice->run->info.page_count);
 
-	while (vy_stmt_lsn(itr->curr_stmt) > (**itr->read_view).vlsn) {
+	while (vy_stmt_lsn(itr->curr_stmt) > (**itr->read_view).vlsn ||
+	       vy_stmt_flags(itr->curr_stmt) & VY_STMT_SKIP_READ) {
 		if (vy_run_iterator_next_pos(itr, iterator_type,
 					     &itr->curr_pos) != 0) {
 			vy_run_iterator_stop(itr);
@@ -1183,6 +1184,7 @@ vy_run_iterator_find_lsn(struct vy_run_iterator *itr,
 						 &test_stmt) != 0)
 				return -1;
 			if (vy_stmt_lsn(test_stmt) > (**itr->read_view).vlsn ||
+			    vy_stmt_flags(test_stmt) & VY_STMT_SKIP_READ ||
 			    vy_tuple_compare(itr->curr_stmt, test_stmt,
 					     cmp_def) != 0) {
 				tuple_unref(test_stmt);
@@ -1478,6 +1480,7 @@ vy_run_iterator_next_lsn(struct vy_run_iterator *itr, struct tuple **ret)
 	assert(itr->curr_pos.page_no < itr->slice->run->info.page_count);
 
 	struct vy_run_iterator_pos next_pos;
+next:
 	if (vy_run_iterator_next_pos(itr, ITER_GE, &next_pos) != 0) {
 		vy_run_iterator_stop(itr);
 		return 0;
@@ -1495,6 +1498,8 @@ vy_run_iterator_next_lsn(struct vy_run_iterator *itr, struct tuple **ret)
 	tuple_unref(itr->curr_stmt);
 	itr->curr_stmt = next_key;
 	itr->curr_pos = next_pos;
+	if (vy_stmt_flags(itr->curr_stmt) & VY_STMT_SKIP_READ)
+		goto next;
 
 	vy_stmt_counter_acct_tuple(&itr->stat->get, itr->curr_stmt);
 	*ret = itr->curr_stmt;
diff --git a/src/box/vy_stmt.h b/src/box/vy_stmt.h
index 8051f1e2..273d5e84 100644
--- a/src/box/vy_stmt.h
+++ b/src/box/vy_stmt.h
@@ -87,6 +87,16 @@ enum {
 	 * DELETE statements for them during compaction.
 	 */
 	VY_STMT_DEFERRED_DELETE		= 1 << 0,
+	/**
+	 * Statements that have this flag set are ignored by the
+	 * read iterator.
+	 *
+	 * We set this flag for deferred DELETE statements, because
+	 * they may violate the invariant which the read relies upon:
+	 * the older a source, the older statements it stores for a
+	 * particular key.
+	 */
+	VY_STMT_SKIP_READ		= 1 << 1,
 };
 
 /**
-- 
2.11.0

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v2 5/7] Introduce _vinyl_deferred_delete system space
  2018-08-21 11:15 [PATCH v2 0/7] vinyl: eliminate disk read on REPLACE/DELETE Vladimir Davydov
                   ` (3 preceding siblings ...)
  2018-08-21 11:15 ` [PATCH v2 4/7] vinyl: allow to skip certain statements on read Vladimir Davydov
@ 2018-08-21 11:15 ` Vladimir Davydov
  2018-08-21 15:42   ` Konstantin Osipov
  2018-08-21 11:15 ` [PATCH v2 6/7] vinyl: zap vy_mem::min_lsn and rename max_lsn to dump_lsn Vladimir Davydov
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 20+ messages in thread
From: Vladimir Davydov @ 2018-08-21 11:15 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

The space is a blackhole. It will be used for writing deferred DELETE
statements generated by vinyl compaction tasks to WAL so that we can
recover deferred DELETEs that hadn't been dumped to disk before the
server was restarted.

Since the new space doesn't depend on other system spaces, let's assign
the minimal possible id to it, i.e. 257.

Needed for #2129
---
 src/box/bootstrap.snap             | Bin 1540 -> 1607 bytes
 src/box/lua/space.cc               |   2 ++
 src/box/lua/upgrade.lua            |  21 +++++++++++++++++++++
 src/box/schema.cc                  |  27 ++++++++++++++++++++++++++-
 src/box/schema_def.h               |   2 ++
 src/box/vinyl.c                    |  34 ++++++++++++++++++++++++++++++++++
 test/app-tap/tarantoolctl.test.lua |   2 +-
 test/box-py/bootstrap.result       |   7 +++++--
 test/box/access_misc.result        |   5 ++++-
 test/box/access_sysview.result     |   2 +-
 test/wal_off/alter.result          |   2 +-
 test/xlog/upgrade.result           |   7 +++++--
 12 files changed, 102 insertions(+), 9 deletions(-)

diff --git a/src/box/bootstrap.snap b/src/box/bootstrap.snap
index b610828c9c9ae9a22acdd8c150c16c6838b7a273..2c515e6e597e943f7538e24a582dc118c7569bc4 100644
GIT binary patch
delta 1602
zcmV-I2EF-&495(R7=JM>GC4ObXJcV8GcjX0FbYX-b97;DV`VxZG&y8qHa9R~Ej2S_
zFfBA<WHBu`H#BA~HZeIdVmM}EH8Ny03RXjGZ)0mZAbWiZ3e~y`y3Ga60L}%iSbe1c
z00000D77#B08k|^0NOw-4oT1$Hvs_Cg%?~QI9ps-+!jp2m4A`gL}K&+2d!A(UZS6h
z7JH|uax=vw5&7K{{qQ7QrX-P`Z)s5?q@<S9O$+>Gp{7RQs7xsZ%K*gy*Z>~CwAvy7
z?~Tt7N3x%qhq3SVDsxTNesp!&m+yOc<h5^ME&CZ#zkEkh+^d=o{ZM56@NlRS`u^d6
z_xrsc{O^V?oqx;gq}EDQgc}9eEHzylxAGootiIB}JFod<?<TPTHcQPQM{=Zi5v%`+
zUe+Y725~OiW~pfnR4rz{zoZe_EH#TN(U)%zZq4dusc9491~i>i-6_?jNFZ!(QlwLN
zV<Lend?LNghf0Ae6^1JNJ10Yr_TpM9rBhmd1XvM*C4b8mDmhXBSo#ys$$Qq00TEA=
zV0mYire(9#B>QtBo+iP2)|mxC<)u|l!N%j6V?jOt>3nBZCV`RiCaEz9HcL%1#O3+w
z_`6Fgg#(+Vrp6zNqAK)U&+@Q$_KiC~d$7)}*#6abooVBDE;FK^1oHd}RTu+2pG)rI
zwDQvGOMiimTU!KYD;$Dn*QDs(FZ>J+>AccED{^ObL}0Vj9C0hg`zXq=Cx)}p)%R;2
z@kA1&#oTciir!69T?YJ**h=T!`RoTM2<_>uZ=JaHtTQQUbqB>xjt)-6Ie}v$RU$<q
zHR3h1C=nEu+AKA<*ANt{bW^e^ojOw&q60xwqJQatXg<u;%#b!qO`sVMT!$COL5Aif
zGZV8RMng;n7RM|K7+@~ISb*VPWL_$~#8^9`P+ve$>Ih~up;if-rKZaZeE~&z5=hqM
z%;$f18EvUVL7dF!`;PbPUtOM+3=LsYcOq<-n&)@yX>#`~RCOS1mYU@2)+)|}GVd|v
zvVU1>{_4YA<~<7C?!6C{IP@=ejrWRH<~`oCq8z~XbF$Q6v(#jD62o?;55!3YqEw|+
z$%X`)ZcwDsr!tYcFT{z2yBizTmoj0fQePYnj|w}g4xBBA)!9kmDBYm3!PSVNQD0`b
zRHD<$s9Mb3(<E546~>`r6crUssW3K6&3_SnzQ4?xtm$l)ngA@VFb?%%|48z)R&e3v
zaQr<~w-z6YTju3ZMUJy-VO+;T=hlbMT32kAnk7~sYjp822L=nTgN+`~<C@XC5eCnf
ztnx_-vp8wQEKkU1KPP3KKe;+9g>jvV%~CTbM~qk%L7i*%bF!wUM^rVb6~+aDD1WHY
zXk(~EMi2l10Z0G=21hB#iXIYx;5dqdFbZH821E!jh)x&*fWTr1iU4ZCL~}MKy2Y|v
z<S`LUiXpHBgJLM?b0-QD$QBJ*36Ze4iMI6}<vGSICLpHBU>J^bl(UdTx}3KdD7L;W
ztel6Vy@Cy?hNRhkp_231+>TD=HGg;x?NWpdPHRSKpb}gn|C(ibD3utwg&La(yhJCw
z{svd9xuA&<Jt-V@{esaOCqq17j>bGH@-T$1+n@u_(ma@h(VdM@2s~OyeFGhWCsL5x
z*eTDDukK8TbMliI3kBu&wYwd~jh6D%9&Ws$`k=hOnH^_S1^k8aGzNaIynoviddCN}
zDh(7ktzP#YX>f80=J1RRTle83@yc!;k4`2i0PwZVqmpO|*wUufJ8_s1`J-SBL*NUF
z{*fY4*U~sf6XXXs2mo>}&=_hR?Mh+zXsVWS1PvTq#7U~e2h9pWND{IchE%_`P~<No
zl{DB9!s?a?gOQ`upq3-H4u7_b+K+y-*uvE|0fp&a4gW<ECyqg$FEGx8Zo_~xK0Rl=
z%z(@R(sxv#Fyv@0+)qRz93-be0b~-~7eL+yE*Ot6!!Jm#JxzJaU5m=0nRZs*9MOv-
ztz_VrywM`a4W29~iYY~Ovrn4&YDz}J1HPMLSzuy9m%_3+F`7fPqdV9=l-qF!Q~S$`
z!J8=MqZ2hy2KGVtruZ^5^BOf+XNye<Sk}KC><l71vaH<^6n#a*CkYU*1Jw|%?e`|%
As{jB1

delta 1534
zcmV<a1p)fU41^4j7=JJ=G%zh^V>dT4F*z~{Np5p=VQyn(Iv_b=V_`LAF*YqTV`gS8
zG-hTvEnzrjFfCy?W;Qf4VP!dGGGq!?Lu_wjYdRo%F*+bOeF_TIx(m9^1&9F7A|rpb
zr2qf`001bpFZ}>e{VM<lJ=RLl7I6XqUp$Ch%_7lYJ_Lj_@PFoXP~+we(A76EO7~<&
z)F3;RA|;uhw(`4WgTRw>I&<{eU?~Ai$VOI~v629dx2JaIm@kx4N&&zCxB$xlBK}6v
zM#}nreKVfmYggvF>;bcooE<&pjzw|&ydgCd;B)Xi4C|Up=iRyONm;Mu=<J|49rmyj
z!~XSg?^aVP(0>n31mId~$~K1OP2OJJrGIzc!y|v*5Q}mxHFpG|K|%#jeQ^A;E^l>*
z3*Bp}NefgxX1+gm^KmUTgDMc1Jsb?P)oZC4!mL!?OgGcvU!7uNP_0u4V%4c@$$Tde
z<JJuBKo=+bJ1@oWHsew%rA$(CR4!bhCq$aiEJ3j}Cx2ttc)M<~yh-n@9Ru2aUZxgY
zOU+j=F>OCDy|bcHe0;hLW1g*=6kJQqvOo9n=`!@rT1v3PA=sEaGtUmrgF1V*+HwF$
z%o|ck2Ck)MDdN~U>lpNk)Rh9)Qd8p>$FU`TvVVQ}JNw3+pEy`&7|(v!cb#wKcU}Xc
zztr*E^M9!-1a3aZxQyRo-mQ)X;JxvQVS5a1E-Z#;*X8(mp!hlb(b=Wn_TxWHa?e&p
z0<NVdiD5jvlVdHR&M*6U`SOA@wvN<F<4PlXstt_|t42g+g<yeLeOP%|b(q;~x1wEl
zEj90~M>ms_NyVg8C=MtXRLX=hZO}2jlujiL*MCw|r<8%iFyl-urCG}bk_A!)lFN(Z
z*9rxd2`ZdQ%ZOz}8m(9>QmHeYBvm9q6rtA9wbTrW=uAh+mpZa8Z9M*WuhEnWw8OPP
z0Ken?`duE+M1Fc0sWuO;rRJ>TPnWxUKDEZdwbU$UUQcC(VtJFR2G>&aT`%TX-lUJ$
zzkkQk8h`!+Ak42Gb5O~I9X_GL4xhR(RGq@`<8YEmeGLBa%Ed{YaV#4wH7X6(wXrob
zRx(mCOfa|?Bo}qm;#z9jPz5?Ii>k-mKV61Bw$iw6MNm)>lUj2vHAVFK{#e#!kBV!l
ziDGG`ab1f2WXaFUg5t{}`TM8d6EBXt_J3uuRg$)9X&lOg&M*+4wU)S+njl=^i<EJZ
zX2ppwLyQ}LBXiMF<cgb*y<$jG_{C2oetkYX_VcpV`IoM<QW}RMaV<4zXt3a^V}T&3
z(OhIq3CsWhfB+}}AqGb&W<?JYfWSD4qc93!7zRWLI0_9I0RX{)ke~>l77Rj$ynhqj
zxeG%W18Ui|uoin^FOJH&6BE><l!R4e54%@5i^3tJ2=RjuhoRy}$qP!PWxP;CvBe2+
zB`AQ$m7GU4)NS@llAFhZ>1-;h#Us!zr2@yPWOP}$xk=>TS*Ej6Q8ZgBx&}sXh}Wym
z;Yvs=Xm6s&oTIkS5VXc28)t^2+kc4qR4r1#;*gdpU~^0s$pG8pm?G)gVGer7p5*j@
zrQSOro_cGm&fR}<@b8h^=iXW!%4reE=RqJFW<g443Z&4$8`^`Azzc12c=p(Su(w)T
z1)jmCmtK6UOn@l~=HxoFllhK1;>sqt`1BcS0HC@As3ba#z^Ko9Pi8avYk$Oa!!S#&
zSx3_!A8xyrpfT#3Y0zk_Hi}Y>p%$pI61F+5M3N&%2xKJAC;$CDD}<6?=$#5uoTZ5Z
zCNj}EgB|fKv=$L@ag>MEJ_76DnN*W1Z`O`MZ9ptwx|K1yxMV1V1(Ltx%(4y7obk4E
zW?=^8GIQ=kv4A0`wS7N1*>`Z!opukPx!_g@c{8|RJi<`!QS!z#<&+X#nnV8V6uo^%
zFL^|hfwSZ-+|LAZ&d5i?Zt>-$tMH_1UnT!Yb{@SussOv@WfJ50JyfLCQd9Bw_N-dh
k2=V>Mm|$S01}gnuSl5<E4okR>OyKJnJxL~`9Mur5?f$>nwg3PC

diff --git a/src/box/lua/space.cc b/src/box/lua/space.cc
index 580e0ea2..25b7e36d 100644
--- a/src/box/lua/space.cc
+++ b/src/box/lua/space.cc
@@ -518,6 +518,8 @@ box_lua_space_init(struct lua_State *L)
 	lua_newtable(L);
 	lua_setfield(L, -2, "schema");
 	lua_getfield(L, -1, "schema");
+	lua_pushnumber(L, BOX_VINYL_DEFERRED_DELETE_ID);
+	lua_setfield(L, -2, "VINYL_DEFERRED_DELETE_ID");
 	lua_pushnumber(L, BOX_SCHEMA_ID);
 	lua_setfield(L, -2, "SCHEMA_ID");
 	lua_pushnumber(L, BOX_SPACE_ID);
diff --git a/src/box/lua/upgrade.lua b/src/box/lua/upgrade.lua
index 0293f6ef..754685f8 100644
--- a/src/box/lua/upgrade.lua
+++ b/src/box/lua/upgrade.lua
@@ -964,6 +964,26 @@ local function upgrade_to_1_10_0()
     create_vsequence_space()
 end
 
+--------------------------------------------------------------------------------
+--- Tarantool 1.10.2
+--------------------------------------------------------------------------------
+local function create_vinyl_deferred_delete_space()
+    local _space = box.space[box.schema.SPACE_ID]
+    local _vinyl_deferred_delete = box.space[box.schema.VINYL_DEFERRED_DELETE_ID]
+
+    local format = {}
+    format[1] = {name = 'space_id', type = 'unsigned'}
+    format[2] = {name = 'lsn', type = 'unsigned'}
+    format[3] = {name = 'tuple', type = 'array'}
+
+    log.info("create space _vinyl_deferred_delete")
+    _space:insert{_vinyl_deferred_delete.id, ADMIN, '_vinyl_deferred_delete',
+                  'blackhole', 0, {group_id = 1}, format}
+end
+
+local function upgrade_to_1_10_2()
+    create_vinyl_deferred_delete_space()
+end
 
 local function get_version()
     local version = box.space._schema:get{'version'}
@@ -991,6 +1011,7 @@ local function upgrade(options)
         {version = mkversion(1, 7, 6), func = upgrade_to_1_7_6, auto = false},
         {version = mkversion(1, 7, 7), func = upgrade_to_1_7_7, auto = true},
         {version = mkversion(1, 10, 0), func = upgrade_to_1_10_0, auto = true},
+        {version = mkversion(1, 10, 2), func = upgrade_to_1_10_2, auto = true},
     }
 
     for _, handler in ipairs(handlers) do
diff --git a/src/box/schema.cc b/src/box/schema.cc
index 433f52c0..32669c69 100644
--- a/src/box/schema.cc
+++ b/src/box/schema.cc
@@ -37,7 +37,8 @@
 #include "scoped_guard.h"
 #include "version.h"
 #include "user.h"
-#include <stdio.h>
+#include "vclock.h"
+
 /**
  * @module Data Dictionary
  *
@@ -351,6 +352,30 @@ schema_init()
 			 FIELD_TYPE_UNSIGNED, false, NULL, COLL_NONE);
 	sc_space_new(BOX_INDEX_ID, "_index", key_def,
 		     &alter_space_on_replace_index, &on_stmt_begin_index);
+
+	/*
+	 * _vinyl_deferred_delete - blackhole that is needed
+	 * for writing deferred DELETE statements generated by
+	 * vinyl compaction tasks to WAL.
+	 */
+	{
+		const char *engine = "blackhole";
+		const char *name = "_vinyl_deferred_delete";
+		struct space_opts opts = space_opts_default;
+		opts.group_id = GROUP_LOCAL;
+		struct space_def *def;
+		def = space_def_new_xc(BOX_VINYL_DEFERRED_DELETE_ID, ADMIN, 0,
+				       name, strlen(name), engine,
+				       strlen(engine), &opts, NULL, 0);
+		auto def_guard = make_scoped_guard([=] {
+			space_def_delete(def);
+		});
+		RLIST_HEAD(key_list);
+		struct space *space = space_new_xc(def, &key_list);
+		space_cache_replace(space);
+		init_system_space(space);
+		trigger_run_xc(&on_alter_space, space);
+	}
 }
 
 void
diff --git a/src/box/schema_def.h b/src/box/schema_def.h
index 2edb8d37..9beed389 100644
--- a/src/box/schema_def.h
+++ b/src/box/schema_def.h
@@ -66,6 +66,8 @@ static_assert(BOX_INVALID_NAME_MAX <= BOX_NAME_MAX,
 enum {
 	/** Start of the reserved range of system spaces. */
 	BOX_SYSTEM_ID_MIN = 256,
+	/** Space if of _vinyl_deferred_delete. */
+	BOX_VINYL_DEFERRED_DELETE_ID = 257,
 	/** Space id of _schema. */
 	BOX_SCHEMA_ID = 272,
 	/** Space id of _collation. */
diff --git a/src/box/vinyl.c b/src/box/vinyl.c
index fd14d1e7..18aa1ba5 100644
--- a/src/box/vinyl.c
+++ b/src/box/vinyl.c
@@ -65,6 +65,7 @@
 #include "engine.h"
 #include "space.h"
 #include "index.h"
+#include "schema.h"
 #include "xstream.h"
 #include "info.h"
 #include "column_mask.h"
@@ -256,6 +257,8 @@ static const struct engine_vtab vinyl_engine_vtab;
 static const struct space_vtab vinyl_space_vtab;
 static const struct index_vtab vinyl_index_vtab;
 
+static struct trigger on_replace_vinyl_deferred_delete;
+
 /**
  * A quick intro into Vinyl cosmology and file format
  * --------------------------------------------------
@@ -2771,6 +2774,20 @@ vinyl_engine_abort_checkpoint(struct engine *engine)
 
 /** {{{ Recovery */
 
+/**
+ * Install trigger on the _vinyl_deferred_delete system space.
+ * Called on bootstrap and recovery. Note, this function can't
+ * be called from engine constructor, because the latter is
+ * invoked before the schema is initialized.
+ */
+static void
+vy_set_deferred_delete_trigger(void)
+{
+	struct space *space = space_by_id(BOX_VINYL_DEFERRED_DELETE_ID);
+	assert(space != NULL);
+	trigger_add(&space->on_replace, &on_replace_vinyl_deferred_delete);
+}
+
 static int
 vinyl_engine_bootstrap(struct engine *engine)
 {
@@ -2780,6 +2797,7 @@ vinyl_engine_bootstrap(struct engine *engine)
 		return -1;
 	vy_quota_set_limit(&e->quota, e->memory);
 	e->status = VINYL_ONLINE;
+	vy_set_deferred_delete_trigger();
 	return 0;
 }
 
@@ -2802,6 +2820,7 @@ vinyl_engine_begin_initial_recovery(struct engine *engine,
 		vy_quota_set_limit(&e->quota, e->memory);
 		e->status = VINYL_INITIAL_RECOVERY_REMOTE;
 	}
+	vy_set_deferred_delete_trigger();
 	return 0;
 }
 
@@ -4265,6 +4284,21 @@ vinyl_space_build_index(struct space *src_space, struct index *new_index,
 
 /* }}} Index build */
 
+/* {{{ Deferred DELETE handling */
+
+static void
+vy_deferred_delete_on_replace(struct trigger *trigger, void *event)
+{
+	(void)trigger;
+	(void)event;
+}
+
+static struct trigger on_replace_vinyl_deferred_delete = {
+	RLIST_LINK_INITIALIZER, vy_deferred_delete_on_replace, NULL, NULL
+};
+
+/* }}} Deferred DELETE handling */
+
 static const struct engine_vtab vinyl_engine_vtab = {
 	/* .shutdown = */ vinyl_engine_shutdown,
 	/* .create_space = */ vinyl_engine_create_space,
diff --git a/test/app-tap/tarantoolctl.test.lua b/test/app-tap/tarantoolctl.test.lua
index ac9a208c..5a4a8450 100755
--- a/test/app-tap/tarantoolctl.test.lua
+++ b/test/app-tap/tarantoolctl.test.lua
@@ -346,7 +346,7 @@ do
             check_ctlcat_xlog(test_i, dir, "--from=3 --to=6 --format=json --show-system --replica 1", "\n", 3)
             check_ctlcat_xlog(test_i, dir, "--from=3 --to=6 --format=json --show-system --replica 1 --replica 2", "\n", 3)
             check_ctlcat_xlog(test_i, dir, "--from=3 --to=6 --format=json --show-system --replica 2", "\n", 0)
-            check_ctlcat_snap(test_i, dir, "--space=280", "---\n", 18)
+            check_ctlcat_snap(test_i, dir, "--space=280", "---\n", 19)
             check_ctlcat_snap(test_i, dir, "--space=288", "---\n", 43)
         end)
     end)
diff --git a/test/box-py/bootstrap.result b/test/box-py/bootstrap.result
index 16c2027c..336cd449 100644
--- a/test/box-py/bootstrap.result
+++ b/test/box-py/bootstrap.result
@@ -5,7 +5,7 @@ box.space._schema:select{}
 ---
 - - ['cluster', '<cluster uuid>']
   - ['max_id', 511]
-  - ['version', 1, 10, 0]
+  - ['version', 1, 10, 2]
 ...
 box.space._cluster:select{}
 ---
@@ -13,7 +13,10 @@ box.space._cluster:select{}
 ...
 box.space._space:select{}
 ---
-- - [272, 1, '_schema', 'memtx', 0, {}, [{'type': 'string', 'name': 'key'}]]
+- - [257, 1, '_vinyl_deferred_delete', 'blackhole', 0, {'group_id': 1}, [{'name': 'space_id',
+        'type': 'unsigned'}, {'name': 'lsn', 'type': 'unsigned'}, {'name': 'tuple',
+        'type': 'array'}]]
+  - [272, 1, '_schema', 'memtx', 0, {}, [{'type': 'string', 'name': 'key'}]]
   - [276, 1, '_collation', 'memtx', 0, {}, [{'name': 'id', 'type': 'unsigned'}, {
         'name': 'name', 'type': 'string'}, {'name': 'owner', 'type': 'unsigned'},
       {'name': 'type', 'type': 'string'}, {'name': 'locale', 'type': 'string'}, {
diff --git a/test/box/access_misc.result b/test/box/access_misc.result
index 2d87fa2d..4054313b 100644
--- a/test/box/access_misc.result
+++ b/test/box/access_misc.result
@@ -752,7 +752,10 @@ box.space._user:select()
 ...
 box.space._space:select()
 ---
-- - [272, 1, '_schema', 'memtx', 0, {}, [{'type': 'string', 'name': 'key'}]]
+- - [257, 1, '_vinyl_deferred_delete', 'blackhole', 0, {'group_id': 1}, [{'name': 'space_id',
+        'type': 'unsigned'}, {'name': 'lsn', 'type': 'unsigned'}, {'name': 'tuple',
+        'type': 'array'}]]
+  - [272, 1, '_schema', 'memtx', 0, {}, [{'type': 'string', 'name': 'key'}]]
   - [276, 1, '_collation', 'memtx', 0, {}, [{'name': 'id', 'type': 'unsigned'}, {
         'name': 'name', 'type': 'string'}, {'name': 'owner', 'type': 'unsigned'},
       {'name': 'type', 'type': 'string'}, {'name': 'locale', 'type': 'string'}, {
diff --git a/test/box/access_sysview.result b/test/box/access_sysview.result
index 20efd2bb..bc5e069f 100644
--- a/test/box/access_sysview.result
+++ b/test/box/access_sysview.result
@@ -230,7 +230,7 @@ box.session.su('guest')
 ...
 #box.space._vspace:select{}
 ---
-- 19
+- 20
 ...
 #box.space._vindex:select{}
 ---
diff --git a/test/wal_off/alter.result b/test/wal_off/alter.result
index afac1e55..f4703395 100644
--- a/test/wal_off/alter.result
+++ b/test/wal_off/alter.result
@@ -28,7 +28,7 @@ end;
 ...
 #spaces;
 ---
-- 65515
+- 65513
 ...
 -- cleanup
 for k, v in pairs(spaces) do
diff --git a/test/xlog/upgrade.result b/test/xlog/upgrade.result
index f02996bb..a411e0a7 100644
--- a/test/xlog/upgrade.result
+++ b/test/xlog/upgrade.result
@@ -36,11 +36,14 @@ box.space._schema:select()
 ---
 - - ['cluster', '<server_uuid>']
   - ['max_id', 513]
-  - ['version', 1, 10, 0]
+  - ['version', 1, 10, 2]
 ...
 box.space._space:select()
 ---
-- - [272, 1, '_schema', 'memtx', 0, {}, [{'type': 'string', 'name': 'key'}]]
+- - [257, 1, '_vinyl_deferred_delete', 'blackhole', 0, {'group_id': 1}, [{'name': 'space_id',
+        'type': 'unsigned'}, {'name': 'lsn', 'type': 'unsigned'}, {'name': 'tuple',
+        'type': 'array'}]]
+  - [272, 1, '_schema', 'memtx', 0, {}, [{'type': 'string', 'name': 'key'}]]
   - [276, 1, '_collation', 'memtx', 0, {}, [{'name': 'id', 'type': 'unsigned'}, {
         'name': 'name', 'type': 'string'}, {'name': 'owner', 'type': 'unsigned'},
       {'name': 'type', 'type': 'string'}, {'name': 'locale', 'type': 'string'}, {
-- 
2.11.0

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v2 6/7] vinyl: zap vy_mem::min_lsn and rename max_lsn to dump_lsn
  2018-08-21 11:15 [PATCH v2 0/7] vinyl: eliminate disk read on REPLACE/DELETE Vladimir Davydov
                   ` (4 preceding siblings ...)
  2018-08-21 11:15 ` [PATCH v2 5/7] Introduce _vinyl_deferred_delete system space Vladimir Davydov
@ 2018-08-21 11:15 ` Vladimir Davydov
  2018-08-21 15:44   ` Konstantin Osipov
  2018-08-22 13:00   ` Vladimir Davydov
  2018-08-21 11:15 ` [PATCH v2 7/7] vinyl: eliminate disk read on REPLACE/DELETE Vladimir Davydov
  2018-08-22 17:50 ` [PATCH v2 0/7] " Vladimir Davydov
  7 siblings, 2 replies; 20+ messages in thread
From: Vladimir Davydov @ 2018-08-21 11:15 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

We never use vy_mem::min_lsn so let's zap it. As for max_lsn, we only
need it to update vy_lsm::dump_lsn (max LSN stored on disk). Let's
rename it appropriately. There's another reason to do that. Once we
start storing deferred DELETE statements in memory (see #2129), it won't
be the max statement LSN stored in vy_mem anymore, because we will
account WAL LSN of deferred DELETE statements there too. Renaming it to
dump_lsn will help avoid confusion.

Needed for #2129
---
 src/box/vy_mem.c        |  8 +++-----
 src/box/vy_mem.h        | 10 +++++++---
 src/box/vy_scheduler.c  |  2 +-
 test/unit/vy_mem.c      | 11 ++++-------
 test/unit/vy_mem.result | 22 ++++++++++------------
 5 files changed, 25 insertions(+), 28 deletions(-)

diff --git a/src/box/vy_mem.c b/src/box/vy_mem.c
index 0c46b93c..f9be8505 100644
--- a/src/box/vy_mem.c
+++ b/src/box/vy_mem.c
@@ -108,8 +108,7 @@ vy_mem_new(struct vy_mem_env *env, int64_t generation,
 		return NULL;
 	}
 	index->env = env;
-	index->min_lsn = INT64_MAX;
-	index->max_lsn = -1;
+	index->dump_lsn = -1;
 	index->cmp_def = cmp_def;
 	index->generation = generation;
 	index->space_cache_version = space_cache_version;
@@ -249,11 +248,10 @@ vy_mem_commit_stmt(struct vy_mem *mem, const struct tuple *stmt)
 	/*
 	 * Normally statement LSN grows monotonically,
 	 * but not in case of building an index on an
-	 * existing non-empty space. Hence use of MIN/MAX
+	 * existing non-empty space. Hence use of MAX
 	 * here.
          */
-	mem->min_lsn = MIN(mem->min_lsn, lsn);
-	mem->max_lsn = MAX(mem->max_lsn, lsn);
+	mem->dump_lsn = MAX(mem->dump_lsn, lsn);
 	/*
 	 * If we don't bump mem version after assigning LSN to
 	 * a mem statement, a read iterator which uses
diff --git a/src/box/vy_mem.h b/src/box/vy_mem.h
index 52caa316..957f549f 100644
--- a/src/box/vy_mem.h
+++ b/src/box/vy_mem.h
@@ -166,9 +166,13 @@ struct vy_mem {
 	size_t tree_extent_size;
 	/** Number of statements. */
 	struct vy_stmt_counter count;
-	/** The min and max values of stmt->lsn in this tree. */
-	int64_t min_lsn;
-	int64_t max_lsn;
+	/**
+	 * Max LSN covered by this in-memory tree.
+	 *
+	 * Once the tree is dumped to disk it will be used to update
+	 * vy_lsm::dump_lsn, see vy_task_dump_new().
+	 */
+	int64_t dump_lsn;
 	/**
 	 * Key definition for this index, extended with primary
 	 * key parts.
diff --git a/src/box/vy_scheduler.c b/src/box/vy_scheduler.c
index 4e8b476b..7d8961c4 100644
--- a/src/box/vy_scheduler.c
+++ b/src/box/vy_scheduler.c
@@ -982,7 +982,7 @@ vy_task_dump_new(struct vy_scheduler *scheduler, struct vy_lsm *lsm,
 			vy_lsm_delete_mem(lsm, mem);
 			continue;
 		}
-		dump_lsn = MAX(dump_lsn, mem->max_lsn);
+		dump_lsn = MAX(dump_lsn, mem->dump_lsn);
 	}
 
 	if (dump_lsn < 0) {
diff --git a/test/unit/vy_mem.c b/test/unit/vy_mem.c
index 6666f94c..967aabe8 100644
--- a/test/unit/vy_mem.c
+++ b/test/unit/vy_mem.c
@@ -18,21 +18,18 @@ test_basic(void)
 	assert(key_def != NULL);
 	struct vy_mem *mem = create_test_mem(key_def);
 
-	is(mem->min_lsn, INT64_MAX, "mem->min_lsn on empty mem");
-	is(mem->max_lsn, -1, "mem->max_lsn on empty mem");
+	is(mem->dump_lsn, -1, "mem->dump_lsn on empty mem");
 	const struct vy_stmt_template stmts[] = {
 		STMT_TEMPLATE(100, REPLACE, 1), STMT_TEMPLATE(101, REPLACE, 1),
 		STMT_TEMPLATE(102, REPLACE, 1), STMT_TEMPLATE(103, REPLACE, 1),
 		STMT_TEMPLATE(104, REPLACE, 1)
 	};
 
-	/* Check min/max lsn */
+	/* Check dump lsn */
 	const struct tuple *stmt = vy_mem_insert_template(mem, &stmts[0]);
-	is(mem->min_lsn, INT64_MAX, "mem->min_lsn after prepare");
-	is(mem->max_lsn, -1, "mem->max_lsn after prepare");
+	is(mem->dump_lsn, -1, "mem->dump_lsn after prepare");
 	vy_mem_commit_stmt(mem, stmt);
-	is(mem->min_lsn, 100, "mem->min_lsn after commit");
-	is(mem->max_lsn, 100, "mem->max_lsn after commit");
+	is(mem->dump_lsn, 100, "mem->dump_lsn after commit");
 
 	/* Check vy_mem_older_lsn */
 	const struct tuple *older = stmt;
diff --git a/test/unit/vy_mem.result b/test/unit/vy_mem.result
index 6212173d..5c9c7af5 100644
--- a/test/unit/vy_mem.result
+++ b/test/unit/vy_mem.result
@@ -1,17 +1,15 @@
+# Looks like you planned 12 tests but ran 9.
 	*** test_basic ***
 1..12
-ok 1 - mem->min_lsn on empty mem
-ok 2 - mem->max_lsn on empty mem
-ok 3 - mem->min_lsn after prepare
-ok 4 - mem->max_lsn after prepare
-ok 5 - mem->min_lsn after commit
-ok 6 - mem->max_lsn after commit
-ok 7 - vy_mem_older_lsn 1
-ok 8 - vy_mem_older_lsn 2
-ok 9 - vy_mem_rollback 1
-ok 10 - vy_mem_rollback 2
-ok 11 - vy_mem->version
-ok 12 - vy_mem->version
+ok 1 - mem->dump_lsn on empty mem
+ok 2 - mem->dump_lsn after prepare
+ok 3 - mem->dump_lsn after commit
+ok 4 - vy_mem_older_lsn 1
+ok 5 - vy_mem_older_lsn 2
+ok 6 - vy_mem_rollback 1
+ok 7 - vy_mem_rollback 2
+ok 8 - vy_mem->version
+ok 9 - vy_mem->version
 	*** test_basic: done ***
 	*** test_iterator_restore_after_insertion ***
 1..1
-- 
2.11.0

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v2 7/7] vinyl: eliminate disk read on REPLACE/DELETE
  2018-08-21 11:15 [PATCH v2 0/7] vinyl: eliminate disk read on REPLACE/DELETE Vladimir Davydov
                   ` (5 preceding siblings ...)
  2018-08-21 11:15 ` [PATCH v2 6/7] vinyl: zap vy_mem::min_lsn and rename max_lsn to dump_lsn Vladimir Davydov
@ 2018-08-21 11:15 ` Vladimir Davydov
  2018-08-21 16:13   ` Konstantin Osipov
  2018-08-22 17:50 ` [PATCH v2 0/7] " Vladimir Davydov
  7 siblings, 1 reply; 20+ messages in thread
From: Vladimir Davydov @ 2018-08-21 11:15 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

When executing a REPLACE or DELETE request for a vinyl space, we need to
delete the old tuple from secondary indexes if any, e.g. if there's a
space with the primary index over field 1 and a secondary index over
field 2 and there's REPLACE{1, 10} in the space, then REPLACE{1, 20} has
to generate DELETE{10, 1} in order to overwrite REPLACE{10, 1} before
inserting REPLACE{20, 1} into the secondary index. Currently, we
generate DELETEs for secondary indexes immediately on request execution,
which makes REPLACE/DELETE operations disk-bound in case the space has
secondary indexes, because in order to delete the old tuple we have to
look it up in the primary index first.

Actually, we can postpone DELETE generation and still yield correct
results. All we have to do is compare each tuple read from a secondary
index with the full tuple corresponding to it in the primary index: if
they match, then the tuple is OK to return to the user; if the don't,
then the tuple was overwritten in the primary index and we have to skip
it. This doesn't introduce any overhead, because we have to look up full
tuples in the primary index while reading a secondary index anyways.
For instance, consider the example given in the previous paragraph: if
we don't insert DELETE{10, 1} into the secondary index, then we will
encounter REPLACE{10, 1} when reading it, but the tuple corresponding to
it in the primary index is REPLACE{1, 20} != REPLACE{10, 1} so we skip
it. This is the first thing that this patch does.

However, skipping garbage tuples isn't enough. We have to purge them
sooner or later, otherwise we risk iterating over thousands of stale
tuples before encountering a fresh one, which would adversely affect
latency of SELECT requests over a secondary index. So we mark each and
every REPLACE and DELETE statement that was inserted into the primary
index without generating DELETEs for secondary index with a special per
statement flag VY_STMT_DEFERRED_DELETE and generate DELETEs for these
statements when the time comes.

The time comes when the primary index finally gets compacted. When
writing a compacted run, we iterate over all tuples in the order set by
the primary key from newer to older tuples, so each statement marked
with VY_STMT_DEFERRED_DELETE will be followed by the tuple it overwrote,
provided there's enough runs compacted. We take these tuples and send
them to the tx thread over cbus (compaction is done in a worker thread,
remember), where deferred DELETEs are generated and inserted into
secondary indexes. Well, it isn't that simple actually, but you should
have got the basic idea by now.

The first problem here is by the time we generate a deferred DELETE,
newer statements for the same key could have been inserted into the
index and dumped to disk, while the read iterator assumes that the newer
the source the newer statements it stores for the same key. In order not
to break the read iterator assumptions by inserting deferred DELETEs, we
mark them with another special per-statement flag, VY_STMT_SKIP_READ,
which renders them invisible to the read iterator. The flag doesn't
affect the write iterator though so deferred DELETEs will purge garbage
statements when the secondary index eventually gets compacted.

The second problem concerns the recovery procedure. Since we write
deferred DELETEs to the in-memory level, we need to recover them after
restart somehow in case they didn't get dumped. To do that, we write
them to WAL (along with LSN and space id) with the aid of a special
system blackhole space, _vinyl_deferred_delete. The insertion of
deferred DELETEs into in-memory trees is actually done by on_replace
trigger installed on the space so deferred DELETEs are generated and
recovered by the same code. In order not to recover statements that have
been dumped, we account LSNs of WAL rows that generates deferred DELETEs
to vy_lsm::dump_lsn and filter dumped statements with vy_is_committed(),
just like normal statements.

Finally, we may run out of memory while generating deferred DELETEs.
This is manageable if happens during compaction - we simply throttle the
compaction task until the memory level is dumped. However, we can't do
that while generating deferred DELETEs during index dump. Solution:
don't generate deferred DELETEs during dump. The thing is we can
generate a deferred DELETE during dump only if the overwritten tuple is
stored in memory, but if it is, the lookup is nearly free and so we can
generate a DELETE when the transaction gets committed. So we introduce a
special version of point lookup, vy_point_lookup_mem(), which look ups a
tuple by the full key in cache and in memory. When a transaction is
committed, we use this function to generate DELETEs.

This should outline the pivotal points of the algorithm. More details,
as usual, in the code.

Closes #2129
---
 src/box/vinyl.c                     | 227 ++++++++++--
 src/box/vy_lsm.h                    |   5 +
 src/box/vy_mem.h                    |   6 +
 src/box/vy_point_lookup.c           |  32 ++
 src/box/vy_point_lookup.h           |  18 +
 src/box/vy_scheduler.c              | 306 +++++++++++++++-
 src/box/vy_tx.c                     | 133 +++++++
 test/unit/vy_point_lookup.c         |   2 +
 test/vinyl/deferred_delete.result   | 677 ++++++++++++++++++++++++++++++++++++
 test/vinyl/deferred_delete.test.lua | 261 ++++++++++++++
 test/vinyl/info.result              |  18 +-
 test/vinyl/info.test.lua            |   9 +-
 test/vinyl/layout.result            |  46 ++-
 test/vinyl/quota.result             |   2 +-
 test/vinyl/tx_gap_lock.result       |  16 +-
 test/vinyl/tx_gap_lock.test.lua     |  10 +-
 test/vinyl/write_iterator.result    |   5 +
 test/vinyl/write_iterator.test.lua  |   3 +
 18 files changed, 1703 insertions(+), 73 deletions(-)
 create mode 100644 test/vinyl/deferred_delete.result
 create mode 100644 test/vinyl/deferred_delete.test.lua

diff --git a/src/box/vinyl.c b/src/box/vinyl.c
index 18aa1ba5..1a45fac9 100644
--- a/src/box/vinyl.c
+++ b/src/box/vinyl.c
@@ -1273,25 +1273,43 @@ vy_get_by_secondary_tuple(struct vy_lsm *lsm, struct vy_tx *tx,
 			  struct tuple *tuple, struct tuple **result)
 {
 	assert(lsm->index_id > 0);
-	/*
-	 * No need in vy_tx_track() as the tuple must already be
-	 * tracked in the secondary index LSM tree.
-	 */
+
 	if (vy_point_lookup(lsm->pk, tx, rv, tuple, result) != 0)
 		return -1;
 
-	if (*result == NULL) {
+	if (*result == NULL ||
+	    vy_tuple_compare(*result, tuple, lsm->key_def) != 0) {
+		/*
+		 * If a tuple read from a secondary index doesn't
+		 * match the tuple corresponding to it in the
+		 * primary index, it must have been overwritten or
+		 * deleted, but the DELETE statement hasn't been
+		 * propagated to the secondary index yet. In this
+		 * case silently skip this tuple.
+		 */
+		if (*result != NULL) {
+			tuple_unref(*result);
+			*result = NULL;
+		}
 		/*
-		 * All indexes of a space must be consistent, i.e.
-		 * if a tuple is present in one index, it must be
-		 * present in all other indexes as well, so we can
-		 * get here only if there's a bug somewhere in vinyl.
-		 * Don't abort as core dump won't really help us in
-		 * this case. Just warn the user and proceed to the
-		 * next tuple.
+		 * Invalidate the cache entry so that we won't read
+		 * the overwritten tuple again from the cache.
 		 */
-		say_warn("%s: key %s missing in primary index",
-			 vy_lsm_name(lsm), vy_stmt_str(tuple));
+		vy_cache_on_write(&lsm->cache, tuple, NULL);
+		return 0;
+	}
+
+	/*
+	 * Even though the tuple is tracked in the secondary index
+	 * read set, we still must track the full tuple read from
+	 * the primary index, otherwise the transaction won't be
+	 * aborted if this tuple is overwritten or deleted, because
+	 * the DELETE statement is not written to secondary indexes
+	 * immediately.
+	 */
+	if (tx != NULL && vy_tx_track_point(tx, lsm->pk, *result) != 0) {
+		tuple_unref(*result);
+		return -1;
 	}
 
 	if ((*rv)->vlsn == INT64_MAX)
@@ -1604,7 +1622,6 @@ vy_delete(struct vy_env *env, struct vy_tx *tx, struct txn_stmt *stmt,
 	struct vy_lsm *lsm = vy_lsm_find_unique(space, request->index_id);
 	if (lsm == NULL)
 		return -1;
-	bool has_secondary = space->index_count > 1;
 	const char *key = request->key;
 	uint32_t part_count = mp_decode_array(&key);
 	if (vy_unique_key_validate(lsm, key, part_count))
@@ -1614,12 +1631,9 @@ vy_delete(struct vy_env *env, struct vy_tx *tx, struct txn_stmt *stmt,
 	 * before deletion.
 	 * - if the space has on_replace triggers and need to pass
 	 *   to them the old tuple.
-	 *
-	 * - if the space has one or more secondary indexes, then
-	 *   we need to extract secondary keys from the old tuple
-	 *   and pass them to indexes for deletion.
+	 * - if deletion is done by a secondary index.
 	 */
-	if (has_secondary || !rlist_empty(&space->on_replace)) {
+	if (lsm->index_id > 0 || !rlist_empty(&space->on_replace)) {
 		if (vy_get_by_raw_key(lsm, tx, vy_tx_read_view(tx),
 				      key, part_count, &stmt->old_tuple) != 0)
 			return -1;
@@ -1628,8 +1642,7 @@ vy_delete(struct vy_env *env, struct vy_tx *tx, struct txn_stmt *stmt,
 	}
 	int rc = 0;
 	struct tuple *delete;
-	if (has_secondary) {
-		assert(stmt->old_tuple != NULL);
+	if (stmt->old_tuple != NULL) {
 		delete = vy_stmt_new_surrogate_delete(pk->mem_format,
 						      stmt->old_tuple);
 		if (delete == NULL)
@@ -1642,12 +1655,14 @@ vy_delete(struct vy_env *env, struct vy_tx *tx, struct txn_stmt *stmt,
 			if (rc != 0)
 				break;
 		}
-	} else { /* Primary is the single index in the space. */
+	} else {
 		assert(lsm->index_id == 0);
 		delete = vy_stmt_new_surrogate_delete_from_key(request->key,
 						pk->key_def, pk->mem_format);
 		if (delete == NULL)
 			return -1;
+		if (space->index_count > 1)
+			vy_stmt_set_flags(delete, VY_STMT_DEFERRED_DELETE);
 		rc = vy_tx_set(tx, pk, delete);
 	}
 	tuple_unref(delete);
@@ -2166,11 +2181,9 @@ vy_replace(struct vy_env *env, struct vy_tx *tx, struct txn_stmt *stmt,
 	/*
 	 * Get the overwritten tuple from the primary index if
 	 * the space has on_replace triggers, in which case we
-	 * need to pass the old tuple to trigger callbacks, or
-	 * if the space has secondary indexes and so we need
-	 * the old tuple to delete it from them.
+	 * need to pass the old tuple to trigger callbacks.
 	 */
-	if (space->index_count > 1 || !rlist_empty(&space->on_replace)) {
+	if (!rlist_empty(&space->on_replace)) {
 		if (vy_get(pk, tx, vy_tx_read_view(tx),
 			   stmt->new_tuple, &stmt->old_tuple) != 0)
 			return -1;
@@ -2181,6 +2194,8 @@ vy_replace(struct vy_env *env, struct vy_tx *tx, struct txn_stmt *stmt,
 			 */
 			vy_stmt_set_type(stmt->new_tuple, IPROTO_INSERT);
 		}
+	} else if (space->index_count > 1) {
+		vy_stmt_set_flags(stmt->new_tuple, VY_STMT_DEFERRED_DELETE);
 	}
 	/*
 	 * Replace in the primary index without explicit deletion
@@ -4286,11 +4301,167 @@ vinyl_space_build_index(struct space *src_space, struct index *new_index,
 
 /* {{{ Deferred DELETE handling */
 
+/**
+ * Callback invoked after a deferred DELETE statement has been
+ * committed to _vinyl_deferred_delete system space.
+ */
+static void
+vy_deferred_delete_on_commit(struct trigger *trigger, void *event)
+{
+	struct txn *txn = event;
+	struct vy_mem *mem = trigger->data;
+	/*
+	 * Update dump_lsn so that we can skip dumped deferred
+	 * DELETE statements on WAL recovery.
+	 */
+	assert(mem->dump_lsn <= txn->signature);
+	mem->dump_lsn = txn->signature;
+	vy_mem_unpin(mem);
+}
+
+/**
+ * Callback invoked when a deferred DELETE statement is written
+ * to _vinyl_deferred_delete system space. It extracts the
+ * deleted tuple, its LSN, and the target space id from the
+ * system space row, then generates a deferred DELETE statement
+ * and inserts it into secondary indexes of the target space.
+ *
+ * Note, this callback is also invoked during local WAL recovery
+ * to restore deferred DELETE statements that haven't been dumped
+ * to disk. To skip deferred DELETEs that have been dumped, we
+ * use the same technique we employ for normal WAL statements,
+ * i.e. we filter them by LSN, see vy_is_committed_one(). To do
+ * that, we need to account the LSN of a WAL row that generated
+ * a deferred DELETE statement to vy_lsm::dump_lsn, so we install
+ * an on_commit trigger that propagates the LSN of the WAL row to
+ * vy_mem::dump_lsn, which in will contribute to vy_lsm::dump_lsn
+ * when the in-memory tree is dumped, see vy_task_dump_new().
+ *
+ * This implies that we don't yield between statements of the
+ * same transaction, because if we did, two deferred DELETEs with
+ * the same WAL LSN could land in different in-memory trees: if
+ * one of the trees got dumped while the other didn't, we would
+ * mistakenly skip both statements on recovery.
+ */
 static void
 vy_deferred_delete_on_replace(struct trigger *trigger, void *event)
 {
 	(void)trigger;
-	(void)event;
+
+	struct txn *txn = event;
+	struct txn_stmt *stmt = txn_current_stmt(txn);
+	bool is_first_statement = txn_is_first_statement(txn);
+
+	if (stmt->new_tuple == NULL)
+		return;
+	/*
+	 * Extract space id, LSN of the deferred DELETE statement,
+	 * and the deleted tuple from the system space row.
+	 */
+	uint32_t space_id;
+	if (tuple_field_u32(stmt->new_tuple, 0, &space_id) != 0)
+		diag_raise();
+	int64_t lsn;
+	if (tuple_field_i64(stmt->new_tuple, 1, &lsn) != 0)
+		diag_raise();
+	const char *delete_data = tuple_field(stmt->new_tuple, 2);
+	if (delete_data == NULL) {
+		diag_set(ClientError, ER_NO_SUCH_FIELD, 2);
+		diag_raise();
+	}
+	const char *delete_data_end = delete_data;
+	mp_next(&delete_data_end);
+
+	/* Look up the space. */
+	struct space *space = space_cache_find(space_id);
+	if (space == NULL)
+		diag_raise();
+	if (space->index_count <= 1)
+		return;
+	/*
+	 * Wait for memory quota if necessary before starting to
+	 * process the batch (we can't yield between statements).
+	 */
+	struct vy_env *env = vy_env(space->engine);
+	if (is_first_statement)
+		vy_quota_wait(&env->quota);
+
+	/* Create the deferred DELETE statement. */
+	struct vy_lsm *pk = vy_lsm(space->index[0]);
+	struct tuple *delete = vy_stmt_new_surrogate_delete_raw(pk->mem_format,
+						delete_data, delete_data_end);
+	if (delete == NULL)
+		diag_raise();
+	vy_stmt_set_lsn(delete, lsn);
+	/*
+	 * A deferred DELETE may be generated after new statements
+	 * were committed for the deleted key while the read iterator
+	 * assumes that newer sources always store newer statements.
+	 * Mark deferred DELETEs with the VY_STMT_SKIP_READ flag so
+	 * as not to break the read iterator assumptions.
+	 */
+	vy_stmt_set_flags(delete, VY_STMT_SKIP_READ);
+
+	/* Insert the deferred DELETE into secondary indexes. */
+	int rc = 0;
+	size_t mem_used_before = lsregion_used(&env->mem_env.allocator);
+	const struct tuple *region_stmt = NULL;
+	for (uint32_t i = 1; i < space->index_count; i++) {
+		struct vy_lsm *lsm = vy_lsm(space->index[i]);
+		if (vy_is_committed_one(env, lsm))
+			continue;
+		/*
+		 * As usual, rotate the active in-memory index if
+		 * schema was changed or dump was triggered. Do it
+		 * only if processing the first statement, because
+		 * dump may be triggered by one of the statements
+		 * of this transaction (see vy_quota_force_use()
+		 * below), in which case we must not do rotation
+		 * as we want all statements to land in the same
+		 * in-memory index. This is safe, as long as we
+		 * don't yield between statements.
+		 */
+		struct vy_mem *mem = lsm->mem;
+		if (is_first_statement &&
+		    (mem->space_cache_version != space_cache_version ||
+		     mem->generation != *lsm->env->p_generation)) {
+			rc = vy_lsm_rotate_mem(lsm);
+			if (rc != 0)
+				break;
+			mem = lsm->mem;
+		}
+		rc = vy_lsm_set(lsm, mem, delete, &region_stmt);
+		if (rc != 0)
+			break;
+		vy_lsm_commit_stmt(lsm, mem, region_stmt);
+
+		if (!is_first_statement)
+			continue;
+		/*
+		 * If this is the first statement of this
+		 * transaction, install on_commit trigger
+		 * which will propagate the WAL row LSN to
+		 * the LSM tree.
+		 */
+		struct trigger *on_commit = region_alloc(&fiber()->gc,
+							 sizeof(*on_commit));
+		if (on_commit == NULL) {
+			diag_set(OutOfMemory, sizeof(*on_commit),
+				 "region", "struct trigger");
+			rc = -1;
+			break;
+		}
+		vy_mem_pin(mem);
+		trigger_create(on_commit, vy_deferred_delete_on_commit, mem, NULL);
+		txn_on_commit(txn, on_commit);
+	}
+	size_t mem_used_after = lsregion_used(&env->mem_env.allocator);
+	assert(mem_used_after >= mem_used_before);
+	vy_quota_force_use(&env->quota, mem_used_after - mem_used_before);
+
+	tuple_unref(delete);
+	if (rc != 0)
+		diag_raise();
 }
 
 static struct trigger on_replace_vinyl_deferred_delete = {
diff --git a/src/box/vy_lsm.h b/src/box/vy_lsm.h
index f0b7ec9c..d2aa0c43 100644
--- a/src/box/vy_lsm.h
+++ b/src/box/vy_lsm.h
@@ -50,6 +50,7 @@ extern "C" {
 #endif /* defined(__cplusplus) */
 
 struct histogram;
+struct index;
 struct tuple;
 struct tuple_format;
 struct vy_lsm;
@@ -292,6 +293,10 @@ struct vy_lsm {
 	vy_lsm_read_set_t read_set;
 };
 
+/** Extract vy_lsm from an index object. */
+struct vy_lsm *
+vy_lsm(struct index *index);
+
 /** Return LSM tree name. Used for logging. */
 const char *
 vy_lsm_name(struct vy_lsm *lsm);
diff --git a/src/box/vy_mem.h b/src/box/vy_mem.h
index 957f549f..29b60ac7 100644
--- a/src/box/vy_mem.h
+++ b/src/box/vy_mem.h
@@ -171,6 +171,12 @@ struct vy_mem {
 	 *
 	 * Once the tree is dumped to disk it will be used to update
 	 * vy_lsm::dump_lsn, see vy_task_dump_new().
+	 *
+	 * Note, we account not only original LSN (vy_stmt_lsn())
+	 * in this variable, but also WAL LSN of deferred DELETE
+	 * statements. This is needed to skip WAL recovery of both
+	 * deferred and normal statements that have been dumped to
+	 * disk. See vy_deferred_delete_on_replace() for more details.
 	 */
 	int64_t dump_lsn;
 	/**
diff --git a/src/box/vy_point_lookup.c b/src/box/vy_point_lookup.c
index 5e43340b..7b704b84 100644
--- a/src/box/vy_point_lookup.c
+++ b/src/box/vy_point_lookup.c
@@ -293,3 +293,35 @@ done:
 	}
 	return 0;
 }
+
+int
+vy_point_lookup_mem(struct vy_lsm *lsm, const struct vy_read_view **rv,
+		    struct tuple *key, struct tuple **ret)
+{
+	assert(tuple_field_count(key) >= lsm->cmp_def->part_count);
+
+	int rc;
+	struct vy_history history;
+	vy_history_create(&history, &lsm->env->history_node_pool);
+
+	rc = vy_point_lookup_scan_cache(lsm, rv, key, &history);
+	if (rc != 0 || vy_history_is_terminal(&history))
+		goto done;
+
+	rc = vy_point_lookup_scan_mems(lsm, rv, key, &history);
+	if (rc != 0 || vy_history_is_terminal(&history))
+		goto done;
+
+	*ret = NULL;
+	goto out;
+done:
+	if (rc == 0) {
+		int upserts_applied;
+		rc = vy_history_apply(&history, lsm->cmp_def, lsm->mem_format,
+				      true, &upserts_applied, ret);
+		lsm->stat.upsert.applied += upserts_applied;
+	}
+out:
+	vy_history_cleanup(&history);
+	return rc;
+}
diff --git a/src/box/vy_point_lookup.h b/src/box/vy_point_lookup.h
index 3b7c5a04..6d77ce9c 100644
--- a/src/box/vy_point_lookup.h
+++ b/src/box/vy_point_lookup.h
@@ -71,6 +71,24 @@ vy_point_lookup(struct vy_lsm *lsm, struct vy_tx *tx,
 		const struct vy_read_view **rv,
 		struct tuple *key, struct tuple **ret);
 
+/**
+ * Look up a tuple by key in memory.
+ *
+ * This function works just like vy_point_lookup() except:
+ *
+ * - It only scans in-memory level and cache and hence doesn't yield.
+ * - It doesn't turn DELETE into NULL so it returns NULL if and only
+ *   if no terminal statement matching the key is present in memory
+ *   (there still may be statements stored on disk though).
+ * - It doesn't account the lookup to LSM tree stats (as it never
+ *   descends to lower levels).
+ *
+ * The function returns 0 on success, -1 on memory allocation error.
+ */
+int
+vy_point_lookup_mem(struct vy_lsm *lsm, const struct vy_read_view **rv,
+		    struct tuple *key, struct tuple **ret);
+
 #if defined(__cplusplus)
 } /* extern "C" */
 #endif /* defined(__cplusplus) */
diff --git a/src/box/vy_scheduler.c b/src/box/vy_scheduler.c
index 7d8961c4..64975f3a 100644
--- a/src/box/vy_scheduler.c
+++ b/src/box/vy_scheduler.c
@@ -49,6 +49,10 @@
 #include "cbus.h"
 #include "salad/stailq.h"
 #include "say.h"
+#include "txn.h"
+#include "space.h"
+#include "schema.h"
+#include "xrow.h"
 #include "vy_lsm.h"
 #include "vy_log.h"
 #include "vy_mem.h"
@@ -65,6 +69,8 @@ static int vy_worker_f(va_list);
 static int vy_scheduler_f(va_list);
 static void vy_task_execute_f(struct cmsg *);
 static void vy_task_complete_f(struct cmsg *);
+static void vy_deferred_delete_batch_process_f(struct cmsg *);
+static void vy_deferred_delete_batch_free_f(struct cmsg *);
 
 static const struct cmsg_hop vy_task_execute_route[] = {
 	{ vy_task_execute_f, NULL },
@@ -83,10 +89,42 @@ struct vy_worker {
 	struct cpipe tx_pipe;
 	/** Link in vy_scheduler::idle_workers. */
 	struct stailq_entry in_idle;
+	/** Route for sending deferred DELETEs back to tx. */
+	struct cmsg_hop deferred_delete_route[2];
 };
 
 struct vy_task;
 
+/** Max number of statements in a batch of deferred DELETEs. */
+enum { VY_DEFERRED_DELETE_BATCH_MAX = 100 };
+
+/** Deferred DELETE statement. */
+struct vy_deferred_delete_stmt {
+	/** Overwritten tuple. */
+	struct tuple *old_stmt;
+	/** Statement that overwrote @old_stmt. */
+	struct tuple *new_stmt;
+};
+
+/**
+ * Batch of deferred DELETE statements generated during
+ * a primary index compaction.
+ */
+struct vy_deferred_delete_batch {
+	/** CBus messages for sending the batch to tx. */
+	struct cmsg cmsg;
+	/** Task that generated this batch. */
+	struct vy_task *task;
+	/** Set if the tx thread failed to process the batch. */
+	bool is_failed;
+	/** In case of failure the error is stored here. */
+	struct diag diag;
+	/** Number of elements actually stored in @stmt array. */
+	int count;
+	/** Array of deferred DELETE statements. */
+	struct vy_deferred_delete_stmt stmt[VY_DEFERRED_DELETE_BATCH_MAX];
+};
+
 struct vy_task_ops {
 	/**
 	 * This function is called from a worker. It is supposed to do work
@@ -159,10 +197,26 @@ struct vy_task {
 	 */
 	double bloom_fpr;
 	int64_t page_size;
+	/**
+	 * Deferred DELETE handler passed to the write iterator.
+	 * It sends deferred DELETE statements generated during
+	 * primary index compaction back to tx.
+	 */
+	struct vy_deferred_delete_handler deferred_delete_handler;
+	/** Batch of deferred deletes generated by this task. */
+	struct vy_deferred_delete_batch *deferred_delete_batch;
+	/**
+	 * Number of batches of deferred DELETEs sent to tx
+	 * and not yet processed.
+	 */
+	int deferred_delete_in_progress;
 	/** Link in vy_scheduler::processed_tasks. */
 	struct stailq_entry in_processed;
 };
 
+static const struct vy_deferred_delete_handler_iface
+vy_task_deferred_delete_iface;
+
 /**
  * Allocate a new task to be executed by a worker thread.
  * When preparing an asynchronous task, this function must
@@ -197,6 +251,7 @@ vy_task_new(struct vy_scheduler *scheduler, struct vy_lsm *lsm,
 	}
 	vy_lsm_ref(lsm);
 	diag_create(&task->diag);
+	task->deferred_delete_handler.iface = &vy_task_deferred_delete_iface;
 	return task;
 }
 
@@ -204,6 +259,8 @@ vy_task_new(struct vy_scheduler *scheduler, struct vy_lsm *lsm,
 static void
 vy_task_delete(struct vy_task *task)
 {
+	assert(task->deferred_delete_batch == NULL);
+	assert(task->deferred_delete_in_progress == 0);
 	key_def_delete(task->cmp_def);
 	key_def_delete(task->key_def);
 	vy_lsm_unref(task->lsm);
@@ -293,6 +350,12 @@ vy_scheduler_start_workers(struct vy_scheduler *scheduler)
 		cpipe_create(&worker->worker_pipe, name);
 		stailq_add_tail_entry(&scheduler->idle_workers,
 				      worker, in_idle);
+
+		struct cmsg_hop *route = worker->deferred_delete_route;
+		route[0].f = vy_deferred_delete_batch_process_f;
+		route[0].pipe = &worker->worker_pipe;
+		route[1].f = vy_deferred_delete_batch_free_f;
+		route[1].pipe = NULL;
 	}
 }
 
@@ -652,6 +715,237 @@ vy_run_discard(struct vy_run *run)
 	vy_log_tx_try_commit();
 }
 
+/**
+ * Encode and write a single deferred DELETE statement to
+ * _vinyl_deferred_delete system space. The rest will be
+ * done by the space trigger.
+ */
+static int
+vy_deferred_delete_process_one(struct space *deferred_delete_space,
+			       uint32_t space_id, struct tuple_format *format,
+			       struct vy_deferred_delete_stmt *stmt)
+{
+	int64_t lsn = vy_stmt_lsn(stmt->new_stmt);
+
+	struct tuple *delete;
+	delete = vy_stmt_new_surrogate_delete(format, stmt->old_stmt);
+	if (delete == NULL)
+		return -1;
+
+	uint32_t delete_data_size;
+	const char *delete_data = tuple_data_range(delete, &delete_data_size);
+
+	size_t buf_size = (mp_sizeof_array(3) + mp_sizeof_uint(space_id) +
+			   mp_sizeof_uint(lsn) + delete_data_size);
+	char *data = region_alloc(&fiber()->gc, buf_size);
+	if (data == NULL) {
+		diag_set(OutOfMemory, buf_size, "region", "buf");
+		tuple_unref(delete);
+		return -1;
+	}
+
+	char *data_end = data;
+	data_end = mp_encode_array(data_end, 3);
+	data_end = mp_encode_uint(data_end, space_id);
+	data_end = mp_encode_uint(data_end, lsn);
+	memcpy(data_end, delete_data, delete_data_size);
+	data_end += delete_data_size;
+	assert(data_end <= data + buf_size);
+
+	struct request request;
+	memset(&request, 0, sizeof(request));
+	request.type = IPROTO_REPLACE;
+	request.space_id = BOX_VINYL_DEFERRED_DELETE_ID;
+	request.tuple = data;
+	request.tuple_end = data_end;
+
+	tuple_unref(delete);
+
+	struct txn *txn = txn_begin_stmt(deferred_delete_space);
+	if (txn == NULL)
+		return -1;
+
+	struct tuple *unused;
+	if (space_execute_dml(deferred_delete_space, txn,
+			      &request, &unused) != 0) {
+		txn_rollback_stmt();
+		return -1;
+	}
+	return txn_commit_stmt(txn, &request);
+}
+
+/**
+ * Callback invoked by the tx thread to process deferred DELETE
+ * statements generated during compaction. It writes deferred
+ * DELETEs to a special system space, _vinyl_deferred_delete.
+ * The system space has an on_replace trigger installed which
+ * propagates the DELETEs to secondary indexes. This way, even
+ * if a deferred DELETE isn't dumped to disk by vinyl, it still
+ * can be recovered from WAL.
+ */
+static void
+vy_deferred_delete_batch_process_f(struct cmsg *cmsg)
+{
+	struct vy_deferred_delete_batch *batch = container_of(cmsg,
+				struct vy_deferred_delete_batch, cmsg);
+	struct vy_task *task = batch->task;
+	struct vy_lsm *pk = task->lsm;
+
+	assert(pk->index_id == 0);
+	/*
+	 * A space can be dropped while a compaction task
+	 * is in progress.
+	 */
+	if (pk->is_dropped)
+		return;
+
+	struct space *deferred_delete_space;
+	deferred_delete_space = space_by_id(BOX_VINYL_DEFERRED_DELETE_ID);
+	assert(deferred_delete_space != NULL);
+
+	struct txn *txn = txn_begin(false);
+	if (txn == NULL)
+		goto fail;
+
+	for (int i = 0; i < batch->count; i++) {
+		if (vy_deferred_delete_process_one(deferred_delete_space,
+						   pk->space_id, pk->mem_format,
+						   &batch->stmt[i]) != 0)
+			goto fail;
+	}
+
+	if (txn_commit(txn) != 0)
+		goto fail;
+
+	return;
+fail:
+	batch->is_failed = true;
+	diag_move(diag_get(), &batch->diag);
+	txn_rollback();
+}
+
+/**
+ * Callback invoked by a worker thread to free processed deferred
+ * DELETE statements. It must be done on behalf the worker thread
+ * that generated those DELETEs, because a vinyl statement cannot
+ * be allocated and freed in different threads.
+ */
+static void
+vy_deferred_delete_batch_free_f(struct cmsg *cmsg)
+{
+	struct vy_deferred_delete_batch *batch = container_of(cmsg,
+				struct vy_deferred_delete_batch, cmsg);
+	struct vy_task *task = batch->task;
+	for (int i = 0; i < batch->count; i++) {
+		struct vy_deferred_delete_stmt *stmt = &batch->stmt[i];
+		vy_stmt_unref_if_possible(stmt->old_stmt);
+		vy_stmt_unref_if_possible(stmt->new_stmt);
+	}
+	/*
+	 * Abort the task if the tx thread failed to process
+	 * the batch unless it has already been aborted.
+	 */
+	if (batch->is_failed && !task->is_failed) {
+		assert(!diag_is_empty(&batch->diag));
+		diag_move(&batch->diag, &task->diag);
+		task->is_failed = true;
+		fiber_cancel(task->fiber);
+	}
+	diag_destroy(&batch->diag);
+	free(batch);
+	/* Notify the caller if this is the last batch. */
+	assert(task->deferred_delete_in_progress > 0);
+	if (--task->deferred_delete_in_progress == 0)
+		fiber_wakeup(task->fiber);
+}
+
+/**
+ * Send all deferred DELETEs accumulated by a vinyl task to
+ * the tx thread where they will be processed.
+ */
+static void
+vy_task_deferred_delete_flush(struct vy_task *task)
+{
+	struct vy_worker *worker = task->worker;
+	struct vy_deferred_delete_batch *batch = task->deferred_delete_batch;
+
+	if (batch == NULL)
+		return;
+
+	task->deferred_delete_batch = NULL;
+	task->deferred_delete_in_progress++;
+
+	cmsg_init(&batch->cmsg, worker->deferred_delete_route);
+	cpipe_push(&worker->tx_pipe, &batch->cmsg);
+}
+
+/**
+ * Add a deferred DELETE to a batch. Once the batch gets full,
+ * submit it to tx where it will get processed.
+ */
+static int
+vy_task_deferred_delete_process(struct vy_deferred_delete_handler *handler,
+				struct tuple *old_stmt, struct tuple *new_stmt)
+{
+	enum { MAX_IN_PROGRESS = 10 };
+
+	struct vy_task *task = container_of(handler, struct vy_task,
+					    deferred_delete_handler);
+	struct vy_deferred_delete_batch *batch = task->deferred_delete_batch;
+
+	/*
+	 * Throttle compaction task if there are too many batches
+	 * being processed so as to limit memory consumption.
+	 */
+	while (task->deferred_delete_in_progress >= MAX_IN_PROGRESS)
+		fiber_sleep(TIMEOUT_INFINITY);
+
+	/* Allocate a new batch on demand. */
+	if (batch == NULL) {
+		batch = malloc(sizeof(*batch));
+		if (batch == NULL) {
+			diag_set(OutOfMemory, sizeof(*batch), "malloc",
+				 "struct vy_deferred_delete_batch");
+			return -1;
+		}
+		memset(batch, 0, sizeof(*batch));
+		batch->task = task;
+		diag_create(&batch->diag);
+		task->deferred_delete_batch = batch;
+	}
+
+	assert(batch->count < VY_DEFERRED_DELETE_BATCH_MAX);
+	struct vy_deferred_delete_stmt *stmt = &batch->stmt[batch->count++];
+	stmt->old_stmt = old_stmt;
+	vy_stmt_ref_if_possible(old_stmt);
+	stmt->new_stmt = new_stmt;
+	vy_stmt_ref_if_possible(new_stmt);
+
+	if (batch->count == VY_DEFERRED_DELETE_BATCH_MAX)
+		vy_task_deferred_delete_flush(task);
+	return 0;
+}
+
+/**
+ * Wait until all pending deferred DELETE statements have been
+ * processed by tx. Called when the write iterator stops.
+ */
+static void
+vy_task_deferred_delete_destroy(struct vy_deferred_delete_handler *handler)
+{
+	struct vy_task *task = container_of(handler, struct vy_task,
+					    deferred_delete_handler);
+	vy_task_deferred_delete_flush(task);
+	while (task->deferred_delete_in_progress > 0)
+		fiber_sleep(TIMEOUT_INFINITY);
+}
+
+static const struct vy_deferred_delete_handler_iface
+vy_task_deferred_delete_iface = {
+	.process = vy_task_deferred_delete_process,
+	.destroy = vy_task_deferred_delete_destroy,
+};
+
 static int
 vy_task_write_run(struct vy_task *task)
 {
@@ -1002,6 +1296,12 @@ vy_task_dump_new(struct vy_scheduler *scheduler, struct vy_lsm *lsm,
 
 	new_run->dump_lsn = dump_lsn;
 
+	/*
+	 * Note, since deferred DELETE are generated on tx commit
+	 * in case the overwritten tuple is found in-memory, no
+	 * deferred DELETE statement should be generated during
+	 * dump so we don't pass a deferred DELETE handler.
+	 */
 	struct vy_stmt_stream *wi;
 	bool is_last_level = (lsm->run_count == 0);
 	wi = vy_write_iterator_new(task->cmp_def, lsm->disk_format,
@@ -1273,7 +1573,9 @@ vy_task_compact_new(struct vy_scheduler *scheduler, struct vy_lsm *lsm,
 	bool is_last_level = (range->compact_priority == range->slice_count);
 	wi = vy_write_iterator_new(task->cmp_def, lsm->disk_format,
 				   lsm->index_id == 0, is_last_level,
-				   scheduler->read_views, NULL);
+				   scheduler->read_views,
+				   lsm->index_id > 0 ? NULL :
+				   &task->deferred_delete_handler);
 	if (wi == NULL)
 		goto err_wi;
 
@@ -1336,7 +1638,7 @@ static int
 vy_task_f(va_list va)
 {
 	struct vy_task *task = va_arg(va, struct vy_task *);
-	if (task->ops->execute(task) != 0) {
+	if (task->ops->execute(task) != 0 && !task->is_failed) {
 		struct diag *diag = diag_get();
 		assert(!diag_is_empty(diag));
 		task->is_failed = true;
diff --git a/src/box/vy_tx.c b/src/box/vy_tx.c
index c7a6d9ff..27cd9bb7 100644
--- a/src/box/vy_tx.c
+++ b/src/box/vy_tx.c
@@ -46,6 +46,7 @@
 #include "iterator_type.h"
 #include "salad/stailq.h"
 #include "schema.h" /* space_cache_version */
+#include "space.h"
 #include "trigger.h"
 #include "trivia/util.h"
 #include "tuple.h"
@@ -58,6 +59,7 @@
 #include "vy_history.h"
 #include "vy_read_set.h"
 #include "vy_read_view.h"
+#include "vy_point_lookup.h"
 
 int
 write_set_cmp(struct txv *a, struct txv *b)
@@ -468,6 +470,106 @@ vy_tx_write(struct vy_lsm *lsm, struct vy_mem *mem,
 	return vy_lsm_set(lsm, mem, stmt, region_stmt);
 }
 
+/**
+ * Try to generate a deferred DELETE statement on tx commit.
+ *
+ * This function is supposed to be called for a primary index
+ * statement which was executed without deletion of the overwritten
+ * tuple from secondary indexes. It looks up the overwritten tuple
+ * in memory and, if found, produces the deferred DELETEs and
+ * inserts them into the transaction log.
+ *
+ * Affects @tx->log, @v->stmt.
+ *
+ * Returns 0 on success, -1 on memory allocation error.
+ */
+static int
+vy_tx_handle_deferred_delete(struct vy_tx *tx, struct txv *v)
+{
+	struct vy_lsm *pk = v->lsm;
+	struct tuple *stmt = v->stmt;
+	uint8_t flags = vy_stmt_flags(stmt);
+
+	assert(pk->index_id == 0);
+	assert(flags & VY_STMT_DEFERRED_DELETE);
+
+	struct space *space = space_cache_find(pk->space_id);
+	if (space == NULL) {
+		/*
+		 * Space was dropped while transaction was
+		 * in progress. Nothing to do.
+		 */
+		return 0;
+	}
+
+	/* Look up the tuple overwritten by this statement. */
+	struct tuple *tuple;
+	if (vy_point_lookup_mem(pk, &tx->xm->p_global_read_view,
+				stmt, &tuple) != 0)
+		return -1;
+
+	if (tuple == NULL) {
+		/*
+		 * Nothing's found, but there still may be
+		 * matching statements stored on disk so we
+		 * have to defer generation of DELETE until
+		 * compaction.
+		 */
+		return 0;
+	}
+
+	/*
+	 * If a terminal statement is found, we can produce
+	 * DELETE right away so clear the flag now.
+	 */
+	vy_stmt_set_flags(stmt, flags & ~VY_STMT_DEFERRED_DELETE);
+
+	if (vy_stmt_type(tuple) == IPROTO_DELETE) {
+		/* The tuple's already deleted, nothing to do. */
+		tuple_unref(tuple);
+		return 0;
+	}
+
+	struct tuple *delete_stmt;
+	delete_stmt = vy_stmt_new_surrogate_delete(pk->mem_format, tuple);
+	tuple_unref(tuple);
+	if (delete_stmt == NULL)
+		return -1;
+
+	if (vy_stmt_type(stmt) == IPROTO_DELETE) {
+		/*
+		 * Since primary and secondary indexes of the
+		 * same space share in-memory statements, we
+		 * need to use the new DELETE in the primary
+		 * index, because the original DELETE doesn't
+		 * contain secondary key parts.
+		 */
+		vy_stmt_counter_acct_tuple(&pk->stat.txw.count, delete_stmt);
+		vy_stmt_counter_unacct_tuple(&pk->stat.txw.count, stmt);
+		v->stmt = delete_stmt;
+		tuple_ref(delete_stmt);
+		tuple_unref(stmt);
+	}
+
+	/*
+	 * Make DELETE statements for secondary indexes and
+	 * insert them into the transaction log.
+	 */
+	int rc = 0;
+	for (uint32_t i = 1; i < space->index_count; i++) {
+		struct vy_lsm *lsm = vy_lsm(space->index[i]);
+		struct txv *delete_txv = txv_new(tx, lsm, delete_stmt);
+		if (delete_txv == NULL) {
+			rc = -1;
+			break;
+		}
+		stailq_insert_entry(&tx->log, delete_txv, v, next_in_log);
+		vy_stmt_counter_acct_tuple(&lsm->stat.txw.count, delete_stmt);
+	}
+	tuple_unref(delete_stmt);
+	return rc;
+}
+
 int
 vy_tx_prepare(struct vy_tx *tx)
 {
@@ -521,6 +623,22 @@ vy_tx_prepare(struct vy_tx *tx)
 		if (v->is_overwritten)
 			continue;
 
+		if (lsm->index_id > 0 && repsert == NULL && delete == NULL) {
+			/*
+			 * This statement is for a secondary index,
+			 * and the statement corresponding to it in
+			 * the primary index was overwritten. This
+			 * can only happen if insertion of DELETE
+			 * into secondary indexes was postponed until
+			 * primary index compaction. In this case
+			 * the DELETE will not be generated, because
+			 * the corresponding statement never made it
+			 * to the primary index LSM tree. So we must
+			 * skip it for secondary indexes as well.
+			 */
+			continue;
+		}
+
 		enum iproto_type type = vy_stmt_type(v->stmt);
 
 		/* Optimize out INSERT + DELETE for the same key. */
@@ -535,6 +653,16 @@ vy_tx_prepare(struct vy_tx *tx)
 			 */
 			type = IPROTO_INSERT;
 			vy_stmt_set_type(v->stmt, type);
+			/*
+			 * In case of INSERT, no statement was actually
+			 * overwritten so no need to generate a deferred
+			 * DELETE for secondary indexes.
+			 */
+			uint8_t flags = vy_stmt_flags(v->stmt);
+			if (flags & VY_STMT_DEFERRED_DELETE) {
+				vy_stmt_set_flags(v->stmt, flags &
+						  ~VY_STMT_DEFERRED_DELETE);
+			}
 		}
 
 		if (!v->is_first_insert && type == IPROTO_INSERT) {
@@ -550,6 +678,11 @@ vy_tx_prepare(struct vy_tx *tx)
 			return -1;
 		assert(v->mem != NULL);
 
+		if (lsm->index_id == 0 &&
+		    vy_stmt_flags(v->stmt) & VY_STMT_DEFERRED_DELETE &&
+		    vy_tx_handle_deferred_delete(tx, v) != 0)
+			return -1;
+
 		/* In secondary indexes only REPLACE/DELETE can be written. */
 		vy_stmt_set_lsn(v->stmt, MAX_LSN + tx->psn);
 		const struct tuple **region_stmt =
diff --git a/test/unit/vy_point_lookup.c b/test/unit/vy_point_lookup.c
index 87f26900..eee25274 100644
--- a/test/unit/vy_point_lookup.c
+++ b/test/unit/vy_point_lookup.c
@@ -13,6 +13,8 @@
 
 uint32_t schema_version;
 uint32_t space_cache_version;
+struct space *space_by_id(uint32_t id) { return NULL; }
+struct vy_lsm *vy_lsm(struct index *index) { return NULL; }
 
 static int
 write_run(struct vy_run *run, const char *dir_name,
diff --git a/test/vinyl/deferred_delete.result b/test/vinyl/deferred_delete.result
new file mode 100644
index 00000000..9811b6bc
--- /dev/null
+++ b/test/vinyl/deferred_delete.result
@@ -0,0 +1,677 @@
+test_run = require('test_run').new()
+---
+...
+fiber = require('fiber')
+---
+...
+--
+-- Create a space with secondary indexes and check that REPLACE and
+-- DELETE requests do not look up the old tuple in the primary index
+-- to generate the DELETE statements for secondary indexes. Instead
+-- DELETEs are generated when the primary index is compacted (gh-2129).
+-- The optimization should work for both non-unique and unique indexes
+-- so mark one of the indexes unique.
+--
+s = box.schema.space.create('test', {engine = 'vinyl'})
+---
+...
+pk = s:create_index('pk', {run_count_per_level = 10})
+---
+...
+i1 = s:create_index('i1', {run_count_per_level = 10, parts = {2, 'unsigned'}, unique = false})
+---
+...
+i2 = s:create_index('i2', {run_count_per_level = 10, parts = {3, 'unsigned'}, unique = true})
+---
+...
+for i = 1, 10 do s:replace{i, i, i} end
+---
+...
+box.snapshot()
+---
+- ok
+...
+for i = 1, 10, 2 do s:delete{i} end
+---
+...
+for i = 2, 10, 2 do s:replace{i, i * 10, i * 100} end
+---
+...
+-- DELETE/REPLACE does not look up the old tuple in the primary index.
+pk:stat().lookup -- 0
+---
+- 0
+...
+-- DELETEs are not written to secondary indexes.
+pk:stat().rows -- 10 old REPLACEs + 5 new REPLACEs + 5 DELETEs
+---
+- 20
+...
+i1:stat().rows -- 10 old REPLACEs + 5 new REPLACEs
+---
+- 15
+...
+i2:stat().rows -- ditto
+---
+- 15
+...
+-- Although there are only 5 tuples in the space, we have to look up
+-- overwritten tuples in the primary index hence 15 lookups per SELECT
+-- in a secondary index.
+i1:select()
+---
+- - [2, 20, 200]
+  - [4, 40, 400]
+  - [6, 60, 600]
+  - [8, 80, 800]
+  - [10, 100, 1000]
+...
+i1:stat().get.rows -- 15
+---
+- 15
+...
+pk:stat().lookup -- 15
+---
+- 15
+...
+i2:select()
+---
+- - [2, 20, 200]
+  - [4, 40, 400]
+  - [6, 60, 600]
+  - [8, 80, 800]
+  - [10, 100, 1000]
+...
+i2:stat().get.rows -- 15
+---
+- 15
+...
+pk:stat().lookup -- 30
+---
+- 30
+...
+-- Overwritten/deleted tuples are not stored in the cache so calling
+-- SELECT for a second time does only 5 lookups.
+box.stat.reset()
+---
+...
+i1:select()
+---
+- - [2, 20, 200]
+  - [4, 40, 400]
+  - [6, 60, 600]
+  - [8, 80, 800]
+  - [10, 100, 1000]
+...
+i1:stat().get.rows -- 5
+---
+- 5
+...
+pk:stat().lookup -- 5
+---
+- 5
+...
+i2:select()
+---
+- - [2, 20, 200]
+  - [4, 40, 400]
+  - [6, 60, 600]
+  - [8, 80, 800]
+  - [10, 100, 1000]
+...
+i2:stat().get.rows -- 5
+---
+- 5
+...
+pk:stat().lookup -- 10
+---
+- 10
+...
+-- Cleanup the cache.
+vinyl_cache = box.cfg.vinyl_cache
+---
+...
+box.cfg{vinyl_cache = 0}
+---
+...
+box.cfg{vinyl_cache = vinyl_cache}
+---
+...
+-- Compact the primary index to generate deferred DELETEs.
+box.snapshot()
+---
+- ok
+...
+pk:compact()
+---
+...
+while pk:stat().disk.compact.count == 0 do fiber.sleep(0.001) end
+---
+...
+pk:stat().rows -- 5 new REPLACEs
+---
+- 5
+...
+i1:stat().rows -- 10 old REPLACE + 5 new REPLACEs + 10 deferred DELETEs
+---
+- 25
+...
+i2:stat().rows -- ditto
+---
+- 25
+...
+-- Deferred DELETEs must be ignored by the read iterator, because
+-- they may break the read iterator invariant, so they don't reduce
+-- the number of lookups.
+box.stat.reset()
+---
+...
+i1:select()
+---
+- - [2, 20, 200]
+  - [4, 40, 400]
+  - [6, 60, 600]
+  - [8, 80, 800]
+  - [10, 100, 1000]
+...
+i1:stat().get.rows -- 15
+---
+- 15
+...
+pk:stat().lookup -- 15
+---
+- 15
+...
+i2:select()
+---
+- - [2, 20, 200]
+  - [4, 40, 400]
+  - [6, 60, 600]
+  - [8, 80, 800]
+  - [10, 100, 1000]
+...
+i2:stat().get.rows -- 15
+---
+- 15
+...
+pk:stat().lookup -- 30
+---
+- 30
+...
+-- Check that deferred DELETEs are not lost after restart.
+test_run:cmd("restart server default")
+fiber = require('fiber')
+---
+...
+s = box.space.test
+---
+...
+pk = s.index.pk
+---
+...
+i1 = s.index.i1
+---
+...
+i2 = s.index.i2
+---
+...
+i1:stat().rows -- 10 old REPLACEs + 5 new REPLACEs + 10 deferred DELETEs
+---
+- 25
+...
+i2:stat().rows -- ditto
+---
+- 25
+...
+-- Dump deferred DELETEs to disk and compact them.
+-- Check that they cleanup garbage statements.
+box.snapshot()
+---
+- ok
+...
+i1:compact()
+---
+...
+while i1:stat().disk.compact.count == 0 do fiber.sleep(0.001) end
+---
+...
+i2:compact()
+---
+...
+while i2:stat().disk.compact.count == 0 do fiber.sleep(0.001) end
+---
+...
+i1:stat().rows -- 5 new REPLACEs
+---
+- 5
+...
+i2:stat().rows -- ditto
+---
+- 5
+...
+box.stat.reset()
+---
+...
+i1:select()
+---
+- - [2, 20, 200]
+  - [4, 40, 400]
+  - [6, 60, 600]
+  - [8, 80, 800]
+  - [10, 100, 1000]
+...
+i1:stat().get.rows -- 5
+---
+- 5
+...
+pk:stat().lookup -- 5
+---
+- 5
+...
+i2:select()
+---
+- - [2, 20, 200]
+  - [4, 40, 400]
+  - [6, 60, 600]
+  - [8, 80, 800]
+  - [10, 100, 1000]
+...
+i2:stat().get.rows -- 5
+---
+- 5
+...
+pk:stat().lookup -- 10
+---
+- 10
+...
+s:drop()
+---
+...
+--
+-- Check that if the old tuple is found in cache or in memory, then
+-- the DELETE for secondary indexes is generated when the statement
+-- is committed.
+--
+s = box.schema.space.create('test', {engine = 'vinyl'})
+---
+...
+pk = s:create_index('pk', {run_count_per_level = 10})
+---
+...
+sk = s:create_index('sk', {run_count_per_level = 10, parts = {2, 'unsigned'}, unique = false})
+---
+...
+for i = 1, 10 do s:replace{i, i} end
+---
+...
+box.snapshot()
+---
+- ok
+...
+s:count() -- add tuples to the cache
+---
+- 10
+...
+box.stat.reset()
+---
+...
+for i = 1, 10, 2 do s:delete{i} end
+---
+...
+for i = 2, 10, 2 do s:replace{i, i * 10} end
+---
+...
+pk:stat().lookup -- 0
+---
+- 0
+...
+pk:stat().cache.lookup -- 10
+---
+- 10
+...
+pk:stat().cache.get.rows -- 10
+---
+- 10
+...
+pk:stat().memory.iterator.lookup -- 0
+---
+- 0
+...
+sk:stat().rows -- 10 old REPLACEs + 10 DELETEs + 5 new REPLACEs
+---
+- 25
+...
+box.stat.reset()
+---
+...
+for i = 1, 10 do s:replace{i, i * 100} end
+---
+...
+pk:stat().lookup -- 0
+---
+- 0
+...
+pk:stat().cache.lookup -- 10
+---
+- 10
+...
+pk:stat().cache.get.rows -- 0
+---
+- 0
+...
+pk:stat().memory.iterator.lookup -- 10
+---
+- 10
+...
+pk:stat().memory.iterator.get.rows -- 10
+---
+- 10
+...
+sk:stat().rows -- 15 old REPLACEs + 15 DELETEs + 10 new REPLACEs
+---
+- 40
+...
+box.stat.reset()
+---
+...
+for i = 1, 10 do s:delete{i} end
+---
+...
+pk:stat().lookup -- 0
+---
+- 0
+...
+pk:stat().cache.lookup -- 10
+---
+- 10
+...
+pk:stat().cache.get.rows -- 0
+---
+- 0
+...
+pk:stat().memory.iterator.lookup -- 10
+---
+- 10
+...
+pk:stat().memory.iterator.get.rows -- 10
+---
+- 10
+...
+sk:stat().rows -- 25 old REPLACEs + 25 DELETEs
+---
+- 50
+...
+sk:select()
+---
+- []
+...
+pk:stat().lookup -- 0
+---
+- 0
+...
+box.snapshot()
+---
+- ok
+...
+sk:compact()
+---
+...
+while sk:stat().disk.compact.count == 0 do fiber.sleep(0.001) end
+---
+...
+sk:stat().run_count -- 0
+---
+- 0
+...
+s:drop()
+---
+...
+--
+-- Check that a transaction is aborted if it read a tuple from
+-- a secondary index that was overwritten in the primary index.
+--
+s = box.schema.space.create('test', {engine = 'vinyl'})
+---
+...
+pk = s:create_index('pk')
+---
+...
+sk = s:create_index('sk', {parts = {2, 'unsigned'}, unique = false})
+---
+...
+s:replace{1, 1}
+---
+- [1, 1]
+...
+box.snapshot()
+---
+- ok
+...
+box.begin()
+---
+...
+sk:select{1}
+---
+- - [1, 1]
+...
+c = fiber.channel(1)
+---
+...
+_ = fiber.create(function() s:replace{1, 10} c:put(true) end)
+---
+...
+c:get()
+---
+- true
+...
+sk:select{1}
+---
+- - [1, 1]
+...
+s:replace{10, 10}
+---
+- [10, 10]
+...
+box.commit() -- error
+---
+- error: Transaction has been aborted by conflict
+...
+s:drop()
+---
+...
+--
+-- Check that if a tuple was overwritten in the transaction write set,
+-- it won't be committed to secondary indexes.
+--
+s = box.schema.space.create('test', {engine = 'vinyl'})
+---
+...
+pk = s:create_index('pk', {run_count_per_level = 10})
+---
+...
+sk = s:create_index('sk', {run_count_per_level = 10, parts = {2, 'unsigned'}, unique = false})
+---
+...
+for i = 1, 10 do s:replace{i, i} end
+---
+...
+box.snapshot()
+---
+- ok
+...
+box.begin()
+---
+...
+for i = 1, 10 do s:replace{i, i * 10} end
+---
+...
+for i = 1, 10, 2 do s:delete{i} end
+---
+...
+for i = 2, 10, 2 do s:replace{i, i * 100} end
+---
+...
+box.commit()
+---
+...
+sk:select()
+---
+- - [2, 200]
+  - [4, 400]
+  - [6, 600]
+  - [8, 800]
+  - [10, 1000]
+...
+pk:stat().rows -- 10 old REPLACEs + 5 DELETEs + 5 new REPLACEs
+---
+- 20
+...
+sk:stat().rows -- 10 old REPLACEs + 5 new REPLACEs
+---
+- 15
+...
+-- Compact the primary index to generate deferred DELETEs.
+box.snapshot()
+---
+- ok
+...
+pk:compact()
+---
+...
+while pk:stat().disk.compact.count == 0 do fiber.sleep(0.001) end
+---
+...
+-- Compact the secondary index to cleanup garbage.
+box.snapshot()
+---
+- ok
+...
+sk:compact()
+---
+...
+while sk:stat().disk.compact.count == 0 do fiber.sleep(0.001) end
+---
+...
+sk:select()
+---
+- - [2, 200]
+  - [4, 400]
+  - [6, 600]
+  - [8, 800]
+  - [10, 1000]
+...
+pk:stat().rows -- 5 new REPLACEs
+---
+- 5
+...
+sk:stat().rows -- ditto
+---
+- 5
+...
+s:drop()
+---
+...
+--
+-- Check that on recovery we do not apply deferred DELETEs that
+-- have been dumped to disk.
+--
+test_run:cmd("create server test with script='vinyl/low_quota.lua'")
+---
+- true
+...
+test_run:cmd("start server test with args='1048576'")
+---
+- true
+...
+test_run:cmd("switch test")
+---
+- true
+...
+fiber = require('fiber')
+---
+...
+s = box.schema.space.create('test', {engine = 'vinyl'})
+---
+...
+pk = s:create_index('pk', {run_count_per_level = 10})
+---
+...
+sk = s:create_index('sk', {run_count_per_level = 10, parts = {2, 'unsigned', 3, 'string'}, unique = false})
+---
+...
+pad = string.rep('x', 10 * 1024)
+---
+...
+for i = 1, 120 do s:replace{i, i, pad} end
+---
+...
+box.snapshot()
+---
+- ok
+...
+pad = string.rep('y', 10 * 1024)
+---
+...
+for i = 1, 120 do s:replace{i, i, pad} end
+---
+...
+box.snapshot()
+---
+- ok
+...
+sk:stat().rows -- 120 old REPLACEs + 120 new REPLACEs
+---
+- 240
+...
+box.stat.reset()
+---
+...
+-- Compact the primary index to generate deferred DELETEs.
+-- Deferred DELETEs won't fit in memory and trigger dump
+-- of the secondary index.
+pk:compact()
+---
+...
+while pk:stat().disk.compact.count == 0 do fiber.sleep(0.001) end
+---
+...
+sk:stat().disk.dump.count -- 1
+---
+- 1
+...
+sk:stat().rows -- 120 old REPLACEs + 120 new REPLACEs + 120 deferred DELETEs
+---
+- 360
+...
+test_run:cmd("restart server test with args='1048576'")
+s = box.space.test
+---
+...
+pk = s.index.pk
+---
+...
+sk = s.index.sk
+---
+...
+-- Should be 360, the same amount of statements as before restart.
+-- If we applied all deferred DELETEs, including the dumped ones,
+-- then there would be more.
+sk:stat().rows
+---
+- 360
+...
+s:drop()
+---
+...
+test_run:cmd("switch default")
+---
+- true
+...
+test_run:cmd("stop server test")
+---
+- true
+...
+test_run:cmd("cleanup server test")
+---
+- true
+...
diff --git a/test/vinyl/deferred_delete.test.lua b/test/vinyl/deferred_delete.test.lua
new file mode 100644
index 00000000..d18361a0
--- /dev/null
+++ b/test/vinyl/deferred_delete.test.lua
@@ -0,0 +1,261 @@
+test_run = require('test_run').new()
+fiber = require('fiber')
+
+--
+-- Create a space with secondary indexes and check that REPLACE and
+-- DELETE requests do not look up the old tuple in the primary index
+-- to generate the DELETE statements for secondary indexes. Instead
+-- DELETEs are generated when the primary index is compacted (gh-2129).
+-- The optimization should work for both non-unique and unique indexes
+-- so mark one of the indexes unique.
+--
+s = box.schema.space.create('test', {engine = 'vinyl'})
+pk = s:create_index('pk', {run_count_per_level = 10})
+i1 = s:create_index('i1', {run_count_per_level = 10, parts = {2, 'unsigned'}, unique = false})
+i2 = s:create_index('i2', {run_count_per_level = 10, parts = {3, 'unsigned'}, unique = true})
+for i = 1, 10 do s:replace{i, i, i} end
+box.snapshot()
+for i = 1, 10, 2 do s:delete{i} end
+for i = 2, 10, 2 do s:replace{i, i * 10, i * 100} end
+
+-- DELETE/REPLACE does not look up the old tuple in the primary index.
+pk:stat().lookup -- 0
+
+-- DELETEs are not written to secondary indexes.
+pk:stat().rows -- 10 old REPLACEs + 5 new REPLACEs + 5 DELETEs
+i1:stat().rows -- 10 old REPLACEs + 5 new REPLACEs
+i2:stat().rows -- ditto
+
+-- Although there are only 5 tuples in the space, we have to look up
+-- overwritten tuples in the primary index hence 15 lookups per SELECT
+-- in a secondary index.
+i1:select()
+i1:stat().get.rows -- 15
+pk:stat().lookup -- 15
+i2:select()
+i2:stat().get.rows -- 15
+pk:stat().lookup -- 30
+
+-- Overwritten/deleted tuples are not stored in the cache so calling
+-- SELECT for a second time does only 5 lookups.
+box.stat.reset()
+i1:select()
+i1:stat().get.rows -- 5
+pk:stat().lookup -- 5
+i2:select()
+i2:stat().get.rows -- 5
+pk:stat().lookup -- 10
+
+-- Cleanup the cache.
+vinyl_cache = box.cfg.vinyl_cache
+box.cfg{vinyl_cache = 0}
+box.cfg{vinyl_cache = vinyl_cache}
+
+-- Compact the primary index to generate deferred DELETEs.
+box.snapshot()
+pk:compact()
+while pk:stat().disk.compact.count == 0 do fiber.sleep(0.001) end
+pk:stat().rows -- 5 new REPLACEs
+i1:stat().rows -- 10 old REPLACE + 5 new REPLACEs + 10 deferred DELETEs
+i2:stat().rows -- ditto
+
+-- Deferred DELETEs must be ignored by the read iterator, because
+-- they may break the read iterator invariant, so they don't reduce
+-- the number of lookups.
+box.stat.reset()
+i1:select()
+i1:stat().get.rows -- 15
+pk:stat().lookup -- 15
+i2:select()
+i2:stat().get.rows -- 15
+pk:stat().lookup -- 30
+
+-- Check that deferred DELETEs are not lost after restart.
+test_run:cmd("restart server default")
+fiber = require('fiber')
+s = box.space.test
+pk = s.index.pk
+i1 = s.index.i1
+i2 = s.index.i2
+i1:stat().rows -- 10 old REPLACEs + 5 new REPLACEs + 10 deferred DELETEs
+i2:stat().rows -- ditto
+
+-- Dump deferred DELETEs to disk and compact them.
+-- Check that they cleanup garbage statements.
+box.snapshot()
+i1:compact()
+while i1:stat().disk.compact.count == 0 do fiber.sleep(0.001) end
+i2:compact()
+while i2:stat().disk.compact.count == 0 do fiber.sleep(0.001) end
+i1:stat().rows -- 5 new REPLACEs
+i2:stat().rows -- ditto
+box.stat.reset()
+i1:select()
+i1:stat().get.rows -- 5
+pk:stat().lookup -- 5
+i2:select()
+i2:stat().get.rows -- 5
+pk:stat().lookup -- 10
+
+s:drop()
+
+--
+-- Check that if the old tuple is found in cache or in memory, then
+-- the DELETE for secondary indexes is generated when the statement
+-- is committed.
+--
+s = box.schema.space.create('test', {engine = 'vinyl'})
+pk = s:create_index('pk', {run_count_per_level = 10})
+sk = s:create_index('sk', {run_count_per_level = 10, parts = {2, 'unsigned'}, unique = false})
+
+for i = 1, 10 do s:replace{i, i} end
+box.snapshot()
+s:count() -- add tuples to the cache
+
+box.stat.reset()
+for i = 1, 10, 2 do s:delete{i} end
+for i = 2, 10, 2 do s:replace{i, i * 10} end
+pk:stat().lookup -- 0
+pk:stat().cache.lookup -- 10
+pk:stat().cache.get.rows -- 10
+pk:stat().memory.iterator.lookup -- 0
+sk:stat().rows -- 10 old REPLACEs + 10 DELETEs + 5 new REPLACEs
+
+box.stat.reset()
+for i = 1, 10 do s:replace{i, i * 100} end
+pk:stat().lookup -- 0
+pk:stat().cache.lookup -- 10
+pk:stat().cache.get.rows -- 0
+pk:stat().memory.iterator.lookup -- 10
+pk:stat().memory.iterator.get.rows -- 10
+sk:stat().rows -- 15 old REPLACEs + 15 DELETEs + 10 new REPLACEs
+
+box.stat.reset()
+for i = 1, 10 do s:delete{i} end
+pk:stat().lookup -- 0
+pk:stat().cache.lookup -- 10
+pk:stat().cache.get.rows -- 0
+pk:stat().memory.iterator.lookup -- 10
+pk:stat().memory.iterator.get.rows -- 10
+sk:stat().rows -- 25 old REPLACEs + 25 DELETEs
+
+sk:select()
+pk:stat().lookup -- 0
+
+box.snapshot()
+sk:compact()
+while sk:stat().disk.compact.count == 0 do fiber.sleep(0.001) end
+sk:stat().run_count -- 0
+
+s:drop()
+
+--
+-- Check that a transaction is aborted if it read a tuple from
+-- a secondary index that was overwritten in the primary index.
+--
+s = box.schema.space.create('test', {engine = 'vinyl'})
+pk = s:create_index('pk')
+sk = s:create_index('sk', {parts = {2, 'unsigned'}, unique = false})
+s:replace{1, 1}
+box.snapshot()
+
+box.begin()
+sk:select{1}
+c = fiber.channel(1)
+_ = fiber.create(function() s:replace{1, 10} c:put(true) end)
+c:get()
+sk:select{1}
+s:replace{10, 10}
+box.commit() -- error
+
+s:drop()
+
+--
+-- Check that if a tuple was overwritten in the transaction write set,
+-- it won't be committed to secondary indexes.
+--
+s = box.schema.space.create('test', {engine = 'vinyl'})
+pk = s:create_index('pk', {run_count_per_level = 10})
+sk = s:create_index('sk', {run_count_per_level = 10, parts = {2, 'unsigned'}, unique = false})
+for i = 1, 10 do s:replace{i, i} end
+box.snapshot()
+
+box.begin()
+for i = 1, 10 do s:replace{i, i * 10} end
+for i = 1, 10, 2 do s:delete{i} end
+for i = 2, 10, 2 do s:replace{i, i * 100} end
+box.commit()
+
+sk:select()
+
+pk:stat().rows -- 10 old REPLACEs + 5 DELETEs + 5 new REPLACEs
+sk:stat().rows -- 10 old REPLACEs + 5 new REPLACEs
+
+-- Compact the primary index to generate deferred DELETEs.
+box.snapshot()
+pk:compact()
+while pk:stat().disk.compact.count == 0 do fiber.sleep(0.001) end
+
+-- Compact the secondary index to cleanup garbage.
+box.snapshot()
+sk:compact()
+while sk:stat().disk.compact.count == 0 do fiber.sleep(0.001) end
+
+sk:select()
+
+pk:stat().rows -- 5 new REPLACEs
+sk:stat().rows -- ditto
+
+s:drop()
+
+--
+-- Check that on recovery we do not apply deferred DELETEs that
+-- have been dumped to disk.
+--
+test_run:cmd("create server test with script='vinyl/low_quota.lua'")
+test_run:cmd("start server test with args='1048576'")
+test_run:cmd("switch test")
+
+fiber = require('fiber')
+
+s = box.schema.space.create('test', {engine = 'vinyl'})
+pk = s:create_index('pk', {run_count_per_level = 10})
+sk = s:create_index('sk', {run_count_per_level = 10, parts = {2, 'unsigned', 3, 'string'}, unique = false})
+
+pad = string.rep('x', 10 * 1024)
+for i = 1, 120 do s:replace{i, i, pad} end
+box.snapshot()
+
+pad = string.rep('y', 10 * 1024)
+for i = 1, 120 do s:replace{i, i, pad} end
+box.snapshot()
+
+sk:stat().rows -- 120 old REPLACEs + 120 new REPLACEs
+
+box.stat.reset()
+
+-- Compact the primary index to generate deferred DELETEs.
+-- Deferred DELETEs won't fit in memory and trigger dump
+-- of the secondary index.
+pk:compact()
+while pk:stat().disk.compact.count == 0 do fiber.sleep(0.001) end
+
+sk:stat().disk.dump.count -- 1
+
+sk:stat().rows -- 120 old REPLACEs + 120 new REPLACEs + 120 deferred DELETEs
+
+test_run:cmd("restart server test with args='1048576'")
+s = box.space.test
+pk = s.index.pk
+sk = s.index.sk
+
+-- Should be 360, the same amount of statements as before restart.
+-- If we applied all deferred DELETEs, including the dumped ones,
+-- then there would be more.
+sk:stat().rows
+
+s:drop()
+
+test_run:cmd("switch default")
+test_run:cmd("stop server test")
+test_run:cmd("cleanup server test")
diff --git a/test/vinyl/info.result b/test/vinyl/info.result
index 112ba85e..95e8cc60 100644
--- a/test/vinyl/info.result
+++ b/test/vinyl/info.result
@@ -1036,10 +1036,10 @@ s:bsize()
 ---
 - 0
 ...
-i1 = s:create_index('i1', {parts = {1, 'unsigned'}, run_count_per_level = 1})
+i1 = s:create_index('i1', {parts = {1, 'unsigned'}, run_count_per_level = 10})
 ---
 ...
-i2 = s:create_index('i2', {parts = {2, 'unsigned'}, run_count_per_level = 1})
+i2 = s:create_index('i2', {parts = {2, 'unsigned'}, run_count_per_level = 10})
 ---
 ...
 s:bsize()
@@ -1162,7 +1162,7 @@ s:bsize()
 i1:len(), i2:len()
 ---
 - 150
-- 150
+- 100
 ...
 i1:bsize(), i2:bsize()
 ---
@@ -1189,13 +1189,25 @@ i2:bsize() == st2.memory.index_size + st2.disk.index_size + st2.disk.bloom_size
 ---
 - true
 ...
+-- Compact the primary index first to generate deferred DELETEs.
+-- Then dump them and compact the secondary index.
 box.snapshot()
 ---
 - ok
 ...
+i1:compact()
+---
+...
 wait(function() return i1:stat() end, st1, 'disk.compact.count', 1)
 ---
 ...
+box.snapshot()
+---
+- ok
+...
+i2:compact()
+---
+...
 wait(function() return i2:stat() end, st2, 'disk.compact.count', 1)
 ---
 ...
diff --git a/test/vinyl/info.test.lua b/test/vinyl/info.test.lua
index 863a8793..5aebd0a8 100644
--- a/test/vinyl/info.test.lua
+++ b/test/vinyl/info.test.lua
@@ -322,8 +322,8 @@ s:drop()
 
 s = box.schema.space.create('test', {engine = 'vinyl'})
 s:bsize()
-i1 = s:create_index('i1', {parts = {1, 'unsigned'}, run_count_per_level = 1})
-i2 = s:create_index('i2', {parts = {2, 'unsigned'}, run_count_per_level = 1})
+i1 = s:create_index('i1', {parts = {1, 'unsigned'}, run_count_per_level = 10})
+i2 = s:create_index('i2', {parts = {2, 'unsigned'}, run_count_per_level = 10})
 s:bsize()
 i1:len(), i2:len()
 i1:bsize(), i2:bsize()
@@ -365,8 +365,13 @@ i2:len() == st2.memory.rows + st2.disk.rows
 i1:bsize() == st1.memory.index_size + st1.disk.index_size + st1.disk.bloom_size
 i2:bsize() == st2.memory.index_size + st2.disk.index_size + st2.disk.bloom_size + st2.disk.bytes
 
+-- Compact the primary index first to generate deferred DELETEs.
+-- Then dump them and compact the secondary index.
 box.snapshot()
+i1:compact()
 wait(function() return i1:stat() end, st1, 'disk.compact.count', 1)
+box.snapshot()
+i2:compact()
 wait(function() return i2:stat() end, st2, 'disk.compact.count', 1)
 st1 = i1:stat()
 st2 = i2:stat()
diff --git a/test/vinyl/layout.result b/test/vinyl/layout.result
index 49826302..ee14cd51 100644
--- a/test/vinyl/layout.result
+++ b/test/vinyl/layout.result
@@ -253,17 +253,17 @@ result
   - - 00000000000000000008.run
     - - HEADER:
           lsn: 10
-          type: INSERT
+          type: REPLACE
         BODY:
           tuple: ['ёёё', null]
       - HEADER:
           lsn: 9
-          type: INSERT
+          type: REPLACE
         BODY:
           tuple: ['эээ', null]
       - HEADER:
           lsn: 8
-          type: INSERT
+          type: REPLACE
         BODY:
           tuple: ['ЭЭЭ', null]
       - HEADER:
@@ -285,8 +285,8 @@ result
         BODY:
           row_index_offset: <offset>
           offset: <offset>
-          size: 90
-          unpacked_size: 71
+          size: 102
+          unpacked_size: 83
           row_count: 3
           min_key: ['ёёё']
   - - 00000000000000000012.run
@@ -295,20 +295,23 @@ result
           type: REPLACE
         BODY:
           tuple: ['ёёё', 123]
+          tuple_meta: {1: 1}
       - HEADER:
           lsn: 13
-          type: INSERT
+          type: REPLACE
         BODY:
           tuple: ['ююю', 789]
+          tuple_meta: {1: 1}
       - HEADER:
           lsn: 12
-          type: INSERT
+          type: REPLACE
         BODY:
           tuple: ['ЮЮЮ', 456]
+          tuple_meta: {1: 1}
       - HEADER:
           type: ROWINDEX
         BODY:
-          row_index: "\0\0\0\0\0\0\0\x10\0\0\0\""
+          row_index: "\0\0\0\0\0\0\0\x14\0\0\0*"
   - - 00000000000000000006.index
     - - HEADER:
           type: RUNINFO
@@ -331,17 +334,17 @@ result
   - - 00000000000000000006.run
     - - HEADER:
           lsn: 10
-          type: INSERT
+          type: REPLACE
         BODY:
           tuple: [null, 'ёёё']
       - HEADER:
           lsn: 9
-          type: INSERT
+          type: REPLACE
         BODY:
           tuple: [null, 'эээ']
       - HEADER:
           lsn: 8
-          type: INSERT
+          type: REPLACE
         BODY:
           tuple: [null, 'ЭЭЭ']
       - HEADER:
@@ -357,41 +360,36 @@ result
           page_count: 1
           bloom_filter: <bloom_filter>
           max_lsn: 13
-          min_key: [null, 'ёёё']
+          min_key: [123, 'ёёё']
       - HEADER:
           type: PAGEINFO
         BODY:
           row_index_offset: <offset>
           offset: <offset>
-          size: 110
-          unpacked_size: 91
-          row_count: 4
-          min_key: [null, 'ёёё']
+          size: 90
+          unpacked_size: 71
+          row_count: 3
+          min_key: [123, 'ёёё']
   - - 00000000000000000010.run
     - - HEADER:
           lsn: 11
-          type: DELETE
-        BODY:
-          key: [null, 'ёёё']
-      - HEADER:
-          lsn: 11
           type: REPLACE
         BODY:
           tuple: [123, 'ёёё']
       - HEADER:
           lsn: 12
-          type: INSERT
+          type: REPLACE
         BODY:
           tuple: [456, 'ЮЮЮ']
       - HEADER:
           lsn: 13
-          type: INSERT
+          type: REPLACE
         BODY:
           tuple: [789, 'ююю']
       - HEADER:
           type: ROWINDEX
         BODY:
-          row_index: "\0\0\0\0\0\0\0\x10\0\0\0 \0\0\02"
+          row_index: "\0\0\0\0\0\0\0\x10\0\0\0\""
 ...
 test_run:cmd("clear filter")
 ---
diff --git a/test/vinyl/quota.result b/test/vinyl/quota.result
index e323bc4e..48042185 100644
--- a/test/vinyl/quota.result
+++ b/test/vinyl/quota.result
@@ -89,7 +89,7 @@ _ = space:replace{1, 1, string.rep('a', 1024 * 1024 * 5)}
 ...
 box.stat.vinyl().quota.used
 ---
-- 5341228
+- 5341267
 ...
 space:drop()
 ---
diff --git a/test/vinyl/tx_gap_lock.result b/test/vinyl/tx_gap_lock.result
index 150826cb..a456c017 100644
--- a/test/vinyl/tx_gap_lock.result
+++ b/test/vinyl/tx_gap_lock.result
@@ -1194,8 +1194,8 @@ s:drop()
 ---
 ...
 ----------------------------------------------------------------
--- gh-2534: Iterator over a secondary index doesn't double track
--- results in the primary index.
+-- Iterator over a secondary index tracks all results in the
+-- primary index. Needed for gh-2129.
 ----------------------------------------------------------------
 s = box.schema.space.create('test', {engine = 'vinyl'})
 ---
@@ -1219,23 +1219,23 @@ gap_lock_count() -- 0
 _ = s.index.sk:select({}, {limit = 50})
 ---
 ...
-gap_lock_count() -- 1
+gap_lock_count() -- 51
 ---
-- 1
+- 51
 ...
 for i = 1, 100 do s.index.sk:get(i) end
 ---
 ...
-gap_lock_count() -- 51
+gap_lock_count() -- 151
 ---
-- 51
+- 151
 ...
 _ = s.index.sk:select()
 ---
 ...
-gap_lock_count() -- 1
+gap_lock_count() -- 101
 ---
-- 1
+- 101
 ...
 box.commit()
 ---
diff --git a/test/vinyl/tx_gap_lock.test.lua b/test/vinyl/tx_gap_lock.test.lua
index 4d8d21d8..4ad55860 100644
--- a/test/vinyl/tx_gap_lock.test.lua
+++ b/test/vinyl/tx_gap_lock.test.lua
@@ -380,8 +380,8 @@ c4:commit()
 
 s:drop()
 ----------------------------------------------------------------
--- gh-2534: Iterator over a secondary index doesn't double track
--- results in the primary index.
+-- Iterator over a secondary index tracks all results in the
+-- primary index. Needed for gh-2129.
 ----------------------------------------------------------------
 s = box.schema.space.create('test', {engine = 'vinyl'})
 _ = s:create_index('pk', {parts = {1, 'unsigned'}})
@@ -390,11 +390,11 @@ for i = 1, 100 do s:insert{i, i} end
 box.begin()
 gap_lock_count() -- 0
 _ = s.index.sk:select({}, {limit = 50})
-gap_lock_count() -- 1
-for i = 1, 100 do s.index.sk:get(i) end
 gap_lock_count() -- 51
+for i = 1, 100 do s.index.sk:get(i) end
+gap_lock_count() -- 151
 _ = s.index.sk:select()
-gap_lock_count() -- 1
+gap_lock_count() -- 101
 box.commit()
 gap_lock_count() -- 0
 s:drop()
diff --git a/test/vinyl/write_iterator.result b/test/vinyl/write_iterator.result
index c38de5d3..cf1e426c 100644
--- a/test/vinyl/write_iterator.result
+++ b/test/vinyl/write_iterator.result
@@ -741,6 +741,11 @@ space:drop()
 s = box.schema.space.create('test', {engine = 'vinyl'})
 ---
 ...
+-- Install on_replace trigger to disable DELETE optimization
+-- in the secondary index (gh-2129).
+_ = s:on_replace(function() end)
+---
+...
 pk = s:create_index('primary', {run_count_per_level = 1})
 ---
 ...
diff --git a/test/vinyl/write_iterator.test.lua b/test/vinyl/write_iterator.test.lua
index 73c90c42..a1de240f 100644
--- a/test/vinyl/write_iterator.test.lua
+++ b/test/vinyl/write_iterator.test.lua
@@ -317,6 +317,9 @@ space:drop()
 -- gh-2875 INSERT+DELETE pairs are annihilated on compaction
 
 s = box.schema.space.create('test', {engine = 'vinyl'})
+-- Install on_replace trigger to disable DELETE optimization
+-- in the secondary index (gh-2129).
+_ = s:on_replace(function() end)
 pk = s:create_index('primary', {run_count_per_level = 1})
 sk = s:create_index('secondary', {run_count_per_level = 1, parts = {2, 'unsigned'}})
 PAD1 = 100
-- 
2.11.0

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 1/7] vinyl: do not store meta in secondary index runs
  2018-08-21 11:15 ` [PATCH v2 1/7] vinyl: do not store meta in secondary index runs Vladimir Davydov
@ 2018-08-21 15:08   ` Konstantin Osipov
  0 siblings, 0 replies; 20+ messages in thread
From: Konstantin Osipov @ 2018-08-21 15:08 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: tarantool-patches

* Vladimir Davydov <vdavydov.dev@gmail.com> [18/08/21 15:19]:
> Currenlty, tuple meta is only needed for storing statement flags in run
> files. In the scope of #2129 two statement flags will be introduced,
> VY_STMT_SKIP_READ and VY_STMT_DEFERRED_DELETE. None of them makes any
> sense for secondary indexes. If we encode meta for secondary index
> statements, we will have to either clear the flags on the upper level
> (e.g. in the write iterator) or filter them out before encoding a
> statement. Alternatively, we can skip encoding meta for secondary index
> statements altogether, and this is what this patch does, because it's
> the simplest and clearest method for now. If tuple meta is ever used for
> storing anything else besides statement flags or a new statement flag
> appears that may be used with secondary index statements, we will
> recover the code and mask out those flags for secondary indexes.

OK to push.


-- 
Konstantin Osipov, Moscow, Russia, +7 903 626 22 32
http://tarantool.io - www.twitter.com/kostja_osipov

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 2/7] vinyl: teach write iterator to return overwritten tuples
  2018-08-21 11:15 ` [PATCH v2 2/7] vinyl: teach write iterator to return overwritten tuples Vladimir Davydov
@ 2018-08-21 15:14   ` Konstantin Osipov
  2018-08-21 15:37     ` Vladimir Davydov
  0 siblings, 1 reply; 20+ messages in thread
From: Konstantin Osipov @ 2018-08-21 15:14 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: tarantool-patches

* Vladimir Davydov <vdavydov.dev@gmail.com> [18/08/21 15:19]:
>  	/*
> +	 * Deferred DELETE statements can only be produced by
> +	 * primary index compaction.
> +	 */
> +	assert(is_primary || handler == NULL);
> +	/*

With this assert, do you really need this check:

> +static int
> +vy_write_iterator_deferred_delete(struct vy_write_iterator *stream,
> +				  struct tuple *stmt)
> +{
> +	if (!stream->is_primary)
> +		return 0;

?

Otherwise lgtm.

-- 
Konstantin Osipov, Moscow, Russia, +7 903 626 22 32
http://tarantool.io - www.twitter.com/kostja_osipov

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 2/7] vinyl: teach write iterator to return overwritten tuples
  2018-08-21 15:14   ` Konstantin Osipov
@ 2018-08-21 15:37     ` Vladimir Davydov
  0 siblings, 0 replies; 20+ messages in thread
From: Vladimir Davydov @ 2018-08-21 15:37 UTC (permalink / raw)
  To: Konstantin Osipov; +Cc: tarantool-patches

On Tue, Aug 21, 2018 at 06:14:57PM +0300, Konstantin Osipov wrote:
> * Vladimir Davydov <vdavydov.dev@gmail.com> [18/08/21 15:19]:
> >  	/*
> > +	 * Deferred DELETE statements can only be produced by
> > +	 * primary index compaction.
> > +	 */
> > +	assert(is_primary || handler == NULL);
> > +	/*
> 
> With this assert, do you really need this check:
> 
> > +static int
> > +vy_write_iterator_deferred_delete(struct vy_write_iterator *stream,
> > +				  struct tuple *stmt)
> > +{
> > +	if (!stream->is_primary)
> > +		return 0;
> 
> ?

Yes. When we dump a primary index, we set 'handler' to NULL, just like
for a secondary index, but in contrast to a secondary index, we have to
include the last statement marked as VY_STMT_DEFERRED_DELETE into the
iterator output even if it is not referenced by any read view, because
we will need it to generate a deferred DELETE statement on compaction.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 3/7] vinyl: prepare write iterator heap comparator for deferred DELETEs
  2018-08-21 11:15 ` [PATCH v2 3/7] vinyl: prepare write iterator heap comparator for deferred DELETEs Vladimir Davydov
@ 2018-08-21 15:38   ` Konstantin Osipov
  0 siblings, 0 replies; 20+ messages in thread
From: Konstantin Osipov @ 2018-08-21 15:38 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: tarantool-patches

* Vladimir Davydov <vdavydov.dev@gmail.com> [18/08/21 15:19]:
> In the scope of #2129, we won't delete the overwritten tuple from
> secondary indexes immediately on REPLACE. Instead we will defer
> generation of the DELETE statement until the primary index compaction.
> However, it may happen that the overwritten tuple and the tuple that
> overwrote it have the same secondary key parts, in which case the
> deferred DELETE is not needed and should be discarded on secondary
> index compaction. This patch makes the write iterator heap comparator
> function discard such useless deferred DELETEs.
> 
> Note, this patch also removes the code that prioritises terminal
> statements over UPSERTs in the write iterator, which, according to the
> comment, may happen only during forced recovery. I don't see why we
> should do that, even during forced recovery, neither have I managed to
> find the reason in the commit history, so I dropped this code in order
> not to overburden the write iterator logic with some esoteric cases.

It was hand waving on my side when we added this code, it's ok you
removed it.


OK to push.

-- 
Konstantin Osipov, Moscow, Russia, +7 903 626 22 32
http://tarantool.io - www.twitter.com/kostja_osipov

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 4/7] vinyl: allow to skip certain statements on read
  2018-08-21 11:15 ` [PATCH v2 4/7] vinyl: allow to skip certain statements on read Vladimir Davydov
@ 2018-08-21 15:39   ` Konstantin Osipov
  0 siblings, 0 replies; 20+ messages in thread
From: Konstantin Osipov @ 2018-08-21 15:39 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: tarantool-patches

* Vladimir Davydov <vdavydov.dev@gmail.com> [18/08/21 15:19]:
> In the scope of #2129 we will defer insertion of certain DELETE
> statements into secondary indexes until primary index compaction.
> However, by the time we invoke compaction, new statements might
> have been inserted into the space for the same set of keys.
> If that happens, insertion of a deferred DELETE will break the
> invariant which the read iterator relies upon: that for any key
> older sources store older statements. To avoid that, let's add
> a new per statement flag, VY_STMT_SKIP_READ, and make the read
> iterator ignore statements marked with it.

It's ok to push. Perhaps we could mark such statements with
max-lsn - 1, as discussed, but we can refactor the code to do it
later.

 

-- 
Konstantin Osipov, Moscow, Russia, +7 903 626 22 32
http://tarantool.io - www.twitter.com/kostja_osipov

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 5/7] Introduce _vinyl_deferred_delete system space
  2018-08-21 11:15 ` [PATCH v2 5/7] Introduce _vinyl_deferred_delete system space Vladimir Davydov
@ 2018-08-21 15:42   ` Konstantin Osipov
  2018-08-22 17:04     ` Vladimir Davydov
  0 siblings, 1 reply; 20+ messages in thread
From: Konstantin Osipov @ 2018-08-21 15:42 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: tarantool-patches

* Vladimir Davydov <vdavydov.dev@gmail.com> [18/08/21 15:19]:
> +	/*
> +	 * _vinyl_deferred_delete - blackhole that is needed
> +	 * for writing deferred DELETE statements generated by
> +	 * vinyl compaction tasks to WAL.
> +	 */

Please expand this comment to explain why we can't create this
space purely from Lua or upgrade script: 

There is an intricate ordering dependency between creating of this
system space, recovery, and engine initialization, when we set an
on_replace trigger on the space. To resolve this dependency, we
create a space stub at schema_init(), then set a trigger in
engine_init(), which is called next, and then "alter" the space to
its final form in recovery(), which is called next.

> +	{
> +		const char *engine = "blackhole";
> +		const char *name = "_vinyl_deferred_delete";
> +		struct space_opts opts = space_opts_default;
> +		opts.group_id = GROUP_LOCAL;
> +		struct space_def *def;
> +		def = space_def_new_xc(BOX_VINYL_DEFERRED_DELETE_ID, ADMIN, 0,
> +				       name, strlen(name), engine,
> +				       strlen(engine), &opts, NULL, 0);
> +		auto def_guard = make_scoped_guard([=] {
> +			space_def_delete(def);
> +		});
> +		RLIST_HEAD(key_list);
> +		struct space *space = space_new_xc(def, &key_list);
> +		space_cache_replace(space);
> +		init_system_space(space);
> +		trigger_run_xc(&on_alter_space, space);
> +	}
>  }
>  

Otherwise OK to push.

-- 
Konstantin Osipov, Moscow, Russia, +7 903 626 22 32
http://tarantool.io - www.twitter.com/kostja_osipov

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 6/7] vinyl: zap vy_mem::min_lsn and rename max_lsn to dump_lsn
  2018-08-21 11:15 ` [PATCH v2 6/7] vinyl: zap vy_mem::min_lsn and rename max_lsn to dump_lsn Vladimir Davydov
@ 2018-08-21 15:44   ` Konstantin Osipov
  2018-08-22 13:00   ` Vladimir Davydov
  1 sibling, 0 replies; 20+ messages in thread
From: Konstantin Osipov @ 2018-08-21 15:44 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: tarantool-patches

* Vladimir Davydov <vdavydov.dev@gmail.com> [18/08/21 15:19]:
> We never use vy_mem::min_lsn so let's zap it. As for max_lsn, we only
> need it to update vy_lsm::dump_lsn (max LSN stored on disk). Let's
> rename it appropriately. There's another reason to do that. Once we
> start storing deferred DELETE statements in memory (see #2129), it won't
> be the max statement LSN stored in vy_mem anymore, because we will
> account WAL LSN of deferred DELETE statements there too. Renaming it to
> dump_lsn will help avoid confusion.

Such a simple change making things so much easier to track.

Thanks.

OK to push.

 

-- 
Konstantin Osipov, Moscow, Russia, +7 903 626 22 32
http://tarantool.io - www.twitter.com/kostja_osipov

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 7/7] vinyl: eliminate disk read on REPLACE/DELETE
  2018-08-21 11:15 ` [PATCH v2 7/7] vinyl: eliminate disk read on REPLACE/DELETE Vladimir Davydov
@ 2018-08-21 16:13   ` Konstantin Osipov
  2018-08-22 17:08     ` Vladimir Davydov
  0 siblings, 1 reply; 20+ messages in thread
From: Konstantin Osipov @ 2018-08-21 16:13 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: tarantool-patches

* Vladimir Davydov <vdavydov.dev@gmail.com> [18/08/21 15:19]:
>  		/*
> -		 * All indexes of a space must be consistent, i.e.
> -		 * if a tuple is present in one index, it must be
> -		 * present in all other indexes as well, so we can
> -		 * get here only if there's a bug somewhere in vinyl.
> -		 * Don't abort as core dump won't really help us in
> -		 * this case. Just warn the user and proceed to the
> -		 * next tuple.
> +		 * Invalidate the cache entry so that we won't read
> +		 * the overwritten tuple again from the cache.
>  		 */
> -		say_warn("%s: key %s missing in primary index",
> -			 vy_lsm_name(lsm), vy_stmt_str(tuple));
> +		vy_cache_on_write(&lsm->cache, tuple, NULL);

Please add a comment why you invalidate the cache.

> +	/*
> +	 * Extract space id, LSN of the deferred DELETE statement,
> +	 * and the deleted tuple from the system space row.
> +	 */
> +	uint32_t space_id;
> +	if (tuple_field_u32(stmt->new_tuple, 0, &space_id) != 0)
> +		diag_raise();
> +	int64_t lsn;
> +	if (tuple_field_i64(stmt->new_tuple, 1, &lsn) != 0)
> +		diag_raise();
> +	const char *delete_data = tuple_field(stmt->new_tuple, 2);
> +	if (delete_data == NULL) {
> +		diag_set(ClientError, ER_NO_SUCH_FIELD, 2);
> +		diag_raise();

Please use tuple iterator instead.

> +		diag_raise();
> +	if (space->index_count <= 1)
> +		return;

Please add a comment when this can be the case - e.g. the space
was altered after we created a deferred delete. I can't imagine
any other case.

> +	struct tuple *delete = vy_stmt_new_surrogate_delete_raw(pk->mem_format,
> +						delete_data, delete_data_end);
> +	if (delete == NULL)
> +		diag_raise();
> +	vy_stmt_set_lsn(delete, lsn);

Please say a few words why you reset the statement lsn and how this works
downstream (when processed by the write iterator).

> +	/* Insert the deferred DELETE into secondary indexes. */
> +	int rc = 0;
> +	size_t mem_used_before = lsregion_used(&env->mem_env.allocator);
> +	const struct tuple *region_stmt = NULL;
> +	for (uint32_t i = 1; i < space->index_count; i++) {
> +		struct vy_lsm *lsm = vy_lsm(space->index[i]);
> +		if (vy_is_committed_one(env, lsm))
> +			continue;
> +		/*
> +		 * As usual, rotate the active in-memory index if
> +		 * schema was changed or dump was triggered. Do it
> +		 * only if processing the first statement, because
> +		 * dump may be triggered by one of the statements
> +		 * of this transaction (see vy_quota_force_use()
> +		 * below), in which case we must not do rotation
> +		 * as we want all statements to land in the same
> +		 * in-memory index. This is safe, as long as we
> +		 * don't yield between statements.
> +		 */
> +		struct vy_mem *mem = lsm->mem;
> +		if (is_first_statement &&
> +		    (mem->space_cache_version != space_cache_version ||
> +		     mem->generation != *lsm->env->p_generation)) {
> +			rc = vy_lsm_rotate_mem(lsm);
> +			if (rc != 0)
> +				break;
> +			mem = lsm->mem;
> +		}
> +		rc = vy_lsm_set(lsm, mem, delete, &region_stmt);
> +		if (rc != 0)
> +			break;
> +		vy_lsm_commit_stmt(lsm, mem, region_stmt);

Can we share this code with vy_replace()?

> +			break;
> +		}
> +		vy_mem_pin(mem);
> +		trigger_create(on_commit, vy_deferred_delete_on_commit, mem, NULL);
> +		txn_on_commit(txn, on_commit);

What about on_rollback? If you missed it, please add a test case
:)
> +/**
> + * Try to generate a deferred DELETE statement on tx commit.
> + *
> + * This function is supposed to be called for a primary index
> + * statement which was executed without deletion of the overwritten
> + * tuple from secondary indexes. It looks up the overwritten tuple
> + * in memory and, if found, produces the deferred DELETEs and
> + * inserts them into the transaction log.
> + *

Please mention it's not only an optimization, explain 
why we logically need it (we can get out of memory error when
trying to insert during dump).


-- 
Konstantin Osipov, Moscow, Russia, +7 903 626 22 32
http://tarantool.io - www.twitter.com/kostja_osipov

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 6/7] vinyl: zap vy_mem::min_lsn and rename max_lsn to dump_lsn
  2018-08-21 11:15 ` [PATCH v2 6/7] vinyl: zap vy_mem::min_lsn and rename max_lsn to dump_lsn Vladimir Davydov
  2018-08-21 15:44   ` Konstantin Osipov
@ 2018-08-22 13:00   ` Vladimir Davydov
  1 sibling, 0 replies; 20+ messages in thread
From: Vladimir Davydov @ 2018-08-22 13:00 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

Pushed to 1.10

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 5/7] Introduce _vinyl_deferred_delete system space
  2018-08-21 15:42   ` Konstantin Osipov
@ 2018-08-22 17:04     ` Vladimir Davydov
  0 siblings, 0 replies; 20+ messages in thread
From: Vladimir Davydov @ 2018-08-22 17:04 UTC (permalink / raw)
  To: Konstantin Osipov; +Cc: tarantool-patches

On Tue, Aug 21, 2018 at 06:42:52PM +0300, Konstantin Osipov wrote:
> * Vladimir Davydov <vdavydov.dev@gmail.com> [18/08/21 15:19]:
> > +	/*
> > +	 * _vinyl_deferred_delete - blackhole that is needed
> > +	 * for writing deferred DELETE statements generated by
> > +	 * vinyl compaction tasks to WAL.
> > +	 */
> 
> Please expand this comment to explain why we can't create this
> space purely from Lua or upgrade script: 
> 
> There is an intricate ordering dependency between creating of this
> system space, recovery, and engine initialization, when we set an
> on_replace trigger on the space. To resolve this dependency, we
> create a space stub at schema_init(), then set a trigger in
> engine_init(), which is called next, and then "alter" the space to
> its final form in recovery(), which is called next.

Done:

diff --git a/src/box/schema.cc b/src/box/schema.cc
index 32669c69..dd5896c5 100644
--- a/src/box/schema.cc
+++ b/src/box/schema.cc
@@ -357,6 +357,15 @@ schema_init()
 	 * _vinyl_deferred_delete - blackhole that is needed
 	 * for writing deferred DELETE statements generated by
 	 * vinyl compaction tasks to WAL.
+	 *
+	 * There is an intricate ordering dependency between
+	 * recovery of this system space and initialization of
+	 * the vinyl engine, when we set an on_replace trigger
+	 * on the space. To resolve this dependency, we create
+	 * a space stub in schema_init(), then set a trigger in
+	 * engine_begin_initial_recovery(), which is called next,
+	 * then recover WAL rows, executing the trigger for each
+	 * of them.
 	 */
 	{
 		const char *engine = "blackhole";

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 7/7] vinyl: eliminate disk read on REPLACE/DELETE
  2018-08-21 16:13   ` Konstantin Osipov
@ 2018-08-22 17:08     ` Vladimir Davydov
  0 siblings, 0 replies; 20+ messages in thread
From: Vladimir Davydov @ 2018-08-22 17:08 UTC (permalink / raw)
  To: Konstantin Osipov; +Cc: tarantool-patches

On Tue, Aug 21, 2018 at 07:13:50PM +0300, Konstantin Osipov wrote:
> * Vladimir Davydov <vdavydov.dev@gmail.com> [18/08/21 15:19]:
> >  		/*
> > -		 * All indexes of a space must be consistent, i.e.
> > -		 * if a tuple is present in one index, it must be
> > -		 * present in all other indexes as well, so we can
> > -		 * get here only if there's a bug somewhere in vinyl.
> > -		 * Don't abort as core dump won't really help us in
> > -		 * this case. Just warn the user and proceed to the
> > -		 * next tuple.
> > +		 * Invalidate the cache entry so that we won't read
> > +		 * the overwritten tuple again from the cache.
> >  		 */
> > -		say_warn("%s: key %s missing in primary index",
> > -			 vy_lsm_name(lsm), vy_stmt_str(tuple));
> > +		vy_cache_on_write(&lsm->cache, tuple, NULL);
> 
> Please add a comment why you invalidate the cache.

Done.

> > +	/*
> > +	 * Extract space id, LSN of the deferred DELETE statement,
> > +	 * and the deleted tuple from the system space row.
> > +	 */
> > +	uint32_t space_id;
> > +	if (tuple_field_u32(stmt->new_tuple, 0, &space_id) != 0)
> > +		diag_raise();
> > +	int64_t lsn;
> > +	if (tuple_field_i64(stmt->new_tuple, 1, &lsn) != 0)
> > +		diag_raise();
> > +	const char *delete_data = tuple_field(stmt->new_tuple, 2);
> > +	if (delete_data == NULL) {
> > +		diag_set(ClientError, ER_NO_SUCH_FIELD, 2);
> > +		diag_raise();
> 
> Please use tuple iterator instead.

Done. Note, I have to add a couple of tuple_next helpers to do that -
see the branch.

> > +		diag_raise();
> > +	if (space->index_count <= 1)
> > +		return;
> 
> Please add a comment when this can be the case - e.g. the space
> was altered after we created a deferred delete. I can't imagine
> any other case.

Done.

> > +	struct tuple *delete = vy_stmt_new_surrogate_delete_raw(pk->mem_format,
> > +						delete_data, delete_data_end);
> > +	if (delete == NULL)
> > +		diag_raise();
> > +	vy_stmt_set_lsn(delete, lsn);
> 
> Please say a few words why you reset the statement lsn and how this works
> downstream (when processed by the write iterator).

Done.

> > +	/* Insert the deferred DELETE into secondary indexes. */
> > +	int rc = 0;
> > +	size_t mem_used_before = lsregion_used(&env->mem_env.allocator);
> > +	const struct tuple *region_stmt = NULL;
> > +	for (uint32_t i = 1; i < space->index_count; i++) {
> > +		struct vy_lsm *lsm = vy_lsm(space->index[i]);
> > +		if (vy_is_committed_one(env, lsm))
> > +			continue;
> > +		/*
> > +		 * As usual, rotate the active in-memory index if
> > +		 * schema was changed or dump was triggered. Do it
> > +		 * only if processing the first statement, because
> > +		 * dump may be triggered by one of the statements
> > +		 * of this transaction (see vy_quota_force_use()
> > +		 * below), in which case we must not do rotation
> > +		 * as we want all statements to land in the same
> > +		 * in-memory index. This is safe, as long as we
> > +		 * don't yield between statements.
> > +		 */
> > +		struct vy_mem *mem = lsm->mem;
> > +		if (is_first_statement &&
> > +		    (mem->space_cache_version != space_cache_version ||
> > +		     mem->generation != *lsm->env->p_generation)) {
> > +			rc = vy_lsm_rotate_mem(lsm);
> > +			if (rc != 0)
> > +				break;
> > +			mem = lsm->mem;
> > +		}
> > +		rc = vy_lsm_set(lsm, mem, delete, &region_stmt);
> > +		if (rc != 0)
> > +			break;
> > +		vy_lsm_commit_stmt(lsm, mem, region_stmt);
> 
> Can we share this code with vy_replace()?

I don't think so, vy_replace() inserts tuples into tx write set.
I assume, you meant vy_tx_write(). Well, that would be difficult,
because the code is scattered there between prepare and commit phases.

> 
> > +			break;
> > +		}
> > +		vy_mem_pin(mem);
> > +		trigger_create(on_commit, vy_deferred_delete_on_commit, mem, NULL);
> > +		txn_on_commit(txn, on_commit);
> 
> What about on_rollback? If you missed it, please add a test case
> :)

Fixed. Added a test too :)

> > +/**
> > + * Try to generate a deferred DELETE statement on tx commit.
> > + *
> > + * This function is supposed to be called for a primary index
> > + * statement which was executed without deletion of the overwritten
> > + * tuple from secondary indexes. It looks up the overwritten tuple
> > + * in memory and, if found, produces the deferred DELETEs and
> > + * inserts them into the transaction log.
> > + *
> 
> Please mention it's not only an optimization, explain 
> why we logically need it (we can get out of memory error when
> trying to insert during dump).

Done.

Branch updated. Incremental diff is below.

diff --git a/src/box/vinyl.c b/src/box/vinyl.c
index 1a45fac9..fb121402 100644
--- a/src/box/vinyl.c
+++ b/src/box/vinyl.c
@@ -1292,8 +1292,10 @@ vy_get_by_secondary_tuple(struct vy_lsm *lsm, struct vy_tx *tx,
 			*result = NULL;
 		}
 		/*
-		 * Invalidate the cache entry so that we won't read
-		 * the overwritten tuple again from the cache.
+		 * We must purge stale tuples from the cache before
+		 * storing the resulting interval in order to avoid
+		 * chain intersections, which are not tolerated by
+		 * the tuple cache implementation.
 		 */
 		vy_cache_on_write(&lsm->cache, tuple, NULL);
 		return 0;
@@ -4301,10 +4303,6 @@ vinyl_space_build_index(struct space *src_space, struct index *new_index,
 
 /* {{{ Deferred DELETE handling */
 
-/**
- * Callback invoked after a deferred DELETE statement has been
- * committed to _vinyl_deferred_delete system space.
- */
 static void
 vy_deferred_delete_on_commit(struct trigger *trigger, void *event)
 {
@@ -4316,6 +4314,16 @@ vy_deferred_delete_on_commit(struct trigger *trigger, void *event)
 	 */
 	assert(mem->dump_lsn <= txn->signature);
 	mem->dump_lsn = txn->signature;
+	/* Unpin the mem pinned in vy_deferred_delete_on_replace(). */
+	vy_mem_unpin(mem);
+}
+
+static void
+vy_deferred_delete_on_rollback(struct trigger *trigger, void *event)
+{
+	(void)event;
+	struct vy_mem *mem = trigger->data;
+	/* Unpin the mem pinned in vy_deferred_delete_on_replace(). */
 	vy_mem_unpin(mem);
 }
 
@@ -4358,17 +4366,17 @@ vy_deferred_delete_on_replace(struct trigger *trigger, void *event)
 	 * Extract space id, LSN of the deferred DELETE statement,
 	 * and the deleted tuple from the system space row.
 	 */
+	struct tuple_iterator it;
+	tuple_rewind(&it, stmt->new_tuple);
 	uint32_t space_id;
-	if (tuple_field_u32(stmt->new_tuple, 0, &space_id) != 0)
+	if (tuple_next_u32(&it, &space_id) != 0)
 		diag_raise();
-	int64_t lsn;
-	if (tuple_field_i64(stmt->new_tuple, 1, &lsn) != 0)
+	uint64_t lsn;
+	if (tuple_next_u64(&it, &lsn) != 0)
 		diag_raise();
-	const char *delete_data = tuple_field(stmt->new_tuple, 2);
-	if (delete_data == NULL) {
-		diag_set(ClientError, ER_NO_SUCH_FIELD, 2);
+	const char *delete_data = tuple_next_with_type(&it, MP_ARRAY);
+	if (delete_data == NULL)
 		diag_raise();
-	}
 	const char *delete_data_end = delete_data;
 	mp_next(&delete_data_end);
 
@@ -4376,6 +4384,11 @@ vy_deferred_delete_on_replace(struct trigger *trigger, void *event)
 	struct space *space = space_cache_find(space_id);
 	if (space == NULL)
 		diag_raise();
+	/*
+	 * All secondary indexes could have been dropped, in
+	 * which case we don't need to generate deferred DELETE
+	 * statements anymore.
+	 */
 	if (space->index_count <= 1)
 		return;
 	/*
@@ -4392,14 +4405,18 @@ vy_deferred_delete_on_replace(struct trigger *trigger, void *event)
 						delete_data, delete_data_end);
 	if (delete == NULL)
 		diag_raise();
-	vy_stmt_set_lsn(delete, lsn);
 	/*
 	 * A deferred DELETE may be generated after new statements
-	 * were committed for the deleted key while the read iterator
-	 * assumes that newer sources always store newer statements.
-	 * Mark deferred DELETEs with the VY_STMT_SKIP_READ flag so
-	 * as not to break the read iterator assumptions.
+	 * were committed for the deleted key. So we must use the
+	 * original LSN (not the one of the WAL row) when inserting
+	 * a deferred DELETE into an index to make sure that it will
+	 * purge the appropriate tuple on compaction. However, this
+	 * also breaks the read iterator invariant that states that
+	 * newer sources contain newer statements for the same key.
+	 * So we mark deferred DELETEs with the VY_STMT_SKIP_READ
+	 * flag, which makes the read iterator ignore them.
 	 */
+	vy_stmt_set_lsn(delete, lsn);
 	vy_stmt_set_flags(delete, VY_STMT_SKIP_READ);
 
 	/* Insert the deferred DELETE into secondary indexes. */
@@ -4451,9 +4468,19 @@ vy_deferred_delete_on_replace(struct trigger *trigger, void *event)
 			rc = -1;
 			break;
 		}
+		struct trigger *on_rollback = region_alloc(&fiber()->gc,
+							   sizeof(*on_commit));
+		if (on_rollback == NULL) {
+			diag_set(OutOfMemory, sizeof(*on_commit),
+				 "region", "struct trigger");
+			rc = -1;
+			break;
+		}
 		vy_mem_pin(mem);
 		trigger_create(on_commit, vy_deferred_delete_on_commit, mem, NULL);
+		trigger_create(on_rollback, vy_deferred_delete_on_rollback, mem, NULL);
 		txn_on_commit(txn, on_commit);
+		txn_on_rollback(txn, on_rollback);
 	}
 	size_t mem_used_after = lsregion_used(&env->mem_env.allocator);
 	assert(mem_used_after >= mem_used_before);
diff --git a/src/box/vy_tx.c b/src/box/vy_tx.c
index 27cd9bb7..590b4483 100644
--- a/src/box/vy_tx.c
+++ b/src/box/vy_tx.c
@@ -479,6 +479,12 @@ vy_tx_write(struct vy_lsm *lsm, struct vy_mem *mem,
  * in memory and, if found, produces the deferred DELETEs and
  * inserts them into the transaction log.
  *
+ * Generating DELETEs before committing a transaction rather than
+ * postponing it to dump isn't just an optimization. The point is
+ * that we can't generate deferred DELETEs during dump, because
+ * if we run out of memory, we won't be able to schedule another
+ * dump to free some.
+ *
  * Affects @tx->log, @v->stmt.
  *
  * Returns 0 on success, -1 on memory allocation error.
diff --git a/test/vinyl/errinj.result b/test/vinyl/errinj.result
index 28271fc9..cdffa198 100644
--- a/test/vinyl/errinj.result
+++ b/test/vinyl/errinj.result
@@ -1844,3 +1844,81 @@ s.index.sk:stat().memory.rows
 s:drop()
 ---
 ...
+--
+-- Check that tarantool doesn't hang or crash if error
+-- occurs while writing a deferred DELETE to WAL.
+--
+fiber = require('fiber')
+---
+...
+errinj = box.error.injection
+---
+...
+s = box.schema.space.create('test', {engine = 'vinyl'})
+---
+...
+_ = s:create_index('pk', {run_count_per_level = 10})
+---
+...
+_ = s:create_index('sk', {unique = false, parts = {2, 'unsigned'}})
+---
+...
+s:replace{1, 10}
+---
+- [1, 10]
+...
+box.snapshot()
+---
+- ok
+...
+s:replace{1, 20}
+---
+- [1, 20]
+...
+box.snapshot()
+---
+- ok
+...
+errinj.set("ERRINJ_VY_SCHED_TIMEOUT", 0.001)
+---
+- ok
+...
+errinj.set("ERRINJ_WAL_IO", true)
+---
+- ok
+...
+errors = box.stat.ERROR.total
+---
+...
+s.index.pk:compact()
+---
+...
+while box.stat.ERROR.total - errors == 0 do fiber.sleep(0.001) end
+---
+...
+s.index.pk:stat().disk.compact.count -- 0
+---
+- 0
+...
+errinj.set("ERRINJ_WAL_IO", false)
+---
+- ok
+...
+while s.index.pk:stat().disk.compact.count == 0 do fiber.sleep(0.001) end
+---
+...
+s.index.pk:stat().disk.compact.count -- 1
+---
+- 1
+...
+errinj.set("ERRINJ_VY_SCHED_TIMEOUT", 0)
+---
+- ok
+...
+box.snapshot() -- ok
+---
+- ok
+...
+s:drop()
+---
+...
diff --git a/test/vinyl/errinj.test.lua b/test/vinyl/errinj.test.lua
index 000067d3..c2332a69 100644
--- a/test/vinyl/errinj.test.lua
+++ b/test/vinyl/errinj.test.lua
@@ -736,3 +736,32 @@ s.index.sk:select()
 s.index.sk:stat().memory.rows
 
 s:drop()
+
+--
+-- Check that tarantool doesn't hang or crash if error
+-- occurs while writing a deferred DELETE to WAL.
+--
+fiber = require('fiber')
+errinj = box.error.injection
+
+s = box.schema.space.create('test', {engine = 'vinyl'})
+_ = s:create_index('pk', {run_count_per_level = 10})
+_ = s:create_index('sk', {unique = false, parts = {2, 'unsigned'}})
+s:replace{1, 10}
+box.snapshot()
+s:replace{1, 20}
+box.snapshot()
+
+errinj.set("ERRINJ_VY_SCHED_TIMEOUT", 0.001)
+errinj.set("ERRINJ_WAL_IO", true)
+errors = box.stat.ERROR.total
+s.index.pk:compact()
+while box.stat.ERROR.total - errors == 0 do fiber.sleep(0.001) end
+s.index.pk:stat().disk.compact.count -- 0
+errinj.set("ERRINJ_WAL_IO", false)
+while s.index.pk:stat().disk.compact.count == 0 do fiber.sleep(0.001) end
+s.index.pk:stat().disk.compact.count -- 1
+errinj.set("ERRINJ_VY_SCHED_TIMEOUT", 0)
+
+box.snapshot() -- ok
+s:drop()

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v2 0/7] vinyl: eliminate disk read on REPLACE/DELETE
  2018-08-21 11:15 [PATCH v2 0/7] vinyl: eliminate disk read on REPLACE/DELETE Vladimir Davydov
                   ` (6 preceding siblings ...)
  2018-08-21 11:15 ` [PATCH v2 7/7] vinyl: eliminate disk read on REPLACE/DELETE Vladimir Davydov
@ 2018-08-22 17:50 ` Vladimir Davydov
  7 siblings, 0 replies; 20+ messages in thread
From: Vladimir Davydov @ 2018-08-22 17:50 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

Pushed to 1.10

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2018-08-22 17:50 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-08-21 11:15 [PATCH v2 0/7] vinyl: eliminate disk read on REPLACE/DELETE Vladimir Davydov
2018-08-21 11:15 ` [PATCH v2 1/7] vinyl: do not store meta in secondary index runs Vladimir Davydov
2018-08-21 15:08   ` Konstantin Osipov
2018-08-21 11:15 ` [PATCH v2 2/7] vinyl: teach write iterator to return overwritten tuples Vladimir Davydov
2018-08-21 15:14   ` Konstantin Osipov
2018-08-21 15:37     ` Vladimir Davydov
2018-08-21 11:15 ` [PATCH v2 3/7] vinyl: prepare write iterator heap comparator for deferred DELETEs Vladimir Davydov
2018-08-21 15:38   ` Konstantin Osipov
2018-08-21 11:15 ` [PATCH v2 4/7] vinyl: allow to skip certain statements on read Vladimir Davydov
2018-08-21 15:39   ` Konstantin Osipov
2018-08-21 11:15 ` [PATCH v2 5/7] Introduce _vinyl_deferred_delete system space Vladimir Davydov
2018-08-21 15:42   ` Konstantin Osipov
2018-08-22 17:04     ` Vladimir Davydov
2018-08-21 11:15 ` [PATCH v2 6/7] vinyl: zap vy_mem::min_lsn and rename max_lsn to dump_lsn Vladimir Davydov
2018-08-21 15:44   ` Konstantin Osipov
2018-08-22 13:00   ` Vladimir Davydov
2018-08-21 11:15 ` [PATCH v2 7/7] vinyl: eliminate disk read on REPLACE/DELETE Vladimir Davydov
2018-08-21 16:13   ` Konstantin Osipov
2018-08-22 17:08     ` Vladimir Davydov
2018-08-22 17:50 ` [PATCH v2 0/7] " Vladimir Davydov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox