Tarantool development patches archive
 help / color / mirror / Atom feed
* [RFC PATCH 00/23] vinyl: eliminate read on REPLACE/DELETE
  2018-07-08 16:48 [RFC PATCH 02/23] vinyl: always get full tuple from pk after reading from secondary index Vladimir Davydov
@ 2018-07-08 16:48 ` Vladimir Davydov
  2018-07-08 16:48   ` [RFC PATCH 01/23] vinyl: do not turn REPLACE into INSERT when processing DML request Vladimir Davydov
                     ` (21 more replies)
  0 siblings, 22 replies; 65+ messages in thread
From: Vladimir Davydov @ 2018-07-08 16:48 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

This patch set optimizes REPLACE and DELETE operations in vinyl in
presence of secondary indexes: now they don't need to read the primary
key in order to delete the overwritten/deleted tuple from secondary
indexes, instead this job is handed over to primary index compaction
task, while read iterator filters out overwritten tuples that haven't
been purged yet.

The patch set still has a few serious problems (deferred DELETEs
generated by compaction task may be lost on restart; generation of
deferred DELETEs may cause OOM in tx thread) and needs some efforts to
be put in it, but it passes all functional tests should be suitable for
testing and the first round of review. Most patches will stay the same
anyways.

https://github.com/tarantool/tarantool/issues/2129
https://github.com/tarantool/tarantool/commits/dv/gh-2129-vy-eliminate-read-on-replace-delete

Vladimir Davydov (23):
  vinyl: do not turn REPLACE into INSERT when processing DML request
  vinyl: always get full tuple from pk after reading from secondary
    index
  vinyl: use vy_mem_iterator for point lookup
  vinyl: make point lookup always return the latest tuple version
  vinyl: fold vy_replace_one and vy_replace_impl
  vinyl: fold vy_delete_impl
  vinyl: refactor unique check
  vinyl: check key uniqueness before modifying tx write set
  vinyl: remove env argument of vy_check_is_unique_{primary,secondary}
  vinyl: store full tuples in secondary index cache
  xrow: allow to store flags in DML requests
  vinyl: do not pass region explicitly to write iterator functions
  vinyl: fix potential use-after-free in vy_read_view_merge
  test: unit/vy_write_iterator: minor refactoring
  vinyl: teach write iterator to return overwritten tuples
  vinyl: allow to skip certain statements on read
  vinyl: do not free pending tasks on shutdown
  vinyl: store pointer to scheduler in struct vy_task
  vinyl: rename some members of vy_scheduler and vy_task struct
  vinyl: use cbus for communication between scheduler and worker threads
  vinyl: zap vy_scheduler::is_worker_pool_running
  vinyl: rename vy_task::status to is_failed
  vinyl: eliminate read on REPLACE/DELETE

 src/box/iproto_constants.c         |   4 +-
 src/box/iproto_constants.h         |   3 +-
 src/box/vinyl.c                    | 792 +++++++++++++++++++------------------
 src/box/vy_mem.c                   |  19 +-
 src/box/vy_point_lookup.c          |  87 ++--
 src/box/vy_point_lookup.h          |   9 +-
 src/box/vy_read_iterator.c         |  61 ++-
 src/box/vy_read_iterator.h         |  24 ++
 src/box/vy_run.c                   |   7 +-
 src/box/vy_scheduler.c             | 563 ++++++++++++++++++--------
 src/box/vy_scheduler.h             |  41 +-
 src/box/vy_stmt.c                  |   4 +
 src/box/vy_stmt.h                  |  44 +++
 src/box/vy_tx.c                    |  26 ++
 src/box/vy_write_iterator.c        | 173 ++++++--
 src/box/vy_write_iterator.h        |  27 +-
 src/box/xrow.c                     |   8 +
 src/box/xrow.h                     |   2 +
 test/unit/vy_iterators_helper.c    |   5 +
 test/unit/vy_iterators_helper.h    |  12 +-
 test/unit/vy_point_lookup.c        |   4 +-
 test/unit/vy_write_iterator.c      | 319 ++++++++++++---
 test/unit/vy_write_iterator.result |  23 +-
 test/vinyl/info.result             |   5 +
 test/vinyl/info.test.lua           |   3 +
 test/vinyl/layout.result           | 166 +++++---
 test/vinyl/tx_gap_lock.result      |  16 +-
 test/vinyl/tx_gap_lock.test.lua    |  10 +-
 test/vinyl/write_iterator.result   |  11 +-
 test/vinyl/write_iterator.test.lua |   5 +-
 30 files changed, 1615 insertions(+), 858 deletions(-)

-- 
2.11.0

^ permalink raw reply	[flat|nested] 65+ messages in thread

* [RFC PATCH 01/23] vinyl: do not turn REPLACE into INSERT when processing DML request
  2018-07-08 16:48 ` [RFC PATCH 00/23] vinyl: eliminate read on REPLACE/DELETE Vladimir Davydov
@ 2018-07-08 16:48   ` Vladimir Davydov
  2018-07-10 12:15     ` Konstantin Osipov
  2018-07-08 16:48   ` [RFC PATCH 03/23] vinyl: use vy_mem_iterator for point lookup Vladimir Davydov
                     ` (20 subsequent siblings)
  21 siblings, 1 reply; 65+ messages in thread
From: Vladimir Davydov @ 2018-07-08 16:48 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

Since in presence of secondary indexes we read the primary index when
processing a REPLACE request anyway, we turn it into INSERT if no tuple
matching the new tuple is found so that INSERT+DELETE gets annihilated
on compaction.

However, in the scope of #2129 we are planning to optimize the read out
so that this transformation won't be possible anymore. So let's remove
it now.

Needed for #2129
---
 src/box/vinyl.c                    |  8 --------
 test/vinyl/layout.result           | 20 ++++++++++----------
 test/vinyl/write_iterator.result   |  6 ------
 test/vinyl/write_iterator.test.lua |  2 --
 4 files changed, 10 insertions(+), 26 deletions(-)

diff --git a/src/box/vinyl.c b/src/box/vinyl.c
index 6dd22884..f9c2843e 100644
--- a/src/box/vinyl.c
+++ b/src/box/vinyl.c
@@ -1540,14 +1540,6 @@ vy_replace_impl(struct vy_env *env, struct vy_tx *tx, struct space *space,
 			 new_stmt, &old_stmt) != 0)
 		goto error;
 
-	if (old_stmt == NULL) {
-		/*
-		 * We can turn REPLACE into INSERT if the new key
-		 * does not have history.
-		 */
-		vy_stmt_set_type(new_stmt, IPROTO_INSERT);
-	}
-
 	/*
 	 * Replace in the primary index without explicit deletion
 	 * of the old tuple.
diff --git a/test/vinyl/layout.result b/test/vinyl/layout.result
index 49826302..1f928a8f 100644
--- a/test/vinyl/layout.result
+++ b/test/vinyl/layout.result
@@ -253,17 +253,17 @@ result
   - - 00000000000000000008.run
     - - HEADER:
           lsn: 10
-          type: INSERT
+          type: REPLACE
         BODY:
           tuple: ['ёёё', null]
       - HEADER:
           lsn: 9
-          type: INSERT
+          type: REPLACE
         BODY:
           tuple: ['эээ', null]
       - HEADER:
           lsn: 8
-          type: INSERT
+          type: REPLACE
         BODY:
           tuple: ['ЭЭЭ', null]
       - HEADER:
@@ -297,12 +297,12 @@ result
           tuple: ['ёёё', 123]
       - HEADER:
           lsn: 13
-          type: INSERT
+          type: REPLACE
         BODY:
           tuple: ['ююю', 789]
       - HEADER:
           lsn: 12
-          type: INSERT
+          type: REPLACE
         BODY:
           tuple: ['ЮЮЮ', 456]
       - HEADER:
@@ -331,17 +331,17 @@ result
   - - 00000000000000000006.run
     - - HEADER:
           lsn: 10
-          type: INSERT
+          type: REPLACE
         BODY:
           tuple: [null, 'ёёё']
       - HEADER:
           lsn: 9
-          type: INSERT
+          type: REPLACE
         BODY:
           tuple: [null, 'эээ']
       - HEADER:
           lsn: 8
-          type: INSERT
+          type: REPLACE
         BODY:
           tuple: [null, 'ЭЭЭ']
       - HEADER:
@@ -380,12 +380,12 @@ result
           tuple: [123, 'ёёё']
       - HEADER:
           lsn: 12
-          type: INSERT
+          type: REPLACE
         BODY:
           tuple: [456, 'ЮЮЮ']
       - HEADER:
           lsn: 13
-          type: INSERT
+          type: REPLACE
         BODY:
           tuple: [789, 'ююю']
       - HEADER:
diff --git a/test/vinyl/write_iterator.result b/test/vinyl/write_iterator.result
index c38de5d3..162d8463 100644
--- a/test/vinyl/write_iterator.result
+++ b/test/vinyl/write_iterator.result
@@ -765,9 +765,6 @@ box.snapshot()
 _ = s:insert{1, 1} -- insert
 ---
 ...
-_ = s:replace{2, 2} -- replace, no old tuple
----
-...
 _ = s:upsert({3, 3}, {{'!', 1, 1}}) -- upsert, no old tuple
 ---
 ...
@@ -794,9 +791,6 @@ box.snapshot()
 s:delete{1}
 ---
 ...
-s:delete{2}
----
-...
 s:delete{3}
 ---
 ...
diff --git a/test/vinyl/write_iterator.test.lua b/test/vinyl/write_iterator.test.lua
index 73c90c42..9a6cc480 100644
--- a/test/vinyl/write_iterator.test.lua
+++ b/test/vinyl/write_iterator.test.lua
@@ -326,7 +326,6 @@ for i = 1001, 1000 + PAD1 do s:replace{i, i} end
 box.snapshot()
 -- Generate some INSERT statements and dump them to disk.
 _ = s:insert{1, 1} -- insert
-_ = s:replace{2, 2} -- replace, no old tuple
 _ = s:upsert({3, 3}, {{'!', 1, 1}}) -- upsert, no old tuple
 box.begin() s:insert{4, 4} s:delete(4) box.commit()
 box.begin() s:insert{5, 5} s:replace{5, 5, 5} box.commit()
@@ -336,7 +335,6 @@ _ = s:insert{8, 8}
 box.snapshot()
 -- Delete the inserted tuples and trigger compaction.
 s:delete{1}
-s:delete{2}
 s:delete{3}
 s:delete{4}
 s:delete{5}
-- 
2.11.0

^ permalink raw reply	[flat|nested] 65+ messages in thread

* [RFC PATCH 02/23] vinyl: always get full tuple from pk after reading from secondary index
  2018-07-08 16:48 ` [RFC PATCH 00/23] vinyl: eliminate read on REPLACE/DELETE Vladimir Davydov
@ 2018-07-08 16:48 Vladimir Davydov
  2018-07-08 16:48 ` [RFC PATCH 00/23] vinyl: eliminate read on REPLACE/DELETE Vladimir Davydov
  21 siblings, 1 reply; 65+ messages in thread
From: Vladimir Davydov @ 2018-07-08 16:48 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

Currently, we don't always need a full tuple. Sometimes (e.g. for
checking uniqueness constraint), a partial tuple read from a secondary
index is enough. So we have vy_lsm_get() which reads a partial tuple
from an index. However, once the optimization described in #2129 is
implemented, it might happen that a tuple read from a secondary index
was overwritten or deleted in the primary index, but DELETE statement
hasn't been propagated to the secondary index yet, i.e. we will have to
read the primary index anyway, even if we don't need a full tuple.

That said, let us:

 - Make vy_lsm_get() always fetch a full tuple, even for secondary
   indexes, and rename it to vy_get().

 - Rewrite vy_lsm_full_by_key() as a wrapper around vy_get() and rename
   it to vy_get_by_raw_key().

 - Introduce vy_get_by_secondary_tuple() which gets a full tuple given a
   tuple read from a secondary index. For now, it's basically a call to
   vy_point_lookup(), but it'll become a bit more complex once #2129 is
   implemented.

 - Prepare vy_get() for the fact that a tuple read from a secondary
   index may be absent in the primary index, in which case it should
   try the next matching one.

Needed for #2129
---
 src/box/vinyl.c | 204 ++++++++++++++++++++++++++++++++------------------------
 1 file changed, 118 insertions(+), 86 deletions(-)

diff --git a/src/box/vinyl.c b/src/box/vinyl.c
index f9c2843e..64004226 100644
--- a/src/box/vinyl.c
+++ b/src/box/vinyl.c
@@ -1265,7 +1265,48 @@ vy_is_committed(struct vy_env *env, struct space *space)
 }
 
 /**
- * Get a vinyl tuple from the LSM tree by the key.
+ * Get a full tuple by a tuple read from a secondary index.
+ * @param lsm         LSM tree from which the tuple was read.
+ * @param tx          Current transaction.
+ * @param rv          Read view.
+ * @param tuple       Tuple read from a secondary index.
+ * @param[out] result The found tuple is stored here. Must be
+ *                    unreferenced after usage.
+ *
+ * @param  0 Success.
+ * @param -1 Memory error or read error.
+ */
+static int
+vy_get_by_secondary_tuple(struct vy_lsm *lsm, struct vy_tx *tx,
+			  const struct vy_read_view **rv,
+			  struct tuple *tuple, struct tuple **result)
+{
+	assert(lsm->index_id > 0);
+	/*
+	 * No need in vy_tx_track() as the tuple must already be
+	 * tracked in the secondary index LSM tree.
+	 */
+	if (vy_point_lookup(lsm->pk, tx, rv, tuple, result) != 0)
+		return -1;
+
+	if (*result == NULL) {
+		/*
+		 * All indexes of a space must be consistent, i.e.
+		 * if a tuple is present in one index, it must be
+		 * present in all other indexes as well, so we can
+		 * get here only if there's a bug somewhere in vinyl.
+		 * Don't abort as core dump won't really help us in
+		 * this case. Just warn the user and proceed to the
+		 * next tuple.
+		 */
+		say_warn("%s: key %s missing in primary index",
+			 vy_lsm_name(lsm), vy_stmt_str(tuple));
+	}
+	return 0;
+}
+
+/**
+ * Get a tuple from a vinyl space by key.
  * @param lsm         LSM tree in which search.
  * @param tx          Current transaction.
  * @param rv          Read view.
@@ -1276,10 +1317,10 @@ vy_is_committed(struct vy_env *env, struct space *space)
  * @param  0 Success.
  * @param -1 Memory error or read error.
  */
-static inline int
-vy_lsm_get(struct vy_lsm *lsm, struct vy_tx *tx,
-	     const struct vy_read_view **rv,
-	     struct tuple *key, struct tuple **result)
+static int
+vy_get(struct vy_lsm *lsm, struct vy_tx *tx,
+       const struct vy_read_view **rv,
+       struct tuple *key, struct tuple **result)
 {
 	/*
 	 * tx can be NULL, for example, if an user calls
@@ -1287,22 +1328,75 @@ vy_lsm_get(struct vy_lsm *lsm, struct vy_tx *tx,
 	 */
 	assert(tx == NULL || tx->state == VINYL_TX_READY);
 
+	int rc;
+	struct tuple *tuple;
+
 	if (tuple_field_count(key) >= lsm->cmp_def->part_count) {
+		/*
+		 * Use point lookup for a full key.
+		 */
 		if (tx != NULL && vy_tx_track_point(tx, lsm, key) != 0)
 			return -1;
-		return vy_point_lookup(lsm, tx, rv, key, result);
+		if (vy_point_lookup(lsm, tx, rv, key, &tuple) != 0)
+			return -1;
+		if (lsm->index_id > 0 && tuple != NULL) {
+			rc = vy_get_by_secondary_tuple(lsm, tx, rv,
+						       tuple, result);
+			tuple_unref(tuple);
+			if (rc != 0)
+				return -1;
+		} else {
+			*result = tuple;
+		}
+		return 0;
 	}
 
 	struct vy_read_iterator itr;
 	vy_read_iterator_open(&itr, lsm, tx, ITER_EQ, key, rv);
-	int rc = vy_read_iterator_next(&itr, result);
-	if (*result != NULL)
-		tuple_ref(*result);
+	while ((rc = vy_read_iterator_next(&itr, &tuple)) == 0) {
+		if (lsm->index_id == 0 || tuple == NULL) {
+			*result = tuple;
+			if (tuple != NULL)
+				tuple_ref(tuple);
+			break;
+		}
+		rc = vy_get_by_secondary_tuple(lsm, tx, rv, tuple, result);
+		if (rc != 0 || *result != NULL)
+			break;
+	}
 	vy_read_iterator_close(&itr);
 	return rc;
 }
 
 /**
+ * Get a tuple from a vinyl space by raw key.
+ * @param lsm         LSM tree in which search.
+ * @param tx          Current transaction.
+ * @param rv          Read view.
+ * @param key_raw     MsgPack array of key fields.
+ * @param part_count  Count of parts in the key.
+ * @param[out] result The found tuple is stored here. Must be
+ *                    unreferenced after usage.
+ *
+ * @param  0 Success.
+ * @param -1 Memory error or read error.
+ */
+static int
+vy_get_by_raw_key(struct vy_lsm *lsm, struct vy_tx *tx,
+		  const struct vy_read_view **rv,
+		  const char *key_raw, uint32_t part_count,
+		  struct tuple **result)
+{
+	struct tuple *key = vy_stmt_new_select(lsm->env->key_format,
+					       key_raw, part_count);
+	if (key == NULL)
+		return -1;
+	int rc = vy_get(lsm, tx, rv, key, result);
+	tuple_unref(key);
+	return rc;
+}
+
+/**
  * Check if the LSM tree contains the key. If true, then set
  * a duplicate key error in the diagnostics area.
  * @param env        Vinyl environment.
@@ -1329,7 +1423,7 @@ vy_check_is_unique(struct vy_env *env, struct vy_tx *tx,
 	 */
 	if (env->status != VINYL_ONLINE)
 		return 0;
-	if (vy_lsm_get(lsm, tx, rv, key, &found))
+	if (vy_get(lsm, tx, rv, key, &found))
 		return -1;
 
 	if (found) {
@@ -1480,8 +1574,8 @@ vy_replace_one(struct vy_env *env, struct vy_tx *tx, struct space *space,
 	 * old tuple to pass it to the trigger.
 	 */
 	if (stmt != NULL && !rlist_empty(&space->on_replace)) {
-		if (vy_lsm_get(pk, tx, vy_tx_read_view(tx),
-			       new_tuple, &stmt->old_tuple) != 0)
+		if (vy_get(pk, tx, vy_tx_read_view(tx),
+			   new_tuple, &stmt->old_tuple) != 0)
 			goto error_unref;
 	}
 	if (vy_tx_set(tx, pk, new_tuple))
@@ -1536,8 +1630,7 @@ vy_replace_impl(struct vy_env *env, struct vy_tx *tx, struct space *space,
 		return -1;
 
 	/* Get full tuple from the primary index. */
-	if (vy_lsm_get(pk, tx, vy_tx_read_view(tx),
-			 new_stmt, &old_stmt) != 0)
+	if (vy_get(pk, tx, vy_tx_read_view(tx), new_stmt, &old_stmt) != 0)
 		goto error;
 
 	/*
@@ -1628,51 +1721,6 @@ vy_unique_key_validate(struct vy_lsm *lsm, const char *key,
 }
 
 /**
- * Find a tuple in the primary index LSM tree by the key of the
- * specified LSM tree.
- * @param lsm         LSM tree for which the key is specified.
- *                    Can be both primary and secondary.
- * @param tx          Current transaction.
- * @param rv          Read view.
- * @param key_raw     MessagePack'ed data, the array without a
- *                    header.
- * @param part_count  Count of parts in the key.
- * @param[out] result The found statement is stored here. Must be
- *                    unreferenced after usage.
- *
- * @retval  0 Success.
- * @retval -1 Memory error.
- */
-static inline int
-vy_lsm_full_by_key(struct vy_lsm *lsm, struct vy_tx *tx,
-		   const struct vy_read_view **rv,
-		   const char *key_raw, uint32_t part_count,
-		   struct tuple **result)
-{
-	int rc;
-	struct tuple *key = vy_stmt_new_select(lsm->env->key_format,
-					       key_raw, part_count);
-	if (key == NULL)
-		return -1;
-	struct tuple *found;
-	rc = vy_lsm_get(lsm, tx, rv, key, &found);
-	tuple_unref(key);
-	if (rc != 0)
-		return -1;
-	if (lsm->index_id == 0 || found == NULL) {
-		*result = found;
-		return 0;
-	}
-	/*
-	 * No need in vy_tx_track() as the tuple is already
-	 * tracked in the secondary index LSM tree.
-	 */
-	rc = vy_point_lookup(lsm->pk, tx, rv, found, result);
-	tuple_unref(found);
-	return rc;
-}
-
-/**
  * Delete the tuple from all LSM trees of the vinyl space.
  * @param env        Vinyl environment.
  * @param tx         Current transaction.
@@ -1754,8 +1802,8 @@ vy_delete(struct vy_env *env, struct vy_tx *tx, struct txn_stmt *stmt,
 	 *   and pass them to indexes for deletion.
 	 */
 	if (has_secondary || !rlist_empty(&space->on_replace)) {
-		if (vy_lsm_full_by_key(lsm, tx, vy_tx_read_view(tx),
-				key, part_count, &stmt->old_tuple) != 0)
+		if (vy_get_by_raw_key(lsm, tx, vy_tx_read_view(tx),
+				      key, part_count, &stmt->old_tuple) != 0)
 			return -1;
 		if (stmt->old_tuple == NULL)
 			return 0;
@@ -1836,8 +1884,8 @@ vy_update(struct vy_env *env, struct vy_tx *tx, struct txn_stmt *stmt,
 	if (vy_unique_key_validate(lsm, key, part_count))
 		return -1;
 
-	if (vy_lsm_full_by_key(lsm, tx, vy_tx_read_view(tx),
-			       key, part_count, &stmt->old_tuple) != 0)
+	if (vy_get_by_raw_key(lsm, tx, vy_tx_read_view(tx),
+			      key, part_count, &stmt->old_tuple) != 0)
 		return -1;
 	/* Nothing to update. */
 	if (stmt->old_tuple == NULL)
@@ -2110,8 +2158,7 @@ vy_upsert(struct vy_env *env, struct vy_tx *tx, struct txn_stmt *stmt,
 					pk->key_def, pk->env->key_format);
 	if (key == NULL)
 		return -1;
-	int rc = vy_lsm_get(pk, tx, vy_tx_read_view(tx),
-			      key, &stmt->old_tuple);
+	int rc = vy_get(pk, tx, vy_tx_read_view(tx), key, &stmt->old_tuple);
 	tuple_unref(key);
 	if (rc != 0)
 		return -1;
@@ -3910,28 +3957,13 @@ next:
 			fiber_sleep(0.01);
 	}
 #endif
-	/*
-	 * Get the full tuple from the primary index.
-	 * Note, there's no need in vy_tx_track() as the
-	 * tuple is already tracked in the secondary index.
-	 */
-	if (vy_point_lookup(it->lsm->pk, it->tx, vy_tx_read_view(it->tx),
-			    tuple, ret) != 0)
+	/* Get the full tuple from the primary index. */
+	if (vy_get_by_secondary_tuple(it->lsm, it->tx,
+				      vy_tx_read_view(it->tx),
+				      tuple, ret) != 0)
 		goto fail;
-	if (*ret == NULL) {
-		/*
-		 * All indexes of a space must be consistent, i.e.
-		 * if a tuple is present in one index, it must be
-		 * present in all other indexes as well, so we can
-		 * get here only if there's a bug somewhere in vinyl.
-		 * Don't abort as core dump won't really help us in
-		 * this case. Just warn the user and proceed to the
-		 * next tuple.
-		 */
-		say_warn("%s: key %s missing in primary index",
-			 vy_lsm_name(it->lsm), vy_stmt_str(tuple));
+	if (*ret == NULL)
 		goto next;
-	}
 	tuple_bless(*ret);
 	tuple_unref(*ret);
 	return 0;
@@ -4020,7 +4052,7 @@ vinyl_index_get(struct index *index, const char *key,
 	const struct vy_read_view **rv = (tx != NULL ? vy_tx_read_view(tx) :
 					  &env->xm->p_global_read_view);
 
-	if (vy_lsm_full_by_key(lsm, tx, rv, key, part_count, ret) != 0)
+	if (vy_get_by_raw_key(lsm, tx, rv, key, part_count, ret) != 0)
 		return -1;
 	if (*ret != NULL) {
 		tuple_bless(*ret);
-- 
2.11.0

^ permalink raw reply	[flat|nested] 65+ messages in thread

* [RFC PATCH 03/23] vinyl: use vy_mem_iterator for point lookup
  2018-07-08 16:48 ` [RFC PATCH 00/23] vinyl: eliminate read on REPLACE/DELETE Vladimir Davydov
  2018-07-08 16:48   ` [RFC PATCH 01/23] vinyl: do not turn REPLACE into INSERT when processing DML request Vladimir Davydov
@ 2018-07-08 16:48   ` Vladimir Davydov
  2018-07-17 10:14     ` Vladimir Davydov
  2018-07-08 16:48   ` [RFC PATCH 04/23] vinyl: make point lookup always return the latest tuple version Vladimir Davydov
                     ` (19 subsequent siblings)
  21 siblings, 1 reply; 65+ messages in thread
From: Vladimir Davydov @ 2018-07-08 16:48 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

vy_mem_iterator_next is as effecient as the current implementation of
vy_point_lookup_scan_mem, because it doesn't copy statements anymore
(see commit 1e1c1fdbedd vinyl: make read iterator always return newest
tuple version). Let's use it instead of open-coding vy_mem tree lookup.
---
 src/box/vy_point_lookup.c | 47 +++++++++--------------------------------------
 1 file changed, 9 insertions(+), 38 deletions(-)

diff --git a/src/box/vy_point_lookup.c b/src/box/vy_point_lookup.c
index 91dc1cca..504a8e80 100644
--- a/src/box/vy_point_lookup.c
+++ b/src/box/vy_point_lookup.c
@@ -95,44 +95,15 @@ vy_point_lookup_scan_mem(struct vy_lsm *lsm, struct vy_mem *mem,
 			 const struct vy_read_view **rv,
 			 struct tuple *key, struct vy_history *history)
 {
-	struct tree_mem_key tree_key;
-	tree_key.stmt = key;
-	tree_key.lsn = (*rv)->vlsn;
-	bool exact;
-	struct vy_mem_tree_iterator mem_itr =
-		vy_mem_tree_lower_bound(&mem->tree, &tree_key, &exact);
-	lsm->stat.memory.iterator.lookup++;
-	const struct tuple *stmt = NULL;
-	if (!vy_mem_tree_iterator_is_invalid(&mem_itr)) {
-		stmt = *vy_mem_tree_iterator_get_elem(&mem->tree, &mem_itr);
-		if (vy_stmt_compare(stmt, key, mem->cmp_def) != 0)
-			stmt = NULL;
-	}
-
-	if (stmt == NULL)
-		return 0;
-
-	while (true) {
-		if (vy_history_append_stmt(history, (struct tuple *)stmt) != 0)
-			return -1;
-
-		vy_stmt_counter_acct_tuple(&lsm->stat.memory.iterator.get,
-					   stmt);
-
-		if (vy_history_is_terminal(history))
-			break;
-
-		if (!vy_mem_tree_iterator_next(&mem->tree, &mem_itr))
-			break;
-
-		const struct tuple *prev_stmt = stmt;
-		stmt = *vy_mem_tree_iterator_get_elem(&mem->tree, &mem_itr);
-		if (vy_stmt_lsn(stmt) >= vy_stmt_lsn(prev_stmt))
-			break;
-		if (vy_stmt_compare(stmt, key, mem->cmp_def) != 0)
-			break;
-	}
-	return 0;
+	struct vy_mem_iterator mem_itr;
+	vy_mem_iterator_open(&mem_itr, &lsm->stat.memory.iterator,
+			     mem, ITER_EQ, key, rv);
+	struct vy_history mem_history;
+	vy_history_create(&mem_history, &lsm->env->history_node_pool);
+	int rc = vy_mem_iterator_next(&mem_itr, &mem_history);
+	vy_history_splice(history, &mem_history);
+	vy_mem_iterator_close(&mem_itr);
+	return rc;
 
 }
 
-- 
2.11.0

^ permalink raw reply	[flat|nested] 65+ messages in thread

* [RFC PATCH 04/23] vinyl: make point lookup always return the latest tuple version
  2018-07-08 16:48 ` [RFC PATCH 00/23] vinyl: eliminate read on REPLACE/DELETE Vladimir Davydov
  2018-07-08 16:48   ` [RFC PATCH 01/23] vinyl: do not turn REPLACE into INSERT when processing DML request Vladimir Davydov
  2018-07-08 16:48   ` [RFC PATCH 03/23] vinyl: use vy_mem_iterator for point lookup Vladimir Davydov
@ 2018-07-08 16:48   ` Vladimir Davydov
  2018-07-10 16:19     ` Konstantin Osipov
  2018-07-08 16:48   ` [RFC PATCH 05/23] vinyl: fold vy_replace_one and vy_replace_impl Vladimir Davydov
                     ` (18 subsequent siblings)
  21 siblings, 1 reply; 65+ messages in thread
From: Vladimir Davydov @ 2018-07-08 16:48 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

Currently, vy_point_lookup(), in contrast to vy_read_iterator, doesn't
rescan the memory level after reading disk, so if the caller doesn't
track the key before calling this function, the caller won't be sent to
a read view in case the key gets updated during yield and hence will
be returned a stale tuple. This is OK now, because we always track the
key before calling vy_point_lookup(), either in the primary or in a
secondary index. However, for #2129 we need it to always return the
latest tuple version, no matter if the key is tracked or not.

The point is in the scope of #2129 we won't write DELETE statements to
secondary indexes corresponding to a tuple replaced in the primary
index. Instead after reading a tuple from a secondary index we will
check whether it matches the tuple corresponding to it in the primary
index: if it is not, it means that the tuple read from the secondary
index was overwritten and should be skipped. E.g. suppose we have the
primary index over the first field and a secondary index over the second
field and the following statements in the space:

  REPLACE{1, 10}
  REPLACE{1, 20}

Then reading {10} from the secondary index will return REPLACE{1, 10}, but
lookup of {1} in the primary index will return REPLACE{1, 20} which
doesn't match REPLACE{1, 10} read from the secondary index hence the
latter was overwritten and should be skipped.

The problem is in the example above we don't want to track key {1} in
the primary index before lookup, because we don't actually read its
value. So for the check to work correctly, we need the point lookup to
guarantee that the returned tuple is always the newest one. It's fairly
easy to do - we just need to rescan the memory level after yielding on
disk if its version changed.

Needed for #2129
---
 src/box/vy_point_lookup.c | 35 +++++++++++++++++++++++++++++------
 src/box/vy_point_lookup.h |  9 +++------
 2 files changed, 32 insertions(+), 12 deletions(-)

diff --git a/src/box/vy_point_lookup.c b/src/box/vy_point_lookup.c
index 504a8e80..f2261fdf 100644
--- a/src/box/vy_point_lookup.c
+++ b/src/box/vy_point_lookup.c
@@ -203,10 +203,13 @@ vy_point_lookup(struct vy_lsm *lsm, struct vy_tx *tx,
 	int rc = 0;
 
 	lsm->stat.lookup++;
+
 	/* History list */
-	struct vy_history history;
+	struct vy_history history, mem_history, disk_history;
 	vy_history_create(&history, &lsm->env->history_node_pool);
-restart:
+	vy_history_create(&mem_history, &lsm->env->history_node_pool);
+	vy_history_create(&disk_history, &lsm->env->history_node_pool);
+
 	rc = vy_point_lookup_scan_txw(lsm, tx, key, &history);
 	if (rc != 0 || vy_history_is_terminal(&history))
 		goto done;
@@ -215,14 +218,16 @@ restart:
 	if (rc != 0 || vy_history_is_terminal(&history))
 		goto done;
 
-	rc = vy_point_lookup_scan_mems(lsm, rv, key, &history);
-	if (rc != 0 || vy_history_is_terminal(&history))
+restart:
+	rc = vy_point_lookup_scan_mems(lsm, rv, key, &mem_history);
+	if (rc != 0 || vy_history_is_terminal(&mem_history))
 		goto done;
 
 	/* Save version before yield */
+	uint32_t mem_version = lsm->mem->version;
 	uint32_t mem_list_version = lsm->mem_list_version;
 
-	rc = vy_point_lookup_scan_slices(lsm, rv, key, &history);
+	rc = vy_point_lookup_scan_slices(lsm, rv, key, &disk_history);
 	if (rc != 0)
 		goto done;
 
@@ -241,11 +246,29 @@ restart:
 		 * This in unnecessary in case of rotation but since we
 		 * cannot distinguish these two cases we always restart.
 		 */
-		vy_history_cleanup(&history);
+		vy_history_cleanup(&mem_history);
+		vy_history_cleanup(&disk_history);
 		goto restart;
 	}
 
+	if (mem_version != lsm->mem->version) {
+		/*
+		 * Rescan the memory level if its version changed while we
+		 * were reading disk, because there may be new statements
+		 * matching the search key.
+		 */
+		vy_history_cleanup(&mem_history);
+		rc = vy_point_lookup_scan_mems(lsm, rv, key, &mem_history);
+		if (rc != 0)
+			goto done;
+		if (vy_history_is_terminal(&mem_history))
+			vy_history_cleanup(&disk_history);
+	}
+
 done:
+	vy_history_splice(&history, &mem_history);
+	vy_history_splice(&history, &disk_history);
+
 	if (rc == 0) {
 		int upserts_applied;
 		rc = vy_history_apply(&history, lsm->cmp_def, lsm->mem_format,
diff --git a/src/box/vy_point_lookup.h b/src/box/vy_point_lookup.h
index d74be9a9..3b7c5a04 100644
--- a/src/box/vy_point_lookup.h
+++ b/src/box/vy_point_lookup.h
@@ -62,12 +62,9 @@ struct tuple;
  * tuple in the LSM tree. The tuple is returned in @ret with its
  * reference counter elevated.
  *
- * The caller must guarantee that if the tuple looked up by this
- * function is modified, the transaction will be sent to read view.
- * This is needed to avoid inserting a stale value into the cache.
- * In other words, vy_tx_track() must be called for the search key
- * before calling this function unless this is a primary index and
- * the tuple is already tracked in a secondary index.
+ * Note, this function doesn't track the result in the transaction
+ * read set, i.e. it is up to the caller to call vy_tx_track() if
+ * necessary.
  */
 int
 vy_point_lookup(struct vy_lsm *lsm, struct vy_tx *tx,
-- 
2.11.0

^ permalink raw reply	[flat|nested] 65+ messages in thread

* [RFC PATCH 05/23] vinyl: fold vy_replace_one and vy_replace_impl
  2018-07-08 16:48 ` [RFC PATCH 00/23] vinyl: eliminate read on REPLACE/DELETE Vladimir Davydov
                     ` (2 preceding siblings ...)
  2018-07-08 16:48   ` [RFC PATCH 04/23] vinyl: make point lookup always return the latest tuple version Vladimir Davydov
@ 2018-07-08 16:48   ` Vladimir Davydov
  2018-07-31 20:28     ` Konstantin Osipov
  2018-07-08 16:48   ` [RFC PATCH 06/23] vinyl: fold vy_delete_impl Vladimir Davydov
                     ` (17 subsequent siblings)
  21 siblings, 1 reply; 65+ messages in thread
From: Vladimir Davydov @ 2018-07-08 16:48 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

There's no point in separating REPLACE path between the cases when
the space has secondary indexes and when it only has the primary
index, because they are quite similar. Let's fold vy_replace_one
and vy_replace_impl into vy_replace to remove code duplication.
---
 src/box/vinyl.c | 219 +++++++++++++++++---------------------------------------
 1 file changed, 67 insertions(+), 152 deletions(-)

diff --git a/src/box/vinyl.c b/src/box/vinyl.c
index 64004226..b93232a3 100644
--- a/src/box/vinyl.c
+++ b/src/box/vinyl.c
@@ -1539,152 +1539,6 @@ vy_insert_secondary(struct vy_env *env, struct vy_tx *tx, struct space *space,
 }
 
 /**
- * Execute REPLACE in a space with a single index, possibly with
- * lookup for an old tuple if the space has at least one
- * on_replace trigger.
- * @param env     Vinyl environment.
- * @param tx      Current transaction.
- * @param space   Space in which replace.
- * @param request Request with the tuple data.
- * @param stmt    Statement for triggers is filled with old
- *                statement.
- *
- * @retval  0 Success.
- * @retval -1 Memory error OR duplicate key error OR the primary
- *            index is not found OR a tuple reference increment
- *            error.
- */
-static inline int
-vy_replace_one(struct vy_env *env, struct vy_tx *tx, struct space *space,
-	       struct request *request, struct txn_stmt *stmt)
-{
-	(void)env;
-	assert(tx != NULL && tx->state == VINYL_TX_READY);
-	struct vy_lsm *pk = vy_lsm(space->index[0]);
-	assert(pk->index_id == 0);
-	if (tuple_validate_raw(pk->mem_format, request->tuple))
-		return -1;
-	struct tuple *new_tuple =
-		vy_stmt_new_replace(pk->mem_format, request->tuple,
-				    request->tuple_end);
-	if (new_tuple == NULL)
-		return -1;
-	/**
-	 * If the space has triggers, then we need to fetch the
-	 * old tuple to pass it to the trigger.
-	 */
-	if (stmt != NULL && !rlist_empty(&space->on_replace)) {
-		if (vy_get(pk, tx, vy_tx_read_view(tx),
-			   new_tuple, &stmt->old_tuple) != 0)
-			goto error_unref;
-	}
-	if (vy_tx_set(tx, pk, new_tuple))
-		goto error_unref;
-
-	if (stmt != NULL)
-		stmt->new_tuple = new_tuple;
-	else
-		tuple_unref(new_tuple);
-	return 0;
-
-error_unref:
-	tuple_unref(new_tuple);
-	return -1;
-}
-
-/**
- * Execute REPLACE in a space with multiple indexes and lookup for
- * an old tuple, that should has been set in \p stmt->old_tuple if
- * the space has at least one on_replace trigger.
- * @param env     Vinyl environment.
- * @param tx      Current transaction.
- * @param space   Vinyl space.
- * @param request Request with the tuple data.
- * @param stmt    Statement for triggers filled with old
- *                statement.
- *
- * @retval  0 Success
- * @retval -1 Memory error OR duplicate key error OR the primary
- *            index is not found OR a tuple reference increment
- *            error.
- */
-static inline int
-vy_replace_impl(struct vy_env *env, struct vy_tx *tx, struct space *space,
-		struct request *request, struct txn_stmt *stmt)
-{
-	assert(tx != NULL && tx->state == VINYL_TX_READY);
-	struct tuple *old_stmt = NULL;
-	struct tuple *new_stmt = NULL;
-	struct tuple *delete = NULL;
-	struct vy_lsm *pk = vy_lsm_find(space, 0);
-	if (pk == NULL) /* space has no primary key */
-		return -1;
-	/* Primary key is dumped last. */
-	assert(!vy_is_committed_one(env, pk));
-	assert(pk->index_id == 0);
-	if (tuple_validate_raw(pk->mem_format, request->tuple))
-		return -1;
-	new_stmt = vy_stmt_new_replace(pk->mem_format, request->tuple,
-				       request->tuple_end);
-	if (new_stmt == NULL)
-		return -1;
-
-	/* Get full tuple from the primary index. */
-	if (vy_get(pk, tx, vy_tx_read_view(tx), new_stmt, &old_stmt) != 0)
-		goto error;
-
-	/*
-	 * Replace in the primary index without explicit deletion
-	 * of the old tuple.
-	 */
-	if (vy_tx_set(tx, pk, new_stmt) != 0)
-		goto error;
-
-	if (space->index_count > 1 && old_stmt != NULL) {
-		delete = vy_stmt_new_surrogate_delete(pk->mem_format, old_stmt);
-		if (delete == NULL)
-			goto error;
-	}
-
-	/* Update secondary keys, avoid duplicates. */
-	for (uint32_t iid = 1; iid < space->index_count; ++iid) {
-		struct vy_lsm *lsm = vy_lsm(space->index[iid]);
-		if (vy_is_committed_one(env, lsm))
-			continue;
-		/*
-		 * Delete goes first, so if old and new keys
-		 * fully match, there is no look up beyond the
-		 * transaction index.
-		 */
-		if (old_stmt != NULL) {
-			if (vy_tx_set(tx, lsm, delete) != 0)
-				goto error;
-		}
-		if (vy_insert_secondary(env, tx, space, lsm, new_stmt) != 0)
-			goto error;
-	}
-	if (delete != NULL)
-		tuple_unref(delete);
-	/*
-	 * The old tuple is used if there is an on_replace
-	 * trigger.
-	 */
-	if (stmt != NULL) {
-		stmt->new_tuple = new_stmt;
-		stmt->old_tuple = old_stmt;
-	}
-	return 0;
-error:
-	if (delete != NULL)
-		tuple_unref(delete);
-	if (old_stmt != NULL)
-		tuple_unref(old_stmt);
-	if (new_stmt != NULL)
-		tuple_unref(new_stmt);
-	return -1;
-}
-
-/**
  * Check that the key can be used for search in a unique index
  * LSM tree.
  * @param  lsm        LSM tree for checking.
@@ -2307,18 +2161,79 @@ static int
 vy_replace(struct vy_env *env, struct vy_tx *tx, struct txn_stmt *stmt,
 	   struct space *space, struct request *request)
 {
+	assert(tx != NULL && tx->state == VINYL_TX_READY);
 	if (vy_is_committed(env, space))
 		return 0;
 	if (request->type == IPROTO_INSERT)
 		return vy_insert(env, tx, stmt, space, request);
 
-	if (space->index_count == 1) {
-		/* Replace in a space with a single index. */
-		return vy_replace_one(env, tx, space, request, stmt);
-	} else {
-		/* Replace in a space with secondary indexes. */
-		return vy_replace_impl(env, tx, space, request, stmt);
+	struct vy_lsm *pk = vy_lsm_find(space, 0);
+	if (pk == NULL)
+		return -1;
+	/* Primary key is dumped last. */
+	assert(!vy_is_committed_one(env, pk));
+
+	/* Validate and create a statement for the new tuple. */
+	if (tuple_validate_raw(pk->mem_format, request->tuple))
+		return -1;
+	stmt->new_tuple = vy_stmt_new_replace(pk->mem_format, request->tuple,
+					      request->tuple_end);
+	if (stmt->new_tuple == NULL)
+		return -1;
+	/*
+	 * Get the overwritten tuple from the primary index if
+	 * the space has on_replace triggers, in which case we
+	 * need to pass the old tuple to trigger callbacks, or
+	 * if the space has secondary indexes and so we need
+	 * the old tuple to delete it from them.
+	 */
+	if (space->index_count > 1 || !rlist_empty(&space->on_replace)) {
+		if (vy_get(pk, tx, vy_tx_read_view(tx),
+			   stmt->new_tuple, &stmt->old_tuple) != 0)
+			return -1;
 	}
+	/*
+	 * Replace in the primary index without explicit deletion
+	 * of the old tuple.
+	 */
+	if (vy_tx_set(tx, pk, stmt->new_tuple) != 0)
+		return -1;
+	if (space->index_count == 1)
+		return 0;
+	/*
+	 * Replace in secondary indexes with explicit deletion
+	 * of the old tuple, if any.
+	 */
+	int rc = 0;
+	struct tuple *delete = NULL;
+	if (stmt->old_tuple != NULL) {
+		delete = vy_stmt_new_surrogate_delete(pk->mem_format,
+						      stmt->old_tuple);
+		if (delete == NULL)
+			return -1;
+	}
+	for (uint32_t i = 1; i < space->index_count; i++) {
+		struct vy_lsm *lsm = vy_lsm(space->index[i]);
+		if (vy_is_committed_one(env, lsm))
+			continue;
+		/*
+		 * DELETE goes first, so if old and new keys
+		 * fully match, there is no look up beyond the
+		 * transaction write set.
+		 */
+		if (delete != NULL) {
+			rc = vy_tx_set(tx, lsm, delete);
+			if (rc != 0)
+				break;
+		}
+		rc = vy_insert_secondary(env, tx, space, lsm,
+					 stmt->new_tuple);
+		if (rc != 0)
+			break;
+	}
+	if (delete != NULL)
+		tuple_unref(delete);
+	return rc;
 }
 
 static int
-- 
2.11.0

^ permalink raw reply	[flat|nested] 65+ messages in thread

* [RFC PATCH 06/23] vinyl: fold vy_delete_impl
  2018-07-08 16:48 ` [RFC PATCH 00/23] vinyl: eliminate read on REPLACE/DELETE Vladimir Davydov
                     ` (3 preceding siblings ...)
  2018-07-08 16:48   ` [RFC PATCH 05/23] vinyl: fold vy_replace_one and vy_replace_impl Vladimir Davydov
@ 2018-07-08 16:48   ` Vladimir Davydov
  2018-07-31 20:28     ` Konstantin Osipov
  2018-07-08 16:48   ` [RFC PATCH 07/23] vinyl: refactor unique check Vladimir Davydov
                     ` (16 subsequent siblings)
  21 siblings, 1 reply; 65+ messages in thread
From: Vladimir Davydov @ 2018-07-08 16:48 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

vy_delete_impl helper is only used once in vy_delete and it is rather
small so inlining it definitely won't hurt. On the contrary, it will
consolidate DELETE logic in one place, making the code easier to follow.
---
 src/box/vinyl.c | 68 ++++++++++++++++-----------------------------------------
 1 file changed, 19 insertions(+), 49 deletions(-)

diff --git a/src/box/vinyl.c b/src/box/vinyl.c
index b93232a3..44238450 100644
--- a/src/box/vinyl.c
+++ b/src/box/vinyl.c
@@ -1575,47 +1575,6 @@ vy_unique_key_validate(struct vy_lsm *lsm, const char *key,
 }
 
 /**
- * Delete the tuple from all LSM trees of the vinyl space.
- * @param env        Vinyl environment.
- * @param tx         Current transaction.
- * @param space      Vinyl space.
- * @param tuple      Tuple to delete.
- *
- * @retval  0 Success
- * @retval -1 Memory error or the index is not found.
- */
-static inline int
-vy_delete_impl(struct vy_env *env, struct vy_tx *tx, struct space *space,
-	       const struct tuple *tuple)
-{
-	struct vy_lsm *pk = vy_lsm_find(space, 0);
-	if (pk == NULL)
-		return -1;
-	/* Primary key is dumped last. */
-	assert(!vy_is_committed_one(env, pk));
-	struct tuple *delete =
-		vy_stmt_new_surrogate_delete(pk->mem_format, tuple);
-	if (delete == NULL)
-		return -1;
-	if (vy_tx_set(tx, pk, delete) != 0)
-		goto error;
-
-	/* At second, delete from seconary indexes. */
-	for (uint32_t i = 1; i < space->index_count; ++i) {
-		struct vy_lsm *lsm = vy_lsm(space->index[i]);
-		if (vy_is_committed_one(env, lsm))
-			continue;
-		if (vy_tx_set(tx, lsm, delete) != 0)
-			goto error;
-	}
-	tuple_unref(delete);
-	return 0;
-error:
-	tuple_unref(delete);
-	return -1;
-}
-
-/**
  * Execute DELETE in a vinyl space.
  * @param env     Vinyl environment.
  * @param tx      Current transaction.
@@ -1662,21 +1621,32 @@ vy_delete(struct vy_env *env, struct vy_tx *tx, struct txn_stmt *stmt,
 		if (stmt->old_tuple == NULL)
 			return 0;
 	}
+	int rc = 0;
+	struct tuple *delete;
 	if (has_secondary) {
 		assert(stmt->old_tuple != NULL);
-		return vy_delete_impl(env, tx, space, stmt->old_tuple);
+		delete = vy_stmt_new_surrogate_delete(pk->mem_format,
+						      stmt->old_tuple);
+		if (delete == NULL)
+			return -1;
+		for (uint32_t i = 0; i < space->index_count; i++) {
+			struct vy_lsm *lsm = vy_lsm(space->index[i]);
+			if (vy_is_committed_one(env, lsm))
+				continue;
+			rc = vy_tx_set(tx, lsm, delete);
+			if (rc != 0)
+				break;
+		}
 	} else { /* Primary is the single index in the space. */
 		assert(lsm->index_id == 0);
-		struct tuple *delete =
-			vy_stmt_new_surrogate_delete_from_key(request->key,
-							      pk->key_def,
-							      pk->mem_format);
+		delete = vy_stmt_new_surrogate_delete_from_key(request->key,
+						pk->key_def, pk->mem_format);
 		if (delete == NULL)
 			return -1;
-		int rc = vy_tx_set(tx, pk, delete);
-		tuple_unref(delete);
-		return rc;
+		rc = vy_tx_set(tx, pk, delete);
 	}
+	tuple_unref(delete);
+	return rc;
 }
 
 /**
-- 
2.11.0

^ permalink raw reply	[flat|nested] 65+ messages in thread

* [RFC PATCH 07/23] vinyl: refactor unique check
  2018-07-08 16:48 ` [RFC PATCH 00/23] vinyl: eliminate read on REPLACE/DELETE Vladimir Davydov
                     ` (4 preceding siblings ...)
  2018-07-08 16:48   ` [RFC PATCH 06/23] vinyl: fold vy_delete_impl Vladimir Davydov
@ 2018-07-08 16:48   ` Vladimir Davydov
  2018-07-31 20:28     ` Konstantin Osipov
  2018-07-08 16:48   ` [RFC PATCH 08/23] vinyl: check key uniqueness before modifying tx write set Vladimir Davydov
                     ` (15 subsequent siblings)
  21 siblings, 1 reply; 65+ messages in thread
From: Vladimir Davydov @ 2018-07-08 16:48 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

For the sake of further patches, let's do some refactoring:
 - Rename vy_check_is_unique to vy_check_is_unique_primary and use it
   only for checking the unique constraint of primary indexes. Also,
   make it return immediately if the primary index doesn't need
   uniqueness check, like vy_check_is_unique_secondary does.
 - Open-code uniqueness check in vy_check_is_unique_secondary instead of
   using vy_check_is_unique.
 - Reduce indentation level of vy_check_is_unique_secondary by inverting
   the if statement.
---
 src/box/vinyl.c | 82 +++++++++++++++++++++++++++++++++++----------------------
 1 file changed, 51 insertions(+), 31 deletions(-)

diff --git a/src/box/vinyl.c b/src/box/vinyl.c
index 44238450..e4563cb6 100644
--- a/src/box/vinyl.c
+++ b/src/box/vinyl.c
@@ -1397,36 +1397,39 @@ vy_get_by_raw_key(struct vy_lsm *lsm, struct vy_tx *tx,
 }
 
 /**
- * Check if the LSM tree contains the key. If true, then set
- * a duplicate key error in the diagnostics area.
+ * Check if insertion of a new tuple violates unique constraint
+ * of the primary index.
  * @param env        Vinyl environment.
  * @param tx         Current transaction.
  * @param rv         Read view.
  * @param space_name Space name.
  * @param index_name Index name.
- * @param lsm        LSM tree in which to search.
- * @param key        Key statement.
+ * @param lsm        LSM tree corresponding to the index.
+ * @param stmt       New tuple.
  *
- * @retval  0 Success, the key isn't found.
- * @retval -1 Memory error or the key is found.
+ * @retval  0 Success, unique constraint is satisfied.
+ * @retval -1 Duplicate is found or read error occurred.
  */
 static inline int
-vy_check_is_unique(struct vy_env *env, struct vy_tx *tx,
-		   const struct vy_read_view **rv,
-		   const char *space_name, const char *index_name,
-		   struct vy_lsm *lsm, struct tuple *key)
+vy_check_is_unique_primary(struct vy_env *env, struct vy_tx *tx,
+			   const struct vy_read_view **rv,
+			   const char *space_name, const char *index_name,
+			   struct vy_lsm *lsm, struct tuple *stmt)
 {
-	struct tuple *found;
+	assert(lsm->index_id == 0);
+	assert(vy_stmt_type(stmt) == IPROTO_INSERT);
 	/*
 	 * During recovery we apply rows that were successfully
 	 * applied before restart so no conflict is possible.
 	 */
 	if (env->status != VINYL_ONLINE)
 		return 0;
-	if (vy_get(lsm, tx, rv, key, &found))
+	if (!lsm->check_is_unique)
+		return 0;
+	struct tuple *found;
+	if (vy_get(lsm, tx, rv, stmt, &found))
 		return -1;
-
-	if (found) {
+	if (found != NULL) {
 		tuple_unref(found);
 		diag_set(ClientError, ER_TUPLE_FOUND,
 			 index_name, space_name);
@@ -1456,19 +1459,36 @@ vy_check_is_unique_secondary(struct vy_env *env, struct vy_tx *tx,
 			     struct vy_lsm *lsm, const struct tuple *stmt)
 {
 	assert(lsm->index_id > 0);
-	struct key_def *def = lsm->key_def;
-	if (lsm->check_is_unique &&
-	    !key_update_can_be_skipped(def->column_mask,
-				       vy_stmt_column_mask(stmt)) &&
-	    (!def->is_nullable || !vy_tuple_key_contains_null(stmt, def))) {
-		struct tuple *key = vy_stmt_extract_key(stmt, def,
-							lsm->env->key_format);
-		if (key == NULL)
-			return -1;
-		int rc = vy_check_is_unique(env, tx, rv, space_name,
-					    index_name, lsm, key);
-		tuple_unref(key);
-		return rc;
+	assert(vy_stmt_type(stmt) == IPROTO_INSERT ||
+	       vy_stmt_type(stmt) == IPROTO_REPLACE);
+	/*
+	 * During recovery we apply rows that were successfully
+	 * applied before restart so no conflict is possible.
+	 */
+	if (env->status != VINYL_ONLINE)
+		return 0;
+	if (!lsm->check_is_unique)
+		return 0;
+	if (key_update_can_be_skipped(lsm->key_def->column_mask,
+				      vy_stmt_column_mask(stmt)))
+		return 0;
+	if (lsm->key_def->is_nullable &&
+	    vy_tuple_key_contains_null(stmt, lsm->key_def))
+		return 0;
+	struct tuple *key = vy_stmt_extract_key(stmt, lsm->key_def,
+						lsm->env->key_format);
+	if (key == NULL)
+		return -1;
+	struct tuple *found;
+	int rc = vy_get(lsm, tx, rv, key, &found);
+	tuple_unref(key);
+	if (rc != 0)
+		return -1;
+	if (found != NULL) {
+		tuple_unref(found);
+		diag_set(ClientError, ER_TUPLE_FOUND,
+			 index_name, space_name);
+		return -1;
 	}
 	return 0;
 }
@@ -1495,10 +1515,10 @@ vy_insert_primary(struct vy_env *env, struct vy_tx *tx, struct space *space,
 	 * A primary index is always unique and the new tuple must not
 	 * conflict with existing tuples.
 	 */
-	if (pk->check_is_unique &&
-	    vy_check_is_unique(env, tx, vy_tx_read_view(tx), space_name(space),
-			       index_name_by_id(space, pk->index_id),
-			       pk, stmt) != 0)
+	if (vy_check_is_unique_primary(env, tx, vy_tx_read_view(tx),
+				       space_name(space),
+				       index_name_by_id(space, pk->index_id),
+				       pk, stmt) != 0)
 		return -1;
 	return vy_tx_set(tx, pk, stmt);
 }
-- 
2.11.0

^ permalink raw reply	[flat|nested] 65+ messages in thread

* [RFC PATCH 08/23] vinyl: check key uniqueness before modifying tx write set
  2018-07-08 16:48 ` [RFC PATCH 00/23] vinyl: eliminate read on REPLACE/DELETE Vladimir Davydov
                     ` (5 preceding siblings ...)
  2018-07-08 16:48   ` [RFC PATCH 07/23] vinyl: refactor unique check Vladimir Davydov
@ 2018-07-08 16:48   ` Vladimir Davydov
  2018-07-31 20:34     ` Konstantin Osipov
  2018-08-09 20:26     ` Konstantin Osipov
  2018-07-08 16:48   ` [RFC PATCH 09/23] vinyl: remove env argument of vy_check_is_unique_{primary,secondary} Vladimir Davydov
                     ` (14 subsequent siblings)
  21 siblings, 2 replies; 65+ messages in thread
From: Vladimir Davydov @ 2018-07-08 16:48 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

Currently, we handle INSERT/REPLACE/UPDATE requests by iterating over
all space indexes starting from the primary and inserting the
corresponding statements to tx write set, checking key uniqueness if
necessary. This means that by the time we write a REPLACE to the write
set of a secondary index, it has already been written to the primary
index write set. This is OK, and vy_tx_prepare() relies on that to
implement the common memory level. However, this also means that when we
check uniqueness of a secondary index, the new REPLACE can be found via
the primary index. This is OK now, because all indexes are fully
independent, but it isn't going to fly after #2129 is implemented. The
problem is in order to check if a tuple is present in a secondary index,
we will have to look up the corresponding full tuple in the primary
index. To illustrate the problem, consider the following situation:

  Primary index covers field 1.
  Secondary index covers field 2.

  Committed statements:

    REPLACE{1, 10, lsn=1} - present in both indexes
    DELETE{1, lsn=2} - present only in the primary index

  Transaction:

    REPLACE{1, 10}

When we check uniqueness of the secondary index, we find committed
statement REPLACE{1, 10, lsn=1}, then look up the corresponding full
tuple in the primary index and find REPLACE{1, 10}. Since the two tuples
match, we mistakenly assume that there's a conflict.

To avoid a situation like that, let's check uniqueness before modifying
the write set of any index.

Needed for #2129
---
 src/box/vinyl.c | 128 +++++++++++++++++++++++++++-----------------------------
 1 file changed, 62 insertions(+), 66 deletions(-)

diff --git a/src/box/vinyl.c b/src/box/vinyl.c
index e4563cb6..c3ac7d68 100644
--- a/src/box/vinyl.c
+++ b/src/box/vinyl.c
@@ -1484,6 +1484,17 @@ vy_check_is_unique_secondary(struct vy_env *env, struct vy_tx *tx,
 	tuple_unref(key);
 	if (rc != 0)
 		return -1;
+	if (found != NULL && vy_tuple_compare(stmt, found,
+					      lsm->pk->key_def) == 0) {
+		/*
+		 * If the old and new tuples are the same in
+		 * terms of the primary key definition, the
+		 * statement doesn't modify the secondary key
+		 * and so there's actually no conflict.
+		 */
+		tuple_unref(found);
+		return 0;
+	}
 	if (found != NULL) {
 		tuple_unref(found);
 		diag_set(ClientError, ER_TUPLE_FOUND,
@@ -1494,68 +1505,51 @@ vy_check_is_unique_secondary(struct vy_env *env, struct vy_tx *tx,
 }
 
 /**
- * Insert a tuple in a primary index LSM tree.
- * @param env   Vinyl environment.
- * @param tx    Current transaction.
- * @param space Target space.
- * @param pk    Primary index LSM tree.
- * @param stmt  Tuple to insert.
- *
- * @retval  0 Success.
- * @retval -1 Memory error or duplicate key error.
- */
-static inline int
-vy_insert_primary(struct vy_env *env, struct vy_tx *tx, struct space *space,
-		  struct vy_lsm *pk, struct tuple *stmt)
-{
-	assert(vy_stmt_type(stmt) == IPROTO_INSERT);
-	assert(tx != NULL && tx->state == VINYL_TX_READY);
-	assert(pk->index_id == 0);
-	/*
-	 * A primary index is always unique and the new tuple must not
-	 * conflict with existing tuples.
-	 */
-	if (vy_check_is_unique_primary(env, tx, vy_tx_read_view(tx),
-				       space_name(space),
-				       index_name_by_id(space, pk->index_id),
-				       pk, stmt) != 0)
-		return -1;
-	return vy_tx_set(tx, pk, stmt);
-}
-
-/**
- * Insert a tuple in a secondary index LSM tree.
- * @param env       Vinyl environment.
- * @param tx        Current transaction.
- * @param space     Target space.
- * @param lsm       Secondary index LSM tree.
- * @param stmt      Tuple to replace.
+ * Check if insertion of a new tuple violates unique constraint
+ * of any index of the space.
+ * @param env        Vinyl environment.
+ * @param tx         Current transaction.
+ * @param space      Space to check.
+ * @param stmt       New tuple.
  *
- * @retval  0 Success.
- * @retval -1 Memory error or duplicate key error.
+ * @retval  0 Success, unique constraint is satisfied.
+ * @retval -1 Duplicate is found or read error occurred.
  */
 static int
-vy_insert_secondary(struct vy_env *env, struct vy_tx *tx, struct space *space,
-		    struct vy_lsm *lsm, struct tuple *stmt)
+vy_check_is_unique(struct vy_env *env, struct vy_tx *tx,
+		   struct space *space, struct tuple *stmt)
 {
+	assert(space->index_count > 0);
 	assert(vy_stmt_type(stmt) == IPROTO_INSERT ||
 	       vy_stmt_type(stmt) == IPROTO_REPLACE);
-	assert(tx != NULL && tx->state == VINYL_TX_READY);
-	assert(lsm->index_id > 0);
 
-	if (vy_check_is_unique_secondary(env, tx, vy_tx_read_view(tx),
-					 space_name(space),
-					 index_name_by_id(space, lsm->index_id),
-					 lsm, stmt) != 0)
-		return -1;
+	const struct vy_read_view **rv = vy_tx_read_view(tx);
 
 	/*
-	 * We must always append the statement to transaction write set
-	 * of each LSM tree, even if operation itself does not update
-	 * the LSM tree, e.g. it's an UPDATE, to ensure we read our
-	 * own writes.
+	 * We only need to check the uniqueness of the primary index
+	 * if this is INSERT, because REPLACE will silently overwrite
+	 * the existing tuple, if any.
 	 */
-	return vy_tx_set(tx, lsm, stmt);
+	if (vy_stmt_type(stmt) == IPROTO_INSERT) {
+		struct vy_lsm *lsm = vy_lsm(space->index[0]);
+		if (vy_check_is_unique_primary(env, tx, rv, space_name(space),
+					       index_name_by_id(space, 0),
+					       lsm, stmt) != 0)
+			return -1;
+	}
+
+	/*
+	 * For secondary indexes, uniqueness must be checked on both
+	 * INSERT and REPLACE.
+	 */
+	for (uint32_t i = 1; i < space->index_count; i++) {
+		struct vy_lsm *lsm = vy_lsm(space->index[i]);
+		if (vy_check_is_unique_secondary(env, tx, rv, space_name(space),
+						 index_name_by_id(space, i),
+						 lsm, stmt) != 0)
+			return -1;
+	}
+	return 0;
 }
 
 /**
@@ -1776,6 +1770,8 @@ vy_update(struct vy_env *env, struct vy_tx *tx, struct txn_stmt *stmt,
 	if (vy_check_update(space, pk, stmt->old_tuple, stmt->new_tuple,
 			    column_mask) != 0)
 		return -1;
+	if (vy_check_is_unique(env, tx, space, stmt->new_tuple) != 0)
+		return -1;
 
 	/*
 	 * In the primary index the tuple can be replaced without
@@ -1798,7 +1794,7 @@ vy_update(struct vy_env *env, struct vy_tx *tx, struct txn_stmt *stmt,
 			continue;
 		if (vy_tx_set(tx, lsm, delete) != 0)
 			goto error;
-		if (vy_insert_secondary(env, tx, space, lsm, stmt->new_tuple))
+		if (vy_tx_set(tx, lsm, stmt->new_tuple) != 0)
 			goto error;
 	}
 	tuple_unref(delete);
@@ -1826,13 +1822,15 @@ vy_insert_first_upsert(struct vy_env *env, struct vy_tx *tx,
 	assert(tx != NULL && tx->state == VINYL_TX_READY);
 	assert(space->index_count > 0);
 	assert(vy_stmt_type(stmt) == IPROTO_INSERT);
+	if (vy_check_is_unique(env, tx, space, stmt) != 0)
+		return -1;
 	struct vy_lsm *pk = vy_lsm(space->index[0]);
 	assert(pk->index_id == 0);
 	if (vy_tx_set(tx, pk, stmt) != 0)
 		return -1;
 	for (uint32_t i = 1; i < space->index_count; ++i) {
 		struct vy_lsm *lsm = vy_lsm(space->index[i]);
-		if (vy_insert_secondary(env, tx, space, lsm, stmt) != 0)
+		if (vy_tx_set(tx, lsm, stmt) != 0)
 			return -1;
 	}
 	return 0;
@@ -2057,6 +2055,8 @@ vy_upsert(struct vy_env *env, struct vy_tx *tx, struct txn_stmt *stmt,
 		 */
 		return 0;
 	}
+	if (vy_check_is_unique(env, tx, space, stmt->new_tuple) != 0)
+		return -1;
 	if (vy_tx_set(tx, pk, stmt->new_tuple))
 		return -1;
 	if (space->index_count == 1)
@@ -2075,8 +2075,7 @@ vy_upsert(struct vy_env *env, struct vy_tx *tx, struct txn_stmt *stmt,
 			continue;
 		if (vy_tx_set(tx, lsm, delete) != 0)
 			goto error;
-		if (vy_insert_secondary(env, tx, space, lsm,
-					stmt->new_tuple) != 0)
+		if (vy_tx_set(tx, lsm, stmt->new_tuple) != 0)
 			goto error;
 	}
 	tuple_unref(delete);
@@ -2119,15 +2118,16 @@ vy_insert(struct vy_env *env, struct vy_tx *tx, struct txn_stmt *stmt,
 					     request->tuple_end);
 	if (stmt->new_tuple == NULL)
 		return -1;
-	if (vy_insert_primary(env, tx, space, pk, stmt->new_tuple) != 0)
+	if (vy_check_is_unique(env, tx, space, stmt->new_tuple) != 0)
+		return -1;
+	if (vy_tx_set(tx, pk, stmt->new_tuple) != 0)
 		return -1;
 
 	for (uint32_t iid = 1; iid < space->index_count; ++iid) {
 		struct vy_lsm *lsm = vy_lsm(space->index[iid]);
 		if (vy_is_committed_one(env, lsm))
 			continue;
-		if (vy_insert_secondary(env, tx, space, lsm,
-					stmt->new_tuple) != 0)
+		if (vy_tx_set(tx, lsm, stmt->new_tuple) != 0)
 			return -1;
 	}
 	return 0;
@@ -2170,6 +2170,8 @@ vy_replace(struct vy_env *env, struct vy_tx *tx, struct txn_stmt *stmt,
 					      request->tuple_end);
 	if (stmt->new_tuple == NULL)
 		return -1;
+	if (vy_check_is_unique(env, tx, space, stmt->new_tuple) != 0)
+		return -1;
 	/*
 	 * Get the overwritten tuple from the primary index if
 	 * the space has on_replace triggers, in which case we
@@ -2206,18 +2208,12 @@ vy_replace(struct vy_env *env, struct vy_tx *tx, struct txn_stmt *stmt,
 		struct vy_lsm *lsm = vy_lsm(space->index[i]);
 		if (vy_is_committed_one(env, lsm))
 			continue;
-		/*
-		 * DELETE goes first, so if old and new keys
-		 * fully match, there is no look up beyond the
-		 * transaction write set.
-		 */
 		if (delete != NULL) {
 			rc = vy_tx_set(tx, lsm, delete);
 			if (rc != 0)
 				break;
 		}
-		rc = vy_insert_secondary(env, tx, space, lsm,
-					 stmt->new_tuple);
+		rc = vy_tx_set(tx, lsm, stmt->new_tuple);
 		if (rc != 0)
 			break;
 	}
-- 
2.11.0

^ permalink raw reply	[flat|nested] 65+ messages in thread

* [RFC PATCH 09/23] vinyl: remove env argument of vy_check_is_unique_{primary,secondary}
  2018-07-08 16:48 ` [RFC PATCH 00/23] vinyl: eliminate read on REPLACE/DELETE Vladimir Davydov
                     ` (6 preceding siblings ...)
  2018-07-08 16:48   ` [RFC PATCH 08/23] vinyl: check key uniqueness before modifying tx write set Vladimir Davydov
@ 2018-07-08 16:48   ` Vladimir Davydov
  2018-07-08 16:48   ` [RFC PATCH 10/23] vinyl: store full tuples in secondary index cache Vladimir Davydov
                     ` (13 subsequent siblings)
  21 siblings, 0 replies; 65+ messages in thread
From: Vladimir Davydov @ 2018-07-08 16:48 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

Besides vy_check_is_unique, other callers of vy_check_is_unique_primary
and vy_check_is_unique_secondary are only called when vinyl engine is
online. So let's move the optimization that skips uniqueness check on
recovery to vy_check_is_unique and remove the env argument.
---
 src/box/vinyl.c | 42 +++++++++++++++---------------------------
 1 file changed, 15 insertions(+), 27 deletions(-)

diff --git a/src/box/vinyl.c b/src/box/vinyl.c
index c3ac7d68..d1b6839e 100644
--- a/src/box/vinyl.c
+++ b/src/box/vinyl.c
@@ -1399,7 +1399,6 @@ vy_get_by_raw_key(struct vy_lsm *lsm, struct vy_tx *tx,
 /**
  * Check if insertion of a new tuple violates unique constraint
  * of the primary index.
- * @param env        Vinyl environment.
  * @param tx         Current transaction.
  * @param rv         Read view.
  * @param space_name Space name.
@@ -1411,19 +1410,13 @@ vy_get_by_raw_key(struct vy_lsm *lsm, struct vy_tx *tx,
  * @retval -1 Duplicate is found or read error occurred.
  */
 static inline int
-vy_check_is_unique_primary(struct vy_env *env, struct vy_tx *tx,
-			   const struct vy_read_view **rv,
+vy_check_is_unique_primary(struct vy_tx *tx, const struct vy_read_view **rv,
 			   const char *space_name, const char *index_name,
 			   struct vy_lsm *lsm, struct tuple *stmt)
 {
 	assert(lsm->index_id == 0);
 	assert(vy_stmt_type(stmt) == IPROTO_INSERT);
-	/*
-	 * During recovery we apply rows that were successfully
-	 * applied before restart so no conflict is possible.
-	 */
-	if (env->status != VINYL_ONLINE)
-		return 0;
+
 	if (!lsm->check_is_unique)
 		return 0;
 	struct tuple *found;
@@ -1441,7 +1434,6 @@ vy_check_is_unique_primary(struct vy_env *env, struct vy_tx *tx,
 /**
  * Check if insertion of a new tuple violates unique constraint
  * of a secondary index.
- * @param env        Vinyl environment.
  * @param tx         Current transaction.
  * @param rv         Read view.
  * @param space_name Space name.
@@ -1453,20 +1445,14 @@ vy_check_is_unique_primary(struct vy_env *env, struct vy_tx *tx,
  * @retval -1 Duplicate is found or read error occurred.
  */
 static int
-vy_check_is_unique_secondary(struct vy_env *env, struct vy_tx *tx,
-			     const struct vy_read_view **rv,
+vy_check_is_unique_secondary(struct vy_tx *tx, const struct vy_read_view **rv,
 			     const char *space_name, const char *index_name,
 			     struct vy_lsm *lsm, const struct tuple *stmt)
 {
 	assert(lsm->index_id > 0);
 	assert(vy_stmt_type(stmt) == IPROTO_INSERT ||
 	       vy_stmt_type(stmt) == IPROTO_REPLACE);
-	/*
-	 * During recovery we apply rows that were successfully
-	 * applied before restart so no conflict is possible.
-	 */
-	if (env->status != VINYL_ONLINE)
-		return 0;
+
 	if (!lsm->check_is_unique)
 		return 0;
 	if (key_update_can_be_skipped(lsm->key_def->column_mask,
@@ -1522,6 +1508,12 @@ vy_check_is_unique(struct vy_env *env, struct vy_tx *tx,
 	assert(space->index_count > 0);
 	assert(vy_stmt_type(stmt) == IPROTO_INSERT ||
 	       vy_stmt_type(stmt) == IPROTO_REPLACE);
+	/*
+	 * During recovery we apply rows that were successfully
+	 * applied before restart so no conflict is possible.
+	 */
+	if (env->status != VINYL_ONLINE)
+		return 0;
 
 	const struct vy_read_view **rv = vy_tx_read_view(tx);
 
@@ -1532,7 +1524,7 @@ vy_check_is_unique(struct vy_env *env, struct vy_tx *tx,
 	 */
 	if (vy_stmt_type(stmt) == IPROTO_INSERT) {
 		struct vy_lsm *lsm = vy_lsm(space->index[0]);
-		if (vy_check_is_unique_primary(env, tx, rv, space_name(space),
+		if (vy_check_is_unique_primary(tx, rv, space_name(space),
 					       index_name_by_id(space, 0),
 					       lsm, stmt) != 0)
 			return -1;
@@ -1544,7 +1536,7 @@ vy_check_is_unique(struct vy_env *env, struct vy_tx *tx,
 	 */
 	for (uint32_t i = 1; i < space->index_count; i++) {
 		struct vy_lsm *lsm = vy_lsm(space->index[i]);
-		if (vy_check_is_unique_secondary(env, tx, rv, space_name(space),
+		if (vy_check_is_unique_secondary(tx, rv, space_name(space),
 						 index_name_by_id(space, i),
 						 lsm, stmt) != 0)
 			return -1;
@@ -3968,8 +3960,6 @@ vinyl_index_get(struct index *index, const char *key,
 
 /** Argument passed to vy_build_on_replace(). */
 struct vy_build_ctx {
-	/** Vinyl environment. */
-	struct vy_env *env;
 	/** LSM tree under construction. */
 	struct vy_lsm *lsm;
 	/** Format to check new tuples against. */
@@ -4010,7 +4000,7 @@ vy_build_on_replace(struct trigger *trigger, void *event)
 
 	/* Check key uniqueness if necessary. */
 	if (stmt->new_tuple != NULL &&
-	    vy_check_is_unique_secondary(ctx->env, tx, vy_tx_read_view(tx),
+	    vy_check_is_unique_secondary(tx, vy_tx_read_view(tx),
 					 ctx->space_name, ctx->index_name,
 					 lsm, stmt->new_tuple) != 0)
 		goto err;
@@ -4096,9 +4086,8 @@ vy_build_insert_tuple(struct vy_env *env, struct vy_lsm *lsm,
 	 * into it after the yield.
 	 */
 	vy_mem_pin(mem);
-	rc = vy_check_is_unique_secondary(env, NULL,
-			&env->xm->p_committed_read_view,
-			space_name, index_name, lsm, tuple);
+	rc = vy_check_is_unique_secondary(NULL, &env->xm->p_committed_read_view,
+					  space_name, index_name, lsm, tuple);
 	vy_mem_unpin(mem);
 	if (rc != 0)
 		return -1;
@@ -4291,7 +4280,6 @@ vinyl_space_build_index(struct space *src_space, struct index *new_index,
 
 	struct trigger on_replace;
 	struct vy_build_ctx ctx;
-	ctx.env = env;
 	ctx.lsm = new_lsm;
 	ctx.format = new_format;
 	ctx.space_name = space_name(src_space);
-- 
2.11.0

^ permalink raw reply	[flat|nested] 65+ messages in thread

* [RFC PATCH 10/23] vinyl: store full tuples in secondary index cache
  2018-07-08 16:48 ` [RFC PATCH 00/23] vinyl: eliminate read on REPLACE/DELETE Vladimir Davydov
                     ` (7 preceding siblings ...)
  2018-07-08 16:48   ` [RFC PATCH 09/23] vinyl: remove env argument of vy_check_is_unique_{primary,secondary} Vladimir Davydov
@ 2018-07-08 16:48   ` Vladimir Davydov
  2018-07-08 16:48   ` [RFC PATCH 11/23] xrow: allow to store flags in DML requests Vladimir Davydov
                     ` (12 subsequent siblings)
  21 siblings, 0 replies; 65+ messages in thread
From: Vladimir Davydov @ 2018-07-08 16:48 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

Currently, both vy_read_iterator_next() and vy_point_lookup() add the
returned tuple to the tuple cache. As a result, we store partial tuples
in a secondary index tuple cache although we could store full tuples
(we have to retrieve them anyway when reading a secondary index). This
means wasting memory. Besides, when the #2129 gets implemented, there
will be tuples in a secondary index that have to be skipped as they have
been overwritten in the primary index. Caching them would be inefficient
and error prone. So let's call vy_cache_add() from the upper level and
add only full tuples to the cache.

Closes #3478
Needed for #2129
---
 src/box/vinyl.c            | 11 +++++++++
 src/box/vy_point_lookup.c  |  5 +---
 src/box/vy_read_iterator.c | 61 ++++++++++++++++++++++------------------------
 src/box/vy_read_iterator.h | 24 ++++++++++++++++++
 4 files changed, 65 insertions(+), 36 deletions(-)

diff --git a/src/box/vinyl.c b/src/box/vinyl.c
index d1b6839e..f05a4a0e 100644
--- a/src/box/vinyl.c
+++ b/src/box/vinyl.c
@@ -1302,6 +1302,10 @@ vy_get_by_secondary_tuple(struct vy_lsm *lsm, struct vy_tx *tx,
 		say_warn("%s: key %s missing in primary index",
 			 vy_lsm_name(lsm), vy_stmt_str(tuple));
 	}
+
+	if ((*rv)->vlsn == INT64_MAX)
+		vy_cache_add(&lsm->pk->cache, *result, NULL, tuple, ITER_EQ);
+
 	return 0;
 }
 
@@ -1348,6 +1352,8 @@ vy_get(struct vy_lsm *lsm, struct vy_tx *tx,
 		} else {
 			*result = tuple;
 		}
+		if ((*rv)->vlsn == INT64_MAX)
+			vy_cache_add(&lsm->cache, *result, NULL, key, ITER_EQ);
 		return 0;
 	}
 
@@ -1364,6 +1370,8 @@ vy_get(struct vy_lsm *lsm, struct vy_tx *tx,
 		if (rc != 0 || *result != NULL)
 			break;
 	}
+	if (rc == 0)
+		vy_read_iterator_cache_add(&itr, *result);
 	vy_read_iterator_close(&itr);
 	return rc;
 }
@@ -3803,6 +3811,7 @@ vinyl_iterator_primary_next(struct iterator *base, struct tuple **ret)
 
 	if (vy_read_iterator_next(&it->iterator, ret) != 0)
 		goto fail;
+	vy_read_iterator_cache_add(&it->iterator, *ret);
 	if (*ret == NULL) {
 		/* EOF. Close the iterator immediately. */
 		vinyl_iterator_close(it);
@@ -3838,6 +3847,7 @@ next:
 
 	if (tuple == NULL) {
 		/* EOF. Close the iterator immediately. */
+		vy_read_iterator_cache_add(&it->iterator, NULL);
 		vinyl_iterator_close(it);
 		*ret = NULL;
 		return 0;
@@ -3857,6 +3867,7 @@ next:
 		goto fail;
 	if (*ret == NULL)
 		goto next;
+	vy_read_iterator_cache_add(&it->iterator, *ret);
 	tuple_bless(*ret);
 	tuple_unref(*ret);
 	return 0;
diff --git a/src/box/vy_point_lookup.c b/src/box/vy_point_lookup.c
index f2261fdf..5e43340b 100644
--- a/src/box/vy_point_lookup.c
+++ b/src/box/vy_point_lookup.c
@@ -280,11 +280,8 @@ done:
 	if (rc != 0)
 		return -1;
 
-	if (*ret != NULL) {
+	if (*ret != NULL)
 		vy_stmt_counter_acct_tuple(&lsm->stat.get, *ret);
-		if ((*rv)->vlsn == INT64_MAX)
-			vy_cache_add(&lsm->cache, *ret, NULL, key, ITER_EQ);
-	}
 
 	double latency = ev_monotonic_now(loop()) - start_time;
 	latency_collect(&lsm->stat.latency, latency);
diff --git a/src/box/vy_read_iterator.c b/src/box/vy_read_iterator.c
index 160bb899..954fc0df 100644
--- a/src/box/vy_read_iterator.c
+++ b/src/box/vy_read_iterator.c
@@ -845,24 +845,17 @@ vy_read_iterator_next(struct vy_read_iterator *itr, struct tuple **result)
 	ev_tstamp start_time = ev_monotonic_now(loop());
 
 	struct vy_lsm *lsm = itr->lsm;
-	struct tuple *stmt, *prev_stmt;
+	struct tuple *stmt;
 
-	/*
-	 * Remember the statement returned by the last iteration.
-	 * We will need it to update the cache.
-	 */
-	prev_stmt = itr->last_stmt;
-	if (prev_stmt != NULL)
-		tuple_ref(prev_stmt);
-	else /* first iteration */
-		lsm->stat.lookup++;
+	if (itr->last_stmt == NULL)
+		lsm->stat.lookup++; /* first iteration */
 next_key:
 	if (vy_read_iterator_advance(itr) != 0)
-		goto err;
+		return -1;
 	if (vy_read_iterator_apply_history(itr, &stmt) != 0)
-		goto err;
+		return -1;
 	if (vy_read_iterator_track_read(itr, stmt) != 0)
-		goto err;
+		return -1;
 
 	if (itr->last_stmt != NULL)
 		tuple_unref(itr->last_stmt);
@@ -877,9 +870,9 @@ next_key:
 		 * previous + current tuple as an unbroken chain.
 		 */
 		if (vy_stmt_lsn(stmt) == INT64_MAX) {
-			if (prev_stmt != NULL)
-				tuple_unref(prev_stmt);
-			prev_stmt = NULL;
+			if (itr->last_cached_stmt != NULL)
+				tuple_unref(itr->last_cached_stmt);
+			itr->last_cached_stmt = NULL;
 		}
 		goto next_key;
 	}
@@ -887,18 +880,6 @@ next_key:
 	       vy_stmt_type(stmt) == IPROTO_INSERT ||
 	       vy_stmt_type(stmt) == IPROTO_REPLACE);
 
-	/*
-	 * Store the result in the cache provided we are reading
-	 * the latest data.
-	 */
-	if ((**itr->read_view).vlsn == INT64_MAX) {
-		vy_cache_add(&lsm->cache, stmt, prev_stmt,
-			     itr->key, itr->iterator_type);
-	}
-	if (prev_stmt != NULL)
-		tuple_unref(prev_stmt);
-
-	/* Update LSM tree stats. */
 	if (stmt != NULL)
 		vy_stmt_counter_acct_tuple(&lsm->stat.get, stmt);
 
@@ -914,10 +895,24 @@ next_key:
 
 	*result = stmt;
 	return 0;
-err:
-	if (prev_stmt != NULL)
-		tuple_unref(prev_stmt);
-	return -1;
+}
+
+void
+vy_read_iterator_cache_add(struct vy_read_iterator *itr, struct tuple *stmt)
+{
+	if ((**itr->read_view).vlsn != INT64_MAX) {
+		if (itr->last_cached_stmt != NULL)
+			tuple_unref(itr->last_cached_stmt);
+		itr->last_cached_stmt = NULL;
+		return;
+	}
+	vy_cache_add(&itr->lsm->cache, stmt, itr->last_cached_stmt,
+		     itr->key, itr->iterator_type);
+	if (stmt != NULL)
+		tuple_ref(stmt);
+	if (itr->last_cached_stmt != NULL)
+		tuple_unref(itr->last_cached_stmt);
+	itr->last_cached_stmt = stmt;
 }
 
 /**
@@ -928,6 +923,8 @@ vy_read_iterator_close(struct vy_read_iterator *itr)
 {
 	if (itr->last_stmt != NULL)
 		tuple_unref(itr->last_stmt);
+	if (itr->last_cached_stmt != NULL)
+		tuple_unref(itr->last_cached_stmt);
 	vy_read_iterator_cleanup(itr);
 	free(itr->src);
 	TRASH(itr);
diff --git a/src/box/vy_read_iterator.h b/src/box/vy_read_iterator.h
index 2cac1087..baab8859 100644
--- a/src/box/vy_read_iterator.h
+++ b/src/box/vy_read_iterator.h
@@ -65,6 +65,11 @@ struct vy_read_iterator {
 	/** Last statement returned by vy_read_iterator_next(). */
 	struct tuple *last_stmt;
 	/**
+	 * Last statement added to the tuple cache by
+	 * vy_read_iterator_cache_add().
+	 */
+	struct tuple *last_cached_stmt;
+	/**
 	 * Copy of lsm->range_tree_version.
 	 * Used for detecting range tree changes.
 	 */
@@ -142,6 +147,25 @@ NODISCARD int
 vy_read_iterator_next(struct vy_read_iterator *itr, struct tuple **result);
 
 /**
+ * Add the last tuple returned by the read iterator to the cache.
+ * @param itr  Read iterator
+ * @param stmt Last tuple returned by the iterator.
+ *
+ * We use a separate function for populating the cache rather than
+ * doing that right in vy_read_iterator_next() so that we can store
+ * full tuples in a secondary index cache, thus saving some memory.
+ *
+ * Usage pattern:
+ * - Call vy_read_iterator_next() to get a partial tuple.
+ * - Call vy_point_lookup() to get the full tuple corresponding
+ *   to the partial tuple returned by the iterator.
+ * - Call vy_read_iterator_cache_add() on the full tuple to add
+ *   the result to the cache.
+ */
+void
+vy_read_iterator_cache_add(struct vy_read_iterator *itr, struct tuple *stmt);
+
+/**
  * Close the iterator and free resources.
  */
 void
-- 
2.11.0

^ permalink raw reply	[flat|nested] 65+ messages in thread

* [RFC PATCH 11/23] xrow: allow to store flags in DML requests
  2018-07-08 16:48 ` [RFC PATCH 00/23] vinyl: eliminate read on REPLACE/DELETE Vladimir Davydov
                     ` (8 preceding siblings ...)
  2018-07-08 16:48   ` [RFC PATCH 10/23] vinyl: store full tuples in secondary index cache Vladimir Davydov
@ 2018-07-08 16:48   ` Vladimir Davydov
  2018-07-31 20:36     ` Konstantin Osipov
  2018-07-08 16:48   ` [RFC PATCH 12/23] vinyl: do not pass region explicitly to write iterator functions Vladimir Davydov
                     ` (11 subsequent siblings)
  21 siblings, 1 reply; 65+ messages in thread
From: Vladimir Davydov @ 2018-07-08 16:48 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

In the scope of #2129 we need to mark REPLACE statements for which we
generated DELETE in secondary indexes so that we don't generate DELETE
again on compaction. We also need to mark DELETE statements that were
generated on compaction so that we can skip them on SELECT.

Let's add flags field to struct vy_stmt. Flags are stored both in memory
and on disk so to encode/decode them we also need to add a new iproto
key (IPROTO_FLAGS) and the corresponding field to struct request.

Needed for #2129
---
 src/box/iproto_constants.c |  4 ++--
 src/box/iproto_constants.h |  3 ++-
 src/box/vy_stmt.c          |  4 ++++
 src/box/vy_stmt.h          | 15 +++++++++++++++
 src/box/xrow.c             |  8 ++++++++
 src/box/xrow.h             |  2 ++
 6 files changed, 33 insertions(+), 3 deletions(-)

diff --git a/src/box/iproto_constants.c b/src/box/iproto_constants.c
index 3adb7cd4..651e07b3 100644
--- a/src/box/iproto_constants.c
+++ b/src/box/iproto_constants.c
@@ -61,10 +61,10 @@ const unsigned char iproto_key_type[IPROTO_KEY_MAX] =
 		/* 0x13 */	MP_UINT, /* IPROTO_OFFSET */
 		/* 0x14 */	MP_UINT, /* IPROTO_ITERATOR */
 		/* 0x15 */	MP_UINT, /* IPROTO_INDEX_BASE */
+		/* 0x16 */	MP_UINT, /* IPROTO_FLAGS */
 	/* }}} */
 
 	/* {{{ unused */
-		/* 0x16 */	MP_UINT,
 		/* 0x17 */	MP_UINT,
 		/* 0x18 */	MP_UINT,
 		/* 0x19 */	MP_UINT,
@@ -148,7 +148,7 @@ const char *iproto_key_strs[IPROTO_KEY_MAX] = {
 	"offset",           /* 0x13 */
 	"iterator",         /* 0x14 */
 	"index base",       /* 0x15 */
-	NULL,               /* 0x16 */
+	"flags",            /* 0x16 */
 	NULL,               /* 0x17 */
 	NULL,               /* 0x18 */
 	NULL,               /* 0x19 */
diff --git a/src/box/iproto_constants.h b/src/box/iproto_constants.h
index d1320de7..f11c7fa9 100644
--- a/src/box/iproto_constants.h
+++ b/src/box/iproto_constants.h
@@ -65,6 +65,7 @@ enum iproto_key {
 	IPROTO_OFFSET = 0x13,
 	IPROTO_ITERATOR = 0x14,
 	IPROTO_INDEX_BASE = 0x15,
+	IPROTO_FLAGS = 0x16,
 	/* Leave a gap between integer values and other keys */
 	IPROTO_KEY = 0x20,
 	IPROTO_TUPLE = 0x21,
@@ -89,7 +90,7 @@ enum iproto_key {
 			  bit(LSN) | bit(SCHEMA_VERSION))
 #define IPROTO_DML_BODY_BMAP (bit(SPACE_ID) | bit(INDEX_ID) | bit(LIMIT) |\
 			      bit(OFFSET) | bit(ITERATOR) | bit(INDEX_BASE) |\
-			      bit(KEY) | bit(TUPLE) | bit(OPS))
+			      bit(KEY) | bit(TUPLE) | bit(OPS) | bit(FLAGS))
 
 static inline bool
 xrow_header_has_key(const char *pos, const char *end)
diff --git a/src/box/vy_stmt.c b/src/box/vy_stmt.c
index a4b7975b..09daa7f4 100644
--- a/src/box/vy_stmt.c
+++ b/src/box/vy_stmt.c
@@ -112,6 +112,7 @@ vy_stmt_alloc(struct tuple_format *format, uint32_t bsize)
 	tuple->data_offset = sizeof(struct vy_stmt) + meta_size;;
 	vy_stmt_set_lsn(tuple, 0);
 	vy_stmt_set_type(tuple, 0);
+	vy_stmt_set_flags(tuple, 0);
 	return tuple;
 }
 
@@ -498,6 +499,7 @@ vy_stmt_encode_primary(const struct tuple *value,
 	struct request request;
 	memset(&request, 0, sizeof(request));
 	request.type = type;
+	request.flags = vy_stmt_flags(value);
 	request.space_id = space_id;
 	uint32_t size;
 	const char *extracted = NULL;
@@ -544,6 +546,7 @@ vy_stmt_encode_secondary(const struct tuple *value,
 	struct request request;
 	memset(&request, 0, sizeof(request));
 	request.type = type;
+	request.flags = vy_stmt_flags(value);
 	uint32_t size;
 	const char *extracted = tuple_extract_key(value, cmp_def, &size);
 	if (extracted == NULL)
@@ -614,6 +617,7 @@ vy_stmt_decode(struct xrow_header *xrow, const struct key_def *key_def,
 		return NULL; /* OOM */
 
 	vy_stmt_set_lsn(stmt, xrow->lsn);
+	vy_stmt_set_flags(stmt, request.flags);
 	return stmt;
 }
 
diff --git a/src/box/vy_stmt.h b/src/box/vy_stmt.h
index e53f98ce..bcf855dd 100644
--- a/src/box/vy_stmt.h
+++ b/src/box/vy_stmt.h
@@ -103,6 +103,7 @@ struct vy_stmt {
 	struct tuple base;
 	int64_t lsn;
 	uint8_t  type; /* IPROTO_SELECT/REPLACE/UPSERT/DELETE */
+	uint8_t flags;
 	/**
 	 * Offsets array concatenated with MessagePack fields
 	 * array.
@@ -138,6 +139,20 @@ vy_stmt_set_type(struct tuple *stmt, enum iproto_type type)
 	((struct vy_stmt *) stmt)->type = type;
 }
 
+/** Get flags of the vinyl statement. */
+static inline uint8_t
+vy_stmt_flags(const struct tuple *stmt)
+{
+	return ((const struct vy_stmt *)stmt)->flags;
+}
+
+/** Set flags of the vinyl statement. */
+static inline void
+vy_stmt_set_flags(struct tuple *stmt, uint8_t flags)
+{
+	((struct vy_stmt *)stmt)->flags = flags;
+}
+
 /**
  * Get upserts count of the vinyl statement.
  * Only for UPSERT statements allocated on lsregion.
diff --git a/src/box/xrow.c b/src/box/xrow.c
index 48fbff27..b74925e0 100644
--- a/src/box/xrow.c
+++ b/src/box/xrow.c
@@ -454,6 +454,9 @@ error:
 		case IPROTO_INDEX_BASE:
 			request->index_base = mp_decode_uint(&value);
 			break;
+		case IPROTO_FLAGS:
+			request->flags = mp_decode_uint(&value);
+			break;
 		case IPROTO_LIMIT:
 			request->limit = mp_decode_uint(&value);
 			break;
@@ -547,6 +550,11 @@ xrow_encode_dml(const struct request *request, struct iovec *iov)
 		pos = mp_encode_uint(pos, request->index_base);
 		map_size++;
 	}
+	if (request->flags) {
+		pos = mp_encode_uint(pos, IPROTO_FLAGS);
+		pos = mp_encode_uint(pos, request->flags);
+		map_size++;
+	}
 	if (request->key) {
 		pos = mp_encode_uint(pos, IPROTO_KEY);
 		memcpy(pos, request->key, key_len);
diff --git a/src/box/xrow.h b/src/box/xrow.h
index 1bb5f103..c69e450c 100644
--- a/src/box/xrow.h
+++ b/src/box/xrow.h
@@ -128,6 +128,8 @@ struct request {
 	const char *ops_end;
 	/** Base field offset for UPDATE/UPSERT, e.g. 0 for C and 1 for Lua. */
 	int index_base;
+	/** Engine-specific statement flags. */
+	uint8_t flags;
 };
 
 /**
-- 
2.11.0

^ permalink raw reply	[flat|nested] 65+ messages in thread

* [RFC PATCH 12/23] vinyl: do not pass region explicitly to write iterator functions
  2018-07-08 16:48 ` [RFC PATCH 00/23] vinyl: eliminate read on REPLACE/DELETE Vladimir Davydov
                     ` (9 preceding siblings ...)
  2018-07-08 16:48   ` [RFC PATCH 11/23] xrow: allow to store flags in DML requests Vladimir Davydov
@ 2018-07-08 16:48   ` Vladimir Davydov
  2018-07-17 10:16     ` Vladimir Davydov
  2018-07-31 20:38     ` Konstantin Osipov
  2018-07-08 16:48   ` [RFC PATCH 13/23] vinyl: fix potential use-after-free in vy_read_view_merge Vladimir Davydov
                     ` (10 subsequent siblings)
  21 siblings, 2 replies; 65+ messages in thread
From: Vladimir Davydov @ 2018-07-08 16:48 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

This is not necessary, as we can use fiber()->gc, as we usually do.
---
 src/box/vy_write_iterator.c | 24 +++++++++---------------
 1 file changed, 9 insertions(+), 15 deletions(-)

diff --git a/src/box/vy_write_iterator.c b/src/box/vy_write_iterator.c
index 52b28aca..7d2ec955 100644
--- a/src/box/vy_write_iterator.c
+++ b/src/box/vy_write_iterator.c
@@ -91,7 +91,6 @@ struct vy_write_history {
  * reverses key LSN order from newest first to oldest first, i.e.
  * orders statements on the same key chronologically.
  *
- * @param region Allocator for the object.
  * @param tuple Key version.
  * @param next Next version of the key.
  *
@@ -99,11 +98,10 @@ struct vy_write_history {
  * @retval NULL     Memory error.
  */
 static inline struct vy_write_history *
-vy_write_history_new(struct region *region, struct tuple *tuple,
-		     struct vy_write_history *next)
+vy_write_history_new(struct tuple *tuple, struct vy_write_history *next)
 {
-	struct vy_write_history *h =
-		region_alloc_object(region, struct vy_write_history);
+	struct vy_write_history *h;
+	h = region_alloc_object(&fiber()->gc, struct vy_write_history);
 	if (h == NULL)
 		return NULL;
 	h->tuple = tuple;
@@ -499,15 +497,14 @@ vy_write_iterator_get_vlsn(struct vy_write_iterator *stream, int rv_i)
  * @retval -1 Memory error.
  */
 static inline int
-vy_write_iterator_push_rv(struct region *region,
-			  struct vy_write_iterator *stream,
+vy_write_iterator_push_rv(struct vy_write_iterator *stream,
 			  struct tuple *tuple, int current_rv_i)
 {
 	assert(current_rv_i < stream->rv_count);
 	struct vy_read_view_stmt *rv = &stream->read_views[current_rv_i];
 	assert(rv->vlsn >= vy_stmt_lsn(tuple));
 	struct vy_write_history *h =
-		vy_write_history_new(region, tuple, rv->history);
+		vy_write_history_new(tuple, rv->history);
 	if (h == NULL)
 		return -1;
 	rv->history = h;
@@ -560,7 +557,6 @@ vy_write_iterator_pop_read_view_stmt(struct vy_write_iterator *stream)
  * This is why there is a special "merge" step which applies
  * UPSERTs and builds a tuple for each read view.
  *
- * @param region History objects allocator.
  * @param stream Write iterator.
  * @param[out] count Count of statements saved in the history.
  * @param[out] is_first_insert Set if the oldest statement for
@@ -570,8 +566,7 @@ vy_write_iterator_pop_read_view_stmt(struct vy_write_iterator *stream)
  * @retval -1 Memory error.
  */
 static NODISCARD int
-vy_write_iterator_build_history(struct region *region,
-				struct vy_write_iterator *stream,
+vy_write_iterator_build_history(struct vy_write_iterator *stream,
 				int *count, bool *is_first_insert)
 {
 	*count = 0;
@@ -678,8 +673,7 @@ vy_write_iterator_build_history(struct region *region,
 			    key_update_can_be_skipped(key_mask, stmt_mask))
 				goto next_lsn;
 
-			rc = vy_write_iterator_push_rv(region, stream,
-						       src->tuple,
+			rc = vy_write_iterator_push_rv(stream, src->tuple,
 						       current_rv_i);
 			if (rc != 0)
 				break;
@@ -693,7 +687,7 @@ vy_write_iterator_build_history(struct region *region,
 		}
 
 		assert(vy_stmt_type(src->tuple) == IPROTO_UPSERT);
-		rc = vy_write_iterator_push_rv(region, stream, src->tuple,
+		rc = vy_write_iterator_push_rv(stream, src->tuple,
 					       current_rv_i);
 		if (rc != 0)
 			break;
@@ -857,7 +851,7 @@ vy_write_iterator_build_read_views(struct vy_write_iterator *stream, int *count)
 	struct region *region = &fiber()->gc;
 	size_t used = region_used(region);
 	stream->rv_used_count = 0;
-	if (vy_write_iterator_build_history(region, stream, &raw_count,
+	if (vy_write_iterator_build_history(stream, &raw_count,
 					    &is_first_insert) != 0)
 		goto error;
 	if (raw_count == 0) {
-- 
2.11.0

^ permalink raw reply	[flat|nested] 65+ messages in thread

* [RFC PATCH 13/23] vinyl: fix potential use-after-free in vy_read_view_merge
  2018-07-08 16:48 ` [RFC PATCH 00/23] vinyl: eliminate read on REPLACE/DELETE Vladimir Davydov
                     ` (10 preceding siblings ...)
  2018-07-08 16:48   ` [RFC PATCH 12/23] vinyl: do not pass region explicitly to write iterator functions Vladimir Davydov
@ 2018-07-08 16:48   ` Vladimir Davydov
  2018-07-17 10:16     ` Vladimir Davydov
  2018-07-08 16:48   ` [RFC PATCH 14/23] test: unit/vy_write_iterator: minor refactoring Vladimir Davydov
                     ` (9 subsequent siblings)
  21 siblings, 1 reply; 65+ messages in thread
From: Vladimir Davydov @ 2018-07-08 16:48 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

If is_first_insert flag is set and vy_stmt_type(rv->tuple) equals
IPROTO_DELETE, we free rv->tuple, but then we dereference it via
an on-stack variable to check if we need to turn a REPLACE into an
INSERT or vice versa. Fix this.
---
 src/box/vy_write_iterator.c | 18 +++++++++---------
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/src/box/vy_write_iterator.c b/src/box/vy_write_iterator.c
index 7d2ec955..4e758be8 100644
--- a/src/box/vy_write_iterator.c
+++ b/src/box/vy_write_iterator.c
@@ -792,8 +792,7 @@ vy_read_view_merge(struct vy_write_iterator *stream, struct tuple *hint,
 		/* Not the first statement. */
 		return 0;
 	}
-	struct tuple *tuple = rv->tuple;
-	if (is_first_insert && vy_stmt_type(tuple) == IPROTO_DELETE) {
+	if (is_first_insert && vy_stmt_type(rv->tuple) == IPROTO_DELETE) {
 		/*
 		 * Optimization 6: discard the first DELETE if
 		 * the oldest statement for the current key among
@@ -801,11 +800,12 @@ vy_read_view_merge(struct vy_write_iterator *stream, struct tuple *hint,
 		 * statements for this key in older runs or the
 		 * last statement is a DELETE.
 		 */
-		vy_stmt_unref_if_possible(tuple);
+		vy_stmt_unref_if_possible(rv->tuple);
 		rv->tuple = NULL;
-	}
-	if ((is_first_insert && vy_stmt_type(tuple) == IPROTO_REPLACE) ||
-	    (!is_first_insert && vy_stmt_type(tuple) == IPROTO_INSERT)) {
+	} else if ((is_first_insert &&
+		    vy_stmt_type(rv->tuple) == IPROTO_REPLACE) ||
+		   (!is_first_insert &&
+		    vy_stmt_type(rv->tuple) == IPROTO_INSERT)) {
 		/*
 		 * If the oldest statement among all sources is an
 		 * INSERT, convert the first REPLACE to an INSERT
@@ -818,14 +818,14 @@ vy_read_view_merge(struct vy_write_iterator *stream, struct tuple *hint,
 		 * compaction.
 		 */
 		uint32_t size;
-		const char *data = tuple_data_range(tuple, &size);
+		const char *data = tuple_data_range(rv->tuple, &size);
 		struct tuple *copy = is_first_insert ?
 			vy_stmt_new_insert(stream->format, data, data + size) :
 			vy_stmt_new_replace(stream->format, data, data + size);
 		if (copy == NULL)
 			return -1;
-		vy_stmt_set_lsn(copy, vy_stmt_lsn(tuple));
-		vy_stmt_unref_if_possible(tuple);
+		vy_stmt_set_lsn(copy, vy_stmt_lsn(rv->tuple));
+		vy_stmt_unref_if_possible(rv->tuple);
 		rv->tuple = copy;
 	}
 	return 0;
-- 
2.11.0

^ permalink raw reply	[flat|nested] 65+ messages in thread

* [RFC PATCH 14/23] test: unit/vy_write_iterator: minor refactoring
  2018-07-08 16:48 ` [RFC PATCH 00/23] vinyl: eliminate read on REPLACE/DELETE Vladimir Davydov
                     ` (11 preceding siblings ...)
  2018-07-08 16:48   ` [RFC PATCH 13/23] vinyl: fix potential use-after-free in vy_read_view_merge Vladimir Davydov
@ 2018-07-08 16:48   ` Vladimir Davydov
  2018-07-17 10:17     ` Vladimir Davydov
  2018-07-08 16:48   ` [RFC PATCH 15/23] vinyl: teach write iterator to return overwritten tuples Vladimir Davydov
                     ` (8 subsequent siblings)
  21 siblings, 1 reply; 65+ messages in thread
From: Vladimir Davydov @ 2018-07-08 16:48 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

Move key_def creation to compare_write_iterator_results as it is the
same for all test cases. Performance is not an issue here, obviously, so
we can close our eyes to the fact that now we create a new key def for
each test cases.
---
 test/unit/vy_write_iterator.c | 56 +++++++++++++++++++------------------------
 1 file changed, 25 insertions(+), 31 deletions(-)

diff --git a/test/unit/vy_write_iterator.c b/test/unit/vy_write_iterator.c
index 6a112028..25a346af 100644
--- a/test/unit/vy_write_iterator.c
+++ b/test/unit/vy_write_iterator.c
@@ -4,11 +4,10 @@
 #include "vy_iterators_helper.h"
 
 /**
- * Create the mem with the specified key_def and content, iterate
- * over it with write_iterator and compare actual result
- * statements with the expected ones.
+ * Create a mem with the specified content, iterate over it with
+ * write_iterator and compare actual result statements with the
+ * expected ones.
  *
- * @param key_def Key definition for the mem.
  * @param content Mem content statements.
  * @param content_count Size of the @content.
  * @param expected Expected results of the iteration.
@@ -20,14 +19,17 @@
  * @param is_last_level True, if the new mem is the last level.
  */
 void
-compare_write_iterator_results(struct key_def *key_def,
-			       const struct vy_stmt_template *content,
+compare_write_iterator_results(const struct vy_stmt_template *content,
 			       int content_count,
 			       const struct vy_stmt_template *expected,
 			       int expected_count,
 			       const int *vlsns, int vlsns_count,
 			       bool is_primary, bool is_last_level)
 {
+	uint32_t fields[] = { 0 };
+	uint32_t types[] = { FIELD_TYPE_UNSIGNED };
+	struct key_def *key_def = box_key_def_new(fields, types, 1);
+	fail_if(key_def == NULL);
 	struct vy_mem *mem = create_test_mem(key_def);
 	for (int i = 0; i < content_count; ++i)
 		vy_mem_insert_template(mem, &content[i]);
@@ -59,7 +61,7 @@ compare_write_iterator_results(struct key_def *key_def,
 	/* Clean up */
 	wi->iface->close(wi);
 	vy_mem_delete(mem);
-
+	box_key_def_delete(key_def);
 	free(rv_array);
 }
 
@@ -68,13 +70,7 @@ test_basic(void)
 {
 	header();
 	plan(46);
-
-	/* Create key_def */
-	uint32_t fields[] = { 0 };
-	uint32_t types[] = { FIELD_TYPE_UNSIGNED };
-	struct key_def *key_def = box_key_def_new(fields, types, 1);
-	assert(key_def != NULL);
-
+{
 /*
  * STATEMENT: REPL REPL REPL  DEL  REPL  REPL  REPL  REPL  REPL  REPL
  * LSN:        5     6   7     8    9     10    11    12    13    14
@@ -82,7 +78,6 @@ test_basic(void)
  *            \____________/\________/\_________________/\___________/
  *                 merge       merge          merge           merge
  */
-{
 	const struct vy_stmt_template content[] = {
 		STMT_TEMPLATE(5, REPLACE, 1, 1),
 		STMT_TEMPLATE(6, REPLACE, 1, 2),
@@ -102,7 +97,7 @@ test_basic(void)
 	int content_count = sizeof(content) / sizeof(content[0]);
 	int expected_count = sizeof(expected) / sizeof(expected[0]);
 	int vlsns_count = sizeof(vlsns) / sizeof(vlsns[0]);
-	compare_write_iterator_results(key_def, content, content_count,
+	compare_write_iterator_results(content, content_count,
 				       expected, expected_count,
 				       vlsns, vlsns_count, true, true);
 }
@@ -136,7 +131,7 @@ test_basic(void)
 	int content_count = sizeof(content) / sizeof(content[0]);
 	int expected_count = sizeof(expected) / sizeof(expected[0]);
 	int vlsns_count = sizeof(vlsns) / sizeof(vlsns[0]);
-	compare_write_iterator_results(key_def, content, content_count,
+	compare_write_iterator_results(content, content_count,
 				       expected, expected_count,
 				       vlsns, vlsns_count, true, false);
 }
@@ -164,7 +159,7 @@ test_basic(void)
 	int content_count = sizeof(content) / sizeof(content[0]);
 	int expected_count = sizeof(expected) / sizeof(expected[0]);
 	int vlsns_count = sizeof(vlsns) / sizeof(vlsns[0]);
-	compare_write_iterator_results(key_def, content, content_count,
+	compare_write_iterator_results(content, content_count,
 				       expected, expected_count,
 				       vlsns, vlsns_count, true, true);
 }
@@ -184,7 +179,7 @@ test_basic(void)
 	int content_count = sizeof(content) / sizeof(content[0]);
 	int expected_count = sizeof(expected) / sizeof(expected[0]);
 	int vlsns_count = sizeof(vlsns) / sizeof(vlsns[0]);
-	compare_write_iterator_results(key_def, content, content_count,
+	compare_write_iterator_results(content, content_count,
 				       expected, expected_count,
 				       vlsns, vlsns_count, true, true);
 }
@@ -208,7 +203,7 @@ test_basic(void)
 	int content_count = sizeof(content) / sizeof(content[0]);
 	int expected_count = sizeof(expected) / sizeof(expected[0]);
 	int vlsns_count = sizeof(vlsns) / sizeof(vlsns[0]);
-	compare_write_iterator_results(key_def, content, content_count,
+	compare_write_iterator_results(content, content_count,
 				       expected, expected_count,
 				       vlsns, vlsns_count, true, true);
 }
@@ -231,7 +226,7 @@ test_basic(void)
 	int content_count = sizeof(content) / sizeof(content[0]);
 	int expected_count = sizeof(expected) / sizeof(expected[0]);
 	int vlsns_count = sizeof(vlsns) / sizeof(vlsns[0]);
-	compare_write_iterator_results(key_def, content, content_count,
+	compare_write_iterator_results(content, content_count,
 				       expected, expected_count,
 				       vlsns, vlsns_count, true, false);
 }
@@ -259,7 +254,7 @@ test_basic(void)
 	int content_count = sizeof(content) / sizeof(content[0]);
 	int expected_count = sizeof(expected) / sizeof(expected[0]);
 	int vlsns_count = sizeof(vlsns) / sizeof(vlsns[0]);
-	compare_write_iterator_results(key_def, content, content_count,
+	compare_write_iterator_results(content, content_count,
 				       expected, expected_count,
 				       vlsns, vlsns_count, false, true);
 }
@@ -279,7 +274,7 @@ test_basic(void)
 	int content_count = sizeof(content) / sizeof(content[0]);
 	int expected_count = sizeof(expected) / sizeof(expected[0]);
 	int vlsns_count = sizeof(vlsns) / sizeof(vlsns[0]);
-	compare_write_iterator_results(key_def, content, content_count,
+	compare_write_iterator_results(content, content_count,
 				       expected, expected_count,
 				       vlsns, vlsns_count, false, false);
 }
@@ -306,7 +301,7 @@ test_basic(void)
 	int content_count = sizeof(content) / sizeof(content[0]);
 	int expected_count = sizeof(expected) / sizeof(expected[0]);
 	int vlsns_count = sizeof(vlsns) / sizeof(vlsns[0]);
-	compare_write_iterator_results(key_def, content, content_count,
+	compare_write_iterator_results(content, content_count,
 				       expected, expected_count,
 				       vlsns, vlsns_count, true, false);
 }
@@ -334,7 +329,7 @@ test_basic(void)
 	int content_count = sizeof(content) / sizeof(content[0]);
 	int expected_count = sizeof(expected) / sizeof(expected[0]);
 	int vlsns_count = sizeof(vlsns) / sizeof(vlsns[0]);
-	compare_write_iterator_results(key_def, content, content_count,
+	compare_write_iterator_results(content, content_count,
 				       expected, expected_count,
 				       vlsns, vlsns_count, true, true);
 }
@@ -359,7 +354,7 @@ test_basic(void)
 	int content_count = sizeof(content) / sizeof(content[0]);
 	int expected_count = sizeof(expected) / sizeof(expected[0]);
 	int vlsns_count = sizeof(vlsns) / sizeof(vlsns[0]);
-	compare_write_iterator_results(key_def, content, content_count,
+	compare_write_iterator_results(content, content_count,
 				       expected, expected_count,
 				       vlsns, vlsns_count, false, false);
 }
@@ -384,7 +379,7 @@ test_basic(void)
 	int content_count = sizeof(content) / sizeof(content[0]);
 	int expected_count = sizeof(expected) / sizeof(expected[0]);
 	int vlsns_count = sizeof(vlsns) / sizeof(vlsns[0]);
-	compare_write_iterator_results(key_def, content, content_count,
+	compare_write_iterator_results(content, content_count,
 				       expected, expected_count,
 				       vlsns, vlsns_count, true, false);
 }
@@ -414,7 +409,7 @@ test_basic(void)
 	int content_count = sizeof(content) / sizeof(content[0]);
 	int expected_count = sizeof(expected) / sizeof(expected[0]);
 	int vlsns_count = sizeof(vlsns) / sizeof(vlsns[0]);
-	compare_write_iterator_results(key_def, content, content_count,
+	compare_write_iterator_results(content, content_count,
 				       expected, expected_count,
 				       vlsns, vlsns_count, true, false);
 }
@@ -455,7 +450,7 @@ test_basic(void)
 	int content_count = sizeof(content) / sizeof(content[0]);
 	int expected_count = sizeof(expected) / sizeof(expected[0]);
 	int vlsns_count = sizeof(vlsns) / sizeof(vlsns[0]);
-	compare_write_iterator_results(key_def, content, content_count,
+	compare_write_iterator_results(content, content_count,
 				       expected, expected_count,
 				       vlsns, vlsns_count, true, false);
 }
@@ -495,11 +490,10 @@ test_basic(void)
 	int content_count = sizeof(content) / sizeof(content[0]);
 	int expected_count = sizeof(expected) / sizeof(expected[0]);
 	int vlsns_count = sizeof(vlsns) / sizeof(vlsns[0]);
-	compare_write_iterator_results(key_def, content, content_count,
+	compare_write_iterator_results(content, content_count,
 				       expected, expected_count,
 				       vlsns, vlsns_count, true, false);
 }
-	key_def_delete(key_def);
 	fiber_gc();
 	footer();
 	check_plan();
-- 
2.11.0

^ permalink raw reply	[flat|nested] 65+ messages in thread

* [RFC PATCH 15/23] vinyl: teach write iterator to return overwritten tuples
  2018-07-08 16:48 ` [RFC PATCH 00/23] vinyl: eliminate read on REPLACE/DELETE Vladimir Davydov
                     ` (12 preceding siblings ...)
  2018-07-08 16:48   ` [RFC PATCH 14/23] test: unit/vy_write_iterator: minor refactoring Vladimir Davydov
@ 2018-07-08 16:48   ` Vladimir Davydov
  2018-07-08 16:48   ` [RFC PATCH 16/23] vinyl: allow to skip certain statements on read Vladimir Davydov
                     ` (7 subsequent siblings)
  21 siblings, 0 replies; 65+ messages in thread
From: Vladimir Davydov @ 2018-07-08 16:48 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

A REPLACE/DELETE request is supposed to delete the old tuple from all
indexes. In order to generate a DELETE statement for a secondary index,
we need to look up the old tuple in the primary index, which is costly
as it implies a random disk access. In the scope of #2129 we are
planning to optimize out the lookup by deferring generation of the
DELETE statement until primary index compaction.

To do that, we need to differentiate statements for which DELETE was
deferred from those for which it was inserted when the request was
executed (as it is the case for UPDATE). So this patch introduces a per
statement flag, VY_STMT_DEFERRED_DELETE. If set for a REPLACE or DELETE
statement, it will make the write iterator to return the overwritten
statement to the caller via a callback.

Needed for #2129
---
 src/box/vinyl.c                    |   3 +-
 src/box/vy_scheduler.c             |   4 +-
 src/box/vy_stmt.h                  |  19 +++
 src/box/vy_write_iterator.c        | 131 +++++++++++++++++-
 src/box/vy_write_iterator.h        |  27 +++-
 test/unit/vy_iterators_helper.c    |   5 +
 test/unit/vy_iterators_helper.h    |  12 +-
 test/unit/vy_point_lookup.c        |   4 +-
 test/unit/vy_write_iterator.c      | 263 ++++++++++++++++++++++++++++++++++---
 test/unit/vy_write_iterator.result |  23 +++-
 10 files changed, 458 insertions(+), 33 deletions(-)

diff --git a/src/box/vinyl.c b/src/box/vinyl.c
index f05a4a0e..7e23dd93 100644
--- a/src/box/vinyl.c
+++ b/src/box/vinyl.c
@@ -3040,7 +3040,8 @@ vy_send_range(struct vy_join_ctx *ctx,
 	struct rlist fake_read_views;
 	rlist_create(&fake_read_views);
 	ctx->wi = vy_write_iterator_new(ctx->key_def, ctx->format,
-					true, true, &fake_read_views);
+					true, true, &fake_read_views,
+					NULL, NULL);
 	if (ctx->wi == NULL) {
 		rc = -1;
 		goto out;
diff --git a/src/box/vy_scheduler.c b/src/box/vy_scheduler.c
index a82fe9f2..4d1f3474 100644
--- a/src/box/vy_scheduler.c
+++ b/src/box/vy_scheduler.c
@@ -1012,7 +1012,7 @@ vy_task_dump_new(struct vy_scheduler *scheduler, struct vy_lsm *lsm,
 	bool is_last_level = (lsm->run_count == 0);
 	wi = vy_write_iterator_new(task->cmp_def, lsm->disk_format,
 				   lsm->index_id == 0, is_last_level,
-				   scheduler->read_views);
+				   scheduler->read_views, NULL, NULL);
 	if (wi == NULL)
 		goto err_wi;
 	rlist_foreach_entry(mem, &lsm->sealed, in_sealed) {
@@ -1283,7 +1283,7 @@ vy_task_compact_new(struct vy_scheduler *scheduler, struct vy_lsm *lsm,
 	bool is_last_level = (range->compact_priority == range->slice_count);
 	wi = vy_write_iterator_new(task->cmp_def, lsm->disk_format,
 				   lsm->index_id == 0, is_last_level,
-				   scheduler->read_views);
+				   scheduler->read_views, NULL, NULL);
 	if (wi == NULL)
 		goto err_wi;
 
diff --git a/src/box/vy_stmt.h b/src/box/vy_stmt.h
index bcf855dd..8de8aa84 100644
--- a/src/box/vy_stmt.h
+++ b/src/box/vy_stmt.h
@@ -70,6 +70,25 @@ extern struct tuple_format_vtab vy_tuple_format_vtab;
  */
 extern size_t vy_max_tuple_size;
 
+/** Statement flags. */
+enum {
+	/**
+	 * A REPLACE/DELETE request is supposed to delete the old
+	 * tuple from all indexes. In order to generate a DELETE
+	 * statement for a secondary index, we need to look up the
+	 * old tuple in the primary index, which is expensive as
+	 * it implies a random disk access. We can optimize out the
+	 * lookup by deferring generation of the DELETE statement
+	 * until primary index compaction.
+	 *
+	 * The following flag is set for those REPLACE and DELETE
+	 * statements that skipped deletion of the old tuple from
+	 * secondary indexes. It makes the write iterator generate
+	 * DELETE statements for them during compaction.
+	 */
+	VY_STMT_DEFERRED_DELETE		= 1 << 0,
+};
+
 /**
  * There are two groups of statements:
  *
diff --git a/src/box/vy_write_iterator.c b/src/box/vy_write_iterator.c
index 4e758be8..2ed125fb 100644
--- a/src/box/vy_write_iterator.c
+++ b/src/box/vy_write_iterator.c
@@ -177,7 +177,16 @@ struct vy_write_iterator {
 	 * key and its tuple format is different.
 	 */
 	bool is_primary;
-
+	/** Callback for generating deferred DELETE statements. */
+	vy_deferred_delete_f deferred_delete_cb;
+	/** Context passed to @deferred_delete_cb. */
+	void *deferred_delete_ctx;
+	/**
+	 * Last scanned REPLACE or DELETE statement that was
+	 * inserted into the primary index without deletion
+	 * of the old tuple from secondary indexes.
+	 */
+	struct tuple *deferred_delete_stmt;
 	/** Length of the @read_views. */
 	int rv_count;
 	/**
@@ -327,9 +336,10 @@ static const struct vy_stmt_stream_iface vy_slice_stream_iface;
  */
 struct vy_stmt_stream *
 vy_write_iterator_new(const struct key_def *cmp_def,
-		      struct tuple_format *format,
-		      bool is_primary, bool is_last_level,
-		      struct rlist *read_views)
+		      struct tuple_format *format, bool is_primary,
+		      bool is_last_level, struct rlist *read_views,
+		      vy_deferred_delete_f deferred_delete_cb,
+		      void *deferred_delete_ctx)
 {
 	/*
 	 * One is reserved for INT64_MAX - maximal read view.
@@ -364,6 +374,8 @@ vy_write_iterator_new(const struct key_def *cmp_def,
 	tuple_format_ref(stream->format);
 	stream->is_primary = is_primary;
 	stream->is_last_level = is_last_level;
+	stream->deferred_delete_cb = deferred_delete_cb;
+	stream->deferred_delete_ctx = deferred_delete_ctx;
 	return &stream->base;
 }
 
@@ -398,6 +410,10 @@ vy_write_iterator_stop(struct vy_stmt_stream *vstream)
 	struct vy_write_src *src, *tmp;
 	rlist_foreach_entry_safe(src, &stream->src_list, in_src_list, tmp)
 		vy_write_iterator_delete_src(stream, src);
+	if (stream->deferred_delete_stmt != NULL) {
+		vy_stmt_unref_if_possible(stream->deferred_delete_stmt);
+		stream->deferred_delete_stmt = NULL;
+	}
 }
 
 /**
@@ -548,6 +564,62 @@ vy_write_iterator_pop_read_view_stmt(struct vy_write_iterator *stream)
 }
 
 /**
+ * Generate a DELETE statement for the given tuple if its
+ * deletion from secondary indexes was deferred.
+ *
+ * @param stream Write iterator.
+ * @param stmt Current statement.
+ *
+ * @retval  0 Success.
+ * @retval -1 Error.
+ */
+static int
+vy_write_iterator_deferred_delete(struct vy_write_iterator *stream,
+				  struct tuple *stmt)
+{
+	/*
+	 * Nothing to do if the caller isn't interested in
+	 * deferred DELETEs.
+	 */
+	if (stream->deferred_delete_cb == NULL)
+		return 0;
+	/*
+	 * UPSERTs cannot change secondary index parts neither
+	 * can they produce deferred DELETEs, so we skip them.
+	 */
+	if (vy_stmt_type(stmt) == IPROTO_UPSERT) {
+		assert((vy_stmt_flags(stmt) & VY_STMT_DEFERRED_DELETE) == 0);
+		return 0;
+	}
+	/*
+	 * Invoke the callback to generate a deferred DELETE
+	 * in case the current tuple was overwritten.
+	 */
+	if (stream->deferred_delete_stmt != NULL) {
+		if (vy_stmt_type(stmt) != IPROTO_DELETE &&
+		    stream->deferred_delete_cb(stmt,
+				stream->deferred_delete_stmt,
+				stream->deferred_delete_ctx) != 0)
+			return -1;
+		vy_stmt_unref_if_possible(stream->deferred_delete_stmt);
+		stream->deferred_delete_stmt = NULL;
+	}
+	/*
+	 * Remember the current statement if it is marked with
+	 * VY_STMT_DEFERRED_DELETE so that we can use it to
+	 * generate a DELETE for the overwritten tuple when this
+	 * function is called next time.
+	 */
+	if ((vy_stmt_flags(stmt) & VY_STMT_DEFERRED_DELETE) != 0) {
+		assert(vy_stmt_type(stmt) == IPROTO_DELETE ||
+		       vy_stmt_type(stmt) == IPROTO_REPLACE);
+		vy_stmt_ref_if_possible(stmt);
+		stream->deferred_delete_stmt = stmt;
+	}
+	return 0;
+}
+
+/**
  * Build the history of the current key.
  * Apply optimizations 1, 2 and 3 (@sa vy_write_iterator.h).
  * When building a history, some statements can be
@@ -572,6 +644,7 @@ vy_write_iterator_build_history(struct vy_write_iterator *stream,
 	*count = 0;
 	*is_first_insert = false;
 	assert(stream->stmt_i == -1);
+	assert(stream->deferred_delete_stmt == NULL);
 	struct heap_node *node = vy_source_heap_top(&stream->src_heap);
 	if (node == NULL)
 		return 0; /* no more data */
@@ -624,6 +697,10 @@ vy_write_iterator_build_history(struct vy_write_iterator *stream,
 				*is_first_insert = true;
 		}
 
+		rc = vy_write_iterator_deferred_delete(stream, src->tuple);
+		if (rc != 0)
+			break;
+
 		if (vy_stmt_lsn(src->tuple) > current_rv_lsn) {
 			/*
 			 * Skip statements invisible to the current read
@@ -704,6 +781,17 @@ next_lsn:
 			break;
 	}
 
+	/*
+	 * No point in keeping the last VY_STMT_DEFERRED_DELETE
+	 * statement around if this is major compaction, because
+	 * there's no tuple it could overwrite.
+	 */
+	if (rc == 0 && stream->is_last_level &&
+	    stream->deferred_delete_stmt != NULL) {
+		vy_stmt_unref_if_possible(stream->deferred_delete_stmt);
+		stream->deferred_delete_stmt = NULL;
+	}
+
 	vy_source_heap_delete(&stream->src_heap, &end_of_key_src.heap_node);
 	vy_stmt_unref_if_possible(end_of_key_src.tuple);
 	return rc;
@@ -788,6 +876,23 @@ vy_read_view_merge(struct vy_write_iterator *stream, struct tuple *hint,
 	rv->history = NULL;
 	result->tuple = NULL;
 	assert(result->next == NULL);
+	/*
+	 * The write iterator generates deferred DELETEs for all
+	 * VY_STMT_DEFERRED_DELETE statements, except, may be,
+	 * the last seen one. Clear the flag for all other output
+	 * statements so as not to generate the same DELETEs on
+	 * the next compaction.
+	 */
+	uint8_t flags = vy_stmt_flags(rv->tuple);
+	if (rv->tuple != stream->deferred_delete_stmt &&
+	    (flags & VY_STMT_DEFERRED_DELETE) != 0) {
+		if (!vy_stmt_is_refable(rv->tuple)) {
+			rv->tuple = vy_stmt_dup(rv->tuple);
+			if (rv->tuple == NULL)
+				return -1;
+		}
+		vy_stmt_set_flags(rv->tuple, flags & ~VY_STMT_DEFERRED_DELETE);
+	}
 	if (hint != NULL) {
 		/* Not the first statement. */
 		return 0;
@@ -912,6 +1017,24 @@ vy_write_iterator_next(struct vy_stmt_stream *vstream,
 	*ret = vy_write_iterator_pop_read_view_stmt(stream);
 	if (*ret != NULL)
 		return 0;
+	/*
+	 * If we didn't generate a deferred DELETE corresponding to
+	 * the last seen VY_STMT_DEFERRED_DELETE statement, we must
+	 * include it into the output, because there still might be
+	 * an overwritten tuple in an older source.
+	 */
+	if (stream->stmt_i >= 0 &&
+	    stream->deferred_delete_stmt != NULL &&
+	    vy_stmt_lsn(stream->deferred_delete_stmt) !=
+	    vy_write_iterator_get_vlsn(stream, stream->stmt_i)) {
+		stream->stmt_i = -1;
+		*ret = stream->deferred_delete_stmt;
+		return 0;
+	}
+	if (stream->deferred_delete_stmt != NULL) {
+		vy_stmt_unref_if_possible(stream->deferred_delete_stmt);
+		stream->deferred_delete_stmt = NULL;
+	}
 
 	/* Build the next key sequence. */
 	stream->stmt_i = -1;
diff --git a/src/box/vy_write_iterator.h b/src/box/vy_write_iterator.h
index ea14b07a..3430bbd2 100644
--- a/src/box/vy_write_iterator.h
+++ b/src/box/vy_write_iterator.h
@@ -220,6 +220,24 @@ struct vy_mem;
 struct vy_slice;
 
 /**
+ * Callback invoked by the write iterator for tuples that were
+ * overwritten or deleted without generating DELETE statement
+ * for secondary indexes.
+ *
+ * @param old_stmt Overwritten tuple.
+ * @param new_stmt Statement that overwrote @old_stmt.
+ * @param ctx Callback context.
+ *
+ * @retval  0 Success.
+ * @retval -1 Error.
+ *
+ * @sa VY_STMT_DEFERRED_DELETE.
+ */
+typedef int
+(*vy_deferred_delete_f)(struct tuple *old_stmt,
+			struct tuple *new_stmt, void *ctx);
+
+/**
  * Open an empty write iterator. To add sources to the iterator
  * use vy_write_iterator_add_* functions.
  * @param cmp_def - key definition for tuple compare.
@@ -227,13 +245,16 @@ struct vy_slice;
  * @param LSM tree is_primary - set if this iterator is for a primary index.
  * @param is_last_level - there is no older level than the one we're writing to.
  * @param read_views - Opened read views.
+ * @param deferred_delete_cb - Callback for generating deferred DELETEs.
+ * @param deferred_delete_ctx - Context passed to @deferred_delete_cb.
  * @return the iterator or NULL on error (diag is set).
  */
 struct vy_stmt_stream *
 vy_write_iterator_new(const struct key_def *cmp_def,
-		      struct tuple_format *format,
-		      bool is_primary, bool is_last_level,
-		      struct rlist *read_views);
+		      struct tuple_format *format, bool is_primary,
+		      bool is_last_level, struct rlist *read_views,
+		      vy_deferred_delete_f deferred_delete_cb,
+		      void *deferred_delete_ctx);
 
 /**
  * Add a mem as a source to the iterator.
diff --git a/test/unit/vy_iterators_helper.c b/test/unit/vy_iterators_helper.c
index 642d8bf2..89603376 100644
--- a/test/unit/vy_iterators_helper.c
+++ b/test/unit/vy_iterators_helper.c
@@ -136,6 +136,7 @@ vy_new_simple_stmt(struct tuple_format *format,
 	}
 	free(buf);
 	vy_stmt_set_lsn(ret, templ->lsn);
+	vy_stmt_set_flags(ret, templ->flags);
 	if (templ->optimize_update)
 		vy_stmt_set_column_mask(ret, 0);
 	return ret;
@@ -277,6 +278,10 @@ vy_stmt_are_same(const struct tuple *actual,
 		tuple_unref(tmp);
 		return false;
 	}
+	if (vy_stmt_flags(actual) != expected->flags) {
+		tuple_unref(tmp);
+		return false;
+	}
 	bool rc = memcmp(a, b, a_len) == 0;
 	tuple_unref(tmp);
 	return rc;
diff --git a/test/unit/vy_iterators_helper.h b/test/unit/vy_iterators_helper.h
index e38ec295..2fe1a26a 100644
--- a/test/unit/vy_iterators_helper.h
+++ b/test/unit/vy_iterators_helper.h
@@ -43,10 +43,16 @@
 #define vyend 99999999
 #define MAX_FIELDS_COUNT 100
 #define STMT_TEMPLATE(lsn, type, ...) \
-{ { __VA_ARGS__, vyend }, IPROTO_##type, lsn, false, 0, 0 }
+{ { __VA_ARGS__, vyend }, IPROTO_##type, lsn, false, 0, 0, 0 }
 
 #define STMT_TEMPLATE_OPTIMIZED(lsn, type, ...) \
-{ { __VA_ARGS__, vyend }, IPROTO_##type, lsn, true, 0, 0 }
+{ { __VA_ARGS__, vyend }, IPROTO_##type, lsn, true, 0, 0, 0 }
+
+#define STMT_TEMPLATE_FLAGS(lsn, type, flags, ...) \
+{ { __VA_ARGS__, vyend }, IPROTO_##type, lsn, true, flags, 0, 0 }
+
+#define STMT_TEMPLATE_DEFERRED_DELETE(lsn, type, ...) \
+STMT_TEMPLATE_FLAGS(lsn, type, VY_STMT_DEFERRED_DELETE, __VA_ARGS__)
 
 extern struct tuple_format_vtab vy_tuple_format_vtab;
 extern struct tuple_format *vy_key_format;
@@ -82,6 +88,8 @@ struct vy_stmt_template {
 	 * to skip it in the write_iterator.
 	 */
 	bool optimize_update;
+	/** Statement flags. */
+	uint8_t flags;
 	/*
 	 * In case of upsert it is possible to use only one 'add' operation.
 	 * This is the column number of the operation.
diff --git a/test/unit/vy_point_lookup.c b/test/unit/vy_point_lookup.c
index 0e63ac69..db83dd33 100644
--- a/test/unit/vy_point_lookup.c
+++ b/test/unit/vy_point_lookup.c
@@ -191,7 +191,7 @@ test_basic()
 	}
 	struct vy_stmt_stream *write_stream
 		= vy_write_iterator_new(pk->cmp_def, pk->disk_format,
-					true, true, &read_views);
+					true, true, &read_views, NULL, NULL);
 	vy_write_iterator_new_mem(write_stream, run_mem);
 	struct vy_run *run = vy_run_new(&run_env, 1);
 	isnt(run, NULL, "vy_run_new");
@@ -224,7 +224,7 @@ test_basic()
 	}
 	write_stream
 		= vy_write_iterator_new(pk->cmp_def, pk->disk_format,
-					true, true, &read_views);
+					true, true, &read_views, NULL, NULL);
 	vy_write_iterator_new_mem(write_stream, run_mem);
 	run = vy_run_new(&run_env, 2);
 	isnt(run, NULL, "vy_run_new");
diff --git a/test/unit/vy_write_iterator.c b/test/unit/vy_write_iterator.c
index 25a346af..43929a15 100644
--- a/test/unit/vy_write_iterator.c
+++ b/test/unit/vy_write_iterator.c
@@ -3,6 +3,63 @@
 #include "vy_write_iterator.h"
 #include "vy_iterators_helper.h"
 
+enum { MAX_DEFERRED_COUNT = 32 };
+
+/** Argument passed to @make_deferred_delete. */
+struct deferred_ctx {
+	/** Key definitions of the index to generate DELETEs for. */
+	struct key_def *key_def;
+	/** Format to use for making DELETEs. */
+	struct tuple_format *format;
+	/** Deferred DELETEs generated by the write iterator. */
+	struct tuple *stmt[MAX_DEFERRED_COUNT];
+	/** Number of elements in @stmt array. */
+	int count;
+};
+
+/**
+ * Callback passed to the write iterator for generating deferred
+ * DELETE statements.
+ */
+static int
+make_deferred_delete(struct tuple *old_stmt,
+		     struct tuple *new_stmt, void *arg)
+{
+	struct deferred_ctx *ctx = arg;
+
+	fail_if(vy_stmt_type(old_stmt) == IPROTO_DELETE);
+	fail_if(vy_stmt_type(new_stmt) != IPROTO_DELETE &&
+		vy_stmt_type(new_stmt) != IPROTO_REPLACE);
+
+	/* Create key definition and format on demand. */
+	if (ctx->key_def == NULL) {
+		uint32_t fields[] = { 0, 1 };
+		uint32_t types[] = { FIELD_TYPE_UNSIGNED, FIELD_TYPE_UNSIGNED };
+		ctx->key_def = box_key_def_new(fields, types, 2);
+		fail_if(ctx->key_def == NULL);
+	}
+	if (ctx->format == NULL) {
+		ctx->format = tuple_format_new(&vy_tuple_format_vtab,
+					       &ctx->key_def, 1,
+					       0, NULL, 0, NULL);
+		fail_if(ctx->format == NULL);
+		tuple_format_ref(ctx->format);
+	}
+
+	/* No need to make a DELETE if the value didn't change. */
+	if (vy_tuple_compare(old_stmt, new_stmt, ctx->key_def) == 0)
+		return 0;
+
+	struct tuple *delete = vy_stmt_new_surrogate_delete(ctx->format,
+							    old_stmt);
+	fail_if(delete == NULL);
+	vy_stmt_set_lsn(delete, vy_stmt_lsn(new_stmt));
+
+	fail_if(ctx->count >= MAX_DEFERRED_COUNT);
+	ctx->stmt[ctx->count++] = delete;
+	return 0;
+}
+
 /**
  * Create a mem with the specified content, iterate over it with
  * write_iterator and compare actual result statements with the
@@ -12,6 +69,8 @@
  * @param content_count Size of the @content.
  * @param expected Expected results of the iteration.
  * @param expected_count Size of the @expected.
+ * @param deferred Expected deferred DELETEs returned by the iteration.
+ * @param deferred_count Size of @deferred.
  * @param vlsns Read view lsns for the write iterator.
  * @param vlsns_count Size of the @vlsns.
  * @param is_primary True, if the new mem belongs to the primary
@@ -23,6 +82,8 @@ compare_write_iterator_results(const struct vy_stmt_template *content,
 			       int content_count,
 			       const struct vy_stmt_template *expected,
 			       int expected_count,
+			       const struct vy_stmt_template *deferred,
+			       int deferred_count,
 			       const int *vlsns, int vlsns_count,
 			       bool is_primary, bool is_last_level)
 {
@@ -37,9 +98,13 @@ compare_write_iterator_results(const struct vy_stmt_template *content,
 	struct vy_read_view *rv_array = malloc(sizeof(*rv_array) * vlsns_count);
 	fail_if(rv_array == NULL);
 	init_read_views_list(&rv_list, rv_array, vlsns, vlsns_count);
-
-	struct vy_stmt_stream *wi = vy_write_iterator_new(key_def, mem->format,
-					is_primary, is_last_level, &rv_list);
+	struct deferred_ctx deferred_ctx;
+	memset(&deferred_ctx, 0, sizeof(deferred_ctx));
+	struct vy_stmt_stream *wi;
+	wi = vy_write_iterator_new(key_def, mem->format, is_primary,
+				   is_last_level, &rv_list,
+				   deferred == NULL ? NULL :
+				   make_deferred_delete, &deferred_ctx);
 	fail_if(wi == NULL);
 	fail_if(vy_write_iterator_new_mem(wi, mem) != 0);
 
@@ -57,11 +122,27 @@ compare_write_iterator_results(const struct vy_stmt_template *content,
 		++i;
 	} while (ret != NULL);
 	ok(i == expected_count, "correct results count");
+	wi->iface->stop(wi);
+
+	for (i = 0; i < MIN(deferred_ctx.count, deferred_count); i++) {
+		ok(vy_stmt_are_same(deferred_ctx.stmt[i], &deferred[i],
+				    deferred_ctx.format, NULL),
+		   "deferred stmt %d is correct", i);
+		tuple_unref(deferred_ctx.stmt[i]);
+	}
+	if (deferred != NULL) {
+		ok(deferred_ctx.count == deferred_count,
+		   "correct deferred stmt count");
+	}
 
 	/* Clean up */
 	wi->iface->close(wi);
 	vy_mem_delete(mem);
 	box_key_def_delete(key_def);
+	if (deferred_ctx.format != NULL)
+		tuple_format_unref(deferred_ctx.format);
+	if (deferred_ctx.key_def != NULL)
+		box_key_def_delete(deferred_ctx.key_def);
 	free(rv_array);
 }
 
@@ -69,7 +150,7 @@ void
 test_basic(void)
 {
 	header();
-	plan(46);
+	plan(80);
 {
 /*
  * STATEMENT: REPL REPL REPL  DEL  REPL  REPL  REPL  REPL  REPL  REPL
@@ -98,7 +179,7 @@ test_basic(void)
 	int expected_count = sizeof(expected) / sizeof(expected[0]);
 	int vlsns_count = sizeof(vlsns) / sizeof(vlsns[0]);
 	compare_write_iterator_results(content, content_count,
-				       expected, expected_count,
+				       expected, expected_count, NULL, 0,
 				       vlsns, vlsns_count, true, true);
 }
 {
@@ -132,7 +213,7 @@ test_basic(void)
 	int expected_count = sizeof(expected) / sizeof(expected[0]);
 	int vlsns_count = sizeof(vlsns) / sizeof(vlsns[0]);
 	compare_write_iterator_results(content, content_count,
-				       expected, expected_count,
+				       expected, expected_count, NULL, 0,
 				       vlsns, vlsns_count, true, false);
 }
 {
@@ -160,7 +241,7 @@ test_basic(void)
 	int expected_count = sizeof(expected) / sizeof(expected[0]);
 	int vlsns_count = sizeof(vlsns) / sizeof(vlsns[0]);
 	compare_write_iterator_results(content, content_count,
-				       expected, expected_count,
+				       expected, expected_count, NULL, 0,
 				       vlsns, vlsns_count, true, true);
 }
 {
@@ -180,7 +261,7 @@ test_basic(void)
 	int expected_count = sizeof(expected) / sizeof(expected[0]);
 	int vlsns_count = sizeof(vlsns) / sizeof(vlsns[0]);
 	compare_write_iterator_results(content, content_count,
-				       expected, expected_count,
+				       expected, expected_count, NULL, 0,
 				       vlsns, vlsns_count, true, true);
 }
 {
@@ -204,7 +285,7 @@ test_basic(void)
 	int expected_count = sizeof(expected) / sizeof(expected[0]);
 	int vlsns_count = sizeof(vlsns) / sizeof(vlsns[0]);
 	compare_write_iterator_results(content, content_count,
-				       expected, expected_count,
+				       expected, expected_count, NULL, 0,
 				       vlsns, vlsns_count, true, true);
 }
 {
@@ -227,7 +308,7 @@ test_basic(void)
 	int expected_count = sizeof(expected) / sizeof(expected[0]);
 	int vlsns_count = sizeof(vlsns) / sizeof(vlsns[0]);
 	compare_write_iterator_results(content, content_count,
-				       expected, expected_count,
+				       expected, expected_count, NULL, 0,
 				       vlsns, vlsns_count, true, false);
 }
 {
@@ -255,7 +336,7 @@ test_basic(void)
 	int expected_count = sizeof(expected) / sizeof(expected[0]);
 	int vlsns_count = sizeof(vlsns) / sizeof(vlsns[0]);
 	compare_write_iterator_results(content, content_count,
-				       expected, expected_count,
+				       expected, expected_count, NULL, 0,
 				       vlsns, vlsns_count, false, true);
 }
 {
@@ -275,7 +356,7 @@ test_basic(void)
 	int expected_count = sizeof(expected) / sizeof(expected[0]);
 	int vlsns_count = sizeof(vlsns) / sizeof(vlsns[0]);
 	compare_write_iterator_results(content, content_count,
-				       expected, expected_count,
+				       expected, expected_count, NULL, 0,
 				       vlsns, vlsns_count, false, false);
 }
 {
@@ -302,7 +383,7 @@ test_basic(void)
 	int expected_count = sizeof(expected) / sizeof(expected[0]);
 	int vlsns_count = sizeof(vlsns) / sizeof(vlsns[0]);
 	compare_write_iterator_results(content, content_count,
-				       expected, expected_count,
+				       expected, expected_count, NULL, 0,
 				       vlsns, vlsns_count, true, false);
 }
 {
@@ -330,7 +411,7 @@ test_basic(void)
 	int expected_count = sizeof(expected) / sizeof(expected[0]);
 	int vlsns_count = sizeof(vlsns) / sizeof(vlsns[0]);
 	compare_write_iterator_results(content, content_count,
-				       expected, expected_count,
+				       expected, expected_count, NULL, 0,
 				       vlsns, vlsns_count, true, true);
 }
 {
@@ -355,7 +436,7 @@ test_basic(void)
 	int expected_count = sizeof(expected) / sizeof(expected[0]);
 	int vlsns_count = sizeof(vlsns) / sizeof(vlsns[0]);
 	compare_write_iterator_results(content, content_count,
-				       expected, expected_count,
+				       expected, expected_count, NULL, 0,
 				       vlsns, vlsns_count, false, false);
 }
 {
@@ -380,7 +461,7 @@ test_basic(void)
 	int expected_count = sizeof(expected) / sizeof(expected[0]);
 	int vlsns_count = sizeof(vlsns) / sizeof(vlsns[0]);
 	compare_write_iterator_results(content, content_count,
-				       expected, expected_count,
+				       expected, expected_count, NULL, 0,
 				       vlsns, vlsns_count, true, false);
 }
 {
@@ -410,7 +491,7 @@ test_basic(void)
 	int expected_count = sizeof(expected) / sizeof(expected[0]);
 	int vlsns_count = sizeof(vlsns) / sizeof(vlsns[0]);
 	compare_write_iterator_results(content, content_count,
-				       expected, expected_count,
+				       expected, expected_count, NULL, 0,
 				       vlsns, vlsns_count, true, false);
 }
 {
@@ -451,7 +532,7 @@ test_basic(void)
 	int expected_count = sizeof(expected) / sizeof(expected[0]);
 	int vlsns_count = sizeof(vlsns) / sizeof(vlsns[0]);
 	compare_write_iterator_results(content, content_count,
-				       expected, expected_count,
+				       expected, expected_count, NULL, 0,
 				       vlsns, vlsns_count, true, false);
 }
 {
@@ -491,7 +572,153 @@ test_basic(void)
 	int expected_count = sizeof(expected) / sizeof(expected[0]);
 	int vlsns_count = sizeof(vlsns) / sizeof(vlsns[0]);
 	compare_write_iterator_results(content, content_count,
+				       expected, expected_count, NULL, 0,
+				       vlsns, vlsns_count, true, false);
+}
+{
+/*
+ * STATEMENT:    REPL DEL REPL REPL DEL DEL DEL REPL DEL INS DEL INS REPL
+ * LSN:            4   5    6    7   8   9  10   11  12  13  14  15   16
+ * DEFERRED DEL:   +   +    +        +   +        +           +        +
+ * READ VIEW:          *         *                *
+ *
+ * is_last_level = true
+ *
+ * Test generation of deferred DELETEs for various combinations
+ * of input statements.
+ */
+	const struct vy_stmt_template content[] = {
+		STMT_TEMPLATE_DEFERRED_DELETE(4, REPLACE, 1, 2),
+		STMT_TEMPLATE_DEFERRED_DELETE(5, DELETE, 1),
+		STMT_TEMPLATE_DEFERRED_DELETE(6, REPLACE, 1, 3),
+		STMT_TEMPLATE(7, REPLACE, 1, 4),
+		STMT_TEMPLATE_DEFERRED_DELETE(8, DELETE, 1),
+		STMT_TEMPLATE_DEFERRED_DELETE(9, DELETE, 1),
+		STMT_TEMPLATE(10, DELETE, 1),
+		STMT_TEMPLATE_DEFERRED_DELETE(11, REPLACE, 1, 5),
+		STMT_TEMPLATE(12, DELETE, 1),
+		STMT_TEMPLATE(13, INSERT, 1, 6),
+		STMT_TEMPLATE_DEFERRED_DELETE(14, DELETE, 1),
+		STMT_TEMPLATE(15, INSERT, 1, 7),
+		STMT_TEMPLATE_DEFERRED_DELETE(16, REPLACE, 1, 8),
+	};
+	const struct vy_stmt_template expected[] = {
+		STMT_TEMPLATE(16, REPLACE, 1, 8),
+		STMT_TEMPLATE(11, REPLACE, 1, 5),
+		STMT_TEMPLATE(7, REPLACE, 1, 4),
+	};
+	const struct vy_stmt_template deferred[] = {
+		STMT_TEMPLATE(16, DELETE, 1, 7),
+		STMT_TEMPLATE(14, DELETE, 1, 6),
+		STMT_TEMPLATE(8, DELETE, 1, 4),
+		STMT_TEMPLATE(5, DELETE, 1, 2),
+	};
+	const int vlsns[] = {5, 7, 11};
+	int content_count = sizeof(content) / sizeof(content[0]);
+	int expected_count = sizeof(expected) / sizeof(expected[0]);
+	int deferred_count = sizeof(deferred) / sizeof(deferred[0]);
+	int vlsns_count = sizeof(vlsns) / sizeof(vlsns[0]);
+	compare_write_iterator_results(content, content_count,
+				       expected, expected_count,
+				       deferred, deferred_count,
+				       vlsns, vlsns_count, true, true);
+}
+{
+/*
+ * STATEMENT:    REPL REPL DEL INS REPL REPL
+ * LSN:            3    4   5   6    7    8
+ * DEFERRED DEL:        +            +    +
+ *
+ * is_last_level = false
+ *
+ * Check that a deferred DELETE is not generated in case the
+ * overwritten tuple equals the new one in terms of the secondary
+ * index key parts.
+ */
+	const struct vy_stmt_template content[] = {
+		STMT_TEMPLATE(3, REPLACE, 1, 1, 1),
+		STMT_TEMPLATE_DEFERRED_DELETE(4, REPLACE, 1, 1, 2),
+		STMT_TEMPLATE(5, DELETE, 1),
+		STMT_TEMPLATE(6, INSERT, 1, 2, 1),
+		STMT_TEMPLATE_DEFERRED_DELETE(7, REPLACE, 1, 2, 2),
+		STMT_TEMPLATE_DEFERRED_DELETE(8, REPLACE, 1, 2, 3),
+	};
+	const struct vy_stmt_template expected[] = {
+		STMT_TEMPLATE(8, REPLACE, 1, 2, 3),
+	};
+	const struct vy_stmt_template deferred[] = {};
+	const int vlsns[] = {};
+	int content_count = sizeof(content) / sizeof(content[0]);
+	int expected_count = sizeof(expected) / sizeof(expected[0]);
+	int deferred_count = sizeof(deferred) / sizeof(deferred[0]);
+	int vlsns_count = sizeof(vlsns) / sizeof(vlsns[0]);
+	compare_write_iterator_results(content, content_count,
+				       expected, expected_count,
+				       deferred, deferred_count,
+				       vlsns, vlsns_count, true, false);
+}
+{
+/*
+ * STATEMENT:    REPL REPL DEL
+ * LSN:            7    8   9
+ * DEFERRED DEL:   +
+ *
+ * is_last_level = false
+ *
+ * Check that the oldest VY_STMT_DEFERRED_DELETE statement is
+ * preserved in case it doesn't overwrite a terminal statement
+ * and this is not a major compaction.
+ */
+	const struct vy_stmt_template content[] = {
+		STMT_TEMPLATE_DEFERRED_DELETE(7, REPLACE, 1, 1),
+		STMT_TEMPLATE(8, REPLACE, 1, 2),
+		STMT_TEMPLATE(9, DELETE, 1, 3),
+	};
+	const struct vy_stmt_template expected[] = {
+		STMT_TEMPLATE(9, DELETE, 1, 1),
+		STMT_TEMPLATE_DEFERRED_DELETE(7, REPLACE, 1, 1),
+	};
+	const struct vy_stmt_template deferred[] = {};
+	const int vlsns[] = {};
+	int content_count = sizeof(content) / sizeof(content[0]);
+	int expected_count = sizeof(expected) / sizeof(expected[0]);
+	int deferred_count = sizeof(deferred) / sizeof(deferred[0]);
+	int vlsns_count = sizeof(vlsns) / sizeof(vlsns[0]);
+	compare_write_iterator_results(content, content_count,
+				       expected, expected_count,
+				       deferred, deferred_count,
+				       vlsns, vlsns_count, true, false);
+}
+{
+/*
+ * STATEMENT:    REPL REPL DEL
+ * LSN:            7    8   9
+ * DEFERRED DEL:   +
+ * READ VIEW:      *
+ *
+ * is_last_level = false
+ *
+ * Check that the oldest VY_STMT_DEFERRED_DELETE statement is
+ * not returned twice if it is referenced by a read view.
+ */
+	const struct vy_stmt_template content[] = {
+		STMT_TEMPLATE_DEFERRED_DELETE(7, REPLACE, 1, 1),
+		STMT_TEMPLATE(8, REPLACE, 1, 2),
+		STMT_TEMPLATE(9, DELETE, 1, 3),
+	};
+	const struct vy_stmt_template expected[] = {
+		STMT_TEMPLATE(9, DELETE, 1, 1),
+		STMT_TEMPLATE_DEFERRED_DELETE(7, REPLACE, 1, 1),
+	};
+	const struct vy_stmt_template deferred[] = {};
+	const int vlsns[] = {7};
+	int content_count = sizeof(content) / sizeof(content[0]);
+	int expected_count = sizeof(expected) / sizeof(expected[0]);
+	int deferred_count = sizeof(deferred) / sizeof(deferred[0]);
+	int vlsns_count = sizeof(vlsns) / sizeof(vlsns[0]);
+	compare_write_iterator_results(content, content_count,
 				       expected, expected_count,
+				       deferred, deferred_count,
 				       vlsns, vlsns_count, true, false);
 }
 	fiber_gc();
diff --git a/test/unit/vy_write_iterator.result b/test/unit/vy_write_iterator.result
index 56d8cb1f..79f23d8d 100644
--- a/test/unit/vy_write_iterator.result
+++ b/test/unit/vy_write_iterator.result
@@ -1,5 +1,6 @@
+# Looks like you planned 80 tests but ran 66.
 	*** test_basic ***
-1..46
+1..80
 ok 1 - stmt 0 is correct
 ok 2 - stmt 1 is correct
 ok 3 - stmt 2 is correct
@@ -46,4 +47,24 @@ ok 43 - stmt 0 is correct
 ok 44 - stmt 1 is correct
 ok 45 - stmt 2 is correct
 ok 46 - correct results count
+ok 47 - stmt 0 is correct
+ok 48 - stmt 1 is correct
+ok 49 - stmt 2 is correct
+ok 50 - correct results count
+ok 51 - deferred stmt 0 is correct
+ok 52 - deferred stmt 1 is correct
+ok 53 - deferred stmt 2 is correct
+ok 54 - deferred stmt 3 is correct
+ok 55 - correct deferred stmt count
+ok 56 - stmt 0 is correct
+ok 57 - correct results count
+ok 58 - correct deferred stmt count
+ok 59 - stmt 0 is correct
+ok 60 - stmt 1 is correct
+ok 61 - correct results count
+ok 62 - correct deferred stmt count
+ok 63 - stmt 0 is correct
+ok 64 - stmt 1 is correct
+ok 65 - correct results count
+ok 66 - correct deferred stmt count
 	*** test_basic: done ***
-- 
2.11.0

^ permalink raw reply	[flat|nested] 65+ messages in thread

* [RFC PATCH 16/23] vinyl: allow to skip certain statements on read
  2018-07-08 16:48 ` [RFC PATCH 00/23] vinyl: eliminate read on REPLACE/DELETE Vladimir Davydov
                     ` (13 preceding siblings ...)
  2018-07-08 16:48   ` [RFC PATCH 15/23] vinyl: teach write iterator to return overwritten tuples Vladimir Davydov
@ 2018-07-08 16:48   ` Vladimir Davydov
  2018-07-08 16:48   ` [RFC PATCH 17/23] vinyl: do not free pending tasks on shutdown Vladimir Davydov
                     ` (6 subsequent siblings)
  21 siblings, 0 replies; 65+ messages in thread
From: Vladimir Davydov @ 2018-07-08 16:48 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

In the scope of #2129 we will defer insertion of certain DELETE
statements into secondary indexes until primary index compaction.
However, by the time we invoke compaction, new statements might
have been inserted into the space for the same set of keys.
If that happens, insertion of a deferred DELETE will break the
invariant which the read iterator relies upon: that for any key
older sources store older statements. To avoid that, let's add
a new per statement flag, VY_STMT_SKIP_READ, and make the read
iterator ignore statements marked with it.

Needed for #2129
---
 src/box/vy_mem.c  | 19 ++++++++++++-------
 src/box/vy_run.c  |  7 ++++++-
 src/box/vy_stmt.h | 10 ++++++++++
 3 files changed, 28 insertions(+), 8 deletions(-)

diff --git a/src/box/vy_mem.c b/src/box/vy_mem.c
index 7c9690ef..dadd73cb 100644
--- a/src/box/vy_mem.c
+++ b/src/box/vy_mem.c
@@ -323,7 +323,8 @@ vy_mem_iterator_find_lsn(struct vy_mem_iterator *itr,
 	assert(!vy_mem_tree_iterator_is_invalid(&itr->curr_pos));
 	assert(itr->curr_stmt == vy_mem_iterator_curr_stmt(itr));
 	const struct key_def *cmp_def = itr->mem->cmp_def;
-	while (vy_stmt_lsn(itr->curr_stmt) > (**itr->read_view).vlsn) {
+	while (vy_stmt_lsn(itr->curr_stmt) > (**itr->read_view).vlsn ||
+	       vy_stmt_flags(itr->curr_stmt) & VY_STMT_SKIP_READ) {
 		if (vy_mem_iterator_step(itr, iterator_type) != 0 ||
 		    (iterator_type == ITER_EQ &&
 		     vy_stmt_compare(key, itr->curr_stmt, cmp_def))) {
@@ -340,6 +341,7 @@ vy_mem_iterator_find_lsn(struct vy_mem_iterator *itr,
 				*vy_mem_tree_iterator_get_elem(&itr->mem->tree,
 							       &prev_pos);
 			if (vy_stmt_lsn(prev_stmt) > (**itr->read_view).vlsn ||
+			    vy_stmt_flags(prev_stmt) & VY_STMT_SKIP_READ ||
 			    vy_tuple_compare(itr->curr_stmt, prev_stmt,
 					     cmp_def) != 0)
 				break;
@@ -495,18 +497,21 @@ vy_mem_iterator_next_lsn(struct vy_mem_iterator *itr)
 	const struct key_def *cmp_def = itr->mem->cmp_def;
 
 	struct vy_mem_tree_iterator next_pos = itr->curr_pos;
+next:
 	vy_mem_tree_iterator_next(&itr->mem->tree, &next_pos);
 	if (vy_mem_tree_iterator_is_invalid(&next_pos))
 		return 1; /* EOF */
 
 	const struct tuple *next_stmt;
 	next_stmt = *vy_mem_tree_iterator_get_elem(&itr->mem->tree, &next_pos);
-	if (vy_tuple_compare(itr->curr_stmt, next_stmt, cmp_def) == 0) {
-		itr->curr_pos = next_pos;
-		itr->curr_stmt = next_stmt;
-		return 0;
-	}
-	return 1;
+	if (vy_tuple_compare(itr->curr_stmt, next_stmt, cmp_def) != 0)
+		return 1;
+
+	itr->curr_pos = next_pos;
+	itr->curr_stmt = next_stmt;
+	if (vy_stmt_flags(itr->curr_stmt) & VY_STMT_SKIP_READ)
+		goto next;
+	return 0;
 }
 
 /**
diff --git a/src/box/vy_run.c b/src/box/vy_run.c
index dc837c2b..6f7fb82a 100644
--- a/src/box/vy_run.c
+++ b/src/box/vy_run.c
@@ -1157,7 +1157,8 @@ vy_run_iterator_find_lsn(struct vy_run_iterator *itr,
 	assert(itr->curr_stmt != NULL);
 	assert(itr->curr_pos.page_no < slice->run->info.page_count);
 
-	while (vy_stmt_lsn(itr->curr_stmt) > (**itr->read_view).vlsn) {
+	while (vy_stmt_lsn(itr->curr_stmt) > (**itr->read_view).vlsn ||
+	       vy_stmt_flags(itr->curr_stmt) & VY_STMT_SKIP_READ) {
 		if (vy_run_iterator_next_pos(itr, iterator_type,
 					     &itr->curr_pos) != 0) {
 			vy_run_iterator_stop(itr);
@@ -1183,6 +1184,7 @@ vy_run_iterator_find_lsn(struct vy_run_iterator *itr,
 						 &test_stmt) != 0)
 				return -1;
 			if (vy_stmt_lsn(test_stmt) > (**itr->read_view).vlsn ||
+			    vy_stmt_flags(test_stmt) & VY_STMT_SKIP_READ ||
 			    vy_tuple_compare(itr->curr_stmt, test_stmt,
 					     cmp_def) != 0) {
 				tuple_unref(test_stmt);
@@ -1478,6 +1480,7 @@ vy_run_iterator_next_lsn(struct vy_run_iterator *itr, struct tuple **ret)
 	assert(itr->curr_pos.page_no < itr->slice->run->info.page_count);
 
 	struct vy_run_iterator_pos next_pos;
+next:
 	if (vy_run_iterator_next_pos(itr, ITER_GE, &next_pos) != 0) {
 		vy_run_iterator_stop(itr);
 		return 0;
@@ -1495,6 +1498,8 @@ vy_run_iterator_next_lsn(struct vy_run_iterator *itr, struct tuple **ret)
 	tuple_unref(itr->curr_stmt);
 	itr->curr_stmt = next_key;
 	itr->curr_pos = next_pos;
+	if (vy_stmt_flags(itr->curr_stmt) & VY_STMT_SKIP_READ)
+		goto next;
 
 	vy_stmt_counter_acct_tuple(&itr->stat->get, itr->curr_stmt);
 	*ret = itr->curr_stmt;
diff --git a/src/box/vy_stmt.h b/src/box/vy_stmt.h
index 8de8aa84..878a27f7 100644
--- a/src/box/vy_stmt.h
+++ b/src/box/vy_stmt.h
@@ -87,6 +87,16 @@ enum {
 	 * DELETE statements for them during compaction.
 	 */
 	VY_STMT_DEFERRED_DELETE		= 1 << 0,
+	/**
+	 * Statements that have this flag set are ignored by the
+	 * read iterator.
+	 *
+	 * We set this flag for deferred DELETE statements, because
+	 * they may violate the invariant which the read relies upon:
+	 * the older a source, the older statements it stores for a
+	 * particular key.
+	 */
+	VY_STMT_SKIP_READ		= 1 << 1,
 };
 
 /**
-- 
2.11.0

^ permalink raw reply	[flat|nested] 65+ messages in thread

* [RFC PATCH 17/23] vinyl: do not free pending tasks on shutdown
  2018-07-08 16:48 ` [RFC PATCH 00/23] vinyl: eliminate read on REPLACE/DELETE Vladimir Davydov
                     ` (14 preceding siblings ...)
  2018-07-08 16:48   ` [RFC PATCH 16/23] vinyl: allow to skip certain statements on read Vladimir Davydov
@ 2018-07-08 16:48   ` Vladimir Davydov
  2018-07-08 16:48   ` [RFC PATCH 18/23] vinyl: store pointer to scheduler in struct vy_task Vladimir Davydov
                     ` (5 subsequent siblings)
  21 siblings, 0 replies; 65+ messages in thread
From: Vladimir Davydov @ 2018-07-08 16:48 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

This is a prerequisite for switching scheduler-worker communication from
pthread mutex/cond to cbus, which in turn is needed to generate and send
deferred DELETEs from workers back to tx (#2129).

After this patch, pending tasks will be leaked on shutdown. This is OK,
as we leak a lot of objects on shutdown anyway. The proper way of fixing
this leak would be to rework shutdown without atexit() so that we can
use cbus till the very end.

Needed for #2129
---
 src/box/vy_scheduler.c | 47 ++++++++++-------------------------------------
 1 file changed, 10 insertions(+), 37 deletions(-)

diff --git a/src/box/vy_scheduler.c b/src/box/vy_scheduler.c
index 4d1f3474..c175bea8 100644
--- a/src/box/vy_scheduler.c
+++ b/src/box/vy_scheduler.c
@@ -84,12 +84,8 @@ struct vy_task_ops {
 	 * This function is called by the scheduler if either ->execute
 	 * or ->complete failed. It may be used to undo changes done to
 	 * the LSM tree when preparing the task.
-	 *
-	 * If @in_shutdown is set, the callback is invoked from the
-	 * engine destructor.
 	 */
-	void (*abort)(struct vy_scheduler *scheduler, struct vy_task *task,
-		      bool in_shutdown);
+	void (*abort)(struct vy_scheduler *scheduler, struct vy_task *task);
 };
 
 struct vy_task {
@@ -279,15 +275,11 @@ vy_scheduler_start_workers(struct vy_scheduler *scheduler)
 static void
 vy_scheduler_stop_workers(struct vy_scheduler *scheduler)
 {
-	struct stailq task_queue;
-	stailq_create(&task_queue);
-
 	assert(scheduler->is_worker_pool_running);
 	scheduler->is_worker_pool_running = false;
 
-	/* Clear the input queue and wake up worker threads. */
+	/* Wake up worker threads. */
 	tt_pthread_mutex_lock(&scheduler->mutex);
-	stailq_concat(&task_queue, &scheduler->input_queue);
 	pthread_cond_broadcast(&scheduler->worker_cond);
 	tt_pthread_mutex_unlock(&scheduler->mutex);
 
@@ -298,15 +290,6 @@ vy_scheduler_stop_workers(struct vy_scheduler *scheduler)
 
 	free(scheduler->worker_pool);
 	scheduler->worker_pool = NULL;
-
-	/* Abort all pending tasks. */
-	struct vy_task *task, *next;
-	stailq_concat(&task_queue, &scheduler->output_queue);
-	stailq_foreach_entry_safe(task, next, &task_queue, link) {
-		if (task->ops->abort != NULL)
-			task->ops->abort(scheduler, task, true);
-		vy_task_delete(&scheduler->task_pool, task);
-	}
 }
 
 void
@@ -888,8 +871,7 @@ fail:
 }
 
 static void
-vy_task_dump_abort(struct vy_scheduler *scheduler, struct vy_task *task,
-		   bool in_shutdown)
+vy_task_dump_abort(struct vy_scheduler *scheduler, struct vy_task *task)
 {
 	struct vy_lsm *lsm = task->lsm;
 
@@ -902,17 +884,13 @@ vy_task_dump_abort(struct vy_scheduler *scheduler, struct vy_task *task,
 	 * It's no use alerting the user if the server is
 	 * shutting down or the LSM tree was dropped.
 	 */
-	if (!in_shutdown && !lsm->is_dropped) {
+	if (!lsm->is_dropped) {
 		struct error *e = diag_last_error(&task->diag);
 		error_log(e);
 		say_error("%s: dump failed", vy_lsm_name(lsm));
 	}
 
-	/* The metadata log is unavailable on shutdown. */
-	if (!in_shutdown)
-		vy_run_discard(task->new_run);
-	else
-		vy_run_unref(task->new_run);
+	vy_run_discard(task->new_run);
 
 	lsm->is_dumping = false;
 	vy_scheduler_update_lsm(scheduler, lsm);
@@ -1213,8 +1191,7 @@ vy_task_compact_complete(struct vy_scheduler *scheduler, struct vy_task *task)
 }
 
 static void
-vy_task_compact_abort(struct vy_scheduler *scheduler, struct vy_task *task,
-		      bool in_shutdown)
+vy_task_compact_abort(struct vy_scheduler *scheduler, struct vy_task *task)
 {
 	struct vy_lsm *lsm = task->lsm;
 	struct vy_range *range = task->range;
@@ -1226,18 +1203,14 @@ vy_task_compact_abort(struct vy_scheduler *scheduler, struct vy_task *task,
 	 * It's no use alerting the user if the server is
 	 * shutting down or the LSM tree was dropped.
 	 */
-	if (!in_shutdown && !lsm->is_dropped) {
+	if (!lsm->is_dropped) {
 		struct error *e = diag_last_error(&task->diag);
 		error_log(e);
 		say_error("%s: failed to compact range %s",
 			  vy_lsm_name(lsm), vy_range_str(range));
 	}
 
-	/* The metadata log is unavailable on shutdown. */
-	if (!in_shutdown)
-		vy_run_discard(task->new_run);
-	else
-		vy_run_unref(task->new_run);
+	vy_run_discard(task->new_run);
 
 	assert(range->heap_node.pos == UINT32_MAX);
 	vy_range_heap_insert(&lsm->range_heap, &range->heap_node);
@@ -1476,7 +1449,7 @@ vy_scheduler_complete_task(struct vy_scheduler *scheduler,
 {
 	if (task->lsm->is_dropped) {
 		if (task->ops->abort)
-			task->ops->abort(scheduler, task, false);
+			task->ops->abort(scheduler, task);
 		return 0;
 	}
 
@@ -1499,7 +1472,7 @@ vy_scheduler_complete_task(struct vy_scheduler *scheduler,
 	return 0;
 fail:
 	if (task->ops->abort)
-		task->ops->abort(scheduler, task, false);
+		task->ops->abort(scheduler, task);
 	diag_move(diag, &scheduler->diag);
 	return -1;
 }
-- 
2.11.0

^ permalink raw reply	[flat|nested] 65+ messages in thread

* [RFC PATCH 18/23] vinyl: store pointer to scheduler in struct vy_task
  2018-07-08 16:48 ` [RFC PATCH 00/23] vinyl: eliminate read on REPLACE/DELETE Vladimir Davydov
                     ` (15 preceding siblings ...)
  2018-07-08 16:48   ` [RFC PATCH 17/23] vinyl: do not free pending tasks on shutdown Vladimir Davydov
@ 2018-07-08 16:48   ` Vladimir Davydov
  2018-07-31 20:39     ` Konstantin Osipov
  2018-07-08 16:48   ` [RFC PATCH 19/23] vinyl: rename some members of vy_scheduler and vy_task struct Vladimir Davydov
                     ` (4 subsequent siblings)
  21 siblings, 1 reply; 65+ messages in thread
From: Vladimir Davydov @ 2018-07-08 16:48 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

Currently, we don't really need it, but once we switch communication
channel between the scheduler and workers from pthread mutex/cond to
cbus (needed for #2129), tasks won't be completed on behalf of the
scheduler fiber and hence we will need a back pointer from vy_task to
vy_scheduler.

Needed for #2129
---
 src/box/vy_scheduler.c | 74 ++++++++++++++++++++++++++------------------------
 1 file changed, 39 insertions(+), 35 deletions(-)

diff --git a/src/box/vy_scheduler.c b/src/box/vy_scheduler.c
index c175bea8..5684f4d4 100644
--- a/src/box/vy_scheduler.c
+++ b/src/box/vy_scheduler.c
@@ -72,24 +72,27 @@ struct vy_task_ops {
 	 * which is too heavy for the tx thread (like IO or compression).
 	 * Returns 0 on success. On failure returns -1 and sets diag.
 	 */
-	int (*execute)(struct vy_scheduler *scheduler, struct vy_task *task);
+	int (*execute)(struct vy_task *task);
 	/**
 	 * This function is called by the scheduler upon task completion.
 	 * It may be used to finish the task from the tx thread context.
 	 *
 	 * Returns 0 on success. On failure returns -1 and sets diag.
 	 */
-	int (*complete)(struct vy_scheduler *scheduler, struct vy_task *task);
+	int (*complete)(struct vy_task *task);
 	/**
 	 * This function is called by the scheduler if either ->execute
 	 * or ->complete failed. It may be used to undo changes done to
 	 * the LSM tree when preparing the task.
 	 */
-	void (*abort)(struct vy_scheduler *scheduler, struct vy_task *task);
+	void (*abort)(struct vy_task *task);
 };
 
 struct vy_task {
+	/** Virtual method table. */
 	const struct vy_task_ops *ops;
+	/** Pointer to the scheduler. */
+	struct vy_scheduler *scheduler;
 	/** Return code of ->execute. */
 	int status;
 	/** If ->execute fails, the error is stored here. */
@@ -138,10 +141,10 @@ struct vy_task {
  * does not free it from under us.
  */
 static struct vy_task *
-vy_task_new(struct mempool *pool, struct vy_lsm *lsm,
+vy_task_new(struct vy_scheduler *scheduler, struct vy_lsm *lsm,
 	    const struct vy_task_ops *ops)
 {
-	struct vy_task *task = mempool_alloc(pool);
+	struct vy_task *task = mempool_alloc(&scheduler->task_pool);
 	if (task == NULL) {
 		diag_set(OutOfMemory, sizeof(*task),
 			 "mempool", "struct vy_task");
@@ -149,16 +152,17 @@ vy_task_new(struct mempool *pool, struct vy_lsm *lsm,
 	}
 	memset(task, 0, sizeof(*task));
 	task->ops = ops;
+	task->scheduler = scheduler;
 	task->lsm = lsm;
 	task->cmp_def = key_def_dup(lsm->cmp_def);
 	if (task->cmp_def == NULL) {
-		mempool_free(pool, task);
+		mempool_free(&scheduler->task_pool, task);
 		return NULL;
 	}
 	task->key_def = key_def_dup(lsm->key_def);
 	if (task->key_def == NULL) {
 		key_def_delete(task->cmp_def);
-		mempool_free(pool, task);
+		mempool_free(&scheduler->task_pool, task);
 		return NULL;
 	}
 	vy_lsm_ref(lsm);
@@ -168,14 +172,13 @@ vy_task_new(struct mempool *pool, struct vy_lsm *lsm,
 
 /** Free a task allocated with vy_task_new(). */
 static void
-vy_task_delete(struct mempool *pool, struct vy_task *task)
+vy_task_delete(struct vy_task *task)
 {
 	key_def_delete(task->cmp_def);
 	key_def_delete(task->key_def);
 	vy_lsm_unref(task->lsm);
 	diag_destroy(&task->diag);
-	TRASH(task);
-	mempool_free(pool, task);
+	mempool_free(&task->scheduler->task_pool, task);
 }
 
 static bool
@@ -643,7 +646,7 @@ vy_run_discard(struct vy_run *run)
 }
 
 static int
-vy_task_write_run(struct vy_scheduler *scheduler, struct vy_task *task)
+vy_task_write_run(struct vy_task *task)
 {
 	struct vy_lsm *lsm = task->lsm;
 	struct vy_stmt_stream *wi = task->wi;
@@ -676,7 +679,7 @@ vy_task_write_run(struct vy_scheduler *scheduler, struct vy_task *task)
 		if (rc != 0)
 			break;
 
-		if (!scheduler->is_worker_pool_running) {
+		if (!task->scheduler->is_worker_pool_running) {
 			diag_set(FiberIsCancelled);
 			rc = -1;
 			break;
@@ -698,14 +701,15 @@ fail:
 }
 
 static int
-vy_task_dump_execute(struct vy_scheduler *scheduler, struct vy_task *task)
+vy_task_dump_execute(struct vy_task *task)
 {
-	return vy_task_write_run(scheduler, task);
+	return vy_task_write_run(task);
 }
 
 static int
-vy_task_dump_complete(struct vy_scheduler *scheduler, struct vy_task *task)
+vy_task_dump_complete(struct vy_task *task)
 {
+	struct vy_scheduler *scheduler = task->scheduler;
 	struct vy_lsm *lsm = task->lsm;
 	struct vy_run *new_run = task->new_run;
 	int64_t dump_lsn = new_run->dump_lsn;
@@ -871,8 +875,9 @@ fail:
 }
 
 static void
-vy_task_dump_abort(struct vy_scheduler *scheduler, struct vy_task *task)
+vy_task_dump_abort(struct vy_task *task)
 {
+	struct vy_scheduler *scheduler = task->scheduler;
 	struct vy_lsm *lsm = task->lsm;
 
 	assert(lsm->is_dumping);
@@ -975,8 +980,7 @@ vy_task_dump_new(struct vy_scheduler *scheduler, struct vy_lsm *lsm,
 		return 0;
 	}
 
-	struct vy_task *task = vy_task_new(&scheduler->task_pool,
-					   lsm, &dump_ops);
+	struct vy_task *task = vy_task_new(scheduler, lsm, &dump_ops);
 	if (task == NULL)
 		goto err;
 
@@ -1031,7 +1035,7 @@ err_wi_sub:
 err_wi:
 	vy_run_discard(new_run);
 err_run:
-	vy_task_delete(&scheduler->task_pool, task);
+	vy_task_delete(task);
 err:
 	diag_log();
 	say_error("%s: could not start dump", vy_lsm_name(lsm));
@@ -1039,14 +1043,15 @@ err:
 }
 
 static int
-vy_task_compact_execute(struct vy_scheduler *scheduler, struct vy_task *task)
+vy_task_compact_execute(struct vy_task *task)
 {
-	return vy_task_write_run(scheduler, task);
+	return vy_task_write_run(task);
 }
 
 static int
-vy_task_compact_complete(struct vy_scheduler *scheduler, struct vy_task *task)
+vy_task_compact_complete(struct vy_task *task)
 {
+	struct vy_scheduler *scheduler = task->scheduler;
 	struct vy_lsm *lsm = task->lsm;
 	struct vy_range *range = task->range;
 	struct vy_run *new_run = task->new_run;
@@ -1191,8 +1196,9 @@ vy_task_compact_complete(struct vy_scheduler *scheduler, struct vy_task *task)
 }
 
 static void
-vy_task_compact_abort(struct vy_scheduler *scheduler, struct vy_task *task)
+vy_task_compact_abort(struct vy_task *task)
 {
+	struct vy_scheduler *scheduler = task->scheduler;
 	struct vy_lsm *lsm = task->lsm;
 	struct vy_range *range = task->range;
 
@@ -1243,8 +1249,7 @@ vy_task_compact_new(struct vy_scheduler *scheduler, struct vy_lsm *lsm,
 		return 0;
 	}
 
-	struct vy_task *task = vy_task_new(&scheduler->task_pool,
-					   lsm, &compact_ops);
+	struct vy_task *task = vy_task_new(scheduler, lsm, &compact_ops);
 	if (task == NULL)
 		goto err_task;
 
@@ -1303,7 +1308,7 @@ err_wi_sub:
 err_wi:
 	vy_run_discard(new_run);
 err_run:
-	vy_task_delete(&scheduler->task_pool, task);
+	vy_task_delete(task);
 err_task:
 	diag_log();
 	say_error("%s: could not start compacting range %s: %s",
@@ -1444,12 +1449,11 @@ fail:
 }
 
 static int
-vy_scheduler_complete_task(struct vy_scheduler *scheduler,
-			   struct vy_task *task)
+vy_task_complete(struct vy_task *task)
 {
 	if (task->lsm->is_dropped) {
 		if (task->ops->abort)
-			task->ops->abort(scheduler, task);
+			task->ops->abort(task);
 		return 0;
 	}
 
@@ -1464,7 +1468,7 @@ vy_scheduler_complete_task(struct vy_scheduler *scheduler,
 			diag_move(diag_get(), diag);
 			goto fail; });
 	if (task->ops->complete &&
-	    task->ops->complete(scheduler, task) != 0) {
+	    task->ops->complete(task) != 0) {
 		assert(!diag_is_empty(diag_get()));
 		diag_move(diag_get(), diag);
 		goto fail;
@@ -1472,8 +1476,8 @@ vy_scheduler_complete_task(struct vy_scheduler *scheduler,
 	return 0;
 fail:
 	if (task->ops->abort)
-		task->ops->abort(scheduler, task);
-	diag_move(diag, &scheduler->diag);
+		task->ops->abort(task);
+	diag_move(diag, &task->scheduler->diag);
 	return -1;
 }
 
@@ -1510,11 +1514,11 @@ vy_scheduler_f(va_list va)
 
 		/* Complete and delete all processed tasks. */
 		stailq_foreach_entry_safe(task, next, &output_queue, link) {
-			if (vy_scheduler_complete_task(scheduler, task) != 0)
+			if (vy_task_complete(task) != 0)
 				tasks_failed++;
 			else
 				tasks_done++;
-			vy_task_delete(&scheduler->task_pool, task);
+			vy_task_delete(task);
 			scheduler->workers_available++;
 			assert(scheduler->workers_available <=
 			       scheduler->worker_pool_size);
@@ -1615,7 +1619,7 @@ vy_worker_f(void *arg)
 		assert(task != NULL);
 
 		/* Execute task */
-		task->status = task->ops->execute(scheduler, task);
+		task->status = task->ops->execute(task);
 		if (task->status != 0) {
 			struct diag *diag = diag_get();
 			assert(!diag_is_empty(diag));
-- 
2.11.0

^ permalink raw reply	[flat|nested] 65+ messages in thread

* [RFC PATCH 19/23] vinyl: rename some members of vy_scheduler and vy_task struct
  2018-07-08 16:48 ` [RFC PATCH 00/23] vinyl: eliminate read on REPLACE/DELETE Vladimir Davydov
                     ` (16 preceding siblings ...)
  2018-07-08 16:48   ` [RFC PATCH 18/23] vinyl: store pointer to scheduler in struct vy_task Vladimir Davydov
@ 2018-07-08 16:48   ` Vladimir Davydov
  2018-07-31 20:40     ` Konstantin Osipov
  2018-07-08 16:48   ` [RFC PATCH 20/23] vinyl: use cbus for communication between scheduler and worker threads Vladimir Davydov
                     ` (3 subsequent siblings)
  21 siblings, 1 reply; 65+ messages in thread
From: Vladimir Davydov @ 2018-07-08 16:48 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

I'm planning to add some new members and remove some old members from
those structs. For this to play nicely, let's do some renames:

  vy_scheduler::workers_available => idle_worker_count
  vy_scheduler::input_queue       => pending_tasks
  vy_scheduler::output_queue      => processed_tasks
  vy_task::link                   => in_pending, in_processed
---
 src/box/vy_scheduler.c | 50 ++++++++++++++++++++++++++------------------------
 src/box/vy_scheduler.h | 10 +++++-----
 2 files changed, 31 insertions(+), 29 deletions(-)

diff --git a/src/box/vy_scheduler.c b/src/box/vy_scheduler.c
index 5684f4d4..4d84f9bc 100644
--- a/src/box/vy_scheduler.c
+++ b/src/box/vy_scheduler.c
@@ -120,17 +120,16 @@ struct vy_task {
 	 */
 	struct vy_slice *first_slice, *last_slice;
 	/**
-	 * Link in the list of pending or processed tasks.
-	 * See vy_scheduler::input_queue, output_queue.
-	 */
-	struct stailq_entry link;
-	/**
 	 * Index options may be modified while a task is in
 	 * progress so we save them here to safely access them
 	 * from another thread.
 	 */
 	double bloom_fpr;
 	int64_t page_size;
+	/** Link in vy_scheduler::pending_tasks. */
+	struct stailq_entry in_pending;
+	/** Link in vy_scheduler::processed_tasks. */
+	struct stailq_entry in_processed;
 };
 
 /**
@@ -259,7 +258,7 @@ vy_scheduler_start_workers(struct vy_scheduler *scheduler)
 	assert(scheduler->worker_pool_size >= 2);
 
 	scheduler->is_worker_pool_running = true;
-	scheduler->workers_available = scheduler->worker_pool_size;
+	scheduler->idle_worker_count = scheduler->worker_pool_size;
 	scheduler->worker_pool = calloc(scheduler->worker_pool_size,
 					sizeof(struct cord));
 	if (scheduler->worker_pool == NULL)
@@ -318,8 +317,8 @@ vy_scheduler_create(struct vy_scheduler *scheduler, int write_threads,
 	scheduler->worker_pool_size = write_threads;
 	mempool_create(&scheduler->task_pool, cord_slab_cache(),
 		       sizeof(struct vy_task));
-	stailq_create(&scheduler->input_queue);
-	stailq_create(&scheduler->output_queue);
+	stailq_create(&scheduler->pending_tasks);
+	stailq_create(&scheduler->processed_tasks);
 
 	tt_pthread_cond_init(&scheduler->worker_cond, NULL);
 	tt_pthread_mutex_init(&scheduler->mutex, NULL);
@@ -1422,7 +1421,7 @@ vy_schedule(struct vy_scheduler *scheduler, struct vy_task **ptask)
 	if (*ptask != NULL)
 		return 0;
 
-	if (scheduler->workers_available <= 1) {
+	if (scheduler->idle_worker_count <= 1) {
 		/*
 		 * If all worker threads are busy doing compaction
 		 * when we run out of quota, ongoing transactions will
@@ -1501,26 +1500,27 @@ vy_scheduler_f(va_list va)
 	vy_scheduler_start_workers(scheduler);
 
 	while (scheduler->scheduler_fiber != NULL) {
-		struct stailq output_queue;
+		struct stailq processed_tasks;
 		struct vy_task *task, *next;
 		int tasks_failed = 0, tasks_done = 0;
 		bool was_empty;
 
 		/* Get the list of processed tasks. */
-		stailq_create(&output_queue);
+		stailq_create(&processed_tasks);
 		tt_pthread_mutex_lock(&scheduler->mutex);
-		stailq_concat(&output_queue, &scheduler->output_queue);
+		stailq_concat(&processed_tasks, &scheduler->processed_tasks);
 		tt_pthread_mutex_unlock(&scheduler->mutex);
 
 		/* Complete and delete all processed tasks. */
-		stailq_foreach_entry_safe(task, next, &output_queue, link) {
+		stailq_foreach_entry_safe(task, next, &processed_tasks,
+					  in_processed) {
 			if (vy_task_complete(task) != 0)
 				tasks_failed++;
 			else
 				tasks_done++;
 			vy_task_delete(task);
-			scheduler->workers_available++;
-			assert(scheduler->workers_available <=
+			scheduler->idle_worker_count++;
+			assert(scheduler->idle_worker_count <=
 			       scheduler->worker_pool_size);
 		}
 		/*
@@ -1534,7 +1534,7 @@ vy_scheduler_f(va_list va)
 			 * opens a time window for a worker to submit
 			 * a processed task and wake up the scheduler
 			 * (via scheduler_async). Hence we should go
-			 * and recheck the output_queue in order not
+			 * and recheck the processed_tasks in order not
 			 * to lose a wakeup event and hang for good.
 			 */
 			continue;
@@ -1543,7 +1543,7 @@ vy_scheduler_f(va_list va)
 		if (tasks_failed > 0)
 			goto error;
 		/* All worker threads are busy. */
-		if (scheduler->workers_available == 0)
+		if (scheduler->idle_worker_count == 0)
 			goto wait;
 		/* Get a task to schedule. */
 		if (vy_schedule(scheduler, &task) != 0)
@@ -1554,13 +1554,14 @@ vy_scheduler_f(va_list va)
 
 		/* Queue the task and notify workers if necessary. */
 		tt_pthread_mutex_lock(&scheduler->mutex);
-		was_empty = stailq_empty(&scheduler->input_queue);
-		stailq_add_tail_entry(&scheduler->input_queue, task, link);
+		was_empty = stailq_empty(&scheduler->pending_tasks);
+		stailq_add_tail_entry(&scheduler->pending_tasks,
+				      task, in_pending);
 		if (was_empty)
 			tt_pthread_cond_signal(&scheduler->worker_cond);
 		tt_pthread_mutex_unlock(&scheduler->mutex);
 
-		scheduler->workers_available--;
+		scheduler->idle_worker_count--;
 		fiber_reschedule();
 		continue;
 error:
@@ -1605,7 +1606,7 @@ vy_worker_f(void *arg)
 	tt_pthread_mutex_lock(&scheduler->mutex);
 	while (scheduler->is_worker_pool_running) {
 		/* Wait for a task */
-		if (stailq_empty(&scheduler->input_queue)) {
+		if (stailq_empty(&scheduler->pending_tasks)) {
 			/* Wake scheduler up if there are no more tasks */
 			ev_async_send(scheduler->scheduler_loop,
 				      &scheduler->scheduler_async);
@@ -1613,8 +1614,8 @@ vy_worker_f(void *arg)
 					     &scheduler->mutex);
 			continue;
 		}
-		task = stailq_shift_entry(&scheduler->input_queue,
-					  struct vy_task, link);
+		task = stailq_shift_entry(&scheduler->pending_tasks,
+					  struct vy_task, in_pending);
 		tt_pthread_mutex_unlock(&scheduler->mutex);
 		assert(task != NULL);
 
@@ -1628,7 +1629,8 @@ vy_worker_f(void *arg)
 
 		/* Return processed task to scheduler */
 		tt_pthread_mutex_lock(&scheduler->mutex);
-		stailq_add_tail_entry(&scheduler->output_queue, task, link);
+		stailq_add_tail_entry(&scheduler->processed_tasks,
+				      task, in_processed);
 	}
 	tt_pthread_mutex_unlock(&scheduler->mutex);
 	return NULL;
diff --git a/src/box/vy_scheduler.h b/src/box/vy_scheduler.h
index 777756c0..284f666e 100644
--- a/src/box/vy_scheduler.h
+++ b/src/box/vy_scheduler.h
@@ -77,13 +77,13 @@ struct vy_scheduler {
 	/** Total number of worker threads. */
 	int worker_pool_size;
 	/** Number worker threads that are currently idle. */
-	int workers_available;
+	int idle_worker_count;
 	/** Memory pool used for allocating vy_task objects. */
 	struct mempool task_pool;
-	/** Queue of pending tasks, linked by vy_task::link. */
-	struct stailq input_queue;
-	/** Queue of processed tasks, linked by vy_task::link. */
-	struct stailq output_queue;
+	/** Queue of pending tasks, linked by vy_task::in_pending. */
+	struct stailq pending_tasks;
+	/** Queue of processed tasks, linked by vy_task::in_processed. */
+	struct stailq processed_tasks;
 	/**
 	 * Signaled to wake up a worker when there is
 	 * a pending task in the input queue. Also used
-- 
2.11.0

^ permalink raw reply	[flat|nested] 65+ messages in thread

* [RFC PATCH 20/23] vinyl: use cbus for communication between scheduler and worker threads
  2018-07-08 16:48 ` [RFC PATCH 00/23] vinyl: eliminate read on REPLACE/DELETE Vladimir Davydov
                     ` (17 preceding siblings ...)
  2018-07-08 16:48   ` [RFC PATCH 19/23] vinyl: rename some members of vy_scheduler and vy_task struct Vladimir Davydov
@ 2018-07-08 16:48   ` Vladimir Davydov
  2018-07-31 20:43     ` Konstantin Osipov
  2018-07-08 16:48   ` [RFC PATCH 21/23] vinyl: zap vy_scheduler::is_worker_pool_running Vladimir Davydov
                     ` (2 subsequent siblings)
  21 siblings, 1 reply; 65+ messages in thread
From: Vladimir Davydov @ 2018-07-08 16:48 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

We need cbus for forwarding deferred DELETE statements generated in a
worker thread during primary index compaction to the tx thread where
they can be inserted into secondary indexes. Since pthread mutex/cond
and cbus are incompatible by their nature, let's rework communication
channel between the tx and worker threads using cbus.

Needed for #2129
---
 src/box/vy_scheduler.c | 215 ++++++++++++++++++++++++++++++-------------------
 src/box/vy_scheduler.h |  25 +-----
 2 files changed, 134 insertions(+), 106 deletions(-)

diff --git a/src/box/vy_scheduler.c b/src/box/vy_scheduler.c
index 4d84f9bc..bd3ad4be 100644
--- a/src/box/vy_scheduler.c
+++ b/src/box/vy_scheduler.c
@@ -46,6 +46,7 @@
 #include "errinj.h"
 #include "fiber.h"
 #include "fiber_cond.h"
+#include "cbus.h"
 #include "salad/stailq.h"
 #include "say.h"
 #include "vy_lsm.h"
@@ -55,14 +56,34 @@
 #include "vy_run.h"
 #include "vy_write_iterator.h"
 #include "trivia/util.h"
-#include "tt_pthread.h"
 
 /* Min and max values for vy_scheduler::timeout. */
 #define VY_SCHEDULER_TIMEOUT_MIN	1
 #define VY_SCHEDULER_TIMEOUT_MAX	60
 
-static void *vy_worker_f(void *);
+static int vy_worker_f(va_list);
 static int vy_scheduler_f(va_list);
+static void vy_task_execute_f(struct cmsg *);
+static void vy_task_complete_f(struct cmsg *);
+
+static const struct cmsg_hop vy_task_execute_route[] = {
+	{ vy_task_execute_f, NULL },
+};
+
+static const struct cmsg_hop vy_task_complete_route[] = {
+	{ vy_task_complete_f, NULL },
+};
+
+/** Vinyl worker thread. */
+struct vy_worker {
+	struct cord cord;
+	/** Pipe from tx to the worker thread. */
+	struct cpipe worker_pipe;
+	/** Pipe from the worker thread to tx. */
+	struct cpipe tx_pipe;
+	/** Link in vy_scheduler::idle_workers. */
+	struct stailq_entry in_idle;
+};
 
 struct vy_task;
 
@@ -89,10 +110,22 @@ struct vy_task_ops {
 };
 
 struct vy_task {
+	/**
+	 * CBus message used for sending the task to/from
+	 * a worker thread.
+	 */
+	struct cmsg cmsg;
 	/** Virtual method table. */
 	const struct vy_task_ops *ops;
 	/** Pointer to the scheduler. */
 	struct vy_scheduler *scheduler;
+	/** Worker thread this task is assigned to. */
+	struct vy_worker *worker;
+	/**
+	 * Fiber that is currently executing this task in
+	 * a worker thread.
+	 */
+	struct fiber *fiber;
 	/** Return code of ->execute. */
 	int status;
 	/** If ->execute fails, the error is stored here. */
@@ -126,8 +159,6 @@ struct vy_task {
 	 */
 	double bloom_fpr;
 	int64_t page_size;
-	/** Link in vy_scheduler::pending_tasks. */
-	struct stailq_entry in_pending;
 	/** Link in vy_scheduler::processed_tasks. */
 	struct stailq_entry in_processed;
 };
@@ -241,16 +272,6 @@ vy_compact_heap_less(struct heap_node *a, struct heap_node *b)
 #undef HEAP_NAME
 
 static void
-vy_scheduler_async_cb(ev_loop *loop, struct ev_async *watcher, int events)
-{
-	(void)loop;
-	(void)events;
-	struct vy_scheduler *scheduler = container_of(watcher,
-			struct vy_scheduler, scheduler_async);
-	fiber_cond_signal(&scheduler->scheduler_cond);
-}
-
-static void
 vy_scheduler_start_workers(struct vy_scheduler *scheduler)
 {
 	assert(!scheduler->is_worker_pool_running);
@@ -260,17 +281,19 @@ vy_scheduler_start_workers(struct vy_scheduler *scheduler)
 	scheduler->is_worker_pool_running = true;
 	scheduler->idle_worker_count = scheduler->worker_pool_size;
 	scheduler->worker_pool = calloc(scheduler->worker_pool_size,
-					sizeof(struct cord));
+					sizeof(*scheduler->worker_pool));
 	if (scheduler->worker_pool == NULL)
 		panic("failed to allocate vinyl worker pool");
 
-	ev_async_start(scheduler->scheduler_loop, &scheduler->scheduler_async);
 	for (int i = 0; i < scheduler->worker_pool_size; i++) {
 		char name[FIBER_NAME_MAX];
 		snprintf(name, sizeof(name), "vinyl.writer.%d", i);
-		if (cord_start(&scheduler->worker_pool[i], name,
-				 vy_worker_f, scheduler) != 0)
+		struct vy_worker *worker = &scheduler->worker_pool[i];
+		if (cord_costart(&worker->cord, name, vy_worker_f, worker) != 0)
 			panic("failed to start vinyl worker thread");
+		cpipe_create(&worker->worker_pipe, name);
+		stailq_add_tail_entry(&scheduler->idle_workers,
+				      worker, in_idle);
 	}
 }
 
@@ -280,16 +303,12 @@ vy_scheduler_stop_workers(struct vy_scheduler *scheduler)
 	assert(scheduler->is_worker_pool_running);
 	scheduler->is_worker_pool_running = false;
 
-	/* Wake up worker threads. */
-	tt_pthread_mutex_lock(&scheduler->mutex);
-	pthread_cond_broadcast(&scheduler->worker_cond);
-	tt_pthread_mutex_unlock(&scheduler->mutex);
-
-	/* Wait for worker threads to exit. */
-	for (int i = 0; i < scheduler->worker_pool_size; i++)
-		cord_join(&scheduler->worker_pool[i]);
-	ev_async_stop(scheduler->scheduler_loop, &scheduler->scheduler_async);
-
+	for (int i = 0; i < scheduler->worker_pool_size; i++) {
+		struct vy_worker *worker = &scheduler->worker_pool[i];
+		cbus_stop_loop(&worker->worker_pipe);
+		cpipe_destroy(&worker->worker_pipe);
+		cord_join(&worker->cord);
+	}
 	free(scheduler->worker_pool);
 	scheduler->worker_pool = NULL;
 }
@@ -310,19 +329,14 @@ vy_scheduler_create(struct vy_scheduler *scheduler, int write_threads,
 	if (scheduler->scheduler_fiber == NULL)
 		panic("failed to allocate vinyl scheduler fiber");
 
-	scheduler->scheduler_loop = loop();
 	fiber_cond_create(&scheduler->scheduler_cond);
-	ev_async_init(&scheduler->scheduler_async, vy_scheduler_async_cb);
 
 	scheduler->worker_pool_size = write_threads;
 	mempool_create(&scheduler->task_pool, cord_slab_cache(),
 		       sizeof(struct vy_task));
-	stailq_create(&scheduler->pending_tasks);
+	stailq_create(&scheduler->idle_workers);
 	stailq_create(&scheduler->processed_tasks);
 
-	tt_pthread_cond_init(&scheduler->worker_cond, NULL);
-	tt_pthread_mutex_init(&scheduler->mutex, NULL);
-
 	vy_dump_heap_create(&scheduler->dump_heap);
 	vy_compact_heap_create(&scheduler->compact_heap);
 
@@ -344,9 +358,6 @@ vy_scheduler_destroy(struct vy_scheduler *scheduler)
 	if (scheduler->is_worker_pool_running)
 		vy_scheduler_stop_workers(scheduler);
 
-	tt_pthread_cond_destroy(&scheduler->worker_cond);
-	tt_pthread_mutex_destroy(&scheduler->mutex);
-
 	diag_destroy(&scheduler->diag);
 	mempool_destroy(&scheduler->task_pool);
 	fiber_cond_destroy(&scheduler->dump_cond);
@@ -647,6 +658,8 @@ vy_run_discard(struct vy_run *run)
 static int
 vy_task_write_run(struct vy_task *task)
 {
+	enum { YIELD_LOOPS = 32 };
+
 	struct vy_lsm *lsm = task->lsm;
 	struct vy_stmt_stream *wi = task->wi;
 
@@ -668,6 +681,7 @@ vy_task_write_run(struct vy_task *task)
 	if (wi->iface->start(wi) != 0)
 		goto fail_abort_writer;
 	int rc;
+	int loops = 0;
 	struct tuple *stmt = NULL;
 	while ((rc = wi->iface->next(wi, &stmt)) == 0 && stmt != NULL) {
 		inj = errinj(ERRINJ_VY_RUN_WRITE_STMT_TIMEOUT, ERRINJ_DOUBLE);
@@ -678,7 +692,9 @@ vy_task_write_run(struct vy_task *task)
 		if (rc != 0)
 			break;
 
-		if (!task->scheduler->is_worker_pool_running) {
+		if (++loops % YIELD_LOOPS == 0)
+			fiber_sleep(0);
+		if (fiber_is_cancelled()) {
 			diag_set(FiberIsCancelled);
 			rc = -1;
 			break;
@@ -1316,6 +1332,62 @@ err_task:
 }
 
 /**
+ * Fiber function that actually executes a vinyl task.
+ * After finishing a task, it sends it back to tx.
+ */
+static int
+vy_task_f(va_list va)
+{
+	struct vy_task *task = va_arg(va, struct vy_task *);
+	task->status = task->ops->execute(task);
+	if (task->status != 0) {
+		struct diag *diag = diag_get();
+		assert(!diag_is_empty(diag));
+		diag_move(diag, &task->diag);
+	}
+	cmsg_init(&task->cmsg, vy_task_complete_route);
+	cpipe_push(&task->worker->tx_pipe, &task->cmsg);
+	task->fiber = NULL;
+	return 0;
+}
+
+/**
+ * Callback invoked by a worker thread upon receiving a task.
+ * It schedules a fiber which actually executes the task, so
+ * as not to block the event loop.
+ */
+static void
+vy_task_execute_f(struct cmsg *cmsg)
+{
+	struct vy_task *task = container_of(cmsg, struct vy_task, cmsg);
+	assert(task->fiber == NULL);
+	task->fiber = fiber_new("task", vy_task_f);
+	if (task->fiber == NULL) {
+		task->status = -1;
+		diag_move(diag_get(), &task->diag);
+		cmsg_init(&task->cmsg, vy_task_complete_route);
+		cpipe_push(&task->worker->tx_pipe, &task->cmsg);
+	} else {
+		fiber_start(task->fiber, task);
+	}
+}
+
+/**
+ * Callback invoked by the tx thread upon receiving an executed
+ * task from a worker thread. It adds the task to the processed
+ * task queue and wakes up the scheduler so that it can complete
+ * it.
+ */
+static void
+vy_task_complete_f(struct cmsg *cmsg)
+{
+	struct vy_task *task = container_of(cmsg, struct vy_task, cmsg);
+	stailq_add_tail_entry(&task->scheduler->processed_tasks,
+			      task, in_processed);
+	fiber_cond_signal(&task->scheduler->scheduler_cond);
+}
+
+/**
  * Create a task for dumping an LSM tree. The new task is returned
  * in @ptask. If there's no LSM tree that needs to be dumped @ptask
  * is set to NULL.
@@ -1503,13 +1575,10 @@ vy_scheduler_f(va_list va)
 		struct stailq processed_tasks;
 		struct vy_task *task, *next;
 		int tasks_failed = 0, tasks_done = 0;
-		bool was_empty;
 
 		/* Get the list of processed tasks. */
 		stailq_create(&processed_tasks);
-		tt_pthread_mutex_lock(&scheduler->mutex);
 		stailq_concat(&processed_tasks, &scheduler->processed_tasks);
-		tt_pthread_mutex_unlock(&scheduler->mutex);
 
 		/* Complete and delete all processed tasks. */
 		stailq_foreach_entry_safe(task, next, &processed_tasks,
@@ -1518,6 +1587,8 @@ vy_scheduler_f(va_list va)
 				tasks_failed++;
 			else
 				tasks_done++;
+			stailq_add_entry(&scheduler->idle_workers,
+					 task->worker, in_idle);
 			vy_task_delete(task);
 			scheduler->idle_worker_count++;
 			assert(scheduler->idle_worker_count <=
@@ -1553,15 +1624,13 @@ vy_scheduler_f(va_list va)
 			goto wait;
 
 		/* Queue the task and notify workers if necessary. */
-		tt_pthread_mutex_lock(&scheduler->mutex);
-		was_empty = stailq_empty(&scheduler->pending_tasks);
-		stailq_add_tail_entry(&scheduler->pending_tasks,
-				      task, in_pending);
-		if (was_empty)
-			tt_pthread_cond_signal(&scheduler->worker_cond);
-		tt_pthread_mutex_unlock(&scheduler->mutex);
-
+		assert(!stailq_empty(&scheduler->idle_workers));
+		task->worker = stailq_shift_entry(&scheduler->idle_workers,
+						  struct vy_worker, in_idle);
 		scheduler->idle_worker_count--;
+		cmsg_init(&task->cmsg, vy_task_execute_route);
+		cpipe_push(&task->worker->worker_pipe, &task->cmsg);
+
 		fiber_reschedule();
 		continue;
 error:
@@ -1597,41 +1666,17 @@ wait:
 	return 0;
 }
 
-static void *
-vy_worker_f(void *arg)
+static int
+vy_worker_f(va_list ap)
 {
-	struct vy_scheduler *scheduler = arg;
-	struct vy_task *task = NULL;
-
-	tt_pthread_mutex_lock(&scheduler->mutex);
-	while (scheduler->is_worker_pool_running) {
-		/* Wait for a task */
-		if (stailq_empty(&scheduler->pending_tasks)) {
-			/* Wake scheduler up if there are no more tasks */
-			ev_async_send(scheduler->scheduler_loop,
-				      &scheduler->scheduler_async);
-			tt_pthread_cond_wait(&scheduler->worker_cond,
-					     &scheduler->mutex);
-			continue;
-		}
-		task = stailq_shift_entry(&scheduler->pending_tasks,
-					  struct vy_task, in_pending);
-		tt_pthread_mutex_unlock(&scheduler->mutex);
-		assert(task != NULL);
-
-		/* Execute task */
-		task->status = task->ops->execute(task);
-		if (task->status != 0) {
-			struct diag *diag = diag_get();
-			assert(!diag_is_empty(diag));
-			diag_move(diag, &task->diag);
-		}
-
-		/* Return processed task to scheduler */
-		tt_pthread_mutex_lock(&scheduler->mutex);
-		stailq_add_tail_entry(&scheduler->processed_tasks,
-				      task, in_processed);
-	}
-	tt_pthread_mutex_unlock(&scheduler->mutex);
-	return NULL;
+	struct vy_worker *worker = va_arg(ap, struct vy_worker *);
+	struct cbus_endpoint endpoint;
+
+	cpipe_create(&worker->tx_pipe, "tx");
+	cbus_endpoint_create(&endpoint, cord_name(&worker->cord),
+			     fiber_schedule_cb, fiber());
+	cbus_loop(&endpoint);
+	cbus_endpoint_destroy(&endpoint, cbus_process);
+	cpipe_destroy(&worker->tx_pipe);
+	return 0;
 }
diff --git a/src/box/vy_scheduler.h b/src/box/vy_scheduler.h
index 284f666e..a235aa6f 100644
--- a/src/box/vy_scheduler.h
+++ b/src/box/vy_scheduler.h
@@ -42,16 +42,15 @@
 #define HEAP_FORWARD_DECLARATION
 #include "salad/heap.h"
 #include "salad/stailq.h"
-#include "tt_pthread.h"
 
 #if defined(__cplusplus)
 extern "C" {
 #endif /* defined(__cplusplus) */
 
-struct cord;
 struct fiber;
 struct vy_lsm;
 struct vy_run_env;
+struct vy_worker;
 struct vy_scheduler;
 
 typedef void
@@ -61,42 +60,26 @@ typedef void
 struct vy_scheduler {
 	/** Scheduler fiber. */
 	struct fiber *scheduler_fiber;
-	/** Scheduler event loop. */
-	struct ev_loop *scheduler_loop;
 	/** Used to wake up the scheduler fiber from TX. */
 	struct fiber_cond scheduler_cond;
-	/** Used to wake up the scheduler from a worker thread. */
-	struct ev_async scheduler_async;
 	/**
 	 * Array of worker threads used for performing
 	 * dump/compaction tasks.
 	 */
-	struct cord *worker_pool;
+	struct vy_worker *worker_pool;
 	/** Set if the worker threads are running. */
 	bool is_worker_pool_running;
 	/** Total number of worker threads. */
 	int worker_pool_size;
 	/** Number worker threads that are currently idle. */
 	int idle_worker_count;
+	/** List of idle workers, linked by vy_worker::in_idle. */
+	struct stailq idle_workers;
 	/** Memory pool used for allocating vy_task objects. */
 	struct mempool task_pool;
-	/** Queue of pending tasks, linked by vy_task::in_pending. */
-	struct stailq pending_tasks;
 	/** Queue of processed tasks, linked by vy_task::in_processed. */
 	struct stailq processed_tasks;
 	/**
-	 * Signaled to wake up a worker when there is
-	 * a pending task in the input queue. Also used
-	 * to stop worker threads on shutdown.
-	 */
-	pthread_cond_t worker_cond;
-	/**
-	 * Mutex protecting input and output queues and
-	 * the condition variable used to wake up worker
-	 * threads.
-	 */
-	pthread_mutex_t mutex;
-	/**
 	 * Heap of LSM trees, ordered by dump priority,
 	 * linked by vy_lsm::in_dump.
 	 */
-- 
2.11.0

^ permalink raw reply	[flat|nested] 65+ messages in thread

* [RFC PATCH 21/23] vinyl: zap vy_scheduler::is_worker_pool_running
  2018-07-08 16:48 ` [RFC PATCH 00/23] vinyl: eliminate read on REPLACE/DELETE Vladimir Davydov
                     ` (18 preceding siblings ...)
  2018-07-08 16:48   ` [RFC PATCH 20/23] vinyl: use cbus for communication between scheduler and worker threads Vladimir Davydov
@ 2018-07-08 16:48   ` Vladimir Davydov
  2018-07-31 20:43     ` Konstantin Osipov
  2018-07-08 16:48   ` [RFC PATCH 22/23] vinyl: rename vy_task::status to is_failed Vladimir Davydov
  2018-07-08 16:48   ` [RFC PATCH 23/23] vinyl: eliminate read on REPLACE/DELETE Vladimir Davydov
  21 siblings, 1 reply; 65+ messages in thread
From: Vladimir Davydov @ 2018-07-08 16:48 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

This flag is set iff worker_pool != NULL hence it is pointless.
---
 src/box/vy_scheduler.c | 9 +++------
 src/box/vy_scheduler.h | 2 --
 2 files changed, 3 insertions(+), 8 deletions(-)

diff --git a/src/box/vy_scheduler.c b/src/box/vy_scheduler.c
index bd3ad4be..ac6b1f47 100644
--- a/src/box/vy_scheduler.c
+++ b/src/box/vy_scheduler.c
@@ -274,11 +274,10 @@ vy_compact_heap_less(struct heap_node *a, struct heap_node *b)
 static void
 vy_scheduler_start_workers(struct vy_scheduler *scheduler)
 {
-	assert(!scheduler->is_worker_pool_running);
+	assert(scheduler->worker_pool == NULL);
 	/* One thread is reserved for dumps, see vy_schedule(). */
 	assert(scheduler->worker_pool_size >= 2);
 
-	scheduler->is_worker_pool_running = true;
 	scheduler->idle_worker_count = scheduler->worker_pool_size;
 	scheduler->worker_pool = calloc(scheduler->worker_pool_size,
 					sizeof(*scheduler->worker_pool));
@@ -300,9 +299,7 @@ vy_scheduler_start_workers(struct vy_scheduler *scheduler)
 static void
 vy_scheduler_stop_workers(struct vy_scheduler *scheduler)
 {
-	assert(scheduler->is_worker_pool_running);
-	scheduler->is_worker_pool_running = false;
-
+	assert(scheduler->worker_pool != NULL);
 	for (int i = 0; i < scheduler->worker_pool_size; i++) {
 		struct vy_worker *worker = &scheduler->worker_pool[i];
 		cbus_stop_loop(&worker->worker_pipe);
@@ -355,7 +352,7 @@ vy_scheduler_destroy(struct vy_scheduler *scheduler)
 	fiber_cond_signal(&scheduler->dump_cond);
 	fiber_cond_signal(&scheduler->scheduler_cond);
 
-	if (scheduler->is_worker_pool_running)
+	if (scheduler->worker_pool != NULL)
 		vy_scheduler_stop_workers(scheduler);
 
 	diag_destroy(&scheduler->diag);
diff --git a/src/box/vy_scheduler.h b/src/box/vy_scheduler.h
index a235aa6f..deefacd7 100644
--- a/src/box/vy_scheduler.h
+++ b/src/box/vy_scheduler.h
@@ -67,8 +67,6 @@ struct vy_scheduler {
 	 * dump/compaction tasks.
 	 */
 	struct vy_worker *worker_pool;
-	/** Set if the worker threads are running. */
-	bool is_worker_pool_running;
 	/** Total number of worker threads. */
 	int worker_pool_size;
 	/** Number worker threads that are currently idle. */
-- 
2.11.0

^ permalink raw reply	[flat|nested] 65+ messages in thread

* [RFC PATCH 22/23] vinyl: rename vy_task::status to is_failed
  2018-07-08 16:48 ` [RFC PATCH 00/23] vinyl: eliminate read on REPLACE/DELETE Vladimir Davydov
                     ` (19 preceding siblings ...)
  2018-07-08 16:48   ` [RFC PATCH 21/23] vinyl: zap vy_scheduler::is_worker_pool_running Vladimir Davydov
@ 2018-07-08 16:48   ` Vladimir Davydov
  2018-07-31 20:44     ` Konstantin Osipov
  2018-07-08 16:48   ` [RFC PATCH 23/23] vinyl: eliminate read on REPLACE/DELETE Vladimir Davydov
  21 siblings, 1 reply; 65+ messages in thread
From: Vladimir Davydov @ 2018-07-08 16:48 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

vy_task::status stores the return code of the ->execute method. There
are only two codes in use: 0 - success and -1 - failure. So let's chage
this to a boolean flag.
---
 src/box/vy_scheduler.c | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/src/box/vy_scheduler.c b/src/box/vy_scheduler.c
index ac6b1f47..06dbb1f8 100644
--- a/src/box/vy_scheduler.c
+++ b/src/box/vy_scheduler.c
@@ -126,9 +126,9 @@ struct vy_task {
 	 * a worker thread.
 	 */
 	struct fiber *fiber;
-	/** Return code of ->execute. */
-	int status;
-	/** If ->execute fails, the error is stored here. */
+	/** Set if the task failed. */
+	bool is_failed;
+	/** In case of task failure the error is stored here. */
 	struct diag diag;
 	/** LSM tree this task is for. */
 	struct vy_lsm *lsm;
@@ -1336,10 +1336,10 @@ static int
 vy_task_f(va_list va)
 {
 	struct vy_task *task = va_arg(va, struct vy_task *);
-	task->status = task->ops->execute(task);
-	if (task->status != 0) {
+	if (task->ops->execute(task) != 0) {
 		struct diag *diag = diag_get();
 		assert(!diag_is_empty(diag));
+		task->is_failed = true;
 		diag_move(diag, &task->diag);
 	}
 	cmsg_init(&task->cmsg, vy_task_complete_route);
@@ -1360,7 +1360,7 @@ vy_task_execute_f(struct cmsg *cmsg)
 	assert(task->fiber == NULL);
 	task->fiber = fiber_new("task", vy_task_f);
 	if (task->fiber == NULL) {
-		task->status = -1;
+		task->is_failed = true;
 		diag_move(diag_get(), &task->diag);
 		cmsg_init(&task->cmsg, vy_task_complete_route);
 		cpipe_push(&task->worker->tx_pipe, &task->cmsg);
@@ -1526,7 +1526,7 @@ vy_task_complete(struct vy_task *task)
 	}
 
 	struct diag *diag = &task->diag;
-	if (task->status != 0) {
+	if (task->is_failed) {
 		assert(!diag_is_empty(diag));
 		goto fail; /* ->execute fialed */
 	}
-- 
2.11.0

^ permalink raw reply	[flat|nested] 65+ messages in thread

* [RFC PATCH 23/23] vinyl: eliminate read on REPLACE/DELETE
  2018-07-08 16:48 ` [RFC PATCH 00/23] vinyl: eliminate read on REPLACE/DELETE Vladimir Davydov
                     ` (20 preceding siblings ...)
  2018-07-08 16:48   ` [RFC PATCH 22/23] vinyl: rename vy_task::status to is_failed Vladimir Davydov
@ 2018-07-08 16:48   ` Vladimir Davydov
  2018-07-13 10:53     ` Vladimir Davydov
  21 siblings, 1 reply; 65+ messages in thread
From: Vladimir Davydov @ 2018-07-08 16:48 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

Currently, in presense of secondary indexes we have to read the primary
key on REPLACE and DELETE in order to delete the overwritten tuple from
secondary indexes. This patch eliminates the reads by hading this job
over to primary index dump/compaction task.

In progress...

Closes #2129
---
 src/box/vinyl.c                    | 133 ++++++++++++++++++------
 src/box/vy_scheduler.c             | 200 ++++++++++++++++++++++++++++++++++++-
 src/box/vy_scheduler.h             |   8 ++
 src/box/vy_tx.c                    |  26 +++++
 test/vinyl/info.result             |   5 +
 test/vinyl/info.test.lua           |   3 +
 test/vinyl/layout.result           | 146 +++++++++++++++++----------
 test/vinyl/tx_gap_lock.result      |  16 +--
 test/vinyl/tx_gap_lock.test.lua    |  10 +-
 test/vinyl/write_iterator.result   |   5 +
 test/vinyl/write_iterator.test.lua |   3 +
 11 files changed, 461 insertions(+), 94 deletions(-)

diff --git a/src/box/vinyl.c b/src/box/vinyl.c
index 7e23dd93..5acba436 100644
--- a/src/box/vinyl.c
+++ b/src/box/vinyl.c
@@ -65,6 +65,7 @@
 #include "engine.h"
 #include "space.h"
 #include "index.h"
+#include "schema.h"
 #include "xstream.h"
 #include "info.h"
 #include "column_mask.h"
@@ -1282,25 +1283,39 @@ vy_get_by_secondary_tuple(struct vy_lsm *lsm, struct vy_tx *tx,
 			  struct tuple *tuple, struct tuple **result)
 {
 	assert(lsm->index_id > 0);
-	/*
-	 * No need in vy_tx_track() as the tuple must already be
-	 * tracked in the secondary index LSM tree.
-	 */
+
 	if (vy_point_lookup(lsm->pk, tx, rv, tuple, result) != 0)
 		return -1;
 
-	if (*result == NULL) {
+	if (*result == NULL ||
+	    vy_tuple_compare(*result, tuple, lsm->key_def) != 0) {
 		/*
-		 * All indexes of a space must be consistent, i.e.
-		 * if a tuple is present in one index, it must be
-		 * present in all other indexes as well, so we can
-		 * get here only if there's a bug somewhere in vinyl.
-		 * Don't abort as core dump won't really help us in
-		 * this case. Just warn the user and proceed to the
-		 * next tuple.
+		 * If a tuple read from a secondary index doesn't
+		 * match the tuple corresponding to it in the
+		 * primary index, it must have been overwritten or
+		 * deleted, but the DELETE statement hasn't been
+		 * propagated to the secondary index yet. In this
+		 * case silently skip this tuple.
 		 */
-		say_warn("%s: key %s missing in primary index",
-			 vy_lsm_name(lsm), vy_stmt_str(tuple));
+		if (*result != NULL) {
+			tuple_unref(*result);
+			*result = NULL;
+		}
+		vy_cache_on_write(&lsm->cache, tuple, NULL);
+		return 0;
+	}
+
+	/*
+	 * Even though the tuple is tracked in the secondary index
+	 * read set, we still must track the full tuple read from
+	 * the primary index, otherwise the transaction won't be
+	 * aborted if this tuple is overwritten or deleted, because
+	 * the DELETE statement is not written to secondary indexes
+	 * immediately.
+	 */
+	if (tx != NULL && vy_tx_track_point(tx, lsm->pk, *result) != 0) {
+		tuple_unref(*result);
+		return -1;
 	}
 
 	if ((*rv)->vlsn == INT64_MAX)
@@ -1613,7 +1628,6 @@ vy_delete(struct vy_env *env, struct vy_tx *tx, struct txn_stmt *stmt,
 	struct vy_lsm *lsm = vy_lsm_find_unique(space, request->index_id);
 	if (lsm == NULL)
 		return -1;
-	bool has_secondary = space->index_count > 1;
 	const char *key = request->key;
 	uint32_t part_count = mp_decode_array(&key);
 	if (vy_unique_key_validate(lsm, key, part_count))
@@ -1623,12 +1637,9 @@ vy_delete(struct vy_env *env, struct vy_tx *tx, struct txn_stmt *stmt,
 	 * before deletion.
 	 * - if the space has on_replace triggers and need to pass
 	 *   to them the old tuple.
-	 *
-	 * - if the space has one or more secondary indexes, then
-	 *   we need to extract secondary keys from the old tuple
-	 *   and pass them to indexes for deletion.
+	 * - if deletion is done by a secondary index.
 	 */
-	if (has_secondary || !rlist_empty(&space->on_replace)) {
+	if (lsm->index_id > 0 || !rlist_empty(&space->on_replace)) {
 		if (vy_get_by_raw_key(lsm, tx, vy_tx_read_view(tx),
 				      key, part_count, &stmt->old_tuple) != 0)
 			return -1;
@@ -1637,8 +1648,7 @@ vy_delete(struct vy_env *env, struct vy_tx *tx, struct txn_stmt *stmt,
 	}
 	int rc = 0;
 	struct tuple *delete;
-	if (has_secondary) {
-		assert(stmt->old_tuple != NULL);
+	if (stmt->old_tuple != NULL) {
 		delete = vy_stmt_new_surrogate_delete(pk->mem_format,
 						      stmt->old_tuple);
 		if (delete == NULL)
@@ -1651,12 +1661,14 @@ vy_delete(struct vy_env *env, struct vy_tx *tx, struct txn_stmt *stmt,
 			if (rc != 0)
 				break;
 		}
-	} else { /* Primary is the single index in the space. */
+	} else {
 		assert(lsm->index_id == 0);
 		delete = vy_stmt_new_surrogate_delete_from_key(request->key,
 						pk->key_def, pk->mem_format);
 		if (delete == NULL)
 			return -1;
+		if (space->index_count > 1)
+			vy_stmt_set_flags(delete, VY_STMT_DEFERRED_DELETE);
 		rc = vy_tx_set(tx, pk, delete);
 	}
 	tuple_unref(delete);
@@ -2175,14 +2187,14 @@ vy_replace(struct vy_env *env, struct vy_tx *tx, struct txn_stmt *stmt,
 	/*
 	 * Get the overwritten tuple from the primary index if
 	 * the space has on_replace triggers, in which case we
-	 * need to pass the old tuple to trigger callbacks, or
-	 * if the space has secondary indexes and so we need
-	 * the old tuple to delete it from them.
+	 * need to pass the old tuple to trigger callbacks.
 	 */
-	if (space->index_count > 1 || !rlist_empty(&space->on_replace)) {
+	if (!rlist_empty(&space->on_replace)) {
 		if (vy_get(pk, tx, vy_tx_read_view(tx),
 			   stmt->new_tuple, &stmt->old_tuple) != 0)
 			return -1;
+	} else if (space->index_count > 1) {
+		vy_stmt_set_flags(stmt->new_tuple, VY_STMT_DEFERRED_DELETE);
 	}
 	/*
 	 * Replace in the primary index without explicit deletion
@@ -2454,6 +2466,71 @@ vinyl_engine_rollback_statement(struct engine *engine, struct txn *txn,
 
 /* }}} Public API of transaction control */
 
+/* {{{ Deferred DELETE handling */
+
+static int
+vy_deferred_delete_one(struct vy_lsm *lsm, struct tuple *delete,
+		       struct tuple *old_stmt, struct tuple *new_stmt,
+		       const struct tuple **region_stmt)
+{
+	if (vy_stmt_type(new_stmt) == IPROTO_REPLACE &&
+	    vy_tuple_compare(old_stmt, new_stmt, lsm->key_def) == 0)
+		return 0;
+	if (unlikely(lsm->mem->schema_version != schema_version ||
+		     lsm->mem->generation != *lsm->env->p_generation)) {
+		if (vy_lsm_rotate_mem(lsm) != 0)
+			return -1;
+	}
+	if (vy_lsm_set(lsm, lsm->mem, delete, region_stmt) != 0)
+		return -1;
+	vy_lsm_commit_stmt(lsm, lsm->mem, *region_stmt);
+	return 0;
+}
+
+static int
+vy_deferred_delete(struct tuple *old_stmt, struct tuple *new_stmt, void *arg)
+{
+	struct vy_lsm *pk = arg;
+	assert(pk->index_id == 0);
+	if (pk->is_dropped)
+		return 0;
+	struct space *space = space_by_id(pk->space_id);
+	if (space == NULL)
+		return 0;
+	if (space->index_count <= 1)
+		return 0;
+
+	struct tuple *delete;
+	delete = vy_stmt_new_surrogate_delete(pk->mem_format, old_stmt);
+	if (delete == NULL)
+		return -1;
+
+	vy_stmt_set_lsn(delete, vy_stmt_lsn(new_stmt));
+	vy_stmt_set_flags(delete, VY_STMT_SKIP_READ);
+
+	int rc = 0;
+	struct vy_env *env = container_of(pk->env, struct vy_env, lsm_env);
+	size_t mem_used_before = lsregion_used(&env->mem_env.allocator);
+
+	const struct tuple *region_stmt = NULL;
+	for (uint32_t i = 1; i < space->index_count; i++) {
+		struct vy_lsm *lsm = vy_lsm(space_index(space, i));
+		rc = vy_deferred_delete_one(lsm, delete, old_stmt, new_stmt,
+					    &region_stmt);
+		if (rc != 0)
+			break;
+	}
+
+	size_t mem_used_after = lsregion_used(&env->mem_env.allocator);
+	assert(mem_used_after >= mem_used_before);
+	vy_quota_force_use(&env->quota, mem_used_after - mem_used_before);
+
+	tuple_unref(delete);
+	return rc;
+}
+
+/* }}} Deferred DELETE handling */
+
 /** {{{ Environment */
 
 static void
@@ -2615,7 +2692,7 @@ vy_env_new(const char *path, size_t memory,
 
 	vy_mem_env_create(&e->mem_env, e->memory);
 	vy_scheduler_create(&e->scheduler, e->write_threads,
-			    vy_env_dump_complete_cb,
+			    vy_env_dump_complete_cb, vy_deferred_delete,
 			    &e->run_env, &e->xm->read_views);
 
 	if (vy_lsm_env_create(&e->lsm_env, e->path,
diff --git a/src/box/vy_scheduler.c b/src/box/vy_scheduler.c
index 06dbb1f8..a070b46f 100644
--- a/src/box/vy_scheduler.c
+++ b/src/box/vy_scheduler.c
@@ -65,6 +65,8 @@ static int vy_worker_f(va_list);
 static int vy_scheduler_f(va_list);
 static void vy_task_execute_f(struct cmsg *);
 static void vy_task_complete_f(struct cmsg *);
+static void vy_deferred_delete_batch_process_f(struct cmsg *);
+static void vy_deferred_delete_batch_free_f(struct cmsg *);
 
 static const struct cmsg_hop vy_task_execute_route[] = {
 	{ vy_task_execute_f, NULL },
@@ -83,10 +85,42 @@ struct vy_worker {
 	struct cpipe tx_pipe;
 	/** Link in vy_scheduler::idle_workers. */
 	struct stailq_entry in_idle;
+	/** Route for sending deferred DELETEs back to tx. */
+	struct cmsg_hop deferred_delete_route[2];
 };
 
 struct vy_task;
 
+/** Max number of statements in a batch of deferred DELETEs. */
+enum { VY_DEFERRED_DELETE_BATCH_MAX = 100 };
+
+/** Deferred DELETE statement. */
+struct vy_deferred_delete_stmt {
+	/** Overwritten tuple. */
+	struct tuple *old_stmt;
+	/** Statement that overwrote @old_stmt. */
+	struct tuple *new_stmt;
+};
+
+/**
+ * Batch of deferred DELETE statements generated during
+ * a primary index compaction.
+ */
+struct vy_deferred_delete_batch {
+	/** CBus messages for sending the batch to tx. */
+	struct cmsg cmsg;
+	/** Task that generated this batch. */
+	struct vy_task *task;
+	/** Set if the tx thread failed to process the batch. */
+	bool is_failed;
+	/** In case of failure the error is stored here. */
+	struct diag diag;
+	/** Number of elements actually stored in @stmt array. */
+	int count;
+	/** Array of deferred DELETE statements. */
+	struct vy_deferred_delete_stmt stmt[VY_DEFERRED_DELETE_BATCH_MAX];
+};
+
 struct vy_task_ops {
 	/**
 	 * This function is called from a worker. It is supposed to do work
@@ -159,6 +193,13 @@ struct vy_task {
 	 */
 	double bloom_fpr;
 	int64_t page_size;
+	/** Batch of deferred deletes generated by this task. */
+	struct vy_deferred_delete_batch *deferred_delete_batch;
+	/**
+	 * Number of batches of deferred DELETEs sent to tx
+	 * and not yet processed.
+	 */
+	int deferred_delete_in_progress;
 	/** Link in vy_scheduler::processed_tasks. */
 	struct stailq_entry in_processed;
 };
@@ -204,6 +245,8 @@ vy_task_new(struct vy_scheduler *scheduler, struct vy_lsm *lsm,
 static void
 vy_task_delete(struct vy_task *task)
 {
+	assert(task->deferred_delete_batch == NULL);
+	assert(task->deferred_delete_in_progress == 0);
 	key_def_delete(task->cmp_def);
 	key_def_delete(task->key_def);
 	vy_lsm_unref(task->lsm);
@@ -293,6 +336,12 @@ vy_scheduler_start_workers(struct vy_scheduler *scheduler)
 		cpipe_create(&worker->worker_pipe, name);
 		stailq_add_tail_entry(&scheduler->idle_workers,
 				      worker, in_idle);
+
+		struct cmsg_hop *route = worker->deferred_delete_route;
+		route[0].f = vy_deferred_delete_batch_process_f;
+		route[0].pipe = &worker->worker_pipe;
+		route[1].f = vy_deferred_delete_batch_free_f;
+		route[1].pipe = NULL;
 	}
 }
 
@@ -313,11 +362,13 @@ vy_scheduler_stop_workers(struct vy_scheduler *scheduler)
 void
 vy_scheduler_create(struct vy_scheduler *scheduler, int write_threads,
 		    vy_scheduler_dump_complete_f dump_complete_cb,
+		    vy_deferred_delete_f deferred_delete_cb,
 		    struct vy_run_env *run_env, struct rlist *read_views)
 {
 	memset(scheduler, 0, sizeof(*scheduler));
 
 	scheduler->dump_complete_cb = dump_complete_cb;
+	scheduler->deferred_delete_cb = deferred_delete_cb;
 	scheduler->read_views = read_views;
 	scheduler->run_env = run_env;
 
@@ -652,6 +703,136 @@ vy_run_discard(struct vy_run *run)
 	vy_log_tx_try_commit();
 }
 
+/**
+ * Callback invoked by the tx thread to process deferred DELETEs
+ * generated during compaction.
+ */
+static void
+vy_deferred_delete_batch_process_f(struct cmsg *cmsg)
+{
+	struct vy_deferred_delete_batch *batch = container_of(cmsg,
+				struct vy_deferred_delete_batch, cmsg);
+	struct vy_task *task = batch->task;
+	struct vy_scheduler *scheduler = task->scheduler;
+
+	for (int i = 0; i < batch->count; i++) {
+		struct vy_deferred_delete_stmt *stmt = &batch->stmt[i];
+		if (scheduler->deferred_delete_cb(stmt->old_stmt,
+						  stmt->new_stmt,
+						  task->lsm) != 0) {
+			struct diag *diag = diag_get();
+			assert(!diag_is_empty(diag));
+			batch->is_failed = true;
+			diag_move(diag, &batch->diag);
+			return;
+		}
+	}
+}
+
+/**
+ * Callback invoked by a worker thread to free processed deferred
+ * DELETE statements. It must be done on behalf the worker thread
+ * that generated those DELETEs, because a vinyl statement cannot
+ * be allocated and freed in different threads.
+ */
+static void
+vy_deferred_delete_batch_free_f(struct cmsg *cmsg)
+{
+	struct vy_deferred_delete_batch *batch = container_of(cmsg,
+				struct vy_deferred_delete_batch, cmsg);
+	struct vy_task *task = batch->task;
+	for (int i = 0; i < batch->count; i++) {
+		struct vy_deferred_delete_stmt *stmt = &batch->stmt[i];
+		vy_stmt_unref_if_possible(stmt->old_stmt);
+		vy_stmt_unref_if_possible(stmt->new_stmt);
+	}
+	/*
+	 * Abort the task if the tx thread failed to process
+	 * the batch unless it has already been aborted.
+	 */
+	if (batch->is_failed && !task->is_failed) {
+		assert(!diag_is_empty(&batch->diag));
+		diag_move(&batch->diag, &task->diag);
+		task->is_failed = true;
+		fiber_cancel(task->fiber);
+	}
+	diag_destroy(&batch->diag);
+	free(batch);
+	/* Notify the caller if this is the last batch. */
+	assert(task->deferred_delete_in_progress > 0);
+	if (--task->deferred_delete_in_progress == 0)
+		fiber_wakeup(task->fiber);
+}
+
+/**
+ * Send all deferred DELETEs accumulated by a vinyl task to
+ * the tx thread where they will be processed.
+ */
+static void
+vy_task_deferred_delete_flush(struct vy_task *task)
+{
+	struct vy_worker *worker = task->worker;
+	struct vy_deferred_delete_batch *batch = task->deferred_delete_batch;
+
+	if (batch == NULL)
+		return;
+
+	task->deferred_delete_batch = NULL;
+	task->deferred_delete_in_progress++;
+
+	cmsg_init(&batch->cmsg, worker->deferred_delete_route);
+	cpipe_push(&worker->tx_pipe, &batch->cmsg);
+}
+
+/**
+ * Wait for all deferred DELETE statements sent to tx to
+ * be processed and returned back to the worker.
+ */
+static void
+vy_task_deferred_delete_wait(struct vy_task *task)
+{
+	while (task->deferred_delete_in_progress > 0)
+		fiber_sleep(TIMEOUT_INFINITY);
+}
+
+/**
+ * Callback invoked by the write iterator during compaction to
+ * generate deferred DELETE statements. It adds a deferred DELETE
+ * to a batch. Once the batch gets full, it submits it to tx.
+ */
+static int
+vy_task_deferred_delete(struct tuple *old_stmt,
+			struct tuple *new_stmt, void *arg)
+{
+	struct vy_task *task = arg;
+	struct vy_deferred_delete_batch *batch = task->deferred_delete_batch;
+
+	/* Allocate a new batch on demand. */
+	if (batch == NULL) {
+		batch = malloc(sizeof(*batch));
+		if (batch == NULL) {
+			diag_set(OutOfMemory, sizeof(*batch), "malloc",
+				 "struct vy_deferred_delete_batch");
+			return -1;
+		}
+		memset(batch, 0, sizeof(*batch));
+		batch->task = task;
+		diag_create(&batch->diag);
+		task->deferred_delete_batch = batch;
+	}
+
+	assert(batch->count < VY_DEFERRED_DELETE_BATCH_MAX);
+	struct vy_deferred_delete_stmt *stmt = &batch->stmt[batch->count++];
+	stmt->old_stmt = old_stmt;
+	vy_stmt_ref_if_possible(old_stmt);
+	stmt->new_stmt = new_stmt;
+	vy_stmt_ref_if_possible(new_stmt);
+
+	if (batch->count == VY_DEFERRED_DELETE_BATCH_MAX)
+		vy_task_deferred_delete_flush(task);
+	return 0;
+}
+
 static int
 vy_task_write_run(struct vy_task *task)
 {
@@ -1006,7 +1187,9 @@ vy_task_dump_new(struct vy_scheduler *scheduler, struct vy_lsm *lsm,
 	bool is_last_level = (lsm->run_count == 0);
 	wi = vy_write_iterator_new(task->cmp_def, lsm->disk_format,
 				   lsm->index_id == 0, is_last_level,
-				   scheduler->read_views, NULL, NULL);
+				   scheduler->read_views,
+				   lsm->index_id > 0 ? NULL :
+				   vy_task_deferred_delete, task);
 	if (wi == NULL)
 		goto err_wi;
 	rlist_foreach_entry(mem, &lsm->sealed, in_sealed) {
@@ -1273,7 +1456,9 @@ vy_task_compact_new(struct vy_scheduler *scheduler, struct vy_lsm *lsm,
 	bool is_last_level = (range->compact_priority == range->slice_count);
 	wi = vy_write_iterator_new(task->cmp_def, lsm->disk_format,
 				   lsm->index_id == 0, is_last_level,
-				   scheduler->read_views, NULL, NULL);
+				   scheduler->read_views,
+				   lsm->index_id > 0 ? NULL :
+				   vy_task_deferred_delete, task);
 	if (wi == NULL)
 		goto err_wi;
 
@@ -1336,12 +1521,21 @@ static int
 vy_task_f(va_list va)
 {
 	struct vy_task *task = va_arg(va, struct vy_task *);
-	if (task->ops->execute(task) != 0) {
+	if (task->ops->execute(task) != 0 && !task->is_failed) {
 		struct diag *diag = diag_get();
 		assert(!diag_is_empty(diag));
 		task->is_failed = true;
 		diag_move(diag, &task->diag);
 	}
+
+	/*
+	 * We must not complete the task until we make sure that
+	 * all deferred DELETEs generated during task execution
+	 * have been successfully processed.
+	 */
+	vy_task_deferred_delete_flush(task);
+	vy_task_deferred_delete_wait(task);
+
 	cmsg_init(&task->cmsg, vy_task_complete_route);
 	cpipe_push(&task->worker->tx_pipe, &task->cmsg);
 	task->fiber = NULL;
diff --git a/src/box/vy_scheduler.h b/src/box/vy_scheduler.h
index deefacd7..f056eeff 100644
--- a/src/box/vy_scheduler.h
+++ b/src/box/vy_scheduler.h
@@ -43,6 +43,8 @@
 #include "salad/heap.h"
 #include "salad/stailq.h"
 
+#include "vy_write_iterator.h" /* vy_deferred_delete_f */
+
 #if defined(__cplusplus)
 extern "C" {
 #endif /* defined(__cplusplus) */
@@ -144,6 +146,11 @@ struct vy_scheduler {
 	 * by the dump.
 	 */
 	vy_scheduler_dump_complete_f dump_complete_cb;
+	/**
+	 * Callback invoked in the tx thread for each deferred DELETE
+	 * statement generated during compaction.
+	 */
+	vy_deferred_delete_f deferred_delete_cb;
 	/** List of read views, see tx_manager::read_views. */
 	struct rlist *read_views;
 	/** Context needed for writing runs. */
@@ -156,6 +163,7 @@ struct vy_scheduler {
 void
 vy_scheduler_create(struct vy_scheduler *scheduler, int write_threads,
 		    vy_scheduler_dump_complete_f dump_complete_cb,
+		    vy_deferred_delete_f deferred_delete_cb,
 		    struct vy_run_env *run_env, struct rlist *read_views);
 
 /**
diff --git a/src/box/vy_tx.c b/src/box/vy_tx.c
index f5bb624f..8100deef 100644
--- a/src/box/vy_tx.c
+++ b/src/box/vy_tx.c
@@ -536,6 +536,22 @@ vy_tx_prepare(struct vy_tx *tx)
 		if (v->is_overwritten)
 			continue;
 
+		if (lsm->index_id > 0 && repsert == NULL && delete == NULL) {
+			/*
+			 * This statement is for a secondary index,
+			 * and the statement corresponding to it in
+			 * the primary index was overwritten. This
+			 * can only happen if insertion of DELETE
+			 * into secondary indexes was postponed until
+			 * primary index compaction. In this case
+			 * the DELETE will not be propagated, because
+			 * the corresponding statement never made it
+			 * to the primary index LSM tree. So we must
+			 * skip it for secondary indexes as well.
+			 */
+			continue;
+		}
+
 		enum iproto_type type = vy_stmt_type(v->stmt);
 
 		/* Optimize out INSERT + DELETE for the same key. */
@@ -550,6 +566,16 @@ vy_tx_prepare(struct vy_tx *tx)
 			 */
 			type = IPROTO_INSERT;
 			vy_stmt_set_type(v->stmt, type);
+			/*
+			 * In case of INSERT, no statement was actually
+			 * overwritten so no need to generate a deferred
+			 * DELETE for secondary indexes.
+			 */
+			uint8_t flags = vy_stmt_flags(v->stmt);
+			if (flags & VY_STMT_DEFERRED_DELETE) {
+				vy_stmt_set_flags(v->stmt, flags &
+						  ~VY_STMT_DEFERRED_DELETE);
+			}
 		}
 
 		if (!v->is_first_insert && type == IPROTO_INSERT) {
diff --git a/test/vinyl/info.result b/test/vinyl/info.result
index 112ba85e..950a56cf 100644
--- a/test/vinyl/info.result
+++ b/test/vinyl/info.result
@@ -1032,6 +1032,11 @@ s:drop()
 s = box.schema.space.create('test', {engine = 'vinyl'})
 ---
 ...
+-- Install on_replace trigger to disable REPLACE/DELETE
+-- optimization in the secondary index (gh-2129).
+_ = s:on_replace(function() end)
+---
+...
 s:bsize()
 ---
 - 0
diff --git a/test/vinyl/info.test.lua b/test/vinyl/info.test.lua
index 863a8793..867415c9 100644
--- a/test/vinyl/info.test.lua
+++ b/test/vinyl/info.test.lua
@@ -321,6 +321,9 @@ s:drop()
 --
 
 s = box.schema.space.create('test', {engine = 'vinyl'})
+-- Install on_replace trigger to disable REPLACE/DELETE
+-- optimization in the secondary index (gh-2129).
+_ = s:on_replace(function() end)
 s:bsize()
 i1 = s:create_index('i1', {parts = {1, 'unsigned'}, run_count_per_level = 1})
 i2 = s:create_index('i2', {parts = {2, 'unsigned'}, run_count_per_level = 1})
diff --git a/test/vinyl/layout.result b/test/vinyl/layout.result
index 1f928a8f..33f7e4b9 100644
--- a/test/vinyl/layout.result
+++ b/test/vinyl/layout.result
@@ -135,15 +135,15 @@ result
       - HEADER:
           type: INSERT
         BODY:
-          tuple: [5, {2: 8, 9: 10}]
+          tuple: [5, {2: 9, 9: 10}]
       - HEADER:
           type: INSERT
         BODY:
-          tuple: [4, {2: 5}]
+          tuple: [4, {2: 6}]
       - HEADER:
           type: INSERT
         BODY:
-          tuple: [6, {2: 5}]
+          tuple: [6, {2: 6}]
       - HEADER:
           type: INSERT
         BODY:
@@ -151,7 +151,7 @@ result
       - HEADER:
           type: INSERT
         BODY:
-          tuple: [8, {1: 1, 2: 8, 8: 9}]
+          tuple: [8, {1: 1, 2: 9, 8: 10}]
       - HEADER:
           type: INSERT
         BODY:
@@ -160,15 +160,11 @@ result
       - HEADER:
           type: INSERT
         BODY:
-          tuple: [5, {0: 2, 2: 6, 9: 10}]
+          tuple: [5, {0: 2, 2: 7, 9: 10}]
       - HEADER:
           type: INSERT
         BODY:
-          tuple: [4, {0: 2, 2: 4}]
-      - HEADER:
-          type: INSERT
-        BODY:
-          tuple: [6, {2: 4}]
+          tuple: [5, {0: 2, 2: 4, 9: 5}]
       - HEADER:
           type: INSERT
         BODY:
@@ -176,36 +172,35 @@ result
       - HEADER:
           type: INSERT
         BODY:
-          tuple: [8, {1: 3, 2: 6, 8: 7}]
+          tuple: [8, {1: 3, 2: 4, 8: 5}]
       - HEADER:
           type: INSERT
         BODY:
-          tuple: [11, {}]
+          tuple: [8, {1: 3, 2: 7, 8: 8}]
       - HEADER:
-          timestamp: <timestamp>
           type: INSERT
         BODY:
-          tuple: [7, {2: 5}]
+          tuple: [11, {}]
       - HEADER:
           timestamp: <timestamp>
           type: INSERT
         BODY:
-          tuple: [7, {2: 4}]
+          tuple: [7, {2: 6}]
       - HEADER:
           timestamp: <timestamp>
           type: INSERT
         BODY:
-          tuple: [4, {0: 2, 2: 10}]
+          tuple: [4, {0: 2, 2: 11}]
       - HEADER:
           timestamp: <timestamp>
           type: INSERT
         BODY:
-          tuple: [5, {0: 2, 2: 10, 9: 13}]
+          tuple: [5, {0: 2, 2: 11, 9: 13}]
       - HEADER:
           timestamp: <timestamp>
           type: INSERT
         BODY:
-          tuple: [8, {1: 3, 2: 10, 8: 11}]
+          tuple: [8, {1: 3, 2: 11, 8: 12}]
       - HEADER:
           timestamp: <timestamp>
           type: INSERT
@@ -215,23 +210,23 @@ result
           timestamp: <timestamp>
           type: INSERT
         BODY:
-          tuple: [4, {2: 12}]
+          tuple: [4, {2: 13}]
       - HEADER:
           timestamp: <timestamp>
           type: INSERT
         BODY:
-          tuple: [5, {2: 12, 9: 13}]
+          tuple: [5, {2: 13, 9: 13}]
       - HEADER:
           timestamp: <timestamp>
           type: INSERT
         BODY:
-          tuple: [8, {1: 1, 2: 12, 8: 13}]
+          tuple: [8, {1: 1, 2: 13, 8: 14}]
       - HEADER:
           timestamp: <timestamp>
           type: INSERT
         BODY:
           tuple: [10, {9: 13}]
-  - - 00000000000000000008.index
+  - - 00000000000000000009.index
     - - HEADER:
           type: RUNINFO
         BODY:
@@ -250,7 +245,7 @@ result
           unpacked_size: 67
           row_count: 3
           min_key: ['ёёё']
-  - - 00000000000000000008.run
+  - - 00000000000000000009.run
     - - HEADER:
           lsn: 10
           type: REPLACE
@@ -270,7 +265,7 @@ result
           type: ROWINDEX
         BODY:
           row_index: "\0\0\0\0\0\0\0\x10\0\0\0 "
-  - - 00000000000000000012.index
+  - - 00000000000000000013.index
     - - HEADER:
           type: RUNINFO
         BODY:
@@ -285,36 +280,86 @@ result
         BODY:
           row_index_offset: <offset>
           offset: <offset>
-          size: 90
-          unpacked_size: 71
-          row_count: 3
+          size: 166
+          unpacked_size: 147
+          row_count: 6
           min_key: ['ёёё']
-  - - 00000000000000000012.run
+  - - 00000000000000000013.run
     - - HEADER:
           lsn: 11
           type: REPLACE
         BODY:
           tuple: ['ёёё', 123]
+          flags: 1
+      - HEADER:
+          lsn: 11
+          type: REPLACE
+        BODY:
+          tuple: ['ёёё', 123]
+          flags: 1
       - HEADER:
           lsn: 13
           type: REPLACE
         BODY:
           tuple: ['ююю', 789]
+          flags: 1
+      - HEADER:
+          lsn: 13
+          type: REPLACE
+        BODY:
+          tuple: ['ююю', 789]
+          flags: 1
+      - HEADER:
+          lsn: 12
+          type: REPLACE
+        BODY:
+          tuple: ['ЮЮЮ', 456]
+          flags: 1
       - HEADER:
           lsn: 12
           type: REPLACE
         BODY:
           tuple: ['ЮЮЮ', 456]
+          flags: 1
       - HEADER:
           type: ROWINDEX
         BODY:
-          row_index: "\0\0\0\0\0\0\0\x10\0\0\0\""
-  - - 00000000000000000006.index
+          row_index: "\0\0\0\0\0\0\0\x12\0\0\0$\0\0\08\0\0\0L\0\0\0`"
+  - - 00000000000000000004.index
     - - HEADER:
           type: RUNINFO
         BODY:
-          min_lsn: 8
-          max_key: [null, 'ЭЭЭ']
+          min_lsn: 5
+          max_key: [777, 'ЁЁЁ']
+          page_count: 1
+          bloom_filter: <bloom_filter>
+          max_lsn: 5
+          min_key: [777, 'ЁЁЁ']
+      - HEADER:
+          type: PAGEINFO
+        BODY:
+          row_index_offset: <offset>
+          offset: <offset>
+          size: 48
+          unpacked_size: 29
+          row_count: 1
+          min_key: [777, 'ЁЁЁ']
+  - - 00000000000000000004.run
+    - - HEADER:
+          lsn: 5
+          type: INSERT
+        BODY:
+          tuple: [777, 'ЁЁЁ']
+      - HEADER:
+          type: ROWINDEX
+        BODY:
+          row_index: "\0\0\0\0"
+  - - 00000000000000000007.index
+    - - HEADER:
+          type: RUNINFO
+        BODY:
+          min_lsn: 6
+          max_key: [777, 'ЁЁЁ']
           page_count: 1
           bloom_filter: <bloom_filter>
           max_lsn: 10
@@ -324,11 +369,11 @@ result
         BODY:
           row_index_offset: <offset>
           offset: <offset>
-          size: 86
-          unpacked_size: 67
-          row_count: 3
+          size: 110
+          unpacked_size: 91
+          row_count: 4
           min_key: [null, 'ёёё']
-  - - 00000000000000000006.run
+  - - 00000000000000000007.run
     - - HEADER:
           lsn: 10
           type: REPLACE
@@ -345,10 +390,16 @@ result
         BODY:
           tuple: [null, 'ЭЭЭ']
       - HEADER:
+          lsn: 6
+          type: DELETE
+        BODY:
+          key: [777, 'ЁЁЁ']
+          flags: 2
+      - HEADER:
           type: ROWINDEX
         BODY:
-          row_index: "\0\0\0\0\0\0\0\x10\0\0\0 "
-  - - 00000000000000000010.index
+          row_index: "\0\0\0\0\0\0\0\x10\0\0\0 \0\0\00"
+  - - 00000000000000000011.index
     - - HEADER:
           type: RUNINFO
         BODY:
@@ -357,24 +408,19 @@ result
           page_count: 1
           bloom_filter: <bloom_filter>
           max_lsn: 13
-          min_key: [null, 'ёёё']
+          min_key: [123, 'ёёё']
       - HEADER:
           type: PAGEINFO
         BODY:
           row_index_offset: <offset>
           offset: <offset>
-          size: 110
-          unpacked_size: 91
-          row_count: 4
-          min_key: [null, 'ёёё']
-  - - 00000000000000000010.run
+          size: 90
+          unpacked_size: 71
+          row_count: 3
+          min_key: [123, 'ёёё']
+  - - 00000000000000000011.run
     - - HEADER:
           lsn: 11
-          type: DELETE
-        BODY:
-          key: [null, 'ёёё']
-      - HEADER:
-          lsn: 11
           type: REPLACE
         BODY:
           tuple: [123, 'ёёё']
@@ -391,7 +437,7 @@ result
       - HEADER:
           type: ROWINDEX
         BODY:
-          row_index: "\0\0\0\0\0\0\0\x10\0\0\0 \0\0\02"
+          row_index: "\0\0\0\0\0\0\0\x10\0\0\0\""
 ...
 test_run:cmd("clear filter")
 ---
diff --git a/test/vinyl/tx_gap_lock.result b/test/vinyl/tx_gap_lock.result
index 150826cb..a456c017 100644
--- a/test/vinyl/tx_gap_lock.result
+++ b/test/vinyl/tx_gap_lock.result
@@ -1194,8 +1194,8 @@ s:drop()
 ---
 ...
 ----------------------------------------------------------------
--- gh-2534: Iterator over a secondary index doesn't double track
--- results in the primary index.
+-- Iterator over a secondary index tracks all results in the
+-- primary index. Needed for gh-2129.
 ----------------------------------------------------------------
 s = box.schema.space.create('test', {engine = 'vinyl'})
 ---
@@ -1219,23 +1219,23 @@ gap_lock_count() -- 0
 _ = s.index.sk:select({}, {limit = 50})
 ---
 ...
-gap_lock_count() -- 1
+gap_lock_count() -- 51
 ---
-- 1
+- 51
 ...
 for i = 1, 100 do s.index.sk:get(i) end
 ---
 ...
-gap_lock_count() -- 51
+gap_lock_count() -- 151
 ---
-- 51
+- 151
 ...
 _ = s.index.sk:select()
 ---
 ...
-gap_lock_count() -- 1
+gap_lock_count() -- 101
 ---
-- 1
+- 101
 ...
 box.commit()
 ---
diff --git a/test/vinyl/tx_gap_lock.test.lua b/test/vinyl/tx_gap_lock.test.lua
index 4d8d21d8..4ad55860 100644
--- a/test/vinyl/tx_gap_lock.test.lua
+++ b/test/vinyl/tx_gap_lock.test.lua
@@ -380,8 +380,8 @@ c4:commit()
 
 s:drop()
 ----------------------------------------------------------------
--- gh-2534: Iterator over a secondary index doesn't double track
--- results in the primary index.
+-- Iterator over a secondary index tracks all results in the
+-- primary index. Needed for gh-2129.
 ----------------------------------------------------------------
 s = box.schema.space.create('test', {engine = 'vinyl'})
 _ = s:create_index('pk', {parts = {1, 'unsigned'}})
@@ -390,11 +390,11 @@ for i = 1, 100 do s:insert{i, i} end
 box.begin()
 gap_lock_count() -- 0
 _ = s.index.sk:select({}, {limit = 50})
-gap_lock_count() -- 1
-for i = 1, 100 do s.index.sk:get(i) end
 gap_lock_count() -- 51
+for i = 1, 100 do s.index.sk:get(i) end
+gap_lock_count() -- 151
 _ = s.index.sk:select()
-gap_lock_count() -- 1
+gap_lock_count() -- 101
 box.commit()
 gap_lock_count() -- 0
 s:drop()
diff --git a/test/vinyl/write_iterator.result b/test/vinyl/write_iterator.result
index 162d8463..8ccd125a 100644
--- a/test/vinyl/write_iterator.result
+++ b/test/vinyl/write_iterator.result
@@ -741,6 +741,11 @@ space:drop()
 s = box.schema.space.create('test', {engine = 'vinyl'})
 ---
 ...
+-- Install on_replace trigger to disable REPLACE/DELETE
+-- optimization in the secondary index (gh-2129).
+_ = s:on_replace(function() end)
+---
+...
 pk = s:create_index('primary', {run_count_per_level = 1})
 ---
 ...
diff --git a/test/vinyl/write_iterator.test.lua b/test/vinyl/write_iterator.test.lua
index 9a6cc480..82d92649 100644
--- a/test/vinyl/write_iterator.test.lua
+++ b/test/vinyl/write_iterator.test.lua
@@ -317,6 +317,9 @@ space:drop()
 -- gh-2875 INSERT+DELETE pairs are annihilated on compaction
 
 s = box.schema.space.create('test', {engine = 'vinyl'})
+-- Install on_replace trigger to disable REPLACE/DELETE
+-- optimization in the secondary index (gh-2129).
+_ = s:on_replace(function() end)
 pk = s:create_index('primary', {run_count_per_level = 1})
 sk = s:create_index('secondary', {run_count_per_level = 1, parts = {2, 'unsigned'}})
 PAD1 = 100
-- 
2.11.0

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH 01/23] vinyl: do not turn REPLACE into INSERT when processing DML request
  2018-07-08 16:48   ` [RFC PATCH 01/23] vinyl: do not turn REPLACE into INSERT when processing DML request Vladimir Davydov
@ 2018-07-10 12:15     ` Konstantin Osipov
  2018-07-10 12:19       ` Vladimir Davydov
  0 siblings, 1 reply; 65+ messages in thread
From: Konstantin Osipov @ 2018-07-10 12:15 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: tarantool-patches

* Vladimir Davydov <vdavydov.dev@gmail.com> [18/07/08 22:52]:
> Since in presence of secondary indexes we read the primary index when
> processing a REPLACE request anyway, we turn it into INSERT if no tuple
> matching the new tuple is found so that INSERT+DELETE gets annihilated
> on compaction.
> 
> However, in the scope of #2129 we are planning to optimize the read out
> so that this transformation won't be possible anymore. So let's remove
> it now.

Ugh. What if we deal with a space which has a unique secondary
key, so optimization B is not applicable. You removed optimization
A for all spaces. 

Can we keep it? 

>-- 
Konstantin Osipov, Moscow, Russia, +7 903 626 22 32
http://tarantool.io - www.twitter.com/kostja_osipov

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH 01/23] vinyl: do not turn REPLACE into INSERT when processing DML request
  2018-07-10 12:15     ` Konstantin Osipov
@ 2018-07-10 12:19       ` Vladimir Davydov
  2018-07-10 18:39         ` Konstantin Osipov
  0 siblings, 1 reply; 65+ messages in thread
From: Vladimir Davydov @ 2018-07-10 12:19 UTC (permalink / raw)
  To: Konstantin Osipov; +Cc: tarantool-patches

On Tue, Jul 10, 2018 at 03:15:27PM +0300, Konstantin Osipov wrote:
> * Vladimir Davydov <vdavydov.dev@gmail.com> [18/07/08 22:52]:
> > Since in presence of secondary indexes we read the primary index when
> > processing a REPLACE request anyway, we turn it into INSERT if no tuple
> > matching the new tuple is found so that INSERT+DELETE gets annihilated
> > on compaction.
> > 
> > However, in the scope of #2129 we are planning to optimize the read out
> > so that this transformation won't be possible anymore. So let's remove
> > it now.
> 
> Ugh. What if we deal with a space which has a unique secondary
> key, so optimization B is not applicable. You removed optimization
> A for all spaces. 

The optimization works even if secondary indexes are unique.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH 04/23] vinyl: make point lookup always return the latest tuple version
  2018-07-08 16:48   ` [RFC PATCH 04/23] vinyl: make point lookup always return the latest tuple version Vladimir Davydov
@ 2018-07-10 16:19     ` Konstantin Osipov
  2018-07-10 16:43       ` Vladimir Davydov
  0 siblings, 1 reply; 65+ messages in thread
From: Konstantin Osipov @ 2018-07-10 16:19 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: tarantool-patches

* Vladimir Davydov <vdavydov.dev@gmail.com> [18/07/08 22:52]:
> Currently, vy_point_lookup(), in contrast to vy_read_iterator, doesn't
> rescan the memory level after reading disk, so if the caller doesn't
> track the key before calling this function, the caller won't be sent to
> a read view in case the key gets updated during yield and hence will
> be returned a stale tuple. This is OK now, because we always track the
> key before calling vy_point_lookup(), either in the primary or in a
> secondary index. However, for #2129 we need it to always return the
> latest tuple version, no matter if the key is tracked or not.
> 
> The point is in the scope of #2129 we won't write DELETE statements to
> secondary indexes corresponding to a tuple replaced in the primary
> index. Instead after reading a tuple from a secondary index we will
> check whether it matches the tuple corresponding to it in the primary
> index: if it is not, it means that the tuple read from the secondary
> index was overwritten and should be skipped. E.g. suppose we have the
> primary index over the first field and a secondary index over the second
> field and the following statements in the space:
> 
>   REPLACE{1, 10}
>   REPLACE{1, 20}
> 
> Then reading {10} from the secondary index will return REPLACE{1, 10}, but
> lookup of {1} in the primary index will return REPLACE{1, 20} which
> doesn't match REPLACE{1, 10} read from the secondary index hence the
> latter was overwritten and should be skipped.
> 
> The problem is in the example above we don't want to track key {1} in
> the primary index before lookup, because we don't actually read its
> value. So for the check to work correctly, we need the point lookup to
> guarantee that the returned tuple is always the newest one. It's fairly
> easy to do - we just need to rescan the memory level after yielding on
> disk if its version changed.

Thank you for the explanation. I haven't read the patch itself
yet. But aren't you complicating things more than necessary? All
we need to do when looking up a match in the primary index is to
compare the match LSN and the secondary index tuple LSN. If there
is a mismatch, then we need to skip the secondary key tuple: it's
garbage. The mismatch does not need to take into account new
tuples which appeared during yield, since a mismatch can not
appear during yield.

-- 
Konstantin Osipov, Moscow, Russia, +7 903 626 22 32
http://tarantool.io - www.twitter.com/kostja_osipov

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH 04/23] vinyl: make point lookup always return the latest tuple version
  2018-07-10 16:19     ` Konstantin Osipov
@ 2018-07-10 16:43       ` Vladimir Davydov
  2018-07-11 16:33         ` Vladimir Davydov
  0 siblings, 1 reply; 65+ messages in thread
From: Vladimir Davydov @ 2018-07-10 16:43 UTC (permalink / raw)
  To: Konstantin Osipov; +Cc: tarantool-patches

On Tue, Jul 10, 2018 at 07:19:26PM +0300, Konstantin Osipov wrote:
> * Vladimir Davydov <vdavydov.dev@gmail.com> [18/07/08 22:52]:
> > Currently, vy_point_lookup(), in contrast to vy_read_iterator, doesn't
> > rescan the memory level after reading disk, so if the caller doesn't
> > track the key before calling this function, the caller won't be sent to
> > a read view in case the key gets updated during yield and hence will
> > be returned a stale tuple. This is OK now, because we always track the
> > key before calling vy_point_lookup(), either in the primary or in a
> > secondary index. However, for #2129 we need it to always return the
> > latest tuple version, no matter if the key is tracked or not.
> > 
> > The point is in the scope of #2129 we won't write DELETE statements to
> > secondary indexes corresponding to a tuple replaced in the primary
> > index. Instead after reading a tuple from a secondary index we will
> > check whether it matches the tuple corresponding to it in the primary
> > index: if it is not, it means that the tuple read from the secondary
> > index was overwritten and should be skipped. E.g. suppose we have the
> > primary index over the first field and a secondary index over the second
> > field and the following statements in the space:
> > 
> >   REPLACE{1, 10}
> >   REPLACE{1, 20}
> > 
> > Then reading {10} from the secondary index will return REPLACE{1, 10}, but
> > lookup of {1} in the primary index will return REPLACE{1, 20} which
> > doesn't match REPLACE{1, 10} read from the secondary index hence the
> > latter was overwritten and should be skipped.
> > 
> > The problem is in the example above we don't want to track key {1} in
> > the primary index before lookup, because we don't actually read its
> > value. So for the check to work correctly, we need the point lookup to
> > guarantee that the returned tuple is always the newest one. It's fairly
> > easy to do - we just need to rescan the memory level after yielding on
> > disk if its version changed.
> 
> Thank you for the explanation. I haven't read the patch itself
> yet. But aren't you complicating things more than necessary? All
> we need to do when looking up a match in the primary index is to
> compare the match LSN and the secondary index tuple LSN. If there
> is a mismatch, then we need to skip the secondary key tuple: it's
> garbage. The mismatch does not need to take into account new
> tuples which appeared during yield, since a mismatch can not
> appear during yield.

Using LSNs solely for detecting mismatch is complicated, because of
prepared and txn statements, but even if we put those aside, there's
an optimization in write iterator, which excludes a statement from
the output in case it doesn't modify key parts - see

  https://github.com/tarantool/tarantool/blob/f64f46199e19542fa60eede939d62cd115abb83a/src/box/vy_write_iterator.c#L674

This optimization makes detection by LSN impossible.

Anyway, this particular patch is needed no matter if we detect mismatch
by LSN or by value. Example:

  Let primary index be over part 1 and secondary index be over part 2.
  Let the following statement be committed to both indexes and written
  to disk:

  REPLACE{1, 10, lsn = 123}

  Now let us consider the following race condition:

  Fiber 1                               Fiber 2
  -------                               -------
  look up {10} in the secondary index
  get REPLACE{1, 10, lsn = 123}
  look up {1} in the primary index to check for mismatch
  yields on disk read

                                        commits REPLACE{1, 20, lsn = 456}

  ( skips the new statement, because point
    lookup doesn't rescan the memory level )
  gets REPLACE{1, 10, lsn = 123}

  LSNs are equal, values are equal too,
  hence no mismatch, return to the user

This behavior would be incorrect, because the transaction wouldn't
be sent to read view in this case since secondary key {10} is not
modified.

We could track primary key {1} before the lookup to make sure the
transaction is sent to read view in such a case, but that wouldn't be
quire right: if there was no {1} in the primary index, we would track
a value we didn't actually read.

Hope this explains the problem I'm coping with here.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH 01/23] vinyl: do not turn REPLACE into INSERT when processing DML request
  2018-07-10 12:19       ` Vladimir Davydov
@ 2018-07-10 18:39         ` Konstantin Osipov
  2018-07-11  7:57           ` Vladimir Davydov
  0 siblings, 1 reply; 65+ messages in thread
From: Konstantin Osipov @ 2018-07-10 18:39 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: tarantool-patches

* Vladimir Davydov <vdavydov.dev@gmail.com> [18/07/10 15:22]:
> On Tue, Jul 10, 2018 at 03:15:27PM +0300, Konstantin Osipov wrote:
> > * Vladimir Davydov <vdavydov.dev@gmail.com> [18/07/08 22:52]:
> > > Since in presence of secondary indexes we read the primary index when
> > > processing a REPLACE request anyway, we turn it into INSERT if no tuple
> > > matching the new tuple is found so that INSERT+DELETE gets annihilated
> > > on compaction.
> > > 
> > > However, in the scope of #2129 we are planning to optimize the read out
> > > so that this transformation won't be possible anymore. So let's remove
> > > it now.
> > 
> > Ugh. What if we deal with a space which has a unique secondary
> > key, so optimization B is not applicable. You removed optimization
> > A for all spaces. 
> 
> The optimization works even if secondary indexes are unique.

Only for deletes. But not for replace/upsert.

I thought you had the opinion that this optimization is
controversial - so ideally it should be optional. What do you
think now? Or you generally think that it's so controversial it's
not worth it, and if we do, we should not make it optional?

-- 
Konstantin Osipov, Moscow, Russia, +7 903 626 22 32
http://tarantool.io - www.twitter.com/kostja_osipov

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH 01/23] vinyl: do not turn REPLACE into INSERT when processing DML request
  2018-07-10 18:39         ` Konstantin Osipov
@ 2018-07-11  7:57           ` Vladimir Davydov
  2018-07-11 10:25             ` Vladimir Davydov
  0 siblings, 1 reply; 65+ messages in thread
From: Vladimir Davydov @ 2018-07-11  7:57 UTC (permalink / raw)
  To: Konstantin Osipov; +Cc: tarantool-patches

On Tue, Jul 10, 2018 at 09:39:09PM +0300, Konstantin Osipov wrote:
> * Vladimir Davydov <vdavydov.dev@gmail.com> [18/07/10 15:22]:
> > On Tue, Jul 10, 2018 at 03:15:27PM +0300, Konstantin Osipov wrote:
> > > * Vladimir Davydov <vdavydov.dev@gmail.com> [18/07/08 22:52]:
> > > > Since in presence of secondary indexes we read the primary index when
> > > > processing a REPLACE request anyway, we turn it into INSERT if no tuple
> > > > matching the new tuple is found so that INSERT+DELETE gets annihilated
> > > > on compaction.
> > > > 
> > > > However, in the scope of #2129 we are planning to optimize the read out
> > > > so that this transformation won't be possible anymore. So let's remove
> > > > it now.
> > > 
> > > Ugh. What if we deal with a space which has a unique secondary
> > > key, so optimization B is not applicable. You removed optimization
> > > A for all spaces. 
> > 
> > The optimization works even if secondary indexes are unique.
> 
> Only for deletes. But not for replace/upsert.

It works for REPLACE as well.

> 
> I thought you had the opinion that this optimization is
> controversial - so ideally it should be optional. What do you
> think now? Or you generally think that it's so controversial it's
> not worth it, and if we do, we should not make it optional?

I made it mandatory for the sake of testing. We might want to make it
optional one day - it isn't going to be a problem.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH 01/23] vinyl: do not turn REPLACE into INSERT when processing DML request
  2018-07-11  7:57           ` Vladimir Davydov
@ 2018-07-11 10:25             ` Vladimir Davydov
  0 siblings, 0 replies; 65+ messages in thread
From: Vladimir Davydov @ 2018-07-11 10:25 UTC (permalink / raw)
  To: Konstantin Osipov; +Cc: tarantool-patches

On Wed, Jul 11, 2018 at 10:57:36AM +0300, Vladimir Davydov wrote:
> On Tue, Jul 10, 2018 at 09:39:09PM +0300, Konstantin Osipov wrote:
> > * Vladimir Davydov <vdavydov.dev@gmail.com> [18/07/10 15:22]:
> > > On Tue, Jul 10, 2018 at 03:15:27PM +0300, Konstantin Osipov wrote:
> > > > * Vladimir Davydov <vdavydov.dev@gmail.com> [18/07/08 22:52]:
> > > > > Since in presence of secondary indexes we read the primary index when
> > > > > processing a REPLACE request anyway, we turn it into INSERT if no tuple
> > > > > matching the new tuple is found so that INSERT+DELETE gets annihilated
> > > > > on compaction.
> > > > > 
> > > > > However, in the scope of #2129 we are planning to optimize the read out
> > > > > so that this transformation won't be possible anymore. So let's remove
> > > > > it now.
> > > > 
> > > > Ugh. What if we deal with a space which has a unique secondary
> > > > key, so optimization B is not applicable. You removed optimization
> > > > A for all spaces. 
> > > 
> > > The optimization works even if secondary indexes are unique.
> > 
> > Only for deletes. But not for replace/upsert.
> 
> It works for REPLACE as well.
> 
> > 
> > I thought you had the opinion that this optimization is
> > controversial - so ideally it should be optional. What do you
> > think now? Or you generally think that it's so controversial it's
> > not worth it, and if we do, we should not make it optional?
> 
> I made it mandatory for the sake of testing. We might want to make it
> optional one day - it isn't going to be a problem.

I removed this patch from the series and updated the branch. The only
patch that required manual rebase was

  [RFC PATCH 05/23] vinyl: fold vy_replace_one and vy_replace_impl

Posting the new version here:

From e0782b9d1b1d2bfc68aa2d460603281caaaaca17 Mon Sep 17 00:00:00 2001
From: Vladimir Davydov <vdavydov.dev@gmail.com>
Date: Sat, 23 Jun 2018 21:47:25 +0300
Subject: [PATCH] vinyl: fold vy_replace_one and vy_replace_impl

There's no point in separating REPLACE path between the cases when
the space has secondary indexes and when it only has the primary
index, because they are quite similar. Let's fold vy_replace_one
and vy_replace_impl into vy_replace to remove code duplication.

diff --git a/src/box/vinyl.c b/src/box/vinyl.c
index c95d05c1..44f8afaa 100644
--- a/src/box/vinyl.c
+++ b/src/box/vinyl.c
@@ -1540,160 +1540,6 @@ vy_insert_secondary(struct vy_env *env, struct vy_tx *tx, struct space *space,
 }
 
 /**
- * Execute REPLACE in a space with a single index, possibly with
- * lookup for an old tuple if the space has at least one
- * on_replace trigger.
- * @param env     Vinyl environment.
- * @param tx      Current transaction.
- * @param space   Space in which replace.
- * @param request Request with the tuple data.
- * @param stmt    Statement for triggers is filled with old
- *                statement.
- *
- * @retval  0 Success.
- * @retval -1 Memory error OR duplicate key error OR the primary
- *            index is not found OR a tuple reference increment
- *            error.
- */
-static inline int
-vy_replace_one(struct vy_env *env, struct vy_tx *tx, struct space *space,
-	       struct request *request, struct txn_stmt *stmt)
-{
-	(void)env;
-	assert(tx != NULL && tx->state == VINYL_TX_READY);
-	struct vy_lsm *pk = vy_lsm(space->index[0]);
-	assert(pk->index_id == 0);
-	if (tuple_validate_raw(pk->mem_format, request->tuple))
-		return -1;
-	struct tuple *new_tuple =
-		vy_stmt_new_replace(pk->mem_format, request->tuple,
-				    request->tuple_end);
-	if (new_tuple == NULL)
-		return -1;
-	/**
-	 * If the space has triggers, then we need to fetch the
-	 * old tuple to pass it to the trigger.
-	 */
-	if (stmt != NULL && !rlist_empty(&space->on_replace)) {
-		if (vy_get(pk, tx, vy_tx_read_view(tx),
-			   new_tuple, &stmt->old_tuple) != 0)
-			goto error_unref;
-	}
-	if (vy_tx_set(tx, pk, new_tuple))
-		goto error_unref;
-
-	if (stmt != NULL)
-		stmt->new_tuple = new_tuple;
-	else
-		tuple_unref(new_tuple);
-	return 0;
-
-error_unref:
-	tuple_unref(new_tuple);
-	return -1;
-}
-
-/**
- * Execute REPLACE in a space with multiple indexes and lookup for
- * an old tuple, that should has been set in \p stmt->old_tuple if
- * the space has at least one on_replace trigger.
- * @param env     Vinyl environment.
- * @param tx      Current transaction.
- * @param space   Vinyl space.
- * @param request Request with the tuple data.
- * @param stmt    Statement for triggers filled with old
- *                statement.
- *
- * @retval  0 Success
- * @retval -1 Memory error OR duplicate key error OR the primary
- *            index is not found OR a tuple reference increment
- *            error.
- */
-static inline int
-vy_replace_impl(struct vy_env *env, struct vy_tx *tx, struct space *space,
-		struct request *request, struct txn_stmt *stmt)
-{
-	assert(tx != NULL && tx->state == VINYL_TX_READY);
-	struct tuple *old_stmt = NULL;
-	struct tuple *new_stmt = NULL;
-	struct tuple *delete = NULL;
-	struct vy_lsm *pk = vy_lsm_find(space, 0);
-	if (pk == NULL) /* space has no primary key */
-		return -1;
-	/* Primary key is dumped last. */
-	assert(!vy_is_committed_one(env, pk));
-	assert(pk->index_id == 0);
-	if (tuple_validate_raw(pk->mem_format, request->tuple))
-		return -1;
-	new_stmt = vy_stmt_new_replace(pk->mem_format, request->tuple,
-				       request->tuple_end);
-	if (new_stmt == NULL)
-		return -1;
-
-	/* Get full tuple from the primary index. */
-	if (vy_get(pk, tx, vy_tx_read_view(tx), new_stmt, &old_stmt) != 0)
-		goto error;
-
-	if (old_stmt == NULL) {
-		/*
-		 * We can turn REPLACE into INSERT if the new key
-		 * does not have history.
-		 */
-		vy_stmt_set_type(new_stmt, IPROTO_INSERT);
-	}
-
-	/*
-	 * Replace in the primary index without explicit deletion
-	 * of the old tuple.
-	 */
-	if (vy_tx_set(tx, pk, new_stmt) != 0)
-		goto error;
-
-	if (space->index_count > 1 && old_stmt != NULL) {
-		delete = vy_stmt_new_surrogate_delete(pk->mem_format, old_stmt);
-		if (delete == NULL)
-			goto error;
-	}
-
-	/* Update secondary keys, avoid duplicates. */
-	for (uint32_t iid = 1; iid < space->index_count; ++iid) {
-		struct vy_lsm *lsm = vy_lsm(space->index[iid]);
-		if (vy_is_committed_one(env, lsm))
-			continue;
-		/*
-		 * Delete goes first, so if old and new keys
-		 * fully match, there is no look up beyond the
-		 * transaction index.
-		 */
-		if (old_stmt != NULL) {
-			if (vy_tx_set(tx, lsm, delete) != 0)
-				goto error;
-		}
-		if (vy_insert_secondary(env, tx, space, lsm, new_stmt) != 0)
-			goto error;
-	}
-	if (delete != NULL)
-		tuple_unref(delete);
-	/*
-	 * The old tuple is used if there is an on_replace
-	 * trigger.
-	 */
-	if (stmt != NULL) {
-		stmt->new_tuple = new_stmt;
-		stmt->old_tuple = old_stmt;
-	}
-	return 0;
-error:
-	if (delete != NULL)
-		tuple_unref(delete);
-	if (old_stmt != NULL)
-		tuple_unref(old_stmt);
-	if (new_stmt != NULL)
-		tuple_unref(new_stmt);
-	return -1;
-}
-
-/**
  * Check that the key can be used for search in a unique index
  * LSM tree.
  * @param  lsm        LSM tree for checking.
@@ -2316,18 +2162,86 @@ static int
 vy_replace(struct vy_env *env, struct vy_tx *tx, struct txn_stmt *stmt,
 	   struct space *space, struct request *request)
 {
+	assert(tx != NULL && tx->state == VINYL_TX_READY);
 	if (vy_is_committed(env, space))
 		return 0;
 	if (request->type == IPROTO_INSERT)
 		return vy_insert(env, tx, stmt, space, request);
 
-	if (space->index_count == 1) {
-		/* Replace in a space with a single index. */
-		return vy_replace_one(env, tx, space, request, stmt);
-	} else {
-		/* Replace in a space with secondary indexes. */
-		return vy_replace_impl(env, tx, space, request, stmt);
+	struct vy_lsm *pk = vy_lsm_find(space, 0);
+	if (pk == NULL)
+		return -1;
+	/* Primary key is dumped last. */
+	assert(!vy_is_committed_one(env, pk));
+
+	/* Validate and create a statement for the new tuple. */
+	if (tuple_validate_raw(pk->mem_format, request->tuple))
+		return -1;
+	stmt->new_tuple = vy_stmt_new_replace(pk->mem_format, request->tuple,
+					      request->tuple_end);
+	if (stmt->new_tuple == NULL)
+		return -1;
+	/*
+	 * Get the overwritten tuple from the primary index if
+	 * the space has on_replace triggers, in which case we
+	 * need to pass the old tuple to trigger callbacks, or
+	 * if the space has secondary indexes and so we need
+	 * the old tuple to delete it from them.
+	 */
+	if (space->index_count > 1 || !rlist_empty(&space->on_replace)) {
+		if (vy_get(pk, tx, vy_tx_read_view(tx),
+			   stmt->new_tuple, &stmt->old_tuple) != 0)
+			return -1;
+		if (stmt->old_tuple == NULL) {
+			/*
+			 * We can turn REPLACE into INSERT if the
+			 * new key does not have history.
+			 */
+			vy_stmt_set_type(stmt->new_tuple, IPROTO_INSERT);
+		}
+	}
+	/*
+	 * Replace in the primary index without explicit deletion
+	 * of the old tuple.
+	 */
+	if (vy_tx_set(tx, pk, stmt->new_tuple) != 0)
+		return -1;
+	if (space->index_count == 1)
+		return 0;
+	/*
+	 * Replace in secondary indexes with explicit deletion
+	 * of the old tuple, if any.
+	 */
+	int rc = 0;
+	struct tuple *delete = NULL;
+	if (stmt->old_tuple != NULL) {
+		delete = vy_stmt_new_surrogate_delete(pk->mem_format,
+						      stmt->old_tuple);
+		if (delete == NULL)
+			return -1;
+	}
+	for (uint32_t i = 1; i < space->index_count; i++) {
+		struct vy_lsm *lsm = vy_lsm(space->index[i]);
+		if (vy_is_committed_one(env, lsm))
+			continue;
+		/*
+		 * DELETE goes first, so if old and new keys
+		 * fully match, there is no look up beyond the
+		 * transaction write set.
+		 */
+		if (delete != NULL) {
+			rc = vy_tx_set(tx, lsm, delete);
+			if (rc != 0)
+				break;
+		}
+		rc = vy_insert_secondary(env, tx, space, lsm,
+					 stmt->new_tuple);
+		if (rc != 0)
+			break;
 	}
+	if (delete != NULL)
+		tuple_unref(delete);
+	return rc;
 }
 
 static int

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH 04/23] vinyl: make point lookup always return the latest tuple version
  2018-07-10 16:43       ` Vladimir Davydov
@ 2018-07-11 16:33         ` Vladimir Davydov
  2018-07-31 19:17           ` Konstantin Osipov
  0 siblings, 1 reply; 65+ messages in thread
From: Vladimir Davydov @ 2018-07-11 16:33 UTC (permalink / raw)
  To: Konstantin Osipov; +Cc: tarantool-patches

BTW this patch makes the behavior of vy_point_lookup() consistent with
vy_read_iterator: both iterators now return the newest tuple version.

Also, it allows to simplify vy_squash_process() - see the patch below
(I pushed it on the branch, you may want to cherry-pick it as well):

From c001e3f1a320bb804f49ca19ec540fee094dacc3 Mon Sep 17 00:00:00 2001
From: Vladimir Davydov <vdavydov.dev@gmail.com>
Date: Wed, 11 Jul 2018 19:14:24 +0300
Subject: [PATCH] vinyl: simplify vy_squash_process

Since vy_point_lookup() now guarantees that it returns the newest
tuple version, we can remove the code that squashes UPSERTs from
vy_squash_process().

diff --git a/src/box/vinyl.c b/src/box/vinyl.c
index a9603560..2d1a6fc0 100644
--- a/src/box/vinyl.c
+++ b/src/box/vinyl.c
@@ -3585,11 +3585,6 @@ vy_squash_process(struct vy_squash *squash)
 
 	struct vy_lsm *lsm = squash->lsm;
 	struct vy_env *env = squash->env;
-	/*
-	 * vy_apply_upsert() is used for primary key only,
-	 * so this is the same as lsm->key_def
-	 */
-	struct key_def *def = lsm->cmp_def;
 
 	/* Upserts enabled only in the primary index LSM tree. */
 	assert(lsm->index_id == 0);
@@ -3607,8 +3602,10 @@ vy_squash_process(struct vy_squash *squash)
 
 	/*
 	 * While we were reading on-disk runs, new statements could
-	 * have been inserted into the in-memory tree. Apply them to
-	 * the result.
+	 * have been prepared for the squashed key. We mustn't apply
+	 * them, because they may be rolled back, but we must adjust
+	 * their n_upserts counter so that they will get squashed by
+	 * vy_lsm_commit_upsert().
 	 */
 	struct vy_mem *mem = lsm->mem;
 	struct tree_mem_key tree_key = {
@@ -3625,108 +3622,20 @@ vy_squash_process(struct vy_squash *squash)
 		tuple_unref(result);
 		return 0;
 	}
-	/**
-	 * Algorithm of the squashing.
-	 * Assume, during building the non-UPSERT statement
-	 * 'result' in the mem some new UPSERTs were inserted, and
-	 * some of them were commited, while the other were just
-	 * prepared. And lets UPSERT_THRESHOLD to be equal to 3,
-	 * for example.
-	 *                    Mem
-	 *    -------------------------------------+
-	 *    UPSERT, lsn = 1, n_ups = 0           |
-	 *    UPSERT, lsn = 2, n_ups = 1           | Commited
-	 *    UPSERT, lsn = 3, n_ups = 2           |
-	 *    -------------------------------------+
-	 *    UPSERT, lsn = MAX,     n_ups = 3     |
-	 *    UPSERT, lsn = MAX + 1, n_ups = 4     | Prepared
-	 *    UPSERT, lsn = MAX + 2, n_ups = 5     |
-	 *    -------------------------------------+
-	 * In such a case the UPSERT statements with
-	 * lsns = {1, 2, 3} are squashed. But now the n_upsert
-	 * values in the prepared statements are not correct.
-	 * If we will not update values, then the
-	 * vy_lsm_commit_upsert will not be able to squash them.
-	 *
-	 * So after squashing it is necessary to update n_upsert
-	 * value in the prepared statements:
-	 *                    Mem
-	 *    -------------------------------------+
-	 *    UPSERT, lsn = 1, n_ups = 0           |
-	 *    UPSERT, lsn = 2, n_ups = 1           | Commited
-	 *    REPLACE, lsn = 3                     |
-	 *    -------------------------------------+
-	 *    UPSERT, lsn = MAX,     n_ups = 0 !!! |
-	 *    UPSERT, lsn = MAX + 1, n_ups = 1 !!! | Prepared
-	 *    UPSERT, lsn = MAX + 2, n_ups = 2 !!! |
-	 *    -------------------------------------+
-	 */
 	vy_mem_tree_iterator_prev(&mem->tree, &mem_itr);
-	const struct tuple *mem_stmt;
-	int64_t stmt_lsn;
-	/*
-	 * According to the described algorithm, squash the
-	 * commited UPSERTs at first.
-	 */
+	uint8_t n_upserts = 0;
 	while (!vy_mem_tree_iterator_is_invalid(&mem_itr)) {
+		const struct tuple *mem_stmt;
 		mem_stmt = *vy_mem_tree_iterator_get_elem(&mem->tree, &mem_itr);
-		stmt_lsn = vy_stmt_lsn(mem_stmt);
-		if (vy_tuple_compare(result, mem_stmt, def) != 0)
-			break;
-		/**
-		 * Leave alone prepared statements; they will be handled
-		 * in vy_range_commit_stmt.
-		 */
-		if (stmt_lsn >= MAX_LSN)
+		if (vy_tuple_compare(result, mem_stmt, lsm->cmp_def) != 0 ||
+		    vy_stmt_type(mem_stmt) != IPROTO_UPSERT)
 			break;
-		if (vy_stmt_type(mem_stmt) != IPROTO_UPSERT) {
-			/**
-			 * Somebody inserted non-upsert statement,
-			 * squashing is useless.
-			 */
-			tuple_unref(result);
-			return 0;
-		}
-		assert(lsm->index_id == 0);
-		struct tuple *applied = vy_apply_upsert(mem_stmt, result, def,
-							mem->format, true);
-		lsm->stat.upsert.applied++;
-		tuple_unref(result);
-		if (applied == NULL)
-			return -1;
-		result = applied;
-		/**
-		 * In normal cases we get a result with the same lsn as
-		 * in mem_stmt.
-		 * But if there are buggy upserts that do wrong things,
-		 * they are ignored and the result has lower lsn.
-		 * We should fix the lsn in any case to replace
-		 * exactly mem_stmt in general and the buggy upsert
-		 * in particular.
-		 */
-		vy_stmt_set_lsn(result, stmt_lsn);
+		assert(vy_stmt_lsn(mem_stmt) >= MAX_LSN);
+		vy_stmt_set_n_upserts((struct tuple *)mem_stmt, n_upserts);
+		if (n_upserts <= VY_UPSERT_THRESHOLD)
+			++n_upserts;
 		vy_mem_tree_iterator_prev(&mem->tree, &mem_itr);
 	}
-	/*
-	 * The second step of the algorithm above is updating of
-	 * n_upsert values of the prepared UPSERTs.
-	 */
-	if (stmt_lsn >= MAX_LSN) {
-		uint8_t n_upserts = 0;
-		while (!vy_mem_tree_iterator_is_invalid(&mem_itr)) {
-			mem_stmt = *vy_mem_tree_iterator_get_elem(&mem->tree,
-								  &mem_itr);
-			if (vy_tuple_compare(result, mem_stmt, def) != 0 ||
-			    vy_stmt_type(mem_stmt) != IPROTO_UPSERT)
-				break;
-			assert(vy_stmt_lsn(mem_stmt) >= MAX_LSN);
-			vy_stmt_set_n_upserts((struct tuple *)mem_stmt,
-					      n_upserts);
-			if (n_upserts <= VY_UPSERT_THRESHOLD)
-				++n_upserts;
-			vy_mem_tree_iterator_prev(&mem->tree, &mem_itr);
-		}
-	}
 
 	lsm->stat.upsert.squashed++;
 

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH 23/23] vinyl: eliminate read on REPLACE/DELETE
  2018-07-08 16:48   ` [RFC PATCH 23/23] vinyl: eliminate read on REPLACE/DELETE Vladimir Davydov
@ 2018-07-13 10:53     ` Vladimir Davydov
  2018-07-13 10:53       ` [PATCH 1/3] stailq: add stailq_insert function Vladimir Davydov
                         ` (2 more replies)
  0 siblings, 3 replies; 65+ messages in thread
From: Vladimir Davydov @ 2018-07-13 10:53 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

During our verbal discussion with Kostja, we agreed that it makes sense
to generate deferred DELETEs when the transaction is committed in case
the overwritten tuple is present in memory. This should decrease the
overall number of deferred DELETEs and thus speed up secondary index
lookups.

The patches below implement this feature. They reduce the time it takes
to run vinyl/select_consistency.test.lua on my laptop from 32 seconds
down to only 8 seconds (it takes about 7 seconds on vanilla).

I pushed them to the same branch as the rest of the patch set.

https://github.com/tarantool/tarantool/issues/2129
https://github.com/tarantool/tarantool/tree/dv/gh-2129-vy-eliminate-read-on-replace-delete

Vladimir Davydov (3):
  stailq: add stailq_insert function
  vinyl: link all indexes of the same space
  vinyl: generate deferred DELETEs on tx commit

 src/box/vinyl.c           |  1 +
 src/box/vy_lsm.c          |  7 +++-
 src/box/vy_lsm.h          |  2 +
 src/box/vy_point_lookup.c | 32 ++++++++++++++++
 src/box/vy_point_lookup.h | 18 +++++++++
 src/box/vy_tx.c           | 97 +++++++++++++++++++++++++++++++++++++++++++++++
 src/lib/salad/stailq.h    | 19 ++++++++++
 test/vinyl/quota.result   |  2 +-
 8 files changed, 176 insertions(+), 2 deletions(-)

-- 
2.11.0

^ permalink raw reply	[flat|nested] 65+ messages in thread

* [PATCH 1/3] stailq: add stailq_insert function
  2018-07-13 10:53     ` Vladimir Davydov
@ 2018-07-13 10:53       ` Vladimir Davydov
  2018-07-15  7:02         ` Konstantin Osipov
  2018-07-17 10:18         ` Vladimir Davydov
  2018-07-13 10:53       ` [PATCH 2/3] vinyl: link all indexes of the same space Vladimir Davydov
  2018-07-13 10:53       ` [PATCH 3/3] vinyl: generate deferred DELETEs on tx commit Vladimir Davydov
  2 siblings, 2 replies; 65+ messages in thread
From: Vladimir Davydov @ 2018-07-13 10:53 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

The new function inserts a new item into the list at the specified
postion.
---
 src/lib/salad/stailq.h | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/src/lib/salad/stailq.h b/src/lib/salad/stailq.h
index 0f51ddb9..0d53369e 100644
--- a/src/lib/salad/stailq.h
+++ b/src/lib/salad/stailq.h
@@ -96,6 +96,19 @@ stailq_add_tail(struct stailq *head, struct stailq_entry *item)
 }
 
 /**
+ * Insert @item into list @head after @prev.
+ */
+inline static void
+stailq_insert(struct stailq *head, struct stailq_entry *item,
+	      struct stailq_entry *prev)
+{
+	item->next = prev->next;
+	prev->next = item;
+	if (item->next == NULL)
+		head->last = &item->next;
+}
+
+/**
  * return first element
  */
 inline static struct stailq_entry *
@@ -234,6 +247,12 @@ stailq_cut_tail(struct stailq *head, struct stailq_entry *last,
 	stailq_add_tail((head), &(item)->member)
 
 /**
+ * insert entry into list
+ */
+#define stailq_insert_entry(head, item, prev, member)			\
+	stailq_insert((head), &(item)->member, &(prev)->member)
+
+/**
  * foreach through list
  */
 #define stailq_foreach(item, head)					\
-- 
2.11.0

^ permalink raw reply	[flat|nested] 65+ messages in thread

* [PATCH 2/3] vinyl: link all indexes of the same space
  2018-07-13 10:53     ` Vladimir Davydov
  2018-07-13 10:53       ` [PATCH 1/3] stailq: add stailq_insert function Vladimir Davydov
@ 2018-07-13 10:53       ` Vladimir Davydov
  2018-07-13 10:53       ` [PATCH 3/3] vinyl: generate deferred DELETEs on tx commit Vladimir Davydov
  2 siblings, 0 replies; 65+ messages in thread
From: Vladimir Davydov @ 2018-07-13 10:53 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

After generating a deferred DELETE, we need to insert it into all
secondary indexes. Let's link all LSM trees of the same space into
a list so that we can iterate over secondary indexes of a space
given the primary index.

Needed for #2129
---
 src/box/vinyl.c  | 1 +
 src/box/vy_lsm.c | 7 ++++++-
 src/box/vy_lsm.h | 2 ++
 3 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/src/box/vinyl.c b/src/box/vinyl.c
index 2d1a6fc0..ba875040 100644
--- a/src/box/vinyl.c
+++ b/src/box/vinyl.c
@@ -1148,6 +1148,7 @@ vinyl_space_swap_index(struct space *old_space, struct space *new_space,
 	SWAP(old_lsm->opts, new_lsm->opts);
 	key_def_swap(old_lsm->key_def, new_lsm->key_def);
 	key_def_swap(old_lsm->cmp_def, new_lsm->cmp_def);
+	rlist_swap(&old_lsm->list, &new_lsm->list);
 
 	/* Update pointer to the primary key. */
 	vy_lsm_update_pk(old_lsm, vy_lsm(old_space->index_map[0]));
diff --git a/src/box/vy_lsm.c b/src/box/vy_lsm.c
index cb3c436f..2e3e7947 100644
--- a/src/box/vy_lsm.c
+++ b/src/box/vy_lsm.c
@@ -194,8 +194,12 @@ vy_lsm_new(struct vy_lsm_env *lsm_env, struct vy_cache_env *cache_env,
 	vy_range_heap_create(&lsm->range_heap);
 	rlist_create(&lsm->runs);
 	lsm->pk = pk;
-	if (pk != NULL)
+	if (pk != NULL) {
 		vy_lsm_ref(pk);
+		rlist_add_tail(&pk->list, &lsm->list);
+	} else {
+		rlist_create(&lsm->list);
+	}
 	lsm->mem_format = format;
 	tuple_format_ref(lsm->mem_format);
 	lsm->in_dump.pos = UINT32_MAX;
@@ -253,6 +257,7 @@ vy_lsm_delete(struct vy_lsm *lsm)
 
 	lsm->env->lsm_count--;
 
+	rlist_del(&lsm->list);
 	if (lsm->pk != NULL)
 		vy_lsm_unref(lsm->pk);
 
diff --git a/src/box/vy_lsm.h b/src/box/vy_lsm.h
index f0b7ec9c..9849e455 100644
--- a/src/box/vy_lsm.h
+++ b/src/box/vy_lsm.h
@@ -204,6 +204,8 @@ struct vy_lsm {
 	 * by each secondary index.
 	 */
 	struct vy_lsm *pk;
+	/** List of all LSM trees of the same space. */
+	struct rlist list;
 	/** LSM tree statistics. */
 	struct vy_lsm_stat stat;
 	/**
-- 
2.11.0

^ permalink raw reply	[flat|nested] 65+ messages in thread

* [PATCH 3/3] vinyl: generate deferred DELETEs on tx commit
  2018-07-13 10:53     ` Vladimir Davydov
  2018-07-13 10:53       ` [PATCH 1/3] stailq: add stailq_insert function Vladimir Davydov
  2018-07-13 10:53       ` [PATCH 2/3] vinyl: link all indexes of the same space Vladimir Davydov
@ 2018-07-13 10:53       ` Vladimir Davydov
  2 siblings, 0 replies; 65+ messages in thread
From: Vladimir Davydov @ 2018-07-13 10:53 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

We don't need to postpone generation of secondary index DELETEs until
compaction in case the overwritten tuple is present in memory or in
cache. Instead we can produce the DELETEs when the transaction is
committed. This should significantly decrease the number of deferred
DELETEs and hence speed up lookups in secondary indexes.

Follow-up #2129
---
 src/box/vy_point_lookup.c | 32 ++++++++++++++++
 src/box/vy_point_lookup.h | 18 +++++++++
 src/box/vy_tx.c           | 97 +++++++++++++++++++++++++++++++++++++++++++++++
 test/vinyl/quota.result   |  2 +-
 4 files changed, 148 insertions(+), 1 deletion(-)

diff --git a/src/box/vy_point_lookup.c b/src/box/vy_point_lookup.c
index 5e43340b..7b704b84 100644
--- a/src/box/vy_point_lookup.c
+++ b/src/box/vy_point_lookup.c
@@ -293,3 +293,35 @@ done:
 	}
 	return 0;
 }
+
+int
+vy_point_lookup_mem(struct vy_lsm *lsm, const struct vy_read_view **rv,
+		    struct tuple *key, struct tuple **ret)
+{
+	assert(tuple_field_count(key) >= lsm->cmp_def->part_count);
+
+	int rc;
+	struct vy_history history;
+	vy_history_create(&history, &lsm->env->history_node_pool);
+
+	rc = vy_point_lookup_scan_cache(lsm, rv, key, &history);
+	if (rc != 0 || vy_history_is_terminal(&history))
+		goto done;
+
+	rc = vy_point_lookup_scan_mems(lsm, rv, key, &history);
+	if (rc != 0 || vy_history_is_terminal(&history))
+		goto done;
+
+	*ret = NULL;
+	goto out;
+done:
+	if (rc == 0) {
+		int upserts_applied;
+		rc = vy_history_apply(&history, lsm->cmp_def, lsm->mem_format,
+				      true, &upserts_applied, ret);
+		lsm->stat.upsert.applied += upserts_applied;
+	}
+out:
+	vy_history_cleanup(&history);
+	return rc;
+}
diff --git a/src/box/vy_point_lookup.h b/src/box/vy_point_lookup.h
index 3b7c5a04..6d77ce9c 100644
--- a/src/box/vy_point_lookup.h
+++ b/src/box/vy_point_lookup.h
@@ -71,6 +71,24 @@ vy_point_lookup(struct vy_lsm *lsm, struct vy_tx *tx,
 		const struct vy_read_view **rv,
 		struct tuple *key, struct tuple **ret);
 
+/**
+ * Look up a tuple by key in memory.
+ *
+ * This function works just like vy_point_lookup() except:
+ *
+ * - It only scans in-memory level and cache and hence doesn't yield.
+ * - It doesn't turn DELETE into NULL so it returns NULL if and only
+ *   if no terminal statement matching the key is present in memory
+ *   (there still may be statements stored on disk though).
+ * - It doesn't account the lookup to LSM tree stats (as it never
+ *   descends to lower levels).
+ *
+ * The function returns 0 on success, -1 on memory allocation error.
+ */
+int
+vy_point_lookup_mem(struct vy_lsm *lsm, const struct vy_read_view **rv,
+		    struct tuple *key, struct tuple **ret);
+
 #if defined(__cplusplus)
 } /* extern "C" */
 #endif /* defined(__cplusplus) */
diff --git a/src/box/vy_tx.c b/src/box/vy_tx.c
index bfef1ada..1421cb84 100644
--- a/src/box/vy_tx.c
+++ b/src/box/vy_tx.c
@@ -58,6 +58,7 @@
 #include "vy_history.h"
 #include "vy_read_set.h"
 #include "vy_read_view.h"
+#include "vy_point_lookup.h"
 
 int
 write_set_cmp(struct txv *a, struct txv *b)
@@ -483,6 +484,97 @@ vy_tx_write(struct vy_lsm *lsm, struct vy_mem *mem,
 	return vy_lsm_set(lsm, mem, stmt, region_stmt);
 }
 
+/**
+ * Try to generate a deferred DELETE statement on tx commit.
+ *
+ * This function is supposed to be called for a primary index
+ * statement which was executed without deletion of the overwritten
+ * tuple from secondary indexes. It looks up the overwritten tuple
+ * in memory and, if found, produces the deferred DELETEs and
+ * inserts them into the transaction log.
+ *
+ * Affects @tx->log, @v->stmt.
+ *
+ * Returns 0 on success, -1 on memory allocation error.
+ */
+static int
+vy_tx_handle_deferred_delete(struct vy_tx *tx, struct txv *v)
+{
+	struct vy_lsm *pk = v->lsm;
+	struct tuple *stmt = v->stmt;
+	uint8_t flags = vy_stmt_flags(stmt);
+
+	assert(pk->index_id == 0);
+	assert(flags & VY_STMT_DEFERRED_DELETE);
+
+	/* Look up the tuple overwritten by this statement. */
+	struct tuple *tuple;
+	if (vy_point_lookup_mem(pk, &tx->xm->p_global_read_view,
+				stmt, &tuple) != 0)
+		return -1;
+
+	if (tuple == NULL) {
+		/*
+		 * Nothing's found, but there still may be
+		 * matching statements stored on disk so we
+		 * have to defer generation of DELETE until
+		 * compaction.
+		 */
+		return 0;
+	}
+
+	/*
+	 * If a terminal statement is found, we can produce
+	 * DELETE right away so clear the flag now.
+	 */
+	vy_stmt_set_flags(stmt, flags & ~VY_STMT_DEFERRED_DELETE);
+
+	if (vy_stmt_type(tuple) == IPROTO_DELETE) {
+		/* The tuple's already deleted, nothing to do. */
+		tuple_unref(tuple);
+		return 0;
+	}
+
+	struct tuple *delete_stmt;
+	delete_stmt = vy_stmt_new_surrogate_delete(pk->mem_format, tuple);
+	tuple_unref(tuple);
+	if (delete_stmt == NULL)
+		return -1;
+
+	if (vy_stmt_type(stmt) == IPROTO_DELETE) {
+		/*
+		 * Since primary and secondary indexes of the
+		 * same space share in-memory statements, we
+		 * need to use the new DELETE in the primary
+		 * index, because the original DELETE doesn't
+		 * contain secondary key parts.
+		 */
+		vy_stmt_counter_acct_tuple(&pk->stat.txw.count, delete_stmt);
+		vy_stmt_counter_unacct_tuple(&pk->stat.txw.count, stmt);
+		v->stmt = delete_stmt;
+		tuple_ref(delete_stmt);
+		tuple_unref(stmt);
+	}
+
+	/*
+	 * Make DELETE statements for secondary indexes and
+	 * insert them into the transaction log.
+	 */
+	int rc = 0;
+	struct vy_lsm *lsm;
+	rlist_foreach_entry(lsm, &pk->list, list) {
+		struct txv *delete_txv = txv_new(tx, lsm, delete_stmt);
+		if (delete_txv == NULL) {
+			rc = -1;
+			break;
+		}
+		stailq_insert_entry(&tx->log, delete_txv, v, next_in_log);
+		vy_stmt_counter_acct_tuple(&lsm->stat.txw.count, delete_stmt);
+	}
+	tuple_unref(delete_stmt);
+	return rc;
+}
+
 int
 vy_tx_prepare(struct vy_tx *tx)
 {
@@ -591,6 +683,11 @@ vy_tx_prepare(struct vy_tx *tx)
 			return -1;
 		assert(v->mem != NULL);
 
+		if (lsm->index_id == 0 &&
+		    vy_stmt_flags(v->stmt) & VY_STMT_DEFERRED_DELETE &&
+		    vy_tx_handle_deferred_delete(tx, v) != 0)
+			return -1;
+
 		/* In secondary indexes only REPLACE/DELETE can be written. */
 		vy_stmt_set_lsn(v->stmt, MAX_LSN + tx->psn);
 		const struct tuple **region_stmt =
diff --git a/test/vinyl/quota.result b/test/vinyl/quota.result
index e323bc4e..48042185 100644
--- a/test/vinyl/quota.result
+++ b/test/vinyl/quota.result
@@ -89,7 +89,7 @@ _ = space:replace{1, 1, string.rep('a', 1024 * 1024 * 5)}
 ...
 box.stat.vinyl().quota.used
 ---
-- 5341228
+- 5341267
 ...
 space:drop()
 ---
-- 
2.11.0

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 1/3] stailq: add stailq_insert function
  2018-07-13 10:53       ` [PATCH 1/3] stailq: add stailq_insert function Vladimir Davydov
@ 2018-07-15  7:02         ` Konstantin Osipov
  2018-07-15 13:17           ` Vladimir Davydov
  2018-07-17 10:18         ` Vladimir Davydov
  1 sibling, 1 reply; 65+ messages in thread
From: Konstantin Osipov @ 2018-07-15  7:02 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: tarantool-patches

* Vladimir Davydov <vdavydov.dev@gmail.com> [18/07/13 17:22]:
> The new function inserts a new item into the list at the specified
> postion.
> ---
>  src/lib/salad/stailq.h | 19 +++++++++++++++++++
>  1 file changed, 19 insertions(+)
> 

Please add a unit test.

-- 
Konstantin Osipov, Moscow, Russia, +7 903 626 22 32
http://tarantool.io - www.twitter.com/kostja_osipov

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 1/3] stailq: add stailq_insert function
  2018-07-15  7:02         ` Konstantin Osipov
@ 2018-07-15 13:17           ` Vladimir Davydov
  2018-07-15 18:40             ` Konstantin Osipov
  0 siblings, 1 reply; 65+ messages in thread
From: Vladimir Davydov @ 2018-07-15 13:17 UTC (permalink / raw)
  To: Konstantin Osipov; +Cc: tarantool-patches

On Sun, Jul 15, 2018 at 10:02:44AM +0300, Konstantin Osipov wrote:
> * Vladimir Davydov <vdavydov.dev@gmail.com> [18/07/13 17:22]:
> > The new function inserts a new item into the list at the specified
> > postion.
> > ---
> >  src/lib/salad/stailq.h | 19 +++++++++++++++++++
> >  1 file changed, 19 insertions(+)
> > 
> 
> Please add a unit test.

Done (amended the commit on the branch). The diff is below.

diff --git a/test/unit/stailq.c b/test/unit/stailq.c
index 12f05a0c..600f71a5 100644
--- a/test/unit/stailq.c
+++ b/test/unit/stailq.c
@@ -3,7 +3,7 @@
 #include <stdarg.h>
 #include "unit.h"
 
-#define PLAN		68
+#define PLAN		75
 
 #define ITEMS		7
 
@@ -111,5 +111,19 @@ main(void)
 		is(it, items + i, "head element after concat %d", i);
 		i++;
 	}
+
+	stailq_create(&head);
+	stailq_add_entry(&head, &items[0], next);
+	stailq_insert(&head, &items[2].next, &items[0].next);
+	stailq_insert(&head, &items[1].next, &items[0].next);
+	stailq_insert_entry(&head, &items[4], &items[2], next);
+	stailq_insert_entry(&head, &items[3], &items[2], next);
+	i = 0;
+	stailq_foreach_entry(it, &head, next) {
+		is(it, items + i, "element %d (insert)", i);
+		i++;
+	}
+	is(stailq_first(&head), &items[0].next, "first item (insert)");
+	is(stailq_last(&head), &items[4].next, "last item (insert)");
 	return check_plan();
 }
diff --git a/test/unit/stailq.result b/test/unit/stailq.result
index 78d3e721..04154500 100644
--- a/test/unit/stailq.result
+++ b/test/unit/stailq.result
@@ -1,4 +1,4 @@
-1..68
+1..75
 ok 1 - list is empty
 ok 2 - list is empty after reverse
 ok 3 - first item
@@ -67,3 +67,10 @@ ok 65 - head element after concat 3
 ok 66 - head element after concat 4
 ok 67 - head element after concat 5
 ok 68 - head element after concat 6
+ok 69 - element 0 (insert)
+ok 70 - element 1 (insert)
+ok 71 - element 2 (insert)
+ok 72 - element 3 (insert)
+ok 73 - element 4 (insert)
+ok 74 - first item (insert)
+ok 75 - last item (insert)

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 1/3] stailq: add stailq_insert function
  2018-07-15 13:17           ` Vladimir Davydov
@ 2018-07-15 18:40             ` Konstantin Osipov
  0 siblings, 0 replies; 65+ messages in thread
From: Konstantin Osipov @ 2018-07-15 18:40 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: tarantool-patches

* Vladimir Davydov <vdavydov.dev@gmail.com> [18/07/15 21:38]:
> On Sun, Jul 15, 2018 at 10:02:44AM +0300, Konstantin Osipov wrote:
> > * Vladimir Davydov <vdavydov.dev@gmail.com> [18/07/13 17:22]:
> > > The new function inserts a new item into the list at the specified
> > > postion.
> > > ---
> > >  src/lib/salad/stailq.h | 19 +++++++++++++++++++
> > >  1 file changed, 19 insertions(+)
> > > 
> > 
> > Please add a unit test.

OK to push.


-- 
Konstantin Osipov, Moscow, Russia, +7 903 626 22 32
http://tarantool.io - www.twitter.com/kostja_osipov

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH 03/23] vinyl: use vy_mem_iterator for point lookup
  2018-07-08 16:48   ` [RFC PATCH 03/23] vinyl: use vy_mem_iterator for point lookup Vladimir Davydov
@ 2018-07-17 10:14     ` Vladimir Davydov
  0 siblings, 0 replies; 65+ messages in thread
From: Vladimir Davydov @ 2018-07-17 10:14 UTC (permalink / raw)
  To: tarantool-patches; +Cc: kostja

On Sun, Jul 08, 2018 at 07:48:34PM +0300, Vladimir Davydov wrote:
> vy_mem_iterator_next is as effecient as the current implementation of
> vy_point_lookup_scan_mem, because it doesn't copy statements anymore
> (see commit 1e1c1fdbedd vinyl: make read iterator always return newest
> tuple version). Let's use it instead of open-coding vy_mem tree lookup.
> ---
>  src/box/vy_point_lookup.c | 47 +++++++++--------------------------------------
>  1 file changed, 9 insertions(+), 38 deletions(-)

Kostja pushed this one a while ago.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH 12/23] vinyl: do not pass region explicitly to write iterator functions
  2018-07-08 16:48   ` [RFC PATCH 12/23] vinyl: do not pass region explicitly to write iterator functions Vladimir Davydov
@ 2018-07-17 10:16     ` Vladimir Davydov
  2018-07-31 20:38     ` Konstantin Osipov
  1 sibling, 0 replies; 65+ messages in thread
From: Vladimir Davydov @ 2018-07-17 10:16 UTC (permalink / raw)
  To: tarantool-patches; +Cc: kostja

On Sun, Jul 08, 2018 at 07:48:43PM +0300, Vladimir Davydov wrote:
> This is not necessary, as we can use fiber()->gc, as we usually do.
> ---
>  src/box/vy_write_iterator.c | 24 +++++++++---------------
>  1 file changed, 9 insertions(+), 15 deletions(-)

This one is trivial. I pushed it to 1.10.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH 13/23] vinyl: fix potential use-after-free in vy_read_view_merge
  2018-07-08 16:48   ` [RFC PATCH 13/23] vinyl: fix potential use-after-free in vy_read_view_merge Vladimir Davydov
@ 2018-07-17 10:16     ` Vladimir Davydov
  0 siblings, 0 replies; 65+ messages in thread
From: Vladimir Davydov @ 2018-07-17 10:16 UTC (permalink / raw)
  To: tarantool-patches; +Cc: kostja

On Sun, Jul 08, 2018 at 07:48:44PM +0300, Vladimir Davydov wrote:
> If is_first_insert flag is set and vy_stmt_type(rv->tuple) equals
> IPROTO_DELETE, we free rv->tuple, but then we dereference it via
> an on-stack variable to check if we need to turn a REPLACE into an
> INSERT or vice versa. Fix this.
> ---
>  src/box/vy_write_iterator.c | 18 +++++++++---------
>  1 file changed, 9 insertions(+), 9 deletions(-)

This one is trivial. I pushed it to 1.10.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH 14/23] test: unit/vy_write_iterator: minor refactoring
  2018-07-08 16:48   ` [RFC PATCH 14/23] test: unit/vy_write_iterator: minor refactoring Vladimir Davydov
@ 2018-07-17 10:17     ` Vladimir Davydov
  0 siblings, 0 replies; 65+ messages in thread
From: Vladimir Davydov @ 2018-07-17 10:17 UTC (permalink / raw)
  To: tarantool-patches; +Cc: kostja

On Sun, Jul 08, 2018 at 07:48:45PM +0300, Vladimir Davydov wrote:
> Move key_def creation to compare_write_iterator_results as it is the
> same for all test cases. Performance is not an issue here, obviously, so
> we can close our eyes to the fact that now we create a new key def for
> each test cases.
> ---
>  test/unit/vy_write_iterator.c | 56 +++++++++++++++++++------------------------
>  1 file changed, 25 insertions(+), 31 deletions(-)

This one is trivial. I pushed it to 1.10.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH 1/3] stailq: add stailq_insert function
  2018-07-13 10:53       ` [PATCH 1/3] stailq: add stailq_insert function Vladimir Davydov
  2018-07-15  7:02         ` Konstantin Osipov
@ 2018-07-17 10:18         ` Vladimir Davydov
  1 sibling, 0 replies; 65+ messages in thread
From: Vladimir Davydov @ 2018-07-17 10:18 UTC (permalink / raw)
  To: tarantool-patches; +Cc: kostja

On Fri, Jul 13, 2018 at 01:53:52PM +0300, Vladimir Davydov wrote:
> The new function inserts a new item into the list at the specified
> postion.
> ---
>  src/lib/salad/stailq.h | 19 +++++++++++++++++++
>  1 file changed, 19 insertions(+)

Pushed it to 1.10.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH 04/23] vinyl: make point lookup always return the latest tuple version
  2018-07-11 16:33         ` Vladimir Davydov
@ 2018-07-31 19:17           ` Konstantin Osipov
  0 siblings, 0 replies; 65+ messages in thread
From: Konstantin Osipov @ 2018-07-31 19:17 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: tarantool-patches

* Vladimir Davydov <vdavydov.dev@gmail.com> [18/07/11 19:37]:
> 
> Also, it allows to simplify vy_squash_process() - see the patch below
> (I pushed it on the branch, you may want to cherry-pick it as well):
> 
> >From c001e3f1a320bb804f49ca19ec540fee094dacc3 Mon Sep 17 00:00:00 2001
> From: Vladimir Davydov <vdavydov.dev@gmail.com>
> Date: Wed, 11 Jul 2018 19:14:24 +0300
> Subject: [PATCH] vinyl: simplify vy_squash_process
> 

> Since vy_point_lookup() now guarantees that it returns the newest
> tuple version, we can remove the code that squashes UPSERTs from
> vy_squash_process().
Pushed.

-- 
Konstantin Osipov, Moscow, Russia, +7 903 626 22 32
http://tarantool.io - www.twitter.com/kostja_osipov

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH 05/23] vinyl: fold vy_replace_one and vy_replace_impl
  2018-07-08 16:48   ` [RFC PATCH 05/23] vinyl: fold vy_replace_one and vy_replace_impl Vladimir Davydov
@ 2018-07-31 20:28     ` Konstantin Osipov
  0 siblings, 0 replies; 65+ messages in thread
From: Konstantin Osipov @ 2018-07-31 20:28 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: tarantool-patches

* Vladimir Davydov <vdavydov.dev@gmail.com> [18/07/08 22:52]:
> There's no point in separating REPLACE path between the cases when
> the space has secondary indexes and when it only has the primary
> index, because they are quite similar. Let's fold vy_replace_one
> and vy_replace_impl into vy_replace to remove code duplication.
> ---
>  src/box/vinyl.c | 219 +++++++++++++++++---------------------------------------

Pushed.


-- 
Konstantin Osipov, Moscow, Russia, +7 903 626 22 32
http://tarantool.io - www.twitter.com/kostja_osipov

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH 06/23] vinyl: fold vy_delete_impl
  2018-07-08 16:48   ` [RFC PATCH 06/23] vinyl: fold vy_delete_impl Vladimir Davydov
@ 2018-07-31 20:28     ` Konstantin Osipov
  0 siblings, 0 replies; 65+ messages in thread
From: Konstantin Osipov @ 2018-07-31 20:28 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: tarantool-patches

* Vladimir Davydov <vdavydov.dev@gmail.com> [18/07/08 22:52]:
> vy_delete_impl helper is only used once in vy_delete and it is rather
> small so inlining it definitely won't hurt. On the contrary, it will
> consolidate DELETE logic in one place, making the code easier to follow.

Pushed.


-- 
Konstantin Osipov, Moscow, Russia, +7 903 626 22 32
http://tarantool.io - www.twitter.com/kostja_osipov

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH 07/23] vinyl: refactor unique check
  2018-07-08 16:48   ` [RFC PATCH 07/23] vinyl: refactor unique check Vladimir Davydov
@ 2018-07-31 20:28     ` Konstantin Osipov
  0 siblings, 0 replies; 65+ messages in thread
From: Konstantin Osipov @ 2018-07-31 20:28 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: tarantool-patches

* Vladimir Davydov <vdavydov.dev@gmail.com> [18/07/08 22:52]:
> For the sake of further patches, let's do some refactoring:
>  - Rename vy_check_is_unique to vy_check_is_unique_primary and use it
>    only for checking the unique constraint of primary indexes. Also,
>    make it return immediately if the primary index doesn't need
>    uniqueness check, like vy_check_is_unique_secondary does.
>  - Open-code uniqueness check in vy_check_is_unique_secondary instead of
>    using vy_check_is_unique.
>  - Reduce indentation level of vy_check_is_unique_secondary by inverting
>    the if statement.

Pushed.


-- 
Konstantin Osipov, Moscow, Russia, +7 903 626 22 32
http://tarantool.io - www.twitter.com/kostja_osipov

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH 08/23] vinyl: check key uniqueness before modifying tx write set
  2018-07-08 16:48   ` [RFC PATCH 08/23] vinyl: check key uniqueness before modifying tx write set Vladimir Davydov
@ 2018-07-31 20:34     ` Konstantin Osipov
  2018-08-01 10:42       ` Vladimir Davydov
  2018-08-09 20:26     ` Konstantin Osipov
  1 sibling, 1 reply; 65+ messages in thread
From: Konstantin Osipov @ 2018-07-31 20:34 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: tarantool-patches

* Vladimir Davydov <vdavydov.dev@gmail.com> [18/07/08 22:52]:

> Currently, we handle INSERT/REPLACE/UPDATE requests by iterating over
> all space indexes starting from the primary and inserting the
> corresponding statements to tx write set, checking key uniqueness if
> necessary. This means that by the time we write a REPLACE to the write
> set of a secondary index, it has already been written to the primary
> index write set. This is OK, and vy_tx_prepare() relies on that to
> implement the common memory level. However, this also means that when we
> check uniqueness of a secondary index, the new REPLACE can be found via
> the primary index. This is OK now, because all indexes are fully
> independent, but it isn't going to fly after #2129 is implemented. The
> problem is in order to check if a tuple is present in a secondary index,
> we will have to look up the corresponding full tuple in the primary
> index. To illustrate the problem, consider the following situation:

I don't understand how this patch works. You need to put the key
into transaction write set first, and yield second. If there is a
change which happens during yield made on behalf of the unique
check, it must be able to see the keys your transaction is reading
and abort it.

Besides, I don't understand how the order of checks is making any
difference in your example - until a transaction is committed it
is not present in the common memory level anyway. 

> 
>   Primary index covers field 1.
>   Secondary index covers field 2.
> 
>   Committed statements:
> 
>     REPLACE{1, 10, lsn=1} - present in both indexes
>     DELETE{1, lsn=2} - present only in the primary index
> 
>   Transaction:
> 
>     REPLACE{1, 10}
> 
> When we check uniqueness of the secondary index, we find committed
> statement REPLACE{1, 10, lsn=1}, then look up the corresponding full
> tuple in the primary index and find REPLACE{1, 10}. Since the two tuples
> match, we mistakenly assume that there's a conflict.
> 
> To avoid a situation like that, let's check uniqueness before modifying
> the write set of any index.
> 
> Needed for #2129
> ---
>  src/box/vinyl.c | 128 +++++++++++++++++++++++++++-----------------------------
>  1 file changed, 62 insertions(+), 66 deletions(-)
> 

-- 
Konstantin Osipov, Moscow, Russia, +7 903 626 22 32
http://tarantool.io - www.twitter.com/kostja_osipov

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH 11/23] xrow: allow to store flags in DML requests
  2018-07-08 16:48   ` [RFC PATCH 11/23] xrow: allow to store flags in DML requests Vladimir Davydov
@ 2018-07-31 20:36     ` Konstantin Osipov
  2018-08-01 14:10       ` Vladimir Davydov
  0 siblings, 1 reply; 65+ messages in thread
From: Konstantin Osipov @ 2018-07-31 20:36 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: tarantool-patches

* Vladimir Davydov <vdavydov.dev@gmail.com> [18/07/08 22:52]:
> In the scope of #2129 we need to mark REPLACE statements for which we
> generated DELETE in secondary indexes so that we don't generate DELETE
> again on compaction. We also need to mark DELETE statements that were
> generated on compaction so that we can skip them on SELECT.
> 
> Let's add flags field to struct vy_stmt. Flags are stored both in memory
> and on disk so to encode/decode them we also need to add a new iproto
> key (IPROTO_FLAGS) and the corresponding field to struct request.

We have been avoiding IPROTO_FLAGS so far. Can we make this
member engine-local, i.e. make sure it's only visible/usable for vinyl rows?

-- 
Konstantin Osipov, Moscow, Russia, +7 903 626 22 32
http://tarantool.io - www.twitter.com/kostja_osipov

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH 12/23] vinyl: do not pass region explicitly to write iterator functions
  2018-07-08 16:48   ` [RFC PATCH 12/23] vinyl: do not pass region explicitly to write iterator functions Vladimir Davydov
  2018-07-17 10:16     ` Vladimir Davydov
@ 2018-07-31 20:38     ` Konstantin Osipov
  2018-08-01 14:14       ` Vladimir Davydov
  1 sibling, 1 reply; 65+ messages in thread
From: Konstantin Osipov @ 2018-07-31 20:38 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: tarantool-patches

* Vladimir Davydov <vdavydov.dev@gmail.com> [18/07/08 22:53]:
> This is not necessary, as we can use fiber()->gc, as we usually do.

The reason Vlad is passing it explicitly is that if we plan to
make transactions interactive, we're going to end up with more
than one transaction per fiber. So it really should be tx->gc
(non-existent at the moment), not fiber->gc.

I think Vlad is passing it around explicitly as a mental note to
self.

It's OK to push.

1.0

-- 
Konstantin Osipov, Moscow, Russia, +7 903 626 22 32
http://tarantool.io - www.twitter.com/kostja_osipov

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH 18/23] vinyl: store pointer to scheduler in struct vy_task
  2018-07-08 16:48   ` [RFC PATCH 18/23] vinyl: store pointer to scheduler in struct vy_task Vladimir Davydov
@ 2018-07-31 20:39     ` Konstantin Osipov
  0 siblings, 0 replies; 65+ messages in thread
From: Konstantin Osipov @ 2018-07-31 20:39 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: tarantool-patches

* Vladimir Davydov <vdavydov.dev@gmail.com> [18/07/08 22:53]:
> Currently, we don't really need it, but once we switch communication
> channel between the scheduler and workers from pthread mutex/cond to
> cbus (needed for #2129), tasks won't be completed on behalf of the
> scheduler fiber and hence we will need a back pointer from vy_task to
> vy_scheduler.

OK to push.

-- 
Konstantin Osipov, Moscow, Russia, +7 903 626 22 32
http://tarantool.io - www.twitter.com/kostja_osipov

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH 19/23] vinyl: rename some members of vy_scheduler and vy_task struct
  2018-07-08 16:48   ` [RFC PATCH 19/23] vinyl: rename some members of vy_scheduler and vy_task struct Vladimir Davydov
@ 2018-07-31 20:40     ` Konstantin Osipov
  0 siblings, 0 replies; 65+ messages in thread
From: Konstantin Osipov @ 2018-07-31 20:40 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: tarantool-patches

* Vladimir Davydov <vdavydov.dev@gmail.com> [18/07/08 22:53]:
> I'm planning to add some new members and remove some old members from
> those structs. For this to play nicely, let's do some renames:
> 
>   vy_scheduler::workers_available => idle_worker_count
>   vy_scheduler::input_queue       => pending_tasks
>   vy_scheduler::output_queue      => processed_tasks
>   vy_task::link                   => in_pending, in_processed

OK to push.


-- 
Konstantin Osipov, Moscow, Russia, +7 903 626 22 32
http://tarantool.io - www.twitter.com/kostja_osipov

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH 20/23] vinyl: use cbus for communication between scheduler and worker threads
  2018-07-08 16:48   ` [RFC PATCH 20/23] vinyl: use cbus for communication between scheduler and worker threads Vladimir Davydov
@ 2018-07-31 20:43     ` Konstantin Osipov
  2018-08-01 14:26       ` Vladimir Davydov
  0 siblings, 1 reply; 65+ messages in thread
From: Konstantin Osipov @ 2018-07-31 20:43 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: tarantool-patches

* Vladimir Davydov <vdavydov.dev@gmail.com> [18/07/08 22:53]:
> We need cbus for forwarding deferred DELETE statements generated in a
> worker thread during primary index compaction to the tx thread where
> they can be inserted into secondary indexes. Since pthread mutex/cond
> and cbus are incompatible by their nature, let's rework communication
> channel between the tx and worker threads using cbus.

OK to push.

> +	/**
> +	 * Fiber that is currently executing this task in
> +	 * a worker thread.
> +	 */
> +	struct fiber *fiber;

You could consider using a fiber pool rather than starting a fiber
for each task, but I guess it's minor.

Have you benched the performance of this patch? I would expect it
to have some positive impact on performance.


-- 
Konstantin Osipov, Moscow, Russia, +7 903 626 22 32
http://tarantool.io - www.twitter.com/kostja_osipov

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH 21/23] vinyl: zap vy_scheduler::is_worker_pool_running
  2018-07-08 16:48   ` [RFC PATCH 21/23] vinyl: zap vy_scheduler::is_worker_pool_running Vladimir Davydov
@ 2018-07-31 20:43     ` Konstantin Osipov
  0 siblings, 0 replies; 65+ messages in thread
From: Konstantin Osipov @ 2018-07-31 20:43 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: tarantool-patches

* Vladimir Davydov <vdavydov.dev@gmail.com> [18/07/08 22:53]:
> This flag is set iff worker_pool != NULL hence it is pointless.

OK to push.


-- 
Konstantin Osipov, Moscow, Russia, +7 903 626 22 32
http://tarantool.io - www.twitter.com/kostja_osipov

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH 22/23] vinyl: rename vy_task::status to is_failed
  2018-07-08 16:48   ` [RFC PATCH 22/23] vinyl: rename vy_task::status to is_failed Vladimir Davydov
@ 2018-07-31 20:44     ` Konstantin Osipov
  0 siblings, 0 replies; 65+ messages in thread
From: Konstantin Osipov @ 2018-07-31 20:44 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: tarantool-patches

* Vladimir Davydov <vdavydov.dev@gmail.com> [18/07/08 22:53]:
> vy_task::status stores the return code of the ->execute method. There
> are only two codes in use: 0 - success and -1 - failure. So let's chage
> this to a boolean flag.

OK to push.


-- 
Konstantin Osipov, Moscow, Russia, +7 903 626 22 32
http://tarantool.io - www.twitter.com/kostja_osipov

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH 08/23] vinyl: check key uniqueness before modifying tx write set
  2018-07-31 20:34     ` Konstantin Osipov
@ 2018-08-01 10:42       ` Vladimir Davydov
  0 siblings, 0 replies; 65+ messages in thread
From: Vladimir Davydov @ 2018-08-01 10:42 UTC (permalink / raw)
  To: Konstantin Osipov; +Cc: tarantool-patches

On Tue, Jul 31, 2018 at 11:34:10PM +0300, Konstantin Osipov wrote:
> * Vladimir Davydov <vdavydov.dev@gmail.com> [18/07/08 22:52]:
> 
> > Currently, we handle INSERT/REPLACE/UPDATE requests by iterating over
> > all space indexes starting from the primary and inserting the
> > corresponding statements to tx write set, checking key uniqueness if
> > necessary. This means that by the time we write a REPLACE to the write
> > set of a secondary index, it has already been written to the primary
> > index write set. This is OK, and vy_tx_prepare() relies on that to
> > implement the common memory level. However, this also means that when we
> > check uniqueness of a secondary index, the new REPLACE can be found via
> > the primary index. This is OK now, because all indexes are fully
> > independent, but it isn't going to fly after #2129 is implemented. The
> > problem is in order to check if a tuple is present in a secondary index,
> > we will have to look up the corresponding full tuple in the primary
> > index. To illustrate the problem, consider the following situation:
> 
> I don't understand how this patch works. You need to put the key
> into transaction write set first, and yield second. If there is a
> change which happens during yield made on behalf of the unique
> check, it must be able to see the keys your transaction is reading
> and abort it.

It will. We abort transactions on conflicts in the read set.
Write set has nothing to do with transaction conflict resolution.

> 
> Besides, I don't understand how the order of checks is making any
> difference in your example - until a transaction is committed it
> is not present in the common memory level anyway. 

It isn't, but it will see its own write set, which is inconsistent
between the primary and secondary indexes.

> 
> > 
> >   Primary index covers field 1.
> >   Secondary index covers field 2.
> > 
> >   Committed statements:
> > 
> >     REPLACE{1, 10, lsn=1} - present in both indexes
> >     DELETE{1, lsn=2} - present only in the primary index
> > 
> >   Transaction:
> > 
> >     REPLACE{1, 10}
> > 
> > When we check uniqueness of the secondary index, we find committed
> > statement REPLACE{1, 10, lsn=1}, then look up the corresponding full
> > tuple in the primary index and find REPLACE{1, 10}. Since the two tuples
> > match, we mistakenly assume that there's a conflict.
> > 
> > To avoid a situation like that, let's check uniqueness before modifying
> > the write set of any index.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH 11/23] xrow: allow to store flags in DML requests
  2018-07-31 20:36     ` Konstantin Osipov
@ 2018-08-01 14:10       ` Vladimir Davydov
  2018-08-17 13:34         ` Vladimir Davydov
  0 siblings, 1 reply; 65+ messages in thread
From: Vladimir Davydov @ 2018-08-01 14:10 UTC (permalink / raw)
  To: Konstantin Osipov; +Cc: tarantool-patches

On Tue, Jul 31, 2018 at 11:36:34PM +0300, Konstantin Osipov wrote:
> * Vladimir Davydov <vdavydov.dev@gmail.com> [18/07/08 22:52]:
> > In the scope of #2129 we need to mark REPLACE statements for which we
> > generated DELETE in secondary indexes so that we don't generate DELETE
> > again on compaction. We also need to mark DELETE statements that were
> > generated on compaction so that we can skip them on SELECT.
> > 
> > Let's add flags field to struct vy_stmt. Flags are stored both in memory
> > and on disk so to encode/decode them we also need to add a new iproto
> > key (IPROTO_FLAGS) and the corresponding field to struct request.
> 
> We have been avoiding IPROTO_FLAGS so far. Can we make this
> member engine-local, i.e. make sure it's only visible/usable for vinyl rows?

The problem is vinyl uses the same format (and functions) for
encoding/decoding statements that are used by xlog, e.g. decoding is
done by xrow_decode_dml via struct request. So I don't see any other
way to introduce per statement flags rather than adding a new field
to struct request and making xrow_decode_dml aware of it...

I agree that IPROTO_FLAGS looks too generic though. We can rename it to
IPROTO_TUPLE_FLAGS to avoid confusion.

Alternatively, we could rename it to IPROTO_VY_STMT_FLAGS (and
request::flags to request::vy_stmt_flags) to emphasize that it is only
relevant for vinyl, but we would still have to add handling of this
flag in the generic code (xrow_encode/decode_dml). This would look ugly
IMO.

What do you think?

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH 12/23] vinyl: do not pass region explicitly to write iterator functions
  2018-07-31 20:38     ` Konstantin Osipov
@ 2018-08-01 14:14       ` Vladimir Davydov
  0 siblings, 0 replies; 65+ messages in thread
From: Vladimir Davydov @ 2018-08-01 14:14 UTC (permalink / raw)
  To: Konstantin Osipov; +Cc: tarantool-patches

On Tue, Jul 31, 2018 at 11:38:13PM +0300, Konstantin Osipov wrote:
> * Vladimir Davydov <vdavydov.dev@gmail.com> [18/07/08 22:53]:
> > This is not necessary, as we can use fiber()->gc, as we usually do.
> 
> The reason Vlad is passing it explicitly is that if we plan to
> make transactions interactive, we're going to end up with more
> than one transaction per fiber. So it really should be tx->gc
> (non-existent at the moment), not fiber->gc.
> 
> I think Vlad is passing it around explicitly as a mental note to
> self.

No, Vlad doesn't have anything to do with this code. It was written by
Alex Lyapunov when he rewrote the write iterator.

Even when we make the region per tx, we will still use fiber->gc in this
function, because it doesn't have anything to do with transaction
processing. In fact, it uses fiber->gc as a big stack and cleans it up
after it's done. No point in passing it around.

> 
> It's OK to push.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH 20/23] vinyl: use cbus for communication between scheduler and worker threads
  2018-07-31 20:43     ` Konstantin Osipov
@ 2018-08-01 14:26       ` Vladimir Davydov
  0 siblings, 0 replies; 65+ messages in thread
From: Vladimir Davydov @ 2018-08-01 14:26 UTC (permalink / raw)
  To: Konstantin Osipov; +Cc: tarantool-patches

On Tue, Jul 31, 2018 at 11:43:42PM +0300, Konstantin Osipov wrote:
> * Vladimir Davydov <vdavydov.dev@gmail.com> [18/07/08 22:53]:
> > We need cbus for forwarding deferred DELETE statements generated in a
> > worker thread during primary index compaction to the tx thread where
> > they can be inserted into secondary indexes. Since pthread mutex/cond
> > and cbus are incompatible by their nature, let's rework communication
> > channel between the tx and worker threads using cbus.
> 
> OK to push.
> 
> > +	/**
> > +	 * Fiber that is currently executing this task in
> > +	 * a worker thread.
> > +	 */
> > +	struct fiber *fiber;
> 
> You could consider using a fiber pool rather than starting a fiber
> for each task, but I guess it's minor.

Yep, thought about that too. It could simplify the code a little bit.
But we don't really need a fiber pool here. Besides, fiber pool has some
troubles with thread termination IIRC - currently we can't just make a
thread using a fiber pool stop immediately without waiting for existing
fibers to terminate (we don't want to wait for compaction to complete if
tarantool is stopped). I'll look into this.

> 
> Have you benched the performance of this patch? I would expect it
> to have some positive impact on performance.

Alas, there's no performance benefit, because dump/compaction are very
rare operations and they take really long (seconds, minutes sometimes).
That is the pthread mutex that guards the task queue is taken so
infrequently that you can't even feel it. There's no lock contention or
other problems usually associated with locks either.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH 08/23] vinyl: check key uniqueness before modifying tx write set
  2018-07-08 16:48   ` [RFC PATCH 08/23] vinyl: check key uniqueness before modifying tx write set Vladimir Davydov
  2018-07-31 20:34     ` Konstantin Osipov
@ 2018-08-09 20:26     ` Konstantin Osipov
  2018-08-10  8:26       ` Vladimir Davydov
  1 sibling, 1 reply; 65+ messages in thread
From: Konstantin Osipov @ 2018-08-09 20:26 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: tarantool-patches

* Vladimir Davydov <vdavydov.dev@gmail.com> [18/07/08 22:52]:
> +	if (found != NULL && vy_tuple_compare(stmt, found,
> +					      lsm->pk->key_def) == 0) {
> +		/*
> +		 * If the old and new tuples are the same in
> +		 * terms of the primary key definition, the
> +		 * statement doesn't modify the secondary key
> +		 * and so there's actually no conflict.
> +		 */
> +		tuple_unref(found);
> +		return 0;
> +	}

In memtx, we pass old_tuple in txn_stmt around so that we can
check that found == old_tuple and ignore the duplicate (please
take a look at replace_check_dup). Why not do the same here, it
would save us a compare?

> +
> +	/*
> +	 * For secondary indexes, uniqueness must be checked on both
> +	 * INSERT and REPLACE.
> +	 */
> +	for (uint32_t i = 1; i < space->index_count; i++) {
> +		struct vy_lsm *lsm = vy_lsm(space->index[i]);
> +		if (vy_check_is_unique_secondary(env, tx, rv, space_name(space),
> +						 index_name_by_id(space, i),
> +						 lsm, stmt) != 0)
> +			return -1;
> +	}
> +	return 0;

This code calls vy_get(), which in turns makes an unnecessary
lookup in the primary key.


-- 
Konstantin Osipov, Moscow, Russia, +7 903 626 22 32
http://tarantool.io - www.twitter.com/kostja_osipov

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH 08/23] vinyl: check key uniqueness before modifying tx write set
  2018-08-09 20:26     ` Konstantin Osipov
@ 2018-08-10  8:26       ` Vladimir Davydov
  0 siblings, 0 replies; 65+ messages in thread
From: Vladimir Davydov @ 2018-08-10  8:26 UTC (permalink / raw)
  To: Konstantin Osipov; +Cc: tarantool-patches

On Thu, Aug 09, 2018 at 11:26:16PM +0300, Konstantin Osipov wrote:
> * Vladimir Davydov <vdavydov.dev@gmail.com> [18/07/08 22:52]:
> > +	if (found != NULL && vy_tuple_compare(stmt, found,
> > +					      lsm->pk->key_def) == 0) {
> > +		/*
> > +		 * If the old and new tuples are the same in
> > +		 * terms of the primary key definition, the
> > +		 * statement doesn't modify the secondary key
> > +		 * and so there's actually no conflict.
> > +		 */
> > +		tuple_unref(found);
> > +		return 0;
> > +	}
> 
> In memtx, we pass old_tuple in txn_stmt around so that we can
> check that found == old_tuple and ignore the duplicate (please
> take a look at replace_check_dup). Why not do the same here, it
> would save us a compare?

The problem is vinyl doesn't necessarily return the same tuple for
lookups by the same key - the resulting tuple is reallocated if it is
read from disk or upserts are applied. Besides, the old tuple will be
unavailable for REPLACE (once secondary keys are reworked) and for
certain UPDATEs (if we make UPDATEs that don't touch secondary keys
read-less).

That said, we can pass the old tuple, but we still have to fall back on
tuple comparison if pointers don't match.

> 
> > +
> > +	/*
> > +	 * For secondary indexes, uniqueness must be checked on both
> > +	 * INSERT and REPLACE.
> > +	 */
> > +	for (uint32_t i = 1; i < space->index_count; i++) {
> > +		struct vy_lsm *lsm = vy_lsm(space->index[i]);
> > +		if (vy_check_is_unique_secondary(env, tx, rv, space_name(space),
> > +						 index_name_by_id(space, i),
> > +						 lsm, stmt) != 0)
> > +			return -1;
> > +	}
> > +	return 0;
> 
> This code calls vy_get(), which in turns makes an unnecessary
> lookup in the primary key.

This is a preparation for new secondary keys - after this patch set is
applied, secondary index lookup won't be enough for checking duplicates,
because a tuple read from a secondary index may be stale (overwritten in
the primary index without DELETE). There's no way to check that other
than reading the matching tuple from the primary index.

Also, always getting the full tuple is useful for shared cache, which is
introduced later in the series. Currently, we store partial tuples in
secondary index cache thus wasting memory. Reading the full tuple will
allow us to avoid that.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [RFC PATCH 11/23] xrow: allow to store flags in DML requests
  2018-08-01 14:10       ` Vladimir Davydov
@ 2018-08-17 13:34         ` Vladimir Davydov
  2018-08-17 13:34           ` [PATCH 1/2] xrow: allow to store tuple metadata in request Vladimir Davydov
  2018-08-17 13:34           ` [PATCH 2/2] vinyl: introduce statement flags Vladimir Davydov
  0 siblings, 2 replies; 65+ messages in thread
From: Vladimir Davydov @ 2018-08-17 13:34 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

As discussed verbally with Kostja, it's more flexible to introduce
tuple metadata, which is basically a msgpack map, instead of flags. 
The patches that implement it are right below. The branch has been
updated as well.

Vladimir Davydov (2):
  xrow: allow to store tuple metadata in request
  vinyl: introduce statement flags

 src/box/iproto_constants.c |  3 ++-
 src/box/iproto_constants.h |  3 ++-
 src/box/vy_stmt.c          | 64 ++++++++++++++++++++++++++++++++++++++++++++++
 src/box/vy_stmt.h          | 15 +++++++++++
 src/box/xrow.c             | 13 +++++++++-
 src/box/xrow.h             |  3 +++
 6 files changed, 98 insertions(+), 3 deletions(-)

-- 
2.11.0

^ permalink raw reply	[flat|nested] 65+ messages in thread

* [PATCH 1/2] xrow: allow to store tuple metadata in request
  2018-08-17 13:34         ` Vladimir Davydov
@ 2018-08-17 13:34           ` Vladimir Davydov
  2018-08-17 13:34           ` [PATCH 2/2] vinyl: introduce statement flags Vladimir Davydov
  1 sibling, 0 replies; 65+ messages in thread
From: Vladimir Davydov @ 2018-08-17 13:34 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

This patch set allows to store msgpack map with arbitrary keys inside
a request. In particular, this is needed to store vinyl statement flags
in run files.

Needed for #2129
---
 src/box/iproto_constants.c |  3 ++-
 src/box/iproto_constants.h |  3 ++-
 src/box/xrow.c             | 13 ++++++++++++-
 src/box/xrow.h             |  3 +++
 4 files changed, 19 insertions(+), 3 deletions(-)

diff --git a/src/box/iproto_constants.c b/src/box/iproto_constants.c
index e35738b4..a8d15cb1 100644
--- a/src/box/iproto_constants.c
+++ b/src/box/iproto_constants.c
@@ -87,6 +87,7 @@ const unsigned char iproto_key_type[IPROTO_KEY_MAX] =
 	/* 0x27 */	MP_STR, /* IPROTO_EXPR */
 	/* 0x28 */	MP_ARRAY, /* IPROTO_OPS */
 	/* 0x29 */	MP_MAP, /* IPROTO_BALLOT */
+	/* 0x2a */	MP_MAP, /* IPROTO_TUPLE_META */
 	/* }}} */
 };
 
@@ -168,7 +169,7 @@ const char *iproto_key_strs[IPROTO_KEY_MAX] = {
 	"expression",       /* 0x27 */
 	"operations",       /* 0x28 */
 	"ballot",           /* 0x29 */
-	NULL,               /* 0x2a */
+	"tuple meta",       /* 0x2a */
 	NULL,               /* 0x2b */
 	NULL,               /* 0x2c */
 	NULL,               /* 0x2d */
diff --git a/src/box/iproto_constants.h b/src/box/iproto_constants.h
index f282a0b2..404f97a2 100644
--- a/src/box/iproto_constants.h
+++ b/src/box/iproto_constants.h
@@ -78,6 +78,7 @@ enum iproto_key {
 	IPROTO_EXPR = 0x27, /* EVAL */
 	IPROTO_OPS = 0x28, /* UPSERT but not UPDATE ops, because of legacy */
 	IPROTO_BALLOT = 0x29,
+	IPROTO_TUPLE_META = 0x2a,
 	/* Leave a gap between request keys and response keys */
 	IPROTO_DATA = 0x30,
 	IPROTO_ERROR = 0x31,
@@ -96,7 +97,7 @@ enum iproto_ballot_key {
 			  bit(LSN) | bit(SCHEMA_VERSION))
 #define IPROTO_DML_BODY_BMAP (bit(SPACE_ID) | bit(INDEX_ID) | bit(LIMIT) |\
 			      bit(OFFSET) | bit(ITERATOR) | bit(INDEX_BASE) |\
-			      bit(KEY) | bit(TUPLE) | bit(OPS))
+			      bit(KEY) | bit(TUPLE) | bit(OPS) | bit(TUPLE_META))
 
 static inline bool
 xrow_header_has_key(const char *pos, const char *end)
diff --git a/src/box/xrow.c b/src/box/xrow.c
index 269a6e68..7a35d0db 100644
--- a/src/box/xrow.c
+++ b/src/box/xrow.c
@@ -532,6 +532,10 @@ error:
 			request->ops = value;
 			request->ops_end = data;
 			break;
+		case IPROTO_TUPLE_META:
+			request->tuple_meta = value;
+			request->tuple_meta_end = data;
+			break;
 		default:
 			break;
 		}
@@ -585,7 +589,8 @@ xrow_encode_dml(const struct request *request, struct iovec *iov)
 	const int MAP_LEN_MAX = 40;
 	uint32_t key_len = request->key_end - request->key;
 	uint32_t ops_len = request->ops_end - request->ops;
-	uint32_t len = MAP_LEN_MAX + key_len + ops_len;
+	uint32_t tuple_meta_len = request->tuple_meta_end - request->tuple_meta;
+	uint32_t len = MAP_LEN_MAX + key_len + ops_len + tuple_meta_len;
 	char *begin = (char *) region_alloc(&fiber()->gc, len);
 	if (begin == NULL) {
 		diag_set(OutOfMemory, len, "region_alloc", "begin");
@@ -620,6 +625,12 @@ xrow_encode_dml(const struct request *request, struct iovec *iov)
 		pos += ops_len;
 		map_size++;
 	}
+	if (request->tuple_meta) {
+		pos = mp_encode_uint(pos, IPROTO_TUPLE_META);
+		memcpy(pos, request->tuple_meta, tuple_meta_len);
+		pos += tuple_meta_len;
+		map_size++;
+	}
 	if (request->tuple) {
 		pos = mp_encode_uint(pos, IPROTO_TUPLE);
 		iov[iovcnt].iov_base = (void *) request->tuple;
diff --git a/src/box/xrow.h b/src/box/xrow.h
index 9887382c..47216705 100644
--- a/src/box/xrow.h
+++ b/src/box/xrow.h
@@ -127,6 +127,9 @@ struct request {
 	/** Upsert operations. */
 	const char *ops;
 	const char *ops_end;
+	/** Tuple metadata. */
+	const char *tuple_meta;
+	const char *tuple_meta_end;
 	/** Base field offset for UPDATE/UPSERT, e.g. 0 for C and 1 for Lua. */
 	int index_base;
 };
-- 
2.11.0

^ permalink raw reply	[flat|nested] 65+ messages in thread

* [PATCH 2/2] vinyl: introduce statement flags
  2018-08-17 13:34         ` Vladimir Davydov
  2018-08-17 13:34           ` [PATCH 1/2] xrow: allow to store tuple metadata in request Vladimir Davydov
@ 2018-08-17 13:34           ` Vladimir Davydov
  1 sibling, 0 replies; 65+ messages in thread
From: Vladimir Davydov @ 2018-08-17 13:34 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

In the scope of #2129 we need to mark REPLACE statements for which we
generated DELETE in secondary indexes so that we don't generate DELETE
again on compaction. We also need to mark DELETE statements that were
generated on compaction so that we can skip them on SELECT.

Let's add flags field to struct vy_stmt. Flags are stored both in memory
and on disk - they are encoded in tuple meta in the latter case.

Needed for #2129
---
 src/box/vy_stmt.c | 64 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 src/box/vy_stmt.h | 15 +++++++++++++
 2 files changed, 79 insertions(+)

diff --git a/src/box/vy_stmt.c b/src/box/vy_stmt.c
index a4b7975b..bcf3dd11 100644
--- a/src/box/vy_stmt.c
+++ b/src/box/vy_stmt.c
@@ -45,6 +45,14 @@
 #include "xrow.h"
 #include "fiber.h"
 
+/**
+ * Statement metadata keys.
+ */
+enum vy_stmt_meta_key {
+	/** Statement flags. */
+	VY_STMT_FLAGS = 0x01,
+};
+
 static struct tuple *
 vy_tuple_new(struct tuple_format *format, const char *data, const char *end)
 {
@@ -112,6 +120,7 @@ vy_stmt_alloc(struct tuple_format *format, uint32_t bsize)
 	tuple->data_offset = sizeof(struct vy_stmt) + meta_size;;
 	vy_stmt_set_lsn(tuple, 0);
 	vy_stmt_set_type(tuple, 0);
+	vy_stmt_set_flags(tuple, 0);
 	return tuple;
 }
 
@@ -485,6 +494,56 @@ vy_stmt_extract_key_raw(const char *data, const char *data_end,
 	return key;
 }
 
+/**
+ * Encode the given statement meta data in a request.
+ * Returns 0 on success, -1 on memory allocation error.
+ */
+static int
+vy_stmt_meta_encode(const struct tuple *stmt, struct request *request)
+{
+	if (vy_stmt_flags(stmt) == 0)
+		return 0; /* nothing to encode */
+
+	size_t len = mp_sizeof_map(1) * 2 * mp_sizeof_uint(UINT64_MAX);
+	char *buf = region_alloc(&fiber()->gc, len);
+	if (buf == NULL)
+		return -1;
+	char *pos = buf;
+	pos = mp_encode_map(pos, 1);
+	pos = mp_encode_uint(pos, VY_STMT_FLAGS);
+	pos = mp_encode_uint(pos, vy_stmt_flags(stmt));
+	assert(pos <= buf + len);
+
+	request->tuple_meta = buf;
+	request->tuple_meta_end = pos;
+	return 0;
+}
+
+/**
+ * Decode statement meta data from a request.
+ */
+static void
+vy_stmt_meta_decode(struct request *request, struct tuple *stmt)
+{
+	const char *data = request->tuple_meta;
+	if (data == NULL)
+		return; /* nothing to decode */
+
+	uint32_t size = mp_decode_map(&data);
+	for (uint32_t i = 0; i < size; i++) {
+		uint64_t key = mp_decode_uint(&data);
+		switch (key) {
+		case VY_STMT_FLAGS: {
+			uint64_t flags = mp_decode_uint(&data);
+			vy_stmt_set_flags(stmt, flags);
+			break;
+		}
+		default:
+			mp_next(&data); /* unknown key, ignore */
+		}
+	}
+}
+
 int
 vy_stmt_encode_primary(const struct tuple *value,
 		       const struct key_def *key_def, uint32_t space_id,
@@ -525,6 +584,8 @@ vy_stmt_encode_primary(const struct tuple *value,
 	default:
 		unreachable();
 	}
+	if (vy_stmt_meta_encode(value, &request) != 0)
+		return -1;
 	xrow->bodycnt = xrow_encode_dml(&request, xrow->body);
 	if (xrow->bodycnt < 0)
 		return -1;
@@ -556,6 +617,8 @@ vy_stmt_encode_secondary(const struct tuple *value,
 		request.key = extracted;
 		request.key_end = extracted + size;
 	}
+	if (vy_stmt_meta_encode(value, &request) != 0)
+		return -1;
 	xrow->bodycnt = xrow_encode_dml(&request, xrow->body);
 	if (xrow->bodycnt < 0)
 		return -1;
@@ -613,6 +676,7 @@ vy_stmt_decode(struct xrow_header *xrow, const struct key_def *key_def,
 	if (stmt == NULL)
 		return NULL; /* OOM */
 
+	vy_stmt_meta_decode(&request, stmt);
 	vy_stmt_set_lsn(stmt, xrow->lsn);
 	return stmt;
 }
diff --git a/src/box/vy_stmt.h b/src/box/vy_stmt.h
index e53f98ce..bcf855dd 100644
--- a/src/box/vy_stmt.h
+++ b/src/box/vy_stmt.h
@@ -103,6 +103,7 @@ struct vy_stmt {
 	struct tuple base;
 	int64_t lsn;
 	uint8_t  type; /* IPROTO_SELECT/REPLACE/UPSERT/DELETE */
+	uint8_t flags;
 	/**
 	 * Offsets array concatenated with MessagePack fields
 	 * array.
@@ -138,6 +139,20 @@ vy_stmt_set_type(struct tuple *stmt, enum iproto_type type)
 	((struct vy_stmt *) stmt)->type = type;
 }
 
+/** Get flags of the vinyl statement. */
+static inline uint8_t
+vy_stmt_flags(const struct tuple *stmt)
+{
+	return ((const struct vy_stmt *)stmt)->flags;
+}
+
+/** Set flags of the vinyl statement. */
+static inline void
+vy_stmt_set_flags(struct tuple *stmt, uint8_t flags)
+{
+	((struct vy_stmt *)stmt)->flags = flags;
+}
+
 /**
  * Get upserts count of the vinyl statement.
  * Only for UPSERT statements allocated on lsregion.
-- 
2.11.0

^ permalink raw reply	[flat|nested] 65+ messages in thread

end of thread, other threads:[~2018-08-17 13:34 UTC | newest]

Thread overview: 65+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-07-08 16:48 [RFC PATCH 02/23] vinyl: always get full tuple from pk after reading from secondary index Vladimir Davydov
2018-07-08 16:48 ` [RFC PATCH 00/23] vinyl: eliminate read on REPLACE/DELETE Vladimir Davydov
2018-07-08 16:48   ` [RFC PATCH 01/23] vinyl: do not turn REPLACE into INSERT when processing DML request Vladimir Davydov
2018-07-10 12:15     ` Konstantin Osipov
2018-07-10 12:19       ` Vladimir Davydov
2018-07-10 18:39         ` Konstantin Osipov
2018-07-11  7:57           ` Vladimir Davydov
2018-07-11 10:25             ` Vladimir Davydov
2018-07-08 16:48   ` [RFC PATCH 03/23] vinyl: use vy_mem_iterator for point lookup Vladimir Davydov
2018-07-17 10:14     ` Vladimir Davydov
2018-07-08 16:48   ` [RFC PATCH 04/23] vinyl: make point lookup always return the latest tuple version Vladimir Davydov
2018-07-10 16:19     ` Konstantin Osipov
2018-07-10 16:43       ` Vladimir Davydov
2018-07-11 16:33         ` Vladimir Davydov
2018-07-31 19:17           ` Konstantin Osipov
2018-07-08 16:48   ` [RFC PATCH 05/23] vinyl: fold vy_replace_one and vy_replace_impl Vladimir Davydov
2018-07-31 20:28     ` Konstantin Osipov
2018-07-08 16:48   ` [RFC PATCH 06/23] vinyl: fold vy_delete_impl Vladimir Davydov
2018-07-31 20:28     ` Konstantin Osipov
2018-07-08 16:48   ` [RFC PATCH 07/23] vinyl: refactor unique check Vladimir Davydov
2018-07-31 20:28     ` Konstantin Osipov
2018-07-08 16:48   ` [RFC PATCH 08/23] vinyl: check key uniqueness before modifying tx write set Vladimir Davydov
2018-07-31 20:34     ` Konstantin Osipov
2018-08-01 10:42       ` Vladimir Davydov
2018-08-09 20:26     ` Konstantin Osipov
2018-08-10  8:26       ` Vladimir Davydov
2018-07-08 16:48   ` [RFC PATCH 09/23] vinyl: remove env argument of vy_check_is_unique_{primary,secondary} Vladimir Davydov
2018-07-08 16:48   ` [RFC PATCH 10/23] vinyl: store full tuples in secondary index cache Vladimir Davydov
2018-07-08 16:48   ` [RFC PATCH 11/23] xrow: allow to store flags in DML requests Vladimir Davydov
2018-07-31 20:36     ` Konstantin Osipov
2018-08-01 14:10       ` Vladimir Davydov
2018-08-17 13:34         ` Vladimir Davydov
2018-08-17 13:34           ` [PATCH 1/2] xrow: allow to store tuple metadata in request Vladimir Davydov
2018-08-17 13:34           ` [PATCH 2/2] vinyl: introduce statement flags Vladimir Davydov
2018-07-08 16:48   ` [RFC PATCH 12/23] vinyl: do not pass region explicitly to write iterator functions Vladimir Davydov
2018-07-17 10:16     ` Vladimir Davydov
2018-07-31 20:38     ` Konstantin Osipov
2018-08-01 14:14       ` Vladimir Davydov
2018-07-08 16:48   ` [RFC PATCH 13/23] vinyl: fix potential use-after-free in vy_read_view_merge Vladimir Davydov
2018-07-17 10:16     ` Vladimir Davydov
2018-07-08 16:48   ` [RFC PATCH 14/23] test: unit/vy_write_iterator: minor refactoring Vladimir Davydov
2018-07-17 10:17     ` Vladimir Davydov
2018-07-08 16:48   ` [RFC PATCH 15/23] vinyl: teach write iterator to return overwritten tuples Vladimir Davydov
2018-07-08 16:48   ` [RFC PATCH 16/23] vinyl: allow to skip certain statements on read Vladimir Davydov
2018-07-08 16:48   ` [RFC PATCH 17/23] vinyl: do not free pending tasks on shutdown Vladimir Davydov
2018-07-08 16:48   ` [RFC PATCH 18/23] vinyl: store pointer to scheduler in struct vy_task Vladimir Davydov
2018-07-31 20:39     ` Konstantin Osipov
2018-07-08 16:48   ` [RFC PATCH 19/23] vinyl: rename some members of vy_scheduler and vy_task struct Vladimir Davydov
2018-07-31 20:40     ` Konstantin Osipov
2018-07-08 16:48   ` [RFC PATCH 20/23] vinyl: use cbus for communication between scheduler and worker threads Vladimir Davydov
2018-07-31 20:43     ` Konstantin Osipov
2018-08-01 14:26       ` Vladimir Davydov
2018-07-08 16:48   ` [RFC PATCH 21/23] vinyl: zap vy_scheduler::is_worker_pool_running Vladimir Davydov
2018-07-31 20:43     ` Konstantin Osipov
2018-07-08 16:48   ` [RFC PATCH 22/23] vinyl: rename vy_task::status to is_failed Vladimir Davydov
2018-07-31 20:44     ` Konstantin Osipov
2018-07-08 16:48   ` [RFC PATCH 23/23] vinyl: eliminate read on REPLACE/DELETE Vladimir Davydov
2018-07-13 10:53     ` Vladimir Davydov
2018-07-13 10:53       ` [PATCH 1/3] stailq: add stailq_insert function Vladimir Davydov
2018-07-15  7:02         ` Konstantin Osipov
2018-07-15 13:17           ` Vladimir Davydov
2018-07-15 18:40             ` Konstantin Osipov
2018-07-17 10:18         ` Vladimir Davydov
2018-07-13 10:53       ` [PATCH 2/3] vinyl: link all indexes of the same space Vladimir Davydov
2018-07-13 10:53       ` [PATCH 3/3] vinyl: generate deferred DELETEs on tx commit Vladimir Davydov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox