Tarantool development patches archive
 help / color / mirror / Atom feed
* [PATCH v2 0/7] Join replicas off the current read view
@ 2019-08-19 16:53 Vladimir Davydov
  2019-08-19 16:53 ` [PATCH v2 1/7] vinyl: don't pin index for iterator lifetime Vladimir Davydov
                   ` (8 more replies)
  0 siblings, 9 replies; 28+ messages in thread
From: Vladimir Davydov @ 2019-08-19 16:53 UTC (permalink / raw)
  To: tarantool-patches

Currently, we join replicas off the last checkpoint. As a result, we
must keep all files corresponding to the last checkpoint. This means
that we must always create a memtx snapshot file on initial call to
box.cfg() even though it is virtually the same for all instances.
Besides, we must rotate the vylog file synchronously with snapshot
creation, otherwise we wouldn't be able to pull all vinyl files
corresponding to the last checkpoint. This interconnection between
vylog and xlog makes the code difficult to maintain.

Actually, nothing prevents us from relaying the current read view
instead of the last checkpoint on initial join, as both memtx and
vinyl support a consistent read view. This patch does the trick.
This is a step towards making vylog independent of checkpointing
and WAL.

https://github.com/tarantool/tarantool/issues/1271
https://github.com/tarantool/tarantool/issues/4417
https://github.com/tarantool/tarantool/commits/dv/gh-1271-rework-replica-join

Changes in v2:
 - Commit preparatory patches approved by Kostja and rebase on
   the latest master branch.
 - Fix the issue with box.on_schema_init and space.before_replace
   instead of disabling the test (see #4417).

Vladimir Davydov (7):
  vinyl: don't pin index for iterator lifetime
  vinyl: don't exempt dropped indexes from dump and compaction
  vinyl: get rid of vy_env::join_lsn
  memtx: use ref counting to pin indexes for snapshot
  memtx: enter small delayed free mode from snapshot iterator
  space: get rid of apply_initial_join_row method
  relay: join new replicas off read view

 src/box/applier.cc                          |  32 +-
 src/box/blackhole.c                         |   4 +-
 src/box/box.cc                              |  33 +-
 src/box/engine.c                            |  21 -
 src/box/engine.h                            |  27 +-
 src/box/memtx_engine.c                      | 146 ++----
 src/box/memtx_engine.h                      |  28 +-
 src/box/memtx_hash.c                        |  18 +-
 src/box/memtx_space.c                       |  30 --
 src/box/memtx_tree.c                        |  19 +-
 src/box/relay.cc                            | 170 ++++++-
 src/box/relay.h                             |   2 +-
 src/box/space.c                             |   9 -
 src/box/space.h                             |  16 -
 src/box/sysview.c                           |   4 +-
 src/box/vinyl.c                             | 526 ++++----------------
 src/box/vy_lsm.c                            |   4 +
 src/box/vy_lsm.h                            |   7 +
 src/box/vy_run.c                            |   6 -
 src/box/vy_scheduler.c                      |  95 ++--
 src/box/vy_scheduler.h                      |  10 +-
 src/box/vy_tx.c                             |  12 +-
 src/box/vy_tx.h                             |   8 +
 src/box/vy_write_iterator.c                 |   8 +-
 src/lib/core/errinj.h                       |   2 +-
 test/box/errinj.result                      |  38 +-
 test/replication-py/cluster.result          |  13 -
 test/replication-py/cluster.test.py         |  25 -
 test/replication/join_without_snap.result   |  88 ++++
 test/replication/join_without_snap.test.lua |  32 ++
 test/replication/on_schema_init.result      |   6 +
 test/replication/on_schema_init.test.lua    |   3 +
 test/replication/suite.cfg                  |   1 +
 test/vinyl/errinj.result                    |   4 +-
 test/vinyl/errinj.test.lua                  |   4 +-
 test/xlog/panic_on_broken_lsn.result        |  31 +-
 test/xlog/panic_on_broken_lsn.test.lua      |  20 +-
 37 files changed, 610 insertions(+), 892 deletions(-)
 create mode 100644 test/replication/join_without_snap.result
 create mode 100644 test/replication/join_without_snap.test.lua

-- 
2.20.1

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v2 1/7] vinyl: don't pin index for iterator lifetime
  2019-08-19 16:53 [PATCH v2 0/7] Join replicas off the current read view Vladimir Davydov
@ 2019-08-19 16:53 ` Vladimir Davydov
  2019-08-19 20:35   ` [tarantool-patches] " Konstantin Osipov
  2019-08-19 16:53 ` [PATCH v2 2/7] vinyl: don't exempt dropped indexes from dump and compaction Vladimir Davydov
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 28+ messages in thread
From: Vladimir Davydov @ 2019-08-19 16:53 UTC (permalink / raw)
  To: tarantool-patches

vinyl_iterator keeps a reference to the LSM tree it was created for
until it is destroyed, which may take indefinitely long in case the
iterator is used in Lua. Actually, we don't need to keep a reference to
the index for the whole iterator lifetime, because iterator_next()
wrapper guarantees that iterator->next won't be called for a dropped
index. What we need to do is keep a reference while we are yielding on
disk read, similarly to vinyl_index_get().

Currently, pinning an index for indefinitely long is harmless, because
an LSM tree is exempted from dump/compaction as soon as it is dropped so
we just pin some memory, that's all. However, following patches are
going to enable dump/compaction for dropped but pinned indexes in order
to implement snapshot iterator so we better relax the dependency of an
iterator on an index know.

While we are at it, let's remove env and lsm members of vinyl_iterator
struct: lsm can be accessed via vy_read_iterator embedded in the struct
while env is only needed to access iterator_pool so we better store a
pointer to the pool in vinyl_iterator instead.
---
 src/box/vinyl.c | 44 +++++++++++++++++++++++++++-----------------
 1 file changed, 27 insertions(+), 17 deletions(-)

diff --git a/src/box/vinyl.c b/src/box/vinyl.c
index 0dd73045..ee6b2728 100644
--- a/src/box/vinyl.c
+++ b/src/box/vinyl.c
@@ -156,10 +156,8 @@ vy_gc(struct vy_env *env, struct vy_recovery *recovery,
 
 struct vinyl_iterator {
 	struct iterator base;
-	/** Vinyl environment. */
-	struct vy_env *env;
-	/** LSM tree this iterator is for. */
-	struct vy_lsm *lsm;
+	/** Memory pool the iterator was allocated from. */
+	struct mempool *pool;
 	/**
 	 * Points either to tx_autocommit for autocommit mode
 	 * or to a multi-statement transaction active when the
@@ -3730,8 +3728,6 @@ static void
 vinyl_iterator_close(struct vinyl_iterator *it)
 {
 	vy_read_iterator_close(&it->iterator);
-	vy_lsm_unref(it->lsm);
-	it->lsm = NULL;
 	tuple_unref(it->key.stmt);
 	it->key = vy_entry_none();
 	if (it->tx == &it->tx_autocommit) {
@@ -3804,10 +3800,17 @@ vinyl_iterator_primary_next(struct iterator *base, struct tuple **ret)
 
 	assert(base->next = vinyl_iterator_primary_next);
 	struct vinyl_iterator *it = (struct vinyl_iterator *)base;
-	assert(it->lsm->index_id == 0);
+	struct vy_lsm *lsm = it->iterator.lsm;
+	assert(lsm->index_id == 0);
+	/*
+	 * Make sure the LSM tree isn't deleted while we are
+	 * reading from it.
+	 */
+	vy_lsm_ref(lsm);
 
 	if (vinyl_iterator_check_tx(it) != 0)
 		goto fail;
+
 	struct vy_entry entry;
 	if (vy_read_iterator_next(&it->iterator, &entry) != 0)
 		goto fail;
@@ -3820,9 +3823,11 @@ vinyl_iterator_primary_next(struct iterator *base, struct tuple **ret)
 		tuple_bless(entry.stmt);
 	}
 	*ret = entry.stmt;
+	vy_lsm_unref(lsm);
 	return 0;
 fail:
 	vinyl_iterator_close(it);
+	vy_lsm_unref(lsm);
 	return -1;
 }
 
@@ -3833,9 +3838,15 @@ vinyl_iterator_secondary_next(struct iterator *base, struct tuple **ret)
 
 	assert(base->next = vinyl_iterator_secondary_next);
 	struct vinyl_iterator *it = (struct vinyl_iterator *)base;
-	assert(it->lsm->index_id > 0);
-	struct vy_entry partial, entry;
+	struct vy_lsm *lsm = it->iterator.lsm;
+	assert(lsm->index_id > 0);
+	/*
+	 * Make sure the LSM tree isn't deleted while we are
+	 * reading from it.
+	 */
+	vy_lsm_ref(lsm);
 
+	struct vy_entry partial, entry;
 next:
 	if (vinyl_iterator_check_tx(it) != 0)
 		goto fail;
@@ -3849,12 +3860,11 @@ next:
 		vinyl_iterator_account_read(it, start_time, NULL);
 		vinyl_iterator_close(it);
 		*ret = NULL;
-		return 0;
+		goto out;
 	}
 	ERROR_INJECT_YIELD(ERRINJ_VY_DELAY_PK_LOOKUP);
 	/* Get the full tuple from the primary index. */
-	if (vy_get_by_secondary_tuple(it->lsm, it->tx,
-				      vy_tx_read_view(it->tx),
+	if (vy_get_by_secondary_tuple(lsm, it->tx, vy_tx_read_view(it->tx),
 				      partial, &entry) != 0)
 		goto fail;
 	if (entry.stmt == NULL)
@@ -3864,9 +3874,12 @@ next:
 	*ret = entry.stmt;
 	tuple_bless(*ret);
 	tuple_unref(*ret);
+out:
+	vy_lsm_unref(lsm);
 	return 0;
 fail:
 	vinyl_iterator_close(it);
+	vy_lsm_unref(lsm);
 	return -1;
 }
 
@@ -3877,7 +3890,7 @@ vinyl_iterator_free(struct iterator *base)
 	struct vinyl_iterator *it = (struct vinyl_iterator *)base;
 	if (base->next != vinyl_iterator_last)
 		vinyl_iterator_close(it);
-	mempool_free(&it->env->iterator_pool, it);
+	mempool_free(it->pool, it);
 }
 
 static struct iterator *
@@ -3918,10 +3931,7 @@ vinyl_index_create_iterator(struct index *base, enum iterator_type type,
 	else
 		it->base.next = vinyl_iterator_secondary_next;
 	it->base.free = vinyl_iterator_free;
-
-	it->env = env;
-	it->lsm = lsm;
-	vy_lsm_ref(lsm);
+	it->pool = &env->iterator_pool;
 
 	if (tx != NULL) {
 		/*
-- 
2.20.1

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v2 2/7] vinyl: don't exempt dropped indexes from dump and compaction
  2019-08-19 16:53 [PATCH v2 0/7] Join replicas off the current read view Vladimir Davydov
  2019-08-19 16:53 ` [PATCH v2 1/7] vinyl: don't pin index for iterator lifetime Vladimir Davydov
@ 2019-08-19 16:53 ` Vladimir Davydov
  2019-08-19 20:47   ` [tarantool-patches] " Konstantin Osipov
  2019-08-20 14:16   ` Vladimir Davydov
  2019-08-19 16:53 ` [PATCH v2 3/7] vinyl: get rid of vy_env::join_lsn Vladimir Davydov
                   ` (6 subsequent siblings)
  8 siblings, 2 replies; 28+ messages in thread
From: Vladimir Davydov @ 2019-08-19 16:53 UTC (permalink / raw)
  To: tarantool-patches

We remove an LSM tree from the scheduler queues as soon as it is
dropped, even though the tree may hang around for a while after
that, e.g. because it is pinned by an iterator. As a result, once
an index is dropped, it won't be dumped anymore - its memory level
will simply disappear without a trace. This is okay for now, but
to implement snapshot iterators we must make sure that an index
will stay valid as long as there's an iterator that references it.

That said, let's delay removal of an index from the scheduler queues
until it is about to be destroyed.
---
 src/box/vinyl.c        | 16 +------
 src/box/vy_lsm.c       |  4 ++
 src/box/vy_lsm.h       |  7 ++++
 src/box/vy_scheduler.c | 95 ++++++++++++++++--------------------------
 src/box/vy_scheduler.h | 10 ++---
 5 files changed, 52 insertions(+), 80 deletions(-)

diff --git a/src/box/vinyl.c b/src/box/vinyl.c
index ee6b2728..9e93153b 100644
--- a/src/box/vinyl.c
+++ b/src/box/vinyl.c
@@ -766,18 +766,8 @@ vinyl_index_open(struct index *index)
 	/*
 	 * Add the new LSM tree to the scheduler so that it can
 	 * be dumped and compacted.
-	 *
-	 * Note, during local recovery an LSM tree may be marked
-	 * as dropped, which means that it will be dropped before
-	 * recovery is complete. In this case there's no need in
-	 * letting the scheduler know about it.
 	 */
-	if (!lsm->is_dropped)
-		vy_scheduler_add_lsm(&env->scheduler, lsm);
-	else
-		assert(env->status == VINYL_INITIAL_RECOVERY_LOCAL ||
-		       env->status == VINYL_FINAL_RECOVERY_LOCAL);
-	return 0;
+	return vy_scheduler_add_lsm(&env->scheduler, lsm);
 }
 
 static void
@@ -856,8 +846,6 @@ vinyl_index_abort_create(struct index *index)
 		return;
 	}
 
-	vy_scheduler_remove_lsm(&env->scheduler, lsm);
-
 	lsm->is_dropped = true;
 
 	vy_log_tx_begin();
@@ -911,8 +899,6 @@ vinyl_index_commit_drop(struct index *index, int64_t lsn)
 	if (env->status == VINYL_FINAL_RECOVERY_LOCAL && lsm->is_dropped)
 		return;
 
-	vy_scheduler_remove_lsm(&env->scheduler, lsm);
-
 	lsm->is_dropped = true;
 
 	vy_log_tx_begin();
diff --git a/src/box/vy_lsm.c b/src/box/vy_lsm.c
index 8fba1792..aa4bce9e 100644
--- a/src/box/vy_lsm.c
+++ b/src/box/vy_lsm.c
@@ -45,6 +45,7 @@
 #include "say.h"
 #include "schema.h"
 #include "tuple.h"
+#include "trigger.h"
 #include "vy_log.h"
 #include "vy_mem.h"
 #include "vy_range.h"
@@ -207,6 +208,7 @@ vy_lsm_new(struct vy_lsm_env *lsm_env, struct vy_cache_env *cache_env,
 	lsm->group_id = group_id;
 	lsm->opts = index_def->opts;
 	vy_lsm_read_set_new(&lsm->read_set);
+	rlist_create(&lsm->on_destroy);
 
 	lsm_env->lsm_count++;
 	return lsm;
@@ -244,6 +246,8 @@ vy_range_tree_free_cb(vy_range_tree_t *t, struct vy_range *range, void *arg)
 void
 vy_lsm_delete(struct vy_lsm *lsm)
 {
+	trigger_run(&lsm->on_destroy, lsm);
+
 	assert(heap_node_is_stray(&lsm->in_dump));
 	assert(heap_node_is_stray(&lsm->in_compaction));
 	assert(vy_lsm_read_set_empty(&lsm->read_set));
diff --git a/src/box/vy_lsm.h b/src/box/vy_lsm.h
index c8b0e297..47f8ee6a 100644
--- a/src/box/vy_lsm.h
+++ b/src/box/vy_lsm.h
@@ -312,6 +312,13 @@ struct vy_lsm {
 	 * this LSM tree.
 	 */
 	vy_lsm_read_set_t read_set;
+	/**
+	 * Triggers run when the last reference to this LSM tree
+	 * is dropped and the LSM tree is about to be destroyed.
+	 * A pointer to this LSM tree is passed to the trigger
+	 * callback in the 'event' argument.
+	 */
+	struct rlist on_destroy;
 };
 
 /** Extract vy_lsm from an index object. */
diff --git a/src/box/vy_scheduler.c b/src/box/vy_scheduler.c
index f3bded20..ee361c31 100644
--- a/src/box/vy_scheduler.c
+++ b/src/box/vy_scheduler.c
@@ -510,35 +510,47 @@ vy_scheduler_reset_stat(struct vy_scheduler *scheduler)
 	stat->compaction_output = 0;
 }
 
-void
+static void
+vy_scheduler_on_delete_lsm(struct trigger *trigger, void *event)
+{
+	struct vy_lsm *lsm = event;
+	struct vy_scheduler *scheduler = trigger->data;
+	assert(! heap_node_is_stray(&lsm->in_dump));
+	assert(! heap_node_is_stray(&lsm->in_compaction));
+	vy_dump_heap_delete(&scheduler->dump_heap, lsm);
+	vy_compaction_heap_delete(&scheduler->compaction_heap, lsm);
+	trigger_clear(trigger);
+	free(trigger);
+}
+
+int
 vy_scheduler_add_lsm(struct vy_scheduler *scheduler, struct vy_lsm *lsm)
 {
-	assert(!lsm->is_dropped);
 	assert(heap_node_is_stray(&lsm->in_dump));
 	assert(heap_node_is_stray(&lsm->in_compaction));
+	/*
+	 * Register a trigger that will remove this LSM tree from
+	 * the scheduler queues on destruction.
+	 */
+	struct trigger *trigger = malloc(sizeof(*trigger));
+	if (trigger == NULL) {
+		diag_set(OutOfMemory, sizeof(*trigger), "malloc", "trigger");
+		return -1;
+	}
+	trigger_create(trigger, vy_scheduler_on_delete_lsm, scheduler, NULL);
+	trigger_add(&lsm->on_destroy, trigger);
+	/*
+	 * Add this LSM tree to the scheduler queues so that it
+	 * can be dumped and compacted in a timely manner.
+	 */
 	vy_dump_heap_insert(&scheduler->dump_heap, lsm);
 	vy_compaction_heap_insert(&scheduler->compaction_heap, lsm);
-}
-
-void
-vy_scheduler_remove_lsm(struct vy_scheduler *scheduler, struct vy_lsm *lsm)
-{
-	assert(!lsm->is_dropped);
-	assert(! heap_node_is_stray(&lsm->in_dump));
-	assert(! heap_node_is_stray(&lsm->in_compaction));
-	vy_dump_heap_delete(&scheduler->dump_heap, lsm);
-	vy_compaction_heap_delete(&scheduler->compaction_heap, lsm);
+	return 0;
 }
 
 static void
 vy_scheduler_update_lsm(struct vy_scheduler *scheduler, struct vy_lsm *lsm)
 {
-	if (lsm->is_dropped) {
-		/* Dropped LSM trees are exempted from scheduling. */
-		assert(heap_node_is_stray(&lsm->in_dump));
-		assert(heap_node_is_stray(&lsm->in_compaction));
-		return;
-	}
 	assert(! heap_node_is_stray(&lsm->in_dump));
 	assert(! heap_node_is_stray(&lsm->in_compaction));
 	vy_dump_heap_update(&scheduler->dump_heap, lsm);
@@ -1267,15 +1279,9 @@ vy_task_dump_abort(struct vy_task *task)
 	/* The iterator has been cleaned up in a worker thread. */
 	task->wi->iface->close(task->wi);
 
-	/*
-	 * It's no use alerting the user if the server is
-	 * shutting down or the LSM tree was dropped.
-	 */
-	if (!lsm->is_dropped) {
-		struct error *e = diag_last_error(&task->diag);
-		error_log(e);
-		say_error("%s: dump failed", vy_lsm_name(lsm));
-	}
+	struct error *e = diag_last_error(&task->diag);
+	error_log(e);
+	say_error("%s: dump failed", vy_lsm_name(lsm));
 
 	vy_run_discard(task->new_run);
 
@@ -1287,18 +1293,6 @@ vy_task_dump_abort(struct vy_task *task)
 
 	assert(scheduler->dump_task_count > 0);
 	scheduler->dump_task_count--;
-
-	/*
-	 * If the LSM tree was dropped during dump, we abort
-	 * the dump task, but we should still poke the scheduler
-	 * to check if the current dump round is complete.
-	 * If we don't and this LSM tree happens to be the last
-	 * one of the current generation, the scheduler will
-	 * never be notified about dump completion and hence
-	 * memory will never be released.
-	 */
-	if (lsm->is_dropped)
-		vy_scheduler_complete_dump(scheduler);
 }
 
 /**
@@ -1317,7 +1311,6 @@ vy_task_dump_new(struct vy_scheduler *scheduler, struct vy_worker *worker,
 		.abort = vy_task_dump_abort,
 	};
 
-	assert(!lsm->is_dropped);
 	assert(!lsm->is_dumping);
 	assert(lsm->pin_count == 0);
 	assert(vy_lsm_generation(lsm) == scheduler->dump_generation);
@@ -1602,16 +1595,10 @@ vy_task_compaction_abort(struct vy_task *task)
 	/* The iterator has been cleaned up in worker. */
 	task->wi->iface->close(task->wi);
 
-	/*
-	 * It's no use alerting the user if the server is
-	 * shutting down or the LSM tree was dropped.
-	 */
-	if (!lsm->is_dropped) {
-		struct error *e = diag_last_error(&task->diag);
-		error_log(e);
-		say_error("%s: failed to compact range %s",
-			  vy_lsm_name(lsm), vy_range_str(range));
-	}
+	struct error *e = diag_last_error(&task->diag);
+	error_log(e);
+	say_error("%s: failed to compact range %s",
+		  vy_lsm_name(lsm), vy_range_str(range));
 
 	vy_run_discard(task->new_run);
 
@@ -1629,7 +1616,6 @@ vy_task_compaction_new(struct vy_scheduler *scheduler, struct vy_worker *worker,
 		.complete = vy_task_compaction_complete,
 		.abort = vy_task_compaction_abort,
 	};
-	assert(!lsm->is_dropped);
 
 	struct vy_range *range = vy_range_heap_top(&lsm->range_heap);
 	assert(range != NULL);
@@ -1945,12 +1931,6 @@ vy_task_complete(struct vy_task *task)
 	assert(scheduler->stat.tasks_inprogress > 0);
 	scheduler->stat.tasks_inprogress--;
 
-	if (task->lsm->is_dropped) {
-		if (task->ops->abort)
-			task->ops->abort(task);
-		goto out;
-	}
-
 	struct diag *diag = &task->diag;
 	if (task->is_failed) {
 		assert(!diag_is_empty(diag));
@@ -1967,7 +1947,6 @@ vy_task_complete(struct vy_task *task)
 		diag_move(diag_get(), diag);
 		goto fail;
 	}
-out:
 	scheduler->stat.tasks_completed++;
 	return 0;
 fail:
diff --git a/src/box/vy_scheduler.h b/src/box/vy_scheduler.h
index 2d4352d7..bc953975 100644
--- a/src/box/vy_scheduler.h
+++ b/src/box/vy_scheduler.h
@@ -194,16 +194,12 @@ vy_scheduler_reset_stat(struct vy_scheduler *scheduler);
 
 /**
  * Add an LSM tree to scheduler dump/compaction queues.
+ * When the LSM tree is destroyed, it will be removed
+ * from the queues automatically.
  */
-void
+int
 vy_scheduler_add_lsm(struct vy_scheduler *, struct vy_lsm *);
 
-/**
- * Remove an LSM tree from scheduler dump/compaction queues.
- */
-void
-vy_scheduler_remove_lsm(struct vy_scheduler *, struct vy_lsm *);
-
 /**
  * Trigger dump of all currently existing in-memory trees.
  */
-- 
2.20.1

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v2 3/7] vinyl: get rid of vy_env::join_lsn
  2019-08-19 16:53 [PATCH v2 0/7] Join replicas off the current read view Vladimir Davydov
  2019-08-19 16:53 ` [PATCH v2 1/7] vinyl: don't pin index for iterator lifetime Vladimir Davydov
  2019-08-19 16:53 ` [PATCH v2 2/7] vinyl: don't exempt dropped indexes from dump and compaction Vladimir Davydov
@ 2019-08-19 16:53 ` Vladimir Davydov
  2019-08-19 20:49   ` [tarantool-patches] " Konstantin Osipov
  2019-08-19 16:53 ` [PATCH v2 4/7] memtx: use ref counting to pin indexes for snapshot Vladimir Davydov
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 28+ messages in thread
From: Vladimir Davydov @ 2019-08-19 16:53 UTC (permalink / raw)
  To: tarantool-patches

This fake LSN counter, which is used for assigning LSNs to Vinyl
statements during the initial join stage, was introduced a long time
ago, when LSNs were used as identifiers for lsregion allocations and
hence were supposed to grow strictly monotonically with each new
transaction. Later on, they were reused for assigning unique LSNs to
identify indexes in vylog.

These days, we don't need initial join LSNs to be unique, as we switched
to generations for lsregion allocations while in vylog we now use LSNs
only as an incarnation counter, not as a unique identifier. That said,
let's zap vy_env::join_lsn and simply assign 0 to all statements
received during the initial join stage.

To achieve that, we just need to relax an assertion in vy_tx_commit()
and remove the assumption that an LSN can't be zero in the write
iterator implementation.
---
 src/box/vinyl.c             | 24 ++----------------------
 src/box/vy_tx.c             |  2 +-
 src/box/vy_write_iterator.c |  8 ++++----
 3 files changed, 7 insertions(+), 27 deletions(-)

diff --git a/src/box/vinyl.c b/src/box/vinyl.c
index 9e93153b..80ed00a1 100644
--- a/src/box/vinyl.c
+++ b/src/box/vinyl.c
@@ -123,17 +123,6 @@ struct vy_env {
 	struct vy_recovery *recovery;
 	/** Local recovery vclock. */
 	const struct vclock *recovery_vclock;
-	/**
-	 * LSN to assign to the next statement received during
-	 * initial join.
-	 *
-	 * We can't use original statements' LSNs, because we
-	 * send statements not in the chronological order while
-	 * the receiving end expects LSNs to grow monotonically
-	 * due to the design of the lsregion allocator, which is
-	 * used for storing statements in memory.
-	 */
-	int64_t join_lsn;
 	/** Path to the data directory. */
 	char *path;
 	/** Max time a transaction may wait for memory. */
@@ -792,15 +781,6 @@ vinyl_index_commit_create(struct index *index, int64_t lsn)
 			return;
 	}
 
-	if (env->status == VINYL_INITIAL_RECOVERY_REMOTE) {
-		/*
-		 * Records received during initial join do not
-		 * have LSNs so we use a fake one to identify
-		 * the index in vylog.
-		 */
-		lsn = ++env->join_lsn;
-	}
-
 	/*
 	 * Backward compatibility fixup: historically, we used
 	 * box.info.signature for LSN of index creation, which
@@ -3023,7 +3003,7 @@ vy_send_range_f(struct cbus_call_msg *cmsg)
 			break;
 		/*
 		 * Reset the LSN as the replica will ignore it
-		 * anyway - see comment to vy_env::join_lsn.
+		 * anyway.
 		 */
 		xrow.lsn = 0;
 		rc = xstream_write(ctx->stream, &xrow);
@@ -3269,7 +3249,7 @@ vinyl_space_apply_initial_join_row(struct space *space, struct request *request)
 
 	rc = vy_tx_prepare(tx);
 	if (rc == 0)
-		vy_tx_commit(tx, ++env->join_lsn);
+		vy_tx_commit(tx, 0);
 	else
 		vy_tx_rollback(tx);
 
diff --git a/src/box/vy_tx.c b/src/box/vy_tx.c
index 9b300fde..1a5d4837 100644
--- a/src/box/vy_tx.c
+++ b/src/box/vy_tx.c
@@ -804,7 +804,7 @@ vy_tx_commit(struct vy_tx *tx, int64_t lsn)
 	if (vy_tx_is_ro(tx))
 		goto out;
 
-	assert(xm->lsn < lsn);
+	assert(xm->lsn <= lsn);
 	xm->lsn = lsn;
 
 	/* Fix LSNs of the records and commit changes. */
diff --git a/src/box/vy_write_iterator.c b/src/box/vy_write_iterator.c
index e7bb6f06..e5ed4e42 100644
--- a/src/box/vy_write_iterator.c
+++ b/src/box/vy_write_iterator.c
@@ -511,7 +511,7 @@ vy_write_iterator_merge_step(struct vy_write_iterator *stream)
  * Try to get VLSN of the read view with the specified number in
  * the vy_write_iterator.read_views array.
  * If the requested read view is older than all existing ones,
- * return 0, as the oldest possible VLSN.
+ * return -1, which is less than any possible VLSN.
  *
  * @param stream Write iterator.
  * @param current_rv_i Index of the read view.
@@ -522,7 +522,7 @@ static inline int64_t
 vy_write_iterator_get_vlsn(struct vy_write_iterator *stream, int rv_i)
 {
 	if (rv_i >= stream->rv_count)
-		return 0;
+		return -1;
 	return stream->read_views[rv_i].vlsn;
 }
 
@@ -753,8 +753,8 @@ vy_write_iterator_build_history(struct vy_write_iterator *stream,
 		 * and other optimizations.
 		 */
 		if (vy_stmt_type(src->entry.stmt) == IPROTO_DELETE &&
-		    stream->is_last_level && merge_until_lsn == 0) {
-			current_rv_lsn = 0; /* Force skip */
+		    stream->is_last_level && merge_until_lsn < 0) {
+			current_rv_lsn = -1; /* Force skip */
 			goto next_lsn;
 		}
 
-- 
2.20.1

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v2 4/7] memtx: use ref counting to pin indexes for snapshot
  2019-08-19 16:53 [PATCH v2 0/7] Join replicas off the current read view Vladimir Davydov
                   ` (2 preceding siblings ...)
  2019-08-19 16:53 ` [PATCH v2 3/7] vinyl: get rid of vy_env::join_lsn Vladimir Davydov
@ 2019-08-19 16:53 ` Vladimir Davydov
  2019-08-19 20:50   ` [tarantool-patches] " Konstantin Osipov
  2019-08-19 16:53 ` [PATCH v2 5/7] memtx: enter small delayed free mode from snapshot iterator Vladimir Davydov
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 28+ messages in thread
From: Vladimir Davydov @ 2019-08-19 16:53 UTC (permalink / raw)
  To: tarantool-patches

Currently, to prevent an index from going away while it is being
written to a snapshot, we postpone memtx_gc_task's free() invocation
until checkpointing is complete, see commit 94de0a081b3a ("Don't take
schema lock for checkpointing"). This works fine, but makes it rather
difficult to reuse snapshot iterators for other purposes, e.g. feeding
a consistent read view to a newly joined replica.

Let's instead use index reference counting for pinning indexes for
checkpointing. A reference is taken in a snapshot iterator constructor
and released when the snapshot iterator is destroyed.
---
 src/box/memtx_engine.c | 26 +-------------------------
 src/box/memtx_engine.h |  5 -----
 src/box/memtx_hash.c   | 15 +++++++++------
 src/box/memtx_tree.c   | 16 +++++++++-------
 4 files changed, 19 insertions(+), 43 deletions(-)

diff --git a/src/box/memtx_engine.c b/src/box/memtx_engine.c
index 7c3dd846..c18177db 100644
--- a/src/box/memtx_engine.c
+++ b/src/box/memtx_engine.c
@@ -649,19 +649,6 @@ memtx_engine_wait_checkpoint(struct engine *engine,
 	return result;
 }
 
-/**
- * Called after checkpointing is complete to free indexes dropped
- * while checkpointing was in progress, see memtx_engine_run_gc().
- */
-static void
-memtx_engine_gc_after_checkpoint(struct memtx_engine *memtx)
-{
-	struct memtx_gc_task *task, *next;
-	stailq_foreach_entry_safe(task, next, &memtx->gc_to_free, link)
-		task->vtab->free(task);
-	stailq_create(&memtx->gc_to_free);
-}
-
 static void
 memtx_engine_commit_checkpoint(struct engine *engine,
 			       const struct vclock *vclock)
@@ -699,8 +686,6 @@ memtx_engine_commit_checkpoint(struct engine *engine,
 
 	checkpoint_delete(memtx->checkpoint);
 	memtx->checkpoint = NULL;
-
-	memtx_engine_gc_after_checkpoint(memtx);
 }
 
 static void
@@ -885,15 +870,7 @@ memtx_engine_run_gc(struct memtx_engine *memtx, bool *stop)
 	task->vtab->run(task, &task_done);
 	if (task_done) {
 		stailq_shift(&memtx->gc_queue);
-		/*
-		 * If checkpointing is in progress, the index may be
-		 * used by the checkpoint thread so we postpone freeing
-		 * until checkpointing is complete.
-		 */
-		if (memtx->checkpoint == NULL)
-			task->vtab->free(task);
-		else
-			stailq_add_entry(&memtx->gc_to_free, task, link);
+		task->vtab->free(task);
 	}
 }
 
@@ -965,7 +942,6 @@ memtx_engine_new(const char *snap_dirname, bool force_recovery,
 	}
 
 	stailq_create(&memtx->gc_queue);
-	stailq_create(&memtx->gc_to_free);
 	memtx->gc_fiber = fiber_new("memtx.gc", memtx_engine_gc_f);
 	if (memtx->gc_fiber == NULL)
 		goto fail;
diff --git a/src/box/memtx_engine.h b/src/box/memtx_engine.h
index fcf595e7..ccb51678 100644
--- a/src/box/memtx_engine.h
+++ b/src/box/memtx_engine.h
@@ -155,11 +155,6 @@ struct memtx_engine {
 	 * memtx_gc_task::link.
 	 */
 	struct stailq gc_queue;
-	/**
-	 * List of tasks awaiting to be freed once checkpointing
-	 * is complete, linked by memtx_gc_task::link.
-	 */
-	struct stailq gc_to_free;
 };
 
 struct memtx_gc_task;
diff --git a/src/box/memtx_hash.c b/src/box/memtx_hash.c
index b53f115c..920f1032 100644
--- a/src/box/memtx_hash.c
+++ b/src/box/memtx_hash.c
@@ -399,7 +399,7 @@ memtx_hash_index_create_iterator(struct index *base, enum iterator_type type,
 
 struct hash_snapshot_iterator {
 	struct snapshot_iterator base;
-	struct light_index_core *hash_table;
+	struct memtx_hash_index *index;
 	struct light_index_iterator iterator;
 };
 
@@ -414,7 +414,8 @@ hash_snapshot_iterator_free(struct snapshot_iterator *iterator)
 	assert(iterator->free == hash_snapshot_iterator_free);
 	struct hash_snapshot_iterator *it =
 		(struct hash_snapshot_iterator *) iterator;
-	light_index_iterator_destroy(it->hash_table, &it->iterator);
+	light_index_iterator_destroy(&it->index->hash_table, &it->iterator);
+	index_unref(&it->index->base);
 	free(iterator);
 }
 
@@ -430,7 +431,8 @@ hash_snapshot_iterator_next(struct snapshot_iterator *iterator,
 	assert(iterator->free == hash_snapshot_iterator_free);
 	struct hash_snapshot_iterator *it =
 		(struct hash_snapshot_iterator *) iterator;
-	struct tuple **res = light_index_iterator_get_and_next(it->hash_table,
+	struct light_index_core *hash_table = &it->index->hash_table;
+	struct tuple **res = light_index_iterator_get_and_next(hash_table,
 							       &it->iterator);
 	if (res == NULL) {
 		*data = NULL;
@@ -459,9 +461,10 @@ memtx_hash_index_create_snapshot_iterator(struct index *base)
 
 	it->base.next = hash_snapshot_iterator_next;
 	it->base.free = hash_snapshot_iterator_free;
-	it->hash_table = &index->hash_table;
-	light_index_iterator_begin(it->hash_table, &it->iterator);
-	light_index_iterator_freeze(it->hash_table, &it->iterator);
+	it->index = index;
+	index_ref(base);
+	light_index_iterator_begin(&index->hash_table, &it->iterator);
+	light_index_iterator_freeze(&index->hash_table, &it->iterator);
 	return (struct snapshot_iterator *) it;
 }
 
diff --git a/src/box/memtx_tree.c b/src/box/memtx_tree.c
index cbd888c5..831a2715 100644
--- a/src/box/memtx_tree.c
+++ b/src/box/memtx_tree.c
@@ -1205,7 +1205,7 @@ memtx_tree_index_end_build(struct index *base)
 
 struct tree_snapshot_iterator {
 	struct snapshot_iterator base;
-	struct memtx_tree *tree;
+	struct memtx_tree_index *index;
 	struct memtx_tree_iterator tree_iterator;
 };
 
@@ -1215,8 +1215,8 @@ tree_snapshot_iterator_free(struct snapshot_iterator *iterator)
 	assert(iterator->free == tree_snapshot_iterator_free);
 	struct tree_snapshot_iterator *it =
 		(struct tree_snapshot_iterator *)iterator;
-	struct memtx_tree *tree = (struct memtx_tree *)it->tree;
-	memtx_tree_iterator_destroy(tree, &it->tree_iterator);
+	memtx_tree_iterator_destroy(&it->index->tree, &it->tree_iterator);
+	index_unref(&it->index->base);
 	free(iterator);
 }
 
@@ -1227,13 +1227,14 @@ tree_snapshot_iterator_next(struct snapshot_iterator *iterator,
 	assert(iterator->free == tree_snapshot_iterator_free);
 	struct tree_snapshot_iterator *it =
 		(struct tree_snapshot_iterator *)iterator;
-	struct memtx_tree_data *res =
-		memtx_tree_iterator_get_elem(it->tree, &it->tree_iterator);
+	struct memtx_tree *tree = &it->index->tree;
+	struct memtx_tree_data *res = memtx_tree_iterator_get_elem(tree,
+							&it->tree_iterator);
 	if (res == NULL) {
 		*data = NULL;
 		return 0;
 	}
-	memtx_tree_iterator_next(it->tree, &it->tree_iterator);
+	memtx_tree_iterator_next(tree, &it->tree_iterator);
 	*data = tuple_data_range(res->tuple, size);
 	return 0;
 }
@@ -1257,7 +1258,8 @@ memtx_tree_index_create_snapshot_iterator(struct index *base)
 
 	it->base.free = tree_snapshot_iterator_free;
 	it->base.next = tree_snapshot_iterator_next;
-	it->tree = &index->tree;
+	it->index = index;
+	index_ref(base);
 	it->tree_iterator = memtx_tree_iterator_first(&index->tree);
 	memtx_tree_iterator_freeze(&index->tree, &it->tree_iterator);
 	return (struct snapshot_iterator *) it;
-- 
2.20.1

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v2 5/7] memtx: enter small delayed free mode from snapshot iterator
  2019-08-19 16:53 [PATCH v2 0/7] Join replicas off the current read view Vladimir Davydov
                   ` (3 preceding siblings ...)
  2019-08-19 16:53 ` [PATCH v2 4/7] memtx: use ref counting to pin indexes for snapshot Vladimir Davydov
@ 2019-08-19 16:53 ` Vladimir Davydov
  2019-08-19 20:51   ` [tarantool-patches] " Konstantin Osipov
  2019-08-19 16:53 ` [PATCH v2 6/7] space: get rid of apply_initial_join_row method Vladimir Davydov
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 28+ messages in thread
From: Vladimir Davydov @ 2019-08-19 16:53 UTC (permalink / raw)
  To: tarantool-patches

We must enable SMALL_DELAYED_FREE_MODE to safely use a memtx snapshot
iterator. Currently, we do that in checkpoint related callbacks, but if
we want to reuse snapshot iterators for other purposes, e.g. feeding
a read view to a newly joined replica, we better hide this code behind
snapshot iterator constructors.
---
 src/box/memtx_engine.c | 24 ++++++++++++++++--------
 src/box/memtx_engine.h | 23 +++++++++++++++++++++++
 src/box/memtx_hash.c   |  3 +++
 src/box/memtx_tree.c   |  3 +++
 4 files changed, 45 insertions(+), 8 deletions(-)

diff --git a/src/box/memtx_engine.c b/src/box/memtx_engine.c
index c18177db..ea197cad 100644
--- a/src/box/memtx_engine.c
+++ b/src/box/memtx_engine.c
@@ -610,10 +610,6 @@ memtx_engine_begin_checkpoint(struct engine *engine)
 		memtx->checkpoint = NULL;
 		return -1;
 	}
-
-	/* increment snapshot version; set tuple deletion to delayed mode */
-	memtx->snapshot_version++;
-	small_alloc_setopt(&memtx->alloc, SMALL_DELAYED_FREE_MODE, true);
 	return 0;
 }
 
@@ -661,8 +657,6 @@ memtx_engine_commit_checkpoint(struct engine *engine,
 	/* waitCheckpoint() must have been done. */
 	assert(!memtx->checkpoint->waiting_for_snap_thread);
 
-	small_alloc_setopt(&memtx->alloc, SMALL_DELAYED_FREE_MODE, false);
-
 	if (!memtx->checkpoint->touch) {
 		int64_t lsn = vclock_sum(&memtx->checkpoint->vclock);
 		struct xdir *dir = &memtx->checkpoint->dir;
@@ -703,8 +697,6 @@ memtx_engine_abort_checkpoint(struct engine *engine)
 		memtx->checkpoint->waiting_for_snap_thread = false;
 	}
 
-	small_alloc_setopt(&memtx->alloc, SMALL_DELAYED_FREE_MODE, false);
-
 	/** Remove garbage .inprogress file. */
 	const char *filename =
 		xdir_format_filename(&memtx->checkpoint->dir,
@@ -1014,6 +1006,22 @@ memtx_engine_set_max_tuple_size(struct memtx_engine *memtx, size_t max_size)
 	memtx->max_tuple_size = max_size;
 }
 
+void
+memtx_enter_delayed_free_mode(struct memtx_engine *memtx)
+{
+	memtx->snapshot_version++;
+	if (memtx->delayed_free_mode++ == 0)
+		small_alloc_setopt(&memtx->alloc, SMALL_DELAYED_FREE_MODE, true);
+}
+
+void
+memtx_leave_delayed_free_mode(struct memtx_engine *memtx)
+{
+	assert(memtx->delayed_free_mode > 0);
+	if (--memtx->delayed_free_mode == 0)
+		small_alloc_setopt(&memtx->alloc, SMALL_DELAYED_FREE_MODE, false);
+}
+
 struct tuple *
 memtx_tuple_new(struct tuple_format *format, const char *data, const char *end)
 {
diff --git a/src/box/memtx_engine.h b/src/box/memtx_engine.h
index ccb51678..c092f5d8 100644
--- a/src/box/memtx_engine.h
+++ b/src/box/memtx_engine.h
@@ -137,6 +137,12 @@ struct memtx_engine {
 	size_t max_tuple_size;
 	/** Incremented with each next snapshot. */
 	uint32_t snapshot_version;
+	/**
+	 * Unless zero, freeing of tuples allocated before the last
+	 * call to memtx_enter_delayed_free_mode() is delayed until
+	 * memtx_leave_delayed_free_mode() is called.
+	 */
+	uint32_t delayed_free_mode;
 	/** Memory pool for rtree index iterator. */
 	struct mempool rtree_iterator_pool;
 	/**
@@ -205,6 +211,23 @@ memtx_engine_set_memory(struct memtx_engine *memtx, size_t size);
 void
 memtx_engine_set_max_tuple_size(struct memtx_engine *memtx, size_t max_size);
 
+/**
+ * Enter tuple delayed free mode: tuple allocated before the call
+ * won't be freed until memtx_leave_delayed_free_mode() is called.
+ * This function is reentrant, meaning it's okay to call it multiple
+ * times from the same or different fibers - one just has to leave
+ * the delayed free mode the same amount of times then.
+ */
+void
+memtx_enter_delayed_free_mode(struct memtx_engine *memtx);
+
+/**
+ * Leave tuple delayed free mode. This function undoes the effect
+ * of memtx_enter_delayed_free_mode().
+ */
+void
+memtx_leave_delayed_free_mode(struct memtx_engine *memtx);
+
 /** Allocate a memtx tuple. @sa tuple_new(). */
 struct tuple *
 memtx_tuple_new(struct tuple_format *format, const char *data, const char *end);
diff --git a/src/box/memtx_hash.c b/src/box/memtx_hash.c
index 920f1032..cdd531cb 100644
--- a/src/box/memtx_hash.c
+++ b/src/box/memtx_hash.c
@@ -414,6 +414,8 @@ hash_snapshot_iterator_free(struct snapshot_iterator *iterator)
 	assert(iterator->free == hash_snapshot_iterator_free);
 	struct hash_snapshot_iterator *it =
 		(struct hash_snapshot_iterator *) iterator;
+	memtx_leave_delayed_free_mode((struct memtx_engine *)
+				      it->index->base.engine);
 	light_index_iterator_destroy(&it->index->hash_table, &it->iterator);
 	index_unref(&it->index->base);
 	free(iterator);
@@ -465,6 +467,7 @@ memtx_hash_index_create_snapshot_iterator(struct index *base)
 	index_ref(base);
 	light_index_iterator_begin(&index->hash_table, &it->iterator);
 	light_index_iterator_freeze(&index->hash_table, &it->iterator);
+	memtx_enter_delayed_free_mode((struct memtx_engine *)base->engine);
 	return (struct snapshot_iterator *) it;
 }
 
diff --git a/src/box/memtx_tree.c b/src/box/memtx_tree.c
index 831a2715..e155ecd6 100644
--- a/src/box/memtx_tree.c
+++ b/src/box/memtx_tree.c
@@ -1215,6 +1215,8 @@ tree_snapshot_iterator_free(struct snapshot_iterator *iterator)
 	assert(iterator->free == tree_snapshot_iterator_free);
 	struct tree_snapshot_iterator *it =
 		(struct tree_snapshot_iterator *)iterator;
+	memtx_leave_delayed_free_mode((struct memtx_engine *)
+				      it->index->base.engine);
 	memtx_tree_iterator_destroy(&it->index->tree, &it->tree_iterator);
 	index_unref(&it->index->base);
 	free(iterator);
@@ -1262,6 +1264,7 @@ memtx_tree_index_create_snapshot_iterator(struct index *base)
 	index_ref(base);
 	it->tree_iterator = memtx_tree_iterator_first(&index->tree);
 	memtx_tree_iterator_freeze(&index->tree, &it->tree_iterator);
+	memtx_enter_delayed_free_mode((struct memtx_engine *)base->engine);
 	return (struct snapshot_iterator *) it;
 }
 
-- 
2.20.1

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v2 6/7] space: get rid of apply_initial_join_row method
  2019-08-19 16:53 [PATCH v2 0/7] Join replicas off the current read view Vladimir Davydov
                   ` (4 preceding siblings ...)
  2019-08-19 16:53 ` [PATCH v2 5/7] memtx: enter small delayed free mode from snapshot iterator Vladimir Davydov
@ 2019-08-19 16:53 ` Vladimir Davydov
  2019-08-19 20:54   ` [tarantool-patches] " Konstantin Osipov
  2019-08-19 16:53 ` [PATCH v2 7/7] relay: join new replicas off read view Vladimir Davydov
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 28+ messages in thread
From: Vladimir Davydov @ 2019-08-19 16:53 UTC (permalink / raw)
  To: tarantool-patches

There's no reason to use a special method instead of the generic
space_execute_dml for applying rows received from a master during the
initial join stage. Moreover, using the special method results in not
running space.before_replace trigger, which makes it impossible to, for
example, update space engine on a replica, see the on_schema_init test
of the replication test suite.

So this patch removes the special method altogether and makes the code
that used it switch to space_execute_dml.

Closes #4417
---
 src/box/applier.cc                       | 32 ++++++-----
 src/box/blackhole.c                      |  1 -
 src/box/memtx_engine.c                   | 22 +++++---
 src/box/memtx_space.c                    | 30 -----------
 src/box/space.c                          |  9 ----
 src/box/space.h                          | 16 ------
 src/box/sysview.c                        |  1 -
 src/box/vinyl.c                          | 68 ------------------------
 test/replication/on_schema_init.result   |  6 +++
 test/replication/on_schema_init.test.lua |  3 ++
 10 files changed, 43 insertions(+), 145 deletions(-)

diff --git a/src/box/applier.cc b/src/box/applier.cc
index cf03ea16..4304ff05 100644
--- a/src/box/applier.cc
+++ b/src/box/applier.cc
@@ -202,23 +202,29 @@ applier_writer_f(va_list ap)
 static int
 apply_initial_join_row(struct xrow_header *row)
 {
-	struct txn *txn = txn_begin();
-	if (txn == NULL)
-		return -1;
-	struct request request;
-	xrow_decode_dml(row, &request, dml_request_key_map(row->type));
-	struct space *space = space_cache_find(request.space_id);
-	if (space == NULL)
-		goto rollback;
-	/* no access checks here - applier always works with admin privs */
-	if (space_apply_initial_join_row(space, &request))
-		goto rollback;
 	int rc;
+	struct request request;
+	if (xrow_decode_dml(row, &request, dml_request_key_map(row->type)) != 0)
+		return -1;
+	struct space *space = space_cache_find(request.space_id);
+	if (space == NULL)
+		return -1;
+	struct txn *txn = txn_begin();
+	if (txn == NULL)
+		return -1;
+	if (txn_begin_stmt(txn, space) != 0)
+		goto rollback;
+	/* no access checks here - applier always works with admin privs */
+	struct tuple *unused;
+	if (space_execute_dml(space, txn, &request, &unused) != 0)
+		goto rollback_stmt;
+	if (txn_commit_stmt(txn, &request))
+		goto rollback;
 	rc = txn_commit(txn);
-	if (rc < 0)
-		return -1;
 	fiber_gc();
 	return rc;
+rollback_stmt:
+	txn_rollback_stmt(txn);
 rollback:
 	txn_rollback(txn);
 	fiber_gc();
diff --git a/src/box/blackhole.c b/src/box/blackhole.c
index b69e543a..22ef324b 100644
--- a/src/box/blackhole.c
+++ b/src/box/blackhole.c
@@ -111,7 +111,6 @@ blackhole_space_create_index(struct space *space, struct index_def *def)
 static const struct space_vtab blackhole_space_vtab = {
 	/* .destroy = */ blackhole_space_destroy,
 	/* .bsize = */ generic_space_bsize,
-	/* .apply_initial_join_row = */ generic_space_apply_initial_join_row,
 	/* .execute_replace = */ blackhole_space_execute_replace,
 	/* .execute_delete = */ blackhole_space_execute_delete,
 	/* .execute_update = */ blackhole_space_execute_update,
diff --git a/src/box/memtx_engine.c b/src/box/memtx_engine.c
index ea197cad..f6a33282 100644
--- a/src/box/memtx_engine.c
+++ b/src/box/memtx_engine.c
@@ -205,7 +205,7 @@ memtx_engine_recover_snapshot_row(struct memtx_engine *memtx,
 			 (uint32_t) row->type);
 		return -1;
 	}
-
+	int rc;
 	struct request request;
 	if (xrow_decode_dml(row, &request, dml_request_key_map(row->type)) != 0)
 		return -1;
@@ -220,13 +220,15 @@ memtx_engine_recover_snapshot_row(struct memtx_engine *memtx,
 	struct txn *txn = txn_begin();
 	if (txn == NULL)
 		return -1;
+	if (txn_begin_stmt(txn, space) != 0)
+		goto rollback;
 	/* no access checks here - applier always works with admin privs */
-	if (space_apply_initial_join_row(space, &request) != 0) {
-		txn_rollback(txn);
-		fiber_gc();
-		return -1;
-	}
-	int rc = txn_commit(txn);
+	struct tuple *unused;
+	if (space_execute_dml(space, txn, &request, &unused) != 0)
+		goto rollback_stmt;
+	if (txn_commit_stmt(txn, &request) != 0)
+		goto rollback;
+	rc = txn_commit(txn);
 	/*
 	 * Don't let gc pool grow too much. Yet to
 	 * it before reading the next row, to make
@@ -234,6 +236,12 @@ memtx_engine_recover_snapshot_row(struct memtx_engine *memtx,
 	 */
 	fiber_gc();
 	return rc;
+rollback_stmt:
+	txn_rollback_stmt(txn);
+rollback:
+	txn_rollback(txn);
+	fiber_gc();
+	return -1;
 }
 
 /** Called at start to tell memtx to recover to a given LSN. */
diff --git a/src/box/memtx_space.c b/src/box/memtx_space.c
index cf29cf32..05efb45f 100644
--- a/src/box/memtx_space.c
+++ b/src/box/memtx_space.c
@@ -316,35 +316,6 @@ dup_replace_mode(uint32_t op)
 	return op == IPROTO_INSERT ? DUP_INSERT : DUP_REPLACE_OR_INSERT;
 }
 
-static int
-memtx_space_apply_initial_join_row(struct space *space, struct request *request)
-{
-	struct memtx_space *memtx_space = (struct memtx_space *)space;
-	if (request->type != IPROTO_INSERT) {
-		diag_set(ClientError, ER_UNKNOWN_REQUEST_TYPE, request->type);
-		return -1;
-	}
-	request->header->replica_id = 0;
-	struct txn *txn = in_txn();
-	if (txn_begin_stmt(txn, space) != 0)
-		return -1;
-	struct txn_stmt *stmt = txn_current_stmt(txn);
-	stmt->new_tuple = memtx_tuple_new(space->format, request->tuple,
-					  request->tuple_end);
-	if (stmt->new_tuple == NULL)
-		goto rollback;
-	tuple_ref(stmt->new_tuple);
-	if (memtx_space->replace(space, NULL, stmt->new_tuple,
-				 DUP_INSERT, &stmt->old_tuple) != 0)
-		goto rollback;
-	return txn_commit_stmt(txn, request);
-
-rollback:
-	say_error("rollback: %s", diag_last_error(diag_get())->errmsg);
-	txn_rollback_stmt(txn);
-	return -1;
-}
-
 static int
 memtx_space_execute_replace(struct space *space, struct txn *txn,
 			    struct request *request, struct tuple **result)
@@ -1168,7 +1139,6 @@ memtx_space_prepare_alter(struct space *old_space, struct space *new_space)
 static const struct space_vtab memtx_space_vtab = {
 	/* .destroy = */ memtx_space_destroy,
 	/* .bsize = */ memtx_space_bsize,
-	/* .apply_initial_join_row = */ memtx_space_apply_initial_join_row,
 	/* .execute_replace = */ memtx_space_execute_replace,
 	/* .execute_delete = */ memtx_space_execute_delete,
 	/* .execute_update = */ memtx_space_execute_update,
diff --git a/src/box/space.c b/src/box/space.c
index 0d1ad3b3..226ac9c9 100644
--- a/src/box/space.c
+++ b/src/box/space.c
@@ -624,15 +624,6 @@ generic_space_bsize(struct space *space)
 	return 0;
 }
 
-int
-generic_space_apply_initial_join_row(struct space *space,
-				     struct request *request)
-{
-	(void)space;
-	(void)request;
-	return 0;
-}
-
 int
 generic_space_ephemeral_replace(struct space *space, const char *tuple,
 				const char *tuple_end)
diff --git a/src/box/space.h b/src/box/space.h
index 8f593e93..7926aa65 100644
--- a/src/box/space.h
+++ b/src/box/space.h
@@ -59,8 +59,6 @@ struct space_vtab {
 	/** Return binary size of a space. */
 	size_t (*bsize)(struct space *);
 
-	int (*apply_initial_join_row)(struct space *, struct request *);
-
 	int (*execute_replace)(struct space *, struct txn *,
 			       struct request *, struct tuple **result);
 	int (*execute_delete)(struct space *, struct txn *,
@@ -361,12 +359,6 @@ index_name_by_id(struct space *space, uint32_t id);
 int
 access_check_space(struct space *space, user_access_t access);
 
-static inline int
-space_apply_initial_join_row(struct space *space, struct request *request)
-{
-	return space->vtab->apply_initial_join_row(space, request);
-}
-
 /**
  * Execute a DML request on the given space.
  */
@@ -528,7 +520,6 @@ space_remove_ck_constraint(struct space *space, struct ck_constraint *ck);
  * Virtual method stubs.
  */
 size_t generic_space_bsize(struct space *);
-int generic_space_apply_initial_join_row(struct space *, struct request *);
 int generic_space_ephemeral_replace(struct space *, const char *, const char *);
 int generic_space_ephemeral_delete(struct space *, const char *);
 int generic_space_ephemeral_rowid_next(struct space *, uint64_t *);
@@ -598,13 +589,6 @@ index_find_system_xc(struct space *space, uint32_t index_id)
 	return index_find_xc(space, index_id);
 }
 
-static inline void
-space_apply_initial_join_row_xc(struct space *space, struct request *request)
-{
-	if (space_apply_initial_join_row(space, request) != 0)
-		diag_raise();
-}
-
 static inline void
 space_check_index_def_xc(struct space *space, struct index_def *index_def)
 {
diff --git a/src/box/sysview.c b/src/box/sysview.c
index 46cf1e13..1fbe3aa2 100644
--- a/src/box/sysview.c
+++ b/src/box/sysview.c
@@ -490,7 +490,6 @@ sysview_space_create_index(struct space *space, struct index_def *def)
 static const struct space_vtab sysview_space_vtab = {
 	/* .destroy = */ sysview_space_destroy,
 	/* .bsize = */ generic_space_bsize,
-	/* .apply_initial_join_row = */ generic_space_apply_initial_join_row,
 	/* .execute_replace = */ sysview_space_execute_replace,
 	/* .execute_delete = */ sysview_space_execute_delete,
 	/* .execute_update = */ sysview_space_execute_update,
diff --git a/src/box/vinyl.c b/src/box/vinyl.c
index 80ed00a1..de4a06c4 100644
--- a/src/box/vinyl.c
+++ b/src/box/vinyl.c
@@ -3199,73 +3199,6 @@ out:
 	return rc;
 }
 
-static int
-vinyl_space_apply_initial_join_row(struct space *space, struct request *request)
-{
-	assert(request->header != NULL);
-	struct vy_env *env = vy_env(space->engine);
-
-	struct vy_tx *tx = vy_tx_begin(env->xm);
-	if (tx == NULL)
-		return -1;
-
-	struct txn_stmt stmt;
-	memset(&stmt, 0, sizeof(stmt));
-
-	int rc = -1;
-	switch (request->type) {
-	case IPROTO_INSERT:
-		rc = vy_insert(env, tx, &stmt, space, request);
-		break;
-	case IPROTO_REPLACE:
-		rc = vy_replace(env, tx, &stmt, space, request);
-		break;
-	case IPROTO_UPSERT:
-		rc = vy_upsert(env, tx, &stmt, space, request);
-		break;
-	case IPROTO_DELETE:
-		rc = vy_delete(env, tx, &stmt, space, request);
-		break;
-	default:
-		diag_set(ClientError, ER_UNKNOWN_REQUEST_TYPE, request->type);
-		break;
-	}
-	if (rc != 0) {
-		vy_tx_rollback(tx);
-		return -1;
-	}
-
-	/*
-	 * Account memory quota, see vinyl_engine_prepare()
-	 * and vinyl_engine_commit() for more details about
-	 * quota accounting.
-	 */
-	size_t reserved = tx->write_size;
-	if (vy_quota_use(&env->quota, VY_QUOTA_CONSUMER_TX,
-			 reserved, TIMEOUT_INFINITY) != 0)
-		unreachable();
-
-	size_t mem_used_before = lsregion_used(&env->mem_env.allocator);
-
-	rc = vy_tx_prepare(tx);
-	if (rc == 0)
-		vy_tx_commit(tx, 0);
-	else
-		vy_tx_rollback(tx);
-
-	if (stmt.old_tuple != NULL)
-		tuple_unref(stmt.old_tuple);
-	if (stmt.new_tuple != NULL)
-		tuple_unref(stmt.new_tuple);
-
-	size_t mem_used_after = lsregion_used(&env->mem_env.allocator);
-	assert(mem_used_after >= mem_used_before);
-	size_t used = mem_used_after - mem_used_before;
-	vy_quota_adjust(&env->quota, VY_QUOTA_CONSUMER_TX, reserved, used);
-	vy_regulator_check_dump_watermark(&env->regulator);
-	return rc;
-}
-
 /* }}} Replication */
 
 /* {{{ Garbage collection */
@@ -4671,7 +4604,6 @@ static const struct engine_vtab vinyl_engine_vtab = {
 static const struct space_vtab vinyl_space_vtab = {
 	/* .destroy = */ vinyl_space_destroy,
 	/* .bsize = */ vinyl_space_bsize,
-	/* .apply_initial_join_row = */ vinyl_space_apply_initial_join_row,
 	/* .execute_replace = */ vinyl_space_execute_replace,
 	/* .execute_delete = */ vinyl_space_execute_delete,
 	/* .execute_update = */ vinyl_space_execute_update,
diff --git a/test/replication/on_schema_init.result b/test/replication/on_schema_init.result
index 3f7ee0bd..6c2857d1 100644
--- a/test/replication/on_schema_init.result
+++ b/test/replication/on_schema_init.result
@@ -15,6 +15,12 @@ test_run:cmd('create server replica with rpl_master=default, script="replication
 test_engine = box.schema.space.create('test_engine', {engine='memtx'})
 ---
 ...
+-- Make sure that space.before_replace trigger is invoked for rows
+-- received during both initial and final join stages.
+box.snapshot()
+---
+- ok
+...
 test_local =  box.schema.space.create('test_local', {is_local=false})
 ---
 ...
diff --git a/test/replication/on_schema_init.test.lua b/test/replication/on_schema_init.test.lua
index 9bb9e477..016a61c1 100644
--- a/test/replication/on_schema_init.test.lua
+++ b/test/replication/on_schema_init.test.lua
@@ -9,6 +9,9 @@ test_run = env.new()
 test_run:cmd('create server replica with rpl_master=default, script="replication/replica_on_schema_init.lua"')
 
 test_engine = box.schema.space.create('test_engine', {engine='memtx'})
+-- Make sure that space.before_replace trigger is invoked for rows
+-- received during both initial and final join stages.
+box.snapshot()
 test_local =  box.schema.space.create('test_local', {is_local=false})
 test_engine.engine
 test_local.is_local
-- 
2.20.1

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v2 7/7] relay: join new replicas off read view
  2019-08-19 16:53 [PATCH v2 0/7] Join replicas off the current read view Vladimir Davydov
                   ` (5 preceding siblings ...)
  2019-08-19 16:53 ` [PATCH v2 6/7] space: get rid of apply_initial_join_row method Vladimir Davydov
@ 2019-08-19 16:53 ` Vladimir Davydov
  2019-08-19 20:57   ` [tarantool-patches] " Konstantin Osipov
  2019-08-19 16:54 ` [PATCH v2 0/7] Join replicas off the current " Vladimir Davydov
  2019-08-20  8:53 ` Vladimir Davydov
  8 siblings, 1 reply; 28+ messages in thread
From: Vladimir Davydov @ 2019-08-19 16:53 UTC (permalink / raw)
  To: tarantool-patches

Historically, we join a new replica off the last checkpoint. As a
result, we must always keep the last memtx snapshot and all vinyl data
files corresponding to it. Actually, there's no need to use the last
checkpoint for joining a replica. Instead we can use the current read
view as both memtx and vinyl support it. This should speed up the
process of joining a new replica, because we don't need to replay all
xlogs written after the last checkpoint, only those that are accumulated
while we are relaying the current read view. This should also allow us
to avoid creating a snapshot file on bootstrap, because the only reason
why we need it is allowing joining replicas. Besides, this is a step
towards decoupling the vinyl metadata log from checkpointing in
particular and from xlogs in general.

Closes #1271
---
 src/box/blackhole.c                         |   3 +-
 src/box/box.cc                              |  33 +-
 src/box/engine.c                            |  21 --
 src/box/engine.h                            |  27 +-
 src/box/memtx_engine.c                      |  74 ----
 src/box/relay.cc                            | 170 +++++++--
 src/box/relay.h                             |   2 +-
 src/box/sysview.c                           |   3 +-
 src/box/vinyl.c                             | 378 ++++----------------
 src/box/vy_run.c                            |   6 -
 src/box/vy_tx.c                             |  10 +-
 src/box/vy_tx.h                             |   8 +
 src/lib/core/errinj.h                       |   2 +-
 test/box/errinj.result                      |  38 +-
 test/replication-py/cluster.result          |  13 -
 test/replication-py/cluster.test.py         |  25 --
 test/replication/join_without_snap.result   |  88 +++++
 test/replication/join_without_snap.test.lua |  32 ++
 test/replication/suite.cfg                  |   1 +
 test/vinyl/errinj.result                    |   4 +-
 test/vinyl/errinj.test.lua                  |   4 +-
 test/xlog/panic_on_broken_lsn.result        |  31 +-
 test/xlog/panic_on_broken_lsn.test.lua      |  20 +-
 23 files changed, 419 insertions(+), 574 deletions(-)
 create mode 100644 test/replication/join_without_snap.result
 create mode 100644 test/replication/join_without_snap.test.lua

diff --git a/src/box/blackhole.c b/src/box/blackhole.c
index 22ef324b..a4cbdaf1 100644
--- a/src/box/blackhole.c
+++ b/src/box/blackhole.c
@@ -177,7 +177,6 @@ blackhole_engine_create_space(struct engine *engine, struct space_def *def,
 static const struct engine_vtab blackhole_engine_vtab = {
 	/* .shutdown = */ blackhole_engine_shutdown,
 	/* .create_space = */ blackhole_engine_create_space,
-	/* .join = */ generic_engine_join,
 	/* .begin = */ generic_engine_begin,
 	/* .begin_statement = */ generic_engine_begin_statement,
 	/* .prepare = */ generic_engine_prepare,
@@ -212,6 +211,6 @@ blackhole_engine_new(void)
 
 	engine->vtab = &blackhole_engine_vtab;
 	engine->name = "blackhole";
-	engine->flags = ENGINE_BYPASS_TX;
+	engine->flags = ENGINE_BYPASS_TX | ENGINE_EXCLUDE_FROM_SNAPSHOT;
 	return engine;
 }
diff --git a/src/box/box.cc b/src/box/box.cc
index 66cd6d3a..95ce0bc1 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -1467,41 +1467,13 @@ box_process_join(struct ev_io *io, struct xrow_header *header)
 			  "wal_mode = 'none'");
 	}
 
-	/*
-	 * The only case when the directory index is empty is
-	 * when someone has deleted a snapshot and tries to join
-	 * as a replica. Our best effort is to not crash in such
-	 * case: raise ER_MISSING_SNAPSHOT.
-	 */
-	struct gc_checkpoint *checkpoint = gc_last_checkpoint();
-	if (checkpoint == NULL)
-		tnt_raise(ClientError, ER_MISSING_SNAPSHOT);
-
-	/* Remember start vclock. */
-	struct vclock start_vclock;
-	vclock_copy(&start_vclock, &checkpoint->vclock);
-
-	/*
-	 * Make sure the checkpoint files won't be deleted while
-	 * initial join is in progress.
-	 */
-	struct gc_checkpoint_ref gc;
-	gc_ref_checkpoint(checkpoint, &gc, "replica %s",
-			  tt_uuid_str(&instance_uuid));
-	auto gc_guard = make_scoped_guard([&]{ gc_unref_checkpoint(&gc); });
-
-	/* Respond to JOIN request with start_vclock. */
-	struct xrow_header row;
-	xrow_encode_vclock_xc(&row, &start_vclock);
-	row.sync = header->sync;
-	coio_write_xrow(io, &row);
-
 	say_info("joining replica %s at %s",
 		 tt_uuid_str(&instance_uuid), sio_socketname(io->fd));
 
 	/*
 	 * Initial stream: feed replica with dirty data from engines.
 	 */
+	struct vclock start_vclock;
 	relay_initial_join(io->fd, header->sync, &start_vclock);
 	say_info("initial data sent.");
 
@@ -1513,6 +1485,8 @@ box_process_join(struct ev_io *io, struct xrow_header *header)
 	 */
 	box_on_join(&instance_uuid);
 
+	ERROR_INJECT_YIELD(ERRINJ_REPLICA_JOIN_DELAY);
+
 	/* Remember master's vclock after the last request */
 	struct vclock stop_vclock;
 	vclock_copy(&stop_vclock, &replicaset.vclock);
@@ -1530,6 +1504,7 @@ box_process_join(struct ev_io *io, struct xrow_header *header)
 		diag_raise();
 
 	/* Send end of initial stage data marker */
+	struct xrow_header row;
 	xrow_encode_vclock_xc(&row, &stop_vclock);
 	row.sync = header->sync;
 	coio_write_xrow(io, &row);
diff --git a/src/box/engine.c b/src/box/engine.c
index a52d0ed1..73ff0464 100644
--- a/src/box/engine.c
+++ b/src/box/engine.c
@@ -174,17 +174,6 @@ engine_backup(const struct vclock *vclock, engine_backup_cb cb, void *cb_arg)
 	return 0;
 }
 
-int
-engine_join(const struct vclock *vclock, struct xstream *stream)
-{
-	struct engine *engine;
-	engine_foreach(engine) {
-		if (engine->vtab->join(engine, vclock, stream) != 0)
-			return -1;
-	}
-	return 0;
-}
-
 void
 engine_memory_stat(struct engine_memory_stat *stat)
 {
@@ -204,16 +193,6 @@ engine_reset_stat(void)
 
 /* {{{ Virtual method stubs */
 
-int
-generic_engine_join(struct engine *engine, const struct vclock *vclock,
-		    struct xstream *stream)
-{
-	(void)engine;
-	(void)vclock;
-	(void)stream;
-	return 0;
-}
-
 int
 generic_engine_begin(struct engine *engine, struct txn *txn)
 {
diff --git a/src/box/engine.h b/src/box/engine.h
index a302b3bc..c83042b4 100644
--- a/src/box/engine.h
+++ b/src/box/engine.h
@@ -73,11 +73,6 @@ struct engine_vtab {
 	/** Allocate a new space instance. */
 	struct space *(*create_space)(struct engine *engine,
 			struct space_def *def, struct rlist *key_list);
-	/**
-	 * Write statements stored in checkpoint @vclock to @stream.
-	 */
-	int (*join)(struct engine *engine, const struct vclock *vclock,
-		    struct xstream *stream);
 	/**
 	 * Begin a new single or multi-statement transaction.
 	 * Called on first statement in a transaction, not when
@@ -205,6 +200,12 @@ enum {
 	 * transactions w/o throwing ER_CROSS_ENGINE_TRANSACTION.
 	 */
 	ENGINE_BYPASS_TX = 1 << 0,
+	/**
+	 * This flag is set for virtual engines, such as sysview,
+	 * that don't actually store any data. It means that we
+	 * must not relay their content to a newly joined replica.
+	 */
+	ENGINE_EXCLUDE_FROM_SNAPSHOT = 1 << 1,
 };
 
 struct engine {
@@ -330,13 +331,6 @@ engine_begin_final_recovery(void);
 int
 engine_end_recovery(void);
 
-/**
- * Feed checkpoint data as join events to the replicas.
- * (called on the master).
- */
-int
-engine_join(const struct vclock *vclock, struct xstream *stream);
-
 int
 engine_begin_checkpoint(void);
 
@@ -364,8 +358,6 @@ engine_reset_stat(void);
 /*
  * Virtual method stubs.
  */
-int generic_engine_join(struct engine *, const struct vclock *,
-			struct xstream *);
 int generic_engine_begin(struct engine *, struct txn *);
 int generic_engine_begin_statement(struct engine *, struct txn *);
 int generic_engine_prepare(struct engine *, struct txn *);
@@ -468,13 +460,6 @@ engine_end_recovery_xc(void)
 		diag_raise();
 }
 
-static inline void
-engine_join_xc(const struct vclock *vclock, struct xstream *stream)
-{
-	if (engine_join(vclock, stream) != 0)
-		diag_raise();
-}
-
 #endif /* defined(__cplusplus) */
 
 #endif /* TARANTOOL_BOX_ENGINE_H_INCLUDED */
diff --git a/src/box/memtx_engine.c b/src/box/memtx_engine.c
index f6a33282..0cdf6cc6 100644
--- a/src/box/memtx_engine.c
+++ b/src/box/memtx_engine.c
@@ -734,79 +734,6 @@ memtx_engine_backup(struct engine *engine, const struct vclock *vclock,
 	return cb(filename, cb_arg);
 }
 
-/** Used to pass arguments to memtx_initial_join_f */
-struct memtx_join_arg {
-	const char *snap_dirname;
-	int64_t checkpoint_lsn;
-	struct xstream *stream;
-};
-
-/**
- * Invoked from a thread to feed snapshot rows.
- */
-static int
-memtx_initial_join_f(va_list ap)
-{
-	struct memtx_join_arg *arg = va_arg(ap, struct memtx_join_arg *);
-	const char *snap_dirname = arg->snap_dirname;
-	int64_t checkpoint_lsn = arg->checkpoint_lsn;
-	struct xstream *stream = arg->stream;
-
-	struct xdir dir;
-	/*
-	 * snap_dirname and INSTANCE_UUID don't change after start,
-	 * safe to use in another thread.
-	 */
-	xdir_create(&dir, snap_dirname, SNAP, &INSTANCE_UUID,
-		    &xlog_opts_default);
-	struct xlog_cursor cursor;
-	int rc = xdir_open_cursor(&dir, checkpoint_lsn, &cursor);
-	xdir_destroy(&dir);
-	if (rc < 0)
-		return -1;
-
-	struct xrow_header row;
-	while ((rc = xlog_cursor_next(&cursor, &row, true)) == 0) {
-		rc = xstream_write(stream, &row);
-		if (rc < 0)
-			break;
-	}
-	xlog_cursor_close(&cursor, false);
-	if (rc < 0)
-		return -1;
-
-	/**
-	 * We should never try to read snapshots with no EOF
-	 * marker - such snapshots are very likely corrupted and
-	 * should not be trusted.
-	 */
-	/* TODO: replace panic with diag_set() */
-	if (!xlog_cursor_is_eof(&cursor))
-		panic("snapshot `%s' has no EOF marker", cursor.name);
-	return 0;
-}
-
-static int
-memtx_engine_join(struct engine *engine, const struct vclock *vclock,
-		  struct xstream *stream)
-{
-	struct memtx_engine *memtx = (struct memtx_engine *)engine;
-
-	/*
-	 * cord_costart() passes only void * pointer as an argument.
-	 */
-	struct memtx_join_arg arg = {
-		/* .snap_dirname   = */ memtx->snap_dir.dirname,
-		/* .checkpoint_lsn = */ vclock_sum(vclock),
-		/* .stream         = */ stream
-	};
-
-	/* Send snapshot using a thread */
-	struct cord cord;
-	cord_costart(&cord, "initial_join", memtx_initial_join_f, &arg);
-	return cord_cojoin(&cord);
-}
-
 static int
 small_stats_noop_cb(const struct mempool_stats *stats, void *cb_ctx)
 {
@@ -830,7 +757,6 @@ memtx_engine_memory_stat(struct engine *engine, struct engine_memory_stat *stat)
 static const struct engine_vtab memtx_engine_vtab = {
 	/* .shutdown = */ memtx_engine_shutdown,
 	/* .create_space = */ memtx_engine_create_space,
-	/* .join = */ memtx_engine_join,
 	/* .begin = */ memtx_engine_begin,
 	/* .begin_statement = */ generic_engine_begin_statement,
 	/* .prepare = */ generic_engine_prepare,
diff --git a/src/box/relay.cc b/src/box/relay.cc
index a19abf6a..2717b8b6 100644
--- a/src/box/relay.cc
+++ b/src/box/relay.cc
@@ -41,11 +41,13 @@
 
 #include "coio.h"
 #include "coio_task.h"
-#include "engine.h"
 #include "gc.h"
+#include "index.h"
 #include "iproto_constants.h"
 #include "recovery.h"
 #include "replication.h"
+#include "schema.h"
+#include "space.h"
 #include "trigger.h"
 #include "vclock.h"
 #include "version.h"
@@ -168,8 +170,6 @@ relay_last_row_time(const struct relay *relay)
 static void
 relay_send(struct relay *relay, struct xrow_header *packet);
 static void
-relay_send_initial_join_row(struct xstream *stream, struct xrow_header *row);
-static void
 relay_send_row(struct xstream *stream, struct xrow_header *row);
 
 struct relay *
@@ -285,20 +285,156 @@ relay_set_cord_name(int fd)
 	cord_set_name(name);
 }
 
+/**
+ * A space to feed to a replica on initial join.
+ */
+struct relay_space {
+	/** Link in the list of spaces to feed. */
+	struct rlist link;
+	/** Space id. */
+	uint32_t space_id;
+	/** Read view iterator. */
+	struct snapshot_iterator *iterator;
+};
+
+/**
+ * Add a space to the list of spaces to feed to a replica
+ * if eligible. We don't need to relay the following kinds
+ * of spaces:
+ *
+ *  - Temporary spaces, apparently.
+ *  - Spaces that are local to this instance.
+ *  - Virtual spaces, such as sysview.
+ *  - Spaces that don't have the primary index.
+ */
+static int
+relay_space_add(struct space *sp, void *data)
+{
+	struct rlist *spaces = (struct rlist *)data;
+
+	if (space_is_temporary(sp))
+		return 0;
+	if (space_group_id(sp) == GROUP_LOCAL)
+		return 0;
+	if (sp->engine->flags & ENGINE_EXCLUDE_FROM_SNAPSHOT)
+		return 0;
+	struct index *pk = space_index(sp, 0);
+	if (pk == NULL)
+		return 0;
+
+	struct relay_space *r = (struct relay_space *)malloc(sizeof(*r));
+	if (r == NULL) {
+		diag_set(OutOfMemory, sizeof(*r),
+			 "malloc", "struct relay_space");
+		return -1;
+	}
+	r->space_id = space_id(sp);
+	r->iterator = index_create_snapshot_iterator(pk);
+	if (r->iterator == NULL) {
+		free(r);
+		return -1;
+	}
+	rlist_add_tail_entry(spaces, r, link);
+	return 0;
+}
+
+/**
+ * Relay a single space row to a replica.
+ */
+static void
+relay_space_send_row(uint32_t space_id, const char *data, uint32_t size,
+		     struct ev_io *io, uint64_t sync)
+{
+	struct request_replace_body body;
+	request_replace_body_create(&body, space_id);
+
+	struct xrow_header row;
+	memset(&row, 0, sizeof(row));
+	row.type = IPROTO_INSERT;
+	row.sync = sync;
+
+	row.bodycnt = 2;
+	row.body[0].iov_base = &body;
+	row.body[0].iov_len = sizeof(body);
+	row.body[1].iov_base = (char *)data;
+	row.body[1].iov_len = size;
+
+	coio_write_xrow(io, &row);
+}
+
+/**
+ * Relay a read view of a space content to a replica.
+ */
+static void
+relay_space_send(struct relay_space *r, struct ev_io *io, uint64_t sync)
+{
+	int rc;
+	struct snapshot_iterator *it = r->iterator;
+
+	uint32_t size;
+	const char *data;
+	while ((rc = it->next(it, &data, &size)) == 0 && data != NULL)
+		relay_space_send_row(r->space_id, data, size, io, sync);
+
+	if (rc != 0)
+		diag_raise();
+}
+
+/**
+ * Close the read view iterator associated with the space
+ * and free the container object.
+ */
+static void
+relay_space_free(struct relay_space *r)
+{
+	rlist_del_entry(r, link);
+	r->iterator->free(r->iterator);
+	free(r);
+}
+
 void
 relay_initial_join(int fd, uint64_t sync, struct vclock *vclock)
 {
-	struct relay *relay = relay_new(NULL);
-	if (relay == NULL)
-		diag_raise();
+	struct ev_io io;
+	coio_create(&io, fd);
 
-	relay_start(relay, fd, sync, relay_send_initial_join_row);
-	auto relay_guard = make_scoped_guard([=] {
-		relay_stop(relay);
-		relay_delete(relay);
+	RLIST_HEAD(spaces);
+	auto guard = make_scoped_guard([&spaces] {
+		struct relay_space *r, *next;
+		rlist_foreach_entry_safe(r, &spaces, link, next)
+			relay_space_free(r);
 	});
 
-	engine_join_xc(vclock, &relay->stream);
+	/*
+	 * First, we open read view iterators over spaces that need
+	 * to be fed to the replica. Note, we can't yield in the loop,
+	 * because otherwise we could get an inconsistent view of the
+	 * database.
+	 */
+	if (space_foreach(relay_space_add, &spaces) != 0)
+		diag_raise();
+
+	/*
+	 * Second, we must sync WAL to make sure that all changes
+	 * visible by the iterators are successfully committed.
+	 */
+	if (wal_sync() != 0)
+		diag_raise();
+
+	vclock_copy(vclock, &replicaset.vclock);
+
+	/* Respond to JOIN request with the current vclock. */
+	struct xrow_header row;
+	xrow_encode_vclock_xc(&row, vclock);
+	row.sync = sync;
+	coio_write_xrow(&io, &row);
+
+	/* Finally, send the read view to the replica. */
+	struct relay_space *r, *next;
+	rlist_foreach_entry_safe(r, &spaces, link, next) {
+		relay_space_send(r, &io, sync);
+		relay_space_free(r);
+	}
 }
 
 int
@@ -699,18 +835,6 @@ relay_send(struct relay *relay, struct xrow_header *packet)
 		fiber_sleep(inj->dparam);
 }
 
-static void
-relay_send_initial_join_row(struct xstream *stream, struct xrow_header *row)
-{
-	struct relay *relay = container_of(stream, struct relay, stream);
-	/*
-	 * Ignore replica local requests as we don't need to promote
-	 * vclock while sending a snapshot.
-	 */
-	if (row->group_id != GROUP_LOCAL)
-		relay_send(relay, row);
-}
-
 /** Send a single row to the client. */
 static void
 relay_send_row(struct xstream *stream, struct xrow_header *packet)
diff --git a/src/box/relay.h b/src/box/relay.h
index 869f8f2e..e1782d78 100644
--- a/src/box/relay.h
+++ b/src/box/relay.h
@@ -102,7 +102,7 @@ relay_last_row_time(const struct relay *relay);
  *
  * @param fd        client connection
  * @param sync      sync from incoming JOIN request
- * @param vclock    vclock of the last checkpoint
+ * @param vclock[out] vclock of the read view sent to the replica
  */
 void
 relay_initial_join(int fd, uint64_t sync, struct vclock *vclock);
diff --git a/src/box/sysview.c b/src/box/sysview.c
index 1fbe3aa2..8817e941 100644
--- a/src/box/sysview.c
+++ b/src/box/sysview.c
@@ -565,7 +565,6 @@ sysview_engine_create_space(struct engine *engine, struct space_def *def,
 static const struct engine_vtab sysview_engine_vtab = {
 	/* .shutdown = */ sysview_engine_shutdown,
 	/* .create_space = */ sysview_engine_create_space,
-	/* .join = */ generic_engine_join,
 	/* .begin = */ generic_engine_begin,
 	/* .begin_statement = */ generic_engine_begin_statement,
 	/* .prepare = */ generic_engine_prepare,
@@ -600,6 +599,6 @@ sysview_engine_new(void)
 
 	sysview->base.vtab = &sysview_engine_vtab;
 	sysview->base.name = "sysview";
-	sysview->base.flags = ENGINE_BYPASS_TX;
+	sysview->base.flags = ENGINE_BYPASS_TX | ENGINE_EXCLUDE_FROM_SNAPSHOT;
 	return sysview;
 }
diff --git a/src/box/vinyl.c b/src/box/vinyl.c
index de4a06c4..580bb7f6 100644
--- a/src/box/vinyl.c
+++ b/src/box/vinyl.c
@@ -166,6 +166,12 @@ struct vinyl_iterator {
 	struct trigger on_tx_destroy;
 };
 
+struct vinyl_snapshot_iterator {
+	struct snapshot_iterator base;
+	struct vy_read_view *rv;
+	struct vy_read_iterator iterator;
+};
+
 static const struct engine_vtab vinyl_engine_vtab;
 static const struct space_vtab vinyl_space_vtab;
 static const struct index_vtab vinyl_index_vtab;
@@ -2892,315 +2898,6 @@ vinyl_engine_end_recovery(struct engine *engine)
 
 /** }}} Recovery */
 
-/** {{{ Replication */
-
-/** Relay context, passed to all relay functions. */
-struct vy_join_ctx {
-	/** Environment. */
-	struct vy_env *env;
-	/** Stream to relay statements to. */
-	struct xstream *stream;
-	/** Pipe to the relay thread. */
-	struct cpipe relay_pipe;
-	/** Pipe to the tx thread. */
-	struct cpipe tx_pipe;
-	/**
-	 * Cbus message, used for calling functions
-	 * on behalf of the relay thread.
-	 */
-	struct cbus_call_msg cmsg;
-	/** ID of the space currently being relayed. */
-	uint32_t space_id;
-	/**
-	 * LSM tree key definition, as defined by the user.
-	 * We only send the primary key, so the definition
-	 * provided by the user is correct for compare.
-	 */
-	struct key_def *key_def;
-	/** LSM tree format used for REPLACE and DELETE statements. */
-	struct tuple_format *format;
-	/**
-	 * Write iterator for merging runs before sending
-	 * them to the replica.
-	 */
-	struct vy_stmt_stream *wi;
-	/**
-	 * List of run slices of the current range, linked by
-	 * vy_slice::in_join. The newer a slice the closer it
-	 * is to the head of the list.
-	 */
-	struct rlist slices;
-};
-
-/**
- * Recover a slice and add it to the list of slices.
- * Newer slices are supposed to be recovered first.
- * Returns 0 on success, -1 on failure.
- */
-static int
-vy_prepare_send_slice(struct vy_join_ctx *ctx,
-		      struct vy_slice_recovery_info *slice_info)
-{
-	int rc = -1;
-	struct vy_run *run = NULL;
-	struct vy_entry begin = vy_entry_none();
-	struct vy_entry end = vy_entry_none();
-
-	run = vy_run_new(&ctx->env->run_env, slice_info->run->id);
-	if (run == NULL)
-		goto out;
-	if (vy_run_recover(run, ctx->env->path, ctx->space_id, 0,
-			   ctx->key_def) != 0)
-		goto out;
-
-	if (slice_info->begin != NULL) {
-		begin = vy_entry_key_from_msgpack(ctx->env->lsm_env.key_format,
-						  ctx->key_def,
-						  slice_info->begin);
-		if (begin.stmt == NULL)
-			goto out;
-	}
-	if (slice_info->end != NULL) {
-		end = vy_entry_key_from_msgpack(ctx->env->lsm_env.key_format,
-						ctx->key_def,
-						slice_info->end);
-		if (end.stmt == NULL)
-			goto out;
-	}
-
-	struct vy_slice *slice = vy_slice_new(slice_info->id, run,
-					      begin, end, ctx->key_def);
-	if (slice == NULL)
-		goto out;
-
-	rlist_add_tail_entry(&ctx->slices, slice, in_join);
-	rc = 0;
-out:
-	if (run != NULL)
-		vy_run_unref(run);
-	if (begin.stmt != NULL)
-		tuple_unref(begin.stmt);
-	if (end.stmt != NULL)
-		tuple_unref(end.stmt);
-	return rc;
-}
-
-static int
-vy_send_range_f(struct cbus_call_msg *cmsg)
-{
-	struct vy_join_ctx *ctx = container_of(cmsg, struct vy_join_ctx, cmsg);
-
-	int rc = ctx->wi->iface->start(ctx->wi);
-	if (rc != 0)
-		goto err;
-	struct vy_entry entry;
-	while ((rc = ctx->wi->iface->next(ctx->wi, &entry)) == 0 &&
-	       entry.stmt != NULL) {
-		struct xrow_header xrow;
-		rc = vy_stmt_encode_primary(entry.stmt, ctx->key_def,
-					    ctx->space_id, &xrow);
-		if (rc != 0)
-			break;
-		/*
-		 * Reset the LSN as the replica will ignore it
-		 * anyway.
-		 */
-		xrow.lsn = 0;
-		rc = xstream_write(ctx->stream, &xrow);
-		if (rc != 0)
-			break;
-		fiber_gc();
-	}
-err:
-	ctx->wi->iface->stop(ctx->wi);
-	fiber_gc();
-	return rc;
-}
-
-/** Merge and send all runs of the given range. */
-static int
-vy_send_range(struct vy_join_ctx *ctx,
-	      struct vy_range_recovery_info *range_info)
-{
-	int rc;
-	struct vy_slice *slice, *tmp;
-
-	if (rlist_empty(&range_info->slices))
-		return 0; /* nothing to do */
-
-	/* Recover slices. */
-	struct vy_slice_recovery_info *slice_info;
-	rlist_foreach_entry(slice_info, &range_info->slices, in_range) {
-		rc = vy_prepare_send_slice(ctx, slice_info);
-		if (rc != 0)
-			goto out_delete_slices;
-	}
-
-	/* Create a write iterator. */
-	struct rlist fake_read_views;
-	rlist_create(&fake_read_views);
-	ctx->wi = vy_write_iterator_new(ctx->key_def, true, true,
-					&fake_read_views, NULL);
-	if (ctx->wi == NULL) {
-		rc = -1;
-		goto out;
-	}
-	rlist_foreach_entry(slice, &ctx->slices, in_join) {
-		rc = vy_write_iterator_new_slice(ctx->wi, slice, ctx->format);
-		if (rc != 0)
-			goto out_delete_wi;
-	}
-
-	/* Do the actual work from the relay thread. */
-	bool cancellable = fiber_set_cancellable(false);
-	rc = cbus_call(&ctx->relay_pipe, &ctx->tx_pipe, &ctx->cmsg,
-		       vy_send_range_f, NULL, TIMEOUT_INFINITY);
-	fiber_set_cancellable(cancellable);
-
-out_delete_wi:
-	ctx->wi->iface->close(ctx->wi);
-	ctx->wi = NULL;
-out_delete_slices:
-	rlist_foreach_entry_safe(slice, &ctx->slices, in_join, tmp)
-		vy_slice_delete(slice);
-	rlist_create(&ctx->slices);
-out:
-	return rc;
-}
-
-/** Send all tuples stored in the given LSM tree. */
-static int
-vy_send_lsm(struct vy_join_ctx *ctx, struct vy_lsm_recovery_info *lsm_info)
-{
-	int rc = -1;
-
-	if (lsm_info->drop_lsn >= 0 || lsm_info->create_lsn < 0) {
-		/* Dropped or not yet built LSM tree. */
-		return 0;
-	}
-	if (lsm_info->group_id == GROUP_LOCAL) {
-		/* Replica local space. */
-		return 0;
-	}
-
-	/*
-	 * We are only interested in the primary index LSM tree.
-	 * Secondary keys will be rebuilt on the destination.
-	 */
-	if (lsm_info->index_id != 0)
-		return 0;
-
-	ctx->space_id = lsm_info->space_id;
-
-	/* Create key definition and tuple format. */
-	ctx->key_def = key_def_new(lsm_info->key_parts,
-				   lsm_info->key_part_count, false);
-	if (ctx->key_def == NULL)
-		goto out;
-	ctx->format = vy_stmt_format_new(&ctx->env->stmt_env, &ctx->key_def, 1,
-					 NULL, 0, 0, NULL);
-	if (ctx->format == NULL)
-		goto out_free_key_def;
-	tuple_format_ref(ctx->format);
-
-	/* Send ranges. */
-	struct vy_range_recovery_info *range_info;
-	assert(!rlist_empty(&lsm_info->ranges));
-	rlist_foreach_entry(range_info, &lsm_info->ranges, in_lsm) {
-		rc = vy_send_range(ctx, range_info);
-		if (rc != 0)
-			break;
-	}
-
-	tuple_format_unref(ctx->format);
-	ctx->format = NULL;
-out_free_key_def:
-	key_def_delete(ctx->key_def);
-	ctx->key_def = NULL;
-out:
-	return rc;
-}
-
-/** Relay cord function. */
-static int
-vy_join_f(va_list ap)
-{
-	struct vy_join_ctx *ctx = va_arg(ap, struct vy_join_ctx *);
-
-	coio_enable();
-
-	cpipe_create(&ctx->tx_pipe, "tx");
-
-	struct cbus_endpoint endpoint;
-	cbus_endpoint_create(&endpoint, cord_name(cord()),
-			     fiber_schedule_cb, fiber());
-
-	cbus_loop(&endpoint);
-
-	cbus_endpoint_destroy(&endpoint, cbus_process);
-	cpipe_destroy(&ctx->tx_pipe);
-	return 0;
-}
-
-static int
-vinyl_engine_join(struct engine *engine, const struct vclock *vclock,
-		  struct xstream *stream)
-{
-	struct vy_env *env = vy_env(engine);
-	int rc = -1;
-
-	/* Allocate the relay context. */
-	struct vy_join_ctx *ctx = malloc(sizeof(*ctx));
-	if (ctx == NULL) {
-		diag_set(OutOfMemory, PATH_MAX, "malloc", "struct vy_join_ctx");
-		goto out;
-	}
-	memset(ctx, 0, sizeof(*ctx));
-	ctx->env = env;
-	ctx->stream = stream;
-	rlist_create(&ctx->slices);
-
-	/* Start the relay cord. */
-	char name[FIBER_NAME_MAX];
-	snprintf(name, sizeof(name), "initial_join_%p", stream);
-	struct cord cord;
-	if (cord_costart(&cord, name, vy_join_f, ctx) != 0)
-		goto out_free_ctx;
-	cpipe_create(&ctx->relay_pipe, name);
-
-	/*
-	 * Load the recovery context from the given point in time.
-	 * Send all runs stored in it to the replica.
-	 */
-	struct vy_recovery *recovery;
-	recovery = vy_recovery_new(vclock_sum(vclock),
-				   VY_RECOVERY_LOAD_CHECKPOINT);
-	if (recovery == NULL) {
-		say_error("failed to recover vylog to join a replica");
-		goto out_join_cord;
-	}
-	rc = 0;
-	struct vy_lsm_recovery_info *lsm_info;
-	rlist_foreach_entry(lsm_info, &recovery->lsms, in_recovery) {
-		rc = vy_send_lsm(ctx, lsm_info);
-		if (rc != 0)
-			break;
-	}
-	vy_recovery_delete(recovery);
-
-out_join_cord:
-	cbus_stop_loop(&ctx->relay_pipe);
-	cpipe_destroy(&ctx->relay_pipe);
-	if (cord_cojoin(&cord) != 0)
-		rc = -1;
-out_free_ctx:
-	free(ctx);
-out:
-	return rc;
-}
-
-/* }}} Replication */
-
 /* {{{ Garbage collection */
 
 /**
@@ -3852,6 +3549,66 @@ vinyl_index_create_iterator(struct index *base, enum iterator_type type,
 	return (struct iterator *)it;
 }
 
+static int
+vinyl_snapshot_iterator_next(struct snapshot_iterator *base,
+			     const char **data, uint32_t *size)
+{
+	assert(base->next == vinyl_snapshot_iterator_next);
+	struct vinyl_snapshot_iterator *it =
+		(struct vinyl_snapshot_iterator *)base;
+	struct vy_entry entry;
+	if (vy_read_iterator_next(&it->iterator, &entry) != 0)
+		return -1;
+	*data = entry.stmt != NULL ? tuple_data_range(entry.stmt, size) : NULL;
+	return 0;
+}
+
+static void
+vinyl_snapshot_iterator_free(struct snapshot_iterator *base)
+{
+	assert(base->free == vinyl_snapshot_iterator_free);
+	struct vinyl_snapshot_iterator *it =
+		(struct vinyl_snapshot_iterator *)base;
+	struct vy_lsm *lsm = it->iterator.lsm;
+	struct vy_env *env = vy_env(lsm->base.engine);
+	vy_read_iterator_close(&it->iterator);
+	tx_manager_destroy_read_view(env->xm, it->rv);
+	vy_lsm_unref(lsm);
+	free(it);
+}
+
+static struct snapshot_iterator *
+vinyl_index_create_snapshot_iterator(struct index *base)
+{
+	struct vy_lsm *lsm = vy_lsm(base);
+	struct vy_env *env = vy_env(base->engine);
+
+	struct vinyl_snapshot_iterator *it = malloc(sizeof(*it));
+	if (it == NULL) {
+		diag_set(OutOfMemory, sizeof(*it), "malloc",
+			 "struct vinyl_snapshot_iterator");
+		return NULL;
+	}
+	it->base.next = vinyl_snapshot_iterator_next;
+	it->base.free = vinyl_snapshot_iterator_free;
+
+	it->rv = tx_manager_read_view(env->xm);
+	if (it->rv == NULL) {
+		free(it);
+		return NULL;
+	}
+	vy_read_iterator_open(&it->iterator, lsm, NULL,
+			      ITER_ALL, lsm->env->empty_key,
+			      (const struct vy_read_view **)&it->rv);
+	/*
+	 * The index may be dropped while we are reading it.
+	 * The iterator must go on as if nothing happened.
+	 */
+	vy_lsm_ref(lsm);
+
+	return &it->base;
+}
+
 static int
 vinyl_index_get(struct index *index, const char *key,
 		uint32_t part_count, struct tuple **ret)
@@ -4578,7 +4335,6 @@ static struct trigger on_replace_vinyl_deferred_delete = {
 static const struct engine_vtab vinyl_engine_vtab = {
 	/* .shutdown = */ vinyl_engine_shutdown,
 	/* .create_space = */ vinyl_engine_create_space,
-	/* .join = */ vinyl_engine_join,
 	/* .begin = */ vinyl_engine_begin,
 	/* .begin_statement = */ vinyl_engine_begin_statement,
 	/* .prepare = */ vinyl_engine_prepare,
@@ -4644,7 +4400,7 @@ static const struct index_vtab vinyl_index_vtab = {
 	/* .replace = */ generic_index_replace,
 	/* .create_iterator = */ vinyl_index_create_iterator,
 	/* .create_snapshot_iterator = */
-		generic_index_create_snapshot_iterator,
+		vinyl_index_create_snapshot_iterator,
 	/* .stat = */ vinyl_index_stat,
 	/* .compact = */ vinyl_index_compact,
 	/* .reset_stat = */ vinyl_index_reset_stat,
diff --git a/src/box/vy_run.c b/src/box/vy_run.c
index c6c17aee..25b6dcd3 100644
--- a/src/box/vy_run.c
+++ b/src/box/vy_run.c
@@ -1675,14 +1675,8 @@ vy_run_recover(struct vy_run *run, const char *dir,
 
 	/* Read run header. */
 	struct xrow_header xrow;
-	ERROR_INJECT(ERRINJ_VYRUN_INDEX_GARBAGE, {
-		errinj(ERRINJ_XLOG_GARBAGE, ERRINJ_BOOL)->bparam = true;
-	});
 	/* all rows should be in one tx */
 	int rc = xlog_cursor_next_tx(&cursor);
-	ERROR_INJECT(ERRINJ_VYRUN_INDEX_GARBAGE, {
-		errinj(ERRINJ_XLOG_GARBAGE, ERRINJ_BOOL)->bparam = false;
-	});
 
 	if (rc != 0) {
 		if (rc > 0)
diff --git a/src/box/vy_tx.c b/src/box/vy_tx.c
index 1a5d4837..d092e0cd 100644
--- a/src/box/vy_tx.c
+++ b/src/box/vy_tx.c
@@ -156,8 +156,7 @@ tx_manager_mem_used(struct tx_manager *xm)
 	return ret;
 }
 
-/** Create or reuse an instance of a read view. */
-static struct vy_read_view *
+struct vy_read_view *
 tx_manager_read_view(struct tx_manager *xm)
 {
 	struct vy_read_view *rv;
@@ -195,12 +194,9 @@ tx_manager_read_view(struct tx_manager *xm)
 	return rv;
 }
 
-/** Dereference and possibly destroy a read view. */
-static void
-tx_manager_destroy_read_view(struct tx_manager *xm,
-			     const struct vy_read_view *read_view)
+void
+tx_manager_destroy_read_view(struct tx_manager *xm, struct vy_read_view *rv)
 {
-	struct vy_read_view *rv = (struct vy_read_view *) read_view;
 	if (rv == xm->p_global_read_view)
 		return;
 	assert(rv->refs);
diff --git a/src/box/vy_tx.h b/src/box/vy_tx.h
index 376f4330..3144c921 100644
--- a/src/box/vy_tx.h
+++ b/src/box/vy_tx.h
@@ -289,6 +289,14 @@ tx_manager_delete(struct tx_manager *xm);
 size_t
 tx_manager_mem_used(struct tx_manager *xm);
 
+/** Create or reuse an instance of a read view. */
+struct vy_read_view *
+tx_manager_read_view(struct tx_manager *xm);
+
+/** Dereference and possibly destroy a read view. */
+void
+tx_manager_destroy_read_view(struct tx_manager *xm, struct vy_read_view *rv);
+
 /**
  * Abort all rw transactions that affect the given space
  * and haven't reached WAL yet. Called before executing a DDL
diff --git a/src/lib/core/errinj.h b/src/lib/core/errinj.h
index e75a620d..3072a00e 100644
--- a/src/lib/core/errinj.h
+++ b/src/lib/core/errinj.h
@@ -102,11 +102,11 @@ struct errinj {
 	_(ERRINJ_RELAY_REPORT_INTERVAL, ERRINJ_DOUBLE, {.dparam = 0}) \
 	_(ERRINJ_RELAY_FINAL_SLEEP, ERRINJ_BOOL, {.bparam = false}) \
 	_(ERRINJ_RELAY_FINAL_JOIN, ERRINJ_BOOL, {.bparam = false}) \
+	_(ERRINJ_REPLICA_JOIN_DELAY, ERRINJ_BOOL, {.bparam = false}) \
 	_(ERRINJ_PORT_DUMP, ERRINJ_BOOL, {.bparam = false}) \
 	_(ERRINJ_XLOG_GARBAGE, ERRINJ_BOOL, {.bparam = false}) \
 	_(ERRINJ_XLOG_META, ERRINJ_BOOL, {.bparam = false}) \
 	_(ERRINJ_XLOG_READ, ERRINJ_INT, {.iparam = -1}) \
-	_(ERRINJ_VYRUN_INDEX_GARBAGE, ERRINJ_BOOL, {.bparam = false}) \
 	_(ERRINJ_VYRUN_DATA_READ, ERRINJ_BOOL, {.bparam = false}) \
 	_(ERRINJ_CHECK_FORMAT_DELAY, ERRINJ_BOOL, {.bparam = false}) \
 	_(ERRINJ_BUILD_INDEX, ERRINJ_INT, {.iparam = -1}) \
diff --git a/test/box/errinj.result b/test/box/errinj.result
index 5784758d..ebe630e1 100644
--- a/test/box/errinj.result
+++ b/test/box/errinj.result
@@ -28,7 +28,7 @@ errinj.info()
     state: 0
   ERRINJ_COIO_SENDFILE_CHUNK:
     state: -1
-  ERRINJ_VY_LOG_FILE_RENAME:
+  ERRINJ_HTTP_RESPONSE_ADD_WAIT:
     state: false
   ERRINJ_WAL_WRITE_PARTIAL:
     state: -1
@@ -42,15 +42,15 @@ errinj.info()
     state: false
   ERRINJ_WAL_SYNC:
     state: false
-  ERRINJ_VYRUN_INDEX_GARBAGE:
-    state: false
-  ERRINJ_BUILD_INDEX_DELAY:
-    state: false
   ERRINJ_BUILD_INDEX:
     state: -1
-  ERRINJ_VY_INDEX_FILE_RENAME:
+  ERRINJ_BUILD_INDEX_DELAY:
     state: false
-  ERRINJ_CHECK_FORMAT_DELAY:
+  ERRINJ_VY_RUN_FILE_RENAME:
+    state: false
+  ERRINJ_VY_COMPACTION_DELAY:
+    state: false
+  ERRINJ_VY_DUMP_DELAY:
     state: false
   ERRINJ_VY_DELAY_PK_LOOKUP:
     state: false
@@ -58,16 +58,16 @@ errinj.info()
     state: false
   ERRINJ_PORT_DUMP:
     state: false
-  ERRINJ_VY_DUMP_DELAY:
-    state: false
+  ERRINJ_WAL_BREAK_LSN:
+    state: -1
   ERRINJ_WAL_IO:
     state: false
   ERRINJ_WAL_FALLOCATE:
     state: 0
-  ERRINJ_WAL_BREAK_LSN:
-    state: -1
   ERRINJ_RELAY_BREAK_LSN:
     state: -1
+  ERRINJ_VY_INDEX_FILE_RENAME:
+    state: false
   ERRINJ_TUPLE_FORMAT_COUNT:
     state: -1
   ERRINJ_TUPLE_ALLOC:
@@ -78,7 +78,7 @@ errinj.info()
     state: false
   ERRINJ_RELAY_REPORT_INTERVAL:
     state: 0
-  ERRINJ_VY_RUN_FILE_RENAME:
+  ERRINJ_VY_LOG_FILE_RENAME:
     state: false
   ERRINJ_VY_READ_PAGE_TIMEOUT:
     state: 0
@@ -86,23 +86,23 @@ errinj.info()
     state: false
   ERRINJ_SIO_READ_MAX:
     state: -1
-  ERRINJ_HTTP_RESPONSE_ADD_WAIT:
-    state: false
-  ERRINJ_WAL_WRITE_DISK:
-    state: false
   ERRINJ_SNAP_COMMIT_DELAY:
     state: false
+  ERRINJ_WAL_WRITE_DISK:
+    state: false
   ERRINJ_SNAP_WRITE_DELAY:
     state: false
-  ERRINJ_VY_RUN_WRITE:
-    state: false
   ERRINJ_LOG_ROTATE:
     state: false
+  ERRINJ_VY_RUN_WRITE:
+    state: false
+  ERRINJ_CHECK_FORMAT_DELAY:
+    state: false
   ERRINJ_VY_LOG_FLUSH_DELAY:
     state: false
   ERRINJ_RELAY_FINAL_JOIN:
     state: false
-  ERRINJ_VY_COMPACTION_DELAY:
+  ERRINJ_REPLICA_JOIN_DELAY:
     state: false
   ERRINJ_RELAY_FINAL_SLEEP:
     state: false
diff --git a/test/replication-py/cluster.result b/test/replication-py/cluster.result
index 04f06f74..f68a6af7 100644
--- a/test/replication-py/cluster.result
+++ b/test/replication-py/cluster.result
@@ -23,19 +23,6 @@ box.schema.user.grant('guest', 'replication')
 ...
 ok - join with granted role
 -------------------------------------------------------------
-gh-707: Master crashes on JOIN if it does not have snapshot files
-gh-480: If socket is closed while JOIN, replica wont reconnect
--------------------------------------------------------------
-ok - join without snapshots
-ok - _cluster did not change after unsuccessful JOIN
-box.schema.user.revoke('guest', 'replication')
----
-...
-box.snapshot()
----
-- ok
-...
--------------------------------------------------------------
 gh-434: Assertion if replace _cluster tuple for local server
 -------------------------------------------------------------
 box.space._cluster:replace{1, require('uuid').NULL:str()}
diff --git a/test/replication-py/cluster.test.py b/test/replication-py/cluster.test.py
index 0140a6bd..088ca9c3 100644
--- a/test/replication-py/cluster.test.py
+++ b/test/replication-py/cluster.test.py
@@ -71,31 +71,6 @@ server.iproto.reconnect() # re-connect with new permissions
 server_id = check_join('join with granted role')
 server.iproto.py_con.space('_cluster').delete(server_id)
 
-print '-------------------------------------------------------------'
-print 'gh-707: Master crashes on JOIN if it does not have snapshot files'
-print 'gh-480: If socket is closed while JOIN, replica wont reconnect'
-print '-------------------------------------------------------------'
-
-data_dir = os.path.join(server.vardir, server.name)
-for k in glob.glob(os.path.join(data_dir, '*.snap')):
-    os.unlink(k)
-
-# remember the number of servers in _cluster table
-server_count = len(server.iproto.py_con.space('_cluster').select(()))
-
-rows = list(server.iproto.py_con.join(replica_uuid))
-print len(rows) > 0 and rows[-1].return_message.find('.snap') >= 0 and \
-    'ok' or 'not ok', '-', 'join without snapshots'
-res = server.iproto.py_con.space('_cluster').select(())
-if server_count <= len(res):
-    print 'ok - _cluster did not change after unsuccessful JOIN'
-else:
-    print 'not ok - _cluster did change after unsuccessful JOIN'
-    print res
-
-server.admin("box.schema.user.revoke('guest', 'replication')")
-server.admin('box.snapshot()')
-
 print '-------------------------------------------------------------'
 print 'gh-434: Assertion if replace _cluster tuple for local server'
 print '-------------------------------------------------------------'
diff --git a/test/replication/join_without_snap.result b/test/replication/join_without_snap.result
new file mode 100644
index 00000000..becdfd21
--- /dev/null
+++ b/test/replication/join_without_snap.result
@@ -0,0 +1,88 @@
+-- test-run result file version 2
+test_run = require('test_run').new()
+ | ---
+ | ...
+
+--
+-- gh-1271: check that replica join works off the current read view,
+-- not the last checkpoint. To do that, delete the last snapshot file
+-- and check that a replica can still join.
+--
+_ = box.schema.space.create('test')
+ | ---
+ | ...
+_ = box.space.test:create_index('pk')
+ | ---
+ | ...
+for i = 1, 5 do box.space.test:insert{i} end
+ | ---
+ | ...
+box.snapshot()
+ | ---
+ | - ok
+ | ...
+
+fio = require('fio')
+ | ---
+ | ...
+fio.unlink(fio.pathjoin(box.cfg.memtx_dir, string.format('%020d.snap', box.info.signature)))
+ | ---
+ | - true
+ | ...
+
+box.schema.user.grant('guest', 'replication')
+ | ---
+ | ...
+
+test_run:cmd('create server replica with rpl_master=default, script="replication/replica.lua"')
+ | ---
+ | - true
+ | ...
+test_run:cmd('start server replica')
+ | ---
+ | - true
+ | ...
+test_run:cmd('switch replica')
+ | ---
+ | - true
+ | ...
+
+box.space.test:select()
+ | ---
+ | - - [1]
+ |   - [2]
+ |   - [3]
+ |   - [4]
+ |   - [5]
+ | ...
+
+test_run:cmd('switch default')
+ | ---
+ | - true
+ | ...
+test_run:cmd('stop server replica')
+ | ---
+ | - true
+ | ...
+test_run:cmd('cleanup server replica')
+ | ---
+ | - true
+ | ...
+test_run:cmd('delete server replica')
+ | ---
+ | - true
+ | ...
+test_run:cleanup_cluster()
+ | ---
+ | ...
+
+box.schema.user.revoke('guest', 'replication')
+ | ---
+ | ...
+box.space.test:drop()
+ | ---
+ | ...
+box.snapshot()
+ | ---
+ | - ok
+ | ...
diff --git a/test/replication/join_without_snap.test.lua b/test/replication/join_without_snap.test.lua
new file mode 100644
index 00000000..6a23d741
--- /dev/null
+++ b/test/replication/join_without_snap.test.lua
@@ -0,0 +1,32 @@
+test_run = require('test_run').new()
+
+--
+-- gh-1271: check that replica join works off the current read view,
+-- not the last checkpoint. To do that, delete the last snapshot file
+-- and check that a replica can still join.
+--
+_ = box.schema.space.create('test')
+_ = box.space.test:create_index('pk')
+for i = 1, 5 do box.space.test:insert{i} end
+box.snapshot()
+
+fio = require('fio')
+fio.unlink(fio.pathjoin(box.cfg.memtx_dir, string.format('%020d.snap', box.info.signature)))
+
+box.schema.user.grant('guest', 'replication')
+
+test_run:cmd('create server replica with rpl_master=default, script="replication/replica.lua"')
+test_run:cmd('start server replica')
+test_run:cmd('switch replica')
+
+box.space.test:select()
+
+test_run:cmd('switch default')
+test_run:cmd('stop server replica')
+test_run:cmd('cleanup server replica')
+test_run:cmd('delete server replica')
+test_run:cleanup_cluster()
+
+box.schema.user.revoke('guest', 'replication')
+box.space.test:drop()
+box.snapshot()
diff --git a/test/replication/suite.cfg b/test/replication/suite.cfg
index 91e884ec..eb25077d 100644
--- a/test/replication/suite.cfg
+++ b/test/replication/suite.cfg
@@ -10,6 +10,7 @@
     "force_recovery.test.lua": {},
     "on_schema_init.test.lua": {},
     "long_row_timeout.test.lua": {},
+    "join_without_snap.test.lua": {},
     "*": {
         "memtx": {"engine": "memtx"},
         "vinyl": {"engine": "vinyl"}
diff --git a/test/vinyl/errinj.result b/test/vinyl/errinj.result
index e8795143..2635da26 100644
--- a/test/vinyl/errinj.result
+++ b/test/vinyl/errinj.result
@@ -1116,7 +1116,7 @@ box.snapshot()
 box.schema.user.grant('guest', 'replication')
 ---
 ...
-errinj.set('ERRINJ_VYRUN_INDEX_GARBAGE', true)
+errinj.set('ERRINJ_VY_READ_PAGE', true)
 ---
 - ok
 ...
@@ -1136,7 +1136,7 @@ test_run:cmd("delete server replica")
 ---
 - true
 ...
-errinj.set('ERRINJ_VYRUN_INDEX_GARBAGE', false)
+errinj.set('ERRINJ_VY_READ_PAGE', false)
 ---
 - ok
 ...
diff --git a/test/vinyl/errinj.test.lua b/test/vinyl/errinj.test.lua
index 034ed34c..4230cfae 100644
--- a/test/vinyl/errinj.test.lua
+++ b/test/vinyl/errinj.test.lua
@@ -404,12 +404,12 @@ _ = s:create_index('pk')
 s:replace{1, 2, 3}
 box.snapshot()
 box.schema.user.grant('guest', 'replication')
-errinj.set('ERRINJ_VYRUN_INDEX_GARBAGE', true)
+errinj.set('ERRINJ_VY_READ_PAGE', true)
 test_run:cmd("create server replica with rpl_master=default, script='replication/replica.lua'")
 test_run:cmd("start server replica with crash_expected=True")
 test_run:cmd("cleanup server replica")
 test_run:cmd("delete server replica")
-errinj.set('ERRINJ_VYRUN_INDEX_GARBAGE', false)
+errinj.set('ERRINJ_VY_READ_PAGE', false)
 box.schema.user.revoke('guest', 'replication')
 s:drop()
 
diff --git a/test/xlog/panic_on_broken_lsn.result b/test/xlog/panic_on_broken_lsn.result
index cddc9c3b..60283281 100644
--- a/test/xlog/panic_on_broken_lsn.result
+++ b/test/xlog/panic_on_broken_lsn.result
@@ -123,16 +123,33 @@ box.space.test:auto_increment{'v0'}
 ---
 - [1, 'v0']
 ...
-lsn = box.info.vclock[1]
+-- Inject a broken LSN in the final join stage.
+lsn = -1
 ---
 ...
-box.error.injection.set("ERRINJ_RELAY_BREAK_LSN", lsn + 1)
+box.error.injection.set("ERRINJ_REPLICA_JOIN_DELAY", true)
 ---
 - ok
 ...
-box.space.test:auto_increment{'v1'}
+fiber = require('fiber')
 ---
-- [2, 'v1']
+...
+test_run:cmd("setopt delimiter ';'")
+---
+- true
+...
+_ = fiber.create(function()
+    test_run:wait_cond(function() return box.space._cluster:get(2) ~= nil end)
+    lsn = box.info.vclock[1]
+    box.error.injection.set("ERRINJ_RELAY_BREAK_LSN", lsn + 1)
+    box.space.test:auto_increment{'v1'}
+    box.error.injection.set("ERRINJ_REPLICA_JOIN_DELAY", false)
+end);
+---
+...
+test_run:cmd("setopt delimiter ''");
+---
+- true
 ...
 test_run:cmd('create server replica with rpl_master=default, script="xlog/replica.lua"')
 ---
@@ -142,12 +159,6 @@ test_run:cmd('start server replica with crash_expected=True')
 ---
 - false
 ...
-fiber = require('fiber')
----
-...
-while box.info.replication[2] == nil do fiber.sleep(0.001) end
----
-...
 box.error.injection.set("ERRINJ_RELAY_BREAK_LSN", -1)
 ---
 - ok
diff --git a/test/xlog/panic_on_broken_lsn.test.lua b/test/xlog/panic_on_broken_lsn.test.lua
index cdf287a1..ca304345 100644
--- a/test/xlog/panic_on_broken_lsn.test.lua
+++ b/test/xlog/panic_on_broken_lsn.test.lua
@@ -57,14 +57,24 @@ box.schema.user.grant('guest', 'replication')
 _ = box.schema.space.create('test', {id = 9000})
 _ = box.space.test:create_index('pk')
 box.space.test:auto_increment{'v0'}
-lsn = box.info.vclock[1]
-box.error.injection.set("ERRINJ_RELAY_BREAK_LSN", lsn + 1)
-box.space.test:auto_increment{'v1'}
+
+-- Inject a broken LSN in the final join stage.
+lsn = -1
+box.error.injection.set("ERRINJ_REPLICA_JOIN_DELAY", true)
+
+fiber = require('fiber')
+test_run:cmd("setopt delimiter ';'")
+_ = fiber.create(function()
+    test_run:wait_cond(function() return box.space._cluster:get(2) ~= nil end)
+    lsn = box.info.vclock[1]
+    box.error.injection.set("ERRINJ_RELAY_BREAK_LSN", lsn + 1)
+    box.space.test:auto_increment{'v1'}
+    box.error.injection.set("ERRINJ_REPLICA_JOIN_DELAY", false)
+end);
+test_run:cmd("setopt delimiter ''");
 
 test_run:cmd('create server replica with rpl_master=default, script="xlog/replica.lua"')
 test_run:cmd('start server replica with crash_expected=True')
-fiber = require('fiber')
-while box.info.replication[2] == nil do fiber.sleep(0.001) end
 box.error.injection.set("ERRINJ_RELAY_BREAK_LSN", -1)
 
 -- Check that log contains the mention of broken LSN and the request printout
-- 
2.20.1

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 0/7] Join replicas off the current read view
  2019-08-19 16:53 [PATCH v2 0/7] Join replicas off the current read view Vladimir Davydov
                   ` (6 preceding siblings ...)
  2019-08-19 16:53 ` [PATCH v2 7/7] relay: join new replicas off read view Vladimir Davydov
@ 2019-08-19 16:54 ` Vladimir Davydov
  2019-08-20  8:53 ` Vladimir Davydov
  8 siblings, 0 replies; 28+ messages in thread
From: Vladimir Davydov @ 2019-08-19 16:54 UTC (permalink / raw)
  To: tarantool-patches

On Mon, Aug 19, 2019 at 07:53:13PM +0300, Vladimir Davydov wrote:
> Currently, we join replicas off the last checkpoint. As a result, we
> must keep all files corresponding to the last checkpoint. This means
> that we must always create a memtx snapshot file on initial call to
> box.cfg() even though it is virtually the same for all instances.
> Besides, we must rotate the vylog file synchronously with snapshot
> creation, otherwise we wouldn't be able to pull all vinyl files
> corresponding to the last checkpoint. This interconnection between
> vylog and xlog makes the code difficult to maintain.
> 
> Actually, nothing prevents us from relaying the current read view
> instead of the last checkpoint on initial join, as both memtx and
> vinyl support a consistent read view. This patch does the trick.
> This is a step towards making vylog independent of checkpointing
> and WAL.
> 
> https://github.com/tarantool/tarantool/issues/1271
> https://github.com/tarantool/tarantool/issues/4417
> https://github.com/tarantool/tarantool/commits/dv/gh-1271-rework-replica-join
> 
> Changes in v2:
>  - Commit preparatory patches approved by Kostja and rebase on
>    the latest master branch.
>  - Fix the issue with box.on_schema_init and space.before_replace
>    instead of disabling the test (see #4417).

v1 can be found here:

https://www.freelists.org/post/tarantool-patches/PATCH-0013-Join-replicas-off-the-current-read-view

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [tarantool-patches] Re: [PATCH v2 1/7] vinyl: don't pin index for iterator lifetime
  2019-08-19 16:53 ` [PATCH v2 1/7] vinyl: don't pin index for iterator lifetime Vladimir Davydov
@ 2019-08-19 20:35   ` Konstantin Osipov
  0 siblings, 0 replies; 28+ messages in thread
From: Konstantin Osipov @ 2019-08-19 20:35 UTC (permalink / raw)
  To: tarantool-patches


lgtm. 

-- 
Konstantin Osipov, Moscow, Russia

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [tarantool-patches] Re: [PATCH v2 2/7] vinyl: don't exempt dropped indexes from dump and compaction
  2019-08-19 16:53 ` [PATCH v2 2/7] vinyl: don't exempt dropped indexes from dump and compaction Vladimir Davydov
@ 2019-08-19 20:47   ` Konstantin Osipov
  2019-08-20  8:12     ` Vladimir Davydov
  2019-08-20 14:16   ` Vladimir Davydov
  1 sibling, 1 reply; 28+ messages in thread
From: Konstantin Osipov @ 2019-08-19 20:47 UTC (permalink / raw)
  To: tarantool-patches

* Vladimir Davydov <vdavydov.dev@gmail.com> [19/08/19 19:57]:
>  	vy_lsm_read_set_t read_set;
> +	/**
> +	 * Triggers run when the last reference to this LSM tree
> +	 * is dropped and the LSM tree is about to be destroyed.
> +	 * A pointer to this LSM tree is passed to the trigger
> +	 * callback in the 'event' argument.
> +	 */
> +	struct rlist on_destroy;

Please explain in the comment that the compaction scheduler task takes a
reference as well, so if the dropped index happens to be compacted
at the moment, it will be dropped only when the compaction task
finishes.

Why did you add a trigger - to avoid dependency
loops between vy_lsm and vy_scheduler? But sounds like it's
simpler to make vy_lsm aware in vy_scheduler and register/unregister
itself in create/destroy. Why did you choose to add a trigger
instead?


-- 
Konstantin Osipov, Moscow, Russia

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [tarantool-patches] Re: [PATCH v2 3/7] vinyl: get rid of vy_env::join_lsn
  2019-08-19 16:53 ` [PATCH v2 3/7] vinyl: get rid of vy_env::join_lsn Vladimir Davydov
@ 2019-08-19 20:49   ` Konstantin Osipov
  0 siblings, 0 replies; 28+ messages in thread
From: Konstantin Osipov @ 2019-08-19 20:49 UTC (permalink / raw)
  To: tarantool-patches

* Vladimir Davydov <vdavydov.dev@gmail.com> [19/08/19 19:57]:
> This fake LSN counter, which is used for assigning LSNs to Vinyl
> statements during the initial join stage, was introduced a long time
> ago, when LSNs were used as identifiers for lsregion allocations and
> hence were supposed to grow strictly monotonically with each new
> transaction. Later on, they were reused for assigning unique LSNs to
> identify indexes in vylog.
> 
> These days, we don't need initial join LSNs to be unique, as we switched
> to generations for lsregion allocations while in vylog we now use LSNs
> only as an incarnation counter, not as a unique identifier. That said,
> let's zap vy_env::join_lsn and simply assign 0 to all statements
> received during the initial join stage.
> 
> To achieve that, we just need to relax an assertion in vy_tx_commit()
> and remove the assumption that an LSN can't be zero in the write
> iterator implementation.

lgtm

 

-- 
Konstantin Osipov, Moscow, Russia

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [tarantool-patches] Re: [PATCH v2 4/7] memtx: use ref counting to pin indexes for snapshot
  2019-08-19 16:53 ` [PATCH v2 4/7] memtx: use ref counting to pin indexes for snapshot Vladimir Davydov
@ 2019-08-19 20:50   ` Konstantin Osipov
  0 siblings, 0 replies; 28+ messages in thread
From: Konstantin Osipov @ 2019-08-19 20:50 UTC (permalink / raw)
  To: tarantool-patches

* Vladimir Davydov <vdavydov.dev@gmail.com> [19/08/19 19:57]:
> Currently, to prevent an index from going away while it is being
> written to a snapshot, we postpone memtx_gc_task's free() invocation
> until checkpointing is complete, see commit 94de0a081b3a ("Don't take
> schema lock for checkpointing"). This works fine, but makes it rather
> difficult to reuse snapshot iterators for other purposes, e.g. feeding
> a consistent read view to a newly joined replica.
> 
> Let's instead use index reference counting for pinning indexes for
> checkpointing. A reference is taken in a snapshot iterator constructor
> and released when the snapshot iterator is destroyed.

lgtm


-- 
Konstantin Osipov, Moscow, Russia

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [tarantool-patches] Re: [PATCH v2 5/7] memtx: enter small delayed free mode from snapshot iterator
  2019-08-19 16:53 ` [PATCH v2 5/7] memtx: enter small delayed free mode from snapshot iterator Vladimir Davydov
@ 2019-08-19 20:51   ` Konstantin Osipov
  0 siblings, 0 replies; 28+ messages in thread
From: Konstantin Osipov @ 2019-08-19 20:51 UTC (permalink / raw)
  To: tarantool-patches

* Vladimir Davydov <vdavydov.dev@gmail.com> [19/08/19 19:57]:
> We must enable SMALL_DELAYED_FREE_MODE to safely use a memtx snapshot
> iterator. Currently, we do that in checkpoint related callbacks, but if
> we want to reuse snapshot iterators for other purposes, e.g. feeding
> a read view to a newly joined replica, we better hide this code behind
> snapshot iterator constructors.

As discussed, better garbage collection is a separate patch. With
that, lgtm.


-- 
Konstantin Osipov, Moscow, Russia

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [tarantool-patches] Re: [PATCH v2 6/7] space: get rid of apply_initial_join_row method
  2019-08-19 16:53 ` [PATCH v2 6/7] space: get rid of apply_initial_join_row method Vladimir Davydov
@ 2019-08-19 20:54   ` Konstantin Osipov
  0 siblings, 0 replies; 28+ messages in thread
From: Konstantin Osipov @ 2019-08-19 20:54 UTC (permalink / raw)
  To: tarantool-patches

* Vladimir Davydov <vdavydov.dev@gmail.com> [19/08/19 19:57]:
> There's no reason to use a special method instead of the generic
> space_execute_dml for applying rows received from a master during the
> initial join stage. Moreover, using the special method results in not
> running space.before_replace trigger, which makes it impossible to, for
> example, update space engine on a replica, see the on_schema_init test
> of the replication test suite.
> 
> So this patch removes the special method altogether and makes the code
> that used it switch to space_execute_dml.
> 
> Closes #4417

LGTM

-- 
Konstantin Osipov, Moscow, Russia

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [tarantool-patches] Re: [PATCH v2 7/7] relay: join new replicas off read view
  2019-08-19 16:53 ` [PATCH v2 7/7] relay: join new replicas off read view Vladimir Davydov
@ 2019-08-19 20:57   ` Konstantin Osipov
  2019-08-20  8:16     ` Vladimir Davydov
  0 siblings, 1 reply; 28+ messages in thread
From: Konstantin Osipov @ 2019-08-19 20:57 UTC (permalink / raw)
  To: tarantool-patches

* Vladimir Davydov <vdavydov.dev@gmail.com> [19/08/19 19:57]:
> Historically, we join a new replica off the last checkpoint. As a
> result, we must always keep the last memtx snapshot and all vinyl data
> files corresponding to it. Actually, there's no need to use the last
> checkpoint for joining a replica. Instead we can use the current read
> view as both memtx and vinyl support it. This should speed up the
> process of joining a new replica, because we don't need to replay all
> xlogs written after the last checkpoint, only those that are accumulated
> while we are relaying the current read view. This should also allow us
> to avoid creating a snapshot file on bootstrap, because the only reason
> why we need it is allowing joining replicas. Besides, this is a step
> towards decoupling the vinyl metadata log from checkpointing in
> particular and from xlogs in general.
> 

How does this work given relay_* functions are running in a relay thread? 


-- 
Konstantin Osipov, Moscow, Russia

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [tarantool-patches] Re: [PATCH v2 2/7] vinyl: don't exempt dropped indexes from dump and compaction
  2019-08-19 20:47   ` [tarantool-patches] " Konstantin Osipov
@ 2019-08-20  8:12     ` Vladimir Davydov
  2019-08-20  9:02       ` Vladimir Davydov
  2019-08-20 11:52       ` Konstantin Osipov
  0 siblings, 2 replies; 28+ messages in thread
From: Vladimir Davydov @ 2019-08-20  8:12 UTC (permalink / raw)
  To: Konstantin Osipov; +Cc: tarantool-patches

On Mon, Aug 19, 2019 at 11:47:00PM +0300, Konstantin Osipov wrote:
> * Vladimir Davydov <vdavydov.dev@gmail.com> [19/08/19 19:57]:
> >  	vy_lsm_read_set_t read_set;
> > +	/**
> > +	 * Triggers run when the last reference to this LSM tree
> > +	 * is dropped and the LSM tree is about to be destroyed.
> > +	 * A pointer to this LSM tree is passed to the trigger
> > +	 * callback in the 'event' argument.
> > +	 */
> > +	struct rlist on_destroy;
> 
> Please explain in the comment that the compaction scheduler task takes a
> reference as well, so if the dropped index happens to be compacted
> at the moment, it will be dropped only when the compaction task
> finishes.

True. We could avoid that, but for now it isn't worth bothering about
IMO. I'll add a comment.

> 
> Why did you add a trigger - to avoid dependency
> loops between vy_lsm and vy_scheduler? But sounds like it's
> simpler to make vy_lsm aware in vy_scheduler and register/unregister
> itself in create/destroy. Why did you choose to add a trigger
> instead?

To avoid a dependency loop: vy_scheduler.[hc] depends on vy_lsm.[hc],
but not vice versa, which is nice IMO.

Besides, I think we could reuse the trigger for other purposes, e.g. to
force data file deletion when an index is finally released.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [tarantool-patches] Re: [PATCH v2 7/7] relay: join new replicas off read view
  2019-08-19 20:57   ` [tarantool-patches] " Konstantin Osipov
@ 2019-08-20  8:16     ` Vladimir Davydov
  2019-08-20 11:53       ` Konstantin Osipov
  0 siblings, 1 reply; 28+ messages in thread
From: Vladimir Davydov @ 2019-08-20  8:16 UTC (permalink / raw)
  To: Konstantin Osipov; +Cc: tarantool-patches

On Mon, Aug 19, 2019 at 11:57:21PM +0300, Konstantin Osipov wrote:
> * Vladimir Davydov <vdavydov.dev@gmail.com> [19/08/19 19:57]:
> > Historically, we join a new replica off the last checkpoint. As a
> > result, we must always keep the last memtx snapshot and all vinyl data
> > files corresponding to it. Actually, there's no need to use the last
> > checkpoint for joining a replica. Instead we can use the current read
> > view as both memtx and vinyl support it. This should speed up the
> > process of joining a new replica, because we don't need to replay all
> > xlogs written after the last checkpoint, only those that are accumulated
> > while we are relaying the current read view. This should also allow us
> > to avoid creating a snapshot file on bootstrap, because the only reason
> > why we need it is allowing joining replicas. Besides, this is a step
> > towards decoupling the vinyl metadata log from checkpointing in
> > particular and from xlogs in general.
> > 
> 
> How does this work given relay_* functions are running in a relay thread? 

Those functions don't run in a relay thread. Just like in case of index
build, we open and use iterators in the tx thread.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 0/7] Join replicas off the current read view
  2019-08-19 16:53 [PATCH v2 0/7] Join replicas off the current read view Vladimir Davydov
                   ` (7 preceding siblings ...)
  2019-08-19 16:54 ` [PATCH v2 0/7] Join replicas off the current " Vladimir Davydov
@ 2019-08-20  8:53 ` Vladimir Davydov
  8 siblings, 0 replies; 28+ messages in thread
From: Vladimir Davydov @ 2019-08-20  8:53 UTC (permalink / raw)
  To: tarantool-patches

Pushed the following patches approved by Kostja to master and
rebased the branch:

>   vinyl: don't pin index for iterator lifetime
>   vinyl: get rid of vy_env::join_lsn
>   memtx: use ref counting to pin indexes for snapshot
>   memtx: enter small delayed free mode from snapshot iterator
>   space: get rid of apply_initial_join_row method

The following patches are still pending review/discussion.

>   vinyl: don't exempt dropped indexes from dump and compaction
>   relay: join new replicas off read view

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [tarantool-patches] Re: [PATCH v2 2/7] vinyl: don't exempt dropped indexes from dump and compaction
  2019-08-20  8:12     ` Vladimir Davydov
@ 2019-08-20  9:02       ` Vladimir Davydov
  2019-08-20 11:52       ` Konstantin Osipov
  1 sibling, 0 replies; 28+ messages in thread
From: Vladimir Davydov @ 2019-08-20  9:02 UTC (permalink / raw)
  To: Konstantin Osipov; +Cc: tarantool-patches

On Tue, Aug 20, 2019 at 11:12:27AM +0300, Vladimir Davydov wrote:
> On Mon, Aug 19, 2019 at 11:47:00PM +0300, Konstantin Osipov wrote:
> > * Vladimir Davydov <vdavydov.dev@gmail.com> [19/08/19 19:57]:
> > >  	vy_lsm_read_set_t read_set;
> > > +	/**
> > > +	 * Triggers run when the last reference to this LSM tree
> > > +	 * is dropped and the LSM tree is about to be destroyed.
> > > +	 * A pointer to this LSM tree is passed to the trigger
> > > +	 * callback in the 'event' argument.
> > > +	 */
> > > +	struct rlist on_destroy;
> > 
> > Please explain in the comment that the compaction scheduler task takes a
> > reference as well, so if the dropped index happens to be compacted
> > at the moment, it will be dropped only when the compaction task
> > finishes.
> 
> True. We could avoid that, but for now it isn't worth bothering about
> IMO. I'll add a comment.

Here goes the comment:

diff --git a/src/box/vy_lsm.h b/src/box/vy_lsm.h
index 47f8ee6a..3b553ea5 100644
--- a/src/box/vy_lsm.h
+++ b/src/box/vy_lsm.h
@@ -317,6 +317,13 @@ struct vy_lsm {
 	 * is dropped and the LSM tree is about to be destroyed.
 	 * A pointer to this LSM tree is passed to the trigger
 	 * callback in the 'event' argument.
+	 *
+	 * For instance, this trigger is used to remove a dropped
+	 * LSM tree from the scheduler before it gets destroyed.
+	 * Since each dump/compaction task takes a reference to
+	 * the target index, this means that a dropped index will
+	 * not get destroyed until all tasks scheduled for it have
+	 * been completed.
 	 */
 	struct rlist on_destroy;
 };

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [tarantool-patches] Re: [PATCH v2 2/7] vinyl: don't exempt dropped indexes from dump and compaction
  2019-08-20  8:12     ` Vladimir Davydov
  2019-08-20  9:02       ` Vladimir Davydov
@ 2019-08-20 11:52       ` Konstantin Osipov
  1 sibling, 0 replies; 28+ messages in thread
From: Konstantin Osipov @ 2019-08-20 11:52 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: tarantool-patches

* Vladimir Davydov <vdavydov.dev@gmail.com> [19/08/20 11:15]:
> > Why did you add a trigger - to avoid dependency
> > loops between vy_lsm and vy_scheduler? But sounds like it's
> > simpler to make vy_lsm aware in vy_scheduler and register/unregister
> > itself in create/destroy. Why did you choose to add a trigger
> > instead?
> 
> To avoid a dependency loop: vy_scheduler.[hc] depends on vy_lsm.[hc],
> but not vice versa, which is nice IMO.
> 
> Besides, I think we could reuse the trigger for other purposes, e.g. to
> force data file deletion when an index is finally released.

OK. The dependency loop is still there, it's just less explicit.
But let it be, it's not a big deal.
-- 
Konstantin Osipov, Moscow, Russia

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [tarantool-patches] Re: [PATCH v2 7/7] relay: join new replicas off read view
  2019-08-20  8:16     ` Vladimir Davydov
@ 2019-08-20 11:53       ` Konstantin Osipov
  2019-08-20 12:05         ` Vladimir Davydov
  0 siblings, 1 reply; 28+ messages in thread
From: Konstantin Osipov @ 2019-08-20 11:53 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: tarantool-patches

* Vladimir Davydov <vdavydov.dev@gmail.com> [19/08/20 11:22]:
> > > Historically, we join a new replica off the last checkpoint. As a
> > > result, we must always keep the last memtx snapshot and all vinyl data
> > > files corresponding to it. Actually, there's no need to use the last
> > > checkpoint for joining a replica. Instead we can use the current read
> > > view as both memtx and vinyl support it. This should speed up the
> > > process of joining a new replica, because we don't need to replay all
> > > xlogs written after the last checkpoint, only those that are accumulated
> > > while we are relaying the current read view. This should also allow us
> > > to avoid creating a snapshot file on bootstrap, because the only reason
> > > why we need it is allowing joining replicas. Besides, this is a step
> > > towards decoupling the vinyl metadata log from checkpointing in
> > > particular and from xlogs in general.
> > > 
> > 
> > How does this work given relay_* functions are running in a relay thread? 
> 
> Those functions don't run in a relay thread. Just like in case of index
> build, we open and use iterators in the tx thread.

Then they shouldn't be prefixed relay_*, this is confusing.

-- 
Konstantin Osipov, Moscow, Russia

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [tarantool-patches] Re: [PATCH v2 7/7] relay: join new replicas off read view
  2019-08-20 11:53       ` Konstantin Osipov
@ 2019-08-20 12:05         ` Vladimir Davydov
  2019-08-20 13:50           ` Konstantin Osipov
  0 siblings, 1 reply; 28+ messages in thread
From: Vladimir Davydov @ 2019-08-20 12:05 UTC (permalink / raw)
  To: Konstantin Osipov; +Cc: tarantool-patches

On Tue, Aug 20, 2019 at 02:53:33PM +0300, Konstantin Osipov wrote:
> * Vladimir Davydov <vdavydov.dev@gmail.com> [19/08/20 11:22]:
> > > > Historically, we join a new replica off the last checkpoint. As a
> > > > result, we must always keep the last memtx snapshot and all vinyl data
> > > > files corresponding to it. Actually, there's no need to use the last
> > > > checkpoint for joining a replica. Instead we can use the current read
> > > > view as both memtx and vinyl support it. This should speed up the
> > > > process of joining a new replica, because we don't need to replay all
> > > > xlogs written after the last checkpoint, only those that are accumulated
> > > > while we are relaying the current read view. This should also allow us
> > > > to avoid creating a snapshot file on bootstrap, because the only reason
> > > > why we need it is allowing joining replicas. Besides, this is a step
> > > > towards decoupling the vinyl metadata log from checkpointing in
> > > > particular and from xlogs in general.
> > > > 
> > > 
> > > How does this work given relay_* functions are running in a relay thread? 
> > 
> > Those functions don't run in a relay thread. Just like in case of index
> > build, we open and use iterators in the tx thread.
> 
> Then they shouldn't be prefixed relay_*, this is confusing.

Well, yeah, kinda. OTOH they do relay data to a replica that's why I
named them relay_something :-/ Also, those functions live in relay.cc,
which is consistent with the relay_ prefix.

If not relay_, what prefix do you think we should use then? join_?
May be, we should also move those functions to a separate file? join.c?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [tarantool-patches] Re: [PATCH v2 7/7] relay: join new replicas off read view
  2019-08-20 12:05         ` Vladimir Davydov
@ 2019-08-20 13:50           ` Konstantin Osipov
  2019-08-20 14:03             ` Vladimir Davydov
  0 siblings, 1 reply; 28+ messages in thread
From: Konstantin Osipov @ 2019-08-20 13:50 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: tarantool-patches

* Vladimir Davydov <vdavydov.dev@gmail.com> [19/08/20 15:10]:
> Well, yeah, kinda. OTOH they do relay data to a replica that's why I
> named them relay_something :-/ Also, those functions live in relay.cc,
> which is consistent with the relay_ prefix.

engine_send? I'll think about it.

-- 
Konstantin Osipov, Moscow, Russia

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [tarantool-patches] Re: [PATCH v2 7/7] relay: join new replicas off read view
  2019-08-20 13:50           ` Konstantin Osipov
@ 2019-08-20 14:03             ` Vladimir Davydov
  2019-08-21 22:08               ` Konstantin Osipov
  0 siblings, 1 reply; 28+ messages in thread
From: Vladimir Davydov @ 2019-08-20 14:03 UTC (permalink / raw)
  To: Konstantin Osipov; +Cc: tarantool-patches

On Tue, Aug 20, 2019 at 04:50:07PM +0300, Konstantin Osipov wrote:
> * Vladimir Davydov <vdavydov.dev@gmail.com> [19/08/20 15:10]:
> > Well, yeah, kinda. OTOH they do relay data to a replica that's why I
> > named them relay_something :-/ Also, those functions live in relay.cc,
> > which is consistent with the relay_ prefix.
> 
> engine_send? I'll think about it.

But the code is engine-agnostic now - it just opens read-view iterators
and sends whatever they return to a replica :-/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 2/7] vinyl: don't exempt dropped indexes from dump and compaction
  2019-08-19 16:53 ` [PATCH v2 2/7] vinyl: don't exempt dropped indexes from dump and compaction Vladimir Davydov
  2019-08-19 20:47   ` [tarantool-patches] " Konstantin Osipov
@ 2019-08-20 14:16   ` Vladimir Davydov
  1 sibling, 0 replies; 28+ messages in thread
From: Vladimir Davydov @ 2019-08-20 14:16 UTC (permalink / raw)
  To: tarantool-patches

Pushed to master.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [tarantool-patches] Re: [PATCH v2 7/7] relay: join new replicas off read view
  2019-08-20 14:03             ` Vladimir Davydov
@ 2019-08-21 22:08               ` Konstantin Osipov
  2019-08-22  8:05                 ` Vladimir Davydov
  0 siblings, 1 reply; 28+ messages in thread
From: Konstantin Osipov @ 2019-08-21 22:08 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: tarantool-patches

* Vladimir Davydov <vdavydov.dev@gmail.com> [19/08/20 17:08]:
> On Tue, Aug 20, 2019 at 04:50:07PM +0300, Konstantin Osipov wrote:
> > * Vladimir Davydov <vdavydov.dev@gmail.com> [19/08/20 15:10]:
> > > Well, yeah, kinda. OTOH they do relay data to a replica that's why I
> > > named them relay_something :-/ Also, those functions live in relay.cc,
> > > which is consistent with the relay_ prefix.
> > 
> > engine_send? I'll think about it.
> 
> But the code is engine-agnostic now - it just opens read-view iterators
> and sends whatever they return to a replica :-/

OK. Let's move it out of relay then. It's OK to put it in a
special file or move to engine.cc. It's OK that it's
engine-agnostic- engine.cc can have code common to all engines.

relay prefix is also confusing since in replication "relaying" is
used to refer to re-sending some existing files or logs.

The only issue about this patch is that since, unlike a snapshot
iterator, the code runs in tx thread, it may require some
throttling. If the network is fast enough it the send loop may run
at very high speeds. It could even hog 100% of CPU time and nearly
never yield CPU to other transactions.

I see two options here: move it to a separate thread (memtx
iterators allow it, so it would be preferred), or throttle.

-- 
Konstantin Osipov, Moscow, Russia

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [tarantool-patches] Re: [PATCH v2 7/7] relay: join new replicas off read view
  2019-08-21 22:08               ` Konstantin Osipov
@ 2019-08-22  8:05                 ` Vladimir Davydov
  0 siblings, 0 replies; 28+ messages in thread
From: Vladimir Davydov @ 2019-08-22  8:05 UTC (permalink / raw)
  To: Konstantin Osipov; +Cc: tarantool-patches

On Thu, Aug 22, 2019 at 01:08:07AM +0300, Konstantin Osipov wrote:
> * Vladimir Davydov <vdavydov.dev@gmail.com> [19/08/20 17:08]:
> > On Tue, Aug 20, 2019 at 04:50:07PM +0300, Konstantin Osipov wrote:
> > > * Vladimir Davydov <vdavydov.dev@gmail.com> [19/08/20 15:10]:
> > > > Well, yeah, kinda. OTOH they do relay data to a replica that's why I
> > > > named them relay_something :-/ Also, those functions live in relay.cc,
> > > > which is consistent with the relay_ prefix.
> > > 
> > > engine_send? I'll think about it.
> > 
> > But the code is engine-agnostic now - it just opens read-view iterators
> > and sends whatever they return to a replica :-/
> 
> OK. Let's move it out of relay then. It's OK to put it in a
> special file or move to engine.cc. It's OK that it's
> engine-agnostic- engine.cc can have code common to all engines.

Okay, will do.

> 
> relay prefix is also confusing since in replication "relaying" is
> used to refer to re-sending some existing files or logs.
> 
> The only issue about this patch is that since, unlike a snapshot
> iterator, the code runs in tx thread, it may require some
> throttling. If the network is fast enough it the send loop may run
> at very high speeds. It could even hog 100% of CPU time and nearly
> never yield CPU to other transactions.

Good point.

> 
> I see two options here: move it to a separate thread (memtx
> iterators allow it, so it would be preferred), or throttle.

Unfortunately, I don't think we can move this code out of the tx thread,
because vinyl iterator must be run from tx (well, we could iterate over
run files in a thread, but not over the memory level; and we don't want
to suspend dumps/compaction until we're done joining a replica). So I
guess I will simply call fiber_sleep() periodically, similarly to how we
handle index build.

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2019-08-22  8:05 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-08-19 16:53 [PATCH v2 0/7] Join replicas off the current read view Vladimir Davydov
2019-08-19 16:53 ` [PATCH v2 1/7] vinyl: don't pin index for iterator lifetime Vladimir Davydov
2019-08-19 20:35   ` [tarantool-patches] " Konstantin Osipov
2019-08-19 16:53 ` [PATCH v2 2/7] vinyl: don't exempt dropped indexes from dump and compaction Vladimir Davydov
2019-08-19 20:47   ` [tarantool-patches] " Konstantin Osipov
2019-08-20  8:12     ` Vladimir Davydov
2019-08-20  9:02       ` Vladimir Davydov
2019-08-20 11:52       ` Konstantin Osipov
2019-08-20 14:16   ` Vladimir Davydov
2019-08-19 16:53 ` [PATCH v2 3/7] vinyl: get rid of vy_env::join_lsn Vladimir Davydov
2019-08-19 20:49   ` [tarantool-patches] " Konstantin Osipov
2019-08-19 16:53 ` [PATCH v2 4/7] memtx: use ref counting to pin indexes for snapshot Vladimir Davydov
2019-08-19 20:50   ` [tarantool-patches] " Konstantin Osipov
2019-08-19 16:53 ` [PATCH v2 5/7] memtx: enter small delayed free mode from snapshot iterator Vladimir Davydov
2019-08-19 20:51   ` [tarantool-patches] " Konstantin Osipov
2019-08-19 16:53 ` [PATCH v2 6/7] space: get rid of apply_initial_join_row method Vladimir Davydov
2019-08-19 20:54   ` [tarantool-patches] " Konstantin Osipov
2019-08-19 16:53 ` [PATCH v2 7/7] relay: join new replicas off read view Vladimir Davydov
2019-08-19 20:57   ` [tarantool-patches] " Konstantin Osipov
2019-08-20  8:16     ` Vladimir Davydov
2019-08-20 11:53       ` Konstantin Osipov
2019-08-20 12:05         ` Vladimir Davydov
2019-08-20 13:50           ` Konstantin Osipov
2019-08-20 14:03             ` Vladimir Davydov
2019-08-21 22:08               ` Konstantin Osipov
2019-08-22  8:05                 ` Vladimir Davydov
2019-08-19 16:54 ` [PATCH v2 0/7] Join replicas off the current " Vladimir Davydov
2019-08-20  8:53 ` Vladimir Davydov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox