Tarantool development patches archive
 help / color / mirror / Atom feed
* [PATCH v2 00/11] Replica rejoin
@ 2018-06-08 17:34 Vladimir Davydov
  2018-06-08 17:34 ` [PATCH v2 01/11] box: retrieve instance uuid before starting local recovery Vladimir Davydov
                   ` (10 more replies)
  0 siblings, 11 replies; 34+ messages in thread
From: Vladimir Davydov @ 2018-06-08 17:34 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

After this patch set is applied, an instance will try to detect if
it fell too much behind its peers in the cluster and so needs to be
rebootstrapped. If it does, it will skip local recovery and instead
proceed to bootstrap from a remote master. Old files (xlog, snap)
are not deleted during rebootstrap. They will be removed by gc as
usual.

TODO: write a test checking that garbage collection works as expected.

https://github.com/tarantool/tarantool/issues/461
https://github.com/tarantool/tarantool/commits/gh-461-replica-rejoin

Changes in v2:
 - Implement rebootstrap support for vinyl engine.
 - Call recover_remaining_wals() explicitly after recovery_stop_local()
   as suggested by @kostja.
 - Add comment to memtx_engine_new() explaining why we need to init
   INSTANCE_UUID before proceeding to local recovery.

v1: https://www.freelists.org/post/tarantool-patches/RFC-PATCH-0012-Replica-rejoin

Vladimir Davydov (11):
  box: retrieve instance uuid before starting local recovery
  box: refactor hot standby recovery
  box: retrieve end vclock before starting local recovery
  box: open the port before starting local recovery
  box: connect to remote peers before starting local recovery
  box: factor out local recovery function
  applier: inquire oldest vclock on connect
  replication: rebootstrap instance on startup if it fell behind
  vinyl: simplify vylog recovery from backup
  vinyl: pass flags to vy_recovery_new
  vinyl: implement rebootstrap support

 src/box/applier.cc                       |  15 ++
 src/box/applier.h                        |   2 +
 src/box/box.cc                           | 312 +++++++++++++++++--------------
 src/box/box.h                            |   4 +-
 src/box/iproto.cc                        |  30 ++-
 src/box/iproto.h                         |   5 +-
 src/box/iproto_constants.h               |   2 +
 src/box/lua/cfg.cc                       |   1 -
 src/box/memtx_engine.c                   |  21 ++-
 src/box/recovery.cc                      |  34 +++-
 src/box/recovery.h                       |   5 +-
 src/box/replication.cc                   |  15 ++
 src/box/replication.h                    |   9 +
 src/box/vinyl.c                          |   8 +-
 src/box/vy_log.c                         | 208 +++++++++++++++------
 src/box/vy_log.h                         |  50 ++++-
 src/box/xrow.c                           |  36 ++++
 src/box/xrow.h                           |  31 +++
 test/replication/replica_rejoin.result   | 201 ++++++++++++++++++++
 test/replication/replica_rejoin.test.lua |  75 ++++++++
 20 files changed, 832 insertions(+), 232 deletions(-)
 create mode 100644 test/replication/replica_rejoin.result
 create mode 100644 test/replication/replica_rejoin.test.lua

-- 
2.11.0

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v2 01/11] box: retrieve instance uuid before starting local recovery
  2018-06-08 17:34 [PATCH v2 00/11] Replica rejoin Vladimir Davydov
@ 2018-06-08 17:34 ` Vladimir Davydov
  2018-06-08 17:51   ` Konstantin Osipov
  2018-06-08 17:34 ` [PATCH v2 02/11] box: refactor hot standby recovery Vladimir Davydov
                   ` (9 subsequent siblings)
  10 siblings, 1 reply; 34+ messages in thread
From: Vladimir Davydov @ 2018-06-08 17:34 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

In order to find out if the current instance fell too much behind its
peers in the cluster and so needs to be rebootstrapped, we need to
connect it to remote peers before proceeding to local recovery. The
problem is box.cfg.replication may have an entry corresponding to the
instance itself so before connecting we have to start listening to
incoming connections. Since an instance is supposed to sent its uuid in
the greeting message, we also have to initialize INSTANCE_UUID early,
before we start local recovery. So this patch makes memtx engine
constructor not only scan the snapshot directory, but also read the
header of the most recent snapshot to initialize INSTANCE_UUID.

Needed for #461
---
 src/box/box.cc         | 18 ++++++++++--------
 src/box/memtx_engine.c | 21 ++++++++++++++++++++-
 2 files changed, 30 insertions(+), 9 deletions(-)

diff --git a/src/box/box.cc b/src/box/box.cc
index 61bfa117..e1bf3934 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -1839,6 +1839,15 @@ box_cfg_xc(void)
 	}
 	bool is_bootstrap_leader = false;
 	if (last_checkpoint_lsn >= 0) {
+		/* Check instance UUID. */
+		assert(!tt_uuid_is_nil(&INSTANCE_UUID));
+		if (!tt_uuid_is_nil(&instance_uuid) &&
+		    !tt_uuid_is_equal(&instance_uuid, &INSTANCE_UUID)) {
+			tnt_raise(ClientError, ER_INSTANCE_UUID_MISMATCH,
+				  tt_uuid_str(&instance_uuid),
+				  tt_uuid_str(&INSTANCE_UUID));
+		}
+
 		struct wal_stream wal_stream;
 		wal_stream_create(&wal_stream, cfg_geti64("rows_per_wal"));
 
@@ -1882,7 +1891,6 @@ box_cfg_xc(void)
 				      cfg_getd("wal_dir_rescan_delay"));
 		title("hot_standby");
 
-		assert(!tt_uuid_is_nil(&INSTANCE_UUID));
 		/*
 		 * Leave hot standby mode, if any, only
 		 * after acquiring the lock.
@@ -1902,13 +1910,7 @@ box_cfg_xc(void)
 		recovery_finalize(recovery, &wal_stream.base);
 		engine_end_recovery_xc();
 
-		/* Check replica set and instance UUID. */
-		if (!tt_uuid_is_nil(&instance_uuid) &&
-		    !tt_uuid_is_equal(&instance_uuid, &INSTANCE_UUID)) {
-			tnt_raise(ClientError, ER_INSTANCE_UUID_MISMATCH,
-				  tt_uuid_str(&instance_uuid),
-				  tt_uuid_str(&INSTANCE_UUID));
-		}
+		/* Check replica set UUID. */
 		if (!tt_uuid_is_nil(&replicaset_uuid) &&
 		    !tt_uuid_is_equal(&replicaset_uuid, &REPLICASET_UUID)) {
 			tnt_raise(ClientError, ER_REPLICASET_UUID_MISMATCH,
diff --git a/src/box/memtx_engine.c b/src/box/memtx_engine.c
index fac84ce1..de9fd1ba 100644
--- a/src/box/memtx_engine.c
+++ b/src/box/memtx_engine.c
@@ -164,7 +164,6 @@ memtx_engine_recover_snapshot(struct memtx_engine *memtx,
 	struct xlog_cursor cursor;
 	if (xlog_cursor_open(&cursor, filename) < 0)
 		return -1;
-	INSTANCE_UUID = cursor.meta.instance_uuid;
 
 	int rc;
 	struct xrow_header row;
@@ -1001,6 +1000,26 @@ memtx_engine_new(const char *snap_dirname, bool force_recovery,
 	if (xdir_scan(&memtx->snap_dir) != 0)
 		goto fail;
 
+	/*
+	 * To check if the instance needs to be rebootstrapped, we
+	 * need to connect it to remote peers before proceeding to
+	 * local recovery. In order to do that, we have to start
+	 * listening for incoming connections, because one of remote
+	 * peers may be self. This, in turn, requires us to know the
+	 * instance UUID, as it is a part of a greeting message.
+	 * So if the local directory isn't empty, read the snapshot
+	 * signature right now to initialize the instance UUID.
+	 */
+	int64_t snap_signature = xdir_last_vclock(&memtx->snap_dir, NULL);
+	if (snap_signature >= 0) {
+		struct xlog_cursor cursor;
+		if (xdir_open_cursor(&memtx->snap_dir,
+				     snap_signature, &cursor) != 0)
+			goto fail;
+		INSTANCE_UUID = cursor.meta.instance_uuid;
+		xlog_cursor_close(&cursor, false);
+	}
+
 	stailq_create(&memtx->gc_queue);
 	memtx->gc_fiber = fiber_new("memtx.gc", memtx_engine_gc_f);
 	if (memtx->gc_fiber == NULL)
-- 
2.11.0

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v2 02/11] box: refactor hot standby recovery
  2018-06-08 17:34 [PATCH v2 00/11] Replica rejoin Vladimir Davydov
  2018-06-08 17:34 ` [PATCH v2 01/11] box: retrieve instance uuid before starting local recovery Vladimir Davydov
@ 2018-06-08 17:34 ` Vladimir Davydov
  2018-06-08 17:34 ` [PATCH v2 03/11] box: retrieve end vclock before starting local recovery Vladimir Davydov
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 34+ messages in thread
From: Vladimir Davydov @ 2018-06-08 17:34 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

Currently, we start a hot standby fiber even if not in hot standby mode
(see recovery_follow_local). And we scan the wal directory twice - first
time in recovery_follow_local(), second time in recovery_finalize().
Let's factor out recover_remaining_wals() from those functions and call
it explicitly. And let's call follow_local() and stop_local() only if in
hot standby mode.

Needed for #461
---
 src/box/box.cc      | 14 +++++++++-----
 src/box/recovery.cc | 11 +----------
 src/box/recovery.h  |  2 +-
 3 files changed, 11 insertions(+), 16 deletions(-)

diff --git a/src/box/box.cc b/src/box/box.cc
index e1bf3934..c1d15644 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -1887,16 +1887,17 @@ box_cfg_xc(void)
 				&last_checkpoint_vclock);
 
 		engine_begin_final_recovery_xc();
-		recovery_follow_local(recovery, &wal_stream.base, "hot_standby",
-				      cfg_getd("wal_dir_rescan_delay"));
-		title("hot_standby");
-
+		recover_remaining_wals(recovery, &wal_stream.base, NULL, true);
 		/*
 		 * Leave hot standby mode, if any, only
 		 * after acquiring the lock.
 		 */
 		if (wal_dir_lock < 0) {
+			title("hot_standby");
 			say_info("Entering hot standby mode");
+			recovery_follow_local(recovery, &wal_stream.base,
+					      "hot_standby",
+					      cfg_getd("wal_dir_rescan_delay"));
 			while (true) {
 				if (path_lock(cfg_gets("wal_dir"),
 					      &wal_dir_lock))
@@ -1905,9 +1906,12 @@ box_cfg_xc(void)
 					break;
 				fiber_sleep(0.1);
 			}
+			recovery_stop_local(recovery);
+			recover_remaining_wals(recovery, &wal_stream.base,
+					       NULL, true);
 			box_bind();
 		}
-		recovery_finalize(recovery, &wal_stream.base);
+		recovery_finalize(recovery);
 		engine_end_recovery_xc();
 
 		/* Check replica set UUID. */
diff --git a/src/box/recovery.cc b/src/box/recovery.cc
index cf348d29..71f6bd8c 100644
--- a/src/box/recovery.cc
+++ b/src/box/recovery.cc
@@ -313,12 +313,8 @@ recover_current_wal:
 }
 
 void
-recovery_finalize(struct recovery *r, struct xstream *stream)
+recovery_finalize(struct recovery *r)
 {
-	recovery_stop_local(r);
-
-	recover_remaining_wals(r, stream, NULL, true);
-
 	recovery_close_log(r);
 
 	/*
@@ -498,11 +494,6 @@ recovery_follow_local(struct recovery *r, struct xstream *stream,
 		      const char *name, ev_tstamp wal_dir_rescan_delay)
 {
 	/*
-	 * Scan wal_dir and recover all existing at the moment xlogs.
-	 * Blocks until finished.
-	 */
-	recover_remaining_wals(r, stream, NULL, true);
-	/*
 	 * Start 'hot_standby' background fiber to follow xlog changes.
 	 * It will pick up from the position of the currently open
 	 * xlog.
diff --git a/src/box/recovery.h b/src/box/recovery.h
index 3a950e47..6aba922b 100644
--- a/src/box/recovery.h
+++ b/src/box/recovery.h
@@ -76,7 +76,7 @@ void
 recovery_stop_local(struct recovery *r);
 
 void
-recovery_finalize(struct recovery *r, struct xstream *stream);
+recovery_finalize(struct recovery *r);
 
 #if defined(__cplusplus)
 } /* extern "C" */
-- 
2.11.0

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v2 03/11] box: retrieve end vclock before starting local recovery
  2018-06-08 17:34 [PATCH v2 00/11] Replica rejoin Vladimir Davydov
  2018-06-08 17:34 ` [PATCH v2 01/11] box: retrieve instance uuid before starting local recovery Vladimir Davydov
  2018-06-08 17:34 ` [PATCH v2 02/11] box: refactor hot standby recovery Vladimir Davydov
@ 2018-06-08 17:34 ` Vladimir Davydov
  2018-06-14 12:58   ` Konstantin Osipov
  2018-06-08 17:34 ` [PATCH v2 04/11] box: open the port " Vladimir Davydov
                   ` (7 subsequent siblings)
  10 siblings, 1 reply; 34+ messages in thread
From: Vladimir Davydov @ 2018-06-08 17:34 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

In order to find out if the current instance fell too much behind its
peers in the cluster and so needs to be rebootstrapped, we need to know
its vclock before we start local recovery. To do that, let's scan the
most recent xlog. In future, we can optimize that by either storing end
vclock in xlog eof marker or by making a new xlog on server stop.

Needed for #461
---
 src/box/box.cc      | 20 +++++++++++++-------
 src/box/recovery.cc | 23 +++++++++++++++++++++++
 src/box/recovery.h  |  3 +++
 3 files changed, 39 insertions(+), 7 deletions(-)

diff --git a/src/box/box.cc b/src/box/box.cc
index c1d15644..3457cf19 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -1858,6 +1858,14 @@ box_cfg_xc(void)
 		auto guard = make_scoped_guard([=]{ recovery_delete(recovery); });
 
 		/*
+		 * Initialize the replica set vclock from recovery.
+		 * The local WAL may contain rows from remote masters,
+		 * so we must reflect this in replicaset vclock to
+		 * not attempt to apply these rows twice.
+		 */
+		recovery_end_vclock(recovery, &replicaset.vclock);
+
+		/*
 		 * recovery->vclock is needed by Vinyl to filter
 		 * WAL rows that were dumped before restart.
 		 *
@@ -1909,6 +1917,11 @@ box_cfg_xc(void)
 			recovery_stop_local(recovery);
 			recover_remaining_wals(recovery, &wal_stream.base,
 					       NULL, true);
+			/*
+			 * Advance replica set vclock to reflect records
+			 * applied in hot standby mode.
+			 */
+			vclock_copy(&replicaset.vclock, &recovery->vclock);
 			box_bind();
 		}
 		recovery_finalize(recovery);
@@ -1924,13 +1937,6 @@ box_cfg_xc(void)
 
 		/* Clear the pointer to journal before it goes out of scope */
 		journal_set(NULL);
-		/*
-		 * Initialize the replica set vclock from recovery.
-		 * The local WAL may contain rows from remote masters,
-		 * so we must reflect this in replicaset vclock to
-		 * not attempt to apply these rows twice.
-		 */
-		vclock_copy(&replicaset.vclock, &recovery->vclock);
 
 		/** Begin listening only when the local recovery is complete. */
 		box_listen();
diff --git a/src/box/recovery.cc b/src/box/recovery.cc
index 71f6bd8c..eb77476d 100644
--- a/src/box/recovery.cc
+++ b/src/box/recovery.cc
@@ -137,6 +137,29 @@ recovery_new(const char *wal_dirname, bool force_recovery,
 	return r;
 }
 
+void
+recovery_end_vclock(struct recovery *r, struct vclock *end_vclock)
+{
+	xdir_scan_xc(&r->wal_dir);
+
+	struct vclock *vclock = vclockset_last(&r->wal_dir.index);
+	if (vclock == NULL || vclock_compare(vclock, &r->vclock) < 0) {
+		/* No xlogs after last checkpoint. */
+		vclock_copy(end_vclock, &r->vclock);
+		return;
+	}
+
+	/* Scan the last xlog to find end vclock. */
+	vclock_copy(end_vclock, vclock);
+	struct xlog_cursor cursor;
+	if (xdir_open_cursor(&r->wal_dir, vclock_sum(vclock), &cursor) != 0)
+		return;
+	struct xrow_header row;
+	while (xlog_cursor_next(&cursor, &row, true) == 0)
+		vclock_follow(end_vclock, row.replica_id, row.lsn);
+	xlog_cursor_close(&cursor, false);
+}
+
 static inline void
 recovery_close_log(struct recovery *r)
 {
diff --git a/src/box/recovery.h b/src/box/recovery.h
index 6aba922b..1ae6f2c3 100644
--- a/src/box/recovery.h
+++ b/src/box/recovery.h
@@ -69,6 +69,9 @@ void
 recovery_delete(struct recovery *r);
 
 void
+recovery_end_vclock(struct recovery *r, struct vclock *end_vclock);
+
+void
 recovery_follow_local(struct recovery *r, struct xstream *stream,
 		      const char *name, ev_tstamp wal_dir_rescan_delay);
 
-- 
2.11.0

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v2 04/11] box: open the port before starting local recovery
  2018-06-08 17:34 [PATCH v2 00/11] Replica rejoin Vladimir Davydov
                   ` (2 preceding siblings ...)
  2018-06-08 17:34 ` [PATCH v2 03/11] box: retrieve end vclock before starting local recovery Vladimir Davydov
@ 2018-06-08 17:34 ` Vladimir Davydov
  2018-06-13 20:43   ` Konstantin Osipov
  2018-06-08 17:34 ` [PATCH v2 05/11] box: connect to remote peers before starting local recovery Vladimir Davydov
                   ` (6 subsequent siblings)
  10 siblings, 1 reply; 34+ messages in thread
From: Vladimir Davydov @ 2018-06-08 17:34 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

In order to find out if the current instance fell too much behind its
peers in the cluster and so needs to be re-bootstrapped we need to
connect it to remote peers before proceeding to local recovery. The
problem is box.cfg.replication may have an entry corresponding to the
instance itself so before connecting we have to start listening to
incoming connections. So this patch moves the call to box_listen()
before recoery is started unless the instance in hot standby mode.
It also folds box_bind() into box_listen() as it is no longer needed
as a separate function.

Needed for #461
---
 src/box/box.cc     | 25 ++++++-------------------
 src/box/box.h      |  1 -
 src/box/iproto.cc  | 23 ++++++-----------------
 src/box/iproto.h   |  5 +----
 src/box/lua/cfg.cc |  1 -
 5 files changed, 13 insertions(+), 42 deletions(-)

diff --git a/src/box/box.cc b/src/box/box.cc
index 3457cf19..4b8d5aa2 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -681,17 +681,11 @@ box_set_replication_skip_conflict(void)
 }
 
 void
-box_bind(void)
+box_listen(void)
 {
 	const char *uri = cfg_gets("listen");
 	box_check_uri(uri, "listen");
-	iproto_bind(uri);
-}
-
-void
-box_listen(void)
-{
-	iproto_listen();
+	iproto_listen(uri);
 }
 
 void
@@ -1829,13 +1823,6 @@ box_cfg_xc(void)
 		 */
 		if (!cfg_geti("hot_standby") || last_checkpoint_lsn < 0)
 			tnt_raise(ClientError, ER_ALREADY_RUNNING, cfg_gets("wal_dir"));
-	} else {
-		/*
-		 * Try to bind the port before recovery, to fail
-		 * early if the port is busy. In hot standby mode,
-		 * the port is most likely busy.
-		 */
-		box_bind();
 	}
 	bool is_bootstrap_leader = false;
 	if (last_checkpoint_lsn >= 0) {
@@ -1865,6 +1852,9 @@ box_cfg_xc(void)
 		 */
 		recovery_end_vclock(recovery, &replicaset.vclock);
 
+		if (wal_dir_lock >= 0)
+			box_listen();
+
 		/*
 		 * recovery->vclock is needed by Vinyl to filter
 		 * WAL rows that were dumped before restart.
@@ -1922,7 +1912,7 @@ box_cfg_xc(void)
 			 * applied in hot standby mode.
 			 */
 			vclock_copy(&replicaset.vclock, &recovery->vclock);
-			box_bind();
+			box_listen();
 		}
 		recovery_finalize(recovery);
 		engine_end_recovery_xc();
@@ -1938,9 +1928,6 @@ box_cfg_xc(void)
 		/* Clear the pointer to journal before it goes out of scope */
 		journal_set(NULL);
 
-		/** Begin listening only when the local recovery is complete. */
-		box_listen();
-
 		title("orphan");
 
 		/* Wait for the cluster to start up */
diff --git a/src/box/box.h b/src/box/box.h
index d3967891..182e1b72 100644
--- a/src/box/box.h
+++ b/src/box/box.h
@@ -170,7 +170,6 @@ box_process_subscribe(struct ev_io *io, struct xrow_header *header);
 void
 box_check_config();
 
-void box_bind(void);
 void box_listen(void);
 void box_set_replication(void);
 void box_set_log_level(void);
diff --git a/src/box/iproto.cc b/src/box/iproto.cc
index 76844555..c6b13934 100644
--- a/src/box/iproto.cc
+++ b/src/box/iproto.cc
@@ -1763,7 +1763,6 @@ iproto_init()
 
 /** Available IProto configuration changes. */
 enum iproto_cfg_op {
-	IPROTO_CFG_BIND,
 	IPROTO_CFG_MSG_MAX,
 	IPROTO_CFG_LISTEN
 };
@@ -1801,12 +1800,6 @@ iproto_do_cfg_f(struct cbus_call_msg *m)
 	int old;
 	try {
 		switch (cfg_msg->op) {
-		case IPROTO_CFG_BIND:
-			if (evio_service_is_active(&binary))
-				evio_service_stop(&binary);
-			if (cfg_msg->uri != NULL)
-				evio_service_bind(&binary, cfg_msg->uri);
-			break;
 		case IPROTO_CFG_MSG_MAX:
 			cpipe_set_max_input(&tx_pipe,
 					    cfg_msg->iproto_msg_max / 2);
@@ -1817,7 +1810,11 @@ iproto_do_cfg_f(struct cbus_call_msg *m)
 			break;
 		case IPROTO_CFG_LISTEN:
 			if (evio_service_is_active(&binary))
+				evio_service_stop(&binary);
+			if (cfg_msg->uri != NULL) {
+				evio_service_bind(&binary, cfg_msg->uri);
 				evio_service_listen(&binary);
+			}
 			break;
 		default:
 			unreachable();
@@ -1837,19 +1834,11 @@ iproto_do_cfg(struct iproto_cfg_msg *msg)
 }
 
 void
-iproto_bind(const char *uri)
-{
-	struct iproto_cfg_msg cfg_msg;
-	iproto_cfg_msg_create(&cfg_msg, IPROTO_CFG_BIND);
-	cfg_msg.uri = uri;
-	iproto_do_cfg(&cfg_msg);
-}
-
-void
-iproto_listen()
+iproto_listen(const char *uri)
 {
 	struct iproto_cfg_msg cfg_msg;
 	iproto_cfg_msg_create(&cfg_msg, IPROTO_CFG_LISTEN);
+	cfg_msg.uri = uri;
 	iproto_do_cfg(&cfg_msg);
 }
 
diff --git a/src/box/iproto.h b/src/box/iproto.h
index b6591469..b9a6cf8f 100644
--- a/src/box/iproto.h
+++ b/src/box/iproto.h
@@ -75,10 +75,7 @@ void
 iproto_init();
 
 void
-iproto_bind(const char *uri);
-
-void
-iproto_listen();
+iproto_listen(const char *uri);
 
 void
 iproto_set_msg_max(int iproto_msg_max);
diff --git a/src/box/lua/cfg.cc b/src/box/lua/cfg.cc
index 5afebc94..0f6b8a5a 100644
--- a/src/box/lua/cfg.cc
+++ b/src/box/lua/cfg.cc
@@ -69,7 +69,6 @@ static int
 lbox_cfg_set_listen(struct lua_State *L)
 {
 	try {
-		box_bind();
 		box_listen();
 	} catch (Exception *) {
 		luaT_error(L);
-- 
2.11.0

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v2 05/11] box: connect to remote peers before starting local recovery
  2018-06-08 17:34 [PATCH v2 00/11] Replica rejoin Vladimir Davydov
                   ` (3 preceding siblings ...)
  2018-06-08 17:34 ` [PATCH v2 04/11] box: open the port " Vladimir Davydov
@ 2018-06-08 17:34 ` Vladimir Davydov
  2018-06-13 20:45   ` Konstantin Osipov
  2018-06-08 17:34 ` [PATCH v2 06/11] box: factor out local recovery function Vladimir Davydov
                   ` (5 subsequent siblings)
  10 siblings, 1 reply; 34+ messages in thread
From: Vladimir Davydov @ 2018-06-08 17:34 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

box_sync_replication() can now be called before recovery, right after
box_listen(). This is a step toward detecting if the instance fell too
much behind its peers in the cluster and so needs to be rebootstrapped.

Needed for #461
---
 src/box/box.cc | 14 ++++++--------
 1 file changed, 6 insertions(+), 8 deletions(-)

diff --git a/src/box/box.cc b/src/box/box.cc
index 4b8d5aa2..3f0c1176 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -1852,8 +1852,10 @@ box_cfg_xc(void)
 		 */
 		recovery_end_vclock(recovery, &replicaset.vclock);
 
-		if (wal_dir_lock >= 0)
+		if (wal_dir_lock >= 0) {
 			box_listen();
+			box_sync_replication(replication_connect_timeout, false);
+		}
 
 		/*
 		 * recovery->vclock is needed by Vinyl to filter
@@ -1913,6 +1915,7 @@ box_cfg_xc(void)
 			 */
 			vclock_copy(&replicaset.vclock, &recovery->vclock);
 			box_listen();
+			box_sync_replication(replication_connect_timeout, false);
 		}
 		recovery_finalize(recovery);
 		engine_end_recovery_xc();
@@ -1927,11 +1930,6 @@ box_cfg_xc(void)
 
 		/* Clear the pointer to journal before it goes out of scope */
 		journal_set(NULL);
-
-		title("orphan");
-
-		/* Wait for the cluster to start up */
-		box_sync_replication(replication_connect_timeout, false);
 	} else {
 		if (!tt_uuid_is_nil(&instance_uuid))
 			INSTANCE_UUID = instance_uuid;
@@ -1943,8 +1941,6 @@ box_cfg_xc(void)
 		 */
 		box_listen();
 
-		title("orphan");
-
 		/*
 		 * Wait for the cluster to start up.
 		 *
@@ -1980,6 +1976,8 @@ box_cfg_xc(void)
 
 	rmean_cleanup(rmean_box);
 
+	title("orphan");
+
 	/*
 	 * If this instance is a leader of a newly bootstrapped
 	 * cluster, it is uptodate by definition so leave the
-- 
2.11.0

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v2 06/11] box: factor out local recovery function
  2018-06-08 17:34 [PATCH v2 00/11] Replica rejoin Vladimir Davydov
                   ` (4 preceding siblings ...)
  2018-06-08 17:34 ` [PATCH v2 05/11] box: connect to remote peers before starting local recovery Vladimir Davydov
@ 2018-06-08 17:34 ` Vladimir Davydov
  2018-06-13 20:50   ` Konstantin Osipov
  2018-06-08 17:34 ` [PATCH v2 07/11] applier: inquire oldest vclock on connect Vladimir Davydov
                   ` (4 subsequent siblings)
  10 siblings, 1 reply; 34+ messages in thread
From: Vladimir Davydov @ 2018-06-08 17:34 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

 - Factor out local_recovery() from box_cfg_xc(). Make it setup
   replication and handle local recovery and hot standby cases.
 - Move replication setup in case of initial bootstrap from box_cfg_xc()
   to bootstrap() to make bootstrap() consistent with local_recovery().
 - Move initial snapshot creation from bootstrap() to bootsrap_master()
   and bootstrap_from_master().

Needed for #461
---
 src/box/box.cc | 280 +++++++++++++++++++++++++++++++--------------------------
 1 file changed, 151 insertions(+), 129 deletions(-)

diff --git a/src/box/box.cc b/src/box/box.cc
index 3f0c1176..922e8604 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -1644,6 +1644,11 @@ bootstrap_master(const struct tt_uuid *replicaset_uuid)
 
 	/* Set UUID of a new replica set */
 	box_set_replicaset_uuid(replicaset_uuid);
+
+	/* Make the initial checkpoint */
+	if (engine_begin_checkpoint() ||
+	    engine_commit_checkpoint(&replicaset.vclock))
+		panic("failed to create a checkpoint");
 }
 
 /**
@@ -1698,6 +1703,11 @@ bootstrap_from_master(struct replica *master)
 	/* Switch applier to initial state */
 	applier_resume_to_state(applier, APPLIER_READY, TIMEOUT_INFINITY);
 	assert(applier->state == APPLIER_READY);
+
+	/* Make the initial checkpoint */
+	if (engine_begin_checkpoint() ||
+	    engine_commit_checkpoint(&replicaset.vclock))
+		panic("failed to create a checkpoint");
 }
 
 /**
@@ -1708,8 +1718,31 @@ bootstrap_from_master(struct replica *master)
  *                                  the leader of a new cluster
  */
 static void
-bootstrap(const struct tt_uuid *replicaset_uuid, bool *is_bootstrap_leader)
-{
+bootstrap(const struct tt_uuid *instance_uuid,
+	  const struct tt_uuid *replicaset_uuid,
+	  bool *is_bootstrap_leader)
+{
+	/* Initialize instance UUID. */
+	assert(tt_uuid_is_nil(&INSTANCE_UUID));
+	if (!tt_uuid_is_nil(instance_uuid))
+		INSTANCE_UUID = *instance_uuid;
+	else
+		tt_uuid_create(&INSTANCE_UUID);
+	/*
+	 * Begin listening on the socket to enable
+	 * master-master replication leader election.
+	 */
+	box_listen();
+	/*
+	 * Wait for the cluster to start up.
+	 *
+	 * Note, when bootstrapping a new instance, we have to
+	 * connect to all masters to make sure all replicas
+	 * receive the same replica set UUID when a new cluster
+	 * is deployed.
+	 */
+	box_sync_replication(TIMEOUT_INFINITY, true);
+
 	/* Use the first replica by URI as a bootstrap leader */
 	struct replica *master = replicaset_leader();
 	assert(master == NULL || master->applier != NULL);
@@ -1727,9 +1760,117 @@ bootstrap(const struct tt_uuid *replicaset_uuid, bool *is_bootstrap_leader)
 		bootstrap_master(replicaset_uuid);
 		*is_bootstrap_leader = true;
 	}
-	if (engine_begin_checkpoint() ||
-	    engine_commit_checkpoint(&replicaset.vclock))
-		panic("failed to create a checkpoint");
+}
+
+/**
+ * Recover the instance from the local directory.
+ * Enter hot standby if the directory is locked.
+ */
+static void
+local_recovery(const struct tt_uuid *instance_uuid,
+	       const struct tt_uuid *replicaset_uuid,
+	       const struct vclock *checkpoint_vclock)
+{
+	/* Check instance UUID. */
+	assert(!tt_uuid_is_nil(&INSTANCE_UUID));
+	if (!tt_uuid_is_nil(instance_uuid) &&
+	    !tt_uuid_is_equal(instance_uuid, &INSTANCE_UUID)) {
+		tnt_raise(ClientError, ER_INSTANCE_UUID_MISMATCH,
+			  tt_uuid_str(instance_uuid),
+			  tt_uuid_str(&INSTANCE_UUID));
+	}
+
+	struct wal_stream wal_stream;
+	wal_stream_create(&wal_stream, cfg_geti64("rows_per_wal"));
+
+	struct recovery *recovery;
+	recovery = recovery_new(cfg_gets("wal_dir"),
+				cfg_geti("force_recovery"),
+				checkpoint_vclock);
+	auto guard = make_scoped_guard([=]{ recovery_delete(recovery); });
+
+	/*
+	 * Initialize the replica set vclock from recovery.
+	 * The local WAL may contain rows from remote masters,
+	 * so we must reflect this in replicaset vclock to
+	 * not attempt to apply these rows twice.
+	 */
+	recovery_end_vclock(recovery, &replicaset.vclock);
+
+	if (wal_dir_lock >= 0) {
+		box_listen();
+		box_sync_replication(replication_connect_timeout, false);
+	}
+
+	/*
+	 * recovery->vclock is needed by Vinyl to filter
+	 * WAL rows that were dumped before restart.
+	 *
+	 * XXX: Passing an internal member of the recovery
+	 * object to an engine is an ugly hack. Instead we
+	 * should introduce space_vtab::apply_wal_row method
+	 * and explicitly pass the statement LSN to it.
+	 */
+	engine_begin_initial_recovery_xc(&recovery->vclock);
+
+	struct memtx_engine *memtx;
+	memtx = (struct memtx_engine *)engine_by_name("memtx");
+	assert(memtx != NULL);
+
+	struct recovery_journal journal;
+	recovery_journal_create(&journal, &recovery->vclock);
+	journal_set(&journal.base);
+
+	/*
+	 * We explicitly request memtx to recover its
+	 * snapshot as a separate phase since it contains
+	 * data for system spaces, and triggers on
+	 * recovery of system spaces issue DDL events in
+	 * other engines.
+	 */
+	memtx_engine_recover_snapshot_xc(memtx, checkpoint_vclock);
+
+	engine_begin_final_recovery_xc();
+	recover_remaining_wals(recovery, &wal_stream.base, NULL, true);
+	/*
+	 * Leave hot standby mode, if any, only after
+	 * acquiring the lock.
+	 */
+	if (wal_dir_lock < 0) {
+		title("hot_standby");
+		say_info("Entering hot standby mode");
+		recovery_follow_local(recovery, &wal_stream.base, "hot_standby",
+				      cfg_getd("wal_dir_rescan_delay"));
+		while (true) {
+			if (path_lock(cfg_gets("wal_dir"), &wal_dir_lock))
+				diag_raise();
+			if (wal_dir_lock >= 0)
+				break;
+			fiber_sleep(0.1);
+		}
+		recovery_stop_local(recovery);
+		recover_remaining_wals(recovery, &wal_stream.base, NULL, true);
+		/*
+		 * Advance replica set vclock to reflect records
+		 * applied in hot standby mode.
+		 */
+		vclock_copy(&replicaset.vclock, &recovery->vclock);
+		box_listen();
+		box_sync_replication(replication_connect_timeout, false);
+	}
+	recovery_finalize(recovery);
+	engine_end_recovery_xc();
+
+	/* Check replica set UUID. */
+	if (!tt_uuid_is_nil(replicaset_uuid) &&
+	    !tt_uuid_is_equal(replicaset_uuid, &REPLICASET_UUID)) {
+		tnt_raise(ClientError, ER_REPLICASET_UUID_MISMATCH,
+			  tt_uuid_str(replicaset_uuid),
+			  tt_uuid_str(&REPLICASET_UUID));
+	}
+
+	/* Clear the pointer to journal before it goes out of scope */
+	journal_set(NULL);
 }
 
 static void
@@ -1826,132 +1967,13 @@ box_cfg_xc(void)
 	}
 	bool is_bootstrap_leader = false;
 	if (last_checkpoint_lsn >= 0) {
-		/* Check instance UUID. */
-		assert(!tt_uuid_is_nil(&INSTANCE_UUID));
-		if (!tt_uuid_is_nil(&instance_uuid) &&
-		    !tt_uuid_is_equal(&instance_uuid, &INSTANCE_UUID)) {
-			tnt_raise(ClientError, ER_INSTANCE_UUID_MISMATCH,
-				  tt_uuid_str(&instance_uuid),
-				  tt_uuid_str(&INSTANCE_UUID));
-		}
-
-		struct wal_stream wal_stream;
-		wal_stream_create(&wal_stream, cfg_geti64("rows_per_wal"));
-
-		struct recovery *recovery;
-		recovery = recovery_new(cfg_gets("wal_dir"),
-					cfg_geti("force_recovery"),
-					&last_checkpoint_vclock);
-		auto guard = make_scoped_guard([=]{ recovery_delete(recovery); });
-
-		/*
-		 * Initialize the replica set vclock from recovery.
-		 * The local WAL may contain rows from remote masters,
-		 * so we must reflect this in replicaset vclock to
-		 * not attempt to apply these rows twice.
-		 */
-		recovery_end_vclock(recovery, &replicaset.vclock);
-
-		if (wal_dir_lock >= 0) {
-			box_listen();
-			box_sync_replication(replication_connect_timeout, false);
-		}
-
-		/*
-		 * recovery->vclock is needed by Vinyl to filter
-		 * WAL rows that were dumped before restart.
-		 *
-		 * XXX: Passing an internal member of the recovery
-		 * object to an engine is an ugly hack. Instead we
-		 * should introduce Engine::applyWALRow method and
-		 * explicitly pass the statement LSN to it.
-		 */
-		engine_begin_initial_recovery_xc(&recovery->vclock);
-
-		struct memtx_engine *memtx;
-		memtx = (struct memtx_engine *)engine_by_name("memtx");
-		assert(memtx != NULL);
-
-		struct recovery_journal journal;
-		recovery_journal_create(&journal, &recovery->vclock);
-		journal_set(&journal.base);
-
-		/**
-		 * We explicitly request memtx to recover its
-		 * snapshot as a separate phase since it contains
-		 * data for system spaces, and triggers on
-		 * recovery of system spaces issue DDL events in
-		 * other engines.
-		 */
-		memtx_engine_recover_snapshot_xc(memtx,
-				&last_checkpoint_vclock);
-
-		engine_begin_final_recovery_xc();
-		recover_remaining_wals(recovery, &wal_stream.base, NULL, true);
-		/*
-		 * Leave hot standby mode, if any, only
-		 * after acquiring the lock.
-		 */
-		if (wal_dir_lock < 0) {
-			title("hot_standby");
-			say_info("Entering hot standby mode");
-			recovery_follow_local(recovery, &wal_stream.base,
-					      "hot_standby",
-					      cfg_getd("wal_dir_rescan_delay"));
-			while (true) {
-				if (path_lock(cfg_gets("wal_dir"),
-					      &wal_dir_lock))
-					diag_raise();
-				if (wal_dir_lock >= 0)
-					break;
-				fiber_sleep(0.1);
-			}
-			recovery_stop_local(recovery);
-			recover_remaining_wals(recovery, &wal_stream.base,
-					       NULL, true);
-			/*
-			 * Advance replica set vclock to reflect records
-			 * applied in hot standby mode.
-			 */
-			vclock_copy(&replicaset.vclock, &recovery->vclock);
-			box_listen();
-			box_sync_replication(replication_connect_timeout, false);
-		}
-		recovery_finalize(recovery);
-		engine_end_recovery_xc();
-
-		/* Check replica set UUID. */
-		if (!tt_uuid_is_nil(&replicaset_uuid) &&
-		    !tt_uuid_is_equal(&replicaset_uuid, &REPLICASET_UUID)) {
-			tnt_raise(ClientError, ER_REPLICASET_UUID_MISMATCH,
-				  tt_uuid_str(&replicaset_uuid),
-				  tt_uuid_str(&REPLICASET_UUID));
-		}
-
-		/* Clear the pointer to journal before it goes out of scope */
-		journal_set(NULL);
+		/* Recover the instance from the local directory */
+		local_recovery(&instance_uuid, &replicaset_uuid,
+			       &last_checkpoint_vclock);
 	} else {
-		if (!tt_uuid_is_nil(&instance_uuid))
-			INSTANCE_UUID = instance_uuid;
-		else
-			tt_uuid_create(&INSTANCE_UUID);
-		/*
-		 * Begin listening on the socket to enable
-		 * master-master replication leader election.
-		 */
-		box_listen();
-
-		/*
-		 * Wait for the cluster to start up.
-		 *
-		 * Note, when bootstrapping a new instance, we have to
-		 * connect to all masters to make sure all replicas
-		 * receive the same replica set UUID when a new cluster
-		 * is deployed.
-		 */
-		box_sync_replication(TIMEOUT_INFINITY, true);
 		/* Bootstrap a new master */
-		bootstrap(&replicaset_uuid, &is_bootstrap_leader);
+		bootstrap(&instance_uuid, &replicaset_uuid,
+			  &is_bootstrap_leader);
 	}
 	fiber_gc();
 
-- 
2.11.0

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v2 07/11] applier: inquire oldest vclock on connect
  2018-06-08 17:34 [PATCH v2 00/11] Replica rejoin Vladimir Davydov
                   ` (5 preceding siblings ...)
  2018-06-08 17:34 ` [PATCH v2 06/11] box: factor out local recovery function Vladimir Davydov
@ 2018-06-08 17:34 ` Vladimir Davydov
  2018-06-13 20:51   ` Konstantin Osipov
  2018-06-08 17:34 ` [PATCH v2 08/11] replication: rebootstrap instance on startup if it fell behind Vladimir Davydov
                   ` (3 subsequent siblings)
  10 siblings, 1 reply; 34+ messages in thread
From: Vladimir Davydov @ 2018-06-08 17:34 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

Introduce a new iproto command IPROTO_GET_GC_VCLOCK that returns the
vclock of the oldest checkpoint available at the master. Use this
command when applier is connected to set applier->gc_vclock. We will
need it to check whether a replica fell too much behind its peers in
the cluster and so needs to be rebootstrapped.

Needed for #461
---
 src/box/applier.cc         | 15 +++++++++++++++
 src/box/applier.h          |  2 ++
 src/box/box.cc             | 12 ++++++++++++
 src/box/box.h              |  3 +++
 src/box/iproto.cc          |  7 +++++++
 src/box/iproto_constants.h |  2 ++
 src/box/xrow.c             | 36 ++++++++++++++++++++++++++++++++++++
 src/box/xrow.h             | 31 +++++++++++++++++++++++++++++++
 8 files changed, 108 insertions(+)

diff --git a/src/box/applier.cc b/src/box/applier.cc
index 556502bf..8d750dc6 100644
--- a/src/box/applier.cc
+++ b/src/box/applier.cc
@@ -228,6 +228,21 @@ applier_connect(struct applier *applier)
 					    &applier->remote_is_ro);
 	}
 
+	/**
+	 * Tarantool >= 1.10.1: send an IPROTO_GET_GC_VCLOCK message
+	 * to find out the oldest vclock available at the remove end.
+	 * Needed to check if the replica has to be rebootstrapped.
+	 */
+	if (applier->version_id >= version_id(1, 10, 1)) {
+		xrow_encode_get_gc_vclock(&row);
+		coio_write_xrow(coio, &row);
+		coio_read_xrow(coio, ibuf, &row);
+		if (row.type != IPROTO_OK)
+			xrow_decode_error_xc(&row);
+		vclock_create(&applier->gc_vclock);
+		xrow_decode_vclock_xc(&row, &applier->gc_vclock);
+	}
+
 	applier_set_state(applier, APPLIER_CONNECTED);
 
 	/* Detect connection to itself */
diff --git a/src/box/applier.h b/src/box/applier.h
index c33562cc..d0ae1ed1 100644
--- a/src/box/applier.h
+++ b/src/box/applier.h
@@ -96,6 +96,8 @@ struct applier {
 	uint32_t version_id;
 	/** Remote vclock at time of connect. */
 	struct vclock vclock;
+	/** Oldest vclock available at remote at time of connect. */
+	struct vclock gc_vclock;
 	/** Remote peer mode, true if read-only, default: false */
 	bool remote_is_ro;
 	/** Remote address */
diff --git a/src/box/box.cc b/src/box/box.cc
index 922e8604..0aaed562 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -1540,6 +1540,18 @@ box_process_subscribe(struct ev_io *io, struct xrow_header *header)
 			replica_version_id);
 }
 
+void
+box_get_gc_vclock(struct vclock *vclock)
+{
+	struct checkpoint_iterator it;
+	checkpoint_iterator_init(&it);
+	const struct vclock *oldest = checkpoint_iterator_next(&it);
+	if (oldest != NULL)
+		vclock_copy(vclock, oldest);
+	else
+		vclock_create(vclock);
+}
+
 /** Insert a new cluster into _schema */
 static void
 box_set_replicaset_uuid(const struct tt_uuid *replicaset_uuid)
diff --git a/src/box/box.h b/src/box/box.h
index 182e1b72..10c54102 100644
--- a/src/box/box.h
+++ b/src/box/box.h
@@ -163,6 +163,9 @@ box_process_join(struct ev_io *io, struct xrow_header *header);
 void
 box_process_subscribe(struct ev_io *io, struct xrow_header *header);
 
+void
+box_get_gc_vclock(struct vclock *vclock);
+
 /**
  * Check Lua configuration before initialization or
  * in case of a configuration change.
diff --git a/src/box/iproto.cc b/src/box/iproto.cc
index c6b13934..fdb286ad 100644
--- a/src/box/iproto.cc
+++ b/src/box/iproto.cc
@@ -1062,6 +1062,7 @@ iproto_msg_decode(struct iproto_msg *msg, const char **pos, const char *reqend,
 		*stop_input = true;
 		break;
 	case IPROTO_REQUEST_VOTE:
+	case IPROTO_GET_GC_VCLOCK:
 		cmsg_init(&msg->base, misc_route);
 		break;
 	case IPROTO_AUTH:
@@ -1423,6 +1424,7 @@ tx_process_misc(struct cmsg *m)
 	struct iproto_msg *msg = tx_accept_msg(m);
 	struct iproto_connection *con = msg->connection;
 	struct obuf *out = con->tx.p_obuf;
+	struct vclock vclock;
 
 	tx_fiber_init(con->session, msg->header.sync);
 
@@ -1446,6 +1448,11 @@ tx_process_misc(struct cmsg *m)
 						     &replicaset.vclock,
 						     cfg_geti("read_only"));
 			break;
+		case IPROTO_GET_GC_VCLOCK:
+			box_get_gc_vclock(&vclock);
+			iproto_reply_vclock_xc(out, msg->header.sync,
+					       ::schema_version, &vclock);
+			break;
 		default:
 			unreachable();
 		}
diff --git a/src/box/iproto_constants.h b/src/box/iproto_constants.h
index 46d47719..cb2fdbf1 100644
--- a/src/box/iproto_constants.h
+++ b/src/box/iproto_constants.h
@@ -156,6 +156,8 @@ enum iproto_type {
 	IPROTO_SUBSCRIBE = 66,
 	/** Vote request command for master election */
 	IPROTO_REQUEST_VOTE = 67,
+	/** Command to inquire garbage collection state */
+	IPROTO_GET_GC_VCLOCK = 68,
 
 	/** Vinyl run info stored in .index file */
 	VY_INDEX_RUN_INFO = 100,
diff --git a/src/box/xrow.c b/src/box/xrow.c
index 532e1296..dc5fa0a2 100644
--- a/src/box/xrow.c
+++ b/src/box/xrow.c
@@ -282,6 +282,35 @@ iproto_reply_ok(struct obuf *out, uint64_t sync, uint32_t schema_version)
 }
 
 int
+iproto_reply_vclock(struct obuf *out, uint64_t sync, uint32_t schema_version,
+		    const struct vclock *vclock)
+{
+	size_t max_size = IPROTO_HEADER_LEN + mp_sizeof_map(1) +
+		mp_sizeof_uint(UINT32_MAX) + mp_sizeof_vclock(vclock);
+
+	char *buf = obuf_reserve(out, max_size);
+	if (buf == NULL) {
+		diag_set(OutOfMemory, max_size,
+			 "obuf_alloc", "buf");
+		return -1;
+	}
+
+	char *data = buf + IPROTO_HEADER_LEN;
+	data = mp_encode_map(data, 1);
+	data = mp_encode_uint(data, IPROTO_VCLOCK);
+	data = mp_encode_vclock(data, vclock);
+	size_t size = data - buf;
+	assert(size <= max_size);
+
+	iproto_header_encode(buf, IPROTO_OK, sync, schema_version,
+			     size - IPROTO_HEADER_LEN);
+
+	char *ptr = obuf_alloc(out, size);
+	assert(ptr == buf);
+	return 0;
+}
+
+int
 iproto_reply_request_vote(struct obuf *out, uint64_t sync,
 			  uint32_t schema_version, const struct vclock *vclock,
 			  bool read_only)
@@ -811,6 +840,13 @@ xrow_encode_request_vote(struct xrow_header *row)
 	row->type = IPROTO_REQUEST_VOTE;
 }
 
+void
+xrow_encode_get_gc_vclock(struct xrow_header *row)
+{
+	memset(row, 0, sizeof(*row));
+	row->type = IPROTO_GET_GC_VCLOCK;
+}
+
 int
 xrow_encode_subscribe(struct xrow_header *row,
 		      const struct tt_uuid *replicaset_uuid,
diff --git a/src/box/xrow.h b/src/box/xrow.h
index b10bf26d..edf16ec2 100644
--- a/src/box/xrow.h
+++ b/src/box/xrow.h
@@ -230,6 +230,13 @@ void
 xrow_encode_request_vote(struct xrow_header *row);
 
 /**
+ * Encode a vote request for gc state inquiry.
+ * @param row[out] Row to encode into.
+ */
+void
+xrow_encode_get_gc_vclock(struct xrow_header *row);
+
+/**
  * Encode SUBSCRIBE command.
  * @param[out] Row.
  * @param replicaset_uuid Replica set uuid.
@@ -393,6 +400,21 @@ iproto_reply_ok(struct obuf *out, uint64_t sync, uint32_t schema_version);
  * @param sync Request sync.
  * @param schema_version.
  * @param vclock.
+ *
+ * @retval  0 Success.
+ * @retval -1 Memory error.
+ */
+int
+iproto_reply_vclock(struct obuf *out, uint64_t sync, uint32_t schema_version,
+		    const struct vclock *vclock);
+
+/**
+ * Encode iproto header with IPROTO_OK response code
+ * and vclock in the body.
+ * @param out Encode to.
+ * @param sync Request sync.
+ * @param schema_version.
+ * @param vclock.
  * @param read_only.
  *
  * @retval  0 Success.
@@ -646,6 +668,15 @@ iproto_reply_ok_xc(struct obuf *out, uint64_t sync, uint32_t schema_version)
 		diag_raise();
 }
 
+/** @copydoc iproto_reply_vclock. */
+static inline void
+iproto_reply_vclock_xc(struct obuf *out, uint64_t sync, uint32_t schema_version,
+		       const struct vclock *vclock)
+{
+	if (iproto_reply_vclock(out, sync, schema_version, vclock) != 0)
+		diag_raise();
+}
+
 /** @copydoc iproto_reply_request_vote_xc. */
 static inline void
 iproto_reply_request_vote_xc(struct obuf *out, uint64_t sync,
-- 
2.11.0

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v2 08/11] replication: rebootstrap instance on startup if it fell behind
  2018-06-08 17:34 [PATCH v2 00/11] Replica rejoin Vladimir Davydov
                   ` (6 preceding siblings ...)
  2018-06-08 17:34 ` [PATCH v2 07/11] applier: inquire oldest vclock on connect Vladimir Davydov
@ 2018-06-08 17:34 ` Vladimir Davydov
  2018-06-13 20:55   ` Konstantin Osipov
  2018-06-08 17:34 ` [PATCH v2 09/11] vinyl: simplify vylog recovery from backup Vladimir Davydov
                   ` (2 subsequent siblings)
  10 siblings, 1 reply; 34+ messages in thread
From: Vladimir Davydov @ 2018-06-08 17:34 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

If a replica fell too much behind its peers in the cluster and xlog
files needed for it to get up to speed have been removed, it won't be
able to proceed without rebootstrap. This patch makes the recovery
procedure detect such cases and initiate rebootstrap procedure if
necessary.

Note, rebootstrap is currently only supported by memtx engine. If there
are vinyl spaces on the replica, rebootstrap will fail. This is fixed by
the following patches.

Part of #461
---
 src/box/box.cc                           |   9 ++
 src/box/replication.cc                   |  15 +++
 src/box/replication.h                    |   9 ++
 test/replication/replica_rejoin.result   | 198 +++++++++++++++++++++++++++++++
 test/replication/replica_rejoin.test.lua |  76 ++++++++++++
 test/replication/suite.cfg               |   1 +
 6 files changed, 308 insertions(+)
 create mode 100644 test/replication/replica_rejoin.result
 create mode 100644 test/replication/replica_rejoin.test.lua

diff --git a/src/box/box.cc b/src/box/box.cc
index 0aaed562..b4ecb357 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -1777,6 +1777,9 @@ bootstrap(const struct tt_uuid *instance_uuid,
 /**
  * Recover the instance from the local directory.
  * Enter hot standby if the directory is locked.
+ * Invoke rebootstrap if the instance fell too much
+ * behind its peers in the replica set and needs
+ * to be rebootstrapped.
  */
 static void
 local_recovery(const struct tt_uuid *instance_uuid,
@@ -1812,6 +1815,12 @@ local_recovery(const struct tt_uuid *instance_uuid,
 	if (wal_dir_lock >= 0) {
 		box_listen();
 		box_sync_replication(replication_connect_timeout, false);
+
+		struct replica *master;
+		if (replicaset_needs_rejoin(&master)) {
+			say_info("replica is too old, initiating rejoin");
+			return bootstrap_from_master(master);
+		}
 	}
 
 	/*
diff --git a/src/box/replication.cc b/src/box/replication.cc
index c1e17698..0dda5dec 100644
--- a/src/box/replication.cc
+++ b/src/box/replication.cc
@@ -625,6 +625,21 @@ error:
 		  "failed to connect to one or more replicas");
 }
 
+bool
+replicaset_needs_rejoin(struct replica **master)
+{
+	replicaset_foreach(replica) {
+		if (replica->applier != NULL &&
+		    vclock_compare(&replica->applier->gc_vclock,
+				   &replicaset.vclock) > 0) {
+			*master = replica;
+			return true;
+		}
+	}
+	*master = NULL;
+	return false;
+}
+
 void
 replicaset_follow(void)
 {
diff --git a/src/box/replication.h b/src/box/replication.h
index fdf995c3..e8b391af 100644
--- a/src/box/replication.h
+++ b/src/box/replication.h
@@ -360,6 +360,15 @@ replicaset_connect(struct applier **appliers, int count,
 		   double timeout, bool connect_all);
 
 /**
+ * Check if the current instance fell too much behind its
+ * peers in the replica set and needs to be rebootstrapped.
+ * If it does, return true and set @master to the instance
+ * to use for rebootstrap, otherwise return false.
+ */
+bool
+replicaset_needs_rejoin(struct replica **master);
+
+/**
  * Resume all appliers registered with the replica set.
  */
 void
diff --git a/test/replication/replica_rejoin.result b/test/replication/replica_rejoin.result
new file mode 100644
index 00000000..2148625c
--- /dev/null
+++ b/test/replication/replica_rejoin.result
@@ -0,0 +1,198 @@
+env = require('test_run')
+---
+...
+test_run = env.new()
+---
+...
+-- Cleanup the instance to remove vylog files left from previous
+-- tests, since vinyl doesn't support rebootstrap yet.
+test_run:cmd('restart server default with cleanup=1')
+--
+-- gh-461: check that a replica refetches the last checkpoint
+-- in case it fell behind the master.
+--
+box.schema.user.grant('guest', 'replication')
+---
+...
+_ = box.schema.space.create('test')
+---
+...
+_ = box.space.test:create_index('pk')
+---
+...
+_ = box.space.test:insert{1}
+---
+...
+_ = box.space.test:insert{2}
+---
+...
+_ = box.space.test:insert{3}
+---
+...
+-- Join a replica, then stop it.
+test_run:cmd("create server replica with rpl_master=default, script='replication/replica.lua'")
+---
+- true
+...
+test_run:cmd("start server replica")
+---
+- true
+...
+test_run:cmd("switch replica")
+---
+- true
+...
+box.info.replication[1].upstream.status == 'follow' or box.info
+---
+- true
+...
+box.space.test:select()
+---
+- - [1]
+  - [2]
+  - [3]
+...
+_ = box.schema.space.create('replica') -- will disappear after rejoin
+---
+...
+test_run:cmd("switch default")
+---
+- true
+...
+test_run:cmd("stop server replica")
+---
+- true
+...
+-- Restart the server to purge the replica from
+-- the garbage collection state.
+test_run:cmd("restart server default")
+-- Make some checkpoints to remove old xlogs.
+checkpoint_count = box.cfg.checkpoint_count
+---
+...
+box.cfg{checkpoint_count = 1}
+---
+...
+_ = box.space.test:delete{1}
+---
+...
+_ = box.space.test:insert{10}
+---
+...
+box.snapshot()
+---
+- ok
+...
+_ = box.space.test:delete{2}
+---
+...
+_ = box.space.test:insert{20}
+---
+...
+box.snapshot()
+---
+- ok
+...
+_ = box.space.test:delete{3}
+---
+...
+_ = box.space.test:insert{30}
+---
+...
+#box.info.gc().checkpoints -- 1
+---
+- 1
+...
+box.cfg{checkpoint_count = checkpoint_count}
+---
+...
+-- Restart the replica. Since xlogs have been removed,
+-- it is supposed to rejoin without changing id.
+test_run:cmd("start server replica")
+---
+- true
+...
+box.info.replication[2].downstream.vclock ~= nil or box.info
+---
+- true
+...
+test_run:cmd("switch replica")
+---
+- true
+...
+box.info.replication[1].upstream.status == 'follow' or box.info
+---
+- true
+...
+box.space.test:select()
+---
+- - [10]
+  - [20]
+  - [30]
+...
+box.space.replica == nil -- was removed by rejoin
+---
+- true
+...
+_ = box.schema.space.create('replica')
+---
+...
+test_run:cmd("switch default")
+---
+- true
+...
+-- Make sure the replica follows new changes.
+for i = 10, 30, 10 do box.space.test:update(i, {{'!', 1, i}}) end
+---
+...
+vclock = test_run:get_vclock('default')
+---
+...
+_ = test_run:wait_vclock('replica', vclock)
+---
+...
+test_run:cmd("switch replica")
+---
+- true
+...
+box.space.test:select()
+---
+- - [10, 10]
+  - [20, 20]
+  - [30, 30]
+...
+-- Check that restart works as usual.
+test_run:cmd("restart server replica")
+box.info.replication[1].upstream.status == 'follow' or box.info
+---
+- true
+...
+box.space.test:select()
+---
+- - [10, 10]
+  - [20, 20]
+  - [30, 30]
+...
+box.space.replica ~= nil
+---
+- true
+...
+-- Cleanup.
+test_run:cmd("switch default")
+---
+- true
+...
+test_run:cmd("stop server replica")
+---
+- true
+...
+test_run:cmd("cleanup server replica")
+---
+- true
+...
+box.space.test:drop()
+---
+...
+box.schema.user.revoke('guest', 'replication')
+---
+...
diff --git a/test/replication/replica_rejoin.test.lua b/test/replication/replica_rejoin.test.lua
new file mode 100644
index 00000000..ea38bf81
--- /dev/null
+++ b/test/replication/replica_rejoin.test.lua
@@ -0,0 +1,76 @@
+env = require('test_run')
+test_run = env.new()
+
+-- Cleanup the instance to remove vylog files left from previous
+-- tests, since vinyl doesn't support rebootstrap yet.
+test_run:cmd('restart server default with cleanup=1')
+
+--
+-- gh-461: check that a replica refetches the last checkpoint
+-- in case it fell behind the master.
+--
+box.schema.user.grant('guest', 'replication')
+_ = box.schema.space.create('test')
+_ = box.space.test:create_index('pk')
+_ = box.space.test:insert{1}
+_ = box.space.test:insert{2}
+_ = box.space.test:insert{3}
+
+-- Join a replica, then stop it.
+test_run:cmd("create server replica with rpl_master=default, script='replication/replica.lua'")
+test_run:cmd("start server replica")
+test_run:cmd("switch replica")
+box.info.replication[1].upstream.status == 'follow' or box.info
+box.space.test:select()
+_ = box.schema.space.create('replica') -- will disappear after rejoin
+test_run:cmd("switch default")
+test_run:cmd("stop server replica")
+
+-- Restart the server to purge the replica from
+-- the garbage collection state.
+test_run:cmd("restart server default")
+
+-- Make some checkpoints to remove old xlogs.
+checkpoint_count = box.cfg.checkpoint_count
+box.cfg{checkpoint_count = 1}
+_ = box.space.test:delete{1}
+_ = box.space.test:insert{10}
+box.snapshot()
+_ = box.space.test:delete{2}
+_ = box.space.test:insert{20}
+box.snapshot()
+_ = box.space.test:delete{3}
+_ = box.space.test:insert{30}
+#box.info.gc().checkpoints -- 1
+box.cfg{checkpoint_count = checkpoint_count}
+
+-- Restart the replica. Since xlogs have been removed,
+-- it is supposed to rejoin without changing id.
+test_run:cmd("start server replica")
+box.info.replication[2].downstream.vclock ~= nil or box.info
+test_run:cmd("switch replica")
+box.info.replication[1].upstream.status == 'follow' or box.info
+box.space.test:select()
+box.space.replica == nil -- was removed by rejoin
+_ = box.schema.space.create('replica')
+test_run:cmd("switch default")
+
+-- Make sure the replica follows new changes.
+for i = 10, 30, 10 do box.space.test:update(i, {{'!', 1, i}}) end
+vclock = test_run:get_vclock('default')
+_ = test_run:wait_vclock('replica', vclock)
+test_run:cmd("switch replica")
+box.space.test:select()
+
+-- Check that restart works as usual.
+test_run:cmd("restart server replica")
+box.info.replication[1].upstream.status == 'follow' or box.info
+box.space.test:select()
+box.space.replica ~= nil
+
+-- Cleanup.
+test_run:cmd("switch default")
+test_run:cmd("stop server replica")
+test_run:cmd("cleanup server replica")
+box.space.test:drop()
+box.schema.user.revoke('guest', 'replication')
diff --git a/test/replication/suite.cfg b/test/replication/suite.cfg
index 95e94e5a..2b609f16 100644
--- a/test/replication/suite.cfg
+++ b/test/replication/suite.cfg
@@ -6,6 +6,7 @@
     "wal_off.test.lua": {},
     "hot_standby.test.lua": {},
     "rebootstrap.test.lua": {},
+    "replica_rejoin.test.lua": {},
     "*": {
         "memtx": {"engine": "memtx"},
         "vinyl": {"engine": "vinyl"}
-- 
2.11.0

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v2 09/11] vinyl: simplify vylog recovery from backup
  2018-06-08 17:34 [PATCH v2 00/11] Replica rejoin Vladimir Davydov
                   ` (7 preceding siblings ...)
  2018-06-08 17:34 ` [PATCH v2 08/11] replication: rebootstrap instance on startup if it fell behind Vladimir Davydov
@ 2018-06-08 17:34 ` Vladimir Davydov
  2018-06-08 17:34 ` [PATCH v2 10/11] vinyl: pass flags to vy_recovery_new Vladimir Davydov
  2018-06-08 17:34 ` [PATCH v2 11/11] vinyl: implement rebootstrap support Vladimir Davydov
  10 siblings, 0 replies; 34+ messages in thread
From: Vladimir Davydov @ 2018-06-08 17:34 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

Since we don't create snapshot files for vylog, but instead append
records written after checkpoint to the same file, we have to use the
previous vylog file for backup (see vy_log_backup_path()). So when
recovering from a backup we need to rotate the last vylog to keep vylog
and checkpoint signatures in sync. Currently, we do it on recovery
completion and we use vy_log_create() instead of vy_log_rotate() for it.
This is done so that we can reuse the context that was used for recovery
instead of rereading vylog for rotation. Actually, there's no point in
this micro-optimization, because we rotate vylog only when recovering
from a backup. Let's remove it and use vy_log_rotate() for this.

Needed for #461
---
 src/box/vy_log.c | 60 +++++++++++++++++++++-----------------------------------
 1 file changed, 22 insertions(+), 38 deletions(-)

diff --git a/src/box/vy_log.c b/src/box/vy_log.c
index 6556dd37..8330a26c 100644
--- a/src/box/vy_log.c
+++ b/src/box/vy_log.c
@@ -182,6 +182,12 @@ static int
 vy_recovery_process_record(struct vy_recovery *recovery,
 			   const struct vy_log_record *record);
 
+static int
+vy_log_create(const struct vclock *vclock, struct vy_recovery *recovery);
+
+int
+vy_log_rotate(const struct vclock *vclock);
+
 /**
  * Return the name of the vylog file that has the given signature.
  */
@@ -866,10 +872,11 @@ vy_log_begin_recovery(const struct vclock *vclock)
 	if (xdir_scan(&vy_log.dir) < 0 && errno != ENOENT)
 		return NULL;
 
-	struct vclock vy_log_vclock;
-	vclock_create(&vy_log_vclock);
-	if (xdir_last_vclock(&vy_log.dir, &vy_log_vclock) >= 0 &&
-	    vclock_compare(&vy_log_vclock, vclock) > 0) {
+	if (xdir_last_vclock(&vy_log.dir, &vy_log.last_checkpoint) < 0)
+		vclock_copy(&vy_log.last_checkpoint, vclock);
+
+	int cmp = vclock_compare(&vy_log.last_checkpoint, vclock);
+	if (cmp > 0) {
 		/*
 		 * Last vy_log log is newer than the last snapshot.
 		 * This can't normally happen, as vy_log is rotated
@@ -879,21 +886,27 @@ vy_log_begin_recovery(const struct vclock *vclock)
 		diag_set(ClientError, ER_MISSING_SNAPSHOT);
 		return NULL;
 	}
+	if (cmp < 0) {
+		/*
+		 * Last vy_log log is older than the last snapshot.
+		 * This happens if we are recovering from a backup.
+		 * Rotate the log to keep its signature in sync with
+		 * checkpoint.
+		 */
+		if (vy_log_rotate(vclock) != 0)
+			return NULL;
+	}
 
 	struct vy_recovery *recovery;
-	recovery = vy_recovery_new(vclock_sum(&vy_log_vclock), false);
+	recovery = vy_recovery_new(vclock_sum(&vy_log.last_checkpoint), false);
 	if (recovery == NULL)
 		return NULL;
 
 	vy_log.next_id = recovery->max_id + 1;
 	vy_log.recovery = recovery;
-	vclock_copy(&vy_log.last_checkpoint, vclock);
 	return recovery;
 }
 
-static int
-vy_log_create(const struct vclock *vclock, struct vy_recovery *recovery);
-
 int
 vy_log_end_recovery(void)
 {
@@ -914,35 +927,6 @@ vy_log_end_recovery(void)
 		return -1;
 	}
 
-	/*
-	 * On backup we copy files corresponding to the most recent
-	 * checkpoint. Since vy_log does not create snapshots of its log
-	 * files, but instead appends records written after checkpoint
-	 * to the most recent log file, the signature of the vy_log file
-	 * corresponding to the last checkpoint equals the signature
-	 * of the previous checkpoint. So upon successful recovery
-	 * from a backup we need to rotate the log to keep checkpoint
-	 * and vy_log signatures in sync.
-	 */
-	struct vclock *vclock = vclockset_last(&vy_log.dir.index);
-	if (vclock == NULL ||
-	    vclock_compare(vclock, &vy_log.last_checkpoint) != 0) {
-		vclock = malloc(sizeof(*vclock));
-		if (vclock == NULL) {
-			diag_set(OutOfMemory, sizeof(*vclock),
-				 "malloc", "struct vclock");
-			return -1;
-		}
-		vclock_copy(vclock, &vy_log.last_checkpoint);
-		xdir_add_vclock(&vy_log.dir, vclock);
-		if (vy_log_create(vclock, vy_log.recovery) < 0) {
-			diag_log();
-			say_error("failed to write `%s'",
-				  vy_log_filename(vclock_sum(vclock)));
-			return -1;
-		}
-	}
-
 	vy_log.recovery = NULL;
 	return 0;
 }
-- 
2.11.0

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v2 10/11] vinyl: pass flags to vy_recovery_new
  2018-06-08 17:34 [PATCH v2 00/11] Replica rejoin Vladimir Davydov
                   ` (8 preceding siblings ...)
  2018-06-08 17:34 ` [PATCH v2 09/11] vinyl: simplify vylog recovery from backup Vladimir Davydov
@ 2018-06-08 17:34 ` Vladimir Davydov
  2018-06-13 20:56   ` Konstantin Osipov
  2018-06-08 17:34 ` [PATCH v2 11/11] vinyl: implement rebootstrap support Vladimir Davydov
  10 siblings, 1 reply; 34+ messages in thread
From: Vladimir Davydov @ 2018-06-08 17:34 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

Currently, this function takes a single boolean argument, but I'm
planning to add another one. Since two bool arguments look rather
confusing, let's turn this arguments into flags.

Needed for #461
---
 src/box/vinyl.c  |  8 +++++---
 src/box/vy_log.c | 19 +++++++++----------
 src/box/vy_log.h | 16 ++++++++++++----
 3 files changed, 26 insertions(+), 17 deletions(-)

diff --git a/src/box/vinyl.c b/src/box/vinyl.c
index d2e3da7e..8cc2ab0e 100644
--- a/src/box/vinyl.c
+++ b/src/box/vinyl.c
@@ -3229,7 +3229,8 @@ vinyl_engine_join(struct engine *engine, const struct vclock *vclock,
 	 * Send all runs stored in it to the replica.
 	 */
 	struct vy_recovery *recovery;
-	recovery = vy_recovery_new(vclock_sum(vclock), true);
+	recovery = vy_recovery_new(vclock_sum(vclock),
+				   VY_RECOVERY_LOAD_CHECKPOINT);
 	if (recovery == NULL) {
 		say_error("failed to recover vylog to join a replica");
 		goto out_join_cord;
@@ -3432,7 +3433,7 @@ vinyl_engine_collect_garbage(struct engine *engine, int64_t lsn)
 
 	/* Cleanup run files. */
 	int64_t signature = checkpoint_last(NULL);
-	struct vy_recovery *recovery = vy_recovery_new(signature, false);
+	struct vy_recovery *recovery = vy_recovery_new(signature, 0);
 	if (recovery == NULL) {
 		say_error("failed to recover vylog for garbage collection");
 		return 0;
@@ -3461,7 +3462,8 @@ vinyl_engine_backup(struct engine *engine, const struct vclock *vclock,
 
 	/* Backup run files. */
 	struct vy_recovery *recovery;
-	recovery = vy_recovery_new(vclock_sum(vclock), true);
+	recovery = vy_recovery_new(vclock_sum(vclock),
+				   VY_RECOVERY_LOAD_CHECKPOINT);
 	if (recovery == NULL) {
 		say_error("failed to recover vylog for backup");
 		return -1;
diff --git a/src/box/vy_log.c b/src/box/vy_log.c
index 8330a26c..e44db5b8 100644
--- a/src/box/vy_log.c
+++ b/src/box/vy_log.c
@@ -176,7 +176,7 @@ struct vy_log {
 static struct vy_log vy_log;
 
 static struct vy_recovery *
-vy_recovery_new_locked(int64_t signature, bool only_checkpoint);
+vy_recovery_new_locked(int64_t signature, int flags);
 
 static int
 vy_recovery_process_record(struct vy_recovery *recovery,
@@ -898,7 +898,7 @@ vy_log_begin_recovery(const struct vclock *vclock)
 	}
 
 	struct vy_recovery *recovery;
-	recovery = vy_recovery_new(vclock_sum(&vy_log.last_checkpoint), false);
+	recovery = vy_recovery_new(vclock_sum(&vy_log.last_checkpoint), 0);
 	if (recovery == NULL)
 		return NULL;
 
@@ -980,7 +980,7 @@ vy_log_rotate(const struct vclock *vclock)
 	latch_lock(&vy_log.latch);
 
 	struct vy_recovery *recovery;
-	recovery = vy_recovery_new_locked(prev_signature, false);
+	recovery = vy_recovery_new_locked(prev_signature, 0);
 	if (recovery == NULL)
 		goto fail;
 
@@ -2007,7 +2007,7 @@ static ssize_t
 vy_recovery_new_f(va_list ap)
 {
 	int64_t signature = va_arg(ap, int64_t);
-	bool only_checkpoint = va_arg(ap, int);
+	int flags = va_arg(ap, int);
 	struct vy_recovery **p_recovery = va_arg(ap, struct vy_recovery **);
 
 	say_verbose("loading vylog %lld", (long long)signature);
@@ -2064,7 +2064,7 @@ vy_recovery_new_f(va_list ap)
 		say_verbose("load vylog record: %s",
 			    vy_log_record_str(&record));
 		if (record.type == VY_LOG_SNAPSHOT) {
-			if (only_checkpoint)
+			if ((flags & VY_RECOVERY_LOAD_CHECKPOINT) != 0)
 				break;
 			continue;
 		}
@@ -2099,7 +2099,7 @@ fail:
  * Must be called with the log latch held.
  */
 static struct vy_recovery *
-vy_recovery_new_locked(int64_t signature, bool only_checkpoint)
+vy_recovery_new_locked(int64_t signature, int flags)
 {
 	int rc;
 	struct vy_recovery *recovery;
@@ -2117,8 +2117,7 @@ vy_recovery_new_locked(int64_t signature, bool only_checkpoint)
 	}
 
 	/* Load the log from coio so as not to stall tx thread. */
-	rc = coio_call(vy_recovery_new_f, signature,
-		       (int)only_checkpoint, &recovery);
+	rc = coio_call(vy_recovery_new_f, signature, flags, &recovery);
 	if (rc != 0) {
 		diag_log();
 		say_error("failed to load `%s'", vy_log_filename(signature));
@@ -2128,12 +2127,12 @@ vy_recovery_new_locked(int64_t signature, bool only_checkpoint)
 }
 
 struct vy_recovery *
-vy_recovery_new(int64_t signature, bool only_checkpoint)
+vy_recovery_new(int64_t signature, int flags)
 {
 	/* Lock out concurrent writers while we are loading the log. */
 	latch_lock(&vy_log.latch);
 	struct vy_recovery *recovery;
-	recovery = vy_recovery_new_locked(signature, only_checkpoint);
+	recovery = vy_recovery_new_locked(signature, flags);
 	latch_unlock(&vy_log.latch);
 	return recovery;
 }
diff --git a/src/box/vy_log.h b/src/box/vy_log.h
index 0a216de8..cdac293e 100644
--- a/src/box/vy_log.h
+++ b/src/box/vy_log.h
@@ -520,18 +520,26 @@ vy_log_begin_recovery(const struct vclock *vclock);
 int
 vy_log_end_recovery(void);
 
+/** Flags passed to vy_recovery_new(). */
+enum vy_recovery_flag {
+	/**
+	 * Do not load records written to the log after checkpoint,
+	 * i.e. get a consistent view of vinyl database at the time
+	 * of the last checkpoint.
+	 */
+	VY_RECOVERY_LOAD_CHECKPOINT	= 1 << 0,
+};
+
 /**
  * Create a recovery context from the metadata log created
  * by checkpoint with the given signature.
  *
- * If @only_checkpoint is set, do not load records appended to
- * the log after checkpoint (i.e. get a consistent view of
- * Vinyl at the time of the checkpoint).
+ * For valid values of @flags, see vy_recovery_flag.
  *
  * Returns NULL on failure.
  */
 struct vy_recovery *
-vy_recovery_new(int64_t signature, bool only_checkpoint);
+vy_recovery_new(int64_t signature, int flags);
 
 /**
  * Free a recovery context created by vy_recovery_new().
-- 
2.11.0

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v2 11/11] vinyl: implement rebootstrap support
  2018-06-08 17:34 [PATCH v2 00/11] Replica rejoin Vladimir Davydov
                   ` (9 preceding siblings ...)
  2018-06-08 17:34 ` [PATCH v2 10/11] vinyl: pass flags to vy_recovery_new Vladimir Davydov
@ 2018-06-08 17:34 ` Vladimir Davydov
  2018-06-10 12:02   ` Vladimir Davydov
  10 siblings, 1 reply; 34+ messages in thread
From: Vladimir Davydov @ 2018-06-08 17:34 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

If vy_log_bootstrap() finds a vylog file in the vinyl directory, it
assumes it has to be rebootstrapped and calls vy_log_rebootstrap().
The latter scans the old vylog file to find the max vinyl object id,
from which it will start numbering objects created during rebootstrap to
avoid conflicts with old objects, then it writes VY_LOG_REBOOTSTRAP
record to the old vylog to denote the beginning of a rebootstrap
section. After that initial join proceeds as usual, writing information
about new objects to the old vylog file after VY_LOG_REBOOTSTRAP marker.
Upon successful rebootstrap completion, checkpoint, which is always
called right after bootstrap, rotates the old vylog and marks all
objects created before the VY_LOG_REBOOTSTRAP marker as dropped in the
new vylog. The old objects will be purged by the garbage collector as
usual.

In case rebootstrap fails and checkpoint never happens, local recovery
writes VY_LOG_ABORT_REBOOTSTRAP record to the vylog. This marker
indicates that the rebootstrap attempt failed and all objects created
during rebootstrap should be discarded. They will be purged by the
garbage collector on checkpoint. Thus even if rebootstrap fails, it is
possible to recover the database to the state that existed right before
a failed rebootstrap attempt.

TODO: write a test checking that garbage collection works as expected.

Closes #461
---
 src/box/vy_log.c                         | 133 +++++++++++++++++++++++++++++--
 src/box/vy_log.h                         |  34 ++++++++
 test/replication/replica_rejoin.result   |  11 ++-
 test/replication/replica_rejoin.test.lua |   7 +-
 test/replication/suite.cfg               |   1 -
 5 files changed, 169 insertions(+), 17 deletions(-)

diff --git a/src/box/vy_log.c b/src/box/vy_log.c
index e44db5b8..e01a802a 100644
--- a/src/box/vy_log.c
+++ b/src/box/vy_log.c
@@ -122,6 +122,8 @@ static const char *vy_log_type_name[] = {
 	[VY_LOG_MODIFY_LSM]		= "modify_lsm",
 	[VY_LOG_FORGET_LSM]		= "forget_lsm",
 	[VY_LOG_PREPARE_LSM]		= "prepare_lsm",
+	[VY_LOG_REBOOTSTRAP]		= "rebootstrap",
+	[VY_LOG_ABORT_REBOOTSTRAP]	= "abort_rebootstrap",
 };
 
 /** Metadata log object. */
@@ -835,17 +837,43 @@ vy_log_next_id(void)
 	return vy_log.next_id++;
 }
 
+/**
+ * If a vylog file already exists, we are doing a rebootstrap:
+ * - Load the vylog to find out the id to start indexing new
+ *   objects with.
+ * - Mark the beginning of a new rebootstrap attempt by writing
+ *   VY_LOG_REBOOTSTRAP record.
+ */
+static int
+vy_log_rebootstrap(void)
+{
+	struct vy_recovery *recovery;
+	recovery = vy_recovery_new(vclock_sum(&vy_log.last_checkpoint),
+				   VY_RECOVERY_ABORT_REBOOTSTRAP);
+	if (recovery == NULL)
+		return -1;
+
+	vy_log.next_id = recovery->max_id + 1;
+	vy_recovery_delete(recovery);
+
+	struct vy_log_record record;
+	vy_log_record_init(&record);
+	record.type = VY_LOG_REBOOTSTRAP;
+	vy_log_tx_begin();
+	vy_log_write(&record);
+	if (vy_log_tx_commit() != 0)
+		return -1;
+
+	return 0;
+}
+
 int
 vy_log_bootstrap(void)
 {
-	/*
-	 * Scan the directory to make sure there is no
-	 * vylog files left from previous setups.
-	 */
 	if (xdir_scan(&vy_log.dir) < 0 && errno != ENOENT)
 		return -1;
-	if (xdir_last_vclock(&vy_log.dir, NULL) >= 0)
-		panic("vinyl directory is not empty");
+	if (xdir_last_vclock(&vy_log.dir, &vy_log.last_checkpoint) >= 0)
+		return vy_log_rebootstrap();
 
 	/* Add initial vclock to the xdir. */
 	struct vclock *vclock = malloc(sizeof(*vclock));
@@ -897,11 +925,29 @@ vy_log_begin_recovery(const struct vclock *vclock)
 			return NULL;
 	}
 
+	/*
+	 * If we are recovering from a vylog that has an unfinished
+	 * rebootstrap section, checkpoint (and hence rebootstrap)
+	 * failed, and we need to mark rebootstrap as aborted.
+	 */
 	struct vy_recovery *recovery;
-	recovery = vy_recovery_new(vclock_sum(&vy_log.last_checkpoint), 0);
+	recovery = vy_recovery_new(vclock_sum(&vy_log.last_checkpoint),
+				   VY_RECOVERY_ABORT_REBOOTSTRAP);
 	if (recovery == NULL)
 		return NULL;
 
+	if (recovery->in_rebootstrap) {
+		struct vy_log_record record;
+		vy_log_record_init(&record);
+		record.type = VY_LOG_ABORT_REBOOTSTRAP;
+		vy_log_tx_begin();
+		vy_log_write(&record);
+		if (vy_log_tx_commit() != 0) {
+			vy_recovery_delete(recovery);
+			return NULL;
+		}
+	}
+
 	vy_log.next_id = recovery->max_id + 1;
 	vy_log.recovery = recovery;
 	return recovery;
@@ -1272,6 +1318,7 @@ vy_recovery_do_create_lsm(struct vy_recovery *recovery, int64_t id,
 	 * before the final version.
 	 */
 	rlist_add_tail_entry(&recovery->lsms, lsm, in_recovery);
+	lsm->in_rebootstrap = recovery->in_rebootstrap;
 	if (recovery->max_id < id)
 		recovery->max_id = id;
 	return lsm;
@@ -1852,6 +1899,42 @@ vy_recovery_delete_slice(struct vy_recovery *recovery, int64_t slice_id)
 }
 
 /**
+ * Mark all LSM trees created during rebootstrap as dropped so
+ * that they will be purged on the next garbage collection.
+ */
+static void
+vy_recovery_do_abort_rebootstrap(struct vy_recovery *recovery)
+{
+	struct vy_lsm_recovery_info *lsm;
+	rlist_foreach_entry(lsm, &recovery->lsms, in_recovery) {
+		if (lsm->in_rebootstrap) {
+			lsm->in_rebootstrap = false;
+			lsm->create_lsn = -1;
+			lsm->modify_lsn = -1;
+			lsm->drop_lsn = 0;
+		}
+	}
+}
+
+/** Handle a VY_LOG_REBOOTSTRAP log record. */
+static void
+vy_recovery_rebootstrap(struct vy_recovery *recovery)
+{
+	if (recovery->in_rebootstrap)
+		vy_recovery_do_abort_rebootstrap(recovery);
+	recovery->in_rebootstrap = true;
+}
+
+/** Handle VY_LOG_ABORT_REBOOTSTRAP record. */
+static void
+vy_recovery_abort_rebootstrap(struct vy_recovery *recovery)
+{
+	if (recovery->in_rebootstrap)
+		vy_recovery_do_abort_rebootstrap(recovery);
+	recovery->in_rebootstrap = false;
+}
+
+/**
  * Update a recovery context with a new log record.
  * Return 0 on success, -1 on failure.
  *
@@ -1862,7 +1945,7 @@ static int
 vy_recovery_process_record(struct vy_recovery *recovery,
 			   const struct vy_log_record *record)
 {
-	int rc;
+	int rc = 0;
 	switch (record->type) {
 	case VY_LOG_PREPARE_LSM:
 		rc = vy_recovery_prepare_lsm(recovery, record->lsm_id,
@@ -1926,6 +2009,12 @@ vy_recovery_process_record(struct vy_recovery *recovery,
 		/* Not used anymore, ignore. */
 		rc = 0;
 		break;
+	case VY_LOG_REBOOTSTRAP:
+		vy_recovery_rebootstrap(recovery);
+		break;
+	case VY_LOG_ABORT_REBOOTSTRAP:
+		vy_recovery_abort_rebootstrap(recovery);
+		break;
 	default:
 		unreachable();
 	}
@@ -1936,6 +2025,26 @@ vy_recovery_process_record(struct vy_recovery *recovery,
 }
 
 /**
+ * Commit the last rebootstrap attempt - drop all objects created
+ * before rebootstrap.
+ */
+static void
+vy_recovery_commit_rebootstrap(struct vy_recovery *recovery)
+{
+	assert(recovery->in_rebootstrap);
+	struct vy_lsm_recovery_info *lsm;
+	rlist_foreach_entry(lsm, &recovery->lsms, in_recovery) {
+		if (!lsm->in_rebootstrap && lsm->drop_lsn < 0) {
+			/*
+			 * The files will be removed when the current
+			 * checkpoint is purged by garbage collector.
+			 */
+			lsm->drop_lsn = vy_log_signature();
+		}
+	}
+}
+
+/**
  * Fill index_id_hash with LSM trees recovered from vylog.
  */
 static int
@@ -2026,6 +2135,7 @@ vy_recovery_new_f(va_list ap)
 	recovery->run_hash = NULL;
 	recovery->slice_hash = NULL;
 	recovery->max_id = -1;
+	recovery->in_rebootstrap = false;
 
 	recovery->index_id_hash = mh_i64ptr_new();
 	recovery->lsm_hash = mh_i64ptr_new();
@@ -2079,6 +2189,13 @@ vy_recovery_new_f(va_list ap)
 
 	xlog_cursor_close(&cursor, false);
 
+	if (recovery->in_rebootstrap) {
+		if ((flags & VY_RECOVERY_ABORT_REBOOTSTRAP) != 0)
+			vy_recovery_do_abort_rebootstrap(recovery);
+		else
+			vy_recovery_commit_rebootstrap(recovery);
+	}
+
 	if (vy_recovery_build_index_id_hash(recovery) != 0)
 		goto fail_free;
 out:
diff --git a/src/box/vy_log.h b/src/box/vy_log.h
index cdac293e..c724d36a 100644
--- a/src/box/vy_log.h
+++ b/src/box/vy_log.h
@@ -195,6 +195,27 @@ enum vy_log_record_type {
 	 * a VY_LOG_CREATE_LSM record to commit it.
 	 */
 	VY_LOG_PREPARE_LSM		= 15,
+	/**
+	 * This record denotes the beginning of a rebootstrap section.
+	 * A rebootstrap section ends either by another record of this
+	 * type or by VY_LOG_ABORT_REBOOTSTRAP or at the end of the file.
+	 * All objects created between a VY_LOG_REBOOTSTRAP record and
+	 * VY_LOG_ABORT_REBOOTSTRAP or another VY_LOG_REBOOTSTRAP are
+	 * considered to be garbage and marked as dropped on recovery.
+	 *
+	 * We write a record of this type if a vylog file already exists
+	 * at bootstrap time, which means we are going to rebootstrap.
+	 * If rebootstrap succeeds, we rotate the vylog on checkpoint and
+	 * mark all objects written before the last VY_LOG_REBOOTSTRAP
+	 * record as dropped in the rotated vylog. If rebootstrap fails,
+	 * we write VY_LOG_ABORT_REBOOTSTRAP on recovery.
+	 */
+	VY_LOG_REBOOTSTRAP		= 16,
+	/**
+	 * This record is written on recovery if rebootstrap failed.
+	 * See also VY_LOG_REBOOTSTRAP.
+	 */
+	VY_LOG_ABORT_REBOOTSTRAP	= 17,
 
 	vy_log_record_type_MAX
 };
@@ -273,6 +294,12 @@ struct vy_recovery {
 	 * or -1 in case no vinyl objects were recovered.
 	 */
 	int64_t max_id;
+	/**
+	 * Set if we are currently processing a rebootstrap section,
+	 * i.e. we encountered a VY_LOG_REBOOTSTRAP record and haven't
+	 * seen matching VY_LOG_ABORT_REBOOTSTRAP.
+	 */
+	bool in_rebootstrap;
 };
 
 /** LSM tree info stored in a recovery context. */
@@ -321,6 +348,8 @@ struct vy_lsm_recovery_info {
 	 * this one after successful ALTER.
 	 */
 	struct vy_lsm_recovery_info *prepared;
+	/** Set if this LSM tree was created during rebootstrap. */
+	bool in_rebootstrap;
 };
 
 /** Vinyl range info stored in a recovery context. */
@@ -528,6 +557,11 @@ enum vy_recovery_flag {
 	 * of the last checkpoint.
 	 */
 	VY_RECOVERY_LOAD_CHECKPOINT	= 1 << 0,
+	/**
+	 * Consider the last attempt to rebootstrap aborted even if
+	 * there's no VY_LOG_ABORT_REBOOTSTRAP record.
+	 */
+	VY_RECOVERY_ABORT_REBOOTSTRAP	= 1 << 1,
 };
 
 /**
diff --git a/test/replication/replica_rejoin.result b/test/replication/replica_rejoin.result
index 2148625c..e8b76056 100644
--- a/test/replication/replica_rejoin.result
+++ b/test/replication/replica_rejoin.result
@@ -4,9 +4,12 @@ env = require('test_run')
 test_run = env.new()
 ---
 ...
--- Cleanup the instance to remove vylog files left from previous
--- tests, since vinyl doesn't support rebootstrap yet.
-test_run:cmd('restart server default with cleanup=1')
+engine = test_run:get_cfg('engine')
+---
+...
+test_run:cleanup_cluster()
+---
+...
 --
 -- gh-461: check that a replica refetches the last checkpoint
 -- in case it fell behind the master.
@@ -14,7 +17,7 @@ test_run:cmd('restart server default with cleanup=1')
 box.schema.user.grant('guest', 'replication')
 ---
 ...
-_ = box.schema.space.create('test')
+_ = box.schema.space.create('test', {engine = engine})
 ---
 ...
 _ = box.space.test:create_index('pk')
diff --git a/test/replication/replica_rejoin.test.lua b/test/replication/replica_rejoin.test.lua
index ea38bf81..b598c4fb 100644
--- a/test/replication/replica_rejoin.test.lua
+++ b/test/replication/replica_rejoin.test.lua
@@ -1,16 +1,15 @@
 env = require('test_run')
 test_run = env.new()
+engine = test_run:get_cfg('engine')
 
--- Cleanup the instance to remove vylog files left from previous
--- tests, since vinyl doesn't support rebootstrap yet.
-test_run:cmd('restart server default with cleanup=1')
+test_run:cleanup_cluster()
 
 --
 -- gh-461: check that a replica refetches the last checkpoint
 -- in case it fell behind the master.
 --
 box.schema.user.grant('guest', 'replication')
-_ = box.schema.space.create('test')
+_ = box.schema.space.create('test', {engine = engine})
 _ = box.space.test:create_index('pk')
 _ = box.space.test:insert{1}
 _ = box.space.test:insert{2}
diff --git a/test/replication/suite.cfg b/test/replication/suite.cfg
index 2b609f16..95e94e5a 100644
--- a/test/replication/suite.cfg
+++ b/test/replication/suite.cfg
@@ -6,7 +6,6 @@
     "wal_off.test.lua": {},
     "hot_standby.test.lua": {},
     "rebootstrap.test.lua": {},
-    "replica_rejoin.test.lua": {},
     "*": {
         "memtx": {"engine": "memtx"},
         "vinyl": {"engine": "vinyl"}
-- 
2.11.0

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 01/11] box: retrieve instance uuid before starting local recovery
  2018-06-08 17:34 ` [PATCH v2 01/11] box: retrieve instance uuid before starting local recovery Vladimir Davydov
@ 2018-06-08 17:51   ` Konstantin Osipov
  0 siblings, 0 replies; 34+ messages in thread
From: Konstantin Osipov @ 2018-06-08 17:51 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: tarantool-patches

* Vladimir Davydov <vdavydov.dev@gmail.com> [18/06/08 20:38]:
> In order to find out if the current instance fell too much behind its
> peers in the cluster and so needs to be rebootstrapped, we need to
> connect it to remote peers before proceeding to local recovery. The
> problem is box.cfg.replication may have an entry corresponding to the
> instance itself so before connecting we have to start listening to
> incoming connections. Since an instance is supposed to sent its uuid in
> the greeting message, we also have to initialize INSTANCE_UUID early,
> before we start local recovery. So this patch makes memtx engine
> constructor not only scan the snapshot directory, but also read the
> header of the most recent snapshot to initialize INSTANCE_UUID.

Pushed.


-- 
Konstantin Osipov, Moscow, Russia, +7 903 626 22 32
http://tarantool.io - www.twitter.com/kostja_osipov

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 11/11] vinyl: implement rebootstrap support
  2018-06-08 17:34 ` [PATCH v2 11/11] vinyl: implement rebootstrap support Vladimir Davydov
@ 2018-06-10 12:02   ` Vladimir Davydov
  0 siblings, 0 replies; 34+ messages in thread
From: Vladimir Davydov @ 2018-06-10 12:02 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

On Fri, Jun 08, 2018 at 08:34:29PM +0300, Vladimir Davydov wrote:
> If vy_log_bootstrap() finds a vylog file in the vinyl directory, it
> assumes it has to be rebootstrapped and calls vy_log_rebootstrap().
> The latter scans the old vylog file to find the max vinyl object id,
> from which it will start numbering objects created during rebootstrap to
> avoid conflicts with old objects, then it writes VY_LOG_REBOOTSTRAP
> record to the old vylog to denote the beginning of a rebootstrap
> section. After that initial join proceeds as usual, writing information
> about new objects to the old vylog file after VY_LOG_REBOOTSTRAP marker.
> Upon successful rebootstrap completion, checkpoint, which is always
> called right after bootstrap, rotates the old vylog and marks all
> objects created before the VY_LOG_REBOOTSTRAP marker as dropped in the
> new vylog. The old objects will be purged by the garbage collector as
> usual.
> 
> In case rebootstrap fails and checkpoint never happens, local recovery
> writes VY_LOG_ABORT_REBOOTSTRAP record to the vylog. This marker
> indicates that the rebootstrap attempt failed and all objects created
> during rebootstrap should be discarded. They will be purged by the
> garbage collector on checkpoint. Thus even if rebootstrap fails, it is
> possible to recover the database to the state that existed right before
> a failed rebootstrap attempt.
> 
> TODO: write a test checking that garbage collection works as expected.

Here goes the test. Note, it needs the following pull requests merged
into test-run (updated on the branch):

  https://github.com/tarantool/test-run/pull/93

---

From 5c2127a1b600559e5e51f19a8b6bea1a75c5aad0 Mon Sep 17 00:00:00 2001
From: Vladimir Davydov <vdavydov.dev@gmail.com>
Date: Sat, 9 Jun 2018 20:31:18 +0300
Subject: [PATCH] test: check that gc works as expected after rebootstrap

Follow-up #461

diff --git a/src/box/relay.cc b/src/box/relay.cc
index a25cc540..985a3e5a 100644
--- a/src/box/relay.cc
+++ b/src/box/relay.cc
@@ -287,6 +287,9 @@ relay_final_join(struct replica *replica, int fd, uint64_t sync,
 	if (rc != 0)
 		diag_raise();
 
+	ERROR_INJECT(ERRINJ_RELAY_FINAL_JOIN,
+		     tnt_raise(ClientError, ER_INJECTION, "relay final join"));
+
 	ERROR_INJECT(ERRINJ_RELAY_FINAL_SLEEP, {
 		while (vclock_compare(stop_vclock, &replicaset.vclock) == 0)
 			fiber_sleep(0.001);
diff --git a/src/errinj.h b/src/errinj.h
index ab578274..e3adb7d7 100644
--- a/src/errinj.h
+++ b/src/errinj.h
@@ -97,6 +97,7 @@ struct errinj {
 	_(ERRINJ_RELAY_TIMEOUT, ERRINJ_DOUBLE, {.dparam = 0}) \
 	_(ERRINJ_RELAY_REPORT_INTERVAL, ERRINJ_DOUBLE, {.dparam = 0}) \
 	_(ERRINJ_RELAY_FINAL_SLEEP, ERRINJ_BOOL, {.bparam = false}) \
+	_(ERRINJ_RELAY_FINAL_JOIN, ERRINJ_BOOL, {.bparam = false}) \
 	_(ERRINJ_PORT_DUMP, ERRINJ_BOOL, {.bparam = false}) \
 	_(ERRINJ_XLOG_GARBAGE, ERRINJ_BOOL, {.bparam = false}) \
 	_(ERRINJ_XLOG_META, ERRINJ_BOOL, {.bparam = false}) \
diff --git a/test/box/errinj.result b/test/box/errinj.result
index e25a4594..aad07c4c 100644
--- a/test/box/errinj.result
+++ b/test/box/errinj.result
@@ -58,6 +58,8 @@ errinj.info()
     state: false
   ERRINJ_VY_LOG_FLUSH_DELAY:
     state: false
+  ERRINJ_RELAY_FINAL_JOIN:
+    state: false
   ERRINJ_SNAP_COMMIT_DELAY:
     state: false
   ERRINJ_RELAY_FINAL_SLEEP:
diff --git a/test/vinyl/replica_rejoin.lua b/test/vinyl/replica_rejoin.lua
new file mode 100644
index 00000000..7cb7e09a
--- /dev/null
+++ b/test/vinyl/replica_rejoin.lua
@@ -0,0 +1,13 @@
+#!/usr/bin/env tarantool
+
+local replication = os.getenv("MASTER")
+if arg[1] == 'disable_replication' then
+    replication = nil
+end
+
+box.cfg({
+    replication     = replication,
+    vinyl_memory    = 1024 * 1024,
+})
+
+require('console').listen(os.getenv('ADMIN'))
diff --git a/test/vinyl/replica_rejoin.result b/test/vinyl/replica_rejoin.result
new file mode 100644
index 00000000..9116dfbb
--- /dev/null
+++ b/test/vinyl/replica_rejoin.result
@@ -0,0 +1,257 @@
+env = require('test_run')
+---
+...
+test_run = env.new()
+---
+...
+--
+-- gh-461: check that garbage collection works as expected
+-- after rebootstrap.
+--
+box.schema.user.grant('guest', 'replication')
+---
+...
+_ = box.schema.space.create('test', { id = 9000, engine = 'vinyl' })
+---
+...
+_ = box.space.test:create_index('pk')
+---
+...
+pad = string.rep('x', 15 * 1024)
+---
+...
+for i = 1, 100 do box.space.test:replace{i, pad} end
+---
+...
+box.snapshot()
+---
+- ok
+...
+-- Join a replica. Check its files.
+test_run:cmd("create server replica with rpl_master=default, script='vinyl/replica_rejoin.lua'")
+---
+- true
+...
+test_run:cmd("start server replica")
+---
+- true
+...
+test_run:cmd("switch replica")
+---
+- true
+...
+fio = require('fio')
+---
+...
+fio.chdir(box.cfg.vinyl_dir)
+---
+- true
+...
+fio.glob(fio.pathjoin(box.space.test.id, 0, '*'))
+---
+- - 9000/0/00000000000000000002.index
+  - 9000/0/00000000000000000002.run
+  - 9000/0/00000000000000000004.index
+  - 9000/0/00000000000000000004.run
+...
+test_run:cmd("switch default")
+---
+- true
+...
+test_run:cmd("stop server replica")
+---
+- true
+...
+-- Invoke garbage collector on the master.
+test_run:cmd("restart server default")
+checkpoint_count = box.cfg.checkpoint_count
+---
+...
+box.cfg{checkpoint_count = 1}
+---
+...
+box.space.test:delete(1)
+---
+...
+box.snapshot()
+---
+- ok
+...
+box.cfg{checkpoint_count = checkpoint_count}
+---
+...
+-- Rebootstrap the replica. Check that old files are removed
+-- by garbage collector.
+test_run:cmd("start server replica")
+---
+- true
+...
+test_run:cmd("switch replica")
+---
+- true
+...
+box.cfg{checkpoint_count = 1}
+---
+...
+box.snapshot()
+---
+- ok
+...
+fio = require('fio')
+---
+...
+fio.chdir(box.cfg.vinyl_dir)
+---
+- true
+...
+fio.glob(fio.pathjoin(box.space.test.id, 0, '*'))
+---
+- - 9000/0/00000000000000000008.index
+  - 9000/0/00000000000000000008.run
+  - 9000/0/00000000000000000010.index
+  - 9000/0/00000000000000000010.run
+...
+box.space.test:count() -- 99
+---
+- 99
+...
+test_run:cmd("switch default")
+---
+- true
+...
+test_run:cmd("stop server replica")
+---
+- true
+...
+-- Invoke garbage collector on the master.
+test_run:cmd("restart server default")
+checkpoint_count = box.cfg.checkpoint_count
+---
+...
+box.cfg{checkpoint_count = 1}
+---
+...
+box.space.test:delete(2)
+---
+...
+box.snapshot()
+---
+- ok
+...
+box.cfg{checkpoint_count = checkpoint_count}
+---
+...
+-- Make the master fail join after sending data. Check that
+-- files written during failed rebootstrap attempt are removed
+-- by garbage collector.
+box.error.injection.set('ERRINJ_RELAY_FINAL_JOIN', true)
+---
+- ok
+...
+test_run:cmd("start server replica") -- fail
+---
+- Can't start server 'replica'
+...
+test_run:cmd("start server replica") -- fail again
+---
+- Can't start server 'replica'
+...
+test_run:cmd("start server replica with args='disable_replication'")
+---
+- true
+...
+test_run:cmd("switch replica")
+---
+- true
+...
+box.cfg{checkpoint_count = 1}
+---
+...
+box.snapshot()
+---
+- ok
+...
+fio = require('fio')
+---
+...
+fio.chdir(box.cfg.vinyl_dir)
+---
+- true
+...
+fio.glob(fio.pathjoin(box.space.test.id, 0, '*'))
+---
+- - 9000/0/00000000000000000008.index
+  - 9000/0/00000000000000000008.run
+  - 9000/0/00000000000000000010.index
+  - 9000/0/00000000000000000010.run
+...
+box.space.test:count() -- 99
+---
+- 99
+...
+test_run:cmd("switch default")
+---
+- true
+...
+test_run:cmd("stop server replica")
+---
+- true
+...
+box.error.injection.set('ERRINJ_RELAY_FINAL_JOIN', false)
+---
+- ok
+...
+-- Rebootstrap after several failed attempts and make sure
+-- old files are removed.
+test_run:cmd("start server replica")
+---
+- true
+...
+test_run:cmd("switch replica")
+---
+- true
+...
+box.cfg{checkpoint_count = 1}
+---
+...
+box.snapshot()
+---
+- ok
+...
+fio = require('fio')
+---
+...
+fio.chdir(box.cfg.vinyl_dir)
+---
+- true
+...
+fio.glob(fio.pathjoin(box.space.test.id, 0, '*'))
+---
+- - 9000/0/00000000000000000022.index
+  - 9000/0/00000000000000000022.run
+  - 9000/0/00000000000000000024.index
+  - 9000/0/00000000000000000024.run
+...
+box.space.test:count() -- 98
+---
+- 98
+...
+test_run:cmd("switch default")
+---
+- true
+...
+test_run:cmd("stop server replica")
+---
+- true
+...
+-- Cleanup.
+test_run:cmd("cleanup server replica")
+---
+- true
+...
+box.space.test:drop()
+---
+...
+box.schema.user.revoke('guest', 'replication')
+---
+...
diff --git a/test/vinyl/replica_rejoin.test.lua b/test/vinyl/replica_rejoin.test.lua
new file mode 100644
index 00000000..61e8199d
--- /dev/null
+++ b/test/vinyl/replica_rejoin.test.lua
@@ -0,0 +1,88 @@
+env = require('test_run')
+test_run = env.new()
+
+--
+-- gh-461: check that garbage collection works as expected
+-- after rebootstrap.
+--
+box.schema.user.grant('guest', 'replication')
+_ = box.schema.space.create('test', { id = 9000, engine = 'vinyl' })
+_ = box.space.test:create_index('pk')
+pad = string.rep('x', 15 * 1024)
+for i = 1, 100 do box.space.test:replace{i, pad} end
+box.snapshot()
+
+-- Join a replica. Check its files.
+test_run:cmd("create server replica with rpl_master=default, script='vinyl/replica_rejoin.lua'")
+test_run:cmd("start server replica")
+test_run:cmd("switch replica")
+fio = require('fio')
+fio.chdir(box.cfg.vinyl_dir)
+fio.glob(fio.pathjoin(box.space.test.id, 0, '*'))
+test_run:cmd("switch default")
+test_run:cmd("stop server replica")
+
+-- Invoke garbage collector on the master.
+test_run:cmd("restart server default")
+checkpoint_count = box.cfg.checkpoint_count
+box.cfg{checkpoint_count = 1}
+box.space.test:delete(1)
+box.snapshot()
+box.cfg{checkpoint_count = checkpoint_count}
+
+-- Rebootstrap the replica. Check that old files are removed
+-- by garbage collector.
+test_run:cmd("start server replica")
+test_run:cmd("switch replica")
+box.cfg{checkpoint_count = 1}
+box.snapshot()
+fio = require('fio')
+fio.chdir(box.cfg.vinyl_dir)
+fio.glob(fio.pathjoin(box.space.test.id, 0, '*'))
+box.space.test:count() -- 99
+test_run:cmd("switch default")
+test_run:cmd("stop server replica")
+
+-- Invoke garbage collector on the master.
+test_run:cmd("restart server default")
+checkpoint_count = box.cfg.checkpoint_count
+box.cfg{checkpoint_count = 1}
+box.space.test:delete(2)
+box.snapshot()
+box.cfg{checkpoint_count = checkpoint_count}
+
+-- Make the master fail join after sending data. Check that
+-- files written during failed rebootstrap attempt are removed
+-- by garbage collector.
+box.error.injection.set('ERRINJ_RELAY_FINAL_JOIN', true)
+test_run:cmd("start server replica") -- fail
+test_run:cmd("start server replica") -- fail again
+test_run:cmd("start server replica with args='disable_replication'")
+test_run:cmd("switch replica")
+box.cfg{checkpoint_count = 1}
+box.snapshot()
+fio = require('fio')
+fio.chdir(box.cfg.vinyl_dir)
+fio.glob(fio.pathjoin(box.space.test.id, 0, '*'))
+box.space.test:count() -- 99
+test_run:cmd("switch default")
+test_run:cmd("stop server replica")
+box.error.injection.set('ERRINJ_RELAY_FINAL_JOIN', false)
+
+-- Rebootstrap after several failed attempts and make sure
+-- old files are removed.
+test_run:cmd("start server replica")
+test_run:cmd("switch replica")
+box.cfg{checkpoint_count = 1}
+box.snapshot()
+fio = require('fio')
+fio.chdir(box.cfg.vinyl_dir)
+fio.glob(fio.pathjoin(box.space.test.id, 0, '*'))
+box.space.test:count() -- 98
+test_run:cmd("switch default")
+test_run:cmd("stop server replica")
+
+-- Cleanup.
+test_run:cmd("cleanup server replica")
+box.space.test:drop()
+box.schema.user.revoke('guest', 'replication')
diff --git a/test/vinyl/suite.ini b/test/vinyl/suite.ini
index ca964289..b9dae380 100644
--- a/test/vinyl/suite.ini
+++ b/test/vinyl/suite.ini
@@ -2,7 +2,7 @@
 core = tarantool
 description = vinyl integration tests
 script = vinyl.lua
-release_disabled = errinj.test.lua errinj_gc.test.lua errinj_vylog.test.lua partial_dump.test.lua quota_timeout.test.lua recovery_quota.test.lua
+release_disabled = errinj.test.lua errinj_gc.test.lua errinj_vylog.test.lua partial_dump.test.lua quota_timeout.test.lua recovery_quota.test.lua replica_rejoin.test.lua
 config = suite.cfg
 lua_libs = suite.lua stress.lua large.lua txn_proxy.lua ../box/lua/utils.lua
 use_unix_sockets = True

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 04/11] box: open the port before starting local recovery
  2018-06-08 17:34 ` [PATCH v2 04/11] box: open the port " Vladimir Davydov
@ 2018-06-13 20:43   ` Konstantin Osipov
  2018-06-14  8:31     ` Vladimir Davydov
  0 siblings, 1 reply; 34+ messages in thread
From: Konstantin Osipov @ 2018-06-13 20:43 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: tarantool-patches

* Vladimir Davydov <vdavydov.dev@gmail.com> [18/06/08 20:38]:
> In order to find out if the current instance fell too much behind its
> peers in the cluster and so needs to be re-bootstrapped we need to
> connect it to remote peers before proceeding to local recovery. The
> problem is box.cfg.replication may have an entry corresponding to the
> instance itself so before connecting we have to start listening to
> incoming connections. So this patch moves the call to box_listen()
> before recoery is started unless the instance in hot standby mode.
> It also folds box_bind() into box_listen() as it is no longer needed
> as a separate function.

This may actually break some applications, since we now begin
processing requests before local recovery. 

But this was bound to happen sometime, we have always had wanted
to separate networking and the database.

Plus I expect this to be needed for Vlad's work on new
require('net.box').listen(), which should forward write requests
to replica set leader by default.

OK to push.


-- 
Konstantin Osipov, Moscow, Russia, +7 903 626 22 32
http://tarantool.io - www.twitter.com/kostja_osipov

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 05/11] box: connect to remote peers before starting local recovery
  2018-06-08 17:34 ` [PATCH v2 05/11] box: connect to remote peers before starting local recovery Vladimir Davydov
@ 2018-06-13 20:45   ` Konstantin Osipov
  2018-06-14  8:34     ` Vladimir Davydov
  0 siblings, 1 reply; 34+ messages in thread
From: Konstantin Osipov @ 2018-06-13 20:45 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: tarantool-patches

* Vladimir Davydov <vdavydov.dev@gmail.com> [18/06/08 20:38]:
> box_sync_replication() can now be called before recovery, right after
> box_listen(). This is a step toward detecting if the instance fell too
> much behind its peers in the cluster and so needs to be rebootstrapped.

Shouldn't you move setting of the orphan status to someplace
before box_sync_replication()?


-- 
Konstantin Osipov, Moscow, Russia, +7 903 626 22 32
http://tarantool.io - www.twitter.com/kostja_osipov

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 06/11] box: factor out local recovery function
  2018-06-08 17:34 ` [PATCH v2 06/11] box: factor out local recovery function Vladimir Davydov
@ 2018-06-13 20:50   ` Konstantin Osipov
  0 siblings, 0 replies; 34+ messages in thread
From: Konstantin Osipov @ 2018-06-13 20:50 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: tarantool-patches

* Vladimir Davydov <vdavydov.dev@gmail.com> [18/06/08 20:38]:
>  - Factor out local_recovery() from box_cfg_xc(). Make it setup
>    replication and handle local recovery and hot standby cases.
>  - Move replication setup in case of initial bootstrap from box_cfg_xc()
>    to bootstrap() to make bootstrap() consistent with local_recovery().
>  - Move initial snapshot creation from bootstrap() to bootsrap_master()
>    and bootstrap_from_master().
> 
> Needed for #461

OK to push assuming it's still needed when review fixes for the
following patches are done.


-- 
Konstantin Osipov, Moscow, Russia, +7 903 626 22 32
http://tarantool.io - www.twitter.com/kostja_osipov

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 07/11] applier: inquire oldest vclock on connect
  2018-06-08 17:34 ` [PATCH v2 07/11] applier: inquire oldest vclock on connect Vladimir Davydov
@ 2018-06-13 20:51   ` Konstantin Osipov
  2018-06-14  8:40     ` Vladimir Davydov
  0 siblings, 1 reply; 34+ messages in thread
From: Konstantin Osipov @ 2018-06-13 20:51 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: tarantool-patches

* Vladimir Davydov <vdavydov.dev@gmail.com> [18/06/08 20:38]:
> Introduce a new iproto command IPROTO_GET_GC_VCLOCK that returns the
> vclock of the oldest checkpoint available at the master. Use this
> command when applier is connected to set applier->gc_vclock. We will
> need it to check whether a replica fell too much behind its peers in
> the cluster and so needs to be rebootstrapped.

Why do you think it's better to have a separate command rather
than folding it into iproto_request_vote?


-- 
Konstantin Osipov, Moscow, Russia, +7 903 626 22 32
http://tarantool.io - www.twitter.com/kostja_osipov

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 08/11] replication: rebootstrap instance on startup if it fell behind
  2018-06-08 17:34 ` [PATCH v2 08/11] replication: rebootstrap instance on startup if it fell behind Vladimir Davydov
@ 2018-06-13 20:55   ` Konstantin Osipov
  2018-06-14  8:58     ` Vladimir Davydov
  0 siblings, 1 reply; 34+ messages in thread
From: Konstantin Osipov @ 2018-06-13 20:55 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: tarantool-patches

* Vladimir Davydov <vdavydov.dev@gmail.com> [18/06/08 20:38]:
> If a replica fell too much behind its peers in the cluster and xlog
> files needed for it to get up to speed have been removed, it won't be
> able to proceed without rebootstrap. This patch makes the recovery
> procedure detect such cases and initiate rebootstrap procedure if
> necessary.
> 
> Note, rebootstrap is currently only supported by memtx engine. If there
> are vinyl spaces on the replica, rebootstrap will fail. This is fixed by
> the following patches.

A nitpick, but this makes the whole point of factoring out local
recovery less valid.


If local_recovery() can fall back to bootstrap_from_master(), then
the name is misleading.

Please make sure the control flow and decision making stays in box_cfg().

>  
> +bool
> +replicaset_needs_rejoin(struct replica **master)
> +{
> +	replicaset_foreach(replica) {
> +		if (replica->applier != NULL &&
> +		    vclock_compare(&replica->applier->gc_vclock,
> +				   &replicaset.vclock) > 0) {
> +			*master = replica;
> +			return true;
> +		}
> +	}
> +	*master = NULL;
> +	return false;
> +}

Intuitively this function should return true only if none of the
masters can provide it with necessary logs, not *any* of the
masters.

-- 
Konstantin Osipov, Moscow, Russia, +7 903 626 22 32
http://tarantool.io - www.twitter.com/kostja_osipov

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 10/11] vinyl: pass flags to vy_recovery_new
  2018-06-08 17:34 ` [PATCH v2 10/11] vinyl: pass flags to vy_recovery_new Vladimir Davydov
@ 2018-06-13 20:56   ` Konstantin Osipov
  0 siblings, 0 replies; 34+ messages in thread
From: Konstantin Osipov @ 2018-06-13 20:56 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: tarantool-patches

* Vladimir Davydov <vdavydov.dev@gmail.com> [18/06/08 20:38]:
> Currently, this function takes a single boolean argument, but I'm
> planning to add another one. Since two bool arguments look rather
> confusing, let's turn this arguments into flags.

This is obviously oK to push.

-- 
Konstantin Osipov, Moscow, Russia, +7 903 626 22 32
http://tarantool.io - www.twitter.com/kostja_osipov

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 04/11] box: open the port before starting local recovery
  2018-06-13 20:43   ` Konstantin Osipov
@ 2018-06-14  8:31     ` Vladimir Davydov
  2018-06-14 12:59       ` Konstantin Osipov
  0 siblings, 1 reply; 34+ messages in thread
From: Vladimir Davydov @ 2018-06-14  8:31 UTC (permalink / raw)
  To: Konstantin Osipov; +Cc: tarantool-patches

On Wed, Jun 13, 2018 at 11:43:42PM +0300, Konstantin Osipov wrote:
> * Vladimir Davydov <vdavydov.dev@gmail.com> [18/06/08 20:38]:
> > In order to find out if the current instance fell too much behind its
> > peers in the cluster and so needs to be re-bootstrapped we need to
> > connect it to remote peers before proceeding to local recovery. The
> > problem is box.cfg.replication may have an entry corresponding to the
> > instance itself so before connecting we have to start listening to
> > incoming connections. So this patch moves the call to box_listen()
> > before recoery is started unless the instance in hot standby mode.
> > It also folds box_bind() into box_listen() as it is no longer needed
> > as a separate function.
> 
> This may actually break some applications, since we now begin
> processing requests before local recovery. 
> 
> But this was bound to happen sometime, we have always had wanted
> to separate networking and the database.
> 
> Plus I expect this to be needed for Vlad's work on new
> require('net.box').listen(), which should forward write requests
> to replica set leader by default.
> 
> OK to push.

Please review patch 3 before I can push this one.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 05/11] box: connect to remote peers before starting local recovery
  2018-06-13 20:45   ` Konstantin Osipov
@ 2018-06-14  8:34     ` Vladimir Davydov
  2018-06-14 12:59       ` Konstantin Osipov
  0 siblings, 1 reply; 34+ messages in thread
From: Vladimir Davydov @ 2018-06-14  8:34 UTC (permalink / raw)
  To: Konstantin Osipov; +Cc: tarantool-patches

On Wed, Jun 13, 2018 at 11:45:45PM +0300, Konstantin Osipov wrote:
> * Vladimir Davydov <vdavydov.dev@gmail.com> [18/06/08 20:38]:
> > box_sync_replication() can now be called before recovery, right after
> > box_listen(). This is a step toward detecting if the instance fell too
> > much behind its peers in the cluster and so needs to be rebootstrapped.
> 
> Shouldn't you move setting of the orphan status to someplace
> before box_sync_replication()?

I don't think it's necessary - the instance is in 'loading' state at
that time, because it hasn't read xlogs yet. Let's think of 'orphan'
state as 'recovery complete, but the instance is still syncing'?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 07/11] applier: inquire oldest vclock on connect
  2018-06-13 20:51   ` Konstantin Osipov
@ 2018-06-14  8:40     ` Vladimir Davydov
  0 siblings, 0 replies; 34+ messages in thread
From: Vladimir Davydov @ 2018-06-14  8:40 UTC (permalink / raw)
  To: Konstantin Osipov; +Cc: tarantool-patches

On Wed, Jun 13, 2018 at 11:51:39PM +0300, Konstantin Osipov wrote:
> * Vladimir Davydov <vdavydov.dev@gmail.com> [18/06/08 20:38]:
> > Introduce a new iproto command IPROTO_GET_GC_VCLOCK that returns the
> > vclock of the oldest checkpoint available at the master. Use this
> > command when applier is connected to set applier->gc_vclock. We will
> > need it to check whether a replica fell too much behind its peers in
> > the cluster and so needs to be rebootstrapped.
> 
> Why do you think it's better to have a separate command rather
> than folding it into iproto_request_vote?

Because IPROTO_REQUEST_VOTE returns a vclock and so IPROTO_VCLOCK key is
busy. If we decided to reuse that command we would have to add a new
key, IPROTO_GC_VCLOCK or something like that, which doesn't look good to
me. Besides, generally speaking master election doesn't have anything to
with garbage collection AFAICT. Anyway, if you really think we'd better
reuse IPROTO_REQUEST_VOTE, I won't really mind. What do you think?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 08/11] replication: rebootstrap instance on startup if it fell behind
  2018-06-13 20:55   ` Konstantin Osipov
@ 2018-06-14  8:58     ` Vladimir Davydov
  0 siblings, 0 replies; 34+ messages in thread
From: Vladimir Davydov @ 2018-06-14  8:58 UTC (permalink / raw)
  To: Konstantin Osipov; +Cc: tarantool-patches

On Wed, Jun 13, 2018 at 11:55:27PM +0300, Konstantin Osipov wrote:
> * Vladimir Davydov <vdavydov.dev@gmail.com> [18/06/08 20:38]:
> > If a replica fell too much behind its peers in the cluster and xlog
> > files needed for it to get up to speed have been removed, it won't be
> > able to proceed without rebootstrap. This patch makes the recovery
> > procedure detect such cases and initiate rebootstrap procedure if
> > necessary.
> > 
> > Note, rebootstrap is currently only supported by memtx engine. If there
> > are vinyl spaces on the replica, rebootstrap will fail. This is fixed by
> > the following patches.
> 
> A nitpick, but this makes the whole point of factoring out local
> recovery less valid.
> 
> 
> If local_recovery() can fall back to bootstrap_from_master(), then
> the name is misleading.
> 
> Please make sure the control flow and decision making stays in box_cfg().

I rather disagree. Rejoin can be though of as an optional step of the
local recovery process. Although we do overwrite local data, we don't
purge it - it is still in the local directory and can be recovered upon
request. I can, of course, move bootstrap_from_master to box_cfg, but
frankly I think it would make the code look worse.

> 
> >  
> > +bool
> > +replicaset_needs_rejoin(struct replica **master)
> > +{
> > +	replicaset_foreach(replica) {
> > +		if (replica->applier != NULL &&
> > +		    vclock_compare(&replica->applier->gc_vclock,
> > +				   &replicaset.vclock) > 0) {
> > +			*master = replica;
> > +			return true;
> > +		}
> > +	}
> > +	*master = NULL;
> > +	return false;
> > +}
> 
> Intuitively this function should return true only if none of the
> masters can provide it with necessary logs, not *any* of the
> masters.

Well, yes, but if an instance fell behind a master, it won't retry to
subscribe to it, even after it has synced up with a master that has all
xlogs available. I thought about fixing it in the future while leaving
it plain and simple for now - rejoin if at least one master reports the
replica is stale...

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 03/11] box: retrieve end vclock before starting local recovery
  2018-06-08 17:34 ` [PATCH v2 03/11] box: retrieve end vclock before starting local recovery Vladimir Davydov
@ 2018-06-14 12:58   ` Konstantin Osipov
  0 siblings, 0 replies; 34+ messages in thread
From: Konstantin Osipov @ 2018-06-14 12:58 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: tarantool-patches

* Vladimir Davydov <vdavydov.dev@gmail.com> [18/06/08 20:38]:
> In order to find out if the current instance fell too much behind its
> peers in the cluster and so needs to be rebootstrapped, we need to know
> its vclock before we start local recovery. To do that, let's scan the
> most recent xlog. In future, we can optimize that by either storing end
> vclock in xlog eof marker or by making a new xlog on server stop.

OK to push.


-- 
Konstantin Osipov, Moscow, Russia, +7 903 626 22 32
http://tarantool.io - www.twitter.com/kostja_osipov

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 04/11] box: open the port before starting local recovery
  2018-06-14  8:31     ` Vladimir Davydov
@ 2018-06-14 12:59       ` Konstantin Osipov
  2018-06-15 15:48         ` [PATCH 0/3] Speed up recovery in case rebootstrap is not needed Vladimir Davydov
  0 siblings, 1 reply; 34+ messages in thread
From: Konstantin Osipov @ 2018-06-14 12:59 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: tarantool-patches

* Vladimir Davydov <vdavydov.dev@gmail.com> [18/06/14 11:33]:
> > 
> > This may actually break some applications, since we now begin
> > processing requests before local recovery. 
> > 
> > But this was bound to happen sometime, we have always had wanted
> > to separate networking and the database.
> > 
> > Plus I expect this to be needed for Vlad's work on new
> > require('net.box').listen(), which should forward write requests
> > to replica set leader by default.
> > 
> > OK to push.
> 
> Please review patch 3 before I can push this one.

Patch #3 is OK to push, but it needs an immediate follow-up patch
which speeds recovery up for most cases.

-- 
Konstantin Osipov, Moscow, Russia, +7 903 626 22 32
http://tarantool.io - www.twitter.com/kostja_osipov

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 05/11] box: connect to remote peers before starting local recovery
  2018-06-14  8:34     ` Vladimir Davydov
@ 2018-06-14 12:59       ` Konstantin Osipov
  0 siblings, 0 replies; 34+ messages in thread
From: Konstantin Osipov @ 2018-06-14 12:59 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: tarantool-patches

* Vladimir Davydov <vdavydov.dev@gmail.com> [18/06/14 11:38]:
> On Wed, Jun 13, 2018 at 11:45:45PM +0300, Konstantin Osipov wrote:
> > * Vladimir Davydov <vdavydov.dev@gmail.com> [18/06/08 20:38]:
> > > box_sync_replication() can now be called before recovery, right after
> > > box_listen(). This is a step toward detecting if the instance fell too
> > > much behind its peers in the cluster and so needs to be rebootstrapped.
> > 
> > Shouldn't you move setting of the orphan status to someplace
> > before box_sync_replication()?
> 
> I don't think it's necessary - the instance is in 'loading' state at
> that time, because it hasn't read xlogs yet. Let's think of 'orphan'
> state as 'recovery complete, but the instance is still syncing'?

OK, I see.
-- 
Konstantin Osipov, Moscow, Russia, +7 903 626 22 32
http://tarantool.io - www.twitter.com/kostja_osipov

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 0/3] Speed up recovery in case rebootstrap is not needed
  2018-06-14 12:59       ` Konstantin Osipov
@ 2018-06-15 15:48         ` Vladimir Davydov
  2018-06-15 15:48           ` [PATCH 1/3] xlog: erase eof marker when reopening existing file for writing Vladimir Davydov
                             ` (2 more replies)
  0 siblings, 3 replies; 34+ messages in thread
From: Vladimir Davydov @ 2018-06-15 15:48 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

The following patches address the comment by @kostja to the commit that
added extra scan of the last WAL file in order to find instance vclock
before proceeding to local recovery, which is needed to determine if the
instance fell too much behind its peers in the cluster and so should be
rebootstrapped:

On Thu, Jun 14, 2018 at 03:59:20PM +0300, Konstantin Osipov wrote:
> Patch #3 is OK to push, but it needs an immediate follow-up patch
> which speeds recovery up for most cases.

To speed up recovery, this patch set makes tarantool create a new empty
xlog file on shutdown and reopen it on restart.

https://github.com/tarantool/tarantool/commits/gh-461-replica-rejoin

Vladimir Davydov (3):
  xlog: erase eof marker when reopening existing file for writing
  wal: rollback vclock on write failure
  wal: create empty xlog on shutdown

 src/box/recovery.cc                   | 23 --------------
 src/box/wal.c                         | 58 +++++++++++++++++++++++++++++++++--
 src/box/xlog.c                        | 12 +++++---
 test/replication/hot_standby.result   | 12 ++++----
 test/replication/hot_standby.test.lua |  4 +--
 test/xlog-py/dup_key.result           | 20 +++---------
 test/xlog-py/dup_key.test.py          | 29 ++++++------------
 test/xlog/panic_on_lsn_gap.result     | 34 +++++++++-----------
 test/xlog/panic_on_lsn_gap.test.lua   | 15 +++------
 test/xlog/panic_on_wal_error.result   | 23 +-------------
 test/xlog/panic_on_wal_error.test.lua |  9 +-----
 11 files changed, 106 insertions(+), 133 deletions(-)

-- 
2.11.0

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 1/3] xlog: erase eof marker when reopening existing file for writing
  2018-06-15 15:48         ` [PATCH 0/3] Speed up recovery in case rebootstrap is not needed Vladimir Davydov
@ 2018-06-15 15:48           ` Vladimir Davydov
  2018-06-27 17:09             ` Konstantin Osipov
  2018-06-15 15:48           ` [PATCH 2/3] wal: rollback vclock on write failure Vladimir Davydov
  2018-06-15 15:48           ` [PATCH 3/3] wal: create empty xlog on shutdown Vladimir Davydov
  2 siblings, 1 reply; 34+ messages in thread
From: Vladimir Davydov @ 2018-06-15 15:48 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

When reopening an existing xlog file (as in case of vylog), we do not
erase the eof marker immediately. Instead we reposition file offset
to (file_size - sizeof eof_marker), assuming the eof marker will be
overwritten on the first write.

However, it isn't enough if we want to reuse this function for reopening
WAL files, because when scanning the WAL directory we close a file if we
read eof marker and never reopen it again, see recover_remaining_wals().
So to avoid skipping rows written to a once closed WAL, we have to erase
the eof marker when reopening an xlog file. Let's do it with truncate().
---
 src/box/xlog.c | 12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/src/box/xlog.c b/src/box/xlog.c
index be7b3459..142262b4 100644
--- a/src/box/xlog.c
+++ b/src/box/xlog.c
@@ -787,10 +787,7 @@ xlog_open(struct xlog *xlog, const char *name)
 		goto err_read;
 	}
 
-	/*
-	 * If the file has eof marker, reposition the file pointer so
-	 * that the next write will overwrite it.
-	 */
+	/* Check if the file has EOF marker. */
 	xlog->offset = fio_lseek(xlog->fd, -(off_t)sizeof(magic), SEEK_END);
 	if (xlog->offset < 0)
 		goto no_eof;
@@ -809,6 +806,13 @@ no_eof:
 				 xlog->filename);
 			goto err_read;
 		}
+	} else {
+		/* Truncate the file to erase the EOF marker. */
+		if (ftruncate(xlog->fd, xlog->offset) != 0) {
+			diag_set(SystemError, "failed to truncate file '%s'",
+				 xlog->filename);
+			goto err_read;
+		}
 	}
 	return 0;
 err_read:
-- 
2.11.0

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 2/3] wal: rollback vclock on write failure
  2018-06-15 15:48         ` [PATCH 0/3] Speed up recovery in case rebootstrap is not needed Vladimir Davydov
  2018-06-15 15:48           ` [PATCH 1/3] xlog: erase eof marker when reopening existing file for writing Vladimir Davydov
@ 2018-06-15 15:48           ` Vladimir Davydov
  2018-06-27 17:22             ` Konstantin Osipov
  2018-06-15 15:48           ` [PATCH 3/3] wal: create empty xlog on shutdown Vladimir Davydov
  2 siblings, 1 reply; 34+ messages in thread
From: Vladimir Davydov @ 2018-06-15 15:48 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

In order to determine whether we need to rebootstrap the instance on
startup, we need to know its vclock. To find it out, we scan the last
xlog file before proceeding to local recovery, but this means in case
rebootstrap is not required we scan the last xlog twice, which is
sub-optimal. To avoid double scan, we can create a new empty xlog before
shutting down the server and reopen it after restart. However, since we
promote WAL writer vclock even if xlog write fails, there will be an LSN
gap between the last xlog and the one created on shutdown in case we
failed to write last few records. To avoid that, let's rollback WAL
writer vclock if write fails. BTW this will make it consistent with
replicaset vclock - see commit 3c4bac715960a ("Follow vclock only for
success wal writes").
---
 src/box/wal.c                       |  8 +++++++-
 test/xlog/panic_on_lsn_gap.result   | 33 +++++++++++++--------------------
 test/xlog/panic_on_lsn_gap.test.lua | 15 ++++-----------
 3 files changed, 24 insertions(+), 32 deletions(-)

diff --git a/src/box/wal.c b/src/box/wal.c
index f6b0fa66..1c6d2422 100644
--- a/src/box/wal.c
+++ b/src/box/wal.c
@@ -637,14 +637,18 @@ wal_write_to_disk(struct cmsg *msg)
 	 */
 	struct journal_entry *entry;
 	struct stailq_entry *last_committed = NULL;
+	struct vclock last_committed_vclock;
+	vclock_copy(&last_committed_vclock, &writer->vclock);
 	stailq_foreach_entry(entry, &wal_msg->commit, fifo) {
 		wal_assign_lsn(writer, entry->rows, entry->rows + entry->n_rows);
 		entry->res = vclock_sum(&writer->vclock);
 		int rc = xlog_write_entry(l, entry);
 		if (rc < 0)
 			goto done;
-		if (rc > 0)
+		if (rc > 0) {
 			last_committed = &entry->fifo;
+			vclock_copy(&last_committed_vclock, &writer->vclock);
+		}
 		/* rc == 0: the write is buffered in xlog_tx */
 	}
 	if (xlog_flush(l) < 0)
@@ -670,6 +674,8 @@ done:
 	stailq_cut_tail(&wal_msg->commit, last_committed, &rollback);
 
 	if (!stailq_empty(&rollback)) {
+		/* Reset WAL writer vclock. */
+		vclock_copy(&writer->vclock, &last_committed_vclock);
 		/* Update status of the successfully committed requests. */
 		stailq_foreach_entry(entry, &rollback, fifo)
 			entry->res = -1;
diff --git a/test/xlog/panic_on_lsn_gap.result b/test/xlog/panic_on_lsn_gap.result
index 313850a6..731eec4e 100644
--- a/test/xlog/panic_on_lsn_gap.result
+++ b/test/xlog/panic_on_lsn_gap.result
@@ -31,10 +31,6 @@ box.info.vclock
 s = box.space._schema
 ---
 ...
--- we need to have at least one record in the
--- xlog otherwise the server believes that there
--- is an lsn gap during recovery.
---
 s:replace{"key", 'test 1'}
 ---
 - ['key', 'test 1']
@@ -83,8 +79,8 @@ t
   - Failed to write to disk
 ...
 --
--- Before restart: oops, our LSN is 11,
--- even though we didn't insert anything.
+-- Before restart: our LSN is 1, because
+-- we didn't insert anything.
 --
 name = string.match(arg[0], "([^,]+)%.lua")
 ---
@@ -100,8 +96,7 @@ require('fio').glob(name .. "/*.xlog")
 test_run:cmd("restart server panic")
 --
 -- after restart: our LSN is the LSN of the
--- last *written* row, all the failed
--- rows are gone from lsn counter.
+-- last written row, i.e. 1 again.
 --
 box.info.vclock
 ---
@@ -161,9 +156,7 @@ box.error.injection.set("ERRINJ_WAL_WRITE", false)
 ...
 --
 -- Write a good row after a series of failed
--- rows. There is a gap in LSN, correct,
--- but it's *inside* a single WAL, so doesn't
--- affect WAL search in recover_remaining_wals()
+-- rows. There is no gap in LSN.
 --
 s:replace{'key', 'test 2'}
 ---
@@ -176,12 +169,12 @@ s:replace{'key', 'test 2'}
 --
 box.info.vclock
 ---
-- {1: 12}
+- {1: 2}
 ...
 test_run:cmd("restart server panic")
 box.info.vclock
 ---
-- {1: 12}
+- {1: 2}
 ...
 box.space._schema:select{'key'}
 ---
@@ -217,7 +210,7 @@ require('fio').glob(name .. "/*.xlog")
 ---
 - - panic/00000000000000000000.xlog
   - panic/00000000000000000001.xlog
-  - panic/00000000000000000012.xlog
+  - panic/00000000000000000002.xlog
 ...
 box.error.injection.set("ERRINJ_WAL_WRITE", true)
 ---
@@ -229,14 +222,14 @@ box.space._schema:replace{"key", 'test 3'}
 ...
 box.info.vclock
 ---
-- {1: 22}
+- {1: 12}
 ...
 require('fio').glob(name .. "/*.xlog")
 ---
 - - panic/00000000000000000000.xlog
   - panic/00000000000000000001.xlog
+  - panic/00000000000000000002.xlog
   - panic/00000000000000000012.xlog
-  - panic/00000000000000000022.xlog
 ...
 -- and the next one (just to be sure
 box.space._schema:replace{"key", 'test 3'}
@@ -245,14 +238,14 @@ box.space._schema:replace{"key", 'test 3'}
 ...
 box.info.vclock
 ---
-- {1: 22}
+- {1: 12}
 ...
 require('fio').glob(name .. "/*.xlog")
 ---
 - - panic/00000000000000000000.xlog
   - panic/00000000000000000001.xlog
+  - panic/00000000000000000002.xlog
   - panic/00000000000000000012.xlog
-  - panic/00000000000000000022.xlog
 ...
 box.error.injection.set("ERRINJ_WAL_WRITE", false)
 ---
@@ -265,14 +258,14 @@ box.space._schema:replace{"key", 'test 4'}
 ...
 box.info.vclock
 ---
-- {1: 25}
+- {1: 13}
 ...
 require('fio').glob(name .. "/*.xlog")
 ---
 - - panic/00000000000000000000.xlog
   - panic/00000000000000000001.xlog
+  - panic/00000000000000000002.xlog
   - panic/00000000000000000012.xlog
-  - panic/00000000000000000022.xlog
 ...
 -- restart is ok
 test_run:cmd("restart server panic")
diff --git a/test/xlog/panic_on_lsn_gap.test.lua b/test/xlog/panic_on_lsn_gap.test.lua
index 248a3e63..7f16d68e 100644
--- a/test/xlog/panic_on_lsn_gap.test.lua
+++ b/test/xlog/panic_on_lsn_gap.test.lua
@@ -13,10 +13,6 @@ test_run:cmd("start server panic")
 test_run:cmd("switch panic")
 box.info.vclock
 s = box.space._schema
--- we need to have at least one record in the
--- xlog otherwise the server believes that there
--- is an lsn gap during recovery.
---
 s:replace{"key", 'test 1'}
 box.info.vclock
 box.error.injection.set("ERRINJ_WAL_WRITE", true)
@@ -34,8 +30,8 @@ end;
 test_run:cmd("setopt delimiter ''");
 t
 --
--- Before restart: oops, our LSN is 11,
--- even though we didn't insert anything.
+-- Before restart: our LSN is 1, because
+-- we didn't insert anything.
 --
 name = string.match(arg[0], "([^,]+)%.lua")
 box.info.vclock
@@ -43,8 +39,7 @@ require('fio').glob(name .. "/*.xlog")
 test_run:cmd("restart server panic")
 --
 -- after restart: our LSN is the LSN of the
--- last *written* row, all the failed
--- rows are gone from lsn counter.
+-- last written row, i.e. 1 again.
 --
 box.info.vclock
 box.space._schema:select{'key'}
@@ -65,9 +60,7 @@ box.info.vclock
 box.error.injection.set("ERRINJ_WAL_WRITE", false)
 --
 -- Write a good row after a series of failed
--- rows. There is a gap in LSN, correct,
--- but it's *inside* a single WAL, so doesn't
--- affect WAL search in recover_remaining_wals()
+-- rows. There is no gap in LSN.
 --
 s:replace{'key', 'test 2'}
 --
-- 
2.11.0

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 3/3] wal: create empty xlog on shutdown
  2018-06-15 15:48         ` [PATCH 0/3] Speed up recovery in case rebootstrap is not needed Vladimir Davydov
  2018-06-15 15:48           ` [PATCH 1/3] xlog: erase eof marker when reopening existing file for writing Vladimir Davydov
  2018-06-15 15:48           ` [PATCH 2/3] wal: rollback vclock on write failure Vladimir Davydov
@ 2018-06-15 15:48           ` Vladimir Davydov
  2018-06-27 17:29             ` Konstantin Osipov
  2 siblings, 1 reply; 34+ messages in thread
From: Vladimir Davydov @ 2018-06-15 15:48 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

In order to determine whether we need to rebootstrap the instance on
startup, we need to know its vclock. To find it out, we scan the last
xlog file before proceeding to local recovery, but this means in case
rebootstrap is not required we scan the last xlog twice, which is
sub-optimal. To speed up this procedure, let's create a new empty xlog
before shutting down the server and reopen it after restart.
---
 src/box/recovery.cc                   | 23 ----------------
 src/box/wal.c                         | 50 +++++++++++++++++++++++++++++++++--
 test/replication/hot_standby.result   | 12 ++++-----
 test/replication/hot_standby.test.lua |  4 +--
 test/xlog-py/dup_key.result           | 20 ++++----------
 test/xlog-py/dup_key.test.py          | 29 +++++++-------------
 test/xlog/panic_on_lsn_gap.result     |  1 +
 test/xlog/panic_on_wal_error.result   | 23 +---------------
 test/xlog/panic_on_wal_error.test.lua |  9 +------
 9 files changed, 74 insertions(+), 97 deletions(-)

diff --git a/src/box/recovery.cc b/src/box/recovery.cc
index eb77476d..1f7a11e6 100644
--- a/src/box/recovery.cc
+++ b/src/box/recovery.cc
@@ -339,29 +339,6 @@ void
 recovery_finalize(struct recovery *r)
 {
 	recovery_close_log(r);
-
-	/*
-	 * Check if next xlog exists. If it's true this xlog is
-	 * corrupted and we should rename it (to avoid getting
-	 * problem on the next xlog write with the same name).
-	 * Possible reasons are:
-	 *  - last xlog has corrupted rows
-	 *  - last xlog has corrupted header
-	 *  - last xlog has zero size
-	 */
-	char *name = xdir_format_filename(&r->wal_dir,
-					  vclock_sum(&r->vclock),
-					  NONE);
-	if (access(name, F_OK) == 0) {
-		say_info("rename corrupted xlog %s", name);
-		char to[PATH_MAX];
-		snprintf(to, sizeof(to), "%s.corrupted", name);
-		if (rename(name, to) != 0) {
-			tnt_raise(SystemError,
-				  "%s: can't rename corrupted xlog",
-				  name);
-		}
-	}
 }
 
 
diff --git a/src/box/wal.c b/src/box/wal.c
index 1c6d2422..1456d3e7 100644
--- a/src/box/wal.c
+++ b/src/box/wal.c
@@ -310,6 +310,39 @@ wal_thread_start()
 	cpipe_set_max_input(&wal_thread.wal_pipe, IOV_MAX);
 }
 
+static int
+wal_init_f(struct cbus_call_msg *msg)
+{
+	(void)msg;
+
+	struct wal_writer *writer = &wal_writer_singleton;
+
+	/*
+	 * Check if the next WAL file already exists. If it does,
+	 * it must have been created on shutdown, try to reopen it.
+	 */
+	const char *path = xdir_format_filename(&writer->wal_dir,
+				vclock_sum(&writer->vclock), NONE);
+	if (access(path, F_OK) == 0) {
+		if (xlog_open(&writer->current_wal, path) == 0)
+			return 0;
+		/*
+		 * The WAL file seems to be corrupted. Rename it
+		 * so that we can proceed.
+		 */
+		say_info("rename corrupted %s", path);
+		char new_path[PATH_MAX];
+		snprintf(new_path, sizeof(new_path), "%s.corrupted", path);
+		if (rename(path, new_path) != 0) {
+			diag_set(SystemError,
+				 "%s: can't rename corrupted xlog", path);
+			diag_log();
+			return -1;
+		}
+	}
+	return 0;
+}
+
 /**
  * Initialize WAL writer.
  *
@@ -332,6 +365,11 @@ wal_init(enum wal_mode wal_mode, const char *wal_dirname,
 	if (xdir_scan(&writer->wal_dir))
 		return -1;
 
+	struct cbus_call_msg msg;
+	if (cbus_call(&wal_thread.wal_pipe, &wal_thread.tx_pipe, &msg,
+		      wal_init_f, NULL, TIMEOUT_INFINITY) != 0)
+		return -1;
+
 	journal_set(&writer->base);
 	return 0;
 }
@@ -382,8 +420,7 @@ wal_checkpoint_f(struct cmsg *data)
 
 		xlog_close(&writer->current_wal, false);
 		/*
-		 * Avoid creating an empty xlog if this is the
-		 * last snapshot before shutdown.
+		 * The next WAL will be created on first write.
 		 */
 	}
 	vclock_copy(msg->vclock, &writer->vclock);
@@ -712,6 +749,15 @@ wal_thread_f(va_list ap)
 	if (xlog_is_open(&writer->current_wal))
 		xlog_close(&writer->current_wal, false);
 
+	/*
+	 * Create a new empty WAL on shutdown so that we don't have
+	 * to rescan the last WAL to find the instance vclock.
+	 */
+	if (writer->wal_mode != WAL_NONE &&
+	    xdir_create_xlog(&writer->wal_dir, &writer->current_wal,
+			     &writer->vclock) == 0)
+		xlog_close(&writer->current_wal, false);
+
 	if (xlog_is_open(&vy_log_writer.xlog))
 		xlog_close(&vy_log_writer.xlog, false);
 
diff --git a/test/replication/hot_standby.result b/test/replication/hot_standby.result
index 66ede5b7..24be0a94 100644
--- a/test/replication/hot_standby.result
+++ b/test/replication/hot_standby.result
@@ -284,27 +284,27 @@ _select(11, 20)
   - [19, 'the tuple 19']
   - [20, 'the tuple 20']
 ...
-test_run:cmd("deploy server default")
+test_run:cmd("stop server hot_standby")
 ---
 - true
 ...
-test_run:cmd("start server default")
+test_run:cmd("cleanup server hot_standby")
 ---
 - true
 ...
-test_run:cmd("switch default")
+test_run:cmd("deploy server default")
 ---
 - true
 ...
-test_run:cmd("stop server hot_standby")
+test_run:cmd("start server default")
 ---
 - true
 ...
-test_run:cmd("stop server replica")
+test_run:cmd("switch default")
 ---
 - true
 ...
-test_run:cmd("cleanup server hot_standby")
+test_run:cmd("stop server replica")
 ---
 - true
 ...
diff --git a/test/replication/hot_standby.test.lua b/test/replication/hot_standby.test.lua
index 8a7c837e..adb3fb6f 100644
--- a/test/replication/hot_standby.test.lua
+++ b/test/replication/hot_standby.test.lua
@@ -109,10 +109,10 @@ test_run:cmd("switch replica")
 _wait_lsn(10)
 _select(11, 20)
 
+test_run:cmd("stop server hot_standby")
+test_run:cmd("cleanup server hot_standby")
 test_run:cmd("deploy server default")
 test_run:cmd("start server default")
 test_run:cmd("switch default")
-test_run:cmd("stop server hot_standby")
 test_run:cmd("stop server replica")
-test_run:cmd("cleanup server hot_standby")
 test_run:cmd("cleanup server replica")
diff --git a/test/xlog-py/dup_key.result b/test/xlog-py/dup_key.result
index 53ae7322..f387e8e8 100644
--- a/test/xlog-py/dup_key.result
+++ b/test/xlog-py/dup_key.result
@@ -4,6 +4,10 @@ space = box.schema.space.create('test')
 index = box.space.test:create_index('primary')
 ---
 ...
+box.snapshot()
+---
+- ok
+...
 box.space.test:insert{1, 'first tuple'}
 ---
 - [1, 'first tuple']
@@ -13,20 +17,6 @@ box.space.test:insert{2, 'second tuple'}
 - [2, 'second tuple']
 ...
 .xlog exists
-space = box.schema.space.create('test')
----
-...
-index = box.space.test:create_index('primary')
----
-...
-box.space.test:insert{1, 'first tuple'}
----
-- [1, 'first tuple']
-...
-box.space.test:delete{1}
----
-- [1, 'first tuple']
-...
 box.space.test:insert{1, 'third tuple'}
 ---
 - [1, 'third tuple']
@@ -35,7 +25,7 @@ box.space.test:insert{2, 'fourth tuple'}
 ---
 - [2, 'fourth tuple']
 ...
-.xlog exists
+.xlog does not exist
 check log line for 'Duplicate key'
 
 'Duplicate key' exists in server log
diff --git a/test/xlog-py/dup_key.test.py b/test/xlog-py/dup_key.test.py
index 058d9e3f..1c033da4 100644
--- a/test/xlog-py/dup_key.test.py
+++ b/test/xlog-py/dup_key.test.py
@@ -8,6 +8,11 @@ import yaml
 
 server.stop()
 server.deploy()
+
+server.admin("space = box.schema.space.create('test')")
+server.admin("index = box.space.test:create_index('primary')")
+server.admin("box.snapshot()")
+
 lsn = int(yaml.load(server.admin("box.info.lsn", silent=True))[0])
 filename = str(lsn).zfill(20) + ".xlog"
 vardir = os.path.join(server.vardir, server.name)
@@ -15,40 +20,26 @@ wal_old = os.path.join(vardir, "old_" + filename)
 wal = os.path.join(vardir, filename)
 
 # Create wal#1
-server.admin("space = box.schema.space.create('test')")
-server.admin("index = box.space.test:create_index('primary')")
 server.admin("box.space.test:insert{1, 'first tuple'}")
 server.admin("box.space.test:insert{2, 'second tuple'}")
 server.stop()
 
-# Save wal #1
+# Save wal#1
 if os.access(wal, os.F_OK):
     print ".xlog exists"
     os.rename(wal, wal_old)
 
-lsn += 4
-
-# Create another wal#1
-server.start()
-server.admin("space = box.schema.space.create('test')")
-server.admin("index = box.space.test:create_index('primary')")
-server.admin("box.space.test:insert{1, 'first tuple'}")
-server.admin("box.space.test:delete{1}")
-server.stop()
-
-# Create wal#2
+# Write wal#2
 server.start()
 server.admin("box.space.test:insert{1, 'third tuple'}")
 server.admin("box.space.test:insert{2, 'fourth tuple'}")
 server.stop()
 
-if os.access(wal, os.F_OK):
-    print ".xlog exists"
-    # Replace wal#1 with saved copy
-    os.unlink(wal)
+# Restore wal#1
+if not os.access(wal, os.F_OK):
+    print ".xlog does not exist"
     os.rename(wal_old, wal)
 
-
 server.start()
 line = 'Duplicate key'
 print "check log line for '%s'" % line
diff --git a/test/xlog/panic_on_lsn_gap.result b/test/xlog/panic_on_lsn_gap.result
index 731eec4e..d0978e40 100644
--- a/test/xlog/panic_on_lsn_gap.result
+++ b/test/xlog/panic_on_lsn_gap.result
@@ -188,6 +188,7 @@ require('fio').glob(name .. "/*.xlog")
 ---
 - - panic/00000000000000000000.xlog
   - panic/00000000000000000001.xlog
+  - panic/00000000000000000002.xlog
 ...
 -- now insert 10 rows - so that the next
 -- row will need to switch the WAL
diff --git a/test/xlog/panic_on_wal_error.result b/test/xlog/panic_on_wal_error.result
index 267b5340..345534ba 100644
--- a/test/xlog/panic_on_wal_error.result
+++ b/test/xlog/panic_on_wal_error.result
@@ -5,28 +5,7 @@ env = require('test_run')
 test_run = env.new()
 ---
 ...
-fio = require('fio')
----
-...
-glob = fio.pathjoin(box.cfg.wal_dir, '*.xlog')
----
-...
-for _, file in pairs(fio.glob(glob)) do fio.unlink(file) end
----
-...
-glob = fio.pathjoin(box.cfg.vinyl_dir, '*.vylog')
----
-...
-for _, file in pairs(fio.glob(glob)) do fio.unlink(file) end
----
-...
-glob = fio.pathjoin(box.cfg.memtx_dir, '*.snap')
----
-...
-for _, file in pairs(fio.glob(glob)) do fio.unlink(file) end
----
-...
-test_run:cmd("restart server default")
+test_run:cmd("restart server default with cleanup=True")
 box.schema.user.grant('guest', 'replication')
 ---
 ...
diff --git a/test/xlog/panic_on_wal_error.test.lua b/test/xlog/panic_on_wal_error.test.lua
index 4f598e33..29410cb2 100644
--- a/test/xlog/panic_on_wal_error.test.lua
+++ b/test/xlog/panic_on_wal_error.test.lua
@@ -2,14 +2,7 @@
 env = require('test_run')
 test_run = env.new()
 
-fio = require('fio')
-glob = fio.pathjoin(box.cfg.wal_dir, '*.xlog')
-for _, file in pairs(fio.glob(glob)) do fio.unlink(file) end
-glob = fio.pathjoin(box.cfg.vinyl_dir, '*.vylog')
-for _, file in pairs(fio.glob(glob)) do fio.unlink(file) end
-glob = fio.pathjoin(box.cfg.memtx_dir, '*.snap')
-for _, file in pairs(fio.glob(glob)) do fio.unlink(file) end
-test_run:cmd("restart server default")
+test_run:cmd("restart server default with cleanup=True")
 box.schema.user.grant('guest', 'replication')
 _ = box.schema.space.create('test')
 _ = box.space.test:create_index('pk')
-- 
2.11.0

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/3] xlog: erase eof marker when reopening existing file for writing
  2018-06-15 15:48           ` [PATCH 1/3] xlog: erase eof marker when reopening existing file for writing Vladimir Davydov
@ 2018-06-27 17:09             ` Konstantin Osipov
  0 siblings, 0 replies; 34+ messages in thread
From: Konstantin Osipov @ 2018-06-27 17:09 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: tarantool-patches

* Vladimir Davydov <vdavydov.dev@gmail.com> [18/06/15 23:28]:
> When reopening an existing xlog file (as in case of vylog), we do not
> erase the eof marker immediately. Instead we reposition file offset
> to (file_size - sizeof eof_marker), assuming the eof marker will be
> overwritten on the first write.
> 
> However, it isn't enough if we want to reuse this function for reopening
> WAL files, because when scanning the WAL directory we close a file if we
> read eof marker and never reopen it again, see recover_remaining_wals().
> So to avoid skipping rows written to a once closed WAL, we have to erase
> the eof marker when reopening an xlog file. Let's do it with truncate().
> ---
>  src/box/xlog.c | 12 ++++++++----
>  1 file changed, 8 insertions(+), 4 deletions(-)

Looks good, will push shortly.

-- 
Konstantin Osipov, Moscow, Russia, +7 903 626 22 32
http://tarantool.io - www.twitter.com/kostja_osipov

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/3] wal: rollback vclock on write failure
  2018-06-15 15:48           ` [PATCH 2/3] wal: rollback vclock on write failure Vladimir Davydov
@ 2018-06-27 17:22             ` Konstantin Osipov
  0 siblings, 0 replies; 34+ messages in thread
From: Konstantin Osipov @ 2018-06-27 17:22 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: tarantool-patches

* Vladimir Davydov <vdavydov.dev@gmail.com> [18/06/15 23:28]:
> In order to determine whether we need to rebootstrap the instance on
> startup, we need to know its vclock. To find it out, we scan the last
> xlog file before proceeding to local recovery, but this means in case
> rebootstrap is not required we scan the last xlog twice, which is
> sub-optimal. To avoid double scan, we can create a new empty xlog before
> shutting down the server and reopen it after restart. However, since we
> promote WAL writer vclock even if xlog write fails, there will be an LSN
> gap between the last xlog and the one created on shutdown in case we
> failed to write last few records. To avoid that, let's rollback WAL
> writer vclock if write fails. BTW this will make it consistent with
> replicaset vclock - see commit 3c4bac715960a ("Follow vclock only for
> success wal writes").

Please add previous xlog's vclock to the next xlog header and use
it to watch gaps/missing xlogs in recover_remaining_wals(). 

If the necessary header is missing, simply ignore the gaps.

Ignore LSN gaps, they are harmless, and using LSN gaps as a mark
of xlog dir corruption was a bad idea from the start.

Logical markers should not be used to verify consistency of the
physical layer.



-- 
Konstantin Osipov, Moscow, Russia, +7 903 626 22 32
http://tarantool.io - www.twitter.com/kostja_osipov

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/3] wal: create empty xlog on shutdown
  2018-06-15 15:48           ` [PATCH 3/3] wal: create empty xlog on shutdown Vladimir Davydov
@ 2018-06-27 17:29             ` Konstantin Osipov
  0 siblings, 0 replies; 34+ messages in thread
From: Konstantin Osipov @ 2018-06-27 17:29 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: tarantool-patches

* Vladimir Davydov <vdavydov.dev@gmail.com> [18/06/15 23:28]:
>  recovery_finalize(struct recovery *r)
>  {
>  	recovery_close_log(r);
> -
> -	/*
> -	 * Check if next xlog exists. If it's true this xlog is
> -	 * corrupted and we should rename it (to avoid getting
> -	 * problem on the next xlog write with the same name).
> -	 * Possible reasons are:
> -	 *  - last xlog has corrupted rows
> -	 *  - last xlog has corrupted header
> -	 *  - last xlog has zero size
> -	 */
> -	char *name = xdir_format_filename(&r->wal_dir,
> -					  vclock_sum(&r->vclock),
> -					  NONE);
> -	if (access(name, F_OK) == 0) {
> -		say_info("rename corrupted xlog %s", name);
> -		char to[PATH_MAX];
> -		snprintf(to, sizeof(to), "%s.corrupted", name);
> -		if (rename(name, to) != 0) {
> -			tnt_raise(SystemError,
> -				  "%s: can't rename corrupted xlog",
> -				  name);
> -		}
> -	}

I agree this hunk should be moved to wal.c, but I don't understand
why it has to be done in wal thread.

Please make it obvious by leaving only the necessary parts in
wal_init_f and adding a comment.

> --- a/src/box/wal.c
> +++ b/src/box/wal.c
> @@ -310,6 +310,39 @@ wal_thread_start()
>  	cpipe_set_max_input(&wal_thread.wal_pipe, IOV_MAX);
>  }
>  
> +static int
> +wal_init_f(struct cbus_call_msg *msg)

Please add a formal comment for this function.

>  /**
>   * Initialize WAL writer.
>   *
> @@ -332,6 +365,11 @@ wal_init(enum wal_mode wal_mode, const char *wal_dirname,
>  	if (xdir_scan(&writer->wal_dir))
>  		return -1;
>  
> +	struct cbus_call_msg msg;
> +	if (cbus_call(&wal_thread.wal_pipe, &wal_thread.tx_pipe, &msg,
> +		      wal_init_f, NULL, TIMEOUT_INFINITY) != 0)
> +		return -1;
> +

> +	/*
> +	 * Create a new empty WAL on shutdown so that we don't have
> +	 * to rescan the last WAL to find the instance vclock.
> +	 */
> +	if (writer->wal_mode != WAL_NONE &&
> +	    xdir_create_xlog(&writer->wal_dir, &writer->current_wal,
> +			     &writer->vclock) == 0)
> +		xlog_close(&writer->current_wal, false);

In case there is an existing empty xlog file, this function fails
to create a new one and returns an error which you ignore.

Please handle this case nicely and avoid trying to create a new
xlog file if it is not necessary.

>  

-- 
Konstantin Osipov, Moscow, Russia, +7 903 626 22 32
http://tarantool.io - www.twitter.com/kostja_osipov

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2018-06-27 17:29 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-06-08 17:34 [PATCH v2 00/11] Replica rejoin Vladimir Davydov
2018-06-08 17:34 ` [PATCH v2 01/11] box: retrieve instance uuid before starting local recovery Vladimir Davydov
2018-06-08 17:51   ` Konstantin Osipov
2018-06-08 17:34 ` [PATCH v2 02/11] box: refactor hot standby recovery Vladimir Davydov
2018-06-08 17:34 ` [PATCH v2 03/11] box: retrieve end vclock before starting local recovery Vladimir Davydov
2018-06-14 12:58   ` Konstantin Osipov
2018-06-08 17:34 ` [PATCH v2 04/11] box: open the port " Vladimir Davydov
2018-06-13 20:43   ` Konstantin Osipov
2018-06-14  8:31     ` Vladimir Davydov
2018-06-14 12:59       ` Konstantin Osipov
2018-06-15 15:48         ` [PATCH 0/3] Speed up recovery in case rebootstrap is not needed Vladimir Davydov
2018-06-15 15:48           ` [PATCH 1/3] xlog: erase eof marker when reopening existing file for writing Vladimir Davydov
2018-06-27 17:09             ` Konstantin Osipov
2018-06-15 15:48           ` [PATCH 2/3] wal: rollback vclock on write failure Vladimir Davydov
2018-06-27 17:22             ` Konstantin Osipov
2018-06-15 15:48           ` [PATCH 3/3] wal: create empty xlog on shutdown Vladimir Davydov
2018-06-27 17:29             ` Konstantin Osipov
2018-06-08 17:34 ` [PATCH v2 05/11] box: connect to remote peers before starting local recovery Vladimir Davydov
2018-06-13 20:45   ` Konstantin Osipov
2018-06-14  8:34     ` Vladimir Davydov
2018-06-14 12:59       ` Konstantin Osipov
2018-06-08 17:34 ` [PATCH v2 06/11] box: factor out local recovery function Vladimir Davydov
2018-06-13 20:50   ` Konstantin Osipov
2018-06-08 17:34 ` [PATCH v2 07/11] applier: inquire oldest vclock on connect Vladimir Davydov
2018-06-13 20:51   ` Konstantin Osipov
2018-06-14  8:40     ` Vladimir Davydov
2018-06-08 17:34 ` [PATCH v2 08/11] replication: rebootstrap instance on startup if it fell behind Vladimir Davydov
2018-06-13 20:55   ` Konstantin Osipov
2018-06-14  8:58     ` Vladimir Davydov
2018-06-08 17:34 ` [PATCH v2 09/11] vinyl: simplify vylog recovery from backup Vladimir Davydov
2018-06-08 17:34 ` [PATCH v2 10/11] vinyl: pass flags to vy_recovery_new Vladimir Davydov
2018-06-13 20:56   ` Konstantin Osipov
2018-06-08 17:34 ` [PATCH v2 11/11] vinyl: implement rebootstrap support Vladimir Davydov
2018-06-10 12:02   ` Vladimir Davydov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox