[PATCH v3 00/11] Replica rejoin

Tarantool development patches archive
 help / color / mirror / Atom feed

* [PATCH v3 00/11] Replica rejoin
@ 2018-07-14 20:49 Vladimir Davydov
  2018-07-14 20:49 ` [PATCH v3 01/11] recovery: clean up WAL dir scan code Vladimir Davydov
                   ` (10 more replies)
  0 siblings, 11 replies; 31+ messages in thread
From: Vladimir Davydov @ 2018-07-14 20:49 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

After this patch set is applied, an instance will try to detect if
it fell too much behind its peers in the cluster and so needs to be
rebootstrapped. If it does, it will skip local recovery and instead
proceed to bootstrap from a remote master. Old files (xlog, snap)
are not deleted during rebootstrap. They will be removed by gc as
usual.

https://github.com/tarantool/tarantool/issues/461
https://github.com/tarantool/tarantool/commits/dv/gh-461-replica-rejoin

Changes in v3:
 - Remove merged patches, add some new ones.
 - Rebase on top of the latest 1.10: this required patching gc to make
   it track vclocks instead of signatures so that it could report the
   vclock of the oldest xlog stored on the instance.
 - Follow-up on the recently committed patch for recovery subsystem: add
   some comments and remove double scanning of the WAL directory.
 - Introduce a new IPROTO command, IPROTO_REQUEST_STATUS, to be used
   instead of IPROTO_REQUEST_VOTE; send a map in reply to this command.
   Rationale: a map is more flexible and can be extended. In particular,
   we can use the very same message for inquiring the oldest vclock
   stored on the master to detect if a replica needs to be rejoined,
   instead of introducing a new IPROTO command, as we did in v2.
 - Do NOT rebootstrap a replica if it has some data that is absent on
   the master. Rationale: we don't want to lose ANY data by rejoining a
   replica; besides, if a replica's vclock is incomparable with the
   master's, xdir_scan may break.

v2: https://www.freelists.org/post/tarantool-patches/PATCH-v2-0011-Replica-rejoin

Changes in v2:
 - Implement rebootstrap support for vinyl engine.
 - Call recover_remaining_wals() explicitly after recovery_stop_local()
   as suggested by @kostja.
 - Add comment to memtx_engine_new() explaining why we need to init
   INSTANCE_UUID before proceeding to local recovery.

v1: https://www.freelists.org/post/tarantool-patches/RFC-PATCH-0012-Replica-rejoin

Vladimir Davydov (11):
  recovery: clean up WAL dir scan code
  xrow: factor out function for decoding vclock
  Introduce IPROTO_REQUEST_STATUS command
  Get rid of IPROTO_SERVER_IS_RO
  gc: keep track of vclocks instead of signatures
  Include oldest vclock available on the instance in IPROTO_STATUS
  replication: rebootstrap instance on startup if it fell behind
  vinyl: simplify vylog recovery from backup
  vinyl: pass flags to vy_recovery_new
  Update test-run
  vinyl: implement rebootstrap support

 src/box/applier.cc                       |   6 +-
 src/box/applier.h                        |   8 +-
 src/box/box.cc                           |  26 +++-
 src/box/box.h                            |   3 +
 src/box/gc.c                             |  89 ++++++-----
 src/box/gc.h                             |  23 +--
 src/box/iproto.cc                        |  16 +-
 src/box/iproto_constants.c               |   4 +-
 src/box/iproto_constants.h               |  15 +-
 src/box/lua/info.c                       |   4 +-
 src/box/recovery.cc                      |   2 +-
 src/box/recovery.h                       |   7 +-
 src/box/relay.cc                         |  21 +--
 src/box/replication.cc                   |  36 ++++-
 src/box/replication.h                    |   9 ++
 src/box/vinyl.c                          |   8 +-
 src/box/vy_log.c                         | 207 ++++++++++++++++++-------
 src/box/vy_log.h                         |  50 +++++-
 src/box/wal.c                            |   9 ++
 src/box/xrow.c                           | 179 +++++++++++++++------
 src/box/xrow.h                           | 106 +++++++------
 src/errinj.h                             |   1 +
 test-run                                 |   2 +-
 test/box/errinj.result                   |   6 +-
 test/replication/replica_rejoin.result   | 250 ++++++++++++++++++++++++++++++
 test/replication/replica_rejoin.test.lua |  91 +++++++++++
 test/vinyl/replica_rejoin.lua            |  13 ++
 test/vinyl/replica_rejoin.result         | 257 +++++++++++++++++++++++++++++++
 test/vinyl/replica_rejoin.test.lua       |  88 +++++++++++
 test/vinyl/suite.ini                     |   2 +-
 30 files changed, 1293 insertions(+), 245 deletions(-)
 create mode 100644 test/replication/replica_rejoin.result
 create mode 100644 test/replication/replica_rejoin.test.lua
 create mode 100644 test/vinyl/replica_rejoin.lua
 create mode 100644 test/vinyl/replica_rejoin.result
 create mode 100644 test/vinyl/replica_rejoin.test.lua

-- 
2.11.0

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v3 01/11] recovery: clean up WAL dir scan code
  2018-07-14 20:49 [PATCH v3 00/11] Replica rejoin Vladimir Davydov
@ 2018-07-14 20:49 ` Vladimir Davydov
  2018-07-19  7:08   ` Konstantin Osipov
  2018-07-14 20:49 ` [PATCH v3 02/11] xrow: factor out function for decoding vclock Vladimir Davydov
                   ` (9 subsequent siblings)
  10 siblings, 1 reply; 31+ messages in thread
From: Vladimir Davydov @ 2018-07-14 20:49 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

 - Remove extra scan of the WAL directory from local_recovery() - we
   scan the directory in recovery_end_vclock() hence we can skip scan in
   recover_remaining_wals() by passing scan_dir = false.

 - Rename recovery_end_vclock() to recovery_scan() to emphasize the fact
   that this function scans the WAL directory. Write a comment to this
   function.

 - Add comments to wal.c explaining why we scan the WAL directory there.

Follow-up 0695fbbb96aa ("box: retrieve end vclock before starting local
recovery").
---
 src/box/box.cc      | 4 ++--
 src/box/recovery.cc | 2 +-
 src/box/recovery.h  | 7 ++++++-
 src/box/wal.c       | 9 +++++++++
 4 files changed, 18 insertions(+), 4 deletions(-)

diff --git a/src/box/box.cc b/src/box/box.cc
index e15d121f..7fc15f33 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -1819,7 +1819,7 @@ local_recovery(const struct tt_uuid *instance_uuid,
 	 * so we must reflect this in replicaset vclock to
 	 * not attempt to apply these rows twice.
 	 */
-	recovery_end_vclock(recovery, &replicaset.vclock);
+	recovery_scan(recovery, &replicaset.vclock);
 
 	if (wal_dir_lock >= 0) {
 		box_listen();
@@ -1855,7 +1855,7 @@ local_recovery(const struct tt_uuid *instance_uuid,
 	memtx_engine_recover_snapshot_xc(memtx, checkpoint_vclock);
 
 	engine_begin_final_recovery_xc();
-	recover_remaining_wals(recovery, &wal_stream.base, NULL, true);
+	recover_remaining_wals(recovery, &wal_stream.base, NULL, false);
 	/*
 	 * Leave hot standby mode, if any, only after
 	 * acquiring the lock.
diff --git a/src/box/recovery.cc b/src/box/recovery.cc
index 7267b345..fe14defe 100644
--- a/src/box/recovery.cc
+++ b/src/box/recovery.cc
@@ -117,7 +117,7 @@ recovery_new(const char *wal_dirname, bool force_recovery,
 }
 
 void
-recovery_end_vclock(struct recovery *r, struct vclock *end_vclock)
+recovery_scan(struct recovery *r, struct vclock *end_vclock)
 {
 	xdir_scan_xc(&r->wal_dir);
 
diff --git a/src/box/recovery.h b/src/box/recovery.h
index 1ae6f2c3..5882d969 100644
--- a/src/box/recovery.h
+++ b/src/box/recovery.h
@@ -68,8 +68,13 @@ recovery_new(const char *wal_dirname, bool force_recovery,
 void
 recovery_delete(struct recovery *r);
 
+/**
+ * Scan the WAL directory, build an index of all found
+ * WAL files, then scan the most recent WAL file to find
+ * the vclock of the last record (returned in @end_vclock).
+ */
 void
-recovery_end_vclock(struct recovery *r, struct vclock *end_vclock);
+recovery_scan(struct recovery *r, struct vclock *end_vclock);
 
 void
 recovery_follow_local(struct recovery *r, struct xstream *stream,
diff --git a/src/box/wal.c b/src/box/wal.c
index 19c9138e..41762a59 100644
--- a/src/box/wal.c
+++ b/src/box/wal.c
@@ -400,6 +400,11 @@ wal_init(enum wal_mode wal_mode, const char *wal_dirname,
 	wal_writer_create(writer, wal_mode, wal_dirname, instance_uuid,
 			  vclock, wal_max_rows, wal_max_size);
 
+	/*
+	 * Scan the WAL directory to build an index of all
+	 * existing WAL files. Required for garbage collection,
+	 * see wal_collect_garbage().
+	 */
 	if (xdir_scan(&writer->wal_dir))
 		return -1;
 
@@ -589,6 +594,10 @@ wal_opt_rotate(struct wal_writer *writer)
 		free(vclock);
 		return -1;
 	}
+	/*
+	 * Keep track of the new WAL vclock. Required for garbage
+	 * collection, see wal_collect_garbage().
+	 */
 	xdir_add_vclock(&writer->wal_dir, vclock);
 
 	wal_notify_watchers(writer, WAL_EVENT_ROTATE);
-- 
2.11.0

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v3 02/11] xrow: factor out function for decoding vclock
  2018-07-14 20:49 [PATCH v3 00/11] Replica rejoin Vladimir Davydov
  2018-07-14 20:49 ` [PATCH v3 01/11] recovery: clean up WAL dir scan code Vladimir Davydov
@ 2018-07-14 20:49 ` Vladimir Davydov
  2018-07-19  7:08   ` Konstantin Osipov
  2018-07-14 20:49 ` [PATCH v3 03/11] Introduce IPROTO_REQUEST_STATUS command Vladimir Davydov
                   ` (8 subsequent siblings)
  10 siblings, 1 reply; 31+ messages in thread
From: Vladimir Davydov @ 2018-07-14 20:49 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

We will need it in other places.
---
 src/box/xrow.c | 45 +++++++++++++++++++++------------------------
 1 file changed, 21 insertions(+), 24 deletions(-)

diff --git a/src/box/xrow.c b/src/box/xrow.c
index 11316906..56197d0e 100644
--- a/src/box/xrow.c
+++ b/src/box/xrow.c
@@ -64,6 +64,26 @@ mp_encode_vclock(char *data, const struct vclock *vclock)
 	return data;
 }
 
+static int
+mp_decode_vclock(const char **data, struct vclock *vclock)
+{
+	vclock_create(vclock);
+	if (mp_typeof(**data) != MP_MAP)
+		return -1;
+	uint32_t size = mp_decode_map(data);
+	for (uint32_t i = 0; i < size; i++) {
+		if (mp_typeof(**data) != MP_UINT)
+			return -1;
+		uint32_t id = mp_decode_uint(data);
+		if (mp_typeof(**data) != MP_UINT)
+			return -1;
+		int64_t lsn = mp_decode_uint(data);
+		if (lsn > 0)
+			vclock_follow(vclock, id, lsn);
+	}
+	return 0;
+}
+
 int
 xrow_header_decode(struct xrow_header *header, const char **pos,
 		   const char *end)
@@ -885,7 +905,6 @@ xrow_decode_subscribe(struct xrow_header *row, struct tt_uuid *replicaset_uuid,
 	/* For backward compatibility initialize read-only with false. */
 	if (read_only)
 		*read_only = false;
-	const char *lsnmap = NULL;
 	d = data;
 	uint32_t map_size = mp_decode_map(&d);
 	for (uint32_t i = 0; i < map_size; i++) {
@@ -911,13 +930,11 @@ xrow_decode_subscribe(struct xrow_header *row, struct tt_uuid *replicaset_uuid,
 		case IPROTO_VCLOCK:
 			if (vclock == NULL)
 				goto skip;
-			if (mp_typeof(*d) != MP_MAP) {
+			if (mp_decode_vclock(&d, vclock) != 0) {
 				diag_set(ClientError, ER_INVALID_MSGPACK,
 					 "invalid VCLOCK");
 				return -1;
 			}
-			lsnmap = d;
-			mp_next(&d);
 			break;
 		case IPROTO_SERVER_VERSION:
 			if (version_id == NULL)
@@ -943,26 +960,6 @@ xrow_decode_subscribe(struct xrow_header *row, struct tt_uuid *replicaset_uuid,
 			mp_next(&d); /* value */
 		}
 	}
-
-	if (lsnmap == NULL)
-		return 0;
-
-	/* Check & save LSNMAP */
-	d = lsnmap;
-	uint32_t lsnmap_size = mp_decode_map(&d);
-	for (uint32_t i = 0; i < lsnmap_size; i++) {
-		if (mp_typeof(*d) != MP_UINT) {
-		map_error:
-			diag_set(ClientError, ER_INVALID_MSGPACK, "VCLOCK");
-			return -1;
-		}
-		uint32_t id = mp_decode_uint(&d);
-		if (mp_typeof(*d) != MP_UINT)
-			goto map_error;
-		int64_t lsn = (int64_t) mp_decode_uint(&d);
-		if (lsn > 0)
-			vclock_follow(vclock, id, lsn);
-	}
 	return 0;
 }
 
-- 
2.11.0

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v3 03/11] Introduce IPROTO_REQUEST_STATUS command
  2018-07-14 20:49 [PATCH v3 00/11] Replica rejoin Vladimir Davydov
  2018-07-14 20:49 ` [PATCH v3 01/11] recovery: clean up WAL dir scan code Vladimir Davydov
  2018-07-14 20:49 ` [PATCH v3 02/11] xrow: factor out function for decoding vclock Vladimir Davydov
@ 2018-07-14 20:49 ` Vladimir Davydov
  2018-07-19  7:10   ` Konstantin Osipov
  2018-07-21 10:25   ` Vladimir Davydov
  2018-07-14 20:49 ` [PATCH v3 04/11] Get rid of IPROTO_SERVER_IS_RO Vladimir Davydov
                   ` (7 subsequent siblings)
  10 siblings, 2 replies; 31+ messages in thread
From: Vladimir Davydov @ 2018-07-14 20:49 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

The new command is supposed to supersede IPROTO_REQUEST_VOTE, which is
difficult to extend, because it uses the global iproto key namespace.
The new command returns a map (IPROTO_STATUS), to which we can add
various information without polluting the global namespace. Currently,
the map contains IPROTO_STATUS_IS_RO and IPROTO_STATUS_VCLOCK keys,
but soon it will be added info needed for replica rebootstrap feature.

Needed for #461
---
 src/box/applier.cc         |  6 +--
 src/box/applier.h          |  8 ++--
 src/box/box.cc             |  7 ++++
 src/box/box.h              |  3 ++
 src/box/iproto.cc          |  7 ++++
 src/box/iproto_constants.c |  3 +-
 src/box/iproto_constants.h | 13 +++++-
 src/box/replication.cc     |  9 +++--
 src/box/xrow.c             | 98 +++++++++++++++++++++++++++++++++++++++++++++-
 src/box/xrow.h             | 80 +++++++++++++++++++++++--------------
 10 files changed, 188 insertions(+), 46 deletions(-)

diff --git a/src/box/applier.cc b/src/box/applier.cc
index 556502bf..ad2710a3 100644
--- a/src/box/applier.cc
+++ b/src/box/applier.cc
@@ -218,14 +218,12 @@ applier_connect(struct applier *applier)
 	 * It will be used for leader election on bootstrap.
 	 */
 	if (applier->version_id >= version_id(1, 7, 7)) {
-		xrow_encode_request_vote(&row);
+		xrow_encode_status_request(&row);
 		coio_write_xrow(coio, &row);
 		coio_read_xrow(coio, ibuf, &row);
 		if (row.type != IPROTO_OK)
 			xrow_decode_error_xc(&row);
-		vclock_create(&applier->vclock);
-		xrow_decode_request_vote_xc(&row, &applier->vclock,
-					    &applier->remote_is_ro);
+		xrow_decode_status_xc(&row, &applier->remote_status);
 	}
 
 	applier_set_state(applier, APPLIER_CONNECTED);
diff --git a/src/box/applier.h b/src/box/applier.h
index c33562cc..29b4e5af 100644
--- a/src/box/applier.h
+++ b/src/box/applier.h
@@ -43,7 +43,7 @@
 #include "tt_uuid.h"
 #include "uri.h"
 
-#include "vclock.h"
+#include "xrow.h"
 
 struct xstream;
 
@@ -94,10 +94,8 @@ struct applier {
 	struct uri uri;
 	/** Remote version encoded as a number, see version_id() macro */
 	uint32_t version_id;
-	/** Remote vclock at time of connect. */
-	struct vclock vclock;
-	/** Remote peer mode, true if read-only, default: false */
-	bool remote_is_ro;
+	/** Remote status at time of connect. */
+	struct status remote_status;
 	/** Remote address */
 	union {
 		struct sockaddr addr;
diff --git a/src/box/box.cc b/src/box/box.cc
index 7fc15f33..200e49a1 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -1564,6 +1564,13 @@ box_process_subscribe(struct ev_io *io, struct xrow_header *header)
 			replica_version_id);
 }
 
+void
+box_process_status_request(struct status *status)
+{
+	status->is_ro = cfg_geti("read_only") != 0;
+	vclock_copy(&status->vclock, &replicaset.vclock);
+}
+
 /** Insert a new cluster into _schema */
 static void
 box_set_replicaset_uuid(const struct tt_uuid *replicaset_uuid)
diff --git a/src/box/box.h b/src/box/box.h
index 182e1b72..8c38b416 100644
--- a/src/box/box.h
+++ b/src/box/box.h
@@ -163,6 +163,9 @@ box_process_join(struct ev_io *io, struct xrow_header *header);
 void
 box_process_subscribe(struct ev_io *io, struct xrow_header *header);
 
+void
+box_process_status_request(struct status *status);
+
 /**
  * Check Lua configuration before initialization or
  * in case of a configuration change.
diff --git a/src/box/iproto.cc b/src/box/iproto.cc
index cba81a22..17f161a3 100644
--- a/src/box/iproto.cc
+++ b/src/box/iproto.cc
@@ -1159,6 +1159,7 @@ iproto_msg_decode(struct iproto_msg *msg, const char **pos, const char *reqend,
 		*stop_input = true;
 		break;
 	case IPROTO_REQUEST_VOTE:
+	case IPROTO_REQUEST_STATUS:
 		cmsg_init(&msg->base, misc_route);
 		break;
 	case IPROTO_AUTH:
@@ -1526,6 +1527,7 @@ tx_process_misc(struct cmsg *m)
 		goto error;
 
 	try {
+		struct status status;
 		switch (msg->header.type) {
 		case IPROTO_AUTH:
 			box_process_auth(&msg->auth, con->salt);
@@ -1542,6 +1544,11 @@ tx_process_misc(struct cmsg *m)
 						     &replicaset.vclock,
 						     cfg_geti("read_only"));
 			break;
+		case IPROTO_REQUEST_STATUS:
+			box_process_status_request(&status);
+			iproto_reply_status_xc(out, &status, msg->header.sync,
+					       ::schema_version);
+			break;
 		default:
 			unreachable();
 		}
diff --git a/src/box/iproto_constants.c b/src/box/iproto_constants.c
index 3bc965bd..bc7dfd7d 100644
--- a/src/box/iproto_constants.c
+++ b/src/box/iproto_constants.c
@@ -87,6 +87,7 @@ const unsigned char iproto_key_type[IPROTO_KEY_MAX] =
 	/* 0x27 */	MP_STR, /* IPROTO_EXPR */
 	/* 0x28 */	MP_ARRAY, /* IPROTO_OPS */
 	/* 0x29 */	MP_BOOL, /* IPROTO_SERVER_IS_RO */
+	/* 0x2a */	MP_MAP, /* IPROTO_STATUS */
 	/* }}} */
 };
 
@@ -168,7 +169,7 @@ const char *iproto_key_strs[IPROTO_KEY_MAX] = {
 	"expression",       /* 0x27 */
 	"operations",       /* 0x28 */
 	"server is ro",     /* 0x29 */
-	NULL,               /* 0x2a */
+	"status",           /* 0x2a */
 	NULL,               /* 0x2b */
 	NULL,               /* 0x2c */
 	NULL,               /* 0x2d */
diff --git a/src/box/iproto_constants.h b/src/box/iproto_constants.h
index ccbf2da5..ce4366ec 100644
--- a/src/box/iproto_constants.h
+++ b/src/box/iproto_constants.h
@@ -78,12 +78,18 @@ enum iproto_key {
 	IPROTO_EXPR = 0x27, /* EVAL */
 	IPROTO_OPS = 0x28, /* UPSERT but not UPDATE ops, because of legacy */
 	IPROTO_SERVER_IS_RO = 0x29,
+	IPROTO_STATUS = 0x2a,
 	/* Leave a gap between request keys and response keys */
 	IPROTO_DATA = 0x30,
 	IPROTO_ERROR = 0x31,
 	IPROTO_KEY_MAX
 };
 
+enum iproto_status_key {
+	IPROTO_STATUS_IS_RO = 0x01,
+	IPROTO_STATUS_VCLOCK = 0x02,
+};
+
 #define bit(c) (1ULL<<IPROTO_##c)
 
 #define IPROTO_HEAD_BMAP (bit(REQUEST_TYPE) | bit(SYNC) | bit(REPLICA_ID) |\
@@ -155,8 +161,13 @@ enum iproto_type {
 	IPROTO_JOIN = 65,
 	/** Replication SUBSCRIBE command */
 	IPROTO_SUBSCRIBE = 66,
-	/** Vote request command for master election */
+	/**
+	 * Vote request command for master election
+	 * DEPRECATED: use IPROTO_REQUEST_STATUS instead
+	 */
 	IPROTO_REQUEST_VOTE = 67,
+	/** Instance status request command */
+	IPROTO_REQUEST_STATUS = 68,
 
 	/** Vinyl run info stored in .index file */
 	VY_INDEX_RUN_INFO = 100,
diff --git a/src/box/replication.cc b/src/box/replication.cc
index c1e17698..f12244c9 100644
--- a/src/box/replication.cc
+++ b/src/box/replication.cc
@@ -732,7 +732,8 @@ replicaset_round(bool skip_ro)
 {
 	struct replica *leader = NULL;
 	replicaset_foreach(replica) {
-		if (replica->applier == NULL)
+		struct applier *applier = replica->applier;
+		if (applier == NULL)
 			continue;
 		/**
 		 * While bootstrapping a new cluster, read-only
@@ -741,7 +742,7 @@ replicaset_round(bool skip_ro)
 		 * replicas since there is still a possibility
 		 * that all replicas exist in cluster table.
 		 */
-		if (skip_ro && replica->applier->remote_is_ro)
+		if (skip_ro && applier->remote_status.is_ro)
 			continue;
 		if (leader == NULL) {
 			leader = replica;
@@ -753,8 +754,8 @@ replicaset_round(bool skip_ro)
 		 * with the same vclock, prefer the one with
 		 * the lowest uuid.
 		 */
-		int cmp = vclock_compare(&replica->applier->vclock,
-					 &leader->applier->vclock);
+		int cmp = vclock_compare(&applier->remote_status.vclock,
+				&leader->applier->remote_status.vclock);
 		if (cmp < 0)
 			continue;
 		if (cmp == 0 && tt_uuid_compare(&replica->uuid,
diff --git a/src/box/xrow.c b/src/box/xrow.c
index 56197d0e..4bc1f81e 100644
--- a/src/box/xrow.c
+++ b/src/box/xrow.c
@@ -344,6 +344,41 @@ iproto_reply_request_vote(struct obuf *out, uint64_t sync,
 }
 
 int
+iproto_reply_status(struct obuf *out, const struct status *status,
+		    uint64_t sync, uint32_t schema_version)
+{
+	size_t max_size = IPROTO_HEADER_LEN + mp_sizeof_map(1) +
+		mp_sizeof_uint(UINT32_MAX) + mp_sizeof_map(2) +
+		mp_sizeof_uint(UINT32_MAX) + mp_sizeof_bool(status->is_ro) +
+		mp_sizeof_uint(UINT32_MAX) + mp_sizeof_vclock(&status->vclock);
+
+	char *buf = obuf_reserve(out, max_size);
+	if (buf == NULL) {
+		diag_set(OutOfMemory, max_size,
+			 "obuf_alloc", "buf");
+		return -1;
+	}
+
+	char *data = buf + IPROTO_HEADER_LEN;
+	data = mp_encode_map(data, 1);
+	data = mp_encode_uint(data, IPROTO_STATUS);
+	data = mp_encode_map(data, 2);
+	data = mp_encode_uint(data, IPROTO_STATUS_IS_RO);
+	data = mp_encode_bool(data, status->is_ro);
+	data = mp_encode_uint(data, IPROTO_STATUS_VCLOCK);
+	data = mp_encode_vclock(data, &status->vclock);
+	size_t size = data - buf;
+	assert(size <= max_size);
+
+	iproto_header_encode(buf, IPROTO_OK, sync, schema_version,
+			     size - IPROTO_HEADER_LEN);
+
+	char *ptr = obuf_alloc(out, size);
+	assert(ptr == buf);
+	return 0;
+}
+
+int
 iproto_reply_error(struct obuf *out, const struct error *e, uint64_t sync,
 		   uint32_t schema_version)
 {
@@ -847,10 +882,69 @@ error:
 }
 
 void
-xrow_encode_request_vote(struct xrow_header *row)
+xrow_encode_status_request(struct xrow_header *row)
 {
 	memset(row, 0, sizeof(*row));
-	row->type = IPROTO_REQUEST_VOTE;
+	row->type = IPROTO_REQUEST_STATUS;
+}
+
+int
+xrow_decode_status(struct xrow_header *row, struct status *status)
+{
+	status->is_ro = false;
+	vclock_create(&status->vclock);
+
+	if (row->bodycnt == 0)
+		goto err;
+	assert(row->bodycnt == 1);
+
+	const char *data = (const char *) row->body[0].iov_base;
+	const char *end = data + row->body[0].iov_len;
+	const char *tmp = data;
+	if (mp_check(&tmp, end) != 0 || mp_typeof(*data) != MP_MAP)
+		goto err;
+
+	/* Find STATUS key. */
+	uint32_t map_size = mp_decode_map(&data);
+	for (uint32_t i = 0; i < map_size; i++) {
+		if (mp_typeof(*data) != MP_UINT) {
+			mp_next(&data); /* key */
+			mp_next(&data); /* value */
+			continue;
+		}
+		if (mp_decode_uint(&data) == IPROTO_STATUS)
+			break;
+	}
+	if (data == end)
+		return 0;
+
+	/* Decode STATUS map. */
+	map_size = mp_decode_map(&data);
+	for (uint32_t i = 0; i < map_size; i++) {
+		if (mp_typeof(*data) != MP_UINT) {
+			mp_next(&data); /* key */
+			mp_next(&data); /* value */
+			continue;
+		}
+		uint32_t key = mp_decode_uint(&data);
+		switch (key) {
+		case IPROTO_STATUS_IS_RO:
+			if (mp_typeof(*data) != MP_BOOL)
+				goto err;
+			status->is_ro = mp_decode_bool(&data);
+			break;
+		case IPROTO_STATUS_VCLOCK:
+			if (mp_decode_vclock(&data, &status->vclock) != 0)
+				goto err;
+			break;
+		default:
+			mp_next(&data);
+		}
+	}
+	return 0;
+err:
+	diag_set(ClientError, ER_INVALID_MSGPACK, "packet body");
+	return -1;
 }
 
 int
diff --git a/src/box/xrow.h b/src/box/xrow.h
index 92ea3c97..67910d52 100644
--- a/src/box/xrow.h
+++ b/src/box/xrow.h
@@ -30,19 +30,19 @@
  * THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  */
+#include <stdbool.h>
 #include <stdint.h>
 #include <stddef.h>
 #include <sys/uio.h> /* struct iovec */
 
 #include "tt_uuid.h"
 #include "diag.h"
+#include "vclock.h"
 
 #if defined(__cplusplus)
 extern "C" {
 #endif
 
-struct vclock;
-
 enum {
 	XROW_HEADER_IOVMAX = 1,
 	XROW_BODY_IOVMAX = 2,
@@ -223,12 +223,28 @@ xrow_encode_auth(struct xrow_header *row, const char *salt, size_t salt_len,
 		 const char *login, size_t login_len, const char *password,
 		 size_t password_len);
 
+/** Instance status. */
+struct status {
+	/** Set if the instance is running in read-only mode. */
+	bool is_ro;
+	/** Current instance vclock. */
+	struct vclock vclock;
+};
+
 /**
- * Encode a vote request for master election.
+ * Decode STATUS from MessagePack.
+ * @param row Row to decode.
+ * @param[out] status
+ */
+int
+xrow_decode_status(struct xrow_header *row, struct status *status);
+
+/**
+ * Encode an instance status request.
  * @param row[out] Row to encode into.
  */
 void
-xrow_encode_request_vote(struct xrow_header *row);
+xrow_encode_status_request(struct xrow_header *row);
 
 /**
  * Encode SUBSCRIBE command.
@@ -315,22 +331,6 @@ xrow_decode_vclock(struct xrow_header *row, struct vclock *vclock)
 }
 
 /**
- * Decode peer vclock and access rights (a response to VOTE command).
- * @param row Row to decode.
- * @param[out] vclock.
- * @param[out] read_only.
- *
- * @retval  0 Success.
- * @retval -1 Memory or format error.
- */
-static inline int
-xrow_decode_request_vote(struct xrow_header *row, struct vclock *vclock,
-			 bool *read_only)
-{
-	return xrow_decode_subscribe(row, NULL, NULL, vclock, NULL, read_only);
-}
-
-/**
  * Encode a heartbeat message.
  * @param row[out] Row to encode into.
  * @param replica_id Instance id.
@@ -405,6 +405,20 @@ iproto_reply_request_vote(struct obuf *out, uint64_t sync,
 			 bool read_only);
 
 /**
+ * Encode a reply to an instance status request.
+ * @param out Buffer to write to.
+ * @param status Instance status to encode.
+ * @param sync Request sync.
+ * @param schema_version Actual schema version.
+ *
+ * @retval  0 Success.
+ * @retval -1 Memory error.
+ */
+int
+iproto_reply_status(struct obuf *out, const struct status *status,
+		    uint64_t sync, uint32_t schema_version);
+
+/**
  * Write an error packet int output buffer. Doesn't throw if out
  * of memory
  */
@@ -585,6 +599,14 @@ xrow_encode_auth_xc(struct xrow_header *row, const char *salt, size_t salt_len,
 		diag_raise();
 }
 
+/** @copydoc xrow_decode_status. */
+static inline void
+xrow_decode_status_xc(struct xrow_header *row, struct status *status)
+{
+	if (xrow_decode_status(row, status) != 0)
+		diag_raise();
+}
+
 /** @copydoc xrow_encode_subscribe. */
 static inline void
 xrow_encode_subscribe_xc(struct xrow_header *row,
@@ -642,15 +664,6 @@ xrow_decode_vclock_xc(struct xrow_header *row, struct vclock *vclock)
 		diag_raise();
 }
 
-/** @copydoc xrow_decode_request_vote. */
-static inline void
-xrow_decode_request_vote_xc(struct xrow_header *row, struct vclock *vclock,
-			    bool *read_only)
-{
-	if (xrow_decode_request_vote(row, vclock, read_only) != 0)
-		diag_raise();
-}
-
 /** @copydoc iproto_reply_ok. */
 static inline void
 iproto_reply_ok_xc(struct obuf *out, uint64_t sync, uint32_t schema_version)
@@ -670,6 +683,15 @@ iproto_reply_request_vote_xc(struct obuf *out, uint64_t sync,
 		diag_raise();
 }
 
+/** @copydoc iproto_reply_status. */
+static inline void
+iproto_reply_status_xc(struct obuf *out, const struct status *status,
+		       uint64_t sync, uint32_t schema_version)
+{
+	if (iproto_reply_status(out, status, sync, schema_version) != 0)
+		diag_raise();
+}
+
 #endif
 
 #endif /* TARANTOOL_XROW_H_INCLUDED */
-- 
2.11.0

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v3 04/11] Get rid of IPROTO_SERVER_IS_RO
  2018-07-14 20:49 [PATCH v3 00/11] Replica rejoin Vladimir Davydov
                   ` (2 preceding siblings ...)
  2018-07-14 20:49 ` [PATCH v3 03/11] Introduce IPROTO_REQUEST_STATUS command Vladimir Davydov
@ 2018-07-14 20:49 ` Vladimir Davydov
  2018-07-19  7:10   ` Konstantin Osipov
  2018-07-21 12:07   ` Vladimir Davydov
  2018-07-14 20:49 ` [PATCH v3 05/11] gc: keep track of vclocks instead of signatures Vladimir Davydov
                   ` (6 subsequent siblings)
  10 siblings, 2 replies; 31+ messages in thread
From: Vladimir Davydov @ 2018-07-14 20:49 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

Not needed anymore as we now use IPROTO_REQUEST_STATUS instead of
IPROTO_REQUEST_VOTE. Let's remove it altogether and reuse its code
for IPROTO_STATUS (they are never decoded together so no conflict
should happen). Worst that can happen is we choose a read-only master
when bootstrapping an older version of tarantool.
---
 src/box/iproto.cc          |  9 +++------
 src/box/iproto_constants.c |  7 +++----
 src/box/iproto_constants.h |  3 +--
 src/box/xrow.c             | 29 ++++++-----------------------
 src/box/xrow.h             | 28 +++++++++++-----------------
 5 files changed, 24 insertions(+), 52 deletions(-)

diff --git a/src/box/iproto.cc b/src/box/iproto.cc
index 17f161a3..6e4f0f8b 100644
--- a/src/box/iproto.cc
+++ b/src/box/iproto.cc
@@ -61,8 +61,6 @@
 #include "iproto_constants.h"
 #include "rmean.h"
 #include "errinj.h"
-#include "applier.h"
-#include "cfg.h"
 
 enum {
 	IPROTO_SALT_SIZE = 32,
@@ -1539,10 +1537,9 @@ tx_process_misc(struct cmsg *m)
 					   ::schema_version);
 			break;
 		case IPROTO_REQUEST_VOTE:
-			iproto_reply_request_vote_xc(out, msg->header.sync,
-						     ::schema_version,
-						     &replicaset.vclock,
-						     cfg_geti("read_only"));
+			iproto_reply_vclock_xc(out, &replicaset.vclock,
+					       msg->header.sync,
+					       ::schema_version);
 			break;
 		case IPROTO_REQUEST_STATUS:
 			box_process_status_request(&status);
diff --git a/src/box/iproto_constants.c b/src/box/iproto_constants.c
index bc7dfd7d..78c83c1d 100644
--- a/src/box/iproto_constants.c
+++ b/src/box/iproto_constants.c
@@ -86,8 +86,7 @@ const unsigned char iproto_key_type[IPROTO_KEY_MAX] =
 	/* 0x26 */	MP_MAP, /* IPROTO_VCLOCK */
 	/* 0x27 */	MP_STR, /* IPROTO_EXPR */
 	/* 0x28 */	MP_ARRAY, /* IPROTO_OPS */
-	/* 0x29 */	MP_BOOL, /* IPROTO_SERVER_IS_RO */
-	/* 0x2a */	MP_MAP, /* IPROTO_STATUS */
+	/* 0x29 */	MP_MAP, /* IPROTO_STATUS */
 	/* }}} */
 };
 
@@ -168,8 +167,8 @@ const char *iproto_key_strs[IPROTO_KEY_MAX] = {
 	"vector clock",     /* 0x26 */
 	"expression",       /* 0x27 */
 	"operations",       /* 0x28 */
-	"server is ro",     /* 0x29 */
-	"status",           /* 0x2a */
+	"status",           /* 0x29 */
+	NULL,               /* 0x2a */
 	NULL,               /* 0x2b */
 	NULL,               /* 0x2c */
 	NULL,               /* 0x2d */
diff --git a/src/box/iproto_constants.h b/src/box/iproto_constants.h
index ce4366ec..fe452817 100644
--- a/src/box/iproto_constants.h
+++ b/src/box/iproto_constants.h
@@ -77,8 +77,7 @@ enum iproto_key {
 	IPROTO_VCLOCK = 0x26,
 	IPROTO_EXPR = 0x27, /* EVAL */
 	IPROTO_OPS = 0x28, /* UPSERT but not UPDATE ops, because of legacy */
-	IPROTO_SERVER_IS_RO = 0x29,
-	IPROTO_STATUS = 0x2a,
+	IPROTO_STATUS = 0x29,
 	/* Leave a gap between request keys and response keys */
 	IPROTO_DATA = 0x30,
 	IPROTO_ERROR = 0x31,
diff --git a/src/box/xrow.c b/src/box/xrow.c
index 4bc1f81e..8a87fd4d 100644
--- a/src/box/xrow.c
+++ b/src/box/xrow.c
@@ -311,13 +311,11 @@ iproto_reply_ok(struct obuf *out, uint64_t sync, uint32_t schema_version)
 }
 
 int
-iproto_reply_request_vote(struct obuf *out, uint64_t sync,
-			  uint32_t schema_version, const struct vclock *vclock,
-			  bool read_only)
+iproto_reply_vclock(struct obuf *out, const struct vclock *vclock,
+		    uint64_t sync, uint32_t schema_version)
 {
-	size_t max_size = IPROTO_HEADER_LEN + mp_sizeof_map(2) +
-		mp_sizeof_uint(UINT32_MAX) + mp_sizeof_vclock(vclock) +
-		mp_sizeof_uint(UINT32_MAX) + mp_sizeof_bool(true);
+	size_t max_size = IPROTO_HEADER_LEN + mp_sizeof_map(1) +
+		mp_sizeof_uint(UINT32_MAX) + mp_sizeof_vclock(vclock);
 
 	char *buf = obuf_reserve(out, max_size);
 	if (buf == NULL) {
@@ -327,9 +325,7 @@ iproto_reply_request_vote(struct obuf *out, uint64_t sync,
 	}
 
 	char *data = buf + IPROTO_HEADER_LEN;
-	data = mp_encode_map(data, 2);
-	data = mp_encode_uint(data, IPROTO_SERVER_IS_RO);
-	data = mp_encode_bool(data, read_only);
+	data = mp_encode_map(data, 1);
 	data = mp_encode_uint(data, IPROTO_VCLOCK);
 	data = mp_encode_vclock(data, vclock);
 	size_t size = data - buf;
@@ -981,7 +977,7 @@ xrow_encode_subscribe(struct xrow_header *row,
 int
 xrow_decode_subscribe(struct xrow_header *row, struct tt_uuid *replicaset_uuid,
 		      struct tt_uuid *instance_uuid, struct vclock *vclock,
-		      uint32_t *version_id, bool *read_only)
+		      uint32_t *version_id)
 {
 	if (row->bodycnt == 0) {
 		diag_set(ClientError, ER_INVALID_MSGPACK, "request body");
@@ -996,9 +992,6 @@ xrow_decode_subscribe(struct xrow_header *row, struct tt_uuid *replicaset_uuid,
 		return -1;
 	}
 
-	/* For backward compatibility initialize read-only with false. */
-	if (read_only)
-		*read_only = false;
 	d = data;
 	uint32_t map_size = mp_decode_map(&d);
 	for (uint32_t i = 0; i < map_size; i++) {
@@ -1040,16 +1033,6 @@ xrow_decode_subscribe(struct xrow_header *row, struct tt_uuid *replicaset_uuid,
 			}
 			*version_id = mp_decode_uint(&d);
 			break;
-		case IPROTO_SERVER_IS_RO:
-			if (read_only == NULL)
-				goto skip;
-			if (mp_typeof(*d) != MP_BOOL) {
-				diag_set(ClientError, ER_INVALID_MSGPACK,
-					 "invalid STATUS");
-				return -1;
-			}
-			*read_only = mp_decode_bool(&d);
-			break;
 		default: skip:
 			mp_next(&d); /* value */
 		}
diff --git a/src/box/xrow.h b/src/box/xrow.h
index 67910d52..1ea30fb1 100644
--- a/src/box/xrow.h
+++ b/src/box/xrow.h
@@ -269,7 +269,6 @@ xrow_encode_subscribe(struct xrow_header *row,
  * @param[out] instance_uuid.
  * @param[out] vclock.
  * @param[out] version_id.
- * @param[out] read_only.
  *
  * @retval  0 Success.
  * @retval -1 Memory or format error.
@@ -277,7 +276,7 @@ xrow_encode_subscribe(struct xrow_header *row,
 int
 xrow_decode_subscribe(struct xrow_header *row, struct tt_uuid *replicaset_uuid,
 		      struct tt_uuid *instance_uuid, struct vclock *vclock,
-		      uint32_t *version_id, bool *read_only);
+		      uint32_t *version_id);
 
 /**
  * Encode JOIN command.
@@ -301,8 +300,7 @@ xrow_encode_join(struct xrow_header *row, const struct tt_uuid *instance_uuid);
 static inline int
 xrow_decode_join(struct xrow_header *row, struct tt_uuid *instance_uuid)
 {
-	return xrow_decode_subscribe(row, NULL, instance_uuid, NULL, NULL,
-				     NULL);
+	return xrow_decode_subscribe(row, NULL, instance_uuid, NULL, NULL);
 }
 
 /**
@@ -327,7 +325,7 @@ xrow_encode_vclock(struct xrow_header *row, const struct vclock *vclock);
 static inline int
 xrow_decode_vclock(struct xrow_header *row, struct vclock *vclock)
 {
-	return xrow_decode_subscribe(row, NULL, NULL, vclock, NULL, NULL);
+	return xrow_decode_subscribe(row, NULL, NULL, vclock, NULL);
 }
 
 /**
@@ -391,18 +389,16 @@ iproto_reply_ok(struct obuf *out, uint64_t sync, uint32_t schema_version);
  * Encode iproto header with IPROTO_OK response code
  * and vclock in the body.
  * @param out Encode to.
+ * @param vclock Vclock to encode.
  * @param sync Request sync.
  * @param schema_version.
- * @param vclock.
- * @param read_only.
  *
  * @retval  0 Success.
  * @retval -1 Memory error.
  */
 int
-iproto_reply_request_vote(struct obuf *out, uint64_t sync,
-			 uint32_t schema_version, const struct vclock *vclock,
-			 bool read_only);
+iproto_reply_vclock(struct obuf *out, const struct vclock *vclock,
+		    uint64_t sync, uint32_t schema_version);
 
 /**
  * Encode a reply to an instance status request.
@@ -627,7 +623,7 @@ xrow_decode_subscribe_xc(struct xrow_header *row,
 			 uint32_t *replica_version_id)
 {
 	if (xrow_decode_subscribe(row, replicaset_uuid, instance_uuid,
-				  vclock, replica_version_id, NULL) != 0)
+				  vclock, replica_version_id) != 0)
 		diag_raise();
 }
 
@@ -672,14 +668,12 @@ iproto_reply_ok_xc(struct obuf *out, uint64_t sync, uint32_t schema_version)
 		diag_raise();
 }
 
-/** @copydoc iproto_reply_request_vote_xc. */
+/** @copydoc iproto_reply_vclock. */
 static inline void
-iproto_reply_request_vote_xc(struct obuf *out, uint64_t sync,
-			     uint32_t schema_version,
-			     const struct vclock *vclock, bool read_only)
+iproto_reply_vclock_xc(struct obuf *out, const struct vclock *vclock,
+		       uint64_t sync, uint32_t schema_version)
 {
-	if (iproto_reply_request_vote(out, sync, schema_version,
-				      vclock, read_only) != 0)
+	if (iproto_reply_vclock(out, vclock, sync, schema_version) != 0)
 		diag_raise();
 }
 
-- 
2.11.0

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v3 05/11] gc: keep track of vclocks instead of signatures
  2018-07-14 20:49 [PATCH v3 00/11] Replica rejoin Vladimir Davydov
                   ` (3 preceding siblings ...)
  2018-07-14 20:49 ` [PATCH v3 04/11] Get rid of IPROTO_SERVER_IS_RO Vladimir Davydov
@ 2018-07-14 20:49 ` Vladimir Davydov
  2018-07-19  7:11   ` Konstantin Osipov
  2018-07-14 20:49 ` [PATCH v3 06/11] Include oldest vclock available on the instance in IPROTO_STATUS Vladimir Davydov
                   ` (5 subsequent siblings)
  10 siblings, 1 reply; 31+ messages in thread
From: Vladimir Davydov @ 2018-07-14 20:49 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

In order to check if a replica needs to be rebootstrapped, we need to
know the vclock of the oldest WAL stored on the master, but the garbage
collector works with signatures and hence can't report the vclock it was
last called for. Actually, all gc users have a vclock and can pass it
instead of signature so it's pretty easy to switch garbage collection
infrastructure to vclock.

Needed for #461
---
 src/box/box.cc     |  5 ++--
 src/box/gc.c       | 83 +++++++++++++++++++++++++++++-------------------------
 src/box/gc.h       | 17 ++++++-----
 src/box/lua/info.c |  4 ++-
 src/box/relay.cc   | 18 ++++++------
 5 files changed, 66 insertions(+), 61 deletions(-)

diff --git a/src/box/box.cc b/src/box/box.cc
index 200e49a1..7aac0a13 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -1422,7 +1422,7 @@ box_process_join(struct ev_io *io, struct xrow_header *header)
 	/* Register the replica with the garbage collector. */
 	struct gc_consumer *gc = gc_consumer_register(
 		tt_sprintf("replica %s", tt_uuid_str(&instance_uuid)),
-		vclock_sum(&start_vclock), GC_CONSUMER_WAL);
+		&start_vclock, GC_CONSUMER_WAL);
 	if (gc == NULL)
 		diag_raise();
 	auto gc_guard = make_scoped_guard([=]{
@@ -2124,8 +2124,7 @@ box_backup_start(int checkpoint_idx, box_backup_cb cb, void *cb_arg)
 			return -1;
 		}
 	} while (checkpoint_idx-- > 0);
-	backup_gc = gc_consumer_register("backup", vclock_sum(vclock),
-					 GC_CONSUMER_ALL);
+	backup_gc = gc_consumer_register("backup", vclock, GC_CONSUMER_ALL);
 	if (backup_gc == NULL)
 		return -1;
 	int rc = engine_backup(vclock, cb, cb_arg);
diff --git a/src/box/gc.c b/src/box/gc.c
index 6a05b298..6c324220 100644
--- a/src/box/gc.c
+++ b/src/box/gc.c
@@ -59,8 +59,8 @@ struct gc_consumer {
 	gc_node_t node;
 	/** Human-readable name. */
 	char *name;
-	/** The vclock signature tracked by this consumer. */
-	int64_t signature;
+	/** The vclock tracked by this consumer. */
+	struct vclock vclock;
 	/** Consumer type, indicating that consumer only consumes
 	 * WAL files, or both - SNAP and WAL.
 	 */
@@ -73,10 +73,10 @@ typedef rb_tree(struct gc_consumer) gc_tree_t;
 struct gc_state {
 	/** Number of checkpoints to maintain. */
 	int checkpoint_count;
-	/** Max signature WAL garbage collection has been called for. */
-	int64_t wal_signature;
-	/** Max signature checkpoint garbage collection has been called for. */
-	int64_t checkpoint_signature;
+	/** Max vclock WAL garbage collection has been called for. */
+	struct vclock wal_vclock;
+	/** Max vclock checkpoint garbage collection has been called for. */
+	struct vclock checkpoint_vclock;
 	/** Registered consumers, linked by gc_consumer::node. */
 	gc_tree_t consumers;
 	/**
@@ -94,9 +94,9 @@ static struct gc_state gc;
 static inline int
 gc_consumer_cmp(const struct gc_consumer *a, const struct gc_consumer *b)
 {
-	if (a->signature < b->signature)
+	if (vclock_sum(&a->vclock) < vclock_sum(&b->vclock))
 		return -1;
-	if (a->signature > b->signature)
+	if (vclock_sum(&a->vclock) > vclock_sum(&b->vclock))
 		return 1;
 	if ((intptr_t)a < (intptr_t)b)
 		return -1;
@@ -110,7 +110,7 @@ rb_gen(MAYBE_UNUSED static inline, gc_tree_, gc_tree_t,
 
 /** Allocate a consumer object. */
 static struct gc_consumer *
-gc_consumer_new(const char *name, int64_t signature,
+gc_consumer_new(const char *name, const struct vclock *vclock,
 		enum gc_consumer_type type)
 {
 	struct gc_consumer *consumer = calloc(1, sizeof(*consumer));
@@ -126,7 +126,7 @@ gc_consumer_new(const char *name, int64_t signature,
 		free(consumer);
 		return NULL;
 	}
-	consumer->signature = signature;
+	vclock_copy(&consumer->vclock, vclock);
 	consumer->type = type;
 	return consumer;
 }
@@ -143,8 +143,8 @@ gc_consumer_delete(struct gc_consumer *consumer)
 void
 gc_init(void)
 {
-	gc.wal_signature = -1;
-	gc.checkpoint_signature = -1;
+	vclock_create(&gc.wal_vclock);
+	vclock_create(&gc.checkpoint_vclock);
 	gc_tree_new(&gc.consumers);
 	latch_create(&gc.latch);
 }
@@ -191,7 +191,8 @@ gc_run(void)
 	 * We have to maintain @checkpoint_count oldest checkpoints,
 	 * plus we can't remove checkpoints that are still in use.
 	 */
-	int64_t gc_checkpoint_signature = -1;
+	struct vclock gc_checkpoint_vclock;
+	vclock_create(&gc_checkpoint_vclock);
 
 	struct checkpoint_iterator checkpoints;
 	checkpoint_iterator_init(&checkpoints);
@@ -201,18 +202,21 @@ gc_run(void)
 		if (--checkpoint_count > 0)
 			continue;
 		if (leftmost_checkpoint != NULL &&
-		    leftmost_checkpoint->signature < vclock_sum(vclock))
+		    vclock_sum(&leftmost_checkpoint->vclock) < vclock_sum(vclock))
 			continue;
-		gc_checkpoint_signature = vclock_sum(vclock);
+		vclock_copy(&gc_checkpoint_vclock, vclock);
 		break;
 	}
 
-	int64_t gc_wal_signature = MIN(gc_checkpoint_signature,
-				       leftmost != NULL ?
-				       leftmost->signature : INT64_MAX);
+	struct vclock gc_wal_vclock;
+	if (leftmost != NULL &&
+	    vclock_sum(&leftmost->vclock) < vclock_sum(&gc_checkpoint_vclock))
+		vclock_copy(&gc_wal_vclock, &leftmost->vclock);
+	else
+		vclock_copy(&gc_wal_vclock, &gc_checkpoint_vclock);
 
-	if (gc_wal_signature <= gc.wal_signature &&
-	    gc_checkpoint_signature <= gc.checkpoint_signature)
+	if (vclock_sum(&gc_wal_vclock) <= vclock_sum(&gc.wal_vclock) &&
+	    vclock_sum(&gc_checkpoint_vclock) <= vclock_sum(&gc.checkpoint_vclock))
 		return; /* nothing to do */
 
 	/*
@@ -231,14 +235,14 @@ gc_run(void)
 	 */
 	int rc = 0;
 
-	if (gc_checkpoint_signature > gc.checkpoint_signature) {
-		gc.checkpoint_signature = gc_checkpoint_signature;
-		rc = engine_collect_garbage(gc_checkpoint_signature);
+	if (vclock_sum(&gc_checkpoint_vclock) > vclock_sum(&gc.checkpoint_vclock)) {
+		vclock_copy(&gc.checkpoint_vclock, &gc_checkpoint_vclock);
+		rc = engine_collect_garbage(vclock_sum(&gc_checkpoint_vclock));
 	}
-	if (gc_wal_signature > gc.wal_signature) {
-		gc.wal_signature = gc_wal_signature;
+	if (vclock_sum(&gc_wal_vclock) > vclock_sum(&gc.wal_vclock)) {
+		vclock_copy(&gc.wal_vclock, &gc_wal_vclock);
 		if (rc == 0)
-			wal_collect_garbage(gc_wal_signature);
+			wal_collect_garbage(vclock_sum(&gc_wal_vclock));
 	}
 
 	latch_unlock(&gc.latch);
@@ -251,11 +255,10 @@ gc_set_checkpoint_count(int checkpoint_count)
 }
 
 struct gc_consumer *
-gc_consumer_register(const char *name, int64_t signature,
+gc_consumer_register(const char *name, const struct vclock *vclock,
 		     enum gc_consumer_type type)
 {
-	struct gc_consumer *consumer = gc_consumer_new(name, signature,
-						       type);
+	struct gc_consumer *consumer = gc_consumer_new(name, vclock, type);
 	if (consumer != NULL)
 		gc_tree_insert(&gc.consumers, consumer);
 	return consumer;
@@ -264,7 +267,7 @@ gc_consumer_register(const char *name, int64_t signature,
 void
 gc_consumer_unregister(struct gc_consumer *consumer)
 {
-	int64_t signature = consumer->signature;
+	int64_t signature = vclock_sum(&consumer->vclock);
 
 	gc_tree_remove(&gc.consumers, consumer);
 	gc_consumer_delete(consumer);
@@ -274,14 +277,15 @@ gc_consumer_unregister(struct gc_consumer *consumer)
 	 * if it referenced the oldest vclock.
 	 */
 	struct gc_consumer *leftmost = gc_tree_first(&gc.consumers);
-	if (leftmost == NULL || leftmost->signature > signature)
+	if (leftmost == NULL || vclock_sum(&leftmost->vclock) > signature)
 		gc_run();
 }
 
 void
-gc_consumer_advance(struct gc_consumer *consumer, int64_t signature)
+gc_consumer_advance(struct gc_consumer *consumer, const struct vclock *vclock)
 {
-	int64_t prev_signature = consumer->signature;
+	int64_t signature = vclock_sum(vclock);
+	int64_t prev_signature = vclock_sum(&consumer->vclock);
 
 	assert(signature >= prev_signature);
 	if (signature == prev_signature)
@@ -292,12 +296,13 @@ gc_consumer_advance(struct gc_consumer *consumer, int64_t signature)
 	 * is violated.
 	 */
 	struct gc_consumer *next = gc_tree_next(&gc.consumers, consumer);
-	bool update_tree = (next != NULL && signature >= next->signature);
+	bool update_tree = (next != NULL &&
+			    signature >= vclock_sum(&next->vclock));
 
 	if (update_tree)
 		gc_tree_remove(&gc.consumers, consumer);
 
-	consumer->signature = signature;
+	vclock_copy(&consumer->vclock, vclock);
 
 	if (update_tree)
 		gc_tree_insert(&gc.consumers, consumer);
@@ -307,7 +312,7 @@ gc_consumer_advance(struct gc_consumer *consumer, int64_t signature)
 	 * if it referenced the oldest vclock.
 	 */
 	struct gc_consumer *leftmost = gc_tree_first(&gc.consumers);
-	if (leftmost == NULL || leftmost->signature > prev_signature)
+	if (leftmost == NULL || vclock_sum(&leftmost->vclock) > prev_signature)
 		gc_run();
 }
 
@@ -317,10 +322,10 @@ gc_consumer_name(const struct gc_consumer *consumer)
 	return consumer->name;
 }
 
-int64_t
-gc_consumer_signature(const struct gc_consumer *consumer)
+void
+gc_consumer_vclock(const struct gc_consumer *consumer, struct vclock *vclock)
 {
-	return consumer->signature;
+	vclock_copy(vclock, &consumer->vclock);
 }
 
 struct gc_consumer *
diff --git a/src/box/gc.h b/src/box/gc.h
index 6a890b7b..7e061768 100644
--- a/src/box/gc.h
+++ b/src/box/gc.h
@@ -32,13 +32,12 @@
  */
 
 #include <stddef.h>
-#include <stdint.h>
-#include <stdbool.h>
 
 #if defined(__cplusplus)
 extern "C" {
 #endif /* defined(__cplusplus) */
 
+struct vclock;
 struct gc_consumer;
 
 /** Consumer type: WAL consumer, or SNAP */
@@ -79,7 +78,7 @@ gc_set_checkpoint_count(int checkpoint_count);
  * Register a consumer.
  *
  * This will stop garbage collection of objects newer than
- * @signature until the consumer is unregistered or advanced.
+ * @vclock until the consumer is unregistered or advanced.
  * @name is a human-readable name of the consumer, it will
  * be used for reporting the consumer to the user.
  * @type consumer type, reporting whether consumer only depends
@@ -89,7 +88,7 @@ gc_set_checkpoint_count(int checkpoint_count);
  * memory allocation failure.
  */
 struct gc_consumer *
-gc_consumer_register(const char *name, int64_t signature,
+gc_consumer_register(const char *name, const struct vclock *vclock,
 		     enum gc_consumer_type type);
 
 /**
@@ -100,19 +99,19 @@ void
 gc_consumer_unregister(struct gc_consumer *consumer);
 
 /**
- * Advance the vclock signature tracked by a consumer and
+ * Advance the vclock tracked by a consumer and
  * invoke garbage collection if needed.
  */
 void
-gc_consumer_advance(struct gc_consumer *consumer, int64_t signature);
+gc_consumer_advance(struct gc_consumer *consumer, const struct vclock *vclock);
 
 /** Return the name of a consumer. */
 const char *
 gc_consumer_name(const struct gc_consumer *consumer);
 
-/** Return the signature a consumer tracks. */
-int64_t
-gc_consumer_signature(const struct gc_consumer *consumer);
+/** Return the vclock a consumer tracks. */
+void
+gc_consumer_vclock(const struct gc_consumer *consumer, struct vclock *vclock);
 
 /**
  * Iterator over registered consumers. The iterator is valid
diff --git a/src/box/lua/info.c b/src/box/lua/info.c
index d6697df9..4544d8b6 100644
--- a/src/box/lua/info.c
+++ b/src/box/lua/info.c
@@ -396,7 +396,9 @@ lbox_info_gc_call(struct lua_State *L)
 		lua_settable(L, -3);
 
 		lua_pushstring(L, "signature");
-		luaL_pushint64(L, gc_consumer_signature(consumer));
+		struct vclock vclock;
+		gc_consumer_vclock(consumer, &vclock);
+		luaL_pushint64(L, vclock_sum(&vclock));
 		lua_settable(L, -3);
 
 		lua_rawseti(L, -2, ++count);
diff --git a/src/box/relay.cc b/src/box/relay.cc
index 75c3d56a..4cacbc84 100644
--- a/src/box/relay.cc
+++ b/src/box/relay.cc
@@ -78,8 +78,8 @@ struct relay_gc_msg {
 	struct stailq_entry in_pending;
 	/** Relay instance */
 	struct relay *relay;
-	/** Vclock signature to advance to */
-	int64_t signature;
+	/** Vclock to advance to */
+	struct vclock vclock;
 };
 
 /** State of a replication relay. */
@@ -325,7 +325,7 @@ static void
 tx_gc_advance(struct cmsg *msg)
 {
 	struct relay_gc_msg *m = (struct relay_gc_msg *)msg;
-	gc_consumer_advance(m->relay->replica->gc, m->signature);
+	gc_consumer_advance(m->relay->replica->gc, &m->vclock);
 	free(m);
 }
 
@@ -343,7 +343,7 @@ relay_on_close_log_f(struct trigger *trigger, void * /* event */)
 	}
 	cmsg_init(&m->msg, route);
 	m->relay = relay;
-	m->signature = vclock_sum(&relay->r->vclock);
+	vclock_copy(&m->vclock, &relay->r->vclock);
 	/*
 	 * Do not invoke garbage collection until the replica
 	 * confirms that it has received data stored in the
@@ -356,16 +356,16 @@ relay_on_close_log_f(struct trigger *trigger, void * /* event */)
  * Invoke pending garbage collection requests.
  *
  * This function schedules the most recent gc message whose
- * signature is less than or equal to the given one. Older
+ * vclock is less than or equal to the given one. Older
  * messages are discarded as their job will be done by the
  * scheduled message anyway.
  */
 static inline void
-relay_schedule_pending_gc(struct relay *relay, int64_t signature)
+relay_schedule_pending_gc(struct relay *relay, const struct vclock *vclock)
 {
 	struct relay_gc_msg *curr, *next, *gc_msg = NULL;
 	stailq_foreach_entry_safe(curr, next, &relay->pending_gc, in_pending) {
-		if (curr->signature > signature)
+		if (vclock_sum(&curr->vclock) > vclock_sum(vclock))
 			break;
 		stailq_shift(&relay->pending_gc);
 		free(gc_msg);
@@ -533,7 +533,7 @@ relay_subscribe_f(va_list ap)
 		relay->status_msg.relay = relay;
 		cpipe_push(&relay->tx_pipe, &relay->status_msg.msg);
 		/* Collect xlog files received by the replica. */
-		relay_schedule_pending_gc(relay, vclock_sum(send_vclock));
+		relay_schedule_pending_gc(relay, send_vclock);
 	}
 
 	say_crit("exiting the relay loop");
@@ -578,7 +578,7 @@ relay_subscribe(struct replica *replica, int fd, uint64_t sync,
 	if (replica->gc == NULL) {
 		replica->gc = gc_consumer_register(
 			tt_sprintf("replica %s", tt_uuid_str(&replica->uuid)),
-			vclock_sum(replica_clock), GC_CONSUMER_WAL);
+			replica_clock, GC_CONSUMER_WAL);
 		if (replica->gc == NULL)
 			diag_raise();
 	}
-- 
2.11.0

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v3 06/11] Include oldest vclock available on the instance in IPROTO_STATUS
  2018-07-14 20:49 [PATCH v3 00/11] Replica rejoin Vladimir Davydov
                   ` (4 preceding siblings ...)
  2018-07-14 20:49 ` [PATCH v3 05/11] gc: keep track of vclocks instead of signatures Vladimir Davydov
@ 2018-07-14 20:49 ` Vladimir Davydov
  2018-07-19  7:12   ` Konstantin Osipov
  2018-07-21 12:07   ` Vladimir Davydov
  2018-07-14 20:49 ` [PATCH v3 07/11] replication: rebootstrap instance on startup if it fell behind Vladimir Davydov
                   ` (4 subsequent siblings)
  10 siblings, 2 replies; 31+ messages in thread
From: Vladimir Davydov @ 2018-07-14 20:49 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

It will be used to check if a replica fell too much behind its peers and
so needs to be rebootstrapped.

Needed for #461
---
 src/box/box.cc             |  1 +
 src/box/gc.c               |  6 ++++++
 src/box/gc.h               |  6 ++++++
 src/box/iproto_constants.h |  1 +
 src/box/xrow.c             | 13 ++++++++++---
 src/box/xrow.h             |  2 ++
 6 files changed, 26 insertions(+), 3 deletions(-)

diff --git a/src/box/box.cc b/src/box/box.cc
index 7aac0a13..b629a4d8 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -1569,6 +1569,7 @@ box_process_status_request(struct status *status)
 {
 	status->is_ro = cfg_geti("read_only") != 0;
 	vclock_copy(&status->vclock, &replicaset.vclock);
+	gc_vclock(&status->gc_vclock);
 }
 
 /** Insert a new cluster into _schema */
diff --git a/src/box/gc.c b/src/box/gc.c
index 6c324220..bf221274 100644
--- a/src/box/gc.c
+++ b/src/box/gc.c
@@ -164,6 +164,12 @@ gc_free(void)
 	latch_destroy(&gc.latch);
 }
 
+void
+gc_vclock(struct vclock *vclock)
+{
+	vclock_copy(vclock, &gc.wal_vclock);
+}
+
 /** Find the consumer that uses the oldest checkpoint. */
 struct gc_consumer *
 gc_tree_first_checkpoint(gc_tree_t *consumers)
diff --git a/src/box/gc.h b/src/box/gc.h
index 7e061768..e8ee2d09 100644
--- a/src/box/gc.h
+++ b/src/box/gc.h
@@ -60,6 +60,12 @@ void
 gc_free(void);
 
 /**
+ * Get the oldest available vclock.
+ */
+void
+gc_vclock(struct vclock *vclock);
+
+/**
  * Invoke garbage collection in order to remove files left
  * from old checkpoints. The number of checkpoints saved by
  * this function is specified by box.cfg.checkpoint_count.
diff --git a/src/box/iproto_constants.h b/src/box/iproto_constants.h
index fe452817..e2bea886 100644
--- a/src/box/iproto_constants.h
+++ b/src/box/iproto_constants.h
@@ -87,6 +87,7 @@ enum iproto_key {
 enum iproto_status_key {
 	IPROTO_STATUS_IS_RO = 0x01,
 	IPROTO_STATUS_VCLOCK = 0x02,
+	IPROTO_STATUS_GC_VCLOCK = 0x03,
 };
 
 #define bit(c) (1ULL<<IPROTO_##c)
diff --git a/src/box/xrow.c b/src/box/xrow.c
index 8a87fd4d..345c7391 100644
--- a/src/box/xrow.c
+++ b/src/box/xrow.c
@@ -344,9 +344,10 @@ iproto_reply_status(struct obuf *out, const struct status *status,
 		    uint64_t sync, uint32_t schema_version)
 {
 	size_t max_size = IPROTO_HEADER_LEN + mp_sizeof_map(1) +
-		mp_sizeof_uint(UINT32_MAX) + mp_sizeof_map(2) +
+		mp_sizeof_uint(UINT32_MAX) + mp_sizeof_map(3) +
 		mp_sizeof_uint(UINT32_MAX) + mp_sizeof_bool(status->is_ro) +
-		mp_sizeof_uint(UINT32_MAX) + mp_sizeof_vclock(&status->vclock);
+		mp_sizeof_uint(UINT32_MAX) + mp_sizeof_vclock(&status->vclock) +
+		mp_sizeof_uint(UINT32_MAX) + mp_sizeof_vclock(&status->gc_vclock);
 
 	char *buf = obuf_reserve(out, max_size);
 	if (buf == NULL) {
@@ -358,11 +359,13 @@ iproto_reply_status(struct obuf *out, const struct status *status,
 	char *data = buf + IPROTO_HEADER_LEN;
 	data = mp_encode_map(data, 1);
 	data = mp_encode_uint(data, IPROTO_STATUS);
-	data = mp_encode_map(data, 2);
+	data = mp_encode_map(data, 3);
 	data = mp_encode_uint(data, IPROTO_STATUS_IS_RO);
 	data = mp_encode_bool(data, status->is_ro);
 	data = mp_encode_uint(data, IPROTO_STATUS_VCLOCK);
 	data = mp_encode_vclock(data, &status->vclock);
+	data = mp_encode_uint(data, IPROTO_STATUS_GC_VCLOCK);
+	data = mp_encode_vclock(data, &status->gc_vclock);
 	size_t size = data - buf;
 	assert(size <= max_size);
 
@@ -933,6 +936,10 @@ xrow_decode_status(struct xrow_header *row, struct status *status)
 			if (mp_decode_vclock(&data, &status->vclock) != 0)
 				goto err;
 			break;
+		case IPROTO_STATUS_GC_VCLOCK:
+			if (mp_decode_vclock(&data, &status->gc_vclock) != 0)
+				goto err;
+			break;
 		default:
 			mp_next(&data);
 		}
diff --git a/src/box/xrow.h b/src/box/xrow.h
index 1ea30fb1..341385d1 100644
--- a/src/box/xrow.h
+++ b/src/box/xrow.h
@@ -229,6 +229,8 @@ struct status {
 	bool is_ro;
 	/** Current instance vclock. */
 	struct vclock vclock;
+	/** Oldest vclock available on the instance. */
+	struct vclock gc_vclock;
 };
 
 /**
-- 
2.11.0

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v3 07/11] replication: rebootstrap instance on startup if it fell behind
  2018-07-14 20:49 [PATCH v3 00/11] Replica rejoin Vladimir Davydov
                   ` (5 preceding siblings ...)
  2018-07-14 20:49 ` [PATCH v3 06/11] Include oldest vclock available on the instance in IPROTO_STATUS Vladimir Davydov
@ 2018-07-14 20:49 ` Vladimir Davydov
  2018-07-19  7:19   ` Konstantin Osipov
  2018-07-14 20:49 ` [PATCH v3 08/11] vinyl: simplify vylog recovery from backup Vladimir Davydov
                   ` (3 subsequent siblings)
  10 siblings, 1 reply; 31+ messages in thread
From: Vladimir Davydov @ 2018-07-14 20:49 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

If a replica fell too much behind its peers in the cluster and xlog
files needed for it to get up to speed have been removed, it won't be
able to proceed without rebootstrap. This patch makes the recovery
procedure detect such cases and initiate rebootstrap procedure if
necessary.

Note, rebootstrap is currently only supported by memtx engine. If there
are vinyl spaces on the replica, rebootstrap will fail. This is fixed by
the following patches.

Part of #461
---
 src/box/box.cc                           |   9 ++
 src/box/replication.cc                   |  27 ++++
 src/box/replication.h                    |   9 ++
 test/replication/replica_rejoin.result   | 247 +++++++++++++++++++++++++++++++
 test/replication/replica_rejoin.test.lua |  92 ++++++++++++
 test/replication/suite.cfg               |   1 +
 6 files changed, 385 insertions(+)
 create mode 100644 test/replication/replica_rejoin.result
 create mode 100644 test/replication/replica_rejoin.test.lua

diff --git a/src/box/box.cc b/src/box/box.cc
index b629a4d8..baf30fce 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -1797,6 +1797,9 @@ bootstrap(const struct tt_uuid *instance_uuid,
 /**
  * Recover the instance from the local directory.
  * Enter hot standby if the directory is locked.
+ * Invoke rebootstrap if the instance fell too much
+ * behind its peers in the replica set and needs
+ * to be rebootstrapped.
  */
 static void
 local_recovery(const struct tt_uuid *instance_uuid,
@@ -1832,6 +1835,12 @@ local_recovery(const struct tt_uuid *instance_uuid,
 	if (wal_dir_lock >= 0) {
 		box_listen();
 		box_sync_replication(replication_connect_timeout, false);
+
+		struct replica *master;
+		if (replicaset_needs_rejoin(&master)) {
+			say_info("replica is too old, initiating rejoin");
+			return bootstrap_from_master(master);
+		}
 	}
 
 	/*
diff --git a/src/box/replication.cc b/src/box/replication.cc
index f12244c9..d61a984f 100644
--- a/src/box/replication.cc
+++ b/src/box/replication.cc
@@ -625,6 +625,33 @@ error:
 		  "failed to connect to one or more replicas");
 }
 
+bool
+replicaset_needs_rejoin(struct replica **master)
+{
+	replicaset_foreach(replica) {
+		/*
+		 * Rebootstrap this instance from a master if:
+		 * - the oldest vclock stored on the master is greater
+		 *   than or incomparable with the instance vclock
+		 *   (so that the instance can't follow the master) and
+		 * - the instance is strictly behind the master (so
+		 *   that we won't lose any data by rebootstrapping
+		 *   this instance)
+		 */
+		struct applier *applier = replica->applier;
+		if (applier != NULL &&
+		    vclock_compare(&applier->remote_status.gc_vclock,
+				   &replicaset.vclock) > 0 &&
+		    vclock_compare(&replicaset.vclock,
+				   &applier->remote_status.vclock) < 0) {
+			*master = replica;
+			return true;
+		}
+	}
+	*master = NULL;
+	return false;
+}
+
 void
 replicaset_follow(void)
 {
diff --git a/src/box/replication.h b/src/box/replication.h
index fdf995c3..e8b391af 100644
--- a/src/box/replication.h
+++ b/src/box/replication.h
@@ -360,6 +360,15 @@ replicaset_connect(struct applier **appliers, int count,
 		   double timeout, bool connect_all);
 
 /**
+ * Check if the current instance fell too much behind its
+ * peers in the replica set and needs to be rebootstrapped.
+ * If it does, return true and set @master to the instance
+ * to use for rebootstrap, otherwise return false.
+ */
+bool
+replicaset_needs_rejoin(struct replica **master);
+
+/**
  * Resume all appliers registered with the replica set.
  */
 void
diff --git a/test/replication/replica_rejoin.result b/test/replication/replica_rejoin.result
new file mode 100644
index 00000000..b7563ed9
--- /dev/null
+++ b/test/replication/replica_rejoin.result
@@ -0,0 +1,247 @@
+env = require('test_run')
+---
+...
+test_run = env.new()
+---
+...
+-- Cleanup the instance to remove vylog files left from previous
+-- tests, since vinyl doesn't support rebootstrap yet.
+test_run:cmd('restart server default with cleanup=1')
+--
+-- gh-461: check that a replica refetches the last checkpoint
+-- in case it fell behind the master.
+--
+box.schema.user.grant('guest', 'replication')
+---
+...
+_ = box.schema.space.create('test')
+---
+...
+_ = box.space.test:create_index('pk')
+---
+...
+_ = box.space.test:insert{1}
+---
+...
+_ = box.space.test:insert{2}
+---
+...
+_ = box.space.test:insert{3}
+---
+...
+-- Join a replica, then stop it.
+test_run:cmd("create server replica with rpl_master=default, script='replication/replica.lua'")
+---
+- true
+...
+test_run:cmd("start server replica")
+---
+- true
+...
+test_run:cmd("switch replica")
+---
+- true
+...
+box.info.replication[1].upstream.status == 'follow' or box.info
+---
+- true
+...
+box.space.test:select()
+---
+- - [1]
+  - [2]
+  - [3]
+...
+test_run:cmd("switch default")
+---
+- true
+...
+test_run:cmd("stop server replica")
+---
+- true
+...
+-- Restart the server to purge the replica from
+-- the garbage collection state.
+test_run:cmd("restart server default")
+-- Make some checkpoints to remove old xlogs.
+checkpoint_count = box.cfg.checkpoint_count
+---
+...
+box.cfg{checkpoint_count = 1}
+---
+...
+_ = box.space.test:delete{1}
+---
+...
+_ = box.space.test:insert{10}
+---
+...
+box.snapshot()
+---
+- ok
+...
+_ = box.space.test:delete{2}
+---
+...
+_ = box.space.test:insert{20}
+---
+...
+box.snapshot()
+---
+- ok
+...
+_ = box.space.test:delete{3}
+---
+...
+_ = box.space.test:insert{30}
+---
+...
+fio = require('fio')
+---
+...
+#fio.glob(fio.pathjoin(box.cfg.wal_dir, '*.xlog')) -- 1
+---
+- 1
+...
+box.cfg{checkpoint_count = checkpoint_count}
+---
+...
+-- Restart the replica. Since xlogs have been removed,
+-- it is supposed to rejoin without changing id.
+test_run:cmd("start server replica")
+---
+- true
+...
+box.info.replication[2].downstream.vclock ~= nil or box.info
+---
+- true
+...
+test_run:cmd("switch replica")
+---
+- true
+...
+box.info.replication[1].upstream.status == 'follow' or box.info
+---
+- true
+...
+box.space.test:select()
+---
+- - [10]
+  - [20]
+  - [30]
+...
+test_run:cmd("switch default")
+---
+- true
+...
+-- Make sure the replica follows new changes.
+for i = 10, 30, 10 do box.space.test:update(i, {{'!', 1, i}}) end
+---
+...
+vclock = test_run:get_vclock('default')
+---
+...
+_ = test_run:wait_vclock('replica', vclock)
+---
+...
+test_run:cmd("switch replica")
+---
+- true
+...
+box.space.test:select()
+---
+- - [10, 10]
+  - [20, 20]
+  - [30, 30]
+...
+-- Check that restart works as usual.
+test_run:cmd("restart server replica")
+box.info.replication[1].upstream.status == 'follow' or box.info
+---
+- true
+...
+box.space.test:select()
+---
+- - [10, 10]
+  - [20, 20]
+  - [30, 30]
+...
+-- Check that rebootstrap is NOT initiated unless the replica
+-- is strictly behind the master.
+box.space.test:replace{1, 2, 3} -- bumps LSN on the replica
+---
+- [1, 2, 3]
+...
+test_run:cmd("switch default")
+---
+- true
+...
+test_run:cmd("stop server replica")
+---
+- true
+...
+test_run:cmd("restart server default")
+checkpoint_count = box.cfg.checkpoint_count
+---
+...
+box.cfg{checkpoint_count = 1}
+---
+...
+for i = 1, 3 do box.space.test:delete{i * 10} end
+---
+...
+box.snapshot()
+---
+- ok
+...
+for i = 1, 3 do box.space.test:insert{i * 100} end
+---
+...
+fio = require('fio')
+---
+...
+#fio.glob(fio.pathjoin(box.cfg.wal_dir, '*.xlog')) -- 1
+---
+- 1
+...
+box.cfg{checkpoint_count = checkpoint_count}
+---
+...
+test_run:cmd("start server replica")
+---
+- true
+...
+test_run:cmd("switch replica")
+---
+- true
+...
+box.info.status -- orphan
+---
+- orphan
+...
+box.space.test:select()
+---
+- - [1, 2, 3]
+  - [10, 10]
+  - [20, 20]
+  - [30, 30]
+...
+-- Cleanup.
+test_run:cmd("switch default")
+---
+- true
+...
+test_run:cmd("stop server replica")
+---
+- true
+...
+test_run:cmd("cleanup server replica")
+---
+- true
+...
+box.space.test:drop()
+---
+...
+box.schema.user.revoke('guest', 'replication')
+---
+...
diff --git a/test/replication/replica_rejoin.test.lua b/test/replication/replica_rejoin.test.lua
new file mode 100644
index 00000000..dfcb79cf
--- /dev/null
+++ b/test/replication/replica_rejoin.test.lua
@@ -0,0 +1,92 @@
+env = require('test_run')
+test_run = env.new()
+
+-- Cleanup the instance to remove vylog files left from previous
+-- tests, since vinyl doesn't support rebootstrap yet.
+test_run:cmd('restart server default with cleanup=1')
+
+--
+-- gh-461: check that a replica refetches the last checkpoint
+-- in case it fell behind the master.
+--
+box.schema.user.grant('guest', 'replication')
+_ = box.schema.space.create('test')
+_ = box.space.test:create_index('pk')
+_ = box.space.test:insert{1}
+_ = box.space.test:insert{2}
+_ = box.space.test:insert{3}
+
+-- Join a replica, then stop it.
+test_run:cmd("create server replica with rpl_master=default, script='replication/replica.lua'")
+test_run:cmd("start server replica")
+test_run:cmd("switch replica")
+box.info.replication[1].upstream.status == 'follow' or box.info
+box.space.test:select()
+test_run:cmd("switch default")
+test_run:cmd("stop server replica")
+
+-- Restart the server to purge the replica from
+-- the garbage collection state.
+test_run:cmd("restart server default")
+
+-- Make some checkpoints to remove old xlogs.
+checkpoint_count = box.cfg.checkpoint_count
+box.cfg{checkpoint_count = 1}
+_ = box.space.test:delete{1}
+_ = box.space.test:insert{10}
+box.snapshot()
+_ = box.space.test:delete{2}
+_ = box.space.test:insert{20}
+box.snapshot()
+_ = box.space.test:delete{3}
+_ = box.space.test:insert{30}
+fio = require('fio')
+#fio.glob(fio.pathjoin(box.cfg.wal_dir, '*.xlog')) -- 1
+box.cfg{checkpoint_count = checkpoint_count}
+
+-- Restart the replica. Since xlogs have been removed,
+-- it is supposed to rejoin without changing id.
+test_run:cmd("start server replica")
+box.info.replication[2].downstream.vclock ~= nil or box.info
+test_run:cmd("switch replica")
+box.info.replication[1].upstream.status == 'follow' or box.info
+box.space.test:select()
+test_run:cmd("switch default")
+
+-- Make sure the replica follows new changes.
+for i = 10, 30, 10 do box.space.test:update(i, {{'!', 1, i}}) end
+vclock = test_run:get_vclock('default')
+_ = test_run:wait_vclock('replica', vclock)
+test_run:cmd("switch replica")
+box.space.test:select()
+
+-- Check that restart works as usual.
+test_run:cmd("restart server replica")
+box.info.replication[1].upstream.status == 'follow' or box.info
+box.space.test:select()
+
+-- Check that rebootstrap is NOT initiated unless the replica
+-- is strictly behind the master.
+box.space.test:replace{1, 2, 3} -- bumps LSN on the replica
+test_run:cmd("switch default")
+test_run:cmd("stop server replica")
+test_run:cmd("restart server default")
+checkpoint_count = box.cfg.checkpoint_count
+box.cfg{checkpoint_count = 1}
+for i = 1, 3 do box.space.test:delete{i * 10} end
+box.snapshot()
+for i = 1, 3 do box.space.test:insert{i * 100} end
+fio = require('fio')
+#fio.glob(fio.pathjoin(box.cfg.wal_dir, '*.xlog')) -- 1
+box.cfg{checkpoint_count = checkpoint_count}
+test_run:cmd("start server replica")
+test_run:cmd("switch replica")
+box.info.status -- orphan
+box.space.test:select()
+
+-- Cleanup.
+test_run:cmd("switch default")
+test_run:cmd("stop server replica")
+test_run:cmd("cleanup server replica")
+box.space.test:drop()
+box.schema.user.revoke('guest', 'replication')
diff --git a/test/replication/suite.cfg b/test/replication/suite.cfg
index 95e94e5a..2b609f16 100644
--- a/test/replication/suite.cfg
+++ b/test/replication/suite.cfg
@@ -6,6 +6,7 @@
     "wal_off.test.lua": {},
     "hot_standby.test.lua": {},
     "rebootstrap.test.lua": {},
+    "replica_rejoin.test.lua": {},
     "*": {
         "memtx": {"engine": "memtx"},
         "vinyl": {"engine": "vinyl"}
-- 
2.11.0

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v3 08/11] vinyl: simplify vylog recovery from backup
  2018-07-14 20:49 [PATCH v3 00/11] Replica rejoin Vladimir Davydov
                   ` (6 preceding siblings ...)
  2018-07-14 20:49 ` [PATCH v3 07/11] replication: rebootstrap instance on startup if it fell behind Vladimir Davydov
@ 2018-07-14 20:49 ` Vladimir Davydov
  2018-07-31  8:21   ` Vladimir Davydov
  2018-07-14 20:49 ` [PATCH v3 09/11] vinyl: pass flags to vy_recovery_new Vladimir Davydov
                   ` (2 subsequent siblings)
  10 siblings, 1 reply; 31+ messages in thread
From: Vladimir Davydov @ 2018-07-14 20:49 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

Since we don't create snapshot files for vylog, but instead append
records written after checkpoint to the same file, we have to use the
previous vylog file for backup (see vy_log_backup_path()). So when
recovering from a backup we need to rotate the last vylog to keep vylog
and checkpoint signatures in sync. Currently, we do it on recovery
completion and we use vy_log_create() instead of vy_log_rotate() for it.
This is done so that we can reuse the context that was used for recovery
instead of rereading vylog for rotation. Actually, there's no point in
this micro-optimization, because we rotate vylog only when recovering
from a backup. Let's remove it and use vy_log_rotate() for this.

Needed for #461
---
 src/box/vy_log.c | 59 +++++++++++++++++++++-----------------------------------
 1 file changed, 22 insertions(+), 37 deletions(-)

diff --git a/src/box/vy_log.c b/src/box/vy_log.c
index fbbc7a7f..2ee5553d 100644
--- a/src/box/vy_log.c
+++ b/src/box/vy_log.c
@@ -184,6 +184,12 @@ static int
 vy_recovery_process_record(struct vy_recovery *recovery,
 			   const struct vy_log_record *record);
 
+static int
+vy_log_create(const struct vclock *vclock, struct vy_recovery *recovery);
+
+int
+vy_log_rotate(const struct vclock *vclock);
+
 /**
  * Return the name of the vylog file that has the given signature.
  */
@@ -883,10 +889,11 @@ vy_log_begin_recovery(const struct vclock *vclock)
 	if (xdir_scan(&vy_log.dir) < 0 && errno != ENOENT)
 		return NULL;
 
-	struct vclock vy_log_vclock;
-	vclock_create(&vy_log_vclock);
-	if (xdir_last_vclock(&vy_log.dir, &vy_log_vclock) >= 0 &&
-	    vclock_compare(&vy_log_vclock, vclock) > 0) {
+	if (xdir_last_vclock(&vy_log.dir, &vy_log.last_checkpoint) < 0)
+		vclock_copy(&vy_log.last_checkpoint, vclock);
+
+	int cmp = vclock_compare(&vy_log.last_checkpoint, vclock);
+	if (cmp > 0) {
 		/*
 		 * Last vy_log log is newer than the last snapshot.
 		 * This can't normally happen, as vy_log is rotated
@@ -896,21 +903,27 @@ vy_log_begin_recovery(const struct vclock *vclock)
 		diag_set(ClientError, ER_MISSING_SNAPSHOT);
 		return NULL;
 	}
+	if (cmp < 0) {
+		/*
+		 * Last vy_log log is older than the last snapshot.
+		 * This happens if we are recovering from a backup.
+		 * Rotate the log to keep its signature in sync with
+		 * checkpoint.
+		 */
+		if (vy_log_rotate(vclock) != 0)
+			return NULL;
+	}
 
 	struct vy_recovery *recovery;
-	recovery = vy_recovery_new(vclock_sum(&vy_log_vclock), false);
+	recovery = vy_recovery_new(vclock_sum(&vy_log.last_checkpoint), false);
 	if (recovery == NULL)
 		return NULL;
 
 	vy_log.next_id = recovery->max_id + 1;
 	vy_log.recovery = recovery;
-	vclock_copy(&vy_log.last_checkpoint, vclock);
 	return recovery;
 }
 
-static int
-vy_log_create(const struct vclock *vclock, struct vy_recovery *recovery);
-
 int
 vy_log_end_recovery(void)
 {
@@ -931,34 +944,6 @@ vy_log_end_recovery(void)
 		return -1;
 	}
 
-	/*
-	 * On backup we copy files corresponding to the most recent
-	 * checkpoint. Since vy_log does not create snapshots of its log
-	 * files, but instead appends records written after checkpoint
-	 * to the most recent log file, the signature of the vy_log file
-	 * corresponding to the last checkpoint equals the signature
-	 * of the previous checkpoint. So upon successful recovery
-	 * from a backup we need to rotate the log to keep checkpoint
-	 * and vy_log signatures in sync.
-	 */
-	struct vclock *vclock = vclockset_last(&vy_log.dir.index);
-	if (vclock == NULL ||
-	    vclock_compare(vclock, &vy_log.last_checkpoint) != 0) {
-		vclock = malloc(sizeof(*vclock));
-		if (vclock == NULL) {
-			diag_set(OutOfMemory, sizeof(*vclock),
-				 "malloc", "struct vclock");
-			return -1;
-		}
-		vclock_copy(vclock, &vy_log.last_checkpoint);
-		xdir_add_vclock(&vy_log.dir, vclock);
-		if (vy_log_create(vclock, vy_log.recovery) < 0) {
-			diag_log();
-			say_error("failed to write `%s'",
-				  vy_log_filename(vclock_sum(vclock)));
-			return -1;
-		}
-	}
 	xdir_collect_inprogress(&vy_log.dir);
 	vy_log.recovery = NULL;
 	return 0;
-- 
2.11.0

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v3 09/11] vinyl: pass flags to vy_recovery_new
  2018-07-14 20:49 [PATCH v3 00/11] Replica rejoin Vladimir Davydov
                   ` (7 preceding siblings ...)
  2018-07-14 20:49 ` [PATCH v3 08/11] vinyl: simplify vylog recovery from backup Vladimir Davydov
@ 2018-07-14 20:49 ` Vladimir Davydov
  2018-07-21 11:12   ` Vladimir Davydov
  2018-07-14 20:49 ` [PATCH v3 10/11] Update test-run Vladimir Davydov
  2018-07-14 20:49 ` [PATCH v3 11/11] vinyl: implement rebootstrap support Vladimir Davydov
  10 siblings, 1 reply; 31+ messages in thread
From: Vladimir Davydov @ 2018-07-14 20:49 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

Currently, this function takes a single boolean argument, but I'm
planning to add another one. Since two bool arguments look rather
confusing, let's turn this arguments into flags.

Needed for #461
---
 src/box/vinyl.c  |  8 +++++---
 src/box/vy_log.c | 19 +++++++++----------
 src/box/vy_log.h | 16 ++++++++++++----
 3 files changed, 26 insertions(+), 17 deletions(-)

diff --git a/src/box/vinyl.c b/src/box/vinyl.c
index ef458921..a802b169 100644
--- a/src/box/vinyl.c
+++ b/src/box/vinyl.c
@@ -3226,7 +3226,8 @@ vinyl_engine_join(struct engine *engine, const struct vclock *vclock,
 	 * Send all runs stored in it to the replica.
 	 */
 	struct vy_recovery *recovery;
-	recovery = vy_recovery_new(vclock_sum(vclock), true);
+	recovery = vy_recovery_new(vclock_sum(vclock),
+				   VY_RECOVERY_LOAD_CHECKPOINT);
 	if (recovery == NULL) {
 		say_error("failed to recover vylog to join a replica");
 		goto out_join_cord;
@@ -3429,7 +3430,7 @@ vinyl_engine_collect_garbage(struct engine *engine, int64_t lsn)
 
 	/* Cleanup run files. */
 	int64_t signature = checkpoint_last(NULL);
-	struct vy_recovery *recovery = vy_recovery_new(signature, false);
+	struct vy_recovery *recovery = vy_recovery_new(signature, 0);
 	if (recovery == NULL) {
 		say_error("failed to recover vylog for garbage collection");
 		return 0;
@@ -3458,7 +3459,8 @@ vinyl_engine_backup(struct engine *engine, const struct vclock *vclock,
 
 	/* Backup run files. */
 	struct vy_recovery *recovery;
-	recovery = vy_recovery_new(vclock_sum(vclock), true);
+	recovery = vy_recovery_new(vclock_sum(vclock),
+				   VY_RECOVERY_LOAD_CHECKPOINT);
 	if (recovery == NULL) {
 		say_error("failed to recover vylog for backup");
 		return -1;
diff --git a/src/box/vy_log.c b/src/box/vy_log.c
index 2ee5553d..10648106 100644
--- a/src/box/vy_log.c
+++ b/src/box/vy_log.c
@@ -178,7 +178,7 @@ struct vy_log {
 static struct vy_log vy_log;
 
 static struct vy_recovery *
-vy_recovery_new_locked(int64_t signature, bool only_checkpoint);
+vy_recovery_new_locked(int64_t signature, int flags);
 
 static int
 vy_recovery_process_record(struct vy_recovery *recovery,
@@ -915,7 +915,7 @@ vy_log_begin_recovery(const struct vclock *vclock)
 	}
 
 	struct vy_recovery *recovery;
-	recovery = vy_recovery_new(vclock_sum(&vy_log.last_checkpoint), false);
+	recovery = vy_recovery_new(vclock_sum(&vy_log.last_checkpoint), 0);
 	if (recovery == NULL)
 		return NULL;
 
@@ -998,7 +998,7 @@ vy_log_rotate(const struct vclock *vclock)
 	latch_lock(&vy_log.latch);
 
 	struct vy_recovery *recovery;
-	recovery = vy_recovery_new_locked(prev_signature, false);
+	recovery = vy_recovery_new_locked(prev_signature, 0);
 	if (recovery == NULL)
 		goto fail;
 
@@ -2031,7 +2031,7 @@ static ssize_t
 vy_recovery_new_f(va_list ap)
 {
 	int64_t signature = va_arg(ap, int64_t);
-	bool only_checkpoint = va_arg(ap, int);
+	int flags = va_arg(ap, int);
 	struct vy_recovery **p_recovery = va_arg(ap, struct vy_recovery **);
 
 	say_verbose("loading vylog %lld", (long long)signature);
@@ -2088,7 +2088,7 @@ vy_recovery_new_f(va_list ap)
 		say_verbose("load vylog record: %s",
 			    vy_log_record_str(&record));
 		if (record.type == VY_LOG_SNAPSHOT) {
-			if (only_checkpoint)
+			if ((flags & VY_RECOVERY_LOAD_CHECKPOINT) != 0)
 				break;
 			continue;
 		}
@@ -2123,7 +2123,7 @@ fail:
  * Must be called with the log latch held.
  */
 static struct vy_recovery *
-vy_recovery_new_locked(int64_t signature, bool only_checkpoint)
+vy_recovery_new_locked(int64_t signature, int flags)
 {
 	int rc;
 	struct vy_recovery *recovery;
@@ -2141,8 +2141,7 @@ vy_recovery_new_locked(int64_t signature, bool only_checkpoint)
 	}
 
 	/* Load the log from coio so as not to stall tx thread. */
-	rc = coio_call(vy_recovery_new_f, signature,
-		       (int)only_checkpoint, &recovery);
+	rc = coio_call(vy_recovery_new_f, signature, flags, &recovery);
 	if (rc != 0) {
 		diag_log();
 		say_error("failed to load `%s'", vy_log_filename(signature));
@@ -2152,12 +2151,12 @@ vy_recovery_new_locked(int64_t signature, bool only_checkpoint)
 }
 
 struct vy_recovery *
-vy_recovery_new(int64_t signature, bool only_checkpoint)
+vy_recovery_new(int64_t signature, int flags)
 {
 	/* Lock out concurrent writers while we are loading the log. */
 	latch_lock(&vy_log.latch);
 	struct vy_recovery *recovery;
-	recovery = vy_recovery_new_locked(signature, only_checkpoint);
+	recovery = vy_recovery_new_locked(signature, flags);
 	latch_unlock(&vy_log.latch);
 	return recovery;
 }
diff --git a/src/box/vy_log.h b/src/box/vy_log.h
index da0745b2..98cbf6ee 100644
--- a/src/box/vy_log.h
+++ b/src/box/vy_log.h
@@ -525,18 +525,26 @@ vy_log_begin_recovery(const struct vclock *vclock);
 int
 vy_log_end_recovery(void);
 
+/** Flags passed to vy_recovery_new(). */
+enum vy_recovery_flag {
+	/**
+	 * Do not load records written to the log after checkpoint,
+	 * i.e. get a consistent view of vinyl database at the time
+	 * of the last checkpoint.
+	 */
+	VY_RECOVERY_LOAD_CHECKPOINT	= 1 << 0,
+};
+
 /**
  * Create a recovery context from the metadata log created
  * by checkpoint with the given signature.
  *
- * If @only_checkpoint is set, do not load records appended to
- * the log after checkpoint (i.e. get a consistent view of
- * Vinyl at the time of the checkpoint).
+ * For valid values of @flags, see vy_recovery_flag.
  *
  * Returns NULL on failure.
  */
 struct vy_recovery *
-vy_recovery_new(int64_t signature, bool only_checkpoint);
+vy_recovery_new(int64_t signature, int flags);
 
 /**
  * Free a recovery context created by vy_recovery_new().
-- 
2.11.0

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v3 10/11] Update test-run
  2018-07-14 20:49 [PATCH v3 00/11] Replica rejoin Vladimir Davydov
                   ` (8 preceding siblings ...)
  2018-07-14 20:49 ` [PATCH v3 09/11] vinyl: pass flags to vy_recovery_new Vladimir Davydov
@ 2018-07-14 20:49 ` Vladimir Davydov
  2018-07-21 11:13   ` Vladimir Davydov
  2018-07-14 20:49 ` [PATCH v3 11/11] vinyl: implement rebootstrap support Vladimir Davydov
  10 siblings, 1 reply; 31+ messages in thread
From: Vladimir Davydov @ 2018-07-14 20:49 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

To bring crash_expected option of "start server" command.
---
 test-run | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/test-run b/test-run
index 77e93279..95562e95 160000
--- a/test-run
+++ b/test-run
@@ -1 +1 @@
-Subproject commit 77e93279210f8c5c1fd0ed03416fa19a184f0b6d
+Subproject commit 95562e95401fef4e0b755ab0bb430974b5d1a29a
-- 
2.11.0

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v3 11/11] vinyl: implement rebootstrap support
  2018-07-14 20:49 [PATCH v3 00/11] Replica rejoin Vladimir Davydov
                   ` (9 preceding siblings ...)
  2018-07-14 20:49 ` [PATCH v3 10/11] Update test-run Vladimir Davydov
@ 2018-07-14 20:49 ` Vladimir Davydov
  2018-07-31  8:23   ` Vladimir Davydov
  10 siblings, 1 reply; 31+ messages in thread
From: Vladimir Davydov @ 2018-07-14 20:49 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

If vy_log_bootstrap() finds a vylog file in the vinyl directory, it
assumes it has to be rebootstrapped and calls vy_log_rebootstrap().
The latter scans the old vylog file to find the max vinyl object id,
from which it will start numbering objects created during rebootstrap to
avoid conflicts with old objects, then it writes VY_LOG_REBOOTSTRAP
record to the old vylog to denote the beginning of a rebootstrap
section. After that initial join proceeds as usual, writing information
about new objects to the old vylog file after VY_LOG_REBOOTSTRAP marker.
Upon successful rebootstrap completion, checkpoint, which is always
called right after bootstrap, rotates the old vylog and marks all
objects created before the VY_LOG_REBOOTSTRAP marker as dropped in the
new vylog. The old objects will be purged by the garbage collector as
usual.

In case rebootstrap fails and checkpoint never happens, local recovery
writes VY_LOG_ABORT_REBOOTSTRAP record to the vylog. This marker
indicates that the rebootstrap attempt failed and all objects created
during rebootstrap should be discarded. They will be purged by the
garbage collector on checkpoint. Thus even if rebootstrap fails, it is
possible to recover the database to the state that existed right before
a failed rebootstrap attempt.

Closes #461
---
 src/box/relay.cc                         |   3 +
 src/box/vy_log.c                         | 133 +++++++++++++++-
 src/box/vy_log.h                         |  34 ++++
 src/errinj.h                             |   1 +
 test/box/errinj.result                   |   6 +-
 test/replication/replica_rejoin.result   |  11 +-
 test/replication/replica_rejoin.test.lua |   7 +-
 test/replication/suite.cfg               |   1 -
 test/vinyl/replica_rejoin.lua            |  13 ++
 test/vinyl/replica_rejoin.result         | 257 +++++++++++++++++++++++++++++++
 test/vinyl/replica_rejoin.test.lua       |  88 +++++++++++
 test/vinyl/suite.ini                     |   2 +-
 12 files changed, 536 insertions(+), 20 deletions(-)
 create mode 100644 test/vinyl/replica_rejoin.lua
 create mode 100644 test/vinyl/replica_rejoin.result
 create mode 100644 test/vinyl/replica_rejoin.test.lua

diff --git a/src/box/relay.cc b/src/box/relay.cc
index 4cacbc84..05468f20 100644
--- a/src/box/relay.cc
+++ b/src/box/relay.cc
@@ -287,6 +287,9 @@ relay_final_join(struct replica *replica, int fd, uint64_t sync,
 	if (rc != 0)
 		diag_raise();
 
+	ERROR_INJECT(ERRINJ_RELAY_FINAL_JOIN,
+		     tnt_raise(ClientError, ER_INJECTION, "relay final join"));
+
 	ERROR_INJECT(ERRINJ_RELAY_FINAL_SLEEP, {
 		while (vclock_compare(stop_vclock, &replicaset.vclock) == 0)
 			fiber_sleep(0.001);
diff --git a/src/box/vy_log.c b/src/box/vy_log.c
index 10648106..3843cad6 100644
--- a/src/box/vy_log.c
+++ b/src/box/vy_log.c
@@ -124,6 +124,8 @@ static const char *vy_log_type_name[] = {
 	[VY_LOG_MODIFY_LSM]		= "modify_lsm",
 	[VY_LOG_FORGET_LSM]		= "forget_lsm",
 	[VY_LOG_PREPARE_LSM]		= "prepare_lsm",
+	[VY_LOG_REBOOTSTRAP]		= "rebootstrap",
+	[VY_LOG_ABORT_REBOOTSTRAP]	= "abort_rebootstrap",
 };
 
 /** Metadata log object. */
@@ -852,17 +854,43 @@ vy_log_next_id(void)
 	return vy_log.next_id++;
 }
 
+/**
+ * If a vylog file already exists, we are doing a rebootstrap:
+ * - Load the vylog to find out the id to start indexing new
+ *   objects with.
+ * - Mark the beginning of a new rebootstrap attempt by writing
+ *   VY_LOG_REBOOTSTRAP record.
+ */
+static int
+vy_log_rebootstrap(void)
+{
+	struct vy_recovery *recovery;
+	recovery = vy_recovery_new(vclock_sum(&vy_log.last_checkpoint),
+				   VY_RECOVERY_ABORT_REBOOTSTRAP);
+	if (recovery == NULL)
+		return -1;
+
+	vy_log.next_id = recovery->max_id + 1;
+	vy_recovery_delete(recovery);
+
+	struct vy_log_record record;
+	vy_log_record_init(&record);
+	record.type = VY_LOG_REBOOTSTRAP;
+	vy_log_tx_begin();
+	vy_log_write(&record);
+	if (vy_log_tx_commit() != 0)
+		return -1;
+
+	return 0;
+}
+
 int
 vy_log_bootstrap(void)
 {
-	/*
-	 * Scan the directory to make sure there is no
-	 * vylog files left from previous setups.
-	 */
 	if (xdir_scan(&vy_log.dir) < 0 && errno != ENOENT)
 		return -1;
-	if (xdir_last_vclock(&vy_log.dir, NULL) >= 0)
-		panic("vinyl directory is not empty");
+	if (xdir_last_vclock(&vy_log.dir, &vy_log.last_checkpoint) >= 0)
+		return vy_log_rebootstrap();
 
 	/* Add initial vclock to the xdir. */
 	struct vclock *vclock = malloc(sizeof(*vclock));
@@ -914,11 +942,29 @@ vy_log_begin_recovery(const struct vclock *vclock)
 			return NULL;
 	}
 
+	/*
+	 * If we are recovering from a vylog that has an unfinished
+	 * rebootstrap section, checkpoint (and hence rebootstrap)
+	 * failed, and we need to mark rebootstrap as aborted.
+	 */
 	struct vy_recovery *recovery;
-	recovery = vy_recovery_new(vclock_sum(&vy_log.last_checkpoint), 0);
+	recovery = vy_recovery_new(vclock_sum(&vy_log.last_checkpoint),
+				   VY_RECOVERY_ABORT_REBOOTSTRAP);
 	if (recovery == NULL)
 		return NULL;
 
+	if (recovery->in_rebootstrap) {
+		struct vy_log_record record;
+		vy_log_record_init(&record);
+		record.type = VY_LOG_ABORT_REBOOTSTRAP;
+		vy_log_tx_begin();
+		vy_log_write(&record);
+		if (vy_log_tx_commit() != 0) {
+			vy_recovery_delete(recovery);
+			return NULL;
+		}
+	}
+
 	vy_log.next_id = recovery->max_id + 1;
 	vy_log.recovery = recovery;
 	return recovery;
@@ -1292,6 +1338,7 @@ vy_recovery_do_create_lsm(struct vy_recovery *recovery, int64_t id,
 	 * before the final version.
 	 */
 	rlist_add_tail_entry(&recovery->lsms, lsm, in_recovery);
+	lsm->in_rebootstrap = recovery->in_rebootstrap;
 	if (recovery->max_id < id)
 		recovery->max_id = id;
 	return lsm;
@@ -1875,6 +1922,42 @@ vy_recovery_delete_slice(struct vy_recovery *recovery, int64_t slice_id)
 }
 
 /**
+ * Mark all LSM trees created during rebootstrap as dropped so
+ * that they will be purged on the next garbage collection.
+ */
+static void
+vy_recovery_do_abort_rebootstrap(struct vy_recovery *recovery)
+{
+	struct vy_lsm_recovery_info *lsm;
+	rlist_foreach_entry(lsm, &recovery->lsms, in_recovery) {
+		if (lsm->in_rebootstrap) {
+			lsm->in_rebootstrap = false;
+			lsm->create_lsn = -1;
+			lsm->modify_lsn = -1;
+			lsm->drop_lsn = 0;
+		}
+	}
+}
+
+/** Handle a VY_LOG_REBOOTSTRAP log record. */
+static void
+vy_recovery_rebootstrap(struct vy_recovery *recovery)
+{
+	if (recovery->in_rebootstrap)
+		vy_recovery_do_abort_rebootstrap(recovery);
+	recovery->in_rebootstrap = true;
+}
+
+/** Handle VY_LOG_ABORT_REBOOTSTRAP record. */
+static void
+vy_recovery_abort_rebootstrap(struct vy_recovery *recovery)
+{
+	if (recovery->in_rebootstrap)
+		vy_recovery_do_abort_rebootstrap(recovery);
+	recovery->in_rebootstrap = false;
+}
+
+/**
  * Update a recovery context with a new log record.
  * Return 0 on success, -1 on failure.
  *
@@ -1885,7 +1968,7 @@ static int
 vy_recovery_process_record(struct vy_recovery *recovery,
 			   const struct vy_log_record *record)
 {
-	int rc;
+	int rc = 0;
 	switch (record->type) {
 	case VY_LOG_PREPARE_LSM:
 		rc = vy_recovery_prepare_lsm(recovery, record->lsm_id,
@@ -1950,6 +2033,12 @@ vy_recovery_process_record(struct vy_recovery *recovery,
 		/* Not used anymore, ignore. */
 		rc = 0;
 		break;
+	case VY_LOG_REBOOTSTRAP:
+		vy_recovery_rebootstrap(recovery);
+		break;
+	case VY_LOG_ABORT_REBOOTSTRAP:
+		vy_recovery_abort_rebootstrap(recovery);
+		break;
 	default:
 		unreachable();
 	}
@@ -1960,6 +2049,26 @@ vy_recovery_process_record(struct vy_recovery *recovery,
 }
 
 /**
+ * Commit the last rebootstrap attempt - drop all objects created
+ * before rebootstrap.
+ */
+static void
+vy_recovery_commit_rebootstrap(struct vy_recovery *recovery)
+{
+	assert(recovery->in_rebootstrap);
+	struct vy_lsm_recovery_info *lsm;
+	rlist_foreach_entry(lsm, &recovery->lsms, in_recovery) {
+		if (!lsm->in_rebootstrap && lsm->drop_lsn < 0) {
+			/*
+			 * The files will be removed when the current
+			 * checkpoint is purged by garbage collector.
+			 */
+			lsm->drop_lsn = vy_log_signature();
+		}
+	}
+}
+
+/**
  * Fill index_id_hash with LSM trees recovered from vylog.
  */
 static int
@@ -2050,6 +2159,7 @@ vy_recovery_new_f(va_list ap)
 	recovery->run_hash = NULL;
 	recovery->slice_hash = NULL;
 	recovery->max_id = -1;
+	recovery->in_rebootstrap = false;
 
 	recovery->index_id_hash = mh_i64ptr_new();
 	recovery->lsm_hash = mh_i64ptr_new();
@@ -2103,6 +2213,13 @@ vy_recovery_new_f(va_list ap)
 
 	xlog_cursor_close(&cursor, false);
 
+	if (recovery->in_rebootstrap) {
+		if ((flags & VY_RECOVERY_ABORT_REBOOTSTRAP) != 0)
+			vy_recovery_do_abort_rebootstrap(recovery);
+		else
+			vy_recovery_commit_rebootstrap(recovery);
+	}
+
 	if (vy_recovery_build_index_id_hash(recovery) != 0)
 		goto fail_free;
 out:
diff --git a/src/box/vy_log.h b/src/box/vy_log.h
index 98cbf6ee..7718d9c6 100644
--- a/src/box/vy_log.h
+++ b/src/box/vy_log.h
@@ -196,6 +196,27 @@ enum vy_log_record_type {
 	 * a VY_LOG_CREATE_LSM record to commit it.
 	 */
 	VY_LOG_PREPARE_LSM		= 15,
+	/**
+	 * This record denotes the beginning of a rebootstrap section.
+	 * A rebootstrap section ends either by another record of this
+	 * type or by VY_LOG_ABORT_REBOOTSTRAP or at the end of the file.
+	 * All objects created between a VY_LOG_REBOOTSTRAP record and
+	 * VY_LOG_ABORT_REBOOTSTRAP or another VY_LOG_REBOOTSTRAP are
+	 * considered to be garbage and marked as dropped on recovery.
+	 *
+	 * We write a record of this type if a vylog file already exists
+	 * at bootstrap time, which means we are going to rebootstrap.
+	 * If rebootstrap succeeds, we rotate the vylog on checkpoint and
+	 * mark all objects written before the last VY_LOG_REBOOTSTRAP
+	 * record as dropped in the rotated vylog. If rebootstrap fails,
+	 * we write VY_LOG_ABORT_REBOOTSTRAP on recovery.
+	 */
+	VY_LOG_REBOOTSTRAP		= 16,
+	/**
+	 * This record is written on recovery if rebootstrap failed.
+	 * See also VY_LOG_REBOOTSTRAP.
+	 */
+	VY_LOG_ABORT_REBOOTSTRAP	= 17,
 
 	vy_log_record_type_MAX
 };
@@ -276,6 +297,12 @@ struct vy_recovery {
 	 * or -1 in case no vinyl objects were recovered.
 	 */
 	int64_t max_id;
+	/**
+	 * Set if we are currently processing a rebootstrap section,
+	 * i.e. we encountered a VY_LOG_REBOOTSTRAP record and haven't
+	 * seen matching VY_LOG_ABORT_REBOOTSTRAP.
+	 */
+	bool in_rebootstrap;
 };
 
 /** LSM tree info stored in a recovery context. */
@@ -326,6 +353,8 @@ struct vy_lsm_recovery_info {
 	 * this one after successful ALTER.
 	 */
 	struct vy_lsm_recovery_info *prepared;
+	/** Set if this LSM tree was created during rebootstrap. */
+	bool in_rebootstrap;
 };
 
 /** Vinyl range info stored in a recovery context. */
@@ -533,6 +562,11 @@ enum vy_recovery_flag {
 	 * of the last checkpoint.
 	 */
 	VY_RECOVERY_LOAD_CHECKPOINT	= 1 << 0,
+	/**
+	 * Consider the last attempt to rebootstrap aborted even if
+	 * there's no VY_LOG_ABORT_REBOOTSTRAP record.
+	 */
+	VY_RECOVERY_ABORT_REBOOTSTRAP	= 1 << 1,
 };
 
 /**
diff --git a/src/errinj.h b/src/errinj.h
index cde58d48..64d13b02 100644
--- a/src/errinj.h
+++ b/src/errinj.h
@@ -97,6 +97,7 @@ struct errinj {
 	_(ERRINJ_RELAY_TIMEOUT, ERRINJ_DOUBLE, {.dparam = 0}) \
 	_(ERRINJ_RELAY_REPORT_INTERVAL, ERRINJ_DOUBLE, {.dparam = 0}) \
 	_(ERRINJ_RELAY_FINAL_SLEEP, ERRINJ_BOOL, {.bparam = false}) \
+	_(ERRINJ_RELAY_FINAL_JOIN, ERRINJ_BOOL, {.bparam = false}) \
 	_(ERRINJ_PORT_DUMP, ERRINJ_BOOL, {.bparam = false}) \
 	_(ERRINJ_XLOG_GARBAGE, ERRINJ_BOOL, {.bparam = false}) \
 	_(ERRINJ_XLOG_META, ERRINJ_BOOL, {.bparam = false}) \
diff --git a/test/box/errinj.result b/test/box/errinj.result
index 54b6d578..c6b2bbac 100644
--- a/test/box/errinj.result
+++ b/test/box/errinj.result
@@ -60,13 +60,15 @@ errinj.info()
     state: false
   ERRINJ_WAL_WRITE_DISK:
     state: false
+  ERRINJ_VY_LOG_FILE_RENAME:
+    state: false
   ERRINJ_VY_RUN_WRITE:
     state: false
-  ERRINJ_VY_LOG_FILE_RENAME:
+  ERRINJ_HTTP_RESPONSE_ADD_WAIT:
     state: false
   ERRINJ_VY_LOG_FLUSH_DELAY:
     state: false
-  ERRINJ_HTTP_RESPONSE_ADD_WAIT:
+  ERRINJ_RELAY_FINAL_JOIN:
     state: false
   ERRINJ_SNAP_COMMIT_DELAY:
     state: false
diff --git a/test/replication/replica_rejoin.result b/test/replication/replica_rejoin.result
index b7563ed9..4370fae4 100644
--- a/test/replication/replica_rejoin.result
+++ b/test/replication/replica_rejoin.result
@@ -4,9 +4,12 @@ env = require('test_run')
 test_run = env.new()
 ---
 ...
--- Cleanup the instance to remove vylog files left from previous
--- tests, since vinyl doesn't support rebootstrap yet.
-test_run:cmd('restart server default with cleanup=1')
+engine = test_run:get_cfg('engine')
+---
+...
+test_run:cleanup_cluster()
+---
+...
 --
 -- gh-461: check that a replica refetches the last checkpoint
 -- in case it fell behind the master.
@@ -14,7 +17,7 @@ test_run:cmd('restart server default with cleanup=1')
 box.schema.user.grant('guest', 'replication')
 ---
 ...
-_ = box.schema.space.create('test')
+_ = box.schema.space.create('test', {engine = engine})
 ---
 ...
 _ = box.space.test:create_index('pk')
diff --git a/test/replication/replica_rejoin.test.lua b/test/replication/replica_rejoin.test.lua
index dfcb79cf..f998f60d 100644
--- a/test/replication/replica_rejoin.test.lua
+++ b/test/replication/replica_rejoin.test.lua
@@ -1,16 +1,15 @@
 env = require('test_run')
 test_run = env.new()
+engine = test_run:get_cfg('engine')
 
--- Cleanup the instance to remove vylog files left from previous
--- tests, since vinyl doesn't support rebootstrap yet.
-test_run:cmd('restart server default with cleanup=1')
+test_run:cleanup_cluster()
 
 --
 -- gh-461: check that a replica refetches the last checkpoint
 -- in case it fell behind the master.
 --
 box.schema.user.grant('guest', 'replication')
-_ = box.schema.space.create('test')
+_ = box.schema.space.create('test', {engine = engine})
 _ = box.space.test:create_index('pk')
 _ = box.space.test:insert{1}
 _ = box.space.test:insert{2}
diff --git a/test/replication/suite.cfg b/test/replication/suite.cfg
index 2b609f16..95e94e5a 100644
--- a/test/replication/suite.cfg
+++ b/test/replication/suite.cfg
@@ -6,7 +6,6 @@
     "wal_off.test.lua": {},
     "hot_standby.test.lua": {},
     "rebootstrap.test.lua": {},
-    "replica_rejoin.test.lua": {},
     "*": {
         "memtx": {"engine": "memtx"},
         "vinyl": {"engine": "vinyl"}
diff --git a/test/vinyl/replica_rejoin.lua b/test/vinyl/replica_rejoin.lua
new file mode 100644
index 00000000..7cb7e09a
--- /dev/null
+++ b/test/vinyl/replica_rejoin.lua
@@ -0,0 +1,13 @@
+#!/usr/bin/env tarantool
+
+local replication = os.getenv("MASTER")
+if arg[1] == 'disable_replication' then
+    replication = nil
+end
+
+box.cfg({
+    replication     = replication,
+    vinyl_memory    = 1024 * 1024,
+})
+
+require('console').listen(os.getenv('ADMIN'))
diff --git a/test/vinyl/replica_rejoin.result b/test/vinyl/replica_rejoin.result
new file mode 100644
index 00000000..bd5d1ed3
--- /dev/null
+++ b/test/vinyl/replica_rejoin.result
@@ -0,0 +1,257 @@
+env = require('test_run')
+---
+...
+test_run = env.new()
+---
+...
+--
+-- gh-461: check that garbage collection works as expected
+-- after rebootstrap.
+--
+box.schema.user.grant('guest', 'replication')
+---
+...
+_ = box.schema.space.create('test', { id = 9000, engine = 'vinyl' })
+---
+...
+_ = box.space.test:create_index('pk')
+---
+...
+pad = string.rep('x', 15 * 1024)
+---
+...
+for i = 1, 100 do box.space.test:replace{i, pad} end
+---
+...
+box.snapshot()
+---
+- ok
+...
+-- Join a replica. Check its files.
+test_run:cmd("create server replica with rpl_master=default, script='vinyl/replica_rejoin.lua'")
+---
+- true
+...
+test_run:cmd("start server replica")
+---
+- true
+...
+test_run:cmd("switch replica")
+---
+- true
+...
+fio = require('fio')
+---
+...
+fio.chdir(box.cfg.vinyl_dir)
+---
+- true
+...
+fio.glob(fio.pathjoin(box.space.test.id, 0, '*'))
+---
+- - 9000/0/00000000000000000002.index
+  - 9000/0/00000000000000000002.run
+  - 9000/0/00000000000000000004.index
+  - 9000/0/00000000000000000004.run
+...
+test_run:cmd("switch default")
+---
+- true
+...
+test_run:cmd("stop server replica")
+---
+- true
+...
+-- Invoke garbage collector on the master.
+test_run:cmd("restart server default")
+checkpoint_count = box.cfg.checkpoint_count
+---
+...
+box.cfg{checkpoint_count = 1}
+---
+...
+box.space.test:delete(1)
+---
+...
+box.snapshot()
+---
+- ok
+...
+box.cfg{checkpoint_count = checkpoint_count}
+---
+...
+-- Rebootstrap the replica. Check that old files are removed
+-- by garbage collector.
+test_run:cmd("start server replica")
+---
+- true
+...
+test_run:cmd("switch replica")
+---
+- true
+...
+box.cfg{checkpoint_count = 1}
+---
+...
+box.snapshot()
+---
+- ok
+...
+fio = require('fio')
+---
+...
+fio.chdir(box.cfg.vinyl_dir)
+---
+- true
+...
+fio.glob(fio.pathjoin(box.space.test.id, 0, '*'))
+---
+- - 9000/0/00000000000000000008.index
+  - 9000/0/00000000000000000008.run
+  - 9000/0/00000000000000000010.index
+  - 9000/0/00000000000000000010.run
+...
+box.space.test:count() -- 99
+---
+- 99
+...
+test_run:cmd("switch default")
+---
+- true
+...
+test_run:cmd("stop server replica")
+---
+- true
+...
+-- Invoke garbage collector on the master.
+test_run:cmd("restart server default")
+checkpoint_count = box.cfg.checkpoint_count
+---
+...
+box.cfg{checkpoint_count = 1}
+---
+...
+box.space.test:delete(2)
+---
+...
+box.snapshot()
+---
+- ok
+...
+box.cfg{checkpoint_count = checkpoint_count}
+---
+...
+-- Make the master fail join after sending data. Check that
+-- files written during failed rebootstrap attempt are removed
+-- by garbage collector.
+box.error.injection.set('ERRINJ_RELAY_FINAL_JOIN', true)
+---
+- ok
+...
+test_run:cmd("start server replica with crash_expected=True") -- fail
+---
+- false
+...
+test_run:cmd("start server replica with crash_expected=True") -- fail again
+---
+- false
+...
+test_run:cmd("start server replica with args='disable_replication'")
+---
+- true
+...
+test_run:cmd("switch replica")
+---
+- true
+...
+box.cfg{checkpoint_count = 1}
+---
+...
+box.snapshot()
+---
+- ok
+...
+fio = require('fio')
+---
+...
+fio.chdir(box.cfg.vinyl_dir)
+---
+- true
+...
+fio.glob(fio.pathjoin(box.space.test.id, 0, '*'))
+---
+- - 9000/0/00000000000000000008.index
+  - 9000/0/00000000000000000008.run
+  - 9000/0/00000000000000000010.index
+  - 9000/0/00000000000000000010.run
+...
+box.space.test:count() -- 99
+---
+- 99
+...
+test_run:cmd("switch default")
+---
+- true
+...
+test_run:cmd("stop server replica")
+---
+- true
+...
+box.error.injection.set('ERRINJ_RELAY_FINAL_JOIN', false)
+---
+- ok
+...
+-- Rebootstrap after several failed attempts and make sure
+-- old files are removed.
+test_run:cmd("start server replica")
+---
+- true
+...
+test_run:cmd("switch replica")
+---
+- true
+...
+box.cfg{checkpoint_count = 1}
+---
+...
+box.snapshot()
+---
+- ok
+...
+fio = require('fio')
+---
+...
+fio.chdir(box.cfg.vinyl_dir)
+---
+- true
+...
+fio.glob(fio.pathjoin(box.space.test.id, 0, '*'))
+---
+- - 9000/0/00000000000000000022.index
+  - 9000/0/00000000000000000022.run
+  - 9000/0/00000000000000000024.index
+  - 9000/0/00000000000000000024.run
+...
+box.space.test:count() -- 98
+---
+- 98
+...
+test_run:cmd("switch default")
+---
+- true
+...
+test_run:cmd("stop server replica")
+---
+- true
+...
+-- Cleanup.
+test_run:cmd("cleanup server replica")
+---
+- true
+...
+box.space.test:drop()
+---
+...
+box.schema.user.revoke('guest', 'replication')
+---
+...
diff --git a/test/vinyl/replica_rejoin.test.lua b/test/vinyl/replica_rejoin.test.lua
new file mode 100644
index 00000000..972b04e5
--- /dev/null
+++ b/test/vinyl/replica_rejoin.test.lua
@@ -0,0 +1,88 @@
+env = require('test_run')
+test_run = env.new()
+
+--
+-- gh-461: check that garbage collection works as expected
+-- after rebootstrap.
+--
+box.schema.user.grant('guest', 'replication')
+_ = box.schema.space.create('test', { id = 9000, engine = 'vinyl' })
+_ = box.space.test:create_index('pk')
+pad = string.rep('x', 15 * 1024)
+for i = 1, 100 do box.space.test:replace{i, pad} end
+box.snapshot()
+
+-- Join a replica. Check its files.
+test_run:cmd("create server replica with rpl_master=default, script='vinyl/replica_rejoin.lua'")
+test_run:cmd("start server replica")
+test_run:cmd("switch replica")
+fio = require('fio')
+fio.chdir(box.cfg.vinyl_dir)
+fio.glob(fio.pathjoin(box.space.test.id, 0, '*'))
+test_run:cmd("switch default")
+test_run:cmd("stop server replica")
+
+-- Invoke garbage collector on the master.
+test_run:cmd("restart server default")
+checkpoint_count = box.cfg.checkpoint_count
+box.cfg{checkpoint_count = 1}
+box.space.test:delete(1)
+box.snapshot()
+box.cfg{checkpoint_count = checkpoint_count}
+
+-- Rebootstrap the replica. Check that old files are removed
+-- by garbage collector.
+test_run:cmd("start server replica")
+test_run:cmd("switch replica")
+box.cfg{checkpoint_count = 1}
+box.snapshot()
+fio = require('fio')
+fio.chdir(box.cfg.vinyl_dir)
+fio.glob(fio.pathjoin(box.space.test.id, 0, '*'))
+box.space.test:count() -- 99
+test_run:cmd("switch default")
+test_run:cmd("stop server replica")
+
+-- Invoke garbage collector on the master.
+test_run:cmd("restart server default")
+checkpoint_count = box.cfg.checkpoint_count
+box.cfg{checkpoint_count = 1}
+box.space.test:delete(2)
+box.snapshot()
+box.cfg{checkpoint_count = checkpoint_count}
+
+-- Make the master fail join after sending data. Check that
+-- files written during failed rebootstrap attempt are removed
+-- by garbage collector.
+box.error.injection.set('ERRINJ_RELAY_FINAL_JOIN', true)
+test_run:cmd("start server replica with crash_expected=True") -- fail
+test_run:cmd("start server replica with crash_expected=True") -- fail again
+test_run:cmd("start server replica with args='disable_replication'")
+test_run:cmd("switch replica")
+box.cfg{checkpoint_count = 1}
+box.snapshot()
+fio = require('fio')
+fio.chdir(box.cfg.vinyl_dir)
+fio.glob(fio.pathjoin(box.space.test.id, 0, '*'))
+box.space.test:count() -- 99
+test_run:cmd("switch default")
+test_run:cmd("stop server replica")
+box.error.injection.set('ERRINJ_RELAY_FINAL_JOIN', false)
+
+-- Rebootstrap after several failed attempts and make sure
+-- old files are removed.
+test_run:cmd("start server replica")
+test_run:cmd("switch replica")
+box.cfg{checkpoint_count = 1}
+box.snapshot()
+fio = require('fio')
+fio.chdir(box.cfg.vinyl_dir)
+fio.glob(fio.pathjoin(box.space.test.id, 0, '*'))
+box.space.test:count() -- 98
+test_run:cmd("switch default")
+test_run:cmd("stop server replica")
+
+-- Cleanup.
+test_run:cmd("cleanup server replica")
+box.space.test:drop()
+box.schema.user.revoke('guest', 'replication')
diff --git a/test/vinyl/suite.ini b/test/vinyl/suite.ini
index ca964289..b9dae380 100644
--- a/test/vinyl/suite.ini
+++ b/test/vinyl/suite.ini
@@ -2,7 +2,7 @@
 core = tarantool
 description = vinyl integration tests
 script = vinyl.lua
-release_disabled = errinj.test.lua errinj_gc.test.lua errinj_vylog.test.lua partial_dump.test.lua quota_timeout.test.lua recovery_quota.test.lua
+release_disabled = errinj.test.lua errinj_gc.test.lua errinj_vylog.test.lua partial_dump.test.lua quota_timeout.test.lua recovery_quota.test.lua replica_rejoin.test.lua
 config = suite.cfg
 lua_libs = suite.lua stress.lua large.lua txn_proxy.lua ../box/lua/utils.lua
 use_unix_sockets = True
-- 
2.11.0

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 01/11] recovery: clean up WAL dir scan code
  2018-07-14 20:49 ` [PATCH v3 01/11] recovery: clean up WAL dir scan code Vladimir Davydov
@ 2018-07-19  7:08   ` Konstantin Osipov
  0 siblings, 0 replies; 31+ messages in thread
From: Konstantin Osipov @ 2018-07-19  7:08 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: tarantool-patches

* Vladimir Davydov <vdavydov.dev@gmail.com> [18/07/14 23:50]:
>  - Remove extra scan of the WAL directory from local_recovery() - we
>    scan the directory in recovery_end_vclock() hence we can skip scan in
>    recover_remaining_wals() by passing scan_dir = false.
> 
>  - Rename recovery_end_vclock() to recovery_scan() to emphasize the fact
>    that this function scans the WAL directory. Write a comment to this
>    function.
> 
>  - Add comments to wal.c explaining why we scan the WAL directory there.

Pushed.


-- 
Konstantin Osipov, Moscow, Russia, +7 903 626 22 32
http://tarantool.io - www.twitter.com/kostja_osipov

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 02/11] xrow: factor out function for decoding vclock
  2018-07-14 20:49 ` [PATCH v3 02/11] xrow: factor out function for decoding vclock Vladimir Davydov
@ 2018-07-19  7:08   ` Konstantin Osipov
  0 siblings, 0 replies; 31+ messages in thread
From: Konstantin Osipov @ 2018-07-19  7:08 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: tarantool-patches

* Vladimir Davydov <vdavydov.dev@gmail.com> [18/07/14 23:50]:
> We will need it in other places.
> ---
>  src/box/xrow.c | 45 +++++++++++++++++++++------------------------
>  1 file changed, 21 insertions(+), 24 deletions(-)

Pushed.


-- 
Konstantin Osipov, Moscow, Russia, +7 903 626 22 32
http://tarantool.io - www.twitter.com/kostja_osipov

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 03/11] Introduce IPROTO_REQUEST_STATUS command
  2018-07-14 20:49 ` [PATCH v3 03/11] Introduce IPROTO_REQUEST_STATUS command Vladimir Davydov
@ 2018-07-19  7:10   ` Konstantin Osipov
  2018-07-19  8:17     ` Vladimir Davydov
  2018-07-21 10:25   ` Vladimir Davydov
  1 sibling, 1 reply; 31+ messages in thread
From: Konstantin Osipov @ 2018-07-19  7:10 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: tarantool-patches

* Vladimir Davydov <vdavydov.dev@gmail.com> [18/07/14 23:50]:
> The new command is supposed to supersede IPROTO_REQUEST_VOTE, which is
> difficult to extend, because it uses the global iproto key namespace.
> The new command returns a map (IPROTO_STATUS), to which we can add
> various information without polluting the global namespace. Currently,
> the map contains IPROTO_STATUS_IS_RO and IPROTO_STATUS_VCLOCK keys,
> but soon it will be added info needed for replica rebootstrap feature.

Would it be possible to move out struct status from xrow.h and
rename to box_status and move to box.h or some other place?
Perhaps we should remove it altogether, and simply pass multiple
members to xrow_decode_status()?

Other request member combos, such as authentication or call or eval have
not leaked into xrow.h before. 

Otherwise the patch is OK to push.

-- 
Konstantin Osipov, Moscow, Russia, +7 903 626 22 32
http://tarantool.io - www.twitter.com/kostja_osipov

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 04/11] Get rid of IPROTO_SERVER_IS_RO
  2018-07-14 20:49 ` [PATCH v3 04/11] Get rid of IPROTO_SERVER_IS_RO Vladimir Davydov
@ 2018-07-19  7:10   ` Konstantin Osipov
  2018-07-21 12:07   ` Vladimir Davydov
  1 sibling, 0 replies; 31+ messages in thread
From: Konstantin Osipov @ 2018-07-19  7:10 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: tarantool-patches

* Vladimir Davydov <vdavydov.dev@gmail.com> [18/07/14 23:50]:
> Not needed anymore as we now use IPROTO_REQUEST_STATUS instead of
> IPROTO_REQUEST_VOTE. Let's remove it altogether and reuse its code
> for IPROTO_STATUS (they are never decoded together so no conflict
> should happen). Worst that can happen is we choose a read-only master
> when bootstrapping an older version of tarantool.

OK to push.


-- 
Konstantin Osipov, Moscow, Russia, +7 903 626 22 32
http://tarantool.io - www.twitter.com/kostja_osipov

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 05/11] gc: keep track of vclocks instead of signatures
  2018-07-14 20:49 ` [PATCH v3 05/11] gc: keep track of vclocks instead of signatures Vladimir Davydov
@ 2018-07-19  7:11   ` Konstantin Osipov
  0 siblings, 0 replies; 31+ messages in thread
From: Konstantin Osipov @ 2018-07-19  7:11 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: tarantool-patches

* Vladimir Davydov <vdavydov.dev@gmail.com> [18/07/14 23:50]:
> In order to check if a replica needs to be rebootstrapped, we need to
> know the vclock of the oldest WAL stored on the master, but the garbage
> collector works with signatures and hence can't report the vclock it was
> last called for. Actually, all gc users have a vclock and can pass it
> instead of signature so it's pretty easy to switch garbage collection
> infrastructure to vclock.

Pushed, with a small follow-up.

-- 
Konstantin Osipov, Moscow, Russia, +7 903 626 22 32
http://tarantool.io - www.twitter.com/kostja_osipov

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 06/11] Include oldest vclock available on the instance in IPROTO_STATUS
  2018-07-14 20:49 ` [PATCH v3 06/11] Include oldest vclock available on the instance in IPROTO_STATUS Vladimir Davydov
@ 2018-07-19  7:12   ` Konstantin Osipov
  2018-07-21 12:07   ` Vladimir Davydov
  1 sibling, 0 replies; 31+ messages in thread
From: Konstantin Osipov @ 2018-07-19  7:12 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: tarantool-patches

* Vladimir Davydov <vdavydov.dev@gmail.com> [18/07/14 23:50]:
> It will be used to check if a replica fell too much behind its peers and
> so needs to be rebootstrapped.

It is OK to push.

-- 
Konstantin Osipov, Moscow, Russia, +7 903 626 22 32
http://tarantool.io - www.twitter.com/kostja_osipov

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 07/11] replication: rebootstrap instance on startup if it fell behind
  2018-07-14 20:49 ` [PATCH v3 07/11] replication: rebootstrap instance on startup if it fell behind Vladimir Davydov
@ 2018-07-19  7:19   ` Konstantin Osipov
  2018-07-19 10:04     ` Vladimir Davydov
  0 siblings, 1 reply; 31+ messages in thread
From: Konstantin Osipov @ 2018-07-19  7:19 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: tarantool-patches

* Vladimir Davydov <vdavydov.dev@gmail.com> [18/07/14 23:50]:

> diff --git a/src/box/box.cc b/src/box/box.cc
> index b629a4d8..baf30fce 100644
> --- a/src/box/box.cc
> +++ b/src/box/box.cc
> @@ -1797,6 +1797,9 @@ bootstrap(const struct tt_uuid *instance_uuid,
>  /**
>   * Recover the instance from the local directory.
>   * Enter hot standby if the directory is locked.
> + * Invoke rebootstrap if the instance fell too much
> + * behind its peers in the replica set and needs
> + * to be rebootstrapped.
>   */
>  static void
>  local_recovery(const struct tt_uuid *instance_uuid,
> @@ -1832,6 +1835,12 @@ local_recovery(const struct tt_uuid *instance_uuid,
>  	if (wal_dir_lock >= 0) {
>  		box_listen();
>  		box_sync_replication(replication_connect_timeout, false);
> +
> +		struct replica *master;
> +		if (replicaset_needs_rejoin(&master)) {
> +			say_info("replica is too old, initiating rejoin");
> +			return bootstrap_from_master(master);

This is say_crit() IMHO.

> +bool
> +replicaset_needs_rejoin(struct replica **master)
> +{
> +	replicaset_foreach(replica) {
> +		/*
> +		 * Rebootstrap this instance from a master if:
> +		 * - the oldest vclock stored on the master is greater
> +		 *   than or incomparable with the instance vclock
> +		 *   (so that the instance can't follow the master) and
> +		 * - the instance is strictly behind the master (so
> +		 *   that we won't lose any data by rebootstrapping
> +		 *   this instance)
> +		 */
> +		struct applier *applier = replica->applier;
> +		if (applier != NULL &&
> +		    vclock_compare(&applier->remote_status.gc_vclock,
> +				   &replicaset.vclock) > 0 &&
> +		    vclock_compare(&replicaset.vclock,
> +				   &applier->remote_status.vclock) < 0) {
> +			*master = replica;
> +			return true;

I'd love to see a bit more clarity in the log about this decision
making process. Imagine this function returns 'false' because
vclocks are incomparable and then replication breaks - it would be
very hard to diagnose why this happened. You could add some
logging to this function, but this would change its contract, since
currently this function has no side effects.

Should it set the diagnostics area in case of error? Log the
error? Return an extra status code? Please feel free to choose the
option you think is best.

Thank you for working on this,

-- 
Konstantin Osipov, Moscow, Russia, +7 903 626 22 32
http://tarantool.io - www.twitter.com/kostja_osipov

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 03/11] Introduce IPROTO_REQUEST_STATUS command
  2018-07-19  7:10   ` Konstantin Osipov
@ 2018-07-19  8:17     ` Vladimir Davydov
  0 siblings, 0 replies; 31+ messages in thread
From: Vladimir Davydov @ 2018-07-19  8:17 UTC (permalink / raw)
  To: Konstantin Osipov; +Cc: tarantool-patches

On Thu, Jul 19, 2018 at 10:10:31AM +0300, Konstantin Osipov wrote:
> * Vladimir Davydov <vdavydov.dev@gmail.com> [18/07/14 23:50]:
> > The new command is supposed to supersede IPROTO_REQUEST_VOTE, which is
> > difficult to extend, because it uses the global iproto key namespace.
> > The new command returns a map (IPROTO_STATUS), to which we can add
> > various information without polluting the global namespace. Currently,
> > the map contains IPROTO_STATUS_IS_RO and IPROTO_STATUS_VCLOCK keys,
> > but soon it will be added info needed for replica rebootstrap feature.
> 
> Would it be possible to move out struct status from xrow.h and
> rename to box_status and move to box.h or some other place?

Then xrow.c would have to include box.h ...

> Perhaps we should remove it altogether, and simply pass multiple
> members to xrow_decode_status()?

There are already three members in a reply to a status request and may
be more in future. Passing them all in function arguments would look
cumbersome. We definitely need to group them somehow.

> 
> Other request member combos, such as authentication or call or eval have
> not leaked into xrow.h before. 

greeting, auth_request, call_request, request - they are all defined in
xrow.h

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 07/11] replication: rebootstrap instance on startup if it fell behind
  2018-07-19  7:19   ` Konstantin Osipov
@ 2018-07-19 10:04     ` Vladimir Davydov
  2018-07-23 20:19       ` Konstantin Osipov
  0 siblings, 1 reply; 31+ messages in thread
From: Vladimir Davydov @ 2018-07-19 10:04 UTC (permalink / raw)
  To: Konstantin Osipov; +Cc: tarantool-patches

On Thu, Jul 19, 2018 at 10:19:03AM +0300, Konstantin Osipov wrote:
> * Vladimir Davydov <vdavydov.dev@gmail.com> [18/07/14 23:50]:
> 
> > diff --git a/src/box/box.cc b/src/box/box.cc
> > index b629a4d8..baf30fce 100644
> > --- a/src/box/box.cc
> > +++ b/src/box/box.cc
> > @@ -1797,6 +1797,9 @@ bootstrap(const struct tt_uuid *instance_uuid,
> >  /**
> >   * Recover the instance from the local directory.
> >   * Enter hot standby if the directory is locked.
> > + * Invoke rebootstrap if the instance fell too much
> > + * behind its peers in the replica set and needs
> > + * to be rebootstrapped.
> >   */
> >  static void
> >  local_recovery(const struct tt_uuid *instance_uuid,
> > @@ -1832,6 +1835,12 @@ local_recovery(const struct tt_uuid *instance_uuid,
> >  	if (wal_dir_lock >= 0) {
> >  		box_listen();
> >  		box_sync_replication(replication_connect_timeout, false);
> > +
> > +		struct replica *master;
> > +		if (replicaset_needs_rejoin(&master)) {
> > +			say_info("replica is too old, initiating rejoin");
> > +			return bootstrap_from_master(master);
> 
> This is say_crit() IMHO.
> 
> > +bool
> > +replicaset_needs_rejoin(struct replica **master)
> > +{
> > +	replicaset_foreach(replica) {
> > +		/*
> > +		 * Rebootstrap this instance from a master if:
> > +		 * - the oldest vclock stored on the master is greater
> > +		 *   than or incomparable with the instance vclock
> > +		 *   (so that the instance can't follow the master) and
> > +		 * - the instance is strictly behind the master (so
> > +		 *   that we won't lose any data by rebootstrapping
> > +		 *   this instance)
> > +		 */
> > +		struct applier *applier = replica->applier;
> > +		if (applier != NULL &&
> > +		    vclock_compare(&applier->remote_status.gc_vclock,
> > +				   &replicaset.vclock) > 0 &&
> > +		    vclock_compare(&replicaset.vclock,
> > +				   &applier->remote_status.vclock) < 0) {
> > +			*master = replica;
> > +			return true;
> 
> I'd love to see a bit more clarity in the log about this decision
> making process. Imagine this function returns 'false' because
> vclocks are incomparable and then replication breaks - it would be
> very hard to diagnose why this happened. You could add some
> logging to this function, but this would change its contract, since
> currently this function has no side effects.
> 
> Should it set the diagnostics area in case of error? Log the
> error? Return an extra status code? Please feel free to choose the
> option you think is best.

What about this?

: 2018-07-19 12:53:45.195 [19455] main/101/replica I> can't follow [::1]:47383: required {1: 8} available {1: 12}
: 2018-07-19 12:53:45.195 [19455] main/101/replica C> replica is too old, initiating rebootstrap
: 2018-07-19 12:53:45.195 [19455] main/101/replica I> bootstrapping replica from [::1]:47383

and

: 2018-07-19 12:53:46.546 [19493] main/101/replica I> can't follow [::1]:47383: required {1: 17, 2: 1} available {1: 20}
: 2018-07-19 12:53:46.546 [19493] main/101/replica I> can't rebootstrap from [::1]:47383: replica has local rows: local {1: 17, 2: 1} remote {1: 23}
: 2018-07-19 12:53:46.546 [19493] main/101/replica I> recovery start

The diff is below. Note, apart from logging I also changed the logic
of the rebootstrap trigger a bit: now we proceed to rebootstrap only
if none of the configured masters has rows needed by the instance.

I haven't pushed the patch to the branch yet. If you're okay with it,
I'll squash it into the original patch, rebase the whole series, and
then update the remote branch.

diff --git a/src/box/box.cc b/src/box/box.cc
index baf30fce..7d752c59 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -1838,7 +1838,7 @@ local_recovery(const struct tt_uuid *instance_uuid,
 
 		struct replica *master;
 		if (replicaset_needs_rejoin(&master)) {
-			say_info("replica is too old, initiating rejoin");
+			say_crit("replica is too old, initiating rebootstrap");
 			return bootstrap_from_master(master);
 		}
 	}
diff --git a/src/box/replication.cc b/src/box/replication.cc
index d61a984f..3ea4f578 100644
--- a/src/box/replication.cc
+++ b/src/box/replication.cc
@@ -41,6 +41,7 @@
 #include "error.h"
 #include "relay.h"
 #include "vclock.h" /* VCLOCK_MAX */
+#include "sio.h"
 
 uint32_t instance_id = REPLICA_ID_NIL;
 struct tt_uuid INSTANCE_UUID;
@@ -628,28 +629,58 @@ error:
 bool
 replicaset_needs_rejoin(struct replica **master)
 {
+	struct replica *leader = NULL;
 	replicaset_foreach(replica) {
-		/*
-		 * Rebootstrap this instance from a master if:
-		 * - the oldest vclock stored on the master is greater
-		 *   than or incomparable with the instance vclock
-		 *   (so that the instance can't follow the master) and
-		 * - the instance is strictly behind the master (so
-		 *   that we won't lose any data by rebootstrapping
-		 *   this instance)
-		 */
 		struct applier *applier = replica->applier;
-		if (applier != NULL &&
-		    vclock_compare(&applier->remote_status.gc_vclock,
-				   &replicaset.vclock) > 0 &&
-		    vclock_compare(&replicaset.vclock,
-				   &applier->remote_status.vclock) < 0) {
-			*master = replica;
-			return true;
+		if (applier == NULL)
+			continue;
+
+		const struct status *status = &applier->remote_status;
+		if (vclock_compare(&status->gc_vclock, &replicaset.vclock) < 0) {
+			/*
+			 * There's at least one master that still stores
+			 * WALs needed by this instance. Proceed to local
+			 * recovery.
+			 */
+			return false;
 		}
+
+		const char *addr_str = sio_strfaddr(&applier->addr,
+						applier->addr_len);
+		char *local_vclock_str = vclock_to_string(&replicaset.vclock);
+		char *remote_vclock_str = vclock_to_string(&status->vclock);
+		char *gc_vclock_str = vclock_to_string(&status->gc_vclock);
+
+		say_info("can't follow %s: required %s available %s",
+			 addr_str, local_vclock_str, gc_vclock_str);
+
+		if (vclock_compare(&replicaset.vclock, &status->vclock) > 0) {
+			/*
+			 * Replica has some rows that are not present on
+			 * the master. Don't rebootstrap as we don't want
+			 * to lose any data.
+			 */
+			say_info("can't rebootstrap from %s: "
+				 "replica has local rows: local %s remote %s",
+				 addr_str, local_vclock_str, remote_vclock_str);
+			goto next;
+		}
+
+		/* Prefer a master with the max vclock. */
+		if (leader == NULL ||
+		    vclock_sum(&applier->remote_status.vclock) >
+		    vclock_sum(&leader->applier->remote_status.vclock))
+			leader = replica;
+next:
+		free(local_vclock_str);
+		free(remote_vclock_str);
+		free(gc_vclock_str);
 	}
-	*master = NULL;
-	return false;
+	if (leader == NULL)
+		return false;
+
+	*master = leader;
+	return true;
 }
 
 void

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 03/11] Introduce IPROTO_REQUEST_STATUS command
  2018-07-14 20:49 ` [PATCH v3 03/11] Introduce IPROTO_REQUEST_STATUS command Vladimir Davydov
  2018-07-19  7:10   ` Konstantin Osipov
@ 2018-07-21 10:25   ` Vladimir Davydov
  1 sibling, 0 replies; 31+ messages in thread
From: Vladimir Davydov @ 2018-07-21 10:25 UTC (permalink / raw)
  To: tarantool-patches; +Cc: kostja

For the record, this patch was pushed to 1.10 by Kostja with
the following changes:

 - IPROTO_REQUEST_STATUS was renamed to IPROTO_VOTE
 - IPROTO_STATUS was renamed to IPROTO_BALLOT
 - struct status was renamed to struct ballot

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 09/11] vinyl: pass flags to vy_recovery_new
  2018-07-14 20:49 ` [PATCH v3 09/11] vinyl: pass flags to vy_recovery_new Vladimir Davydov
@ 2018-07-21 11:12   ` Vladimir Davydov
  0 siblings, 0 replies; 31+ messages in thread
From: Vladimir Davydov @ 2018-07-21 11:12 UTC (permalink / raw)
  To: tarantool-patches; +Cc: kostja

For the record, this patch was pushed to 1.10 by Kostja.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 10/11] Update test-run
  2018-07-14 20:49 ` [PATCH v3 10/11] Update test-run Vladimir Davydov
@ 2018-07-21 11:13   ` Vladimir Davydov
  0 siblings, 0 replies; 31+ messages in thread
From: Vladimir Davydov @ 2018-07-21 11:13 UTC (permalink / raw)
  To: tarantool-patches; +Cc: kostja

For the record, this patch was pushed to 1.10 by Kostja.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 04/11] Get rid of IPROTO_SERVER_IS_RO
  2018-07-14 20:49 ` [PATCH v3 04/11] Get rid of IPROTO_SERVER_IS_RO Vladimir Davydov
  2018-07-19  7:10   ` Konstantin Osipov
@ 2018-07-21 12:07   ` Vladimir Davydov
  1 sibling, 0 replies; 31+ messages in thread
From: Vladimir Davydov @ 2018-07-21 12:07 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

Rebased and pushed to 1.10

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 06/11] Include oldest vclock available on the instance in IPROTO_STATUS
  2018-07-14 20:49 ` [PATCH v3 06/11] Include oldest vclock available on the instance in IPROTO_STATUS Vladimir Davydov
  2018-07-19  7:12   ` Konstantin Osipov
@ 2018-07-21 12:07   ` Vladimir Davydov
  1 sibling, 0 replies; 31+ messages in thread
From: Vladimir Davydov @ 2018-07-21 12:07 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

Rebased and pushed to 1.10

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 07/11] replication: rebootstrap instance on startup if it fell behind
  2018-07-19 10:04     ` Vladimir Davydov
@ 2018-07-23 20:19       ` Konstantin Osipov
  2018-07-27 16:13         ` [PATCH] replication: print master uuid when (re)bootstrapping Vladimir Davydov
  0 siblings, 1 reply; 31+ messages in thread
From: Konstantin Osipov @ 2018-07-23 20:19 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: tarantool-patches

* Vladimir Davydov <vdavydov.dev@gmail.com> [18/07/19 14:07]:
> 
> What about this?
> 
> : 2018-07-19 12:53:45.195 [19455] main/101/replica I> can't follow [::1]:47383: required {1: 8} available {1: 12}
> : 2018-07-19 12:53:45.195 [19455] main/101/replica C> replica is too old, initiating rebootstrap
> : 2018-07-19 12:53:45.195 [19455] main/101/replica I> bootstrapping replica from [::1]:47383

It would be better if we use uuid or server id as identifiers
(can't follow <guid> or rebootstrapping from <guid>), providing
URIs only as additinal information.

> : 2018-07-19 12:53:46.546 [19493] main/101/replica I> can't follow [::1]:47383: required {1: 17, 2: 1} available {1: 20}
> : 2018-07-19 12:53:46.546 [19493] main/101/replica I> can't rebootstrap from [::1]:47383: replica has local rows: local {1: 17, 2: 1} remote {1: 23}
> : 2018-07-19 12:53:46.546 [19493] main/101/replica I> recovery start
> 
> The diff is below. Note, apart from logging I also changed the logic
> of the rebootstrap trigger a bit: now we proceed to rebootstrap only
> if none of the configured masters has rows needed by the instance.

Ack.

> 
> I haven't pushed the patch to the branch yet. If you're okay with it,
> I'll squash it into the original patch, rebase the whole series, and
> then update the remote branch.

OK, the patch itself is OK.



-- 
Konstantin Osipov, Moscow, Russia, +7 903 626 22 32
http://tarantool.io - www.twitter.com/kostja_osipov

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH] replication: print master uuid when (re)bootstrapping
  2018-07-23 20:19       ` Konstantin Osipov
@ 2018-07-27 16:13         ` Vladimir Davydov
  2018-07-31  8:34           ` Vladimir Davydov
  0 siblings, 1 reply; 31+ messages in thread
From: Vladimir Davydov @ 2018-07-27 16:13 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

Currently only the remote address is printed. Let's also print the UUID,
because replicas are identified by UUID everywhere in tarantool, not by
the address. An example of the output is below:

  I> can't follow eb81a67e-99ee-40bb-8601-99b03fa20124 at [::1]:58083: required {1: 8} available {1: 12}
  C> replica is too old, initiating rebootstrap
  I> bootstrapping replica from eb81a67e-99ee-40bb-8601-99b03fa20124 at [::1]:58083

  I> can't follow eb81a67e-99ee-40bb-8601-99b03fa20124 at [::1]:58083: required {1: 17, 2: 1} available {1: 20}
  I> can't rebootstrap from eb81a67e-99ee-40bb-8601-99b03fa20124 at [::1]:58083: replica has local rows: local {1: 17, 2: 1} remote {1: 23}
  I> recovery start

Suggested by @kostja.

Follow-up ea69a0cd12d8 ("replication: rebootstrap instance on startup
if it fell behind").
---
https://github.com/tarantool/tarantool/commits/dv/print-replica-uuid-to-log
https://www.freelists.org/post/tarantool-patches/PATCH-v3-0711-replication-rebootstrap-instance-on-startup-if-it-fell-behind,3

 src/box/box.cc         |  3 ++-
 src/box/replication.cc | 10 ++++++----
 2 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/src/box/box.cc b/src/box/box.cc
index ae4959d6..ee12d573 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -1703,7 +1703,8 @@ bootstrap_from_master(struct replica *master)
 	applier_resume_to_state(applier, APPLIER_READY, TIMEOUT_INFINITY);
 	assert(applier->state == APPLIER_READY);
 
-	say_info("bootstrapping replica from %s",
+	say_info("bootstrapping replica from %s at %s",
+		 tt_uuid_str(&master->uuid),
 		 sio_strfaddr(&applier->addr, applier->addr_len));
 
 	/*
diff --git a/src/box/replication.cc b/src/box/replication.cc
index 48956d2e..26bbbe32 100644
--- a/src/box/replication.cc
+++ b/src/box/replication.cc
@@ -659,14 +659,15 @@ replicaset_needs_rejoin(struct replica **master)
 			return false;
 		}
 
+		const char *uuid_str = tt_uuid_str(&replica->uuid);
 		const char *addr_str = sio_strfaddr(&applier->addr,
 						applier->addr_len);
 		char *local_vclock_str = vclock_to_string(&replicaset.vclock);
 		char *remote_vclock_str = vclock_to_string(&ballot->vclock);
 		char *gc_vclock_str = vclock_to_string(&ballot->gc_vclock);
 
-		say_info("can't follow %s: required %s available %s",
-			 addr_str, local_vclock_str, gc_vclock_str);
+		say_info("can't follow %s at %s: required %s available %s",
+			 uuid_str, addr_str, local_vclock_str, gc_vclock_str);
 
 		if (vclock_compare(&replicaset.vclock, &ballot->vclock) > 0) {
 			/*
@@ -674,9 +675,10 @@ replicaset_needs_rejoin(struct replica **master)
 			 * the master. Don't rebootstrap as we don't want
 			 * to lose any data.
 			 */
-			say_info("can't rebootstrap from %s: "
+			say_info("can't rebootstrap from %s at %s: "
 				 "replica has local rows: local %s remote %s",
-				 addr_str, local_vclock_str, remote_vclock_str);
+				 uuid_str, addr_str, local_vclock_str,
+				 remote_vclock_str);
 			goto next;
 		}
 
-- 
2.11.0

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 08/11] vinyl: simplify vylog recovery from backup
  2018-07-14 20:49 ` [PATCH v3 08/11] vinyl: simplify vylog recovery from backup Vladimir Davydov
@ 2018-07-31  8:21   ` Vladimir Davydov
  0 siblings, 0 replies; 31+ messages in thread
From: Vladimir Davydov @ 2018-07-31  8:21 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

Pushed to 1.10 by Kostja.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v3 11/11] vinyl: implement rebootstrap support
  2018-07-14 20:49 ` [PATCH v3 11/11] vinyl: implement rebootstrap support Vladimir Davydov
@ 2018-07-31  8:23   ` Vladimir Davydov
  0 siblings, 0 replies; 31+ messages in thread
From: Vladimir Davydov @ 2018-07-31  8:23 UTC (permalink / raw)
  To: kostja, kyukhin; +Cc: tarantool-patches

Pushed to 1.10 by Kostja.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH] replication: print master uuid when (re)bootstrapping
  2018-07-27 16:13         ` [PATCH] replication: print master uuid when (re)bootstrapping Vladimir Davydov
@ 2018-07-31  8:34           ` Vladimir Davydov
  0 siblings, 0 replies; 31+ messages in thread
From: Vladimir Davydov @ 2018-07-31  8:34 UTC (permalink / raw)
  To: kostja; +Cc: tarantool-patches

Both the original patch and the follow-up have been pushed to 1.10
by Kostja.

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2018-07-31  8:34 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-07-14 20:49 [PATCH v3 00/11] Replica rejoin Vladimir Davydov
2018-07-14 20:49 ` [PATCH v3 01/11] recovery: clean up WAL dir scan code Vladimir Davydov
2018-07-19  7:08   ` Konstantin Osipov
2018-07-14 20:49 ` [PATCH v3 02/11] xrow: factor out function for decoding vclock Vladimir Davydov
2018-07-19  7:08   ` Konstantin Osipov
2018-07-14 20:49 ` [PATCH v3 03/11] Introduce IPROTO_REQUEST_STATUS command Vladimir Davydov
2018-07-19  7:10   ` Konstantin Osipov
2018-07-19  8:17     ` Vladimir Davydov
2018-07-21 10:25   ` Vladimir Davydov
2018-07-14 20:49 ` [PATCH v3 04/11] Get rid of IPROTO_SERVER_IS_RO Vladimir Davydov
2018-07-19  7:10   ` Konstantin Osipov
2018-07-21 12:07   ` Vladimir Davydov
2018-07-14 20:49 ` [PATCH v3 05/11] gc: keep track of vclocks instead of signatures Vladimir Davydov
2018-07-19  7:11   ` Konstantin Osipov
2018-07-14 20:49 ` [PATCH v3 06/11] Include oldest vclock available on the instance in IPROTO_STATUS Vladimir Davydov
2018-07-19  7:12   ` Konstantin Osipov
2018-07-21 12:07   ` Vladimir Davydov
2018-07-14 20:49 ` [PATCH v3 07/11] replication: rebootstrap instance on startup if it fell behind Vladimir Davydov
2018-07-19  7:19   ` Konstantin Osipov
2018-07-19 10:04     ` Vladimir Davydov
2018-07-23 20:19       ` Konstantin Osipov
2018-07-27 16:13         ` [PATCH] replication: print master uuid when (re)bootstrapping Vladimir Davydov
2018-07-31  8:34           ` Vladimir Davydov
2018-07-14 20:49 ` [PATCH v3 08/11] vinyl: simplify vylog recovery from backup Vladimir Davydov
2018-07-31  8:21   ` Vladimir Davydov
2018-07-14 20:49 ` [PATCH v3 09/11] vinyl: pass flags to vy_recovery_new Vladimir Davydov
2018-07-21 11:12   ` Vladimir Davydov
2018-07-14 20:49 ` [PATCH v3 10/11] Update test-run Vladimir Davydov
2018-07-21 11:13   ` Vladimir Davydov
2018-07-14 20:49 ` [PATCH v3 11/11] vinyl: implement rebootstrap support Vladimir Davydov
2018-07-31  8:23   ` Vladimir Davydov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox