Tarantool development patches archive
 help / color / mirror / Atom feed
* [Tarantool-patches] [PATCH v4 00/11] Replication from memory
@ 2020-02-12  9:39 Georgy Kirichenko
  2020-02-12  9:39 ` [Tarantool-patches] [PATCH v4 01/11] recovery: do not call recovery_stop_local inside recovery_delete Georgy Kirichenko
                   ` (10 more replies)
  0 siblings, 11 replies; 16+ messages in thread
From: Georgy Kirichenko @ 2020-02-12  9:39 UTC (permalink / raw)
  To: tarantool-patches

This is a complete redesign of the previous version of the
feature. First five patches are refactoring done to make
corresponding facilities, recovery, coio and xstream,
C-compliant. Two minor changes are picked out in order to
facilitate review.

Sixth patch extracts xlog batch writing into a separate
routine what helps with further reviews too.

Matrix clock is a structure maintaining a set of vclocks and
used to build a n-majority vclock (a vclock each component of
them is greather or equal than n corresponding components of
all containing vclocks). This feature is used in order to
determine a vclock read by all replicas (0-majority) or
a vclock which is applied by n-replicas in case of synchronous
replication.

Matrix clock allows wal to track relay vclocks and collect
garbage without tx thread what is implemented in the next patch.

Xrow buffer objects is a in-mempory data structure placing
encoded xrows data as well as corresponding xrow headers into
rotating memory buffers. The main purpose is to let atransaction
to live in memory for some time even after the transaction
finalization. Xrow encoded data is stored into obufs whereas
headers are stored in arrays. Such approach allows to analyze
xrow header (replica id, lsn and group) without decoding blob
data as recovery does now. Additionally it is possible to scan
xrow headers and build a big data range containing already encoded
data in order to send the data with one call (not implemented yet).

Tenth patch does a wal refactoring consisiting in xrow buffer
usage before any actual write.

The last patch implements in memory replication. From now a relay
lives in a wal thread (what is inevitably in case of synchronous
replication) as a pair of fibers - writer and reader. The reader
has the same mission as before - to read and process replica status
vclock. The writer fetcher rows from wal xrow buffer and then
transmits them to a replica. If wal memory does not contain
required rows then writing fiber spawns a cord which reads logs
from files. Also relay provides a special filter function which
is used by the writer in order to implement previous relaying
logic (skip rows, nops).

Branch:
https://github.com/tarantool/tarantool/tree/g.kirichenko/gh-3794-memory-replication
Issue: https://github.com/tarantool/tarantool/issues/3794

Georgy Kirichenko (11):
  recovery: do not call recovery_stop_local inside recovery_delete
  recovery: do not throw an error
  coio: do not allow parallel usage of coio
  coio: do not throw an error, minor refactoring
  xstream: get rid of an exception
  wal: extract log write batch into a separate routine
  wal: matrix clock structure
  wal: track relay vclock and collect logs in wal thread
  wal: xrow memory buffer and cursor
  wal: use a xrow buffer object for entry encoding
  replication: use wal memory buffer to fetch rows

 src/box/CMakeLists.txt                        |   6 +-
 src/box/applier.cc                            |  49 +-
 src/box/box.cc                                |  81 +-
 src/box/gc.c                                  | 216 ++---
 src/box/gc.h                                  |  95 +-
 src/box/lua/info.c                            |  33 +-
 src/box/mclock.c                              | 374 ++++++++
 src/box/mclock.h                              | 125 +++
 src/box/recovery.cc                           | 100 ++-
 src/box/recovery.h                            |  14 +-
 src/box/relay.cc                              | 649 ++++----------
 src/box/relay.h                               |   6 +-
 src/box/replication.cc                        |  37 +-
 src/box/wal.c                                 | 829 ++++++++++++++++--
 src/box/wal.h                                 |  97 +-
 src/box/xlog.c                                |  57 +-
 src/box/xlog.h                                |  14 +
 src/box/xrow_buf.c                            | 374 ++++++++
 src/box/xrow_buf.h                            | 197 +++++
 src/box/xrow_io.cc                            |  59 +-
 src/box/xrow_io.h                             |  11 +-
 src/box/xstream.cc                            |  44 -
 src/box/xstream.h                             |   9 +-
 src/lib/core/coio.cc                          | 534 ++++++-----
 src/lib/core/coio.h                           |  19 +-
 src/lib/core/coio_buf.h                       |   8 +
 src/lib/core/errinj.h                         |   1 +
 test/box-py/iproto.test.py                    |   9 +-
 test/box/errinj.result                        | 134 +--
 test/replication/force_recovery.result        |   8 +
 test/replication/force_recovery.test.lua      |   2 +
 test/replication/gc_no_space.result           |  30 +-
 test/replication/gc_no_space.test.lua         |  12 +-
 test/replication/replica_rejoin.result        |   8 +
 test/replication/replica_rejoin.test.lua      |   2 +
 .../show_error_on_disconnect.result           |   8 +
 .../show_error_on_disconnect.test.lua         |   2 +
 test/replication/suite.ini                    |   2 +-
 test/unit/CMakeLists.txt                      |   2 +
 test/unit/mclock.result                       |  18 +
 test/unit/mclock.test.c                       | 160 ++++
 test/xlog/panic_on_wal_error.result           |  12 +
 test/xlog/panic_on_wal_error.test.lua         |   3 +
 test/xlog/suite.ini                           |   2 +-
 44 files changed, 3063 insertions(+), 1389 deletions(-)
 create mode 100644 src/box/mclock.c
 create mode 100644 src/box/mclock.h
 create mode 100644 src/box/xrow_buf.c
 create mode 100644 src/box/xrow_buf.h
 delete mode 100644 src/box/xstream.cc
 create mode 100644 test/unit/mclock.result
 create mode 100644 test/unit/mclock.test.c

-- 
2.25.0

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Tarantool-patches] [PATCH v4 01/11] recovery: do not call recovery_stop_local inside recovery_delete
  2020-02-12  9:39 [Tarantool-patches] [PATCH v4 00/11] Replication from memory Georgy Kirichenko
@ 2020-02-12  9:39 ` Georgy Kirichenko
  2020-03-19  7:55   ` Konstantin Osipov
  2020-02-12  9:39 ` [Tarantool-patches] [PATCH v4 02/11] recovery: do not throw an error Georgy Kirichenko
                   ` (9 subsequent siblings)
  10 siblings, 1 reply; 16+ messages in thread
From: Georgy Kirichenko @ 2020-02-12  9:39 UTC (permalink / raw)
  To: tarantool-patches

Recovery stop local raises an exception in case of an recovery error
so it is not safe to stop recovery inside recovery delete and guard
inside local_recovery. So call recovery_stop_local manually.

Part of #980
---
 src/box/box.cc      | 4 +++-
 src/box/recovery.cc | 2 +-
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/src/box/box.cc b/src/box/box.cc
index 1b2b27d61..68038df18 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -2238,8 +2238,10 @@ local_recovery(const struct tt_uuid *instance_uuid,
 		recovery_follow_local(recovery, &wal_stream.base, "hot_standby",
 				      cfg_getd("wal_dir_rescan_delay"));
 		while (true) {
-			if (path_lock(cfg_gets("wal_dir"), &wal_dir_lock))
+			if (path_lock(cfg_gets("wal_dir"), &wal_dir_lock)) {
+				recovery_stop_local(recovery);
 				diag_raise();
+			}
 			if (wal_dir_lock >= 0)
 				break;
 			fiber_sleep(0.1);
diff --git a/src/box/recovery.cc b/src/box/recovery.cc
index 64aa467b1..a1ac2d967 100644
--- a/src/box/recovery.cc
+++ b/src/box/recovery.cc
@@ -216,7 +216,7 @@ gap_error:
 void
 recovery_delete(struct recovery *r)
 {
-	recovery_stop_local(r);
+	assert(r->watcher == NULL);
 
 	trigger_destroy(&r->on_close_log);
 	xdir_destroy(&r->wal_dir);
-- 
2.25.0

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Tarantool-patches] [PATCH v4 02/11] recovery: do not throw an error
  2020-02-12  9:39 [Tarantool-patches] [PATCH v4 00/11] Replication from memory Georgy Kirichenko
  2020-02-12  9:39 ` [Tarantool-patches] [PATCH v4 01/11] recovery: do not call recovery_stop_local inside recovery_delete Georgy Kirichenko
@ 2020-02-12  9:39 ` Georgy Kirichenko
  2020-03-19  7:56   ` Konstantin Osipov
  2020-02-12  9:39 ` [Tarantool-patches] [PATCH v4 03/11] coio: do not allow parallel usage of coio Georgy Kirichenko
                   ` (8 subsequent siblings)
  10 siblings, 1 reply; 16+ messages in thread
From: Georgy Kirichenko @ 2020-02-12  9:39 UTC (permalink / raw)
  To: tarantool-patches

Relaying from C-written wal requires recovery to be a C-compliant. So
get rid of exception from recovery interface.

Part of #980
---
 src/box/box.cc      | 19 ++++++++--
 src/box/recovery.cc | 89 ++++++++++++++++++++++++++-------------------
 src/box/recovery.h  | 14 +++----
 src/box/relay.cc    | 15 ++++----
 4 files changed, 82 insertions(+), 55 deletions(-)

diff --git a/src/box/box.cc b/src/box/box.cc
index 68038df18..611100b8b 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -2166,6 +2166,8 @@ local_recovery(const struct tt_uuid *instance_uuid,
 	recovery = recovery_new(cfg_gets("wal_dir"),
 				cfg_geti("force_recovery"),
 				checkpoint_vclock);
+	if (recovery == NULL)
+		diag_raise();
 
 	/*
 	 * Make sure we report the actual recovery position
@@ -2183,7 +2185,8 @@ local_recovery(const struct tt_uuid *instance_uuid,
 	 * so we must reflect this in replicaset vclock to
 	 * not attempt to apply these rows twice.
 	 */
-	recovery_scan(recovery, &replicaset.vclock, &gc.vclock);
+	if (recovery_scan(recovery, &replicaset.vclock, &gc.vclock) != 0)
+		diag_raise();
 	say_info("instance vclock %s", vclock_to_string(&replicaset.vclock));
 
 	if (wal_dir_lock >= 0) {
@@ -2226,7 +2229,8 @@ local_recovery(const struct tt_uuid *instance_uuid,
 	memtx_engine_recover_snapshot_xc(memtx, checkpoint_vclock);
 
 	engine_begin_final_recovery_xc();
-	recover_remaining_wals(recovery, &wal_stream.base, NULL, false);
+	if (recover_remaining_wals(recovery, &wal_stream.base, NULL, false) != 0)
+		diag_raise();
 	engine_end_recovery_xc();
 	/*
 	 * Leave hot standby mode, if any, only after
@@ -2239,6 +2243,10 @@ local_recovery(const struct tt_uuid *instance_uuid,
 				      cfg_getd("wal_dir_rescan_delay"));
 		while (true) {
 			if (path_lock(cfg_gets("wal_dir"), &wal_dir_lock)) {
+				/*
+				 * Let recovery_stop_local override
+				 * a path_lock error.
+				 */
 				recovery_stop_local(recovery);
 				diag_raise();
 			}
@@ -2246,8 +2254,11 @@ local_recovery(const struct tt_uuid *instance_uuid,
 				break;
 			fiber_sleep(0.1);
 		}
-		recovery_stop_local(recovery);
-		recover_remaining_wals(recovery, &wal_stream.base, NULL, true);
+		if (recovery_stop_local(recovery) != 0)
+			diag_raise();
+		if (recover_remaining_wals(recovery, &wal_stream.base, NULL,
+					   true) != 0)
+			diag_raise();
 		/*
 		 * Advance replica set vclock to reflect records
 		 * applied in hot standby mode.
diff --git a/src/box/recovery.cc b/src/box/recovery.cc
index a1ac2d967..e4aad1296 100644
--- a/src/box/recovery.cc
+++ b/src/box/recovery.cc
@@ -87,14 +87,11 @@ recovery_new(const char *wal_dirname, bool force_recovery,
 			calloc(1, sizeof(*r));
 
 	if (r == NULL) {
-		tnt_raise(OutOfMemory, sizeof(*r), "malloc",
-			  "struct recovery");
+		diag_set(OutOfMemory, sizeof(*r), "malloc",
+			 "struct recovery");
+		return NULL;
 	}
 
-	auto guard = make_scoped_guard([=]{
-		free(r);
-	});
-
 	xdir_create(&r->wal_dir, wal_dirname, XLOG, &INSTANCE_UUID,
 		    &xlog_opts_default);
 	r->wal_dir.force_recovery = force_recovery;
@@ -108,27 +105,31 @@ recovery_new(const char *wal_dirname, bool force_recovery,
 	 * UUID, see replication/cluster.test for
 	 * details.
 	 */
-	xdir_check_xc(&r->wal_dir);
+	if (xdir_check(&r->wal_dir) != 0) {
+		xdir_destroy(&r->wal_dir);
+		free(r);
+		return NULL;
+	}
 
 	r->watcher = NULL;
 	rlist_create(&r->on_close_log);
 
-	guard.is_active = false;
 	return r;
 }
 
-void
+int
 recovery_scan(struct recovery *r, struct vclock *end_vclock,
 	      struct vclock *gc_vclock)
 {
-	xdir_scan_xc(&r->wal_dir);
+	if (xdir_scan(&r->wal_dir) != 0)
+		return -1;
 
 	if (xdir_last_vclock(&r->wal_dir, end_vclock) < 0 ||
 	    vclock_compare(end_vclock, &r->vclock) < 0) {
 		/* No xlogs after last checkpoint. */
 		vclock_copy(gc_vclock, &r->vclock);
 		vclock_copy(end_vclock, &r->vclock);
-		return;
+		return 0;
 	}
 
 	if (xdir_first_vclock(&r->wal_dir, gc_vclock) < 0)
@@ -137,11 +138,12 @@ recovery_scan(struct recovery *r, struct vclock *end_vclock,
 	/* Scan the last xlog to find end vclock. */
 	struct xlog_cursor cursor;
 	if (xdir_open_cursor(&r->wal_dir, vclock_sum(end_vclock), &cursor) != 0)
-		return;
+		return 0;
 	struct xrow_header row;
 	while (xlog_cursor_next(&cursor, &row, true) == 0)
 		vclock_follow_xrow(end_vclock, &row);
 	xlog_cursor_close(&cursor, false);
+	return 0;
 }
 
 static inline void
@@ -156,19 +158,21 @@ recovery_close_log(struct recovery *r)
 			 r->cursor.name);
 	}
 	xlog_cursor_close(&r->cursor, false);
-	trigger_run_xc(&r->on_close_log, NULL);
+	/* Suppress a trigger error if happened. */
+	trigger_run(&r->on_close_log, NULL);
 }
 
-static void
+static int
 recovery_open_log(struct recovery *r, const struct vclock *vclock)
 {
-	XlogGapError *e;
 	struct xlog_meta meta = r->cursor.meta;
 	enum xlog_cursor_state state = r->cursor.state;
 
 	recovery_close_log(r);
 
-	xdir_open_cursor_xc(&r->wal_dir, vclock_sum(vclock), &r->cursor);
+	if (xdir_open_cursor(&r->wal_dir, vclock_sum(vclock),
+			     &r->cursor) != 0)
+		return -1;
 
 	if (state == XLOG_CURSOR_NEW &&
 	    vclock_compare(vclock, &r->vclock) > 0) {
@@ -201,14 +205,14 @@ out:
 	 */
 	if (vclock_compare(&r->vclock, vclock) < 0)
 		vclock_copy(&r->vclock, vclock);
-	return;
+	return 0;
 
 gap_error:
-	e = tnt_error(XlogGapError, &r->vclock, vclock);
+	diag_set(XlogGapError, &r->vclock, vclock);
 	if (!r->wal_dir.force_recovery)
-		throw e;
+		return -1;
 	/* Ignore missing WALs if force_recovery is set. */
-	e->log();
+	diag_log();
 	say_warn("ignoring a gap in LSN");
 	goto out;
 }
@@ -217,7 +221,6 @@ void
 recovery_delete(struct recovery *r)
 {
 	assert(r->watcher == NULL);
-
 	trigger_destroy(&r->on_close_log);
 	xdir_destroy(&r->wal_dir);
 	if (xlog_cursor_is_open(&r->cursor)) {
@@ -237,25 +240,26 @@ recovery_delete(struct recovery *r)
  * The reading will be stopped on reaching stop_vclock.
  * Use NULL for boundless recover
  */
-static void
+static int
 recover_xlog(struct recovery *r, struct xstream *stream,
 	     const struct vclock *stop_vclock)
 {
 	struct xrow_header row;
 	uint64_t row_count = 0;
-	while (xlog_cursor_next_xc(&r->cursor, &row,
-				   r->wal_dir.force_recovery) == 0) {
+	int rc;
+	while ((rc = xlog_cursor_next(&r->cursor, &row,
+				      r->wal_dir.force_recovery)) == 0) {
 		/*
 		 * Read the next row from xlog file.
 		 *
-		 * xlog_cursor_next_xc() returns 1 when
+		 * xlog_cursor_next() returns 1 when
 		 * it can not read more rows. This doesn't mean
 		 * the file is fully read: it's fully read only
 		 * when EOF marker has been read, see i.eof_read
 		 */
 		if (stop_vclock != NULL &&
 		    r->vclock.signature >= stop_vclock->signature)
-			return;
+			return 0;
 		int64_t current_lsn = vclock_get(&r->vclock, row.replica_id);
 		if (row.lsn <= current_lsn)
 			continue; /* already applied, skip */
@@ -282,13 +286,16 @@ recover_xlog(struct recovery *r, struct xstream *stream,
 					 row_count / 1000000.);
 		} else {
 			if (!r->wal_dir.force_recovery)
-				diag_raise();
+				return -1;
 
 			say_error("skipping row {%u: %lld}",
 				  (unsigned)row.replica_id, (long long)row.lsn);
 			diag_log();
 		}
 	}
+	if (rc < 0)
+		return -1;
+	return 0;
 }
 
 /**
@@ -302,14 +309,14 @@ recover_xlog(struct recovery *r, struct xstream *stream,
  * This function will not close r->current_wal if
  * recovery was successful.
  */
-void
+int
 recover_remaining_wals(struct recovery *r, struct xstream *stream,
 		       const struct vclock *stop_vclock, bool scan_dir)
 {
 	struct vclock *clock;
 
-	if (scan_dir)
-		xdir_scan_xc(&r->wal_dir);
+	if (scan_dir && xdir_scan(&r->wal_dir) != 0)
+		return -1;
 
 	if (xlog_cursor_is_open(&r->cursor)) {
 		/* If there's a WAL open, recover from it first. */
@@ -343,21 +350,26 @@ recover_remaining_wals(struct recovery *r, struct xstream *stream,
 			continue;
 		}
 
-		recovery_open_log(r, clock);
+		if (recovery_open_log(r, clock) != 0)
+			return -1;
 
 		say_info("recover from `%s'", r->cursor.name);
 
 recover_current_wal:
-		recover_xlog(r, stream, stop_vclock);
+		if (recover_xlog(r, stream, stop_vclock) != 0)
+			return -1;
 	}
 
 	if (xlog_cursor_is_eof(&r->cursor))
 		recovery_close_log(r);
 
-	if (stop_vclock != NULL && vclock_compare(&r->vclock, stop_vclock) != 0)
-		tnt_raise(XlogGapError, &r->vclock, stop_vclock);
+	if (stop_vclock != NULL && vclock_compare(&r->vclock, stop_vclock) != 0) {
+		diag_set(XlogGapError, &r->vclock, stop_vclock);
+		return -1;
+	}
 
 	region_free(&fiber()->gc);
+	return 0;
 }
 
 void
@@ -481,7 +493,9 @@ hot_standby_f(va_list ap)
 		do {
 			start = vclock_sum(&r->vclock);
 
-			recover_remaining_wals(r, stream, NULL, scan_dir);
+			if (recover_remaining_wals(r, stream, NULL,
+						   scan_dir) != 0)
+				diag_raise();
 
 			end = vclock_sum(&r->vclock);
 			/*
@@ -529,7 +543,7 @@ recovery_follow_local(struct recovery *r, struct xstream *stream,
 	fiber_start(r->watcher, r, stream, wal_dir_rescan_delay);
 }
 
-void
+int
 recovery_stop_local(struct recovery *r)
 {
 	if (r->watcher) {
@@ -537,8 +551,9 @@ recovery_stop_local(struct recovery *r)
 		r->watcher = NULL;
 		fiber_cancel(f);
 		if (fiber_join(f) != 0)
-			diag_raise();
+			return -1;
 	}
+	return 0;
 }
 
 /* }}} */
diff --git a/src/box/recovery.h b/src/box/recovery.h
index 6e68abc0b..145d9199e 100644
--- a/src/box/recovery.h
+++ b/src/box/recovery.h
@@ -74,7 +74,7 @@ recovery_delete(struct recovery *r);
  * @gc_vclock is set to the oldest vclock available in the
  * WAL directory.
  */
-void
+int
 recovery_scan(struct recovery *r,  struct vclock *end_vclock,
 	      struct vclock *gc_vclock);
 
@@ -82,16 +82,12 @@ void
 recovery_follow_local(struct recovery *r, struct xstream *stream,
 		      const char *name, ev_tstamp wal_dir_rescan_delay);
 
-void
+int
 recovery_stop_local(struct recovery *r);
 
 void
 recovery_finalize(struct recovery *r);
 
-#if defined(__cplusplus)
-} /* extern "C" */
-#endif /* defined(__cplusplus) */
-
 /**
  * Find out if there are new .xlog files since the current
  * vclock, and read them all up.
@@ -102,8 +98,12 @@ recovery_finalize(struct recovery *r);
  * This function will not close r->current_wal if
  * recovery was successful.
  */
-void
+int
 recover_remaining_wals(struct recovery *r, struct xstream *stream,
 		       const struct vclock *stop_vclock, bool scan_dir);
 
+#if defined(__cplusplus)
+} /* extern "C" */
+#endif /* defined(__cplusplus) */
+
 #endif /* TARANTOOL_RECOVERY_H_INCLUDED */
diff --git a/src/box/relay.cc b/src/box/relay.cc
index b89632273..d5a1c9c68 100644
--- a/src/box/relay.cc
+++ b/src/box/relay.cc
@@ -334,8 +334,9 @@ relay_final_join_f(va_list ap)
 
 	/* Send all WALs until stop_vclock */
 	assert(relay->stream.write != NULL);
-	recover_remaining_wals(relay->r, &relay->stream,
-			       &relay->stop_vclock, true);
+	if (recover_remaining_wals(relay->r, &relay->stream,
+				   &relay->stop_vclock, true) != 0)
+		diag_raise();
 	assert(vclock_compare(&relay->r->vclock, &relay->stop_vclock) == 0);
 	return 0;
 }
@@ -491,11 +492,9 @@ relay_process_wal_event(struct wal_watcher *watcher, unsigned events)
 		 */
 		return;
 	}
-	try {
-		recover_remaining_wals(relay->r, &relay->stream, NULL,
-				       (events & WAL_EVENT_ROTATE) != 0);
-	} catch (Exception *e) {
-		relay_set_error(relay, e);
+	if (recover_remaining_wals(relay->r, &relay->stream, NULL,
+				   (events & WAL_EVENT_ROTATE) != 0) != 0) {
+		relay_set_error(relay, diag_last_error(diag_get()));
 		fiber_cancel(fiber());
 	}
 }
@@ -702,6 +701,8 @@ relay_subscribe(struct replica *replica, int fd, uint64_t sync,
 	vclock_copy(&relay->local_vclock_at_subscribe, &replicaset.vclock);
 	relay->r = recovery_new(cfg_gets("wal_dir"), false,
 			        replica_clock);
+	if (relay->r == NULL)
+		diag_raise();
 	vclock_copy(&relay->tx.vclock, replica_clock);
 	relay->version_id = replica_version_id;
 
-- 
2.25.0

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Tarantool-patches] [PATCH v4 03/11] coio: do not allow parallel usage of coio
  2020-02-12  9:39 [Tarantool-patches] [PATCH v4 00/11] Replication from memory Georgy Kirichenko
  2020-02-12  9:39 ` [Tarantool-patches] [PATCH v4 01/11] recovery: do not call recovery_stop_local inside recovery_delete Georgy Kirichenko
  2020-02-12  9:39 ` [Tarantool-patches] [PATCH v4 02/11] recovery: do not throw an error Georgy Kirichenko
@ 2020-02-12  9:39 ` Georgy Kirichenko
  2020-03-19 18:09   ` Konstantin Osipov
  2020-02-12  9:39 ` [Tarantool-patches] [PATCH v4 04/11] coio: do not throw an error, minor refactoring Georgy Kirichenko
                   ` (7 subsequent siblings)
  10 siblings, 1 reply; 16+ messages in thread
From: Georgy Kirichenko @ 2020-02-12  9:39 UTC (permalink / raw)
  To: tarantool-patches

Simultaneous usage of one coio from two or more fiber could lead
to undefined behavior as coio routines are replacing awaiting fiber
(a data member) and stopping watcher without any relevance if there
any other users of the coio object. Such behavior could lead to
an applier invalid stream issue #4040.
The proposal is to disable an active coio reuse by returning a fake
EINPROGRESS error.

Part of #980
---
 src/lib/core/coio.cc | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/src/lib/core/coio.cc b/src/lib/core/coio.cc
index e88d724d5..faa7e5bd5 100644
--- a/src/lib/core/coio.cc
+++ b/src/lib/core/coio.cc
@@ -238,6 +238,17 @@ coio_connect_timeout(struct ev_io *coio, struct uri *uri, struct sockaddr *addr,
 	tnt_raise(SocketError, sio_socketname(coio->fd), "connection failed");
 }
 
+/* Do not allow to reuse coio by different fiber. */
+static inline void
+check_coio_in_use(struct ev_io *coio)
+{
+	if (ev_is_active(coio)) {
+		errno = EINPROGRESS;
+		tnt_raise(SocketError, sio_socketname(coio->fd),
+			  "already in use");
+	}
+}
+
 /**
  * Wait a client connection on a server socket until
  * timedout.
@@ -249,6 +260,7 @@ coio_accept(struct ev_io *coio, struct sockaddr *addr,
 	ev_tstamp start, delay;
 	coio_timeout_init(&start, &delay, timeout);
 
+	check_coio_in_use(coio);
 	CoioGuard coio_guard(coio);
 
 	while (true) {
@@ -302,6 +314,7 @@ coio_read_ahead_timeout(struct ev_io *coio, void *buf, size_t sz,
 
 	ssize_t to_read = (ssize_t) sz;
 
+	check_coio_in_use(coio);
 	CoioGuard coio_guard(coio);
 
 	while (true) {
@@ -399,6 +412,7 @@ coio_write_timeout(struct ev_io *coio, const void *buf, size_t sz,
 	ev_tstamp start, delay;
 	coio_timeout_init(&start, &delay, timeout);
 
+	check_coio_in_use(coio);
 	CoioGuard coio_guard(coio);
 
 	while (true) {
@@ -461,6 +475,7 @@ coio_writev_timeout(struct ev_io *coio, struct iovec *iov, int iovcnt,
 	struct iovec *end = iov + iovcnt;
 	ev_tstamp start, delay;
 	coio_timeout_init(&start, &delay, timeout);
+	check_coio_in_use(coio);
 	CoioGuard coio_guard(coio);
 
 	/* Avoid a syscall in case of 0 iovcnt. */
@@ -518,6 +533,7 @@ coio_sendto_timeout(struct ev_io *coio, const void *buf, size_t sz, int flags,
 	ev_tstamp start, delay;
 	coio_timeout_init(&start, &delay, timeout);
 
+	check_coio_in_use(coio);
 	CoioGuard coio_guard(coio);
 
 	while (true) {
@@ -563,6 +579,7 @@ coio_recvfrom_timeout(struct ev_io *coio, void *buf, size_t sz, int flags,
 	ev_tstamp start, delay;
 	coio_timeout_init(&start, &delay, timeout);
 
+	check_coio_in_use(coio);
 	CoioGuard coio_guard(coio);
 
 	while (true) {
-- 
2.25.0

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Tarantool-patches] [PATCH v4 04/11] coio: do not throw an error, minor refactoring
  2020-02-12  9:39 [Tarantool-patches] [PATCH v4 00/11] Replication from memory Georgy Kirichenko
                   ` (2 preceding siblings ...)
  2020-02-12  9:39 ` [Tarantool-patches] [PATCH v4 03/11] coio: do not allow parallel usage of coio Georgy Kirichenko
@ 2020-02-12  9:39 ` Georgy Kirichenko
  2020-03-23  6:59   ` Konstantin Osipov
  2020-02-12  9:39 ` [Tarantool-patches] [PATCH v4 05/11] xstream: get rid of an exception Georgy Kirichenko
                   ` (6 subsequent siblings)
  10 siblings, 1 reply; 16+ messages in thread
From: Georgy Kirichenko @ 2020-02-12  9:39 UTC (permalink / raw)
  To: tarantool-patches

Relaying from C-written wal requires coio and xrow_io to be
a C-compliant. So get rid of exception from coio interface.
Also this patch includes some minor refactoring (as code looks ugly
without them):
 1. Get rid of unused size_hint from coio_writev_timeout.
 2. Handle partial read/write before yield loop.
 3. Do not reset errno to 0 in case of reading EOF.

Part of #980
---
 src/box/applier.cc      |  49 ++--
 src/box/box.cc          |   9 +-
 src/box/relay.cc        |  11 +-
 src/box/xrow_io.cc      |  59 ++---
 src/box/xrow_io.h       |  11 +-
 src/lib/core/coio.cc    | 535 ++++++++++++++++++++++++----------------
 src/lib/core/coio.h     |  19 +-
 src/lib/core/coio_buf.h |   8 +
 8 files changed, 413 insertions(+), 288 deletions(-)

diff --git a/src/box/applier.cc b/src/box/applier.cc
index ae3d281a5..ad427707a 100644
--- a/src/box/applier.cc
+++ b/src/box/applier.cc
@@ -173,8 +173,9 @@ applier_writer_f(va_list ap)
 			continue;
 		try {
 			struct xrow_header xrow;
-			xrow_encode_vclock(&xrow, &replicaset.vclock);
-			coio_write_xrow(&io, &xrow);
+			if (xrow_encode_vclock(&xrow, &replicaset.vclock) != 0 ||
+			    coio_write_xrow(&io, &xrow) < 0)
+				diag_raise();
 		} catch (SocketError *e) {
 			/*
 			 * There is no point trying to send ACKs if
@@ -308,9 +309,11 @@ applier_connect(struct applier *applier)
 	 */
 	applier->addr_len = sizeof(applier->addrstorage);
 	applier_set_state(applier, APPLIER_CONNECT);
-	coio_connect(coio, uri, &applier->addr, &applier->addr_len);
+	if (coio_connect(coio, uri, &applier->addr, &applier->addr_len) != 0)
+		diag_raise();
 	assert(coio->fd >= 0);
-	coio_readn(coio, greetingbuf, IPROTO_GREETING_SIZE);
+	if (coio_readn(coio, greetingbuf, IPROTO_GREETING_SIZE) < 0)
+		diag_raise();
 	applier->last_row_time = ev_monotonic_now(loop());
 
 	/* Decode instance version and name from greeting */
@@ -345,8 +348,9 @@ applier_connect(struct applier *applier)
 	 * election on bootstrap.
 	 */
 	xrow_encode_vote(&row);
-	coio_write_xrow(coio, &row);
-	coio_read_xrow(coio, ibuf, &row);
+	if (coio_write_xrow(coio, &row) < 0 ||
+	    coio_read_xrow(coio, ibuf, &row) < 0)
+		diag_raise();
 	if (row.type == IPROTO_OK) {
 		xrow_decode_ballot_xc(&row, &applier->ballot);
 	} else try {
@@ -376,8 +380,9 @@ applier_connect(struct applier *applier)
 			    uri->login_len,
 			    uri->password != NULL ? uri->password : "",
 			    uri->password_len);
-	coio_write_xrow(coio, &row);
-	coio_read_xrow(coio, ibuf, &row);
+	if (coio_write_xrow(coio, &row) < 0 ||
+	    coio_read_xrow(coio, ibuf, &row) < 0)
+		diag_raise();
 	applier->last_row_time = ev_monotonic_now(loop());
 	if (row.type != IPROTO_OK)
 		xrow_decode_error_xc(&row); /* auth failed */
@@ -401,7 +406,8 @@ applier_wait_snapshot(struct applier *applier)
 	 */
 	if (applier->version_id >= version_id(1, 7, 0)) {
 		/* Decode JOIN/FETCH_SNAPSHOT response */
-		coio_read_xrow(coio, ibuf, &row);
+		if (coio_read_xrow(coio, ibuf, &row) < 0)
+			diag_raise();
 		if (iproto_type_is_error(row.type)) {
 			xrow_decode_error_xc(&row); /* re-throw error */
 		} else if (row.type != IPROTO_OK) {
@@ -422,7 +428,8 @@ applier_wait_snapshot(struct applier *applier)
 	 */
 	uint64_t row_count = 0;
 	while (true) {
-		coio_read_xrow(coio, ibuf, &row);
+		if (coio_read_xrow(coio, ibuf, &row) < 0)
+			diag_raise();
 		applier->last_row_time = ev_monotonic_now(loop());
 		if (iproto_type_is_dml(row.type)) {
 			if (apply_snapshot_row(&row) != 0)
@@ -488,7 +495,8 @@ applier_wait_register(struct applier *applier, uint64_t row_count)
 	 * Receive final data.
 	 */
 	while (true) {
-		coio_read_xrow(coio, ibuf, &row);
+		if (coio_read_xrow(coio, ibuf, &row) < 0)
+			diag_raise();
 		applier->last_row_time = ev_monotonic_now(loop());
 		if (iproto_type_is_dml(row.type)) {
 			vclock_follow_xrow(&replicaset.vclock, &row);
@@ -605,10 +613,13 @@ applier_read_tx_row(struct applier *applier)
 	 * from the master for quite a while the connection is
 	 * broken - the master might just be idle.
 	 */
-	if (applier->version_id < version_id(1, 7, 7))
-		coio_read_xrow(coio, ibuf, row);
-	else
-		coio_read_xrow_timeout_xc(coio, ibuf, row, timeout);
+	if (applier->version_id < version_id(1, 7, 7)) {
+		if (coio_read_xrow(coio, ibuf, row) < 0)
+			diag_raise();
+	} else {
+		if (coio_read_xrow_timeout(coio, ibuf, row, timeout) < 0)
+			diag_raise();
+	}
 
 	applier->lag = ev_now(loop()) - row->tm;
 	applier->last_row_time = ev_monotonic_now(loop());
@@ -868,11 +879,13 @@ applier_subscribe(struct applier *applier)
 	vclock_copy(&vclock, &replicaset.vclock);
 	xrow_encode_subscribe_xc(&row, &REPLICASET_UUID, &INSTANCE_UUID,
 				 &vclock, replication_anon);
-	coio_write_xrow(coio, &row);
+	if (coio_write_xrow(coio, &row) < 0)
+		diag_raise();
 
 	/* Read SUBSCRIBE response */
 	if (applier->version_id >= version_id(1, 6, 7)) {
-		coio_read_xrow(coio, ibuf, &row);
+		if (coio_read_xrow(coio, ibuf, &row) < 0)
+			diag_raise();
 		if (iproto_type_is_error(row.type)) {
 			xrow_decode_error_xc(&row);  /* error */
 		} else if (row.type != IPROTO_OK) {
@@ -1009,7 +1022,7 @@ applier_disconnect(struct applier *applier, enum applier_state state)
 		applier->writer = NULL;
 	}
 
-	coio_close(loop(), &applier->io);
+	coio_destroy(loop(), &applier->io);
 	/* Clear all unparsed input. */
 	ibuf_reinit(&applier->ibuf);
 	fiber_gc();
diff --git a/src/box/box.cc b/src/box/box.cc
index 611100b8b..ca1696383 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -1741,7 +1741,8 @@ box_process_join(struct ev_io *io, struct xrow_header *header)
 	struct xrow_header row;
 	xrow_encode_vclock_xc(&row, &stop_vclock);
 	row.sync = header->sync;
-	coio_write_xrow(io, &row);
+	if (coio_write_xrow(io, &row) < 0)
+		diag_raise();
 
 	/*
 	 * Final stage: feed replica with WALs in range
@@ -1753,7 +1754,8 @@ box_process_join(struct ev_io *io, struct xrow_header *header)
 	/* Send end of WAL stream marker */
 	xrow_encode_vclock_xc(&row, &replicaset.vclock);
 	row.sync = header->sync;
-	coio_write_xrow(io, &row);
+	if (coio_write_xrow(io, &row) < 0)
+		diag_raise();
 
 	/*
 	 * Advance the WAL consumer state to the position where
@@ -1845,7 +1847,8 @@ box_process_subscribe(struct ev_io *io, struct xrow_header *header)
 	assert(self != NULL); /* the local registration is read-only */
 	row.replica_id = self->id;
 	row.sync = header->sync;
-	coio_write_xrow(io, &row);
+	if (coio_write_xrow(io, &row) < 0)
+		diag_raise();
 
 	say_info("subscribed replica %s at %s",
 		 tt_uuid_str(&replica_uuid), sio_socketname(io->fd));
diff --git a/src/box/relay.cc b/src/box/relay.cc
index d5a1c9c68..bb7761b99 100644
--- a/src/box/relay.cc
+++ b/src/box/relay.cc
@@ -317,7 +317,8 @@ relay_initial_join(int fd, uint64_t sync, struct vclock *vclock)
 	struct xrow_header row;
 	xrow_encode_vclock_xc(&row, vclock);
 	row.sync = sync;
-	coio_write_xrow(&relay->io, &row);
+	if (coio_write_xrow(&relay->io, &row) < 0)
+		diag_raise();
 
 	/* Send read view to the replica. */
 	engine_join_xc(&ctx, &relay->stream);
@@ -516,8 +517,9 @@ relay_reader_f(va_list ap)
 	try {
 		while (!fiber_is_cancelled()) {
 			struct xrow_header xrow;
-			coio_read_xrow_timeout_xc(&io, &ibuf, &xrow,
-					replication_disconnect_timeout());
+			if (coio_read_xrow_timeout(&io, &ibuf, &xrow,
+					replication_disconnect_timeout()) < 0)
+				diag_raise();
 			/* vclock is followed while decoding, zeroing it. */
 			vclock_create(&relay->recv_vclock);
 			xrow_decode_vclock_xc(&xrow, &relay->recv_vclock);
@@ -721,7 +723,8 @@ relay_send(struct relay *relay, struct xrow_header *packet)
 
 	packet->sync = relay->sync;
 	relay->last_row_time = ev_monotonic_now(loop());
-	coio_write_xrow(&relay->io, packet);
+	if (coio_write_xrow(&relay->io, packet) < 0)
+		diag_raise();
 	fiber_gc();
 
 	struct errinj *inj = errinj(ERRINJ_RELAY_TIMEOUT, ERRINJ_DOUBLE);
diff --git a/src/box/xrow_io.cc b/src/box/xrow_io.cc
index 48707982b..4e79cd2f0 100644
--- a/src/box/xrow_io.cc
+++ b/src/box/xrow_io.cc
@@ -35,71 +35,74 @@
 #include "error.h"
 #include "msgpuck/msgpuck.h"
 
-void
+ssize_t
 coio_read_xrow(struct ev_io *coio, struct ibuf *in, struct xrow_header *row)
 {
 	/* Read fixed header */
-	if (ibuf_used(in) < 1)
-		coio_breadn(coio, in, 1);
+	if (ibuf_used(in) < 1 && coio_breadn(coio, in, 1) < 0)
+		return -1;
 
 	/* Read length */
 	if (mp_typeof(*in->rpos) != MP_UINT) {
-		tnt_raise(ClientError, ER_INVALID_MSGPACK,
-			  "packet length");
+		diag_set(ClientError, ER_INVALID_MSGPACK,
+			 "packet length");
+		return -1;
 	}
 	ssize_t to_read = mp_check_uint(in->rpos, in->wpos);
-	if (to_read > 0)
-		coio_breadn(coio, in, to_read);
+	if (to_read > 0 && coio_breadn(coio, in, to_read) < 0)
+		return -1;
 
 	uint32_t len = mp_decode_uint((const char **) &in->rpos);
 
 	/* Read header and body */
 	to_read = len - ibuf_used(in);
-	if (to_read > 0)
-		coio_breadn(coio, in, to_read);
+	if (to_read > 0 && coio_breadn(coio, in, to_read) < 0)
+		return -1;
 
-	xrow_header_decode_xc(row, (const char **) &in->rpos, in->rpos + len,
-			      true);
+	return xrow_header_decode(row, (const char **) &in->rpos, in->rpos + len,
+				  true);
 }
 
-void
-coio_read_xrow_timeout_xc(struct ev_io *coio, struct ibuf *in,
-			  struct xrow_header *row, ev_tstamp timeout)
+ssize_t
+coio_read_xrow_timeout(struct ev_io *coio, struct ibuf *in,
+		       struct xrow_header *row, ev_tstamp timeout)
 {
 	ev_tstamp start, delay;
 	coio_timeout_init(&start, &delay, timeout);
 	/* Read fixed header */
-	if (ibuf_used(in) < 1)
-		coio_breadn_timeout(coio, in, 1, delay);
+	if (ibuf_used(in) < 1 && coio_breadn_timeout(coio, in, 1, delay) < 0)
+		return -1;
 	coio_timeout_update(&start, &delay);
 
 	/* Read length */
 	if (mp_typeof(*in->rpos) != MP_UINT) {
-		tnt_raise(ClientError, ER_INVALID_MSGPACK,
-			  "packet length");
+		diag_set(ClientError, ER_INVALID_MSGPACK,
+			 "packet length");
+		return -1;
 	}
 	ssize_t to_read = mp_check_uint(in->rpos, in->wpos);
-	if (to_read > 0)
-		coio_breadn_timeout(coio, in, to_read, delay);
+	if (to_read > 0 && coio_breadn_timeout(coio, in, to_read, delay) < 0)
+		return -1;
 	coio_timeout_update(&start, &delay);
 
 	uint32_t len = mp_decode_uint((const char **) &in->rpos);
 
 	/* Read header and body */
 	to_read = len - ibuf_used(in);
-	if (to_read > 0)
-		coio_breadn_timeout(coio, in, to_read, delay);
+	if (to_read > 0 && coio_breadn_timeout(coio, in, to_read, delay) < 0)
+		return -1;
 
-	xrow_header_decode_xc(row, (const char **) &in->rpos, in->rpos + len,
-			      true);
+	return xrow_header_decode(row, (const char **) &in->rpos, in->rpos + len,
+				  true);
 }
 
-
-void
+ssize_t
 coio_write_xrow(struct ev_io *coio, const struct xrow_header *row)
 {
 	struct iovec iov[XROW_IOVMAX];
-	int iovcnt = xrow_to_iovec_xc(row, iov);
-	coio_writev(coio, iov, iovcnt, 0);
+	int iovcnt = xrow_to_iovec(row, iov);
+	if (iovcnt < 0)
+		return -1;
+	return coio_writev(coio, iov, iovcnt);
 }
 
diff --git a/src/box/xrow_io.h b/src/box/xrow_io.h
index 0eb7a8ace..96c5047b7 100644
--- a/src/box/xrow_io.h
+++ b/src/box/xrow_io.h
@@ -30,6 +30,7 @@
  * THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  */
+#include "unistd.h"
 #if defined(__cplusplus)
 extern "C" {
 #endif
@@ -38,14 +39,14 @@ struct ev_io;
 struct ibuf;
 struct xrow_header;
 
-void
+ssize_t
 coio_read_xrow(struct ev_io *coio, struct ibuf *in, struct xrow_header *row);
 
-void
-coio_read_xrow_timeout_xc(struct ev_io *coio, struct ibuf *in,
-			  struct xrow_header *row, double timeout);
+ssize_t
+coio_read_xrow_timeout(struct ev_io *coio, struct ibuf *in,
+		       struct xrow_header *row, double timeout);
 
-void
+ssize_t
 coio_write_xrow(struct ev_io *coio, const struct xrow_header *row);
 
 
diff --git a/src/lib/core/coio.cc b/src/lib/core/coio.cc
index faa7e5bd5..8ae6930de 100644
--- a/src/lib/core/coio.cc
+++ b/src/lib/core/coio.cc
@@ -41,12 +41,6 @@
 #include "scoped_guard.h"
 #include "coio_task.h" /* coio_resolve() */
 
-struct CoioGuard {
-	struct ev_io *ev_io;
-	CoioGuard(struct ev_io *arg) :ev_io(arg) {}
-	~CoioGuard() { ev_io_stop(loop(), ev_io); }
-};
-
 typedef void (*ev_stat_cb)(ev_loop *, ev_stat *, int);
 
 /** Note: this function does not throw */
@@ -65,12 +59,14 @@ coio_fiber_yield_timeout(struct ev_io *coio, ev_tstamp delay)
 	coio->data = fiber();
 	bool is_timedout = fiber_yield_timeout(delay);
 	coio->data = NULL;
+	if (is_timedout)
+		diag_set(TimedOut);
 	return is_timedout;
 }
 
 /**
  * Connect to a host with a specified timeout.
- * @retval -1 timeout
+ * @retval -1 error or timeout
  * @retval 0 connected
  */
 static int
@@ -79,36 +75,46 @@ coio_connect_addr(struct ev_io *coio, struct sockaddr *addr,
 {
 	ev_loop *loop = loop();
 	if (evio_socket(coio, addr->sa_family, SOCK_STREAM, 0) != 0)
-		diag_raise();
-	auto coio_guard = make_scoped_guard([=]{ evio_close(loop, coio); });
-	if (sio_connect(coio->fd, addr, len) == 0) {
-		coio_guard.is_active = false;
+		return -1;
+	if (sio_connect(coio->fd, addr, len) == 0)
 		return 0;
+	if (errno != EINPROGRESS) {
+		evio_close(loop, coio);
+		return -1;
 	}
-	if (errno != EINPROGRESS)
-		diag_raise();
 	/*
 	 * Wait until socket is ready for writing or
 	 * timed out.
 	 */
 	ev_io_set(coio, coio->fd, EV_WRITE);
 	ev_io_start(loop, coio);
-	bool is_timedout = coio_fiber_yield_timeout(coio, timeout);
+	bool is_timedout;
+	is_timedout = coio_fiber_yield_timeout(coio, timeout);
 	ev_io_stop(loop, coio);
-	fiber_testcancel();
-	if (is_timedout)
-		tnt_raise(TimedOut);
+	if (fiber_is_cancelled()) {
+		diag_set(FiberIsCancelled);
+		evio_close(loop, coio);
+		return -1;
+	}
+	if (is_timedout) {
+		evio_close(loop, coio);
+		return -1;
+	}
 	int error = EINPROGRESS;
 	socklen_t sz = sizeof(error);
 	if (sio_getsockopt(coio->fd, SOL_SOCKET, SO_ERROR,
-		       &error, &sz))
-		diag_raise();
+			   &error, &sz)) {
+		evio_close(loop, coio);
+		return -1;
+	}
 	if (error != 0) {
 		errno = error;
-		tnt_raise(SocketError, sio_socketname(coio->fd), "connect");
+		diag_set(SocketError, sio_socketname(coio->fd), "connect");
+		evio_close(loop, coio);
+		return -1;
 	}
-	coio_guard.is_active = false;
 	return 0;
+
 }
 
 void
@@ -152,7 +158,7 @@ coio_fill_addrinfo(struct addrinfo *ai_local, const char *host,
  * This function also supports UNIX domain sockets if uri->path is not NULL and
  * uri->service is NULL.
  *
- * @retval -1 timeout
+ * @retval -1 error or timeout
  * @retval 0 connected
  */
 int
@@ -201,52 +207,55 @@ coio_connect_timeout(struct ev_io *coio, struct uri *uri, struct sockaddr *addr,
 	    hints.ai_flags = AI_ADDRCONFIG|AI_NUMERICSERV|AI_PASSIVE;
 	    hints.ai_protocol = 0;
 	    int rc = coio_getaddrinfo(host, service, &hints, &ai, delay);
-	    if (rc != 0) {
-		    diag_raise();
-		    panic("unspecified getaddrinfo error");
-	    }
+	    if (rc != 0)
+			return -1;
 	}
-	auto addrinfo_guard = make_scoped_guard([=] {
-		if (!uri->host_hint) freeaddrinfo(ai);
-		else free(ai_local.ai_addr);
-	});
+	struct addrinfo *first_ai = ai;
 	evio_timeout_update(loop(), &start, &delay);
 
 	coio_timeout_init(&start, &delay, timeout);
 	assert(! evio_has_fd(coio));
-	while (ai) {
-		try {
-			if (coio_connect_addr(coio, ai->ai_addr,
-					      ai->ai_addrlen, delay))
-				return -1;
+	while (ai && delay >= 0) {
+		if (coio_connect_addr(coio, ai->ai_addr,
+				      ai->ai_addrlen, delay) == 0) {
 			if (addr != NULL) {
 				assert(addr_len != NULL);
 				*addr_len = MIN(ai->ai_addrlen, *addr_len);
 				memcpy(addr, ai->ai_addr, *addr_len);
 			}
+			if (!uri->host_hint)
+				freeaddrinfo(first_ai);
+			else
+				free(ai_local.ai_addr);
 			return 0; /* connected */
-		} catch (SocketError *e) {
-			if (ai->ai_next == NULL)
-				throw;
-			/* ignore exception and try the next address */
 		}
 		ai = ai->ai_next;
 		ev_now_update(loop);
 		coio_timeout_update(&start, &delay);
 	}
 
-	tnt_raise(SocketError, sio_socketname(coio->fd), "connection failed");
+	/* Set an error if not timedout. */
+	if (delay >= 0)
+		diag_set(SocketError, sio_socketname(coio->fd),
+			 "connection failed");
+	if (!uri->host_hint)
+		freeaddrinfo(first_ai);
+	else
+		free(ai_local.ai_addr);
+	return -1;
 }
 
 /* Do not allow to reuse coio by different fiber. */
-static inline void
+static inline int
 check_coio_in_use(struct ev_io *coio)
 {
 	if (ev_is_active(coio)) {
 		errno = EINPROGRESS;
-		tnt_raise(SocketError, sio_socketname(coio->fd),
-			  "already in use");
+		diag_set(SocketError, sio_socketname(coio->fd),
+			 "already in use");
+		return -1;
 	}
+	return 0;
 }
 
 /**
@@ -260,45 +269,61 @@ coio_accept(struct ev_io *coio, struct sockaddr *addr,
 	ev_tstamp start, delay;
 	coio_timeout_init(&start, &delay, timeout);
 
-	check_coio_in_use(coio);
-	CoioGuard coio_guard(coio);
-
-	while (true) {
-		/* Assume that there are waiting clients
-		 * available */
-		int fd = sio_accept(coio->fd, addr, &addrlen);
-		if (fd >= 0) {
-			if (evio_setsockopt_client(fd, addr->sa_family,
-						   SOCK_STREAM) != 0) {
-				close(fd);
-				diag_raise();
-			}
-			return fd;
-		}
-		if (! sio_wouldblock(errno))
-			diag_raise();
-		/* The socket is not ready, yield */
-		if (! ev_is_active(coio)) {
-			ev_io_set(coio, coio->fd, EV_READ);
-			ev_io_start(loop(), coio);
+	if (check_coio_in_use(coio) != 0)
+		return -1;
+
+	/* Assume that there are waiting clients available */
+	int fd = sio_accept(coio->fd, addr, &addrlen);
+
+	if (fd >= 0) {
+		if (evio_setsockopt_client(fd, addr->sa_family,
+					   SOCK_STREAM) != 0) {
+			close(fd);
+			return -1;
 		}
+		return fd;
+	}
+
+	if (!sio_wouldblock(errno))
+		return -1;
+
+	/* The socket is not ready, yield */
+	ev_io_set(coio, coio->fd, EV_READ);
+	ev_io_start(loop(), coio);
+
+	do {
 		/*
 		 * Yield control to other fibers until the
 		 * timeout is reached.
 		 */
 		bool is_timedout = coio_fiber_yield_timeout(coio, delay);
-		fiber_testcancel();
+		if (fiber_is_cancelled()) {
+			diag_set(FiberIsCancelled);
+			break;
+		}
 		if (is_timedout)
-			tnt_raise(TimedOut);
+			break;
 		coio_timeout_update(&start, &delay);
+		fd = sio_accept(coio->fd, addr, &addrlen);
+	} while (fd < 0 && sio_wouldblock(errno));
+
+	ev_io_stop(loop(), coio);
+
+	if (fd >= 0) {
+		if (evio_setsockopt_client(fd, addr->sa_family,
+					   SOCK_STREAM) != 0) {
+			close(fd);
+			return -1;
+		}
+		return fd;
 	}
+	return -1;
 }
 
 /**
  * Read at least sz bytes from socket with readahead.
  *
- * In case of EOF returns the amount read until eof (possibly 0),
- * and sets errno to 0.
+ * In case of EOF returns the amount read until eof (possibly 0).
  * Can read up to bufsiz bytes.
  *
  * @retval the number of bytes read.
@@ -313,46 +338,68 @@ coio_read_ahead_timeout(struct ev_io *coio, void *buf, size_t sz,
 	coio_timeout_init(&start, &delay, timeout);
 
 	ssize_t to_read = (ssize_t) sz;
+	if (to_read <= 0)
+		return 0;
 
-	check_coio_in_use(coio);
-	CoioGuard coio_guard(coio);
+	if (check_coio_in_use(coio) != 0)
+		return -1;
 
-	while (true) {
-		/*
-		 * Sic: assume the socket is ready: since
-		 * the user called read(), some data must
-		 * be expected.
-		 */
-		ssize_t nrd = sio_read(coio->fd, buf, bufsiz);
-		if (nrd > 0) {
-			to_read -= nrd;
-			if (to_read <= 0)
-				return sz - to_read;
-			buf = (char *) buf + nrd;
-			bufsiz -= nrd;
-		} else if (nrd == 0) {
-			errno = 0;
-			return sz - to_read;
-		} else if (! sio_wouldblock(errno)) {
-			diag_raise();
-		}
+	ssize_t nrd;
+	/*
+	 * Sic: assume the socket is ready: since
+	 * the user called read(), some data must
+	 * be expected.
+	 */
+	do {
+		nrd = sio_read(coio->fd, buf, bufsiz);
+		if (nrd <= 0)
+			break;
+		to_read -= nrd;
+		buf = (char *) buf + nrd;
+		bufsiz -= nrd;
+	} while (to_read > 0);
+
+	if (nrd >= 0)
+		return sz - to_read;
+
+	if (!sio_wouldblock(errno))
+		return -1;
 
-		/* The socket is not ready, yield */
-		if (! ev_is_active(coio)) {
-			ev_io_set(coio, coio->fd, EV_READ);
-			ev_io_start(loop(), coio);
-		}
+	/* The socket is not ready, yield */
+	ev_io_set(coio, coio->fd, EV_READ);
+	ev_io_start(loop(), coio);
+
+	do {
 		/*
 		 * Yield control to other fibers until the
 		 * timeout is being reached.
 		 */
-		bool is_timedout = coio_fiber_yield_timeout(coio,
-							    delay);
-		fiber_testcancel();
+		bool is_timedout = coio_fiber_yield_timeout(coio, delay);
+		if (fiber_is_cancelled()) {
+			diag_set(FiberIsCancelled);
+			break;
+		}
 		if (is_timedout)
-			tnt_raise(TimedOut);
+			break;
 		coio_timeout_update(&start, &delay);
-	}
+		nrd = sio_read(coio->fd, buf, bufsiz);
+		if (nrd == 0)
+			break;
+		if (nrd < 0) {
+			if (sio_wouldblock(errno))
+				continue;
+			break;
+		}
+		to_read -= nrd;
+		buf = (char *) buf + nrd;
+		bufsiz -= nrd;
+	} while (to_read > 0);
+
+	ev_io_stop(loop(), coio);
+
+	if (nrd < 0)
+		return -1;
+	return sz - to_read;
 }
 
 /**
@@ -366,10 +413,13 @@ ssize_t
 coio_readn_ahead(struct ev_io *coio, void *buf, size_t sz, size_t bufsiz)
 {
 	ssize_t nrd = coio_read_ahead(coio, buf, sz, bufsiz);
+	if (nrd < 0)
+		return -1;
 	if (nrd < (ssize_t)sz) {
 		errno = EPIPE;
-		tnt_raise(SocketError, sio_socketname(coio->fd),
-			  "unexpected EOF when reading from socket");
+		diag_set(SocketError, sio_socketname(coio->fd),
+			 "unexpected EOF when reading from socket");
+		return -1;
 	}
 	return nrd;
 }
@@ -386,10 +436,13 @@ coio_readn_ahead_timeout(struct ev_io *coio, void *buf, size_t sz, size_t bufsiz
 		         ev_tstamp timeout)
 {
 	ssize_t nrd = coio_read_ahead_timeout(coio, buf, sz, bufsiz, timeout);
-	if (nrd < (ssize_t)sz && errno == 0) { /* EOF. */
+	if (nrd < 0)
+		return -1;
+	if (nrd < (ssize_t)sz) { /* EOF. */
 		errno = EPIPE;
-		tnt_raise(SocketError, sio_socketname(coio->fd),
-			  "unexpected EOF when reading from socket");
+		diag_set(SocketError, sio_socketname(coio->fd),
+			 "unexpected EOF when reading from socket");
+		return -1;
 	}
 	return nrd;
 }
@@ -412,43 +465,62 @@ coio_write_timeout(struct ev_io *coio, const void *buf, size_t sz,
 	ev_tstamp start, delay;
 	coio_timeout_init(&start, &delay, timeout);
 
-	check_coio_in_use(coio);
-	CoioGuard coio_guard(coio);
+	if (towrite <= 0)
+		return 0;
+	if (check_coio_in_use(coio) != 0)
+		return -1;
+
+	ssize_t nwr;
+	/*
+	 * Sic: write as much data as possible,
+	 * assuming the socket is ready.
+	 */
+	do {
+		nwr = sio_write(coio->fd, buf, towrite);
+		if (nwr < 0)
+			break;
+		towrite -= nwr;
+		buf = (char *) buf + nwr;
+	} while (towrite > 0);
+
+	if (nwr > 0)
+		return sz;
+
+	if (!sio_wouldblock(errno))
+		return -1;
 
-	while (true) {
-		/*
-		 * Sic: write as much data as possible,
-		 * assuming the socket is ready.
-		 */
-		ssize_t nwr = sio_write(coio->fd, buf, towrite);
-		if (nwr > 0) {
-			/* Go past the data just written. */
-			if (nwr >= towrite)
-				return sz;
-			towrite -= nwr;
-			buf = (char *) buf + nwr;
-		} else if (nwr < 0 && !sio_wouldblock(errno)) {
-			diag_raise();
-		}
-		if (! ev_is_active(coio)) {
-			ev_io_set(coio, coio->fd, EV_WRITE);
-			ev_io_start(loop(), coio);
-		}
-		/* Yield control to other fibers. */
-		fiber_testcancel();
+	ev_io_set(coio, coio->fd, EV_WRITE);
+	ev_io_start(loop(), coio);
+
+	do {
 		/*
 		 * Yield control to other fibers until the
 		 * timeout is reached or the socket is
 		 * ready.
 		 */
-		bool is_timedout = coio_fiber_yield_timeout(coio,
-							    delay);
-		fiber_testcancel();
-
+		bool is_timedout = coio_fiber_yield_timeout(coio, delay);
+		if (fiber_is_cancelled()) {
+			diag_set(FiberIsCancelled);
+			break;
+		}
 		if (is_timedout)
-			tnt_raise(TimedOut);
+			break;
 		coio_timeout_update(&start, &delay);
-	}
+		nwr = sio_write(coio->fd, buf, towrite);
+		if (nwr < 0) {
+			if (sio_wouldblock(errno))
+				continue;
+			break;
+		}
+		towrite -= nwr;
+		buf = (char *) buf + nwr;
+	} while (towrite > 0);
+
+	ev_io_stop(loop(), coio);
+
+	if (nwr < 0)
+		return -1;
+	return sz;
 }
 
 /*
@@ -456,66 +528,83 @@ coio_write_timeout(struct ev_io *coio, const void *buf, size_t sz,
  * Put in an own function to workaround gcc bug with @finally
  */
 static inline ssize_t
-coio_flush(int fd, struct iovec *iov, ssize_t offset, int iovcnt)
+coio_flush(int fd, struct iovec **iov, size_t *offset, int iovcnt)
 {
-	sio_add_to_iov(iov, -offset);
-	ssize_t nwr = sio_writev(fd, iov, iovcnt);
-	sio_add_to_iov(iov, offset);
-	if (nwr < 0 && ! sio_wouldblock(errno))
-		diag_raise();
+	sio_add_to_iov(*iov, -*offset);
+	ssize_t nwr = sio_writev(fd, *iov, iovcnt);
+	sio_add_to_iov(*iov, *offset);
+	if (nwr < 0 && !sio_wouldblock(errno))
+		return -1;
+	if (nwr < 0)
+		return 0;
+	/* Successful write adjust iov and offset. */
+	*iov += sio_move_iov(*iov, nwr, offset);
 	return nwr;
 }
 
 ssize_t
 coio_writev_timeout(struct ev_io *coio, struct iovec *iov, int iovcnt,
-		    size_t size_hint, ev_tstamp timeout)
+		    ev_tstamp timeout)
 {
 	size_t total = 0;
 	size_t iov_len = 0;
 	struct iovec *end = iov + iovcnt;
 	ev_tstamp start, delay;
 	coio_timeout_init(&start, &delay, timeout);
-	check_coio_in_use(coio);
-	CoioGuard coio_guard(coio);
 
+	if (iovcnt == 0)
+		return 0;
+
+	if (check_coio_in_use(coio) != 0)
+		return -1;
+
+	ssize_t nwr;
 	/* Avoid a syscall in case of 0 iovcnt. */
-	while (iov < end) {
-		/* Write as much data as possible. */
-		ssize_t nwr = coio_flush(coio->fd, iov, iov_len,
-					 end - iov);
-		if (nwr >= 0) {
-			total += nwr;
-			/*
-			 * If there was a hint for the total size
-			 * of the vector, use it.
-			 */
-			if (size_hint > 0 && size_hint == total)
-				break;
-
-			iov += sio_move_iov(iov, nwr, &iov_len);
-			if (iov == end) {
-				assert(iov_len == 0);
-				break;
-			}
-		}
-		if (! ev_is_active(coio)) {
-			ev_io_set(coio, coio->fd, EV_WRITE);
-			ev_io_start(loop(), coio);
-		}
-		/* Yield control to other fibers. */
-		fiber_testcancel();
+	do {
+		nwr = coio_flush(coio->fd, &iov, &iov_len, end - iov);
+		if (nwr < 0)
+			break;
+		total += nwr;
+
+	} while (iov < end);
+
+	assert(nwr < 0 || iov_len == 0);
+	if (nwr > 0)
+		return total;
+
+	if (!sio_wouldblock(errno))
+		return -1;
+
+	ev_io_set(coio, coio->fd, EV_WRITE);
+	ev_io_start(loop(), coio);
+	do {
 		/*
 		 * Yield control to other fibers until the
 		 * timeout is reached or the socket is
 		 * ready.
 		 */
 		bool is_timedout = coio_fiber_yield_timeout(coio, delay);
-		fiber_testcancel();
-
+		if (fiber_is_cancelled()) {
+			diag_set(FiberIsCancelled);
+			break;
+		}
 		if (is_timedout)
-			tnt_raise(TimedOut);
+			break;
 		coio_timeout_update(&start, &delay);
-	}
+		nwr = coio_flush(coio->fd, &iov, &iov_len, end - iov);
+		if (nwr < 0) {
+			if (sio_wouldblock(errno))
+				continue;
+			break;
+		}
+		total += nwr;
+	} while (iov < end);
+
+	ev_io_stop(loop(), coio);
+
+	if (nwr < 0)
+		return -1;
+	assert(iov_len == 0);
 	return total;
 }
 
@@ -533,36 +622,40 @@ coio_sendto_timeout(struct ev_io *coio, const void *buf, size_t sz, int flags,
 	ev_tstamp start, delay;
 	coio_timeout_init(&start, &delay, timeout);
 
-	check_coio_in_use(coio);
-	CoioGuard coio_guard(coio);
+	if (check_coio_in_use(coio) != 0)
+		return -1;
 
-	while (true) {
-		/*
-		 * Sic: write as much data as possible,
-		 * assuming the socket is ready.
-		 */
-		ssize_t nwr = sio_sendto(coio->fd, buf, sz,
-					 flags, dest_addr, addrlen);
-		if (nwr > 0)
-			return nwr;
-		if (nwr < 0 && ! sio_wouldblock(errno))
-			diag_raise();
-		if (! ev_is_active(coio)) {
-			ev_io_set(coio, coio->fd, EV_WRITE);
-			ev_io_start(loop(), coio);
-		}
+	/*
+	 * Sic: write as much data as possible,
+	 * assuming the socket is ready.
+	 */
+	ssize_t nwr = sio_sendto(coio->fd, buf, sz, flags, dest_addr, addrlen);
+	if (nwr > 0 || !sio_wouldblock(errno))
+		return nwr;
+
+	ev_io_set(coio, coio->fd, EV_WRITE);
+	ev_io_start(loop(), coio);
+	do {
 		/*
 		 * Yield control to other fibers until
 		 * timeout is reached or the socket is
 		 * ready.
 		 */
-		bool is_timedout = coio_fiber_yield_timeout(coio,
-							    delay);
-		fiber_testcancel();
+		bool is_timedout = coio_fiber_yield_timeout(coio, delay);
+		if (fiber_is_cancelled()) {
+			diag_set(FiberIsCancelled);
+			break;
+		}
 		if (is_timedout)
-			tnt_raise(TimedOut);
+			break;
 		coio_timeout_update(&start, &delay);
-	}
+		nwr = sio_sendto(coio->fd, buf, sz, flags, dest_addr,
+				 addrlen);
+	} while (nwr < 0 && sio_wouldblock(errno));
+
+	ev_io_stop(loop(), coio);
+
+	return nwr;
 }
 
 /**
@@ -579,36 +672,41 @@ coio_recvfrom_timeout(struct ev_io *coio, void *buf, size_t sz, int flags,
 	ev_tstamp start, delay;
 	coio_timeout_init(&start, &delay, timeout);
 
-	check_coio_in_use(coio);
-	CoioGuard coio_guard(coio);
+	if (check_coio_in_use(coio) != 0)
+		return -1;
 
-	while (true) {
-		/*
-		 * Read as much data as possible,
-		 * assuming the socket is ready.
-		 */
-		ssize_t nrd = sio_recvfrom(coio->fd, buf, sz, flags,
-					   src_addr, &addrlen);
-		if (nrd >= 0)
-			return nrd;
-		if (! sio_wouldblock(errno))
-			diag_raise();
-		if (! ev_is_active(coio)) {
-			ev_io_set(coio, coio->fd, EV_READ);
-			ev_io_start(loop(), coio);
-		}
+	/*
+	 * Read as much data as possible,
+	 * assuming the socket is ready.
+	 */
+	ssize_t nrd = sio_recvfrom(coio->fd, buf, sz, flags,
+				   src_addr, &addrlen);
+	if (nrd >= 0 || !sio_wouldblock(errno))
+		return nrd;
+
+	ev_io_set(coio, coio->fd, EV_READ);
+	ev_io_start(loop(), coio);
+	do {
 		/*
 		 * Yield control to other fibers until
 		 * timeout is reached or the socket is
 		 * ready.
 		 */
-		bool is_timedout = coio_fiber_yield_timeout(coio,
-							    delay);
-		fiber_testcancel();
+		bool is_timedout = coio_fiber_yield_timeout(coio, delay);
+		if (fiber_is_cancelled()) {
+			diag_set(FiberIsCancelled);
+			break;
+		}
 		if (is_timedout)
-			tnt_raise(TimedOut);
+			break;
 		coio_timeout_update(&start, &delay);
-	}
+		nrd = sio_recvfrom(coio->fd, buf, sz, flags,
+				   src_addr, &addrlen);
+	} while (nrd < 0 && sio_wouldblock(errno));
+
+	ev_io_stop(loop(), coio);
+
+	return nrd;
 }
 
 static int
@@ -655,12 +753,13 @@ coio_service_init(struct coio_service *service, const char *name,
 	service->handler_param = handler_param;
 }
 
-void
+int
 coio_service_start(struct evio_service *service, const char *uri)
 {
 	if (evio_service_bind(service, uri) != 0 ||
 	    evio_service_listen(service) != 0)
-		diag_raise();
+		return -1;
+	return 0;
 }
 
 void
@@ -678,7 +777,6 @@ coio_stat_stat_timeout(ev_stat *stat, ev_tstamp timeout)
 	coio_timeout_init(&start, &delay, timeout);
 	fiber_yield_timeout(delay);
 	ev_stat_stop(loop(), stat);
-	fiber_testcancel();
 }
 
 typedef void (*ev_child_cb)(ev_loop *, ev_child *, int);
@@ -706,7 +804,6 @@ coio_waitpid(pid_t pid)
 	fiber_set_cancellable(allow_cancel);
 	ev_child_stop(loop(), &cw);
 	int status = cw.rstatus;
-	fiber_testcancel();
 	return status;
 }
 
diff --git a/src/lib/core/coio.h b/src/lib/core/coio.h
index 6a2337689..4267a0459 100644
--- a/src/lib/core/coio.h
+++ b/src/lib/core/coio.h
@@ -33,6 +33,9 @@
 #include "fiber.h"
 #include "trivia/util.h"
 #if defined(__cplusplus)
+extern "C" {
+#endif /* defined(__cplusplus) */
+
 #include "evio.h"
 
 /**
@@ -59,10 +62,6 @@ coio_connect(struct ev_io *coio, struct uri *uri, struct sockaddr *addr,
 	return coio_connect_timeout(coio, uri, addr, addr_len, TIMEOUT_INFINITY);
 }
 
-void
-coio_bind(struct ev_io *coio, struct sockaddr *addr,
-	  socklen_t addrlen);
-
 int
 coio_accept(struct ev_io *coio, struct sockaddr *addr, socklen_t addrlen,
 	    ev_tstamp timeout);
@@ -71,7 +70,7 @@ void
 coio_create(struct ev_io *coio, int fd);
 
 static inline void
-coio_close(ev_loop *loop, struct ev_io *coio)
+coio_destroy(ev_loop *loop, struct ev_io *coio)
 {
 	return evio_close(loop, coio);
 }
@@ -141,12 +140,12 @@ coio_write(struct ev_io *coio, const void *buf, size_t sz)
 
 ssize_t
 coio_writev_timeout(struct ev_io *coio, struct iovec *iov, int iovcnt,
-		    size_t size, ev_tstamp timeout);
+		    ev_tstamp timeout);
 
 static inline ssize_t
-coio_writev(struct ev_io *coio, struct iovec *iov, int iovcnt, size_t size)
+coio_writev(struct ev_io *coio, struct iovec *iov, int iovcnt)
 {
-	return coio_writev_timeout(coio, iov, iovcnt, size, TIMEOUT_INFINITY);
+	return coio_writev_timeout(coio, iov, iovcnt, TIMEOUT_INFINITY);
 }
 
 ssize_t
@@ -164,7 +163,7 @@ coio_service_init(struct coio_service *service, const char *name,
 		  fiber_func handler, void *handler_param);
 
 /** Wait until the service binds to the port. */
-void
+int
 coio_service_start(struct evio_service *service, const char *uri);
 
 void
@@ -185,8 +184,6 @@ coio_stat_stat_timeout(ev_stat *stat, ev_tstamp delay);
 int
 coio_waitpid(pid_t pid);
 
-extern "C" {
-#endif /* defined(__cplusplus) */
 
 /** \cond public */
 
diff --git a/src/lib/core/coio_buf.h b/src/lib/core/coio_buf.h
index 1ad104985..3a83f8fe1 100644
--- a/src/lib/core/coio_buf.h
+++ b/src/lib/core/coio_buf.h
@@ -45,6 +45,8 @@ coio_bread(struct ev_io *coio, struct ibuf *buf, size_t sz)
 {
 	ibuf_reserve_xc(buf, sz);
 	ssize_t n = coio_read_ahead(coio, buf->wpos, sz, ibuf_unused(buf));
+	if (n < 0)
+		return -1;
 	buf->wpos += n;
 	return n;
 }
@@ -61,6 +63,8 @@ coio_bread_timeout(struct ev_io *coio, struct ibuf *buf, size_t sz,
 	ibuf_reserve_xc(buf, sz);
 	ssize_t n = coio_read_ahead_timeout(coio, buf->wpos, sz, ibuf_unused(buf),
 			                    timeout);
+	if (n < 0)
+		return -1;
 	buf->wpos += n;
 	return n;
 }
@@ -71,6 +75,8 @@ coio_breadn(struct ev_io *coio, struct ibuf *buf, size_t sz)
 {
 	ibuf_reserve_xc(buf, sz);
 	ssize_t n = coio_readn_ahead(coio, buf->wpos, sz, ibuf_unused(buf));
+	if (n < 0)
+		return -1;
 	buf->wpos += n;
 	return n;
 }
@@ -87,6 +93,8 @@ coio_breadn_timeout(struct ev_io *coio, struct ibuf *buf, size_t sz,
 	ibuf_reserve_xc(buf, sz);
 	ssize_t n = coio_readn_ahead_timeout(coio, buf->wpos, sz, ibuf_unused(buf),
 			                     timeout);
+	if (n < 0)
+		return -1;
 	buf->wpos += n;
 	return n;
 }
-- 
2.25.0

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Tarantool-patches] [PATCH v4 05/11] xstream: get rid of an exception
  2020-02-12  9:39 [Tarantool-patches] [PATCH v4 00/11] Replication from memory Georgy Kirichenko
                   ` (3 preceding siblings ...)
  2020-02-12  9:39 ` [Tarantool-patches] [PATCH v4 04/11] coio: do not throw an error, minor refactoring Georgy Kirichenko
@ 2020-02-12  9:39 ` Georgy Kirichenko
  2020-02-12  9:39 ` [Tarantool-patches] [PATCH v4 06/11] wal: extract log write batch into a separate routine Georgy Kirichenko
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 16+ messages in thread
From: Georgy Kirichenko @ 2020-02-12  9:39 UTC (permalink / raw)
  To: tarantool-patches

Refactoring: make xstream C-compliant

Part of #380
---
 src/box/CMakeLists.txt |  1 -
 src/box/box.cc         |  5 +++--
 src/box/relay.cc       | 25 +++++++++++++-----------
 src/box/xstream.cc     | 44 ------------------------------------------
 src/box/xstream.h      |  9 ++++++---
 5 files changed, 23 insertions(+), 61 deletions(-)
 delete mode 100644 src/box/xstream.cc

diff --git a/src/box/CMakeLists.txt b/src/box/CMakeLists.txt
index 56758bd2f..0cc154ba5 100644
--- a/src/box/CMakeLists.txt
+++ b/src/box/CMakeLists.txt
@@ -125,7 +125,6 @@ add_library(box STATIC
     authentication.cc
     replication.cc
     recovery.cc
-    xstream.cc
     applier.cc
     relay.cc
     journal.c
diff --git a/src/box/box.cc b/src/box/box.cc
index ca1696383..c5aaad295 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -327,7 +327,7 @@ recovery_journal_create(struct recovery_journal *journal, struct vclock *v)
 	journal->vclock = v;
 }
 
-static void
+static int
 apply_wal_row(struct xstream *stream, struct xrow_header *row)
 {
 	struct request request;
@@ -336,7 +336,7 @@ apply_wal_row(struct xstream *stream, struct xrow_header *row)
 		struct space *space = space_cache_find_xc(request.space_id);
 		if (box_process_rw(&request, space, NULL) != 0) {
 			say_error("error applying row: %s", request_str(&request));
-			diag_raise();
+			return -1;
 		}
 	}
 	struct wal_stream *xstream =
@@ -347,6 +347,7 @@ apply_wal_row(struct xstream *stream, struct xrow_header *row)
 	 */
 	if (++xstream->rows % WAL_ROWS_PER_YIELD == 0)
 		fiber_sleep(0);
+	return 0;
 }
 
 static void
diff --git a/src/box/relay.cc b/src/box/relay.cc
index bb7761b99..5b0d3f023 100644
--- a/src/box/relay.cc
+++ b/src/box/relay.cc
@@ -165,11 +165,11 @@ relay_last_row_time(const struct relay *relay)
 	return relay->last_row_time;
 }
 
-static void
+static int
 relay_send(struct relay *relay, struct xrow_header *packet);
-static void
+static int
 relay_send_initial_join_row(struct xstream *stream, struct xrow_header *row);
-static void
+static int
 relay_send_row(struct xstream *stream, struct xrow_header *row);
 
 struct relay *
@@ -192,7 +192,7 @@ relay_new(struct replica *replica)
 
 static void
 relay_start(struct relay *relay, int fd, uint64_t sync,
-	     void (*stream_write)(struct xstream *, struct xrow_header *))
+	     int (*stream_write)(struct xstream *, struct xrow_header *))
 {
 	xstream_create(&relay->stream, stream_write);
 	/*
@@ -716,7 +716,7 @@ relay_subscribe(struct replica *replica, int fd, uint64_t sync,
 		diag_raise();
 }
 
-static void
+static int
 relay_send(struct relay *relay, struct xrow_header *packet)
 {
 	ERROR_INJECT_YIELD(ERRINJ_RELAY_SEND_DELAY);
@@ -724,15 +724,16 @@ relay_send(struct relay *relay, struct xrow_header *packet)
 	packet->sync = relay->sync;
 	relay->last_row_time = ev_monotonic_now(loop());
 	if (coio_write_xrow(&relay->io, packet) < 0)
-		diag_raise();
+		return -1;
 	fiber_gc();
 
 	struct errinj *inj = errinj(ERRINJ_RELAY_TIMEOUT, ERRINJ_DOUBLE);
 	if (inj != NULL && inj->dparam > 0)
 		fiber_sleep(inj->dparam);
+	return 0;
 }
 
-static void
+static int
 relay_send_initial_join_row(struct xstream *stream, struct xrow_header *row)
 {
 	struct relay *relay = container_of(stream, struct relay, stream);
@@ -741,11 +742,12 @@ relay_send_initial_join_row(struct xstream *stream, struct xrow_header *row)
 	 * vclock while sending a snapshot.
 	 */
 	if (row->group_id != GROUP_LOCAL)
-		relay_send(relay, row);
+		return relay_send(relay, row);
+	return 0;
 }
 
 /** Send a single row to the client. */
-static void
+static int
 relay_send_row(struct xstream *stream, struct xrow_header *packet)
 {
 	struct relay *relay = container_of(stream, struct relay, stream);
@@ -762,7 +764,7 @@ relay_send_row(struct xstream *stream, struct xrow_header *packet)
 		 * skip all these rows.
 		 */
 		if (packet->replica_id == REPLICA_ID_NIL)
-			return;
+			return 0;
 		packet->type = IPROTO_NOP;
 		packet->group_id = GROUP_DEFAULT;
 		packet->bodycnt = 0;
@@ -790,6 +792,7 @@ relay_send_row(struct xstream *stream, struct xrow_header *packet)
 			say_warn("injected broken lsn: %lld",
 				 (long long) packet->lsn);
 		}
-		relay_send(relay, packet);
+		return relay_send(relay, packet);
 	}
+	return 0;
 }
diff --git a/src/box/xstream.cc b/src/box/xstream.cc
deleted file mode 100644
index c77e4360e..000000000
--- a/src/box/xstream.cc
+++ /dev/null
@@ -1,44 +0,0 @@
-/*
- * Copyright 2010-2016, Tarantool AUTHORS, please see AUTHORS file.
- *
- * Redistribution and use in source and binary forms, with or
- * without modification, are permitted provided that the following
- * conditions are met:
- *
- * 1. Redistributions of source code must retain the above
- *    copyright notice, this list of conditions and the
- *    following disclaimer.
- *
- * 2. Redistributions in binary form must reproduce the above
- *    copyright notice, this list of conditions and the following
- *    disclaimer in the documentation and/or other materials
- *    provided with the distribution.
- *
- * THIS SOFTWARE IS PROVIDED BY AUTHORS ``AS IS'' AND
- * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
- * TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
- * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL
- * <COPYRIGHT HOLDER> OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT,
- * INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
- * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
- * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR
- * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
- * LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF
- * THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
- * SUCH DAMAGE.
- */
-
-#include "xstream.h"
-#include "exception.h"
-
-int
-xstream_write(struct xstream *stream, struct xrow_header *row)
-{
-	try {
-		stream->write(stream, row);
-	} catch (Exception *e) {
-		return -1;
-	}
-	return 0;
-}
diff --git a/src/box/xstream.h b/src/box/xstream.h
index d29ff4213..ae07c3e22 100644
--- a/src/box/xstream.h
+++ b/src/box/xstream.h
@@ -41,7 +41,7 @@ extern "C" {
 struct xrow_header;
 struct xstream;
 
-typedef void (*xstream_write_f)(struct xstream *, struct xrow_header *);
+typedef int (*xstream_write_f)(struct xstream *, struct xrow_header *);
 
 struct xstream {
 	xstream_write_f write;
@@ -53,8 +53,11 @@ xstream_create(struct xstream *xstream, xstream_write_f write)
 	xstream->write = write;
 }
 
-int
-xstream_write(struct xstream *stream, struct xrow_header *row);
+static inline int
+xstream_write(struct xstream *stream, struct xrow_header *row)
+{
+	return stream->write(stream, row);
+}
 
 #if defined(__cplusplus)
 } /* extern C */
-- 
2.25.0

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Tarantool-patches] [PATCH v4 06/11] wal: extract log write batch into a separate routine
  2020-02-12  9:39 [Tarantool-patches] [PATCH v4 00/11] Replication from memory Georgy Kirichenko
                   ` (4 preceding siblings ...)
  2020-02-12  9:39 ` [Tarantool-patches] [PATCH v4 05/11] xstream: get rid of an exception Georgy Kirichenko
@ 2020-02-12  9:39 ` Georgy Kirichenko
  2020-02-12  9:39 ` [Tarantool-patches] [PATCH v4 07/11] wal: matrix clock structure Georgy Kirichenko
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 16+ messages in thread
From: Georgy Kirichenko @ 2020-02-12  9:39 UTC (permalink / raw)
  To: tarantool-patches

Introduce a routine which transfers journal entries from an input
to an output queue writing them to a xlog file. On xlog output
the routine breaks transferring loop and returns writing result code.
After this the output queue containing entries which were written to
xlog (despite the disk write status) whereas the input queue contains
untouched entries. If an input queue is processed without actual
xlog write then a xlog file is flushed manually.
This refactoring helps to implement wal memory buffer.

Part of #980, #3794
---
 src/box/wal.c | 87 ++++++++++++++++++++++++++++++++-------------------
 1 file changed, 54 insertions(+), 33 deletions(-)

diff --git a/src/box/wal.c b/src/box/wal.c
index 0ae66ff32..ce15cb459 100644
--- a/src/box/wal.c
+++ b/src/box/wal.c
@@ -958,6 +958,36 @@ wal_assign_lsn(struct vclock *vclock_diff, struct vclock *base,
 	}
 }
 
+/*
+ * This function shifts entries from input queue and writes
+ * them to the current log file until the current log flushes
+ * or write error happened. All touched entries are moved to
+ * the output queue. The function returns count of written
+ * bytes or -1 in case of error.
+ */
+static ssize_t
+wal_write_xlog_batch(struct wal_writer *writer, struct stailq *input,
+		     struct stailq *output, struct vclock *vclock_diff)
+{
+	struct xlog *l = &writer->current_wal;
+	ssize_t rc;
+	do {
+		struct journal_entry *entry =
+			stailq_shift_entry(input, struct journal_entry, fifo);
+		stailq_add_tail(output, &entry->fifo);
+
+		wal_assign_lsn(vclock_diff, &writer->vclock,
+			       entry->rows, entry->rows + entry->n_rows);
+		entry->res = vclock_sum(vclock_diff) +
+			     vclock_sum(&writer->vclock);
+		rc = xlog_write_entry(l, entry);
+	} while (rc == 0 && !stailq_empty(input));
+	/* If log was not flushed then flush it explicitly. */
+	if (rc == 0)
+		rc = xlog_flush(l);
+	return rc;
+}
+
 static void
 wal_write_to_disk(struct cmsg *msg)
 {
@@ -1017,36 +1047,31 @@ wal_write_to_disk(struct cmsg *msg)
 	 * of request in xlog file is stored inside `struct journal_entry`.
 	 */
 
-	struct xlog *l = &writer->current_wal;
-
-	/*
-	 * Iterate over requests (transactions)
-	 */
-	int rc;
-	struct journal_entry *entry;
-	struct stailq_entry *last_committed = NULL;
-	stailq_foreach_entry(entry, &wal_msg->commit, fifo) {
-		wal_assign_lsn(&vclock_diff, &writer->vclock,
-			       entry->rows, entry->rows + entry->n_rows);
-		entry->res = vclock_sum(&vclock_diff) +
-			     vclock_sum(&writer->vclock);
-		rc = xlog_write_entry(l, entry);
-		if (rc < 0)
-			goto done;
-		if (rc > 0) {
+	struct stailq input;
+	stailq_create(&input);
+	stailq_concat(&input, &wal_msg->commit);
+	struct stailq output;
+	stailq_create(&output);
+	while (!stailq_empty(&input)) {
+		ssize_t rc = wal_write_xlog_batch(writer, &input, &output,
+						  &vclock_diff);
+		if (rc < 0) {
+			/*
+			 * Put processed entries and tail of write
+			 * queue to a rollback list.
+			 */
+			stailq_concat(&wal_msg->rollback, &output);
+			stailq_concat(&wal_msg->rollback, &input);
+		} else {
+			/*
+			 * Schedule processed entries to commit
+			 * and update the wal vclock.
+			 */
+			stailq_concat(&wal_msg->commit, &output);
 			writer->checkpoint_wal_size += rc;
-			last_committed = &entry->fifo;
 			vclock_merge(&writer->vclock, &vclock_diff);
 		}
-		/* rc == 0: the write is buffered in xlog_tx */
 	}
-	rc = xlog_flush(l);
-	if (rc < 0)
-		goto done;
-
-	writer->checkpoint_wal_size += rc;
-	last_committed = stailq_last(&wal_msg->commit);
-	vclock_merge(&writer->vclock, &vclock_diff);
 
 	/*
 	 * Notify TX if the checkpoint threshold has been exceeded.
@@ -1070,7 +1095,6 @@ wal_write_to_disk(struct cmsg *msg)
 		}
 	}
 
-done:
 	error = diag_last_error(diag_get());
 	if (error) {
 		/* Until we can pass the error to tx, log it and clear. */
@@ -1090,15 +1114,12 @@ done:
 	 * nothing, and need to start rollback from the first
 	 * request. Otherwise we rollback from the first request.
 	 */
-	struct stailq rollback;
-	stailq_cut_tail(&wal_msg->commit, last_committed, &rollback);
-
-	if (!stailq_empty(&rollback)) {
+	if (!stailq_empty(&wal_msg->rollback)) {
+		struct journal_entry *entry;
 		/* Update status of the successfully committed requests. */
-		stailq_foreach_entry(entry, &rollback, fifo)
+		stailq_foreach_entry(entry, &wal_msg->rollback, fifo)
 			entry->res = -1;
 		/* Rollback unprocessed requests */
-		stailq_concat(&wal_msg->rollback, &rollback);
 		wal_writer_begin_rollback(writer);
 	}
 	fiber_gc();
-- 
2.25.0

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Tarantool-patches] [PATCH v4 07/11] wal: matrix clock structure
  2020-02-12  9:39 [Tarantool-patches] [PATCH v4 00/11] Replication from memory Georgy Kirichenko
                   ` (5 preceding siblings ...)
  2020-02-12  9:39 ` [Tarantool-patches] [PATCH v4 06/11] wal: extract log write batch into a separate routine Georgy Kirichenko
@ 2020-02-12  9:39 ` Georgy Kirichenko
  2020-02-12  9:39 ` [Tarantool-patches] [PATCH v4 08/11] wal: track relay vclock and collect logs in wal thread Georgy Kirichenko
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 16+ messages in thread
From: Georgy Kirichenko @ 2020-02-12  9:39 UTC (permalink / raw)
  To: tarantool-patches

A matrix clock which allows to maintain a set of vclocks and
their components order. The main target is to be able to
build a vclock which contains lsns each one is less or equal
that n corresponding lsn from a matrix clock.

The purpose of the matrix clock is to evaluate a vclock
which is already processed by wal consumers like relays
or to obtain a majority vclock to commit journal entries
in case of synchronous replication.

Part of #980, #3794
---
 src/box/CMakeLists.txt   |   4 +
 src/box/mclock.c         | 374 +++++++++++++++++++++++++++++++++++++++
 src/box/mclock.h         | 125 +++++++++++++
 test/unit/CMakeLists.txt |   2 +
 test/unit/mclock.result  |  18 ++
 test/unit/mclock.test.c  | 160 +++++++++++++++++
 6 files changed, 683 insertions(+)
 create mode 100644 src/box/mclock.c
 create mode 100644 src/box/mclock.h
 create mode 100644 test/unit/mclock.result
 create mode 100644 test/unit/mclock.test.c

diff --git a/src/box/CMakeLists.txt b/src/box/CMakeLists.txt
index 0cc154ba5..32f922dd7 100644
--- a/src/box/CMakeLists.txt
+++ b/src/box/CMakeLists.txt
@@ -32,6 +32,9 @@ target_link_libraries(box_error core stat)
 add_library(vclock STATIC vclock.c)
 target_link_libraries(vclock core)
 
+add_library(mclock STATIC mclock.c)
+target_link_libraries(mclock vclock core)
+
 add_library(xrow STATIC xrow.c iproto_constants.c)
 target_link_libraries(xrow server core small vclock misc box_error
                       scramble ${MSGPUCK_LIBRARIES})
@@ -133,6 +136,7 @@ add_library(box STATIC
     execute.c
     sql_stmt_cache.c
     wal.c
+    mclock.c
     call.c
     merger.c
     ${lua_sources}
diff --git a/src/box/mclock.c b/src/box/mclock.c
new file mode 100644
index 000000000..c05838836
--- /dev/null
+++ b/src/box/mclock.c
@@ -0,0 +1,374 @@
+/*
+ * Copyright 2010-2016, Tarantool AUTHORS, please see AUTHORS file.
+ *
+ * Redistribution and use in source and binary forms, with or
+ * without modification, are permitted provided that the following
+ * conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above
+ *    copyright notice, this list of conditions and the
+ *    following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above
+ *    copyright notice, this list of conditions and the following
+ *    disclaimer in the documentation and/or other materials
+ *    provided with the distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY <COPYRIGHT HOLDER> ``AS IS'' AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
+ * TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL
+ * <COPYRIGHT HOLDER> OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT,
+ * INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
+ * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR
+ * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
+ * LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF
+ * THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ */
+#include "mclock.h"
+
+void
+mclock_create(struct mclock *mclock)
+{
+	memset(mclock, 0, sizeof(struct mclock));
+}
+
+void
+mclock_destroy(struct mclock *mclock)
+{
+	memset(mclock, 0, sizeof(struct mclock));
+}
+
+/*
+ * Check if passed vclock contain unknown replica
+ * identifiers. In case of new replica identifier
+ * the column map is adjusted and corresponding ordered
+ * array is built. Also new vclock is placed at the first
+ * position in the ordered array because there are no more
+ * non-zero entries in this column.
+ */
+static void
+mclock_adjust_col_map(struct mclock *mclock, uint32_t id,
+		      const struct vclock *vclock)
+{
+	/* Evaluate new matrix column identifiers. */
+	uint32_t new_col_map = vclock->map & ~mclock->col_map;
+	struct bit_iterator col_map_it;
+	bit_iterator_init(&col_map_it, &new_col_map, sizeof(new_col_map), true);
+	for (size_t col_id = bit_iterator_next(&col_map_it); col_id < SIZE_MAX;
+	     col_id = bit_iterator_next(&col_map_it)) {
+		/* Register new replica identifier. */
+		mclock->col_map |= (1 << col_id);
+		struct bit_iterator row_map_it;
+		bit_iterator_init(&row_map_it, &mclock->row_map,
+				  sizeof(mclock->row_map), true);
+		/* Rebuild an order map for given column. */
+		mclock->order[col_id][0] = id;
+		for (size_t row_id = bit_iterator_next(&row_map_it), i = 1;
+		     row_id < SIZE_MAX;
+		     row_id = bit_iterator_next(&row_map_it)) {
+			if (row_id != id)
+				mclock->order[col_id][i++] = row_id;
+		}
+	}
+}
+
+/* Fetches a lsn on given column and position (not row). */
+static inline int64_t
+mclock_get_pos_lsn(const struct mclock *mclock, uint32_t col_id, uint32_t pos)
+{
+	uint32_t row_id = mclock->order[col_id][pos];
+	return vclock_get(mclock->vclock + row_id, col_id);
+}
+
+/*
+ * Locate a range which contains given lsn for given column
+ * identifier. Function return two values by pointer: to and from.
+ * The first one contain the least position which lsn greater or
+ * equal than given lsn, the second one contains the bigger position
+ * which lsn less than the given one. So all lsns on positions
+ * between `*from` and  `*to -1` are equal to given lsn.
+ * For instance if we have lsn array like {12, 10, 10, 7, 6}
+ * then for lsn == 10 we will get *from == 1 and *to == 3
+ * whereas for lsn == 8 the result should be: *from == 3 and *to == 3
+ */
+static inline void
+mclock_find_range(const struct mclock *mclock, uint32_t col_id, int64_t lsn,
+		  uint32_t *from, uint32_t *to)
+{
+	/* Logarithic search, setup initial search ranges. */
+	uint32_t b = *from, e = *to;
+	uint32_t b_to = *from, e_to = *to;
+	/* Look for `from' position. */
+	while (e - b > 1) {
+		uint32_t m = (b + e) / 2;
+		int64_t m_lsn = mclock_get_pos_lsn(mclock, col_id, m);
+		if (m_lsn <= lsn)
+			e = m;
+		else
+			b = m;
+		/*
+		 * Optimization: check if we could decrease
+		 * the `to' search range.
+		 */
+		if (m_lsn < lsn)
+			e_to = MIN(m, e_to);
+		else
+			b_to = MAX(m, b_to);
+	}
+	if (mclock_get_pos_lsn(mclock, col_id, b) > lsn)
+		*from = e;
+	else
+		*from = b;
+	/* Look for `to' position. */
+	while (e_to - b_to > 1) {
+		uint32_t m = (b_to + e_to) / 2;
+		int64_t m_lsn = mclock_get_pos_lsn(mclock, col_id, m);
+		if (m_lsn < lsn)
+			e_to = m;
+		else
+			b_to = m;
+	}
+	*to = e_to;
+}
+
+/*
+ * Shift a sequence between old_pos and new_pos one
+ * step up (new_pos > old_pos, erasing head member)
+ * or down (new_pos < old_pos, erasing tail member).
+ */
+static inline void
+mclock_shift(struct mclock *mclock, uint32_t col_id, uint32_t old_pos,
+	     uint32_t new_pos)
+{
+	if (old_pos > new_pos) {
+		memmove(mclock->order[col_id] + new_pos + 1,
+			mclock->order[col_id] + new_pos,
+			(old_pos - new_pos) * sizeof(**mclock->order));
+	} else if (old_pos < new_pos) {
+		memmove(mclock->order[col_id] + old_pos,
+			mclock->order[col_id] + old_pos + 1,
+			(new_pos - old_pos) * sizeof(**mclock->order));
+	}
+}
+
+/*
+ * Update replica vclock and reorder mclock members.
+ */
+static int
+mclock_update_vclock(struct mclock *mclock, uint32_t id, const struct vclock *vclock)
+{
+	uint32_t count = __builtin_popcount(mclock->row_map);
+	mclock_adjust_col_map(mclock, id, vclock);
+	/* Perform reordering for each column. */
+	struct bit_iterator col_map_it;
+	bit_iterator_init(&col_map_it, &mclock->col_map, sizeof(mclock->col_map), true);
+	for (size_t col_id = bit_iterator_next(&col_map_it); col_id < SIZE_MAX;
+	     col_id = bit_iterator_next(&col_map_it)) {
+		int64_t new_lsn = vclock_get(vclock, col_id);
+		int64_t old_lsn = vclock_get(mclock->vclock + id, col_id);
+
+		if (old_lsn == new_lsn)
+			continue;
+		/*
+		 * Find a positions range which contains given
+		 * replica id for current column (old lsn position).
+		 */
+		uint32_t from = 0, to = count;
+		mclock_find_range(mclock, col_id, old_lsn, &from, &to);
+		assert(to > from);
+		uint32_t old_pos = from, new_pos;
+		while (old_pos < to) {
+			uint32_t replica_id = mclock->order[col_id][old_pos];
+			if (replica_id == id)
+				break;
+			++old_pos;
+		}
+		/* Replica id should be found. */
+		assert(old_pos < to);
+		if (new_lsn == old_lsn) {
+			/*
+			 * Lsn was not changed put replica id on the
+			 * last position in corresponding range.
+			 */
+			new_pos = to - 1;
+		}
+		else if (new_lsn > mclock_get_pos_lsn(mclock, col_id, 0)) {
+			/*
+			 * New lsn is the biggest one so put on
+			 * the first position in a column.
+			 */
+			new_pos = 0;
+		}
+		else if (new_lsn <= mclock_get_pos_lsn(mclock, col_id,
+						       count - 1)) {
+			/* The least one - the last position. */
+			new_pos = count - 1;
+		}
+		else {
+			/* Find a range of position which contains new lsn. */
+			if (new_lsn > old_lsn)
+				from = 0;
+			else
+				to = count;
+			mclock_find_range(mclock, col_id, new_lsn, &from, &to);
+			/* Take care about positions shift - to the
+			 * head or to the tail of column order map.
+			 */
+			new_pos = to - (new_lsn <= old_lsn? 1: 0);
+		}
+		assert(new_pos < count);
+		if (old_pos == new_pos)
+			continue;
+		mclock_shift(mclock, col_id, old_pos, new_pos);
+		mclock->order[col_id][new_pos] = id;
+	}
+	vclock_copy(&mclock->vclock[id], vclock);
+	return 0;
+}
+
+/*
+ * Delete replica vclock.
+ */
+static int
+mclock_delete_vclock(struct mclock *mclock, uint32_t id)
+{
+	uint32_t count = __builtin_popcount(mclock->row_map);
+	/* Perform reordering for each column. */
+	struct bit_iterator col_map_it;
+	bit_iterator_init(&col_map_it, &mclock->col_map, sizeof(mclock->col_map), true);
+	for (size_t col_id = bit_iterator_next(&col_map_it); col_id < SIZE_MAX;
+	     col_id = bit_iterator_next(&col_map_it)) {
+		int64_t old_lsn = vclock_get(mclock->vclock + id, col_id);
+		/*
+		 * Find a positions range which contains given
+		 * replica id for current column (old lsn position).
+		 */
+		uint32_t from = 0, to = count;
+		mclock_find_range(mclock, col_id, old_lsn, &from, &to);
+		assert(to > from);
+		uint32_t old_pos = from, new_pos = count - 1;
+		while (old_pos < to) {
+			uint32_t replica_id = mclock->order[col_id][old_pos];
+			if (replica_id == id)
+				break;
+			++old_pos;
+		}
+		/* Replica id should be found. */
+		assert(old_pos < to);
+		new_pos = count - 1;
+
+		if (old_pos == new_pos)
+			continue;
+		mclock_shift(mclock, col_id, old_pos, new_pos);
+	}
+	mclock->row_map ^= (1 << id);
+	return 0;
+}
+int
+mclock_update(struct mclock *mclock, uint32_t id, const struct vclock *vclock)
+{
+	/*
+	 * The given id is not registered and
+	 * vclock is zero - nothing to do.
+	 */
+	if ((mclock->row_map & (1 << id)) == 0 && vclock_sum(vclock) == 0)
+		return 0;
+	if (vclock_sum(vclock) == 0) {
+		mclock_delete_vclock(mclock, id);
+		return 0;
+	}
+	/*
+	 * The given replica id is not yet attached so
+	 * put a zero vclock on the last position with
+	 * corresponding replica identifier.
+	 */
+	if ((mclock->row_map & (1 << id)) == 0) {
+		vclock_create(&mclock->vclock[id]);
+		/* Put the given vclock at the last position. */
+		mclock->row_map |= 1 << id;
+		uint32_t count = __builtin_popcount(mclock->row_map);
+		struct bit_iterator col_map_it;
+		bit_iterator_init(&col_map_it, &mclock->col_map, sizeof(mclock->col_map), true);
+		for (size_t col_id = bit_iterator_next(&col_map_it); col_id < SIZE_MAX;
+		     col_id = bit_iterator_next(&col_map_it)) {
+			mclock->order[col_id][count - 1] = id;
+		}
+	}
+	mclock_update_vclock(mclock, id, vclock);
+	return 0;
+}
+
+int
+mclock_get(struct mclock *mclock, int32_t offset, struct vclock *vclock)
+{
+	int32_t count = __builtin_popcount(mclock->row_map);
+	/* Check if given offset is out of mclock range. */
+	if (offset >= count || offset < -count) {
+		vclock_create(vclock);
+		return -1;
+	}
+	offset = (offset + count) % count;
+	vclock_create(vclock);
+	/* Fetch lsn for each known replica identifier. */
+	struct bit_iterator col_map_it;
+	bit_iterator_init(&col_map_it, &mclock->col_map, sizeof(mclock->col_map), true);
+	for (size_t col_id = bit_iterator_next(&col_map_it); col_id < SIZE_MAX;
+	     col_id = bit_iterator_next(&col_map_it)) {
+		int64_t lsn = mclock_get_pos_lsn(mclock, col_id, offset);
+		if (lsn > 0)
+			vclock_follow(vclock, col_id, lsn);
+	}
+	return 0;
+}
+
+int
+mclock_extract_row(struct mclock *mclock, uint32_t id, struct vclock *vclock)
+{
+	if (mclock->row_map && (1 << id) == 0) {
+		vclock_create(vclock);
+		return -1;
+	}
+	vclock_copy(vclock, mclock->vclock + id);
+	return 0;
+}
+
+int
+mclock_extract_col(struct mclock *mclock, uint32_t id, struct vclock *vclock)
+{
+	vclock_create(vclock);
+	if (mclock->col_map && (1 << id) == 0)
+		return -1;
+
+	struct bit_iterator row_map_it;
+	bit_iterator_init(&row_map_it, &mclock->row_map,
+			  sizeof(mclock->row_map), true);
+	for (size_t row_id = bit_iterator_next(&row_map_it);
+	     row_id < SIZE_MAX; row_id = bit_iterator_next(&row_map_it)) {
+		int64_t lsn = vclock_get(mclock->vclock + row_id, id);
+		if (lsn == 0)
+			continue;
+		vclock_follow(vclock, row_id, lsn);
+	}
+
+	return 0;
+}
+
+bool
+mclock_check(struct mclock *mclock)
+{
+	uint32_t count = __builtin_popcount(mclock->row_map);
+	struct bit_iterator col_map_it;
+	bit_iterator_init(&col_map_it, &mclock->col_map, sizeof(mclock->col_map), true);
+	for (size_t col_id = bit_iterator_next(&col_map_it); col_id < SIZE_MAX;
+	     col_id = bit_iterator_next(&col_map_it)) {
+		for (uint32_t n = 0; n < count - 1; ++n)
+			if (mclock_get_pos_lsn(mclock, col_id, n) <
+			    mclock_get_pos_lsn(mclock, col_id, n + 1))
+				return false;
+	}
+	return true;
+}
diff --git a/src/box/mclock.h b/src/box/mclock.h
new file mode 100644
index 000000000..759cfda05
--- /dev/null
+++ b/src/box/mclock.h
@@ -0,0 +1,125 @@
+#ifndef INCLUDES_TARANTOOL_MCLOCK_H
+#define INCLUDES_TARANTOOL_MCLOCK_H
+/*
+ * Copyright 2010-2016, Tarantool AUTHORS, please see AUTHORS file.
+ *
+ * Redistribution and use in source and binary forms, with or
+ * without modification, are permitted provided that the following
+ * conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above
+ *    copyright notice, this list of conditions and the
+ *    following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above
+ *    copyright notice, this list of conditions and the following
+ *    disclaimer in the documentation and/or other materials
+ *    provided with the distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY <COPYRIGHT HOLDER> ``AS IS'' AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
+ * TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL
+ * <COPYRIGHT HOLDER> OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT,
+ * INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
+ * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR
+ * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
+ * LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF
+ * THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ */
+#include <stdlib.h>
+
+#include "vclock.h"
+
+#if defined(__cplusplus)
+extern "C" {
+#endif /* defined(__cplusplus) */
+
+/**
+ * Matrix clock structure contains vclocks identified
+ * by replica identifiers int it rows and maintains
+ * the column order for each known replica identifier.
+ */
+struct mclock {
+	/** Map of attached replica vclocks. */
+	unsigned int row_map;
+	/** Map of known replica identifies. */
+	unsigned int col_map;
+	/**
+	 * Contained vclock array addressable by
+	 * corresponding replica identifier.
+	 */
+	struct vclock vclock[VCLOCK_MAX];
+	/**
+	 * Per column ordered map. Each row describes
+	 * an ordered map of attached replica identifiers
+	 * where the most bigger lsn is on the first position.
+	 * In case of sequence of the equal lsns the latest is
+	 * on the last position in the sequence.
+	 * For instance if we have vclock like:
+	 *  1: {1: 10, 2: 12, 3: 0}
+	 *  2: {1: 10, 2: 14, 3: 1}
+	 *  3: {1: 0, 2: 8, 3: 4}
+	 * The order map will look like:
+	 *  {{1, 2, 3}, {2, 1, 3}, {3, 2, 1}}
+	 */
+	uint8_t order[VCLOCK_MAX][VCLOCK_MAX];
+};
+
+/** Create a mclock structure. */
+void
+mclock_create(struct mclock *mclock);
+
+/** Release allocated resources. */
+void
+mclock_destroy(struct mclock *mclock);
+
+/**
+ * Update a vclock identified by replica id and
+ * sort mclock members.
+ */
+int
+mclock_update(struct mclock *mclock, uint32_t id, const struct vclock *vclock);
+
+/**
+ * Build a vclock each component of which is less or equal than
+ * offset + 1 (or count + offset + 1 if offset < 0) corresponding
+ * component of containing vclocks. So mclock_get(mclock, 0) selects the
+ * biggest lsns for each column.
+ * For instance if we have mclock like:
+ *  1: {1: 10, 2: 12, 3: 0}
+ *  2: {1: 10, 2: 14, 3: 1}
+ *  3: {1: 0, 2: 8, 3: 4}
+ * Then mclock_get(0) builds {1: 10, 2: 14, 3: 4}
+ * whereas  mclock_get(2) build {1: 0, 2: 8, 3: 0}
+ */
+int
+mclock_get(struct mclock *mclock, int32_t offset, struct vclock *vclock);
+
+/**
+ * Fetch a row from a matrix clock.
+ */
+int
+mclock_extract_row(struct mclock *mclock, uint32_t id, struct vclock *vclock);
+
+/**
+ * Extract a column from a matrix clock. Such column describes
+ * the id replica lsn visible by cluster members.
+ */
+int
+mclock_extract_col(struct mclock *mclock, uint32_t id,  struct vclock *vclock);
+
+/*
+ * Function which checks the matrix clock consistence.
+ */
+bool
+mclock_check(struct mclock *mclock);
+
+#if defined(__cplusplus)
+} /* extern "C" */
+#endif /* defined(__cplusplus) */
+
+#endif /* INCLUDES_TARANTOOL_MCLOCK_H */
diff --git a/test/unit/CMakeLists.txt b/test/unit/CMakeLists.txt
index 4a57597e9..40db199e0 100644
--- a/test/unit/CMakeLists.txt
+++ b/test/unit/CMakeLists.txt
@@ -65,6 +65,8 @@ add_executable(bloom.test bloom.cc)
 target_link_libraries(bloom.test salad)
 add_executable(vclock.test vclock.cc)
 target_link_libraries(vclock.test vclock unit)
+add_executable(mclock.test mclock.test.c)
+target_link_libraries(mclock.test mclock vclock bit unit)
 add_executable(xrow.test xrow.cc)
 target_link_libraries(xrow.test xrow unit)
 add_executable(decimal.test decimal.c)
diff --git a/test/unit/mclock.result b/test/unit/mclock.result
new file mode 100644
index 000000000..eb3aa649d
--- /dev/null
+++ b/test/unit/mclock.result
@@ -0,0 +1,18 @@
+1..2
+    1..1
+	*** test_random_stress ***
+    ok 1 - random stress
+	*** test_random_stress: done ***
+ok 1 - subtests
+    1..8
+	*** test_func ***
+    ok 1 - consistency 1
+    ok 2 - first vclock 1
+    ok 3 - last vclock 1
+    ok 4 - second vclock
+    ok 5 - consistency 2
+    ok 6 - consistency 3
+    ok 7 - first vclock - 2
+    ok 8 - last vclock - 2
+	*** test_func: done ***
+ok 2 - subtests
diff --git a/test/unit/mclock.test.c b/test/unit/mclock.test.c
new file mode 100644
index 000000000..cd3a6538e
--- /dev/null
+++ b/test/unit/mclock.test.c
@@ -0,0 +1,160 @@
+/*
+ * Copyright 2010-2015, Tarantool AUTHORS, please see AUTHORS file.
+ *
+ * Redistribution and use in source and binary forms, with or
+ * without modification, are permitted provided that the following
+ * conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above
+ *    copyright notice, this list of conditions and the
+ *    following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above
+ *    copyright notice, this list of conditions and the following
+ *    disclaimer in the documentation and/or other materials
+ *    provided with the distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY <COPYRIGHT HOLDER> ``AS IS'' AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
+ * TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL
+ * <COPYRIGHT HOLDER> OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT,
+ * INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
+ * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR
+ * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
+ * LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF
+ * THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ */
+#include "unit.h"
+#include <stdarg.h>
+
+#include "box/mclock.h"
+
+static void
+test_random_stress()
+{
+	srand(time(NULL));
+	plan(1);
+	header();
+	struct mclock mclock;
+	mclock_create(&mclock);
+	bool ok = true;
+	for (int i = 0; i < 50000; ++i) {
+		struct vclock vclock;
+		vclock_create(&vclock);
+		uint32_t id = rand() % 31 + 1;
+		/* Count of non-zero items. */
+		int tm = rand() % 31;
+		for (int t = 0; t < tm;) {
+			uint32_t j = rand() % 31 + 1;
+			if (vclock_get(&vclock, j) > 0)
+				continue;
+			vclock_follow(&vclock, j, rand() + 1);
+			++t;
+		}
+		mclock_update(&mclock, id, &vclock);
+		if (!(ok = mclock_check(&mclock)))
+			break;
+	}
+	struct vclock vclock;
+	vclock_create(&vclock);
+	for (int i = 1; i < 32; ++i)
+		mclock_update(&mclock, i, &vclock);
+	mclock_destroy(&mclock);
+	is(ok, true, "random stress");
+	footer();
+	check_plan();
+}
+
+static void
+test_func()
+{
+	plan(8);
+	header();
+	struct mclock mclock;
+	mclock_create(&mclock);
+	struct vclock v1, v2, v3;
+	vclock_create(&v1);
+	vclock_follow(&v1, 1, 11);
+	vclock_follow(&v1, 2, 21);
+	vclock_follow(&v1, 3, 31);
+	vclock_create(&v2);
+	vclock_follow(&v2, 1, 22);
+	vclock_follow(&v2, 2, 12);
+	vclock_follow(&v2, 3, 30);
+	vclock_create(&v3);
+	vclock_follow(&v3, 2, 32);
+	vclock_follow(&v3, 3, 2);
+	vclock_follow(&v3, 4, 5);
+	mclock_update(&mclock, 1, &v1);
+	mclock_update(&mclock, 2, &v2);
+	mclock_update(&mclock, 3, &v3);
+	is(mclock_check(&mclock), true, "consistency 1");
+
+	struct vclock v, t;
+	vclock_create(&t);
+	vclock_follow(&t, 1, 22);
+	vclock_follow(&t, 2, 32);
+	vclock_follow(&t, 3, 31);
+	vclock_follow(&t, 4, 5);
+
+	mclock_get(&mclock, 0, &v);
+	is(vclock_compare(&v, &t), 0, "first vclock 1");
+
+	vclock_create(&t);
+	vclock_follow(&t, 2, 12);
+	vclock_follow(&t, 3, 2);
+
+	mclock_get(&mclock, -1, &v);
+	is(vclock_compare(&v, &t), 0, "last vclock 1");
+
+	vclock_create(&t);
+	vclock_follow(&t, 1, 11);
+	vclock_follow(&t, 2, 21);
+	vclock_follow(&t, 3, 30);
+
+	mclock_get(&mclock, 1, &v);
+	is(vclock_compare(&v, &t), 0, "second vclock");
+
+	vclock_follow(&v1, 1, 40);
+	vclock_follow(&v1, 4, 10);
+	mclock_update(&mclock, 1, &v1);
+	is(mclock_check(&mclock), true, "consistency 2");
+	vclock_follow(&v2, 2, 35);
+	vclock_follow(&v2, 4, 3);
+	mclock_update(&mclock, 2, &v2);
+	is(mclock_check(&mclock), true, "consistency 3");
+
+	vclock_create(&t);
+	vclock_follow(&t, 1, 40);
+	vclock_follow(&t, 2, 35);
+	vclock_follow(&t, 3, 31);
+	vclock_follow(&t, 4, 10);
+
+	mclock_get(&mclock, 0, &v);
+	is(vclock_compare(&v, &t), 0, "first vclock - 2");
+
+	vclock_create(&t);
+	vclock_follow(&t, 2, 21);
+	vclock_follow(&t, 3, 2);
+	vclock_follow(&t, 4, 3);
+
+	mclock_get(&mclock, -1, &v);
+	is(vclock_compare(&v, &t), 0, "last vclock - 2");
+
+	footer();
+	check_plan();
+
+}
+
+int main(void)
+{
+	plan(2);
+	test_random_stress();
+	test_func();
+	check_plan();
+}
+
-- 
2.25.0

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Tarantool-patches] [PATCH v4 08/11] wal: track relay vclock and collect logs in wal thread
  2020-02-12  9:39 [Tarantool-patches] [PATCH v4 00/11] Replication from memory Georgy Kirichenko
                   ` (6 preceding siblings ...)
  2020-02-12  9:39 ` [Tarantool-patches] [PATCH v4 07/11] wal: matrix clock structure Georgy Kirichenko
@ 2020-02-12  9:39 ` Georgy Kirichenko
  2020-02-12  9:39 ` [Tarantool-patches] [PATCH v4 09/11] wal: xrow memory buffer and cursor Georgy Kirichenko
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 16+ messages in thread
From: Georgy Kirichenko @ 2020-02-12  9:39 UTC (permalink / raw)
  To: tarantool-patches

Wal uses a matrix clock (mclock) in order to track vclocks reported
by relay. This allows wal to build the minimal boundary vclock which
could be used in order to collect wal unused files. Box protects logs
from collecting using wal_set_first_checkpoint call.
In order to preserve logs while joining gc tracks all join-vclocks as
checkpoints with a special mark - is_join_readview set to true.

Also there is no more gc consumer in tx thread, gc consumer info in
box.info output and corresponding lines were commented from test out.

Part of #3795, #980
---
 src/box/box.cc                        |  39 ++--
 src/box/gc.c                          | 216 ++++++---------------
 src/box/gc.h                          |  95 ++-------
 src/box/lua/info.c                    |  29 +--
 src/box/relay.cc                      | 129 +-----------
 src/box/replication.cc                |  34 +---
 src/box/wal.c                         | 269 +++++++++++++++++++++++---
 src/box/wal.h                         |  17 +-
 test/replication/gc_no_space.result   |  30 +--
 test/replication/gc_no_space.test.lua |  12 +-
 10 files changed, 366 insertions(+), 504 deletions(-)

diff --git a/src/box/box.cc b/src/box/box.cc
index c5aaad295..17495a211 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -1566,11 +1566,12 @@ box_process_register(struct ev_io *io, struct xrow_header *header)
 			  "wal_mode = 'none'");
 	}
 
-	struct gc_consumer *gc = gc_consumer_register(&replicaset.vclock,
-				"replica %s", tt_uuid_str(&instance_uuid));
-	if (gc == NULL)
-		diag_raise();
-	auto gc_guard = make_scoped_guard([&] { gc_consumer_unregister(gc); });
+	struct vclock register_vclock;
+	vclock_copy(&register_vclock, &replicaset.vclock);
+	gc_add_join_readview(&register_vclock);
+	auto gc_guard = make_scoped_guard([&] {
+		gc_del_join_readview(&register_vclock);
+	});
 
 	say_info("registering replica %s at %s",
 		 tt_uuid_str(&instance_uuid), sio_socketname(io->fd));
@@ -1609,12 +1610,8 @@ box_process_register(struct ev_io *io, struct xrow_header *header)
 	 * registration was complete and assign it to the
 	 * replica.
 	 */
-	gc_consumer_advance(gc, &stop_vclock);
 	replica = replica_by_uuid(&instance_uuid);
-	if (replica->gc != NULL)
-		gc_consumer_unregister(replica->gc);
-	replica->gc = gc;
-	gc_guard.is_active = false;
+	wal_relay_status_update(replica->id, &stop_vclock);
 }
 
 void
@@ -1708,11 +1705,12 @@ box_process_join(struct ev_io *io, struct xrow_header *header)
 	 * Register the replica as a WAL consumer so that
 	 * it can resume FINAL JOIN where INITIAL JOIN ends.
 	 */
-	struct gc_consumer *gc = gc_consumer_register(&replicaset.vclock,
-				"replica %s", tt_uuid_str(&instance_uuid));
-	if (gc == NULL)
-		diag_raise();
-	auto gc_guard = make_scoped_guard([&] { gc_consumer_unregister(gc); });
+	struct vclock join_vclock;
+	vclock_copy(&join_vclock, &replicaset.vclock);
+	gc_add_join_readview(&join_vclock);
+	auto gc_guard = make_scoped_guard([&] {
+		gc_del_join_readview(&join_vclock);
+	});
 
 	say_info("joining replica %s at %s",
 		 tt_uuid_str(&instance_uuid), sio_socketname(io->fd));
@@ -1757,17 +1755,8 @@ box_process_join(struct ev_io *io, struct xrow_header *header)
 	row.sync = header->sync;
 	if (coio_write_xrow(io, &row) < 0)
 		diag_raise();
-
-	/*
-	 * Advance the WAL consumer state to the position where
-	 * FINAL JOIN ended and assign it to the replica.
-	 */
-	gc_consumer_advance(gc, &stop_vclock);
 	replica = replica_by_uuid(&instance_uuid);
-	if (replica->gc != NULL)
-		gc_consumer_unregister(replica->gc);
-	replica->gc = gc;
-	gc_guard.is_active = false;
+	wal_relay_status_update(replica->id, &stop_vclock);
 }
 
 void
diff --git a/src/box/gc.c b/src/box/gc.c
index f5c387f9d..eb0548aa2 100644
--- a/src/box/gc.c
+++ b/src/box/gc.c
@@ -65,35 +65,6 @@ gc_cleanup_fiber_f(va_list);
 static int
 gc_checkpoint_fiber_f(va_list);
 
-/**
- * Comparator used for ordering gc_consumer objects by signature
- * in a binary tree.
- */
-static inline int
-gc_consumer_cmp(const struct gc_consumer *a, const struct gc_consumer *b)
-{
-	if (vclock_sum(&a->vclock) < vclock_sum(&b->vclock))
-		return -1;
-	if (vclock_sum(&a->vclock) > vclock_sum(&b->vclock))
-		return 1;
-	if ((intptr_t)a < (intptr_t)b)
-		return -1;
-	if ((intptr_t)a > (intptr_t)b)
-		return 1;
-	return 0;
-}
-
-rb_gen(MAYBE_UNUSED static inline, gc_tree_, gc_tree_t,
-       struct gc_consumer, node, gc_consumer_cmp);
-
-/** Free a consumer object. */
-static void
-gc_consumer_delete(struct gc_consumer *consumer)
-{
-	TRASH(consumer);
-	free(consumer);
-}
-
 /** Free a checkpoint object. */
 static void
 gc_checkpoint_delete(struct gc_checkpoint *checkpoint)
@@ -110,7 +81,6 @@ gc_init(void)
 
 	vclock_create(&gc.vclock);
 	rlist_create(&gc.checkpoints);
-	gc_tree_new(&gc.consumers);
 	fiber_cond_create(&gc.cleanup_cond);
 	checkpoint_schedule_cfg(&gc.checkpoint_schedule, 0, 0);
 	engine_collect_garbage(&gc.vclock);
@@ -142,15 +112,6 @@ gc_free(void)
 				 next_checkpoint) {
 		gc_checkpoint_delete(checkpoint);
 	}
-	/* Free all registered consumers. */
-	struct gc_consumer *consumer = gc_tree_first(&gc.consumers);
-	while (consumer != NULL) {
-		struct gc_consumer *next = gc_tree_next(&gc.consumers,
-							consumer);
-		gc_tree_remove(&gc.consumers, consumer);
-		gc_consumer_delete(consumer);
-		consumer = next;
-	}
 }
 
 /**
@@ -161,7 +122,6 @@ gc_free(void)
 static void
 gc_run_cleanup(void)
 {
-	bool run_wal_gc = false;
 	bool run_engine_gc = false;
 
 	/*
@@ -170,10 +130,11 @@ gc_run_cleanup(void)
 	 * checkpoints, plus we can't remove checkpoints that
 	 * are still in use.
 	 */
-	struct gc_checkpoint *checkpoint = NULL;
-	while (true) {
-		checkpoint = rlist_first_entry(&gc.checkpoints,
-				struct gc_checkpoint, in_checkpoints);
+	struct gc_checkpoint *checkpoint = NULL, *tmp;
+	rlist_foreach_entry_safe(checkpoint, &gc.checkpoints,
+				 in_checkpoints, tmp) {
+		if (checkpoint->is_join_readview)
+			continue;
 		if (gc.checkpoint_count <= gc.min_checkpoint_count)
 			break;
 		if (!rlist_empty(&checkpoint->refs))
@@ -187,23 +148,7 @@ gc_run_cleanup(void)
 	/* At least one checkpoint must always be available. */
 	assert(checkpoint != NULL);
 
-	/*
-	 * Find the vclock of the oldest WAL row to keep.
-	 * Note, we must keep all WALs created after the
-	 * oldest checkpoint, even if no consumer needs them.
-	 */
-	const struct vclock *vclock = (gc_tree_empty(&gc.consumers) ? NULL :
-				       &gc_tree_first(&gc.consumers)->vclock);
-	if (vclock == NULL ||
-	    vclock_sum(vclock) > vclock_sum(&checkpoint->vclock))
-		vclock = &checkpoint->vclock;
-
-	if (vclock_sum(vclock) > vclock_sum(&gc.vclock)) {
-		vclock_copy(&gc.vclock, vclock);
-		run_wal_gc = true;
-	}
-
-	if (!run_engine_gc && !run_wal_gc)
+	if (!run_engine_gc)
 		return; /* nothing to do */
 
 	/*
@@ -219,10 +164,10 @@ gc_run_cleanup(void)
 	 * we never remove the last checkpoint and the following
 	 * WALs so this may only affect backup checkpoints.
 	 */
-	if (run_engine_gc)
-		engine_collect_garbage(&checkpoint->vclock);
-	if (run_wal_gc)
-		wal_collect_garbage(vclock);
+	engine_collect_garbage(&checkpoint->vclock);
+	checkpoint = rlist_first_entry(&gc.checkpoints,
+					struct gc_checkpoint, in_checkpoints);
+	wal_set_gc_first_vclock(&checkpoint->vclock);
 }
 
 static int
@@ -278,28 +223,10 @@ void
 gc_advance(const struct vclock *vclock)
 {
 	/*
-	 * In case of emergency ENOSPC, the WAL thread may delete
-	 * WAL files needed to restore from backup checkpoints,
-	 * which would be kept by the garbage collector otherwise.
-	 * Bring the garbage collector vclock up to date.
+	 * Bring the garbage collector up to date with the oldest
+	 * wal xlog file.
 	 */
 	vclock_copy(&gc.vclock, vclock);
-
-	struct gc_consumer *consumer = gc_tree_first(&gc.consumers);
-	while (consumer != NULL &&
-	       vclock_sum(&consumer->vclock) < vclock_sum(vclock)) {
-		struct gc_consumer *next = gc_tree_next(&gc.consumers,
-							consumer);
-		assert(!consumer->is_inactive);
-		consumer->is_inactive = true;
-		gc_tree_remove(&gc.consumers, consumer);
-
-		say_crit("deactivated WAL consumer %s at %s", consumer->name,
-			 vclock_to_string(&consumer->vclock));
-
-		consumer = next;
-	}
-	gc_schedule_cleanup();
 }
 
 void
@@ -329,6 +256,10 @@ void
 gc_add_checkpoint(const struct vclock *vclock)
 {
 	struct gc_checkpoint *last_checkpoint = gc_last_checkpoint();
+	while (last_checkpoint != NULL && last_checkpoint->is_join_readview) {
+		last_checkpoint = rlist_prev_entry(last_checkpoint,
+						   in_checkpoints);
+	}
 	if (last_checkpoint != NULL &&
 	    vclock_sum(&last_checkpoint->vclock) == vclock_sum(vclock)) {
 		/*
@@ -351,6 +282,8 @@ gc_add_checkpoint(const struct vclock *vclock)
 	if (checkpoint == NULL)
 		panic("out of memory");
 
+	if (rlist_empty(&gc.checkpoints))
+		wal_set_gc_first_vclock(vclock);
 	rlist_create(&checkpoint->refs);
 	vclock_copy(&checkpoint->vclock, vclock);
 	rlist_add_tail_entry(&gc.checkpoints, checkpoint, in_checkpoints);
@@ -359,6 +292,47 @@ gc_add_checkpoint(const struct vclock *vclock)
 	gc_schedule_cleanup();
 }
 
+void
+gc_add_join_readview(const struct vclock *vclock)
+{
+	struct gc_checkpoint *checkpoint = calloc(1, sizeof(*checkpoint));
+	/*
+	 * It is not a fatal error if we could not prevent subsequent
+	 * from clearance.
+	 */
+	if (checkpoint == NULL) {
+		say_error("GC: could not add a join readview");
+		return;
+	}
+	if (rlist_empty(&gc.checkpoints))
+		wal_set_gc_first_vclock(vclock);
+	checkpoint->is_join_readview = true;
+	rlist_create(&checkpoint->refs);
+	vclock_copy(&checkpoint->vclock, vclock);
+	rlist_add_tail_entry(&gc.checkpoints, checkpoint, in_checkpoints);
+}
+
+void
+gc_del_join_readview(const struct vclock *vclock)
+{
+	struct gc_checkpoint *checkpoint;
+	rlist_foreach_entry(checkpoint, &gc.checkpoints, in_checkpoints) {
+		if (!checkpoint->is_join_readview ||
+		    vclock_compare(&checkpoint->vclock, vclock) != 0)
+			continue;
+		rlist_del(&checkpoint->in_checkpoints);
+		free(checkpoint);
+		checkpoint = rlist_first_entry(&gc.checkpoints,
+					       struct gc_checkpoint,
+					       in_checkpoints);
+		wal_set_gc_first_vclock(&checkpoint->vclock);
+		return;
+	}
+	/* A join readview was not found. */
+	say_error("GC: could del a join readview");
+}
+
+
 static int
 gc_do_checkpoint(void)
 {
@@ -513,75 +487,3 @@ gc_unref_checkpoint(struct gc_checkpoint_ref *ref)
 	rlist_del_entry(ref, in_refs);
 	gc_schedule_cleanup();
 }
-
-struct gc_consumer *
-gc_consumer_register(const struct vclock *vclock, const char *format, ...)
-{
-	struct gc_consumer *consumer = calloc(1, sizeof(*consumer));
-	if (consumer == NULL) {
-		diag_set(OutOfMemory, sizeof(*consumer),
-			 "malloc", "struct gc_consumer");
-		return NULL;
-	}
-
-	va_list ap;
-	va_start(ap, format);
-	vsnprintf(consumer->name, GC_NAME_MAX, format, ap);
-	va_end(ap);
-
-	vclock_copy(&consumer->vclock, vclock);
-	gc_tree_insert(&gc.consumers, consumer);
-	return consumer;
-}
-
-void
-gc_consumer_unregister(struct gc_consumer *consumer)
-{
-	if (!consumer->is_inactive) {
-		gc_tree_remove(&gc.consumers, consumer);
-		gc_schedule_cleanup();
-	}
-	gc_consumer_delete(consumer);
-}
-
-void
-gc_consumer_advance(struct gc_consumer *consumer, const struct vclock *vclock)
-{
-	if (consumer->is_inactive)
-		return;
-
-	int64_t signature = vclock_sum(vclock);
-	int64_t prev_signature = vclock_sum(&consumer->vclock);
-
-	assert(signature >= prev_signature);
-	if (signature == prev_signature)
-		return; /* nothing to do */
-
-	/*
-	 * Do not update the tree unless the tree invariant
-	 * is violated.
-	 */
-	struct gc_consumer *next = gc_tree_next(&gc.consumers, consumer);
-	bool update_tree = (next != NULL &&
-			    signature >= vclock_sum(&next->vclock));
-
-	if (update_tree)
-		gc_tree_remove(&gc.consumers, consumer);
-
-	vclock_copy(&consumer->vclock, vclock);
-
-	if (update_tree)
-		gc_tree_insert(&gc.consumers, consumer);
-
-	gc_schedule_cleanup();
-}
-
-struct gc_consumer *
-gc_consumer_iterator_next(struct gc_consumer_iterator *it)
-{
-	if (it->curr != NULL)
-		it->curr = gc_tree_next(&gc.consumers, it->curr);
-	else
-		it->curr = gc_tree_first(&gc.consumers);
-	return it->curr;
-}
diff --git a/src/box/gc.h b/src/box/gc.h
index 827a5db8e..49fb05fd4 100644
--- a/src/box/gc.h
+++ b/src/box/gc.h
@@ -45,12 +45,9 @@ extern "C" {
 #endif /* defined(__cplusplus) */
 
 struct fiber;
-struct gc_consumer;
 
 enum { GC_NAME_MAX = 64 };
 
-typedef rb_node(struct gc_consumer) gc_node_t;
-
 /**
  * Garbage collector keeps track of all preserved checkpoints.
  * The following structure represents a checkpoint.
@@ -60,6 +57,8 @@ struct gc_checkpoint {
 	struct rlist in_checkpoints;
 	/** VClock of the checkpoint. */
 	struct vclock vclock;
+	/** True when it is a join readview. */
+	bool is_join_readview;
 	/**
 	 * List of checkpoint references, linked by
 	 * gc_checkpoint_ref::in_refs.
@@ -81,26 +80,6 @@ struct gc_checkpoint_ref {
 	char name[GC_NAME_MAX];
 };
 
-/**
- * The object of this type is used to prevent garbage
- * collection from removing WALs that are still in use.
- */
-struct gc_consumer {
-	/** Link in gc_state::consumers. */
-	gc_node_t node;
-	/** Human-readable name. */
-	char name[GC_NAME_MAX];
-	/** The vclock tracked by this consumer. */
-	struct vclock vclock;
-	/**
-	 * This flag is set if a WAL needed by this consumer was
-	 * deleted by the WAL thread on ENOSPC.
-	 */
-	bool is_inactive;
-};
-
-typedef rb_tree(struct gc_consumer) gc_tree_t;
-
 /** Garbage collection state. */
 struct gc_state {
 	/** VClock of the oldest WAL row available on the instance. */
@@ -121,8 +100,6 @@ struct gc_state {
 	 * to the tail. Linked by gc_checkpoint::in_checkpoints.
 	 */
 	struct rlist checkpoints;
-	/** Registered consumers, linked by gc_consumer::node. */
-	gc_tree_t consumers;
 	/** Fiber responsible for periodic checkpointing. */
 	struct fiber *checkpoint_fiber;
 	/** Schedule of periodic checkpoints. */
@@ -208,7 +185,6 @@ gc_free(void);
 
 /**
  * Advance the garbage collector vclock to the given position.
- * Deactivate WAL consumers that need older data.
  */
 void
 gc_advance(const struct vclock *vclock);
@@ -219,7 +195,7 @@ gc_advance(const struct vclock *vclock);
  *
  * Note, this function doesn't run garbage collector so
  * changes will take effect only after a new checkpoint
- * is created or a consumer is unregistered.
+ * is created.
  */
 void
 gc_set_min_checkpoint_count(int min_checkpoint_count);
@@ -239,6 +215,19 @@ gc_set_checkpoint_interval(double interval);
 void
 gc_add_checkpoint(const struct vclock *vclock);
 
+/**
+ * Register a join readview in the garbage collector state in order
+ * to prevent subsequent logs clearance.
+ */
+void
+gc_add_join_readview(const struct vclock *vclock);
+
+/**
+ * Unregister a join readview from the garbage collector state.
+ */
+void
+gc_del_join_readview(const struct vclock *vclock);
+
 /**
  * Make a *manual* checkpoint.
  * This is entry point for box.snapshot() and SIGUSR1 signal
@@ -283,58 +272,6 @@ gc_ref_checkpoint(struct gc_checkpoint *checkpoint,
 void
 gc_unref_checkpoint(struct gc_checkpoint_ref *ref);
 
-/**
- * Register a consumer.
- *
- * This will stop garbage collection of WAL files newer than
- * @vclock until the consumer is unregistered or advanced.
- * @format... specifies a human-readable name of the consumer,
- * it will be used for listing the consumer in box.info.gc().
- *
- * Returns a pointer to the new consumer object or NULL on
- * memory allocation failure.
- */
-CFORMAT(printf, 2, 3)
-struct gc_consumer *
-gc_consumer_register(const struct vclock *vclock, const char *format, ...);
-
-/**
- * Unregister a consumer and invoke garbage collection
- * if needed.
- */
-void
-gc_consumer_unregister(struct gc_consumer *consumer);
-
-/**
- * Advance the vclock tracked by a consumer and
- * invoke garbage collection if needed.
- */
-void
-gc_consumer_advance(struct gc_consumer *consumer, const struct vclock *vclock);
-
-/**
- * Iterator over registered consumers. The iterator is valid
- * as long as the caller doesn't yield.
- */
-struct gc_consumer_iterator {
-	struct gc_consumer *curr;
-};
-
-/** Init an iterator over consumers. */
-static inline void
-gc_consumer_iterator_init(struct gc_consumer_iterator *it)
-{
-	it->curr = NULL;
-}
-
-/**
- * Iterate to the next registered consumer. Return a pointer
- * to the next consumer object or NULL if there is no more
- * consumers.
- */
-struct gc_consumer *
-gc_consumer_iterator_next(struct gc_consumer_iterator *it);
-
 #if defined(__cplusplus)
 } /* extern "C" */
 #endif /* defined(__cplusplus) */
diff --git a/src/box/lua/info.c b/src/box/lua/info.c
index c004fad27..aba9a4b7c 100644
--- a/src/box/lua/info.c
+++ b/src/box/lua/info.c
@@ -399,6 +399,8 @@ lbox_info_gc_call(struct lua_State *L)
 	count = 0;
 	struct gc_checkpoint *checkpoint;
 	gc_foreach_checkpoint(checkpoint) {
+		if (checkpoint->is_join_readview)
+			continue;
 		lua_createtable(L, 0, 2);
 
 		lua_pushstring(L, "vclock");
@@ -423,33 +425,6 @@ lbox_info_gc_call(struct lua_State *L)
 	}
 	lua_settable(L, -3);
 
-	lua_pushstring(L, "consumers");
-	lua_newtable(L);
-
-	struct gc_consumer_iterator consumers;
-	gc_consumer_iterator_init(&consumers);
-
-	count = 0;
-	struct gc_consumer *consumer;
-	while ((consumer = gc_consumer_iterator_next(&consumers)) != NULL) {
-		lua_createtable(L, 0, 3);
-
-		lua_pushstring(L, "name");
-		lua_pushstring(L, consumer->name);
-		lua_settable(L, -3);
-
-		lua_pushstring(L, "vclock");
-		lbox_pushvclock(L, &consumer->vclock);
-		lua_settable(L, -3);
-
-		lua_pushstring(L, "signature");
-		luaL_pushint64(L, vclock_sum(&consumer->vclock));
-		lua_settable(L, -3);
-
-		lua_rawseti(L, -2, ++count);
-	}
-	lua_settable(L, -3);
-
 	return 1;
 }
 
diff --git a/src/box/relay.cc b/src/box/relay.cc
index 5b0d3f023..13c8f4c28 100644
--- a/src/box/relay.cc
+++ b/src/box/relay.cc
@@ -66,23 +66,6 @@ struct relay_status_msg {
 	struct vclock vclock;
 };
 
-/**
- * Cbus message to update replica gc state in tx thread.
- */
-struct relay_gc_msg {
-	/** Parent */
-	struct cmsg msg;
-	/**
-	 * Link in the list of pending gc messages,
-	 * see relay::pending_gc.
-	 */
-	struct stailq_entry in_pending;
-	/** Relay instance */
-	struct relay *relay;
-	/** Vclock to advance to */
-	struct vclock vclock;
-};
-
 /** State of a replication relay. */
 struct relay {
 	/** The thread in which we relay data to the replica. */
@@ -123,11 +106,6 @@ struct relay {
 	struct cpipe relay_pipe;
 	/** Status message */
 	struct relay_status_msg status_msg;
-	/**
-	 * List of garbage collection messages awaiting
-	 * confirmation from the replica.
-	 */
-	struct stailq pending_gc;
 	/** Time when last row was sent to peer. */
 	double last_row_time;
 	/** Relay sync state. */
@@ -185,7 +163,6 @@ relay_new(struct replica *replica)
 	relay->last_row_time = ev_monotonic_now(loop());
 	fiber_cond_create(&relay->reader_cond);
 	diag_create(&relay->diag);
-	stailq_create(&relay->pending_gc);
 	relay->state = RELAY_OFF;
 	return relay;
 }
@@ -241,12 +218,6 @@ relay_exit(struct relay *relay)
 static void
 relay_stop(struct relay *relay)
 {
-	struct relay_gc_msg *gc_msg, *next_gc_msg;
-	stailq_foreach_entry_safe(gc_msg, next_gc_msg,
-				  &relay->pending_gc, in_pending) {
-		free(gc_msg);
-	}
-	stailq_create(&relay->pending_gc);
 	if (relay->r != NULL)
 		recovery_delete(relay->r);
 	relay->r = NULL;
@@ -383,7 +354,9 @@ relay_final_join(int fd, uint64_t sync, struct vclock *start_vclock,
 static void
 relay_status_update(struct cmsg *msg)
 {
+	struct relay_status_msg *status = (struct relay_status_msg *)msg;
 	msg->route = NULL;
+	fiber_cond_signal(&status->relay->reader_cond);
 }
 
 /**
@@ -393,6 +366,8 @@ static void
 tx_status_update(struct cmsg *msg)
 {
 	struct relay_status_msg *status = (struct relay_status_msg *)msg;
+	if (!status->relay->replica->anon)
+		wal_relay_status_update(status->relay->replica->id, &status->vclock);
 	vclock_copy(&status->relay->tx.vclock, &status->vclock);
 	static const struct cmsg_hop route[] = {
 		{relay_status_update, NULL}
@@ -401,74 +376,6 @@ tx_status_update(struct cmsg *msg)
 	cpipe_push(&status->relay->relay_pipe, msg);
 }
 
-/**
- * Update replica gc state in tx thread.
- */
-static void
-tx_gc_advance(struct cmsg *msg)
-{
-	struct relay_gc_msg *m = (struct relay_gc_msg *)msg;
-	gc_consumer_advance(m->relay->replica->gc, &m->vclock);
-	free(m);
-}
-
-static int
-relay_on_close_log_f(struct trigger *trigger, void * /* event */)
-{
-	static const struct cmsg_hop route[] = {
-		{tx_gc_advance, NULL}
-	};
-	struct relay *relay = (struct relay *)trigger->data;
-	struct relay_gc_msg *m = (struct relay_gc_msg *)malloc(sizeof(*m));
-	if (m == NULL) {
-		say_warn("failed to allocate relay gc message");
-		return 0;
-	}
-	cmsg_init(&m->msg, route);
-	m->relay = relay;
-	vclock_copy(&m->vclock, &relay->r->vclock);
-	/*
-	 * Do not invoke garbage collection until the replica
-	 * confirms that it has received data stored in the
-	 * sent xlog.
-	 */
-	stailq_add_tail_entry(&relay->pending_gc, m, in_pending);
-	return 0;
-}
-
-/**
- * Invoke pending garbage collection requests.
- *
- * This function schedules the most recent gc message whose
- * vclock is less than or equal to the given one. Older
- * messages are discarded as their job will be done by the
- * scheduled message anyway.
- */
-static inline void
-relay_schedule_pending_gc(struct relay *relay, const struct vclock *vclock)
-{
-	struct relay_gc_msg *curr, *next, *gc_msg = NULL;
-	stailq_foreach_entry_safe(curr, next, &relay->pending_gc, in_pending) {
-		/*
-		 * We may delete a WAL file only if its vclock is
-		 * less than or equal to the vclock acknowledged by
-		 * the replica. Even if the replica's signature is
-		 * is greater, but the vclocks are incomparable, we
-		 * must not delete the WAL, because there may still
-		 * be rows not applied by the replica in it while
-		 * the greater signatures is due to changes pulled
-		 * from other members of the cluster.
-		 */
-		if (vclock_compare(&curr->vclock, vclock) > 0)
-			break;
-		stailq_shift(&relay->pending_gc);
-		free(gc_msg);
-		gc_msg = curr;
-	}
-	if (gc_msg != NULL)
-		cpipe_push(&relay->tx_pipe, &gc_msg->msg);
-}
-
 static void
 relay_set_error(struct relay *relay, struct error *e)
 {
@@ -569,17 +476,6 @@ relay_subscribe_f(va_list ap)
 	cbus_pair("tx", relay->endpoint.name, &relay->tx_pipe,
 		  &relay->relay_pipe, NULL, NULL, cbus_process);
 
-	/*
-	 * Setup garbage collection trigger.
-	 * Not needed for anonymous replicas, since they
-	 * aren't registered with gc at all.
-	 */
-	struct trigger on_close_log = {
-		RLIST_LINK_INITIALIZER, relay_on_close_log_f, relay, NULL
-	};
-	if (!relay->replica->anon)
-		trigger_add(&r->on_close_log, &on_close_log);
-
 	/* Setup WAL watcher for sending new rows to the replica. */
 	wal_set_watcher(&relay->wal_watcher, relay->endpoint.name,
 			relay_process_wal_event, cbus_process);
@@ -643,8 +539,6 @@ relay_subscribe_f(va_list ap)
 		vclock_copy(&relay->status_msg.vclock, send_vclock);
 		relay->status_msg.relay = relay;
 		cpipe_push(&relay->tx_pipe, &relay->status_msg.msg);
-		/* Collect xlog files received by the replica. */
-		relay_schedule_pending_gc(relay, send_vclock);
 	}
 
 	/*
@@ -657,8 +551,6 @@ relay_subscribe_f(va_list ap)
 	say_crit("exiting the relay loop");
 
 	/* Clear garbage collector trigger and WAL watcher. */
-	if (!relay->replica->anon)
-		trigger_clear(&on_close_log);
 	wal_clear_watcher(&relay->wal_watcher, cbus_process);
 
 	/* Join ack reader fiber. */
@@ -682,17 +574,8 @@ relay_subscribe(struct replica *replica, int fd, uint64_t sync,
 	assert(replica->anon || replica->id != REPLICA_ID_NIL);
 	struct relay *relay = replica->relay;
 	assert(relay->state != RELAY_FOLLOW);
-	/*
-	 * Register the replica with the garbage collector
-	 * unless it has already been registered by initial
-	 * join.
-	 */
-	if (replica->gc == NULL && !replica->anon) {
-		replica->gc = gc_consumer_register(replica_clock, "replica %s",
-						   tt_uuid_str(&replica->uuid));
-		if (replica->gc == NULL)
-			diag_raise();
-	}
+	if (!replica->anon)
+		wal_relay_status_update(replica->id, replica_clock);
 
 	relay_start(relay, fd, sync, relay_send_row);
 	auto relay_guard = make_scoped_guard([=] {
diff --git a/src/box/replication.cc b/src/box/replication.cc
index e7bfa22ab..869177656 100644
--- a/src/box/replication.cc
+++ b/src/box/replication.cc
@@ -37,6 +37,7 @@
 #include <small/mempool.h>
 
 #include "box.h"
+#include "wal.h"
 #include "gc.h"
 #include "error.h"
 #include "relay.h"
@@ -191,8 +192,6 @@ replica_delete(struct replica *replica)
 	assert(replica_is_orphan(replica));
 	if (replica->relay != NULL)
 		relay_delete(replica->relay);
-	if (replica->gc != NULL)
-		gc_consumer_unregister(replica->gc);
 	TRASH(replica);
 	free(replica);
 }
@@ -235,15 +234,6 @@ replica_set_id(struct replica *replica, uint32_t replica_id)
 		/* Assign local replica id */
 		assert(instance_id == REPLICA_ID_NIL);
 		instance_id = replica_id;
-	} else if (replica->anon) {
-		/*
-		 * Set replica gc on its transition from
-		 * anonymous to a normal one.
-		 */
-		assert(replica->gc == NULL);
-		replica->gc = gc_consumer_register(&replicaset.vclock,
-						   "replica %s",
-						   tt_uuid_str(&replica->uuid));
 	}
 	replicaset.replica_by_id[replica_id] = replica;
 
@@ -271,22 +261,13 @@ replica_clear_id(struct replica *replica)
 		assert(replicaset.is_joining);
 		instance_id = REPLICA_ID_NIL;
 	}
+	uint32_t replica_id = replica->id;
 	replica->id = REPLICA_ID_NIL;
 	say_info("removed replica %s", tt_uuid_str(&replica->uuid));
 
-	/*
-	 * The replica will never resubscribe so we don't need to keep
-	 * WALs for it anymore. Unregister it with the garbage collector
-	 * if the relay thread is stopped. In case the relay thread is
-	 * still running, it may need to access replica->gc so leave the
-	 * job to replica_on_relay_stop, which will be called as soon as
-	 * the relay thread exits.
-	 */
-	if (replica->gc != NULL &&
-	    relay_get_state(replica->relay) != RELAY_FOLLOW) {
-		gc_consumer_unregister(replica->gc);
-		replica->gc = NULL;
-	}
+	if (replica_id != REPLICA_ID_NIL)
+		wal_relay_delete(replica_id);
+
 	if (replica_is_orphan(replica)) {
 		replica_hash_remove(&replicaset.hash, replica);
 		replica_delete(replica);
@@ -896,10 +877,7 @@ replica_on_relay_stop(struct replica *replica)
 	 * collector then. See also replica_clear_id.
 	 */
 	if (replica->id == REPLICA_ID_NIL) {
-		if (!replica->anon) {
-			gc_consumer_unregister(replica->gc);
-			replica->gc = NULL;
-		} else {
+		if (replica->anon) {
 			assert(replica->gc == NULL);
 			/*
 			 * We do not replicate from anonymous
diff --git a/src/box/wal.c b/src/box/wal.c
index ce15cb459..ad3c79a8a 100644
--- a/src/box/wal.c
+++ b/src/box/wal.c
@@ -43,6 +43,7 @@
 #include "cbus.h"
 #include "coio_task.h"
 #include "replication.h"
+#include "mclock.h"
 
 enum {
 	/**
@@ -154,6 +155,25 @@ struct wal_writer
 	 * Used for replication relays.
 	 */
 	struct rlist watchers;
+	/**
+	 * Matrix clock with all wal consumer vclocks.
+	 */
+	struct mclock mclock;
+	/**
+	 * Fiber condition signaled on matrix clock is updated.
+	 */
+	struct fiber_cond wal_gc_cond;
+	/**
+	 * Minimal known xlog vclock used to decide when
+	 * wal gc should be invoked. It is a wal vclockset
+	 * second cached value.
+	 */
+	const struct vclock *gc_wal_vclock;
+	/**
+	 * Vclock which preserves subsequent logs from
+	 * collecting. Ignored in case of no space error.
+	 */
+	struct vclock gc_first_vclock;
 };
 
 struct wal_msg {
@@ -335,6 +355,44 @@ tx_notify_checkpoint(struct cmsg *msg)
 	free(msg);
 }
 
+/*
+ * Shortcut function which returns the second vclock from a wal
+ * directory. If the gc vclock is greater or equal than second one
+ * in a wal directory then there is at least one file to clean.
+ */
+static inline const struct vclock *
+second_vclock(struct wal_writer *writer)
+{
+	struct vclock *first_vclock = vclockset_first(&writer->wal_dir.index);
+	struct vclock *second_vclock = NULL;
+	if (first_vclock != NULL)
+		second_vclock = vclockset_next(&writer->wal_dir.index,
+					       first_vclock);
+	if (first_vclock != NULL && second_vclock == NULL &&
+	    first_vclock->signature != writer->vclock.signature) {
+		/* New xlog could be not created yet. */
+		second_vclock = &writer->vclock;
+	}
+	return second_vclock;
+}
+
+/*
+ * Shortcut function which compares three vclocks and
+ * return true if the first one is not greater or equal than the
+ * second one whereas the third one is. Used in order to decide
+ * when a wal gc should be signaled.
+ */
+static inline bool
+vclock_order_changed(const struct vclock *old, const struct vclock *target,
+		     const struct vclock *new)
+{
+	int rc = vclock_compare(old, target);
+	if (rc > 0 && rc != VCLOCK_ORDER_UNDEFINED)
+		return false;
+	rc = vclock_compare(new, target);
+	return rc >= 0 && rc != VCLOCK_ORDER_UNDEFINED;
+}
+
 /**
  * Initialize WAL writer context. Even though it's a singleton,
  * encapsulate the details just in case we may use
@@ -375,6 +433,12 @@ wal_writer_create(struct wal_writer *writer, enum wal_mode wal_mode,
 
 	mempool_create(&writer->msg_pool, &cord()->slabc,
 		       sizeof(struct wal_msg));
+
+	mclock_create(&writer->mclock);
+
+	fiber_cond_create(&writer->wal_gc_cond);
+	writer->gc_wal_vclock = NULL;
+	vclock_create(&writer->gc_first_vclock);
 }
 
 /** Destroy a WAL writer structure. */
@@ -494,6 +558,7 @@ wal_enable(void)
 	 */
 	if (xdir_scan(&writer->wal_dir))
 		return -1;
+	writer->gc_wal_vclock = second_vclock(writer);
 
 	/* Open the most recent WAL file. */
 	if (wal_open(writer) != 0)
@@ -592,6 +657,8 @@ wal_begin_checkpoint_f(struct cbus_call_msg *data)
 		/*
 		 * The next WAL will be created on the first write.
 		 */
+		if (writer->gc_wal_vclock == NULL)
+			writer->gc_wal_vclock = second_vclock(writer);
 	}
 	vclock_copy(&msg->vclock, &writer->vclock);
 	msg->wal_size = writer->checkpoint_wal_size;
@@ -695,20 +762,35 @@ wal_set_checkpoint_threshold(int64_t threshold)
 	fiber_set_cancellable(cancellable);
 }
 
-struct wal_gc_msg
+static void
+wal_gc_advance(struct wal_writer *writer)
 {
-	struct cbus_call_msg base;
-	const struct vclock *vclock;
-};
+	static struct cmsg_hop route[] = {
+		{ tx_notify_gc, NULL },
+	};
+	struct tx_notify_gc_msg *msg = malloc(sizeof(*msg));
+	if (msg != NULL) {
+		if (xdir_first_vclock(&writer->wal_dir, &msg->vclock) < 0)
+			vclock_copy(&msg->vclock, &writer->vclock);
+		cmsg_init(&msg->base, route);
+		cpipe_push(&writer->tx_prio_pipe, &msg->base);
+	} else
+		say_warn("failed to allocate gc notification message");
+}
 
 static int
-wal_collect_garbage_f(struct cbus_call_msg *data)
+wal_collect_garbage(struct wal_writer *writer)
 {
-	struct wal_writer *writer = &wal_writer_singleton;
-	const struct vclock *vclock = ((struct wal_gc_msg *)data)->vclock;
+	struct vclock *collect_vclock = &writer->gc_first_vclock;
+	struct vclock relay_min_vclock;
+	if (mclock_get(&writer->mclock, -1, &relay_min_vclock) == 0) {
+		int rc = vclock_compare(collect_vclock, &relay_min_vclock);
+		if (rc > 0 || rc == VCLOCK_ORDER_UNDEFINED)
+			collect_vclock = &relay_min_vclock;
+	}
 
 	if (!xlog_is_open(&writer->current_wal) &&
-	    vclock_sum(vclock) >= vclock_sum(&writer->vclock)) {
+	    vclock_sum(collect_vclock) >= vclock_sum(&writer->vclock)) {
 		/*
 		 * The last available WAL file has been sealed and
 		 * all registered consumers have done reading it.
@@ -720,27 +802,54 @@ wal_collect_garbage_f(struct cbus_call_msg *data)
 		 * required by registered consumers and delete all
 		 * older WAL files.
 		 */
-		vclock = vclockset_psearch(&writer->wal_dir.index, vclock);
+		collect_vclock = vclockset_match(&writer->wal_dir.index,
+						 collect_vclock);
+	}
+	if (collect_vclock != NULL) {
+		xdir_collect_garbage(&writer->wal_dir,
+				     vclock_sum(collect_vclock), XDIR_GC_ASYNC);
+		writer->gc_wal_vclock = second_vclock(writer);
+		wal_gc_advance(writer);
 	}
-	if (vclock != NULL)
-		xdir_collect_garbage(&writer->wal_dir, vclock_sum(vclock),
-				     XDIR_GC_ASYNC);
 
 	return 0;
 }
 
+struct wal_set_gc_first_vclock_msg {
+	struct cbus_call_msg base;
+	const struct vclock *vclock;
+};
+
+int
+wal_set_gc_first_vclock_f(struct cbus_call_msg *base)
+{
+	struct wal_writer *writer = &wal_writer_singleton;
+	struct wal_set_gc_first_vclock_msg *msg =
+		container_of(base, struct wal_set_gc_first_vclock_msg, base);
+	if (writer->gc_wal_vclock != NULL &&
+	    vclock_order_changed(&writer->gc_first_vclock,
+				 writer->gc_wal_vclock, msg->vclock))
+		fiber_cond_signal(&writer->wal_gc_cond);
+	vclock_copy(&writer->gc_first_vclock, msg->vclock);
+	return 0;
+}
+
 void
-wal_collect_garbage(const struct vclock *vclock)
+wal_set_gc_first_vclock(const struct vclock *vclock)
 {
 	struct wal_writer *writer = &wal_writer_singleton;
-	if (writer->wal_mode == WAL_NONE)
+	if (writer->wal_mode == WAL_NONE) {
+		vclock_copy(&writer->gc_first_vclock, vclock);
 		return;
-	struct wal_gc_msg msg;
+	}
+	struct wal_set_gc_first_vclock_msg msg;
 	msg.vclock = vclock;
 	bool cancellable = fiber_set_cancellable(false);
-	cbus_call(&writer->wal_pipe, &writer->tx_prio_pipe, &msg.base,
-		  wal_collect_garbage_f, NULL, TIMEOUT_INFINITY);
+	cbus_call(&writer->wal_pipe, &writer->tx_prio_pipe,
+		  &msg.base, wal_set_gc_first_vclock_f, NULL,
+		  TIMEOUT_INFINITY);
 	fiber_set_cancellable(cancellable);
+
 }
 
 static void
@@ -790,7 +899,8 @@ wal_opt_rotate(struct wal_writer *writer)
 	 * collection, see wal_collect_garbage().
 	 */
 	xdir_add_vclock(&writer->wal_dir, &writer->vclock);
-
+	if (writer->gc_wal_vclock == NULL)
+		writer->gc_wal_vclock = second_vclock(writer);
 	wal_notify_watchers(writer, WAL_EVENT_ROTATE);
 	return 0;
 }
@@ -845,6 +955,10 @@ retry:
 	}
 
 	xdir_collect_garbage(&writer->wal_dir, gc_lsn, XDIR_GC_REMOVE_ONE);
+	writer->gc_wal_vclock = second_vclock(writer);
+	if (vclock_compare(&writer->gc_first_vclock,
+			   writer->gc_wal_vclock) < 0)
+		vclock_copy(&writer->gc_first_vclock, writer->gc_wal_vclock);
 	notify_gc = true;
 	goto retry;
 error:
@@ -861,20 +975,8 @@ out:
 	 * event and a failure to send this message isn't really
 	 * critical.
 	 */
-	if (notify_gc) {
-		static struct cmsg_hop route[] = {
-			{ tx_notify_gc, NULL },
-		};
-		struct tx_notify_gc_msg *msg = malloc(sizeof(*msg));
-		if (msg != NULL) {
-			if (xdir_first_vclock(&writer->wal_dir,
-					      &msg->vclock) < 0)
-				vclock_copy(&msg->vclock, &writer->vclock);
-			cmsg_init(&msg->base, route);
-			cpipe_push(&writer->tx_prio_pipe, &msg->base);
-		} else
-			say_warn("failed to allocate gc notification message");
-	}
+	if (notify_gc)
+		wal_gc_advance(writer);
 	return rc;
 }
 
@@ -1126,6 +1228,26 @@ wal_write_to_disk(struct cmsg *msg)
 	wal_notify_watchers(writer, WAL_EVENT_WRITE);
 }
 
+
+/*
+ * WAL garbage collection fiber.
+ * The fiber waits until writer mclock is updated
+ * and then compares the mclock lower bound with
+ * the oldest wal file.
+ */
+static int
+wal_gc_f(va_list ap)
+{
+	struct wal_writer *writer = va_arg(ap, struct wal_writer *);
+
+	while (!fiber_is_cancelled()) {
+		fiber_cond_wait(&writer->wal_gc_cond);
+		wal_collect_garbage(writer);
+	}
+
+	return 0;
+}
+
 /** WAL writer main loop.  */
 static int
 wal_writer_f(va_list ap)
@@ -1145,8 +1267,15 @@ wal_writer_f(va_list ap)
 	 */
 	cpipe_create(&writer->tx_prio_pipe, "tx_prio");
 
+	struct fiber *wal_gc_fiber = fiber_new("wal_gc", wal_gc_f);
+	fiber_set_joinable(wal_gc_fiber, true);
+	fiber_start(wal_gc_fiber, writer);
+
 	cbus_loop(&endpoint);
 
+	fiber_cancel(wal_gc_fiber);
+	fiber_join(wal_gc_fiber);
+
 	/*
 	 * Create a new empty WAL on shutdown so that we don't
 	 * have to rescan the last WAL to find the instance vclock.
@@ -1436,6 +1565,82 @@ wal_notify_watchers(struct wal_writer *writer, unsigned events)
 		wal_watcher_notify(watcher, events);
 }
 
+struct wal_relay_status_update_msg {
+	struct cbus_call_msg base;
+	uint32_t replica_id;
+	struct vclock vclock;
+};
+
+static int
+wal_relay_status_update_f(struct cbus_call_msg *base)
+{
+	struct wal_writer *writer = &wal_writer_singleton;
+	struct wal_relay_status_update_msg *msg =
+		container_of(base, struct wal_relay_status_update_msg, base);
+	struct vclock old_vclock;
+	mclock_extract_row(&writer->mclock, msg->replica_id, &old_vclock);
+	if (writer->gc_wal_vclock != NULL &&
+	    vclock_order_changed(&old_vclock, writer->gc_wal_vclock,
+				 &msg->vclock))
+		fiber_cond_signal(&writer->wal_gc_cond);
+	mclock_update(&writer->mclock, msg->replica_id, &msg->vclock);
+	return 0;
+}
+
+void
+wal_relay_status_update(uint32_t replica_id, const struct vclock *vclock)
+{
+	struct wal_writer *writer = &wal_writer_singleton;
+	struct wal_relay_status_update_msg msg;
+	/*
+	 * We do not take anonymous replica in account. There is
+	 * no way to distinguish them but anonynous replica could
+	 * be rebootstrapped at any time.
+	 */
+	if (replica_id == 0)
+		return;
+	msg.replica_id = replica_id;
+	vclock_copy(&msg.vclock, vclock);
+	bool cancellable = fiber_set_cancellable(false);
+	cbus_call(&writer->wal_pipe, &writer->tx_prio_pipe,
+		  &msg.base, wal_relay_status_update_f, NULL,
+		  TIMEOUT_INFINITY);
+	fiber_set_cancellable(cancellable);
+}
+
+struct wal_relay_delete_msg {
+	struct cmsg base;
+	uint32_t replica_id;
+};
+
+void
+wal_relay_delete_f(struct cmsg *base)
+{
+	struct wal_writer *writer = &wal_writer_singleton;
+	struct wal_relay_delete_msg *msg =
+		container_of(base, struct wal_relay_delete_msg, base);
+	struct vclock vclock;
+	vclock_create(&vclock);
+	mclock_update(&writer->mclock, msg->replica_id, &vclock);
+	fiber_cond_signal(&writer->wal_gc_cond);
+	free(msg);
+}
+
+void
+wal_relay_delete(uint32_t replica_id)
+{
+	struct wal_writer *writer = &wal_writer_singleton;
+	struct wal_relay_delete_msg *msg =
+		(struct wal_relay_delete_msg *)malloc(sizeof(*msg));
+	if (msg == NULL) {
+		say_error("Could not allocate relay delete message");
+		return;
+	}
+	static struct cmsg_hop route[] = {{wal_relay_delete_f, NULL}};
+	cmsg_init(&msg->base, route);
+	msg->replica_id = replica_id;
+	cpipe_push(&writer->wal_pipe, &msg->base);
+}
 
 /**
  * After fork, the WAL writer thread disappears.
diff --git a/src/box/wal.h b/src/box/wal.h
index 76b44941a..86887656d 100644
--- a/src/box/wal.h
+++ b/src/box/wal.h
@@ -223,6 +223,12 @@ wal_begin_checkpoint(struct wal_checkpoint *checkpoint);
 void
 wal_commit_checkpoint(struct wal_checkpoint *checkpoint);
 
+/**
+ * Prevent wal from collecting logs after the given vclock.
+ */
+void
+wal_set_gc_first_vclock(const struct vclock *vclock);
+
 /**
  * Set the WAL size threshold exceeding which will trigger
  * checkpointing in TX.
@@ -231,11 +237,16 @@ void
 wal_set_checkpoint_threshold(int64_t threshold);
 
 /**
- * Remove WAL files that are not needed by consumers reading
- * rows at @vclock or newer.
+ * Update a wal consumer vclock position.
+ */
+void
+wal_relay_status_update(uint32_t replica_id, const struct vclock *vclock);
+
+/**
+ * Unregister a wal consumer.
  */
 void
-wal_collect_garbage(const struct vclock *vclock);
+wal_relay_delete(uint32_t replica_id);
 
 void
 wal_init_vy_log();
diff --git a/test/replication/gc_no_space.result b/test/replication/gc_no_space.result
index e860ab00f..f295cb16b 100644
--- a/test/replication/gc_no_space.result
+++ b/test/replication/gc_no_space.result
@@ -162,18 +162,12 @@ check_snap_count(2)
 gc = box.info.gc()
 ---
 ...
-#gc.consumers -- 3
----
-- 3
-...
+--#gc.consumers -- 3
 #gc.checkpoints -- 2
 ---
 - 2
 ...
-gc.signature == gc.consumers[1].signature
----
-- true
-...
+--gc.signature == gc.consumers[1].signature
 --
 -- Inject a ENOSPC error and check that the WAL thread deletes
 -- old WAL files to prevent the user from seeing the error.
@@ -201,18 +195,12 @@ check_snap_count(2)
 gc = box.info.gc()
 ---
 ...
-#gc.consumers -- 1
----
-- 1
-...
+--#gc.consumers -- 1
 #gc.checkpoints -- 2
 ---
 - 2
 ...
-gc.signature == gc.consumers[1].signature
----
-- true
-...
+--gc.signature == gc.consumers[1].signature
 --
 -- Check that the WAL thread never deletes WAL files that are
 -- needed for recovery from the last checkpoint, but may delete
@@ -242,10 +230,7 @@ check_snap_count(2)
 gc = box.info.gc()
 ---
 ...
-#gc.consumers -- 0
----
-- 0
-...
+--#gc.consumers -- 0
 #gc.checkpoints -- 2
 ---
 - 2
@@ -272,7 +257,4 @@ gc = box.info.gc()
 ---
 - 2
 ...
-gc.signature == gc.checkpoints[2].signature
----
-- true
-...
+--gc.signature == gc.checkpoints[2].signature
diff --git a/test/replication/gc_no_space.test.lua b/test/replication/gc_no_space.test.lua
index 98ccd401b..c28bc0710 100644
--- a/test/replication/gc_no_space.test.lua
+++ b/test/replication/gc_no_space.test.lua
@@ -72,9 +72,9 @@ s:auto_increment{}
 check_wal_count(5)
 check_snap_count(2)
 gc = box.info.gc()
-#gc.consumers -- 3
+--#gc.consumers -- 3
 #gc.checkpoints -- 2
-gc.signature == gc.consumers[1].signature
+--gc.signature == gc.consumers[1].signature
 
 --
 -- Inject a ENOSPC error and check that the WAL thread deletes
@@ -87,9 +87,9 @@ errinj.info()['ERRINJ_WAL_FALLOCATE'].state -- 0
 check_wal_count(3)
 check_snap_count(2)
 gc = box.info.gc()
-#gc.consumers -- 1
+--#gc.consumers -- 1
 #gc.checkpoints -- 2
-gc.signature == gc.consumers[1].signature
+--gc.signature == gc.consumers[1].signature
 
 --
 -- Check that the WAL thread never deletes WAL files that are
@@ -104,7 +104,7 @@ errinj.info()['ERRINJ_WAL_FALLOCATE'].state -- 0
 check_wal_count(1)
 check_snap_count(2)
 gc = box.info.gc()
-#gc.consumers -- 0
+--#gc.consumers -- 0
 #gc.checkpoints -- 2
 gc.signature == gc.checkpoints[2].signature
 
@@ -116,4 +116,4 @@ test_run:cleanup_cluster()
 test_run:cmd("restart server default")
 gc = box.info.gc()
 #gc.checkpoints -- 2
-gc.signature == gc.checkpoints[2].signature
+--gc.signature == gc.checkpoints[2].signature
-- 
2.25.0

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Tarantool-patches] [PATCH v4 09/11] wal: xrow memory buffer and cursor
  2020-02-12  9:39 [Tarantool-patches] [PATCH v4 00/11] Replication from memory Georgy Kirichenko
                   ` (7 preceding siblings ...)
  2020-02-12  9:39 ` [Tarantool-patches] [PATCH v4 08/11] wal: track relay vclock and collect logs in wal thread Georgy Kirichenko
@ 2020-02-12  9:39 ` Georgy Kirichenko
  2020-02-12  9:39 ` [Tarantool-patches] [PATCH v4 10/11] wal: use a xrow buffer object for entry encoding Georgy Kirichenko
  2020-02-12  9:39 ` [Tarantool-patches] [PATCH v4 11/11] replication: use wal memory buffer to fetch rows Georgy Kirichenko
  10 siblings, 0 replies; 16+ messages in thread
From: Georgy Kirichenko @ 2020-02-12  9:39 UTC (permalink / raw)
  To: tarantool-patches

* xrow buffer structure
  Introduce a xrow buffer which stores encoded xrows in a memory
  after transaction was finished. Wal uses an xrow buffer object in
  order to encode transactions and then writes encoded data
  to a log file. Xrow buffer consist of not more than
  XROW_BUF_CHUNK_COUNT rotating chunks organized in a ring. Rotation
  thresholds and XROW_BUF_CHUNK_COUNT are the empiric values now.

* xrow buffer cursor
  This structure allows to find a xrow buffer row with vclock less
  than given one and then fetch row by row from the xrow forwards
  to the last appended row. A xrow buffer cursor is essential to
  allow the from memory replication and will be used by a relay
  to fetch all logged rows, stored in a wal memory (implemented as
  xrow buffer) from given position and then follow all new changes.

Part of #3974 #980
---
 src/box/CMakeLists.txt |   1 +
 src/box/xrow_buf.c     | 374 +++++++++++++++++++++++++++++++++++++++++
 src/box/xrow_buf.h     | 198 ++++++++++++++++++++++
 3 files changed, 573 insertions(+)
 create mode 100644 src/box/xrow_buf.c
 create mode 100644 src/box/xrow_buf.h

diff --git a/src/box/CMakeLists.txt b/src/box/CMakeLists.txt
index 32f922dd7..303ad9c6e 100644
--- a/src/box/CMakeLists.txt
+++ b/src/box/CMakeLists.txt
@@ -137,6 +137,7 @@ add_library(box STATIC
     sql_stmt_cache.c
     wal.c
     mclock.c
+    xrow_buf.c
     call.c
     merger.c
     ${lua_sources}
diff --git a/src/box/xrow_buf.c b/src/box/xrow_buf.c
new file mode 100644
index 000000000..4c3c87fcf
--- /dev/null
+++ b/src/box/xrow_buf.c
@@ -0,0 +1,374 @@
+/*
+ * Copyright 2010-2019, Tarantool AUTHORS, please see AUTHORS file.
+ *
+ * Redistribution and use in source and binary forms, with or
+ * without modification, are permitted provided that the following
+ * conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above
+ *    copyright notice, this list of conditions and the
+ *    following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above
+ *    copyright notice, this list of conditions and the following
+ *    disclaimer in the documentation and/or other materials
+ *    provided with the distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY <COPYRIGHT HOLDER> ``AS IS'' AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
+ * TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL
+ * <COPYRIGHT HOLDER> OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT,
+ * INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
+ * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR
+ * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
+ * LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF
+ * THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ */
+
+#include "xrow_buf.h"
+#include "fiber.h"
+
+/* Xrow buffer chunk options (empirical values). */
+enum {
+	/* Chunk row info array capacity increment */
+	XROW_BUF_CHUNK_CAPACITY_INCREMENT = 16384,
+	/* Initial size for raw data storage. */
+	XROW_BUF_CHUNK_INITIAL_DATA_SIZE = 65536,
+	/* How many rows we will place in one buffer. */
+	XROW_BUF_CHUNK_ROW_COUNT_THRESHOLD = 8192,
+	/* How many data we will place in one buffer. */
+	XROW_BUF_CHUNK_DATA_SIZE_THRESHOLD = 1 << 19,
+};
+
+
+/*
+ * Save the current xrow buffer chunk state which consists of
+ * two values index and position where the next row header
+ * and raw data would be placed. This state is used to track
+ * the next transaction starting boundary.
+ */
+static inline void
+xrow_buf_save_state(struct xrow_buf *xrow_buf)
+{
+	struct xrow_buf_chunk *chunk = xrow_buf->chunk +
+		xrow_buf->last_chunk_index % XROW_BUF_CHUNK_COUNT;
+	/* Save the current xrow buffer state. */
+	xrow_buf->tx_first_row_index = chunk->row_count;
+	xrow_buf->tx_first_row_svp = obuf_create_svp(&chunk->data);
+}
+
+void
+xrow_buf_create(struct xrow_buf *xrow_buf)
+{
+	for (int i = 0; i < XROW_BUF_CHUNK_COUNT; ++i) {
+		xrow_buf->chunk[i].row_info = NULL;
+		xrow_buf->chunk[i].row_info_capacity = 0;
+		xrow_buf->chunk[i].row_count = 0;
+		obuf_create(&xrow_buf->chunk[i].data, &cord()->slabc,
+			    XROW_BUF_CHUNK_INITIAL_DATA_SIZE);
+	}
+	xrow_buf->last_chunk_index = 0;
+	xrow_buf->first_chunk_index = 0;
+	xrow_buf_save_state(xrow_buf);
+}
+
+void
+xrow_buf_destroy(struct xrow_buf *xrow_buf)
+{
+	for (int i = 0; i < XROW_BUF_CHUNK_COUNT; ++i) {
+		if (xrow_buf->chunk[i].row_info_capacity > 0)
+			slab_put(&cord()->slabc,
+				 slab_from_data(xrow_buf->chunk[i].row_info));
+		obuf_destroy(&xrow_buf->chunk[i].data);
+	}
+}
+
+/*
+ * If the current chunk data limits are reached then this function
+ * switches a xrow buffer to the next chunk. If there is no free
+ * chunks in a xrow_buffer ring then the oldest one is going
+ * to be truncated, after truncation it is going to be reused in
+ * order to store new data.
+ */
+static struct xrow_buf_chunk *
+xrow_buf_rotate(struct xrow_buf *xrow_buf)
+{
+	struct xrow_buf_chunk *chunk = xrow_buf->chunk +
+		xrow_buf->last_chunk_index % XROW_BUF_CHUNK_COUNT;
+	/* Check if the current chunk could accept new data. */
+	if (chunk->row_count < XROW_BUF_CHUNK_ROW_COUNT_THRESHOLD &&
+	    obuf_size(&chunk->data) < XROW_BUF_CHUNK_DATA_SIZE_THRESHOLD)
+		return chunk;
+
+	/*
+	 * Increase the last chunk index and fetch
+	 * corresponding chunk from the ring buffer.
+	 */
+	++xrow_buf->last_chunk_index;
+	chunk = xrow_buf->chunk + xrow_buf->last_chunk_index %
+				  XROW_BUF_CHUNK_COUNT;
+	/*
+	 * Check if the next chunk has data and discard
+	 * the data if required.
+	 */
+	if (xrow_buf->last_chunk_index - xrow_buf->first_chunk_index >=
+	    XROW_BUF_CHUNK_COUNT) {
+		chunk->row_count = 0;
+		obuf_reset(&chunk->data);
+		++xrow_buf->first_chunk_index;
+	}
+	/*
+	 * The xrow_buffer current chunk was changed so update
+	 * the xrow buffer state.
+	 */
+	xrow_buf_save_state(xrow_buf);
+	return chunk;
+}
+
+void
+xrow_buf_tx_begin(struct xrow_buf *xrow_buf, const struct vclock *vclock)
+{
+	/*
+	 * Xrow buffer places a transaction in one chunk and does not
+	 * chunk rotation while transaction is in progress. So check
+	 * current chunk thresholds and rotate if required.
+	 */
+	struct xrow_buf_chunk *chunk = xrow_buf_rotate(xrow_buf);
+	/*
+	 * Check if the current chunk is empty and a vclock for
+	 * the chunk should be set.
+	 */
+	if (chunk->row_count == 0)
+		vclock_copy(&chunk->vclock, vclock);
+}
+
+void
+xrow_buf_tx_rollback(struct xrow_buf *xrow_buf)
+{
+	struct xrow_buf_chunk *chunk = xrow_buf->chunk +
+		xrow_buf->last_chunk_index % XROW_BUF_CHUNK_COUNT;
+	chunk->row_count = xrow_buf->tx_first_row_index;
+	obuf_rollback_to_svp(&chunk->data, &xrow_buf->tx_first_row_svp);
+}
+
+void
+xrow_buf_tx_commit(struct xrow_buf *xrow_buf)
+{
+	/* Save the current xrow buffer state. */
+	xrow_buf_save_state(xrow_buf);
+}
+
+int
+xrow_buf_write(struct xrow_buf *xrow_buf, struct xrow_header **begin,
+	       struct xrow_header **end, struct iovec **iovec)
+{
+	struct xrow_buf_chunk *chunk = xrow_buf->chunk +
+		xrow_buf->last_chunk_index % XROW_BUF_CHUNK_COUNT;
+
+	/* Save a data buffer svp to restore the buffer in case of an error. */
+	struct obuf_svp data_svp = obuf_create_svp(&chunk->data);
+
+	size_t row_count = chunk->row_count + (end - begin);
+	/* Allocate space for new row information members if required. */
+	if (row_count > chunk->row_info_capacity) {
+		/* Round allocation up to XROW_BUF_CHUNK_CAPACITY_INCREMENT. */
+		uint32_t capacity = XROW_BUF_CHUNK_CAPACITY_INCREMENT *
+				    ((row_count +
+				      XROW_BUF_CHUNK_CAPACITY_INCREMENT - 1) /
+				     XROW_BUF_CHUNK_CAPACITY_INCREMENT);
+
+		struct slab *row_info_slab =
+			slab_get(&cord()->slabc,
+				 sizeof(struct xrow_buf_row_info) * capacity);
+		if (row_info_slab == NULL) {
+			diag_set(OutOfMemory, capacity *
+					      sizeof(struct xrow_buf_row_info),
+				 "region", "row info array");
+			goto error;
+		}
+		if (chunk->row_info_capacity > 0) {
+			memcpy(slab_data(row_info_slab), chunk->row_info,
+			       sizeof(struct xrow_buf_row_info) *
+			       chunk->row_count);
+			slab_put(&cord()->slabc,
+				 slab_from_data(chunk->row_info));
+		}
+		chunk->row_info = slab_data(row_info_slab);
+		chunk->row_info_capacity = capacity;
+	}
+
+	/* Encode rows. */
+	for (struct xrow_header **row = begin; row < end; ++row) {
+		/* Reserve space for raw encoded data. */
+		char *data = obuf_reserve(&chunk->data, xrow_approx_len(*row));
+		if (data == NULL) {
+			diag_set(OutOfMemory, xrow_approx_len(*row),
+				 "region", "wal memory data");
+			goto error;
+		}
+
+		/*
+		 * Xrow header itself is going to be encoded onto a gc
+		 * memory region and the first member of a resulting
+		 * iovec points to this data. Row bodies are going
+		 * to be attached to the resulting iovec consequently.
+		 */
+		struct iovec iov[XROW_BODY_IOVMAX];
+		int iov_cnt = xrow_header_encode(*row, 0, iov, 0);
+		if (iov_cnt < 0)
+			goto error;
+
+		/*
+		 * Now we have xrow header encoded representation
+		 * so place it onto chunk data buffer starting
+		 * from xrow header and then bodies.
+		 */
+		data = obuf_alloc(&chunk->data, iov[0].iov_len);
+		memcpy(data, iov[0].iov_base, iov[0].iov_len);
+		/*
+		 * Initialize row info from xrow header and
+		 * the row header encoded data location.
+		 */
+		struct xrow_buf_row_info *row_info =
+			chunk->row_info + chunk->row_count + (row - begin);
+		row_info->xrow = **row;
+		row_info->data = data;
+		row_info->size = iov[0].iov_len;
+
+		for (int i = 1; i < iov_cnt; ++i) {
+			data = obuf_alloc(&chunk->data, iov[i].iov_len);
+			memcpy(data, iov[i].iov_base, iov[i].iov_len);
+			/*
+			 * Adjust stored row body location as we just
+			 * copied it to a chunk data buffer.
+			 */
+			row_info->xrow.body[i - 1].iov_base = data;
+			row_info->size += iov[i].iov_len;
+		}
+	}
+
+	/* Return an iovec which points to the encoded data. */
+	int iov_cnt = 1 + obuf_iovcnt(&chunk->data) - data_svp.pos;
+	*iovec = region_alloc(&fiber()->gc, sizeof(struct iovec) * iov_cnt);
+	if (*iovec == NULL) {
+		diag_set(OutOfMemory, sizeof(struct iovec) * iov_cnt,
+			 "region", "xrow_buf iovec");
+		goto error;
+	}
+	memcpy(*iovec, chunk->data.iov + data_svp.pos,
+	       sizeof(struct iovec) * iov_cnt);
+	/* Adjust first iovec member to data starting location. */
+	(*iovec)[0].iov_base += data_svp.iov_len;
+	(*iovec)[0].iov_len -= data_svp.iov_len;
+
+	/* Update chunk row count. */
+	chunk->row_count = row_count;
+	return iov_cnt;
+
+error:
+	/* Restore data buffer state. */
+	obuf_rollback_to_svp(&chunk->data, &data_svp);
+	return -1;
+}
+
+/*
+ * Returns an index of the first row after given vclock
+ * in a chunk.
+ */
+static int
+xrow_buf_chunk_locate_vclock(struct xrow_buf_chunk *chunk,
+			     struct vclock *vclock)
+{
+	for (uint32_t row_index = 0; row_index < chunk->row_count;
+	     ++row_index) {
+		struct xrow_header *row = &chunk->row_info[row_index].xrow;
+		if (vclock_get(vclock, row->replica_id) < row->lsn)
+			return row_index;
+	}
+	/*
+	 * We did not find any row with vclock not less than
+	 * given one so return an index just after the last one.
+	 */
+	return chunk->row_count;
+}
+
+int
+xrow_buf_cursor_create(struct xrow_buf *xrow_buf,
+		       struct xrow_buf_cursor *xrow_buf_cursor,
+		       struct vclock *vclock)
+{
+	/* Check if a buffer has requested data. */
+	struct xrow_buf_chunk *chunk =
+			xrow_buf->chunk + xrow_buf->first_chunk_index %
+					  XROW_BUF_CHUNK_COUNT;
+	int rc = vclock_compare(&chunk->vclock, vclock);
+	if (rc > 0 || rc == VCLOCK_ORDER_UNDEFINED) {
+		/* The requested data was discarded. */
+		return -1;
+	}
+	uint32_t index = xrow_buf->first_chunk_index;
+	while (index < xrow_buf->last_chunk_index) {
+		chunk = xrow_buf->chunk + (index + 1) % XROW_BUF_CHUNK_COUNT;
+		int rc = vclock_compare(&chunk->vclock, vclock);
+		if (rc > 0 || rc == VCLOCK_ORDER_UNDEFINED) {
+			/* Next chunk has younger rows than requested vclock. */
+			break;
+		}
+		++index;
+	}
+	chunk = xrow_buf->chunk + (index) % XROW_BUF_CHUNK_COUNT;
+	xrow_buf_cursor->chunk_index = index;
+	xrow_buf_cursor->row_index = xrow_buf_chunk_locate_vclock(chunk, vclock);
+	return 0;
+}
+
+int
+xrow_buf_cursor_next(struct xrow_buf *xrow_buf,
+		     struct xrow_buf_cursor *xrow_buf_cursor,
+		     struct xrow_header **row, void **data, size_t *size)
+{
+	if (xrow_buf->first_chunk_index > xrow_buf_cursor->chunk_index) {
+		/* A cursor current chunk was discarded by a buffer. */
+		return -1;
+	}
+
+	struct xrow_buf_chunk *chunk;
+next_chunk:
+	chunk = xrow_buf->chunk + xrow_buf_cursor->chunk_index %
+				  XROW_BUF_CHUNK_COUNT;
+	size_t chunk_row_count = chunk->row_count;
+	if (chunk_row_count == xrow_buf_cursor->row_index) {
+		/*
+		 * No more rows in a buffer but there are two
+		 * possibilities:
+		 *  1. A cursor current chunk is the last one and there is
+		 * no more rows in the cursor.
+		 *  2. There is a chunk after the current one
+		 * so we can switch to it.
+		 * */
+		if (xrow_buf->last_chunk_index ==
+		    xrow_buf_cursor->chunk_index) {
+			/*
+			 * The current chunk is the last one -
+			 * no more rows in a buffer.
+			 */
+			return 1;
+		}
+		/* Switch to the next chunk. */
+		xrow_buf_cursor->row_index = 0;
+		++xrow_buf_cursor->chunk_index;
+		goto next_chunk;
+	}
+	/* Return row header and data pointers and data size. */
+	struct xrow_buf_row_info *row_info = chunk->row_info +
+					     xrow_buf_cursor->row_index;
+	*row = &row_info->xrow;
+	*data = row_info->data;
+	*size = row_info->size;
+	++xrow_buf_cursor->row_index;
+	return 0;
+}
diff --git a/src/box/xrow_buf.h b/src/box/xrow_buf.h
new file mode 100644
index 000000000..c5f624d45
--- /dev/null
+++ b/src/box/xrow_buf.h
@@ -0,0 +1,198 @@
+#ifndef TARANTOOL_XROW_BUF_H_INCLUDED
+#define TARANTOOL_XROW_BUF_H_INCLUDED
+/*
+ * Copyright 2010-2019, Tarantool AUTHORS, please see AUTHORS file.
+ *
+ * Redistribution and use in source and binary forms, with or
+ * without modification, are permitted provided that the following
+ * conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above
+ *    copyright notice, this list of conditions and the
+ *    following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above
+ *    copyright notice, this list of conditions and the following
+ *    disclaimer in the documentation and/or other materials
+ *    provided with the distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY <COPYRIGHT HOLDER> ``AS IS'' AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
+ * TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL
+ * <COPYRIGHT HOLDER> OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT,
+ * INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
+ * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR
+ * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
+ * LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF
+ * THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ */
+#include <stdint.h>
+
+#include "small/obuf.h"
+#include "small/rlist.h"
+#include "xrow.h"
+#include "vclock.h"
+
+enum {
+	/*
+	 * Xrow buffer contains some count of rotating data chunks.
+	 * Every rotation has an estimated decrease in amount of
+	 * stored rows at 1/(COUNT OF CHUNKS). However the bigger
+	 * value makes rotation more frequent, the decrease would be
+	 * smoother and size of a xrow buffer more stable.
+	 */
+	XROW_BUF_CHUNK_COUNT = 16,
+};
+
+/**
+ * Xrow_info structure used to describe a row stored in a xrow
+ * buffer. Xrow info contains an xrow_header structure, pointer
+ * and size of the row_header encoded representation. Xrow header
+ * allows to filter rows by replica_id, lsn or replication group
+ * while encoded representation could be used to write xrow 
+ * without any further encoding.
+ */
+struct xrow_buf_row_info {
+	/** Stored row header. */
+	struct xrow_header xrow;
+	/** Pointer to row encoded raw data. */
+	void *data;
+	/** Row encoded raw data size. */
+	size_t size;
+};
+
+/**
+ * Xrow buffer data chunk structure is used to store a continuous
+ * sequence of xrow headers written to a xrow buffer. Xrow buffer data
+ * chunk contains a vclock of the last row just before the first row
+ * stored in the chunk, count of rows, its encoded raw data, and array of
+ * stored row info. Vclock is used to track stored vclock lower boundary.
+ */
+struct xrow_buf_chunk {
+	/** Vclock just before the first row in this chunk. */
+	struct vclock vclock;
+	/** Count of stored rows. */
+	size_t row_count;
+	/** Stored rows information array. */
+	struct xrow_buf_row_info *row_info;
+	/** Capacity of stored rows information array. */
+	size_t row_info_capacity;
+	/** Data storage for encoded rows data. */
+	struct obuf data;
+};
+
+/**
+ * Xrow buffer enables to encode and store some continuous sequence
+ * of xrows (both headers and binary encoded representation).
+ * Storage organized as a range of globally indexed chunks. New rows
+ * are appended to the last one chunk (the youngest one). If the last
+ * chunk is already full then a new chunk will be used. Xrow_buffer
+ * maintains not more than XROW_BUF_CHUNK_COUNT chunks so when the buffer
+ * is already full then a first one chunk should be discarded before a
+ * new one could be used. All chunks are organized in a ring which is
+ * XROW_BUF_CHUNK_COUNT the size so a chunk in-ring index could be
+ * evaluated from the chunk global index with the modulo operation.
+ */
+struct xrow_buf {
+	/** Global index of the first used chunk (the oldest one). */
+	size_t first_chunk_index;
+	/** Global index of the last used chunk (the youngest one). */
+	size_t last_chunk_index;
+	/** Ring -array containing chunks . */
+	struct xrow_buf_chunk chunk[XROW_BUF_CHUNK_COUNT];
+	/**
+	 * A xrow_buf transaction is recorded in one chunk only.
+	 * When transaction is started current row count and data
+	 * buffer svp from the current chunk (which is the last one)
+	 * are remembered in order to be able to restore the chunk
+	 * state in case of rollback.
+	 */
+	struct {
+		/** The current transaction first row index. */
+		uint32_t tx_first_row_index;
+		/** The current transaction encoded data start svp. */
+		struct obuf_svp tx_first_row_svp;
+	};
+};
+
+/** Create a wal memory. */
+void
+xrow_buf_create(struct xrow_buf *xrow_buf);
+
+/** Destroy wal memory structure. */
+void
+xrow_buf_destroy(struct xrow_buf *xrow_buf);
+
+/**
+ * Begin a xrow buffer transaction. This function may rotate the
+ * last one data chunk and use the vclock parameter as a new chunk
+ * starting vclock.
+ */
+void
+xrow_buf_tx_begin(struct xrow_buf *xrow_buf, const struct vclock *vclock);
+
+/** Discard all the data was written after the last transaction. */
+void
+xrow_buf_tx_rollback(struct xrow_buf *xrow_buf);
+
+/** Commit a xrow buffer transaction. */
+void
+xrow_buf_tx_commit(struct xrow_buf *xrow_buf);
+
+/**
+ * Append an xrow array to a wal memory. The array is placed into
+ * one xrow buffer data chunk and each row takes a continuous
+ * space in a data buffer. Raw encoded data is placed onto
+ * gc-allocated iovec array.
+ *
+ * @retval count of written iovec members for success
+ * @retval -1 in case of error
+ */
+int
+xrow_buf_write(struct xrow_buf *xrow_buf, struct xrow_header **begin,
+	       struct xrow_header **end,
+	       struct iovec **iovec);
+
+/**
+ * Xrow buffer cursor used to search a position in a buffer
+ * and then fetch rows one by one from the postion toward the
+ * buffer last append row.
+ */
+struct xrow_buf_cursor {
+	/** Current chunk global index. */
+	uint32_t chunk_index;
+	/** Row index in the current chunk. */
+	uint32_t row_index;
+};
+
+/**
+ * Create a xrow buffer cursor and set it's position to
+ * the first row after passed vclock value.
+ *
+ * @retval 0 cursor was created
+ * @retval -1 if a vclock was discarded
+ */
+int
+xrow_buf_cursor_create(struct xrow_buf *xrow_buf,
+		       struct xrow_buf_cursor *xrow_buf_cursor,
+		       struct vclock *vclock);
+
+/**
+ * Fetch next row from a xrow buffer cursor and return the row
+ * header and encoded data pointers as well as encoded data size
+ * in the corresponding parameters.
+ *
+ * @retval 0 in case of success
+ * @retval 1 if there is no more rows in a buffer
+ * @retval -1 if this cursor postion was discarded by xrow buffer
+ */
+int
+xrow_buf_cursor_next(struct xrow_buf *xrow_buf,
+		     struct xrow_buf_cursor *xrow_buf_cursor,
+		     struct xrow_header **row, void **data, size_t *size);
+
+#endif /* TARANTOOL_XROW_BUF_H_INCLUDED */
-- 
2.25.0

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Tarantool-patches] [PATCH v4 10/11] wal: use a xrow buffer object for entry encoding
  2020-02-12  9:39 [Tarantool-patches] [PATCH v4 00/11] Replication from memory Georgy Kirichenko
                   ` (8 preceding siblings ...)
  2020-02-12  9:39 ` [Tarantool-patches] [PATCH v4 09/11] wal: xrow memory buffer and cursor Georgy Kirichenko
@ 2020-02-12  9:39 ` Georgy Kirichenko
  2020-02-12  9:39 ` [Tarantool-patches] [PATCH v4 11/11] replication: use wal memory buffer to fetch rows Georgy Kirichenko
  10 siblings, 0 replies; 16+ messages in thread
From: Georgy Kirichenko @ 2020-02-12  9:39 UTC (permalink / raw)
  To: tarantool-patches

Wal uses an xrow buffer object in order to encode transactions and
then writes encoded data to a log file so encoded data still live
in memory for some time after a transaction is finished.

Part of #3794, #980
---
 src/box/wal.c      | 51 ++++++++++++++++++++++++++++++++++++++++-
 src/box/xlog.c     | 57 +++++++++++++++++++++++++++++-----------------
 src/box/xlog.h     | 14 ++++++++++++
 src/box/xrow_buf.h |  3 +--
 4 files changed, 101 insertions(+), 24 deletions(-)

diff --git a/src/box/wal.c b/src/box/wal.c
index ad3c79a8a..b483b8cc4 100644
--- a/src/box/wal.c
+++ b/src/box/wal.c
@@ -44,6 +44,7 @@
 #include "coio_task.h"
 #include "replication.h"
 #include "mclock.h"
+#include "xrow_buf.h"
 
 enum {
 	/**
@@ -174,6 +175,14 @@ struct wal_writer
 	 * collecting. Ignored in case of no space error.
 	 */
 	struct vclock gc_first_vclock;
+	/**
+	 * In-memory WAl write buffer used to encode transaction rows and
+	 * write them to an xlog file. An in-memory buffer allows us to
+	 * preserve xrows after transaction processing was finished.
+	 * This buffer will be used by replication to fetch xrows from memory
+	 * without xlog files access.
+	 */
+	struct xrow_buf xrow_buf;
 };
 
 struct wal_msg {
@@ -1040,6 +1049,7 @@ wal_assign_lsn(struct vclock *vclock_diff, struct vclock *base,
 	int64_t tsn = 0;
 	/** Assign LSN to all local rows. */
 	for ( ; row < end; row++) {
+		(*row)->tm = ev_now(loop());
 		if ((*row)->replica_id == 0) {
 			(*row)->lsn = vclock_inc(vclock_diff, instance_id) +
 				      vclock_get(base, instance_id);
@@ -1067,6 +1077,35 @@ wal_assign_lsn(struct vclock *vclock_diff, struct vclock *base,
  * the output queue. The function returns count of written
  * bytes or -1 in case of error.
  */
+static ssize_t
+wal_encode_write_entry(struct wal_writer *writer, struct journal_entry *entry)
+{
+	struct errinj *inj = errinj(ERRINJ_WAL_BREAK_LSN, ERRINJ_INT);
+	if (inj != NULL) {
+		for (struct xrow_header **row = entry->rows;
+		     row < entry->rows + entry->n_rows; ++row) {
+			if (inj->iparam == (*row)->lsn) {
+				(*row)->lsn = inj->iparam - 1;
+				say_warn("injected broken lsn: %lld",
+					 (long long) (*row)->lsn);
+				break;
+			}
+		}
+	}
+
+	struct iovec *iov;
+	int iov_cnt = xrow_buf_write(&writer->xrow_buf, entry->rows,
+				     entry->rows + entry->n_rows, &iov);
+	if (iov_cnt < 0)
+		return -1;
+	xlog_tx_begin(&writer->current_wal);
+	ssize_t rc = xlog_write_iov(&writer->current_wal, iov, iov_cnt,
+				    entry->n_rows);
+	if (rc < 0)
+		return rc;
+	return xlog_tx_commit(&writer->current_wal);
+}
+
 static ssize_t
 wal_write_xlog_batch(struct wal_writer *writer, struct stailq *input,
 		     struct stailq *output, struct vclock *vclock_diff)
@@ -1082,7 +1121,7 @@ wal_write_xlog_batch(struct wal_writer *writer, struct stailq *input,
 			       entry->rows, entry->rows + entry->n_rows);
 		entry->res = vclock_sum(vclock_diff) +
 			     vclock_sum(&writer->vclock);
-		rc = xlog_write_entry(l, entry);
+		rc = wal_encode_write_entry(writer, entry);
 	} while (rc == 0 && !stailq_empty(input));
 	/* If log was not flushed then flush it explicitly. */
 	if (rc == 0)
@@ -1155,9 +1194,12 @@ wal_write_to_disk(struct cmsg *msg)
 	struct stailq output;
 	stailq_create(&output);
 	while (!stailq_empty(&input)) {
+		/* Start a wal memory buffer transaction. */
+		xrow_buf_tx_begin(&writer->xrow_buf, &writer->vclock);
 		ssize_t rc = wal_write_xlog_batch(writer, &input, &output,
 						  &vclock_diff);
 		if (rc < 0) {
+			xrow_buf_tx_rollback(&writer->xrow_buf);
 			/*
 			 * Put processed entries and tail of write
 			 * queue to a rollback list.
@@ -1165,6 +1207,7 @@ wal_write_to_disk(struct cmsg *msg)
 			stailq_concat(&wal_msg->rollback, &output);
 			stailq_concat(&wal_msg->rollback, &input);
 		} else {
+			xrow_buf_tx_commit(&writer->xrow_buf);
 			/*
 			 * Schedule processed entries to commit
 			 * and update the wal vclock.
@@ -1254,6 +1297,11 @@ wal_writer_f(va_list ap)
 {
 	(void) ap;
 	struct wal_writer *writer = &wal_writer_singleton;
+	/*
+	 * Initialize writer memory buffer here because it
+	 * should be done in the wal thread.
+	 */
+	xrow_buf_create(&writer->xrow_buf);
 
 	/** Initialize eio in this thread */
 	coio_enable();
@@ -1300,6 +1348,7 @@ wal_writer_f(va_list ap)
 		xlog_close(&vy_log_writer.xlog, false);
 
 	cpipe_destroy(&writer->tx_prio_pipe);
+	xrow_buf_destroy(&writer->xrow_buf);
 	return 0;
 }
 
diff --git a/src/box/xlog.c b/src/box/xlog.c
index 8254cce20..2fcf7f0df 100644
--- a/src/box/xlog.c
+++ b/src/box/xlog.c
@@ -1275,14 +1275,8 @@ xlog_tx_write(struct xlog *log)
 	return written;
 }
 
-/*
- * Add a row to a log and possibly flush the log.
- *
- * @retval  -1 error, check diag.
- * @retval >=0 the number of bytes written to buffer.
- */
-ssize_t
-xlog_write_row(struct xlog *log, const struct xrow_header *packet)
+static int
+xlog_write_prepare(struct xlog *log)
 {
 	/*
 	 * Automatically reserve space for a fixheader when adding
@@ -1296,17 +1290,20 @@ xlog_write_row(struct xlog *log, const struct xrow_header *packet)
 			return -1;
 		}
 	}
+	return 0;
+}
 
-	struct obuf_svp svp = obuf_create_svp(&log->obuf);
-	size_t page_offset = obuf_size(&log->obuf);
-	/** encode row into iovec */
-	struct iovec iov[XROW_IOVMAX];
-	/** don't write sync to the disk */
-	int iovcnt = xrow_header_encode(packet, 0, iov, 0);
-	if (iovcnt < 0) {
-		obuf_rollback_to_svp(&log->obuf, &svp);
+/*
+ * Write an xrow containing iovec to a xlog.
+ */
+ssize_t
+xlog_write_iov(struct xlog *log, struct iovec *iov, int iovcnt, int row_count)
+{
+	if (xlog_write_prepare(log) != 0)
 		return -1;
-	}
+
+	struct obuf_svp svp = obuf_create_svp(&log->obuf);
+	size_t old_size = obuf_size(&log->obuf);
 	for (int i = 0; i < iovcnt; ++i) {
 		struct errinj *inj = errinj(ERRINJ_WAL_WRITE_PARTIAL,
 					    ERRINJ_INT);
@@ -1325,16 +1322,34 @@ xlog_write_row(struct xlog *log, const struct xrow_header *packet)
 			return -1;
 		}
 	}
-	assert(iovcnt <= XROW_IOVMAX);
-	log->tx_rows++;
+	log->tx_rows += row_count;
 
-	size_t row_size = obuf_size(&log->obuf) - page_offset;
+	ssize_t written = obuf_size(&log->obuf) - old_size;
 	if (log->is_autocommit &&
 	    obuf_size(&log->obuf) >= XLOG_TX_AUTOCOMMIT_THRESHOLD &&
 	    xlog_tx_write(log) < 0)
 		return -1;
 
-	return row_size;
+	return written;
+}
+
+/*
+ * Add a row to a log and possibly flush the log.
+ *
+ * @retval  -1 error, check diag.
+ * @retval >=0 the number of bytes written to buffer.
+ */
+ssize_t
+xlog_write_row(struct xlog *log, const struct xrow_header *packet)
+{
+	/** encode row into an iovec */
+	struct iovec iov[XROW_IOVMAX];
+	/** don't write sync to the disk */
+	int iovcnt = xrow_header_encode(packet, 0, iov, 0);
+	if (iovcnt < 0)
+		return -1;
+	assert(iovcnt <= XROW_IOVMAX);
+	return xlog_write_iov(log, iov, iovcnt, 1);
 }
 
 /**
diff --git a/src/box/xlog.h b/src/box/xlog.h
index a48b05fc4..d0f97f0e4 100644
--- a/src/box/xlog.h
+++ b/src/box/xlog.h
@@ -501,6 +501,20 @@ xlog_fallocate(struct xlog *log, size_t size);
 ssize_t
 xlog_write_row(struct xlog *log, const struct xrow_header *packet);
 
+/**
+ * Write an iov vector with rows into a xlog,
+ *
+ * @param xlog a xlog file to write into
+ * @param iov an iovec with encoded rows data
+ * @param iovcnt count of iovec members
+ * @param row_count count of encoded rows
+ *
+ * @retval count of writen bytes
+ * @retval -1 for error
+ */
+ssize_t
+xlog_write_iov(struct xlog *xlog, struct iovec *iov, int iovcnt, int row_count);
+
 /**
  * Prevent xlog row buffer offloading, should be use
  * at transaction start to write transaction in one xlog tx
diff --git a/src/box/xrow_buf.h b/src/box/xrow_buf.h
index c5f624d45..d286a94d3 100644
--- a/src/box/xrow_buf.h
+++ b/src/box/xrow_buf.h
@@ -154,8 +154,7 @@ xrow_buf_tx_commit(struct xrow_buf *xrow_buf);
  */
 int
 xrow_buf_write(struct xrow_buf *xrow_buf, struct xrow_header **begin,
-	       struct xrow_header **end,
-	       struct iovec **iovec);
+	       struct xrow_header **end, struct iovec **iovec);
 
 /**
  * Xrow buffer cursor used to search a position in a buffer
-- 
2.25.0

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Tarantool-patches] [PATCH v4 11/11] replication: use wal memory buffer to fetch rows
  2020-02-12  9:39 [Tarantool-patches] [PATCH v4 00/11] Replication from memory Georgy Kirichenko
                   ` (9 preceding siblings ...)
  2020-02-12  9:39 ` [Tarantool-patches] [PATCH v4 10/11] wal: use a xrow buffer object for entry encoding Georgy Kirichenko
@ 2020-02-12  9:39 ` Georgy Kirichenko
  10 siblings, 0 replies; 16+ messages in thread
From: Georgy Kirichenko @ 2020-02-12  9:39 UTC (permalink / raw)
  To: tarantool-patches

Fetch data from wal in-memory buffer. Wal allows to start a fiber
which creates a xrow buffer cursor with given vclock and then
fetches row from the xrow buffer one by one and calls given callback
for each row. Also the wal relaying fiber send a heartbeat message if
all rows were processed there were no rows written for replication
timeout period.
In case of outdated vclock (wal could not create a cursor
or fetch new row from the cursor) a relay switch to reading logged
data from file up to the current vclock and then makes next attempt
to fetch data from wal memory.
In file mode there is always data to send to a replica so relay
do not have to heartbeat messages.
From this point relay creates a cord only when switches to reading
from file. Frequent memory-file oscillation is not very likely
because two consideration:
 1. If replica is to slow (slower than master writes) - it will
    switch to disk and then fall behind
 2. If replica is fast enough - it will catch memory and then
    consume memory before the memory buffer rotation.
In order to split wal and relay logic a relay filter function
were introduced which should be passed while relay attaches
to wal.

Note: wal exit is not graceful - tx sends a break loop message and
wal just stops cbus processing without any care about other fibers
which could still use cbus. To overcome this there is a special
trigger which is signaled just before cbus pipe destroy.

Close #3794
Part of #980
---
 src/box/box.cc                                |   9 -
 src/box/lua/info.c                            |   4 +-
 src/box/recovery.cc                           |  17 +-
 src/box/relay.cc                              | 517 +++++-------------
 src/box/relay.h                               |   6 +-
 src/box/replication.cc                        |   3 +
 src/box/wal.c                                 | 510 +++++++++++++++--
 src/box/wal.h                                 |  92 +++-
 src/lib/core/errinj.h                         |   1 +
 test/box-py/iproto.test.py                    |   9 +-
 test/box/errinj.result                        | 134 ++---
 test/replication/force_recovery.result        |   8 +
 test/replication/force_recovery.test.lua      |   2 +
 test/replication/replica_rejoin.result        |   8 +
 test/replication/replica_rejoin.test.lua      |   2 +
 .../show_error_on_disconnect.result           |   8 +
 .../show_error_on_disconnect.test.lua         |   2 +
 test/replication/suite.ini                    |   2 +-
 test/xlog/panic_on_wal_error.result           |  12 +
 test/xlog/panic_on_wal_error.test.lua         |   3 +
 test/xlog/suite.ini                           |   2 +-
 21 files changed, 838 insertions(+), 513 deletions(-)

diff --git a/src/box/box.cc b/src/box/box.cc
index 17495a211..f629abe70 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -1604,14 +1604,6 @@ box_process_register(struct ev_io *io, struct xrow_header *header)
 	xrow_encode_vclock_xc(&row, &replicaset.vclock);
 	row.sync = header->sync;
 	coio_write_xrow(io, &row);
-
-	/*
-	 * Advance the WAL consumer state to the position where
-	 * registration was complete and assign it to the
-	 * replica.
-	 */
-	replica = replica_by_uuid(&instance_uuid);
-	wal_relay_status_update(replica->id, &stop_vclock);
 }
 
 void
@@ -1756,7 +1748,6 @@ box_process_join(struct ev_io *io, struct xrow_header *header)
 	if (coio_write_xrow(io, &row) < 0)
 		diag_raise();
 	replica = replica_by_uuid(&instance_uuid);
-	wal_relay_status_update(replica->id, &stop_vclock);
 }
 
 void
diff --git a/src/box/lua/info.c b/src/box/lua/info.c
index aba9a4b7c..60ef5b0b0 100644
--- a/src/box/lua/info.c
+++ b/src/box/lua/info.c
@@ -135,7 +135,9 @@ lbox_pushrelay(lua_State *L, struct relay *relay)
 		lua_pushstring(L, "follow");
 		lua_settable(L, -3);
 		lua_pushstring(L, "vclock");
-		lbox_pushvclock(L, relay_vclock(relay));
+		struct vclock vclock;
+		relay_vclock(relay, &vclock);
+		lbox_pushvclock(L, &vclock);
 		lua_settable(L, -3);
 		lua_pushstring(L, "idle");
 		lua_pushnumber(L, ev_monotonic_now(loop()) -
diff --git a/src/box/recovery.cc b/src/box/recovery.cc
index e4aad1296..657c55d10 100644
--- a/src/box/recovery.cc
+++ b/src/box/recovery.cc
@@ -257,9 +257,11 @@ recover_xlog(struct recovery *r, struct xstream *stream,
 		 * the file is fully read: it's fully read only
 		 * when EOF marker has been read, see i.eof_read
 		 */
-		if (stop_vclock != NULL &&
-		    r->vclock.signature >= stop_vclock->signature)
-			return 0;
+		if (stop_vclock != NULL) {
+			int rc = vclock_compare(&r->vclock, stop_vclock);
+			if (rc >= 0 && rc != VCLOCK_ORDER_UNDEFINED)
+				return 0;
+		}
 		int64_t current_lsn = vclock_get(&r->vclock, row.replica_id);
 		if (row.lsn <= current_lsn)
 			continue; /* already applied, skip */
@@ -363,9 +365,12 @@ recover_current_wal:
 	if (xlog_cursor_is_eof(&r->cursor))
 		recovery_close_log(r);
 
-	if (stop_vclock != NULL && vclock_compare(&r->vclock, stop_vclock) != 0) {
-		diag_set(XlogGapError, &r->vclock, stop_vclock);
-		return -1;
+	if (stop_vclock != NULL) {
+		int rc = vclock_compare(&r->vclock, stop_vclock);
+		if (rc < 0 || rc == VCLOCK_ORDER_UNDEFINED) {
+			diag_set(XlogGapError, &r->vclock, stop_vclock);
+			return -1;
+		}
 	}
 
 	region_free(&fiber()->gc);
diff --git a/src/box/relay.cc b/src/box/relay.cc
index 13c8f4c28..980e05b2f 100644
--- a/src/box/relay.cc
+++ b/src/box/relay.cc
@@ -44,7 +44,6 @@
 #include "engine.h"
 #include "gc.h"
 #include "iproto_constants.h"
-#include "recovery.h"
 #include "replication.h"
 #include "trigger.h"
 #include "vclock.h"
@@ -54,38 +53,20 @@
 #include "xstream.h"
 #include "wal.h"
 
-/**
- * Cbus message to send status updates from relay to tx thread.
- */
-struct relay_status_msg {
-	/** Parent */
-	struct cmsg msg;
-	/** Relay instance */
-	struct relay *relay;
-	/** Replica vclock. */
-	struct vclock vclock;
-};
-
 /** State of a replication relay. */
 struct relay {
-	/** The thread in which we relay data to the replica. */
-	struct cord cord;
 	/** Replica connection */
 	struct ev_io io;
 	/** Request sync */
 	uint64_t sync;
-	/** Recovery instance to read xlog from the disk */
-	struct recovery *r;
 	/** Xstream argument to recovery */
 	struct xstream stream;
 	/** Vclock to stop playing xlogs */
 	struct vclock stop_vclock;
 	/** Remote replica */
 	struct replica *replica;
-	/** WAL event watcher. */
-	struct wal_watcher wal_watcher;
-	/** Relay reader cond. */
-	struct fiber_cond reader_cond;
+	/** WAL memory relay. */
+	struct wal_relay wal_relay;
 	/** Relay diagnostics. */
 	struct diag diag;
 	/** Vclock recieved from replica. */
@@ -98,25 +79,10 @@ struct relay {
 	 */
 	struct vclock local_vclock_at_subscribe;
 
-	/** Relay endpoint */
-	struct cbus_endpoint endpoint;
-	/** A pipe from 'relay' thread to 'tx' */
-	struct cpipe tx_pipe;
-	/** A pipe from 'tx' thread to 'relay' */
-	struct cpipe relay_pipe;
-	/** Status message */
-	struct relay_status_msg status_msg;
-	/** Time when last row was sent to peer. */
-	double last_row_time;
 	/** Relay sync state. */
 	enum relay_state state;
-
-	struct {
-		/* Align to prevent false-sharing with tx thread */
-		alignas(CACHELINE_SIZE)
-		/** Known relay vclock. */
-		struct vclock vclock;
-	} tx;
+	/** Fiber processing this relay. */
+	struct fiber *fiber;
 };
 
 struct diag*
@@ -131,24 +97,22 @@ relay_get_state(const struct relay *relay)
 	return relay->state;
 }
 
-const struct vclock *
-relay_vclock(const struct relay *relay)
+void
+relay_vclock(const struct relay *relay, struct vclock *vclock)
 {
-	return &relay->tx.vclock;
+	wal_relay_vclock(&relay->wal_relay, vclock);
 }
 
 double
 relay_last_row_time(const struct relay *relay)
 {
-	return relay->last_row_time;
+	return wal_relay_last_row_time(&relay->wal_relay);
 }
 
 static int
 relay_send(struct relay *relay, struct xrow_header *packet);
 static int
 relay_send_initial_join_row(struct xstream *stream, struct xrow_header *row);
-static int
-relay_send_row(struct xstream *stream, struct xrow_header *row);
 
 struct relay *
 relay_new(struct replica *replica)
@@ -160,18 +124,14 @@ relay_new(struct replica *replica)
 		return NULL;
 	}
 	relay->replica = replica;
-	relay->last_row_time = ev_monotonic_now(loop());
-	fiber_cond_create(&relay->reader_cond);
 	diag_create(&relay->diag);
 	relay->state = RELAY_OFF;
 	return relay;
 }
 
 static void
-relay_start(struct relay *relay, int fd, uint64_t sync,
-	     int (*stream_write)(struct xstream *, struct xrow_header *))
+relay_start(struct relay *relay, int fd, uint64_t sync)
 {
-	xstream_create(&relay->stream, stream_write);
 	/*
 	 * Clear the diagnostics at start, in case it has the old
 	 * error message which we keep around to display in
@@ -181,18 +141,15 @@ relay_start(struct relay *relay, int fd, uint64_t sync,
 	coio_create(&relay->io, fd);
 	relay->sync = sync;
 	relay->state = RELAY_FOLLOW;
-	relay->last_row_time = ev_monotonic_now(loop());
+	relay->fiber = fiber();
 }
 
 void
 relay_cancel(struct relay *relay)
 {
 	/* Check that the thread is running first. */
-	if (relay->cord.id != 0) {
-		if (tt_pthread_cancel(relay->cord.id) == ESRCH)
-			return;
-		tt_pthread_join(relay->cord.id, NULL);
-	}
+	if (relay->fiber != NULL)
+		fiber_cancel(relay->fiber);
 }
 
 /**
@@ -201,33 +158,17 @@ relay_cancel(struct relay *relay)
 static void
 relay_exit(struct relay *relay)
 {
+	(void) relay;
 	struct errinj *inj = errinj(ERRINJ_RELAY_EXIT_DELAY, ERRINJ_DOUBLE);
 	if (inj != NULL && inj->dparam > 0)
 		fiber_sleep(inj->dparam);
-
-	/*
-	 * Destroy the recovery context. We MUST do it in
-	 * the relay thread, because it contains an xlog
-	 * cursor, which must be closed in the same thread
-	 * that opened it (it uses cord's slab allocator).
-	 */
-	recovery_delete(relay->r);
-	relay->r = NULL;
 }
 
 static void
 relay_stop(struct relay *relay)
 {
-	if (relay->r != NULL)
-		recovery_delete(relay->r);
-	relay->r = NULL;
 	relay->state = RELAY_STOPPED;
-	/*
-	 * Needed to track whether relay thread is running or not
-	 * for relay_cancel(). Id is reset to a positive value
-	 * upon cord_create().
-	 */
-	relay->cord.id = 0;
+	relay->fiber = NULL;
 }
 
 void
@@ -235,27 +176,11 @@ relay_delete(struct relay *relay)
 {
 	if (relay->state == RELAY_FOLLOW)
 		relay_stop(relay);
-	fiber_cond_destroy(&relay->reader_cond);
 	diag_destroy(&relay->diag);
 	TRASH(relay);
 	free(relay);
 }
 
-static void
-relay_set_cord_name(int fd)
-{
-	char name[FIBER_NAME_MAX];
-	struct sockaddr_storage peer;
-	socklen_t addrlen = sizeof(peer);
-	if (getpeername(fd, ((struct sockaddr*)&peer), &addrlen) == 0) {
-		snprintf(name, sizeof(name), "relay/%s",
-			 sio_strfaddr((struct sockaddr *)&peer, addrlen));
-	} else {
-		snprintf(name, sizeof(name), "relay/<unknown>");
-	}
-	cord_set_name(name);
-}
-
 void
 relay_initial_join(int fd, uint64_t sync, struct vclock *vclock)
 {
@@ -263,7 +188,7 @@ relay_initial_join(int fd, uint64_t sync, struct vclock *vclock)
 	if (relay == NULL)
 		diag_raise();
 
-	relay_start(relay, fd, sync, relay_send_initial_join_row);
+	relay_start(relay, fd, sync);
 	auto relay_guard = make_scoped_guard([=] {
 		relay_stop(relay);
 		relay_delete(relay);
@@ -291,26 +216,58 @@ relay_initial_join(int fd, uint64_t sync, struct vclock *vclock)
 	if (coio_write_xrow(&relay->io, &row) < 0)
 		diag_raise();
 
+	xstream_create(&relay->stream, relay_send_initial_join_row);
 	/* Send read view to the replica. */
 	engine_join_xc(&ctx, &relay->stream);
 }
 
-int
-relay_final_join_f(va_list ap)
+/*
+ * Filter callback function used by wal relay in order to
+ * transform all local rows into a NOPs.
+ */
+static ssize_t
+relay_final_join_filter(struct wal_relay *wal_relay, struct xrow_header **row)
 {
-	struct relay *relay = va_arg(ap, struct relay *);
-	auto guard = make_scoped_guard([=] { relay_exit(relay); });
-
-	coio_enable();
-	relay_set_cord_name(relay->io.fd);
-
-	/* Send all WALs until stop_vclock */
-	assert(relay->stream.write != NULL);
-	if (recover_remaining_wals(relay->r, &relay->stream,
-				   &relay->stop_vclock, true) != 0)
-		diag_raise();
-	assert(vclock_compare(&relay->r->vclock, &relay->stop_vclock) == 0);
-	return 0;
+	(void) wal_relay;
+	ssize_t rc = WAL_RELAY_FILTER_PASS;
+	struct errinj *inj = errinj(ERRINJ_RELAY_BREAK_LSN,
+				    ERRINJ_INT);
+	if (inj != NULL && (*row)->lsn == inj->iparam) {
+		struct xrow_header *filtered_row = (struct xrow_header *)
+			region_alloc(&fiber()->gc, sizeof(*filtered_row));
+		if (filtered_row == NULL) {
+			diag_set(OutOfMemory, sizeof(struct xrow_header),
+				 "region", "struct xrow_header");
+			return WAL_RELAY_FILTER_ERR;
+		}
+		*filtered_row = **row;
+		filtered_row->lsn = inj->iparam - 1;
+		say_warn("injected broken lsn: %lld",
+			 (long long) filtered_row->lsn);
+		*row = filtered_row;
+		rc = WAL_RELAY_FILTER_ROW;
+	}
+	/*
+	 * Transform replica local requests to IPROTO_NOP so as to
+	 * promote vclock on the replica without actually modifying
+	 * any data.
+	 */
+	if ((*row)->group_id == GROUP_LOCAL) {
+		struct xrow_header *filtered_row = (struct xrow_header *)
+			region_alloc(&fiber()->gc, sizeof(*filtered_row));
+		if (filtered_row == NULL) {
+			diag_set(OutOfMemory, sizeof(struct xrow_header),
+				 "region", "struct xrow_header");
+			return WAL_RELAY_FILTER_ERR;
+		}
+		*filtered_row = **row;
+		filtered_row->type = IPROTO_NOP;
+		filtered_row->group_id = GROUP_DEFAULT;
+		filtered_row->bodycnt = 0;
+		*row = filtered_row;
+		rc = WAL_RELAY_FILTER_ROW;
+	}
+	return rc;
 }
 
 void
@@ -321,21 +278,16 @@ relay_final_join(int fd, uint64_t sync, struct vclock *start_vclock,
 	if (relay == NULL)
 		diag_raise();
 
-	relay_start(relay, fd, sync, relay_send_row);
+	relay_start(relay, fd, sync);
 	auto relay_guard = make_scoped_guard([=] {
 		relay_stop(relay);
 		relay_delete(relay);
 	});
 
-	relay->r = recovery_new(cfg_gets("wal_dir"), false,
-			       start_vclock);
 	vclock_copy(&relay->stop_vclock, stop_vclock);
 
-	int rc = cord_costart(&relay->cord, "final_join",
-			      relay_final_join_f, relay);
-	if (rc == 0)
-		rc = cord_cojoin(&relay->cord);
-	if (rc != 0)
+	if (wal_relay(&relay->wal_relay, start_vclock, stop_vclock,
+		      relay_final_join_filter, fd, relay->replica) != 0)
 		diag_raise();
 
 	ERROR_INJECT(ERRINJ_RELAY_FINAL_JOIN,
@@ -347,35 +299,6 @@ relay_final_join(int fd, uint64_t sync, struct vclock *start_vclock,
 	});
 }
 
-/**
- * The message which updated tx thread with a new vclock has returned back
- * to the relay.
- */
-static void
-relay_status_update(struct cmsg *msg)
-{
-	struct relay_status_msg *status = (struct relay_status_msg *)msg;
-	msg->route = NULL;
-	fiber_cond_signal(&status->relay->reader_cond);
-}
-
-/**
- * Deliver a fresh relay vclock to tx thread.
- */
-static void
-tx_status_update(struct cmsg *msg)
-{
-	struct relay_status_msg *status = (struct relay_status_msg *)msg;
-	if (!status->relay->replica->anon)
-		wal_relay_status_update(status->relay->replica->id, &status->vclock);
-	vclock_copy(&status->relay->tx.vclock, &status->vclock);
-	static const struct cmsg_hop route[] = {
-		{relay_status_update, NULL}
-	};
-	cmsg_init(msg, route);
-	cpipe_push(&status->relay->relay_pipe, msg);
-}
-
 static void
 relay_set_error(struct relay *relay, struct error *e)
 {
@@ -384,186 +307,88 @@ relay_set_error(struct relay *relay, struct error *e)
 		diag_add_error(&relay->diag, e);
 }
 
-static void
-relay_process_wal_event(struct wal_watcher *watcher, unsigned events)
+/*
+ * Filter callback function used while subscribe phase.
+ */
+static ssize_t
+relay_subscribe_filter(struct wal_relay *wal_relay, struct xrow_header **row)
 {
-	struct relay *relay = container_of(watcher, struct relay, wal_watcher);
-	if (fiber_is_cancelled()) {
+	if ((*row)->type != IPROTO_OK) {
+		assert(iproto_type_is_dml((*row)->type));
 		/*
-		 * The relay is exiting. Rescanning the WAL at this
-		 * point would be pointless and even dangerous,
-		 * because the relay could have written a packet
-		 * fragment to the socket before being cancelled
-		 * so that writing another row to the socket would
-		 * lead to corrupted replication stream and, as
-		 * a result, permanent replication breakdown.
+		 * Because of asynchronous replication both master
+		 * and replica may have different transaction
+		 * order in their logs. As we start relaying
+		 * transactions from the first unknow one there
+		 * could be some other already known by replica
+		 * and there is no point to send them.
 		 */
-		return;
-	}
-	if (recover_remaining_wals(relay->r, &relay->stream, NULL,
-				   (events & WAL_EVENT_ROTATE) != 0) != 0) {
-		relay_set_error(relay, diag_last_error(diag_get()));
-		fiber_cancel(fiber());
+		if (vclock_get(&wal_relay->vclock, (*row)->replica_id) >=
+		    (*row)->lsn)
+			return WAL_RELAY_FILTER_SKIP;
 	}
-}
-
-/*
- * Relay reader fiber function.
- * Read xrow encoded vclocks sent by the replica.
- */
-int
-relay_reader_f(va_list ap)
-{
-	struct relay *relay = va_arg(ap, struct relay *);
-	struct fiber *relay_f = va_arg(ap, struct fiber *);
-
-	struct ibuf ibuf;
-	struct ev_io io;
-	coio_create(&io, relay->io.fd);
-	ibuf_create(&ibuf, &cord()->slabc, 1024);
-	try {
-		while (!fiber_is_cancelled()) {
-			struct xrow_header xrow;
-			if (coio_read_xrow_timeout(&io, &ibuf, &xrow,
-					replication_disconnect_timeout()) < 0)
-				diag_raise();
-			/* vclock is followed while decoding, zeroing it. */
-			vclock_create(&relay->recv_vclock);
-			xrow_decode_vclock_xc(&xrow, &relay->recv_vclock);
-			fiber_cond_signal(&relay->reader_cond);
+	ssize_t rc = WAL_RELAY_FILTER_PASS;
+
+	struct errinj *inj = errinj(ERRINJ_RELAY_BREAK_LSN,
+				    ERRINJ_INT);
+	if (inj != NULL && (*row)->lsn == inj->iparam) {
+		struct xrow_header *filtered_row = (struct xrow_header *)
+			region_alloc(&fiber()->gc, sizeof(*filtered_row));
+		if (filtered_row == NULL) {
+			diag_set(OutOfMemory, sizeof(struct xrow_header),
+				 "region", "struct xrow_header");
+			return WAL_RELAY_FILTER_ERR;
 		}
-	} catch (Exception *e) {
-		relay_set_error(relay, e);
-		fiber_cancel(relay_f);
+		*filtered_row = **row;
+		filtered_row->lsn = inj->iparam - 1;
+		say_warn("injected broken lsn: %lld",
+			 (long long) filtered_row->lsn);
+		*row = filtered_row;
+		rc = WAL_RELAY_FILTER_ROW;
 	}
-	ibuf_destroy(&ibuf);
-	return 0;
-}
-
-/**
- * Send a heartbeat message over a connected relay.
- */
-static void
-relay_send_heartbeat(struct relay *relay)
-{
-	struct xrow_header row;
-	xrow_encode_timestamp(&row, instance_id, ev_now(loop()));
-	try {
-		relay_send(relay, &row);
-	} catch (Exception *e) {
-		relay_set_error(relay, e);
-		fiber_cancel(fiber());
-	}
-}
-
-/**
- * A libev callback invoked when a relay client socket is ready
- * for read. This currently only happens when the client closes
- * its socket, and we get an EOF.
- */
-static int
-relay_subscribe_f(va_list ap)
-{
-	struct relay *relay = va_arg(ap, struct relay *);
-	struct recovery *r = relay->r;
-
-	coio_enable();
-	relay_set_cord_name(relay->io.fd);
-
-	/* Create cpipe to tx for propagating vclock. */
-	cbus_endpoint_create(&relay->endpoint, tt_sprintf("relay_%p", relay),
-			     fiber_schedule_cb, fiber());
-	cbus_pair("tx", relay->endpoint.name, &relay->tx_pipe,
-		  &relay->relay_pipe, NULL, NULL, cbus_process);
-
-	/* Setup WAL watcher for sending new rows to the replica. */
-	wal_set_watcher(&relay->wal_watcher, relay->endpoint.name,
-			relay_process_wal_event, cbus_process);
-
-	/* Start fiber for receiving replica acks. */
-	char name[FIBER_NAME_MAX];
-	snprintf(name, sizeof(name), "%s:%s", fiber()->name, "reader");
-	struct fiber *reader = fiber_new_xc(name, relay_reader_f);
-	fiber_set_joinable(reader, true);
-	fiber_start(reader, relay, fiber());
 
 	/*
-	 * If the replica happens to be up to date on subscribe,
-	 * don't wait for timeout to happen - send a heartbeat
-	 * message right away to update the replication lag as
-	 * soon as possible.
-	 */
-	relay_send_heartbeat(relay);
-
-	/*
-	 * Run the event loop until the connection is broken
-	 * or an error occurs.
+	 * Transform replica local requests to IPROTO_NOP so as to
+	 * promote vclock on the replica without actually modifying
+	 * any data.
 	 */
-	while (!fiber_is_cancelled()) {
-		double timeout = replication_timeout;
-		struct errinj *inj = errinj(ERRINJ_RELAY_REPORT_INTERVAL,
-					    ERRINJ_DOUBLE);
-		if (inj != NULL && inj->dparam != 0)
-			timeout = inj->dparam;
-
-		fiber_cond_wait_deadline(&relay->reader_cond,
-					 relay->last_row_time + timeout);
-
-		/*
-		 * The fiber can be woken by IO cancel, by a timeout of
-		 * status messaging or by an acknowledge to status message.
-		 * Handle cbus messages first.
-		 */
-		cbus_process(&relay->endpoint);
-		/* Check for a heartbeat timeout. */
-		if (ev_monotonic_now(loop()) - relay->last_row_time > timeout)
-			relay_send_heartbeat(relay);
-		/*
-		 * Check that the vclock has been updated and the previous
-		 * status message is delivered
-		 */
-		if (relay->status_msg.msg.route != NULL)
-			continue;
-		struct vclock *send_vclock;
-		if (relay->version_id < version_id(1, 7, 4))
-			send_vclock = &r->vclock;
-		else
-			send_vclock = &relay->recv_vclock;
-		if (vclock_sum(&relay->status_msg.vclock) ==
-		    vclock_sum(send_vclock))
-			continue;
-		static const struct cmsg_hop route[] = {
-			{tx_status_update, NULL}
-		};
-		cmsg_init(&relay->status_msg.msg, route);
-		vclock_copy(&relay->status_msg.vclock, send_vclock);
-		relay->status_msg.relay = relay;
-		cpipe_push(&relay->tx_pipe, &relay->status_msg.msg);
+	if ((*row)->group_id == GROUP_LOCAL) {
+		if ((*row)->replica_id == 0)
+			return WAL_RELAY_FILTER_SKIP;
+		struct xrow_header *filtered_row = (struct xrow_header *)
+			region_alloc(&fiber()->gc, sizeof(*filtered_row));
+		if (filtered_row == NULL) {
+			diag_set(OutOfMemory, sizeof(struct xrow_header),
+				 "region", "struct xrow_header");
+			return WAL_RELAY_FILTER_ERR;
+		}
+		*filtered_row = **row;
+		filtered_row->type = IPROTO_NOP;
+		filtered_row->group_id = GROUP_DEFAULT;
+		filtered_row->bodycnt = 0;
+		*row = filtered_row;
+		rc = WAL_RELAY_FILTER_ROW;
 	}
-
 	/*
-	 * Log the error that caused the relay to break the loop.
-	 * Don't clear the error for status reporting.
+	 * We're feeding a WAL, thus responding to FINAL JOIN or SUBSCRIBE
+	 * request. If this is FINAL JOIN (i.e. relay->replica is NULL),
+	 * we must relay all rows, even those originating from the replica
+	 * itself (there may be such rows if this is rebootstrap). If this
+	 * SUBSCRIBE, only send a row if it is not from the same replica
+	 * (i.e. don't send replica's own rows back) or if this row is
+	 * missing on the other side (i.e. in case of sudden power-loss,
+	 * data was not written to WAL, so remote master can't recover
+	 * it). In the latter case packet's LSN is less than or equal to
+	 * local master's LSN at the moment it received 'SUBSCRIBE' request.
 	 */
-	assert(!diag_is_empty(&relay->diag));
-	diag_add_error(diag_get(), diag_last_error(&relay->diag));
-	diag_log();
-	say_crit("exiting the relay loop");
-
-	/* Clear garbage collector trigger and WAL watcher. */
-	wal_clear_watcher(&relay->wal_watcher, cbus_process);
-
-	/* Join ack reader fiber. */
-	fiber_cancel(reader);
-	fiber_join(reader);
-
-	/* Destroy cpipe to tx. */
-	cbus_unpair(&relay->tx_pipe, &relay->relay_pipe,
-		    NULL, NULL, cbus_process);
-	cbus_endpoint_destroy(&relay->endpoint, cbus_process);
-
-	relay_exit(relay);
-	return -1;
+	struct relay *relay = container_of(wal_relay, struct relay, wal_relay);
+	if (wal_relay->replica == NULL ||
+	    (*row)->replica_id != wal_relay->replica->id ||
+	    (*row)->lsn <= vclock_get(&relay->local_vclock_at_subscribe,
+				      (*row)->replica_id)) {
+		return rc;
+	}
+	return WAL_RELAY_FILTER_SKIP;
 }
 
 /** Replication acceptor fiber handler. */
@@ -574,29 +399,21 @@ relay_subscribe(struct replica *replica, int fd, uint64_t sync,
 	assert(replica->anon || replica->id != REPLICA_ID_NIL);
 	struct relay *relay = replica->relay;
 	assert(relay->state != RELAY_FOLLOW);
-	if (!replica->anon)
-		wal_relay_status_update(replica->id, replica_clock);
 
-	relay_start(relay, fd, sync, relay_send_row);
+	relay_start(relay, fd, sync);
 	auto relay_guard = make_scoped_guard([=] {
 		relay_stop(relay);
 		replica_on_relay_stop(replica);
 	});
 
 	vclock_copy(&relay->local_vclock_at_subscribe, &replicaset.vclock);
-	relay->r = recovery_new(cfg_gets("wal_dir"), false,
-			        replica_clock);
-	if (relay->r == NULL)
-		diag_raise();
-	vclock_copy(&relay->tx.vclock, replica_clock);
 	relay->version_id = replica_version_id;
 
-	int rc = cord_costart(&relay->cord, "subscribe",
-			      relay_subscribe_f, relay);
-	if (rc == 0)
-		rc = cord_cojoin(&relay->cord);
-	if (rc != 0)
-		diag_raise();
+	if (wal_relay(&relay->wal_relay, replica_clock, NULL,
+		      relay_subscribe_filter, fd, relay->replica) != 0)
+		relay_set_error(relay, diag_last_error(&fiber()->diag));
+	relay_exit(relay);
+	diag_raise();
 }
 
 static int
@@ -605,7 +422,6 @@ relay_send(struct relay *relay, struct xrow_header *packet)
 	ERROR_INJECT_YIELD(ERRINJ_RELAY_SEND_DELAY);
 
 	packet->sync = relay->sync;
-	relay->last_row_time = ev_monotonic_now(loop());
 	if (coio_write_xrow(&relay->io, packet) < 0)
 		return -1;
 	fiber_gc();
@@ -628,54 +444,3 @@ relay_send_initial_join_row(struct xstream *stream, struct xrow_header *row)
 		return relay_send(relay, row);
 	return 0;
 }
-
-/** Send a single row to the client. */
-static int
-relay_send_row(struct xstream *stream, struct xrow_header *packet)
-{
-	struct relay *relay = container_of(stream, struct relay, stream);
-	assert(iproto_type_is_dml(packet->type));
-	/*
-	 * Transform replica local requests to IPROTO_NOP so as to
-	 * promote vclock on the replica without actually modifying
-	 * any data.
-	 */
-	if (packet->group_id == GROUP_LOCAL) {
-		/*
-		 * Replica-local requests generated while replica
-		 * was anonymous have a zero instance id. Just
-		 * skip all these rows.
-		 */
-		if (packet->replica_id == REPLICA_ID_NIL)
-			return 0;
-		packet->type = IPROTO_NOP;
-		packet->group_id = GROUP_DEFAULT;
-		packet->bodycnt = 0;
-	}
-	/*
-	 * We're feeding a WAL, thus responding to FINAL JOIN or SUBSCRIBE
-	 * request. If this is FINAL JOIN (i.e. relay->replica is NULL),
-	 * we must relay all rows, even those originating from the replica
-	 * itself (there may be such rows if this is rebootstrap). If this
-	 * SUBSCRIBE, only send a row if it is not from the same replica
-	 * (i.e. don't send replica's own rows back) or if this row is
-	 * missing on the other side (i.e. in case of sudden power-loss,
-	 * data was not written to WAL, so remote master can't recover
-	 * it). In the latter case packet's LSN is less than or equal to
-	 * local master's LSN at the moment it received 'SUBSCRIBE' request.
-	 */
-	if (relay->replica == NULL ||
-	    packet->replica_id != relay->replica->id ||
-	    packet->lsn <= vclock_get(&relay->local_vclock_at_subscribe,
-				      packet->replica_id)) {
-		struct errinj *inj = errinj(ERRINJ_RELAY_BREAK_LSN,
-					    ERRINJ_INT);
-		if (inj != NULL && packet->lsn == inj->iparam) {
-			packet->lsn = inj->iparam - 1;
-			say_warn("injected broken lsn: %lld",
-				 (long long) packet->lsn);
-		}
-		return relay_send(relay, packet);
-	}
-	return 0;
-}
diff --git a/src/box/relay.h b/src/box/relay.h
index e1782d78f..43d4e7ab3 100644
--- a/src/box/relay.h
+++ b/src/box/relay.h
@@ -80,10 +80,10 @@ relay_get_state(const struct relay *relay);
 /**
  * Returns relay's vclock
  * @param relay relay
- * @returns relay's vclock
+ * @param relay's vclock
  */
-const struct vclock *
-relay_vclock(const struct relay *relay);
+void
+relay_vclock(const struct relay *relay, struct vclock *vclock);
 
 /**
  * Returns relay's last_row_time
diff --git a/src/box/replication.cc b/src/box/replication.cc
index 869177656..a7a513fe1 100644
--- a/src/box/replication.cc
+++ b/src/box/replication.cc
@@ -228,6 +228,9 @@ replica_set_id(struct replica *replica, uint32_t replica_id)
 {
 	assert(replica_id < VCLOCK_MAX);
 	assert(replica->id == REPLICA_ID_NIL); /* replica id is read-only */
+	/* If replica was anon then unregister it from wal. */
+	if (replica->anon)
+		wal_relay_delete(0);
 	replica->id = replica_id;
 
 	if (tt_uuid_is_equal(&INSTANCE_UUID, &replica->uuid)) {
diff --git a/src/box/wal.c b/src/box/wal.c
index b483b8cc4..886663e0c 100644
--- a/src/box/wal.c
+++ b/src/box/wal.c
@@ -45,6 +45,9 @@
 #include "replication.h"
 #include "mclock.h"
 #include "xrow_buf.h"
+#include "recovery.h"
+#include "coio.h"
+#include "xrow_io.h"
 
 enum {
 	/**
@@ -183,6 +186,14 @@ struct wal_writer
 	 * without xlog files access.
 	 */
 	struct xrow_buf xrow_buf;
+	/** xrow buffer condition signaled when a buffer write was done. */
+	struct fiber_cond xrow_buf_cond;
+	/**
+	 * Wal exit is not gracefull so there is a helper trigger
+	 * which is used in order to infor all relays that wal was
+	 * destroyed.
+	 */
+	struct rlist on_wal_exit;
 };
 
 struct wal_msg {
@@ -448,6 +459,8 @@ wal_writer_create(struct wal_writer *writer, enum wal_mode wal_mode,
 	fiber_cond_create(&writer->wal_gc_cond);
 	writer->gc_wal_vclock = NULL;
 	vclock_create(&writer->gc_first_vclock);
+
+	rlist_create(&writer->on_wal_exit);
 }
 
 /** Destroy a WAL writer structure. */
@@ -1208,6 +1221,7 @@ wal_write_to_disk(struct cmsg *msg)
 			stailq_concat(&wal_msg->rollback, &input);
 		} else {
 			xrow_buf_tx_commit(&writer->xrow_buf);
+			fiber_cond_signal(&writer->xrow_buf_cond);
 			/*
 			 * Schedule processed entries to commit
 			 * and update the wal vclock.
@@ -1302,6 +1316,7 @@ wal_writer_f(va_list ap)
 	 * should be done in the wal thread.
 	 */
 	xrow_buf_create(&writer->xrow_buf);
+	fiber_cond_create(&writer->xrow_buf_cond);
 
 	/** Initialize eio in this thread */
 	coio_enable();
@@ -1347,6 +1362,9 @@ wal_writer_f(va_list ap)
 	if (xlog_is_open(&vy_log_writer.xlog))
 		xlog_close(&vy_log_writer.xlog, false);
 
+	/* Inform relays that wal is exiting. */
+	trigger_run(&writer->on_wal_exit, NULL);
+
 	cpipe_destroy(&writer->tx_prio_pipe);
 	xrow_buf_destroy(&writer->xrow_buf);
 	return 0;
@@ -1614,49 +1632,6 @@ wal_notify_watchers(struct wal_writer *writer, unsigned events)
 		wal_watcher_notify(watcher, events);
 }
 
-struct wal_relay_status_update_msg {
-	struct cbus_call_msg base;
-	uint32_t replica_id;
-	struct vclock vclock;
-};
-
-static int
-wal_relay_status_update_f(struct cbus_call_msg *base)
-{
-	struct wal_writer *writer = &wal_writer_singleton;
-	struct wal_relay_status_update_msg *msg =
-		container_of(base, struct wal_relay_status_update_msg, base);
-	struct vclock old_vclock;
-	mclock_extract_row(&writer->mclock, msg->replica_id, &old_vclock);
-	if (writer->gc_wal_vclock != NULL &&
-	    vclock_order_changed(&old_vclock, writer->gc_wal_vclock,
-				 &msg->vclock))
-		fiber_cond_signal(&writer->wal_gc_cond);
-	mclock_update(&writer->mclock, msg->replica_id, &msg->vclock);
-	return 0;
-}
-
-void
-wal_relay_status_update(uint32_t replica_id, const struct vclock *vclock)
-{
-	struct wal_writer *writer = &wal_writer_singleton;
-	struct wal_relay_status_update_msg msg;
-	/*
-	 * We do not take anonymous replica in account. There is
-	 * no way to distinguish them but anonynous replica could
-	 * be rebootstrapped at any time.
-	 */
-	if (replica_id == 0)
-		return;
-	msg.replica_id = replica_id;
-	vclock_copy(&msg.vclock, vclock);
-	bool cancellable = fiber_set_cancellable(false);
-	cbus_call(&writer->wal_pipe, &writer->tx_prio_pipe,
-		  &msg.base, wal_relay_status_update_f, NULL,
-		  TIMEOUT_INFINITY);
-	fiber_set_cancellable(cancellable);
-}
-
 struct wal_relay_delete_msg {
 	struct cmsg base;
 	uint32_t replica_id;
@@ -1705,3 +1680,452 @@ wal_atfork()
 	if (xlog_is_open(&vy_log_writer.xlog))
 		xlog_atfork(&vy_log_writer.xlog);
 }
+
+/*
+ * Relay reader fiber function.
+ * Read xrow encoded vclocks sent by the replica.
+ */
+static int
+wal_relay_reader_f(va_list ap)
+{
+	struct wal_writer *writer = va_arg(ap, struct wal_writer *);
+	struct wal_relay *wal_relay = va_arg(ap, struct wal_relay *);
+	uint32_t replica_id = wal_relay->replica->id;
+
+	mclock_update(&writer->mclock, replica_id, &wal_relay->replica_vclock);
+	fiber_cond_signal(&writer->wal_gc_cond);
+
+	struct ibuf ibuf;
+	struct ev_io io;
+	coio_create(&io, wal_relay->fd);
+	ibuf_create(&ibuf, &cord()->slabc, 1024);
+	while (!fiber_is_cancelled()) {
+		struct xrow_header row;
+		if (coio_read_xrow_timeout(&io, &ibuf, &row,
+					   replication_disconnect_timeout()) < 0) {
+			if (diag_is_empty(&wal_relay->diag))
+				diag_move(&fiber()->diag, &wal_relay->diag);
+			break;
+		}
+
+		struct vclock cur_vclock;
+		/* vclock is followed while decoding, zeroing it. */
+		vclock_create(&cur_vclock);
+		if (xrow_decode_vclock(&row, &cur_vclock) < 0)
+			break;
+
+		if (writer->gc_wal_vclock != NULL &&
+		    vclock_order_changed(&wal_relay->replica_vclock,
+					 writer->gc_wal_vclock, &cur_vclock))
+			fiber_cond_signal(&writer->wal_gc_cond);
+		vclock_copy(&wal_relay->replica_vclock, &cur_vclock);
+		mclock_update(&writer->mclock, replica_id, &cur_vclock);
+	}
+	ibuf_destroy(&ibuf);
+	fiber_cancel(wal_relay->fiber);
+	return 0;
+}
+
+struct wal_relay_stream {
+	struct xstream stream;
+	struct wal_relay *wal_relay;
+	struct ev_io io;
+};
+
+static int
+wal_relay_stream_write(struct xstream *stream, struct xrow_header *row)
+{
+	struct wal_relay_stream *wal_relay_stream =
+		container_of(stream, struct wal_relay_stream, stream);
+	struct wal_relay *wal_relay = wal_relay_stream->wal_relay;
+	/*
+	 * Remember the original row because filter could
+	 * change it.
+	 */
+	struct xrow_header *orig_row = row;
+	switch (wal_relay->on_filter(wal_relay, &row)) {
+	case WAL_RELAY_FILTER_PASS:
+	case WAL_RELAY_FILTER_ROW:
+		break;
+	case WAL_RELAY_FILTER_SKIP:
+		return 0;
+	case WAL_RELAY_FILTER_ERR:
+		return -1;
+	}
+	ERROR_INJECT_YIELD(ERRINJ_RELAY_SEND_DELAY);
+
+	vclock_follow_xrow(&wal_relay->vclock, orig_row);
+	int rc =  coio_write_xrow(&wal_relay_stream->io, row);
+	struct errinj *inj = errinj(ERRINJ_RELAY_TIMEOUT, ERRINJ_DOUBLE);
+	if (inj != NULL && inj->dparam > 0)
+		fiber_sleep(inj->dparam);
+
+	return rc >= 0? 0: -1;
+}
+
+/* Structure to provide arguments for file relaying cord. */
+struct wal_relay_from_file_args {
+	/* Wal writer. */
+	struct wal_writer *writer;
+	/* Wal realy structure. */
+	struct wal_relay *wal_relay;
+	/* Vclock to stop relaying on. */
+	struct vclock stop_vclock;
+};
+
+/*
+ * Relay from file cord function. This cord read log and
+ * sends data to replica.
+ */
+static int
+wal_relay_from_file_f(va_list ap)
+{
+	struct wal_relay_from_file_args *args =
+		va_arg(ap, struct wal_relay_from_file_args *);
+	/* Recover xlogs from files. */
+	struct recovery *recovery = recovery_new(args->writer->wal_dir.dirname,
+						 false,
+						 &args->wal_relay->vclock);
+	if (recovery == NULL)
+		return -1;
+	struct wal_relay_stream wal_relay_stream;
+	xstream_create(&wal_relay_stream.stream, wal_relay_stream_write);
+	wal_relay_stream.wal_relay = args->wal_relay;
+	coio_create(&wal_relay_stream.io, args->wal_relay->fd);
+
+	if (recover_remaining_wals(recovery, &wal_relay_stream.stream,
+	    &args->stop_vclock, true) != 0) {
+		recovery_delete(recovery);
+		return -1;
+	}
+	recovery_delete(recovery);
+	return 0;
+}
+
+static int
+wal_relay_from_file(struct wal_writer *writer, struct wal_relay *wal_relay)
+{
+	struct wal_relay_from_file_args args;
+	args.writer = writer;
+	args.wal_relay = wal_relay;
+
+	vclock_create(&args.stop_vclock);
+	if (vclock_is_set(&wal_relay->stop_vclock))
+		vclock_copy(&args.stop_vclock, &wal_relay->stop_vclock);
+	else
+		vclock_copy(&args.stop_vclock, &writer->vclock);
+
+	int rc = cord_costart(&wal_relay->cord, "file relay",
+			      wal_relay_from_file_f, &args);
+	if (rc == 0)
+		rc = cord_cojoin(&wal_relay->cord);
+	return rc;
+}
+
+static int
+wal_relay_send_hearthbeat(struct ev_io *io)
+{
+	struct xrow_header hearthbeat;
+	xrow_encode_timestamp(&hearthbeat, instance_id, ev_now(loop()));
+	return coio_write_xrow(io, &hearthbeat);
+}
+
+/* Wal relay fiber function. */
+static int
+wal_relay_from_memory(struct wal_writer *writer, struct wal_relay *wal_relay)
+{
+	double last_row_time = 0;
+	struct xrow_buf_cursor cursor;
+	if (xrow_buf_cursor_create(&writer->xrow_buf, &cursor,
+				   &wal_relay->vclock) != 0)
+		return 0;
+	struct ev_io io;
+	coio_create(&io, wal_relay->fd);
+	/* Cursor was created and then we can process rows one by one. */
+	while (!fiber_is_cancelled()) {
+		if (vclock_is_set(&wal_relay->stop_vclock)) {
+			int rc =  vclock_compare(&wal_relay->stop_vclock,
+						 &wal_relay->vclock);
+			if (rc <= 0 && rc != VCLOCK_ORDER_UNDEFINED)
+				return 1;
+		}
+		struct xrow_header *row;
+		void *data;
+		size_t size;
+		int rc = xrow_buf_cursor_next(&writer->xrow_buf, &cursor,
+					     &row, &data, &size);
+		if (rc < 0) {
+			/*
+			 * Wal memory buffer was rotated and we are not in
+			 * memory.
+			 */
+			return 0;
+		}
+		if (rc > 0) {
+			/*
+			 * There are no more rows in a buffer. Wait
+			 * until wal wrote new ones or timeout was
+			 * exceeded and send a heartbeat message.
+			 */
+			double timeout = replication_timeout;
+			struct errinj *inj = errinj(ERRINJ_RELAY_REPORT_INTERVAL,
+						    ERRINJ_DOUBLE);
+			if (inj != NULL && inj->dparam != 0)
+				timeout = inj->dparam;
+
+			fiber_cond_wait_deadline(&writer->xrow_buf_cond,
+						 last_row_time + timeout);
+			if (ev_monotonic_now(loop()) - last_row_time >
+			    timeout) {
+				/* Timeout was exceeded - send a heartbeat. */
+				if (wal_relay_send_hearthbeat(&io) < 0)
+					return -1;
+				last_row_time = ev_monotonic_now(loop());
+			}
+			continue;
+		}
+		ERROR_INJECT(ERRINJ_WAL_MEM_IGNORE, return 0);
+		/*
+		 * Remember the original row because filter could
+		 * change it.
+		 */
+		struct xrow_header *orig_row = row;
+		switch (wal_relay->on_filter(wal_relay, &row)) {
+		case WAL_RELAY_FILTER_PASS:
+		case WAL_RELAY_FILTER_ROW:
+			break;
+		case WAL_RELAY_FILTER_SKIP:
+			continue;
+		case WAL_RELAY_FILTER_ERR:
+			return -1;
+		}
+
+		ERROR_INJECT(ERRINJ_RELAY_SEND_DELAY, { return 0;});
+
+		last_row_time = ev_monotonic_now(loop());
+		if (coio_write_xrow(&io, row) < 0)
+			return -1;
+		vclock_follow_xrow(&wal_relay->vclock, orig_row);
+		struct errinj *inj = errinj(ERRINJ_RELAY_TIMEOUT, ERRINJ_DOUBLE);
+		if (inj != NULL && inj->dparam > 0)
+			fiber_sleep(inj->dparam);
+	}
+	return -1;
+}
+
+static int
+wal_relay_on_wal_exit(struct trigger *trigger, void *event)
+{
+	(void) event;
+	struct wal_relay *wal_relay = (struct wal_relay *)trigger->data;
+	if (wal_relay->cord.id > 0)
+		pthread_cancel(wal_relay->cord.id);
+	fiber_cancel(wal_relay->fiber);
+	wal_relay->is_wal_exit = true;
+	return 0;
+}
+
+/* Wake relay when wal_relay finished. */
+static void
+wal_relay_done(struct cmsg *base)
+{
+	struct wal_relay *msg =
+		container_of(base, struct wal_relay, base);
+	msg->done = true;
+	fiber_cond_signal(&msg->done_cond);
+}
+
+static int
+wal_relay_f(va_list ap)
+{
+	struct wal_writer *writer = &wal_writer_singleton;
+	struct wal_relay *wal_relay = va_arg(ap, struct wal_relay *);
+
+	struct trigger on_wal_exit;
+	trigger_create(&on_wal_exit, wal_relay_on_wal_exit, wal_relay, NULL);
+	trigger_add(&writer->on_wal_exit, &on_wal_exit);
+
+	struct fiber *reader = NULL;
+	if (wal_relay->replica != NULL && wal_relay->replica->id != REPLICA_ID_NIL) {
+		/* Start fiber for receiving replica acks. */
+		char name[FIBER_NAME_MAX];
+		snprintf(name, sizeof(name), "%s:%s", fiber()->name, "reader");
+		reader = fiber_new(name, wal_relay_reader_f);
+		if (reader == NULL) {
+			diag_move(&fiber()->diag, &wal_relay->diag);
+			return 0;
+		}
+		fiber_set_joinable(reader, true);
+		fiber_start(reader, writer, wal_relay);
+
+		struct ev_io io;
+		coio_create(&io, wal_relay->fd);
+		if (wal_relay_send_hearthbeat(&io) < 0)
+			goto done;
+	}
+
+	while (wal_relay_from_memory(writer, wal_relay) == 0 &&
+	       wal_relay_from_file(writer, wal_relay) == 0);
+
+done:
+	if (wal_relay->is_wal_exit)
+		return 0;
+	trigger_clear(&on_wal_exit);
+	if (diag_is_empty(&wal_relay->diag))
+		diag_move(&fiber()->diag, &wal_relay->diag);
+
+	if (reader != NULL) {
+		/* Join ack reader fiber. */
+		fiber_cancel(reader);
+		fiber_join(reader);
+	}
+	if (wal_relay->is_wal_exit)
+		return 0;
+
+	static struct cmsg_hop done_route[] = {
+		{wal_relay_done, NULL}
+	};
+	cmsg_init(&wal_relay->base, done_route);
+	cpipe_push(&writer->tx_prio_pipe, &wal_relay->base);
+	wal_relay->fiber = NULL;
+	return 0;
+}
+
+static void
+wal_relay_attach(struct cmsg *base)
+{
+	struct wal_relay *wal_relay = container_of(base, struct wal_relay, base);
+	wal_relay->fiber = fiber_new("wal relay fiber", wal_relay_f);
+	wal_relay->cord.id = 0;
+	wal_relay->is_wal_exit = false;
+	fiber_start(wal_relay->fiber, wal_relay);
+}
+
+static void
+wal_relay_cancel(struct cmsg *base)
+{
+	struct wal_relay *wal_relay = container_of(base, struct wal_relay,
+						 cancel_msg);
+	/*
+	 * A relay was cancelled so cancel corresponding
+	 * fiber in the wal thread if it still alive.
+	 */
+	if (wal_relay->fiber != NULL)
+		fiber_cancel(wal_relay->fiber);
+}
+
+int
+wal_relay(struct wal_relay *wal_relay, const struct vclock *vclock,
+	  const struct vclock *stop_vclock,  wal_relay_filter_cb on_filter, int fd,
+	  struct replica *replica)
+{
+	struct wal_writer *writer = &wal_writer_singleton;
+	vclock_copy(&wal_relay->vclock, vclock);
+	vclock_create(&wal_relay->stop_vclock);
+	if (stop_vclock != NULL)
+		vclock_copy(&wal_relay->stop_vclock, stop_vclock);
+	else
+		vclock_clear(&wal_relay->stop_vclock);
+	wal_relay->on_filter = on_filter;
+	wal_relay->fd = fd;
+	wal_relay->replica = replica;
+	diag_create(&wal_relay->diag);
+	wal_relay->cancel_msg.route = NULL;
+
+	fiber_cond_create(&wal_relay->done_cond);
+	wal_relay->done = false;
+
+	static struct cmsg_hop route[] = {
+		{wal_relay_attach, NULL}
+	};
+	cmsg_init(&wal_relay->base, route);
+	cpipe_push(&writer->wal_pipe, &wal_relay->base);
+
+	/*
+	 * We do not use cbus_call because we should be able to
+	 * process this fiber cancelation and send a cancel request
+	 * to the wal cord to force wal detach.
+	 */
+	while (!wal_relay->done) {
+		if (fiber_is_cancelled() &&
+		    wal_relay->cancel_msg.route == NULL) {
+			/* Send a cancel message to a wal relay fiber. */
+			static struct cmsg_hop cancel_route[]= {
+				{wal_relay_cancel, NULL}};
+			cmsg_init(&wal_relay->cancel_msg, cancel_route);
+			cpipe_push(&writer->wal_pipe, &wal_relay->cancel_msg);
+		}
+		fiber_cond_wait(&wal_relay->done_cond);
+	}
+
+	if (!diag_is_empty(&wal_relay->diag)) {
+		diag_move(&wal_relay->diag, &fiber()->diag);
+		return -1;
+	}
+	if (fiber_is_cancelled()) {
+		diag_set(FiberIsCancelled);
+		return -1;
+	}
+	return 0;
+}
+
+struct wal_relay_vclock_msg {
+	struct cbus_call_msg base;
+	const struct wal_relay *wal_relay;
+	struct vclock *vclock;
+};
+
+static int
+wal_relay_vclock_f(struct cbus_call_msg *base)
+{
+	struct wal_relay_vclock_msg *msg =
+		container_of(base, struct wal_relay_vclock_msg, base);
+	vclock_copy(msg->vclock, &msg->wal_relay->replica_vclock);
+	return 0;
+}
+
+int
+wal_relay_vclock(const struct wal_relay *wal_relay, struct vclock *vclock)
+{
+	struct wal_writer *writer = &wal_writer_singleton;
+
+	struct wal_relay_vclock_msg msg;
+	msg.wal_relay = wal_relay;
+	msg.vclock = vclock;
+	bool cancellable = fiber_set_cancellable(false);
+	int rc = cbus_call(&writer->wal_pipe, &writer->tx_prio_pipe,
+			   &msg.base, wal_relay_vclock_f, NULL,
+			   TIMEOUT_INFINITY);
+	fiber_set_cancellable(cancellable);
+	return rc;
+}
+
+struct wal_relay_last_row_time_msg {
+	struct cbus_call_msg base;
+	const struct wal_relay *wal_relay;
+	double last_row_time;
+};
+
+static int
+wal_relay_last_row_time_f(struct cbus_call_msg *base)
+{
+	struct wal_relay_last_row_time_msg *msg =
+		container_of(base, struct wal_relay_last_row_time_msg, base);
+	msg->last_row_time = msg->wal_relay->last_row_time;
+	return 0;
+}
+
+double
+wal_relay_last_row_time(const struct wal_relay *wal_relay)
+{
+	struct wal_writer *writer = &wal_writer_singleton;
+
+	struct wal_relay_last_row_time_msg msg;
+	msg.wal_relay = wal_relay;
+	bool cancellable = fiber_set_cancellable(false);
+	cbus_call(&writer->wal_pipe, &writer->tx_prio_pipe,
+		  &msg.base, wal_relay_last_row_time_f, NULL,
+		  TIMEOUT_INFINITY);
+	fiber_set_cancellable(cancellable);
+	return msg.last_row_time;
+}
diff --git a/src/box/wal.h b/src/box/wal.h
index 86887656d..a84c976c7 100644
--- a/src/box/wal.h
+++ b/src/box/wal.h
@@ -36,6 +36,7 @@
 #include "cbus.h"
 #include "journal.h"
 #include "vclock.h"
+#include "xstream.h"
 
 struct fiber;
 struct wal_writer;
@@ -236,12 +237,6 @@ wal_set_gc_first_vclock(const struct vclock *vclock);
 void
 wal_set_checkpoint_threshold(int64_t threshold);
 
-/**
- * Update a wal consumer vclock position.
- */
-void
-wal_relay_status_update(uint32_t replica_id, const struct vclock *vclock);
-
 /**
  * Unregister a wal consumer.
  */
@@ -263,6 +258,91 @@ wal_write_vy_log(struct journal_entry *req);
 void
 wal_rotate_vy_log();
 
+struct replica;
+struct wal_relay;
+
+#define WAL_RELAY_FILTER_ERR		-1
+#define WAL_RELAY_FILTER_PASS		0
+#define WAL_RELAY_FILTER_ROW		1
+#define WAL_RELAY_FILTER_SKIP		2
+
+typedef ssize_t (*wal_relay_filter_cb)(struct wal_relay *wal_relay,
+				       struct xrow_header **row);
+
+/**
+ * Wal relay maintains wal memory tracking and allows
+ * to retrieve logged xrows direct from the wal memory.
+ */
+struct wal_relay {
+	struct cmsg base;
+	/** Current wal reay position. */
+	struct vclock vclock;
+	/** Vclock to stop relaying. */
+	struct vclock stop_vclock;
+	/** Replica socket handle. */
+	int fd;
+	/**
+	 * Filter function callback points which row should
+	 * be passed to replica, replaced by NOP or other row
+	 * or skiiped out.
+	 */
+	wal_relay_filter_cb on_filter;
+	/**
+	 * Relay working fiber preserved in order to cancel
+	 * when relaying is canceled.
+	 */
+	struct fiber *fiber;
+	/** Message to cancel relaying fiber. */
+	struct cmsg cancel_msg;
+	/** Fiber condition is signalled relaying is stopped. */
+	struct fiber_cond done_cond;
+	/** Turns to true when relaying was stopped. */
+	bool done;
+	/** Return code. */
+	int rc;
+	/** Diagnostic area. */
+	struct diag diag;
+	/** Replica which consumes relayed logs. */
+	struct replica *replica;
+	/** Vclock reported by replica. */
+	struct vclock replica_vclock;
+	/** Last transmission time. */
+	double last_row_time;
+	/** Cord spawned to relay from files. */
+	struct cord cord;
+	/** True if the relay was signalled about wal exit. */
+	bool is_wal_exit;
+};
+
+/**
+ * A function to start fetching rows direct from wal memory buffer.
+ * This function initiates connection with a wal and starts
+ * a fiber which handles wal memory cursor and yields until
+ * the fiber exited because of the cursor was outdated or a
+ * row sending error. When a fiber called this function was
+ * cancelled then special cancel message will be send in order
+ * to stop relaying fiber.
+ *
+ * @param wal_relay a wal relay structure to put all temporay
+ * values in
+ * @param vclock a vclock to start relaying from
+ * @param on_filter a callback to patch relaying rows
+ * @param fd replica socket handler
+ * @param replica client replica which consumer logs
+ * @retval 0 relaying was finished because of cursor is our of date
+ * @retval -1 relaying was finished because of an error.
+ */
+int
+wal_relay(struct wal_relay *wal_relay, const struct vclock *vclock,
+	  const struct vclock *stop_vclock,  wal_relay_filter_cb on_filter,
+	  int fd, struct replica *replica);
+
+int
+wal_relay_vclock(const struct wal_relay *wal_relay, struct vclock *vclock);
+
+double
+wal_relay_last_row_time(const struct wal_relay *wal_relay);
+
 #if defined(__cplusplus)
 } /* extern "C" */
 #endif /* defined(__cplusplus) */
diff --git a/src/lib/core/errinj.h b/src/lib/core/errinj.h
index 672da2119..c0025000a 100644
--- a/src/lib/core/errinj.h
+++ b/src/lib/core/errinj.h
@@ -135,6 +135,7 @@ struct errinj {
 	_(ERRINJ_COIO_SENDFILE_CHUNK, ERRINJ_INT, {.iparam = -1}) \
 	_(ERRINJ_SWIM_FD_ONLY, ERRINJ_BOOL, {.bparam = false}) \
 	_(ERRINJ_DYN_MODULE_COUNT, ERRINJ_INT, {.iparam = 0}) \
+	_(ERRINJ_WAL_MEM_IGNORE, ERRINJ_BOOL, {.bparam = false}) \
 
 ENUM0(errinj_id, ERRINJ_LIST);
 extern struct errinj errinjs[];
diff --git a/test/box-py/iproto.test.py b/test/box-py/iproto.test.py
index 77637d8ed..788df58f8 100644
--- a/test/box-py/iproto.test.py
+++ b/test/box-py/iproto.test.py
@@ -293,9 +293,13 @@ uuid = '0d5bd431-7f3e-4695-a5c2-82de0a9cbc95'
 header = { IPROTO_CODE: REQUEST_TYPE_JOIN, IPROTO_SYNC: 2334 }
 body = { IPROTO_SERVER_UUID: uuid }
 resp = test_request(header, body)
+# In memory replication would not support the
+# sync field while final join or subscribe for
+# replied rows because it requires
+# to reencode of each row stored in a memory buffer.
 if resp['header'][IPROTO_SYNC] == 2334:
     i = 1
-    while i < 3:
+    while i < 2:
         resp = receive_response()
         if resp['header'][IPROTO_SYNC] != 2334:
             print 'Bad sync on response with number ', i
@@ -306,6 +310,9 @@ if resp['header'][IPROTO_SYNC] == 2334:
         print 'Sync ok'
 else:
     print 'Bad first sync'
+# read until the third OK
+while receive_response()['header'][IPROTO_CODE] != REQUEST_TYPE_OK:
+    pass
 
 #
 # Try incorrect JOIN. SYNC must be also returned.
diff --git a/test/box/errinj.result b/test/box/errinj.result
index babe36b1b..62c0832ef 100644
--- a/test/box/errinj.result
+++ b/test/box/errinj.result
@@ -23,132 +23,134 @@ errinj.info()
 ---
 - ERRINJ_VY_RUN_WRITE_STMT_TIMEOUT:
     state: 0
-  ERRINJ_WAL_WRITE:
-    state: false
-  ERRINJ_RELAY_BREAK_LSN:
+  ERRINJ_WAL_BREAK_LSN:
     state: -1
-  ERRINJ_HTTPC_EXECUTE:
-    state: false
   ERRINJ_VYRUN_DATA_READ:
     state: false
-  ERRINJ_SWIM_FD_ONLY:
-    state: false
-  ERRINJ_SQL_NAME_NORMALIZATION:
-    state: false
   ERRINJ_VY_SCHED_TIMEOUT:
     state: 0
-  ERRINJ_COIO_SENDFILE_CHUNK:
-    state: -1
   ERRINJ_HTTP_RESPONSE_ADD_WAIT:
     state: false
-  ERRINJ_WAL_WRITE_PARTIAL:
-    state: -1
-  ERRINJ_VY_GC:
-    state: false
-  ERRINJ_WAL_DELAY:
-    state: false
-  ERRINJ_INDEX_ALLOC:
-    state: false
   ERRINJ_WAL_WRITE_EOF:
     state: false
-  ERRINJ_WAL_SYNC:
-    state: false
-  ERRINJ_BUILD_INDEX:
-    state: -1
   ERRINJ_BUILD_INDEX_DELAY:
     state: false
-  ERRINJ_VY_RUN_FILE_RENAME:
-    state: false
-  ERRINJ_VY_COMPACTION_DELAY:
-    state: false
-  ERRINJ_VY_DUMP_DELAY:
-    state: false
   ERRINJ_VY_DELAY_PK_LOOKUP:
     state: false
-  ERRINJ_VY_TASK_COMPLETE:
-    state: false
-  ERRINJ_PORT_DUMP:
+  ERRINJ_VY_POINT_ITER_WAIT:
     state: false
-  ERRINJ_WAL_BREAK_LSN:
-    state: -1
   ERRINJ_WAL_IO:
     state: false
-  ERRINJ_WAL_FALLOCATE:
-    state: 0
-  ERRINJ_DYN_MODULE_COUNT:
-    state: 0
   ERRINJ_VY_INDEX_FILE_RENAME:
     state: false
   ERRINJ_TUPLE_FORMAT_COUNT:
     state: -1
   ERRINJ_TUPLE_ALLOC:
     state: false
-  ERRINJ_VY_RUN_WRITE_DELAY:
+  ERRINJ_VY_RUN_FILE_RENAME:
     state: false
   ERRINJ_VY_READ_PAGE:
     state: false
   ERRINJ_RELAY_REPORT_INTERVAL:
     state: 0
-  ERRINJ_VY_LOG_FILE_RENAME:
-    state: false
-  ERRINJ_VY_READ_PAGE_TIMEOUT:
-    state: 0
+  ERRINJ_RELAY_BREAK_LSN:
+    state: -1
   ERRINJ_XLOG_META:
     state: false
-  ERRINJ_SIO_READ_MAX:
-    state: -1
   ERRINJ_SNAP_COMMIT_DELAY:
     state: false
-  ERRINJ_WAL_WRITE_DISK:
+  ERRINJ_VY_RUN_WRITE:
     state: false
-  ERRINJ_SNAP_WRITE_DELAY:
+  ERRINJ_BUILD_INDEX:
+    state: -1
+  ERRINJ_RELAY_FINAL_JOIN:
+    state: false
+  ERRINJ_REPLICA_JOIN_DELAY:
     state: false
   ERRINJ_LOG_ROTATE:
     state: false
-  ERRINJ_VY_RUN_WRITE:
+  ERRINJ_MEMTX_DELAY_GC:
     state: false
-  ERRINJ_CHECK_FORMAT_DELAY:
+  ERRINJ_XLOG_GARBAGE:
+    state: false
+  ERRINJ_VY_READ_PAGE_DELAY:
+    state: false
+  ERRINJ_SWIM_FD_ONLY:
     state: false
+  ERRINJ_WAL_WRITE:
+    state: false
+  ERRINJ_HTTPC_EXECUTE:
+    state: false
+  ERRINJ_SQL_NAME_NORMALIZATION:
+    state: false
+  ERRINJ_WAL_WRITE_PARTIAL:
+    state: -1
+  ERRINJ_VY_GC:
+    state: false
+  ERRINJ_WAL_DELAY:
+    state: false
+  ERRINJ_XLOG_READ:
+    state: -1
+  ERRINJ_WAL_SYNC:
+    state: false
+  ERRINJ_VY_TASK_COMPLETE:
+    state: false
+  ERRINJ_PORT_DUMP:
+    state: false
+  ERRINJ_COIO_SENDFILE_CHUNK:
+    state: -1
+  ERRINJ_DYN_MODULE_COUNT:
+    state: 0
+  ERRINJ_SIO_READ_MAX:
+    state: -1
+  ERRINJ_WAL_MEM_IGNORE:
+    state: false
+  ERRINJ_RELAY_TIMEOUT:
+    state: 0
+  ERRINJ_VY_DUMP_DELAY:
+    state: false
+  ERRINJ_VY_SQUASH_TIMEOUT:
+    state: 0
   ERRINJ_VY_LOG_FLUSH_DELAY:
     state: false
-  ERRINJ_RELAY_FINAL_JOIN:
+  ERRINJ_RELAY_SEND_DELAY:
     state: false
-  ERRINJ_REPLICA_JOIN_DELAY:
+  ERRINJ_VY_COMPACTION_DELAY:
     state: false
-  ERRINJ_RELAY_FINAL_SLEEP:
+  ERRINJ_VY_LOG_FILE_RENAME:
     state: false
   ERRINJ_VY_RUN_DISCARD:
     state: false
   ERRINJ_WAL_ROTATE:
     state: false
-  ERRINJ_RELAY_EXIT_DELAY:
+  ERRINJ_VY_READ_PAGE_TIMEOUT:
     state: 0
-  ERRINJ_VY_POINT_ITER_WAIT:
+  ERRINJ_VY_INDEX_DUMP:
+    state: -1
+  ERRINJ_TUPLE_FIELD:
     state: false
-  ERRINJ_MEMTX_DELAY_GC:
+  ERRINJ_SNAP_WRITE_DELAY:
     state: false
   ERRINJ_IPROTO_TX_DELAY:
     state: false
-  ERRINJ_XLOG_READ:
-    state: -1
-  ERRINJ_TUPLE_FIELD:
+  ERRINJ_RELAY_EXIT_DELAY:
+    state: 0
+  ERRINJ_RELAY_FINAL_SLEEP:
     state: false
-  ERRINJ_XLOG_GARBAGE:
+  ERRINJ_WAL_WRITE_DISK:
     state: false
-  ERRINJ_VY_INDEX_DUMP:
-    state: -1
-  ERRINJ_VY_READ_PAGE_DELAY:
+  ERRINJ_CHECK_FORMAT_DELAY:
     state: false
   ERRINJ_TESTING:
     state: false
-  ERRINJ_RELAY_SEND_DELAY:
+  ERRINJ_VY_RUN_WRITE_DELAY:
     state: false
-  ERRINJ_VY_SQUASH_TIMEOUT:
+  ERRINJ_WAL_FALLOCATE:
     state: 0
   ERRINJ_VY_LOG_FLUSH:
     state: false
-  ERRINJ_RELAY_TIMEOUT:
-    state: 0
+  ERRINJ_INDEX_ALLOC:
+    state: false
 ...
 errinj.set("some-injection", true)
 ---
diff --git a/test/replication/force_recovery.result b/test/replication/force_recovery.result
index f50452858..e48c12657 100644
--- a/test/replication/force_recovery.result
+++ b/test/replication/force_recovery.result
@@ -16,6 +16,10 @@ _ = box.space.test:create_index('primary')
 box.schema.user.grant('guest', 'replication')
 ---
 ...
+box.error.injection.set("ERRINJ_WAL_MEM_IGNORE", true)
+---
+- ok
+...
 -- Deploy a replica.
 test_run:cmd("create server test with rpl_master=default, script='replication/replica.lua'")
 ---
@@ -86,6 +90,10 @@ test_run:cmd("switch default")
 box.cfg{force_recovery = false}
 ---
 ...
+box.error.injection.set("ERRINJ_WAL_MEM_IGNORE", false)
+---
+- ok
+...
 -- Cleanup.
 test_run:cmd("stop server test")
 ---
diff --git a/test/replication/force_recovery.test.lua b/test/replication/force_recovery.test.lua
index 54307814b..c08bb9c02 100644
--- a/test/replication/force_recovery.test.lua
+++ b/test/replication/force_recovery.test.lua
@@ -8,6 +8,7 @@ _ = box.schema.space.create('test')
 _ = box.space.test:create_index('primary')
 box.schema.user.grant('guest', 'replication')
 
+box.error.injection.set("ERRINJ_WAL_MEM_IGNORE", true)
 -- Deploy a replica.
 test_run:cmd("create server test with rpl_master=default, script='replication/replica.lua'")
 test_run:cmd("start server test")
@@ -33,6 +34,7 @@ box.space.test:select()
 box.info.replication[1].upstream.status == 'stopped' or box.info
 test_run:cmd("switch default")
 box.cfg{force_recovery = false}
+box.error.injection.set("ERRINJ_WAL_MEM_IGNORE", false)
 
 -- Cleanup.
 test_run:cmd("stop server test")
diff --git a/test/replication/replica_rejoin.result b/test/replication/replica_rejoin.result
index f71292da1..187634c62 100644
--- a/test/replication/replica_rejoin.result
+++ b/test/replication/replica_rejoin.result
@@ -184,6 +184,10 @@ test_run:cmd("stop server replica")
 - true
 ...
 test_run:cmd("restart server default")
+box.error.injection.set("ERRINJ_WAL_MEM_IGNORE", true)
+---
+- ok
+...
 checkpoint_count = box.cfg.checkpoint_count
 ---
 ...
@@ -368,6 +372,10 @@ test_run:cmd("switch default")
 ---
 - true
 ...
+box.error.injection.set("ERRINJ_WAL_MEM_IGNORE", false)
+---
+- ok
+...
 box.cfg{replication = ''}
 ---
 ...
diff --git a/test/replication/replica_rejoin.test.lua b/test/replication/replica_rejoin.test.lua
index 22a91d8d7..3ee98bc85 100644
--- a/test/replication/replica_rejoin.test.lua
+++ b/test/replication/replica_rejoin.test.lua
@@ -70,6 +70,7 @@ box.space.test:replace{1, 2, 3} -- bumps LSN on the replica
 test_run:cmd("switch default")
 test_run:cmd("stop server replica")
 test_run:cmd("restart server default")
+box.error.injection.set("ERRINJ_WAL_MEM_IGNORE", true)
 checkpoint_count = box.cfg.checkpoint_count
 box.cfg{checkpoint_count = 1}
 for i = 1, 3 do box.space.test:delete{i * 10} end
@@ -135,6 +136,7 @@ box.space.test:replace{2}
 
 -- Cleanup.
 test_run:cmd("switch default")
+box.error.injection.set("ERRINJ_WAL_MEM_IGNORE", false)
 box.cfg{replication = ''}
 test_run:cmd("stop server replica")
 test_run:cmd("cleanup server replica")
diff --git a/test/replication/show_error_on_disconnect.result b/test/replication/show_error_on_disconnect.result
index 48003db06..e6920c160 100644
--- a/test/replication/show_error_on_disconnect.result
+++ b/test/replication/show_error_on_disconnect.result
@@ -20,6 +20,10 @@ test_run:cmd("switch master_quorum1")
 ---
 - true
 ...
+box.error.injection.set("ERRINJ_WAL_MEM_IGNORE", true)
+---
+- ok
+...
 repl = box.cfg.replication
 ---
 ...
@@ -30,6 +34,10 @@ test_run:cmd("switch master_quorum2")
 ---
 - true
 ...
+box.error.injection.set("ERRINJ_WAL_MEM_IGNORE", true)
+---
+- ok
+...
 box.space.test:insert{1}
 ---
 - [1]
diff --git a/test/replication/show_error_on_disconnect.test.lua b/test/replication/show_error_on_disconnect.test.lua
index 1b0ea4373..2a944dfc3 100644
--- a/test/replication/show_error_on_disconnect.test.lua
+++ b/test/replication/show_error_on_disconnect.test.lua
@@ -10,9 +10,11 @@ SERVERS = {'master_quorum1', 'master_quorum2'}
 test_run:create_cluster(SERVERS)
 test_run:wait_fullmesh(SERVERS)
 test_run:cmd("switch master_quorum1")
+box.error.injection.set("ERRINJ_WAL_MEM_IGNORE", true)
 repl = box.cfg.replication
 box.cfg{replication = ""}
 test_run:cmd("switch master_quorum2")
+box.error.injection.set("ERRINJ_WAL_MEM_IGNORE", true)
 box.space.test:insert{1}
 box.snapshot()
 box.space.test:insert{2}
diff --git a/test/replication/suite.ini b/test/replication/suite.ini
index ed1de3140..23be12528 100644
--- a/test/replication/suite.ini
+++ b/test/replication/suite.ini
@@ -3,7 +3,7 @@ core = tarantool
 script =  master.lua
 description = tarantool/box, replication
 disabled = consistent.test.lua
-release_disabled = catch.test.lua errinj.test.lua gc.test.lua gc_no_space.test.lua before_replace.test.lua quorum.test.lua recover_missing_xlog.test.lua sync.test.lua long_row_timeout.test.lua
+release_disabled = catch.test.lua errinj.test.lua gc.test.lua gc_no_space.test.lua before_replace.test.lua quorum.test.lua recover_missing_xlog.test.lua sync.test.lua long_row_timeout.test.lua force_recovery.test.lua show_error_on_disconnect.test.lua replica_rejoin.test.lua
 config = suite.cfg
 lua_libs = lua/fast_replica.lua lua/rlimit.lua
 use_unix_sockets = True
diff --git a/test/xlog/panic_on_wal_error.result b/test/xlog/panic_on_wal_error.result
index 22f14f912..897116b3b 100644
--- a/test/xlog/panic_on_wal_error.result
+++ b/test/xlog/panic_on_wal_error.result
@@ -19,6 +19,10 @@ _ = box.space.test:create_index('pk')
 -- reopen xlog
 --
 test_run:cmd("restart server default")
+box.error.injection.set("ERRINJ_WAL_MEM_IGNORE", true)
+---
+- ok
+...
 box.space.test ~= nil
 ---
 - true
@@ -68,6 +72,10 @@ test_run:cmd("stop server replica")
 - true
 ...
 test_run:cmd("restart server default")
+box.error.injection.set("ERRINJ_WAL_MEM_IGNORE", true)
+---
+- ok
+...
 box.space.test:auto_increment{'after snapshot'}
 ---
 - [2, 'after snapshot']
@@ -153,6 +161,10 @@ test_run:cmd("switch default")
 ---
 - true
 ...
+box.error.injection.set("ERRINJ_WAL_MEM_IGNORE", false)
+---
+- ok
+...
 test_run:cmd("stop server replica")
 ---
 - true
diff --git a/test/xlog/panic_on_wal_error.test.lua b/test/xlog/panic_on_wal_error.test.lua
index 2e95431c6..d973a00ff 100644
--- a/test/xlog/panic_on_wal_error.test.lua
+++ b/test/xlog/panic_on_wal_error.test.lua
@@ -10,6 +10,7 @@ _ = box.space.test:create_index('pk')
 -- reopen xlog
 --
 test_run:cmd("restart server default")
+box.error.injection.set("ERRINJ_WAL_MEM_IGNORE", true)
 box.space.test ~= nil
 -- insert some stuff
 -- 
@@ -32,6 +33,7 @@ box.space.test:select{}
 test_run:cmd("switch default")
 test_run:cmd("stop server replica")
 test_run:cmd("restart server default")
+box.error.injection.set("ERRINJ_WAL_MEM_IGNORE", true)
 box.space.test:auto_increment{'after snapshot'}
 box.space.test:auto_increment{'after snapshot - one more row'}
 --
@@ -67,6 +69,7 @@ box.space.test:select{}
 --
 --
 test_run:cmd("switch default")
+box.error.injection.set("ERRINJ_WAL_MEM_IGNORE", false)
 test_run:cmd("stop server replica")
 test_run:cmd("cleanup server replica")
 --
diff --git a/test/xlog/suite.ini b/test/xlog/suite.ini
index 689d2b871..c208c73c4 100644
--- a/test/xlog/suite.ini
+++ b/test/xlog/suite.ini
@@ -4,7 +4,7 @@ description = tarantool write ahead log tests
 script = xlog.lua
 disabled = snap_io_rate.test.lua upgrade.test.lua
 valgrind_disabled =
-release_disabled = errinj.test.lua panic_on_lsn_gap.test.lua panic_on_broken_lsn.test.lua checkpoint_threshold.test.lua
+release_disabled = errinj.test.lua panic_on_lsn_gap.test.lua panic_on_broken_lsn.test.lua checkpoint_threshold.test.lua panic_on_wal_error.test.lua
 config = suite.cfg
 use_unix_sockets = True
 use_unix_sockets_iproto = True
-- 
2.25.0

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Tarantool-patches] [PATCH v4 01/11] recovery: do not call recovery_stop_local inside recovery_delete
  2020-02-12  9:39 ` [Tarantool-patches] [PATCH v4 01/11] recovery: do not call recovery_stop_local inside recovery_delete Georgy Kirichenko
@ 2020-03-19  7:55   ` Konstantin Osipov
  0 siblings, 0 replies; 16+ messages in thread
From: Konstantin Osipov @ 2020-03-19  7:55 UTC (permalink / raw)
  To: Georgy Kirichenko, sergepetrenko; +Cc: tarantool-patches

Hello,

* Georgy Kirichenko <georgy@tarantool.org> [20/02/12 13:09]:

> Recovery stop local raises an exception in case of an recovery error
> so it is not safe to stop recovery inside recovery delete and guard
> inside local_recovery. So call recovery_stop_local manually.

I suggest you try/catch the exception in recovery_delete instead,
or add a flag to recovery_stop_local to nothrow, or move
diag_raise() out of recovery_stop_local() and make sure
recovery_stop_local() returns 0, -1

It would be nice to explain in the changeset comment what is the
exact event chain that leads to an exception from
recovery_stop_local() - and even better - to add a test.

> Part of #980
> ---
>  src/box/box.cc      | 4 +++-
>  src/box/recovery.cc | 2 +-
>  2 files changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/src/box/box.cc b/src/box/box.cc
> index 1b2b27d61..68038df18 100644
> --- a/src/box/box.cc
> +++ b/src/box/box.cc
> @@ -2238,8 +2238,10 @@ local_recovery(const struct tt_uuid *instance_uuid,
>  		recovery_follow_local(recovery, &wal_stream.base, "hot_standby",
>  				      cfg_getd("wal_dir_rescan_delay"));
>  		while (true) {
> -			if (path_lock(cfg_gets("wal_dir"), &wal_dir_lock))
> +			if (path_lock(cfg_gets("wal_dir"), &wal_dir_lock)) {
> +				recovery_stop_local(recovery);
>  				diag_raise();
> +			}
>  			if (wal_dir_lock >= 0)
>  				break;
>  			fiber_sleep(0.1);
> diff --git a/src/box/recovery.cc b/src/box/recovery.cc
> index 64aa467b1..a1ac2d967 100644
> --- a/src/box/recovery.cc
> +++ b/src/box/recovery.cc
> @@ -216,7 +216,7 @@ gap_error:
>  void
>  recovery_delete(struct recovery *r)
>  {
> -	recovery_stop_local(r);
> +	assert(r->watcher == NULL);
>  
>  	trigger_destroy(&r->on_close_log);
>  	xdir_destroy(&r->wal_dir);
> -- 
> 2.25.0

-- 
Konstantin Osipov, Moscow, Russia

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Tarantool-patches] [PATCH v4 02/11] recovery: do not throw an error
  2020-02-12  9:39 ` [Tarantool-patches] [PATCH v4 02/11] recovery: do not throw an error Georgy Kirichenko
@ 2020-03-19  7:56   ` Konstantin Osipov
  0 siblings, 0 replies; 16+ messages in thread
From: Konstantin Osipov @ 2020-03-19  7:56 UTC (permalink / raw)
  To: Georgy Kirichenko, sergepetrenko; +Cc: tarantool-patches

* Georgy Kirichenko <georgy@tarantool.org> [20/02/12 13:09]:
> Relaying from C-written wal requires recovery to be a C-compliant. So
> get rid of exception from recovery interface.

LGTM, but please solicit another review.

Let's not cook this any longer, Sergey, it would be really great
if you finish this patch and push it.

Thanks!
> 
> Part of #980
> ---
>  src/box/box.cc      | 19 ++++++++--
>  src/box/recovery.cc | 89 ++++++++++++++++++++++++++-------------------
>  src/box/recovery.h  | 14 +++----
>  src/box/relay.cc    | 15 ++++----
>  4 files changed, 82 insertions(+), 55 deletions(-)
> 
> diff --git a/src/box/box.cc b/src/box/box.cc
> index 68038df18..611100b8b 100644
> --- a/src/box/box.cc
> +++ b/src/box/box.cc
> @@ -2166,6 +2166,8 @@ local_recovery(const struct tt_uuid *instance_uuid,
>  	recovery = recovery_new(cfg_gets("wal_dir"),
>  				cfg_geti("force_recovery"),
>  				checkpoint_vclock);
> +	if (recovery == NULL)
> +		diag_raise();
>  
>  	/*
>  	 * Make sure we report the actual recovery position
> @@ -2183,7 +2185,8 @@ local_recovery(const struct tt_uuid *instance_uuid,
>  	 * so we must reflect this in replicaset vclock to
>  	 * not attempt to apply these rows twice.
>  	 */
> -	recovery_scan(recovery, &replicaset.vclock, &gc.vclock);
> +	if (recovery_scan(recovery, &replicaset.vclock, &gc.vclock) != 0)
> +		diag_raise();
>  	say_info("instance vclock %s", vclock_to_string(&replicaset.vclock));
>  
>  	if (wal_dir_lock >= 0) {
> @@ -2226,7 +2229,8 @@ local_recovery(const struct tt_uuid *instance_uuid,
>  	memtx_engine_recover_snapshot_xc(memtx, checkpoint_vclock);
>  
>  	engine_begin_final_recovery_xc();
> -	recover_remaining_wals(recovery, &wal_stream.base, NULL, false);
> +	if (recover_remaining_wals(recovery, &wal_stream.base, NULL, false) != 0)
> +		diag_raise();
>  	engine_end_recovery_xc();
>  	/*
>  	 * Leave hot standby mode, if any, only after
> @@ -2239,6 +2243,10 @@ local_recovery(const struct tt_uuid *instance_uuid,
>  				      cfg_getd("wal_dir_rescan_delay"));
>  		while (true) {
>  			if (path_lock(cfg_gets("wal_dir"), &wal_dir_lock)) {
> +				/*
> +				 * Let recovery_stop_local override
> +				 * a path_lock error.
> +				 */
>  				recovery_stop_local(recovery);
>  				diag_raise();
>  			}
> @@ -2246,8 +2254,11 @@ local_recovery(const struct tt_uuid *instance_uuid,
>  				break;
>  			fiber_sleep(0.1);
>  		}
> -		recovery_stop_local(recovery);
> -		recover_remaining_wals(recovery, &wal_stream.base, NULL, true);
> +		if (recovery_stop_local(recovery) != 0)
> +			diag_raise();
> +		if (recover_remaining_wals(recovery, &wal_stream.base, NULL,
> +					   true) != 0)
> +			diag_raise();
>  		/*
>  		 * Advance replica set vclock to reflect records
>  		 * applied in hot standby mode.
> diff --git a/src/box/recovery.cc b/src/box/recovery.cc
> index a1ac2d967..e4aad1296 100644
> --- a/src/box/recovery.cc
> +++ b/src/box/recovery.cc
> @@ -87,14 +87,11 @@ recovery_new(const char *wal_dirname, bool force_recovery,
>  			calloc(1, sizeof(*r));
>  
>  	if (r == NULL) {
> -		tnt_raise(OutOfMemory, sizeof(*r), "malloc",
> -			  "struct recovery");
> +		diag_set(OutOfMemory, sizeof(*r), "malloc",
> +			 "struct recovery");
> +		return NULL;
>  	}
>  
> -	auto guard = make_scoped_guard([=]{
> -		free(r);
> -	});
> -
>  	xdir_create(&r->wal_dir, wal_dirname, XLOG, &INSTANCE_UUID,
>  		    &xlog_opts_default);
>  	r->wal_dir.force_recovery = force_recovery;
> @@ -108,27 +105,31 @@ recovery_new(const char *wal_dirname, bool force_recovery,
>  	 * UUID, see replication/cluster.test for
>  	 * details.
>  	 */
> -	xdir_check_xc(&r->wal_dir);
> +	if (xdir_check(&r->wal_dir) != 0) {
> +		xdir_destroy(&r->wal_dir);
> +		free(r);
> +		return NULL;
> +	}
>  
>  	r->watcher = NULL;
>  	rlist_create(&r->on_close_log);
>  
> -	guard.is_active = false;
>  	return r;
>  }
>  
> -void
> +int
>  recovery_scan(struct recovery *r, struct vclock *end_vclock,
>  	      struct vclock *gc_vclock)
>  {
> -	xdir_scan_xc(&r->wal_dir);
> +	if (xdir_scan(&r->wal_dir) != 0)
> +		return -1;
>  
>  	if (xdir_last_vclock(&r->wal_dir, end_vclock) < 0 ||
>  	    vclock_compare(end_vclock, &r->vclock) < 0) {
>  		/* No xlogs after last checkpoint. */
>  		vclock_copy(gc_vclock, &r->vclock);
>  		vclock_copy(end_vclock, &r->vclock);
> -		return;
> +		return 0;
>  	}
>  
>  	if (xdir_first_vclock(&r->wal_dir, gc_vclock) < 0)
> @@ -137,11 +138,12 @@ recovery_scan(struct recovery *r, struct vclock *end_vclock,
>  	/* Scan the last xlog to find end vclock. */
>  	struct xlog_cursor cursor;
>  	if (xdir_open_cursor(&r->wal_dir, vclock_sum(end_vclock), &cursor) != 0)
> -		return;
> +		return 0;
>  	struct xrow_header row;
>  	while (xlog_cursor_next(&cursor, &row, true) == 0)
>  		vclock_follow_xrow(end_vclock, &row);
>  	xlog_cursor_close(&cursor, false);
> +	return 0;
>  }
>  
>  static inline void
> @@ -156,19 +158,21 @@ recovery_close_log(struct recovery *r)
>  			 r->cursor.name);
>  	}
>  	xlog_cursor_close(&r->cursor, false);
> -	trigger_run_xc(&r->on_close_log, NULL);
> +	/* Suppress a trigger error if happened. */
> +	trigger_run(&r->on_close_log, NULL);
>  }
>  
> -static void
> +static int
>  recovery_open_log(struct recovery *r, const struct vclock *vclock)
>  {
> -	XlogGapError *e;
>  	struct xlog_meta meta = r->cursor.meta;
>  	enum xlog_cursor_state state = r->cursor.state;
>  
>  	recovery_close_log(r);
>  
> -	xdir_open_cursor_xc(&r->wal_dir, vclock_sum(vclock), &r->cursor);
> +	if (xdir_open_cursor(&r->wal_dir, vclock_sum(vclock),
> +			     &r->cursor) != 0)
> +		return -1;
>  
>  	if (state == XLOG_CURSOR_NEW &&
>  	    vclock_compare(vclock, &r->vclock) > 0) {
> @@ -201,14 +205,14 @@ out:
>  	 */
>  	if (vclock_compare(&r->vclock, vclock) < 0)
>  		vclock_copy(&r->vclock, vclock);
> -	return;
> +	return 0;
>  
>  gap_error:
> -	e = tnt_error(XlogGapError, &r->vclock, vclock);
> +	diag_set(XlogGapError, &r->vclock, vclock);
>  	if (!r->wal_dir.force_recovery)
> -		throw e;
> +		return -1;
>  	/* Ignore missing WALs if force_recovery is set. */
> -	e->log();
> +	diag_log();
>  	say_warn("ignoring a gap in LSN");
>  	goto out;
>  }
> @@ -217,7 +221,6 @@ void
>  recovery_delete(struct recovery *r)
>  {
>  	assert(r->watcher == NULL);
> -
>  	trigger_destroy(&r->on_close_log);
>  	xdir_destroy(&r->wal_dir);
>  	if (xlog_cursor_is_open(&r->cursor)) {
> @@ -237,25 +240,26 @@ recovery_delete(struct recovery *r)
>   * The reading will be stopped on reaching stop_vclock.
>   * Use NULL for boundless recover
>   */
> -static void
> +static int
>  recover_xlog(struct recovery *r, struct xstream *stream,
>  	     const struct vclock *stop_vclock)
>  {
>  	struct xrow_header row;
>  	uint64_t row_count = 0;
> -	while (xlog_cursor_next_xc(&r->cursor, &row,
> -				   r->wal_dir.force_recovery) == 0) {
> +	int rc;
> +	while ((rc = xlog_cursor_next(&r->cursor, &row,
> +				      r->wal_dir.force_recovery)) == 0) {
>  		/*
>  		 * Read the next row from xlog file.
>  		 *
> -		 * xlog_cursor_next_xc() returns 1 when
> +		 * xlog_cursor_next() returns 1 when
>  		 * it can not read more rows. This doesn't mean
>  		 * the file is fully read: it's fully read only
>  		 * when EOF marker has been read, see i.eof_read
>  		 */
>  		if (stop_vclock != NULL &&
>  		    r->vclock.signature >= stop_vclock->signature)
> -			return;
> +			return 0;
>  		int64_t current_lsn = vclock_get(&r->vclock, row.replica_id);
>  		if (row.lsn <= current_lsn)
>  			continue; /* already applied, skip */
> @@ -282,13 +286,16 @@ recover_xlog(struct recovery *r, struct xstream *stream,
>  					 row_count / 1000000.);
>  		} else {
>  			if (!r->wal_dir.force_recovery)
> -				diag_raise();
> +				return -1;
>  
>  			say_error("skipping row {%u: %lld}",
>  				  (unsigned)row.replica_id, (long long)row.lsn);
>  			diag_log();
>  		}
>  	}
> +	if (rc < 0)
> +		return -1;
> +	return 0;
>  }
>  
>  /**
> @@ -302,14 +309,14 @@ recover_xlog(struct recovery *r, struct xstream *stream,
>   * This function will not close r->current_wal if
>   * recovery was successful.
>   */
> -void
> +int
>  recover_remaining_wals(struct recovery *r, struct xstream *stream,
>  		       const struct vclock *stop_vclock, bool scan_dir)
>  {
>  	struct vclock *clock;
>  
> -	if (scan_dir)
> -		xdir_scan_xc(&r->wal_dir);
> +	if (scan_dir && xdir_scan(&r->wal_dir) != 0)
> +		return -1;
>  
>  	if (xlog_cursor_is_open(&r->cursor)) {
>  		/* If there's a WAL open, recover from it first. */
> @@ -343,21 +350,26 @@ recover_remaining_wals(struct recovery *r, struct xstream *stream,
>  			continue;
>  		}
>  
> -		recovery_open_log(r, clock);
> +		if (recovery_open_log(r, clock) != 0)
> +			return -1;
>  
>  		say_info("recover from `%s'", r->cursor.name);
>  
>  recover_current_wal:
> -		recover_xlog(r, stream, stop_vclock);
> +		if (recover_xlog(r, stream, stop_vclock) != 0)
> +			return -1;
>  	}
>  
>  	if (xlog_cursor_is_eof(&r->cursor))
>  		recovery_close_log(r);
>  
> -	if (stop_vclock != NULL && vclock_compare(&r->vclock, stop_vclock) != 0)
> -		tnt_raise(XlogGapError, &r->vclock, stop_vclock);
> +	if (stop_vclock != NULL && vclock_compare(&r->vclock, stop_vclock) != 0) {
> +		diag_set(XlogGapError, &r->vclock, stop_vclock);
> +		return -1;
> +	}
>  
>  	region_free(&fiber()->gc);
> +	return 0;
>  }
>  
>  void
> @@ -481,7 +493,9 @@ hot_standby_f(va_list ap)
>  		do {
>  			start = vclock_sum(&r->vclock);
>  
> -			recover_remaining_wals(r, stream, NULL, scan_dir);
> +			if (recover_remaining_wals(r, stream, NULL,
> +						   scan_dir) != 0)
> +				diag_raise();
>  
>  			end = vclock_sum(&r->vclock);
>  			/*
> @@ -529,7 +543,7 @@ recovery_follow_local(struct recovery *r, struct xstream *stream,
>  	fiber_start(r->watcher, r, stream, wal_dir_rescan_delay);
>  }
>  
> -void
> +int
>  recovery_stop_local(struct recovery *r)
>  {
>  	if (r->watcher) {
> @@ -537,8 +551,9 @@ recovery_stop_local(struct recovery *r)
>  		r->watcher = NULL;
>  		fiber_cancel(f);
>  		if (fiber_join(f) != 0)
> -			diag_raise();
> +			return -1;
>  	}
> +	return 0;
>  }
>  
>  /* }}} */
> diff --git a/src/box/recovery.h b/src/box/recovery.h
> index 6e68abc0b..145d9199e 100644
> --- a/src/box/recovery.h
> +++ b/src/box/recovery.h
> @@ -74,7 +74,7 @@ recovery_delete(struct recovery *r);
>   * @gc_vclock is set to the oldest vclock available in the
>   * WAL directory.
>   */
> -void
> +int
>  recovery_scan(struct recovery *r,  struct vclock *end_vclock,
>  	      struct vclock *gc_vclock);
>  
> @@ -82,16 +82,12 @@ void
>  recovery_follow_local(struct recovery *r, struct xstream *stream,
>  		      const char *name, ev_tstamp wal_dir_rescan_delay);
>  
> -void
> +int
>  recovery_stop_local(struct recovery *r);
>  
>  void
>  recovery_finalize(struct recovery *r);
>  
> -#if defined(__cplusplus)
> -} /* extern "C" */
> -#endif /* defined(__cplusplus) */
> -
>  /**
>   * Find out if there are new .xlog files since the current
>   * vclock, and read them all up.
> @@ -102,8 +98,12 @@ recovery_finalize(struct recovery *r);
>   * This function will not close r->current_wal if
>   * recovery was successful.
>   */
> -void
> +int
>  recover_remaining_wals(struct recovery *r, struct xstream *stream,
>  		       const struct vclock *stop_vclock, bool scan_dir);
>  
> +#if defined(__cplusplus)
> +} /* extern "C" */
> +#endif /* defined(__cplusplus) */
> +
>  #endif /* TARANTOOL_RECOVERY_H_INCLUDED */
> diff --git a/src/box/relay.cc b/src/box/relay.cc
> index b89632273..d5a1c9c68 100644
> --- a/src/box/relay.cc
> +++ b/src/box/relay.cc
> @@ -334,8 +334,9 @@ relay_final_join_f(va_list ap)
>  
>  	/* Send all WALs until stop_vclock */
>  	assert(relay->stream.write != NULL);
> -	recover_remaining_wals(relay->r, &relay->stream,
> -			       &relay->stop_vclock, true);
> +	if (recover_remaining_wals(relay->r, &relay->stream,
> +				   &relay->stop_vclock, true) != 0)
> +		diag_raise();
>  	assert(vclock_compare(&relay->r->vclock, &relay->stop_vclock) == 0);
>  	return 0;
>  }
> @@ -491,11 +492,9 @@ relay_process_wal_event(struct wal_watcher *watcher, unsigned events)
>  		 */
>  		return;
>  	}
> -	try {
> -		recover_remaining_wals(relay->r, &relay->stream, NULL,
> -				       (events & WAL_EVENT_ROTATE) != 0);
> -	} catch (Exception *e) {
> -		relay_set_error(relay, e);
> +	if (recover_remaining_wals(relay->r, &relay->stream, NULL,
> +				   (events & WAL_EVENT_ROTATE) != 0) != 0) {
> +		relay_set_error(relay, diag_last_error(diag_get()));
>  		fiber_cancel(fiber());
>  	}
>  }
> @@ -702,6 +701,8 @@ relay_subscribe(struct replica *replica, int fd, uint64_t sync,
>  	vclock_copy(&relay->local_vclock_at_subscribe, &replicaset.vclock);
>  	relay->r = recovery_new(cfg_gets("wal_dir"), false,
>  			        replica_clock);
> +	if (relay->r == NULL)
> +		diag_raise();
>  	vclock_copy(&relay->tx.vclock, replica_clock);
>  	relay->version_id = replica_version_id;
>  
> -- 
> 2.25.0

-- 
Konstantin Osipov, Moscow, Russia

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Tarantool-patches] [PATCH v4 03/11] coio: do not allow parallel usage of coio
  2020-02-12  9:39 ` [Tarantool-patches] [PATCH v4 03/11] coio: do not allow parallel usage of coio Georgy Kirichenko
@ 2020-03-19 18:09   ` Konstantin Osipov
  0 siblings, 0 replies; 16+ messages in thread
From: Konstantin Osipov @ 2020-03-19 18:09 UTC (permalink / raw)
  To: Georgy Kirichenko; +Cc: tarantool-patches

* Georgy Kirichenko <georgy@tarantool.org> [20/02/12 13:09]:
> Simultaneous usage of one coio from two or more fiber could lead
> to undefined behavior as coio routines are replacing awaiting fiber
> (a data member) and stopping watcher without any relevance if there
> any other users of the coio object. Such behavior could lead to
> an applier invalid stream issue #4040.
> The proposal is to disable an active coio reuse by returning a fake
> EINPROGRESS error.

I am not aware of any cases when coio is used by multiple fibers:
it's a violation of the coio contract.

If you suspect it is violated in some cases, please add the
scenario description to the commit comment. Then, to better 
identify the case, your patch should add an assert in debug mode.
For release mode, I suggest we set a more clear error so that it's
easy then to spot in the log file and act upon, not EINPROGRESS.
Something like ERR_OPEN_A_BUG.

> 
> Part of #980
> ---
>  src/lib/core/coio.cc | 17 +++++++++++++++++
>  1 file changed, 17 insertions(+)
> 
> diff --git a/src/lib/core/coio.cc b/src/lib/core/coio.cc
> index e88d724d5..faa7e5bd5 100644
> --- a/src/lib/core/coio.cc
> +++ b/src/lib/core/coio.cc
> @@ -238,6 +238,17 @@ coio_connect_timeout(struct ev_io *coio, struct uri *uri, struct sockaddr *addr,
>  	tnt_raise(SocketError, sio_socketname(coio->fd), "connection failed");
>  }
>  
> +/* Do not allow to reuse coio by different fiber. */
> +static inline void
> +check_coio_in_use(struct ev_io *coio)
> +{
> +	if (ev_is_active(coio)) {
> +		errno = EINPROGRESS;
> +		tnt_raise(SocketError, sio_socketname(coio->fd),
> +			  "already in use");
> +	}
> +}
> +
>  /**
>   * Wait a client connection on a server socket until
>   * timedout.
> @@ -249,6 +260,7 @@ coio_accept(struct ev_io *coio, struct sockaddr *addr,
>  	ev_tstamp start, delay;
>  	coio_timeout_init(&start, &delay, timeout);
>  
> +	check_coio_in_use(coio);
>  	CoioGuard coio_guard(coio);
>  
>  	while (true) {
> @@ -302,6 +314,7 @@ coio_read_ahead_timeout(struct ev_io *coio, void *buf, size_t sz,
>  
>  	ssize_t to_read = (ssize_t) sz;
>  
> +	check_coio_in_use(coio);
>  	CoioGuard coio_guard(coio);
>  
>  	while (true) {
> @@ -399,6 +412,7 @@ coio_write_timeout(struct ev_io *coio, const void *buf, size_t sz,
>  	ev_tstamp start, delay;
>  	coio_timeout_init(&start, &delay, timeout);
>  
> +	check_coio_in_use(coio);
>  	CoioGuard coio_guard(coio);
>  
>  	while (true) {
> @@ -461,6 +475,7 @@ coio_writev_timeout(struct ev_io *coio, struct iovec *iov, int iovcnt,
>  	struct iovec *end = iov + iovcnt;
>  	ev_tstamp start, delay;
>  	coio_timeout_init(&start, &delay, timeout);
> +	check_coio_in_use(coio);
>  	CoioGuard coio_guard(coio);
>  
>  	/* Avoid a syscall in case of 0 iovcnt. */
> @@ -518,6 +533,7 @@ coio_sendto_timeout(struct ev_io *coio, const void *buf, size_t sz, int flags,
>  	ev_tstamp start, delay;
>  	coio_timeout_init(&start, &delay, timeout);
>  
> +	check_coio_in_use(coio);
>  	CoioGuard coio_guard(coio);
>  
>  	while (true) {
> @@ -563,6 +579,7 @@ coio_recvfrom_timeout(struct ev_io *coio, void *buf, size_t sz, int flags,
>  	ev_tstamp start, delay;
>  	coio_timeout_init(&start, &delay, timeout);
>  
> +	check_coio_in_use(coio);
>  	CoioGuard coio_guard(coio);
>  
>  	while (true) {
> -- 
> 2.25.0

-- 
Konstantin Osipov, Moscow, Russia

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Tarantool-patches] [PATCH v4 04/11] coio: do not throw an error, minor refactoring
  2020-02-12  9:39 ` [Tarantool-patches] [PATCH v4 04/11] coio: do not throw an error, minor refactoring Georgy Kirichenko
@ 2020-03-23  6:59   ` Konstantin Osipov
  0 siblings, 0 replies; 16+ messages in thread
From: Konstantin Osipov @ 2020-03-23  6:59 UTC (permalink / raw)
  To: Georgy Kirichenko; +Cc: tarantool-patches

* Georgy Kirichenko <georgy@tarantool.org> [20/02/12 13:09]:
> Relaying from C-written wal requires coio and xrow_io to be
> a C-compliant. So get rid of exception from coio interface.
> Also this patch includes some minor refactoring (as code looks ugly
> without them):
>  1. Get rid of unused size_hint from coio_writev_timeout.
>  2. Handle partial read/write before yield loop.
>  3. Do not reset errno to 0 in case of reading EOF.

This patch is LGTM but requires a more careful second review.


-- 
Konstantin Osipov, Moscow, Russia

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2020-03-23  6:59 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-02-12  9:39 [Tarantool-patches] [PATCH v4 00/11] Replication from memory Georgy Kirichenko
2020-02-12  9:39 ` [Tarantool-patches] [PATCH v4 01/11] recovery: do not call recovery_stop_local inside recovery_delete Georgy Kirichenko
2020-03-19  7:55   ` Konstantin Osipov
2020-02-12  9:39 ` [Tarantool-patches] [PATCH v4 02/11] recovery: do not throw an error Georgy Kirichenko
2020-03-19  7:56   ` Konstantin Osipov
2020-02-12  9:39 ` [Tarantool-patches] [PATCH v4 03/11] coio: do not allow parallel usage of coio Georgy Kirichenko
2020-03-19 18:09   ` Konstantin Osipov
2020-02-12  9:39 ` [Tarantool-patches] [PATCH v4 04/11] coio: do not throw an error, minor refactoring Georgy Kirichenko
2020-03-23  6:59   ` Konstantin Osipov
2020-02-12  9:39 ` [Tarantool-patches] [PATCH v4 05/11] xstream: get rid of an exception Georgy Kirichenko
2020-02-12  9:39 ` [Tarantool-patches] [PATCH v4 06/11] wal: extract log write batch into a separate routine Georgy Kirichenko
2020-02-12  9:39 ` [Tarantool-patches] [PATCH v4 07/11] wal: matrix clock structure Georgy Kirichenko
2020-02-12  9:39 ` [Tarantool-patches] [PATCH v4 08/11] wal: track relay vclock and collect logs in wal thread Georgy Kirichenko
2020-02-12  9:39 ` [Tarantool-patches] [PATCH v4 09/11] wal: xrow memory buffer and cursor Georgy Kirichenko
2020-02-12  9:39 ` [Tarantool-patches] [PATCH v4 10/11] wal: use a xrow buffer object for entry encoding Georgy Kirichenko
2020-02-12  9:39 ` [Tarantool-patches] [PATCH v4 11/11] replication: use wal memory buffer to fetch rows Georgy Kirichenko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox