Tarantool development patches archive
 help / color / mirror / Atom feed
* [Tarantool-patches] [PATCH v3 00/10] raft: introduce manual elections and fix a bug with re-applying rolled back transactions
@ 2021-04-14 14:17 Serge Petrenko via Tarantool-patches
  2021-04-14 14:17 ` [Tarantool-patches] [PATCH v3 01/10] wal: enrich row's meta information with sync replication flags Serge Petrenko via Tarantool-patches
                   ` (11 more replies)
  0 siblings, 12 replies; 36+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-04-14 14:17 UTC (permalink / raw)
  To: v.shpilevoy, gorcunov; +Cc: tarantool-patches

Changes in v3:
  - fix gh-5445-leader-inconsistency.test.lua flakiness
  - fixes as per review from Cyrill Gorcunov
  - minor fixes and rewordings
  - rebased on top of current master
  - added patch 9/10 (remove parameter from clear_synchro_queue)

Changes in v2:
  - Added tests for patches 1, 6, 9
  - Minor typo fixes and bugfixes.

https://github.com/tarantool/tarantool/tree/sp/gh-5445-election-fixes
https://github.com/tarantool/tarantool/issues/5445
https://github.com/tarantool/tarantool/issues/3055

Serge Petrenko (10):
  wal: enrich row's meta information with sync replication flags
  xrow: introduce a PROMOTE entry
  box: actualise iproto_key_type array
  box: make clear_synchro_queue() write a PROMOTE entry instead of
    CONFIRM + ROLLBACK
  box: write PROMOTE even for empty limbo
  raft: keep track of greatest known term and filter replication sources
    based on that
  replication: introduce a new election mode: "manual"
  Support manual elections in `box.ctl.clear_synchro_queue()`
  box: remove parameter from clear_synchro_queue
  box.ctl: rename clear_synchro_queue to promote

 changelogs/unreleased/box-ctl-promote.md      |   8 +
 ...very => qsync-multi-statement-recovery.md} |   0
 changelogs/unreleased/raft-promote.md         |   4 +
 src/box/applier.cc                            |  28 ++
 src/box/box.cc                                | 165 ++++++++--
 src/box/box.h                                 |   2 +-
 src/box/errcode.h                             |   3 +
 src/box/iproto_constants.c                    |  58 ++++
 src/box/iproto_constants.h                    |  31 +-
 src/box/journal.h                             |   2 +
 src/box/lua/ctl.c                             |   8 +-
 src/box/raft.c                                |  37 ++-
 src/box/raft.h                                |  20 ++
 src/box/txn.c                                 |   9 +
 src/box/txn_limbo.c                           |  81 ++---
 src/box/txn_limbo.h                           |  17 +-
 src/box/wal.c                                 |  24 +-
 src/box/xrow.c                                |  68 +++-
 src/box/xrow.h                                |  64 +++-
 src/lib/raft/raft.c                           |  13 +-
 src/lib/raft/raft.h                           |  48 ++-
 test/box/error.result                         |   3 +
 test/replication/election_basic.result        |  29 +-
 test/replication/election_basic.test.lua      |  10 +
 .../gh-3055-election-promote.result           | 105 +++++++
 .../gh-3055-election-promote.test.lua         |  43 +++
 .../gh-5445-leader-inconsistency.result       | 291 ++++++++++++++++++
 .../gh-5445-leader-inconsistency.test.lua     | 128 ++++++++
 test/replication/suite.cfg                    |   2 +
 test/unit/raft_test_utils.c                   |   4 +-
 test/unit/xrow.cc                             | 104 +++++--
 test/unit/xrow.result                         | 133 +++++++-
 32 files changed, 1380 insertions(+), 162 deletions(-)
 create mode 100644 changelogs/unreleased/box-ctl-promote.md
 rename changelogs/unreleased/{qsync-multi-statement-recovery => qsync-multi-statement-recovery.md} (100%)
 create mode 100644 changelogs/unreleased/raft-promote.md
 create mode 100644 test/replication/gh-3055-election-promote.result
 create mode 100644 test/replication/gh-3055-election-promote.test.lua
 create mode 100644 test/replication/gh-5445-leader-inconsistency.result
 create mode 100644 test/replication/gh-5445-leader-inconsistency.test.lua

-- 
2.24.3 (Apple Git-128)


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [Tarantool-patches] [PATCH v3 01/10] wal: enrich row's meta information with sync replication flags
  2021-04-14 14:17 [Tarantool-patches] [PATCH v3 00/10] raft: introduce manual elections and fix a bug with re-applying rolled back transactions Serge Petrenko via Tarantool-patches
@ 2021-04-14 14:17 ` Serge Petrenko via Tarantool-patches
  2021-04-15 23:18   ` Vladislav Shpilevoy via Tarantool-patches
  2021-04-14 14:17 ` [Tarantool-patches] [PATCH v3 02/10] xrow: introduce a PROMOTE entry Serge Petrenko via Tarantool-patches
                   ` (10 subsequent siblings)
  11 siblings, 1 reply; 36+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-04-14 14:17 UTC (permalink / raw)
  To: v.shpilevoy, gorcunov; +Cc: tarantool-patches

Introduce two new flags to xrow_header: `wait_ack` and `wait_sync`.
These flags are set for rows belonging to synchronous transactions in
addition to `is_commit`.

The new flags help to define whether the rows belong to a synchronous
transaction or not without parsing them all and checking whether any of
the rows touches a synchronous space.

This will be used in applier once it is taught to filter synchronous
transactions based on whether they are coming from a raft leader or not.

P.S. These flags will also be useful once we allow to turn any transaction
synchronous. Once this is done, the flags in row header will be the only
source of information on whether the transaction is synchronous or not.

Prerequisite #5445

@TarantoolBot document
Title: new values for IPROTO_FLAGS field

IPROTO_FLAGS bitfield is enriched with two new constant:
IPROTO_FLAG_WAIT_SYNC = 0x02
IPROTO_FLAG_WAIT_ACK = 0x04

IPROTO_FLAG_WAIT_SYNC is set for the last row of a transaction which
cannot be committed immediately: either because it is synchronous or
because it waits for other synchronous transactions to complete.
IPROTO_FLAG_WAIT_ACK is set for the last synchronous transaction row.
---
 src/box/iproto_constants.h |   5 ++
 src/box/journal.h          |   2 +
 src/box/txn.c              |   9 +++
 src/box/wal.c              |  24 ++++---
 src/box/xrow.c             |  13 ++--
 src/box/xrow.h             |  30 ++++++---
 test/unit/xrow.cc          | 104 +++++++++++++++++++++++------
 test/unit/xrow.result      | 133 ++++++++++++++++++++++++++++++++++---
 8 files changed, 263 insertions(+), 57 deletions(-)

diff --git a/src/box/iproto_constants.h b/src/box/iproto_constants.h
index d3738c705..f7f46088f 100644
--- a/src/box/iproto_constants.h
+++ b/src/box/iproto_constants.h
@@ -49,9 +49,14 @@ enum {
 	XLOG_FIXHEADER_SIZE = 19
 };
 
+/** IPROTO_FLAGS bitfield constants. */
 enum {
 	/** Set for the last xrow in a transaction. */
 	IPROTO_FLAG_COMMIT = 0x01,
+	/** Set for the last row of a tx residing in limbo. */
+	IPROTO_FLAG_WAIT_SYNC = 0x02,
+	/** Set for the last row of a synchronous tx. */
+	IPROTO_FLAG_WAIT_ACK = 0x04,
 };
 
 enum iproto_key {
diff --git a/src/box/journal.h b/src/box/journal.h
index 76c70c19f..3ce9c869e 100644
--- a/src/box/journal.h
+++ b/src/box/journal.h
@@ -63,6 +63,7 @@ struct journal_entry {
 	 * A journal entry completion callback argument.
 	 */
 	void *complete_data;
+	uint8_t opt_flags;
 	/**
 	 * Asynchronous write completion function.
 	 */
@@ -97,6 +98,7 @@ journal_entry_create(struct journal_entry *entry, size_t n_rows,
 	entry->approx_len	= approx_len;
 	entry->n_rows		= n_rows;
 	entry->res		= -1;
+	entry->opt_flags	= 0;
 }
 
 /**
diff --git a/src/box/txn.c b/src/box/txn.c
index 40061ff09..34e8ce026 100644
--- a/src/box/txn.c
+++ b/src/box/txn.c
@@ -76,6 +76,7 @@ txn_add_redo(struct txn *txn, struct txn_stmt *stmt, struct request *request)
 		row->lsn = 0;
 		row->sync = 0;
 		row->tm = 0;
+		row->opt_flags = 0;
 	}
 	/*
 	 * Group ID should be set both for requests not having a
@@ -681,6 +682,14 @@ txn_journal_entry_new(struct txn *txn)
 		--req->n_rows;
 	}
 
+	static const uint8_t opt_flags_map[] = {
+		[TXN_WAIT_SYNC] = IPROTO_FLAG_WAIT_SYNC,
+		[TXN_WAIT_ACK] = IPROTO_FLAG_WAIT_ACK,
+	};
+
+	req->opt_flags |= opt_flags_map[txn->flags & TXN_WAIT_SYNC];
+	req->opt_flags |= opt_flags_map[txn->flags & TXN_WAIT_ACK];
+
 	return req;
 }
 
diff --git a/src/box/wal.c b/src/box/wal.c
index 34af0bda6..4ec8034a3 100644
--- a/src/box/wal.c
+++ b/src/box/wal.c
@@ -962,14 +962,14 @@ out:
  */
 static void
 wal_assign_lsn(struct vclock *vclock_diff, struct vclock *base,
-	       struct xrow_header **row,
-	       struct xrow_header **end)
+	       struct journal_entry *entry)
 {
 	int64_t tsn = 0;
-	struct xrow_header **start = row;
-	struct xrow_header **first_glob_row = row;
+	struct xrow_header **start = entry->rows;
+	struct xrow_header **end = entry->rows + entry->n_rows;
+	struct xrow_header **first_glob_row = entry->rows;
 	/** Assign LSN to all local rows. */
-	for ( ; row < end; row++) {
+	for (struct xrow_header **row = start; row < end; row++) {
 		if ((*row)->replica_id == 0) {
 			/*
 			 * All rows representing local space data
@@ -996,7 +996,11 @@ wal_assign_lsn(struct vclock *vclock_diff, struct vclock *base,
 				first_glob_row = row;
 			}
 			(*row)->tsn = tsn == 0 ? (*start)->lsn : tsn;
-			(*row)->is_commit = row == end - 1;
+			/* Tx meta is stored in the last tx row. */
+			if (row == end - 1) {
+				(*row)->opt_flags = entry->opt_flags;
+				(*row)->is_commit = true;
+			}
 		} else {
 			int64_t diff = (*row)->lsn - vclock_get(base, (*row)->replica_id);
 			if (diff <= vclock_get(vclock_diff,
@@ -1020,7 +1024,7 @@ wal_assign_lsn(struct vclock *vclock_diff, struct vclock *base,
 	 * the first global row. tsn was yet unknown when those
 	 * rows were processed.
 	 */
-	for (row = start; row < first_glob_row; row++)
+	for (struct xrow_header **row = start; row < first_glob_row; row++)
 		(*row)->tsn = tsn;
 }
 
@@ -1098,8 +1102,7 @@ wal_write_to_disk(struct cmsg *msg)
 	struct journal_entry *entry;
 	struct stailq_entry *last_committed = NULL;
 	stailq_foreach_entry(entry, &wal_msg->commit, fifo) {
-		wal_assign_lsn(&vclock_diff, &writer->vclock,
-			       entry->rows, entry->rows + entry->n_rows);
+		wal_assign_lsn(&vclock_diff, &writer->vclock, entry);
 		entry->res = vclock_sum(&vclock_diff) +
 			     vclock_sum(&writer->vclock);
 		rc = xlog_write_entry(l, entry);
@@ -1319,8 +1322,7 @@ wal_write_none_async(struct journal *journal,
 	struct vclock vclock_diff;
 
 	vclock_create(&vclock_diff);
-	wal_assign_lsn(&vclock_diff, &writer->vclock, entry->rows,
-		       entry->rows + entry->n_rows);
+	wal_assign_lsn(&vclock_diff, &writer->vclock, entry);
 	vclock_merge(&writer->vclock, &vclock_diff);
 	vclock_copy(&replicaset.vclock, &writer->vclock);
 	entry->res = vclock_sum(&writer->vclock);
diff --git a/src/box/xrow.c b/src/box/xrow.c
index bc06738ad..cc8e43ed4 100644
--- a/src/box/xrow.c
+++ b/src/box/xrow.c
@@ -183,7 +183,7 @@ error:
 			break;
 		case IPROTO_FLAGS:
 			flags = mp_decode_uint(pos);
-			header->is_commit = flags & IPROTO_FLAG_COMMIT;
+			header->opt_flags = flags;
 			break;
 		default:
 			/* unknown header */
@@ -299,6 +299,7 @@ xrow_header_encode(const struct xrow_header *header, uint64_t sync,
 	 *   flag to find transaction boundary (last row in the
 	 *   transaction stream).
 	 */
+	uint8_t flags_to_encode = header->opt_flags & ~IPROTO_FLAG_COMMIT;
 	if (header->tsn != 0) {
 		if (header->tsn != header->lsn || !header->is_commit) {
 			/*
@@ -314,12 +315,14 @@ xrow_header_encode(const struct xrow_header *header, uint64_t sync,
 			map_size++;
 		}
 		if (header->is_commit && header->tsn != header->lsn) {
-			/* Setup last row for multi row transaction. */
-			d = mp_encode_uint(d, IPROTO_FLAGS);
-			d = mp_encode_uint(d, IPROTO_FLAG_COMMIT);
-			map_size++;
+			flags_to_encode |= IPROTO_FLAG_COMMIT;
 		}
 	}
+	if (flags_to_encode != 0) {
+		d = mp_encode_uint(d, IPROTO_FLAGS);
+		d = mp_encode_uint(d, flags_to_encode);
+		map_size++;
+	}
 	assert(d <= data + XROW_HEADER_LEN_MAX);
 	mp_encode_map(data, map_size);
 	out->iov_len = d - (char *) out->iov_base;
diff --git a/src/box/xrow.h b/src/box/xrow.h
index fde8f9474..1bb0964dc 100644
--- a/src/box/xrow.h
+++ b/src/box/xrow.h
@@ -80,14 +80,28 @@ struct xrow_header {
 	 * transaction.
 	 */
 	int64_t tsn;
-	/**
-	 * True for the last row in a multi-statement transaction,
-	 * or single-statement transaction. Is only encoded in the
-	 * write ahead log for multi-statement transactions.
-	 * Single-statement transactions do not encode
-	 * tsn and is_commit flag to save space.
-	 */
-	bool is_commit;
+	/** Transaction meta flags set only in the last transaction row. */
+	union {
+		uint8_t opt_flags;
+		struct {
+			/**
+			 * Is only encoded in the write ahead log for
+			 * multi-statement transactions. Single-statement
+			 * transactions do not encode tsn and is_commit flag to
+			 * save space.
+			 */
+			bool is_commit : 1;
+			/**
+			 * True for any transaction that would enter the limbo
+			 * (not necessarily a synchronous one).
+			 */
+			bool wait_sync : 1;
+			/**
+			 * True for a synchronous transaction.
+			 */
+			bool wait_ack  : 1;
+		};
+	};
 
 	int bodycnt;
 	uint32_t schema_version;
diff --git a/test/unit/xrow.cc b/test/unit/xrow.cc
index 9fd154719..3d7d8bee1 100644
--- a/test/unit/xrow.cc
+++ b/test/unit/xrow.cc
@@ -204,7 +204,9 @@ test_greeting()
 void
 test_xrow_header_encode_decode()
 {
-	plan(10);
+	/* Test all possible 3-bit combinations. */
+	const int bit_comb_count = 1 << 3;
+	plan(1 + bit_comb_count);
 	struct xrow_header header;
 	char buffer[2048];
 	char *pos = mp_encode_uint(buffer, 300);
@@ -217,27 +219,47 @@ test_xrow_header_encode_decode()
 	header.tm = 123.456;
 	header.bodycnt = 0;
 	header.tsn = header.lsn;
-	header.is_commit = true;
 	uint64_t sync = 100500;
-	struct iovec vec[1];
-	is(1, xrow_header_encode(&header, sync, vec, 200), "encode");
-	int fixheader_len = 200;
-	pos = (char *)vec[0].iov_base + fixheader_len;
-	is(mp_decode_map((const char **)&pos), 5, "header map size");
-
-	struct xrow_header decoded_header;
-	const char *begin = (const char *)vec[0].iov_base;
-	begin += fixheader_len;
-	const char *end = (const char *)vec[0].iov_base;
-	end += vec[0].iov_len;
-	is(xrow_header_decode(&decoded_header, &begin, end, true), 0,
-	   "header decode");
-	is(header.type, decoded_header.type, "decoded type");
-	is(header.replica_id, decoded_header.replica_id, "decoded replica_id");
-	is(header.lsn, decoded_header.lsn, "decoded lsn");
-	is(header.tm, decoded_header.tm, "decoded tm");
-	is(decoded_header.sync, sync, "decoded sync");
-	is(decoded_header.bodycnt, 0, "decoded bodycnt");
+	for (int opt_idx = 0; opt_idx < bit_comb_count; opt_idx++) {
+		plan(12);
+		header.is_commit = opt_idx & 0x01;
+		header.wait_sync = opt_idx >> 1 & 0x01;
+		header.wait_ack = opt_idx >> 2 & 0x01;
+		struct iovec vec[1];
+		is(1, xrow_header_encode(&header, sync, vec, 200), "encode");
+		int fixheader_len = 200;
+		pos = (char *)vec[0].iov_base + fixheader_len;
+		uint32_t exp_map_size = 5;
+		/*
+		 * header.is_commit flag isn't encoded, since this row looks
+		 * like a single-statement transaction.
+		 */
+		if (header.wait_sync || header.wait_ack)
+			exp_map_size += 1;
+		/* tsn is encoded explicitly in this case. */
+		if (!header.is_commit)
+			exp_map_size += 1;
+		uint32_t size = mp_decode_map((const char **)&pos);
+		is(size, exp_map_size, "header map size");
+
+		struct xrow_header decoded_header;
+		const char *begin = (const char *)vec[0].iov_base;
+		begin += fixheader_len;
+		const char *end = (const char *)vec[0].iov_base;
+		end += vec[0].iov_len;
+		is(xrow_header_decode(&decoded_header, &begin, end, true), 0,
+		   "header decode");
+		is(header.is_commit, decoded_header.is_commit, "decoded is_commit");
+		is(header.wait_sync, decoded_header.wait_sync, "decoded wait_sync");
+		is(header.wait_ack, decoded_header.wait_ack, "decoded wait_ack");
+		is(header.type, decoded_header.type, "decoded type");
+		is(header.replica_id, decoded_header.replica_id, "decoded replica_id");
+		is(header.lsn, decoded_header.lsn, "decoded lsn");
+		is(header.tm, decoded_header.tm, "decoded tm");
+		is(decoded_header.sync, sync, "decoded sync");
+		is(decoded_header.bodycnt, 0, "decoded bodycnt");
+		check_plan();
+	}
 
 	check_plan();
 }
@@ -275,12 +297,49 @@ test_request_str()
 	check_plan();
 }
 
+/**
+ * The compiler doesn't have to preserve bitfields order,
+ * still we rely on it for convenience sake.
+ */
+static void
+test_xrow_opt_field()
+{
+	plan(6);
+
+	struct xrow_header header;
+
+	memset(&header, 0, sizeof(header));
+
+	header.is_commit = true;
+	is(header.opt_flags, IPROTO_FLAG_COMMIT, "header.is_commit -> COMMIT");
+	header.is_commit = false;
+
+	header.wait_sync = true;
+	is(header.opt_flags, IPROTO_FLAG_WAIT_SYNC, "header.wait_sync -> WAIT_SYNC");
+	header.wait_sync = false;
+
+	header.wait_ack = true;
+	is(header.opt_flags, IPROTO_FLAG_WAIT_ACK, "header.wait_ack -> WAIT_ACK");
+	header.wait_ack = false;
+
+	header.opt_flags = IPROTO_FLAG_COMMIT;
+	ok(header.is_commit && !header.wait_sync && !header.wait_ack, "COMMIT -> header.is_commit");
+
+	header.opt_flags = IPROTO_FLAG_WAIT_SYNC;
+	ok(!header.is_commit && header.wait_sync && !header.wait_ack, "WAIT_SYNC -> header.wait_sync");
+
+	header.opt_flags = IPROTO_FLAG_WAIT_ACK;
+	ok(!header.is_commit && !header.wait_sync && header.wait_ack, "WAIT_ACK -> header.wait_ack");
+
+	check_plan();
+}
+
 int
 main(void)
 {
 	memory_init();
 	fiber_init(fiber_c_invoke);
-	plan(3);
+	plan(4);
 
 	random_init();
 
@@ -288,6 +347,7 @@ main(void)
 	test_greeting();
 	test_xrow_header_encode_decode();
 	test_request_str();
+	test_xrow_opt_field();
 
 	random_free();
 	fiber_free();
diff --git a/test/unit/xrow.result b/test/unit/xrow.result
index 5ee92ad7b..3b705d5ba 100644
--- a/test/unit/xrow.result
+++ b/test/unit/xrow.result
@@ -1,4 +1,4 @@
-1..3
+1..4
     1..40
     ok 1 - round trip
     ok 2 - roundtrip.version_id
@@ -41,18 +41,129 @@
     ok 39 - invalid 10
     ok 40 - invalid 11
 ok 1 - subtests
-    1..10
+    1..9
     ok 1 - bad msgpack end
-    ok 2 - encode
-    ok 3 - header map size
-    ok 4 - header decode
-    ok 5 - decoded type
-    ok 6 - decoded replica_id
-    ok 7 - decoded lsn
-    ok 8 - decoded tm
-    ok 9 - decoded sync
-    ok 10 - decoded bodycnt
+        1..12
+        ok 1 - encode
+        ok 2 - header map size
+        ok 3 - header decode
+        ok 4 - decoded is_commit
+        ok 5 - decoded wait_sync
+        ok 6 - decoded wait_ack
+        ok 7 - decoded type
+        ok 8 - decoded replica_id
+        ok 9 - decoded lsn
+        ok 10 - decoded tm
+        ok 11 - decoded sync
+        ok 12 - decoded bodycnt
+    ok 2 - subtests
+        1..12
+        ok 1 - encode
+        ok 2 - header map size
+        ok 3 - header decode
+        ok 4 - decoded is_commit
+        ok 5 - decoded wait_sync
+        ok 6 - decoded wait_ack
+        ok 7 - decoded type
+        ok 8 - decoded replica_id
+        ok 9 - decoded lsn
+        ok 10 - decoded tm
+        ok 11 - decoded sync
+        ok 12 - decoded bodycnt
+    ok 3 - subtests
+        1..12
+        ok 1 - encode
+        ok 2 - header map size
+        ok 3 - header decode
+        ok 4 - decoded is_commit
+        ok 5 - decoded wait_sync
+        ok 6 - decoded wait_ack
+        ok 7 - decoded type
+        ok 8 - decoded replica_id
+        ok 9 - decoded lsn
+        ok 10 - decoded tm
+        ok 11 - decoded sync
+        ok 12 - decoded bodycnt
+    ok 4 - subtests
+        1..12
+        ok 1 - encode
+        ok 2 - header map size
+        ok 3 - header decode
+        ok 4 - decoded is_commit
+        ok 5 - decoded wait_sync
+        ok 6 - decoded wait_ack
+        ok 7 - decoded type
+        ok 8 - decoded replica_id
+        ok 9 - decoded lsn
+        ok 10 - decoded tm
+        ok 11 - decoded sync
+        ok 12 - decoded bodycnt
+    ok 5 - subtests
+        1..12
+        ok 1 - encode
+        ok 2 - header map size
+        ok 3 - header decode
+        ok 4 - decoded is_commit
+        ok 5 - decoded wait_sync
+        ok 6 - decoded wait_ack
+        ok 7 - decoded type
+        ok 8 - decoded replica_id
+        ok 9 - decoded lsn
+        ok 10 - decoded tm
+        ok 11 - decoded sync
+        ok 12 - decoded bodycnt
+    ok 6 - subtests
+        1..12
+        ok 1 - encode
+        ok 2 - header map size
+        ok 3 - header decode
+        ok 4 - decoded is_commit
+        ok 5 - decoded wait_sync
+        ok 6 - decoded wait_ack
+        ok 7 - decoded type
+        ok 8 - decoded replica_id
+        ok 9 - decoded lsn
+        ok 10 - decoded tm
+        ok 11 - decoded sync
+        ok 12 - decoded bodycnt
+    ok 7 - subtests
+        1..12
+        ok 1 - encode
+        ok 2 - header map size
+        ok 3 - header decode
+        ok 4 - decoded is_commit
+        ok 5 - decoded wait_sync
+        ok 6 - decoded wait_ack
+        ok 7 - decoded type
+        ok 8 - decoded replica_id
+        ok 9 - decoded lsn
+        ok 10 - decoded tm
+        ok 11 - decoded sync
+        ok 12 - decoded bodycnt
+    ok 8 - subtests
+        1..12
+        ok 1 - encode
+        ok 2 - header map size
+        ok 3 - header decode
+        ok 4 - decoded is_commit
+        ok 5 - decoded wait_sync
+        ok 6 - decoded wait_ack
+        ok 7 - decoded type
+        ok 8 - decoded replica_id
+        ok 9 - decoded lsn
+        ok 10 - decoded tm
+        ok 11 - decoded sync
+        ok 12 - decoded bodycnt
+    ok 9 - subtests
 ok 2 - subtests
     1..1
     ok 1 - request_str
 ok 3 - subtests
+    1..6
+    ok 1 - header.is_commit -> COMMIT
+    ok 2 - header.wait_sync -> WAIT_SYNC
+    ok 3 - header.wait_ack -> WAIT_ACK
+    ok 4 - COMMIT -> header.is_commit
+    ok 5 - WAIT_SYNC -> header.wait_sync
+    ok 6 - WAIT_ACK -> header.wait_ack
+ok 4 - subtests
-- 
2.24.3 (Apple Git-128)


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [Tarantool-patches] [PATCH v3 02/10] xrow: introduce a PROMOTE entry
  2021-04-14 14:17 [Tarantool-patches] [PATCH v3 00/10] raft: introduce manual elections and fix a bug with re-applying rolled back transactions Serge Petrenko via Tarantool-patches
  2021-04-14 14:17 ` [Tarantool-patches] [PATCH v3 01/10] wal: enrich row's meta information with sync replication flags Serge Petrenko via Tarantool-patches
@ 2021-04-14 14:17 ` Serge Petrenko via Tarantool-patches
  2021-04-15 23:19   ` Vladislav Shpilevoy via Tarantool-patches
  2021-04-14 14:17 ` [Tarantool-patches] [PATCH v3 03/10] box: actualise iproto_key_type array Serge Petrenko via Tarantool-patches
                   ` (9 subsequent siblings)
  11 siblings, 1 reply; 36+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-04-14 14:17 UTC (permalink / raw)
  To: v.shpilevoy, gorcunov; +Cc: tarantool-patches

A PROMOTE entry combines effect of CONFIRM, ROLLBACK and RAFT_TERM
entries with some additional semantics on top.

PROMOTE carries the following arguments:

1) former_leader_id - the id of previous limbo owner whose entries we
   want to confirm.
2) confirm_lsn - the lsn of the last former leader's transaction to be
   confirmed. In this sense PROMOTE(confirm_lsn) replaces
   CONFIRM(confirm_lsn) + ROLLBACK(confirm_lsn + 1).
3) replica_id - id of the instance issuing
   `box.ctl.clear_synchro_queue()`
4) term - the new term the instance issuing
   `box.ctl.clear_synchro_queue()` has just entered.

This entry will be written to WAL instead of the usual CONFIRM +
ROLLBACK pair on a successful `box.ctl.clear_synchro_queue()` call.

Note, the ususal CONFIRM and ROLLBACK occurrences (after a confirmed or
rolled back synchronous transaction) are here to stay.

Part of #5445
---
 src/box/iproto_constants.h | 26 ++++++++++++++++--
 src/box/xrow.c             | 55 +++++++++++++++++++++++++++++++++-----
 src/box/xrow.h             | 34 ++++++++++++++++++++++-
 3 files changed, 105 insertions(+), 10 deletions(-)

diff --git a/src/box/iproto_constants.h b/src/box/iproto_constants.h
index f7f46088f..7d39b0d61 100644
--- a/src/box/iproto_constants.h
+++ b/src/box/iproto_constants.h
@@ -132,6 +132,18 @@ enum iproto_key {
 	IPROTO_REPLICA_ANON = 0x50,
 	IPROTO_ID_FILTER = 0x51,
 	IPROTO_ERROR = 0x52,
+	/**
+	 * Term. Has the same meaning as IPROTO_RAFT_TERM, but is an iproto
+	 * key, rather than a raft key. Used for PROMOTE request, which needs
+	 * both iproto (e.g. REPLICA_ID) and raft (RAFT_TERM) keys.
+	 */
+	IPROTO_TERM = 0x53,
+	/*
+	 * Be careful to not extend iproto_key values over 0x7f.
+	 * iproto_keys are encoded in msgpack as positive fixnum, which ends at
+	 * 0x7f, and we rely on this in some places by allocating a uint8_t to
+	 * hold a msgpack-encoded key value.
+	 */
 	IPROTO_KEY_MAX
 };
 
@@ -226,6 +238,8 @@ enum iproto_type {
 	IPROTO_TYPE_STAT_MAX,
 
 	IPROTO_RAFT = 30,
+	/** PROMOTE request. */
+	IPROTO_PROMOTE = 31,
 
 	/** A confirmation message for synchronous transactions. */
 	IPROTO_CONFIRM = 40,
@@ -340,11 +354,19 @@ dml_request_key_map(uint32_t type)
 	return iproto_body_key_map[type];
 }
 
-/** CONFIRM/ROLLBACK entries for synchronous replication. */
+/** Synchronous replication entries: CONFIRM/ROLLBACK/PROMOTE. */
 static inline bool
 iproto_type_is_synchro_request(uint32_t type)
 {
-	return type == IPROTO_CONFIRM || type == IPROTO_ROLLBACK;
+	return type == IPROTO_CONFIRM || type == IPROTO_ROLLBACK ||
+	       type == IPROTO_PROMOTE;
+}
+
+/** PROMOTE entry (synchronous replication and leader elections). */
+static inline bool
+iproto_type_is_promote_request(uint32_t type)
+{
+       return type == IPROTO_PROMOTE;
 }
 
 static inline bool
diff --git a/src/box/xrow.c b/src/box/xrow.c
index cc8e43ed4..5d515ce92 100644
--- a/src/box/xrow.c
+++ b/src/box/xrow.c
@@ -884,28 +884,63 @@ xrow_encode_dml(const struct request *request, struct region *region,
 	return iovcnt;
 }
 
-void
-xrow_encode_synchro(struct xrow_header *row,
-		    struct synchro_body_bin *body,
-		    const struct synchro_request *req)
+static void
+xrow_encode_synchro_body(struct synchro_body_bin *body,
+		         const struct synchro_request *req)
 {
 	/*
-	 * A map with two elements. We don't compress
+	 * A map with two or three elements. We don't compress
 	 * numbers to have this structure constant in size,
 	 * which allows us to preallocate it on stack.
 	 */
-	body->m_body = 0x80 | 2;
+	body->m_body = 0x80 | (req->type == IPROTO_PROMOTE ? 3 : 2);
 	body->k_replica_id = IPROTO_REPLICA_ID;
 	body->m_replica_id = 0xce;
 	body->v_replica_id = mp_bswap_u32(req->replica_id);
 	body->k_lsn = IPROTO_LSN;
 	body->m_lsn = 0xcf;
 	body->v_lsn = mp_bswap_u64(req->lsn);
+}
+
+void
+xrow_encode_synchro(struct xrow_header *row,
+		    struct synchro_body_bin *body,
+		    const struct synchro_request *req)
+{
+	assert(req->type == IPROTO_CONFIRM || req->type == IPROTO_ROLLBACK);
+
+	xrow_encode_synchro_body(body, req);
 
 	memset(row, 0, sizeof(*row));
+	row->type = req->type;
+	row->body[0].iov_base = body;
+	row->body[0].iov_len = sizeof(*body);
+	row->bodycnt = 1;
+}
+
+static inline void
+xrow_encode_promote_body(struct promote_body_bin *body,
+			 const struct synchro_request *req)
+{
+	xrow_encode_synchro_body(&body->base, req);
+
+	body->k_term = IPROTO_TERM;
+	body->m_term = 0xcf;
+	body->v_term = mp_bswap_u64(req->term);
+}
+
 
+void
+xrow_encode_promote(struct xrow_header *row, struct promote_body_bin *body,
+		    const struct synchro_request *req)
+{
+	assert(req->type == IPROTO_PROMOTE);
+
+	xrow_encode_promote_body(body, req);
+
+	memset(row, 0, sizeof(*row));
 	row->type = req->type;
-	row->body[0].iov_base = (void *)body;
+	row->body[0].iov_base = body;
 	row->body[0].iov_len = sizeof(*body);
 	row->bodycnt = 1;
 }
@@ -952,11 +987,17 @@ xrow_decode_synchro(const struct xrow_header *row, struct synchro_request *req)
 		case IPROTO_LSN:
 			req->lsn = mp_decode_uint(&d);
 			break;
+		case IPROTO_TERM:
+			req->term = mp_decode_uint(&d);
+			break;
 		default:
 			mp_next(&d);
 		}
 	}
+
 	req->type = row->type;
+	req->origin_id = row->replica_id;
+
 	return 0;
 }
 
diff --git a/src/box/xrow.h b/src/box/xrow.h
index 1bb0964dc..51442f9b6 100644
--- a/src/box/xrow.h
+++ b/src/box/xrow.h
@@ -226,7 +226,10 @@ xrow_encode_dml(const struct request *request, struct region *region,
  * pending synchronous transactions.
  */
 struct synchro_request {
-	/** Operation type - IPROTO_ROLLBACK or IPROTO_CONFIRM. */
+	/**
+	 * Operation type - either IPROTO_ROLLBACK or IPROTO_CONFIRM or
+	 * IPROTO_PROMOTE
+	 */
 	uint32_t type;
 	/**
 	 * ID of the instance owning the pending transactions.
@@ -236,14 +239,25 @@ struct synchro_request {
 	 * finish transactions of an old master.
 	 */
 	uint32_t replica_id;
+	/**
+	 * Id of the instance which has issued this request. Only filled on
+	 * decoding, and left blank when encoding a request.
+	 */
+	uint32_t origin_id;
 	/**
 	 * Operation LSN.
 	 * In case of CONFIRM it means 'confirm all
 	 * transactions with lsn <= this value'.
 	 * In case of ROLLBACK it means 'rollback all transactions
 	 * with lsn >= this value'.
+	 * In case of PROMOTE it means CONFIRM(lsn) + ROLLBACK(lsn+1)
 	 */
 	int64_t lsn;
+	/**
+	 * The new term the instance issuing this request is in. Only used for
+	 * PROMOTE request.
+	 */
+	uint64_t term;
 };
 
 /** Synchro request xrow's body in MsgPack format. */
@@ -257,6 +271,14 @@ struct PACKED synchro_body_bin {
 	uint64_t v_lsn;
 };
 
+/** PROMOTE request's xrow body in MsgPack format. */
+struct PACKED promote_body_bin {
+	struct synchro_body_bin base;
+	uint8_t k_term;
+	uint8_t m_term;
+	uint64_t v_term;
+};
+
 /**
  * Encode synchronous replication request.
  * @param row xrow header.
@@ -268,6 +290,16 @@ xrow_encode_synchro(struct xrow_header *row,
 		    struct synchro_body_bin *body,
 		    const struct synchro_request *req);
 
+/**
+ * Encode a promote request.
+ * @param row xrow header.
+ * @param body A place to encode promote body.
+ * @param req Request parameters.
+ */
+void
+xrow_encode_promote(struct xrow_header *row, struct promote_body_bin *body,
+		    const struct synchro_request *req);
+
 /**
  * Decode synchronous replication request.
  * @param row xrow header.
-- 
2.24.3 (Apple Git-128)


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [Tarantool-patches] [PATCH v3 03/10] box: actualise iproto_key_type array
  2021-04-14 14:17 [Tarantool-patches] [PATCH v3 00/10] raft: introduce manual elections and fix a bug with re-applying rolled back transactions Serge Petrenko via Tarantool-patches
  2021-04-14 14:17 ` [Tarantool-patches] [PATCH v3 01/10] wal: enrich row's meta information with sync replication flags Serge Petrenko via Tarantool-patches
  2021-04-14 14:17 ` [Tarantool-patches] [PATCH v3 02/10] xrow: introduce a PROMOTE entry Serge Petrenko via Tarantool-patches
@ 2021-04-14 14:17 ` Serge Petrenko via Tarantool-patches
  2021-04-14 14:17 ` [Tarantool-patches] [PATCH v3 04/10] box: make clear_synchro_queue() write a PROMOTE entry instead of CONFIRM + ROLLBACK Serge Petrenko via Tarantool-patches
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 36+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-04-14 14:17 UTC (permalink / raw)
  To: v.shpilevoy, gorcunov; +Cc: tarantool-patches

iproto_key_type array is used while validating incoming requests, but it
was only half-filled. The last initialized field was 0x2b, while
IPROTO_KEY_MAX is currently 0x54.

We got away with it, since the array is only  used in xrow_header_decode(),
xrow_decode_dml() and xrow_decode_synchro(), and all the keys usually present
in these requests were present in the array. This is not true anymore,
so it's time to make array contents up to date with all the IPROTO_KEY_*
constants we have.

Part of #5445
---
 src/box/iproto_constants.c | 58 ++++++++++++++++++++++++++++++++++++++
 1 file changed, 58 insertions(+)

diff --git a/src/box/iproto_constants.c b/src/box/iproto_constants.c
index 029d9888c..addda39dc 100644
--- a/src/box/iproto_constants.c
+++ b/src/box/iproto_constants.c
@@ -90,6 +90,64 @@ const unsigned char iproto_key_type[IPROTO_KEY_MAX] =
 	/* 0x2a */	MP_MAP, /* IPROTO_TUPLE_META */
 	/* 0x2b */	MP_MAP, /* IPROTO_OPTIONS */
 	/* }}} */
+
+	/* {{{ unused */
+	/* 0x2c */	MP_UINT,
+	/* 0x2d */	MP_UINT,
+	/* 0x2e */	MP_UINT,
+	/* 0x2f */	MP_UINT,
+	/* }}} */
+
+	/* {{{ body -- response keys */
+	/* 0x30 */	MP_ARRAY, /* IPROTO_DATA */
+	/* 0x31 */	MP_STR, /* IPROTO_ERROR_24 */
+	/* 0x32 */	MP_ARRAY, /* IPROTO_METADATA */
+	/* 0x33 */	MP_ARRAY, /* IPROTO_BIND_METADATA */
+	/* 0x34 */	MP_UINT, /* IIPROTO_BIND_COUNT */
+	/* }}} */
+
+	/* {{{ unused */
+	/* 0x35 */	MP_UINT,
+	/* 0x36 */	MP_UINT,
+	/* 0x37 */	MP_UINT,
+	/* 0x38 */	MP_UINT,
+	/* 0x39 */	MP_UINT,
+	/* 0x3a */	MP_UINT,
+	/* 0x3b */	MP_UINT,
+	/* 0x3c */	MP_UINT,
+	/* 0x3d */	MP_UINT,
+	/* 0x3e */	MP_UINT,
+	/* 0x3f */	MP_UINT,
+	/* }}} */
+
+	/* {{{ body -- sql keys */
+	/* 0x40 */	MP_STR, /* IPROTO_SQL_TEXT */
+	/* 0x41 */	MP_ARRAY, /* IPROTO_SQL_BIND */
+	/* 0x42 */	MP_MAP, /* IPROTO_SQL_INFO */
+	/* 0x43 */	MP_UINT, /* IPROTO_STMT_ID */
+	/* }}} */
+
+	/* {{{ unused */
+	/* 0x44 */	MP_UINT,
+	/* 0x45 */	MP_UINT,
+	/* 0x46 */	MP_UINT,
+	/* 0x47 */	MP_UINT,
+	/* 0x48 */	MP_UINT,
+	/* 0x49 */	MP_UINT,
+	/* 0x4a */	MP_UINT,
+	/* 0x4b */	MP_UINT,
+	/* 0x4c */	MP_UINT,
+	/* 0x4d */	MP_UINT,
+	/* 0x4e */	MP_UINT,
+	/* 0x4f */	MP_UINT,
+	/* }}} */
+
+	/* {{{ body -- additional request keys */
+	/* 0x50 */	MP_BOOL, /* IPROTO_REPLICA_ANON */
+	/* 0x51 */	MP_ARRAY, /* IPROTO_ID_FILTER */
+	/* 0x52 */	MP_MAP, /* IPROTO_ERROR */
+	/* 0x53 */	MP_UINT, /* IPROTO_TERM */
+	/* }}} */
 };
 
 const char *iproto_type_strs[] =
-- 
2.24.3 (Apple Git-128)


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [Tarantool-patches] [PATCH v3 04/10] box: make clear_synchro_queue() write a PROMOTE entry instead of CONFIRM + ROLLBACK
  2021-04-14 14:17 [Tarantool-patches] [PATCH v3 00/10] raft: introduce manual elections and fix a bug with re-applying rolled back transactions Serge Petrenko via Tarantool-patches
                   ` (2 preceding siblings ...)
  2021-04-14 14:17 ` [Tarantool-patches] [PATCH v3 03/10] box: actualise iproto_key_type array Serge Petrenko via Tarantool-patches
@ 2021-04-14 14:17 ` Serge Petrenko via Tarantool-patches
  2021-04-15 23:20   ` Vladislav Shpilevoy via Tarantool-patches
  2021-04-14 14:17 ` [Tarantool-patches] [PATCH v3 05/10] box: write PROMOTE even for empty limbo Serge Petrenko via Tarantool-patches
                   ` (7 subsequent siblings)
  11 siblings, 1 reply; 36+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-04-14 14:17 UTC (permalink / raw)
  To: v.shpilevoy, gorcunov; +Cc: tarantool-patches

A successful box_clear_synchro_queue() call results in writing
CONFIRM(N) ROLLBACK(N+1) pair, where N is  the confirmed lsn.

Let's write a single PROMOTE(N) entry instead. It'll have  the same
meaning as CONFIRM + ROLLBACK and it will give followers some additional
information regarding leader state change later.

Part of #5445
---
 src/box/box.cc      | 14 +++++++-
 src/box/txn_limbo.c | 81 +++++++++++++++++++++++++--------------------
 src/box/txn_limbo.h | 17 ++++++----
 3 files changed, 69 insertions(+), 43 deletions(-)

diff --git a/src/box/box.cc b/src/box/box.cc
index 70b325180..9adb6ba46 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -1556,7 +1556,19 @@ box_clear_synchro_queue(bool try_wait)
 				 "new synchronous transactions appeared");
 			rc = -1;
 		} else {
-			txn_limbo_force_empty(&txn_limbo, wait_lsn);
+			/*
+			 * Term parameter is unused now, We'll pass
+			 * box_raft()->term there later.
+			 */
+			txn_limbo_write_promote(&txn_limbo, wait_lsn, 0);
+			struct synchro_request req = {
+				.type = 0, /* unused */
+				.replica_id = 0, /* unused */
+				.origin_id = instance_id,
+				.lsn = wait_lsn,
+				.term = 0, /* unused */
+			};
+			txn_limbo_read_promote(&txn_limbo, &req);
 			assert(txn_limbo_is_empty(&txn_limbo));
 		}
 	}
diff --git a/src/box/txn_limbo.c b/src/box/txn_limbo.c
index d29722ef7..bfe0ad302 100644
--- a/src/box/txn_limbo.c
+++ b/src/box/txn_limbo.c
@@ -317,21 +317,24 @@ txn_limbo_write_cb(struct journal_entry *entry)
 }
 
 static void
-txn_limbo_write_synchro(struct txn_limbo *limbo, uint32_t type, int64_t lsn)
+txn_limbo_write_synchro(struct txn_limbo *limbo, uint32_t type, int64_t lsn,
+			uint64_t term)
 {
-	assert(lsn > 0);
+	assert(lsn >= 0);
 
 	struct synchro_request req = {
 		.type		= type,
 		.replica_id	= limbo->owner_id,
 		.lsn		= lsn,
+		.term		= term,
 	};
 
 	/*
-	 * This is a synchronous commit so we can
-	 * allocate everything on a stack.
+	 * This is a synchronous commit so we can allocate everything on a
+	 * stack. Note, that promote body includes synchro body.
 	 */
-	struct synchro_body_bin body;
+	struct promote_body_bin body_bin;
+
 	struct xrow_header row;
 	char buf[sizeof(struct journal_entry) +
 		 sizeof(struct xrow_header *)];
@@ -339,7 +342,10 @@ txn_limbo_write_synchro(struct txn_limbo *limbo, uint32_t type, int64_t lsn)
 	struct journal_entry *entry = (struct journal_entry *)buf;
 	entry->rows[0] = &row;
 
-	xrow_encode_synchro(&row, &body, &req);
+	if (type == IPROTO_PROMOTE)
+		xrow_encode_promote(&row, &body_bin, &req);
+	else
+		xrow_encode_synchro(&row, &body_bin.base, &req);
 
 	journal_entry_create(entry, 1, xrow_approx_len(&row),
 			     txn_limbo_write_cb, fiber());
@@ -371,14 +377,14 @@ txn_limbo_write_confirm(struct txn_limbo *limbo, int64_t lsn)
 	assert(lsn > limbo->confirmed_lsn);
 	assert(!limbo->is_in_rollback);
 	limbo->confirmed_lsn = lsn;
-	txn_limbo_write_synchro(limbo, IPROTO_CONFIRM, lsn);
+	txn_limbo_write_synchro(limbo, IPROTO_CONFIRM, lsn, 0);
 }
 
 /** Confirm all the entries <= @a lsn. */
 static void
 txn_limbo_read_confirm(struct txn_limbo *limbo, int64_t lsn)
 {
-	assert(limbo->owner_id != REPLICA_ID_NIL);
+	assert(limbo->owner_id != REPLICA_ID_NIL || txn_limbo_is_empty(limbo));
 	assert(limbo == &txn_limbo);
 	struct txn_limbo_entry *e, *tmp;
 	rlist_foreach_entry_safe(e, &limbo->queue, in_queue, tmp) {
@@ -425,7 +431,7 @@ txn_limbo_write_rollback(struct txn_limbo *limbo, int64_t lsn)
 	assert(lsn > limbo->confirmed_lsn);
 	assert(!limbo->is_in_rollback);
 	limbo->is_in_rollback = true;
-	txn_limbo_write_synchro(limbo, IPROTO_ROLLBACK, lsn);
+	txn_limbo_write_synchro(limbo, IPROTO_ROLLBACK, lsn, 0);
 	limbo->is_in_rollback = false;
 }
 
@@ -433,7 +439,7 @@ txn_limbo_write_rollback(struct txn_limbo *limbo, int64_t lsn)
 static void
 txn_limbo_read_rollback(struct txn_limbo *limbo, int64_t lsn)
 {
-	assert(limbo->owner_id != REPLICA_ID_NIL);
+	assert(limbo->owner_id != REPLICA_ID_NIL || txn_limbo_is_empty(limbo));
 	assert(limbo == &txn_limbo);
 	struct txn_limbo_entry *e, *tmp;
 	struct txn_limbo_entry *last_rollback = NULL;
@@ -464,6 +470,32 @@ txn_limbo_read_rollback(struct txn_limbo *limbo, int64_t lsn)
 		box_update_ro_summary();
 }
 
+void
+txn_limbo_write_promote(struct txn_limbo *limbo, int64_t lsn, uint64_t term)
+{
+	limbo->confirmed_lsn = lsn;
+	/*
+	 * We make sure that promote is only written once everything this
+	 * instance has may be confirmed.
+	 */
+	struct txn_limbo_entry *e = txn_limbo_last_synchro_entry(limbo);
+	assert(e == NULL || e->lsn <= lsn);
+	(void) e;
+	txn_limbo_write_synchro(limbo, IPROTO_PROMOTE, lsn, term);
+	limbo->is_in_rollback = false;
+}
+
+void
+txn_limbo_read_promote(struct txn_limbo *limbo,
+		       const struct synchro_request *req)
+{
+	txn_limbo_read_confirm(limbo, req->lsn);
+	txn_limbo_read_rollback(limbo, req->lsn + 1);
+	assert(txn_limbo_is_empty(&txn_limbo));
+	limbo->owner_id = req->origin_id;
+	limbo->confirmed_lsn = 0;
+}
+
 void
 txn_limbo_ack(struct txn_limbo *limbo, uint32_t replica_id, int64_t lsn)
 {
@@ -626,38 +658,15 @@ txn_limbo_process(struct txn_limbo *limbo, const struct synchro_request *req)
 	case IPROTO_ROLLBACK:
 		txn_limbo_read_rollback(limbo, req->lsn);
 		break;
+	case IPROTO_PROMOTE:
+		txn_limbo_read_promote(limbo, req);
+		break;
 	default:
 		unreachable();
 	}
 	return;
 }
 
-void
-txn_limbo_force_empty(struct txn_limbo *limbo, int64_t confirm_lsn)
-{
-	struct txn_limbo_entry *e, *last_quorum = NULL;
-	struct txn_limbo_entry *rollback = NULL;
-	rlist_foreach_entry(e, &limbo->queue, in_queue) {
-		if (txn_has_flag(e->txn, TXN_WAIT_ACK)) {
-			if (e->lsn <= confirm_lsn) {
-				last_quorum = e;
-			} else {
-				rollback = e;
-				break;
-			}
-		}
-	}
-
-	if (last_quorum != NULL) {
-		txn_limbo_write_confirm(limbo, last_quorum->lsn);
-		txn_limbo_read_confirm(limbo, last_quorum->lsn);
-	}
-	if (rollback != NULL) {
-		txn_limbo_write_rollback(limbo, rollback->lsn);
-		txn_limbo_read_rollback(limbo, rollback->lsn);
-	}
-}
-
 void
 txn_limbo_on_parameters_change(struct txn_limbo *limbo)
 {
diff --git a/src/box/txn_limbo.h b/src/box/txn_limbo.h
index f2a98c8bb..4a1c43856 100644
--- a/src/box/txn_limbo.h
+++ b/src/box/txn_limbo.h
@@ -272,14 +272,19 @@ int
 txn_limbo_wait_confirm(struct txn_limbo *limbo);
 
 /**
- * Make txn_limbo confirm all the entries with lsn less than or
- * equal to the given one, and rollback all the following entries.
- * The function makes txn_limbo write CONFIRM and ROLLBACK
- * messages for appropriate lsns, and then process the messages
- * immediately.
+ * Write a PROMOTE request, which has the same effect as CONFIRM(@a lsn) and
+ * ROLLBACK(@a lsn + 1) combined.
  */
 void
-txn_limbo_force_empty(struct txn_limbo *limbo, int64_t last_confirm);
+txn_limbo_write_promote(struct txn_limbo *limbo, int64_t lsn, uint64_t term);
+
+/**
+ * Process a PROMOTE request, i.e. confirm all entries <= @req.lsn and rollback all
+ * entries > @req.lsn.
+ */
+void
+txn_limbo_read_promote(struct txn_limbo *limbo,
+		       const struct synchro_request *req);
 
 /**
  * Update qsync parameters dynamically.
-- 
2.24.3 (Apple Git-128)


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [Tarantool-patches] [PATCH v3 05/10] box: write PROMOTE even for empty limbo
  2021-04-14 14:17 [Tarantool-patches] [PATCH v3 00/10] raft: introduce manual elections and fix a bug with re-applying rolled back transactions Serge Petrenko via Tarantool-patches
                   ` (3 preceding siblings ...)
  2021-04-14 14:17 ` [Tarantool-patches] [PATCH v3 04/10] box: make clear_synchro_queue() write a PROMOTE entry instead of CONFIRM + ROLLBACK Serge Petrenko via Tarantool-patches
@ 2021-04-14 14:17 ` Serge Petrenko via Tarantool-patches
  2021-04-15 23:21   ` Vladislav Shpilevoy via Tarantool-patches
  2021-04-14 14:17 ` [Tarantool-patches] [PATCH v3 06/10] raft: keep track of greatest known term and filter replication sources based on that Serge Petrenko via Tarantool-patches
                   ` (6 subsequent siblings)
  11 siblings, 1 reply; 36+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-04-14 14:17 UTC (permalink / raw)
  To: v.shpilevoy, gorcunov; +Cc: tarantool-patches

PROMOTE entry will be used to mark limbo ownership transition besides
emptying the limbo. So it has to be written every time
`box.ctl.clear_synchro_queue()` succeeds. Even when the limbo was
already empty.

Part of #5445
---
 src/box/box.cc | 30 ++++++++++++++++--------------
 1 file changed, 16 insertions(+), 14 deletions(-)

diff --git a/src/box/box.cc b/src/box/box.cc
index 9adb6ba46..722fc23b7 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -1502,19 +1502,18 @@ box_clear_synchro_queue(bool try_wait)
 			 "simultaneous invocations");
 		return -1;
 	}
-	/*
-	 * XXX: we may want to write confirm + rollback even when the limbo is
-	 * empty for the sake of limbo ownership transition.
-	 */
-	if (!is_box_configured || txn_limbo_is_empty(&txn_limbo))
+
+	if (!is_box_configured)
 		return 0;
 	uint32_t former_leader_id = txn_limbo.owner_id;
-	assert(former_leader_id != REPLICA_ID_NIL);
-	if (former_leader_id == instance_id)
-		return 0;
-
+	int64_t wait_lsn = txn_limbo.confirmed_lsn;
+	int rc = 0;
+	int quorum = replication_synchro_quorum;
 	in_clear_synchro_queue = true;
 
+	if (txn_limbo_is_empty(&txn_limbo))
+		goto promote;
+
 	if (try_wait) {
 		/* Wait until pending confirmations/rollbacks reach us. */
 		double timeout = 2 * replication_synchro_timeout;
@@ -1528,8 +1527,11 @@ box_clear_synchro_queue(bool try_wait)
 		 * Our mission was to clear the limbo from former leader's
 		 * transactions. Exit in case someone did that for us.
 		 */
-		if (txn_limbo_is_empty(&txn_limbo) ||
-		    former_leader_id != txn_limbo.owner_id) {
+		if (former_leader_id != txn_limbo.owner_id) {
+			/*
+			 * TODO: error once we see someone else has become the
+			 * leader already.
+			 */
 			in_clear_synchro_queue = false;
 			return 0;
 		}
@@ -1540,11 +1542,10 @@ box_clear_synchro_queue(bool try_wait)
 	 * in the limbo must've come through the applier meaning they already
 	 * have an lsn assigned, even if their WAL write hasn't finished yet.
 	 */
-	int64_t wait_lsn = txn_limbo_last_synchro_entry(&txn_limbo)->lsn;
+	wait_lsn = txn_limbo_last_synchro_entry(&txn_limbo)->lsn;
 	assert(wait_lsn > 0);
 
-	int quorum = replication_synchro_quorum;
-	int rc = box_wait_quorum(former_leader_id, wait_lsn, quorum,
+	rc = box_wait_quorum(former_leader_id, wait_lsn, quorum,
 				 replication_synchro_timeout);
 	if (rc == 0) {
 		if (quorum < replication_synchro_quorum) {
@@ -1556,6 +1557,7 @@ box_clear_synchro_queue(bool try_wait)
 				 "new synchronous transactions appeared");
 			rc = -1;
 		} else {
+promote:
 			/*
 			 * Term parameter is unused now, We'll pass
 			 * box_raft()->term there later.
-- 
2.24.3 (Apple Git-128)


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [Tarantool-patches] [PATCH v3 06/10] raft: keep track of greatest known term and filter replication sources based on that
  2021-04-14 14:17 [Tarantool-patches] [PATCH v3 00/10] raft: introduce manual elections and fix a bug with re-applying rolled back transactions Serge Petrenko via Tarantool-patches
                   ` (4 preceding siblings ...)
  2021-04-14 14:17 ` [Tarantool-patches] [PATCH v3 05/10] box: write PROMOTE even for empty limbo Serge Petrenko via Tarantool-patches
@ 2021-04-14 14:17 ` Serge Petrenko via Tarantool-patches
  2021-04-15 23:27   ` Vladislav Shpilevoy via Tarantool-patches
  2021-04-14 14:17 ` [Tarantool-patches] [PATCH v3 07/10] replication: introduce a new election mode: "manual" Serge Petrenko via Tarantool-patches
                   ` (5 subsequent siblings)
  11 siblings, 1 reply; 36+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-04-14 14:17 UTC (permalink / raw)
  To: v.shpilevoy, gorcunov; +Cc: tarantool-patches

Start writing the actual leader term together with the PROMOTE request
and process terms in PROMOTE requests on receiver side.

Make applier only apply synchronous transactions from the instance which
has the greatest term as received in PROMOTE requests.

Closes #5445
---
 ...very => qsync-multi-statement-recovery.md} |   0
 changelogs/unreleased/raft-promote.md         |   4 +
 src/box/applier.cc                            |  28 ++
 src/box/box.cc                                |  24 +-
 src/lib/raft/raft.c                           |   1 +
 src/lib/raft/raft.h                           |  46 +++
 .../gh-5445-leader-inconsistency.result       | 291 ++++++++++++++++++
 .../gh-5445-leader-inconsistency.test.lua     | 128 ++++++++
 test/replication/suite.cfg                    |   1 +
 9 files changed, 516 insertions(+), 7 deletions(-)
 rename changelogs/unreleased/{qsync-multi-statement-recovery => qsync-multi-statement-recovery.md} (100%)
 create mode 100644 changelogs/unreleased/raft-promote.md
 create mode 100644 test/replication/gh-5445-leader-inconsistency.result
 create mode 100644 test/replication/gh-5445-leader-inconsistency.test.lua

diff --git a/changelogs/unreleased/qsync-multi-statement-recovery b/changelogs/unreleased/qsync-multi-statement-recovery.md
similarity index 100%
rename from changelogs/unreleased/qsync-multi-statement-recovery
rename to changelogs/unreleased/qsync-multi-statement-recovery.md
diff --git a/changelogs/unreleased/raft-promote.md b/changelogs/unreleased/raft-promote.md
new file mode 100644
index 000000000..e5dac599c
--- /dev/null
+++ b/changelogs/unreleased/raft-promote.md
@@ -0,0 +1,4 @@
+## bugfix/replication
+
+* Fix a bug in synchronous replication when rolled back transactions could
+  reappear once a sufficiently old instance reconnected (gh-5445).
diff --git a/src/box/applier.cc b/src/box/applier.cc
index 4898f9f7b..3fb864686 100644
--- a/src/box/applier.cc
+++ b/src/box/applier.cc
@@ -790,6 +790,12 @@ apply_synchro_row_cb(struct journal_entry *entry)
 		applier_rollback_by_wal_io();
 	} else {
 		txn_limbo_process(&txn_limbo, synchro_entry->req);
+		if (iproto_type_is_promote_request(synchro_entry->req->type)) {
+			raft_source_update_term(box_raft(),
+						synchro_entry->req->origin_id,
+						synchro_entry->req->term);
+
+		}
 		trigger_run(&replicaset.applier.on_wal_write, NULL);
 	}
 	/* The fiber is the same on final join. */
@@ -1027,6 +1033,28 @@ applier_apply_tx(struct applier *applier, struct stailq *rows)
 		}
 	}
 
+	/*
+	 * When elections are enabled we must filter out synchronous rows coming
+	 * from an instance that fell behind the current leader. This includes
+	 * both synchronous tx rows and rows for txs following unconfirmed
+	 * synchronous transactions.
+	 * The rows are replaced with NOPs to preserve the vclock consistency.
+	 */
+	struct applier_tx_row *item;
+	if (raft_source_has_outdated_term(box_raft(), applier->instance_id) &&
+	    (last_row->wait_sync ||
+	     (iproto_type_is_synchro_request(first_row->type) &&
+	     !iproto_type_is_promote_request(first_row->type)))) {
+		stailq_foreach_entry(item, rows, next) {
+			struct xrow_header *row = &item->row;
+			row->type = IPROTO_NOP;
+			/*
+			 * Row body is saved to fiber's region and will be freed
+			 * on next fiber_gc() call.
+			 */
+			row->bodycnt = 0;
+		}
+	}
 	if (unlikely(iproto_type_is_synchro_request(first_row->type))) {
 		/*
 		 * Synchro messages are not transactions, in terms
diff --git a/src/box/box.cc b/src/box/box.cc
index 722fc23b7..f44dd0e54 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -426,6 +426,10 @@ wal_stream_apply_synchro_row(struct wal_stream *stream, struct xrow_header *row)
 		return -1;
 	}
 	txn_limbo_process(&txn_limbo, &syn_req);
+	if (syn_req.type == IPROTO_PROMOTE) {
+			raft_source_update_term(box_raft(), syn_req.origin_id,
+						syn_req.term);
+	}
 	return 0;
 }
 
@@ -1503,7 +1507,12 @@ box_clear_synchro_queue(bool try_wait)
 		return -1;
 	}
 
-	if (!is_box_configured)
+	/*
+	 * Do nothing when box isn't configured and when PROMOTE was already
+	 * written for this term.
+	 */
+	if (!is_box_configured ||
+	    raft_source_term(box_raft(), instance_id) == box_raft()->term)
 		return 0;
 	uint32_t former_leader_id = txn_limbo.owner_id;
 	int64_t wait_lsn = txn_limbo.confirmed_lsn;
@@ -1558,20 +1567,21 @@ box_clear_synchro_queue(bool try_wait)
 			rc = -1;
 		} else {
 promote:
-			/*
-			 * Term parameter is unused now, We'll pass
-			 * box_raft()->term there later.
-			 */
-			txn_limbo_write_promote(&txn_limbo, wait_lsn, 0);
+			/* We cannot possibly get here in a volatile state. */
+			assert(box_raft()->volatile_term == box_raft()->term);
+			txn_limbo_write_promote(&txn_limbo, wait_lsn,
+						box_raft()->term);
 			struct synchro_request req = {
 				.type = 0, /* unused */
 				.replica_id = 0, /* unused */
 				.origin_id = instance_id,
 				.lsn = wait_lsn,
-				.term = 0, /* unused */
+				.term = box_raft()->term,
 			};
 			txn_limbo_read_promote(&txn_limbo, &req);
 			assert(txn_limbo_is_empty(&txn_limbo));
+			raft_source_update_term(box_raft(), req.origin_id,
+						req.term);
 		}
 	}
 	in_clear_synchro_queue = false;
diff --git a/src/lib/raft/raft.c b/src/lib/raft/raft.c
index 4ea4fc3f8..e9ce8cade 100644
--- a/src/lib/raft/raft.c
+++ b/src/lib/raft/raft.c
@@ -985,6 +985,7 @@ raft_create(struct raft *raft, const struct raft_vtab *vtab)
 		.death_timeout = 5,
 		.vtab = vtab,
 	};
+	vclock_create(&raft->term_map);
 	raft_ev_timer_init(&raft->timer, raft_sm_schedule_new_election_cb,
 			   0, 0);
 	raft->timer.data = raft;
diff --git a/src/lib/raft/raft.h b/src/lib/raft/raft.h
index e447f6634..01f548fee 100644
--- a/src/lib/raft/raft.h
+++ b/src/lib/raft/raft.h
@@ -207,6 +207,19 @@ struct raft {
 	 * subsystems, such as Raft.
 	 */
 	const struct vclock *vclock;
+	/**
+	 * The biggest term seen by this instance and persisted in WAL as part
+	 * of a PROMOTE request. May be smaller than @a term, while there are
+	 * ongoing elections, or the leader is already known, but this instance
+	 * hasn't read its PROMOTE request yet.
+	 * During other times must be equal to @a term.
+	 */
+	uint64_t greatest_known_term;
+	/**
+	 * Latest terms received with PROMOTE entries from remote instances.
+	 * Raft uses them to determine data from which sources may be applied.
+	 */
+	struct vclock term_map;
 	/** State machine timed event trigger. */
 	struct ev_timer timer;
 	/** Configured election timeout in seconds. */
@@ -243,6 +256,39 @@ raft_is_source_allowed(const struct raft *raft, uint32_t source_id)
 	return !raft->is_enabled || raft->leader == source_id;
 }
 
+/**
+ * Return the latest term as seen in PROMOTE requests from instance with id
+ * @a source_id.
+ */
+static inline uint64_t
+raft_source_term(const struct raft *raft, uint32_t source_id)
+{
+	assert(source_id != 0 && source_id < VCLOCK_MAX);
+	return vclock_get(&raft->term_map, source_id);
+}
+
+/**
+ * Check whether replica with id @a source_id is too old to apply synchronous
+ * data from it. The check is only valid when elections are enabled.
+ */
+static inline bool
+raft_source_has_outdated_term(const struct raft *raft, uint32_t source_id)
+{
+	uint64_t source_term = vclock_get(&raft->term_map, source_id);
+	return raft->is_enabled && source_term < raft->greatest_known_term;
+}
+
+/** Remember the last term seen for replica  with id @a source_id. */
+static inline void
+raft_source_update_term(struct raft *raft, uint32_t source_id, uint64_t term)
+{
+	if ((uint64_t) vclock_get(&raft->term_map, source_id) >= term)
+		return;
+	vclock_follow(&raft->term_map, source_id, term);
+	if (term > raft->greatest_known_term)
+		raft->greatest_known_term = term;
+}
+
 /** Check if Raft is enabled. */
 static inline bool
 raft_is_enabled(const struct raft *raft)
diff --git a/test/replication/gh-5445-leader-inconsistency.result b/test/replication/gh-5445-leader-inconsistency.result
new file mode 100644
index 000000000..ff3104de5
--- /dev/null
+++ b/test/replication/gh-5445-leader-inconsistency.result
@@ -0,0 +1,291 @@
+-- test-run result file version 2
+test_run = require("test_run").new()
+ | ---
+ | ...
+
+is_leader_cmd = "return box.info.election.state == 'leader'"
+ | ---
+ | ...
+
+-- Auxiliary.
+test_run:cmd('setopt delimiter ";"')
+ | ---
+ | - true
+ | ...
+function get_leader(nrs)
+    local leader_nr = 0
+    test_run:wait_cond(function()
+        for nr, do_check in pairs(nrs) do
+            if do_check then
+                local is_leader = test_run:eval('election_replica'..nr,
+                                                is_leader_cmd)[1]
+                if is_leader then
+                    leader_nr = nr
+                    return true
+                end
+            end
+        end
+        return false
+    end)
+    assert(leader_nr ~= 0)
+    return leader_nr
+end;
+ | ---
+ | ...
+
+function name(id)
+    return 'election_replica'..id
+end;
+ | ---
+ | ...
+test_run:cmd('setopt delimiter ""');
+ | ---
+ | - true
+ | ...
+
+--
+-- gh-5445: make sure rolled back rows do not reappear once old leader returns
+-- to cluster.
+--
+SERVERS = {'election_replica1', 'election_replica2' ,'election_replica3'}
+ | ---
+ | ...
+test_run:create_cluster(SERVERS, "replication", {args='2 0.4'})
+ | ---
+ | ...
+test_run:wait_fullmesh(SERVERS)
+ | ---
+ | ...
+
+-- Any of the three instances may bootstrap the cluster and become leader.
+is_possible_leader = {true, true, true}
+ | ---
+ | ...
+leader_nr = get_leader(is_possible_leader)
+ | ---
+ | ...
+leader = name(leader_nr)
+ | ---
+ | ...
+next_leader_nr = ((leader_nr - 1) % 3 + 1) % 3 + 1 -- {1, 2, 3} -> {2, 3, 1}
+ | ---
+ | ...
+next_leader = name(next_leader_nr)
+ | ---
+ | ...
+other_nr = ((leader_nr - 1) % 3 + 2) % 3 + 1 -- {1, 2, 3} -> {3, 1, 2}
+ | ---
+ | ...
+other = name(other_nr)
+ | ---
+ | ...
+
+test_run:switch(other)
+ | ---
+ | - true
+ | ...
+box.cfg{election_mode='voter'}
+ | ---
+ | ...
+test_run:switch('default')
+ | ---
+ | - true
+ | ...
+
+test_run:switch(next_leader)
+ | ---
+ | - true
+ | ...
+box.cfg{election_mode='voter'}
+ | ---
+ | ...
+test_run:switch('default')
+ | ---
+ | - true
+ | ...
+
+test_run:switch(leader)
+ | ---
+ | - true
+ | ...
+box.ctl.wait_rw()
+ | ---
+ | ...
+_ = box.schema.space.create('test', {is_sync=true})
+ | ---
+ | ...
+_ = box.space.test:create_index('pk')
+ | ---
+ | ...
+box.space.test:insert{1}
+ | ---
+ | - [1]
+ | ...
+
+-- Simulate a situation when the instance which will become the next leader
+-- doesn't know of unconfirmed rows. It should roll them back anyways and do not
+-- accept them once they actually appear from the old leader.
+-- So, stop the instance which'll be the next leader.
+test_run:switch('default')
+ | ---
+ | - true
+ | ...
+test_run:cmd('stop server '..next_leader)
+ | ---
+ | - true
+ | ...
+test_run:switch(leader)
+ | ---
+ | - true
+ | ...
+-- Insert some unconfirmed data.
+box.cfg{replication_synchro_quorum=3, replication_synchro_timeout=1000}
+ | ---
+ | ...
+fib = require('fiber').create(box.space.test.insert, box.space.test, {2})
+ | ---
+ | ...
+fib:status()
+ | ---
+ | - suspended
+ | ...
+
+-- 'other', 'leader', 'next_leader' are defined on 'default' node, hence the
+-- double switches.
+test_run:switch('default')
+ | ---
+ | - true
+ | ...
+test_run:switch(other)
+ | ---
+ | - true
+ | ...
+-- Wait until the rows are replicated to the other instance.
+test_run:wait_cond(function() return box.space.test:get{2} ~= nil end)
+ | ---
+ | - true
+ | ...
+-- Old leader is gone.
+test_run:switch('default')
+ | ---
+ | - true
+ | ...
+test_run:cmd('stop server '..leader)
+ | ---
+ | - true
+ | ...
+is_possible_leader[leader_nr] = false
+ | ---
+ | ...
+
+-- Emulate a situation when next_leader wins the elections. It can't do that in
+-- this configuration, obviously, because it's behind the 'other' node, so set
+-- quorum to 1 and imagine there are 2 more servers which would vote for
+-- next_leader.
+-- Also, make the instance ignore synchronization with other replicas.
+-- Otherwise it would stall for replication_sync_timeout. This is due to the
+-- nature of the test and may be ignored (we restart the instance to simulate
+-- a situation when some rows from the old leader were not received).
+test_run:cmd('start server '..next_leader..' with args="1 0.4 candidate 1"')
+ | ---
+ | - true
+ | ...
+assert(get_leader(is_possible_leader) == next_leader_nr)
+ | ---
+ | - true
+ | ...
+test_run:switch(other)
+ | ---
+ | - true
+ | ...
+-- New leader didn't know about the unconfirmed rows but still rolled them back.
+test_run:wait_cond(function() return box.space.test:get{2} == nil end)
+ | ---
+ | - true
+ | ...
+
+test_run:switch('default')
+ | ---
+ | - true
+ | ...
+test_run:switch(next_leader)
+ | ---
+ | - true
+ | ...
+-- No signs of the unconfirmed transaction.
+box.space.test:select{} -- 1
+ | ---
+ | - - [1]
+ | ...
+
+test_run:switch('default')
+ | ---
+ | - true
+ | ...
+-- Old leader returns and old unconfirmed rows from it must be ignored.
+-- Note, it wins the elections fairly.
+test_run:cmd('start server '..leader..' with args="3 0.4 voter"')
+ | ---
+ | - true
+ | ...
+test_run:wait_lsn(leader, next_leader)
+ | ---
+ | ...
+test_run:switch(leader)
+ | ---
+ | - true
+ | ...
+test_run:wait_cond(function() return box.space.test:get{2} == nil end)
+ | ---
+ | - true
+ | ...
+box.cfg{election_mode='candidate'}
+ | ---
+ | ...
+
+test_run:switch('default')
+ | ---
+ | - true
+ | ...
+test_run:switch(next_leader)
+ | ---
+ | - true
+ | ...
+-- Resign to make old leader win the elections.
+box.cfg{election_mode='voter'}
+ | ---
+ | ...
+
+test_run:switch('default')
+ | ---
+ | - true
+ | ...
+is_possible_leader[leader_nr] = true
+ | ---
+ | ...
+assert(get_leader(is_possible_leader) == leader_nr)
+ | ---
+ | - true
+ | ...
+
+test_run:switch(next_leader)
+ | ---
+ | - true
+ | ...
+test_run:wait_upstream(1, {status='follow'})
+ | ---
+ | - true
+ | ...
+box.space.test:select{} -- 1
+ | ---
+ | - - [1]
+ | ...
+
+-- Cleanup.
+test_run:switch('default')
+ | ---
+ | - true
+ | ...
+test_run:drop_cluster(SERVERS)
+ | ---
+ | ...
diff --git a/test/replication/gh-5445-leader-inconsistency.test.lua b/test/replication/gh-5445-leader-inconsistency.test.lua
new file mode 100644
index 000000000..bf8b31886
--- /dev/null
+++ b/test/replication/gh-5445-leader-inconsistency.test.lua
@@ -0,0 +1,128 @@
+test_run = require("test_run").new()
+
+is_leader_cmd = "return box.info.election.state == 'leader'"
+
+-- Auxiliary.
+test_run:cmd('setopt delimiter ";"')
+function get_leader(nrs)
+    local leader_nr = 0
+    test_run:wait_cond(function()
+        for nr, do_check in pairs(nrs) do
+            if do_check then
+                local is_leader = test_run:eval('election_replica'..nr,
+                                                is_leader_cmd)[1]
+                if is_leader then
+                    leader_nr = nr
+                    return true
+                end
+            end
+        end
+        return false
+    end)
+    assert(leader_nr ~= 0)
+    return leader_nr
+end;
+
+function name(id)
+    return 'election_replica'..id
+end;
+test_run:cmd('setopt delimiter ""');
+
+--
+-- gh-5445: make sure rolled back rows do not reappear once old leader returns
+-- to cluster.
+--
+SERVERS = {'election_replica1', 'election_replica2' ,'election_replica3'}
+test_run:create_cluster(SERVERS, "replication", {args='2 0.4'})
+test_run:wait_fullmesh(SERVERS)
+
+-- Any of the three instances may bootstrap the cluster and become leader.
+is_possible_leader = {true, true, true}
+leader_nr = get_leader(is_possible_leader)
+leader = name(leader_nr)
+next_leader_nr = ((leader_nr - 1) % 3 + 1) % 3 + 1 -- {1, 2, 3} -> {2, 3, 1}
+next_leader = name(next_leader_nr)
+other_nr = ((leader_nr - 1) % 3 + 2) % 3 + 1 -- {1, 2, 3} -> {3, 1, 2}
+other = name(other_nr)
+
+test_run:switch(other)
+box.cfg{election_mode='voter'}
+test_run:switch('default')
+
+test_run:switch(next_leader)
+box.cfg{election_mode='voter'}
+test_run:switch('default')
+
+test_run:switch(leader)
+box.ctl.wait_rw()
+_ = box.schema.space.create('test', {is_sync=true})
+_ = box.space.test:create_index('pk')
+box.space.test:insert{1}
+
+-- Simulate a situation when the instance which will become the next leader
+-- doesn't know of unconfirmed rows. It should roll them back anyways and do not
+-- accept them once they actually appear from the old leader.
+-- So, stop the instance which'll be the next leader.
+test_run:switch('default')
+test_run:cmd('stop server '..next_leader)
+test_run:switch(leader)
+-- Insert some unconfirmed data.
+box.cfg{replication_synchro_quorum=3, replication_synchro_timeout=1000}
+fib = require('fiber').create(box.space.test.insert, box.space.test, {2})
+fib:status()
+
+-- 'other', 'leader', 'next_leader' are defined on 'default' node, hence the
+-- double switches.
+test_run:switch('default')
+test_run:switch(other)
+-- Wait until the rows are replicated to the other instance.
+test_run:wait_cond(function() return box.space.test:get{2} ~= nil end)
+-- Old leader is gone.
+test_run:switch('default')
+test_run:cmd('stop server '..leader)
+is_possible_leader[leader_nr] = false
+
+-- Emulate a situation when next_leader wins the elections. It can't do that in
+-- this configuration, obviously, because it's behind the 'other' node, so set
+-- quorum to 1 and imagine there are 2 more servers which would vote for
+-- next_leader.
+-- Also, make the instance ignore synchronization with other replicas.
+-- Otherwise it would stall for replication_sync_timeout. This is due to the
+-- nature of the test and may be ignored (we restart the instance to simulate
+-- a situation when some rows from the old leader were not received).
+test_run:cmd('start server '..next_leader..' with args="1 0.4 candidate 1"')
+assert(get_leader(is_possible_leader) == next_leader_nr)
+test_run:switch(other)
+-- New leader didn't know about the unconfirmed rows but still rolled them back.
+test_run:wait_cond(function() return box.space.test:get{2} == nil end)
+
+test_run:switch('default')
+test_run:switch(next_leader)
+-- No signs of the unconfirmed transaction.
+box.space.test:select{} -- 1
+
+test_run:switch('default')
+-- Old leader returns and old unconfirmed rows from it must be ignored.
+-- Note, it wins the elections fairly.
+test_run:cmd('start server '..leader..' with args="3 0.4 voter"')
+test_run:wait_lsn(leader, next_leader)
+test_run:switch(leader)
+test_run:wait_cond(function() return box.space.test:get{2} == nil end)
+box.cfg{election_mode='candidate'}
+
+test_run:switch('default')
+test_run:switch(next_leader)
+-- Resign to make old leader win the elections.
+box.cfg{election_mode='voter'}
+
+test_run:switch('default')
+is_possible_leader[leader_nr] = true
+assert(get_leader(is_possible_leader) == leader_nr)
+
+test_run:switch(next_leader)
+test_run:wait_upstream(1, {status='follow'})
+box.space.test:select{} -- 1
+
+-- Cleanup.
+test_run:switch('default')
+test_run:drop_cluster(SERVERS)
diff --git a/test/replication/suite.cfg b/test/replication/suite.cfg
index 4a9ca0a46..8b185ce7e 100644
--- a/test/replication/suite.cfg
+++ b/test/replication/suite.cfg
@@ -19,6 +19,7 @@
     "gh-5213-qsync-applier-order-3.test.lua": {},
     "gh-5426-election-on-off.test.lua": {},
     "gh-5433-election-restart-recovery.test.lua": {},
+    "gh-5445-leader-inconsistency.test.lua": {},
     "gh-5506-election-on-off.test.lua": {},
     "once.test.lua": {},
     "on_replace.test.lua": {},
-- 
2.24.3 (Apple Git-128)


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [Tarantool-patches] [PATCH v3 07/10] replication: introduce a new election mode: "manual"
  2021-04-14 14:17 [Tarantool-patches] [PATCH v3 00/10] raft: introduce manual elections and fix a bug with re-applying rolled back transactions Serge Petrenko via Tarantool-patches
                   ` (5 preceding siblings ...)
  2021-04-14 14:17 ` [Tarantool-patches] [PATCH v3 06/10] raft: keep track of greatest known term and filter replication sources based on that Serge Petrenko via Tarantool-patches
@ 2021-04-14 14:17 ` Serge Petrenko via Tarantool-patches
  2021-04-15 23:27   ` Vladislav Shpilevoy via Tarantool-patches
  2021-04-14 14:17 ` [Tarantool-patches] [PATCH v3 08/10] Support manual elections in `box.ctl.clear_synchro_queue()` Serge Petrenko via Tarantool-patches
                   ` (4 subsequent siblings)
  11 siblings, 1 reply; 36+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-04-14 14:17 UTC (permalink / raw)
  To: v.shpilevoy, gorcunov; +Cc: tarantool-patches

When an instance is configured in "manual" election mode, it behaves as
a voter for most of the time, until `box.ctl.promote()` is called.

Once `box.ctl.promote()` is called the instance will behave as a
candidate for a full election round, e.g. until the leader is known. If
the instance wins the elections, it remains in `leader` state until the
next elections. Otherwise the instance returns to `voter` state.

Follow-up #5445
Part of #3055
---
 src/box/box.cc                         | 37 +++++++++++++++++---------
 src/box/raft.c                         |  2 ++
 src/box/raft.h                         | 17 ++++++++++++
 test/replication/election_basic.result |  4 +--
 4 files changed, 45 insertions(+), 15 deletions(-)

diff --git a/src/box/box.cc b/src/box/box.cc
index f44dd0e54..3729ed997 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -680,17 +680,27 @@ box_check_uri(const char *source, const char *option_name)
 	}
 }
 
-static const char *
+static enum election_mode
 box_check_election_mode(void)
 {
 	const char *mode = cfg_gets("election_mode");
-	if (mode == NULL || (strcmp(mode, "off") != 0 &&
-	    strcmp(mode, "voter") != 0 && strcmp(mode, "candidate") != 0)) {
-		diag_set(ClientError, ER_CFG, "election_mode", "the value must "
-			 "be a string 'off' or 'voter' or 'candidate'");
-		return NULL;
-	}
-	return mode;
+	if (mode == NULL)
+		goto error;
+
+	if (strcmp(mode, "off") == 0)
+		return ELECTION_MODE_OFF;
+	else if (strcmp(mode, "voter") == 0)
+		return ELECTION_MODE_VOTER;
+	else if (strcmp(mode, "manual") == 0)
+		return ELECTION_MODE_MANUAL;
+	else if (strcmp(mode, "candidate") == 0)
+		return ELECTION_MODE_CANDIDATE;
+
+error:
+	diag_set(ClientError, ER_CFG, "election_mode",
+		"the value must be one of the following strings: "
+		"'off', 'voter', 'candidate', 'manual'");
+	return ELECTION_MODE_INVALID;
 }
 
 static double
@@ -1113,7 +1123,7 @@ box_check_config(void)
 	box_check_uri(cfg_gets("listen"), "listen");
 	box_check_instance_uuid(&uuid);
 	box_check_replicaset_uuid(&uuid);
-	if (box_check_election_mode() == NULL)
+	if (box_check_election_mode() == ELECTION_MODE_INVALID)
 		diag_raise();
 	if (box_check_election_timeout() < 0)
 		diag_raise();
@@ -1147,11 +1157,12 @@ box_check_config(void)
 int
 box_set_election_mode(void)
 {
-	const char *mode = box_check_election_mode();
-	if (mode == NULL)
+	enum election_mode mode = box_check_election_mode();
+	if (mode == ELECTION_MODE_INVALID)
 		return -1;
-	raft_cfg_is_candidate(box_raft(), strcmp(mode, "candidate") == 0);
-	raft_cfg_is_enabled(box_raft(), strcmp(mode, "off") != 0);
+	box_election_mode = mode;
+	raft_cfg_is_candidate(box_raft(), mode == ELECTION_MODE_CANDIDATE);
+	raft_cfg_is_enabled(box_raft(), mode != ELECTION_MODE_OFF);
 	return 0;
 }
 
diff --git a/src/box/raft.c b/src/box/raft.c
index cfd898db0..285dbe4fd 100644
--- a/src/box/raft.c
+++ b/src/box/raft.c
@@ -44,6 +44,8 @@ struct raft box_raft_global = {
 	.state = 0,
 };
 
+enum election_mode box_election_mode = ELECTION_MODE_INVALID;
+
 /**
  * A trigger executed each time the Raft state machine updates any
  * of its visible attributes.
diff --git a/src/box/raft.h b/src/box/raft.h
index 1c59f17e6..15f4e80d9 100644
--- a/src/box/raft.h
+++ b/src/box/raft.h
@@ -35,8 +35,25 @@
 extern "C" {
 #endif
 
+enum election_mode {
+	ELECTION_MODE_INVALID = -1,
+	ELECTION_MODE_OFF = 0,
+	ELECTION_MODE_VOTER = 1,
+	ELECTION_MODE_MANUAL = 2,
+	ELECTION_MODE_CANDIDATE = 3,
+};
+
 struct raft_request;
 
+/**
+ * box_election_mode - current mode of operation for raft. Some modes correspond
+ * to RAFT operation modes directly, like CANDIDATE, VOTER and OFF.
+ * There's a mode which does not map to raft operation mode directly:
+ * MANUAL. In this mode RAFT usually operates as a voter, but it may become a
+ * candidate for some period of time when user calls `box.ctl.promote()`
+ */
+extern enum election_mode box_election_mode;
+
 /** Raft state of this instance. */
 static inline struct raft *
 box_raft(void)
diff --git a/test/replication/election_basic.result b/test/replication/election_basic.result
index 4d7d33f2b..d5320b3ff 100644
--- a/test/replication/election_basic.result
+++ b/test/replication/election_basic.result
@@ -22,8 +22,8 @@ box.cfg{election_mode = 100}
  | ...
 box.cfg{election_mode = '100'}
  | ---
- | - error: 'Incorrect value for option ''election_mode'': the value must be a string
- |     ''off'' or ''voter'' or ''candidate'''
+ | - error: 'Incorrect value for option ''election_mode'': the value must be one of the
+ |     following strings: ''off'', ''voter'', ''candidate'', ''manual'''
  | ...
 box.cfg{election_timeout = -1}
  | ---
-- 
2.24.3 (Apple Git-128)


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [Tarantool-patches] [PATCH v3 08/10] Support manual elections in `box.ctl.clear_synchro_queue()`
  2021-04-14 14:17 [Tarantool-patches] [PATCH v3 00/10] raft: introduce manual elections and fix a bug with re-applying rolled back transactions Serge Petrenko via Tarantool-patches
                   ` (6 preceding siblings ...)
  2021-04-14 14:17 ` [Tarantool-patches] [PATCH v3 07/10] replication: introduce a new election mode: "manual" Serge Petrenko via Tarantool-patches
@ 2021-04-14 14:17 ` Serge Petrenko via Tarantool-patches
  2021-04-15 23:30   ` Vladislav Shpilevoy via Tarantool-patches
  2021-04-14 14:17 ` [Tarantool-patches] [PATCH v3 09/10] box: remove parameter from clear_synchro_queue Serge Petrenko via Tarantool-patches
                   ` (3 subsequent siblings)
  11 siblings, 1 reply; 36+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-04-14 14:17 UTC (permalink / raw)
  To: v.shpilevoy, gorcunov; +Cc: tarantool-patches

This patch adds support for manual elections from
`box.ctl.clear_synchro_queue()`. When an instance is in
`election_mode='manual'`, calling `clear_synchro_queue()` will make it
start a new election round.

Follow-up #5445
Part of #3055

@TarantoolBot document
Title: describe election_mode='manual'

Manual election mode is introduced. It may be used when the user wants to
control which instance is the leader explicitly instead of relying on
Raft election algorithm.

When an instance is configured with `election_mode='manual'`, it behaves
as follows:
 1) By default, the instance acts like a voter: it is read-only and may
    vote for other instances that are candidates.
 2) Once `box.ctl.clear_synchro_queue()` is called, the instance becomes a
    candidate and starts a new election round. If the instance wins the
    elections, it remains leader, but won't participate in any new elections.
---
 src/box/box.cc              | 73 ++++++++++++++++++++++++++++++++++---
 src/box/errcode.h           |  3 ++
 src/box/raft.c              | 30 +++++++++++++--
 src/box/raft.h              |  3 ++
 src/lib/raft/raft.c         | 12 +++++-
 src/lib/raft/raft.h         |  2 +-
 test/box/error.result       |  3 ++
 test/unit/raft_test_utils.c |  4 +-
 8 files changed, 115 insertions(+), 15 deletions(-)

diff --git a/src/box/box.cc b/src/box/box.cc
index 3729ed997..6c7c8968a 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -1161,7 +1161,8 @@ box_set_election_mode(void)
 	if (mode == ELECTION_MODE_INVALID)
 		return -1;
 	box_election_mode = mode;
-	raft_cfg_is_candidate(box_raft(), mode == ELECTION_MODE_CANDIDATE);
+	raft_cfg_is_candidate(box_raft(), mode == ELECTION_MODE_CANDIDATE,
+			      true);
 	raft_cfg_is_enabled(box_raft(), mode != ELECTION_MODE_OFF);
 	return 0;
 }
@@ -1525,12 +1526,74 @@ box_clear_synchro_queue(bool try_wait)
 	if (!is_box_configured ||
 	    raft_source_term(box_raft(), instance_id) == box_raft()->term)
 		return 0;
+
+	bool run_elections = false;
+
+	switch (box_election_mode) {
+	case ELECTION_MODE_OFF:
+		break;
+	case ELECTION_MODE_VOTER:
+		assert(box_raft()->state == RAFT_STATE_FOLLOWER);
+		diag_set(ClientError, ER_UNSUPPORTED, "election_mode='voter'",
+			 "manual elections");
+		return -1;
+	case ELECTION_MODE_MANUAL:
+		assert(box_raft()->state != RAFT_STATE_CANDIDATE);
+		if (box_raft()->state == RAFT_STATE_LEADER) {
+			diag_set(ClientError, ER_ALREADY_LEADER);
+			return -1;
+		}
+		run_elections = true;
+		try_wait = false;
+		break;
+	case ELECTION_MODE_CANDIDATE:
+		/*
+		 * Leader elections are enabled, and this instance is allowed to
+		 * promote only if it's already an elected leader. No manual
+		 * elections.
+		 */
+		if (box_raft()->state != RAFT_STATE_LEADER) {
+			diag_set(ClientError, ER_UNSUPPORTED, "election_mode="
+				 "'candidate'", "manual elections");
+			return -1;
+		}
+		break;
+	default:
+		unreachable();
+	}
+
 	uint32_t former_leader_id = txn_limbo.owner_id;
 	int64_t wait_lsn = txn_limbo.confirmed_lsn;
 	int rc = 0;
 	int quorum = replication_synchro_quorum;
 	in_clear_synchro_queue = true;
 
+	if (run_elections) {
+		/*
+		 * Make this instance a candidate and run until some leader, not
+		 * necessarily this instance, emerges.
+		 */
+		raft_cfg_is_candidate(box_raft(), true, false);
+		/*
+		 * Trigger new elections without waiting for an old leader to
+		 * disappear.
+		 */
+		raft_new_term(box_raft());
+		box_raft_wait_leader_found();
+		raft_cfg_is_candidate(box_raft(), false, false);
+		if (!box_raft()->is_enabled) {
+			diag_set(ClientError, ER_RAFT_DISABLED);
+			in_clear_synchro_queue = false;
+			return -1;
+		}
+		if (box_raft()->state != RAFT_STATE_LEADER) {
+			diag_set(ClientError, ER_INTERFERING_PROMOTE,
+				 box_raft()->leader);
+			in_clear_synchro_queue = false;
+			return -1;
+		}
+	}
+
 	if (txn_limbo_is_empty(&txn_limbo))
 		goto promote;
 
@@ -1548,12 +1611,10 @@ box_clear_synchro_queue(bool try_wait)
 		 * transactions. Exit in case someone did that for us.
 		 */
 		if (former_leader_id != txn_limbo.owner_id) {
-			/*
-			 * TODO: error once we see someone else has become the
-			 * leader already.
-			 */
+			diag_set(ClientError, ER_INTERFERING_PROMOTE,
+				 txn_limbo.owner_id);
 			in_clear_synchro_queue = false;
-			return 0;
+			return -1;
 		}
 	}
 
diff --git a/src/box/errcode.h b/src/box/errcode.h
index c63191fb6..df36085db 100644
--- a/src/box/errcode.h
+++ b/src/box/errcode.h
@@ -275,6 +275,9 @@ struct errcode_record {
 	/*220 */_(ER_TOO_EARLY_SUBSCRIBE,	"Can't subscribe non-anonymous replica %s until join is done") \
 	/*221 */_(ER_SQL_CANT_ADD_AUTOINC,	"Can't add AUTOINCREMENT: space %s can't feature more than one AUTOINCREMENT field") \
 	/*222 */_(ER_QUORUM_WAIT,		"Couldn't wait for quorum %d: %s") \
+	/*223 */_(ER_INTERFERING_PROMOTE,	"Instance with replica id %u was promoted first") \
+	/*224 */_(ER_RAFT_DISABLED,		"Elections were turned off while running box.ctl.promote()")\
+	/*225 */_(ER_ALREADY_LEADER,		"Can't promote an existing leader")\
 
 /*
  * !IMPORTANT! Please follow instructions at start of the file
diff --git a/src/box/raft.c b/src/box/raft.c
index 285dbe4fd..c7dc79f9b 100644
--- a/src/box/raft.c
+++ b/src/box/raft.c
@@ -91,11 +91,11 @@ box_raft_update_synchro_queue(struct raft *raft)
 	 * If the node became a leader, it means it will ignore all records from
 	 * all the other nodes, and won't get late CONFIRM messages anyway. Can
 	 * clear the queue without waiting for confirmations.
-	 * It's alright that the user may have called clear_synchro_queue
-	 * manually. In this case the call below will exit immediately and we'll
-	 * simply log a warning.
+	 * In case these are manual elections, we are already in the middle of a
+	 * `clear_synchro_queue` call. No need to call it once again.
 	 */
-	if (raft->state == RAFT_STATE_LEADER) {
+	if (raft->state == RAFT_STATE_LEADER &&
+	    box_election_mode != ELECTION_MODE_MANUAL) {
 		int rc = 0;
 		uint32_t errcode = 0;
 		do {
@@ -336,6 +336,28 @@ fail:
 	panic("Could not write a raft request to WAL\n");
 }
 
+static int
+box_raft_wait_leader_found_trig(struct trigger *trig, void *event)
+{
+	struct raft *raft = (struct raft *)event;
+	assert(raft == box_raft());
+	struct fiber *waiter = (struct fiber *)trig->data;
+	if (raft->leader != REPLICA_ID_NIL || !raft->is_enabled)
+		fiber_wakeup(waiter);
+	return 0;
+}
+
+void
+box_raft_wait_leader_found(void)
+{
+	struct trigger trig;
+	trigger_create(&trig, box_raft_wait_leader_found_trig, fiber(), NULL);
+	raft_on_update(box_raft(), &trig);
+	fiber_yield();
+	assert(box_raft()->leader != REPLICA_ID_NIL || !box_raft()->is_enabled);
+	trigger_clear(&trig);
+}
+
 void
 box_raft_init(void)
 {
diff --git a/src/box/raft.h b/src/box/raft.h
index 15f4e80d9..8fce423e1 100644
--- a/src/box/raft.h
+++ b/src/box/raft.h
@@ -97,6 +97,9 @@ box_raft_checkpoint_remote(struct raft_request *req);
 int
 box_raft_process(struct raft_request *req, uint32_t source);
 
+void
+box_raft_wait_leader_found();
+
 void
 box_raft_init(void);
 
diff --git a/src/lib/raft/raft.c b/src/lib/raft/raft.c
index e9ce8cade..7b77e05ea 100644
--- a/src/lib/raft/raft.c
+++ b/src/lib/raft/raft.c
@@ -846,7 +846,7 @@ raft_cfg_is_enabled(struct raft *raft, bool is_enabled)
 }
 
 void
-raft_cfg_is_candidate(struct raft *raft, bool is_candidate)
+raft_cfg_is_candidate(struct raft *raft, bool is_candidate, bool demote)
 {
 	bool old_is_candidate = raft->is_candidate;
 	raft->is_cfg_candidate = is_candidate;
@@ -874,8 +874,16 @@ raft_cfg_is_candidate(struct raft *raft, bool is_candidate)
 			raft_ev_timer_stop(raft_loop(), &raft->timer);
 		}
 		if (raft->state != RAFT_STATE_FOLLOWER) {
-			if (raft->state == RAFT_STATE_LEADER)
+			if (raft->state == RAFT_STATE_LEADER) {
+				if (!demote) {
+					/*
+					 * Remain leader until someone
+					 * triggers new elections.
+					 */
+					return;
+				}
 				raft->leader = 0;
+			}
 			raft->state = RAFT_STATE_FOLLOWER;
 			/* State is visible and changed - broadcast. */
 			raft_schedule_broadcast(raft);
diff --git a/src/lib/raft/raft.h b/src/lib/raft/raft.h
index 01f548fee..390ea8ed8 100644
--- a/src/lib/raft/raft.h
+++ b/src/lib/raft/raft.h
@@ -325,7 +325,7 @@ raft_cfg_is_enabled(struct raft *raft, bool is_enabled);
  * the node still can vote, when Raft is enabled.
  */
 void
-raft_cfg_is_candidate(struct raft *raft, bool is_candidate);
+raft_cfg_is_candidate(struct raft *raft, bool is_candidate, bool demote);
 
 /** Configure Raft leader election timeout. */
 void
diff --git a/test/box/error.result b/test/box/error.result
index 7761c6949..dad6a21d3 100644
--- a/test/box/error.result
+++ b/test/box/error.result
@@ -441,6 +441,9 @@ t;
  |   220: box.error.TOO_EARLY_SUBSCRIBE
  |   221: box.error.SQL_CANT_ADD_AUTOINC
  |   222: box.error.QUORUM_WAIT
+ |   223: box.error.INTERFERING_PROMOTE
+ |   224: box.error.RAFT_DISABLED
+ |   225: box.error.ALREADY_LEADER
  | ...
 
 test_run:cmd("setopt delimiter ''");
diff --git a/test/unit/raft_test_utils.c b/test/unit/raft_test_utils.c
index b8735f373..a10ccae6a 100644
--- a/test/unit/raft_test_utils.c
+++ b/test/unit/raft_test_utils.c
@@ -360,7 +360,7 @@ raft_node_start(struct raft_node *node)
 		raft_process_recovery(&node->raft, &node->journal.rows[i]);
 
 	raft_cfg_is_enabled(&node->raft, node->cfg_is_enabled);
-	raft_cfg_is_candidate(&node->raft, node->cfg_is_candidate);
+	raft_cfg_is_candidate(&node->raft, node->cfg_is_candidate, true);
 	raft_cfg_election_timeout(&node->raft, node->cfg_election_timeout);
 	raft_cfg_election_quorum(&node->raft, node->cfg_election_quorum);
 	raft_cfg_death_timeout(&node->raft, node->cfg_death_timeout);
@@ -402,7 +402,7 @@ raft_node_cfg_is_candidate(struct raft_node *node, bool value)
 {
 	node->cfg_is_candidate = value;
 	if (raft_node_is_started(node)) {
-		raft_cfg_is_candidate(&node->raft, value);
+		raft_cfg_is_candidate(&node->raft, value, true);
 		raft_run_async_work();
 	}
 }
-- 
2.24.3 (Apple Git-128)


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [Tarantool-patches] [PATCH v3 09/10] box: remove parameter from clear_synchro_queue
  2021-04-14 14:17 [Tarantool-patches] [PATCH v3 00/10] raft: introduce manual elections and fix a bug with re-applying rolled back transactions Serge Petrenko via Tarantool-patches
                   ` (7 preceding siblings ...)
  2021-04-14 14:17 ` [Tarantool-patches] [PATCH v3 08/10] Support manual elections in `box.ctl.clear_synchro_queue()` Serge Petrenko via Tarantool-patches
@ 2021-04-14 14:17 ` Serge Petrenko via Tarantool-patches
  2021-04-14 14:18 ` [Tarantool-patches] [PATCH v3 10/10] box.ctl: rename clear_synchro_queue to promote Serge Petrenko via Tarantool-patches
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 36+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-04-14 14:17 UTC (permalink / raw)
  To: v.shpilevoy, gorcunov; +Cc: tarantool-patches

The `try_wait` parameter became redundant with the inroduction of manual
elections concept. It may be determined whether the node should wait for
pending confirmations or not by looking at election mode, so remove the
parameter.

Part of #3055
---
 src/box/box.cc    | 5 +++--
 src/box/box.h     | 2 +-
 src/box/lua/ctl.c | 2 +-
 src/box/raft.c    | 5 +----
 4 files changed, 6 insertions(+), 8 deletions(-)

diff --git a/src/box/box.cc b/src/box/box.cc
index 6c7c8968a..9b663f54a 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -1509,7 +1509,7 @@ box_wait_quorum(uint32_t lead_id, int64_t target_lsn, int quorum,
 }
 
 int
-box_clear_synchro_queue(bool try_wait)
+box_clear_synchro_queue(void)
 {
 	/* A guard to block multiple simultaneous function invocations. */
 	static bool in_clear_synchro_queue = false;
@@ -1528,9 +1528,11 @@ box_clear_synchro_queue(bool try_wait)
 		return 0;
 
 	bool run_elections = false;
+	bool try_wait = false;
 
 	switch (box_election_mode) {
 	case ELECTION_MODE_OFF:
+		try_wait = true;
 		break;
 	case ELECTION_MODE_VOTER:
 		assert(box_raft()->state == RAFT_STATE_FOLLOWER);
@@ -1544,7 +1546,6 @@ box_clear_synchro_queue(bool try_wait)
 			return -1;
 		}
 		run_elections = true;
-		try_wait = false;
 		break;
 	case ELECTION_MODE_CANDIDATE:
 		/*
diff --git a/src/box/box.h b/src/box/box.h
index e2321b9b0..90facd189 100644
--- a/src/box/box.h
+++ b/src/box/box.h
@@ -274,7 +274,7 @@ extern "C" {
 typedef struct tuple box_tuple_t;
 
 int
-box_clear_synchro_queue(bool try_wait);
+box_clear_synchro_queue(void);
 
 /* box_select is private and used only by FFI */
 API_EXPORT int
diff --git a/src/box/lua/ctl.c b/src/box/lua/ctl.c
index d039a059f..5b8d0d0e4 100644
--- a/src/box/lua/ctl.c
+++ b/src/box/lua/ctl.c
@@ -84,7 +84,7 @@ lbox_ctl_on_schema_init(struct lua_State *L)
 static int
 lbox_ctl_clear_synchro_queue(struct lua_State *L)
 {
-	if (box_clear_synchro_queue(true) != 0)
+	if (box_clear_synchro_queue() != 0)
 		return luaT_error(L);
 	return 0;
 }
diff --git a/src/box/raft.c b/src/box/raft.c
index c7dc79f9b..e8c9f3d2c 100644
--- a/src/box/raft.c
+++ b/src/box/raft.c
@@ -88,9 +88,6 @@ box_raft_update_synchro_queue(struct raft *raft)
 {
 	assert(raft == box_raft());
 	/*
-	 * If the node became a leader, it means it will ignore all records from
-	 * all the other nodes, and won't get late CONFIRM messages anyway. Can
-	 * clear the queue without waiting for confirmations.
 	 * In case these are manual elections, we are already in the middle of a
 	 * `clear_synchro_queue` call. No need to call it once again.
 	 */
@@ -99,7 +96,7 @@ box_raft_update_synchro_queue(struct raft *raft)
 		int rc = 0;
 		uint32_t errcode = 0;
 		do {
-			rc = box_clear_synchro_queue(false);
+			rc = box_clear_synchro_queue();
 			if (rc != 0) {
 				struct error *err = diag_last_error(diag_get());
 				errcode = box_error_code(err);
-- 
2.24.3 (Apple Git-128)


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [Tarantool-patches] [PATCH v3 10/10] box.ctl: rename clear_synchro_queue to promote
  2021-04-14 14:17 [Tarantool-patches] [PATCH v3 00/10] raft: introduce manual elections and fix a bug with re-applying rolled back transactions Serge Petrenko via Tarantool-patches
                   ` (8 preceding siblings ...)
  2021-04-14 14:17 ` [Tarantool-patches] [PATCH v3 09/10] box: remove parameter from clear_synchro_queue Serge Petrenko via Tarantool-patches
@ 2021-04-14 14:18 ` Serge Petrenko via Tarantool-patches
  2021-04-15 23:31   ` Vladislav Shpilevoy via Tarantool-patches
  2021-04-14 18:21 ` [Tarantool-patches] [PATCH v3 00/10] raft: introduce manual elections and fix a bug with re-applying rolled back transactions Cyrill Gorcunov via Tarantool-patches
  2021-04-15 23:16 ` Vladislav Shpilevoy via Tarantool-patches
  11 siblings, 1 reply; 36+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-04-14 14:18 UTC (permalink / raw)
  To: v.shpilevoy, gorcunov; +Cc: tarantool-patches

New function name will be `box.ctl.promote()`. It's much shorter and
closer to the function's now enriched functionality.

Old name `box.ctl.clear_synchro_queue()` remains in Lua for the sake of
backward compatibility.

Follow-up #5445
Closes #3055

@TarantoolBot document
Title: deprecate `box.ctl.clear_synchro_queue()` in favor of `box.ctl.promote()`

Replace all the mentions of `box.ctl.clear_synchro_queue()` with
`box.ctl.promote()` and add a note that `box.ctl.clear_synchro_queue()`
is a deprecated alias to `box.ctl.promote()`
---
 changelogs/unreleased/box-ctl-promote.md      |   8 ++
 src/box/box.cc                                |  20 ++--
 src/box/box.h                                 |   2 +-
 src/box/lua/ctl.c                             |   8 +-
 src/box/raft.c                                |   4 +-
 test/replication/election_basic.result        |  25 +++++
 test/replication/election_basic.test.lua      |  10 ++
 .../gh-3055-election-promote.result           | 105 ++++++++++++++++++
 .../gh-3055-election-promote.test.lua         |  43 +++++++
 test/replication/suite.cfg                    |   1 +
 10 files changed, 210 insertions(+), 16 deletions(-)
 create mode 100644 changelogs/unreleased/box-ctl-promote.md
 create mode 100644 test/replication/gh-3055-election-promote.result
 create mode 100644 test/replication/gh-3055-election-promote.test.lua

diff --git a/changelogs/unreleased/box-ctl-promote.md b/changelogs/unreleased/box-ctl-promote.md
new file mode 100644
index 000000000..15f6fb206
--- /dev/null
+++ b/changelogs/unreleased/box-ctl-promote.md
@@ -0,0 +1,8 @@
+## feature/replication
+
+* Introduce `box.ctl.promote()` and the concept of manual elections (enabled
+  with `election_mode='manual'`). Once the instance is in `manual` election
+  mode, it acts like a `voter` most of the time, but may trigger elections and
+  become a leader, once `box.ctl.promote()` is called.
+  When `election_mode ~= 'manual'`, `box.ctl.promote()` replaces
+  `box.ctl.clear_synchro_queue()`, which is now deprecated (gh-3055).
diff --git a/src/box/box.cc b/src/box/box.cc
index 9b663f54a..0b0c38cd5 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -1509,12 +1509,12 @@ box_wait_quorum(uint32_t lead_id, int64_t target_lsn, int quorum,
 }
 
 int
-box_clear_synchro_queue(void)
+box_promote(void)
 {
 	/* A guard to block multiple simultaneous function invocations. */
-	static bool in_clear_synchro_queue = false;
-	if (in_clear_synchro_queue) {
-		diag_set(ClientError, ER_UNSUPPORTED, "clear_synchro_queue",
+	static bool in_promote = false;
+	if (in_promote) {
+		diag_set(ClientError, ER_UNSUPPORTED, "box.ctl.promote",
 			 "simultaneous invocations");
 		return -1;
 	}
@@ -1567,7 +1567,7 @@ box_clear_synchro_queue(void)
 	int64_t wait_lsn = txn_limbo.confirmed_lsn;
 	int rc = 0;
 	int quorum = replication_synchro_quorum;
-	in_clear_synchro_queue = true;
+	in_promote = true;
 
 	if (run_elections) {
 		/*
@@ -1584,13 +1584,13 @@ box_clear_synchro_queue(void)
 		raft_cfg_is_candidate(box_raft(), false, false);
 		if (!box_raft()->is_enabled) {
 			diag_set(ClientError, ER_RAFT_DISABLED);
-			in_clear_synchro_queue = false;
+			in_promote = false;
 			return -1;
 		}
 		if (box_raft()->state != RAFT_STATE_LEADER) {
 			diag_set(ClientError, ER_INTERFERING_PROMOTE,
 				 box_raft()->leader);
-			in_clear_synchro_queue = false;
+			in_promote = false;
 			return -1;
 		}
 	}
@@ -1614,13 +1614,13 @@ box_clear_synchro_queue(void)
 		if (former_leader_id != txn_limbo.owner_id) {
 			diag_set(ClientError, ER_INTERFERING_PROMOTE,
 				 txn_limbo.owner_id);
-			in_clear_synchro_queue = false;
+			in_promote = false;
 			return -1;
 		}
 	}
 
 	/*
-	 * clear_synchro_queue() is a no-op on the limbo owner, so all the rows
+	 * promote() is a no-op on the limbo owner, so all the rows
 	 * in the limbo must've come through the applier meaning they already
 	 * have an lsn assigned, even if their WAL write hasn't finished yet.
 	 */
@@ -1657,7 +1657,7 @@ promote:
 						req.term);
 		}
 	}
-	in_clear_synchro_queue = false;
+	in_promote = false;
 	return rc;
 }
 
diff --git a/src/box/box.h b/src/box/box.h
index 90facd189..04bdd397d 100644
--- a/src/box/box.h
+++ b/src/box/box.h
@@ -274,7 +274,7 @@ extern "C" {
 typedef struct tuple box_tuple_t;
 
 int
-box_clear_synchro_queue(void);
+box_promote(void);
 
 /* box_select is private and used only by FFI */
 API_EXPORT int
diff --git a/src/box/lua/ctl.c b/src/box/lua/ctl.c
index 5b8d0d0e4..368b9ab60 100644
--- a/src/box/lua/ctl.c
+++ b/src/box/lua/ctl.c
@@ -82,9 +82,9 @@ lbox_ctl_on_schema_init(struct lua_State *L)
 }
 
 static int
-lbox_ctl_clear_synchro_queue(struct lua_State *L)
+lbox_ctl_promote(struct lua_State *L)
 {
-	if (box_clear_synchro_queue() != 0)
+	if (box_promote() != 0)
 		return luaT_error(L);
 	return 0;
 }
@@ -124,7 +124,9 @@ static const struct luaL_Reg lbox_ctl_lib[] = {
 	{"wait_rw", lbox_ctl_wait_rw},
 	{"on_shutdown", lbox_ctl_on_shutdown},
 	{"on_schema_init", lbox_ctl_on_schema_init},
-	{"clear_synchro_queue", lbox_ctl_clear_synchro_queue},
+	{"promote", lbox_ctl_promote},
+	/* An old alias. */
+	{"clear_synchro_queue", lbox_ctl_promote},
 	{"is_recovery_finished", lbox_ctl_is_recovery_finished},
 	{"set_on_shutdown_timeout", lbox_ctl_set_on_shutdown_timeout},
 	{NULL, NULL}
diff --git a/src/box/raft.c b/src/box/raft.c
index e8c9f3d2c..e357772a5 100644
--- a/src/box/raft.c
+++ b/src/box/raft.c
@@ -89,14 +89,14 @@ box_raft_update_synchro_queue(struct raft *raft)
 	assert(raft == box_raft());
 	/*
 	 * In case these are manual elections, we are already in the middle of a
-	 * `clear_synchro_queue` call. No need to call it once again.
+	 * `promote` call. No need to call it once again.
 	 */
 	if (raft->state == RAFT_STATE_LEADER &&
 	    box_election_mode != ELECTION_MODE_MANUAL) {
 		int rc = 0;
 		uint32_t errcode = 0;
 		do {
-			rc = box_clear_synchro_queue();
+			rc = box_promote();
 			if (rc != 0) {
 				struct error *err = diag_last_error(diag_get());
 				errcode = box_error_code(err);
diff --git a/test/replication/election_basic.result b/test/replication/election_basic.result
index d5320b3ff..78c911245 100644
--- a/test/replication/election_basic.result
+++ b/test/replication/election_basic.result
@@ -108,6 +108,31 @@ assert(box.info.election.leader == box.info.id)
  | - true
  | ...
 
+-- Manual election mode. A voter most of the time, a leader once
+-- `box.ctl.promote()` is called.
+box.cfg{election_mode = 'manual'}
+ | ---
+ | ...
+
+assert(box.info.election.state == 'follower')
+ | ---
+ | - true
+ | ...
+term = box.info.election.term
+ | ---
+ | ...
+box.ctl.promote()
+ | ---
+ | ...
+assert(box.info.election.state == 'leader')
+ | ---
+ | - error: assertion failed!
+ | ...
+assert(box.info.election.term > term)
+ | ---
+ | - error: assertion failed!
+ | ...
+
 box.cfg{                                                                        \
     election_mode = 'off',                                                      \
     election_timeout = old_election_timeout                                     \
diff --git a/test/replication/election_basic.test.lua b/test/replication/election_basic.test.lua
index 821f73cea..5fc398848 100644
--- a/test/replication/election_basic.test.lua
+++ b/test/replication/election_basic.test.lua
@@ -39,6 +39,16 @@ assert(box.info.election.term > term)
 assert(box.info.election.vote == box.info.id)
 assert(box.info.election.leader == box.info.id)
 
+-- Manual election mode. A voter most of the time, a leader once
+-- `box.ctl.promote()` is called.
+box.cfg{election_mode = 'manual'}
+
+assert(box.info.election.state == 'follower')
+term = box.info.election.term
+box.ctl.promote()
+assert(box.info.election.state == 'leader')
+assert(box.info.election.term > term)
+
 box.cfg{                                                                        \
     election_mode = 'off',                                                      \
     election_timeout = old_election_timeout                                     \
diff --git a/test/replication/gh-3055-election-promote.result b/test/replication/gh-3055-election-promote.result
new file mode 100644
index 000000000..6f5af13bc
--- /dev/null
+++ b/test/replication/gh-3055-election-promote.result
@@ -0,0 +1,105 @@
+-- test-run result file version 2
+test_run = require('test_run').new()
+ | ---
+ | ...
+
+--
+-- gh-3055 box.ctl.promote(). Call on instance with election_mode='manual'
+-- in order to promote it to leader.
+SERVERS = {'election_replica1', 'election_replica2', 'election_replica3'}
+ | ---
+ | ...
+-- Start in candidate state in order for bootstrap to work.
+test_run:create_cluster(SERVERS, 'replication', {args='2 0.1 candidate'})
+ | ---
+ | ...
+test_run:wait_fullmesh(SERVERS)
+ | ---
+ | ...
+
+cfg_set_manual =\
+    "box.cfg{election_mode='manual'} "..\
+    "assert(box.info.election.state == 'follower') "..\
+    "assert(box.info.ro)"
+ | ---
+ | ...
+
+for _, server in pairs(SERVERS) do\
+    ok, res = test_run:eval(server, cfg_set_manual)\
+    assert(ok)\
+end
+ | ---
+ | ...
+
+-- Promote without living leader.
+test_run:switch('election_replica1')
+ | ---
+ | - true
+ | ...
+assert(box.info.election.state == 'follower')
+ | ---
+ | - true
+ | ...
+term = box.info.election.term
+ | ---
+ | ...
+box.ctl.promote()
+ | ---
+ | ...
+assert(box.info.election.state == 'leader')
+ | ---
+ | - true
+ | ...
+assert(not box.info.ro)
+ | ---
+ | - true
+ | ...
+assert(box.info.election.term > term)
+ | ---
+ | - true
+ | ...
+
+-- Test promote when there's a live leader.
+test_run:switch('election_replica2')
+ | ---
+ | - true
+ | ...
+term = box.info.election.term
+ | ---
+ | ...
+assert(box.info.election.state == 'follower')
+ | ---
+ | - true
+ | ...
+assert(box.info.ro)
+ | ---
+ | - true
+ | ...
+assert(box.info.election.leader ~= 0)
+ | ---
+ | - true
+ | ...
+box.ctl.promote()
+ | ---
+ | ...
+assert(box.info.election.state == 'leader')
+ | ---
+ | - true
+ | ...
+assert(not box.info.ro)
+ | ---
+ | - true
+ | ...
+assert(box.info.election.term > term)
+ | ---
+ | - true
+ | ...
+
+-- Cleanup.
+test_run:switch('default')
+ | ---
+ | - true
+ | ...
+test_run:drop_cluster(SERVERS)
+ | ---
+ | ...
diff --git a/test/replication/gh-3055-election-promote.test.lua b/test/replication/gh-3055-election-promote.test.lua
new file mode 100644
index 000000000..cbc3ed206
--- /dev/null
+++ b/test/replication/gh-3055-election-promote.test.lua
@@ -0,0 +1,43 @@
+test_run = require('test_run').new()
+
+--
+-- gh-3055 box.ctl.promote(). Call on instance with election_mode='manual'
+-- in order to promote it to leader.
+SERVERS = {'election_replica1', 'election_replica2', 'election_replica3'}
+-- Start in candidate state in order for bootstrap to work.
+test_run:create_cluster(SERVERS, 'replication', {args='2 0.1 candidate'})
+test_run:wait_fullmesh(SERVERS)
+
+cfg_set_manual =\
+    "box.cfg{election_mode='manual'} "..\
+    "assert(box.info.election.state == 'follower') "..\
+    "assert(box.info.ro)"
+
+for _, server in pairs(SERVERS) do\
+    ok, res = test_run:eval(server, cfg_set_manual)\
+    assert(ok)\
+end
+
+-- Promote without living leader.
+test_run:switch('election_replica1')
+assert(box.info.election.state == 'follower')
+term = box.info.election.term
+box.ctl.promote()
+assert(box.info.election.state == 'leader')
+assert(not box.info.ro)
+assert(box.info.election.term > term)
+
+-- Test promote when there's a live leader.
+test_run:switch('election_replica2')
+term = box.info.election.term
+assert(box.info.election.state == 'follower')
+assert(box.info.ro)
+assert(box.info.election.leader ~= 0)
+box.ctl.promote()
+assert(box.info.election.state == 'leader')
+assert(not box.info.ro)
+assert(box.info.election.term > term)
+
+-- Cleanup.
+test_run:switch('default')
+test_run:drop_cluster(SERVERS)
diff --git a/test/replication/suite.cfg b/test/replication/suite.cfg
index 8b185ce7e..dc39e2f74 100644
--- a/test/replication/suite.cfg
+++ b/test/replication/suite.cfg
@@ -2,6 +2,7 @@
     "anon.test.lua": {},
     "anon_register_gap.test.lua": {},
     "gh-2991-misc-asserts-on-update.test.lua": {},
+    "gh-3055-election-promote.test.lua": {},
     "gh-3111-misc-rebootstrap-from-ro-master.test.lua": {},
     "gh-3160-misc-heartbeats-on-master-changes.test.lua": {},
     "gh-3247-misc-iproto-sequence-value-not-replicated.test.lua": {},
-- 
2.24.3 (Apple Git-128)


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Tarantool-patches] [PATCH v3 00/10] raft: introduce manual elections and fix a bug with re-applying rolled back transactions
  2021-04-14 14:17 [Tarantool-patches] [PATCH v3 00/10] raft: introduce manual elections and fix a bug with re-applying rolled back transactions Serge Petrenko via Tarantool-patches
                   ` (9 preceding siblings ...)
  2021-04-14 14:18 ` [Tarantool-patches] [PATCH v3 10/10] box.ctl: rename clear_synchro_queue to promote Serge Petrenko via Tarantool-patches
@ 2021-04-14 18:21 ` Cyrill Gorcunov via Tarantool-patches
  2021-04-15 23:16 ` Vladislav Shpilevoy via Tarantool-patches
  11 siblings, 0 replies; 36+ messages in thread
From: Cyrill Gorcunov via Tarantool-patches @ 2021-04-14 18:21 UTC (permalink / raw)
  To: Serge Petrenko; +Cc: v.shpilevoy, tarantool-patches

On Wed, Apr 14, 2021 at 05:17:10PM +0300, Serge Petrenko wrote:
> Changes in v3:
>   - fix gh-5445-leader-inconsistency.test.lua flakiness
>   - fixes as per review from Cyrill Gorcunov
>   - minor fixes and rewordings
>   - rebased on top of current master
>   - added patch 9/10 (remove parameter from clear_synchro_queue)
> 
> Changes in v2:
>   - Added tests for patches 1, 6, 9
>   - Minor typo fixes and bugfixes.

Thanks! Ack for code structure, I can't comment much on
architectural details, lets wait for Vlad's opinion.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Tarantool-patches] [PATCH v3 00/10] raft: introduce manual elections and fix a bug with re-applying rolled back transactions
  2021-04-14 14:17 [Tarantool-patches] [PATCH v3 00/10] raft: introduce manual elections and fix a bug with re-applying rolled back transactions Serge Petrenko via Tarantool-patches
                   ` (10 preceding siblings ...)
  2021-04-14 18:21 ` [Tarantool-patches] [PATCH v3 00/10] raft: introduce manual elections and fix a bug with re-applying rolled back transactions Cyrill Gorcunov via Tarantool-patches
@ 2021-04-15 23:16 ` Vladislav Shpilevoy via Tarantool-patches
  2021-04-16 16:35   ` Serge Petrenko via Tarantool-patches
  11 siblings, 1 reply; 36+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-04-15 23:16 UTC (permalink / raw)
  To: Serge Petrenko, gorcunov; +Cc: tarantool-patches

Hi! Thanks for the patchset!

I see the new test does not pass in CI in one of the jobs:
https://github.com/tarantool/tarantool/runs/2343885212

Is it still flaky?

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Tarantool-patches] [PATCH v3 01/10] wal: enrich row's meta information with sync replication flags
  2021-04-14 14:17 ` [Tarantool-patches] [PATCH v3 01/10] wal: enrich row's meta information with sync replication flags Serge Petrenko via Tarantool-patches
@ 2021-04-15 23:18   ` Vladislav Shpilevoy via Tarantool-patches
  2021-04-16  7:08     ` Serge Petrenko via Tarantool-patches
  0 siblings, 1 reply; 36+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-04-15 23:18 UTC (permalink / raw)
  To: Serge Petrenko, gorcunov; +Cc: tarantool-patches

Good job on the patch!

See 3 comments below.

> diff --git a/src/box/journal.h b/src/box/journal.h
> index 76c70c19f..3ce9c869e 100644
> --- a/src/box/journal.h
> +++ b/src/box/journal.h
> @@ -63,6 +63,7 @@ struct journal_entry {
>  	 * A journal entry completion callback argument.
>  	 */
>  	void *complete_data;
> +	uint8_t opt_flags;

1. I propose to call them just flags. There is no a third value
like 'no flag'. They are either set or not, am I right? Also the
member is missing a comment. The most important thing to say -
these flags are only for the last row.

>  	/**
>  	 * Asynchronous write completion function.
>  	 */
> @@ -97,6 +98,7 @@ journal_entry_create(struct journal_entry *entry, size_t n_rows,
>  	entry->approx_len	= approx_len;
>  	entry->n_rows		= n_rows;
>  	entry->res		= -1;
> +	entry->opt_flags	= 0;

2. You could initialize it with IPROTO_FLAG_COMMIT right here and
drop (*row)->is_commit = true from wal_assign_lsn. But this one up
to you. Maybe it is not a good idea.

>  }
>  
>  /**
> diff --git a/src/box/wal.c b/src/box/wal.c
> index 34af0bda6..4ec8034a3 100644
> --- a/src/box/wal.c
> +++ b/src/box/wal.c
> @@ -962,14 +962,14 @@ out:
>   */
>  static void
>  wal_assign_lsn(struct vclock *vclock_diff, struct vclock *base,
> -	       struct xrow_header **row,
> -	       struct xrow_header **end)
> +	       struct journal_entry *entry)

3. This part could be a separate commit, otherwise it is hard to
see the functional changes. Up to you if you want to split.

>  {
>  	int64_t tsn = 0;
> -	struct xrow_header **start = row;
> -	struct xrow_header **first_glob_row = row;
> +	struct xrow_header **start = entry->rows;
> +	struct xrow_header **end = entry->rows + entry->n_rows;
> +	struct xrow_header **first_glob_row = entry->rows;
>  	/** Assign LSN to all local rows. */
> -	for ( ; row < end; row++) {
> +	for (struct xrow_header **row = start; row < end; row++) {
>  		if ((*row)->replica_id == 0) {
>  			/*
>  			 * All rows representing local space data

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Tarantool-patches] [PATCH v3 02/10] xrow: introduce a PROMOTE entry
  2021-04-14 14:17 ` [Tarantool-patches] [PATCH v3 02/10] xrow: introduce a PROMOTE entry Serge Petrenko via Tarantool-patches
@ 2021-04-15 23:19   ` Vladislav Shpilevoy via Tarantool-patches
  2021-04-16 16:18     ` Serge Petrenko via Tarantool-patches
  0 siblings, 1 reply; 36+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-04-15 23:19 UTC (permalink / raw)
  To: Serge Petrenko, gorcunov; +Cc: tarantool-patches

I appreciate the work you did here!

See 2 comments below.

> diff --git a/src/box/xrow.c b/src/box/xrow.c
> index cc8e43ed4..5d515ce92 100644
> --- a/src/box/xrow.c
> +++ b/src/box/xrow.c
> @@ -884,28 +884,63 @@ xrow_encode_dml(const struct request *request, struct region *region,
>  	return iovcnt;
>  }
>  
> -void
> -xrow_encode_synchro(struct xrow_header *row,
> -		    struct synchro_body_bin *body,
> -		    const struct synchro_request *req)
> +static void
> +xrow_encode_synchro_body(struct synchro_body_bin *body,
> +		         const struct synchro_request *req)
>  {
>  	/*
> -	 * A map with two elements. We don't compress
> +	 * A map with two or three elements. We don't compress
>  	 * numbers to have this structure constant in size,
>  	 * which allows us to preallocate it on stack.
>  	 */
> -	body->m_body = 0x80 | 2;
> +	body->m_body = 0x80 | (req->type == IPROTO_PROMOTE ? 3 : 2);
>  	body->k_replica_id = IPROTO_REPLICA_ID;
>  	body->m_replica_id = 0xce;
>  	body->v_replica_id = mp_bswap_u32(req->replica_id);
>  	body->k_lsn = IPROTO_LSN;
>  	body->m_lsn = 0xcf;
>  	body->v_lsn = mp_bswap_u64(req->lsn);
> +}
> +
> +void
> +xrow_encode_synchro(struct xrow_header *row,
> +		    struct synchro_body_bin *body,
> +		    const struct synchro_request *req)
> +{
> +	assert(req->type == IPROTO_CONFIRM || req->type == IPROTO_ROLLBACK);
> +
> +	xrow_encode_synchro_body(body, req);
>  
>  	memset(row, 0, sizeof(*row));
> +	row->type = req->type;
> +	row->body[0].iov_base = body;
> +	row->body[0].iov_len = sizeof(*body);
> +	row->bodycnt = 1;
> +}
> +
> +static inline void
> +xrow_encode_promote_body(struct promote_body_bin *body,
> +			 const struct synchro_request *req)

1. I would propose to inline it. It is used in a single place,
and now it looks like if we would have more than 1 place where
we would need the promote body.

But more interestingly, it looks you could keep it a single
function xrow_encode_synchro. Although we wouldn't be able to
have a PACKED struct with predefined fields. Not a big deal
anyway.

The reasoning is similar to xrow_encode_dml(). It also uses
a single struct request for all kinds of DML, and conditionally
encodes the non-zero fields. I think your case is the same. It
would simplify some code, and remove branching from other
places. For example, from txn_limbo_write_synchro(), where you
branch between PROMOTE and non-PROMOTE when decide what to encode.
We even had the same issue when tried to encode CONFIRM and
ROLLBACK via separate functions.

> +{
> +	xrow_encode_synchro_body(&body->base, req);
> +
> +	body->k_term = IPROTO_TERM;
> +	body->m_term = 0xcf;
> +	body->v_term = mp_bswap_u64(req->term);
> +}
> +
>  
> +void
> +xrow_encode_promote(struct xrow_header *row, struct promote_body_bin *body,
> +		    const struct synchro_request *req)
> +{
> +	assert(req->type == IPROTO_PROMOTE);
> +
> +	xrow_encode_promote_body(body, req);
> +
> +	memset(row, 0, sizeof(*row));
>  	row->type = req->type;
> -	row->body[0].iov_base = (void *)body;
> +	row->body[0].iov_base = body;

2. Unnecessary change. But I don't mind, up to you.

>  	row->body[0].iov_len = sizeof(*body);
>  	row->bodycnt = 1;
>  }

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Tarantool-patches] [PATCH v3 04/10] box: make clear_synchro_queue() write a PROMOTE entry instead of CONFIRM + ROLLBACK
  2021-04-14 14:17 ` [Tarantool-patches] [PATCH v3 04/10] box: make clear_synchro_queue() write a PROMOTE entry instead of CONFIRM + ROLLBACK Serge Petrenko via Tarantool-patches
@ 2021-04-15 23:20   ` Vladislav Shpilevoy via Tarantool-patches
  2021-04-16  9:28     ` Serge Petrenko via Tarantool-patches
  0 siblings, 1 reply; 36+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-04-15 23:20 UTC (permalink / raw)
  To: Serge Petrenko, gorcunov; +Cc: tarantool-patches

Thanks for working on this!

See 2 comments below.

> diff --git a/src/box/box.cc b/src/box/box.cc
> index 70b325180..9adb6ba46 100644
> --- a/src/box/box.cc
> +++ b/src/box/box.cc
> @@ -1556,7 +1556,19 @@ box_clear_synchro_queue(bool try_wait)
>  				 "new synchronous transactions appeared");
>  			rc = -1;
>  		} else {
> -			txn_limbo_force_empty(&txn_limbo, wait_lsn);
> +			/*
> +			 * Term parameter is unused now, We'll pass
> +			 * box_raft()->term there later.
> +			 */
> +			txn_limbo_write_promote(&txn_limbo, wait_lsn, 0);
> +			struct synchro_request req = {
> +				.type = 0, /* unused */
> +				.replica_id = 0, /* unused */
> +				.origin_id = instance_id,
> +				.lsn = wait_lsn,
> +				.term = 0, /* unused */

1. Aren't the unused fields nullified anyway according to
the standard?

> +			};
> +			txn_limbo_read_promote(&txn_limbo, &req);
>  			assert(txn_limbo_is_empty(&txn_limbo));
>  		}
>  	}
> diff --git a/src/box/txn_limbo.c b/src/box/txn_limbo.c
> index d29722ef7..bfe0ad302 100644
> --- a/src/box/txn_limbo.c
> +++ b/src/box/txn_limbo.c
> @@ -464,6 +470,32 @@ txn_limbo_read_rollback(struct txn_limbo *limbo, int64_t lsn)
>  		box_update_ro_summary();
>  }
>  
> +void
> +txn_limbo_write_promote(struct txn_limbo *limbo, int64_t lsn, uint64_t term)
> +{
> +	limbo->confirmed_lsn = lsn;
> +	/*
> +	 * We make sure that promote is only written once everything this
> +	 * instance has may be confirmed.
> +	 */
> +	struct txn_limbo_entry *e = txn_limbo_last_synchro_entry(limbo);
> +	assert(e == NULL || e->lsn <= lsn);
> +	(void) e;
> +	txn_limbo_write_synchro(limbo, IPROTO_PROMOTE, lsn, term);
> +	limbo->is_in_rollback = false;

2. How is it possible that there was a rollback in progress at
the same time?

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Tarantool-patches] [PATCH v3 05/10] box: write PROMOTE even for empty limbo
  2021-04-14 14:17 ` [Tarantool-patches] [PATCH v3 05/10] box: write PROMOTE even for empty limbo Serge Petrenko via Tarantool-patches
@ 2021-04-15 23:21   ` Vladislav Shpilevoy via Tarantool-patches
  2021-04-16  9:33     ` Serge Petrenko via Tarantool-patches
  0 siblings, 1 reply; 36+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-04-15 23:21 UTC (permalink / raw)
  To: Serge Petrenko, gorcunov; +Cc: tarantool-patches

Thanks for the patch!

> @@ -1540,11 +1542,10 @@ box_clear_synchro_queue(bool try_wait)
>  	 * in the limbo must've come through the applier meaning they already
>  	 * have an lsn assigned, even if their WAL write hasn't finished yet.
>  	 */
> -	int64_t wait_lsn = txn_limbo_last_synchro_entry(&txn_limbo)->lsn;
> +	wait_lsn = txn_limbo_last_synchro_entry(&txn_limbo)->lsn;
>  	assert(wait_lsn > 0);
>  
> -	int quorum = replication_synchro_quorum;
> -	int rc = box_wait_quorum(former_leader_id, wait_lsn, quorum,
> +	rc = box_wait_quorum(former_leader_id, wait_lsn, quorum,
>  				 replication_synchro_timeout);

The second line became misaligned.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Tarantool-patches] [PATCH v3 06/10] raft: keep track of greatest known term and filter replication sources based on that
  2021-04-14 14:17 ` [Tarantool-patches] [PATCH v3 06/10] raft: keep track of greatest known term and filter replication sources based on that Serge Petrenko via Tarantool-patches
@ 2021-04-15 23:27   ` Vladislav Shpilevoy via Tarantool-patches
  2021-04-16 14:16     ` Serge Petrenko via Tarantool-patches
  0 siblings, 1 reply; 36+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-04-15 23:27 UTC (permalink / raw)
  To: Serge Petrenko, gorcunov; +Cc: tarantool-patches

Good job on the patch!

Please, try to reduce length of the lines in the commit
message, or at least its title. It is suuuper long now.

See 10 comments below.

> diff --git a/src/box/applier.cc b/src/box/applier.cc
> index 4898f9f7b..3fb864686 100644
> --- a/src/box/applier.cc
> +++ b/src/box/applier.cc
> @@ -790,6 +790,12 @@ apply_synchro_row_cb(struct journal_entry *entry)
>  		applier_rollback_by_wal_io();
>  	} else {
>  		txn_limbo_process(&txn_limbo, synchro_entry->req);
> +		if (iproto_type_is_promote_request(synchro_entry->req->type)) {
> +			raft_source_update_term(box_raft(),
> +						synchro_entry->req->origin_id,
> +						synchro_entry->req->term);

1. How about moving that to txn_limbo_read_promote()? What do you think? I see
you do it in 3 places where txn_limbo_process() or txn_limbo_read_promote()
are called on PROMOTE rows.

> +
> +		}
>  		trigger_run(&replicaset.applier.on_wal_write, NULL);
>  	}
>  	/* The fiber is the same on final join. */
> @@ -1027,6 +1033,28 @@ applier_apply_tx(struct applier *applier, struct stailq *rows)
>  		}
>  	}
>  
> +	/*
> +	 * When elections are enabled we must filter out synchronous rows coming
> +	 * from an instance that fell behind the current leader. This includes
> +	 * both synchronous tx rows and rows for txs following unconfirmed
> +	 * synchronous transactions.
> +	 * The rows are replaced with NOPs to preserve the vclock consistency.
> +	 */
> +	struct applier_tx_row *item;
> +	if (raft_source_has_outdated_term(box_raft(), applier->instance_id) &&

2. The names are too long IMO. I would propose

	raft_is_node_outdated(raft, id)   // Check if behind
	raft_process_term(raft, id, term) // Set term for a node or skip if
						the same or older
	raft_node_term(raft, id)         // Get term

'source' is not really a perfect name, because raft nodes send
messages to each other. There are no one-directional channels
AFAIR like we have with upstream and downstream in the replication.

I used 'source' in raft_process_heartbeat() as like a source of the
heartbeat message. Note like the nodes are called sources everywhere.

Also I used 'process' for the new term, because we already have
raft_process_heartbeat() to handle info from a node with a given
ID, and I thought it makes sense to keep them similar.


3. Why does raft_is_source_allowed() still exist when we have this wonder?

> +	    (last_row->wait_sync ||
> +	     (iproto_type_is_synchro_request(first_row->type) &&
> +	     !iproto_type_is_promote_request(first_row->type)))) {
> +		stailq_foreach_entry(item, rows, next) {
> +			struct xrow_header *row = &item->row;
> +			row->type = IPROTO_NOP;
> +			/*
> +			 * Row body is saved to fiber's region and will be freed
> +			 * on next fiber_gc() call.
> +			 */
> +			row->bodycnt = 0;
> +		}
> +	}
> diff --git a/src/box/box.cc b/src/box/box.cc
> index 722fc23b7..f44dd0e54 100644
> --- a/src/box/box.cc
> +++ b/src/box/box.cc
> @@ -426,6 +426,10 @@ wal_stream_apply_synchro_row(struct wal_stream *stream, struct xrow_header *row)
>  		return -1;
>  	}
>  	txn_limbo_process(&txn_limbo, &syn_req);
> +	if (syn_req.type == IPROTO_PROMOTE) {
> +			raft_source_update_term(box_raft(), syn_req.origin_id,
> +						syn_req.term);

4. Misaligned. Also see the first comment.

> @@ -1558,20 +1567,21 @@ box_clear_synchro_queue(bool try_wait)
>  			rc = -1;
>  		} else {
>  promote:
> -			/*
> -			 * Term parameter is unused now, We'll pass
> -			 * box_raft()->term there later.
> -			 */
> -			txn_limbo_write_promote(&txn_limbo, wait_lsn, 0);
> +			/* We cannot possibly get here in a volatile state. */
> +			assert(box_raft()->volatile_term == box_raft()->term);
> +			txn_limbo_write_promote(&txn_limbo, wait_lsn,
> +						box_raft()->term);
>  			struct synchro_request req = {
>  				.type = 0, /* unused */
>  				.replica_id = 0, /* unused */
>  				.origin_id = instance_id,
>  				.lsn = wait_lsn,
> -				.term = 0, /* unused */
> +				.term = box_raft()->term,
>  			};
>  			txn_limbo_read_promote(&txn_limbo, &req);
>  			assert(txn_limbo_is_empty(&txn_limbo));
> +			raft_source_update_term(box_raft(), req.origin_id,
> +						req.term);

5. See the first comment.

>  		}
>  	}
> diff --git a/src/lib/raft/raft.h b/src/lib/raft/raft.h
> index e447f6634..01f548fee 100644
> --- a/src/lib/raft/raft.h
> +++ b/src/lib/raft/raft.h
> @@ -207,6 +207,19 @@ struct raft {
>  	 * subsystems, such as Raft.
>  	 */
>  	const struct vclock *vclock;
> +	/**
> +	 * The biggest term seen by this instance and persisted in WAL as part
> +	 * of a PROMOTE request. May be smaller than @a term, while there are
> +	 * ongoing elections, or the leader is already known, but this instance
> +	 * hasn't read its PROMOTE request yet.
> +	 * During other times must be equal to @a term.
> +	 */
> +	uint64_t greatest_known_term;

6. Maybe omit 'known'. There can't be 'greatest_unknown_term'.

> +	/**
> +	 * Latest terms received with PROMOTE entries from remote instances.
> +	 * Raft uses them to determine data from which sources may be applied.
> +	 */
> +	struct vclock term_map;

7. I have a feeling it is similar to the limbo's LSN map. Like
they should be merged into something one. Can't formulate that
properly. I hope we will see it more clear when will move all that
to the WAL thread someday.

>  	/** State machine timed event trigger. */
>  	struct ev_timer timer;
>  	/** Configured election timeout in seconds. */
> @@ -243,6 +256,39 @@ raft_is_source_allowed(const struct raft *raft, uint32_t source_id)
>  	return !raft->is_enabled || raft->leader == source_id;
>  }
>  
> +/**
> + * Return the latest term as seen in PROMOTE requests from instance with id
> + * @a source_id.
> + */
> +static inline uint64_t
> +raft_source_term(const struct raft *raft, uint32_t source_id)
> +{
> +	assert(source_id != 0 && source_id < VCLOCK_MAX);
> +	return vclock_get(&raft->term_map, source_id);
> +}
> +
> +/**
> + * Check whether replica with id @a source_id is too old to apply synchronous
> + * data from it. The check is only valid when elections are enabled.
> + */
> +static inline bool
> +raft_source_has_outdated_term(const struct raft *raft, uint32_t source_id)
> +{
> +	uint64_t source_term = vclock_get(&raft->term_map, source_id);
> +	return raft->is_enabled && source_term < raft->greatest_known_term;
> +}
> +
> +/** Remember the last term seen for replica  with id @a source_id. */
> +static inline void
> +raft_source_update_term(struct raft *raft, uint32_t source_id, uint64_t term)
> +{
> +	if ((uint64_t) vclock_get(&raft->term_map, source_id) >= term)

8. Probably having the term as uint64_t was a mistake from the beginning.
Feel free to change it to int64_t if you want, in a separate commit.

> +		return;
> +	vclock_follow(&raft->term_map, source_id, term);
> +	if (term > raft->greatest_known_term)
> +		raft->greatest_known_term = term;
> +}

9. I see these are not used in the raft code at all. Did you think about
moving it all to box/raft.h and box/raft.c? Or about covering this all
with unit tests in unit/raft.c if you decide to keep it here?

> +
>  /** Check if Raft is enabled. */
>  static inline bool
>  raft_is_enabled(const struct raft *raft)
> diff --git a/test/replication/gh-5445-leader-inconsistency.result b/test/replication/gh-5445-leader-inconsistency.result
> new file mode 100644
> index 000000000..ff3104de5
> --- /dev/null
> +++ b/test/replication/gh-5445-leader-inconsistency.result
> @@ -0,0 +1,291 @@
> +-- test-run result file version 2
> +test_run = require("test_run").new()
> + | ---
> + | ...
> +
> +is_leader_cmd = "return box.info.election.state == 'leader'"
> + | ---
> + | ...
> +
> +-- Auxiliary.
> +test_run:cmd('setopt delimiter ";"')
> + | ---
> + | - true
> + | ...
> +function get_leader(nrs)
> +    local leader_nr = 0
> +    test_run:wait_cond(function()
> +        for nr, do_check in pairs(nrs) do
> +            if do_check then
> +                local is_leader = test_run:eval('election_replica'..nr,
> +                                                is_leader_cmd)[1]
> +                if is_leader then
> +                    leader_nr = nr
> +                    return true
> +                end
> +            end
> +        end
> +        return false
> +    end)
> +    assert(leader_nr ~= 0)
> +    return leader_nr
> +end;
> + | ---
> + | ...
> +
> +function name(id)
> +    return 'election_replica'..id
> +end;

10. You can move this function above get_leader() and use it
in there.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Tarantool-patches] [PATCH v3 07/10] replication: introduce a new election mode: "manual"
  2021-04-14 14:17 ` [Tarantool-patches] [PATCH v3 07/10] replication: introduce a new election mode: "manual" Serge Petrenko via Tarantool-patches
@ 2021-04-15 23:27   ` Vladislav Shpilevoy via Tarantool-patches
  2021-04-16 14:18     ` Serge Petrenko via Tarantool-patches
  0 siblings, 1 reply; 36+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-04-15 23:27 UTC (permalink / raw)
  To: Serge Petrenko, gorcunov; +Cc: tarantool-patches

I appreciate the work you did here!

Maybe change the commit title subsystem to 'election:'.
The same in the next commit.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Tarantool-patches] [PATCH v3 08/10] Support manual elections in `box.ctl.clear_synchro_queue()`
  2021-04-14 14:17 ` [Tarantool-patches] [PATCH v3 08/10] Support manual elections in `box.ctl.clear_synchro_queue()` Serge Petrenko via Tarantool-patches
@ 2021-04-15 23:30   ` Vladislav Shpilevoy via Tarantool-patches
  2021-04-16 15:38     ` Serge Petrenko via Tarantool-patches
  0 siblings, 1 reply; 36+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-04-15 23:30 UTC (permalink / raw)
  To: Serge Petrenko, gorcunov; +Cc: tarantool-patches

Thanks for working on this!

See 5 comments below.

> diff --git a/src/box/box.cc b/src/box/box.cc
> index 3729ed997..6c7c8968a 100644
> --- a/src/box/box.cc
> +++ b/src/box/box.cc
> @@ -1525,12 +1526,74 @@ box_clear_synchro_queue(bool try_wait)
>  	if (!is_box_configured ||
>  	    raft_source_term(box_raft(), instance_id) == box_raft()->term)
>  		return 0;
> +
> +	bool run_elections = false;
> +
> +	switch (box_election_mode) {
> +	case ELECTION_MODE_OFF:
> +		break;
> +	case ELECTION_MODE_VOTER:
> +		assert(box_raft()->state == RAFT_STATE_FOLLOWER);
> +		diag_set(ClientError, ER_UNSUPPORTED, "election_mode='voter'",
> +			 "manual elections");
> +		return -1;
> +	case ELECTION_MODE_MANUAL:
> +		assert(box_raft()->state != RAFT_STATE_CANDIDATE);
> +		if (box_raft()->state == RAFT_STATE_LEADER) {
> +			diag_set(ClientError, ER_ALREADY_LEADER);
> +			return -1;
> +		}
> +		run_elections = true;
> +		try_wait = false;
> +		break;
> +	case ELECTION_MODE_CANDIDATE:
> +		/*
> +		 * Leader elections are enabled, and this instance is allowed to
> +		 * promote only if it's already an elected leader. No manual
> +		 * elections.
> +		 */
> +		if (box_raft()->state != RAFT_STATE_LEADER) {

1. That is strange. Why do you allow to promote the node
if it is already the leader when mode is candidate, but do
not allow the same when the mode is manual?

Shouldn't we throw an error when the mode is candidate
regardless of the node role?

> +			diag_set(ClientError, ER_UNSUPPORTED, "election_mode="
> +				 "'candidate'", "manual elections");
> +			return -1;
> +		}
> +		break;
> +	default:
> +		unreachable();
> +	}
> +
>  	uint32_t former_leader_id = txn_limbo.owner_id;
>  	int64_t wait_lsn = txn_limbo.confirmed_lsn;
>  	int rc = 0;
>  	int quorum = replication_synchro_quorum;
>  	in_clear_synchro_queue = true;
>  
> +	if (run_elections) {
> +		/*
> +		 * Make this instance a candidate and run until some leader, not
> +		 * necessarily this instance, emerges.
> +		 */
> +		raft_cfg_is_candidate(box_raft(), true, false);
> +		/*
> +		 * Trigger new elections without waiting for an old leader to
> +		 * disappear.
> +		 */
> +		raft_new_term(box_raft());
> +		box_raft_wait_leader_found();
> +		raft_cfg_is_candidate(box_raft(), false, false);

2. What if during box_raft_wait_leader_found() I made the node candidate
via box.cfg? Won't you then reset it back to non-candidate here?

It probably should reset the current box.cfg mode back. Not just
remove the candidate flag.

> +		if (!box_raft()->is_enabled) {
> +			diag_set(ClientError, ER_RAFT_DISABLED);
> +			in_clear_synchro_queue = false;
> +			return -1;
> +		}
> +		if (box_raft()->state != RAFT_STATE_LEADER) {
> +			diag_set(ClientError, ER_INTERFERING_PROMOTE,
> +				 box_raft()->leader);
> +			in_clear_synchro_queue = false;
> +			return -1;
> +		}
> +	}
> +
>  	if (txn_limbo_is_empty(&txn_limbo))
>  		goto promote;
>  
> diff --git a/src/box/raft.c b/src/box/raft.c
> index 285dbe4fd..c7dc79f9b 100644
> --- a/src/box/raft.c
> +++ b/src/box/raft.c
> @@ -336,6 +336,28 @@ fail:
>  	panic("Could not write a raft request to WAL\n");
>  }
>  
> +static int
> +box_raft_wait_leader_found_trig(struct trigger *trig, void *event)

3. I thought we usually call triggers with _f suffix, not _trig.

> +{
> +	struct raft *raft = (struct raft *)event;
> +	assert(raft == box_raft());
> +	struct fiber *waiter = (struct fiber *)trig->data;

4. No need to cast this and event - void * cast works naturally in C.

> +	if (raft->leader != REPLICA_ID_NIL || !raft->is_enabled)
> +		fiber_wakeup(waiter);
> +	return 0;
> +}
> diff --git a/src/box/raft.h b/src/box/raft.h
> index 15f4e80d9..8fce423e1 100644
> --- a/src/box/raft.h
> +++ b/src/box/raft.h
> @@ -97,6 +97,9 @@ box_raft_checkpoint_remote(struct raft_request *req);
>  int
>  box_raft_process(struct raft_request *req, uint32_t source);
>  
> +void
> +box_raft_wait_leader_found();
> +
>  void
>  box_raft_init(void);
>  
> diff --git a/src/lib/raft/raft.c b/src/lib/raft/raft.c
> index e9ce8cade..7b77e05ea 100644
> --- a/src/lib/raft/raft.c
> +++ b/src/lib/raft/raft.c
> @@ -846,7 +846,7 @@ raft_cfg_is_enabled(struct raft *raft, bool is_enabled)
>  }
>  
>  void
> -raft_cfg_is_candidate(struct raft *raft, bool is_candidate)
> +raft_cfg_is_candidate(struct raft *raft, bool is_candidate, bool demote)

5. I know it might lead to some code duplication, but probably
better move that to other functions. For example,

	raft_cfg_is_temporary_candidate()

or something like that. Otherwise it appears surprisingly hard
to follow these 2 flags together. Although I might be wrong and
it would look worse. Did you try?

Or another option:

	raft_cfg_is_candidate(box_raft(), true, false);
	raft_cfg_is_candidate(box_raft(), false, false);

turns into

	raft_start_candidate(box_raft())
	raft_stop_candidate(box_raft())

Also it would be good to have unit tests for the changes in raft.h
and raft.c.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Tarantool-patches] [PATCH v3 10/10] box.ctl: rename clear_synchro_queue to promote
  2021-04-14 14:18 ` [Tarantool-patches] [PATCH v3 10/10] box.ctl: rename clear_synchro_queue to promote Serge Petrenko via Tarantool-patches
@ 2021-04-15 23:31   ` Vladislav Shpilevoy via Tarantool-patches
  2021-04-16 16:13     ` Serge Petrenko via Tarantool-patches
  0 siblings, 1 reply; 36+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-04-15 23:31 UTC (permalink / raw)
  To: Serge Petrenko, gorcunov; +Cc: tarantool-patches

Thanks for the patch!

> diff --git a/test/replication/election_basic.result b/test/replication/election_basic.result
> index d5320b3ff..78c911245 100644
> --- a/test/replication/election_basic.result
> +++ b/test/replication/election_basic.result
> @@ -108,6 +108,31 @@ assert(box.info.election.leader == box.info.id)
>   | - true
>   | ...
>  
> +-- Manual election mode. A voter most of the time, a leader once
> +-- `box.ctl.promote()` is called.
> +box.cfg{election_mode = 'manual'}
> + | ---
> + | ...
> +
> +assert(box.info.election.state == 'follower')
> + | ---
> + | - true
> + | ...
> +term = box.info.election.term
> + | ---
> + | ...
> +box.ctl.promote()
> + | ---
> + | ...
> +assert(box.info.election.state == 'leader')
> + | ---
> + | - error: assertion failed!
> + | ...
> +assert(box.info.election.term > term)
> + | ---
> + | - error: assertion failed!

Hm. Shouldn't these assertions pass?

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Tarantool-patches] [PATCH v3 01/10] wal: enrich row's meta information with sync replication flags
  2021-04-15 23:18   ` Vladislav Shpilevoy via Tarantool-patches
@ 2021-04-16  7:08     ` Serge Petrenko via Tarantool-patches
  2021-04-16  7:11       ` Serge Petrenko via Tarantool-patches
  2021-04-16  8:57       ` Serge Petrenko via Tarantool-patches
  0 siblings, 2 replies; 36+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-04-16  7:08 UTC (permalink / raw)
  To: Vladislav Shpilevoy, gorcunov; +Cc: tarantool-patches



16.04.2021 02:18, Vladislav Shpilevoy пишет:
> Good job on the patch!
>
> See 3 comments below.

Hi! Thanks for the review!

>> diff --git a/src/box/journal.h b/src/box/journal.h
>> index 76c70c19f..3ce9c869e 100644
>> --- a/src/box/journal.h
>> +++ b/src/box/journal.h
>> @@ -63,6 +63,7 @@ struct journal_entry {
>>   	 * A journal entry completion callback argument.
>>   	 */
>>   	void *complete_data;
>> +	uint8_t opt_flags;
> 1. I propose to call them just flags. There is no a third value
> like 'no flag'. They are either set or not, am I right? Also the
> member is missing a comment. The most important thing to say -
> these flags are only for the last row.

Ok, fixed.

>
>>   	/**
>>   	 * Asynchronous write completion function.
>>   	 */
>> @@ -97,6 +98,7 @@ journal_entry_create(struct journal_entry *entry, size_t n_rows,
>>   	entry->approx_len	= approx_len;
>>   	entry->n_rows		= n_rows;
>>   	entry->res		= -1;
>> +	entry->opt_flags	= 0;
> 2. You could initialize it with IPROTO_FLAG_COMMIT right here and
> drop (*row)->is_commit = true from wal_assign_lsn. But this one up
> to you. Maybe it is not a good idea.

This would look better, indeed, but neither journal nor wal know
about iproto constants. And I don't think it's a good idea to
introduce such a dependency.

I can add entry->flags |= IPROTO_FLAG_COMMIT to
txn_journal_entry_new().
I actually like how this turned out. It's none of WAL's or journal's
business which row is commit and which isn't.

>
>>   }
>>   
>>   /**
>> diff --git a/src/box/wal.c b/src/box/wal.c
>> index 34af0bda6..4ec8034a3 100644
>> --- a/src/box/wal.c
>> +++ b/src/box/wal.c
>> @@ -962,14 +962,14 @@ out:
>>    */
>>   static void
>>   wal_assign_lsn(struct vclock *vclock_diff, struct vclock *base,
>> -	       struct xrow_header **row,
>> -	       struct xrow_header **end)
>> +	       struct journal_entry *entry)
> 3. This part could be a separate commit, otherwise it is hard to
> see the functional changes. Up to you if you want to split.

Good idea, let's do that.

Incremental diff for this commit is below and the extracted commit
regarding wal_assign_lsn() refactoring is in reply to this email.

>>   {
>>   	int64_t tsn = 0;
>> -	struct xrow_header **start = row;
>> -	struct xrow_header **first_glob_row = row;
>> +	struct xrow_header **start = entry->rows;
>> +	struct xrow_header **end = entry->rows + entry->n_rows;
>> +	struct xrow_header **first_glob_row = entry->rows;
>>   	/** Assign LSN to all local rows. */
>> -	for ( ; row < end; row++) {
>> +	for (struct xrow_header **row = start; row < end; row++) {
>>   		if ((*row)->replica_id == 0) {
>>   			/*
>>   			 * All rows representing local space data

========================================================

diff --git a/src/box/journal.h b/src/box/journal.h
index 3ce9c869e..8f3d56a61 100644
--- a/src/box/journal.h
+++ b/src/box/journal.h
@@ -63,7 +63,8 @@ struct journal_entry {
       * A journal entry completion callback argument.
       */
      void *complete_data;
-    uint8_t opt_flags;
+    /** Flags that should be set for the last entry row. */
+    uint8_t flags;
      /**
       * Asynchronous write completion function.
       */
@@ -98,7 +99,7 @@ journal_entry_create(struct journal_entry *entry, 
size_t n_rows,
      entry->approx_len    = approx_len;
      entry->n_rows        = n_rows;
      entry->res        = -1;
-    entry->opt_flags    = 0;
+    entry->flags        = 0;
  }

  /**
diff --git a/src/box/txn.c b/src/box/txn.c
index e090d58fc..31f664aa0 100644
--- a/src/box/txn.c
+++ b/src/box/txn.c
@@ -76,7 +76,7 @@ txn_add_redo(struct txn *txn, struct txn_stmt *stmt, 
struct request *request)
          row->lsn = 0;
          row->sync = 0;
          row->tm = 0;
-        row->opt_flags = 0;
+        row->flags = 0;
      }
      /*
       * Group ID should be set both for requests not having a
@@ -668,13 +668,16 @@ txn_journal_entry_new(struct txn *txn)
          --req->n_rows;
      }

-    static const uint8_t opt_flags_map[] = {
+    static const uint8_t flags_map[] = {
          [TXN_WAIT_SYNC] = IPROTO_FLAG_WAIT_SYNC,
          [TXN_WAIT_ACK] = IPROTO_FLAG_WAIT_ACK,
      };

-    req->opt_flags |= opt_flags_map[txn->flags & TXN_WAIT_SYNC];
-    req->opt_flags |= opt_flags_map[txn->flags & TXN_WAIT_ACK];
+    req->flags |= flags_map[txn->flags & TXN_WAIT_SYNC];
+    req->flags |= flags_map[txn->flags & TXN_WAIT_ACK];
+
+    /* is_commit is always set for the last tx row. */
+    req->flags |= IPROTO_FLAG_COMMIT;

      return req;
  }
diff --git a/src/box/wal.c b/src/box/wal.c
index 4ec8034a3..53d896972 100644
--- a/src/box/wal.c
+++ b/src/box/wal.c
@@ -997,10 +997,8 @@ wal_assign_lsn(struct vclock *vclock_diff, struct 
vclock *base,
              }
              (*row)->tsn = tsn == 0 ? (*start)->lsn : tsn;
              /* Tx meta is stored in the last tx row. */
-            if (row == end - 1) {
-                (*row)->opt_flags = entry->opt_flags;
-                (*row)->is_commit = true;
-            }
+            if (row == end - 1)
+                (*row)->flags = entry->flags;
          } else {
              int64_t diff = (*row)->lsn - vclock_get(base, 
(*row)->replica_id);
              if (diff <= vclock_get(vclock_diff,
diff --git a/src/box/xrow.c b/src/box/xrow.c
index ba121799b..35e1d1c20 100644
--- a/src/box/xrow.c
+++ b/src/box/xrow.c
@@ -183,7 +183,7 @@ error:
              break;
          case IPROTO_FLAGS:
              flags = mp_decode_uint(pos);
-            header->opt_flags = flags;
+            header->flags = flags;
              break;
          default:
              /* unknown header */
@@ -299,7 +299,7 @@ xrow_header_encode(const struct xrow_header *header, 
uint64_t sync,
       *   flag to find transaction boundary (last row in the
       *   transaction stream).
       */
-    uint8_t flags_to_encode = header->opt_flags & ~IPROTO_FLAG_COMMIT;
+    uint8_t flags_to_encode = header->flags & ~IPROTO_FLAG_COMMIT;
      if (header->tsn != 0) {
          if (header->tsn != header->lsn || !header->is_commit) {
              /*
diff --git a/src/box/xrow.h b/src/box/xrow.h
index 0526e3cd9..5ea99e792 100644
--- a/src/box/xrow.h
+++ b/src/box/xrow.h
@@ -82,7 +82,7 @@ struct xrow_header {
      int64_t tsn;
      /** Transaction meta flags set only in the last transaction row. */
      union {
-        uint8_t opt_flags;
+        uint8_t flags;
          struct {
              /**
               * Is only encoded in the write ahead log for
diff --git a/test/unit/xrow.cc b/test/unit/xrow.cc
index 3d7d8bee1..b6018eed9 100644
--- a/test/unit/xrow.cc
+++ b/test/unit/xrow.cc
@@ -302,7 +302,7 @@ test_request_str()
   * still we rely on it for convenience sake.
   */
  static void
-test_xrow_opt_field()
+test_xrow_fields()
  {
      plan(6);

@@ -311,24 +311,24 @@ test_xrow_opt_field()
      memset(&header, 0, sizeof(header));

      header.is_commit = true;
-    is(header.opt_flags, IPROTO_FLAG_COMMIT, "header.is_commit -> COMMIT");
+    is(header.flags, IPROTO_FLAG_COMMIT, "header.is_commit -> COMMIT");
      header.is_commit = false;

      header.wait_sync = true;
-    is(header.opt_flags, IPROTO_FLAG_WAIT_SYNC, "header.wait_sync -> 
WAIT_SYNC");
+    is(header.flags, IPROTO_FLAG_WAIT_SYNC, "header.wait_sync -> 
WAIT_SYNC");
      header.wait_sync = false;

      header.wait_ack = true;
-    is(header.opt_flags, IPROTO_FLAG_WAIT_ACK, "header.wait_ack -> 
WAIT_ACK");
+    is(header.flags, IPROTO_FLAG_WAIT_ACK, "header.wait_ack -> WAIT_ACK");
      header.wait_ack = false;

-    header.opt_flags = IPROTO_FLAG_COMMIT;
+    header.flags = IPROTO_FLAG_COMMIT;
      ok(header.is_commit && !header.wait_sync && !header.wait_ack, 
"COMMIT -> header.is_commit");

-    header.opt_flags = IPROTO_FLAG_WAIT_SYNC;
+    header.flags = IPROTO_FLAG_WAIT_SYNC;
      ok(!header.is_commit && header.wait_sync && !header.wait_ack, 
"WAIT_SYNC -> header.wait_sync");

-    header.opt_flags = IPROTO_FLAG_WAIT_ACK;
+    header.flags = IPROTO_FLAG_WAIT_ACK;
      ok(!header.is_commit && !header.wait_sync && header.wait_ack, 
"WAIT_ACK -> header.wait_ack");

      check_plan();
@@ -347,7 +347,7 @@ main(void)
      test_greeting();
      test_xrow_header_encode_decode();
      test_request_str();
-    test_xrow_opt_field();
+    test_xrow_fields();

      random_free();
      fiber_free();

-- 
Serge Petrenko


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Tarantool-patches] [PATCH v3 01/10] wal: enrich row's meta information with sync replication flags
  2021-04-16  7:08     ` Serge Petrenko via Tarantool-patches
@ 2021-04-16  7:11       ` Serge Petrenko via Tarantool-patches
  2021-04-16  8:57       ` Serge Petrenko via Tarantool-patches
  1 sibling, 0 replies; 36+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-04-16  7:11 UTC (permalink / raw)
  To: Vladislav Shpilevoy, gorcunov; +Cc: tarantool-patches



16.04.2021 10:08, Serge Petrenko via Tarantool-patches пишет:
>
>
> 16.04.2021 02:18, Vladislav Shpilevoy пишет:
>
>>
>>>   }
>>>     /**
>>> diff --git a/src/box/wal.c b/src/box/wal.c
>>> index 34af0bda6..4ec8034a3 100644
>>> --- a/src/box/wal.c
>>> +++ b/src/box/wal.c
>>> @@ -962,14 +962,14 @@ out:
>>>    */
>>>   static void
>>>   wal_assign_lsn(struct vclock *vclock_diff, struct vclock *base,
>>> -           struct xrow_header **row,
>>> -           struct xrow_header **end)
>>> +           struct journal_entry *entry)
>> 3. This part could be a separate commit, otherwise it is hard to
>> see the functional changes. Up to you if you want to split.
>
> Good idea, let's do that.
>
> Incremental diff for this commit is below and the extracted commit
> regarding wal_assign_lsn() refactoring is in reply to this email.
>

================================================================

commit d5c2ea43f89afb2e9507f14d4ad651955cee2803
Author: Serge Petrenko <sergepetrenko@tarantool.org>
Date:   Fri Apr 16 09:58:09 2021 +0300

     wal: make wal_assign_lsn accept journal entry

     Refactor wal_assign_lsn() to accept a journal entry instead of a 
pair of
     pointers to the first and last entry rows.

     Journal entry will carry additional meta information for the last row
     soon, which will be needed in wal_assign_lsn().

     Prerequisite #5445

diff --git a/src/box/wal.c b/src/box/wal.c
index 34af0bda6..95ee8e200 100644
--- a/src/box/wal.c
+++ b/src/box/wal.c
@@ -962,14 +962,14 @@ out:
   */
  static void
  wal_assign_lsn(struct vclock *vclock_diff, struct vclock *base,
-           struct xrow_header **row,
-           struct xrow_header **end)
+           struct journal_entry *entry)
  {
      int64_t tsn = 0;
-    struct xrow_header **start = row;
-    struct xrow_header **first_glob_row = row;
+    struct xrow_header **start = entry->rows;
+    struct xrow_header **end = entry->rows + entry->n_rows;
+    struct xrow_header **first_glob_row = entry->rows;
      /** Assign LSN to all local rows. */
-    for ( ; row < end; row++) {
+    for (struct xrow_header **row = start; row < end; row++) {
          if ((*row)->replica_id == 0) {
              /*
               * All rows representing local space data
@@ -1020,7 +1020,7 @@ wal_assign_lsn(struct vclock *vclock_diff, struct 
vclock *base,
       * the first global row. tsn was yet unknown when those
       * rows were processed.
       */
-    for (row = start; row < first_glob_row; row++)
+    for (struct xrow_header **row = start; row < first_glob_row; row++)
          (*row)->tsn = tsn;
  }

@@ -1098,8 +1098,7 @@ wal_write_to_disk(struct cmsg *msg)
      struct journal_entry *entry;
      struct stailq_entry *last_committed = NULL;
      stailq_foreach_entry(entry, &wal_msg->commit, fifo) {
-        wal_assign_lsn(&vclock_diff, &writer->vclock,
-                   entry->rows, entry->rows + entry->n_rows);
+        wal_assign_lsn(&vclock_diff, &writer->vclock, entry);
          entry->res = vclock_sum(&vclock_diff) +
                   vclock_sum(&writer->vclock);
          rc = xlog_write_entry(l, entry);
@@ -1319,8 +1318,7 @@ wal_write_none_async(struct journal *journal,
      struct vclock vclock_diff;

      vclock_create(&vclock_diff);
-    wal_assign_lsn(&vclock_diff, &writer->vclock, entry->rows,
-               entry->rows + entry->n_rows);
+    wal_assign_lsn(&vclock_diff, &writer->vclock, entry);
      vclock_merge(&writer->vclock, &vclock_diff);
      vclock_copy(&replicaset.vclock, &writer->vclock);
      entry->res = vclock_sum(&writer->vclock);

-- 
Serge Petrenko


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Tarantool-patches] [PATCH v3 01/10] wal: enrich row's meta information with sync replication flags
  2021-04-16  7:08     ` Serge Petrenko via Tarantool-patches
  2021-04-16  7:11       ` Serge Petrenko via Tarantool-patches
@ 2021-04-16  8:57       ` Serge Petrenko via Tarantool-patches
  1 sibling, 0 replies; 36+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-04-16  8:57 UTC (permalink / raw)
  To: Vladislav Shpilevoy, gorcunov; +Cc: tarantool-patches



16.04.2021 10:08, Serge Petrenko via Tarantool-patches пишет:
>
>
> 16.04.2021 02:18, Vladislav Shpilevoy пишет:
>> Good job on the patch!
>>
>> See 3 comments below.
>
> Hi! Thanks for the review!
>
>>> diff --git a/src/box/journal.h b/src/box/journal.h
>>> index 76c70c19f..3ce9c869e 100644
>>> --- a/src/box/journal.h
>>> +++ b/src/box/journal.h
>>> @@ -63,6 +63,7 @@ struct journal_entry {
>>>        * A journal entry completion callback argument.
>>>        */
>>>       void *complete_data;
>>> +    uint8_t opt_flags;
>> 1. I propose to call them just flags. There is no a third value
>> like 'no flag'. They are either set or not, am I right? Also the
>> member is missing a comment. The most important thing to say -
>> these flags are only for the last row.
>
> Ok, fixed.
>
>>
>>>       /**
>>>        * Asynchronous write completion function.
>>>        */
>>> @@ -97,6 +98,7 @@ journal_entry_create(struct journal_entry *entry, 
>>> size_t n_rows,
>>>       entry->approx_len    = approx_len;
>>>       entry->n_rows        = n_rows;
>>>       entry->res        = -1;
>>> +    entry->opt_flags    = 0;
>> 2. You could initialize it with IPROTO_FLAG_COMMIT right here and
>> drop (*row)->is_commit = true from wal_assign_lsn. But this one up
>> to you. Maybe it is not a good idea.
>
> This would look better, indeed, but neither journal nor wal know
> about iproto constants. And I don't think it's a good idea to
> introduce such a dependency.
>
> I can add entry->flags |= IPROTO_FLAG_COMMIT to
> txn_journal_entry_new().
> I actually like how this turned out. It's none of WAL's or journal's
> business which row is commit and which isn't.

Forget about that. Let's leave (row)->is_commit = true in wal_assign_lsn().

Otherwise I need to set entry->flags for synchro entries and raft 
entries, which
aren't transactions, actually, so their journal entries do not go through
txn_journal_entry_new().

========================================

diff --git a/src/box/txn.c b/src/box/txn.c
index 31f664aa0..a71ccadd0 100644
--- a/src/box/txn.c
+++ b/src/box/txn.c
@@ -676,9 +676,6 @@ txn_journal_entry_new(struct txn *txn)
         req->flags |= flags_map[txn->flags & TXN_WAIT_SYNC];
         req->flags |= flags_map[txn->flags & TXN_WAIT_ACK];

-       /* is_commit is always set for the last tx row. */
-       req->flags |= IPROTO_FLAG_COMMIT;
-
         return req;
  }

diff --git a/src/box/wal.c b/src/box/wal.c
index 53d896972..5b6200b81 100644
--- a/src/box/wal.c
+++ b/src/box/wal.c
@@ -997,8 +997,10 @@ wal_assign_lsn(struct vclock *vclock_diff, struct 
vclock *base,
                         }
                         (*row)->tsn = tsn == 0 ? (*start)->lsn : tsn;
                         /* Tx meta is stored in the last tx row. */
-                       if (row == end - 1)
+                       if (row == end - 1) {
                                 (*row)->flags = entry->flags;
+                               (*row)->is_commit = true;
+                       }
                 } else {
                         int64_t diff = (*row)->lsn - vclock_get(base, 
(*row)->replica_id);
                         if (diff <= vclock_get(vclock_diff,
>
>>
>>>   }
>>>     /**
>>> diff --git a/src/box/wal.c b/src/box/wal.c
>>> index 34af0bda6..4ec8034a3 100644
>>> --- a/src/box/wal.c
>>> +++ b/src/box/wal.c
>>> @@ -962,14 +962,14 @@ out:
>>>    */
>>>   static void
>>>   wal_assign_lsn(struct vclock *vclock_diff, struct vclock *base,
>>> -           struct xrow_header **row,
>>> -           struct xrow_header **end)
>>> +           struct journal_entry *entry)
>> 3. This part could be a separate commit, otherwise it is hard to
>> see the functional changes. Up to you if you want to split.
>
> Good idea, let's do that.
>
> Incremental diff for this commit is below and the extracted commit
> regarding wal_assign_lsn() refactoring is in reply to this email.
>
>>>   {
>>>       int64_t tsn = 0;
>>> -    struct xrow_header **start = row;
>>> -    struct xrow_header **first_glob_row = row;
>>> +    struct xrow_header **start = entry->rows;
>>> +    struct xrow_header **end = entry->rows + entry->n_rows;
>>> +    struct xrow_header **first_glob_row = entry->rows;
>>>       /** Assign LSN to all local rows. */
>>> -    for ( ; row < end; row++) {
>>> +    for (struct xrow_header **row = start; row < end; row++) {
>>>           if ((*row)->replica_id == 0) {
>>>               /*
>>>                * All rows representing local space data
>
> ========================================================
>
> diff --git a/src/box/journal.h b/src/box/journal.h
> index 3ce9c869e..8f3d56a61 100644
> --- a/src/box/journal.h
> +++ b/src/box/journal.h
> @@ -63,7 +63,8 @@ struct journal_entry {
>       * A journal entry completion callback argument.
>       */
>      void *complete_data;
> -    uint8_t opt_flags;
> +    /** Flags that should be set for the last entry row. */
> +    uint8_t flags;
>      /**
>       * Asynchronous write completion function.
>       */
> @@ -98,7 +99,7 @@ journal_entry_create(struct journal_entry *entry, 
> size_t n_rows,
>      entry->approx_len    = approx_len;
>      entry->n_rows        = n_rows;
>      entry->res        = -1;
> -    entry->opt_flags    = 0;
> +    entry->flags        = 0;
>  }
>
>  /**
> diff --git a/src/box/txn.c b/src/box/txn.c
> index e090d58fc..31f664aa0 100644
> --- a/src/box/txn.c
> +++ b/src/box/txn.c
> @@ -76,7 +76,7 @@ txn_add_redo(struct txn *txn, struct txn_stmt *stmt, 
> struct request *request)
>          row->lsn = 0;
>          row->sync = 0;
>          row->tm = 0;
> -        row->opt_flags = 0;
> +        row->flags = 0;
>      }
>      /*
>       * Group ID should be set both for requests not having a
> @@ -668,13 +668,16 @@ txn_journal_entry_new(struct txn *txn)
>          --req->n_rows;
>      }
>
> -    static const uint8_t opt_flags_map[] = {
> +    static const uint8_t flags_map[] = {
>          [TXN_WAIT_SYNC] = IPROTO_FLAG_WAIT_SYNC,
>          [TXN_WAIT_ACK] = IPROTO_FLAG_WAIT_ACK,
>      };
>
> -    req->opt_flags |= opt_flags_map[txn->flags & TXN_WAIT_SYNC];
> -    req->opt_flags |= opt_flags_map[txn->flags & TXN_WAIT_ACK];
> +    req->flags |= flags_map[txn->flags & TXN_WAIT_SYNC];
> +    req->flags |= flags_map[txn->flags & TXN_WAIT_ACK];
> +
> +    /* is_commit is always set for the last tx row. */
> +    req->flags |= IPROTO_FLAG_COMMIT;
>
>      return req;
>  }
> diff --git a/src/box/wal.c b/src/box/wal.c
> index 4ec8034a3..53d896972 100644
> --- a/src/box/wal.c
> +++ b/src/box/wal.c
> @@ -997,10 +997,8 @@ wal_assign_lsn(struct vclock *vclock_diff, struct 
> vclock *base,
>              }
>              (*row)->tsn = tsn == 0 ? (*start)->lsn : tsn;
>              /* Tx meta is stored in the last tx row. */
> -            if (row == end - 1) {
> -                (*row)->opt_flags = entry->opt_flags;
> -                (*row)->is_commit = true;
> -            }
> +            if (row == end - 1)
> +                (*row)->flags = entry->flags;
>          } else {
>              int64_t diff = (*row)->lsn - vclock_get(base, 
> (*row)->replica_id);
>              if (diff <= vclock_get(vclock_diff,
> diff --git a/src/box/xrow.c b/src/box/xrow.c
> index ba121799b..35e1d1c20 100644
> --- a/src/box/xrow.c
> +++ b/src/box/xrow.c
> @@ -183,7 +183,7 @@ error:
>              break;
>          case IPROTO_FLAGS:
>              flags = mp_decode_uint(pos);
> -            header->opt_flags = flags;
> +            header->flags = flags;
>              break;
>          default:
>              /* unknown header */
> @@ -299,7 +299,7 @@ xrow_header_encode(const struct xrow_header 
> *header, uint64_t sync,
>       *   flag to find transaction boundary (last row in the
>       *   transaction stream).
>       */
> -    uint8_t flags_to_encode = header->opt_flags & ~IPROTO_FLAG_COMMIT;
> +    uint8_t flags_to_encode = header->flags & ~IPROTO_FLAG_COMMIT;
>      if (header->tsn != 0) {
>          if (header->tsn != header->lsn || !header->is_commit) {
>              /*
> diff --git a/src/box/xrow.h b/src/box/xrow.h
> index 0526e3cd9..5ea99e792 100644
> --- a/src/box/xrow.h
> +++ b/src/box/xrow.h
> @@ -82,7 +82,7 @@ struct xrow_header {
>      int64_t tsn;
>      /** Transaction meta flags set only in the last transaction row. */
>      union {
> -        uint8_t opt_flags;
> +        uint8_t flags;
>          struct {
>              /**
>               * Is only encoded in the write ahead log for
> diff --git a/test/unit/xrow.cc b/test/unit/xrow.cc
> index 3d7d8bee1..b6018eed9 100644
> --- a/test/unit/xrow.cc
> +++ b/test/unit/xrow.cc
> @@ -302,7 +302,7 @@ test_request_str()
>   * still we rely on it for convenience sake.
>   */
>  static void
> -test_xrow_opt_field()
> +test_xrow_fields()
>  {
>      plan(6);
>
> @@ -311,24 +311,24 @@ test_xrow_opt_field()
>      memset(&header, 0, sizeof(header));
>
>      header.is_commit = true;
> -    is(header.opt_flags, IPROTO_FLAG_COMMIT, "header.is_commit -> 
> COMMIT");
> +    is(header.flags, IPROTO_FLAG_COMMIT, "header.is_commit -> COMMIT");
>      header.is_commit = false;
>
>      header.wait_sync = true;
> -    is(header.opt_flags, IPROTO_FLAG_WAIT_SYNC, "header.wait_sync -> 
> WAIT_SYNC");
> +    is(header.flags, IPROTO_FLAG_WAIT_SYNC, "header.wait_sync -> 
> WAIT_SYNC");
>      header.wait_sync = false;
>
>      header.wait_ack = true;
> -    is(header.opt_flags, IPROTO_FLAG_WAIT_ACK, "header.wait_ack -> 
> WAIT_ACK");
> +    is(header.flags, IPROTO_FLAG_WAIT_ACK, "header.wait_ack -> 
> WAIT_ACK");
>      header.wait_ack = false;
>
> -    header.opt_flags = IPROTO_FLAG_COMMIT;
> +    header.flags = IPROTO_FLAG_COMMIT;
>      ok(header.is_commit && !header.wait_sync && !header.wait_ack, 
> "COMMIT -> header.is_commit");
>
> -    header.opt_flags = IPROTO_FLAG_WAIT_SYNC;
> +    header.flags = IPROTO_FLAG_WAIT_SYNC;
>      ok(!header.is_commit && header.wait_sync && !header.wait_ack, 
> "WAIT_SYNC -> header.wait_sync");
>
> -    header.opt_flags = IPROTO_FLAG_WAIT_ACK;
> +    header.flags = IPROTO_FLAG_WAIT_ACK;
>      ok(!header.is_commit && !header.wait_sync && header.wait_ack, 
> "WAIT_ACK -> header.wait_ack");
>
>      check_plan();
> @@ -347,7 +347,7 @@ main(void)
>      test_greeting();
>      test_xrow_header_encode_decode();
>      test_request_str();
> -    test_xrow_opt_field();
> +    test_xrow_fields();
>
>      random_free();
>      fiber_free();
>

-- 
Serge Petrenko


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Tarantool-patches] [PATCH v3 04/10] box: make clear_synchro_queue() write a PROMOTE entry instead of CONFIRM + ROLLBACK
  2021-04-15 23:20   ` Vladislav Shpilevoy via Tarantool-patches
@ 2021-04-16  9:28     ` Serge Petrenko via Tarantool-patches
  2021-04-16 14:03       ` Serge Petrenko via Tarantool-patches
  0 siblings, 1 reply; 36+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-04-16  9:28 UTC (permalink / raw)
  To: Vladislav Shpilevoy, gorcunov; +Cc: tarantool-patches



16.04.2021 02:20, Vladislav Shpilevoy пишет:
> Thanks for working on this!
>
> See 2 comments below.
>
>> diff --git a/src/box/box.cc b/src/box/box.cc
>> index 70b325180..9adb6ba46 100644
>> --- a/src/box/box.cc
>> +++ b/src/box/box.cc
>> @@ -1556,7 +1556,19 @@ box_clear_synchro_queue(bool try_wait)
>>   				 "new synchronous transactions appeared");
>>   			rc = -1;
>>   		} else {
>> -			txn_limbo_force_empty(&txn_limbo, wait_lsn);
>> +			/*
>> +			 * Term parameter is unused now, We'll pass
>> +			 * box_raft()->term there later.
>> +			 */
>> +			txn_limbo_write_promote(&txn_limbo, wait_lsn, 0);
>> +			struct synchro_request req = {
>> +				.type = 0, /* unused */
>> +				.replica_id = 0, /* unused */
>> +				.origin_id = instance_id,
>> +				.lsn = wait_lsn,
>> +				.term = 0, /* unused */
> 1. Aren't the unused fields nullified anyway according to
> the standard?

We had this conversation with Cyrill recently. I don't have a good 
explanation
for this anyway, but here's the one I have:

>> Is there some particular meaning of zeroifying designated assignments?
>> I mean why not simply
>>
>> 			struct synchro_request req = {
>> 				.origin_id	= instance_id,
>> 				.lsn		= wait_lsn,
>> 			};
>>
>> or you wanted to pay attention that the left of the fields are
>> unused? Just curious, I'm fine with current code.
> I went for your option at first, and it's the one I'd prefer.
> But with it I got failed builds in some CI jobs.
>
> It said something like "sorry, not yet implemented: struct partial
> initialization"
>


>
>> +			};
>> +			txn_limbo_read_promote(&txn_limbo, &req);
>>   			assert(txn_limbo_is_empty(&txn_limbo));
>>   		}
>>   	}
>> diff --git a/src/box/txn_limbo.c b/src/box/txn_limbo.c
>> index d29722ef7..bfe0ad302 100644
>> --- a/src/box/txn_limbo.c
>> +++ b/src/box/txn_limbo.c
>> @@ -464,6 +470,32 @@ txn_limbo_read_rollback(struct txn_limbo *limbo, int64_t lsn)
>>   		box_update_ro_summary();
>>   }
>>   
>> +void
>> +txn_limbo_write_promote(struct txn_limbo *limbo, int64_t lsn, uint64_t term)
>> +{
>> +	limbo->confirmed_lsn = lsn;
>> +	/*
>> +	 * We make sure that promote is only written once everything this
>> +	 * instance has may be confirmed.
>> +	 */
>> +	struct txn_limbo_entry *e = txn_limbo_last_synchro_entry(limbo);
>> +	assert(e == NULL || e->lsn <= lsn);
>> +	(void) e;
>> +	txn_limbo_write_synchro(limbo, IPROTO_PROMOTE, lsn, term);
>> +	limbo->is_in_rollback = false;
> 2. How is it possible that there was a rollback in progress at
> the same time?

Sorry, I was trying to replicate txn_limbo_write_confirm/rollback behaviour,
but forgot to set limbo->is_in_rollback = true before the journal write:

=======================================

diff --git a/src/box/txn_limbo.c b/src/box/txn_limbo.c
index e6f644bc0..93c8994b7 100644
--- a/src/box/txn_limbo.c
+++ b/src/box/txn_limbo.c
@@ -474,6 +474,7 @@ void
  txn_limbo_write_promote(struct txn_limbo *limbo, int64_t lsn, uint64_t 
term)
  {
         limbo->confirmed_lsn = lsn;
+       limbo->is_in_rollback = true;
         /*
          * We make sure that promote is only written once everything this
          * instance has may be confirmed.



-- 
Serge Petrenko


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Tarantool-patches] [PATCH v3 05/10] box: write PROMOTE even for empty limbo
  2021-04-15 23:21   ` Vladislav Shpilevoy via Tarantool-patches
@ 2021-04-16  9:33     ` Serge Petrenko via Tarantool-patches
  0 siblings, 0 replies; 36+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-04-16  9:33 UTC (permalink / raw)
  To: Vladislav Shpilevoy, gorcunov; +Cc: tarantool-patches



16.04.2021 02:21, Vladislav Shpilevoy пишет:
> Thanks for the patch!
>
>> @@ -1540,11 +1542,10 @@ box_clear_synchro_queue(bool try_wait)
>>   	 * in the limbo must've come through the applier meaning they already
>>   	 * have an lsn assigned, even if their WAL write hasn't finished yet.
>>   	 */
>> -	int64_t wait_lsn = txn_limbo_last_synchro_entry(&txn_limbo)->lsn;
>> +	wait_lsn = txn_limbo_last_synchro_entry(&txn_limbo)->lsn;
>>   	assert(wait_lsn > 0);
>>   
>> -	int quorum = replication_synchro_quorum;
>> -	int rc = box_wait_quorum(former_leader_id, wait_lsn, quorum,
>> +	rc = box_wait_quorum(former_leader_id, wait_lsn, quorum,
>>   				 replication_synchro_timeout);
> The second line became misaligned.

Thanks for noticing!
Fixed.

==========================

diff --git a/src/box/box.cc b/src/box/box.cc
index 123c611f5..be78d7096 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -1546,7 +1546,7 @@ box_clear_synchro_queue(bool try_wait)
         assert(wait_lsn > 0);

         rc = box_wait_quorum(former_leader_id, wait_lsn, quorum,
-                                replication_synchro_timeout);
+                            replication_synchro_timeout);
         if (rc == 0) {
                 if (quorum < replication_synchro_quorum) {
                         diag_set(ClientError, ER_QUORUM_WAIT, quorum,

-- 
Serge Petrenko


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Tarantool-patches] [PATCH v3 04/10] box: make clear_synchro_queue() write a PROMOTE entry instead of CONFIRM + ROLLBACK
  2021-04-16  9:28     ` Serge Petrenko via Tarantool-patches
@ 2021-04-16 14:03       ` Serge Petrenko via Tarantool-patches
  0 siblings, 0 replies; 36+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-04-16 14:03 UTC (permalink / raw)
  To: Vladislav Shpilevoy, gorcunov; +Cc: tarantool-patches



16.04.2021 12:28, Serge Petrenko via Tarantool-patches пишет:
>
>
> 16.04.2021 02:20, Vladislav Shpilevoy пишет:
>> Thanks for working on this!
>>
>> See 2 comments below.
>>
>>> diff --git a/src/box/box.cc b/src/box/box.cc
>>> index 70b325180..9adb6ba46 100644
>>> --- a/src/box/box.cc
>>> +++ b/src/box/box.cc
>>> @@ -1556,7 +1556,19 @@ box_clear_synchro_queue(bool try_wait)
>>>                    "new synchronous transactions appeared");
>>>               rc = -1;
>>>           } else {
>>> -            txn_limbo_force_empty(&txn_limbo, wait_lsn);
>>> +            /*
>>> +             * Term parameter is unused now, We'll pass
>>> +             * box_raft()->term there later.
>>> +             */
>>> +            txn_limbo_write_promote(&txn_limbo, wait_lsn, 0);
>>> +            struct synchro_request req = {
>>> +                .type = 0, /* unused */
>>> +                .replica_id = 0, /* unused */
>>> +                .origin_id = instance_id,
>>> +                .lsn = wait_lsn,
>>> +                .term = 0, /* unused */
>> 1. Aren't the unused fields nullified anyway according to
>> the standard?
>
> We had this conversation with Cyrill recently. I don't have a good 
> explanation
> for this anyway, but here's the one I have:
>
>>> Is there some particular meaning of zeroifying designated assignments?
>>> I mean why not simply
>>>
>>>             struct synchro_request req = {
>>>                 .origin_id    = instance_id,
>>>                 .lsn        = wait_lsn,
>>>             };
>>>
>>> or you wanted to pay attention that the left of the fields are
>>> unused? Just curious, I'm fine with current code.
>> I went for your option at first, and it's the one I'd prefer.
>> But with it I got failed builds in some CI jobs.
>>
>> It said something like "sorry, not yet implemented: struct partial
>> initialization"
>>
>
>
>>
>>> +            };
>>> +            txn_limbo_read_promote(&txn_limbo, &req);
>>>               assert(txn_limbo_is_empty(&txn_limbo));
>>>           }
>>>       }
>>> diff --git a/src/box/txn_limbo.c b/src/box/txn_limbo.c
>>> index d29722ef7..bfe0ad302 100644
>>> --- a/src/box/txn_limbo.c
>>> +++ b/src/box/txn_limbo.c
>>> @@ -464,6 +470,32 @@ txn_limbo_read_rollback(struct txn_limbo 
>>> *limbo, int64_t lsn)
>>>           box_update_ro_summary();
>>>   }
>>>   +void
>>> +txn_limbo_write_promote(struct txn_limbo *limbo, int64_t lsn, 
>>> uint64_t term)
>>> +{
>>> +    limbo->confirmed_lsn = lsn;
>>> +    /*
>>> +     * We make sure that promote is only written once everything this
>>> +     * instance has may be confirmed.
>>> +     */
>>> +    struct txn_limbo_entry *e = txn_limbo_last_synchro_entry(limbo);
>>> +    assert(e == NULL || e->lsn <= lsn);
>>> +    (void) e;
>>> +    txn_limbo_write_synchro(limbo, IPROTO_PROMOTE, lsn, term);
>>> +    limbo->is_in_rollback = false;
>> 2. How is it possible that there was a rollback in progress at
>> the same time?
>
> Sorry, I was trying to replicate txn_limbo_write_confirm/rollback 
> behaviour,
> but forgot to set limbo->is_in_rollback = true before the journal write:
>
> =======================================
>
> diff --git a/src/box/txn_limbo.c b/src/box/txn_limbo.c
> index e6f644bc0..93c8994b7 100644
> --- a/src/box/txn_limbo.c
> +++ b/src/box/txn_limbo.c
> @@ -474,6 +474,7 @@ void
>  txn_limbo_write_promote(struct txn_limbo *limbo, int64_t lsn, 
> uint64_t term)
>  {
>         limbo->confirmed_lsn = lsn;
> +       limbo->is_in_rollback = true;
>         /*
>          * We make sure that promote is only written once everything this
>          * instance has may be confirmed.
>
>
>
One more diff on top:

========================================

diff --git a/src/box/box.cc b/src/box/box.cc
index 73a9da56a..e0bed74f1 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -1562,13 +1562,13 @@ box_clear_synchro_queue(bool try_wait)
                          */
                         txn_limbo_write_promote(&txn_limbo, wait_lsn, 0);
                         struct synchro_request req = {
-                               .type = 0, /* unused */
-                               .replica_id = 0, /* unused */
+                               .type = IPROTO_PROMOTE,
+                               .replica_id = former_leader_id,
                                 .origin_id = instance_id,
                                 .lsn = wait_lsn,
                                 .term = 0, /* unused */
                         };
-                       txn_limbo_read_promote(&txn_limbo, &req);
+                       txn_limbo_process(&txn_limbo, &req);
assert(txn_limbo_is_empty(&txn_limbo));
                 }
         }
diff --git a/src/box/txn_limbo.c b/src/box/txn_limbo.c
index 93c8994b7..6657990e8 100644
--- a/src/box/txn_limbo.c
+++ b/src/box/txn_limbo.c
@@ -486,7 +486,11 @@ txn_limbo_write_promote(struct txn_limbo *limbo, 
int64_t lsn, uint64_t term)
         limbo->is_in_rollback = false;
  }

-void
+/**
+ * Process a PROMOTE request, i.e. confirm all entries <= @req.lsn and 
rollback all
+ * entries > @req.lsn.
+ */
+static void
  txn_limbo_read_promote(struct txn_limbo *limbo,
                        const struct synchro_request *req)
  {
diff --git a/src/box/txn_limbo.h b/src/box/txn_limbo.h
index 4a1c43856..f35771dc9 100644
--- a/src/box/txn_limbo.h
+++ b/src/box/txn_limbo.h
@@ -278,14 +278,6 @@ txn_limbo_wait_confirm(struct txn_limbo *limbo);
  void
  txn_limbo_write_promote(struct txn_limbo *limbo, int64_t lsn, uint64_t 
term);

-/**
- * Process a PROMOTE request, i.e. confirm all entries <= @req.lsn and 
rollback all
- * entries > @req.lsn.
- */
-void
-txn_limbo_read_promote(struct txn_limbo *limbo,
-                      const struct synchro_request *req);
-
  /**
   * Update qsync parameters dynamically.
   */


-- 
Serge Petrenko


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Tarantool-patches] [PATCH v3 06/10] raft: keep track of greatest known term and filter replication sources based on that
  2021-04-15 23:27   ` Vladislav Shpilevoy via Tarantool-patches
@ 2021-04-16 14:16     ` Serge Petrenko via Tarantool-patches
  2021-04-16 22:13       ` Vladislav Shpilevoy via Tarantool-patches
  0 siblings, 1 reply; 36+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-04-16 14:16 UTC (permalink / raw)
  To: Vladislav Shpilevoy, gorcunov; +Cc: tarantool-patches



16.04.2021 02:27, Vladislav Shpilevoy пишет:
> Good job on the patch!

Thanks for the review!

>
> Please, try to reduce length of the lines in the commit
> message, or at least its title. It is suuuper long now.
Ok, fixed the title. Commit message looks ok, it's less than 72 chars:


     raft: filter rows based on known peer terms

     Start writing the actual leader term together with the PROMOTE request
     and process terms in PROMOTE requests on receiver side.

     Make applier only apply synchronous transactions from the instance 
which
     has the greatest term as received in PROMOTE requests.

     Closes #5445


>
> See 10 comments below.
>
>> diff --git a/src/box/applier.cc b/src/box/applier.cc
>> index 4898f9f7b..3fb864686 100644
>> --- a/src/box/applier.cc
>> +++ b/src/box/applier.cc
>> @@ -790,6 +790,12 @@ apply_synchro_row_cb(struct journal_entry *entry)
>>   		applier_rollback_by_wal_io();
>>   	} else {
>>   		txn_limbo_process(&txn_limbo, synchro_entry->req);
>> +		if (iproto_type_is_promote_request(synchro_entry->req->type)) {
>> +			raft_source_update_term(box_raft(),
>> +						synchro_entry->req->origin_id,
>> +						synchro_entry->req->term);
> 1. How about moving that to txn_limbo_read_promote()? What do you think? I see
> you do it in 3 places where txn_limbo_process() or txn_limbo_read_promote()
> are called on PROMOTE rows.

I didn't want to do that, because raft and limbo are separate entities.
But this would simplify the code quite a bit, so I'm ok with it.

Other option would be to introduce some new function to `box/raft.c`:
it would call txn_limbo_read_promote() and raft_update_term().
But then again we would have a separate handler for PROMOTE requests,
while CONFIRM and ROLLBACK could both be handled with txn_limbo_process.

Long story short, I'll add raft_update_term() to txn_limbo_process()
and remove txn_limbo_read_promote() from txn_limbo.h.

>
>> +
>> +		}
>>   		trigger_run(&replicaset.applier.on_wal_write, NULL);
>>   	}
>>   	/* The fiber is the same on final join. */
>> @@ -1027,6 +1033,28 @@ applier_apply_tx(struct applier *applier, struct stailq *rows)
>>   		}
>>   	}
>>   
>> +	/*
>> +	 * When elections are enabled we must filter out synchronous rows coming
>> +	 * from an instance that fell behind the current leader. This includes
>> +	 * both synchronous tx rows and rows for txs following unconfirmed
>> +	 * synchronous transactions.
>> +	 * The rows are replaced with NOPs to preserve the vclock consistency.
>> +	 */
>> +	struct applier_tx_row *item;
>> +	if (raft_source_has_outdated_term(box_raft(), applier->instance_id) &&
> 2. The names are too long IMO. I would propose
>
> 	raft_is_node_outdated(raft, id)   // Check if behind
> 	raft_process_term(raft, id, term) // Set term for a node or skip if
> 						the same or older
> 	raft_node_term(raft, id)         // Get term
>
> 'source' is not really a perfect name, because raft nodes send
> messages to each other. There are no one-directional channels
> AFAIR like we have with upstream and downstream in the replication.
>
> I used 'source' in raft_process_heartbeat() as like a source of the
> heartbeat message. Note like the nodes are called sources everywhere.
>
> Also I used 'process' for the new term, because we already have
> raft_process_heartbeat() to handle info from a node with a given
> ID, and I thought it makes sense to keep them similar.

Ok, I'm fine with new names.

>
>
> 3. Why does raft_is_source_allowed() still exist when we have this wonder?

Shouldn't we still ignore asynchronous rows from non-leaders?

"The wonder" only saves us from synchro transactions.

>
>> +	    (last_row->wait_sync ||
>> +	     (iproto_type_is_synchro_request(first_row->type) &&
>> +	     !iproto_type_is_promote_request(first_row->type)))) {
>> +		stailq_foreach_entry(item, rows, next) {
>> +			struct xrow_header *row = &item->row;
>> +			row->type = IPROTO_NOP;
>> +			/*
>> +			 * Row body is saved to fiber's region and will be freed
>> +			 * on next fiber_gc() call.
>> +			 */
>> +			row->bodycnt = 0;
>> +		}
>> +	}
>> diff --git a/src/box/box.cc b/src/box/box.cc
>> index 722fc23b7..f44dd0e54 100644
>> --- a/src/box/box.cc
>> +++ b/src/box/box.cc
>> @@ -426,6 +426,10 @@ wal_stream_apply_synchro_row(struct wal_stream *stream, struct xrow_header *row)
>>   		return -1;
>>   	}
>>   	txn_limbo_process(&txn_limbo, &syn_req);
>> +	if (syn_req.type == IPROTO_PROMOTE) {
>> +			raft_source_update_term(box_raft(), syn_req.origin_id,
>> +						syn_req.term);
> 4. Misaligned. Also see the first comment.

Thanks. Removed that altogether (moved to txn_limbo_read_promote())

>
>> @@ -1558,20 +1567,21 @@ box_clear_synchro_queue(bool try_wait)
>>   			rc = -1;
>>   		} else {
>>   promote:
>> -			/*
>> -			 * Term parameter is unused now, We'll pass
>> -			 * box_raft()->term there later.
>> -			 */
>> -			txn_limbo_write_promote(&txn_limbo, wait_lsn, 0);
>> +			/* We cannot possibly get here in a volatile state. */
>> +			assert(box_raft()->volatile_term == box_raft()->term);
>> +			txn_limbo_write_promote(&txn_limbo, wait_lsn,
>> +						box_raft()->term);
>>   			struct synchro_request req = {
>>   				.type = 0, /* unused */
>>   				.replica_id = 0, /* unused */
>>   				.origin_id = instance_id,
>>   				.lsn = wait_lsn,
>> -				.term = 0, /* unused */
>> +				.term = box_raft()->term,
>>   			};
>>   			txn_limbo_read_promote(&txn_limbo, &req);
>>   			assert(txn_limbo_is_empty(&txn_limbo));
>> +			raft_source_update_term(box_raft(), req.origin_id,
>> +						req.term);
> 5. See the first comment.

Removed.

>
>>   		}
>>   	}
>> diff --git a/src/lib/raft/raft.h b/src/lib/raft/raft.h
>> index e447f6634..01f548fee 100644
>> --- a/src/lib/raft/raft.h
>> +++ b/src/lib/raft/raft.h
>> @@ -207,6 +207,19 @@ struct raft {
>>   	 * subsystems, such as Raft.
>>   	 */
>>   	const struct vclock *vclock;
>> +	/**
>> +	 * The biggest term seen by this instance and persisted in WAL as part
>> +	 * of a PROMOTE request. May be smaller than @a term, while there are
>> +	 * ongoing elections, or the leader is already known, but this instance
>> +	 * hasn't read its PROMOTE request yet.
>> +	 * During other times must be equal to @a term.
>> +	 */
>> +	uint64_t greatest_known_term;
> 6. Maybe omit 'known'. There can't be 'greatest_unknown_term'.

Ok.

>
>> +	/**
>> +	 * Latest terms received with PROMOTE entries from remote instances.
>> +	 * Raft uses them to determine data from which sources may be applied.
>> +	 */
>> +	struct vclock term_map;
> 7. I have a feeling it is similar to the limbo's LSN map. Like
> they should be merged into something one. Can't formulate that
> properly. I hope we will see it more clear when will move all that
> to the WAL thread someday.

Yes, they're quite similar.

Do you mean we should create some new structure instead of using vclock for
these entities? Like something which would incorporate remote state:
a map of ids with their known terms and confirmed lsns?

>
>>   	/** State machine timed event trigger. */
>>   	struct ev_timer timer;
>>   	/** Configured election timeout in seconds. */
>> @@ -243,6 +256,39 @@ raft_is_source_allowed(const struct raft *raft, uint32_t source_id)
>>   	return !raft->is_enabled || raft->leader == source_id;
>>   }
>>   
>> +/**
>> + * Return the latest term as seen in PROMOTE requests from instance with id
>> + * @a source_id.
>> + */
>> +static inline uint64_t
>> +raft_source_term(const struct raft *raft, uint32_t source_id)
>> +{
>> +	assert(source_id != 0 && source_id < VCLOCK_MAX);
>> +	return vclock_get(&raft->term_map, source_id);
>> +}
>> +
>> +/**
>> + * Check whether replica with id @a source_id is too old to apply synchronous
>> + * data from it. The check is only valid when elections are enabled.
>> + */
>> +static inline bool
>> +raft_source_has_outdated_term(const struct raft *raft, uint32_t source_id)
>> +{
>> +	uint64_t source_term = vclock_get(&raft->term_map, source_id);
>> +	return raft->is_enabled && source_term < raft->greatest_known_term;
>> +}
>> +
>> +/** Remember the last term seen for replica  with id @a source_id. */
>> +static inline void
>> +raft_source_update_term(struct raft *raft, uint32_t source_id, uint64_t term)
>> +{
>> +	if ((uint64_t) vclock_get(&raft->term_map, source_id) >= term)
> 8. Probably having the term as uint64_t was a mistake from the beginning.
> Feel free to change it to int64_t if you want, in a separate commit.

I replaced this particular line with
`
raft_node_term() >= term
`
So it's not that ugly now.

Speaking of uint64_t -> int64_t, I don't think it's worth it.

>
>> +		return;
>> +	vclock_follow(&raft->term_map, source_id, term);
>> +	if (term > raft->greatest_known_term)
>> +		raft->greatest_known_term = term;
>> +}
> 9. I see these are not used in the raft code at all. Did you think about
> moving it all to box/raft.h and box/raft.c? Or about covering this all
> with unit tests in unit/raft.c if you decide to keep it here?

AFAIU box/raft.h is about interconnecting box and raft functionality.

These functions aren't used in lib/raft indeed, but they belong here,
  I think. Just like `raft_is_source_allowed()`.

I'll come up with some unit tests.

>
>> +
>>   /** Check if Raft is enabled. */
>>   static inline bool
>>   raft_is_enabled(const struct raft *raft)
>> diff --git a/test/replication/gh-5445-leader-inconsistency.result b/test/replication/gh-5445-leader-inconsistency.result
>> new file mode 100644
>> index 000000000..ff3104de5
>> --- /dev/null
>> +++ b/test/replication/gh-5445-leader-inconsistency.result
>> @@ -0,0 +1,291 @@
>> +-- test-run result file version 2
>> +test_run = require("test_run").new()
>> + | ---
>> + | ...
>> +
>> +is_leader_cmd = "return box.info.election.state == 'leader'"
>> + | ---
>> + | ...
>> +
>> +-- Auxiliary.
>> +test_run:cmd('setopt delimiter ";"')
>> + | ---
>> + | - true
>> + | ...
>> +function get_leader(nrs)
>> +    local leader_nr = 0
>> +    test_run:wait_cond(function()
>> +        for nr, do_check in pairs(nrs) do
>> +            if do_check then
>> +                local is_leader = test_run:eval('election_replica'..nr,
>> +                                                is_leader_cmd)[1]
>> +                if is_leader then
>> +                    leader_nr = nr
>> +                    return true
>> +                end
>> +            end
>> +        end
>> +        return false
>> +    end)
>> +    assert(leader_nr ~= 0)
>> +    return leader_nr
>> +end;
>> + | ---
>> + | ...
>> +
>> +function name(id)
>> +    return 'election_replica'..id
>> +end;
> 10. You can move this function above get_leader() and use it
> in there.

Yes, indeed, thanks!

Incremental diff:

===================================================

diff --git a/src/box/applier.cc b/src/box/applier.cc
index c3ee620a2..61d53fdec 100644
--- a/src/box/applier.cc
+++ b/src/box/applier.cc
@@ -790,12 +790,6 @@ apply_synchro_row_cb(struct journal_entry *entry)
          applier_rollback_by_wal_io();
      } else {
          txn_limbo_process(&txn_limbo, synchro_entry->req);
-        if (iproto_type_is_promote_request(synchro_entry->req->type)) {
-            raft_source_update_term(box_raft(),
-                        synchro_entry->req->origin_id,
-                        synchro_entry->req->term);
-
-        }
          trigger_run(&replicaset.applier.on_wal_write, NULL);
      }
      /* The fiber is the same on final join. */
@@ -1041,7 +1035,7 @@ applier_apply_tx(struct applier *applier, struct 
stailq *rows)
       * The rows are replaced with NOPs to preserve the vclock consistency.
       */
      struct applier_tx_row *item;
-    if (raft_source_has_outdated_term(box_raft(), applier->instance_id) &&
+    if (raft_is_node_outdated(box_raft(), applier->instance_id) &&
          (last_row->wait_sync ||
           (iproto_type_is_synchro_request(first_row->type) &&
           !iproto_type_is_promote_request(first_row->type)))) {
diff --git a/src/box/box.cc b/src/box/box.cc
index 9c7e92a0e..907bcca31 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -426,10 +426,6 @@ wal_stream_apply_synchro_row(struct wal_stream 
*stream, struct xrow_header *row)
          return -1;
      }
      txn_limbo_process(&txn_limbo, &syn_req);
-    if (syn_req.type == IPROTO_PROMOTE) {
-            raft_source_update_term(box_raft(), syn_req.origin_id,
-                        syn_req.term);
-    }
      return 0;
  }

@@ -1512,7 +1508,7 @@ box_clear_synchro_queue(bool try_wait)
       * written for this term.
       */
      if (!is_box_configured ||
-        raft_source_term(box_raft(), instance_id) == box_raft()->term)
+        raft_node_term(box_raft(), instance_id) == box_raft()->term)
          return 0;
      uint32_t former_leader_id = txn_limbo.owner_id;
      int64_t wait_lsn = txn_limbo.confirmed_lsn;
@@ -1580,8 +1576,6 @@ promote:
              };
              txn_limbo_read_promote(&txn_limbo, &req);
              assert(txn_limbo_is_empty(&txn_limbo));
-            raft_source_update_term(box_raft(), req.origin_id,
-                        req.term);
          }
      }
      in_clear_synchro_queue = false;
diff --git a/src/box/txn_limbo.c b/src/box/txn_limbo.c
index 93c8994b7..d72a0573f 100644
--- a/src/box/txn_limbo.c
+++ b/src/box/txn_limbo.c
@@ -34,6 +34,7 @@
  #include "iproto_constants.h"
  #include "journal.h"
  #include "box.h"
+#include "raft.h"

  struct txn_limbo txn_limbo;

@@ -495,6 +496,7 @@ txn_limbo_read_promote(struct txn_limbo *limbo,
      assert(txn_limbo_is_empty(&txn_limbo));
      limbo->owner_id = req->origin_id;
      limbo->confirmed_lsn = 0;
+    raft_process_term(box_raft(), req->origin_id, req->term);
  }

  void
diff --git a/src/box/txn_limbo.h b/src/box/txn_limbo.h
index 4a1c43856..b90c50b33 100644
--- a/src/box/txn_limbo.h
+++ b/src/box/txn_limbo.h
@@ -280,7 +280,7 @@ txn_limbo_write_promote(struct txn_limbo *limbo, 
int64_t lsn, uint64_t term);

  /**
   * Process a PROMOTE request, i.e. confirm all entries <= @req.lsn and 
rollback all
- * entries > @req.lsn.
+ * entries > @req.lsn, and update known raft term for @req.origin_id.
   */
  void
  txn_limbo_read_promote(struct txn_limbo *limbo,
diff --git a/src/lib/raft/raft.h b/src/lib/raft/raft.h
index 01f548fee..75512e38a 100644
--- a/src/lib/raft/raft.h
+++ b/src/lib/raft/raft.h
@@ -214,7 +214,7 @@ struct raft {
       * hasn't read its PROMOTE request yet.
       * During other times must be equal to @a term.
       */
-    uint64_t greatest_known_term;
+    uint64_t greatest_term;
      /**
       * Latest terms received with PROMOTE entries from remote instances.
       * Raft uses them to determine data from which sources may be applied.
@@ -261,7 +261,7 @@ raft_is_source_allowed(const struct raft *raft, 
uint32_t source_id)
   * @a source_id.
   */
  static inline uint64_t
-raft_source_term(const struct raft *raft, uint32_t source_id)
+raft_node_term(const struct raft *raft, uint32_t source_id)
  {
      assert(source_id != 0 && source_id < VCLOCK_MAX);
      return vclock_get(&raft->term_map, source_id);
@@ -272,21 +272,21 @@ raft_source_term(const struct raft *raft, uint32_t 
source_id)
   * data from it. The check is only valid when elections are enabled.
   */
  static inline bool
-raft_source_has_outdated_term(const struct raft *raft, uint32_t source_id)
+raft_is_node_outdated(const struct raft *raft, uint32_t source_id)
  {
-    uint64_t source_term = vclock_get(&raft->term_map, source_id);
-    return raft->is_enabled && source_term < raft->greatest_known_term;
+    uint64_t source_term = raft_node_term(raft, source_id);
+    return raft->is_enabled && source_term < raft->greatest_term;
  }

  /** Remember the last term seen for replica  with id @a source_id. */
  static inline void
-raft_source_update_term(struct raft *raft, uint32_t source_id, uint64_t 
term)
+raft_process_term(struct raft *raft, uint32_t source_id, uint64_t term)
  {
-    if ((uint64_t) vclock_get(&raft->term_map, source_id) >= term)
+    if (raft_node_term(raft, source_id) >= term)
          return;
      vclock_follow(&raft->term_map, source_id, term);
-    if (term > raft->greatest_known_term)
-        raft->greatest_known_term = term;
+    if (term > raft->greatest_term)
+        raft->greatest_term = term;
  }

  /** Check if Raft is enabled. */
diff --git a/test/replication/gh-5445-leader-inconsistency.result 
b/test/replication/gh-5445-leader-inconsistency.result
index ff3104de5..5c6169f50 100644
--- a/test/replication/gh-5445-leader-inconsistency.result
+++ b/test/replication/gh-5445-leader-inconsistency.result
@@ -12,12 +12,18 @@ test_run:cmd('setopt delimiter ";"')
   | ---
   | - true
   | ...
+function name(id)
+    return 'election_replica'..id
+end;
+ | ---
+ | ...
+
  function get_leader(nrs)
      local leader_nr = 0
      test_run:wait_cond(function()
          for nr, do_check in pairs(nrs) do
              if do_check then
-                local is_leader = test_run:eval('election_replica'..nr,
+                local is_leader = test_run:eval(name(nr),
                                                  is_leader_cmd)[1]
                  if is_leader then
                      leader_nr = nr
@@ -33,11 +39,6 @@ end;
   | ---
   | ...

-function name(id)
-    return 'election_replica'..id
-end;
- | ---
- | ...
  test_run:cmd('setopt delimiter ""');
   | ---
   | - true
diff --git a/test/replication/gh-5445-leader-inconsistency.test.lua 
b/test/replication/gh-5445-leader-inconsistency.test.lua
index bf8b31886..e7952f5fa 100644
--- a/test/replication/gh-5445-leader-inconsistency.test.lua
+++ b/test/replication/gh-5445-leader-inconsistency.test.lua
@@ -4,12 +4,16 @@ is_leader_cmd = "return box.info.election.state == 
'leader'"

  -- Auxiliary.
  test_run:cmd('setopt delimiter ";"')
+function name(id)
+    return 'election_replica'..id
+end;
+
  function get_leader(nrs)
      local leader_nr = 0
      test_run:wait_cond(function()
          for nr, do_check in pairs(nrs) do
              if do_check then
-                local is_leader = test_run:eval('election_replica'..nr,
+                local is_leader = test_run:eval(name(nr),
                                                  is_leader_cmd)[1]
                  if is_leader then
                      leader_nr = nr
@@ -23,9 +27,6 @@ function get_leader(nrs)
      return leader_nr
  end;

-function name(id)
-    return 'election_replica'..id
-end;
  test_run:cmd('setopt delimiter ""');

  --
diff --git a/test/unit/raft.c b/test/unit/raft.c
index d0d13d8c7..0306cefcd 100644
--- a/test/unit/raft.c
+++ b/test/unit/raft.c
@@ -1267,10 +1267,44 @@ raft_test_too_long_wal_write(void)
      raft_finish_test();
  }

+static void
+raft_test_term_filter(void)
+{
+    raft_start_test(9);
+    struct raft_node node;
+    raft_node_create(&node);
+
+    is(raft_node_term(&node.raft, 1), 0, "empty node term");
+    ok(!raft_is_node_outdated(&node.raft, 1), "not outdated initially");
+
+    raft_process_term(&node.raft, 1, 1);
+    is(raft_node_term(&node.raft, 1), 1, "node term updated");
+    ok(raft_is_node_outdated(&node.raft, 2), "other nodes are outdated");
+
+    raft_process_term(&node.raft, 2, 100);
+    ok(raft_is_node_outdated(&node.raft, 1), "node outdated when others "
+                         "have greater term");
+    ok(!raft_is_node_outdated(&node.raft, 2), "node with greatest term "
+                         "isn't outdated");
+
+    raft_process_term(&node.raft, 3, 100);
+    ok(!raft_is_node_outdated(&node.raft, 2), "node not outdated when "
+                         "others have the same term");
+
+    raft_process_term(&node.raft, 3, 99);
+    is(raft_node_term(&node.raft, 3), 100, "node term isn't decreased");
+    ok(!raft_is_node_outdated(&node.raft, 3), "node doesn't become "
+                          "outdated");
+
+
+    raft_node_destroy(&node);
+    raft_finish_test();
+}
+
  static int
  main_f(va_list ap)
  {
-    raft_start_test(13);
+    raft_start_test(14);

      (void) ap;
      fakeev_init();
@@ -1288,6 +1322,7 @@ main_f(va_list ap)
      raft_test_death_timeout();
      raft_test_enable_disable();
      raft_test_too_long_wal_write();
+    raft_test_term_filter();

      fakeev_free();

diff --git a/test/unit/raft.result b/test/unit/raft.result
index 96bfc3b86..ecb962e42 100644
--- a/test/unit/raft.result
+++ b/test/unit/raft.result
@@ -1,5 +1,5 @@
      *** main_f ***
-1..13
+1..14
      *** raft_test_leader_election ***
      1..24
      ok 1 - 1 pending message at start
@@ -220,4 +220,17 @@ ok 12 - subtests
      ok 8 - became candidate
  ok 13 - subtests
      *** raft_test_too_long_wal_write: done ***
+    *** raft_test_term_filter ***
+    1..9
+    ok 1 - empty node term
+    ok 2 - not outdated initially
+    ok 3 - node term updated
+    ok 4 - other nodes are outdated
+    ok 5 - node outdated when others have greater term
+    ok 6 - node with greatest term isn't outdated
+    ok 7 - node not outdated when others have the same term
+    ok 8 - node term isn't decreased
+    ok 9 - node doesn't become outdated
+ok 14 - subtests
+    *** raft_test_term_filter: done ***
      *** main_f: done ***

-- 
Serge Petrenko


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Tarantool-patches] [PATCH v3 07/10] replication: introduce a new election mode: "manual"
  2021-04-15 23:27   ` Vladislav Shpilevoy via Tarantool-patches
@ 2021-04-16 14:18     ` Serge Petrenko via Tarantool-patches
  0 siblings, 0 replies; 36+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-04-16 14:18 UTC (permalink / raw)
  To: Vladislav Shpilevoy, gorcunov; +Cc: tarantool-patches



16.04.2021 02:27, Vladislav Shpilevoy пишет:
> I appreciate the work you did here!
>
> Maybe change the commit title subsystem to 'election:'.
> The same in the next commit.
Thanks for the review!

No problem, changed.

-- 
Serge Petrenko


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Tarantool-patches] [PATCH v3 08/10] Support manual elections in `box.ctl.clear_synchro_queue()`
  2021-04-15 23:30   ` Vladislav Shpilevoy via Tarantool-patches
@ 2021-04-16 15:38     ` Serge Petrenko via Tarantool-patches
  2021-04-16 15:40       ` Serge Petrenko via Tarantool-patches
  0 siblings, 1 reply; 36+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-04-16 15:38 UTC (permalink / raw)
  To: Vladislav Shpilevoy, gorcunov; +Cc: tarantool-patches



16.04.2021 02:30, Vladislav Shpilevoy пишет:
> Thanks for working on this!
>
> See 5 comments below.
>
>> diff --git a/src/box/box.cc b/src/box/box.cc
>> index 3729ed997..6c7c8968a 100644
>> --- a/src/box/box.cc
>> +++ b/src/box/box.cc
>> @@ -1525,12 +1526,74 @@ box_clear_synchro_queue(bool try_wait)
>>   	if (!is_box_configured ||
>>   	    raft_source_term(box_raft(), instance_id) == box_raft()->term)
>>   		return 0;
>> +
>> +	bool run_elections = false;
>> +
>> +	switch (box_election_mode) {
>> +	case ELECTION_MODE_OFF:
>> +		break;
>> +	case ELECTION_MODE_VOTER:
>> +		assert(box_raft()->state == RAFT_STATE_FOLLOWER);
>> +		diag_set(ClientError, ER_UNSUPPORTED, "election_mode='voter'",
>> +			 "manual elections");
>> +		return -1;
>> +	case ELECTION_MODE_MANUAL:
>> +		assert(box_raft()->state != RAFT_STATE_CANDIDATE);
>> +		if (box_raft()->state == RAFT_STATE_LEADER) {
>> +			diag_set(ClientError, ER_ALREADY_LEADER);
>> +			return -1;
>> +		}
>> +		run_elections = true;
>> +		try_wait = false;
>> +		break;
>> +	case ELECTION_MODE_CANDIDATE:
>> +		/*
>> +		 * Leader elections are enabled, and this instance is allowed to
>> +		 * promote only if it's already an elected leader. No manual
>> +		 * elections.
>> +		 */
>> +		if (box_raft()->state != RAFT_STATE_LEADER) {
> 1. That is strange. Why do you allow to promote the node
> if it is already the leader when mode is candidate, but do
> not allow the same when the mode is manual?

It's because box_clear_synchro_queue() may be called by raft worker,
once the node becomes leader. So I cannot forbid this.

Actually, there's a guard for manual `box.ctl.clear_synchro_queue()` above,
I just didn't make proper use of it. I mean these lines:

  	if (!is_box_configured ||
  	    raft_source_term(box_raft(), instance_id) == box_raft()->term)
  		return 0;

So I don't need the ER_ALREADY_LEADER in ELECTION_MODE_MANUAL case.

Will fix.
Thanks for pointing this out!

>
> Shouldn't we throw an error when the mode is candidate
> regardless of the node role?
>
>> +			diag_set(ClientError, ER_UNSUPPORTED, "election_mode="
>> +				 "'candidate'", "manual elections");
>> +			return -1;
>> +		}
>> +		break;
>> +	default:
>> +		unreachable();
>> +	}
>> +
>>   	uint32_t former_leader_id = txn_limbo.owner_id;
>>   	int64_t wait_lsn = txn_limbo.confirmed_lsn;
>>   	int rc = 0;
>>   	int quorum = replication_synchro_quorum;
>>   	in_clear_synchro_queue = true;
>>   
>> +	if (run_elections) {
>> +		/*
>> +		 * Make this instance a candidate and run until some leader, not
>> +		 * necessarily this instance, emerges.
>> +		 */
>> +		raft_cfg_is_candidate(box_raft(), true, false);
>> +		/*
>> +		 * Trigger new elections without waiting for an old leader to
>> +		 * disappear.
>> +		 */
>> +		raft_new_term(box_raft());
>> +		box_raft_wait_leader_found();
>> +		raft_cfg_is_candidate(box_raft(), false, false);
> 2. What if during box_raft_wait_leader_found() I made the node candidate
> via box.cfg? Won't you then reset it back to non-candidate here?
Yes, that's true.
>
> It probably should reset the current box.cfg mode back. Not just
> remove the candidate flag.

I think it's strange to reconfigure box automatically.
I suggest to reset node to non-candidate only if the mode
remains MANUAL after the election.
>
>> +		if (!box_raft()->is_enabled) {
>> +			diag_set(ClientError, ER_RAFT_DISABLED);
>> +			in_clear_synchro_queue = false;
>> +			return -1;
>> +		}
>> +		if (box_raft()->state != RAFT_STATE_LEADER) {
>> +			diag_set(ClientError, ER_INTERFERING_PROMOTE,
>> +				 box_raft()->leader);
>> +			in_clear_synchro_queue = false;
>> +			return -1;
>> +		}
>> +	}
>> +
>>   	if (txn_limbo_is_empty(&txn_limbo))
>>   		goto promote;
>>   
>> diff --git a/src/box/raft.c b/src/box/raft.c
>> index 285dbe4fd..c7dc79f9b 100644
>> --- a/src/box/raft.c
>> +++ b/src/box/raft.c
>> @@ -336,6 +336,28 @@ fail:
>>   	panic("Could not write a raft request to WAL\n");
>>   }
>>   
>> +static int
>> +box_raft_wait_leader_found_trig(struct trigger *trig, void *event)
> 3. I thought we usually call triggers with _f suffix, not _trig.

Sure. Fixed.
>
>> +{
>> +	struct raft *raft = (struct raft *)event;
>> +	assert(raft == box_raft());
>> +	struct fiber *waiter = (struct fiber *)trig->data;
> 4. No need to cast this and event - void * cast works naturally in C.

Ok.

>
>> +	if (raft->leader != REPLICA_ID_NIL || !raft->is_enabled)
>> +		fiber_wakeup(waiter);
>> +	return 0;
>> +}
>> diff --git a/src/box/raft.h b/src/box/raft.h
>> index 15f4e80d9..8fce423e1 100644
>> --- a/src/box/raft.h
>> +++ b/src/box/raft.h
>> @@ -97,6 +97,9 @@ box_raft_checkpoint_remote(struct raft_request *req);
>>   int
>>   box_raft_process(struct raft_request *req, uint32_t source);
>>   
>> +void
>> +box_raft_wait_leader_found();
>> +
>>   void
>>   box_raft_init(void);
>>   
>> diff --git a/src/lib/raft/raft.c b/src/lib/raft/raft.c
>> index e9ce8cade..7b77e05ea 100644
>> --- a/src/lib/raft/raft.c
>> +++ b/src/lib/raft/raft.c
>> @@ -846,7 +846,7 @@ raft_cfg_is_enabled(struct raft *raft, bool is_enabled)
>>   }
>>   
>>   void
>> -raft_cfg_is_candidate(struct raft *raft, bool is_candidate)
>> +raft_cfg_is_candidate(struct raft *raft, bool is_candidate, bool demote)
> 5. I know it might lead to some code duplication, but probably
> better move that to other functions. For example,
>
> 	raft_cfg_is_temporary_candidate()
>
> or something like that. Otherwise it appears surprisingly hard
> to follow these 2 flags together. Although I might be wrong and
> it would look worse. Did you try?
>
> Or another option:
>
> 	raft_cfg_is_candidate(box_raft(), true, false);
> 	raft_cfg_is_candidate(box_raft(), false, false);
>
> turns into
>
> 	raft_start_candidate(box_raft())
> 	raft_stop_candidate(box_raft())
>
> Also it would be good to have unit tests for the changes in raft.h
> and raft.c.

This variant sounds good. I'll implement in in a new commit.

Incremental diff:

==========================================

diff --git a/src/box/box.cc b/src/box/box.cc
index 37938df15..fcd812c09 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -1157,8 +1157,7 @@ box_set_election_mode(void)
      if (mode == ELECTION_MODE_INVALID)
          return -1;
      box_election_mode = mode;
-    raft_cfg_is_candidate(box_raft(), mode == ELECTION_MODE_CANDIDATE,
-                  true);
+    raft_cfg_is_candidate(box_raft(), mode == ELECTION_MODE_CANDIDATE);
      raft_cfg_is_enabled(box_raft(), mode != ELECTION_MODE_OFF);
      return 0;
  }
@@ -1534,11 +1533,7 @@ box_clear_synchro_queue(bool try_wait)
               "manual elections");
          return -1;
      case ELECTION_MODE_MANUAL:
-        assert(box_raft()->state != RAFT_STATE_CANDIDATE);
-        if (box_raft()->state == RAFT_STATE_LEADER) {
-            diag_set(ClientError, ER_ALREADY_LEADER);
-            return -1;
-        }
+        assert(box_raft()->state == RAFT_STATE_FOLLOWER);
          run_elections = true;
          try_wait = false;
          break;
@@ -1569,14 +1564,19 @@ box_clear_synchro_queue(bool try_wait)
           * Make this instance a candidate and run until some leader, not
           * necessarily this instance, emerges.
           */
-        raft_cfg_is_candidate(box_raft(), true, false);
+        raft_start_candidate(box_raft());
          /*
           * Trigger new elections without waiting for an old leader to
           * disappear.
           */
          raft_new_term(box_raft());
          box_raft_wait_leader_found();
-        raft_cfg_is_candidate(box_raft(), false, false);
+        /*
+         * Do not reset raft mode if it was changed while running the
+         * elections.
+         */
+        if (box_election_mode == ELECTION_MODE_MANUAL)
+            raft_stop_candidate(box_raft(), false);
          if (!box_raft()->is_enabled) {
              diag_set(ClientError, ER_RAFT_DISABLED);
              in_clear_synchro_queue = false;
diff --git a/src/box/errcode.h b/src/box/errcode.h
index df36085db..d93820e96 100644
--- a/src/box/errcode.h
+++ b/src/box/errcode.h
@@ -277,7 +277,6 @@ struct errcode_record {
      /*222 */_(ER_QUORUM_WAIT,        "Couldn't wait for quorum %d: %s") \
      /*223 */_(ER_INTERFERING_PROMOTE,    "Instance with replica id %u 
was promoted first") \
      /*224 */_(ER_RAFT_DISABLED,        "Elections were turned off 
while running box.ctl.promote()")\
-    /*225 */_(ER_ALREADY_LEADER,        "Can't promote an existing 
leader")\

  /*
   * !IMPORTANT! Please follow instructions at start of the file
diff --git a/src/box/raft.c b/src/box/raft.c
index c7dc79f9b..425353207 100644
--- a/src/box/raft.c
+++ b/src/box/raft.c
@@ -337,11 +337,11 @@ fail:
  }

  static int
-box_raft_wait_leader_found_trig(struct trigger *trig, void *event)
+box_raft_wait_leader_found_f(struct trigger *trig, void *event)
  {
-    struct raft *raft = (struct raft *)event;
+    struct raft *raft = event;
      assert(raft == box_raft());
-    struct fiber *waiter = (struct fiber *)trig->data;
+    struct fiber *waiter = trig->data;
      if (raft->leader != REPLICA_ID_NIL || !raft->is_enabled)
          fiber_wakeup(waiter);
      return 0;
@@ -351,7 +351,7 @@ void
  box_raft_wait_leader_found(void)
  {
      struct trigger trig;
-    trigger_create(&trig, box_raft_wait_leader_found_trig, fiber(), NULL);
+    trigger_create(&trig, box_raft_wait_leader_found_f, fiber(), NULL);
      raft_on_update(box_raft(), &trig);
      fiber_yield();
      assert(box_raft()->leader != REPLICA_ID_NIL || 
!box_raft()->is_enabled);
diff --git a/src/lib/raft/raft.c b/src/lib/raft/raft.c
index d557c907b..8deb06eb5 100644
--- a/src/lib/raft/raft.c
+++ b/src/lib/raft/raft.c
@@ -846,7 +846,7 @@ raft_cfg_is_enabled(struct raft *raft, bool is_enabled)
  }

  void
-raft_cfg_is_candidate(struct raft *raft, bool is_candidate, bool demote)
+raft_cfg_is_candidate(struct raft *raft, bool is_candidate)
  {
      raft->is_cfg_candidate = is_candidate;
      is_candidate = is_candidate && raft->is_enabled;
diff --git a/src/lib/raft/raft.h b/src/lib/raft/raft.h
index b140ff3ba..69dec63c6 100644
--- a/src/lib/raft/raft.h
+++ b/src/lib/raft/raft.h
@@ -325,7 +325,7 @@ raft_cfg_is_enabled(struct raft *raft, bool is_enabled);
   * the node still can vote, when Raft is enabled.
   */
  void
-raft_cfg_is_candidate(struct raft *raft, bool is_candidate, bool demote);
+raft_cfg_is_candidate(struct raft *raft, bool is_candidate);

  /**
   * Make the instance a candidate.
diff --git a/test/box/error.result b/test/box/error.result
index dad6a21d3..cc8cbaaa9 100644
--- a/test/box/error.result
+++ b/test/box/error.result
@@ -443,7 +443,6 @@ t;
   |   222: box.error.QUORUM_WAIT
   |   223: box.error.INTERFERING_PROMOTE
   |   224: box.error.RAFT_DISABLED
- |   225: box.error.ALREADY_LEADER
   | ...

  test_run:cmd("setopt delimiter ''");
diff --git a/test/unit/raft_test_utils.c b/test/unit/raft_test_utils.c
index a10ccae6a..b8735f373 100644
--- a/test/unit/raft_test_utils.c
+++ b/test/unit/raft_test_utils.c
@@ -360,7 +360,7 @@ raft_node_start(struct raft_node *node)
          raft_process_recovery(&node->raft, &node->journal.rows[i]);

      raft_cfg_is_enabled(&node->raft, node->cfg_is_enabled);
-    raft_cfg_is_candidate(&node->raft, node->cfg_is_candidate, true);
+    raft_cfg_is_candidate(&node->raft, node->cfg_is_candidate);
      raft_cfg_election_timeout(&node->raft, node->cfg_election_timeout);
      raft_cfg_election_quorum(&node->raft, node->cfg_election_quorum);
      raft_cfg_death_timeout(&node->raft, node->cfg_death_timeout);
@@ -402,7 +402,7 @@ raft_node_cfg_is_candidate(struct raft_node *node, 
bool value)
  {
      node->cfg_is_candidate = value;
      if (raft_node_is_started(node)) {
-        raft_cfg_is_candidate(&node->raft, value, true);
+        raft_cfg_is_candidate(&node->raft, value);
          raft_run_async_work();
      }
  }

-- 
Serge Petrenko


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Tarantool-patches] [PATCH v3 08/10] Support manual elections in `box.ctl.clear_synchro_queue()`
  2021-04-16 15:38     ` Serge Petrenko via Tarantool-patches
@ 2021-04-16 15:40       ` Serge Petrenko via Tarantool-patches
  2021-04-16 15:50         ` Serge Petrenko via Tarantool-patches
  0 siblings, 1 reply; 36+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-04-16 15:40 UTC (permalink / raw)
  To: Vladislav Shpilevoy, gorcunov; +Cc: tarantool-patches



16.04.2021 18:38, Serge Petrenko via Tarantool-patches пишет:
>
>
> 16.04.2021 02:30, Vladislav Shpilevoy пишет:
>>> diff --git a/src/box/raft.h b/src/box/raft.h
>>> index 15f4e80d9..8fce423e1 100644
>>> --- a/src/box/raft.h
>>> +++ b/src/box/raft.h
>>> @@ -97,6 +97,9 @@ box_raft_checkpoint_remote(struct raft_request *req);
>>>   int
>>>   box_raft_process(struct raft_request *req, uint32_t source);
>>>   +void
>>> +box_raft_wait_leader_found();
>>> +
>>>   void
>>>   box_raft_init(void);
>>>   diff --git a/src/lib/raft/raft.c b/src/lib/raft/raft.c
>>> index e9ce8cade..7b77e05ea 100644
>>> --- a/src/lib/raft/raft.c
>>> +++ b/src/lib/raft/raft.c
>>> @@ -846,7 +846,7 @@ raft_cfg_is_enabled(struct raft *raft, bool 
>>> is_enabled)
>>>   }
>>>     void
>>> -raft_cfg_is_candidate(struct raft *raft, bool is_candidate)
>>> +raft_cfg_is_candidate(struct raft *raft, bool is_candidate, bool 
>>> demote)
>> 5. I know it might lead to some code duplication, but probably
>> better move that to other functions. For example,
>>
>>     raft_cfg_is_temporary_candidate()
>>
>> or something like that. Otherwise it appears surprisingly hard
>> to follow these 2 flags together. Although I might be wrong and
>> it would look worse. Did you try?
>>
>> Or another option:
>>
>>     raft_cfg_is_candidate(box_raft(), true, false);
>>     raft_cfg_is_candidate(box_raft(), false, false);
>>
>> turns into
>>
>>     raft_start_candidate(box_raft())
>>     raft_stop_candidate(box_raft())
>>
>> Also it would be good to have unit tests for the changes in raft.h
>> and raft.c.
>
> This variant sounds good. I'll implement in in a new commit.

The commit I was talking about:

=====================================================

commit 79940c7b20a4acefaa5984550fee2872a58fef0c
Author: Serge Petrenko <sergepetrenko@tarantool.org>
Date:   Fri Apr 16 18:22:28 2021 +0300

     raft: introduce raft_start/stop_candidate

     Extract raft_start_candidate and raft_stop_candidate functions from
     raft_cfg_is_candidate.

     These functions will be used in manual elections.

     Prerequisite #3055

diff --git a/src/lib/raft/raft.c b/src/lib/raft/raft.c
index e9ce8cade..8deb06eb5 100644
--- a/src/lib/raft/raft.c
+++ b/src/lib/raft/raft.c
@@ -848,38 +848,59 @@ raft_cfg_is_enabled(struct raft *raft, bool 
is_enabled)
  void
  raft_cfg_is_candidate(struct raft *raft, bool is_candidate)
  {
-    bool old_is_candidate = raft->is_candidate;
      raft->is_cfg_candidate = is_candidate;
-    raft->is_candidate = is_candidate && raft->is_enabled;
-    if (raft->is_candidate == old_is_candidate)
-        return;
+    is_candidate = is_candidate && raft->is_enabled;
+    if (is_candidate)
+        raft_start_candidate(raft);
+    else
+        raft_stop_candidate(raft, true);
+}

-    if (raft->is_candidate) {
-        assert(raft->state == RAFT_STATE_FOLLOWER);
-        if (raft->is_write_in_progress) {
-            /*
-             * If there is an on-going WAL write, it means there was
-             * some node who sent newer data to this node. So it is
-             * probably a better candidate. Anyway can't do anything
-             * until the new state is fully persisted.
-             */
-        } else if (raft->leader != 0) {
-            raft_sm_wait_leader_dead(raft);
-        } else {
-            raft_sm_wait_leader_found(raft);
-        }
+void
+raft_start_candidate(struct raft *raft)
+{
+    if (raft->is_candidate)
+        return;
+    raft->is_candidate = true;
+    assert(raft->state == RAFT_STATE_FOLLOWER);
+    if (raft->is_write_in_progress) {
+        /*
+         * If there is an on-going WAL write, it means there was
+         * some node who sent newer data to this node. So it is
+         * probably a better candidate. Anyway can't do anything
+         * until the new state is fully persisted.
+         */
+    } else if (raft->leader != 0) {
+        raft_sm_wait_leader_dead(raft);
      } else {
-        if (raft->state != RAFT_STATE_LEADER) {
-            /* Do not wait for anything while being a voter. */
-            raft_ev_timer_stop(raft_loop(), &raft->timer);
-        }
-        if (raft->state != RAFT_STATE_FOLLOWER) {
-            if (raft->state == RAFT_STATE_LEADER)
-                raft->leader = 0;
-            raft->state = RAFT_STATE_FOLLOWER;
-            /* State is visible and changed - broadcast. */
-            raft_schedule_broadcast(raft);
+        raft_sm_wait_leader_found(raft);
+    }
+}
+
+void
+raft_stop_candidate(struct raft *raft, bool demote)
+{
+    if (!raft->is_candidate)
+        return;
+    raft->is_candidate = false;
+    if (raft->state != RAFT_STATE_LEADER) {
+        /* Do not wait for anything while being a voter. */
+        raft_ev_timer_stop(raft_loop(), &raft->timer);
+    }
+    if (raft->state != RAFT_STATE_FOLLOWER) {
+        if (raft->state == RAFT_STATE_LEADER) {
+            if (!demote) {
+                /*
+                 * Remain leader until someone
+                 * triggers new elections.
+                 */
+                return;
+            }
+            raft->leader = 0;
          }
+        raft->state = RAFT_STATE_FOLLOWER;
+        /* State is visible and changed - broadcast. */
+        raft_schedule_broadcast(raft);
      }
  }

diff --git a/src/lib/raft/raft.h b/src/lib/raft/raft.h
index a5f7e08d9..69dec63c6 100644
--- a/src/lib/raft/raft.h
+++ b/src/lib/raft/raft.h
@@ -327,6 +327,19 @@ raft_cfg_is_enabled(struct raft *raft, bool 
is_enabled);
  void
  raft_cfg_is_candidate(struct raft *raft, bool is_candidate);

+/**
+ * Make the instance a candidate.
+ */
+void
+raft_start_candidate(struct raft *raft);
+
+/**
+ * Make the instance stop taking part in new elections.
+ * @param demote whether to stop being a leader immediately or not.
+ */
+void
+raft_stop_candidate(struct  raft *raft, bool demote);
+
  /** Configure Raft leader election timeout. */
  void
  raft_cfg_election_timeout(struct raft *raft, double timeout);
diff --git a/test/unit/raft.c b/test/unit/raft.c
index 0306cefcd..575886932 100644
--- a/test/unit/raft.c
+++ b/test/unit/raft.c
@@ -1296,15 +1296,43 @@ raft_test_term_filter(void)
      ok(!raft_is_node_outdated(&node.raft, 3), "node doesn't become "
                            "outdated");

-
      raft_node_destroy(&node);
      raft_finish_test();
  }

+static void
+raft_test_start_stop_candidate(void)
+{
+    raft_start_test(4);
+    struct raft_node node;
+    raft_node_create(&node);
+
+    raft_node_cfg_is_candidate(&node, false);
+    raft_node_cfg_election_quorum(&node, 1);
+
+    raft_start_candidate(&node.raft);
+    raft_run_next_event();
+    is(node.raft.state, RAFT_STATE_LEADER, "became leader after "
+                           "start_candidate");
+    raft_stop_candidate(&node.raft, false);
+    raft_run_for(node.cfg_death_timeout);
+    is(node.raft.state, RAFT_STATE_LEADER, "remain leader after "
+                           "stop_candidate");
+
+    is(raft_node_send_vote_request(&node,
+        3 /* Term. */,
+        "{}" /* Vclock. */,
+        2 /* Source. */
+    ), 0, "vote request from 2");
+    is(node.raft.state, RAFT_STATE_FOLLOWER, "demote once new election "
+                         "starts");
+    raft_finish_test();
+}
+
  static int
  main_f(va_list ap)
  {
-    raft_start_test(14);
+    raft_start_test(15);

      (void) ap;
      fakeev_init();
@@ -1323,6 +1351,7 @@ main_f(va_list ap)
      raft_test_enable_disable();
      raft_test_too_long_wal_write();
      raft_test_term_filter();
+    raft_test_start_stop_candidate();

      fakeev_free();

diff --git a/test/unit/raft.result b/test/unit/raft.result
index ecb962e42..bb799936b 100644
--- a/test/unit/raft.result
+++ b/test/unit/raft.result
@@ -1,5 +1,5 @@
      *** main_f ***
-1..14
+1..15
      *** raft_test_leader_election ***
      1..24
      ok 1 - 1 pending message at start
@@ -233,4 +233,12 @@ ok 13 - subtests
      ok 9 - node doesn't become outdated
  ok 14 - subtests
      *** raft_test_term_filter: done ***
+    *** raft_test_start_stop_candidate ***
+    1..4
+    ok 1 - became leader after start_candidate
+    ok 2 - remain leader after stop_candidate
+    ok 3 - vote request from 2
+    ok 4 - demote once new election starts
+ok 15 - subtests
+    *** raft_test_start_stop_candidate: done ***
      *** main_f: done ***

-- 
Serge Petrenko


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Tarantool-patches] [PATCH v3 08/10] Support manual elections in `box.ctl.clear_synchro_queue()`
  2021-04-16 15:40       ` Serge Petrenko via Tarantool-patches
@ 2021-04-16 15:50         ` Serge Petrenko via Tarantool-patches
  0 siblings, 0 replies; 36+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-04-16 15:50 UTC (permalink / raw)
  To: Vladislav Shpilevoy, gorcunov; +Cc: tarantool-patches



16.04.2021 18:40, Serge Petrenko via Tarantool-patches пишет:
>
>
> 16.04.2021 18:38, Serge Petrenko via Tarantool-patches пишет:
>>
>>
>> 16.04.2021 02:30, Vladislav Shpilevoy пишет:
>>>> diff --git a/src/box/raft.h b/src/box/raft.h
>>>> index 15f4e80d9..8fce423e1 100644
>>>> --- a/src/box/raft.h
>>>> +++ b/src/box/raft.h
>>>> @@ -97,6 +97,9 @@ box_raft_checkpoint_remote(struct raft_request 
>>>> *req);
>>>>   int
>>>>   box_raft_process(struct raft_request *req, uint32_t source);
>>>>   +void
>>>> +box_raft_wait_leader_found();
>>>> +
>>>>   void
>>>>   box_raft_init(void);
>>>>   diff --git a/src/lib/raft/raft.c b/src/lib/raft/raft.c
>>>> index e9ce8cade..7b77e05ea 100644
>>>> --- a/src/lib/raft/raft.c
>>>> +++ b/src/lib/raft/raft.c
>>>> @@ -846,7 +846,7 @@ raft_cfg_is_enabled(struct raft *raft, bool 
>>>> is_enabled)
>>>>   }
>>>>     void
>>>> -raft_cfg_is_candidate(struct raft *raft, bool is_candidate)
>>>> +raft_cfg_is_candidate(struct raft *raft, bool is_candidate, bool 
>>>> demote)
>>> 5. I know it might lead to some code duplication, but probably
>>> better move that to other functions. For example,
>>>
>>>     raft_cfg_is_temporary_candidate()
>>>
>>> or something like that. Otherwise it appears surprisingly hard
>>> to follow these 2 flags together. Although I might be wrong and
>>> it would look worse. Did you try?
>>>
>>> Or another option:
>>>
>>>     raft_cfg_is_candidate(box_raft(), true, false);
>>>     raft_cfg_is_candidate(box_raft(), false, false);
>>>
>>> turns into
>>>
>>>     raft_start_candidate(box_raft())
>>>     raft_stop_candidate(box_raft())
>>>
>>> Also it would be good to have unit tests for the changes in raft.h
>>> and raft.c.
>>
>> This variant sounds good. I'll implement in in a new commit.
>
> The commit I was talking about:

Incremental diff on top:

=====================================

diff --git a/src/lib/raft/raft.c b/src/lib/raft/raft.c
index 8deb06eb5..b21693642 100644
--- a/src/lib/raft/raft.c
+++ b/src/lib/raft/raft.c
@@ -862,7 +862,13 @@ raft_start_candidate(struct raft *raft)
         if (raft->is_candidate)
                 return;
         raft->is_candidate = true;
-       assert(raft->state == RAFT_STATE_FOLLOWER);
+       assert(raft->state != RAFT_STATE_CANDIDATE);
+       /*
+        * May still be the leader after raft_stop_candidate
+        * with demote = false.
+        */
+       if (raft->state == RAFT_STATE_LEADER)
+               return;
         if (raft->is_write_in_progress) {
                 /*
                  * If there is an on-going WAL write, it means there was


-- 
Serge Petrenko


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Tarantool-patches] [PATCH v3 10/10] box.ctl: rename clear_synchro_queue to promote
  2021-04-15 23:31   ` Vladislav Shpilevoy via Tarantool-patches
@ 2021-04-16 16:13     ` Serge Petrenko via Tarantool-patches
  0 siblings, 0 replies; 36+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-04-16 16:13 UTC (permalink / raw)
  To: Vladislav Shpilevoy, gorcunov; +Cc: tarantool-patches



16.04.2021 02:31, Vladislav Shpilevoy пишет:
> Thanks for the patch!
>
>> diff --git a/test/replication/election_basic.result b/test/replication/election_basic.result
>> index d5320b3ff..78c911245 100644
>> --- a/test/replication/election_basic.result
>> +++ b/test/replication/election_basic.result
>> @@ -108,6 +108,31 @@ assert(box.info.election.leader == box.info.id)
>>    | - true
>>    | ...
>>   
>> +-- Manual election mode. A voter most of the time, a leader once
>> +-- `box.ctl.promote()` is called.
>> +box.cfg{election_mode = 'manual'}
>> + | ---
>> + | ...
>> +
>> +assert(box.info.election.state == 'follower')
>> + | ---
>> + | - true
>> + | ...
>> +term = box.info.election.term
>> + | ---
>> + | ...
>> +box.ctl.promote()
>> + | ---
>> + | ...
>> +assert(box.info.election.state == 'leader')
>> + | ---
>> + | - error: assertion failed!
>> + | ...
>> +assert(box.info.election.term > term)
>> + | ---
>> + | - error: assertion failed!
> Hm. Shouldn't these assertions pass?

Ouch. Yes, they should. The assertions failed because
promote does nothing when node was a leader in this term.
So we can't test box.ctl.promote() with one instance.
The test will fail as long as the instance already was the
leader in current term. And this is the only case when
there is only one instance in test.

Even if I move election_mode='manual' above the
election_mode='candidate', the same test will fail on conf='vinyl'.

===============================

diff --git a/test/replication/election_basic.result 
b/test/replication/election_basic.result
index 78c911245..d5320b3ff 100644
--- a/test/replication/election_basic.result
+++ b/test/replication/election_basic.result
@@ -108,31 +108,6 @@ assert(box.info.election.leader == box.info.id)
   | - true
   | ...

--- Manual election mode. A voter most of the time, a leader once
--- `box.ctl.promote()` is called.
-box.cfg{election_mode = 'manual'}
- | ---
- | ...
-
-assert(box.info.election.state == 'follower')
- | ---
- | - true
- | ...
-term = box.info.election.term
- | ---
- | ...
-box.ctl.promote()
- | ---
- | ...
-assert(box.info.election.state == 'leader')
- | ---
- | - error: assertion failed!
- | ...
-assert(box.info.election.term > term)
- | ---
- | - error: assertion failed!
- | ...
-
  box.cfg{ \
      election_mode = 
'off',                                                      \
      election_timeout = 
old_election_timeout                                     \
diff --git a/test/replication/election_basic.test.lua 
b/test/replication/election_basic.test.lua
index 5fc398848..821f73cea 100644
--- a/test/replication/election_basic.test.lua
+++ b/test/replication/election_basic.test.lua
@@ -39,16 +39,6 @@ assert(box.info.election.term > term)
  assert(box.info.election.vote == box.info.id)
  assert(box.info.election.leader == box.info.id)

--- Manual election mode. A voter most of the time, a leader once
--- `box.ctl.promote()` is called.
-box.cfg{election_mode = 'manual'}
-
-assert(box.info.election.state == 'follower')
-term = box.info.election.term
-box.ctl.promote()
-assert(box.info.election.state == 'leader')
-assert(box.info.election.term > term)
-
  box.cfg{ \
      election_mode = 
'off',                                                      \
      election_timeout = 
old_election_timeout                                     \

-- 
Serge Petrenko


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Tarantool-patches] [PATCH v3 02/10] xrow: introduce a PROMOTE entry
  2021-04-15 23:19   ` Vladislav Shpilevoy via Tarantool-patches
@ 2021-04-16 16:18     ` Serge Petrenko via Tarantool-patches
  0 siblings, 0 replies; 36+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-04-16 16:18 UTC (permalink / raw)
  To: Vladislav Shpilevoy, gorcunov; +Cc: tarantool-patches



16.04.2021 02:19, Vladislav Shpilevoy пишет:
> I appreciate the work you did here!
>
> See 2 comments below.
>
>> diff --git a/src/box/xrow.c b/src/box/xrow.c
>> index cc8e43ed4..5d515ce92 100644
>> --- a/src/box/xrow.c
>> +++ b/src/box/xrow.c
>> @@ -884,28 +884,63 @@ xrow_encode_dml(const struct request *request, struct region *region,
>>   	return iovcnt;
>>   }
>>   
>> -void
>> -xrow_encode_synchro(struct xrow_header *row,
>> -		    struct synchro_body_bin *body,
>> -		    const struct synchro_request *req)
>> +static void
>> +xrow_encode_synchro_body(struct synchro_body_bin *body,
>> +		         const struct synchro_request *req)
>>   {
>>   	/*
>> -	 * A map with two elements. We don't compress
>> +	 * A map with two or three elements. We don't compress
>>   	 * numbers to have this structure constant in size,
>>   	 * which allows us to preallocate it on stack.
>>   	 */
>> -	body->m_body = 0x80 | 2;
>> +	body->m_body = 0x80 | (req->type == IPROTO_PROMOTE ? 3 : 2);
>>   	body->k_replica_id = IPROTO_REPLICA_ID;
>>   	body->m_replica_id = 0xce;
>>   	body->v_replica_id = mp_bswap_u32(req->replica_id);
>>   	body->k_lsn = IPROTO_LSN;
>>   	body->m_lsn = 0xcf;
>>   	body->v_lsn = mp_bswap_u64(req->lsn);
>> +}
>> +
>> +void
>> +xrow_encode_synchro(struct xrow_header *row,
>> +		    struct synchro_body_bin *body,
>> +		    const struct synchro_request *req)
>> +{
>> +	assert(req->type == IPROTO_CONFIRM || req->type == IPROTO_ROLLBACK);
>> +
>> +	xrow_encode_synchro_body(body, req);
>>   
>>   	memset(row, 0, sizeof(*row));
>> +	row->type = req->type;
>> +	row->body[0].iov_base = body;
>> +	row->body[0].iov_len = sizeof(*body);
>> +	row->bodycnt = 1;
>> +}
>> +
>> +static inline void
>> +xrow_encode_promote_body(struct promote_body_bin *body,
>> +			 const struct synchro_request *req)
> 1. I would propose to inline it. It is used in a single place,
> and now it looks like if we would have more than 1 place where
> we would need the promote body.
>
> But more interestingly, it looks you could keep it a single
> function xrow_encode_synchro. Although we wouldn't be able to
> have a PACKED struct with predefined fields. Not a big deal
> anyway.
>
> The reasoning is similar to xrow_encode_dml(). It also uses
> a single struct request for all kinds of DML, and conditionally
> encodes the non-zero fields. I think your case is the same. It
> would simplify some code, and remove branching from other
> places. For example, from txn_limbo_write_synchro(), where you
> branch between PROMOTE and non-PROMOTE when decide what to encode.
> We even had the same issue when tried to encode CONFIRM and
> ROLLBACK via separate functions.

Ok, let's encode the entries together.

Incremental diff is below.

>
>> +{
>> +	xrow_encode_synchro_body(&body->base, req);
>> +
>> +	body->k_term = IPROTO_TERM;
>> +	body->m_term = 0xcf;
>> +	body->v_term = mp_bswap_u64(req->term);
>> +}
>> +
>>   
>> +void
>> +xrow_encode_promote(struct xrow_header *row, struct promote_body_bin *body,
>> +		    const struct synchro_request *req)
>> +{
>> +	assert(req->type == IPROTO_PROMOTE);
>> +
>> +	xrow_encode_promote_body(body, req);
>> +
>> +	memset(row, 0, sizeof(*row));
>>   	row->type = req->type;
>> -	row->body[0].iov_base = (void *)body;
>> +	row->body[0].iov_base = body;
> 2. Unnecessary change. But I don't mind, up to you.

This was accidental, but let's keep the change.
I'm either removing the unnecessary cast from xrow_encode_synchro()
or adding it to xrow_encode_promote().

I'd better remove it rather than add another place where it's used.

>
>>   	row->body[0].iov_len = sizeof(*body);
>>   	row->bodycnt = 1;
>>   }


=======================================================

diff --git a/src/box/txn_limbo.c b/src/box/txn_limbo.c
index addcb0f97..c96e497c6 100644
--- a/src/box/txn_limbo.c
+++ b/src/box/txn_limbo.c
@@ -331,7 +331,7 @@ txn_limbo_write_synchro(struct txn_limbo *limbo, 
uint16_t type, int64_t lsn)
       * This is a synchronous commit so we can
       * allocate everything on a stack.
       */
-    struct synchro_body_bin body;
+    char body[XROW_SYNCHRO_BODY_LEN_MAX];
      struct xrow_header row;
      char buf[sizeof(struct journal_entry) +
           sizeof(struct xrow_header *)];
@@ -339,7 +339,7 @@ txn_limbo_write_synchro(struct txn_limbo *limbo, 
uint16_t type, int64_t lsn)
      struct journal_entry *entry = (struct journal_entry *)buf;
      entry->rows[0] = &row;

-    xrow_encode_synchro(&row, &body, &req);
+    xrow_encode_synchro(&row, body, &req);

      journal_entry_create(entry, 1, xrow_approx_len(&row),
                   txn_limbo_write_cb, fiber());
diff --git a/src/box/xrow.c b/src/box/xrow.c
index 5b689a86a..2e364cea5 100644
--- a/src/box/xrow.c
+++ b/src/box/xrow.c
@@ -884,56 +884,34 @@ xrow_encode_dml(const struct request *request, 
struct region *region,
      return iovcnt;
  }

-static void
-xrow_encode_synchro_body(struct synchro_body_bin *body,
-                 const struct synchro_request *req)
-{
-    /*
-     * A map with two or three elements. We don't compress
-     * numbers to have this structure constant in size,
-     * which allows us to preallocate it on stack.
-     */
-    body->m_body = 0x80 | (req->type == IPROTO_PROMOTE ? 3 : 2);
-    body->k_replica_id = IPROTO_REPLICA_ID;
-    body->m_replica_id = 0xce;
-    body->v_replica_id = mp_bswap_u32(req->replica_id);
-    body->k_lsn = IPROTO_LSN;
-    body->m_lsn = 0xcf;
-    body->v_lsn = mp_bswap_u64(req->lsn);
-}
-
  void
-xrow_encode_synchro(struct xrow_header *row,
-            struct synchro_body_bin *body,
+xrow_encode_synchro(struct xrow_header *row, char *body,
              const struct synchro_request *req)
  {
-    assert(req->type == IPROTO_CONFIRM || req->type == IPROTO_ROLLBACK);
+    assert(iproto_type_is_synchro_request(req->type));

-    xrow_encode_synchro_body(body, req);
+    char *pos = body;

-    memset(row, 0, sizeof(*row));
-    row->type = req->type;
-    row->body[0].iov_base = body;
-    row->body[0].iov_len = sizeof(*body);
-    row->bodycnt = 1;
-}
+    pos = mp_encode_map(pos,
+                iproto_type_is_promote_request(req->type) ? 3 : 2);

-void
-xrow_encode_promote(struct xrow_header *row, struct promote_body_bin *body,
-            const struct synchro_request *req)
-{
-    assert(req->type == IPROTO_PROMOTE);
+    pos = mp_encode_uint(pos, IPROTO_REPLICA_ID);
+    pos = mp_encode_uint(pos, req->replica_id);

-    xrow_encode_synchro_body(&body->base, req);
+    pos = mp_encode_uint(pos, IPROTO_LSN);
+    pos = mp_encode_uint(pos, req->lsn);
+
+    if (iproto_type_is_promote_request(req->type)) {
+        pos = mp_encode_uint(pos, IPROTO_TERM);
+        pos = mp_encode_uint(pos, req->term);
+    }

-    body->k_term = IPROTO_TERM;
-    body->m_term = 0xcf;
-    body->v_term = mp_bswap_u64(req->term);
+    assert(pos - body < XROW_SYNCHRO_BODY_LEN_MAX);

      memset(row, 0, sizeof(*row));
      row->type = req->type;
      row->body[0].iov_base = body;
-    row->body[0].iov_len = sizeof(*body);
+    row->body[0].iov_len = pos - body;
      row->bodycnt = 1;
  }

diff --git a/src/box/xrow.h b/src/box/xrow.h
index 80a5cd909..b3c664be2 100644
--- a/src/box/xrow.h
+++ b/src/box/xrow.h
@@ -49,6 +49,7 @@ enum {
      XROW_IOVMAX = XROW_HEADER_IOVMAX + XROW_BODY_IOVMAX,
      XROW_HEADER_LEN_MAX = 52,
      XROW_BODY_LEN_MAX = 256,
+    XROW_SYNCHRO_BODY_LEN_MAX = 32,
      IPROTO_HEADER_LEN = 28,
      /** 7 = sizeof(iproto_body_bin). */
      IPROTO_SELECT_HEADER_LEN = IPROTO_HEADER_LEN + 7,
@@ -260,25 +261,6 @@ struct synchro_request {
      uint64_t term;
  };

-/** Synchro request xrow's body in MsgPack format. */
-struct PACKED synchro_body_bin {
-    uint8_t m_body;
-    uint8_t k_replica_id;
-    uint8_t m_replica_id;
-    uint32_t v_replica_id;
-    uint8_t k_lsn;
-    uint8_t m_lsn;
-    uint64_t v_lsn;
-};
-
-/** PROMOTE request's xrow body in MsgPack format. */
-struct PACKED promote_body_bin {
-    struct synchro_body_bin base;
-    uint8_t k_term;
-    uint8_t m_term;
-    uint64_t v_term;
-};
-
  /**
   * Encode synchronous replication request.
   * @param row xrow header.
@@ -286,18 +268,7 @@ struct PACKED promote_body_bin {
   * @param req Request parameters.
   */
  void
-xrow_encode_synchro(struct xrow_header *row,
-            struct synchro_body_bin *body,
-            const struct synchro_request *req);
-
-/**
- * Encode a promote request.
- * @param row xrow header.
- * @param body A place to encode promote body.
- * @param req Request parameters.
- */
-void
-xrow_encode_promote(struct xrow_header *row, struct promote_body_bin *body,
+xrow_encode_synchro(struct xrow_header *row, char *body,
              const struct synchro_request *req);

  /**

-- 
Serge Petrenko


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Tarantool-patches] [PATCH v3 00/10] raft: introduce manual elections and fix a bug with re-applying rolled back transactions
  2021-04-15 23:16 ` Vladislav Shpilevoy via Tarantool-patches
@ 2021-04-16 16:35   ` Serge Petrenko via Tarantool-patches
  0 siblings, 0 replies; 36+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-04-16 16:35 UTC (permalink / raw)
  To: Vladislav Shpilevoy, gorcunov; +Cc: tarantool-patches



16.04.2021 02:16, Vladislav Shpilevoy пишет:
> Hi! Thanks for the patchset!
>
> I see the new test does not pass in CI in one of the jobs:
> https://github.com/tarantool/tarantool/runs/2343885212
>
> Is it still flaky?
Hm, looks like so. It's much harder to reproduce now though.

I'll return with a fix later.

-- 
Serge Petrenko


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Tarantool-patches] [PATCH v3 06/10] raft: keep track of greatest known term and filter replication sources based on that
  2021-04-16 14:16     ` Serge Petrenko via Tarantool-patches
@ 2021-04-16 22:13       ` Vladislav Shpilevoy via Tarantool-patches
  0 siblings, 0 replies; 36+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-04-16 22:13 UTC (permalink / raw)
  To: Serge Petrenko, gorcunov; +Cc: tarantool-patches

>>> +    /**
>>> +     * Latest terms received with PROMOTE entries from remote instances.
>>> +     * Raft uses them to determine data from which sources may be applied.
>>> +     */
>>> +    struct vclock term_map;
>> 7. I have a feeling it is similar to the limbo's LSN map. Like
>> they should be merged into something one. Can't formulate that
>> properly. I hope we will see it more clear when will move all that
>> to the WAL thread someday.
> 
> Yes, they're quite similar.
> 
> Do you mean we should create some new structure instead of using vclock for
> these entities? Like something which would incorporate remote state:
> a map of ids with their known terms and confirmed lsns?

Yes, I think we might end up with that. We already have struct replica
doing similar, but they are not in WAL thread.

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2021-04-16 22:13 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-14 14:17 [Tarantool-patches] [PATCH v3 00/10] raft: introduce manual elections and fix a bug with re-applying rolled back transactions Serge Petrenko via Tarantool-patches
2021-04-14 14:17 ` [Tarantool-patches] [PATCH v3 01/10] wal: enrich row's meta information with sync replication flags Serge Petrenko via Tarantool-patches
2021-04-15 23:18   ` Vladislav Shpilevoy via Tarantool-patches
2021-04-16  7:08     ` Serge Petrenko via Tarantool-patches
2021-04-16  7:11       ` Serge Petrenko via Tarantool-patches
2021-04-16  8:57       ` Serge Petrenko via Tarantool-patches
2021-04-14 14:17 ` [Tarantool-patches] [PATCH v3 02/10] xrow: introduce a PROMOTE entry Serge Petrenko via Tarantool-patches
2021-04-15 23:19   ` Vladislav Shpilevoy via Tarantool-patches
2021-04-16 16:18     ` Serge Petrenko via Tarantool-patches
2021-04-14 14:17 ` [Tarantool-patches] [PATCH v3 03/10] box: actualise iproto_key_type array Serge Petrenko via Tarantool-patches
2021-04-14 14:17 ` [Tarantool-patches] [PATCH v3 04/10] box: make clear_synchro_queue() write a PROMOTE entry instead of CONFIRM + ROLLBACK Serge Petrenko via Tarantool-patches
2021-04-15 23:20   ` Vladislav Shpilevoy via Tarantool-patches
2021-04-16  9:28     ` Serge Petrenko via Tarantool-patches
2021-04-16 14:03       ` Serge Petrenko via Tarantool-patches
2021-04-14 14:17 ` [Tarantool-patches] [PATCH v3 05/10] box: write PROMOTE even for empty limbo Serge Petrenko via Tarantool-patches
2021-04-15 23:21   ` Vladislav Shpilevoy via Tarantool-patches
2021-04-16  9:33     ` Serge Petrenko via Tarantool-patches
2021-04-14 14:17 ` [Tarantool-patches] [PATCH v3 06/10] raft: keep track of greatest known term and filter replication sources based on that Serge Petrenko via Tarantool-patches
2021-04-15 23:27   ` Vladislav Shpilevoy via Tarantool-patches
2021-04-16 14:16     ` Serge Petrenko via Tarantool-patches
2021-04-16 22:13       ` Vladislav Shpilevoy via Tarantool-patches
2021-04-14 14:17 ` [Tarantool-patches] [PATCH v3 07/10] replication: introduce a new election mode: "manual" Serge Petrenko via Tarantool-patches
2021-04-15 23:27   ` Vladislav Shpilevoy via Tarantool-patches
2021-04-16 14:18     ` Serge Petrenko via Tarantool-patches
2021-04-14 14:17 ` [Tarantool-patches] [PATCH v3 08/10] Support manual elections in `box.ctl.clear_synchro_queue()` Serge Petrenko via Tarantool-patches
2021-04-15 23:30   ` Vladislav Shpilevoy via Tarantool-patches
2021-04-16 15:38     ` Serge Petrenko via Tarantool-patches
2021-04-16 15:40       ` Serge Petrenko via Tarantool-patches
2021-04-16 15:50         ` Serge Petrenko via Tarantool-patches
2021-04-14 14:17 ` [Tarantool-patches] [PATCH v3 09/10] box: remove parameter from clear_synchro_queue Serge Petrenko via Tarantool-patches
2021-04-14 14:18 ` [Tarantool-patches] [PATCH v3 10/10] box.ctl: rename clear_synchro_queue to promote Serge Petrenko via Tarantool-patches
2021-04-15 23:31   ` Vladislav Shpilevoy via Tarantool-patches
2021-04-16 16:13     ` Serge Petrenko via Tarantool-patches
2021-04-14 18:21 ` [Tarantool-patches] [PATCH v3 00/10] raft: introduce manual elections and fix a bug with re-applying rolled back transactions Cyrill Gorcunov via Tarantool-patches
2021-04-15 23:16 ` Vladislav Shpilevoy via Tarantool-patches
2021-04-16 16:35   ` Serge Petrenko via Tarantool-patches

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox