Tarantool development patches archive
 help / color / mirror / Atom feed
* [Tarantool-patches] [PATCH v2 0/9] raft: introduce manual elections and fix a bug with re-applying rolled back transactions
@ 2021-04-12 19:40 Serge Petrenko via Tarantool-patches
  2021-04-12 19:40 ` [Tarantool-patches] [PATCH v2 1/9] wal: enrich row's meta information with sync replication flags Serge Petrenko via Tarantool-patches
                   ` (9 more replies)
  0 siblings, 10 replies; 23+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-04-12 19:40 UTC (permalink / raw)
  To: v.shpilevoy, gorcunov; +Cc: tarantool-patches

Changes in v2:
  - Added tests for patches 1, 6, 9
  - Minor typo fixes and bugfixes.

https://github.com/tarantool/tarantool/tree/sp/gh-5445-election-fixes
https://github.com/tarantool/tarantool/issues/5445
https://github.com/tarantool/tarantool/issues/3055

Serge Petrenko (9):
  wal: enrich row's meta information with sync replication flags
  xrow: introduce a PROMOTE entry
  box: actualise iproto_key_type array
  box: make clear_synchro_queue() write a PROMOTE entry instead of
    CONFIRM + ROLLBACK
  box: write PROMOTE even for empty limbo
  raft: keep track of greatest known term and filter replication sources
    based on that
  replication: introduce a new election mode: "manual"
  Support manual elections in `box.ctl.clear_synchro_queue()`
  box.ctl: rename clear_synchro_queue to promote

 changelogs/unreleased/box-ctl-promote.md      |   8 +
 ...very => qsync-multi-statement-recovery.md} |   0
 changelogs/unreleased/raft-promote.md         |   4 +
 src/box/applier.cc                            |  22 +-
 src/box/box.cc                                | 167 +++++++++---
 src/box/box.h                                 |   2 +-
 src/box/errcode.h                             |   3 +
 src/box/iproto_constants.c                    |  58 +++++
 src/box/iproto_constants.h                    |  27 +-
 src/box/journal.h                             |   2 +
 src/box/lua/ctl.c                             |   8 +-
 src/box/raft.c                                |  31 ++-
 src/box/raft.h                                |  20 ++
 src/box/txn.c                                 |   5 +
 src/box/txn_limbo.c                           |  79 +++---
 src/box/txn_limbo.h                           |  10 +-
 src/box/wal.c                                 |  26 +-
 src/box/xrow.c                                |  43 +++-
 src/box/xrow.h                                |  54 +++-
 src/lib/raft/raft.c                           |  13 +-
 src/lib/raft/raft.h                           |  48 +++-
 test/box/error.result                         |   3 +
 test/replication/election_basic.result        |  29 ++-
 test/replication/election_basic.test.lua      |  10 +
 .../gh-3055-election-promote.result           | 105 ++++++++
 .../gh-3055-election-promote.test.lua         |  43 ++++
 .../gh-5445-leader-inconsistency.result       | 238 ++++++++++++++++++
 .../gh-5445-leader-inconsistency.test.lua     | 108 ++++++++
 test/replication/suite.cfg                    |   2 +
 test/unit/raft_test_utils.c                   |   4 +-
 test/unit/xrow.cc                             |  64 +++--
 test/unit/xrow.result                         | 123 ++++++++-
 32 files changed, 1210 insertions(+), 149 deletions(-)
 create mode 100644 changelogs/unreleased/box-ctl-promote.md
 rename changelogs/unreleased/{qsync-multi-statement-recovery => qsync-multi-statement-recovery.md} (100%)
 create mode 100644 changelogs/unreleased/raft-promote.md
 create mode 100644 test/replication/gh-3055-election-promote.result
 create mode 100644 test/replication/gh-3055-election-promote.test.lua
 create mode 100644 test/replication/gh-5445-leader-inconsistency.result
 create mode 100644 test/replication/gh-5445-leader-inconsistency.test.lua

-- 
2.24.3 (Apple Git-128)


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Tarantool-patches] [PATCH v2 1/9] wal: enrich row's meta information with sync replication flags
  2021-04-12 19:40 [Tarantool-patches] [PATCH v2 0/9] raft: introduce manual elections and fix a bug with re-applying rolled back transactions Serge Petrenko via Tarantool-patches
@ 2021-04-12 19:40 ` Serge Petrenko via Tarantool-patches
  2021-04-13 11:50   ` Cyrill Gorcunov via Tarantool-patches
  2021-04-13 13:09   ` Cyrill Gorcunov via Tarantool-patches
  2021-04-12 19:40 ` [Tarantool-patches] [PATCH v2 2/9] xrow: introduce a PROMOTE entry Serge Petrenko via Tarantool-patches
                   ` (8 subsequent siblings)
  9 siblings, 2 replies; 23+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-04-12 19:40 UTC (permalink / raw)
  To: v.shpilevoy, gorcunov; +Cc: tarantool-patches

Introduce two new flags to xrow_header: `wait_ack` and `wait_sync`.
These flags are set for rows belonging to synchronous transactions in
addition to `is_commit`.

The new flags help to define whether the rows belong to a synchronous
transaction or not without parsing them all and checking whether any of
the rows touches a synchronous space.

This will be used in applier once it is taught to filter synchronous
transactions based on whether they are coming from a raft leader or not.

P.S. These flags will also be useful once we allow to turn any transaction
synchronous. Once this is done, the flags in row header will be the only
source of information on whether the transaction is synchronous or not.

Prerequisite #5445

@TarantoolBot document
Title: new values for IPROTO_FLAGS field

IPROTO_FLAGS bitfield is enriched with two new constant:
IPROTO_FLAG_WAIT_SYNC = 0x02
IPROTO_FLAG_WAIT_ACK = 0x04

IPROTO_FLAG_WAIT_SYNC is set for the last synchronous transaction row.
IPROTO_FLAG_WAIT_ACK is set for the last row of a transaction which is
not applied because other synchronous transactions are in fly.
---
 src/box/iproto_constants.h |   5 ++
 src/box/journal.h          |   2 +
 src/box/txn.c              |   5 ++
 src/box/wal.c              |  26 ++++----
 src/box/xrow.c             |  13 ++--
 src/box/xrow.h             |  30 ++++++---
 test/unit/xrow.cc          |  64 ++++++++++++-------
 test/unit/xrow.result      | 123 ++++++++++++++++++++++++++++++++++---
 8 files changed, 213 insertions(+), 55 deletions(-)

diff --git a/src/box/iproto_constants.h b/src/box/iproto_constants.h
index d3738c705..f7f46088f 100644
--- a/src/box/iproto_constants.h
+++ b/src/box/iproto_constants.h
@@ -49,9 +49,14 @@ enum {
 	XLOG_FIXHEADER_SIZE = 19
 };
 
+/** IPROTO_FLAGS bitfield constants. */
 enum {
 	/** Set for the last xrow in a transaction. */
 	IPROTO_FLAG_COMMIT = 0x01,
+	/** Set for the last row of a tx residing in limbo. */
+	IPROTO_FLAG_WAIT_SYNC = 0x02,
+	/** Set for the last row of a synchronous tx. */
+	IPROTO_FLAG_WAIT_ACK = 0x04,
 };
 
 enum iproto_key {
diff --git a/src/box/journal.h b/src/box/journal.h
index 76c70c19f..9ab8af2c1 100644
--- a/src/box/journal.h
+++ b/src/box/journal.h
@@ -63,6 +63,7 @@ struct journal_entry {
 	 * A journal entry completion callback argument.
 	 */
 	void *complete_data;
+	uint8_t opt_flags;
 	/**
 	 * Asynchronous write completion function.
 	 */
@@ -97,6 +98,7 @@ journal_entry_create(struct journal_entry *entry, size_t n_rows,
 	entry->approx_len	= approx_len;
 	entry->n_rows		= n_rows;
 	entry->res		= -1;
+	entry->opt_flags = 0;
 }
 
 /**
diff --git a/src/box/txn.c b/src/box/txn.c
index 40061ff09..d65315f58 100644
--- a/src/box/txn.c
+++ b/src/box/txn.c
@@ -76,6 +76,7 @@ txn_add_redo(struct txn *txn, struct txn_stmt *stmt, struct request *request)
 		row->lsn = 0;
 		row->sync = 0;
 		row->tm = 0;
+		row->opt_flags = 0;
 	}
 	/*
 	 * Group ID should be set both for requests not having a
@@ -681,6 +682,10 @@ txn_journal_entry_new(struct txn *txn)
 		--req->n_rows;
 	}
 
+	req->opt_flags |=
+		(txn_has_flag(txn, TXN_WAIT_SYNC) ? IPROTO_FLAG_WAIT_SYNC : 0) |
+		(txn_has_flag(txn, TXN_WAIT_ACK) ? IPROTO_FLAG_WAIT_ACK : 0);
+
 	return req;
 }
 
diff --git a/src/box/wal.c b/src/box/wal.c
index 34af0bda6..00fcb21b4 100644
--- a/src/box/wal.c
+++ b/src/box/wal.c
@@ -962,14 +962,14 @@ out:
  */
 static void
 wal_assign_lsn(struct vclock *vclock_diff, struct vclock *base,
-	       struct xrow_header **row,
-	       struct xrow_header **end)
+	       struct journal_entry *entry)
 {
 	int64_t tsn = 0;
-	struct xrow_header **start = row;
-	struct xrow_header **first_glob_row = row;
+	struct xrow_header **start = entry->rows;
+	struct xrow_header **end = entry->rows + entry->n_rows;
+	struct xrow_header **first_glob_row = start;
 	/** Assign LSN to all local rows. */
-	for ( ; row < end; row++) {
+	for (struct xrow_header **row = start; row < end; row++) {
 		if ((*row)->replica_id == 0) {
 			/*
 			 * All rows representing local space data
@@ -996,7 +996,13 @@ wal_assign_lsn(struct vclock *vclock_diff, struct vclock *base,
 				first_glob_row = row;
 			}
 			(*row)->tsn = tsn == 0 ? (*start)->lsn : tsn;
-			(*row)->is_commit = row == end - 1;
+			if (row < end - 1)
+				continue;
+			/* Tx meta is stored in the last tx row. */
+			if (row == end - 1) {
+				(*row)->opt_flags = entry->opt_flags;
+				(*row)->is_commit = true;
+			}
 		} else {
 			int64_t diff = (*row)->lsn - vclock_get(base, (*row)->replica_id);
 			if (diff <= vclock_get(vclock_diff,
@@ -1020,7 +1026,7 @@ wal_assign_lsn(struct vclock *vclock_diff, struct vclock *base,
 	 * the first global row. tsn was yet unknown when those
 	 * rows were processed.
 	 */
-	for (row = start; row < first_glob_row; row++)
+	for (struct xrow_header **row = start; row < first_glob_row; row++)
 		(*row)->tsn = tsn;
 }
 
@@ -1098,8 +1104,7 @@ wal_write_to_disk(struct cmsg *msg)
 	struct journal_entry *entry;
 	struct stailq_entry *last_committed = NULL;
 	stailq_foreach_entry(entry, &wal_msg->commit, fifo) {
-		wal_assign_lsn(&vclock_diff, &writer->vclock,
-			       entry->rows, entry->rows + entry->n_rows);
+		wal_assign_lsn(&vclock_diff, &writer->vclock, entry);
 		entry->res = vclock_sum(&vclock_diff) +
 			     vclock_sum(&writer->vclock);
 		rc = xlog_write_entry(l, entry);
@@ -1319,8 +1324,7 @@ wal_write_none_async(struct journal *journal,
 	struct vclock vclock_diff;
 
 	vclock_create(&vclock_diff);
-	wal_assign_lsn(&vclock_diff, &writer->vclock, entry->rows,
-		       entry->rows + entry->n_rows);
+	wal_assign_lsn(&vclock_diff, &writer->vclock, entry);
 	vclock_merge(&writer->vclock, &vclock_diff);
 	vclock_copy(&replicaset.vclock, &writer->vclock);
 	entry->res = vclock_sum(&writer->vclock);
diff --git a/src/box/xrow.c b/src/box/xrow.c
index bc06738ad..cc8e43ed4 100644
--- a/src/box/xrow.c
+++ b/src/box/xrow.c
@@ -183,7 +183,7 @@ error:
 			break;
 		case IPROTO_FLAGS:
 			flags = mp_decode_uint(pos);
-			header->is_commit = flags & IPROTO_FLAG_COMMIT;
+			header->opt_flags = flags;
 			break;
 		default:
 			/* unknown header */
@@ -299,6 +299,7 @@ xrow_header_encode(const struct xrow_header *header, uint64_t sync,
 	 *   flag to find transaction boundary (last row in the
 	 *   transaction stream).
 	 */
+	uint8_t flags_to_encode = header->opt_flags & ~IPROTO_FLAG_COMMIT;
 	if (header->tsn != 0) {
 		if (header->tsn != header->lsn || !header->is_commit) {
 			/*
@@ -314,12 +315,14 @@ xrow_header_encode(const struct xrow_header *header, uint64_t sync,
 			map_size++;
 		}
 		if (header->is_commit && header->tsn != header->lsn) {
-			/* Setup last row for multi row transaction. */
-			d = mp_encode_uint(d, IPROTO_FLAGS);
-			d = mp_encode_uint(d, IPROTO_FLAG_COMMIT);
-			map_size++;
+			flags_to_encode |= IPROTO_FLAG_COMMIT;
 		}
 	}
+	if (flags_to_encode != 0) {
+		d = mp_encode_uint(d, IPROTO_FLAGS);
+		d = mp_encode_uint(d, flags_to_encode);
+		map_size++;
+	}
 	assert(d <= data + XROW_HEADER_LEN_MAX);
 	mp_encode_map(data, map_size);
 	out->iov_len = d - (char *) out->iov_base;
diff --git a/src/box/xrow.h b/src/box/xrow.h
index fde8f9474..2a18733c0 100644
--- a/src/box/xrow.h
+++ b/src/box/xrow.h
@@ -80,14 +80,28 @@ struct xrow_header {
 	 * transaction.
 	 */
 	int64_t tsn;
-	/**
-	 * True for the last row in a multi-statement transaction,
-	 * or single-statement transaction. Is only encoded in the
-	 * write ahead log for multi-statement transactions.
-	 * Single-statement transactions do not encode
-	 * tsn and is_commit flag to save space.
-	 */
-	bool is_commit;
+	/** Transaction meta flags set only in the last transaction row. */
+	union {
+		uint8_t opt_flags;
+		struct {
+			/**
+			 * Is only encoded in the write ahead log for
+			 * multi-statement transactions. Single-statement
+			 * transactions do not encode tsn and is_commit flag to
+			 * save space.
+			 */
+			bool is_commit : 1;
+			/**
+			 * True for a synchronous transaction.
+			 */
+			bool wait_sync : 1;
+			/**
+			 * True for any transaction that would enter the limbo
+			 * (not necessarily a synchronous one).
+			 */
+			bool wait_ack  : 1;
+		};
+	};
 
 	int bodycnt;
 	uint32_t schema_version;
diff --git a/test/unit/xrow.cc b/test/unit/xrow.cc
index 9fd154719..ea1ee1767 100644
--- a/test/unit/xrow.cc
+++ b/test/unit/xrow.cc
@@ -204,7 +204,9 @@ test_greeting()
 void
 test_xrow_header_encode_decode()
 {
-	plan(10);
+	/* Test all possible 3-bit combinations. */
+	const int bit_comb_count = 1 << 3;
+	plan(1 + bit_comb_count);
 	struct xrow_header header;
 	char buffer[2048];
 	char *pos = mp_encode_uint(buffer, 300);
@@ -217,27 +219,47 @@ test_xrow_header_encode_decode()
 	header.tm = 123.456;
 	header.bodycnt = 0;
 	header.tsn = header.lsn;
-	header.is_commit = true;
 	uint64_t sync = 100500;
-	struct iovec vec[1];
-	is(1, xrow_header_encode(&header, sync, vec, 200), "encode");
-	int fixheader_len = 200;
-	pos = (char *)vec[0].iov_base + fixheader_len;
-	is(mp_decode_map((const char **)&pos), 5, "header map size");
-
-	struct xrow_header decoded_header;
-	const char *begin = (const char *)vec[0].iov_base;
-	begin += fixheader_len;
-	const char *end = (const char *)vec[0].iov_base;
-	end += vec[0].iov_len;
-	is(xrow_header_decode(&decoded_header, &begin, end, true), 0,
-	   "header decode");
-	is(header.type, decoded_header.type, "decoded type");
-	is(header.replica_id, decoded_header.replica_id, "decoded replica_id");
-	is(header.lsn, decoded_header.lsn, "decoded lsn");
-	is(header.tm, decoded_header.tm, "decoded tm");
-	is(decoded_header.sync, sync, "decoded sync");
-	is(decoded_header.bodycnt, 0, "decoded bodycnt");
+	for (int opt_idx = 0; opt_idx < bit_comb_count; opt_idx++) {
+		plan(12);
+		header.is_commit = opt_idx & 0x01;
+		header.wait_sync = opt_idx >> 1 & 0x01;
+		header.wait_ack = opt_idx >> 2 & 0x01;
+		struct iovec vec[1];
+		is(1, xrow_header_encode(&header, sync, vec, 200), "encode");
+		int fixheader_len = 200;
+		pos = (char *)vec[0].iov_base + fixheader_len;
+		uint32_t exp_map_size = 5;
+		/*
+		 * header.is_commit flag isn't encoded, since this row looks
+		 * like a single-statement transaction.
+		 */
+		if (header.wait_sync || header.wait_ack)
+			exp_map_size += 1;
+		/* tsn is encoded explicitly in this case. */
+		if (!header.is_commit)
+			exp_map_size += 1;
+		uint32_t size = mp_decode_map((const char **)&pos);
+		is(size, exp_map_size, "header map size");
+
+		struct xrow_header decoded_header;
+		const char *begin = (const char *)vec[0].iov_base;
+		begin += fixheader_len;
+		const char *end = (const char *)vec[0].iov_base;
+		end += vec[0].iov_len;
+		is(xrow_header_decode(&decoded_header, &begin, end, true), 0,
+		   "header decode");
+		is(header.is_commit, decoded_header.is_commit, "decoded is_commit");
+		is(header.wait_sync, decoded_header.wait_sync, "decoded wait_sync");
+		is(header.wait_ack, decoded_header.wait_ack, "decoded wait_ack");
+		is(header.type, decoded_header.type, "decoded type");
+		is(header.replica_id, decoded_header.replica_id, "decoded replica_id");
+		is(header.lsn, decoded_header.lsn, "decoded lsn");
+		is(header.tm, decoded_header.tm, "decoded tm");
+		is(decoded_header.sync, sync, "decoded sync");
+		is(decoded_header.bodycnt, 0, "decoded bodycnt");
+		check_plan();
+	}
 
 	check_plan();
 }
diff --git a/test/unit/xrow.result b/test/unit/xrow.result
index 5ee92ad7b..e06ba5261 100644
--- a/test/unit/xrow.result
+++ b/test/unit/xrow.result
@@ -41,17 +41,120 @@
     ok 39 - invalid 10
     ok 40 - invalid 11
 ok 1 - subtests
-    1..10
+    1..9
     ok 1 - bad msgpack end
-    ok 2 - encode
-    ok 3 - header map size
-    ok 4 - header decode
-    ok 5 - decoded type
-    ok 6 - decoded replica_id
-    ok 7 - decoded lsn
-    ok 8 - decoded tm
-    ok 9 - decoded sync
-    ok 10 - decoded bodycnt
+        1..12
+        ok 1 - encode
+        ok 2 - header map size
+        ok 3 - header decode
+        ok 4 - decoded is_commit
+        ok 5 - decoded wait_sync
+        ok 6 - decoded wait_ack
+        ok 7 - decoded type
+        ok 8 - decoded replica_id
+        ok 9 - decoded lsn
+        ok 10 - decoded tm
+        ok 11 - decoded sync
+        ok 12 - decoded bodycnt
+    ok 2 - subtests
+        1..12
+        ok 1 - encode
+        ok 2 - header map size
+        ok 3 - header decode
+        ok 4 - decoded is_commit
+        ok 5 - decoded wait_sync
+        ok 6 - decoded wait_ack
+        ok 7 - decoded type
+        ok 8 - decoded replica_id
+        ok 9 - decoded lsn
+        ok 10 - decoded tm
+        ok 11 - decoded sync
+        ok 12 - decoded bodycnt
+    ok 3 - subtests
+        1..12
+        ok 1 - encode
+        ok 2 - header map size
+        ok 3 - header decode
+        ok 4 - decoded is_commit
+        ok 5 - decoded wait_sync
+        ok 6 - decoded wait_ack
+        ok 7 - decoded type
+        ok 8 - decoded replica_id
+        ok 9 - decoded lsn
+        ok 10 - decoded tm
+        ok 11 - decoded sync
+        ok 12 - decoded bodycnt
+    ok 4 - subtests
+        1..12
+        ok 1 - encode
+        ok 2 - header map size
+        ok 3 - header decode
+        ok 4 - decoded is_commit
+        ok 5 - decoded wait_sync
+        ok 6 - decoded wait_ack
+        ok 7 - decoded type
+        ok 8 - decoded replica_id
+        ok 9 - decoded lsn
+        ok 10 - decoded tm
+        ok 11 - decoded sync
+        ok 12 - decoded bodycnt
+    ok 5 - subtests
+        1..12
+        ok 1 - encode
+        ok 2 - header map size
+        ok 3 - header decode
+        ok 4 - decoded is_commit
+        ok 5 - decoded wait_sync
+        ok 6 - decoded wait_ack
+        ok 7 - decoded type
+        ok 8 - decoded replica_id
+        ok 9 - decoded lsn
+        ok 10 - decoded tm
+        ok 11 - decoded sync
+        ok 12 - decoded bodycnt
+    ok 6 - subtests
+        1..12
+        ok 1 - encode
+        ok 2 - header map size
+        ok 3 - header decode
+        ok 4 - decoded is_commit
+        ok 5 - decoded wait_sync
+        ok 6 - decoded wait_ack
+        ok 7 - decoded type
+        ok 8 - decoded replica_id
+        ok 9 - decoded lsn
+        ok 10 - decoded tm
+        ok 11 - decoded sync
+        ok 12 - decoded bodycnt
+    ok 7 - subtests
+        1..12
+        ok 1 - encode
+        ok 2 - header map size
+        ok 3 - header decode
+        ok 4 - decoded is_commit
+        ok 5 - decoded wait_sync
+        ok 6 - decoded wait_ack
+        ok 7 - decoded type
+        ok 8 - decoded replica_id
+        ok 9 - decoded lsn
+        ok 10 - decoded tm
+        ok 11 - decoded sync
+        ok 12 - decoded bodycnt
+    ok 8 - subtests
+        1..12
+        ok 1 - encode
+        ok 2 - header map size
+        ok 3 - header decode
+        ok 4 - decoded is_commit
+        ok 5 - decoded wait_sync
+        ok 6 - decoded wait_ack
+        ok 7 - decoded type
+        ok 8 - decoded replica_id
+        ok 9 - decoded lsn
+        ok 10 - decoded tm
+        ok 11 - decoded sync
+        ok 12 - decoded bodycnt
+    ok 9 - subtests
 ok 2 - subtests
     1..1
     ok 1 - request_str
-- 
2.24.3 (Apple Git-128)


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Tarantool-patches] [PATCH v2 2/9] xrow: introduce a PROMOTE entry
  2021-04-12 19:40 [Tarantool-patches] [PATCH v2 0/9] raft: introduce manual elections and fix a bug with re-applying rolled back transactions Serge Petrenko via Tarantool-patches
  2021-04-12 19:40 ` [Tarantool-patches] [PATCH v2 1/9] wal: enrich row's meta information with sync replication flags Serge Petrenko via Tarantool-patches
@ 2021-04-12 19:40 ` Serge Petrenko via Tarantool-patches
  2021-04-13 14:15   ` Cyrill Gorcunov via Tarantool-patches
  2021-04-12 19:40 ` [Tarantool-patches] [PATCH v2 3/9] box: actualise iproto_key_type array Serge Petrenko via Tarantool-patches
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 23+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-04-12 19:40 UTC (permalink / raw)
  To: v.shpilevoy, gorcunov; +Cc: tarantool-patches

A PROMOTE entry combines effect of CONFIRM, ROLLBACK and RAFT_TERM
entries with some additional semantics on top.

PROMOTE carries the following arguments:

1) former_leader_id - the id of previous limbo owner whose entries we
   want to confirm.
2) confirm_lsn - the lsn of the last former leader's transaction to be
   confirmed. In this sense PROMOTE(confirm_lsn) replaces
   CONFIRM(confirm_lsn) + ROLLBACK(confirm_lsn + 1).
3) replica_id - id of the instance issuing
   `box.ctl.clear_synchro_queue()`
4) term - the new term the instance issuing
   `box.ctl.clear_synchro_queue()` has just entered.

This entry will be written to WAL instead of the usual CONFIRM +
ROLLBACK pair on a successful `box.ctl.clear_synchro_queue()` call.

Note, the ususal CONFIRM and ROLLBACK occurrences (after a confirmed or
rolled back synchronous transaction) are here to stay.

Part of #5445
---
 src/box/iproto_constants.h | 17 ++++++++++++++++-
 src/box/xrow.c             | 30 +++++++++++++++++++++++++-----
 src/box/xrow.h             | 24 +++++++++++++++++++++++-
 3 files changed, 64 insertions(+), 7 deletions(-)

diff --git a/src/box/iproto_constants.h b/src/box/iproto_constants.h
index f7f46088f..816a308d8 100644
--- a/src/box/iproto_constants.h
+++ b/src/box/iproto_constants.h
@@ -132,6 +132,18 @@ enum iproto_key {
 	IPROTO_REPLICA_ANON = 0x50,
 	IPROTO_ID_FILTER = 0x51,
 	IPROTO_ERROR = 0x52,
+	/**
+	 * Term. Has the same meaning as IPROTO_RAFT_TERM, but is an iproto
+	 * key, rather than a raft key. Used for PROMOTE request, which needs
+	 * both iproto (e.g. REPLICA_ID) and raft (RAFT_TERM) keys.
+	 */
+	IPROTO_TERM = 0x53,
+	/*
+	 * Be careful to not extend iproto_key values over 0x7f.
+	 * iproto_keys are encoded in msgpack as positive fixnum, which ends at
+	 * 0x7f, and we rely on this in some places by allocating a uint8_t to
+	 * hold a msgpack-encoded key value.
+	 */
 	IPROTO_KEY_MAX
 };
 
@@ -226,6 +238,8 @@ enum iproto_type {
 	IPROTO_TYPE_STAT_MAX,
 
 	IPROTO_RAFT = 30,
+	/** PROMOTE request. */
+	IPROTO_PROMOTE = 31,
 
 	/** A confirmation message for synchronous transactions. */
 	IPROTO_CONFIRM = 40,
@@ -344,7 +358,8 @@ dml_request_key_map(uint32_t type)
 static inline bool
 iproto_type_is_synchro_request(uint32_t type)
 {
-	return type == IPROTO_CONFIRM || type == IPROTO_ROLLBACK;
+	return type == IPROTO_CONFIRM || type == IPROTO_ROLLBACK ||
+	       type == IPROTO_PROMOTE;
 }
 
 static inline bool
diff --git a/src/box/xrow.c b/src/box/xrow.c
index cc8e43ed4..70ba075f8 100644
--- a/src/box/xrow.c
+++ b/src/box/xrow.c
@@ -890,11 +890,11 @@ xrow_encode_synchro(struct xrow_header *row,
 		    const struct synchro_request *req)
 {
 	/*
-	 * A map with two elements. We don't compress
+	 * A map with two or three elements. We don't compress
 	 * numbers to have this structure constant in size,
 	 * which allows us to preallocate it on stack.
 	 */
-	body->m_body = 0x80 | 2;
+	body->m_body = 0x80 | (req->type == IPROTO_PROMOTE ? 3 : 2);
 	body->k_replica_id = IPROTO_REPLICA_ID;
 	body->m_replica_id = 0xce;
 	body->v_replica_id = mp_bswap_u32(req->replica_id);
@@ -903,10 +903,24 @@ xrow_encode_synchro(struct xrow_header *row,
 	body->v_lsn = mp_bswap_u64(req->lsn);
 
 	memset(row, 0, sizeof(*row));
-
 	row->type = req->type;
-	row->body[0].iov_base = (void *)body;
-	row->body[0].iov_len = sizeof(*body);
+
+	/* Promote body is longer. It has an additional IPROTO_TERM field. */
+	if (req->type == IPROTO_PROMOTE) {
+		struct promote_body_bin *promote_body =
+			(struct promote_body_bin *)body;
+
+		promote_body->k_term = IPROTO_TERM;
+		promote_body->m_term = 0xcf;
+		promote_body->v_term = mp_bswap_u64(req->term);
+
+		row->body[0].iov_base = (void *)promote_body;
+		row->body[0].iov_len = sizeof(*promote_body);
+	} else {
+		row->body[0].iov_base = (void *)body;
+		row->body[0].iov_len = sizeof(*body);
+	}
+
 	row->bodycnt = 1;
 }
 
@@ -952,11 +966,17 @@ xrow_decode_synchro(const struct xrow_header *row, struct synchro_request *req)
 		case IPROTO_LSN:
 			req->lsn = mp_decode_uint(&d);
 			break;
+		case IPROTO_TERM:
+			req->term = mp_decode_uint(&d);
+			break;
 		default:
 			mp_next(&d);
 		}
 	}
+
 	req->type = row->type;
+	req->origin_id = row->replica_id;
+
 	return 0;
 }
 
diff --git a/src/box/xrow.h b/src/box/xrow.h
index 2a18733c0..af4ad0d12 100644
--- a/src/box/xrow.h
+++ b/src/box/xrow.h
@@ -226,7 +226,10 @@ xrow_encode_dml(const struct request *request, struct region *region,
  * pending synchronous transactions.
  */
 struct synchro_request {
-	/** Operation type - IPROTO_ROLLBACK or IPROTO_CONFIRM. */
+	/**
+	 * Operation type - either IPROTO_ROLLBACK or IPROTO_CONFIRM or
+	 * IPROTO_PROMOTE
+	 */
 	uint32_t type;
 	/**
 	 * ID of the instance owning the pending transactions.
@@ -236,14 +239,25 @@ struct synchro_request {
 	 * finish transactions of an old master.
 	 */
 	uint32_t replica_id;
+	/**
+	 * Id of the instance which has issued this request. Only filled on
+	 * decoding, and left blank when encoding a request.
+	 */
+	uint32_t origin_id;
 	/**
 	 * Operation LSN.
 	 * In case of CONFIRM it means 'confirm all
 	 * transactions with lsn <= this value'.
 	 * In case of ROLLBACK it means 'rollback all transactions
 	 * with lsn >= this value'.
+	 * In case of PROMOTE it means CONFIRM(lsn) + ROLLBACK(lsn+1)
 	 */
 	int64_t lsn;
+	/**
+	 * The new term the instance issuing this request is in. Only used for
+	 * PROMOTE request.
+	 */
+	uint64_t term;
 };
 
 /** Synchro request xrow's body in MsgPack format. */
@@ -257,6 +271,14 @@ struct PACKED synchro_body_bin {
 	uint64_t v_lsn;
 };
 
+/** PROMOTE request's xrow body in MsgPack format. */
+struct PACKED promote_body_bin {
+	struct synchro_body_bin base;
+	uint8_t k_term;
+	uint8_t m_term;
+	uint64_t v_term;
+};
+
 /**
  * Encode synchronous replication request.
  * @param row xrow header.
-- 
2.24.3 (Apple Git-128)


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Tarantool-patches] [PATCH v2 3/9] box: actualise iproto_key_type array
  2021-04-12 19:40 [Tarantool-patches] [PATCH v2 0/9] raft: introduce manual elections and fix a bug with re-applying rolled back transactions Serge Petrenko via Tarantool-patches
  2021-04-12 19:40 ` [Tarantool-patches] [PATCH v2 1/9] wal: enrich row's meta information with sync replication flags Serge Petrenko via Tarantool-patches
  2021-04-12 19:40 ` [Tarantool-patches] [PATCH v2 2/9] xrow: introduce a PROMOTE entry Serge Petrenko via Tarantool-patches
@ 2021-04-12 19:40 ` Serge Petrenko via Tarantool-patches
  2021-04-12 19:40 ` [Tarantool-patches] [PATCH v2 4/9] box: make clear_synchro_queue() write a PROMOTE entry instead of CONFIRM + ROLLBACK Serge Petrenko via Tarantool-patches
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 23+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-04-12 19:40 UTC (permalink / raw)
  To: v.shpilevoy, gorcunov; +Cc: tarantool-patches

iproto_key_type array is used while validating incoming requests, but it
was only half-filled. The last initialized field was 0x2b, while
IPROTO_KEY_MAX is currently 0x54.

We got away with it, since the array is only  used in xrow_header_decode(),
xrow_decode_dml() and xrow_decode_synchro(), and all the keys usually present
in these requests were present in the array. This is not true anymore,
so it's time to make array contents up to date with all the IPROTO_KEY_*
constants we have.

Part of #5445
---
 src/box/iproto_constants.c | 58 ++++++++++++++++++++++++++++++++++++++
 1 file changed, 58 insertions(+)

diff --git a/src/box/iproto_constants.c b/src/box/iproto_constants.c
index 029d9888c..addda39dc 100644
--- a/src/box/iproto_constants.c
+++ b/src/box/iproto_constants.c
@@ -90,6 +90,64 @@ const unsigned char iproto_key_type[IPROTO_KEY_MAX] =
 	/* 0x2a */	MP_MAP, /* IPROTO_TUPLE_META */
 	/* 0x2b */	MP_MAP, /* IPROTO_OPTIONS */
 	/* }}} */
+
+	/* {{{ unused */
+	/* 0x2c */	MP_UINT,
+	/* 0x2d */	MP_UINT,
+	/* 0x2e */	MP_UINT,
+	/* 0x2f */	MP_UINT,
+	/* }}} */
+
+	/* {{{ body -- response keys */
+	/* 0x30 */	MP_ARRAY, /* IPROTO_DATA */
+	/* 0x31 */	MP_STR, /* IPROTO_ERROR_24 */
+	/* 0x32 */	MP_ARRAY, /* IPROTO_METADATA */
+	/* 0x33 */	MP_ARRAY, /* IPROTO_BIND_METADATA */
+	/* 0x34 */	MP_UINT, /* IIPROTO_BIND_COUNT */
+	/* }}} */
+
+	/* {{{ unused */
+	/* 0x35 */	MP_UINT,
+	/* 0x36 */	MP_UINT,
+	/* 0x37 */	MP_UINT,
+	/* 0x38 */	MP_UINT,
+	/* 0x39 */	MP_UINT,
+	/* 0x3a */	MP_UINT,
+	/* 0x3b */	MP_UINT,
+	/* 0x3c */	MP_UINT,
+	/* 0x3d */	MP_UINT,
+	/* 0x3e */	MP_UINT,
+	/* 0x3f */	MP_UINT,
+	/* }}} */
+
+	/* {{{ body -- sql keys */
+	/* 0x40 */	MP_STR, /* IPROTO_SQL_TEXT */
+	/* 0x41 */	MP_ARRAY, /* IPROTO_SQL_BIND */
+	/* 0x42 */	MP_MAP, /* IPROTO_SQL_INFO */
+	/* 0x43 */	MP_UINT, /* IPROTO_STMT_ID */
+	/* }}} */
+
+	/* {{{ unused */
+	/* 0x44 */	MP_UINT,
+	/* 0x45 */	MP_UINT,
+	/* 0x46 */	MP_UINT,
+	/* 0x47 */	MP_UINT,
+	/* 0x48 */	MP_UINT,
+	/* 0x49 */	MP_UINT,
+	/* 0x4a */	MP_UINT,
+	/* 0x4b */	MP_UINT,
+	/* 0x4c */	MP_UINT,
+	/* 0x4d */	MP_UINT,
+	/* 0x4e */	MP_UINT,
+	/* 0x4f */	MP_UINT,
+	/* }}} */
+
+	/* {{{ body -- additional request keys */
+	/* 0x50 */	MP_BOOL, /* IPROTO_REPLICA_ANON */
+	/* 0x51 */	MP_ARRAY, /* IPROTO_ID_FILTER */
+	/* 0x52 */	MP_MAP, /* IPROTO_ERROR */
+	/* 0x53 */	MP_UINT, /* IPROTO_TERM */
+	/* }}} */
 };
 
 const char *iproto_type_strs[] =
-- 
2.24.3 (Apple Git-128)


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Tarantool-patches] [PATCH v2 4/9] box: make clear_synchro_queue() write a PROMOTE entry instead of CONFIRM + ROLLBACK
  2021-04-12 19:40 [Tarantool-patches] [PATCH v2 0/9] raft: introduce manual elections and fix a bug with re-applying rolled back transactions Serge Petrenko via Tarantool-patches
                   ` (2 preceding siblings ...)
  2021-04-12 19:40 ` [Tarantool-patches] [PATCH v2 3/9] box: actualise iproto_key_type array Serge Petrenko via Tarantool-patches
@ 2021-04-12 19:40 ` Serge Petrenko via Tarantool-patches
  2021-04-13 14:33   ` Cyrill Gorcunov via Tarantool-patches
  2021-04-12 19:40 ` [Tarantool-patches] [PATCH v2 5/9] box: write PROMOTE even for empty limbo Serge Petrenko via Tarantool-patches
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 23+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-04-12 19:40 UTC (permalink / raw)
  To: v.shpilevoy, gorcunov; +Cc: tarantool-patches

A successful box_clear_synchro_queue() call results in writing
CONFIRM(N) ROLLBACK(N+1) pair, where N is  the confirmed lsn.

Let's write a single PROMOTE(N) entry instead. It'll have  the same
meaning as CONFIRM + ROLLBACK and it will give followers some additional
information regarding leader state change later.

Part of #5445
---
 src/box/applier.cc         |  4 +-
 src/box/box.cc             | 14 ++++++-
 src/box/iproto_constants.h |  5 +++
 src/box/txn_limbo.c        | 79 +++++++++++++++++++++-----------------
 src/box/txn_limbo.h        | 10 ++++-
 5 files changed, 72 insertions(+), 40 deletions(-)

diff --git a/src/box/applier.cc b/src/box/applier.cc
index 971b2e64c..e8cbbe27a 100644
--- a/src/box/applier.cc
+++ b/src/box/applier.cc
@@ -763,7 +763,7 @@ applier_txn_wal_write_cb(struct trigger *trigger, void *event)
 
 struct synchro_entry {
 	/** Encoded form of a synchro record. */
-	struct synchro_body_bin	body_bin;
+	struct promote_body_bin body_bin;
 
 	/** xrow to write, used by the journal engine. */
 	struct xrow_header row;
@@ -822,7 +822,7 @@ synchro_entry_new(struct xrow_header *applier_row,
 	}
 
 	struct journal_entry *journal_entry = &entry->journal_entry;
-	struct synchro_body_bin *body_bin = &entry->body_bin;
+	struct synchro_body_bin *body_bin = &entry->body_bin.base;
 	struct xrow_header *row = &entry->row;
 
 	journal_entry->rows[0] = row;
diff --git a/src/box/box.cc b/src/box/box.cc
index b846ba8f5..8aba051a2 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -1556,7 +1556,19 @@ box_clear_synchro_queue(bool try_wait)
 				 "new synchronous transactions appeared");
 			rc = -1;
 		} else {
-			txn_limbo_force_empty(&txn_limbo, wait_lsn);
+			/*
+			 * Term parameter is unused now, We'll pass
+			 * box_raft()->term there later.
+			 */
+			txn_limbo_write_promote(&txn_limbo, wait_lsn, 0);
+			struct synchro_request req = {
+				.type = 0, /* unused */
+				.replica_id = 0, /* unused */
+				.origin_id = instance_id,
+				.lsn = wait_lsn,
+				.term = 0, /* unused */
+			};
+			txn_limbo_read_promote(&txn_limbo, &req);
 			assert(txn_limbo_is_empty(&txn_limbo));
 		}
 	}
diff --git a/src/box/iproto_constants.h b/src/box/iproto_constants.h
index 816a308d8..da78ac4d4 100644
--- a/src/box/iproto_constants.h
+++ b/src/box/iproto_constants.h
@@ -362,6 +362,11 @@ iproto_type_is_synchro_request(uint32_t type)
 	       type == IPROTO_PROMOTE;
 }
 
+static inline bool
+iproto_type_is_promote_request(uint32_t type)
+{
+	return type == IPROTO_PROMOTE;
+}
 static inline bool
 iproto_type_is_raft_request(uint32_t type)
 {
diff --git a/src/box/txn_limbo.c b/src/box/txn_limbo.c
index a22e0861a..f119c35b6 100644
--- a/src/box/txn_limbo.c
+++ b/src/box/txn_limbo.c
@@ -317,21 +317,25 @@ txn_limbo_write_cb(struct journal_entry *entry)
 }
 
 static void
-txn_limbo_write_synchro(struct txn_limbo *limbo, uint32_t type, int64_t lsn)
+txn_limbo_write_synchro(struct txn_limbo *limbo, uint32_t type, int64_t lsn,
+			uint64_t term)
 {
-	assert(lsn > 0);
+	assert(lsn >= 0);
 
 	struct synchro_request req = {
 		.type		= type,
 		.replica_id	= limbo->owner_id,
 		.lsn		= lsn,
+		.term		= term,
 	};
 
 	/*
-	 * This is a synchronous commit so we can
-	 * allocate everything on a stack.
+	 * This is a synchronous commit so we can allocate everything on a
+	 * stack. Promote body includes synchro body.
 	 */
-	struct synchro_body_bin body;
+	struct promote_body_bin body;
+	struct synchro_body_bin *base = &body.base;
+
 	struct xrow_header row;
 	char buf[sizeof(struct journal_entry) +
 		 sizeof(struct xrow_header *)];
@@ -339,7 +343,7 @@ txn_limbo_write_synchro(struct txn_limbo *limbo, uint32_t type, int64_t lsn)
 	struct journal_entry *entry = (struct journal_entry *)buf;
 	entry->rows[0] = &row;
 
-	xrow_encode_synchro(&row, &body, &req);
+	xrow_encode_synchro(&row, base, &req);
 
 	journal_entry_create(entry, 1, xrow_approx_len(&row),
 			     txn_limbo_write_cb, fiber());
@@ -371,14 +375,14 @@ txn_limbo_write_confirm(struct txn_limbo *limbo, int64_t lsn)
 	assert(lsn > limbo->confirmed_lsn);
 	assert(!limbo->is_in_rollback);
 	limbo->confirmed_lsn = lsn;
-	txn_limbo_write_synchro(limbo, IPROTO_CONFIRM, lsn);
+	txn_limbo_write_synchro(limbo, IPROTO_CONFIRM, lsn, 0);
 }
 
 /** Confirm all the entries <= @a lsn. */
 static void
 txn_limbo_read_confirm(struct txn_limbo *limbo, int64_t lsn)
 {
-	assert(limbo->owner_id != REPLICA_ID_NIL);
+	assert(limbo->owner_id != REPLICA_ID_NIL || txn_limbo_is_empty(limbo));
 	assert(limbo == &txn_limbo);
 	struct txn_limbo_entry *e, *tmp;
 	rlist_foreach_entry_safe(e, &limbo->queue, in_queue, tmp) {
@@ -434,7 +438,7 @@ txn_limbo_write_rollback(struct txn_limbo *limbo, int64_t lsn)
 	assert(lsn > limbo->confirmed_lsn);
 	assert(!limbo->is_in_rollback);
 	limbo->is_in_rollback = true;
-	txn_limbo_write_synchro(limbo, IPROTO_ROLLBACK, lsn);
+	txn_limbo_write_synchro(limbo, IPROTO_ROLLBACK, lsn, 0);
 	limbo->is_in_rollback = false;
 }
 
@@ -442,7 +446,7 @@ txn_limbo_write_rollback(struct txn_limbo *limbo, int64_t lsn)
 static void
 txn_limbo_read_rollback(struct txn_limbo *limbo, int64_t lsn)
 {
-	assert(limbo->owner_id != REPLICA_ID_NIL);
+	assert(limbo->owner_id != REPLICA_ID_NIL || txn_limbo_is_empty(limbo));
 	assert(limbo == &txn_limbo);
 	struct txn_limbo_entry *e, *tmp;
 	struct txn_limbo_entry *last_rollback = NULL;
@@ -490,6 +494,32 @@ txn_limbo_read_rollback(struct txn_limbo *limbo, int64_t lsn)
 		box_update_ro_summary();
 }
 
+void
+txn_limbo_write_promote(struct txn_limbo *limbo, int64_t lsn, uint64_t term)
+{
+	limbo->confirmed_lsn = lsn;
+	/*
+	 * We make sure that promote is only written once everything this
+	 * instance has may be confirmed.
+	 */
+	struct txn_limbo_entry *e = txn_limbo_last_synchro_entry(limbo);
+	assert(e == NULL || e->lsn <= lsn);
+	(void) e;
+	txn_limbo_write_synchro(limbo, IPROTO_PROMOTE, lsn, term);
+	limbo->is_in_rollback = false;
+}
+
+void
+txn_limbo_read_promote(struct txn_limbo *limbo,
+		       const struct synchro_request *req)
+{
+	txn_limbo_read_confirm(limbo, req->lsn);
+	txn_limbo_read_rollback(limbo, req->lsn + 1);
+	assert(txn_limbo_is_empty(&txn_limbo));
+	limbo->owner_id = req->origin_id;
+	limbo->confirmed_lsn = 0;
+}
+
 void
 txn_limbo_ack(struct txn_limbo *limbo, uint32_t replica_id, int64_t lsn)
 {
@@ -652,38 +682,15 @@ txn_limbo_process(struct txn_limbo *limbo, const struct synchro_request *req)
 	case IPROTO_ROLLBACK:
 		txn_limbo_read_rollback(limbo, req->lsn);
 		break;
+	case IPROTO_PROMOTE:
+		txn_limbo_read_promote(limbo, req);
+		break;
 	default:
 		unreachable();
 	}
 	return;
 }
 
-void
-txn_limbo_force_empty(struct txn_limbo *limbo, int64_t confirm_lsn)
-{
-	struct txn_limbo_entry *e, *last_quorum = NULL;
-	struct txn_limbo_entry *rollback = NULL;
-	rlist_foreach_entry(e, &limbo->queue, in_queue) {
-		if (txn_has_flag(e->txn, TXN_WAIT_ACK)) {
-			if (e->lsn <= confirm_lsn) {
-				last_quorum = e;
-			} else {
-				rollback = e;
-				break;
-			}
-		}
-	}
-
-	if (last_quorum != NULL) {
-		txn_limbo_write_confirm(limbo, last_quorum->lsn);
-		txn_limbo_read_confirm(limbo, last_quorum->lsn);
-	}
-	if (rollback != NULL) {
-		txn_limbo_write_rollback(limbo, rollback->lsn);
-		txn_limbo_read_rollback(limbo, rollback->lsn);
-	}
-}
-
 void
 txn_limbo_on_parameters_change(struct txn_limbo *limbo)
 {
diff --git a/src/box/txn_limbo.h b/src/box/txn_limbo.h
index f2a98c8bb..10db4fc2d 100644
--- a/src/box/txn_limbo.h
+++ b/src/box/txn_limbo.h
@@ -279,7 +279,15 @@ txn_limbo_wait_confirm(struct txn_limbo *limbo);
  * immediately.
  */
 void
-txn_limbo_force_empty(struct txn_limbo *limbo, int64_t last_confirm);
+txn_limbo_write_promote(struct txn_limbo *limbo, int64_t lsn, uint64_t term);
+
+/**
+ * Process a PROMOTE request, i.e. confirm all entries <= @lsn and rollback all
+ * entries > @lsn.
+ */
+void
+txn_limbo_read_promote(struct txn_limbo *limbo,
+		       const struct synchro_request *req);
 
 /**
  * Update qsync parameters dynamically.
-- 
2.24.3 (Apple Git-128)


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Tarantool-patches] [PATCH v2 5/9] box: write PROMOTE even for empty limbo
  2021-04-12 19:40 [Tarantool-patches] [PATCH v2 0/9] raft: introduce manual elections and fix a bug with re-applying rolled back transactions Serge Petrenko via Tarantool-patches
                   ` (3 preceding siblings ...)
  2021-04-12 19:40 ` [Tarantool-patches] [PATCH v2 4/9] box: make clear_synchro_queue() write a PROMOTE entry instead of CONFIRM + ROLLBACK Serge Petrenko via Tarantool-patches
@ 2021-04-12 19:40 ` Serge Petrenko via Tarantool-patches
  2021-04-12 19:40 ` [Tarantool-patches] [PATCH v2 6/9] raft: keep track of greatest known term and filter replication sources based on that Serge Petrenko via Tarantool-patches
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 23+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-04-12 19:40 UTC (permalink / raw)
  To: v.shpilevoy, gorcunov; +Cc: tarantool-patches

PROMOTE entry will be used to mark limbo ownership transition besides
emptying the limbo. So it has to be written every time
`box.ctl.clear_synchro_queue()` succeeds. Even when the limbo was
already empty.

Part of #5445
---
 src/box/box.cc | 28 ++++++++++++++--------------
 1 file changed, 14 insertions(+), 14 deletions(-)

diff --git a/src/box/box.cc b/src/box/box.cc
index 8aba051a2..b093341d3 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -1502,19 +1502,18 @@ box_clear_synchro_queue(bool try_wait)
 			 "simultaneous invocations");
 		return -1;
 	}
-	/*
-	 * XXX: we may want to write confirm + rollback even when the limbo is
-	 * empty for the sake of limbo ownership transition.
-	 */
-	if (!is_box_configured || txn_limbo_is_empty(&txn_limbo))
+
+	if (!is_box_configured)
 		return 0;
 	uint32_t former_leader_id = txn_limbo.owner_id;
-	assert(former_leader_id != REPLICA_ID_NIL);
-	if (former_leader_id == instance_id)
-		return 0;
-
+	int64_t wait_lsn = txn_limbo.confirmed_lsn;
+	int rc = 0;
+	int quorum = replication_synchro_quorum;
 	in_clear_synchro_queue = true;
 
+	if (txn_limbo_is_empty(&txn_limbo))
+		goto promote;
+
 	if (try_wait) {
 		/* Wait until pending confirmations/rollbacks reach us. */
 		double timeout = 2 * replication_synchro_timeout;
@@ -1528,8 +1527,9 @@ box_clear_synchro_queue(bool try_wait)
 		 * Our mission was to clear the limbo from former leader's
 		 * transactions. Exit in case someone did that for us.
 		 */
-		if (txn_limbo_is_empty(&txn_limbo) ||
-		    former_leader_id != txn_limbo.owner_id) {
+		if (former_leader_id != txn_limbo.owner_id) {
+			//TODO: error once we see someone else became the leader
+			// already.
 			in_clear_synchro_queue = false;
 			return 0;
 		}
@@ -1540,11 +1540,10 @@ box_clear_synchro_queue(bool try_wait)
 	 * in the limbo must've come through the applier meaning they already
 	 * have an lsn assigned, even if their WAL write hasn't finished yet.
 	 */
-	int64_t wait_lsn = txn_limbo_last_synchro_entry(&txn_limbo)->lsn;
+	wait_lsn = txn_limbo_last_synchro_entry(&txn_limbo)->lsn;
 	assert(wait_lsn > 0);
 
-	int quorum = replication_synchro_quorum;
-	int rc = box_wait_quorum(former_leader_id, wait_lsn, quorum,
+	rc = box_wait_quorum(former_leader_id, wait_lsn, quorum,
 				 replication_synchro_timeout);
 	if (rc == 0) {
 		if (quorum < replication_synchro_quorum) {
@@ -1556,6 +1555,7 @@ box_clear_synchro_queue(bool try_wait)
 				 "new synchronous transactions appeared");
 			rc = -1;
 		} else {
+promote:
 			/*
 			 * Term parameter is unused now, We'll pass
 			 * box_raft()->term there later.
-- 
2.24.3 (Apple Git-128)


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Tarantool-patches] [PATCH v2 6/9] raft: keep track of greatest known term and filter replication sources based on that
  2021-04-12 19:40 [Tarantool-patches] [PATCH v2 0/9] raft: introduce manual elections and fix a bug with re-applying rolled back transactions Serge Petrenko via Tarantool-patches
                   ` (4 preceding siblings ...)
  2021-04-12 19:40 ` [Tarantool-patches] [PATCH v2 5/9] box: write PROMOTE even for empty limbo Serge Petrenko via Tarantool-patches
@ 2021-04-12 19:40 ` Serge Petrenko via Tarantool-patches
  2021-04-12 19:40 ` [Tarantool-patches] [PATCH v2 7/9] replication: introduce a new election mode: "manual" Serge Petrenko via Tarantool-patches
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 23+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-04-12 19:40 UTC (permalink / raw)
  To: v.shpilevoy, gorcunov; +Cc: tarantool-patches

Start writing the actual leader term together with the PROMOTE request
and process terms in PROMOTE requests on receiver side.

Make applier only apply synchronous transactions from the instance which
has the greatest term as received in PROMOTE requests.

Closes #5445
---
 ...very => qsync-multi-statement-recovery.md} |   0
 changelogs/unreleased/raft-promote.md         |   4 +
 src/box/applier.cc                            |  18 ++
 src/box/box.cc                                |  24 +-
 src/lib/raft/raft.c                           |   1 +
 src/lib/raft/raft.h                           |  46 ++++
 .../gh-5445-leader-inconsistency.result       | 238 ++++++++++++++++++
 .../gh-5445-leader-inconsistency.test.lua     | 108 ++++++++
 test/replication/suite.cfg                    |   1 +
 9 files changed, 433 insertions(+), 7 deletions(-)
 rename changelogs/unreleased/{qsync-multi-statement-recovery => qsync-multi-statement-recovery.md} (100%)
 create mode 100644 changelogs/unreleased/raft-promote.md
 create mode 100644 test/replication/gh-5445-leader-inconsistency.result
 create mode 100644 test/replication/gh-5445-leader-inconsistency.test.lua

diff --git a/changelogs/unreleased/qsync-multi-statement-recovery b/changelogs/unreleased/qsync-multi-statement-recovery.md
similarity index 100%
rename from changelogs/unreleased/qsync-multi-statement-recovery
rename to changelogs/unreleased/qsync-multi-statement-recovery.md
diff --git a/changelogs/unreleased/raft-promote.md b/changelogs/unreleased/raft-promote.md
new file mode 100644
index 000000000..e5dac599c
--- /dev/null
+++ b/changelogs/unreleased/raft-promote.md
@@ -0,0 +1,4 @@
+## bugfix/replication
+
+* Fix a bug in synchronous replication when rolled back transactions could
+  reappear once a sufficiently old instance reconnected (gh-5445).
diff --git a/src/box/applier.cc b/src/box/applier.cc
index e8cbbe27a..926d2f7ea 100644
--- a/src/box/applier.cc
+++ b/src/box/applier.cc
@@ -849,6 +849,9 @@ apply_synchro_row(struct xrow_header *row)
 
 	txn_limbo_process(&txn_limbo, &req);
 
+	if (req.type == IPROTO_PROMOTE)
+		raft_source_update_term(box_raft(), req.origin_id, req.term);
+
 	struct synchro_entry *entry;
 	entry = synchro_entry_new(row, &req);
 	if (entry == NULL)
@@ -1044,6 +1047,21 @@ applier_apply_tx(struct applier *applier, struct stailq *rows)
 		}
 	}
 
+	/*
+	 * All the synchronous rows coming from outdated instances are ignored
+	 * and replaced with NOPs to save vclock consistency.
+	 */
+	struct applier_tx_row *item;
+	if (raft_source_has_outdated_term(box_raft(), applier->instance_id) &&
+	    (last_row->wait_sync ||
+	     (iproto_type_is_synchro_request(first_row->type) &&
+	     !iproto_type_is_promote_request(first_row->type)))) {
+		stailq_foreach_entry(item, rows, next) {
+			struct xrow_header *row = &item->row;
+			row->type = IPROTO_NOP;
+			row->bodycnt = 0;
+		}
+	}
 	if (unlikely(iproto_type_is_synchro_request(first_row->type))) {
 		/*
 		 * Synchro messages are not transactions, in terms
diff --git a/src/box/box.cc b/src/box/box.cc
index b093341d3..aae57ec29 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -426,6 +426,10 @@ wal_stream_apply_synchro_row(struct wal_stream *stream, struct xrow_header *row)
 		return -1;
 	}
 	txn_limbo_process(&txn_limbo, &syn_req);
+	if (syn_req.type == IPROTO_PROMOTE) {
+			raft_source_update_term(box_raft(), syn_req.origin_id,
+						syn_req.term);
+	}
 	return 0;
 }
 
@@ -1503,7 +1507,12 @@ box_clear_synchro_queue(bool try_wait)
 		return -1;
 	}
 
-	if (!is_box_configured)
+	/*
+	 * Do nothing when box isn't configured and when PROMOTE was already
+	 * written for this term.
+	 */
+	if (!is_box_configured ||
+	    raft_source_term(box_raft(), instance_id) == box_raft()->term)
 		return 0;
 	uint32_t former_leader_id = txn_limbo.owner_id;
 	int64_t wait_lsn = txn_limbo.confirmed_lsn;
@@ -1556,20 +1565,21 @@ box_clear_synchro_queue(bool try_wait)
 			rc = -1;
 		} else {
 promote:
-			/*
-			 * Term parameter is unused now, We'll pass
-			 * box_raft()->term there later.
-			 */
-			txn_limbo_write_promote(&txn_limbo, wait_lsn, 0);
+			/* We cannot possibly get here in a volatile state. */
+			assert(box_raft()->volatile_term == box_raft()->term);
+			txn_limbo_write_promote(&txn_limbo, wait_lsn,
+						box_raft()->term);
 			struct synchro_request req = {
 				.type = 0, /* unused */
 				.replica_id = 0, /* unused */
 				.origin_id = instance_id,
 				.lsn = wait_lsn,
-				.term = 0, /* unused */
+				.term = box_raft()->term,
 			};
 			txn_limbo_read_promote(&txn_limbo, &req);
 			assert(txn_limbo_is_empty(&txn_limbo));
+			raft_source_update_term(box_raft(), req.origin_id,
+						req.term);
 		}
 	}
 	in_clear_synchro_queue = false;
diff --git a/src/lib/raft/raft.c b/src/lib/raft/raft.c
index 4ea4fc3f8..e9ce8cade 100644
--- a/src/lib/raft/raft.c
+++ b/src/lib/raft/raft.c
@@ -985,6 +985,7 @@ raft_create(struct raft *raft, const struct raft_vtab *vtab)
 		.death_timeout = 5,
 		.vtab = vtab,
 	};
+	vclock_create(&raft->term_map);
 	raft_ev_timer_init(&raft->timer, raft_sm_schedule_new_election_cb,
 			   0, 0);
 	raft->timer.data = raft;
diff --git a/src/lib/raft/raft.h b/src/lib/raft/raft.h
index e447f6634..40c8630e9 100644
--- a/src/lib/raft/raft.h
+++ b/src/lib/raft/raft.h
@@ -207,6 +207,19 @@ struct raft {
 	 * subsystems, such as Raft.
 	 */
 	const struct vclock *vclock;
+	/**
+	 * The biggest term seen by this instance and persisted in WAL as part
+	 * of a PROMOTE request. May be smaller than @a term, while there are
+	 * ongoing elections, or the leader is already known, but this instance
+	 * hasn't read its PROMOTE request yet.
+	 * During other times must be equal to @a term.
+	 */
+	uint64_t greatest_known_term;
+	/**
+	 * Latest terms received with PROMOTE entries from remote instances.
+	 * Raft uses them to determine data from which sources may be applied.
+	 */
+	struct vclock term_map;
 	/** State machine timed event trigger. */
 	struct ev_timer timer;
 	/** Configured election timeout in seconds. */
@@ -243,6 +256,39 @@ raft_is_source_allowed(const struct raft *raft, uint32_t source_id)
 	return !raft->is_enabled || raft->leader == source_id;
 }
 
+/**
+ * Return the latest term as seen in PROMOTE requests from instance with id
+ * @a source_id.
+ */
+static inline uint64_t
+raft_source_term(const struct raft *raft, uint32_t source_id)
+{
+	assert(source_id != 0 && source_id < VCLOCK_MAX);
+	return vclock_get(&raft->term_map, source_id);
+}
+
+/**
+ * Check whether replica with id @a source_id is too old to apply synchronous
+ * data from it. The check remains valid  even when elections are disabled.
+ */
+static inline bool
+raft_source_has_outdated_term(const struct raft *raft, uint32_t source_id)
+{
+	uint64_t source_term = vclock_get(&raft->term_map, source_id);
+	return source_term < raft->greatest_known_term;
+}
+
+/** Remember the last term seen for replica  with id @a source_id. */
+static inline void
+raft_source_update_term(struct raft *raft, uint32_t source_id, uint64_t term)
+{
+	if ((uint64_t) vclock_get(&raft->term_map, source_id) >= term)
+		return;
+	vclock_follow(&raft->term_map, source_id, term);
+	if (term > raft->greatest_known_term)
+		raft->greatest_known_term = term;
+}
+
 /** Check if Raft is enabled. */
 static inline bool
 raft_is_enabled(const struct raft *raft)
diff --git a/test/replication/gh-5445-leader-inconsistency.result b/test/replication/gh-5445-leader-inconsistency.result
new file mode 100644
index 000000000..b1f8a4ed1
--- /dev/null
+++ b/test/replication/gh-5445-leader-inconsistency.result
@@ -0,0 +1,238 @@
+-- test-run result file version 2
+test_run = require("test_run").new()
+ | ---
+ | ...
+
+is_leader_cmd = "return box.info.election.state == 'leader'"
+ | ---
+ | ...
+
+-- Auxiliary.
+test_run:cmd('setopt delimiter ";"')
+ | ---
+ | - true
+ | ...
+function get_leader(nrs)
+    local leader_nr = 0
+    test_run:wait_cond(function()
+        for nr, do_check in pairs(nrs) do
+            if do_check then
+                local is_leader = test_run:eval('election_replica'..nr,
+                                                is_leader_cmd)[1]
+                if is_leader then
+                    leader_nr = nr
+                    return true
+                end
+            end
+        end
+        return false
+    end)
+    assert(leader_nr ~= 0)
+    return leader_nr
+end;
+ | ---
+ | ...
+
+function name(id)
+    return 'election_replica'..id
+end;
+ | ---
+ | ...
+test_run:cmd('setopt delimiter ""');
+ | ---
+ | - true
+ | ...
+
+--
+-- gh-5445: make sure rolled back rows do not reappear once old leader returns
+-- to cluster.
+--
+SERVERS = {'election_replica1', 'election_replica2' ,'election_replica3'}
+ | ---
+ | ...
+test_run:create_cluster(SERVERS, "replication", {args='2 0.4'})
+ | ---
+ | ...
+test_run:wait_fullmesh(SERVERS)
+ | ---
+ | ...
+
+-- Any of the three instances may bootstrap the cluster and become leader.
+is_possible_leader = {true, true, true}
+ | ---
+ | ...
+leader_nr = get_leader(is_possible_leader)
+ | ---
+ | ...
+leader = name(leader_nr)
+ | ---
+ | ...
+next_leader_nr = ((leader_nr - 1) % 3 + 1) % 3 + 1 -- {1, 2, 3} -> {2, 3, 1}
+ | ---
+ | ...
+next_leader = name(next_leader_nr)
+ | ---
+ | ...
+other_nr = ((leader_nr - 1) % 3 + 2) % 3 + 1 -- {1, 2, 3} -> {3, 1, 2}
+ | ---
+ | ...
+other = name(other_nr)
+ | ---
+ | ...
+
+test_run:switch(leader)
+ | ---
+ | - true
+ | ...
+box.ctl.wait_rw()
+ | ---
+ | ...
+_ = box.schema.space.create('test', {is_sync=true})
+ | ---
+ | ...
+_ = box.space.test:create_index('pk')
+ | ---
+ | ...
+box.space.test:insert{1}
+ | ---
+ | - [1]
+ | ...
+
+-- Simulate a situation when the instance which will become the next leader
+-- doesn't know of unconfirmed rows. It should roll them back anyways and do not
+-- accept them once they actually appear from the old leader.
+-- So, stop the instance which'll be the next leader.
+test_run:switch('default')
+ | ---
+ | - true
+ | ...
+test_run:cmd('stop server '..next_leader)
+ | ---
+ | - true
+ | ...
+test_run:switch(leader)
+ | ---
+ | - true
+ | ...
+-- Insert some unconfirmed data.
+box.cfg{replication_synchro_quorum=3, replication_synchro_timeout=1000}
+ | ---
+ | ...
+fib = require('fiber').create(box.space.test.insert, box.space.test, {2})
+ | ---
+ | ...
+fib:status()
+ | ---
+ | - suspended
+ | ...
+
+-- 'other', 'leader', 'next_leader' are defined on 'default' node, hence the
+-- double switches.
+test_run:switch('default')
+ | ---
+ | - true
+ | ...
+test_run:switch(other)
+ | ---
+ | - true
+ | ...
+-- Wait until the rows are replicated to the other instance.
+test_run:wait_cond(function() return box.space.test:get{2} ~= nil end)
+ | ---
+ | - true
+ | ...
+box.cfg{election_mode='voter'}
+ | ---
+ | ...
+-- Old leader is gone.
+test_run:switch('default')
+ | ---
+ | - true
+ | ...
+test_run:cmd('stop server '..leader)
+ | ---
+ | - true
+ | ...
+is_possible_leader[leader_nr] = false
+ | ---
+ | ...
+
+-- Emulate a situation when next_leader wins the elections. It can't do that in
+-- this configuration, obviously, because it's behind the 'other' node, so set
+-- quorum to 1 and imagine there are 2 more servers which would vote for
+-- next_leader.
+-- Also, make the instance ignore synchronization with other replicas.
+-- Otherwise it would stall for replication_sync_timeout. This is due to the
+-- nature of the test and may be ignored (we restart the instance to simulate
+-- a situation when some rows from the old leader were not received).
+test_run:cmd('start server '..next_leader..' with args="1 0.4 candidate 1"')
+ | ---
+ | - true
+ | ...
+assert(get_leader(is_possible_leader) == next_leader_nr)
+ | ---
+ | - true
+ | ...
+test_run:switch(other)
+ | ---
+ | - true
+ | ...
+-- New leader didn't know about the unconfirmed rows but still rolled them back.
+test_run:wait_cond(function() return box.space.test:get{2} == nil end)
+ | ---
+ | - true
+ | ...
+
+test_run:switch('default')
+ | ---
+ | - true
+ | ...
+test_run:switch(next_leader)
+ | ---
+ | - true
+ | ...
+box.space.test:select{} -- 1
+ | ---
+ | - - [1]
+ | ...
+box.cfg{election_mode='voter'}
+ | ---
+ | ...
+-- Old leader returns and old unconfirmed rows from it must be ignored.
+test_run:switch('default')
+ | ---
+ | - true
+ | ...
+-- Make old leader win the elections.
+test_run:cmd('start server '..leader..' with args="1 0.4 candidate 1"')
+ | ---
+ | - true
+ | ...
+is_possible_leader[leader_nr] = true
+ | ---
+ | ...
+assert(get_leader(is_possible_leader) == leader_nr)
+ | ---
+ | - true
+ | ...
+test_run:switch(next_leader)
+ | ---
+ | - true
+ | ...
+box.space.test:select{} -- 1
+ | ---
+ | - - [1]
+ | ...
+test_run:wait_upstream(1, {status='follow'})
+ | ---
+ | - true
+ | ...
+
+-- Cleanup.
+test_run:switch('default')
+ | ---
+ | - true
+ | ...
+test_run:drop_cluster(SERVERS)
+ | ---
+ | ...
diff --git a/test/replication/gh-5445-leader-inconsistency.test.lua b/test/replication/gh-5445-leader-inconsistency.test.lua
new file mode 100644
index 000000000..94beea966
--- /dev/null
+++ b/test/replication/gh-5445-leader-inconsistency.test.lua
@@ -0,0 +1,108 @@
+test_run = require("test_run").new()
+
+is_leader_cmd = "return box.info.election.state == 'leader'"
+
+-- Auxiliary.
+test_run:cmd('setopt delimiter ";"')
+function get_leader(nrs)
+    local leader_nr = 0
+    test_run:wait_cond(function()
+        for nr, do_check in pairs(nrs) do
+            if do_check then
+                local is_leader = test_run:eval('election_replica'..nr,
+                                                is_leader_cmd)[1]
+                if is_leader then
+                    leader_nr = nr
+                    return true
+                end
+            end
+        end
+        return false
+    end)
+    assert(leader_nr ~= 0)
+    return leader_nr
+end;
+
+function name(id)
+    return 'election_replica'..id
+end;
+test_run:cmd('setopt delimiter ""');
+
+--
+-- gh-5445: make sure rolled back rows do not reappear once old leader returns
+-- to cluster.
+--
+SERVERS = {'election_replica1', 'election_replica2' ,'election_replica3'}
+test_run:create_cluster(SERVERS, "replication", {args='2 0.4'})
+test_run:wait_fullmesh(SERVERS)
+
+-- Any of the three instances may bootstrap the cluster and become leader.
+is_possible_leader = {true, true, true}
+leader_nr = get_leader(is_possible_leader)
+leader = name(leader_nr)
+next_leader_nr = ((leader_nr - 1) % 3 + 1) % 3 + 1 -- {1, 2, 3} -> {2, 3, 1}
+next_leader = name(next_leader_nr)
+other_nr = ((leader_nr - 1) % 3 + 2) % 3 + 1 -- {1, 2, 3} -> {3, 1, 2}
+other = name(other_nr)
+
+test_run:switch(leader)
+box.ctl.wait_rw()
+_ = box.schema.space.create('test', {is_sync=true})
+_ = box.space.test:create_index('pk')
+box.space.test:insert{1}
+
+-- Simulate a situation when the instance which will become the next leader
+-- doesn't know of unconfirmed rows. It should roll them back anyways and do not
+-- accept them once they actually appear from the old leader.
+-- So, stop the instance which'll be the next leader.
+test_run:switch('default')
+test_run:cmd('stop server '..next_leader)
+test_run:switch(leader)
+-- Insert some unconfirmed data.
+box.cfg{replication_synchro_quorum=3, replication_synchro_timeout=1000}
+fib = require('fiber').create(box.space.test.insert, box.space.test, {2})
+fib:status()
+
+-- 'other', 'leader', 'next_leader' are defined on 'default' node, hence the
+-- double switches.
+test_run:switch('default')
+test_run:switch(other)
+-- Wait until the rows are replicated to the other instance.
+test_run:wait_cond(function() return box.space.test:get{2} ~= nil end)
+box.cfg{election_mode='voter'}
+-- Old leader is gone.
+test_run:switch('default')
+test_run:cmd('stop server '..leader)
+is_possible_leader[leader_nr] = false
+
+-- Emulate a situation when next_leader wins the elections. It can't do that in
+-- this configuration, obviously, because it's behind the 'other' node, so set
+-- quorum to 1 and imagine there are 2 more servers which would vote for
+-- next_leader.
+-- Also, make the instance ignore synchronization with other replicas.
+-- Otherwise it would stall for replication_sync_timeout. This is due to the
+-- nature of the test and may be ignored (we restart the instance to simulate
+-- a situation when some rows from the old leader were not received).
+test_run:cmd('start server '..next_leader..' with args="1 0.4 candidate 1"')
+assert(get_leader(is_possible_leader) == next_leader_nr)
+test_run:switch(other)
+-- New leader didn't know about the unconfirmed rows but still rolled them back.
+test_run:wait_cond(function() return box.space.test:get{2} == nil end)
+
+test_run:switch('default')
+test_run:switch(next_leader)
+box.space.test:select{} -- 1
+box.cfg{election_mode='voter'}
+-- Old leader returns and old unconfirmed rows from it must be ignored.
+test_run:switch('default')
+-- Make old leader win the elections.
+test_run:cmd('start server '..leader..' with args="1 0.4 candidate 1"')
+is_possible_leader[leader_nr] = true
+assert(get_leader(is_possible_leader) == leader_nr)
+test_run:switch(next_leader)
+box.space.test:select{} -- 1
+test_run:wait_upstream(1, {status='follow'})
+
+-- Cleanup.
+test_run:switch('default')
+test_run:drop_cluster(SERVERS)
diff --git a/test/replication/suite.cfg b/test/replication/suite.cfg
index aff5fda26..8ae2fc14d 100644
--- a/test/replication/suite.cfg
+++ b/test/replication/suite.cfg
@@ -17,6 +17,7 @@
     "gh-4424-misc-orphan-on-reconfiguration-error.test.lua": {},
     "gh-5426-election-on-off.test.lua": {},
     "gh-5433-election-restart-recovery.test.lua": {},
+    "gh-5445-leader-inconsistency.test.lua": {},
     "gh-5506-election-on-off.test.lua": {},
     "once.test.lua": {},
     "on_replace.test.lua": {},
-- 
2.24.3 (Apple Git-128)


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Tarantool-patches] [PATCH v2 7/9] replication: introduce a new election mode: "manual"
  2021-04-12 19:40 [Tarantool-patches] [PATCH v2 0/9] raft: introduce manual elections and fix a bug with re-applying rolled back transactions Serge Petrenko via Tarantool-patches
                   ` (5 preceding siblings ...)
  2021-04-12 19:40 ` [Tarantool-patches] [PATCH v2 6/9] raft: keep track of greatest known term and filter replication sources based on that Serge Petrenko via Tarantool-patches
@ 2021-04-12 19:40 ` Serge Petrenko via Tarantool-patches
  2021-04-12 19:40 ` [Tarantool-patches] [PATCH v2 8/9] Support manual elections in `box.ctl.clear_synchro_queue()` Serge Petrenko via Tarantool-patches
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 23+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-04-12 19:40 UTC (permalink / raw)
  To: v.shpilevoy, gorcunov; +Cc: tarantool-patches

When an instance is configured in "manual" election mode, it behaves as
a voter for most of the time, until `box.ctl.promote()` is called.

Once `box.ctl.promote()` is called the instance will behave as a
candidate for a full election round, e.g. until the leader is known. If
the instance wins the elections, it remains in `leader` state until the
next elections. Otherwise the instance returns to `voter` state.

Follow-up #5445
Part of #3055
---
 src/box/box.cc                         | 37 +++++++++++++++++---------
 src/box/raft.c                         |  2 ++
 src/box/raft.h                         | 17 ++++++++++++
 test/replication/election_basic.result |  4 +--
 4 files changed, 45 insertions(+), 15 deletions(-)

diff --git a/src/box/box.cc b/src/box/box.cc
index aae57ec29..b77b0a08d 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -680,17 +680,27 @@ box_check_uri(const char *source, const char *option_name)
 	}
 }
 
-static const char *
+static enum election_mode
 box_check_election_mode(void)
 {
 	const char *mode = cfg_gets("election_mode");
-	if (mode == NULL || (strcmp(mode, "off") != 0 &&
-	    strcmp(mode, "voter") != 0 && strcmp(mode, "candidate") != 0)) {
-		diag_set(ClientError, ER_CFG, "election_mode", "the value must "
-			 "be a string 'off' or 'voter' or 'candidate'");
-		return NULL;
-	}
-	return mode;
+	if (mode == NULL)
+		goto error;
+
+	if (strcmp(mode, "off") == 0)
+		return ELECTION_MODE_OFF;
+	else if (strcmp(mode, "voter") == 0)
+		return ELECTION_MODE_VOTER;
+	else if (strcmp(mode, "manual") == 0)
+		return ELECTION_MODE_MANUAL;
+	else if (strcmp(mode, "candidate") == 0)
+		return ELECTION_MODE_CANDIDATE;
+
+error:
+	diag_set(ClientError, ER_CFG, "election_mode",
+		"the value must be one of the following strings: "
+		"'off', 'voter', 'candidate', 'manual'");
+	return ELECTION_MODE_INVALID;
 }
 
 static double
@@ -1113,7 +1123,7 @@ box_check_config(void)
 	box_check_uri(cfg_gets("listen"), "listen");
 	box_check_instance_uuid(&uuid);
 	box_check_replicaset_uuid(&uuid);
-	if (box_check_election_mode() == NULL)
+	if (box_check_election_mode() == ELECTION_MODE_INVALID)
 		diag_raise();
 	if (box_check_election_timeout() < 0)
 		diag_raise();
@@ -1147,11 +1157,12 @@ box_check_config(void)
 int
 box_set_election_mode(void)
 {
-	const char *mode = box_check_election_mode();
-	if (mode == NULL)
+	enum election_mode mode = box_check_election_mode();
+	if (mode == ELECTION_MODE_INVALID)
 		return -1;
-	raft_cfg_is_candidate(box_raft(), strcmp(mode, "candidate") == 0);
-	raft_cfg_is_enabled(box_raft(), strcmp(mode, "off") != 0);
+	box_election_mode = mode;
+	raft_cfg_is_candidate(box_raft(), mode == ELECTION_MODE_CANDIDATE);
+	raft_cfg_is_enabled(box_raft(), mode != ELECTION_MODE_OFF);
 	return 0;
 }
 
diff --git a/src/box/raft.c b/src/box/raft.c
index cfd898db0..285dbe4fd 100644
--- a/src/box/raft.c
+++ b/src/box/raft.c
@@ -44,6 +44,8 @@ struct raft box_raft_global = {
 	.state = 0,
 };
 
+enum election_mode box_election_mode = ELECTION_MODE_INVALID;
+
 /**
  * A trigger executed each time the Raft state machine updates any
  * of its visible attributes.
diff --git a/src/box/raft.h b/src/box/raft.h
index 1c59f17e6..15f4e80d9 100644
--- a/src/box/raft.h
+++ b/src/box/raft.h
@@ -35,8 +35,25 @@
 extern "C" {
 #endif
 
+enum election_mode {
+	ELECTION_MODE_INVALID = -1,
+	ELECTION_MODE_OFF = 0,
+	ELECTION_MODE_VOTER = 1,
+	ELECTION_MODE_MANUAL = 2,
+	ELECTION_MODE_CANDIDATE = 3,
+};
+
 struct raft_request;
 
+/**
+ * box_election_mode - current mode of operation for raft. Some modes correspond
+ * to RAFT operation modes directly, like CANDIDATE, VOTER and OFF.
+ * There's a mode which does not map to raft operation mode directly:
+ * MANUAL. In this mode RAFT usually operates as a voter, but it may become a
+ * candidate for some period of time when user calls `box.ctl.promote()`
+ */
+extern enum election_mode box_election_mode;
+
 /** Raft state of this instance. */
 static inline struct raft *
 box_raft(void)
diff --git a/test/replication/election_basic.result b/test/replication/election_basic.result
index 4d7d33f2b..d5320b3ff 100644
--- a/test/replication/election_basic.result
+++ b/test/replication/election_basic.result
@@ -22,8 +22,8 @@ box.cfg{election_mode = 100}
  | ...
 box.cfg{election_mode = '100'}
  | ---
- | - error: 'Incorrect value for option ''election_mode'': the value must be a string
- |     ''off'' or ''voter'' or ''candidate'''
+ | - error: 'Incorrect value for option ''election_mode'': the value must be one of the
+ |     following strings: ''off'', ''voter'', ''candidate'', ''manual'''
  | ...
 box.cfg{election_timeout = -1}
  | ---
-- 
2.24.3 (Apple Git-128)


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Tarantool-patches] [PATCH v2 8/9] Support manual elections in `box.ctl.clear_synchro_queue()`
  2021-04-12 19:40 [Tarantool-patches] [PATCH v2 0/9] raft: introduce manual elections and fix a bug with re-applying rolled back transactions Serge Petrenko via Tarantool-patches
                   ` (6 preceding siblings ...)
  2021-04-12 19:40 ` [Tarantool-patches] [PATCH v2 7/9] replication: introduce a new election mode: "manual" Serge Petrenko via Tarantool-patches
@ 2021-04-12 19:40 ` Serge Petrenko via Tarantool-patches
  2021-04-12 19:40 ` [Tarantool-patches] [PATCH v2 9/9] box.ctl: rename clear_synchro_queue to promote Serge Petrenko via Tarantool-patches
  2021-04-13 14:42 ` [Tarantool-patches] [PATCH v2 0/9] raft: introduce manual elections and fix a bug with re-applying rolled back transactions Cyrill Gorcunov via Tarantool-patches
  9 siblings, 0 replies; 23+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-04-12 19:40 UTC (permalink / raw)
  To: v.shpilevoy, gorcunov; +Cc: tarantool-patches

This patch adds support for manual elections from
`box.ctl.clear_synchro_queue()`. When an instance is in
`election_mode='manual'`, calling `clear_synchro_queue()` will make it
start a new election round.

Follow-up #5445
Part of #3055

@TarantoolBot document
Title: describe election_mode='manual'

Manual election mode is introduced. It may be used when the user wants to
control which instance is the leader explicitly instead of relying on
Raft election algorithm.

When an instance is configured with `election_mode='manual'`, it behaves
as follows:
 1) By default, the instance acts like a voter: it is read-only and may
    vote for other instances that are candidates.
 2) Once `box.ctl.clear_synchro_queue()` is called, the instance becomes a
    candidate and starts a new election round. If the instance wins the
    elections, it remains leader, but won't participate in any new elections.
---
 src/box/box.cc              | 74 +++++++++++++++++++++++++++++++++++--
 src/box/errcode.h           |  3 ++
 src/box/raft.c              | 25 ++++++++++++-
 src/box/raft.h              |  3 ++
 src/lib/raft/raft.c         | 12 +++++-
 src/lib/raft/raft.h         |  2 +-
 test/box/error.result       |  3 ++
 test/unit/raft_test_utils.c |  4 +-
 8 files changed, 116 insertions(+), 10 deletions(-)

diff --git a/src/box/box.cc b/src/box/box.cc
index b77b0a08d..dc7f434e4 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -1161,7 +1161,8 @@ box_set_election_mode(void)
 	if (mode == ELECTION_MODE_INVALID)
 		return -1;
 	box_election_mode = mode;
-	raft_cfg_is_candidate(box_raft(), mode == ELECTION_MODE_CANDIDATE);
+	raft_cfg_is_candidate(box_raft(), mode == ELECTION_MODE_CANDIDATE,
+			      true);
 	raft_cfg_is_enabled(box_raft(), mode != ELECTION_MODE_OFF);
 	return 0;
 }
@@ -1525,12 +1526,77 @@ box_clear_synchro_queue(bool try_wait)
 	if (!is_box_configured ||
 	    raft_source_term(box_raft(), instance_id) == box_raft()->term)
 		return 0;
+
+	bool run_elections = false;
+
+	switch (box_election_mode) {
+	case ELECTION_MODE_OFF:
+		break;
+	case ELECTION_MODE_VOTER:
+		diag_set(ClientError, ER_UNSUPPORTED, "election_mode='voter'",
+			 "manual elections");
+		return -1;
+	case ELECTION_MODE_MANUAL:
+		/*
+		 * Even if this instance is already a leader, allow to bump term
+		 * manually once again.
+		 */
+		assert(box_raft()->state != RAFT_STATE_CANDIDATE);
+		if (box_raft()->state == RAFT_STATE_LEADER) {
+			diag_set(ClientError, ER_ALREADY_LEADER);
+			return -1;
+		}
+		run_elections = true;
+		try_wait = false;
+		break;
+	case ELECTION_MODE_CANDIDATE:
+		/*
+		 * Leader elections are enabled, and this instance is allowed to
+		 * promote only if it's already an elected leader. No manual
+		 * elections.
+		 */
+		if (box_raft()->state != RAFT_STATE_LEADER) {
+			diag_set(ClientError, ER_UNSUPPORTED, "election_mode="
+				 "'candidate'", "manual elections");
+			return -1;
+		}
+		break;
+	default:
+		unreachable();
+	}
+
 	uint32_t former_leader_id = txn_limbo.owner_id;
 	int64_t wait_lsn = txn_limbo.confirmed_lsn;
 	int rc = 0;
 	int quorum = replication_synchro_quorum;
 	in_clear_synchro_queue = true;
 
+	if (run_elections) {
+		/*
+		 * Make this instance a candidate and run until some leader, not
+		 * necessarily this instance, emerges.
+		 */
+		raft_cfg_is_candidate(box_raft(), true, false);
+		/*
+		 * Trigger new elections without waiting for an old leader to
+		 * disappear.
+		 */
+		raft_new_term(box_raft());
+		box_raft_wait_leader_found();
+		raft_cfg_is_candidate(box_raft(), false, false);
+		if (!box_raft()->is_enabled) {
+			diag_set(ClientError, ER_RAFT_DISABLED);
+			in_clear_synchro_queue = false;
+			return -1;
+		}
+		if (box_raft()->state != RAFT_STATE_LEADER) {
+			diag_set(ClientError, ER_INTERFERING_PROMOTE,
+				 box_raft()->leader);
+			in_clear_synchro_queue = false;
+			return -1;
+		}
+	}
+
 	if (txn_limbo_is_empty(&txn_limbo))
 		goto promote;
 
@@ -1548,10 +1614,10 @@ box_clear_synchro_queue(bool try_wait)
 		 * transactions. Exit in case someone did that for us.
 		 */
 		if (former_leader_id != txn_limbo.owner_id) {
-			//TODO: error once we see someone else became the leader
-			// already.
+			diag_set(ClientError, ER_INTERFERING_PROMOTE,
+				 txn_limbo.owner_id);
 			in_clear_synchro_queue = false;
-			return 0;
+			return -1;
 		}
 	}
 
diff --git a/src/box/errcode.h b/src/box/errcode.h
index 56573688e..e5c9f3b09 100644
--- a/src/box/errcode.h
+++ b/src/box/errcode.h
@@ -275,6 +275,9 @@ struct errcode_record {
 	/*220 */_(ER_TOO_EARLY_SUBSCRIBE,	"Can't subscribe non-anonymous replica %s until join is done") \
 	/*221 */_(ER_SQL_CANT_ADD_AUTOINC,	"Can't add AUTOINCREMENT: space %s can't feature more than one AUTOINCREMENT field") \
 	/*222 */_(ER_QUORUM_WAIT,		"Couldn't wait for quorum %d: %s") \
+	/*223 */_(ER_INTERFERING_PROMOTE,	"Instance with replica id %u was promoted first") \
+	/*224 */_(ER_RAFT_DISABLED,		"Elections were turned off while running box.ctl.promote()")\
+	/*225 */_(ER_ALREADY_LEADER,		"Can't promote an existing leader")\
 
 /*
  * !IMPORTANT! Please follow instructions at start of the file
diff --git a/src/box/raft.c b/src/box/raft.c
index 285dbe4fd..47d4fd56d 100644
--- a/src/box/raft.c
+++ b/src/box/raft.c
@@ -95,7 +95,8 @@ box_raft_update_synchro_queue(struct raft *raft)
 	 * manually. In this case the call below will exit immediately and we'll
 	 * simply log a warning.
 	 */
-	if (raft->state == RAFT_STATE_LEADER) {
+	if (raft->state == RAFT_STATE_LEADER &&
+	    box_election_mode != ELECTION_MODE_MANUAL) {
 		int rc = 0;
 		uint32_t errcode = 0;
 		do {
@@ -336,6 +337,28 @@ fail:
 	panic("Could not write a raft request to WAL\n");
 }
 
+static int
+box_raft_wait_leader_found_trig(struct trigger *trig, void *event)
+{
+	struct raft *raft = (struct raft *)event;
+	assert(raft == box_raft());
+	struct fiber *waiter = (struct fiber *)trig->data;
+	if (raft->leader != REPLICA_ID_NIL || !raft->is_enabled)
+		fiber_wakeup(waiter);
+	return 0;
+}
+
+void
+box_raft_wait_leader_found(void)
+{
+	struct trigger trig;
+	trigger_create(&trig, box_raft_wait_leader_found_trig, fiber(), NULL);
+	raft_on_update(box_raft(), &trig);
+	fiber_yield();
+	assert(box_raft()->leader != REPLICA_ID_NIL || !box_raft()->is_enabled);
+	trigger_clear(&trig);
+}
+
 void
 box_raft_init(void)
 {
diff --git a/src/box/raft.h b/src/box/raft.h
index 15f4e80d9..8fce423e1 100644
--- a/src/box/raft.h
+++ b/src/box/raft.h
@@ -97,6 +97,9 @@ box_raft_checkpoint_remote(struct raft_request *req);
 int
 box_raft_process(struct raft_request *req, uint32_t source);
 
+void
+box_raft_wait_leader_found();
+
 void
 box_raft_init(void);
 
diff --git a/src/lib/raft/raft.c b/src/lib/raft/raft.c
index e9ce8cade..7b77e05ea 100644
--- a/src/lib/raft/raft.c
+++ b/src/lib/raft/raft.c
@@ -846,7 +846,7 @@ raft_cfg_is_enabled(struct raft *raft, bool is_enabled)
 }
 
 void
-raft_cfg_is_candidate(struct raft *raft, bool is_candidate)
+raft_cfg_is_candidate(struct raft *raft, bool is_candidate, bool demote)
 {
 	bool old_is_candidate = raft->is_candidate;
 	raft->is_cfg_candidate = is_candidate;
@@ -874,8 +874,16 @@ raft_cfg_is_candidate(struct raft *raft, bool is_candidate)
 			raft_ev_timer_stop(raft_loop(), &raft->timer);
 		}
 		if (raft->state != RAFT_STATE_FOLLOWER) {
-			if (raft->state == RAFT_STATE_LEADER)
+			if (raft->state == RAFT_STATE_LEADER) {
+				if (!demote) {
+					/*
+					 * Remain leader until someone
+					 * triggers new elections.
+					 */
+					return;
+				}
 				raft->leader = 0;
+			}
 			raft->state = RAFT_STATE_FOLLOWER;
 			/* State is visible and changed - broadcast. */
 			raft_schedule_broadcast(raft);
diff --git a/src/lib/raft/raft.h b/src/lib/raft/raft.h
index 40c8630e9..3526460af 100644
--- a/src/lib/raft/raft.h
+++ b/src/lib/raft/raft.h
@@ -325,7 +325,7 @@ raft_cfg_is_enabled(struct raft *raft, bool is_enabled);
  * the node still can vote, when Raft is enabled.
  */
 void
-raft_cfg_is_candidate(struct raft *raft, bool is_candidate);
+raft_cfg_is_candidate(struct raft *raft, bool is_candidate, bool demote);
 
 /** Configure Raft leader election timeout. */
 void
diff --git a/test/box/error.result b/test/box/error.result
index 7761c6949..dad6a21d3 100644
--- a/test/box/error.result
+++ b/test/box/error.result
@@ -441,6 +441,9 @@ t;
  |   220: box.error.TOO_EARLY_SUBSCRIBE
  |   221: box.error.SQL_CANT_ADD_AUTOINC
  |   222: box.error.QUORUM_WAIT
+ |   223: box.error.INTERFERING_PROMOTE
+ |   224: box.error.RAFT_DISABLED
+ |   225: box.error.ALREADY_LEADER
  | ...
 
 test_run:cmd("setopt delimiter ''");
diff --git a/test/unit/raft_test_utils.c b/test/unit/raft_test_utils.c
index b8735f373..a10ccae6a 100644
--- a/test/unit/raft_test_utils.c
+++ b/test/unit/raft_test_utils.c
@@ -360,7 +360,7 @@ raft_node_start(struct raft_node *node)
 		raft_process_recovery(&node->raft, &node->journal.rows[i]);
 
 	raft_cfg_is_enabled(&node->raft, node->cfg_is_enabled);
-	raft_cfg_is_candidate(&node->raft, node->cfg_is_candidate);
+	raft_cfg_is_candidate(&node->raft, node->cfg_is_candidate, true);
 	raft_cfg_election_timeout(&node->raft, node->cfg_election_timeout);
 	raft_cfg_election_quorum(&node->raft, node->cfg_election_quorum);
 	raft_cfg_death_timeout(&node->raft, node->cfg_death_timeout);
@@ -402,7 +402,7 @@ raft_node_cfg_is_candidate(struct raft_node *node, bool value)
 {
 	node->cfg_is_candidate = value;
 	if (raft_node_is_started(node)) {
-		raft_cfg_is_candidate(&node->raft, value);
+		raft_cfg_is_candidate(&node->raft, value, true);
 		raft_run_async_work();
 	}
 }
-- 
2.24.3 (Apple Git-128)


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Tarantool-patches] [PATCH v2 9/9] box.ctl: rename clear_synchro_queue to promote
  2021-04-12 19:40 [Tarantool-patches] [PATCH v2 0/9] raft: introduce manual elections and fix a bug with re-applying rolled back transactions Serge Petrenko via Tarantool-patches
                   ` (7 preceding siblings ...)
  2021-04-12 19:40 ` [Tarantool-patches] [PATCH v2 8/9] Support manual elections in `box.ctl.clear_synchro_queue()` Serge Petrenko via Tarantool-patches
@ 2021-04-12 19:40 ` Serge Petrenko via Tarantool-patches
  2021-04-13 14:42 ` [Tarantool-patches] [PATCH v2 0/9] raft: introduce manual elections and fix a bug with re-applying rolled back transactions Cyrill Gorcunov via Tarantool-patches
  9 siblings, 0 replies; 23+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-04-12 19:40 UTC (permalink / raw)
  To: v.shpilevoy, gorcunov; +Cc: tarantool-patches

New function name will be `box.ctl.promote()`. It's much shorter and
closer to the function's now enriched functionality.

Old name `box.ctl.clear_synchro_queue()` remains in Lua for the sake of
backward compatibility.

Follow-up #5445
Closes #3055

@TarantoolBot document
Title: deprecate `box.ctl.clear_synchro_queue()` in favor of `box.ctl.promote()`

Replace all the mentions of `box.ctl.clear_synchro_queue()` with
`box.ctl.promote()` and add a note that `box.ctl.clear_synchro_queue()`
is a deprecated alias to `box.ctl.promote()`
---
 changelogs/unreleased/box-ctl-promote.md      |   8 ++
 src/box/box.cc                                |  20 ++--
 src/box/box.h                                 |   2 +-
 src/box/lua/ctl.c                             |   8 +-
 src/box/raft.c                                |   4 +-
 test/replication/election_basic.result        |  25 +++++
 test/replication/election_basic.test.lua      |  10 ++
 .../gh-3055-election-promote.result           | 105 ++++++++++++++++++
 .../gh-3055-election-promote.test.lua         |  43 +++++++
 test/replication/suite.cfg                    |   1 +
 10 files changed, 210 insertions(+), 16 deletions(-)
 create mode 100644 changelogs/unreleased/box-ctl-promote.md
 create mode 100644 test/replication/gh-3055-election-promote.result
 create mode 100644 test/replication/gh-3055-election-promote.test.lua

diff --git a/changelogs/unreleased/box-ctl-promote.md b/changelogs/unreleased/box-ctl-promote.md
new file mode 100644
index 000000000..15f6fb206
--- /dev/null
+++ b/changelogs/unreleased/box-ctl-promote.md
@@ -0,0 +1,8 @@
+## feature/replication
+
+* Introduce `box.ctl.promote()` and the concept of manual elections (enabled
+  with `election_mode='manual'`). Once the instance is in `manual` election
+  mode, it acts like a `voter` most of the time, but may trigger elections and
+  become a leader, once `box.ctl.promote()` is called.
+  When `election_mode ~= 'manual'`, `box.ctl.promote()` replaces
+  `box.ctl.clear_synchro_queue()`, which is now deprecated (gh-3055).
diff --git a/src/box/box.cc b/src/box/box.cc
index dc7f434e4..ca5d2dd50 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -1509,12 +1509,12 @@ box_wait_quorum(uint32_t lead_id, int64_t target_lsn, int quorum,
 }
 
 int
-box_clear_synchro_queue(bool try_wait)
+box_promote(bool try_wait)
 {
 	/* A guard to block multiple simultaneous function invocations. */
-	static bool in_clear_synchro_queue = false;
-	if (in_clear_synchro_queue) {
-		diag_set(ClientError, ER_UNSUPPORTED, "clear_synchro_queue",
+	static bool in_promote = false;
+	if (in_promote) {
+		diag_set(ClientError, ER_UNSUPPORTED, "promote",
 			 "simultaneous invocations");
 		return -1;
 	}
@@ -1569,7 +1569,7 @@ box_clear_synchro_queue(bool try_wait)
 	int64_t wait_lsn = txn_limbo.confirmed_lsn;
 	int rc = 0;
 	int quorum = replication_synchro_quorum;
-	in_clear_synchro_queue = true;
+	in_promote = true;
 
 	if (run_elections) {
 		/*
@@ -1586,13 +1586,13 @@ box_clear_synchro_queue(bool try_wait)
 		raft_cfg_is_candidate(box_raft(), false, false);
 		if (!box_raft()->is_enabled) {
 			diag_set(ClientError, ER_RAFT_DISABLED);
-			in_clear_synchro_queue = false;
+			in_promote = false;
 			return -1;
 		}
 		if (box_raft()->state != RAFT_STATE_LEADER) {
 			diag_set(ClientError, ER_INTERFERING_PROMOTE,
 				 box_raft()->leader);
-			in_clear_synchro_queue = false;
+			in_promote = false;
 			return -1;
 		}
 	}
@@ -1616,13 +1616,13 @@ box_clear_synchro_queue(bool try_wait)
 		if (former_leader_id != txn_limbo.owner_id) {
 			diag_set(ClientError, ER_INTERFERING_PROMOTE,
 				 txn_limbo.owner_id);
-			in_clear_synchro_queue = false;
+			in_promote = false;
 			return -1;
 		}
 	}
 
 	/*
-	 * clear_synchro_queue() is a no-op on the limbo owner, so all the rows
+	 * promote() is a no-op on the limbo owner, so all the rows
 	 * in the limbo must've come through the applier meaning they already
 	 * have an lsn assigned, even if their WAL write hasn't finished yet.
 	 */
@@ -1659,7 +1659,7 @@ promote:
 						req.term);
 		}
 	}
-	in_clear_synchro_queue = false;
+	in_promote = false;
 	return rc;
 }
 
diff --git a/src/box/box.h b/src/box/box.h
index e2321b9b0..89c6fe1a1 100644
--- a/src/box/box.h
+++ b/src/box/box.h
@@ -274,7 +274,7 @@ extern "C" {
 typedef struct tuple box_tuple_t;
 
 int
-box_clear_synchro_queue(bool try_wait);
+box_promote(bool try_wait);
 
 /* box_select is private and used only by FFI */
 API_EXPORT int
diff --git a/src/box/lua/ctl.c b/src/box/lua/ctl.c
index d039a059f..f06af8588 100644
--- a/src/box/lua/ctl.c
+++ b/src/box/lua/ctl.c
@@ -82,9 +82,9 @@ lbox_ctl_on_schema_init(struct lua_State *L)
 }
 
 static int
-lbox_ctl_clear_synchro_queue(struct lua_State *L)
+lbox_ctl_promote(struct lua_State *L)
 {
-	if (box_clear_synchro_queue(true) != 0)
+	if (box_promote(true) != 0)
 		return luaT_error(L);
 	return 0;
 }
@@ -124,7 +124,9 @@ static const struct luaL_Reg lbox_ctl_lib[] = {
 	{"wait_rw", lbox_ctl_wait_rw},
 	{"on_shutdown", lbox_ctl_on_shutdown},
 	{"on_schema_init", lbox_ctl_on_schema_init},
-	{"clear_synchro_queue", lbox_ctl_clear_synchro_queue},
+	{"promote", lbox_ctl_promote},
+	/* An old alias. */
+	{"clear_synchro_queue", lbox_ctl_promote},
 	{"is_recovery_finished", lbox_ctl_is_recovery_finished},
 	{"set_on_shutdown_timeout", lbox_ctl_set_on_shutdown_timeout},
 	{NULL, NULL}
diff --git a/src/box/raft.c b/src/box/raft.c
index 47d4fd56d..45baf5dd8 100644
--- a/src/box/raft.c
+++ b/src/box/raft.c
@@ -91,7 +91,7 @@ box_raft_update_synchro_queue(struct raft *raft)
 	 * If the node became a leader, it means it will ignore all records from
 	 * all the other nodes, and won't get late CONFIRM messages anyway. Can
 	 * clear the queue without waiting for confirmations.
-	 * It's alright that the user may have called clear_synchro_queue
+	 * It's alright that the user may have called promote
 	 * manually. In this case the call below will exit immediately and we'll
 	 * simply log a warning.
 	 */
@@ -100,7 +100,7 @@ box_raft_update_synchro_queue(struct raft *raft)
 		int rc = 0;
 		uint32_t errcode = 0;
 		do {
-			rc = box_clear_synchro_queue(false);
+			rc = box_promote(false);
 			if (rc != 0) {
 				struct error *err = diag_last_error(diag_get());
 				errcode = box_error_code(err);
diff --git a/test/replication/election_basic.result b/test/replication/election_basic.result
index d5320b3ff..78c911245 100644
--- a/test/replication/election_basic.result
+++ b/test/replication/election_basic.result
@@ -108,6 +108,31 @@ assert(box.info.election.leader == box.info.id)
  | - true
  | ...
 
+-- Manual election mode. A voter most of the time, a leader once
+-- `box.ctl.promote()` is called.
+box.cfg{election_mode = 'manual'}
+ | ---
+ | ...
+
+assert(box.info.election.state == 'follower')
+ | ---
+ | - true
+ | ...
+term = box.info.election.term
+ | ---
+ | ...
+box.ctl.promote()
+ | ---
+ | ...
+assert(box.info.election.state == 'leader')
+ | ---
+ | - error: assertion failed!
+ | ...
+assert(box.info.election.term > term)
+ | ---
+ | - error: assertion failed!
+ | ...
+
 box.cfg{                                                                        \
     election_mode = 'off',                                                      \
     election_timeout = old_election_timeout                                     \
diff --git a/test/replication/election_basic.test.lua b/test/replication/election_basic.test.lua
index 821f73cea..5fc398848 100644
--- a/test/replication/election_basic.test.lua
+++ b/test/replication/election_basic.test.lua
@@ -39,6 +39,16 @@ assert(box.info.election.term > term)
 assert(box.info.election.vote == box.info.id)
 assert(box.info.election.leader == box.info.id)
 
+-- Manual election mode. A voter most of the time, a leader once
+-- `box.ctl.promote()` is called.
+box.cfg{election_mode = 'manual'}
+
+assert(box.info.election.state == 'follower')
+term = box.info.election.term
+box.ctl.promote()
+assert(box.info.election.state == 'leader')
+assert(box.info.election.term > term)
+
 box.cfg{                                                                        \
     election_mode = 'off',                                                      \
     election_timeout = old_election_timeout                                     \
diff --git a/test/replication/gh-3055-election-promote.result b/test/replication/gh-3055-election-promote.result
new file mode 100644
index 000000000..6f5af13bc
--- /dev/null
+++ b/test/replication/gh-3055-election-promote.result
@@ -0,0 +1,105 @@
+-- test-run result file version 2
+test_run = require('test_run').new()
+ | ---
+ | ...
+
+--
+-- gh-3055 box.ctl.promote(). Call on instance with election_mode='manual'
+-- in order to promote it to leader.
+SERVERS = {'election_replica1', 'election_replica2', 'election_replica3'}
+ | ---
+ | ...
+-- Start in candidate state in order for bootstrap to work.
+test_run:create_cluster(SERVERS, 'replication', {args='2 0.1 candidate'})
+ | ---
+ | ...
+test_run:wait_fullmesh(SERVERS)
+ | ---
+ | ...
+
+cfg_set_manual =\
+    "box.cfg{election_mode='manual'} "..\
+    "assert(box.info.election.state == 'follower') "..\
+    "assert(box.info.ro)"
+ | ---
+ | ...
+
+for _, server in pairs(SERVERS) do\
+    ok, res = test_run:eval(server, cfg_set_manual)\
+    assert(ok)\
+end
+ | ---
+ | ...
+
+-- Promote without living leader.
+test_run:switch('election_replica1')
+ | ---
+ | - true
+ | ...
+assert(box.info.election.state == 'follower')
+ | ---
+ | - true
+ | ...
+term = box.info.election.term
+ | ---
+ | ...
+box.ctl.promote()
+ | ---
+ | ...
+assert(box.info.election.state == 'leader')
+ | ---
+ | - true
+ | ...
+assert(not box.info.ro)
+ | ---
+ | - true
+ | ...
+assert(box.info.election.term > term)
+ | ---
+ | - true
+ | ...
+
+-- Test promote when there's a live leader.
+test_run:switch('election_replica2')
+ | ---
+ | - true
+ | ...
+term = box.info.election.term
+ | ---
+ | ...
+assert(box.info.election.state == 'follower')
+ | ---
+ | - true
+ | ...
+assert(box.info.ro)
+ | ---
+ | - true
+ | ...
+assert(box.info.election.leader ~= 0)
+ | ---
+ | - true
+ | ...
+box.ctl.promote()
+ | ---
+ | ...
+assert(box.info.election.state == 'leader')
+ | ---
+ | - true
+ | ...
+assert(not box.info.ro)
+ | ---
+ | - true
+ | ...
+assert(box.info.election.term > term)
+ | ---
+ | - true
+ | ...
+
+-- Cleanup.
+test_run:switch('default')
+ | ---
+ | - true
+ | ...
+test_run:drop_cluster(SERVERS)
+ | ---
+ | ...
diff --git a/test/replication/gh-3055-election-promote.test.lua b/test/replication/gh-3055-election-promote.test.lua
new file mode 100644
index 000000000..cbc3ed206
--- /dev/null
+++ b/test/replication/gh-3055-election-promote.test.lua
@@ -0,0 +1,43 @@
+test_run = require('test_run').new()
+
+--
+-- gh-3055 box.ctl.promote(). Call on instance with election_mode='manual'
+-- in order to promote it to leader.
+SERVERS = {'election_replica1', 'election_replica2', 'election_replica3'}
+-- Start in candidate state in order for bootstrap to work.
+test_run:create_cluster(SERVERS, 'replication', {args='2 0.1 candidate'})
+test_run:wait_fullmesh(SERVERS)
+
+cfg_set_manual =\
+    "box.cfg{election_mode='manual'} "..\
+    "assert(box.info.election.state == 'follower') "..\
+    "assert(box.info.ro)"
+
+for _, server in pairs(SERVERS) do\
+    ok, res = test_run:eval(server, cfg_set_manual)\
+    assert(ok)\
+end
+
+-- Promote without living leader.
+test_run:switch('election_replica1')
+assert(box.info.election.state == 'follower')
+term = box.info.election.term
+box.ctl.promote()
+assert(box.info.election.state == 'leader')
+assert(not box.info.ro)
+assert(box.info.election.term > term)
+
+-- Test promote when there's a live leader.
+test_run:switch('election_replica2')
+term = box.info.election.term
+assert(box.info.election.state == 'follower')
+assert(box.info.ro)
+assert(box.info.election.leader ~= 0)
+box.ctl.promote()
+assert(box.info.election.state == 'leader')
+assert(not box.info.ro)
+assert(box.info.election.term > term)
+
+-- Cleanup.
+test_run:switch('default')
+test_run:drop_cluster(SERVERS)
diff --git a/test/replication/suite.cfg b/test/replication/suite.cfg
index 8ae2fc14d..00118e9f6 100644
--- a/test/replication/suite.cfg
+++ b/test/replication/suite.cfg
@@ -2,6 +2,7 @@
     "anon.test.lua": {},
     "anon_register_gap.test.lua": {},
     "gh-2991-misc-asserts-on-update.test.lua": {},
+    "gh-3055-election-promote.test.lua": {},
     "gh-3111-misc-rebootstrap-from-ro-master.test.lua": {},
     "gh-3160-misc-heartbeats-on-master-changes.test.lua": {},
     "gh-3247-misc-iproto-sequence-value-not-replicated.test.lua": {},
-- 
2.24.3 (Apple Git-128)


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Tarantool-patches] [PATCH v2 1/9] wal: enrich row's meta information with sync replication flags
  2021-04-12 19:40 ` [Tarantool-patches] [PATCH v2 1/9] wal: enrich row's meta information with sync replication flags Serge Petrenko via Tarantool-patches
@ 2021-04-13 11:50   ` Cyrill Gorcunov via Tarantool-patches
  2021-04-13 13:51     ` Serge Petrenko via Tarantool-patches
  2021-04-13 13:09   ` Cyrill Gorcunov via Tarantool-patches
  1 sibling, 1 reply; 23+ messages in thread
From: Cyrill Gorcunov via Tarantool-patches @ 2021-04-13 11:50 UTC (permalink / raw)
  To: Serge Petrenko; +Cc: v.shpilevoy, tarantool-patches

On Mon, Apr 12, 2021 at 10:40:14PM +0300, Serge Petrenko wrote:
> Introduce two new flags to xrow_header: `wait_ack` and `wait_sync`.
> These flags are set for rows belonging to synchronous transactions in
> addition to `is_commit`.
> 
> The new flags help to define whether the rows belong to a synchronous
> transaction or not without parsing them all and checking whether any of
> the rows touches a synchronous space.
> 
> This will be used in applier once it is taught to filter synchronous
> transactions based on whether they are coming from a raft leader or not.
> 
> P.S. These flags will also be useful once we allow to turn any transaction
> synchronous. Once this is done, the flags in row header will be the only
> source of information on whether the transaction is synchronous or not.
> 
> Prerequisite #5445

Serge, here is a one addition: lets verify bitfields order. While their
use indeed is suitable we should eliminate ourself from unpredicted results.
The test is for c++ and probably we need one for plain c compiler as well?

The patch is on top of your series.
---
From 8de9e03d5bf7ed5fd88d6d903bd7abef7a32286a Mon Sep 17 00:00:00 2001
From: Cyrill Gorcunov <gorcunov@gmail.com>
Date: Tue, 13 Apr 2021 14:47:40 +0300
Subject: [PATCH] test/unit: xrow -- verify bitfields order

Part-of #5445

Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
---
 test/unit/xrow.cc     | 31 ++++++++++++++++++++++++++++++-
 test/unit/xrow.result |  7 ++++++-
 2 files changed, 36 insertions(+), 2 deletions(-)

diff --git a/test/unit/xrow.cc b/test/unit/xrow.cc
index ea1ee1767..047f3330d 100644
--- a/test/unit/xrow.cc
+++ b/test/unit/xrow.cc
@@ -297,12 +297,40 @@ test_request_str()
 	check_plan();
 }
 
+/**
+ * The compiler doesn't have to preserve bitfields order,
+ * still we rely on it for convenience sake.
+ */
+static void
+test_xrow_opt_field()
+{
+	plan(3);
+
+	struct xrow_header header;
+
+	memset(&header, 0, sizeof(header));
+
+	header.is_commit = 1;
+	ok(header.opt_flags == IPROTO_FLAG_COMMIT, "header.is_commit");
+	header.is_commit = 0;
+
+	header.wait_sync = 1;
+	ok(header.opt_flags == IPROTO_FLAG_WAIT_SYNC, "header.wait_sync");
+	header.wait_sync = 0;
+
+	header.wait_ack = 1;
+	ok(header.opt_flags == IPROTO_FLAG_WAIT_ACK, "header.wait_ack");
+	header.wait_ack = 0;
+
+	check_plan();
+}
+
 int
 main(void)
 {
 	memory_init();
 	fiber_init(fiber_c_invoke);
-	plan(3);
+	plan(4);
 
 	random_init();
 
@@ -310,6 +338,7 @@ main(void)
 	test_greeting();
 	test_xrow_header_encode_decode();
 	test_request_str();
+	test_xrow_opt_field();
 
 	random_free();
 	fiber_free();
diff --git a/test/unit/xrow.result b/test/unit/xrow.result
index e06ba5261..bd8a475fc 100644
--- a/test/unit/xrow.result
+++ b/test/unit/xrow.result
@@ -1,4 +1,4 @@
-1..3
+1..4
     1..40
     ok 1 - round trip
     ok 2 - roundtrip.version_id
@@ -159,3 +159,8 @@ ok 2 - subtests
     1..1
     ok 1 - request_str
 ok 3 - subtests
+    1..3
+    ok 1 - header.is_commit
+    ok 2 - header.wait_sync
+    ok 3 - header.wait_ack
+ok 4 - subtests
-- 
2.30.2


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Tarantool-patches] [PATCH v2 1/9] wal: enrich row's meta information with sync replication flags
  2021-04-12 19:40 ` [Tarantool-patches] [PATCH v2 1/9] wal: enrich row's meta information with sync replication flags Serge Petrenko via Tarantool-patches
  2021-04-13 11:50   ` Cyrill Gorcunov via Tarantool-patches
@ 2021-04-13 13:09   ` Cyrill Gorcunov via Tarantool-patches
  2021-04-13 13:29     ` Serge Petrenko via Tarantool-patches
  1 sibling, 1 reply; 23+ messages in thread
From: Cyrill Gorcunov via Tarantool-patches @ 2021-04-13 13:09 UTC (permalink / raw)
  To: Serge Petrenko; +Cc: v.shpilevoy, tarantool-patches

On Mon, Apr 12, 2021 at 10:40:14PM +0300, Serge Petrenko wrote:
> @@ -996,7 +996,13 @@ wal_assign_lsn(struct vclock *vclock_diff, struct vclock *base,
>  				first_glob_row = row;
>  			}
>  			(*row)->tsn = tsn == 0 ? (*start)->lsn : tsn;
> -			(*row)->is_commit = row == end - 1;
> +			if (row < end - 1)
> +				continue;

These two lines ^^
Serge, why we need `continue` here at all? Why can't we simply drop the
above `if` saving a branch?

> +			/* Tx meta is stored in the last tx row. */
> +			if (row == end - 1) {
> +				(*row)->opt_flags = entry->opt_flags;
> +				(*row)->is_commit = true;
> +			}

IOW, I mean

		(*row)->tsn = tsn == 0 ? (*start)->lsn : tsn;
		(*row)->is_commit = row == end - 1;
		if (row == end - 1) {
			// Save meta
		}

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Tarantool-patches] [PATCH v2 1/9] wal: enrich row's meta information with sync replication flags
  2021-04-13 13:09   ` Cyrill Gorcunov via Tarantool-patches
@ 2021-04-13 13:29     ` Serge Petrenko via Tarantool-patches
  0 siblings, 0 replies; 23+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-04-13 13:29 UTC (permalink / raw)
  To: Cyrill Gorcunov; +Cc: v.shpilevoy, tarantool-patches



13.04.2021 16:09, Cyrill Gorcunov пишет:
> On Mon, Apr 12, 2021 at 10:40:14PM +0300, Serge Petrenko wrote:
>> @@ -996,7 +996,13 @@ wal_assign_lsn(struct vclock *vclock_diff, struct vclock *base,
>>   				first_glob_row = row;
>>   			}
>>   			(*row)->tsn = tsn == 0 ? (*start)->lsn : tsn;
>> -			(*row)->is_commit = row == end - 1;
>> +			if (row < end - 1)
>> +				continue;
> These two lines ^^
> Serge, why we need `continue` here at all? Why can't we simply drop the
> above `if` saving a branch?
>
>> +			/* Tx meta is stored in the last tx row. */
>> +			if (row == end - 1) {
>> +				(*row)->opt_flags = entry->opt_flags;
>> +				(*row)->is_commit = true;
>> +			}
> IOW, I mean
>
> 		(*row)->tsn = tsn == 0 ? (*start)->lsn : tsn;
> 		(*row)->is_commit = row == end - 1;
> 		if (row == end - 1) {
> 			// Save meta
> 		}

Thanks for noticing! That's strange, indeed. Removed:

diff --git a/src/box/wal.c b/src/box/wal.c
index b7c69fc59..4ec8034a3 100644
--- a/src/box/wal.c
+++ b/src/box/wal.c
@@ -996,8 +996,6 @@ wal_assign_lsn(struct vclock *vclock_diff, struct 
vclock *base,
                                 first_glob_row = row;
                         }
                         (*row)->tsn = tsn == 0 ? (*start)->lsn : tsn;
-                       if (row < end - 1)
-                               continue;
                         /* Tx meta is stored in the last tx row. */
                         if (row == end - 1) {
                                 (*row)->opt_flags = entry->opt_flags;

-- 
Serge Petrenko


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Tarantool-patches] [PATCH v2 1/9] wal: enrich row's meta information with sync replication flags
  2021-04-13 11:50   ` Cyrill Gorcunov via Tarantool-patches
@ 2021-04-13 13:51     ` Serge Petrenko via Tarantool-patches
  2021-04-13 14:16       ` Cyrill Gorcunov via Tarantool-patches
  0 siblings, 1 reply; 23+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-04-13 13:51 UTC (permalink / raw)
  To: Cyrill Gorcunov; +Cc: v.shpilevoy, tarantool-patches



13.04.2021 14:50, Cyrill Gorcunov пишет:
> On Mon, Apr 12, 2021 at 10:40:14PM +0300, Serge Petrenko wrote:
>> Introduce two new flags to xrow_header: `wait_ack` and `wait_sync`.
>> These flags are set for rows belonging to synchronous transactions in
>> addition to `is_commit`.
>>
>> The new flags help to define whether the rows belong to a synchronous
>> transaction or not without parsing them all and checking whether any of
>> the rows touches a synchronous space.
>>
>> This will be used in applier once it is taught to filter synchronous
>> transactions based on whether they are coming from a raft leader or not.
>>
>> P.S. These flags will also be useful once we allow to turn any transaction
>> synchronous. Once this is done, the flags in row header will be the only
>> source of information on whether the transaction is synchronous or not.
>>
>> Prerequisite #5445
> Serge, here is a one addition: lets verify bitfields order. While their
> use indeed is suitable we should eliminate ourself from unpredicted results.
> The test is for c++ and probably we need one for plain c compiler as well?

Thanks for the help! I've added your tests with some changes.
What do you think?

===========================
diff --git a/test/unit/xrow.cc b/test/unit/xrow.cc
index ea1ee1767..3d7d8bee1 100644
--- a/test/unit/xrow.cc
+++ b/test/unit/xrow.cc
@@ -297,12 +297,49 @@ test_request_str()
      check_plan();
  }

+/**
+ * The compiler doesn't have to preserve bitfields order,
+ * still we rely on it for convenience sake.
+ */
+static void
+test_xrow_opt_field()
+{
+    plan(6);
+
+    struct xrow_header header;
+
+    memset(&header, 0, sizeof(header));
+
+    header.is_commit = true;
+    is(header.opt_flags, IPROTO_FLAG_COMMIT, "header.is_commit -> COMMIT");
+    header.is_commit = false;
+
+    header.wait_sync = true;
+    is(header.opt_flags, IPROTO_FLAG_WAIT_SYNC, "header.wait_sync -> 
WAIT_SYNC");
+    header.wait_sync = false;
+
+    header.wait_ack = true;
+    is(header.opt_flags, IPROTO_FLAG_WAIT_ACK, "header.wait_ack -> 
WAIT_ACK");
+    header.wait_ack = false;
+
+    header.opt_flags = IPROTO_FLAG_COMMIT;
+    ok(header.is_commit && !header.wait_sync && !header.wait_ack, 
"COMMIT -> header.is_commit");
+
+    header.opt_flags = IPROTO_FLAG_WAIT_SYNC;
+    ok(!header.is_commit && header.wait_sync && !header.wait_ack, 
"WAIT_SYNC -> header.wait_sync");
+
+    header.opt_flags = IPROTO_FLAG_WAIT_ACK;
+    ok(!header.is_commit && !header.wait_sync && header.wait_ack, 
"WAIT_ACK -> header.wait_ack");
+
+    check_plan();
+}
+
  int
  main(void)
  {
      memory_init();
      fiber_init(fiber_c_invoke);
-    plan(3);
+    plan(4);

      random_init();

@@ -310,6 +347,7 @@ main(void)
      test_greeting();
      test_xrow_header_encode_decode();
      test_request_str();
+    test_xrow_opt_field();

      random_free();
      fiber_free();
diff --git a/test/unit/xrow.result b/test/unit/xrow.result
index e06ba5261..3b705d5ba 100644
--- a/test/unit/xrow.result
+++ b/test/unit/xrow.result
@@ -1,4 +1,4 @@
-1..3
+1..4
      1..40
      ok 1 - round trip
      ok 2 - roundtrip.version_id
@@ -159,3 +159,11 @@ ok 2 - subtests
      1..1
      ok 1 - request_str
  ok 3 - subtests
+    1..6
+    ok 1 - header.is_commit -> COMMIT
+    ok 2 - header.wait_sync -> WAIT_SYNC
+    ok 3 - header.wait_ack -> WAIT_ACK
+    ok 4 - COMMIT -> header.is_commit
+    ok 5 - WAIT_SYNC -> header.wait_sync
+    ok 6 - WAIT_ACK -> header.wait_ack
+ok 4 - subtests

-- 
Serge Petrenko


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Tarantool-patches] [PATCH v2 2/9] xrow: introduce a PROMOTE entry
  2021-04-12 19:40 ` [Tarantool-patches] [PATCH v2 2/9] xrow: introduce a PROMOTE entry Serge Petrenko via Tarantool-patches
@ 2021-04-13 14:15   ` Cyrill Gorcunov via Tarantool-patches
  2021-04-14  9:12     ` Serge Petrenko via Tarantool-patches
  0 siblings, 1 reply; 23+ messages in thread
From: Cyrill Gorcunov via Tarantool-patches @ 2021-04-13 14:15 UTC (permalink / raw)
  To: Serge Petrenko; +Cc: v.shpilevoy, tarantool-patches

On Mon, Apr 12, 2021 at 10:40:15PM +0300, Serge Petrenko wrote:
> diff --git a/src/box/xrow.c b/src/box/xrow.c
> index cc8e43ed4..70ba075f8 100644
> --- a/src/box/xrow.c
> +++ b/src/box/xrow.c
> @@ -890,11 +890,11 @@ xrow_encode_synchro(struct xrow_header *row,
>  		    const struct synchro_request *req)
>  {
>  	/*
> -	 * A map with two elements. We don't compress
> +	 * A map with two or three elements. We don't compress
>  	 * numbers to have this structure constant in size,
>  	 * which allows us to preallocate it on stack.
>  	 */
> -	body->m_body = 0x80 | 2;
> +	body->m_body = 0x80 | (req->type == IPROTO_PROMOTE ? 3 : 2);
>  	body->k_replica_id = IPROTO_REPLICA_ID;
>  	body->m_replica_id = 0xce;
>  	body->v_replica_id = mp_bswap_u32(req->replica_id);
> @@ -903,10 +903,24 @@ xrow_encode_synchro(struct xrow_header *row,
>  	body->v_lsn = mp_bswap_u64(req->lsn);
>  
>  	memset(row, 0, sizeof(*row));
> -
>  	row->type = req->type;
> -	row->body[0].iov_base = (void *)body;
> -	row->body[0].iov_len = sizeof(*body);
> +
> +	/* Promote body is longer. It has an additional IPROTO_TERM field. */
> +	if (req->type == IPROTO_PROMOTE) {
> +		struct promote_body_bin *promote_body =
> +			(struct promote_body_bin *)body;
> +
> +		promote_body->k_term = IPROTO_TERM;
> +		promote_body->m_term = 0xcf;
> +		promote_body->v_term = mp_bswap_u64(req->term);
> +
> +		row->body[0].iov_base = (void *)promote_body;
> +		row->body[0].iov_len = sizeof(*promote_body);
> +	} else {
> +		row->body[0].iov_base = (void *)body;
> +		row->body[0].iov_len = sizeof(*body);
> +	}
> +
>  	row->bodycnt = 1;
>  }

You know, while I understand that we're trying to reuse code flow
here but I really don't like that this function unaware of type passed.
IOW the function may easily overwire caller's stack if you occasionally
pass synchro_body_bin instead of promote request.

Actually I've sevaral options:

1) make the caller to provide a size and use assert() inside
   this encoder to make sure the caller passer proper amount
   of data from stack;
2) provide own helper for promote packet encoding (see below).

Still both approaches somehow *ugly* I think. Since there only a
few use of this encodings it is easy to remember where and what
and don't make a mistake.

So lets leave it as it now.
---
diff --git a/src/box/txn_limbo.c b/src/box/txn_limbo.c
index f119c35b6..87fa220ea 100644
--- a/src/box/txn_limbo.c
+++ b/src/box/txn_limbo.c
@@ -333,8 +333,7 @@ txn_limbo_write_synchro(struct txn_limbo *limbo, uint32_t type, int64_t lsn,
 	 * This is a synchronous commit so we can allocate everything on a
 	 * stack. Promote body includes synchro body.
 	 */
-	struct promote_body_bin body;
-	struct synchro_body_bin *base = &body.base;
+	struct promote_body_bin body_bin;
 
 	struct xrow_header row;
 	char buf[sizeof(struct journal_entry) +
@@ -343,7 +342,7 @@ txn_limbo_write_synchro(struct txn_limbo *limbo, uint32_t type, int64_t lsn,
 	struct journal_entry *entry = (struct journal_entry *)buf;
 	entry->rows[0] = &row;
 
-	xrow_encode_synchro(&row, base, &req);
+	xrow_encode_promote(&row, &body_bin, &req);
 
 	journal_entry_create(entry, 1, xrow_approx_len(&row),
 			     txn_limbo_write_cb, fiber());
diff --git a/src/box/xrow.c b/src/box/xrow.c
index 70ba075f8..df76b104b 100644
--- a/src/box/xrow.c
+++ b/src/box/xrow.c
@@ -884,16 +884,11 @@ xrow_encode_dml(const struct request *request, struct region *region,
 	return iovcnt;
 }
 
-void
-xrow_encode_synchro(struct xrow_header *row,
-		    struct synchro_body_bin *body,
-		    const struct synchro_request *req)
+static void
+xrow_encode_synchro_common(struct xrow_header *row,
+			   struct synchro_body_bin *body,
+			   const struct synchro_request *req)
 {
-	/*
-	 * A map with two or three elements. We don't compress
-	 * numbers to have this structure constant in size,
-	 * which allows us to preallocate it on stack.
-	 */
 	body->m_body = 0x80 | (req->type == IPROTO_PROMOTE ? 3 : 2);
 	body->k_replica_id = IPROTO_REPLICA_ID;
 	body->m_replica_id = 0xce;
@@ -904,24 +899,32 @@ xrow_encode_synchro(struct xrow_header *row,
 
 	memset(row, 0, sizeof(*row));
 	row->type = req->type;
+	row->bodycnt = 1;
+}
 
-	/* Promote body is longer. It has an additional IPROTO_TERM field. */
-	if (req->type == IPROTO_PROMOTE) {
-		struct promote_body_bin *promote_body =
-			(struct promote_body_bin *)body;
+void
+xrow_encode_promote(struct xrow_header *row, struct promote_body_bin *body,
+		    const struct synchro_request *req)
+{
+	xrow_encode_synchro_common(row, body, req);
 
-		promote_body->k_term = IPROTO_TERM;
-		promote_body->m_term = 0xcf;
-		promote_body->v_term = mp_bswap_u64(req->term);
+	promote_body->k_term = IPROTO_TERM;
+	promote_body->m_term = 0xcf;
+	promote_body->v_term = mp_bswap_u64(req->term);
 
-		row->body[0].iov_base = (void *)promote_body;
-		row->body[0].iov_len = sizeof(*promote_body);
-	} else {
-		row->body[0].iov_base = (void *)body;
-		row->body[0].iov_len = sizeof(*body);
-	}
+	row->body[0].iov_base = body;
+	row->body[0].iov_len = sizeof(*body);
+}
+
+void
+xrow_encode_synchro(struct xrow_header *row, struct synchro_body_bin *body,
+		    const struct synchro_request *req)
+{
+	xrow_encode_synchro_common(row, body, req);
+
+	row->body[0].iov_base = body;
+	row->body[0].iov_len = sizeof(*body);
 
-	row->bodycnt = 1;
 }
 
 int

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Tarantool-patches] [PATCH v2 1/9] wal: enrich row's meta information with sync replication flags
  2021-04-13 13:51     ` Serge Petrenko via Tarantool-patches
@ 2021-04-13 14:16       ` Cyrill Gorcunov via Tarantool-patches
  0 siblings, 0 replies; 23+ messages in thread
From: Cyrill Gorcunov via Tarantool-patches @ 2021-04-13 14:16 UTC (permalink / raw)
  To: Serge Petrenko; +Cc: v.shpilevoy, tarantool-patches

On Tue, Apr 13, 2021 at 04:51:52PM +0300, Serge Petrenko wrote:
> 
> 
> 13.04.2021 14:50, Cyrill Gorcunov пишет:
> > On Mon, Apr 12, 2021 at 10:40:14PM +0300, Serge Petrenko wrote:
> > > Introduce two new flags to xrow_header: `wait_ack` and `wait_sync`.
> > > These flags are set for rows belonging to synchronous transactions in
> > > addition to `is_commit`.
> > > 
> > > The new flags help to define whether the rows belong to a synchronous
> > > transaction or not without parsing them all and checking whether any of
> > > the rows touches a synchronous space.
> > > 
> > > This will be used in applier once it is taught to filter synchronous
> > > transactions based on whether they are coming from a raft leader or not.
> > > 
> > > P.S. These flags will also be useful once we allow to turn any transaction
> > > synchronous. Once this is done, the flags in row header will be the only
> > > source of information on whether the transaction is synchronous or not.
> > > 
> > > Prerequisite #5445
> > Serge, here is a one addition: lets verify bitfields order. While their
> > use indeed is suitable we should eliminate ourself from unpredicted results.
> > The test is for c++ and probably we need one for plain c compiler as well?
> 
> Thanks for the help! I've added your tests with some changes.
> What do you think?
> 

Ack

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Tarantool-patches] [PATCH v2 4/9] box: make clear_synchro_queue() write a PROMOTE entry instead of CONFIRM + ROLLBACK
  2021-04-12 19:40 ` [Tarantool-patches] [PATCH v2 4/9] box: make clear_synchro_queue() write a PROMOTE entry instead of CONFIRM + ROLLBACK Serge Petrenko via Tarantool-patches
@ 2021-04-13 14:33   ` Cyrill Gorcunov via Tarantool-patches
  2021-04-14  8:23     ` Serge Petrenko via Tarantool-patches
  2021-04-14  9:12     ` Serge Petrenko via Tarantool-patches
  0 siblings, 2 replies; 23+ messages in thread
From: Cyrill Gorcunov via Tarantool-patches @ 2021-04-13 14:33 UTC (permalink / raw)
  To: Serge Petrenko; +Cc: v.shpilevoy, tarantool-patches

On Mon, Apr 12, 2021 at 10:40:17PM +0300, Serge Petrenko wrote:
>  		} else {
> -			txn_limbo_force_empty(&txn_limbo, wait_lsn);
> +			/*
> +			 * Term parameter is unused now, We'll pass
> +			 * box_raft()->term there later.
> +			 */
> +			txn_limbo_write_promote(&txn_limbo, wait_lsn, 0);
> +			struct synchro_request req = {
> +				.type = 0, /* unused */
> +				.replica_id = 0, /* unused */
> +				.origin_id = instance_id,
> +				.lsn = wait_lsn,
> +				.term = 0, /* unused */
> +			};

Is there some particular meaning of zeroifying designated assignments?
I mean why not simply

			struct synchro_request req = {
				.origin_id	= instance_id,
				.lsn		= wait_lsn,
			};

or you wanted to pay attention that the left of the fields are
unused? Just curious, I'm fine with current code.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Tarantool-patches] [PATCH v2 0/9] raft: introduce manual elections and fix a bug with re-applying rolled back transactions
  2021-04-12 19:40 [Tarantool-patches] [PATCH v2 0/9] raft: introduce manual elections and fix a bug with re-applying rolled back transactions Serge Petrenko via Tarantool-patches
                   ` (8 preceding siblings ...)
  2021-04-12 19:40 ` [Tarantool-patches] [PATCH v2 9/9] box.ctl: rename clear_synchro_queue to promote Serge Petrenko via Tarantool-patches
@ 2021-04-13 14:42 ` Cyrill Gorcunov via Tarantool-patches
  9 siblings, 0 replies; 23+ messages in thread
From: Cyrill Gorcunov via Tarantool-patches @ 2021-04-13 14:42 UTC (permalink / raw)
  To: Serge Petrenko; +Cc: v.shpilevoy, tarantool-patches

On Mon, Apr 12, 2021 at 10:40:13PM +0300, Serge Petrenko wrote:
> Changes in v2:
>   - Added tests for patches 1, 6, 9
>   - Minor typo fixes and bugfixes.
> 
> https://github.com/tarantool/tarantool/tree/sp/gh-5445-election-fixes
> https://github.com/tarantool/tarantool/issues/5445
> https://github.com/tarantool/tarantool/issues/3055
> 

The series looks ok for me, but I must confess I still
lack some of internals of our raft implementation. So
better to wait for Vlad's comments.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Tarantool-patches] [PATCH v2 4/9] box: make clear_synchro_queue() write a PROMOTE entry instead of CONFIRM + ROLLBACK
  2021-04-13 14:33   ` Cyrill Gorcunov via Tarantool-patches
@ 2021-04-14  8:23     ` Serge Petrenko via Tarantool-patches
  2021-04-14  8:34       ` Cyrill Gorcunov via Tarantool-patches
  2021-04-14  9:12     ` Serge Petrenko via Tarantool-patches
  1 sibling, 1 reply; 23+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-04-14  8:23 UTC (permalink / raw)
  To: Cyrill Gorcunov; +Cc: v.shpilevoy, tarantool-patches



13.04.2021 17:33, Cyrill Gorcunov пишет:
> On Mon, Apr 12, 2021 at 10:40:17PM +0300, Serge Petrenko wrote:
>>   		} else {
>> -			txn_limbo_force_empty(&txn_limbo, wait_lsn);
>> +			/*
>> +			 * Term parameter is unused now, We'll pass
>> +			 * box_raft()->term there later.
>> +			 */
>> +			txn_limbo_write_promote(&txn_limbo, wait_lsn, 0);
>> +			struct synchro_request req = {
>> +				.type = 0, /* unused */
>> +				.replica_id = 0, /* unused */
>> +				.origin_id = instance_id,
>> +				.lsn = wait_lsn,
>> +				.term = 0, /* unused */
>> +			};
> Is there some particular meaning of zeroifying designated assignments?
> I mean why not simply
>
> 			struct synchro_request req = {
> 				.origin_id	= instance_id,
> 				.lsn		= wait_lsn,
> 			};
>
> or you wanted to pay attention that the left of the fields are
> unused? Just curious, I'm fine with current code.

I went for your option at first, and it's the one I'd prefer.
But with it I got failed builds in some CI jobs.

It said something like "sorry, not yet implemented: struct partial
initialization"

Maybe the error was there because I initialize not the first
struct members. It's strange, because we use such "partial initializations"
in some other places IIRC.

Anyway, zeroifying unused fields explicitly helped.

-- 
Serge Petrenko


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Tarantool-patches] [PATCH v2 4/9] box: make clear_synchro_queue() write a PROMOTE entry instead of CONFIRM + ROLLBACK
  2021-04-14  8:23     ` Serge Petrenko via Tarantool-patches
@ 2021-04-14  8:34       ` Cyrill Gorcunov via Tarantool-patches
  0 siblings, 0 replies; 23+ messages in thread
From: Cyrill Gorcunov via Tarantool-patches @ 2021-04-14  8:34 UTC (permalink / raw)
  To: Serge Petrenko; +Cc: v.shpilevoy, tarantool-patches

On Wed, Apr 14, 2021 at 11:23:23AM +0300, Serge Petrenko wrote:
> > Is there some particular meaning of zeroifying designated assignments?
> > I mean why not simply
> > 
> > 			struct synchro_request req = {
> > 				.origin_id	= instance_id,
> > 				.lsn		= wait_lsn,
> > 			};
> > 
> > or you wanted to pay attention that the left of the fields are
> > unused? Just curious, I'm fine with current code.
> 
> I went for your option at first, and it's the one I'd prefer.
> But with it I got failed builds in some CI jobs.
> 
> It said something like "sorry, not yet implemented: struct partial
> initialization"

Ah, Serge, it is because of C++ compiler, not C. Sorry ;)

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Tarantool-patches] [PATCH v2 2/9] xrow: introduce a PROMOTE entry
  2021-04-13 14:15   ` Cyrill Gorcunov via Tarantool-patches
@ 2021-04-14  9:12     ` Serge Petrenko via Tarantool-patches
  2021-04-14 10:00       ` Cyrill Gorcunov via Tarantool-patches
  0 siblings, 1 reply; 23+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-04-14  9:12 UTC (permalink / raw)
  To: Cyrill Gorcunov; +Cc: v.shpilevoy, tarantool-patches



13.04.2021 17:15, Cyrill Gorcunov пишет:
> On Mon, Apr 12, 2021 at 10:40:15PM +0300, Serge Petrenko wrote:
>> diff --git a/src/box/xrow.c b/src/box/xrow.c
>> index cc8e43ed4..70ba075f8 100644
>> --- a/src/box/xrow.c
>> +++ b/src/box/xrow.c
>> @@ -890,11 +890,11 @@ xrow_encode_synchro(struct xrow_header *row,
>>   		    const struct synchro_request *req)
>>   {
>>   	/*
>> -	 * A map with two elements. We don't compress
>> +	 * A map with two or three elements. We don't compress
>>   	 * numbers to have this structure constant in size,
>>   	 * which allows us to preallocate it on stack.
>>   	 */
>> -	body->m_body = 0x80 | 2;
>> +	body->m_body = 0x80 | (req->type == IPROTO_PROMOTE ? 3 : 2);
>>   	body->k_replica_id = IPROTO_REPLICA_ID;
>>   	body->m_replica_id = 0xce;
>>   	body->v_replica_id = mp_bswap_u32(req->replica_id);
>> @@ -903,10 +903,24 @@ xrow_encode_synchro(struct xrow_header *row,
>>   	body->v_lsn = mp_bswap_u64(req->lsn);
>>   
>>   	memset(row, 0, sizeof(*row));
>> -
>>   	row->type = req->type;
>> -	row->body[0].iov_base = (void *)body;
>> -	row->body[0].iov_len = sizeof(*body);
>> +
>> +	/* Promote body is longer. It has an additional IPROTO_TERM field. */
>> +	if (req->type == IPROTO_PROMOTE) {
>> +		struct promote_body_bin *promote_body =
>> +			(struct promote_body_bin *)body;
>> +
>> +		promote_body->k_term = IPROTO_TERM;
>> +		promote_body->m_term = 0xcf;
>> +		promote_body->v_term = mp_bswap_u64(req->term);
>> +
>> +		row->body[0].iov_base = (void *)promote_body;
>> +		row->body[0].iov_len = sizeof(*promote_body);
>> +	} else {
>> +		row->body[0].iov_base = (void *)body;
>> +		row->body[0].iov_len = sizeof(*body);
>> +	}
>> +
>>   	row->bodycnt = 1;
>>   }
> You know, while I understand that we're trying to reuse code flow
> here but I really don't like that this function unaware of type passed.
> IOW the function may easily overwire caller's stack if you occasionally
> pass synchro_body_bin instead of promote request.

That's true and I've even caught that bug during the patch development.

>
> Actually I've sevaral options:
>
> 1) make the caller to provide a size and use assert() inside
>     this encoder to make sure the caller passer proper amount
>     of data from stack;
> 2) provide own helper for promote packet encoding (see below).
>
> Still both approaches somehow *ugly* I think. Since there only a
> few use of this encodings it is easy to remember where and what
> and don't make a mistake.

I think your second option looks better than what we have now.
So, thanks for the suggestion! I've taken your diff with a couple
of changes, please, take a look.

The rest of the diff's in the reply for the 4th patch, because the
changes to txn_limbo.c belong there.

===============================================
diff --git a/src/box/xrow.c b/src/box/xrow.c
index 70ba075f8..5d515ce92 100644
--- a/src/box/xrow.c
+++ b/src/box/xrow.c
@@ -884,10 +884,9 @@ xrow_encode_dml(const struct request *request, 
struct region *region,
      return iovcnt;
  }

-void
-xrow_encode_synchro(struct xrow_header *row,
-            struct synchro_body_bin *body,
-            const struct synchro_request *req)
+static void
+xrow_encode_synchro_body(struct synchro_body_bin *body,
+                 const struct synchro_request *req)
  {
      /*
       * A map with two or three elements. We don't compress
@@ -901,26 +900,48 @@ xrow_encode_synchro(struct xrow_header *row,
      body->k_lsn = IPROTO_LSN;
      body->m_lsn = 0xcf;
      body->v_lsn = mp_bswap_u64(req->lsn);
+}
+
+void
+xrow_encode_synchro(struct xrow_header *row,
+            struct synchro_body_bin *body,
+            const struct synchro_request *req)
+{
+    assert(req->type == IPROTO_CONFIRM || req->type == IPROTO_ROLLBACK);
+
+    xrow_encode_synchro_body(body, req);

      memset(row, 0, sizeof(*row));
      row->type = req->type;
+    row->body[0].iov_base = body;
+    row->body[0].iov_len = sizeof(*body);
+    row->bodycnt = 1;
+}

-    /* Promote body is longer. It has an additional IPROTO_TERM field. */
-    if (req->type == IPROTO_PROMOTE) {
-        struct promote_body_bin *promote_body =
-            (struct promote_body_bin *)body;
+static inline void
+xrow_encode_promote_body(struct promote_body_bin *body,
+             const struct synchro_request *req)
+{
+    xrow_encode_synchro_body(&body->base, req);

-        promote_body->k_term = IPROTO_TERM;
-        promote_body->m_term = 0xcf;
-        promote_body->v_term = mp_bswap_u64(req->term);
+    body->k_term = IPROTO_TERM;
+    body->m_term = 0xcf;
+    body->v_term = mp_bswap_u64(req->term);
+}

-        row->body[0].iov_base = (void *)promote_body;
-        row->body[0].iov_len = sizeof(*promote_body);
-    } else {
-        row->body[0].iov_base = (void *)body;
-        row->body[0].iov_len = sizeof(*body);
-    }

+void
+xrow_encode_promote(struct xrow_header *row, struct promote_body_bin *body,
+            const struct synchro_request *req)
+{
+    assert(req->type == IPROTO_PROMOTE);
+
+    xrow_encode_promote_body(body, req);
+
+    memset(row, 0, sizeof(*row));
+    row->type = req->type;
+    row->body[0].iov_base = body;
+    row->body[0].iov_len = sizeof(*body);
      row->bodycnt = 1;
  }

diff --git a/src/box/xrow.h b/src/box/xrow.h
index af4ad0d12..d75ab0cc9 100644
--- a/src/box/xrow.h
+++ b/src/box/xrow.h
@@ -290,6 +290,16 @@ xrow_encode_synchro(struct xrow_header *row,
              struct synchro_body_bin *body,
              const struct synchro_request *req);

+/**
+ * Encode a promote request.
+ * @param row xrow header.
+ * @param body A place to encode promote body.
+ * @param req Request parameters.
+ */
+void
+xrow_encode_promote(struct xrow_header *row, struct promote_body_bin *body,
+            const struct synchro_request *req);
+
  /**
   * Decode synchronous replication request.
   * @param row xrow header.


-- 

Serge Petrenko


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Tarantool-patches] [PATCH v2 4/9] box: make clear_synchro_queue() write a PROMOTE entry instead of CONFIRM + ROLLBACK
  2021-04-13 14:33   ` Cyrill Gorcunov via Tarantool-patches
  2021-04-14  8:23     ` Serge Petrenko via Tarantool-patches
@ 2021-04-14  9:12     ` Serge Petrenko via Tarantool-patches
  1 sibling, 0 replies; 23+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-04-14  9:12 UTC (permalink / raw)
  To: Cyrill Gorcunov; +Cc: v.shpilevoy, tarantool-patches



13.04.2021 17:33, Cyrill Gorcunov пишет:
> On Mon, Apr 12, 2021 at 10:40:17PM +0300, Serge Petrenko wrote:
>>   		} else {
>> -			txn_limbo_force_empty(&txn_limbo, wait_lsn);
>> +			/*
>> +			 * Term parameter is unused now, We'll pass
>> +			 * box_raft()->term there later.
>> +			 */
>> +			txn_limbo_write_promote(&txn_limbo, wait_lsn, 0);
>> +			struct synchro_request req = {
>> +				.type = 0, /* unused */
>> +				.replica_id = 0, /* unused */
>> +				.origin_id = instance_id,
>> +				.lsn = wait_lsn,
>> +				.term = 0, /* unused */
>> +			};
> Is there some particular meaning of zeroifying designated assignments?
> I mean why not simply
>
> 			struct synchro_request req = {
> 				.origin_id	= instance_id,
> 				.lsn		= wait_lsn,
> 			};
>
> or you wanted to pay attention that the left of the fields are
> unused? Just curious, I'm fine with current code.

A diff as per changes requested in patch 2.


======================================================

diff --git a/src/box/applier.cc b/src/box/applier.cc
index e8cbbe27a..65a749b62 100644
--- a/src/box/applier.cc
+++ b/src/box/applier.cc
@@ -822,12 +822,15 @@ synchro_entry_new(struct xrow_header *applier_row,
         }

         struct journal_entry *journal_entry = &entry->journal_entry;
-       struct synchro_body_bin *body_bin = &entry->body_bin.base;
         struct xrow_header *row = &entry->row;

         journal_entry->rows[0] = row;

-       xrow_encode_synchro(row, body_bin, req);
+       assert(iproto_type_is_synchro_request(req->type));
+       if (iproto_type_is_promote_request(req->type))
+               xrow_encode_promote(row, &entry->body_bin, req);
+       else
+               xrow_encode_synchro(row, &entry->body_bin.base, req);

         row->lsn = applier_row->lsn;
         row->replica_id = applier_row->replica_id;
diff --git a/src/box/txn_limbo.c b/src/box/txn_limbo.c
index f119c35b6..36c233348 100644
--- a/src/box/txn_limbo.c
+++ b/src/box/txn_limbo.c
@@ -333,8 +333,7 @@ txn_limbo_write_synchro(struct txn_limbo *limbo, 
uint32_t type, int64_t lsn,
          * This is a synchronous commit so we can allocate everything on a
          * stack. Promote body includes synchro body.
          */
-       struct promote_body_bin body;
-       struct synchro_body_bin *base = &body.base;
+       struct promote_body_bin body_bin;

         struct xrow_header row;
         char buf[sizeof(struct journal_entry) +
@@ -343,7 +342,10 @@ txn_limbo_write_synchro(struct txn_limbo *limbo, 
uint32_t type, int64_t lsn,
         struct journal_entry *entry = (struct journal_entry *)buf;
         entry->rows[0] = &row;

-       xrow_encode_synchro(&row, base, &req);
+       if (type == IPROTO_PROMOTE)
+               xrow_encode_promote(&row, &body_bin, &req);
+       else
+               xrow_encode_synchro(&row, &body_bin.base, &req);

         journal_entry_create(entry, 1, xrow_approx_len(&row),
                              txn_limbo_write_cb, fiber());


-- 
Serge Petrenko


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Tarantool-patches] [PATCH v2 2/9] xrow: introduce a PROMOTE entry
  2021-04-14  9:12     ` Serge Petrenko via Tarantool-patches
@ 2021-04-14 10:00       ` Cyrill Gorcunov via Tarantool-patches
  0 siblings, 0 replies; 23+ messages in thread
From: Cyrill Gorcunov via Tarantool-patches @ 2021-04-14 10:00 UTC (permalink / raw)
  To: Serge Petrenko; +Cc: v.shpilevoy, tarantool-patches

On Wed, Apr 14, 2021 at 12:12:51PM +0300, Serge Petrenko wrote:
> > You know, while I understand that we're trying to reuse code flow
> > here but I really don't like that this function unaware of type passed.
> > IOW the function may easily overwire caller's stack if you occasionally
> > pass synchro_body_bin instead of promote request.
> 
> That's true and I've even caught that bug during the patch development.
> 
> > 
> > Actually I've sevaral options:
> > 
> > 1) make the caller to provide a size and use assert() inside
> >     this encoder to make sure the caller passer proper amount
> >     of data from stack;
> > 2) provide own helper for promote packet encoding (see below).
> > 
> > Still both approaches somehow *ugly* I think. Since there only a
> > few use of this encodings it is easy to remember where and what
> > and don't make a mistake.
> 
> I think your second option looks better than what we have now.
> So, thanks for the suggestion! I've taken your diff with a couple
> of changes, please, take a look.
> 
> The rest of the diff's in the reply for the 4th patch, because the
> changes to txn_limbo.c belong there.

Ack. Thanks!

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2021-04-14 10:00 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-12 19:40 [Tarantool-patches] [PATCH v2 0/9] raft: introduce manual elections and fix a bug with re-applying rolled back transactions Serge Petrenko via Tarantool-patches
2021-04-12 19:40 ` [Tarantool-patches] [PATCH v2 1/9] wal: enrich row's meta information with sync replication flags Serge Petrenko via Tarantool-patches
2021-04-13 11:50   ` Cyrill Gorcunov via Tarantool-patches
2021-04-13 13:51     ` Serge Petrenko via Tarantool-patches
2021-04-13 14:16       ` Cyrill Gorcunov via Tarantool-patches
2021-04-13 13:09   ` Cyrill Gorcunov via Tarantool-patches
2021-04-13 13:29     ` Serge Petrenko via Tarantool-patches
2021-04-12 19:40 ` [Tarantool-patches] [PATCH v2 2/9] xrow: introduce a PROMOTE entry Serge Petrenko via Tarantool-patches
2021-04-13 14:15   ` Cyrill Gorcunov via Tarantool-patches
2021-04-14  9:12     ` Serge Petrenko via Tarantool-patches
2021-04-14 10:00       ` Cyrill Gorcunov via Tarantool-patches
2021-04-12 19:40 ` [Tarantool-patches] [PATCH v2 3/9] box: actualise iproto_key_type array Serge Petrenko via Tarantool-patches
2021-04-12 19:40 ` [Tarantool-patches] [PATCH v2 4/9] box: make clear_synchro_queue() write a PROMOTE entry instead of CONFIRM + ROLLBACK Serge Petrenko via Tarantool-patches
2021-04-13 14:33   ` Cyrill Gorcunov via Tarantool-patches
2021-04-14  8:23     ` Serge Petrenko via Tarantool-patches
2021-04-14  8:34       ` Cyrill Gorcunov via Tarantool-patches
2021-04-14  9:12     ` Serge Petrenko via Tarantool-patches
2021-04-12 19:40 ` [Tarantool-patches] [PATCH v2 5/9] box: write PROMOTE even for empty limbo Serge Petrenko via Tarantool-patches
2021-04-12 19:40 ` [Tarantool-patches] [PATCH v2 6/9] raft: keep track of greatest known term and filter replication sources based on that Serge Petrenko via Tarantool-patches
2021-04-12 19:40 ` [Tarantool-patches] [PATCH v2 7/9] replication: introduce a new election mode: "manual" Serge Petrenko via Tarantool-patches
2021-04-12 19:40 ` [Tarantool-patches] [PATCH v2 8/9] Support manual elections in `box.ctl.clear_synchro_queue()` Serge Petrenko via Tarantool-patches
2021-04-12 19:40 ` [Tarantool-patches] [PATCH v2 9/9] box.ctl: rename clear_synchro_queue to promote Serge Petrenko via Tarantool-patches
2021-04-13 14:42 ` [Tarantool-patches] [PATCH v2 0/9] raft: introduce manual elections and fix a bug with re-applying rolled back transactions Cyrill Gorcunov via Tarantool-patches

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox