Tarantool development patches archive
 help / color / mirror / Atom feed
* [Tarantool-patches] [PATCH 0/9] raft: introduce manual elections and fix a bug with re-applying rolled back transactions
@ 2021-04-11 17:55 Serge Petrenko via Tarantool-patches
  2021-04-11 17:55 ` [Tarantool-patches] [PATCH 1/9] wal: enrich row's meta information with sync replication flags Serge Petrenko via Tarantool-patches
                   ` (8 more replies)
  0 siblings, 9 replies; 16+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-04-11 17:55 UTC (permalink / raw)
  To: v.shpilevoy, gorcunov; +Cc: tarantool-patches

Hey, guys, take a look please.
The patchset does not contain the tests yet, I've only tested it manually.
I will make a follow up with tests soon.

Cyrill, I remember our conversation regarding the first patch and bitfields. I
just haven't come up with a better alternative yet.

https://github.com/tarantool/tarantool/tree/sp/gh-5445-election-fixes
https://github.com/tarantool/tarantool/issues/5445
https://github.com/tarantool/tarantool/issues/3055

Serge Petrenko (9):
  wal: enrich row's meta information with sync replication flags
  xrow: introduce a PROMOTE entry
  box: actualise iproto_key_type array
  box: make clear_synchro_queue() write a PROMOTE entry instead of
    CONFIRM + ROLLBACK
  box: write PROMOTE even for empty limbo
  raft: keep track of greatest known term and filter replication sources
    based on that
  replication: introduce a new election mode: "manual"
  Support manual elections in `box.ctl.clear_synchro_queue()`
  box.ctl: rename clear_synchro_queue to promote

 changelogs/unreleased/box-ctl-promote.md      |   8 +
 ...very => qsync-multi-statement-recovery.md} |   0
 changelogs/unreleased/raft-promote.md         |   4 +
 src/box/applier.cc                            |  22 ++-
 src/box/box.cc                                | 166 ++++++++++++++----
 src/box/box.h                                 |   2 +-
 src/box/errcode.h                             |   3 +
 src/box/iproto_constants.c                    |  58 ++++++
 src/box/iproto_constants.h                    |  27 ++-
 src/box/journal.h                             |   2 +
 src/box/lua/ctl.c                             |   8 +-
 src/box/raft.c                                |  31 +++-
 src/box/raft.h                                |  20 +++
 src/box/txn.c                                 |   5 +
 src/box/txn_limbo.c                           |  79 +++++----
 src/box/txn_limbo.h                           |  10 +-
 src/box/wal.c                                 |  26 +--
 src/box/xrow.c                                |  43 +++--
 src/box/xrow.h                                |  54 +++++-
 src/lib/raft/raft.c                           |  13 +-
 src/lib/raft/raft.h                           |  32 +++-
 test/replication/election_basic.result        |   4 +-
 test/unit/raft_test_utils.c                   |   4 +-
 23 files changed, 501 insertions(+), 120 deletions(-)
 create mode 100644 changelogs/unreleased/box-ctl-promote.md
 rename changelogs/unreleased/{qsync-multi-statement-recovery => qsync-multi-statement-recovery.md} (100%)
 create mode 100644 changelogs/unreleased/raft-promote.md

-- 
2.24.3 (Apple Git-128)


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Tarantool-patches] [PATCH 1/9] wal: enrich row's meta information with sync replication flags
  2021-04-11 17:55 [Tarantool-patches] [PATCH 0/9] raft: introduce manual elections and fix a bug with re-applying rolled back transactions Serge Petrenko via Tarantool-patches
@ 2021-04-11 17:55 ` Serge Petrenko via Tarantool-patches
  2021-04-12 13:06   ` Cyrill Gorcunov via Tarantool-patches
  2021-04-12 19:21   ` Serge Petrenko via Tarantool-patches
  2021-04-11 17:55 ` [Tarantool-patches] [PATCH 2/9] xrow: introduce a PROMOTE entry Serge Petrenko via Tarantool-patches
                   ` (7 subsequent siblings)
  8 siblings, 2 replies; 16+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-04-11 17:55 UTC (permalink / raw)
  To: v.shpilevoy, gorcunov; +Cc: tarantool-patches

Introduce two new flags to xrow_header: `wait_ack` and `wait_sync`.
These flags are set for rows belonging to synchronous transactions in
addition to `is_commit`.

The new flags help to define whether the rows belong to a synchronous
transaction or not without parsing them all and checking whether any of
the rows touches a synchronous space.

This will be used in applier once it is taught to filter synchronous
transactions based on whether they are coming from a raft leader or not.

P.S. These flags will also be useful once we allow to turn any transaction
synchronous. Once this is done, the flags in row header will be the only
source of information on whether the transaction is synchronous or not.

Prerequisite #5445

@TarantoolBot document
Title: new values for IPROTO_FLAGS field

IPROTO_FLAGS bitfield is enriched with two new constant:
IPROTO_FLAG_WAIT_SYNC = 0x02
IPROTO_FLAG_WAIT_ACK = 0x04

IPROTO_FLAG_WAIT_SYNC is set for the last synchronous transaction row.
IPROTO_FLAG_WAIT_ACK is set for the last row of a transaction which is
not applied because other synchronous transactions are in fly.
---
 src/box/iproto_constants.h |  5 +++++
 src/box/journal.h          |  2 ++
 src/box/txn.c              |  5 +++++
 src/box/wal.c              | 26 +++++++++++++++-----------
 src/box/xrow.c             | 13 ++++++++-----
 src/box/xrow.h             | 30 ++++++++++++++++++++++--------
 6 files changed, 57 insertions(+), 24 deletions(-)

diff --git a/src/box/iproto_constants.h b/src/box/iproto_constants.h
index d3738c705..f7f46088f 100644
--- a/src/box/iproto_constants.h
+++ b/src/box/iproto_constants.h
@@ -49,9 +49,14 @@ enum {
 	XLOG_FIXHEADER_SIZE = 19
 };
 
+/** IPROTO_FLAGS bitfield constants. */
 enum {
 	/** Set for the last xrow in a transaction. */
 	IPROTO_FLAG_COMMIT = 0x01,
+	/** Set for the last row of a tx residing in limbo. */
+	IPROTO_FLAG_WAIT_SYNC = 0x02,
+	/** Set for the last row of a synchronous tx. */
+	IPROTO_FLAG_WAIT_ACK = 0x04,
 };
 
 enum iproto_key {
diff --git a/src/box/journal.h b/src/box/journal.h
index 76c70c19f..9ab8af2c1 100644
--- a/src/box/journal.h
+++ b/src/box/journal.h
@@ -63,6 +63,7 @@ struct journal_entry {
 	 * A journal entry completion callback argument.
 	 */
 	void *complete_data;
+	uint8_t opt_flags;
 	/**
 	 * Asynchronous write completion function.
 	 */
@@ -97,6 +98,7 @@ journal_entry_create(struct journal_entry *entry, size_t n_rows,
 	entry->approx_len	= approx_len;
 	entry->n_rows		= n_rows;
 	entry->res		= -1;
+	entry->opt_flags = 0;
 }
 
 /**
diff --git a/src/box/txn.c b/src/box/txn.c
index 40061ff09..d65315f58 100644
--- a/src/box/txn.c
+++ b/src/box/txn.c
@@ -76,6 +76,7 @@ txn_add_redo(struct txn *txn, struct txn_stmt *stmt, struct request *request)
 		row->lsn = 0;
 		row->sync = 0;
 		row->tm = 0;
+		row->opt_flags = 0;
 	}
 	/*
 	 * Group ID should be set both for requests not having a
@@ -681,6 +682,10 @@ txn_journal_entry_new(struct txn *txn)
 		--req->n_rows;
 	}
 
+	req->opt_flags |=
+		(txn_has_flag(txn, TXN_WAIT_SYNC) ? IPROTO_FLAG_WAIT_SYNC : 0) |
+		(txn_has_flag(txn, TXN_WAIT_ACK) ? IPROTO_FLAG_WAIT_ACK : 0);
+
 	return req;
 }
 
diff --git a/src/box/wal.c b/src/box/wal.c
index 34af0bda6..00fcb21b4 100644
--- a/src/box/wal.c
+++ b/src/box/wal.c
@@ -962,14 +962,14 @@ out:
  */
 static void
 wal_assign_lsn(struct vclock *vclock_diff, struct vclock *base,
-	       struct xrow_header **row,
-	       struct xrow_header **end)
+	       struct journal_entry *entry)
 {
 	int64_t tsn = 0;
-	struct xrow_header **start = row;
-	struct xrow_header **first_glob_row = row;
+	struct xrow_header **start = entry->rows;
+	struct xrow_header **end = entry->rows + entry->n_rows;
+	struct xrow_header **first_glob_row = start;
 	/** Assign LSN to all local rows. */
-	for ( ; row < end; row++) {
+	for (struct xrow_header **row = start; row < end; row++) {
 		if ((*row)->replica_id == 0) {
 			/*
 			 * All rows representing local space data
@@ -996,7 +996,13 @@ wal_assign_lsn(struct vclock *vclock_diff, struct vclock *base,
 				first_glob_row = row;
 			}
 			(*row)->tsn = tsn == 0 ? (*start)->lsn : tsn;
-			(*row)->is_commit = row == end - 1;
+			if (row < end - 1)
+				continue;
+			/* Tx meta is stored in the last tx row. */
+			if (row == end - 1) {
+				(*row)->opt_flags = entry->opt_flags;
+				(*row)->is_commit = true;
+			}
 		} else {
 			int64_t diff = (*row)->lsn - vclock_get(base, (*row)->replica_id);
 			if (diff <= vclock_get(vclock_diff,
@@ -1020,7 +1026,7 @@ wal_assign_lsn(struct vclock *vclock_diff, struct vclock *base,
 	 * the first global row. tsn was yet unknown when those
 	 * rows were processed.
 	 */
-	for (row = start; row < first_glob_row; row++)
+	for (struct xrow_header **row = start; row < first_glob_row; row++)
 		(*row)->tsn = tsn;
 }
 
@@ -1098,8 +1104,7 @@ wal_write_to_disk(struct cmsg *msg)
 	struct journal_entry *entry;
 	struct stailq_entry *last_committed = NULL;
 	stailq_foreach_entry(entry, &wal_msg->commit, fifo) {
-		wal_assign_lsn(&vclock_diff, &writer->vclock,
-			       entry->rows, entry->rows + entry->n_rows);
+		wal_assign_lsn(&vclock_diff, &writer->vclock, entry);
 		entry->res = vclock_sum(&vclock_diff) +
 			     vclock_sum(&writer->vclock);
 		rc = xlog_write_entry(l, entry);
@@ -1319,8 +1324,7 @@ wal_write_none_async(struct journal *journal,
 	struct vclock vclock_diff;
 
 	vclock_create(&vclock_diff);
-	wal_assign_lsn(&vclock_diff, &writer->vclock, entry->rows,
-		       entry->rows + entry->n_rows);
+	wal_assign_lsn(&vclock_diff, &writer->vclock, entry);
 	vclock_merge(&writer->vclock, &vclock_diff);
 	vclock_copy(&replicaset.vclock, &writer->vclock);
 	entry->res = vclock_sum(&writer->vclock);
diff --git a/src/box/xrow.c b/src/box/xrow.c
index bc06738ad..cc8e43ed4 100644
--- a/src/box/xrow.c
+++ b/src/box/xrow.c
@@ -183,7 +183,7 @@ error:
 			break;
 		case IPROTO_FLAGS:
 			flags = mp_decode_uint(pos);
-			header->is_commit = flags & IPROTO_FLAG_COMMIT;
+			header->opt_flags = flags;
 			break;
 		default:
 			/* unknown header */
@@ -299,6 +299,7 @@ xrow_header_encode(const struct xrow_header *header, uint64_t sync,
 	 *   flag to find transaction boundary (last row in the
 	 *   transaction stream).
 	 */
+	uint8_t flags_to_encode = header->opt_flags & ~IPROTO_FLAG_COMMIT;
 	if (header->tsn != 0) {
 		if (header->tsn != header->lsn || !header->is_commit) {
 			/*
@@ -314,12 +315,14 @@ xrow_header_encode(const struct xrow_header *header, uint64_t sync,
 			map_size++;
 		}
 		if (header->is_commit && header->tsn != header->lsn) {
-			/* Setup last row for multi row transaction. */
-			d = mp_encode_uint(d, IPROTO_FLAGS);
-			d = mp_encode_uint(d, IPROTO_FLAG_COMMIT);
-			map_size++;
+			flags_to_encode |= IPROTO_FLAG_COMMIT;
 		}
 	}
+	if (flags_to_encode != 0) {
+		d = mp_encode_uint(d, IPROTO_FLAGS);
+		d = mp_encode_uint(d, flags_to_encode);
+		map_size++;
+	}
 	assert(d <= data + XROW_HEADER_LEN_MAX);
 	mp_encode_map(data, map_size);
 	out->iov_len = d - (char *) out->iov_base;
diff --git a/src/box/xrow.h b/src/box/xrow.h
index fde8f9474..2a18733c0 100644
--- a/src/box/xrow.h
+++ b/src/box/xrow.h
@@ -80,14 +80,28 @@ struct xrow_header {
 	 * transaction.
 	 */
 	int64_t tsn;
-	/**
-	 * True for the last row in a multi-statement transaction,
-	 * or single-statement transaction. Is only encoded in the
-	 * write ahead log for multi-statement transactions.
-	 * Single-statement transactions do not encode
-	 * tsn and is_commit flag to save space.
-	 */
-	bool is_commit;
+	/** Transaction meta flags set only in the last transaction row. */
+	union {
+		uint8_t opt_flags;
+		struct {
+			/**
+			 * Is only encoded in the write ahead log for
+			 * multi-statement transactions. Single-statement
+			 * transactions do not encode tsn and is_commit flag to
+			 * save space.
+			 */
+			bool is_commit : 1;
+			/**
+			 * True for a synchronous transaction.
+			 */
+			bool wait_sync : 1;
+			/**
+			 * True for any transaction that would enter the limbo
+			 * (not necessarily a synchronous one).
+			 */
+			bool wait_ack  : 1;
+		};
+	};
 
 	int bodycnt;
 	uint32_t schema_version;
-- 
2.24.3 (Apple Git-128)


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Tarantool-patches] [PATCH 2/9] xrow: introduce a PROMOTE entry
  2021-04-11 17:55 [Tarantool-patches] [PATCH 0/9] raft: introduce manual elections and fix a bug with re-applying rolled back transactions Serge Petrenko via Tarantool-patches
  2021-04-11 17:55 ` [Tarantool-patches] [PATCH 1/9] wal: enrich row's meta information with sync replication flags Serge Petrenko via Tarantool-patches
@ 2021-04-11 17:55 ` Serge Petrenko via Tarantool-patches
  2021-04-11 17:55 ` [Tarantool-patches] [PATCH 3/9] box: actualise iproto_key_type array Serge Petrenko via Tarantool-patches
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 16+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-04-11 17:55 UTC (permalink / raw)
  To: v.shpilevoy, gorcunov; +Cc: tarantool-patches

A PROMOTE entry combines effect of CONFIRM, ROLLBACK and RAFT_TERM
entries with some additional semantics on top.

PROMOTE carries the following arguments:

1) former_leader_id - the id of previous limbo owner whose entries we
   want to confirm.
2) confirm_lsn - the lsn of the last former leader's transaction to be
   confirmed. In this sense PROMOTE(confirm_lsn) replaces
   CONFIRM(confirm_lsn) + ROLLBACK(confirm_lsn + 1).
3) replica_id - id of the instance issuing
   `box.ctl.clear_synchro_queue()`
4) term - the new term the instance issuing
   `box.ctl.clear_synchro_queue()` has just entered.

This entry will be written to WAL instead of the usual CONFIRM +
ROLLBACK pair on a successful `box.ctl.clear_synchro_queue()` call.

Note, the ususal CONFIRM and ROLLBACK occurrences (after a confirmed or
rolled back synchronous transaction) are here to stay.

Part of #5445
---
 src/box/iproto_constants.h | 17 ++++++++++++++++-
 src/box/xrow.c             | 30 +++++++++++++++++++++++++-----
 src/box/xrow.h             | 24 +++++++++++++++++++++++-
 3 files changed, 64 insertions(+), 7 deletions(-)

diff --git a/src/box/iproto_constants.h b/src/box/iproto_constants.h
index f7f46088f..816a308d8 100644
--- a/src/box/iproto_constants.h
+++ b/src/box/iproto_constants.h
@@ -132,6 +132,18 @@ enum iproto_key {
 	IPROTO_REPLICA_ANON = 0x50,
 	IPROTO_ID_FILTER = 0x51,
 	IPROTO_ERROR = 0x52,
+	/**
+	 * Term. Has the same meaning as IPROTO_RAFT_TERM, but is an iproto
+	 * key, rather than a raft key. Used for PROMOTE request, which needs
+	 * both iproto (e.g. REPLICA_ID) and raft (RAFT_TERM) keys.
+	 */
+	IPROTO_TERM = 0x53,
+	/*
+	 * Be careful to not extend iproto_key values over 0x7f.
+	 * iproto_keys are encoded in msgpack as positive fixnum, which ends at
+	 * 0x7f, and we rely on this in some places by allocating a uint8_t to
+	 * hold a msgpack-encoded key value.
+	 */
 	IPROTO_KEY_MAX
 };
 
@@ -226,6 +238,8 @@ enum iproto_type {
 	IPROTO_TYPE_STAT_MAX,
 
 	IPROTO_RAFT = 30,
+	/** PROMOTE request. */
+	IPROTO_PROMOTE = 31,
 
 	/** A confirmation message for synchronous transactions. */
 	IPROTO_CONFIRM = 40,
@@ -344,7 +358,8 @@ dml_request_key_map(uint32_t type)
 static inline bool
 iproto_type_is_synchro_request(uint32_t type)
 {
-	return type == IPROTO_CONFIRM || type == IPROTO_ROLLBACK;
+	return type == IPROTO_CONFIRM || type == IPROTO_ROLLBACK ||
+	       type == IPROTO_PROMOTE;
 }
 
 static inline bool
diff --git a/src/box/xrow.c b/src/box/xrow.c
index cc8e43ed4..70ba075f8 100644
--- a/src/box/xrow.c
+++ b/src/box/xrow.c
@@ -890,11 +890,11 @@ xrow_encode_synchro(struct xrow_header *row,
 		    const struct synchro_request *req)
 {
 	/*
-	 * A map with two elements. We don't compress
+	 * A map with two or three elements. We don't compress
 	 * numbers to have this structure constant in size,
 	 * which allows us to preallocate it on stack.
 	 */
-	body->m_body = 0x80 | 2;
+	body->m_body = 0x80 | (req->type == IPROTO_PROMOTE ? 3 : 2);
 	body->k_replica_id = IPROTO_REPLICA_ID;
 	body->m_replica_id = 0xce;
 	body->v_replica_id = mp_bswap_u32(req->replica_id);
@@ -903,10 +903,24 @@ xrow_encode_synchro(struct xrow_header *row,
 	body->v_lsn = mp_bswap_u64(req->lsn);
 
 	memset(row, 0, sizeof(*row));
-
 	row->type = req->type;
-	row->body[0].iov_base = (void *)body;
-	row->body[0].iov_len = sizeof(*body);
+
+	/* Promote body is longer. It has an additional IPROTO_TERM field. */
+	if (req->type == IPROTO_PROMOTE) {
+		struct promote_body_bin *promote_body =
+			(struct promote_body_bin *)body;
+
+		promote_body->k_term = IPROTO_TERM;
+		promote_body->m_term = 0xcf;
+		promote_body->v_term = mp_bswap_u64(req->term);
+
+		row->body[0].iov_base = (void *)promote_body;
+		row->body[0].iov_len = sizeof(*promote_body);
+	} else {
+		row->body[0].iov_base = (void *)body;
+		row->body[0].iov_len = sizeof(*body);
+	}
+
 	row->bodycnt = 1;
 }
 
@@ -952,11 +966,17 @@ xrow_decode_synchro(const struct xrow_header *row, struct synchro_request *req)
 		case IPROTO_LSN:
 			req->lsn = mp_decode_uint(&d);
 			break;
+		case IPROTO_TERM:
+			req->term = mp_decode_uint(&d);
+			break;
 		default:
 			mp_next(&d);
 		}
 	}
+
 	req->type = row->type;
+	req->origin_id = row->replica_id;
+
 	return 0;
 }
 
diff --git a/src/box/xrow.h b/src/box/xrow.h
index 2a18733c0..af4ad0d12 100644
--- a/src/box/xrow.h
+++ b/src/box/xrow.h
@@ -226,7 +226,10 @@ xrow_encode_dml(const struct request *request, struct region *region,
  * pending synchronous transactions.
  */
 struct synchro_request {
-	/** Operation type - IPROTO_ROLLBACK or IPROTO_CONFIRM. */
+	/**
+	 * Operation type - either IPROTO_ROLLBACK or IPROTO_CONFIRM or
+	 * IPROTO_PROMOTE
+	 */
 	uint32_t type;
 	/**
 	 * ID of the instance owning the pending transactions.
@@ -236,14 +239,25 @@ struct synchro_request {
 	 * finish transactions of an old master.
 	 */
 	uint32_t replica_id;
+	/**
+	 * Id of the instance which has issued this request. Only filled on
+	 * decoding, and left blank when encoding a request.
+	 */
+	uint32_t origin_id;
 	/**
 	 * Operation LSN.
 	 * In case of CONFIRM it means 'confirm all
 	 * transactions with lsn <= this value'.
 	 * In case of ROLLBACK it means 'rollback all transactions
 	 * with lsn >= this value'.
+	 * In case of PROMOTE it means CONFIRM(lsn) + ROLLBACK(lsn+1)
 	 */
 	int64_t lsn;
+	/**
+	 * The new term the instance issuing this request is in. Only used for
+	 * PROMOTE request.
+	 */
+	uint64_t term;
 };
 
 /** Synchro request xrow's body in MsgPack format. */
@@ -257,6 +271,14 @@ struct PACKED synchro_body_bin {
 	uint64_t v_lsn;
 };
 
+/** PROMOTE request's xrow body in MsgPack format. */
+struct PACKED promote_body_bin {
+	struct synchro_body_bin base;
+	uint8_t k_term;
+	uint8_t m_term;
+	uint64_t v_term;
+};
+
 /**
  * Encode synchronous replication request.
  * @param row xrow header.
-- 
2.24.3 (Apple Git-128)


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Tarantool-patches] [PATCH 3/9] box: actualise iproto_key_type array
  2021-04-11 17:55 [Tarantool-patches] [PATCH 0/9] raft: introduce manual elections and fix a bug with re-applying rolled back transactions Serge Petrenko via Tarantool-patches
  2021-04-11 17:55 ` [Tarantool-patches] [PATCH 1/9] wal: enrich row's meta information with sync replication flags Serge Petrenko via Tarantool-patches
  2021-04-11 17:55 ` [Tarantool-patches] [PATCH 2/9] xrow: introduce a PROMOTE entry Serge Petrenko via Tarantool-patches
@ 2021-04-11 17:55 ` Serge Petrenko via Tarantool-patches
  2021-04-11 17:55 ` [Tarantool-patches] [PATCH 4/9] box: make clear_synchro_queue() write a PROMOTE entry instead of CONFIRM + ROLLBACK Serge Petrenko via Tarantool-patches
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 16+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-04-11 17:55 UTC (permalink / raw)
  To: v.shpilevoy, gorcunov; +Cc: tarantool-patches

iproto_key_type array is used while validating incoming requests, but it
was only half-filled. The last initialized field was 0x2b, while
IPROTO_KEY_MAX is currently 0x54.

We got away with it, since the array is only  used in xrow_header_decode(),
xrow_decode_dml() and xrow_decode_synchro(), and all the keys usually present
in these requests were present in the array. This is not true anymore,
so it's time to make array contents up to date with all the IPROTO_KEY_*
constants we have.

Part of #5445
---
 src/box/iproto_constants.c | 58 ++++++++++++++++++++++++++++++++++++++
 1 file changed, 58 insertions(+)

diff --git a/src/box/iproto_constants.c b/src/box/iproto_constants.c
index 029d9888c..addda39dc 100644
--- a/src/box/iproto_constants.c
+++ b/src/box/iproto_constants.c
@@ -90,6 +90,64 @@ const unsigned char iproto_key_type[IPROTO_KEY_MAX] =
 	/* 0x2a */	MP_MAP, /* IPROTO_TUPLE_META */
 	/* 0x2b */	MP_MAP, /* IPROTO_OPTIONS */
 	/* }}} */
+
+	/* {{{ unused */
+	/* 0x2c */	MP_UINT,
+	/* 0x2d */	MP_UINT,
+	/* 0x2e */	MP_UINT,
+	/* 0x2f */	MP_UINT,
+	/* }}} */
+
+	/* {{{ body -- response keys */
+	/* 0x30 */	MP_ARRAY, /* IPROTO_DATA */
+	/* 0x31 */	MP_STR, /* IPROTO_ERROR_24 */
+	/* 0x32 */	MP_ARRAY, /* IPROTO_METADATA */
+	/* 0x33 */	MP_ARRAY, /* IPROTO_BIND_METADATA */
+	/* 0x34 */	MP_UINT, /* IIPROTO_BIND_COUNT */
+	/* }}} */
+
+	/* {{{ unused */
+	/* 0x35 */	MP_UINT,
+	/* 0x36 */	MP_UINT,
+	/* 0x37 */	MP_UINT,
+	/* 0x38 */	MP_UINT,
+	/* 0x39 */	MP_UINT,
+	/* 0x3a */	MP_UINT,
+	/* 0x3b */	MP_UINT,
+	/* 0x3c */	MP_UINT,
+	/* 0x3d */	MP_UINT,
+	/* 0x3e */	MP_UINT,
+	/* 0x3f */	MP_UINT,
+	/* }}} */
+
+	/* {{{ body -- sql keys */
+	/* 0x40 */	MP_STR, /* IPROTO_SQL_TEXT */
+	/* 0x41 */	MP_ARRAY, /* IPROTO_SQL_BIND */
+	/* 0x42 */	MP_MAP, /* IPROTO_SQL_INFO */
+	/* 0x43 */	MP_UINT, /* IPROTO_STMT_ID */
+	/* }}} */
+
+	/* {{{ unused */
+	/* 0x44 */	MP_UINT,
+	/* 0x45 */	MP_UINT,
+	/* 0x46 */	MP_UINT,
+	/* 0x47 */	MP_UINT,
+	/* 0x48 */	MP_UINT,
+	/* 0x49 */	MP_UINT,
+	/* 0x4a */	MP_UINT,
+	/* 0x4b */	MP_UINT,
+	/* 0x4c */	MP_UINT,
+	/* 0x4d */	MP_UINT,
+	/* 0x4e */	MP_UINT,
+	/* 0x4f */	MP_UINT,
+	/* }}} */
+
+	/* {{{ body -- additional request keys */
+	/* 0x50 */	MP_BOOL, /* IPROTO_REPLICA_ANON */
+	/* 0x51 */	MP_ARRAY, /* IPROTO_ID_FILTER */
+	/* 0x52 */	MP_MAP, /* IPROTO_ERROR */
+	/* 0x53 */	MP_UINT, /* IPROTO_TERM */
+	/* }}} */
 };
 
 const char *iproto_type_strs[] =
-- 
2.24.3 (Apple Git-128)


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Tarantool-patches] [PATCH 4/9] box: make clear_synchro_queue() write a PROMOTE entry instead of CONFIRM + ROLLBACK
  2021-04-11 17:55 [Tarantool-patches] [PATCH 0/9] raft: introduce manual elections and fix a bug with re-applying rolled back transactions Serge Petrenko via Tarantool-patches
                   ` (2 preceding siblings ...)
  2021-04-11 17:55 ` [Tarantool-patches] [PATCH 3/9] box: actualise iproto_key_type array Serge Petrenko via Tarantool-patches
@ 2021-04-11 17:55 ` Serge Petrenko via Tarantool-patches
  2021-04-11 17:56 ` [Tarantool-patches] [PATCH 5/9] box: write PROMOTE even for empty limbo Serge Petrenko via Tarantool-patches
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 16+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-04-11 17:55 UTC (permalink / raw)
  To: v.shpilevoy, gorcunov; +Cc: tarantool-patches

A successful box_clear_synchro_queue() call results in writing
CONFIRM(N) ROLLBACK(N+1) pair, where N is  the confirmed lsn.

Let's write a single PROMOTE(N) entry instead. It'll have  the same
meaning as CONFIRM + ROLLBACK and it will give followers some additional
information regarding leader state change later.

Part of #5445
---
 src/box/applier.cc         |  4 +-
 src/box/box.cc             | 14 ++++++-
 src/box/iproto_constants.h |  5 +++
 src/box/txn_limbo.c        | 79 +++++++++++++++++++++-----------------
 src/box/txn_limbo.h        | 10 ++++-
 5 files changed, 72 insertions(+), 40 deletions(-)

diff --git a/src/box/applier.cc b/src/box/applier.cc
index 971b2e64c..e8cbbe27a 100644
--- a/src/box/applier.cc
+++ b/src/box/applier.cc
@@ -763,7 +763,7 @@ applier_txn_wal_write_cb(struct trigger *trigger, void *event)
 
 struct synchro_entry {
 	/** Encoded form of a synchro record. */
-	struct synchro_body_bin	body_bin;
+	struct promote_body_bin body_bin;
 
 	/** xrow to write, used by the journal engine. */
 	struct xrow_header row;
@@ -822,7 +822,7 @@ synchro_entry_new(struct xrow_header *applier_row,
 	}
 
 	struct journal_entry *journal_entry = &entry->journal_entry;
-	struct synchro_body_bin *body_bin = &entry->body_bin;
+	struct synchro_body_bin *body_bin = &entry->body_bin.base;
 	struct xrow_header *row = &entry->row;
 
 	journal_entry->rows[0] = row;
diff --git a/src/box/box.cc b/src/box/box.cc
index b846ba8f5..8aba051a2 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -1556,7 +1556,19 @@ box_clear_synchro_queue(bool try_wait)
 				 "new synchronous transactions appeared");
 			rc = -1;
 		} else {
-			txn_limbo_force_empty(&txn_limbo, wait_lsn);
+			/*
+			 * Term parameter is unused now, We'll pass
+			 * box_raft()->term there later.
+			 */
+			txn_limbo_write_promote(&txn_limbo, wait_lsn, 0);
+			struct synchro_request req = {
+				.type = 0, /* unused */
+				.replica_id = 0, /* unused */
+				.origin_id = instance_id,
+				.lsn = wait_lsn,
+				.term = 0, /* unused */
+			};
+			txn_limbo_read_promote(&txn_limbo, &req);
 			assert(txn_limbo_is_empty(&txn_limbo));
 		}
 	}
diff --git a/src/box/iproto_constants.h b/src/box/iproto_constants.h
index 816a308d8..da78ac4d4 100644
--- a/src/box/iproto_constants.h
+++ b/src/box/iproto_constants.h
@@ -362,6 +362,11 @@ iproto_type_is_synchro_request(uint32_t type)
 	       type == IPROTO_PROMOTE;
 }
 
+static inline bool
+iproto_type_is_promote_request(uint32_t type)
+{
+	return type == IPROTO_PROMOTE;
+}
 static inline bool
 iproto_type_is_raft_request(uint32_t type)
 {
diff --git a/src/box/txn_limbo.c b/src/box/txn_limbo.c
index cf0ad9350..664f9d369 100644
--- a/src/box/txn_limbo.c
+++ b/src/box/txn_limbo.c
@@ -314,21 +314,25 @@ txn_limbo_write_cb(struct journal_entry *entry)
 }
 
 static void
-txn_limbo_write_synchro(struct txn_limbo *limbo, uint32_t type, int64_t lsn)
+txn_limbo_write_synchro(struct txn_limbo *limbo, uint32_t type, int64_t lsn,
+			uint64_t term)
 {
-	assert(lsn > 0);
+	assert(lsn >= 0);
 
 	struct synchro_request req = {
 		.type		= type,
 		.replica_id	= limbo->owner_id,
 		.lsn		= lsn,
+		.term		= term,
 	};
 
 	/*
-	 * This is a synchronous commit so we can
-	 * allocate everything on a stack.
+	 * This is a synchronous commit so we can allocate everything on a
+	 * stack. Promote body includes synchro body.
 	 */
-	struct synchro_body_bin body;
+	struct promote_body_bin body;
+	struct synchro_body_bin *base = &body.base;
+
 	struct xrow_header row;
 	char buf[sizeof(struct journal_entry) +
 		 sizeof(struct xrow_header *)];
@@ -336,7 +340,7 @@ txn_limbo_write_synchro(struct txn_limbo *limbo, uint32_t type, int64_t lsn)
 	struct journal_entry *entry = (struct journal_entry *)buf;
 	entry->rows[0] = &row;
 
-	xrow_encode_synchro(&row, &body, &req);
+	xrow_encode_synchro(&row, base, &req);
 
 	journal_entry_create(entry, 1, xrow_approx_len(&row),
 			     txn_limbo_write_cb, fiber());
@@ -368,14 +372,14 @@ txn_limbo_write_confirm(struct txn_limbo *limbo, int64_t lsn)
 	assert(lsn > limbo->confirmed_lsn);
 	assert(!limbo->is_in_rollback);
 	limbo->confirmed_lsn = lsn;
-	txn_limbo_write_synchro(limbo, IPROTO_CONFIRM, lsn);
+	txn_limbo_write_synchro(limbo, IPROTO_CONFIRM, lsn, 0);
 }
 
 /** Confirm all the entries <= @a lsn. */
 static void
 txn_limbo_read_confirm(struct txn_limbo *limbo, int64_t lsn)
 {
-	assert(limbo->owner_id != REPLICA_ID_NIL);
+	assert(limbo->owner_id != REPLICA_ID_NIL || txn_limbo_is_empty(limbo));
 	assert(limbo == &txn_limbo);
 	struct txn_limbo_entry *e, *tmp;
 	rlist_foreach_entry_safe(e, &limbo->queue, in_queue, tmp) {
@@ -431,7 +435,7 @@ txn_limbo_write_rollback(struct txn_limbo *limbo, int64_t lsn)
 	assert(lsn > limbo->confirmed_lsn);
 	assert(!limbo->is_in_rollback);
 	limbo->is_in_rollback = true;
-	txn_limbo_write_synchro(limbo, IPROTO_ROLLBACK, lsn);
+	txn_limbo_write_synchro(limbo, IPROTO_ROLLBACK, lsn, 0);
 	limbo->is_in_rollback = false;
 }
 
@@ -439,7 +443,7 @@ txn_limbo_write_rollback(struct txn_limbo *limbo, int64_t lsn)
 static void
 txn_limbo_read_rollback(struct txn_limbo *limbo, int64_t lsn)
 {
-	assert(limbo->owner_id != REPLICA_ID_NIL);
+	assert(limbo->owner_id != REPLICA_ID_NIL || txn_limbo_is_empty(limbo));
 	assert(limbo == &txn_limbo);
 	struct txn_limbo_entry *e, *tmp;
 	struct txn_limbo_entry *last_rollback = NULL;
@@ -487,6 +491,32 @@ txn_limbo_read_rollback(struct txn_limbo *limbo, int64_t lsn)
 		box_update_ro_summary();
 }
 
+void
+txn_limbo_write_promote(struct txn_limbo *limbo, int64_t lsn, uint64_t term)
+{
+	limbo->confirmed_lsn = lsn;
+	/*
+	 * We make sure that promote is only written once everything this
+	 * instance has may be confirmed.
+	 */
+	struct txn_limbo_entry *e = txn_limbo_last_synchro_entry(limbo);
+	assert(e == NULL || e->lsn <= lsn);
+	(void) e;
+	txn_limbo_write_synchro(limbo, IPROTO_PROMOTE, lsn, term);
+	limbo->is_in_rollback = false;
+}
+
+void
+txn_limbo_read_promote(struct txn_limbo *limbo,
+		       const struct synchro_request *req)
+{
+	txn_limbo_read_confirm(limbo, req->lsn);
+	txn_limbo_read_rollback(limbo, req->lsn + 1);
+	assert(txn_limbo_is_empty(&txn_limbo));
+	limbo->owner_id = req->origin_id;
+	limbo->confirmed_lsn = 0;
+}
+
 void
 txn_limbo_ack(struct txn_limbo *limbo, uint32_t replica_id, int64_t lsn)
 {
@@ -649,38 +679,15 @@ txn_limbo_process(struct txn_limbo *limbo, const struct synchro_request *req)
 	case IPROTO_ROLLBACK:
 		txn_limbo_read_rollback(limbo, req->lsn);
 		break;
+	case IPROTO_PROMOTE:
+		txn_limbo_read_promote(limbo, req);
+		break;
 	default:
 		unreachable();
 	}
 	return;
 }
 
-void
-txn_limbo_force_empty(struct txn_limbo *limbo, int64_t confirm_lsn)
-{
-	struct txn_limbo_entry *e, *last_quorum = NULL;
-	struct txn_limbo_entry *rollback = NULL;
-	rlist_foreach_entry(e, &limbo->queue, in_queue) {
-		if (txn_has_flag(e->txn, TXN_WAIT_ACK)) {
-			if (e->lsn <= confirm_lsn) {
-				last_quorum = e;
-			} else {
-				rollback = e;
-				break;
-			}
-		}
-	}
-
-	if (last_quorum != NULL) {
-		txn_limbo_write_confirm(limbo, last_quorum->lsn);
-		txn_limbo_read_confirm(limbo, last_quorum->lsn);
-	}
-	if (rollback != NULL) {
-		txn_limbo_write_rollback(limbo, rollback->lsn);
-		txn_limbo_read_rollback(limbo, rollback->lsn);
-	}
-}
-
 void
 txn_limbo_on_parameters_change(struct txn_limbo *limbo)
 {
diff --git a/src/box/txn_limbo.h b/src/box/txn_limbo.h
index af0addf8d..ec317ed20 100644
--- a/src/box/txn_limbo.h
+++ b/src/box/txn_limbo.h
@@ -275,7 +275,15 @@ txn_limbo_wait_confirm(struct txn_limbo *limbo);
  * immediately.
  */
 void
-txn_limbo_force_empty(struct txn_limbo *limbo, int64_t last_confirm);
+txn_limbo_write_promote(struct txn_limbo *limbo, int64_t lsn, uint64_t term);
+
+/**
+ * Process a PROMOTE request, i.e. confirm all entries <= @lsn and rollback all
+ * entries > @lsn.
+ */
+void
+txn_limbo_read_promote(struct txn_limbo *limbo,
+		       const struct synchro_request *req);
 
 /**
  * Update qsync parameters dynamically.
-- 
2.24.3 (Apple Git-128)


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Tarantool-patches] [PATCH 5/9] box: write PROMOTE even for empty limbo
  2021-04-11 17:55 [Tarantool-patches] [PATCH 0/9] raft: introduce manual elections and fix a bug with re-applying rolled back transactions Serge Petrenko via Tarantool-patches
                   ` (3 preceding siblings ...)
  2021-04-11 17:55 ` [Tarantool-patches] [PATCH 4/9] box: make clear_synchro_queue() write a PROMOTE entry instead of CONFIRM + ROLLBACK Serge Petrenko via Tarantool-patches
@ 2021-04-11 17:56 ` Serge Petrenko via Tarantool-patches
  2021-04-11 17:56 ` [Tarantool-patches] [PATCH 6/9] raft: keep track of greatest known term and filter replication sources based on that Serge Petrenko via Tarantool-patches
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 16+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-04-11 17:56 UTC (permalink / raw)
  To: v.shpilevoy, gorcunov; +Cc: tarantool-patches

PROMOTE entry will be used to mark limbo ownership transition besides
emptying the limbo. So it has to be written every time
`box.ctl.clear_synchro_queue()` succeeds. Even when the limbo was
already empty.

Part of #5445
---
 src/box/box.cc | 28 ++++++++++++++--------------
 1 file changed, 14 insertions(+), 14 deletions(-)

diff --git a/src/box/box.cc b/src/box/box.cc
index 8aba051a2..b093341d3 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -1502,19 +1502,18 @@ box_clear_synchro_queue(bool try_wait)
 			 "simultaneous invocations");
 		return -1;
 	}
-	/*
-	 * XXX: we may want to write confirm + rollback even when the limbo is
-	 * empty for the sake of limbo ownership transition.
-	 */
-	if (!is_box_configured || txn_limbo_is_empty(&txn_limbo))
+
+	if (!is_box_configured)
 		return 0;
 	uint32_t former_leader_id = txn_limbo.owner_id;
-	assert(former_leader_id != REPLICA_ID_NIL);
-	if (former_leader_id == instance_id)
-		return 0;
-
+	int64_t wait_lsn = txn_limbo.confirmed_lsn;
+	int rc = 0;
+	int quorum = replication_synchro_quorum;
 	in_clear_synchro_queue = true;
 
+	if (txn_limbo_is_empty(&txn_limbo))
+		goto promote;
+
 	if (try_wait) {
 		/* Wait until pending confirmations/rollbacks reach us. */
 		double timeout = 2 * replication_synchro_timeout;
@@ -1528,8 +1527,9 @@ box_clear_synchro_queue(bool try_wait)
 		 * Our mission was to clear the limbo from former leader's
 		 * transactions. Exit in case someone did that for us.
 		 */
-		if (txn_limbo_is_empty(&txn_limbo) ||
-		    former_leader_id != txn_limbo.owner_id) {
+		if (former_leader_id != txn_limbo.owner_id) {
+			//TODO: error once we see someone else became the leader
+			// already.
 			in_clear_synchro_queue = false;
 			return 0;
 		}
@@ -1540,11 +1540,10 @@ box_clear_synchro_queue(bool try_wait)
 	 * in the limbo must've come through the applier meaning they already
 	 * have an lsn assigned, even if their WAL write hasn't finished yet.
 	 */
-	int64_t wait_lsn = txn_limbo_last_synchro_entry(&txn_limbo)->lsn;
+	wait_lsn = txn_limbo_last_synchro_entry(&txn_limbo)->lsn;
 	assert(wait_lsn > 0);
 
-	int quorum = replication_synchro_quorum;
-	int rc = box_wait_quorum(former_leader_id, wait_lsn, quorum,
+	rc = box_wait_quorum(former_leader_id, wait_lsn, quorum,
 				 replication_synchro_timeout);
 	if (rc == 0) {
 		if (quorum < replication_synchro_quorum) {
@@ -1556,6 +1555,7 @@ box_clear_synchro_queue(bool try_wait)
 				 "new synchronous transactions appeared");
 			rc = -1;
 		} else {
+promote:
 			/*
 			 * Term parameter is unused now, We'll pass
 			 * box_raft()->term there later.
-- 
2.24.3 (Apple Git-128)


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Tarantool-patches] [PATCH 6/9] raft: keep track of greatest known term and filter replication sources based on that
  2021-04-11 17:55 [Tarantool-patches] [PATCH 0/9] raft: introduce manual elections and fix a bug with re-applying rolled back transactions Serge Petrenko via Tarantool-patches
                   ` (4 preceding siblings ...)
  2021-04-11 17:56 ` [Tarantool-patches] [PATCH 5/9] box: write PROMOTE even for empty limbo Serge Petrenko via Tarantool-patches
@ 2021-04-11 17:56 ` Serge Petrenko via Tarantool-patches
  2021-04-12 19:23   ` Serge Petrenko via Tarantool-patches
  2021-04-11 17:56 ` [Tarantool-patches] [PATCH 7/9] replication: introduce a new election mode: "manual" Serge Petrenko via Tarantool-patches
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 16+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-04-11 17:56 UTC (permalink / raw)
  To: v.shpilevoy, gorcunov; +Cc: tarantool-patches

Start writing the actual leader term together with the PROMOTE request
and process terms in PROMOTE requests on receiver side.

Make applier only apply synchronous transactions from the instance which
has the greatest term as received in PROMOTE requests.

Closes #5445
---
 ...very => qsync-multi-statement-recovery.md} |  0
 changelogs/unreleased/raft-promote.md         |  4 +++
 src/box/applier.cc                            | 18 +++++++++++
 src/box/box.cc                                | 15 ++++++----
 src/lib/raft/raft.c                           |  1 +
 src/lib/raft/raft.h                           | 30 +++++++++++++++++++
 6 files changed, 63 insertions(+), 5 deletions(-)
 rename changelogs/unreleased/{qsync-multi-statement-recovery => qsync-multi-statement-recovery.md} (100%)
 create mode 100644 changelogs/unreleased/raft-promote.md

diff --git a/changelogs/unreleased/qsync-multi-statement-recovery b/changelogs/unreleased/qsync-multi-statement-recovery.md
similarity index 100%
rename from changelogs/unreleased/qsync-multi-statement-recovery
rename to changelogs/unreleased/qsync-multi-statement-recovery.md
diff --git a/changelogs/unreleased/raft-promote.md b/changelogs/unreleased/raft-promote.md
new file mode 100644
index 000000000..e5dac599c
--- /dev/null
+++ b/changelogs/unreleased/raft-promote.md
@@ -0,0 +1,4 @@
+## bugfix/replication
+
+* Fix a bug in synchronous replication when rolled back transactions could
+  reappear once a sufficiently old instance reconnected (gh-5445).
diff --git a/src/box/applier.cc b/src/box/applier.cc
index e8cbbe27a..926d2f7ea 100644
--- a/src/box/applier.cc
+++ b/src/box/applier.cc
@@ -849,6 +849,9 @@ apply_synchro_row(struct xrow_header *row)
 
 	txn_limbo_process(&txn_limbo, &req);
 
+	if (req.type == IPROTO_PROMOTE)
+		raft_source_update_term(box_raft(), req.origin_id, req.term);
+
 	struct synchro_entry *entry;
 	entry = synchro_entry_new(row, &req);
 	if (entry == NULL)
@@ -1044,6 +1047,21 @@ applier_apply_tx(struct applier *applier, struct stailq *rows)
 		}
 	}
 
+	/*
+	 * All the synchronous rows coming from outdated instances are ignored
+	 * and replaced with NOPs to save vclock consistency.
+	 */
+	struct applier_tx_row *item;
+	if (raft_source_has_outdated_term(box_raft(), applier->instance_id) &&
+	    (last_row->wait_sync ||
+	     (iproto_type_is_synchro_request(first_row->type) &&
+	     !iproto_type_is_promote_request(first_row->type)))) {
+		stailq_foreach_entry(item, rows, next) {
+			struct xrow_header *row = &item->row;
+			row->type = IPROTO_NOP;
+			row->bodycnt = 0;
+		}
+	}
 	if (unlikely(iproto_type_is_synchro_request(first_row->type))) {
 		/*
 		 * Synchro messages are not transactions, in terms
diff --git a/src/box/box.cc b/src/box/box.cc
index b093341d3..9b6323b3f 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -426,6 +426,10 @@ wal_stream_apply_synchro_row(struct wal_stream *stream, struct xrow_header *row)
 		return -1;
 	}
 	txn_limbo_process(&txn_limbo, &syn_req);
+	if (syn_req.type == IPROTO_PROMOTE) {
+			raft_source_update_term(box_raft(), syn_req.origin_id,
+						syn_req.term);
+	}
 	return 0;
 }
 
@@ -1556,11 +1560,10 @@ box_clear_synchro_queue(bool try_wait)
 			rc = -1;
 		} else {
 promote:
-			/*
-			 * Term parameter is unused now, We'll pass
-			 * box_raft()->term there later.
-			 */
-			txn_limbo_write_promote(&txn_limbo, wait_lsn, 0);
+			/* We cannot possibly get here in a volatile state. */
+			assert(box_raft()->volatile_term == box_raft()->term);
+			txn_limbo_write_promote(&txn_limbo, wait_lsn,
+						box_raft()->term);
 			struct synchro_request req = {
 				.type = 0, /* unused */
 				.replica_id = 0, /* unused */
@@ -1570,6 +1573,8 @@ promote:
 			};
 			txn_limbo_read_promote(&txn_limbo, &req);
 			assert(txn_limbo_is_empty(&txn_limbo));
+			raft_source_update_term(box_raft(), req.origin_id,
+						req.lsn);
 		}
 	}
 	in_clear_synchro_queue = false;
diff --git a/src/lib/raft/raft.c b/src/lib/raft/raft.c
index 4ea4fc3f8..e9ce8cade 100644
--- a/src/lib/raft/raft.c
+++ b/src/lib/raft/raft.c
@@ -985,6 +985,7 @@ raft_create(struct raft *raft, const struct raft_vtab *vtab)
 		.death_timeout = 5,
 		.vtab = vtab,
 	};
+	vclock_create(&raft->term_map);
 	raft_ev_timer_init(&raft->timer, raft_sm_schedule_new_election_cb,
 			   0, 0);
 	raft->timer.data = raft;
diff --git a/src/lib/raft/raft.h b/src/lib/raft/raft.h
index e447f6634..cba45a67d 100644
--- a/src/lib/raft/raft.h
+++ b/src/lib/raft/raft.h
@@ -207,6 +207,19 @@ struct raft {
 	 * subsystems, such as Raft.
 	 */
 	const struct vclock *vclock;
+	/**
+	 * The biggest term seen by this instance and persisted in WAL as part
+	 * of a PROMOTE request. May be smaller than @a term, while there are
+	 * ongoing elections, or the leader is already known, but this instance
+	 * hasn't read its PROMOTE request yet.
+	 * During other times must be equal to @a term.
+	 */
+	uint64_t greatest_known_term;
+	/**
+	 * Latest terms received with PROMOTE entries from remote instances.
+	 * Raft uses them to determine data from which sources may be applied.
+	 */
+	struct vclock term_map;
 	/** State machine timed event trigger. */
 	struct ev_timer timer;
 	/** Configured election timeout in seconds. */
@@ -243,6 +256,13 @@ raft_is_source_allowed(const struct raft *raft, uint32_t source_id)
 	return !raft->is_enabled || raft->leader == source_id;
 }
 
+static inline bool
+raft_source_has_outdated_term(const struct raft *raft, uint32_t source_id)
+{
+	uint64_t source_term = vclock_get(&raft->term_map, source_id);
+	return raft->is_enabled && source_term < raft->greatest_known_term;
+}
+
 /** Check if Raft is enabled. */
 static inline bool
 raft_is_enabled(const struct raft *raft)
@@ -250,6 +270,16 @@ raft_is_enabled(const struct raft *raft)
 	return raft->is_enabled;
 }
 
+static inline void
+raft_source_update_term(struct raft *raft, uint32_t source_id, uint64_t term)
+{
+	if ((uint64_t) vclock_get(&raft->term_map, source_id) >= term)
+		return;
+	vclock_follow(&raft->term_map, source_id, term);
+	if (term > raft->greatest_known_term)
+		raft->greatest_known_term = term;
+}
+
 /** Process a raft entry stored in WAL/snapshot. */
 void
 raft_process_recovery(struct raft *raft, const struct raft_msg *req);
-- 
2.24.3 (Apple Git-128)


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Tarantool-patches] [PATCH 7/9] replication: introduce a new election mode: "manual"
  2021-04-11 17:55 [Tarantool-patches] [PATCH 0/9] raft: introduce manual elections and fix a bug with re-applying rolled back transactions Serge Petrenko via Tarantool-patches
                   ` (5 preceding siblings ...)
  2021-04-11 17:56 ` [Tarantool-patches] [PATCH 6/9] raft: keep track of greatest known term and filter replication sources based on that Serge Petrenko via Tarantool-patches
@ 2021-04-11 17:56 ` Serge Petrenko via Tarantool-patches
  2021-04-11 17:56 ` [Tarantool-patches] [PATCH 8/9] Support manual elections in `box.ctl.clear_synchro_queue()` Serge Petrenko via Tarantool-patches
  2021-04-11 17:56 ` [Tarantool-patches] [PATCH 9/9] box.ctl: rename clear_synchro_queue to promote Serge Petrenko via Tarantool-patches
  8 siblings, 0 replies; 16+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-04-11 17:56 UTC (permalink / raw)
  To: v.shpilevoy, gorcunov; +Cc: tarantool-patches

When an instance is configured in "manual" election mode, it behaves as
a voter for most of the time, until `box.ctl.promote()` is called.

Once `box.ctl.promote()` is called the instance will behave as a
candidate for a full election round, e.g. until the leader is known. If
the instance wins the elections, it remains in `leader` state until the
next elections. Otherwise the instance returns to `voter` state.

Follow-up #5445
Part of #3055
---
 src/box/box.cc                         | 37 +++++++++++++++++---------
 src/box/raft.c                         |  2 ++
 src/box/raft.h                         | 17 ++++++++++++
 test/replication/election_basic.result |  4 +--
 4 files changed, 45 insertions(+), 15 deletions(-)

diff --git a/src/box/box.cc b/src/box/box.cc
index 9b6323b3f..10e8351b0 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -680,17 +680,27 @@ box_check_uri(const char *source, const char *option_name)
 	}
 }
 
-static const char *
+static enum election_mode
 box_check_election_mode(void)
 {
 	const char *mode = cfg_gets("election_mode");
-	if (mode == NULL || (strcmp(mode, "off") != 0 &&
-	    strcmp(mode, "voter") != 0 && strcmp(mode, "candidate") != 0)) {
-		diag_set(ClientError, ER_CFG, "election_mode", "the value must "
-			 "be a string 'off' or 'voter' or 'candidate'");
-		return NULL;
-	}
-	return mode;
+	if (mode == NULL)
+		goto error;
+
+	if (strcmp(mode, "off") == 0)
+		return ELECTION_MODE_OFF;
+	else if (strcmp(mode, "voter") == 0)
+		return ELECTION_MODE_VOTER;
+	else if (strcmp(mode, "manual") == 0)
+		return ELECTION_MODE_MANUAL;
+	else if (strcmp(mode, "candidate") == 0)
+		return ELECTION_MODE_CANDIDATE;
+
+error:
+	diag_set(ClientError, ER_CFG, "election_mode",
+		"the value must be one of the following strings: "
+		"'off', 'voter', 'candidate', 'manual'");
+	return ELECTION_MODE_INVALID;
 }
 
 static double
@@ -1113,7 +1123,7 @@ box_check_config(void)
 	box_check_uri(cfg_gets("listen"), "listen");
 	box_check_instance_uuid(&uuid);
 	box_check_replicaset_uuid(&uuid);
-	if (box_check_election_mode() == NULL)
+	if (box_check_election_mode() == ELECTION_MODE_INVALID)
 		diag_raise();
 	if (box_check_election_timeout() < 0)
 		diag_raise();
@@ -1147,11 +1157,12 @@ box_check_config(void)
 int
 box_set_election_mode(void)
 {
-	const char *mode = box_check_election_mode();
-	if (mode == NULL)
+	enum election_mode mode = box_check_election_mode();
+	if (mode == ELECTION_MODE_INVALID)
 		return -1;
-	raft_cfg_is_candidate(box_raft(), strcmp(mode, "candidate") == 0);
-	raft_cfg_is_enabled(box_raft(), strcmp(mode, "off") != 0);
+	box_election_mode = mode;
+	raft_cfg_is_candidate(box_raft(), mode == ELECTION_MODE_CANDIDATE);
+	raft_cfg_is_enabled(box_raft(), mode != ELECTION_MODE_OFF);
 	return 0;
 }
 
diff --git a/src/box/raft.c b/src/box/raft.c
index cfd898db0..285dbe4fd 100644
--- a/src/box/raft.c
+++ b/src/box/raft.c
@@ -44,6 +44,8 @@ struct raft box_raft_global = {
 	.state = 0,
 };
 
+enum election_mode box_election_mode = ELECTION_MODE_INVALID;
+
 /**
  * A trigger executed each time the Raft state machine updates any
  * of its visible attributes.
diff --git a/src/box/raft.h b/src/box/raft.h
index 1c59f17e6..15f4e80d9 100644
--- a/src/box/raft.h
+++ b/src/box/raft.h
@@ -35,8 +35,25 @@
 extern "C" {
 #endif
 
+enum election_mode {
+	ELECTION_MODE_INVALID = -1,
+	ELECTION_MODE_OFF = 0,
+	ELECTION_MODE_VOTER = 1,
+	ELECTION_MODE_MANUAL = 2,
+	ELECTION_MODE_CANDIDATE = 3,
+};
+
 struct raft_request;
 
+/**
+ * box_election_mode - current mode of operation for raft. Some modes correspond
+ * to RAFT operation modes directly, like CANDIDATE, VOTER and OFF.
+ * There's a mode which does not map to raft operation mode directly:
+ * MANUAL. In this mode RAFT usually operates as a voter, but it may become a
+ * candidate for some period of time when user calls `box.ctl.promote()`
+ */
+extern enum election_mode box_election_mode;
+
 /** Raft state of this instance. */
 static inline struct raft *
 box_raft(void)
diff --git a/test/replication/election_basic.result b/test/replication/election_basic.result
index 4d7d33f2b..d5320b3ff 100644
--- a/test/replication/election_basic.result
+++ b/test/replication/election_basic.result
@@ -22,8 +22,8 @@ box.cfg{election_mode = 100}
  | ...
 box.cfg{election_mode = '100'}
  | ---
- | - error: 'Incorrect value for option ''election_mode'': the value must be a string
- |     ''off'' or ''voter'' or ''candidate'''
+ | - error: 'Incorrect value for option ''election_mode'': the value must be one of the
+ |     following strings: ''off'', ''voter'', ''candidate'', ''manual'''
  | ...
 box.cfg{election_timeout = -1}
  | ---
-- 
2.24.3 (Apple Git-128)


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Tarantool-patches] [PATCH 8/9] Support manual elections in `box.ctl.clear_synchro_queue()`
  2021-04-11 17:55 [Tarantool-patches] [PATCH 0/9] raft: introduce manual elections and fix a bug with re-applying rolled back transactions Serge Petrenko via Tarantool-patches
                   ` (6 preceding siblings ...)
  2021-04-11 17:56 ` [Tarantool-patches] [PATCH 7/9] replication: introduce a new election mode: "manual" Serge Petrenko via Tarantool-patches
@ 2021-04-11 17:56 ` Serge Petrenko via Tarantool-patches
  2021-04-12 19:23   ` Serge Petrenko via Tarantool-patches
  2021-04-11 17:56 ` [Tarantool-patches] [PATCH 9/9] box.ctl: rename clear_synchro_queue to promote Serge Petrenko via Tarantool-patches
  8 siblings, 1 reply; 16+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-04-11 17:56 UTC (permalink / raw)
  To: v.shpilevoy, gorcunov; +Cc: tarantool-patches

This patch adds support for manual elections from
`box.ctl.clear_synchro_queue()`. When an instance is in
`election_mode='manual'`, calling `clear_synchro_queue()` will make it
start a new election round.

Follow-up #5445
Part of #3055

@TarantoolBot document
Title: describe election_mode='manual'

Manual election mode is introduced. It may be used when the user wants to
control which instance is the leader explicitly instead of relying on
Raft election algorithm.

When an instance is configured with `election_mode='manual'`, it behaves
as follows:
 1) By default, the instance acts like a voter: it is read-only and may
    vote for other instances that are candidates.
 2) Once `box.ctl.clear_synchro_queue()` is called, the instance becomes a
    candidate and starts a new election round. If the instance wins the
    elections, it remains leader, but won't participate in any new elections.
---
 src/box/box.cc              | 74 +++++++++++++++++++++++++++++++++++--
 src/box/errcode.h           |  3 ++
 src/box/raft.c              | 25 ++++++++++++-
 src/box/raft.h              |  3 ++
 src/lib/raft/raft.c         | 12 +++++-
 src/lib/raft/raft.h         |  2 +-
 test/unit/raft_test_utils.c |  4 +-
 7 files changed, 113 insertions(+), 10 deletions(-)

diff --git a/src/box/box.cc b/src/box/box.cc
index 10e8351b0..41d2ff0f8 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -1161,7 +1161,8 @@ box_set_election_mode(void)
 	if (mode == ELECTION_MODE_INVALID)
 		return -1;
 	box_election_mode = mode;
-	raft_cfg_is_candidate(box_raft(), mode == ELECTION_MODE_CANDIDATE);
+	raft_cfg_is_candidate(box_raft(), mode == ELECTION_MODE_CANDIDATE,
+			      true);
 	raft_cfg_is_enabled(box_raft(), mode != ELECTION_MODE_OFF);
 	return 0;
 }
@@ -1520,12 +1521,77 @@ box_clear_synchro_queue(bool try_wait)
 
 	if (!is_box_configured)
 		return 0;
+
+	bool run_elections = false;
+
+	switch (box_election_mode) {
+	case ELECTION_MODE_OFF:
+		break;
+	case ELECTION_MODE_VOTER:
+		diag_set(ClientError, ER_UNSUPPORTED, "election_mode='voter'",
+			 "manual elections");
+		return -1;
+	case ELECTION_MODE_MANUAL:
+		/*
+		 * Even if this instance is already a leader, allow to bump term
+		 * manually once again.
+		 */
+		assert(box_raft()->state != RAFT_STATE_CANDIDATE);
+		if (box_raft()->state == RAFT_STATE_LEADER) {
+			diag_set(ClientError, ER_ALREADY_LEADER);
+			return -1;
+		}
+		run_elections = true;
+		try_wait = false;
+		break;
+	case ELECTION_MODE_CANDIDATE:
+		/*
+		 * Leader elections are enabled, and this instance is allowed to
+		 * promote only if it's already an elected leader. No manual
+		 * elections.
+		 */
+		if (box_raft()->state != RAFT_STATE_LEADER) {
+			diag_set(ClientError, ER_UNSUPPORTED, "election_mode="
+				 "'candidate'", "manual elections");
+			return -1;
+		}
+		break;
+	default:
+		unreachable();
+	}
+
 	uint32_t former_leader_id = txn_limbo.owner_id;
 	int64_t wait_lsn = txn_limbo.confirmed_lsn;
 	int rc = 0;
 	int quorum = replication_synchro_quorum;
 	in_clear_synchro_queue = true;
 
+	if (run_elections) {
+		/*
+		 * Make this instance a candidate and run until some leader, not
+		 * necessarily this instance, emerges.
+		 */
+		raft_cfg_is_candidate(box_raft(), true, false);
+		/*
+		 * Trigger new elections without waiting for an old leader to
+		 * disappear.
+		 */
+		raft_new_term(box_raft());
+		box_raft_wait_leader_found();
+		raft_cfg_is_candidate(box_raft(), false, false);
+		if (!box_raft()->is_enabled) {
+			diag_set(ClientError, ER_RAFT_DISABLED);
+			in_clear_synchro_queue = false;
+			return -1;
+		}
+		if (box_raft()->state != RAFT_STATE_LEADER) {
+			diag_set(ClientError, ER_INTERFERING_PROMOTE,
+				 box_raft()->leader);
+			in_clear_synchro_queue = false;
+			return -1;
+		}
+	}
+
 	if (txn_limbo_is_empty(&txn_limbo))
 		goto promote;
 
@@ -1543,10 +1609,10 @@ box_clear_synchro_queue(bool try_wait)
 		 * transactions. Exit in case someone did that for us.
 		 */
 		if (former_leader_id != txn_limbo.owner_id) {
-			//TODO: error once we see someone else became the leader
-			// already.
+			diag_set(ClientError, ER_INTERFERING_PROMOTE,
+				 txn_limbo.owner_id);
 			in_clear_synchro_queue = false;
-			return 0;
+			return -1;
 		}
 	}
 
diff --git a/src/box/errcode.h b/src/box/errcode.h
index 56573688e..e5c9f3b09 100644
--- a/src/box/errcode.h
+++ b/src/box/errcode.h
@@ -275,6 +275,9 @@ struct errcode_record {
 	/*220 */_(ER_TOO_EARLY_SUBSCRIBE,	"Can't subscribe non-anonymous replica %s until join is done") \
 	/*221 */_(ER_SQL_CANT_ADD_AUTOINC,	"Can't add AUTOINCREMENT: space %s can't feature more than one AUTOINCREMENT field") \
 	/*222 */_(ER_QUORUM_WAIT,		"Couldn't wait for quorum %d: %s") \
+	/*223 */_(ER_INTERFERING_PROMOTE,	"Instance with replica id %u was promoted first") \
+	/*224 */_(ER_RAFT_DISABLED,		"Elections were turned off while running box.ctl.promote()")\
+	/*225 */_(ER_ALREADY_LEADER,		"Can't promote an existing leader")\
 
 /*
  * !IMPORTANT! Please follow instructions at start of the file
diff --git a/src/box/raft.c b/src/box/raft.c
index 285dbe4fd..47d4fd56d 100644
--- a/src/box/raft.c
+++ b/src/box/raft.c
@@ -95,7 +95,8 @@ box_raft_update_synchro_queue(struct raft *raft)
 	 * manually. In this case the call below will exit immediately and we'll
 	 * simply log a warning.
 	 */
-	if (raft->state == RAFT_STATE_LEADER) {
+	if (raft->state == RAFT_STATE_LEADER &&
+	    box_election_mode != ELECTION_MODE_MANUAL) {
 		int rc = 0;
 		uint32_t errcode = 0;
 		do {
@@ -336,6 +337,28 @@ fail:
 	panic("Could not write a raft request to WAL\n");
 }
 
+static int
+box_raft_wait_leader_found_trig(struct trigger *trig, void *event)
+{
+	struct raft *raft = (struct raft *)event;
+	assert(raft == box_raft());
+	struct fiber *waiter = (struct fiber *)trig->data;
+	if (raft->leader != REPLICA_ID_NIL || !raft->is_enabled)
+		fiber_wakeup(waiter);
+	return 0;
+}
+
+void
+box_raft_wait_leader_found(void)
+{
+	struct trigger trig;
+	trigger_create(&trig, box_raft_wait_leader_found_trig, fiber(), NULL);
+	raft_on_update(box_raft(), &trig);
+	fiber_yield();
+	assert(box_raft()->leader != REPLICA_ID_NIL || !box_raft()->is_enabled);
+	trigger_clear(&trig);
+}
+
 void
 box_raft_init(void)
 {
diff --git a/src/box/raft.h b/src/box/raft.h
index 15f4e80d9..8fce423e1 100644
--- a/src/box/raft.h
+++ b/src/box/raft.h
@@ -97,6 +97,9 @@ box_raft_checkpoint_remote(struct raft_request *req);
 int
 box_raft_process(struct raft_request *req, uint32_t source);
 
+void
+box_raft_wait_leader_found();
+
 void
 box_raft_init(void);
 
diff --git a/src/lib/raft/raft.c b/src/lib/raft/raft.c
index e9ce8cade..7b77e05ea 100644
--- a/src/lib/raft/raft.c
+++ b/src/lib/raft/raft.c
@@ -846,7 +846,7 @@ raft_cfg_is_enabled(struct raft *raft, bool is_enabled)
 }
 
 void
-raft_cfg_is_candidate(struct raft *raft, bool is_candidate)
+raft_cfg_is_candidate(struct raft *raft, bool is_candidate, bool demote)
 {
 	bool old_is_candidate = raft->is_candidate;
 	raft->is_cfg_candidate = is_candidate;
@@ -874,8 +874,16 @@ raft_cfg_is_candidate(struct raft *raft, bool is_candidate)
 			raft_ev_timer_stop(raft_loop(), &raft->timer);
 		}
 		if (raft->state != RAFT_STATE_FOLLOWER) {
-			if (raft->state == RAFT_STATE_LEADER)
+			if (raft->state == RAFT_STATE_LEADER) {
+				if (!demote) {
+					/*
+					 * Remain leader until someone
+					 * triggers new elections.
+					 */
+					return;
+				}
 				raft->leader = 0;
+			}
 			raft->state = RAFT_STATE_FOLLOWER;
 			/* State is visible and changed - broadcast. */
 			raft_schedule_broadcast(raft);
diff --git a/src/lib/raft/raft.h b/src/lib/raft/raft.h
index cba45a67d..d31471802 100644
--- a/src/lib/raft/raft.h
+++ b/src/lib/raft/raft.h
@@ -309,7 +309,7 @@ raft_cfg_is_enabled(struct raft *raft, bool is_enabled);
  * the node still can vote, when Raft is enabled.
  */
 void
-raft_cfg_is_candidate(struct raft *raft, bool is_candidate);
+raft_cfg_is_candidate(struct raft *raft, bool is_candidate, bool demote);
 
 /** Configure Raft leader election timeout. */
 void
diff --git a/test/unit/raft_test_utils.c b/test/unit/raft_test_utils.c
index b8735f373..a10ccae6a 100644
--- a/test/unit/raft_test_utils.c
+++ b/test/unit/raft_test_utils.c
@@ -360,7 +360,7 @@ raft_node_start(struct raft_node *node)
 		raft_process_recovery(&node->raft, &node->journal.rows[i]);
 
 	raft_cfg_is_enabled(&node->raft, node->cfg_is_enabled);
-	raft_cfg_is_candidate(&node->raft, node->cfg_is_candidate);
+	raft_cfg_is_candidate(&node->raft, node->cfg_is_candidate, true);
 	raft_cfg_election_timeout(&node->raft, node->cfg_election_timeout);
 	raft_cfg_election_quorum(&node->raft, node->cfg_election_quorum);
 	raft_cfg_death_timeout(&node->raft, node->cfg_death_timeout);
@@ -402,7 +402,7 @@ raft_node_cfg_is_candidate(struct raft_node *node, bool value)
 {
 	node->cfg_is_candidate = value;
 	if (raft_node_is_started(node)) {
-		raft_cfg_is_candidate(&node->raft, value);
+		raft_cfg_is_candidate(&node->raft, value, true);
 		raft_run_async_work();
 	}
 }
-- 
2.24.3 (Apple Git-128)


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [Tarantool-patches] [PATCH 9/9] box.ctl: rename clear_synchro_queue to promote
  2021-04-11 17:55 [Tarantool-patches] [PATCH 0/9] raft: introduce manual elections and fix a bug with re-applying rolled back transactions Serge Petrenko via Tarantool-patches
                   ` (7 preceding siblings ...)
  2021-04-11 17:56 ` [Tarantool-patches] [PATCH 8/9] Support manual elections in `box.ctl.clear_synchro_queue()` Serge Petrenko via Tarantool-patches
@ 2021-04-11 17:56 ` Serge Petrenko via Tarantool-patches
  2021-04-12 19:24   ` Serge Petrenko via Tarantool-patches
  8 siblings, 1 reply; 16+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-04-11 17:56 UTC (permalink / raw)
  To: v.shpilevoy, gorcunov; +Cc: tarantool-patches

New function name will be `box.ctl.promote()`. It's much shorter and
closer to the function's now enriched functionality.

Old name `box.ctl.clear_synchro_queue()` remains in Lua for the sake of
backward compatibility.

Follow-up #5445
Closes #3055

@TarantoolBot document
Title: deprecate `box.ctl.clear_synchro_queue()` in favor of `box.ctl.promote()`

Replace all the mentions of `box.ctl.clear_synchro_queue()` with
`box.ctl.promote()` and add a note that `box.ctl.clear_synchro_queue()`
is a deprecated alias to `box.ctl.promote()`
---
 changelogs/unreleased/box-ctl-promote.md |  8 ++++++++
 src/box/box.cc                           | 20 ++++++++++----------
 src/box/box.h                            |  2 +-
 src/box/lua/ctl.c                        |  8 +++++---
 src/box/raft.c                           |  4 ++--
 5 files changed, 26 insertions(+), 16 deletions(-)
 create mode 100644 changelogs/unreleased/box-ctl-promote.md

diff --git a/changelogs/unreleased/box-ctl-promote.md b/changelogs/unreleased/box-ctl-promote.md
new file mode 100644
index 000000000..15f6fb206
--- /dev/null
+++ b/changelogs/unreleased/box-ctl-promote.md
@@ -0,0 +1,8 @@
+## feature/replication
+
+* Introduce `box.ctl.promote()` and the concept of manual elections (enabled
+  with `election_mode='manual'`). Once the instance is in `manual` election
+  mode, it acts like a `voter` most of the time, but may trigger elections and
+  become a leader, once `box.ctl.promote()` is called.
+  When `election_mode ~= 'manual'`, `box.ctl.promote()` replaces
+  `box.ctl.clear_synchro_queue()`, which is now deprecated (gh-3055).
diff --git a/src/box/box.cc b/src/box/box.cc
index 41d2ff0f8..40467805b 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -1509,12 +1509,12 @@ box_wait_quorum(uint32_t lead_id, int64_t target_lsn, int quorum,
 }
 
 int
-box_clear_synchro_queue(bool try_wait)
+box_promote(bool try_wait)
 {
 	/* A guard to block multiple simultaneous function invocations. */
-	static bool in_clear_synchro_queue = false;
-	if (in_clear_synchro_queue) {
-		diag_set(ClientError, ER_UNSUPPORTED, "clear_synchro_queue",
+	static bool in_promote = false;
+	if (in_promote) {
+		diag_set(ClientError, ER_UNSUPPORTED, "promote",
 			 "simultaneous invocations");
 		return -1;
 	}
@@ -1564,7 +1564,7 @@ box_clear_synchro_queue(bool try_wait)
 	int64_t wait_lsn = txn_limbo.confirmed_lsn;
 	int rc = 0;
 	int quorum = replication_synchro_quorum;
-	in_clear_synchro_queue = true;
+	in_promote = true;
 
 	if (run_elections) {
 		/*
@@ -1581,13 +1581,13 @@ box_clear_synchro_queue(bool try_wait)
 		raft_cfg_is_candidate(box_raft(), false, false);
 		if (!box_raft()->is_enabled) {
 			diag_set(ClientError, ER_RAFT_DISABLED);
-			in_clear_synchro_queue = false;
+			in_promote = false;
 			return -1;
 		}
 		if (box_raft()->state != RAFT_STATE_LEADER) {
 			diag_set(ClientError, ER_INTERFERING_PROMOTE,
 				 box_raft()->leader);
-			in_clear_synchro_queue = false;
+			in_promote = false;
 			return -1;
 		}
 	}
@@ -1611,13 +1611,13 @@ box_clear_synchro_queue(bool try_wait)
 		if (former_leader_id != txn_limbo.owner_id) {
 			diag_set(ClientError, ER_INTERFERING_PROMOTE,
 				 txn_limbo.owner_id);
-			in_clear_synchro_queue = false;
+			in_promote = false;
 			return -1;
 		}
 	}
 
 	/*
-	 * clear_synchro_queue() is a no-op on the limbo owner, so all the rows
+	 * promote() is a no-op on the limbo owner, so all the rows
 	 * in the limbo must've come through the applier meaning they already
 	 * have an lsn assigned, even if their WAL write hasn't finished yet.
 	 */
@@ -1654,7 +1654,7 @@ promote:
 						req.lsn);
 		}
 	}
-	in_clear_synchro_queue = false;
+	in_promote = false;
 	return rc;
 }
 
diff --git a/src/box/box.h b/src/box/box.h
index e2321b9b0..89c6fe1a1 100644
--- a/src/box/box.h
+++ b/src/box/box.h
@@ -274,7 +274,7 @@ extern "C" {
 typedef struct tuple box_tuple_t;
 
 int
-box_clear_synchro_queue(bool try_wait);
+box_promote(bool try_wait);
 
 /* box_select is private and used only by FFI */
 API_EXPORT int
diff --git a/src/box/lua/ctl.c b/src/box/lua/ctl.c
index d039a059f..f06af8588 100644
--- a/src/box/lua/ctl.c
+++ b/src/box/lua/ctl.c
@@ -82,9 +82,9 @@ lbox_ctl_on_schema_init(struct lua_State *L)
 }
 
 static int
-lbox_ctl_clear_synchro_queue(struct lua_State *L)
+lbox_ctl_promote(struct lua_State *L)
 {
-	if (box_clear_synchro_queue(true) != 0)
+	if (box_promote(true) != 0)
 		return luaT_error(L);
 	return 0;
 }
@@ -124,7 +124,9 @@ static const struct luaL_Reg lbox_ctl_lib[] = {
 	{"wait_rw", lbox_ctl_wait_rw},
 	{"on_shutdown", lbox_ctl_on_shutdown},
 	{"on_schema_init", lbox_ctl_on_schema_init},
-	{"clear_synchro_queue", lbox_ctl_clear_synchro_queue},
+	{"promote", lbox_ctl_promote},
+	/* An old alias. */
+	{"clear_synchro_queue", lbox_ctl_promote},
 	{"is_recovery_finished", lbox_ctl_is_recovery_finished},
 	{"set_on_shutdown_timeout", lbox_ctl_set_on_shutdown_timeout},
 	{NULL, NULL}
diff --git a/src/box/raft.c b/src/box/raft.c
index 47d4fd56d..45baf5dd8 100644
--- a/src/box/raft.c
+++ b/src/box/raft.c
@@ -91,7 +91,7 @@ box_raft_update_synchro_queue(struct raft *raft)
 	 * If the node became a leader, it means it will ignore all records from
 	 * all the other nodes, and won't get late CONFIRM messages anyway. Can
 	 * clear the queue without waiting for confirmations.
-	 * It's alright that the user may have called clear_synchro_queue
+	 * It's alright that the user may have called promote
 	 * manually. In this case the call below will exit immediately and we'll
 	 * simply log a warning.
 	 */
@@ -100,7 +100,7 @@ box_raft_update_synchro_queue(struct raft *raft)
 		int rc = 0;
 		uint32_t errcode = 0;
 		do {
-			rc = box_clear_synchro_queue(false);
+			rc = box_promote(false);
 			if (rc != 0) {
 				struct error *err = diag_last_error(diag_get());
 				errcode = box_error_code(err);
-- 
2.24.3 (Apple Git-128)


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Tarantool-patches] [PATCH 1/9] wal: enrich row's meta information with sync replication flags
  2021-04-11 17:55 ` [Tarantool-patches] [PATCH 1/9] wal: enrich row's meta information with sync replication flags Serge Petrenko via Tarantool-patches
@ 2021-04-12 13:06   ` Cyrill Gorcunov via Tarantool-patches
  2021-04-13 13:26     ` Serge Petrenko via Tarantool-patches
  2021-04-12 19:21   ` Serge Petrenko via Tarantool-patches
  1 sibling, 1 reply; 16+ messages in thread
From: Cyrill Gorcunov via Tarantool-patches @ 2021-04-12 13:06 UTC (permalink / raw)
  To: Serge Petrenko; +Cc: v.shpilevoy, tarantool-patches

On Sun, Apr 11, 2021 at 08:55:56PM +0300, Serge Petrenko wrote:
> diff --git a/src/box/iproto_constants.h b/src/box/iproto_constants.h
> index d3738c705..f7f46088f 100644
> --- a/src/box/iproto_constants.h
> +++ b/src/box/iproto_constants.h
> @@ -49,9 +49,14 @@ enum {
>  	XLOG_FIXHEADER_SIZE = 19
>  };
>  
> +/** IPROTO_FLAGS bitfield constants. */
>  enum {
>  	/** Set for the last xrow in a transaction. */
>  	IPROTO_FLAG_COMMIT = 0x01,
> +	/** Set for the last row of a tx residing in limbo. */
> +	IPROTO_FLAG_WAIT_SYNC = 0x02,
> +	/** Set for the last row of a synchronous tx. */
> +	IPROTO_FLAG_WAIT_ACK = 0x04,
>  };

Serge, is there some particular reason why cant we make them 1:1
mapping to txn flags? Ie

	IPROTO_FLAG_WAIT_SYNC	= 0x10,
	IPROTO_FLAG_WAIT_ACK	= 0x20,

this would allow us to eliminate branching in code. While txn flags
are not part of API and iproto flags _are_ I don't expect the former
to be changed anyhow soon?

Not insisting though, just share the idea.

>  
>  enum iproto_key {
> diff --git a/src/box/journal.h b/src/box/journal.h
> index 76c70c19f..9ab8af2c1 100644
> --- a/src/box/journal.h
> +++ b/src/box/journal.h
> @@ -63,6 +63,7 @@ struct journal_entry {
>  	 * A journal entry completion callback argument.
>  	 */
>  	void *complete_data;
> +	uint8_t opt_flags;
>  	/**
>  	 * Asynchronous write completion function.
>  	 */
> @@ -97,6 +98,7 @@ journal_entry_create(struct journal_entry *entry, size_t n_rows,
>  	entry->approx_len	= approx_len;
>  	entry->n_rows		= n_rows;
>  	entry->res		= -1;
> +	entry->opt_flags = 0;
>  }

Please don't ruine alignment here. I know that Vlad prefer dense style
(which is actually hard to parse by eyes at least for me) but here we
have a block which ether should be whole formatted to dense or kept
aligned.

>  
>  /**
> diff --git a/src/box/txn.c b/src/box/txn.c
> index 40061ff09..d65315f58 100644
> --- a/src/box/txn.c
> +++ b/src/box/txn.c
> @@ -76,6 +76,7 @@ txn_add_redo(struct txn *txn, struct txn_stmt *stmt, struct request *request)
>  		row->lsn = 0;
>  		row->sync = 0;
>  		row->tm = 0;
> +		row->opt_flags = 0;
>  	}
>  	/*
>  	 * Group ID should be set both for requests not having a
> @@ -681,6 +682,10 @@ txn_journal_entry_new(struct txn *txn)
>  		--req->n_rows;
>  	}
>  
> +	req->opt_flags |=
> +		(txn_has_flag(txn, TXN_WAIT_SYNC) ? IPROTO_FLAG_WAIT_SYNC : 0) |
> +		(txn_has_flag(txn, TXN_WAIT_ACK) ? IPROTO_FLAG_WAIT_ACK : 0);
> +

When mentioning branching I meant this code. We could use 1:1 mapping
here if we define iproto flags to be the same as txn flags.

Or say some explicit map could be provided (for readability sake)

	static const uint8_t opt_flags_map[] = {
		[TXN_WAIT_SYNC]		= IPROTO_FLAG_WAIT_SYNC,
		[TXN_WAIT_ACK]		= IPROTO_FLAG_WAIT_ACK,
	};

	req->opt_flags |= opt_flags_map[txn->flags & IPROTO_FLAG_WAIT_SYNC];
	req->opt_flags |= opt_flags_map[txn->flags & IPROTO_FLAG_WAIT_ACK];

Such technique is widely used when need some mapping between flags and
instead of number of `if` one simply use read-only memory.

On the other hands I'm fine with branching here because this is not
that hot code and even misprediction should not harm much. So just
to share the ideas.

>  	return req;
>  }
>  
> diff --git a/src/box/wal.c b/src/box/wal.c
> index 34af0bda6..00fcb21b4 100644
> --- a/src/box/wal.c
> +++ b/src/box/wal.c
> @@ -962,14 +962,14 @@ out:
>   */
>  static void
>  wal_assign_lsn(struct vclock *vclock_diff, struct vclock *base,
> -	       struct xrow_header **row,
> -	       struct xrow_header **end)
> +	       struct journal_entry *entry)
>  {
>  	int64_t tsn = 0;
> -	struct xrow_header **start = row;
> -	struct xrow_header **first_glob_row = row;
> +	struct xrow_header **start = entry->rows;
> +	struct xrow_header **end = entry->rows + entry->n_rows;
> +	struct xrow_header **first_glob_row = start;

There is no need to define another dependency, actually I think
the compiler will figure out that @first_glob_row and @start
are initialized from same @entry->rows still previously this
was more clear, iow I propose to leave `first_glob_row = entry->rows;`

>  	/** Assign LSN to all local rows. */
> -	for ( ; row < end; row++) {
> +	for (struct xrow_header **row = start; row < end; row++) {
>  		if ((*row)->replica_id == 0) {
>  			/*
>  			 * All rows representing local space data
> @@ -996,7 +996,13 @@ wal_assign_lsn(struct vclock *vclock_diff, struct vclock *base,
>  				first_glob_row = row;
>  			}
>  			(*row)->tsn = tsn == 0 ? (*start)->lsn : tsn;
> -			(*row)->is_commit = row == end - 1;
> +			if (row < end - 1)
> +				continue;
> +			/* Tx meta is stored in the last tx row. */
> +			if (row == end - 1) {
> +				(*row)->opt_flags = entry->opt_flags;
> +				(*row)->is_commit = true;

Why can't we use
				(*row)->opt_flags = entry->opt_flags | IPROTO_FLAG_COMMIT;
instead?

> +			}
>  		} else {
>  			int64_t diff = (*row)->lsn - vclock_get(base, (*row)->replica_id);
>  			if (diff <= vclock_get(vclock_diff,
> @@ -1020,7 +1026,7 @@ wal_assign_lsn(struct vclock *vclock_diff, struct vclock *base,
>  	 * the first global row. tsn was yet unknown when those
>  	 * rows were processed.
>  	 */
> -	for (row = start; row < first_glob_row; row++)
> +	for (struct xrow_header **row = start; row < first_glob_row; row++)
>  		(*row)->tsn = tsn;
>  }
>  
> @@ -1098,8 +1104,7 @@ wal_write_to_disk(struct cmsg *msg)
>  	struct journal_entry *entry;
>  	struct stailq_entry *last_committed = NULL;
>  	stailq_foreach_entry(entry, &wal_msg->commit, fifo) {
> -		wal_assign_lsn(&vclock_diff, &writer->vclock,
> -			       entry->rows, entry->rows + entry->n_rows);
> +		wal_assign_lsn(&vclock_diff, &writer->vclock, entry);
>  		entry->res = vclock_sum(&vclock_diff) +
>  			     vclock_sum(&writer->vclock);
>  		rc = xlog_write_entry(l, entry);
> @@ -1319,8 +1324,7 @@ wal_write_none_async(struct journal *journal,
>  	struct vclock vclock_diff;
>  
>  	vclock_create(&vclock_diff);
> -	wal_assign_lsn(&vclock_diff, &writer->vclock, entry->rows,
> -		       entry->rows + entry->n_rows);
> +	wal_assign_lsn(&vclock_diff, &writer->vclock, entry);
>  	vclock_merge(&writer->vclock, &vclock_diff);
>  	vclock_copy(&replicaset.vclock, &writer->vclock);
>  	entry->res = vclock_sum(&writer->vclock);
> diff --git a/src/box/xrow.c b/src/box/xrow.c
> index bc06738ad..cc8e43ed4 100644
> --- a/src/box/xrow.c
> +++ b/src/box/xrow.c
> @@ -183,7 +183,7 @@ error:
>  			break;
>  		case IPROTO_FLAGS:
>  			flags = mp_decode_uint(pos);
> -			header->is_commit = flags & IPROTO_FLAG_COMMIT;
> +			header->opt_flags = flags;
>  			break;
>  		default:
>  			/* unknown header */
> @@ -299,6 +299,7 @@ xrow_header_encode(const struct xrow_header *header, uint64_t sync,
>  	 *   flag to find transaction boundary (last row in the
>  	 *   transaction stream).
>  	 */
> +	uint8_t flags_to_encode = header->opt_flags & ~IPROTO_FLAG_COMMIT;
>  	if (header->tsn != 0) {
>  		if (header->tsn != header->lsn || !header->is_commit) {
>  			/*
> @@ -314,12 +315,14 @@ xrow_header_encode(const struct xrow_header *header, uint64_t sync,
>  			map_size++;
>  		}
>  		if (header->is_commit && header->tsn != header->lsn) {
> -			/* Setup last row for multi row transaction. */
> -			d = mp_encode_uint(d, IPROTO_FLAGS);
> -			d = mp_encode_uint(d, IPROTO_FLAG_COMMIT);
> -			map_size++;
> +			flags_to_encode |= IPROTO_FLAG_COMMIT;
>  		}
>  	}
> +	if (flags_to_encode != 0) {
> +		d = mp_encode_uint(d, IPROTO_FLAGS);
> +		d = mp_encode_uint(d, flags_to_encode);
> +		map_size++;
> +	}
>  	assert(d <= data + XROW_HEADER_LEN_MAX);
>  	mp_encode_map(data, map_size);
>  	out->iov_len = d - (char *) out->iov_base;
> diff --git a/src/box/xrow.h b/src/box/xrow.h
> index fde8f9474..2a18733c0 100644
> --- a/src/box/xrow.h
> +++ b/src/box/xrow.h
> @@ -80,14 +80,28 @@ struct xrow_header {
>  	 * transaction.
>  	 */
>  	int64_t tsn;
> -	/**
> -	 * True for the last row in a multi-statement transaction,
> -	 * or single-statement transaction. Is only encoded in the
> -	 * write ahead log for multi-statement transactions.
> -	 * Single-statement transactions do not encode
> -	 * tsn and is_commit flag to save space.
> -	 */
> -	bool is_commit;
> +	/** Transaction meta flags set only in the last transaction row. */
> +	union {
> +		uint8_t opt_flags;
> +		struct {
> +			/**
> +			 * Is only encoded in the write ahead log for
> +			 * multi-statement transactions. Single-statement
> +			 * transactions do not encode tsn and is_commit flag to
> +			 * save space.
> +			 */
> +			bool is_commit : 1;
> +			/**
> +			 * True for a synchronous transaction.
> +			 */
> +			bool wait_sync : 1;
> +			/**
> +			 * True for any transaction that would enter the limbo
> +			 * (not necessarily a synchronous one).
> +			 */
> +			bool wait_ack  : 1;
> +		};
> +	};

Serge, I know that you mention bitfields in the mail in context of our
discussion and I assume you'll rework this otherwise this will be
just a hidden bug which might work for sometime but one day it will
trigger and we won't even be able to figure out why.

>  
>  	int bodycnt;
>  	uint32_t schema_version;
> -- 
> 2.24.3 (Apple Git-128)
> 

	Cyrill

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Tarantool-patches] [PATCH 1/9] wal: enrich row's meta information with sync replication flags
  2021-04-11 17:55 ` [Tarantool-patches] [PATCH 1/9] wal: enrich row's meta information with sync replication flags Serge Petrenko via Tarantool-patches
  2021-04-12 13:06   ` Cyrill Gorcunov via Tarantool-patches
@ 2021-04-12 19:21   ` Serge Petrenko via Tarantool-patches
  1 sibling, 0 replies; 16+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-04-12 19:21 UTC (permalink / raw)
  To: v.shpilevoy, gorcunov; +Cc: tarantool-patches



11.04.2021 20:55, Serge Petrenko пишет:
> Introduce two new flags to xrow_header: `wait_ack` and `wait_sync`.
> These flags are set for rows belonging to synchronous transactions in
> addition to `is_commit`.
>
> The new flags help to define whether the rows belong to a synchronous
> transaction or not without parsing them all and checking whether any of
> the rows touches a synchronous space.
>
> This will be used in applier once it is taught to filter synchronous
> transactions based on whether they are coming from a raft leader or not.
>
> P.S. These flags will also be useful once we allow to turn any transaction
> synchronous. Once this is done, the flags in row header will be the only
> source of information on whether the transaction is synchronous or not.
>
> Prerequisite #5445

Force-pushed a test:


=====================================


diff --git a/test/unit/xrow.cc b/test/unit/xrow.cc
index 9fd154719..ea1ee1767 100644
--- a/test/unit/xrow.cc
+++ b/test/unit/xrow.cc
@@ -204,7 +204,9 @@ test_greeting()
  void
  test_xrow_header_encode_decode()
  {
-    plan(10);
+    /* Test all possible 3-bit combinations. */
+    const int bit_comb_count = 1 << 3;
+    plan(1 + bit_comb_count);
      struct xrow_header header;
      char buffer[2048];
      char *pos = mp_encode_uint(buffer, 300);
@@ -217,27 +219,47 @@ test_xrow_header_encode_decode()
      header.tm = 123.456;
      header.bodycnt = 0;
      header.tsn = header.lsn;
-    header.is_commit = true;
      uint64_t sync = 100500;
-    struct iovec vec[1];
-    is(1, xrow_header_encode(&header, sync, vec, 200), "encode");
-    int fixheader_len = 200;
-    pos = (char *)vec[0].iov_base + fixheader_len;
-    is(mp_decode_map((const char **)&pos), 5, "header map size");
-
-    struct xrow_header decoded_header;
-    const char *begin = (const char *)vec[0].iov_base;
-    begin += fixheader_len;
-    const char *end = (const char *)vec[0].iov_base;
-    end += vec[0].iov_len;
-    is(xrow_header_decode(&decoded_header, &begin, end, true), 0,
-       "header decode");
-    is(header.type, decoded_header.type, "decoded type");
-    is(header.replica_id, decoded_header.replica_id, "decoded replica_id");
-    is(header.lsn, decoded_header.lsn, "decoded lsn");
-    is(header.tm, decoded_header.tm, "decoded tm");
-    is(decoded_header.sync, sync, "decoded sync");
-    is(decoded_header.bodycnt, 0, "decoded bodycnt");
+    for (int opt_idx = 0; opt_idx < bit_comb_count; opt_idx++) {
+        plan(12);
+        header.is_commit = opt_idx & 0x01;
+        header.wait_sync = opt_idx >> 1 & 0x01;
+        header.wait_ack = opt_idx >> 2 & 0x01;
+        struct iovec vec[1];
+        is(1, xrow_header_encode(&header, sync, vec, 200), "encode");
+        int fixheader_len = 200;
+        pos = (char *)vec[0].iov_base + fixheader_len;
+        uint32_t exp_map_size = 5;
+        /*
+         * header.is_commit flag isn't encoded, since this row looks
+         * like a single-statement transaction.
+         */
+        if (header.wait_sync || header.wait_ack)
+            exp_map_size += 1;
+        /* tsn is encoded explicitly in this case. */
+        if (!header.is_commit)
+            exp_map_size += 1;
+        uint32_t size = mp_decode_map((const char **)&pos);
+        is(size, exp_map_size, "header map size");
+
+        struct xrow_header decoded_header;
+        const char *begin = (const char *)vec[0].iov_base;
+        begin += fixheader_len;
+        const char *end = (const char *)vec[0].iov_base;
+        end += vec[0].iov_len;
+        is(xrow_header_decode(&decoded_header, &begin, end, true), 0,
+           "header decode");
+        is(header.is_commit, decoded_header.is_commit, "decoded 
is_commit");
+        is(header.wait_sync, decoded_header.wait_sync, "decoded 
wait_sync");
+        is(header.wait_ack, decoded_header.wait_ack, "decoded wait_ack");
+        is(header.type, decoded_header.type, "decoded type");
+        is(header.replica_id, decoded_header.replica_id, "decoded 
replica_id");
+        is(header.lsn, decoded_header.lsn, "decoded lsn");
+        is(header.tm, decoded_header.tm, "decoded tm");
+        is(decoded_header.sync, sync, "decoded sync");
+        is(decoded_header.bodycnt, 0, "decoded bodycnt");
+        check_plan();
+    }

      check_plan();
  }
diff --git a/test/unit/xrow.result b/test/unit/xrow.result
index 5ee92ad7b..e06ba5261 100644
--- a/test/unit/xrow.result
+++ b/test/unit/xrow.result
@@ -41,17 +41,120 @@
      ok 39 - invalid 10
      ok 40 - invalid 11
  ok 1 - subtests
-    1..10
+    1..9
      ok 1 - bad msgpack end
-    ok 2 - encode
-    ok 3 - header map size
-    ok 4 - header decode
-    ok 5 - decoded type
-    ok 6 - decoded replica_id
-    ok 7 - decoded lsn
-    ok 8 - decoded tm
-    ok 9 - decoded sync
-    ok 10 - decoded bodycnt
+        1..12
+        ok 1 - encode
+        ok 2 - header map size
+        ok 3 - header decode
+        ok 4 - decoded is_commit
+        ok 5 - decoded wait_sync
+        ok 6 - decoded wait_ack
+        ok 7 - decoded type
+        ok 8 - decoded replica_id
+        ok 9 - decoded lsn
+        ok 10 - decoded tm
+        ok 11 - decoded sync
+        ok 12 - decoded bodycnt
+    ok 2 - subtests
+        1..12
+        ok 1 - encode
+        ok 2 - header map size
+        ok 3 - header decode
+        ok 4 - decoded is_commit
+        ok 5 - decoded wait_sync
+        ok 6 - decoded wait_ack
+        ok 7 - decoded type
+        ok 8 - decoded replica_id
+        ok 9 - decoded lsn
+        ok 10 - decoded tm
+        ok 11 - decoded sync
+        ok 12 - decoded bodycnt
+    ok 3 - subtests
+        1..12
+        ok 1 - encode
+        ok 2 - header map size
+        ok 3 - header decode
+        ok 4 - decoded is_commit
+        ok 5 - decoded wait_sync
+        ok 6 - decoded wait_ack
+        ok 7 - decoded type
+        ok 8 - decoded replica_id
+        ok 9 - decoded lsn
+        ok 10 - decoded tm
+        ok 11 - decoded sync
+        ok 12 - decoded bodycnt
+    ok 4 - subtests
+        1..12
+        ok 1 - encode
+        ok 2 - header map size
+        ok 3 - header decode
+        ok 4 - decoded is_commit
+        ok 5 - decoded wait_sync
+        ok 6 - decoded wait_ack
+        ok 7 - decoded type
+        ok 8 - decoded replica_id
+        ok 9 - decoded lsn
+        ok 10 - decoded tm
+        ok 11 - decoded sync
+        ok 12 - decoded bodycnt
+    ok 5 - subtests
+        1..12
+        ok 1 - encode
+        ok 2 - header map size
+        ok 3 - header decode
+        ok 4 - decoded is_commit
+        ok 5 - decoded wait_sync
+        ok 6 - decoded wait_ack
+        ok 7 - decoded type
+        ok 8 - decoded replica_id
+        ok 9 - decoded lsn
+        ok 10 - decoded tm
+        ok 11 - decoded sync
+        ok 12 - decoded bodycnt
+    ok 6 - subtests
+        1..12
+        ok 1 - encode
+        ok 2 - header map size
+        ok 3 - header decode
+        ok 4 - decoded is_commit
+        ok 5 - decoded wait_sync
+        ok 6 - decoded wait_ack
+        ok 7 - decoded type
+        ok 8 - decoded replica_id
+        ok 9 - decoded lsn
+        ok 10 - decoded tm
+        ok 11 - decoded sync
+        ok 12 - decoded bodycnt
+    ok 7 - subtests
+        1..12
+        ok 1 - encode
+        ok 2 - header map size
+        ok 3 - header decode
+        ok 4 - decoded is_commit
+        ok 5 - decoded wait_sync
+        ok 6 - decoded wait_ack
+        ok 7 - decoded type
+        ok 8 - decoded replica_id
+        ok 9 - decoded lsn
+        ok 10 - decoded tm
+        ok 11 - decoded sync
+        ok 12 - decoded bodycnt
+    ok 8 - subtests
+        1..12
+        ok 1 - encode
+        ok 2 - header map size
+        ok 3 - header decode
+        ok 4 - decoded is_commit
+        ok 5 - decoded wait_sync
+        ok 6 - decoded wait_ack
+        ok 7 - decoded type
+        ok 8 - decoded replica_id
+        ok 9 - decoded lsn
+        ok 10 - decoded tm
+        ok 11 - decoded sync
+        ok 12 - decoded bodycnt
+    ok 9 - subtests
  ok 2 - subtests
      1..1
      ok 1 - request_str


-- 
Serge Petrenko


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Tarantool-patches] [PATCH 6/9] raft: keep track of greatest known term and filter replication sources based on that
  2021-04-11 17:56 ` [Tarantool-patches] [PATCH 6/9] raft: keep track of greatest known term and filter replication sources based on that Serge Petrenko via Tarantool-patches
@ 2021-04-12 19:23   ` Serge Petrenko via Tarantool-patches
  0 siblings, 0 replies; 16+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-04-12 19:23 UTC (permalink / raw)
  To: v.shpilevoy, gorcunov; +Cc: tarantool-patches



11.04.2021 20:56, Serge Petrenko пишет:
> Start writing the actual leader term together with the PROMOTE request
> and process terms in PROMOTE requests on receiver side.
>
> Make applier only apply synchronous transactions from the instance which
> has the greatest term as received in PROMOTE requests.
>
> Closes #5445
> ---

Force-pushed a test and a couple of fixes. Please find an incremental 
diff below and
the full patch in v2.

==========================
diff --git a/src/box/box.cc b/src/box/box.cc
index 9b6323b3f..aae57ec29 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -1507,7 +1507,12 @@ box_clear_synchro_queue(bool try_wait)
          return -1;
      }

-    if (!is_box_configured)
+    /*
+     * Do nothing when box isn't configured and when PROMOTE was already
+     * written for this term.
+     */
+    if (!is_box_configured ||
+        raft_source_term(box_raft(), instance_id) == box_raft()->term)
          return 0;
      uint32_t former_leader_id = txn_limbo.owner_id;
      int64_t wait_lsn = txn_limbo.confirmed_lsn;
@@ -1569,12 +1574,12 @@ promote:
                  .replica_id = 0, /* unused */
                  .origin_id = instance_id,
                  .lsn = wait_lsn,
-                .term = 0, /* unused */
+                .term = box_raft()->term,
              };
              txn_limbo_read_promote(&txn_limbo, &req);
              assert(txn_limbo_is_empty(&txn_limbo));
              raft_source_update_term(box_raft(), req.origin_id,
-                        req.lsn);
+                        req.term);
          }
      }
      in_clear_synchro_queue = false;
diff --git a/src/lib/raft/raft.h b/src/lib/raft/raft.h
index cba45a67d..40c8630e9 100644
--- a/src/lib/raft/raft.h
+++ b/src/lib/raft/raft.h
@@ -256,20 +256,29 @@ raft_is_source_allowed(const struct raft *raft, 
uint32_t source_id)
      return !raft->is_enabled || raft->leader == source_id;
  }

-static inline bool
-raft_source_has_outdated_term(const struct raft *raft, uint32_t source_id)
+/**
+ * Return the latest term as seen in PROMOTE requests from instance with id
+ * @a source_id.
+ */
+static inline uint64_t
+raft_source_term(const struct raft *raft, uint32_t source_id)
  {
-    uint64_t source_term = vclock_get(&raft->term_map, source_id);
-    return raft->is_enabled && source_term < raft->greatest_known_term;
+    assert(source_id != 0 && source_id < VCLOCK_MAX);
+    return vclock_get(&raft->term_map, source_id);
  }

-/** Check if Raft is enabled. */
+/**
+ * Check whether replica with id @a source_id is too old to apply 
synchronous
+ * data from it. The check remains valid  even when elections are disabled.
+ */
  static inline bool
-raft_is_enabled(const struct raft *raft)
+raft_source_has_outdated_term(const struct raft *raft, uint32_t source_id)
  {
-    return raft->is_enabled;
+    uint64_t source_term = vclock_get(&raft->term_map, source_id);
+    return source_term < raft->greatest_known_term;
  }

+/** Remember the last term seen for replica  with id @a source_id. */
  static inline void
  raft_source_update_term(struct raft *raft, uint32_t source_id, 
uint64_t term)
  {
@@ -280,6 +289,13 @@ raft_source_update_term(struct raft *raft, uint32_t 
source_id, uint64_t term)
          raft->greatest_known_term = term;
  }

+/** Check if Raft is enabled. */
+static inline bool
+raft_is_enabled(const struct raft *raft)
+{
+    return raft->is_enabled;
+}
+
  /** Process a raft entry stored in WAL/snapshot. */
  void
  raft_process_recovery(struct raft *raft, const struct raft_msg *req);
diff --git a/test/replication/gh-5445-leader-incosistency.result 
b/test/replication/gh-5445-leader-incosistency.result
new file mode 100644
index 000000000..b1f8a4ed1
--- /dev/null
+++ b/test/replication/gh-5445-leader-incosistency.result
@@ -0,0 +1,238 @@
+-- test-run result file version 2
+test_run = require("test_run").new()
+ | ---
+ | ...
+
+is_leader_cmd = "return box.info.election.state == 'leader'"
+ | ---
+ | ...
+
+-- Auxiliary.
+test_run:cmd('setopt delimiter ";"')
+ | ---
+ | - true
+ | ...
+function get_leader(nrs)
+    local leader_nr = 0
+    test_run:wait_cond(function()
+        for nr, do_check in pairs(nrs) do
+            if do_check then
+                local is_leader = test_run:eval('election_replica'..nr,
+                                                is_leader_cmd)[1]
+                if is_leader then
+                    leader_nr = nr
+                    return true
+                end
+            end
+        end
+        return false
+    end)
+    assert(leader_nr ~= 0)
+    return leader_nr
+end;
+ | ---
+ | ...
+
+function name(id)
+    return 'election_replica'..id
+end;
+ | ---
+ | ...
+test_run:cmd('setopt delimiter ""');
+ | ---
+ | - true
+ | ...
+
+--
+-- gh-5445: make sure rolled back rows do not reappear once old leader 
returns
+-- to cluster.
+--
+SERVERS = {'election_replica1', 'election_replica2' ,'election_replica3'}
+ | ---
+ | ...
+test_run:create_cluster(SERVERS, "replication", {args='2 0.4'})
+ | ---
+ | ...
+test_run:wait_fullmesh(SERVERS)
+ | ---
+ | ...
+
+-- Any of the three instances may bootstrap the cluster and become leader.
+is_possible_leader = {true, true, true}
+ | ---
+ | ...
+leader_nr = get_leader(is_possible_leader)
+ | ---
+ | ...
+leader = name(leader_nr)
+ | ---
+ | ...
+next_leader_nr = ((leader_nr - 1) % 3 + 1) % 3 + 1 -- {1, 2, 3} -> {2, 
3, 1}
+ | ---
+ | ...
+next_leader = name(next_leader_nr)
+ | ---
+ | ...
+other_nr = ((leader_nr - 1) % 3 + 2) % 3 + 1 -- {1, 2, 3} -> {3, 1, 2}
+ | ---
+ | ...
+other = name(other_nr)
+ | ---
+ | ...
+
+test_run:switch(leader)
+ | ---
+ | - true
+ | ...
+box.ctl.wait_rw()
+ | ---
+ | ...
+_ = box.schema.space.create('test', {is_sync=true})
+ | ---
+ | ...
+_ = box.space.test:create_index('pk')
+ | ---
+ | ...
+box.space.test:insert{1}
+ | ---
+ | - [1]
+ | ...
+
+-- Simulate a situation when the instance which will become the next leader
+-- doesn't know of unconfirmed rows. It should roll them back anyways 
and do not
+-- accept them once they actually appear from the old leader.
+-- So, stop the instance which'll be the next leader.
+test_run:switch('default')
+ | ---
+ | - true
+ | ...
+test_run:cmd('stop server '..next_leader)
+ | ---
+ | - true
+ | ...
+test_run:switch(leader)
+ | ---
+ | - true
+ | ...
+-- Insert some unconfirmed data.
+box.cfg{replication_synchro_quorum=3, replication_synchro_timeout=1000}
+ | ---
+ | ...
+fib = require('fiber').create(box.space.test.insert, box.space.test, {2})
+ | ---
+ | ...
+fib:status()
+ | ---
+ | - suspended
+ | ...
+
+-- 'other', 'leader', 'next_leader' are defined on 'default' node, 
hence the
+-- double switches.
+test_run:switch('default')
+ | ---
+ | - true
+ | ...
+test_run:switch(other)
+ | ---
+ | - true
+ | ...
+-- Wait until the rows are replicated to the other instance.
+test_run:wait_cond(function() return box.space.test:get{2} ~= nil end)
+ | ---
+ | - true
+ | ...
+box.cfg{election_mode='voter'}
+ | ---
+ | ...
+-- Old leader is gone.
+test_run:switch('default')
+ | ---
+ | - true
+ | ...
+test_run:cmd('stop server '..leader)
+ | ---
+ | - true
+ | ...
+is_possible_leader[leader_nr] = false
+ | ---
+ | ...
+
+-- Emulate a situation when next_leader wins the elections. It can't do 
that in
+-- this configuration, obviously, because it's behind the 'other' node, 
so set
+-- quorum to 1 and imagine there are 2 more servers which would vote for
+-- next_leader.
+-- Also, make the instance ignore synchronization with other replicas.
+-- Otherwise it would stall for replication_sync_timeout. This is due 
to the
+-- nature of the test and may be ignored (we restart the instance to 
simulate
+-- a situation when some rows from the old leader were not received).
+test_run:cmd('start server '..next_leader..' with args="1 0.4 candidate 
1"')
+ | ---
+ | - true
+ | ...
+assert(get_leader(is_possible_leader) == next_leader_nr)
+ | ---
+ | - true
+ | ...
+test_run:switch(other)
+ | ---
+ | - true
+ | ...
+-- New leader didn't know about the unconfirmed rows but still rolled 
them back.
+test_run:wait_cond(function() return box.space.test:get{2} == nil end)
+ | ---
+ | - true
+ | ...
+
+test_run:switch('default')
+ | ---
+ | - true
+ | ...
+test_run:switch(next_leader)
+ | ---
+ | - true
+ | ...
+box.space.test:select{} -- 1
+ | ---
+ | - - [1]
+ | ...
+box.cfg{election_mode='voter'}
+ | ---
+ | ...
+-- Old leader returns and old unconfirmed rows from it must be ignored.
+test_run:switch('default')
+ | ---
+ | - true
+ | ...
+-- Make old leader win the elections.
+test_run:cmd('start server '..leader..' with args="1 0.4 candidate 1"')
+ | ---
+ | - true
+ | ...
+is_possible_leader[leader_nr] = true
+ | ---
+ | ...
+assert(get_leader(is_possible_leader) == leader_nr)
+ | ---
+ | - true
+ | ...
+test_run:switch(next_leader)
+ | ---
+ | - true
+ | ...
+box.space.test:select{} -- 1
+ | ---
+ | - - [1]
+ | ...
+test_run:wait_upstream(1, {status='follow'})
+ | ---
+ | - true
+ | ...
+
+-- Cleanup.
+test_run:switch('default')
+ | ---
+ | - true
+ | ...
+test_run:drop_cluster(SERVERS)
+ | ---
+ | ...
diff --git a/test/replication/gh-5445-leader-incosistency.test.lua 
b/test/replication/gh-5445-leader-incosistency.test.lua
new file mode 100644
index 000000000..94beea966
--- /dev/null
+++ b/test/replication/gh-5445-leader-incosistency.test.lua
@@ -0,0 +1,108 @@
+test_run = require("test_run").new()
+
+is_leader_cmd = "return box.info.election.state == 'leader'"
+
+-- Auxiliary.
+test_run:cmd('setopt delimiter ";"')
+function get_leader(nrs)
+    local leader_nr = 0
+    test_run:wait_cond(function()
+        for nr, do_check in pairs(nrs) do
+            if do_check then
+                local is_leader = test_run:eval('election_replica'..nr,
+                                                is_leader_cmd)[1]
+                if is_leader then
+                    leader_nr = nr
+                    return true
+                end
+            end
+        end
+        return false
+    end)
+    assert(leader_nr ~= 0)
+    return leader_nr
+end;
+
+function name(id)
+    return 'election_replica'..id
+end;
+test_run:cmd('setopt delimiter ""');
+
+--
+-- gh-5445: make sure rolled back rows do not reappear once old leader 
returns
+-- to cluster.
+--
+SERVERS = {'election_replica1', 'election_replica2' ,'election_replica3'}
+test_run:create_cluster(SERVERS, "replication", {args='2 0.4'})
+test_run:wait_fullmesh(SERVERS)
+
+-- Any of the three instances may bootstrap the cluster and become leader.
+is_possible_leader = {true, true, true}
+leader_nr = get_leader(is_possible_leader)
+leader = name(leader_nr)
+next_leader_nr = ((leader_nr - 1) % 3 + 1) % 3 + 1 -- {1, 2, 3} -> {2, 
3, 1}
+next_leader = name(next_leader_nr)
+other_nr = ((leader_nr - 1) % 3 + 2) % 3 + 1 -- {1, 2, 3} -> {3, 1, 2}
+other = name(other_nr)
+
+test_run:switch(leader)
+box.ctl.wait_rw()
+_ = box.schema.space.create('test', {is_sync=true})
+_ = box.space.test:create_index('pk')
+box.space.test:insert{1}
+
+-- Simulate a situation when the instance which will become the next leader
+-- doesn't know of unconfirmed rows. It should roll them back anyways 
and do not
+-- accept them once they actually appear from the old leader.
+-- So, stop the instance which'll be the next leader.
+test_run:switch('default')
+test_run:cmd('stop server '..next_leader)
+test_run:switch(leader)
+-- Insert some unconfirmed data.
+box.cfg{replication_synchro_quorum=3, replication_synchro_timeout=1000}
+fib = require('fiber').create(box.space.test.insert, box.space.test, {2})
+fib:status()
+
+-- 'other', 'leader', 'next_leader' are defined on 'default' node, 
hence the
+-- double switches.
+test_run:switch('default')
+test_run:switch(other)
+-- Wait until the rows are replicated to the other instance.
+test_run:wait_cond(function() return box.space.test:get{2} ~= nil end)
+box.cfg{election_mode='voter'}
+-- Old leader is gone.
+test_run:switch('default')
+test_run:cmd('stop server '..leader)
+is_possible_leader[leader_nr] = false
+
+-- Emulate a situation when next_leader wins the elections. It can't do 
that in
+-- this configuration, obviously, because it's behind the 'other' node, 
so set
+-- quorum to 1 and imagine there are 2 more servers which would vote for
+-- next_leader.
+-- Also, make the instance ignore synchronization with other replicas.
+-- Otherwise it would stall for replication_sync_timeout. This is due 
to the
+-- nature of the test and may be ignored (we restart the instance to 
simulate
+-- a situation when some rows from the old leader were not received).
+test_run:cmd('start server '..next_leader..' with args="1 0.4 candidate 
1"')
+assert(get_leader(is_possible_leader) == next_leader_nr)
+test_run:switch(other)
+-- New leader didn't know about the unconfirmed rows but still rolled 
them back.
+test_run:wait_cond(function() return box.space.test:get{2} == nil end)
+
+test_run:switch('default')
+test_run:switch(next_leader)
+box.space.test:select{} -- 1
+box.cfg{election_mode='voter'}
+-- Old leader returns and old unconfirmed rows from it must be ignored.
+test_run:switch('default')
+-- Make old leader win the elections.
+test_run:cmd('start server '..leader..' with args="1 0.4 candidate 1"')
+is_possible_leader[leader_nr] = true
+assert(get_leader(is_possible_leader) == leader_nr)
+test_run:switch(next_leader)
+box.space.test:select{} -- 1
+test_run:wait_upstream(1, {status='follow'})
+
+-- Cleanup.
+test_run:switch('default')
+test_run:drop_cluster(SERVERS)
diff --git a/test/replication/suite.cfg b/test/replication/suite.cfg
index aff5fda26..0a270d3d6 100644
--- a/test/replication/suite.cfg
+++ b/test/replication/suite.cfg
@@ -17,6 +17,7 @@
      "gh-4424-misc-orphan-on-reconfiguration-error.test.lua": {},
      "gh-5426-election-on-off.test.lua": {},
      "gh-5433-election-restart-recovery.test.lua": {},
+    "gh-5445-leader-incosistency.test.lua": {},
      "gh-5506-election-on-off.test.lua": {},
      "once.test.lua": {},
      "on_replace.test.lua": {},

-- 
Serge Petrenko


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Tarantool-patches] [PATCH 8/9] Support manual elections in `box.ctl.clear_synchro_queue()`
  2021-04-11 17:56 ` [Tarantool-patches] [PATCH 8/9] Support manual elections in `box.ctl.clear_synchro_queue()` Serge Petrenko via Tarantool-patches
@ 2021-04-12 19:23   ` Serge Petrenko via Tarantool-patches
  0 siblings, 0 replies; 16+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-04-12 19:23 UTC (permalink / raw)
  To: v.shpilevoy, gorcunov; +Cc: tarantool-patches



11.04.2021 20:56, Serge Petrenko пишет:
> This patch adds support for manual elections from
> `box.ctl.clear_synchro_queue()`. When an instance is in
> `election_mode='manual'`, calling `clear_synchro_queue()` will make it
> start a new election round.
>
> Follow-up #5445
> Part of #3055
>
> @TarantoolBot document
> Title: describe election_mode='manual'
>
> Manual election mode is introduced. It may be used when the user wants to
> control which instance is the leader explicitly instead of relying on
> Raft election algorithm.
>
> When an instance is configured with `election_mode='manual'`, it behaves
> as follows:
>   1) By default, the instance acts like a voter: it is read-only and may
>      vote for other instances that are candidates.
>   2) Once `box.ctl.clear_synchro_queue()` is called, the instance becomes a
>      candidate and starts a new election round. If the instance wins the
>      elections, it remains leader, but won't participate in any new elections.
> ---
>

Force-pushed a fix for test:

==============
diff --git a/test/box/error.result b/test/box/error.result
index 7761c6949..dad6a21d3 100644
--- a/test/box/error.result
+++ b/test/box/error.result
@@ -441,6 +441,9 @@ t;
   |   220: box.error.TOO_EARLY_SUBSCRIBE
   |   221: box.error.SQL_CANT_ADD_AUTOINC
   |   222: box.error.QUORUM_WAIT
+ |   223: box.error.INTERFERING_PROMOTE
+ |   224: box.error.RAFT_DISABLED
+ |   225: box.error.ALREADY_LEADER
   | ...

  test_run:cmd("setopt delimiter ''");

-- 
Serge Petrenko


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Tarantool-patches] [PATCH 9/9] box.ctl: rename clear_synchro_queue to promote
  2021-04-11 17:56 ` [Tarantool-patches] [PATCH 9/9] box.ctl: rename clear_synchro_queue to promote Serge Petrenko via Tarantool-patches
@ 2021-04-12 19:24   ` Serge Petrenko via Tarantool-patches
  0 siblings, 0 replies; 16+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-04-12 19:24 UTC (permalink / raw)
  To: v.shpilevoy, gorcunov; +Cc: tarantool-patches



11.04.2021 20:56, Serge Petrenko пишет:
> New function name will be `box.ctl.promote()`. It's much shorter and
> closer to the function's now enriched functionality.
>
> Old name `box.ctl.clear_synchro_queue()` remains in Lua for the sake of
> backward compatibility.
>
> Follow-up #5445
> Closes #3055
>
> @TarantoolBot document
> Title: deprecate `box.ctl.clear_synchro_queue()` in favor of `box.ctl.promote()`
>
> Replace all the mentions of `box.ctl.clear_synchro_queue()` with
> `box.ctl.promote()` and add a note that `box.ctl.clear_synchro_queue()`
> is a deprecated alias to `box.ctl.promote()`
>

Force-pushed a test.
Please find an incremental diff below and the full patch in v2.

======================================
diff --git a/test/replication/election_basic.result 
b/test/replication/election_basic.result
index d5320b3ff..78c911245 100644
--- a/test/replication/election_basic.result
+++ b/test/replication/election_basic.result
@@ -108,6 +108,31 @@ assert(box.info.election.leader == box.info.id)
   | - true
   | ...

+-- Manual election mode. A voter most of the time, a leader once
+-- `box.ctl.promote()` is called.
+box.cfg{election_mode = 'manual'}
+ | ---
+ | ...
+
+assert(box.info.election.state == 'follower')
+ | ---
+ | - true
+ | ...
+term = box.info.election.term
+ | ---
+ | ...
+box.ctl.promote()
+ | ---
+ | ...
+assert(box.info.election.state == 'leader')
+ | ---
+ | - error: assertion failed!
+ | ...
+assert(box.info.election.term > term)
+ | ---
+ | - error: assertion failed!
+ | ...
+
  box.cfg{ \
      election_mode = 
'off',                                                      \
      election_timeout = 
old_election_timeout                                     \
diff --git a/test/replication/election_basic.test.lua 
b/test/replication/election_basic.test.lua
index 821f73cea..5fc398848 100644
--- a/test/replication/election_basic.test.lua
+++ b/test/replication/election_basic.test.lua
@@ -39,6 +39,16 @@ assert(box.info.election.term > term)
  assert(box.info.election.vote == box.info.id)
  assert(box.info.election.leader == box.info.id)

+-- Manual election mode. A voter most of the time, a leader once
+-- `box.ctl.promote()` is called.
+box.cfg{election_mode = 'manual'}
+
+assert(box.info.election.state == 'follower')
+term = box.info.election.term
+box.ctl.promote()
+assert(box.info.election.state == 'leader')
+assert(box.info.election.term > term)
+
  box.cfg{ \
      election_mode = 
'off',                                                      \
      election_timeout = 
old_election_timeout                                     \
diff --git a/test/replication/gh-3055-election-promote.result 
b/test/replication/gh-3055-election-promote.result
new file mode 100644
index 000000000..6f5af13bc
--- /dev/null
+++ b/test/replication/gh-3055-election-promote.result
@@ -0,0 +1,105 @@
+-- test-run result file version 2
+test_run = require('test_run').new()
+ | ---
+ | ...
+
+--
+-- gh-3055 box.ctl.promote(). Call on instance with election_mode='manual'
+-- in order to promote it to leader.
+SERVERS = {'election_replica1', 'election_replica2', 'election_replica3'}
+ | ---
+ | ...
+-- Start in candidate state in order for bootstrap to work.
+test_run:create_cluster(SERVERS, 'replication', {args='2 0.1 candidate'})
+ | ---
+ | ...
+test_run:wait_fullmesh(SERVERS)
+ | ---
+ | ...
+
+cfg_set_manual =\
+    "box.cfg{election_mode='manual'} "..\
+    "assert(box.info.election.state == 'follower') "..\
+    "assert(box.info.ro)"
+ | ---
+ | ...
+
+for _, server in pairs(SERVERS) do\
+    ok, res = test_run:eval(server, cfg_set_manual)\
+    assert(ok)\
+end
+ | ---
+ | ...
+
+-- Promote without living leader.
+test_run:switch('election_replica1')
+ | ---
+ | - true
+ | ...
+assert(box.info.election.state == 'follower')
+ | ---
+ | - true
+ | ...
+term = box.info.election.term
+ | ---
+ | ...
+box.ctl.promote()
+ | ---
+ | ...
+assert(box.info.election.state == 'leader')
+ | ---
+ | - true
+ | ...
+assert(not box.info.ro)
+ | ---
+ | - true
+ | ...
+assert(box.info.election.term > term)
+ | ---
+ | - true
+ | ...
+
+-- Test promote when there's a live leader.
+test_run:switch('election_replica2')
+ | ---
+ | - true
+ | ...
+term = box.info.election.term
+ | ---
+ | ...
+assert(box.info.election.state == 'follower')
+ | ---
+ | - true
+ | ...
+assert(box.info.ro)
+ | ---
+ | - true
+ | ...
+assert(box.info.election.leader ~= 0)
+ | ---
+ | - true
+ | ...
+box.ctl.promote()
+ | ---
+ | ...
+assert(box.info.election.state == 'leader')
+ | ---
+ | - true
+ | ...
+assert(not box.info.ro)
+ | ---
+ | - true
+ | ...
+assert(box.info.election.term > term)
+ | ---
+ | - true
+ | ...
+
+-- Cleanup.
+test_run:switch('default')
+ | ---
+ | - true
+ | ...
+test_run:drop_cluster(SERVERS)
+ | ---
+ | ...
diff --git a/test/replication/gh-3055-election-promote.test.lua 
b/test/replication/gh-3055-election-promote.test.lua
new file mode 100644
index 000000000..cbc3ed206
--- /dev/null
+++ b/test/replication/gh-3055-election-promote.test.lua
@@ -0,0 +1,43 @@
+test_run = require('test_run').new()
+
+--
+-- gh-3055 box.ctl.promote(). Call on instance with election_mode='manual'
+-- in order to promote it to leader.
+SERVERS = {'election_replica1', 'election_replica2', 'election_replica3'}
+-- Start in candidate state in order for bootstrap to work.
+test_run:create_cluster(SERVERS, 'replication', {args='2 0.1 candidate'})
+test_run:wait_fullmesh(SERVERS)
+
+cfg_set_manual =\
+    "box.cfg{election_mode='manual'} "..\
+    "assert(box.info.election.state == 'follower') "..\
+    "assert(box.info.ro)"
+
+for _, server in pairs(SERVERS) do\
+    ok, res = test_run:eval(server, cfg_set_manual)\
+    assert(ok)\
+end
+
+-- Promote without living leader.
+test_run:switch('election_replica1')
+assert(box.info.election.state == 'follower')
+term = box.info.election.term
+box.ctl.promote()
+assert(box.info.election.state == 'leader')
+assert(not box.info.ro)
+assert(box.info.election.term > term)
+
+-- Test promote when there's a live leader.
+test_run:switch('election_replica2')
+term = box.info.election.term
+assert(box.info.election.state == 'follower')
+assert(box.info.ro)
+assert(box.info.election.leader ~= 0)
+box.ctl.promote()
+assert(box.info.election.state == 'leader')
+assert(not box.info.ro)
+assert(box.info.election.term > term)
+
+-- Cleanup.
+test_run:switch('default')
+test_run:drop_cluster(SERVERS)
diff --git a/test/replication/suite.cfg b/test/replication/suite.cfg
index 0a270d3d6..2fa3c5660 100644
--- a/test/replication/suite.cfg
+++ b/test/replication/suite.cfg
@@ -2,6 +2,7 @@
      "anon.test.lua": {},
      "anon_register_gap.test.lua": {},
      "gh-2991-misc-asserts-on-update.test.lua": {},
+    "gh-3055-election-promote.test.lua": {},
      "gh-3111-misc-rebootstrap-from-ro-master.test.lua": {},
      "gh-3160-misc-heartbeats-on-master-changes.test.lua": {},
      "gh-3247-misc-iproto-sequence-value-not-replicated.test.lua": {},

-- 
Serge Petrenko


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Tarantool-patches] [PATCH 1/9] wal: enrich row's meta information with sync replication flags
  2021-04-12 13:06   ` Cyrill Gorcunov via Tarantool-patches
@ 2021-04-13 13:26     ` Serge Petrenko via Tarantool-patches
  0 siblings, 0 replies; 16+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-04-13 13:26 UTC (permalink / raw)
  To: Cyrill Gorcunov; +Cc: v.shpilevoy, tarantool-patches



12.04.2021 16:06, Cyrill Gorcunov пишет:
> On Sun, Apr 11, 2021 at 08:55:56PM +0300, Serge Petrenko wrote:
>> diff --git a/src/box/iproto_constants.h b/src/box/iproto_constants.h
>> index d3738c705..f7f46088f 100644
>> --- a/src/box/iproto_constants.h
>> +++ b/src/box/iproto_constants.h
>> @@ -49,9 +49,14 @@ enum {
>>   	XLOG_FIXHEADER_SIZE = 19
>>   };
>>   
>> +/** IPROTO_FLAGS bitfield constants. */
>>   enum {
>>   	/** Set for the last xrow in a transaction. */
>>   	IPROTO_FLAG_COMMIT = 0x01,
>> +	/** Set for the last row of a tx residing in limbo. */
>> +	IPROTO_FLAG_WAIT_SYNC = 0x02,
>> +	/** Set for the last row of a synchronous tx. */
>> +	IPROTO_FLAG_WAIT_ACK = 0x04,
>>   };
> Serge, is there some particular reason why cant we make them 1:1
> mapping to txn flags? Ie
>
> 	IPROTO_FLAG_WAIT_SYNC	= 0x10,
> 	IPROTO_FLAG_WAIT_ACK	= 0x20,
>
> this would allow us to eliminate branching in code. While txn flags
> are not part of API and iproto flags _are_ I don't expect the former
> to be changed anyhow soon?
>
> Not insisting though, just share the idea.

Thanks for the review!
Please find my answers inline and the incremental diff below.

Actually, this can be done. But it would require us to change the 
bitfield to something like
struct {
         bool is_commit : 1;
         int unused : 3;
         bool wait_sync : 1;
         bool wait_ack: 1;
};

Otherwise the mapping between opt_flags and the bit fields would be lost.

I'd rather stick with your other idea (map between iproto_flags and 
txn_flags).
>
>>   
>>   enum iproto_key {
>> diff --git a/src/box/journal.h b/src/box/journal.h
>> index 76c70c19f..9ab8af2c1 100644
>> --- a/src/box/journal.h
>> +++ b/src/box/journal.h
>> @@ -63,6 +63,7 @@ struct journal_entry {
>>   	 * A journal entry completion callback argument.
>>   	 */
>>   	void *complete_data;
>> +	uint8_t opt_flags;
>>   	/**
>>   	 * Asynchronous write completion function.
>>   	 */
>> @@ -97,6 +98,7 @@ journal_entry_create(struct journal_entry *entry, size_t n_rows,
>>   	entry->approx_len	= approx_len;
>>   	entry->n_rows		= n_rows;
>>   	entry->res		= -1;
>> +	entry->opt_flags = 0;
>>   }
> Please don't ruine alignment here. I know that Vlad prefer dense style
> (which is actually hard to parse by eyes at least for me) but here we
> have a block which ether should be whole formatted to dense or kept
> aligned.

Sure, thanks for noticing!

>
>>   
>>   /**
>> diff --git a/src/box/txn.c b/src/box/txn.c
>> index 40061ff09..d65315f58 100644
>> --- a/src/box/txn.c
>> +++ b/src/box/txn.c
>> @@ -76,6 +76,7 @@ txn_add_redo(struct txn *txn, struct txn_stmt *stmt, struct request *request)
>>   		row->lsn = 0;
>>   		row->sync = 0;
>>   		row->tm = 0;
>> +		row->opt_flags = 0;
>>   	}
>>   	/*
>>   	 * Group ID should be set both for requests not having a
>> @@ -681,6 +682,10 @@ txn_journal_entry_new(struct txn *txn)
>>   		--req->n_rows;
>>   	}
>>   
>> +	req->opt_flags |=
>> +		(txn_has_flag(txn, TXN_WAIT_SYNC) ? IPROTO_FLAG_WAIT_SYNC : 0) |
>> +		(txn_has_flag(txn, TXN_WAIT_ACK) ? IPROTO_FLAG_WAIT_ACK : 0);
>> +
> When mentioning branching I meant this code. We could use 1:1 mapping
> here if we define iproto flags to be the same as txn flags.
>
> Or say some explicit map could be provided (for readability sake)
>
> 	static const uint8_t opt_flags_map[] = {
> 		[TXN_WAIT_SYNC]		= IPROTO_FLAG_WAIT_SYNC,
> 		[TXN_WAIT_ACK]		= IPROTO_FLAG_WAIT_ACK,
> 	};
>
> 	req->opt_flags |= opt_flags_map[txn->flags & IPROTO_FLAG_WAIT_SYNC];
> 	req->opt_flags |= opt_flags_map[txn->flags & IPROTO_FLAG_WAIT_ACK];
>
> Such technique is widely used when need some mapping between flags and
> instead of number of `if` one simply use read-only memory.

I like this variant, applied.

>
> On the other hands I'm fine with branching here because this is not
> that hot code and even misprediction should not harm much. So just
> to share the ideas.
>
>>   	return req;
>>   }
>>   
>> diff --git a/src/box/wal.c b/src/box/wal.c
>> index 34af0bda6..00fcb21b4 100644
>> --- a/src/box/wal.c
>> +++ b/src/box/wal.c
>> @@ -962,14 +962,14 @@ out:
>>    */
>>   static void
>>   wal_assign_lsn(struct vclock *vclock_diff, struct vclock *base,
>> -	       struct xrow_header **row,
>> -	       struct xrow_header **end)
>> +	       struct journal_entry *entry)
>>   {
>>   	int64_t tsn = 0;
>> -	struct xrow_header **start = row;
>> -	struct xrow_header **first_glob_row = row;
>> +	struct xrow_header **start = entry->rows;
>> +	struct xrow_header **end = entry->rows + entry->n_rows;
>> +	struct xrow_header **first_glob_row = start;
> There is no need to define another dependency, actually I think
> the compiler will figure out that @first_glob_row and @start
> are initialized from same @entry->rows still previously this
> was more clear, iow I propose to leave `first_glob_row = entry->rows;`

Ok, no problem.

>
>>   	/** Assign LSN to all local rows. */
>> -	for ( ; row < end; row++) {
>> +	for (struct xrow_header **row = start; row < end; row++) {
>>   		if ((*row)->replica_id == 0) {
>>   			/*
>>   			 * All rows representing local space data
>> @@ -996,7 +996,13 @@ wal_assign_lsn(struct vclock *vclock_diff, struct vclock *base,
>>   				first_glob_row = row;
>>   			}
>>   			(*row)->tsn = tsn == 0 ? (*start)->lsn : tsn;
>> -			(*row)->is_commit = row == end - 1;
>> +			if (row < end - 1)
>> +				continue;
>> +			/* Tx meta is stored in the last tx row. */
>> +			if (row == end - 1) {
>> +				(*row)->opt_flags = entry->opt_flags;
>> +				(*row)->is_commit = true;
> Why can't we use
> 				(*row)->opt_flags = entry->opt_flags | IPROTO_FLAG_COMMIT;
> instead?

WAL doesn't depend on IPROTO_CONSTANTS and I don't want to introduce 
such a dependency.
OTOH, I don't want it to know of `wait_ack` and `wait_sync` properties 
of transactions/rows,
that's why I have a generic opt_flags field, which WAL blindly assigns 
to the row, and it's
txn_journal_entry_new()'s responsibility to set wait_ack and wait_sync bits.

>
>> +			}
>>   		} else {
>>   			int64_t diff = (*row)->lsn - vclock_get(base, (*row)->replica_id);
>>   			if (diff <= vclock_get(vclock_diff,
>> @@ -1020,7 +1026,7 @@ wal_assign_lsn(struct vclock *vclock_diff, struct vclock *base,
>>   	 * the first global row. tsn was yet unknown when those
>>   	 * rows were processed.
>>   	 */
>> -	for (row = start; row < first_glob_row; row++)
>> +	for (struct xrow_header **row = start; row < first_glob_row; row++)
>>   		(*row)->tsn = tsn;
>>   }
>>   
>> @@ -1098,8 +1104,7 @@ wal_write_to_disk(struct cmsg *msg)
>>   	struct journal_entry *entry;
>>   	struct stailq_entry *last_committed = NULL;
>>   	stailq_foreach_entry(entry, &wal_msg->commit, fifo) {
>> -		wal_assign_lsn(&vclock_diff, &writer->vclock,
>> -			       entry->rows, entry->rows + entry->n_rows);
>> +		wal_assign_lsn(&vclock_diff, &writer->vclock, entry);
>>   		entry->res = vclock_sum(&vclock_diff) +
>>   			     vclock_sum(&writer->vclock);
>>   		rc = xlog_write_entry(l, entry);
>> @@ -1319,8 +1324,7 @@ wal_write_none_async(struct journal *journal,
>>   	struct vclock vclock_diff;
>>   
>>   	vclock_create(&vclock_diff);
>> -	wal_assign_lsn(&vclock_diff, &writer->vclock, entry->rows,
>> -		       entry->rows + entry->n_rows);
>> +	wal_assign_lsn(&vclock_diff, &writer->vclock, entry);
>>   	vclock_merge(&writer->vclock, &vclock_diff);
>>   	vclock_copy(&replicaset.vclock, &writer->vclock);
>>   	entry->res = vclock_sum(&writer->vclock);
>> diff --git a/src/box/xrow.c b/src/box/xrow.c
>> index bc06738ad..cc8e43ed4 100644
>> --- a/src/box/xrow.c
>> +++ b/src/box/xrow.c
>> @@ -183,7 +183,7 @@ error:
>>   			break;
>>   		case IPROTO_FLAGS:
>>   			flags = mp_decode_uint(pos);
>> -			header->is_commit = flags & IPROTO_FLAG_COMMIT;
>> +			header->opt_flags = flags;
>>   			break;
>>   		default:
>>   			/* unknown header */
>> @@ -299,6 +299,7 @@ xrow_header_encode(const struct xrow_header *header, uint64_t sync,
>>   	 *   flag to find transaction boundary (last row in the
>>   	 *   transaction stream).
>>   	 */
>> +	uint8_t flags_to_encode = header->opt_flags & ~IPROTO_FLAG_COMMIT;
>>   	if (header->tsn != 0) {
>>   		if (header->tsn != header->lsn || !header->is_commit) {
>>   			/*
>> @@ -314,12 +315,14 @@ xrow_header_encode(const struct xrow_header *header, uint64_t sync,
>>   			map_size++;
>>   		}
>>   		if (header->is_commit && header->tsn != header->lsn) {
>> -			/* Setup last row for multi row transaction. */
>> -			d = mp_encode_uint(d, IPROTO_FLAGS);
>> -			d = mp_encode_uint(d, IPROTO_FLAG_COMMIT);
>> -			map_size++;
>> +			flags_to_encode |= IPROTO_FLAG_COMMIT;
>>   		}
>>   	}
>> +	if (flags_to_encode != 0) {
>> +		d = mp_encode_uint(d, IPROTO_FLAGS);
>> +		d = mp_encode_uint(d, flags_to_encode);
>> +		map_size++;
>> +	}
>>   	assert(d <= data + XROW_HEADER_LEN_MAX);
>>   	mp_encode_map(data, map_size);
>>   	out->iov_len = d - (char *) out->iov_base;
>> diff --git a/src/box/xrow.h b/src/box/xrow.h
>> index fde8f9474..2a18733c0 100644
>> --- a/src/box/xrow.h
>> +++ b/src/box/xrow.h
>> @@ -80,14 +80,28 @@ struct xrow_header {
>>   	 * transaction.
>>   	 */
>>   	int64_t tsn;
>> -	/**
>> -	 * True for the last row in a multi-statement transaction,
>> -	 * or single-statement transaction. Is only encoded in the
>> -	 * write ahead log for multi-statement transactions.
>> -	 * Single-statement transactions do not encode
>> -	 * tsn and is_commit flag to save space.
>> -	 */
>> -	bool is_commit;
>> +	/** Transaction meta flags set only in the last transaction row. */
>> +	union {
>> +		uint8_t opt_flags;
>> +		struct {
>> +			/**
>> +			 * Is only encoded in the write ahead log for
>> +			 * multi-statement transactions. Single-statement
>> +			 * transactions do not encode tsn and is_commit flag to
>> +			 * save space.
>> +			 */
>> +			bool is_commit : 1;
>> +			/**
>> +			 * True for a synchronous transaction.
>> +			 */
>> +			bool wait_sync : 1;
>> +			/**
>> +			 * True for any transaction that would enter the limbo
>> +			 * (not necessarily a synchronous one).
>> +			 */
>> +			bool wait_ack  : 1;
>> +		};
>> +	};
> Serge, I know that you mention bitfields in the mail in context of our
> discussion and I assume you'll rework this otherwise this will be
> just a hidden bug which might work for sometime but one day it will
> trigger and we won't even be able to figure out why.

As discussed verbally, let's leave this bitfield in place, and introduce 
some
unit test which would check the assumptions I make about bit field order.

Will the test help, though? What if a person builds tarantool and doesn't
run the tests? Would a Cmake check be enough?

>
>>   
>>   	int bodycnt;
>>   	uint32_t schema_version;
>> -- 
>> 2.24.3 (Apple Git-128)
>>
> 	Cyrill
Incremental diff:
===============================
diff --git a/src/box/journal.h b/src/box/journal.h
index 9ab8af2c1..3ce9c869e 100644
--- a/src/box/journal.h
+++ b/src/box/journal.h
@@ -98,7 +98,7 @@ journal_entry_create(struct journal_entry *entry, 
size_t n_rows,
         entry->approx_len       = approx_len;
         entry->n_rows           = n_rows;
         entry->res              = -1;
-       entry->opt_flags = 0;
+       entry->opt_flags        = 0;
  }

  /**
diff --git a/src/box/txn.c b/src/box/txn.c
index d65315f58..34e8ce026 100644
--- a/src/box/txn.c
+++ b/src/box/txn.c
@@ -682,9 +682,13 @@ txn_journal_entry_new(struct txn *txn)
                 --req->n_rows;
         }

-       req->opt_flags |=
-               (txn_has_flag(txn, TXN_WAIT_SYNC) ? 
IPROTO_FLAG_WAIT_SYNC : 0) |
-               (txn_has_flag(txn, TXN_WAIT_ACK) ? IPROTO_FLAG_WAIT_ACK 
: 0);
+       static const uint8_t opt_flags_map[] = {
+               [TXN_WAIT_SYNC] = IPROTO_FLAG_WAIT_SYNC,
+               [TXN_WAIT_ACK] = IPROTO_FLAG_WAIT_ACK,
+       };
+
+       req->opt_flags |= opt_flags_map[txn->flags & TXN_WAIT_SYNC];
+       req->opt_flags |= opt_flags_map[txn->flags & TXN_WAIT_ACK];

         return req;
  }
diff --git a/src/box/wal.c b/src/box/wal.c
index 00fcb21b4..b7c69fc59 100644
--- a/src/box/wal.c
+++ b/src/box/wal.c
@@ -967,7 +967,7 @@ wal_assign_lsn(struct vclock *vclock_diff, struct 
vclock *base,
         int64_t tsn = 0;
         struct xrow_header **start = entry->rows;
         struct xrow_header **end = entry->rows + entry->n_rows;
-       struct xrow_header **first_glob_row = start;
+       struct xrow_header **first_glob_row = entry->rows;
         /** Assign LSN to all local rows. */
         for (struct xrow_header **row = start; row < end; row++) {
                 if ((*row)->replica_id == 0) {

-- 
Serge Petrenko


^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2021-04-13 13:26 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-11 17:55 [Tarantool-patches] [PATCH 0/9] raft: introduce manual elections and fix a bug with re-applying rolled back transactions Serge Petrenko via Tarantool-patches
2021-04-11 17:55 ` [Tarantool-patches] [PATCH 1/9] wal: enrich row's meta information with sync replication flags Serge Petrenko via Tarantool-patches
2021-04-12 13:06   ` Cyrill Gorcunov via Tarantool-patches
2021-04-13 13:26     ` Serge Petrenko via Tarantool-patches
2021-04-12 19:21   ` Serge Petrenko via Tarantool-patches
2021-04-11 17:55 ` [Tarantool-patches] [PATCH 2/9] xrow: introduce a PROMOTE entry Serge Petrenko via Tarantool-patches
2021-04-11 17:55 ` [Tarantool-patches] [PATCH 3/9] box: actualise iproto_key_type array Serge Petrenko via Tarantool-patches
2021-04-11 17:55 ` [Tarantool-patches] [PATCH 4/9] box: make clear_synchro_queue() write a PROMOTE entry instead of CONFIRM + ROLLBACK Serge Petrenko via Tarantool-patches
2021-04-11 17:56 ` [Tarantool-patches] [PATCH 5/9] box: write PROMOTE even for empty limbo Serge Petrenko via Tarantool-patches
2021-04-11 17:56 ` [Tarantool-patches] [PATCH 6/9] raft: keep track of greatest known term and filter replication sources based on that Serge Petrenko via Tarantool-patches
2021-04-12 19:23   ` Serge Petrenko via Tarantool-patches
2021-04-11 17:56 ` [Tarantool-patches] [PATCH 7/9] replication: introduce a new election mode: "manual" Serge Petrenko via Tarantool-patches
2021-04-11 17:56 ` [Tarantool-patches] [PATCH 8/9] Support manual elections in `box.ctl.clear_synchro_queue()` Serge Petrenko via Tarantool-patches
2021-04-12 19:23   ` Serge Petrenko via Tarantool-patches
2021-04-11 17:56 ` [Tarantool-patches] [PATCH 9/9] box.ctl: rename clear_synchro_queue to promote Serge Petrenko via Tarantool-patches
2021-04-12 19:24   ` Serge Petrenko via Tarantool-patches

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox