* [Tarantool-patches] [PATCH v4 00/12] raft: introduce manual elections and fix a bug with re-applying rolled back transactions
@ 2021-04-16 16:25 Serge Petrenko via Tarantool-patches
2021-04-16 16:25 ` [Tarantool-patches] [PATCH v4 01/12] wal: make wal_assign_lsn accept journal entry Serge Petrenko via Tarantool-patches
` (15 more replies)
0 siblings, 16 replies; 57+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-04-16 16:25 UTC (permalink / raw)
To: v.shpilevoy, gorcunov; +Cc: tarantool-patches
Changes in v4:
- review fixes as per review from Vlad
Changes in v3:
- fix gh-5445-leader-inconsistency.test.lua flakiness
- fixes as per review from Cyrill Gorcunov
- minor fixes and rewordings
- rebased on top of current master
- added patch 9/10 (remove parameter from clear_synchro_queue)
Changes in v2:
- Added tests for patches 1, 6, 9
- Minor typo fixes and bugfixes.
https://github.com/tarantool/tarantool/tree/sp/gh-5445-election-fixes
https://github.com/tarantool/tarantool/issues/5445
https://github.com/tarantool/tarantool/issues/3055
Serge Petrenko (12):
wal: make wal_assign_lsn accept journal entry
xrow: enrich row's meta information with sync replication flags
xrow: introduce a PROMOTE entry
box: actualise iproto_key_type array
box: make clear_synchro_queue() write a PROMOTE entry instead of
CONFIRM + ROLLBACK
box: write PROMOTE even for empty limbo
raft: filter rows based on known peer terms
election: introduce a new election mode: "manual"
raft: introduce raft_start/stop_candidate
election: support manual elections in clear_synchro_queue()
box: remove parameter from clear_synchro_queue
box.ctl: rename clear_synchro_queue to promote
changelogs/unreleased/box-ctl-promote.md | 8 +
...very => qsync-multi-statement-recovery.md} | 0
changelogs/unreleased/raft-promote.md | 4 +
src/box/applier.cc | 22 ++
src/box/box.cc | 161 +++++++---
src/box/box.h | 2 +-
src/box/errcode.h | 2 +
src/box/iproto_constants.c | 58 ++++
src/box/iproto_constants.h | 31 +-
src/box/journal.h | 3 +
src/box/lua/ctl.c | 8 +-
src/box/raft.c | 37 ++-
src/box/raft.h | 20 ++
src/box/txn.c | 9 +
src/box/txn_limbo.c | 85 ++---
src/box/txn_limbo.h | 9 +-
src/box/wal.c | 24 +-
src/box/xrow.c | 58 ++--
src/box/xrow.h | 61 ++--
src/lib/raft/raft.c | 84 +++--
src/lib/raft/raft.h | 59 ++++
test/box/error.result | 2 +
test/replication/election_basic.result | 4 +-
.../gh-3055-election-promote.result | 105 +++++++
.../gh-3055-election-promote.test.lua | 43 +++
.../gh-5445-leader-inconsistency.result | 292 ++++++++++++++++++
.../gh-5445-leader-inconsistency.test.lua | 129 ++++++++
test/replication/suite.cfg | 2 +
test/unit/raft.c | 66 +++-
test/unit/raft.result | 23 +-
test/unit/xrow.cc | 104 +++++--
test/unit/xrow.result | 133 +++++++-
32 files changed, 1437 insertions(+), 211 deletions(-)
create mode 100644 changelogs/unreleased/box-ctl-promote.md
rename changelogs/unreleased/{qsync-multi-statement-recovery => qsync-multi-statement-recovery.md} (100%)
create mode 100644 changelogs/unreleased/raft-promote.md
create mode 100644 test/replication/gh-3055-election-promote.result
create mode 100644 test/replication/gh-3055-election-promote.test.lua
create mode 100644 test/replication/gh-5445-leader-inconsistency.result
create mode 100644 test/replication/gh-5445-leader-inconsistency.test.lua
--
2.24.3 (Apple Git-128)
^ permalink raw reply [flat|nested] 57+ messages in thread
* [Tarantool-patches] [PATCH v4 01/12] wal: make wal_assign_lsn accept journal entry
2021-04-16 16:25 [Tarantool-patches] [PATCH v4 00/12] raft: introduce manual elections and fix a bug with re-applying rolled back transactions Serge Petrenko via Tarantool-patches
@ 2021-04-16 16:25 ` Serge Petrenko via Tarantool-patches
2021-04-16 16:25 ` [Tarantool-patches] [PATCH v4 02/12] xrow: enrich row's meta information with sync replication flags Serge Petrenko via Tarantool-patches
` (14 subsequent siblings)
15 siblings, 0 replies; 57+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-04-16 16:25 UTC (permalink / raw)
To: v.shpilevoy, gorcunov; +Cc: tarantool-patches
Refactor wal_assign_lsn() to accept a journal entry instead of a pair of
pointers to the first and last entry rows.
Journal entry will carry additional meta information for the last row
soon, which will be needed in wal_assign_lsn().
Prerequisite #5445
---
src/box/wal.c | 18 ++++++++----------
1 file changed, 8 insertions(+), 10 deletions(-)
diff --git a/src/box/wal.c b/src/box/wal.c
index 34af0bda6..95ee8e200 100644
--- a/src/box/wal.c
+++ b/src/box/wal.c
@@ -962,14 +962,14 @@ out:
*/
static void
wal_assign_lsn(struct vclock *vclock_diff, struct vclock *base,
- struct xrow_header **row,
- struct xrow_header **end)
+ struct journal_entry *entry)
{
int64_t tsn = 0;
- struct xrow_header **start = row;
- struct xrow_header **first_glob_row = row;
+ struct xrow_header **start = entry->rows;
+ struct xrow_header **end = entry->rows + entry->n_rows;
+ struct xrow_header **first_glob_row = entry->rows;
/** Assign LSN to all local rows. */
- for ( ; row < end; row++) {
+ for (struct xrow_header **row = start; row < end; row++) {
if ((*row)->replica_id == 0) {
/*
* All rows representing local space data
@@ -1020,7 +1020,7 @@ wal_assign_lsn(struct vclock *vclock_diff, struct vclock *base,
* the first global row. tsn was yet unknown when those
* rows were processed.
*/
- for (row = start; row < first_glob_row; row++)
+ for (struct xrow_header **row = start; row < first_glob_row; row++)
(*row)->tsn = tsn;
}
@@ -1098,8 +1098,7 @@ wal_write_to_disk(struct cmsg *msg)
struct journal_entry *entry;
struct stailq_entry *last_committed = NULL;
stailq_foreach_entry(entry, &wal_msg->commit, fifo) {
- wal_assign_lsn(&vclock_diff, &writer->vclock,
- entry->rows, entry->rows + entry->n_rows);
+ wal_assign_lsn(&vclock_diff, &writer->vclock, entry);
entry->res = vclock_sum(&vclock_diff) +
vclock_sum(&writer->vclock);
rc = xlog_write_entry(l, entry);
@@ -1319,8 +1318,7 @@ wal_write_none_async(struct journal *journal,
struct vclock vclock_diff;
vclock_create(&vclock_diff);
- wal_assign_lsn(&vclock_diff, &writer->vclock, entry->rows,
- entry->rows + entry->n_rows);
+ wal_assign_lsn(&vclock_diff, &writer->vclock, entry);
vclock_merge(&writer->vclock, &vclock_diff);
vclock_copy(&replicaset.vclock, &writer->vclock);
entry->res = vclock_sum(&writer->vclock);
--
2.24.3 (Apple Git-128)
^ permalink raw reply [flat|nested] 57+ messages in thread
* [Tarantool-patches] [PATCH v4 02/12] xrow: enrich row's meta information with sync replication flags
2021-04-16 16:25 [Tarantool-patches] [PATCH v4 00/12] raft: introduce manual elections and fix a bug with re-applying rolled back transactions Serge Petrenko via Tarantool-patches
2021-04-16 16:25 ` [Tarantool-patches] [PATCH v4 01/12] wal: make wal_assign_lsn accept journal entry Serge Petrenko via Tarantool-patches
@ 2021-04-16 16:25 ` Serge Petrenko via Tarantool-patches
2021-04-16 16:25 ` [Tarantool-patches] [PATCH v4 03/12] xrow: introduce a PROMOTE entry Serge Petrenko via Tarantool-patches
` (13 subsequent siblings)
15 siblings, 0 replies; 57+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-04-16 16:25 UTC (permalink / raw)
To: v.shpilevoy, gorcunov; +Cc: tarantool-patches
Introduce two new flags to xrow_header: `wait_ack` and `wait_sync`.
These flags are set for rows belonging to synchronous transactions in
addition to `is_commit`.
The new flags help to define whether the rows belong to a synchronous
transaction or not without parsing them all and checking whether any of
the rows touches a synchronous space.
This will be used in applier once it is taught to filter synchronous
transactions based on whether they are coming from a raft leader or not.
P.S. These flags will also be useful once we allow to turn any transaction
synchronous. Once this is done, the flags in row header will be the only
source of information on whether the transaction is synchronous or not.
Prerequisite #5445
@TarantoolBot document
Title: new values for IPROTO_FLAGS field
IPROTO_FLAGS bitfield is enriched with two new constant:
IPROTO_FLAG_WAIT_SYNC = 0x02
IPROTO_FLAG_WAIT_ACK = 0x04
IPROTO_FLAG_WAIT_SYNC is set for the last row of a transaction which
cannot be committed immediately: either because it is synchronous or
because it waits for other synchronous transactions to complete.
IPROTO_FLAG_WAIT_ACK is set for the last synchronous transaction row.
---
src/box/iproto_constants.h | 5 ++
src/box/journal.h | 3 +
src/box/txn.c | 9 +++
src/box/wal.c | 6 +-
src/box/xrow.c | 13 ++--
src/box/xrow.h | 30 ++++++---
test/unit/xrow.cc | 104 +++++++++++++++++++++++------
test/unit/xrow.result | 133 ++++++++++++++++++++++++++++++++++---
8 files changed, 256 insertions(+), 47 deletions(-)
diff --git a/src/box/iproto_constants.h b/src/box/iproto_constants.h
index b07a73b20..e9d1ef5d6 100644
--- a/src/box/iproto_constants.h
+++ b/src/box/iproto_constants.h
@@ -49,9 +49,14 @@ enum {
XLOG_FIXHEADER_SIZE = 19
};
+/** IPROTO_FLAGS bitfield constants. */
enum {
/** Set for the last xrow in a transaction. */
IPROTO_FLAG_COMMIT = 0x01,
+ /** Set for the last row of a tx residing in limbo. */
+ IPROTO_FLAG_WAIT_SYNC = 0x02,
+ /** Set for the last row of a synchronous tx. */
+ IPROTO_FLAG_WAIT_ACK = 0x04,
};
enum iproto_key {
diff --git a/src/box/journal.h b/src/box/journal.h
index 76c70c19f..8f3d56a61 100644
--- a/src/box/journal.h
+++ b/src/box/journal.h
@@ -63,6 +63,8 @@ struct journal_entry {
* A journal entry completion callback argument.
*/
void *complete_data;
+ /** Flags that should be set for the last entry row. */
+ uint8_t flags;
/**
* Asynchronous write completion function.
*/
@@ -97,6 +99,7 @@ journal_entry_create(struct journal_entry *entry, size_t n_rows,
entry->approx_len = approx_len;
entry->n_rows = n_rows;
entry->res = -1;
+ entry->flags = 0;
}
/**
diff --git a/src/box/txn.c b/src/box/txn.c
index c56725cea..a71ccadd0 100644
--- a/src/box/txn.c
+++ b/src/box/txn.c
@@ -76,6 +76,7 @@ txn_add_redo(struct txn *txn, struct txn_stmt *stmt, struct request *request)
row->lsn = 0;
row->sync = 0;
row->tm = 0;
+ row->flags = 0;
}
/*
* Group ID should be set both for requests not having a
@@ -667,6 +668,14 @@ txn_journal_entry_new(struct txn *txn)
--req->n_rows;
}
+ static const uint8_t flags_map[] = {
+ [TXN_WAIT_SYNC] = IPROTO_FLAG_WAIT_SYNC,
+ [TXN_WAIT_ACK] = IPROTO_FLAG_WAIT_ACK,
+ };
+
+ req->flags |= flags_map[txn->flags & TXN_WAIT_SYNC];
+ req->flags |= flags_map[txn->flags & TXN_WAIT_ACK];
+
return req;
}
diff --git a/src/box/wal.c b/src/box/wal.c
index 95ee8e200..5b6200b81 100644
--- a/src/box/wal.c
+++ b/src/box/wal.c
@@ -996,7 +996,11 @@ wal_assign_lsn(struct vclock *vclock_diff, struct vclock *base,
first_glob_row = row;
}
(*row)->tsn = tsn == 0 ? (*start)->lsn : tsn;
- (*row)->is_commit = row == end - 1;
+ /* Tx meta is stored in the last tx row. */
+ if (row == end - 1) {
+ (*row)->flags = entry->flags;
+ (*row)->is_commit = true;
+ }
} else {
int64_t diff = (*row)->lsn - vclock_get(base, (*row)->replica_id);
if (diff <= vclock_get(vclock_diff,
diff --git a/src/box/xrow.c b/src/box/xrow.c
index 7368eccff..35e1d1c20 100644
--- a/src/box/xrow.c
+++ b/src/box/xrow.c
@@ -183,7 +183,7 @@ error:
break;
case IPROTO_FLAGS:
flags = mp_decode_uint(pos);
- header->is_commit = flags & IPROTO_FLAG_COMMIT;
+ header->flags = flags;
break;
default:
/* unknown header */
@@ -299,6 +299,7 @@ xrow_header_encode(const struct xrow_header *header, uint64_t sync,
* flag to find transaction boundary (last row in the
* transaction stream).
*/
+ uint8_t flags_to_encode = header->flags & ~IPROTO_FLAG_COMMIT;
if (header->tsn != 0) {
if (header->tsn != header->lsn || !header->is_commit) {
/*
@@ -314,12 +315,14 @@ xrow_header_encode(const struct xrow_header *header, uint64_t sync,
map_size++;
}
if (header->is_commit && header->tsn != header->lsn) {
- /* Setup last row for multi row transaction. */
- d = mp_encode_uint(d, IPROTO_FLAGS);
- d = mp_encode_uint(d, IPROTO_FLAG_COMMIT);
- map_size++;
+ flags_to_encode |= IPROTO_FLAG_COMMIT;
}
}
+ if (flags_to_encode != 0) {
+ d = mp_encode_uint(d, IPROTO_FLAGS);
+ d = mp_encode_uint(d, flags_to_encode);
+ map_size++;
+ }
assert(d <= data + XROW_HEADER_LEN_MAX);
mp_encode_map(data, map_size);
out->iov_len = d - (char *) out->iov_base;
diff --git a/src/box/xrow.h b/src/box/xrow.h
index 69337a226..5ea99e792 100644
--- a/src/box/xrow.h
+++ b/src/box/xrow.h
@@ -80,14 +80,28 @@ struct xrow_header {
* transaction.
*/
int64_t tsn;
- /**
- * True for the last row in a multi-statement transaction,
- * or single-statement transaction. Is only encoded in the
- * write ahead log for multi-statement transactions.
- * Single-statement transactions do not encode
- * tsn and is_commit flag to save space.
- */
- bool is_commit;
+ /** Transaction meta flags set only in the last transaction row. */
+ union {
+ uint8_t flags;
+ struct {
+ /**
+ * Is only encoded in the write ahead log for
+ * multi-statement transactions. Single-statement
+ * transactions do not encode tsn and is_commit flag to
+ * save space.
+ */
+ bool is_commit : 1;
+ /**
+ * True for any transaction that would enter the limbo
+ * (not necessarily a synchronous one).
+ */
+ bool wait_sync : 1;
+ /**
+ * True for a synchronous transaction.
+ */
+ bool wait_ack : 1;
+ };
+ };
int bodycnt;
uint32_t schema_version;
diff --git a/test/unit/xrow.cc b/test/unit/xrow.cc
index 9fd154719..b6018eed9 100644
--- a/test/unit/xrow.cc
+++ b/test/unit/xrow.cc
@@ -204,7 +204,9 @@ test_greeting()
void
test_xrow_header_encode_decode()
{
- plan(10);
+ /* Test all possible 3-bit combinations. */
+ const int bit_comb_count = 1 << 3;
+ plan(1 + bit_comb_count);
struct xrow_header header;
char buffer[2048];
char *pos = mp_encode_uint(buffer, 300);
@@ -217,27 +219,47 @@ test_xrow_header_encode_decode()
header.tm = 123.456;
header.bodycnt = 0;
header.tsn = header.lsn;
- header.is_commit = true;
uint64_t sync = 100500;
- struct iovec vec[1];
- is(1, xrow_header_encode(&header, sync, vec, 200), "encode");
- int fixheader_len = 200;
- pos = (char *)vec[0].iov_base + fixheader_len;
- is(mp_decode_map((const char **)&pos), 5, "header map size");
-
- struct xrow_header decoded_header;
- const char *begin = (const char *)vec[0].iov_base;
- begin += fixheader_len;
- const char *end = (const char *)vec[0].iov_base;
- end += vec[0].iov_len;
- is(xrow_header_decode(&decoded_header, &begin, end, true), 0,
- "header decode");
- is(header.type, decoded_header.type, "decoded type");
- is(header.replica_id, decoded_header.replica_id, "decoded replica_id");
- is(header.lsn, decoded_header.lsn, "decoded lsn");
- is(header.tm, decoded_header.tm, "decoded tm");
- is(decoded_header.sync, sync, "decoded sync");
- is(decoded_header.bodycnt, 0, "decoded bodycnt");
+ for (int opt_idx = 0; opt_idx < bit_comb_count; opt_idx++) {
+ plan(12);
+ header.is_commit = opt_idx & 0x01;
+ header.wait_sync = opt_idx >> 1 & 0x01;
+ header.wait_ack = opt_idx >> 2 & 0x01;
+ struct iovec vec[1];
+ is(1, xrow_header_encode(&header, sync, vec, 200), "encode");
+ int fixheader_len = 200;
+ pos = (char *)vec[0].iov_base + fixheader_len;
+ uint32_t exp_map_size = 5;
+ /*
+ * header.is_commit flag isn't encoded, since this row looks
+ * like a single-statement transaction.
+ */
+ if (header.wait_sync || header.wait_ack)
+ exp_map_size += 1;
+ /* tsn is encoded explicitly in this case. */
+ if (!header.is_commit)
+ exp_map_size += 1;
+ uint32_t size = mp_decode_map((const char **)&pos);
+ is(size, exp_map_size, "header map size");
+
+ struct xrow_header decoded_header;
+ const char *begin = (const char *)vec[0].iov_base;
+ begin += fixheader_len;
+ const char *end = (const char *)vec[0].iov_base;
+ end += vec[0].iov_len;
+ is(xrow_header_decode(&decoded_header, &begin, end, true), 0,
+ "header decode");
+ is(header.is_commit, decoded_header.is_commit, "decoded is_commit");
+ is(header.wait_sync, decoded_header.wait_sync, "decoded wait_sync");
+ is(header.wait_ack, decoded_header.wait_ack, "decoded wait_ack");
+ is(header.type, decoded_header.type, "decoded type");
+ is(header.replica_id, decoded_header.replica_id, "decoded replica_id");
+ is(header.lsn, decoded_header.lsn, "decoded lsn");
+ is(header.tm, decoded_header.tm, "decoded tm");
+ is(decoded_header.sync, sync, "decoded sync");
+ is(decoded_header.bodycnt, 0, "decoded bodycnt");
+ check_plan();
+ }
check_plan();
}
@@ -275,12 +297,49 @@ test_request_str()
check_plan();
}
+/**
+ * The compiler doesn't have to preserve bitfields order,
+ * still we rely on it for convenience sake.
+ */
+static void
+test_xrow_fields()
+{
+ plan(6);
+
+ struct xrow_header header;
+
+ memset(&header, 0, sizeof(header));
+
+ header.is_commit = true;
+ is(header.flags, IPROTO_FLAG_COMMIT, "header.is_commit -> COMMIT");
+ header.is_commit = false;
+
+ header.wait_sync = true;
+ is(header.flags, IPROTO_FLAG_WAIT_SYNC, "header.wait_sync -> WAIT_SYNC");
+ header.wait_sync = false;
+
+ header.wait_ack = true;
+ is(header.flags, IPROTO_FLAG_WAIT_ACK, "header.wait_ack -> WAIT_ACK");
+ header.wait_ack = false;
+
+ header.flags = IPROTO_FLAG_COMMIT;
+ ok(header.is_commit && !header.wait_sync && !header.wait_ack, "COMMIT -> header.is_commit");
+
+ header.flags = IPROTO_FLAG_WAIT_SYNC;
+ ok(!header.is_commit && header.wait_sync && !header.wait_ack, "WAIT_SYNC -> header.wait_sync");
+
+ header.flags = IPROTO_FLAG_WAIT_ACK;
+ ok(!header.is_commit && !header.wait_sync && header.wait_ack, "WAIT_ACK -> header.wait_ack");
+
+ check_plan();
+}
+
int
main(void)
{
memory_init();
fiber_init(fiber_c_invoke);
- plan(3);
+ plan(4);
random_init();
@@ -288,6 +347,7 @@ main(void)
test_greeting();
test_xrow_header_encode_decode();
test_request_str();
+ test_xrow_fields();
random_free();
fiber_free();
diff --git a/test/unit/xrow.result b/test/unit/xrow.result
index 5ee92ad7b..3b705d5ba 100644
--- a/test/unit/xrow.result
+++ b/test/unit/xrow.result
@@ -1,4 +1,4 @@
-1..3
+1..4
1..40
ok 1 - round trip
ok 2 - roundtrip.version_id
@@ -41,18 +41,129 @@
ok 39 - invalid 10
ok 40 - invalid 11
ok 1 - subtests
- 1..10
+ 1..9
ok 1 - bad msgpack end
- ok 2 - encode
- ok 3 - header map size
- ok 4 - header decode
- ok 5 - decoded type
- ok 6 - decoded replica_id
- ok 7 - decoded lsn
- ok 8 - decoded tm
- ok 9 - decoded sync
- ok 10 - decoded bodycnt
+ 1..12
+ ok 1 - encode
+ ok 2 - header map size
+ ok 3 - header decode
+ ok 4 - decoded is_commit
+ ok 5 - decoded wait_sync
+ ok 6 - decoded wait_ack
+ ok 7 - decoded type
+ ok 8 - decoded replica_id
+ ok 9 - decoded lsn
+ ok 10 - decoded tm
+ ok 11 - decoded sync
+ ok 12 - decoded bodycnt
+ ok 2 - subtests
+ 1..12
+ ok 1 - encode
+ ok 2 - header map size
+ ok 3 - header decode
+ ok 4 - decoded is_commit
+ ok 5 - decoded wait_sync
+ ok 6 - decoded wait_ack
+ ok 7 - decoded type
+ ok 8 - decoded replica_id
+ ok 9 - decoded lsn
+ ok 10 - decoded tm
+ ok 11 - decoded sync
+ ok 12 - decoded bodycnt
+ ok 3 - subtests
+ 1..12
+ ok 1 - encode
+ ok 2 - header map size
+ ok 3 - header decode
+ ok 4 - decoded is_commit
+ ok 5 - decoded wait_sync
+ ok 6 - decoded wait_ack
+ ok 7 - decoded type
+ ok 8 - decoded replica_id
+ ok 9 - decoded lsn
+ ok 10 - decoded tm
+ ok 11 - decoded sync
+ ok 12 - decoded bodycnt
+ ok 4 - subtests
+ 1..12
+ ok 1 - encode
+ ok 2 - header map size
+ ok 3 - header decode
+ ok 4 - decoded is_commit
+ ok 5 - decoded wait_sync
+ ok 6 - decoded wait_ack
+ ok 7 - decoded type
+ ok 8 - decoded replica_id
+ ok 9 - decoded lsn
+ ok 10 - decoded tm
+ ok 11 - decoded sync
+ ok 12 - decoded bodycnt
+ ok 5 - subtests
+ 1..12
+ ok 1 - encode
+ ok 2 - header map size
+ ok 3 - header decode
+ ok 4 - decoded is_commit
+ ok 5 - decoded wait_sync
+ ok 6 - decoded wait_ack
+ ok 7 - decoded type
+ ok 8 - decoded replica_id
+ ok 9 - decoded lsn
+ ok 10 - decoded tm
+ ok 11 - decoded sync
+ ok 12 - decoded bodycnt
+ ok 6 - subtests
+ 1..12
+ ok 1 - encode
+ ok 2 - header map size
+ ok 3 - header decode
+ ok 4 - decoded is_commit
+ ok 5 - decoded wait_sync
+ ok 6 - decoded wait_ack
+ ok 7 - decoded type
+ ok 8 - decoded replica_id
+ ok 9 - decoded lsn
+ ok 10 - decoded tm
+ ok 11 - decoded sync
+ ok 12 - decoded bodycnt
+ ok 7 - subtests
+ 1..12
+ ok 1 - encode
+ ok 2 - header map size
+ ok 3 - header decode
+ ok 4 - decoded is_commit
+ ok 5 - decoded wait_sync
+ ok 6 - decoded wait_ack
+ ok 7 - decoded type
+ ok 8 - decoded replica_id
+ ok 9 - decoded lsn
+ ok 10 - decoded tm
+ ok 11 - decoded sync
+ ok 12 - decoded bodycnt
+ ok 8 - subtests
+ 1..12
+ ok 1 - encode
+ ok 2 - header map size
+ ok 3 - header decode
+ ok 4 - decoded is_commit
+ ok 5 - decoded wait_sync
+ ok 6 - decoded wait_ack
+ ok 7 - decoded type
+ ok 8 - decoded replica_id
+ ok 9 - decoded lsn
+ ok 10 - decoded tm
+ ok 11 - decoded sync
+ ok 12 - decoded bodycnt
+ ok 9 - subtests
ok 2 - subtests
1..1
ok 1 - request_str
ok 3 - subtests
+ 1..6
+ ok 1 - header.is_commit -> COMMIT
+ ok 2 - header.wait_sync -> WAIT_SYNC
+ ok 3 - header.wait_ack -> WAIT_ACK
+ ok 4 - COMMIT -> header.is_commit
+ ok 5 - WAIT_SYNC -> header.wait_sync
+ ok 6 - WAIT_ACK -> header.wait_ack
+ok 4 - subtests
--
2.24.3 (Apple Git-128)
^ permalink raw reply [flat|nested] 57+ messages in thread
* [Tarantool-patches] [PATCH v4 03/12] xrow: introduce a PROMOTE entry
2021-04-16 16:25 [Tarantool-patches] [PATCH v4 00/12] raft: introduce manual elections and fix a bug with re-applying rolled back transactions Serge Petrenko via Tarantool-patches
2021-04-16 16:25 ` [Tarantool-patches] [PATCH v4 01/12] wal: make wal_assign_lsn accept journal entry Serge Petrenko via Tarantool-patches
2021-04-16 16:25 ` [Tarantool-patches] [PATCH v4 02/12] xrow: enrich row's meta information with sync replication flags Serge Petrenko via Tarantool-patches
@ 2021-04-16 16:25 ` Serge Petrenko via Tarantool-patches
2021-04-16 16:25 ` [Tarantool-patches] [PATCH v4 04/12] box: actualise iproto_key_type array Serge Petrenko via Tarantool-patches
` (12 subsequent siblings)
15 siblings, 0 replies; 57+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-04-16 16:25 UTC (permalink / raw)
To: v.shpilevoy, gorcunov; +Cc: tarantool-patches
A PROMOTE entry combines effect of CONFIRM, ROLLBACK and RAFT_TERM
entries with some additional semantics on top.
PROMOTE carries the following arguments:
1) former_leader_id - the id of previous limbo owner whose entries we
want to confirm.
2) confirm_lsn - the lsn of the last former leader's transaction to be
confirmed. In this sense PROMOTE(confirm_lsn) replaces
CONFIRM(confirm_lsn) + ROLLBACK(confirm_lsn + 1).
3) replica_id - id of the instance issuing
`box.ctl.clear_synchro_queue()`
4) term - the new term the instance issuing
`box.ctl.clear_synchro_queue()` has just entered.
This entry will be written to WAL instead of the usual CONFIRM +
ROLLBACK pair on a successful `box.ctl.clear_synchro_queue()` call.
Note, the ususal CONFIRM and ROLLBACK occurrences (after a confirmed or
rolled back synchronous transaction) are here to stay.
Part of #5445
---
src/box/iproto_constants.h | 26 ++++++++++++++++++++--
src/box/txn_limbo.c | 4 ++--
src/box/xrow.c | 45 ++++++++++++++++++++++++--------------
src/box/xrow.h | 31 ++++++++++++++------------
4 files changed, 71 insertions(+), 35 deletions(-)
diff --git a/src/box/iproto_constants.h b/src/box/iproto_constants.h
index e9d1ef5d6..99c8ca184 100644
--- a/src/box/iproto_constants.h
+++ b/src/box/iproto_constants.h
@@ -132,6 +132,18 @@ enum iproto_key {
IPROTO_REPLICA_ANON = 0x50,
IPROTO_ID_FILTER = 0x51,
IPROTO_ERROR = 0x52,
+ /**
+ * Term. Has the same meaning as IPROTO_RAFT_TERM, but is an iproto
+ * key, rather than a raft key. Used for PROMOTE request, which needs
+ * both iproto (e.g. REPLICA_ID) and raft (RAFT_TERM) keys.
+ */
+ IPROTO_TERM = 0x53,
+ /*
+ * Be careful to not extend iproto_key values over 0x7f.
+ * iproto_keys are encoded in msgpack as positive fixnum, which ends at
+ * 0x7f, and we rely on this in some places by allocating a uint8_t to
+ * hold a msgpack-encoded key value.
+ */
IPROTO_KEY_MAX
};
@@ -226,6 +238,8 @@ enum iproto_type {
IPROTO_TYPE_STAT_MAX,
IPROTO_RAFT = 30,
+ /** PROMOTE request. */
+ IPROTO_PROMOTE = 31,
/** A confirmation message for synchronous transactions. */
IPROTO_CONFIRM = 40,
@@ -340,11 +354,19 @@ dml_request_key_map(uint16_t type)
return iproto_body_key_map[type];
}
-/** CONFIRM/ROLLBACK entries for synchronous replication. */
+/** Synchronous replication entries: CONFIRM/ROLLBACK/PROMOTE. */
static inline bool
iproto_type_is_synchro_request(uint16_t type)
{
- return type == IPROTO_CONFIRM || type == IPROTO_ROLLBACK;
+ return type == IPROTO_CONFIRM || type == IPROTO_ROLLBACK ||
+ type == IPROTO_PROMOTE;
+}
+
+/** PROMOTE entry (synchronous replication and leader elections). */
+static inline bool
+iproto_type_is_promote_request(uint32_t type)
+{
+ return type == IPROTO_PROMOTE;
}
static inline bool
diff --git a/src/box/txn_limbo.c b/src/box/txn_limbo.c
index addcb0f97..c96e497c6 100644
--- a/src/box/txn_limbo.c
+++ b/src/box/txn_limbo.c
@@ -331,7 +331,7 @@ txn_limbo_write_synchro(struct txn_limbo *limbo, uint16_t type, int64_t lsn)
* This is a synchronous commit so we can
* allocate everything on a stack.
*/
- struct synchro_body_bin body;
+ char body[XROW_SYNCHRO_BODY_LEN_MAX];
struct xrow_header row;
char buf[sizeof(struct journal_entry) +
sizeof(struct xrow_header *)];
@@ -339,7 +339,7 @@ txn_limbo_write_synchro(struct txn_limbo *limbo, uint16_t type, int64_t lsn)
struct journal_entry *entry = (struct journal_entry *)buf;
entry->rows[0] = &row;
- xrow_encode_synchro(&row, &body, &req);
+ xrow_encode_synchro(&row, body, &req);
journal_entry_create(entry, 1, xrow_approx_len(&row),
txn_limbo_write_cb, fiber());
diff --git a/src/box/xrow.c b/src/box/xrow.c
index 35e1d1c20..2e364cea5 100644
--- a/src/box/xrow.c
+++ b/src/box/xrow.c
@@ -885,28 +885,33 @@ xrow_encode_dml(const struct request *request, struct region *region,
}
void
-xrow_encode_synchro(struct xrow_header *row,
- struct synchro_body_bin *body,
+xrow_encode_synchro(struct xrow_header *row, char *body,
const struct synchro_request *req)
{
- /*
- * A map with two elements. We don't compress
- * numbers to have this structure constant in size,
- * which allows us to preallocate it on stack.
- */
- body->m_body = 0x80 | 2;
- body->k_replica_id = IPROTO_REPLICA_ID;
- body->m_replica_id = 0xce;
- body->v_replica_id = mp_bswap_u32(req->replica_id);
- body->k_lsn = IPROTO_LSN;
- body->m_lsn = 0xcf;
- body->v_lsn = mp_bswap_u64(req->lsn);
+ assert(iproto_type_is_synchro_request(req->type));
- memset(row, 0, sizeof(*row));
+ char *pos = body;
+
+ pos = mp_encode_map(pos,
+ iproto_type_is_promote_request(req->type) ? 3 : 2);
+ pos = mp_encode_uint(pos, IPROTO_REPLICA_ID);
+ pos = mp_encode_uint(pos, req->replica_id);
+
+ pos = mp_encode_uint(pos, IPROTO_LSN);
+ pos = mp_encode_uint(pos, req->lsn);
+
+ if (iproto_type_is_promote_request(req->type)) {
+ pos = mp_encode_uint(pos, IPROTO_TERM);
+ pos = mp_encode_uint(pos, req->term);
+ }
+
+ assert(pos - body < XROW_SYNCHRO_BODY_LEN_MAX);
+
+ memset(row, 0, sizeof(*row));
row->type = req->type;
- row->body[0].iov_base = (void *)body;
- row->body[0].iov_len = sizeof(*body);
+ row->body[0].iov_base = body;
+ row->body[0].iov_len = pos - body;
row->bodycnt = 1;
}
@@ -952,11 +957,17 @@ xrow_decode_synchro(const struct xrow_header *row, struct synchro_request *req)
case IPROTO_LSN:
req->lsn = mp_decode_uint(&d);
break;
+ case IPROTO_TERM:
+ req->term = mp_decode_uint(&d);
+ break;
default:
mp_next(&d);
}
}
+
req->type = row->type;
+ req->origin_id = row->replica_id;
+
return 0;
}
diff --git a/src/box/xrow.h b/src/box/xrow.h
index 5ea99e792..b3c664be2 100644
--- a/src/box/xrow.h
+++ b/src/box/xrow.h
@@ -49,6 +49,7 @@ enum {
XROW_IOVMAX = XROW_HEADER_IOVMAX + XROW_BODY_IOVMAX,
XROW_HEADER_LEN_MAX = 52,
XROW_BODY_LEN_MAX = 256,
+ XROW_SYNCHRO_BODY_LEN_MAX = 32,
IPROTO_HEADER_LEN = 28,
/** 7 = sizeof(iproto_body_bin). */
IPROTO_SELECT_HEADER_LEN = IPROTO_HEADER_LEN + 7,
@@ -226,7 +227,10 @@ xrow_encode_dml(const struct request *request, struct region *region,
* pending synchronous transactions.
*/
struct synchro_request {
- /** Operation type - IPROTO_ROLLBACK or IPROTO_CONFIRM. */
+ /**
+ * Operation type - either IPROTO_ROLLBACK or IPROTO_CONFIRM or
+ * IPROTO_PROMOTE
+ */
uint16_t type;
/**
* ID of the instance owning the pending transactions.
@@ -236,25 +240,25 @@ struct synchro_request {
* finish transactions of an old master.
*/
uint32_t replica_id;
+ /**
+ * Id of the instance which has issued this request. Only filled on
+ * decoding, and left blank when encoding a request.
+ */
+ uint32_t origin_id;
/**
* Operation LSN.
* In case of CONFIRM it means 'confirm all
* transactions with lsn <= this value'.
* In case of ROLLBACK it means 'rollback all transactions
* with lsn >= this value'.
+ * In case of PROMOTE it means CONFIRM(lsn) + ROLLBACK(lsn+1)
*/
int64_t lsn;
-};
-
-/** Synchro request xrow's body in MsgPack format. */
-struct PACKED synchro_body_bin {
- uint8_t m_body;
- uint8_t k_replica_id;
- uint8_t m_replica_id;
- uint32_t v_replica_id;
- uint8_t k_lsn;
- uint8_t m_lsn;
- uint64_t v_lsn;
+ /**
+ * The new term the instance issuing this request is in. Only used for
+ * PROMOTE request.
+ */
+ uint64_t term;
};
/**
@@ -264,8 +268,7 @@ struct PACKED synchro_body_bin {
* @param req Request parameters.
*/
void
-xrow_encode_synchro(struct xrow_header *row,
- struct synchro_body_bin *body,
+xrow_encode_synchro(struct xrow_header *row, char *body,
const struct synchro_request *req);
/**
--
2.24.3 (Apple Git-128)
^ permalink raw reply [flat|nested] 57+ messages in thread
* [Tarantool-patches] [PATCH v4 04/12] box: actualise iproto_key_type array
2021-04-16 16:25 [Tarantool-patches] [PATCH v4 00/12] raft: introduce manual elections and fix a bug with re-applying rolled back transactions Serge Petrenko via Tarantool-patches
` (2 preceding siblings ...)
2021-04-16 16:25 ` [Tarantool-patches] [PATCH v4 03/12] xrow: introduce a PROMOTE entry Serge Petrenko via Tarantool-patches
@ 2021-04-16 16:25 ` Serge Petrenko via Tarantool-patches
2021-04-16 16:25 ` [Tarantool-patches] [PATCH v4 05/12] box: make clear_synchro_queue() write a PROMOTE entry instead of CONFIRM + ROLLBACK Serge Petrenko via Tarantool-patches
` (11 subsequent siblings)
15 siblings, 0 replies; 57+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-04-16 16:25 UTC (permalink / raw)
To: v.shpilevoy, gorcunov; +Cc: tarantool-patches
iproto_key_type array is used while validating incoming requests, but it
was only half-filled. The last initialized field was 0x2b, while
IPROTO_KEY_MAX is currently 0x54.
We got away with it, since the array is only used in xrow_header_decode(),
xrow_decode_dml() and xrow_decode_synchro(), and all the keys usually present
in these requests were present in the array. This is not true anymore,
so it's time to make array contents up to date with all the IPROTO_KEY_*
constants we have.
Part of #5445
---
src/box/iproto_constants.c | 58 ++++++++++++++++++++++++++++++++++++++
1 file changed, 58 insertions(+)
diff --git a/src/box/iproto_constants.c b/src/box/iproto_constants.c
index 029d9888c..addda39dc 100644
--- a/src/box/iproto_constants.c
+++ b/src/box/iproto_constants.c
@@ -90,6 +90,64 @@ const unsigned char iproto_key_type[IPROTO_KEY_MAX] =
/* 0x2a */ MP_MAP, /* IPROTO_TUPLE_META */
/* 0x2b */ MP_MAP, /* IPROTO_OPTIONS */
/* }}} */
+
+ /* {{{ unused */
+ /* 0x2c */ MP_UINT,
+ /* 0x2d */ MP_UINT,
+ /* 0x2e */ MP_UINT,
+ /* 0x2f */ MP_UINT,
+ /* }}} */
+
+ /* {{{ body -- response keys */
+ /* 0x30 */ MP_ARRAY, /* IPROTO_DATA */
+ /* 0x31 */ MP_STR, /* IPROTO_ERROR_24 */
+ /* 0x32 */ MP_ARRAY, /* IPROTO_METADATA */
+ /* 0x33 */ MP_ARRAY, /* IPROTO_BIND_METADATA */
+ /* 0x34 */ MP_UINT, /* IIPROTO_BIND_COUNT */
+ /* }}} */
+
+ /* {{{ unused */
+ /* 0x35 */ MP_UINT,
+ /* 0x36 */ MP_UINT,
+ /* 0x37 */ MP_UINT,
+ /* 0x38 */ MP_UINT,
+ /* 0x39 */ MP_UINT,
+ /* 0x3a */ MP_UINT,
+ /* 0x3b */ MP_UINT,
+ /* 0x3c */ MP_UINT,
+ /* 0x3d */ MP_UINT,
+ /* 0x3e */ MP_UINT,
+ /* 0x3f */ MP_UINT,
+ /* }}} */
+
+ /* {{{ body -- sql keys */
+ /* 0x40 */ MP_STR, /* IPROTO_SQL_TEXT */
+ /* 0x41 */ MP_ARRAY, /* IPROTO_SQL_BIND */
+ /* 0x42 */ MP_MAP, /* IPROTO_SQL_INFO */
+ /* 0x43 */ MP_UINT, /* IPROTO_STMT_ID */
+ /* }}} */
+
+ /* {{{ unused */
+ /* 0x44 */ MP_UINT,
+ /* 0x45 */ MP_UINT,
+ /* 0x46 */ MP_UINT,
+ /* 0x47 */ MP_UINT,
+ /* 0x48 */ MP_UINT,
+ /* 0x49 */ MP_UINT,
+ /* 0x4a */ MP_UINT,
+ /* 0x4b */ MP_UINT,
+ /* 0x4c */ MP_UINT,
+ /* 0x4d */ MP_UINT,
+ /* 0x4e */ MP_UINT,
+ /* 0x4f */ MP_UINT,
+ /* }}} */
+
+ /* {{{ body -- additional request keys */
+ /* 0x50 */ MP_BOOL, /* IPROTO_REPLICA_ANON */
+ /* 0x51 */ MP_ARRAY, /* IPROTO_ID_FILTER */
+ /* 0x52 */ MP_MAP, /* IPROTO_ERROR */
+ /* 0x53 */ MP_UINT, /* IPROTO_TERM */
+ /* }}} */
};
const char *iproto_type_strs[] =
--
2.24.3 (Apple Git-128)
^ permalink raw reply [flat|nested] 57+ messages in thread
* [Tarantool-patches] [PATCH v4 05/12] box: make clear_synchro_queue() write a PROMOTE entry instead of CONFIRM + ROLLBACK
2021-04-16 16:25 [Tarantool-patches] [PATCH v4 00/12] raft: introduce manual elections and fix a bug with re-applying rolled back transactions Serge Petrenko via Tarantool-patches
` (3 preceding siblings ...)
2021-04-16 16:25 ` [Tarantool-patches] [PATCH v4 04/12] box: actualise iproto_key_type array Serge Petrenko via Tarantool-patches
@ 2021-04-16 16:25 ` Serge Petrenko via Tarantool-patches
2021-04-16 22:12 ` Vladislav Shpilevoy via Tarantool-patches
2021-04-20 22:30 ` Vladislav Shpilevoy via Tarantool-patches
2021-04-16 16:25 ` [Tarantool-patches] [PATCH v4 06/12] box: write PROMOTE even for empty limbo Serge Petrenko via Tarantool-patches
` (10 subsequent siblings)
15 siblings, 2 replies; 57+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-04-16 16:25 UTC (permalink / raw)
To: v.shpilevoy, gorcunov; +Cc: tarantool-patches
A successful box_clear_synchro_queue() call results in writing
CONFIRM(N) ROLLBACK(N+1) pair, where N is the confirmed lsn.
Let's write a single PROMOTE(N) entry instead. It'll have the same
meaning as CONFIRM + ROLLBACK and it will give followers some additional
information regarding leader state change later.
Part of #5445
---
src/box/box.cc | 14 +++++++-
src/box/txn_limbo.c | 78 +++++++++++++++++++++++++--------------------
src/box/txn_limbo.h | 9 ++----
3 files changed, 60 insertions(+), 41 deletions(-)
diff --git a/src/box/box.cc b/src/box/box.cc
index 2d2ae233c..e0bed74f1 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -1556,7 +1556,19 @@ box_clear_synchro_queue(bool try_wait)
"new synchronous transactions appeared");
rc = -1;
} else {
- txn_limbo_force_empty(&txn_limbo, wait_lsn);
+ /*
+ * Term parameter is unused now, We'll pass
+ * box_raft()->term there later.
+ */
+ txn_limbo_write_promote(&txn_limbo, wait_lsn, 0);
+ struct synchro_request req = {
+ .type = IPROTO_PROMOTE,
+ .replica_id = former_leader_id,
+ .origin_id = instance_id,
+ .lsn = wait_lsn,
+ .term = 0, /* unused */
+ };
+ txn_limbo_process(&txn_limbo, &req);
assert(txn_limbo_is_empty(&txn_limbo));
}
}
diff --git a/src/box/txn_limbo.c b/src/box/txn_limbo.c
index c96e497c6..2346331c7 100644
--- a/src/box/txn_limbo.c
+++ b/src/box/txn_limbo.c
@@ -317,19 +317,21 @@ txn_limbo_write_cb(struct journal_entry *entry)
}
static void
-txn_limbo_write_synchro(struct txn_limbo *limbo, uint16_t type, int64_t lsn)
+txn_limbo_write_synchro(struct txn_limbo *limbo, uint16_t type, int64_t lsn,
+ uint64_t term)
{
- assert(lsn > 0);
+ assert(lsn >= 0);
struct synchro_request req = {
.type = type,
.replica_id = limbo->owner_id,
.lsn = lsn,
+ .term = term,
};
/*
- * This is a synchronous commit so we can
- * allocate everything on a stack.
+ * This is a synchronous commit so we can allocate everything on a
+ * stack. Note, that promote body includes synchro body.
*/
char body[XROW_SYNCHRO_BODY_LEN_MAX];
struct xrow_header row;
@@ -371,14 +373,14 @@ txn_limbo_write_confirm(struct txn_limbo *limbo, int64_t lsn)
assert(lsn > limbo->confirmed_lsn);
assert(!limbo->is_in_rollback);
limbo->confirmed_lsn = lsn;
- txn_limbo_write_synchro(limbo, IPROTO_CONFIRM, lsn);
+ txn_limbo_write_synchro(limbo, IPROTO_CONFIRM, lsn, 0);
}
/** Confirm all the entries <= @a lsn. */
static void
txn_limbo_read_confirm(struct txn_limbo *limbo, int64_t lsn)
{
- assert(limbo->owner_id != REPLICA_ID_NIL);
+ assert(limbo->owner_id != REPLICA_ID_NIL || txn_limbo_is_empty(limbo));
assert(limbo == &txn_limbo);
struct txn_limbo_entry *e, *tmp;
rlist_foreach_entry_safe(e, &limbo->queue, in_queue, tmp) {
@@ -425,7 +427,7 @@ txn_limbo_write_rollback(struct txn_limbo *limbo, int64_t lsn)
assert(lsn > limbo->confirmed_lsn);
assert(!limbo->is_in_rollback);
limbo->is_in_rollback = true;
- txn_limbo_write_synchro(limbo, IPROTO_ROLLBACK, lsn);
+ txn_limbo_write_synchro(limbo, IPROTO_ROLLBACK, lsn, 0);
limbo->is_in_rollback = false;
}
@@ -433,7 +435,7 @@ txn_limbo_write_rollback(struct txn_limbo *limbo, int64_t lsn)
static void
txn_limbo_read_rollback(struct txn_limbo *limbo, int64_t lsn)
{
- assert(limbo->owner_id != REPLICA_ID_NIL);
+ assert(limbo->owner_id != REPLICA_ID_NIL || txn_limbo_is_empty(limbo));
assert(limbo == &txn_limbo);
struct txn_limbo_entry *e, *tmp;
struct txn_limbo_entry *last_rollback = NULL;
@@ -464,6 +466,37 @@ txn_limbo_read_rollback(struct txn_limbo *limbo, int64_t lsn)
box_update_ro_summary();
}
+void
+txn_limbo_write_promote(struct txn_limbo *limbo, int64_t lsn, uint64_t term)
+{
+ limbo->confirmed_lsn = lsn;
+ limbo->is_in_rollback = true;
+ /*
+ * We make sure that promote is only written once everything this
+ * instance has may be confirmed.
+ */
+ struct txn_limbo_entry *e = txn_limbo_last_synchro_entry(limbo);
+ assert(e == NULL || e->lsn <= lsn);
+ (void) e;
+ txn_limbo_write_synchro(limbo, IPROTO_PROMOTE, lsn, term);
+ limbo->is_in_rollback = false;
+}
+
+/**
+ * Process a PROMOTE request, i.e. confirm all entries <= @req.lsn and rollback all
+ * entries > @req.lsn.
+ */
+static void
+txn_limbo_read_promote(struct txn_limbo *limbo,
+ const struct synchro_request *req)
+{
+ txn_limbo_read_confirm(limbo, req->lsn);
+ txn_limbo_read_rollback(limbo, req->lsn + 1);
+ assert(txn_limbo_is_empty(&txn_limbo));
+ limbo->owner_id = req->origin_id;
+ limbo->confirmed_lsn = 0;
+}
+
void
txn_limbo_ack(struct txn_limbo *limbo, uint32_t replica_id, int64_t lsn)
{
@@ -626,38 +659,15 @@ txn_limbo_process(struct txn_limbo *limbo, const struct synchro_request *req)
case IPROTO_ROLLBACK:
txn_limbo_read_rollback(limbo, req->lsn);
break;
+ case IPROTO_PROMOTE:
+ txn_limbo_read_promote(limbo, req);
+ break;
default:
unreachable();
}
return;
}
-void
-txn_limbo_force_empty(struct txn_limbo *limbo, int64_t confirm_lsn)
-{
- struct txn_limbo_entry *e, *last_quorum = NULL;
- struct txn_limbo_entry *rollback = NULL;
- rlist_foreach_entry(e, &limbo->queue, in_queue) {
- if (txn_has_flag(e->txn, TXN_WAIT_ACK)) {
- if (e->lsn <= confirm_lsn) {
- last_quorum = e;
- } else {
- rollback = e;
- break;
- }
- }
- }
-
- if (last_quorum != NULL) {
- txn_limbo_write_confirm(limbo, last_quorum->lsn);
- txn_limbo_read_confirm(limbo, last_quorum->lsn);
- }
- if (rollback != NULL) {
- txn_limbo_write_rollback(limbo, rollback->lsn);
- txn_limbo_read_rollback(limbo, rollback->lsn);
- }
-}
-
void
txn_limbo_on_parameters_change(struct txn_limbo *limbo)
{
diff --git a/src/box/txn_limbo.h b/src/box/txn_limbo.h
index f2a98c8bb..f35771dc9 100644
--- a/src/box/txn_limbo.h
+++ b/src/box/txn_limbo.h
@@ -272,14 +272,11 @@ int
txn_limbo_wait_confirm(struct txn_limbo *limbo);
/**
- * Make txn_limbo confirm all the entries with lsn less than or
- * equal to the given one, and rollback all the following entries.
- * The function makes txn_limbo write CONFIRM and ROLLBACK
- * messages for appropriate lsns, and then process the messages
- * immediately.
+ * Write a PROMOTE request, which has the same effect as CONFIRM(@a lsn) and
+ * ROLLBACK(@a lsn + 1) combined.
*/
void
-txn_limbo_force_empty(struct txn_limbo *limbo, int64_t last_confirm);
+txn_limbo_write_promote(struct txn_limbo *limbo, int64_t lsn, uint64_t term);
/**
* Update qsync parameters dynamically.
--
2.24.3 (Apple Git-128)
^ permalink raw reply [flat|nested] 57+ messages in thread
* [Tarantool-patches] [PATCH v4 06/12] box: write PROMOTE even for empty limbo
2021-04-16 16:25 [Tarantool-patches] [PATCH v4 00/12] raft: introduce manual elections and fix a bug with re-applying rolled back transactions Serge Petrenko via Tarantool-patches
` (4 preceding siblings ...)
2021-04-16 16:25 ` [Tarantool-patches] [PATCH v4 05/12] box: make clear_synchro_queue() write a PROMOTE entry instead of CONFIRM + ROLLBACK Serge Petrenko via Tarantool-patches
@ 2021-04-16 16:25 ` Serge Petrenko via Tarantool-patches
2021-04-19 13:39 ` Serge Petrenko via Tarantool-patches
2021-04-16 16:25 ` [Tarantool-patches] [PATCH v4 07/12] raft: filter rows based on known peer terms Serge Petrenko via Tarantool-patches
` (9 subsequent siblings)
15 siblings, 1 reply; 57+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-04-16 16:25 UTC (permalink / raw)
To: v.shpilevoy, gorcunov; +Cc: tarantool-patches
PROMOTE entry will be used to mark limbo ownership transition besides
emptying the limbo. So it has to be written every time
`box.ctl.clear_synchro_queue()` succeeds. Even when the limbo was
already empty.
Part of #5445
---
src/box/box.cc | 32 +++++++++++++++++---------------
1 file changed, 17 insertions(+), 15 deletions(-)
diff --git a/src/box/box.cc b/src/box/box.cc
index e0bed74f1..9d45e211e 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -1502,19 +1502,18 @@ box_clear_synchro_queue(bool try_wait)
"simultaneous invocations");
return -1;
}
- /*
- * XXX: we may want to write confirm + rollback even when the limbo is
- * empty for the sake of limbo ownership transition.
- */
- if (!is_box_configured || txn_limbo_is_empty(&txn_limbo))
+
+ if (!is_box_configured)
return 0;
uint32_t former_leader_id = txn_limbo.owner_id;
- assert(former_leader_id != REPLICA_ID_NIL);
- if (former_leader_id == instance_id)
- return 0;
-
+ int64_t wait_lsn = txn_limbo.confirmed_lsn;
+ int rc = 0;
+ int quorum = replication_synchro_quorum;
in_clear_synchro_queue = true;
+ if (txn_limbo_is_empty(&txn_limbo))
+ goto promote;
+
if (try_wait) {
/* Wait until pending confirmations/rollbacks reach us. */
double timeout = 2 * replication_synchro_timeout;
@@ -1528,8 +1527,11 @@ box_clear_synchro_queue(bool try_wait)
* Our mission was to clear the limbo from former leader's
* transactions. Exit in case someone did that for us.
*/
- if (txn_limbo_is_empty(&txn_limbo) ||
- former_leader_id != txn_limbo.owner_id) {
+ if (former_leader_id != txn_limbo.owner_id) {
+ /*
+ * TODO: error once we see someone else has become the
+ * leader already.
+ */
in_clear_synchro_queue = false;
return 0;
}
@@ -1540,12 +1542,11 @@ box_clear_synchro_queue(bool try_wait)
* in the limbo must've come through the applier meaning they already
* have an lsn assigned, even if their WAL write hasn't finished yet.
*/
- int64_t wait_lsn = txn_limbo_last_synchro_entry(&txn_limbo)->lsn;
+ wait_lsn = txn_limbo_last_synchro_entry(&txn_limbo)->lsn;
assert(wait_lsn > 0);
- int quorum = replication_synchro_quorum;
- int rc = box_wait_quorum(former_leader_id, wait_lsn, quorum,
- replication_synchro_timeout);
+ rc = box_wait_quorum(former_leader_id, wait_lsn, quorum,
+ replication_synchro_timeout);
if (rc == 0) {
if (quorum < replication_synchro_quorum) {
diag_set(ClientError, ER_QUORUM_WAIT, quorum,
@@ -1556,6 +1557,7 @@ box_clear_synchro_queue(bool try_wait)
"new synchronous transactions appeared");
rc = -1;
} else {
+promote:
/*
* Term parameter is unused now, We'll pass
* box_raft()->term there later.
--
2.24.3 (Apple Git-128)
^ permalink raw reply [flat|nested] 57+ messages in thread
* [Tarantool-patches] [PATCH v4 07/12] raft: filter rows based on known peer terms
2021-04-16 16:25 [Tarantool-patches] [PATCH v4 00/12] raft: introduce manual elections and fix a bug with re-applying rolled back transactions Serge Petrenko via Tarantool-patches
` (5 preceding siblings ...)
2021-04-16 16:25 ` [Tarantool-patches] [PATCH v4 06/12] box: write PROMOTE even for empty limbo Serge Petrenko via Tarantool-patches
@ 2021-04-16 16:25 ` Serge Petrenko via Tarantool-patches
2021-04-16 22:21 ` Vladislav Shpilevoy via Tarantool-patches
` (2 more replies)
2021-04-16 16:25 ` [Tarantool-patches] [PATCH v4 08/12] election: introduce a new election mode: "manual" Serge Petrenko via Tarantool-patches
` (8 subsequent siblings)
15 siblings, 3 replies; 57+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-04-16 16:25 UTC (permalink / raw)
To: v.shpilevoy, gorcunov; +Cc: tarantool-patches
Start writing the actual leader term together with the PROMOTE request
and process terms in PROMOTE requests on receiver side.
Make applier only apply synchronous transactions from the instance which
has the greatest term as received in PROMOTE requests.
Closes #5445
---
...very => qsync-multi-statement-recovery.md} | 0
changelogs/unreleased/raft-promote.md | 4 +
src/box/applier.cc | 22 ++
src/box/box.cc | 18 +-
src/box/txn_limbo.c | 3 +
src/lib/raft/raft.c | 1 +
src/lib/raft/raft.h | 46 +++
.../gh-5445-leader-inconsistency.result | 292 ++++++++++++++++++
.../gh-5445-leader-inconsistency.test.lua | 129 ++++++++
test/replication/suite.cfg | 1 +
test/unit/raft.c | 37 ++-
test/unit/raft.result | 15 +-
12 files changed, 559 insertions(+), 9 deletions(-)
rename changelogs/unreleased/{qsync-multi-statement-recovery => qsync-multi-statement-recovery.md} (100%)
create mode 100644 changelogs/unreleased/raft-promote.md
create mode 100644 test/replication/gh-5445-leader-inconsistency.result
create mode 100644 test/replication/gh-5445-leader-inconsistency.test.lua
diff --git a/changelogs/unreleased/qsync-multi-statement-recovery b/changelogs/unreleased/qsync-multi-statement-recovery.md
similarity index 100%
rename from changelogs/unreleased/qsync-multi-statement-recovery
rename to changelogs/unreleased/qsync-multi-statement-recovery.md
diff --git a/changelogs/unreleased/raft-promote.md b/changelogs/unreleased/raft-promote.md
new file mode 100644
index 000000000..e5dac599c
--- /dev/null
+++ b/changelogs/unreleased/raft-promote.md
@@ -0,0 +1,4 @@
+## bugfix/replication
+
+* Fix a bug in synchronous replication when rolled back transactions could
+ reappear once a sufficiently old instance reconnected (gh-5445).
diff --git a/src/box/applier.cc b/src/box/applier.cc
index 40fc5ce86..61d53fdec 100644
--- a/src/box/applier.cc
+++ b/src/box/applier.cc
@@ -1027,6 +1027,28 @@ applier_apply_tx(struct applier *applier, struct stailq *rows)
}
}
+ /*
+ * When elections are enabled we must filter out synchronous rows coming
+ * from an instance that fell behind the current leader. This includes
+ * both synchronous tx rows and rows for txs following unconfirmed
+ * synchronous transactions.
+ * The rows are replaced with NOPs to preserve the vclock consistency.
+ */
+ struct applier_tx_row *item;
+ if (raft_is_node_outdated(box_raft(), applier->instance_id) &&
+ (last_row->wait_sync ||
+ (iproto_type_is_synchro_request(first_row->type) &&
+ !iproto_type_is_promote_request(first_row->type)))) {
+ stailq_foreach_entry(item, rows, next) {
+ struct xrow_header *row = &item->row;
+ row->type = IPROTO_NOP;
+ /*
+ * Row body is saved to fiber's region and will be freed
+ * on next fiber_gc() call.
+ */
+ row->bodycnt = 0;
+ }
+ }
if (unlikely(iproto_type_is_synchro_request(first_row->type))) {
/*
* Synchro messages are not transactions, in terms
diff --git a/src/box/box.cc b/src/box/box.cc
index 9d45e211e..19f1528ca 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -1503,7 +1503,12 @@ box_clear_synchro_queue(bool try_wait)
return -1;
}
- if (!is_box_configured)
+ /*
+ * Do nothing when box isn't configured and when PROMOTE was already
+ * written for this term.
+ */
+ if (!is_box_configured ||
+ raft_node_term(box_raft(), instance_id) == box_raft()->term)
return 0;
uint32_t former_leader_id = txn_limbo.owner_id;
int64_t wait_lsn = txn_limbo.confirmed_lsn;
@@ -1558,17 +1563,16 @@ box_clear_synchro_queue(bool try_wait)
rc = -1;
} else {
promote:
- /*
- * Term parameter is unused now, We'll pass
- * box_raft()->term there later.
- */
- txn_limbo_write_promote(&txn_limbo, wait_lsn, 0);
+ /* We cannot possibly get here in a volatile state. */
+ assert(box_raft()->volatile_term == box_raft()->term);
+ txn_limbo_write_promote(&txn_limbo, wait_lsn,
+ box_raft()->term);
struct synchro_request req = {
.type = IPROTO_PROMOTE,
.replica_id = former_leader_id,
.origin_id = instance_id,
.lsn = wait_lsn,
- .term = 0, /* unused */
+ .term = box_raft()->term,
};
txn_limbo_process(&txn_limbo, &req);
assert(txn_limbo_is_empty(&txn_limbo));
diff --git a/src/box/txn_limbo.c b/src/box/txn_limbo.c
index 2346331c7..0726b5a04 100644
--- a/src/box/txn_limbo.c
+++ b/src/box/txn_limbo.c
@@ -34,6 +34,7 @@
#include "iproto_constants.h"
#include "journal.h"
#include "box.h"
+#include "raft.h"
struct txn_limbo txn_limbo;
@@ -643,6 +644,8 @@ complete:
void
txn_limbo_process(struct txn_limbo *limbo, const struct synchro_request *req)
{
+ /* It's ok to process an empty term. It'll just get ignored. */
+ raft_process_term(box_raft(), req->origin_id, req->term);
if (req->replica_id != limbo->owner_id) {
/*
* Ignore CONFIRM/ROLLBACK messages for a foreign master.
diff --git a/src/lib/raft/raft.c b/src/lib/raft/raft.c
index 4ea4fc3f8..e9ce8cade 100644
--- a/src/lib/raft/raft.c
+++ b/src/lib/raft/raft.c
@@ -985,6 +985,7 @@ raft_create(struct raft *raft, const struct raft_vtab *vtab)
.death_timeout = 5,
.vtab = vtab,
};
+ vclock_create(&raft->term_map);
raft_ev_timer_init(&raft->timer, raft_sm_schedule_new_election_cb,
0, 0);
raft->timer.data = raft;
diff --git a/src/lib/raft/raft.h b/src/lib/raft/raft.h
index e447f6634..a5f7e08d9 100644
--- a/src/lib/raft/raft.h
+++ b/src/lib/raft/raft.h
@@ -207,6 +207,19 @@ struct raft {
* subsystems, such as Raft.
*/
const struct vclock *vclock;
+ /**
+ * The biggest term seen by this instance and persisted in WAL as part
+ * of a PROMOTE request. May be smaller than @a term, while there are
+ * ongoing elections, or the leader is already known, but this instance
+ * hasn't read its PROMOTE request yet.
+ * During other times must be equal to @a term.
+ */
+ uint64_t greatest_term;
+ /**
+ * Latest terms received with PROMOTE entries from remote instances.
+ * Raft uses them to determine data from which sources may be applied.
+ */
+ struct vclock term_map;
/** State machine timed event trigger. */
struct ev_timer timer;
/** Configured election timeout in seconds. */
@@ -243,6 +256,39 @@ raft_is_source_allowed(const struct raft *raft, uint32_t source_id)
return !raft->is_enabled || raft->leader == source_id;
}
+/**
+ * Return the latest term as seen in PROMOTE requests from instance with id
+ * @a source_id.
+ */
+static inline uint64_t
+raft_node_term(const struct raft *raft, uint32_t source_id)
+{
+ assert(source_id < VCLOCK_MAX);
+ return vclock_get(&raft->term_map, source_id);
+}
+
+/**
+ * Check whether replica with id @a source_id is too old to apply synchronous
+ * data from it. The check is only valid when elections are enabled.
+ */
+static inline bool
+raft_is_node_outdated(const struct raft *raft, uint32_t source_id)
+{
+ uint64_t source_term = raft_node_term(raft, source_id);
+ return raft->is_enabled && source_term < raft->greatest_term;
+}
+
+/** Remember the last term seen for replica with id @a source_id. */
+static inline void
+raft_process_term(struct raft *raft, uint32_t source_id, uint64_t term)
+{
+ if (raft_node_term(raft, source_id) >= term)
+ return;
+ vclock_follow(&raft->term_map, source_id, term);
+ if (term > raft->greatest_term)
+ raft->greatest_term = term;
+}
+
/** Check if Raft is enabled. */
static inline bool
raft_is_enabled(const struct raft *raft)
diff --git a/test/replication/gh-5445-leader-inconsistency.result b/test/replication/gh-5445-leader-inconsistency.result
new file mode 100644
index 000000000..5c6169f50
--- /dev/null
+++ b/test/replication/gh-5445-leader-inconsistency.result
@@ -0,0 +1,292 @@
+-- test-run result file version 2
+test_run = require("test_run").new()
+ | ---
+ | ...
+
+is_leader_cmd = "return box.info.election.state == 'leader'"
+ | ---
+ | ...
+
+-- Auxiliary.
+test_run:cmd('setopt delimiter ";"')
+ | ---
+ | - true
+ | ...
+function name(id)
+ return 'election_replica'..id
+end;
+ | ---
+ | ...
+
+function get_leader(nrs)
+ local leader_nr = 0
+ test_run:wait_cond(function()
+ for nr, do_check in pairs(nrs) do
+ if do_check then
+ local is_leader = test_run:eval(name(nr),
+ is_leader_cmd)[1]
+ if is_leader then
+ leader_nr = nr
+ return true
+ end
+ end
+ end
+ return false
+ end)
+ assert(leader_nr ~= 0)
+ return leader_nr
+end;
+ | ---
+ | ...
+
+test_run:cmd('setopt delimiter ""');
+ | ---
+ | - true
+ | ...
+
+--
+-- gh-5445: make sure rolled back rows do not reappear once old leader returns
+-- to cluster.
+--
+SERVERS = {'election_replica1', 'election_replica2' ,'election_replica3'}
+ | ---
+ | ...
+test_run:create_cluster(SERVERS, "replication", {args='2 0.4'})
+ | ---
+ | ...
+test_run:wait_fullmesh(SERVERS)
+ | ---
+ | ...
+
+-- Any of the three instances may bootstrap the cluster and become leader.
+is_possible_leader = {true, true, true}
+ | ---
+ | ...
+leader_nr = get_leader(is_possible_leader)
+ | ---
+ | ...
+leader = name(leader_nr)
+ | ---
+ | ...
+next_leader_nr = ((leader_nr - 1) % 3 + 1) % 3 + 1 -- {1, 2, 3} -> {2, 3, 1}
+ | ---
+ | ...
+next_leader = name(next_leader_nr)
+ | ---
+ | ...
+other_nr = ((leader_nr - 1) % 3 + 2) % 3 + 1 -- {1, 2, 3} -> {3, 1, 2}
+ | ---
+ | ...
+other = name(other_nr)
+ | ---
+ | ...
+
+test_run:switch(other)
+ | ---
+ | - true
+ | ...
+box.cfg{election_mode='voter'}
+ | ---
+ | ...
+test_run:switch('default')
+ | ---
+ | - true
+ | ...
+
+test_run:switch(next_leader)
+ | ---
+ | - true
+ | ...
+box.cfg{election_mode='voter'}
+ | ---
+ | ...
+test_run:switch('default')
+ | ---
+ | - true
+ | ...
+
+test_run:switch(leader)
+ | ---
+ | - true
+ | ...
+box.ctl.wait_rw()
+ | ---
+ | ...
+_ = box.schema.space.create('test', {is_sync=true})
+ | ---
+ | ...
+_ = box.space.test:create_index('pk')
+ | ---
+ | ...
+box.space.test:insert{1}
+ | ---
+ | - [1]
+ | ...
+
+-- Simulate a situation when the instance which will become the next leader
+-- doesn't know of unconfirmed rows. It should roll them back anyways and do not
+-- accept them once they actually appear from the old leader.
+-- So, stop the instance which'll be the next leader.
+test_run:switch('default')
+ | ---
+ | - true
+ | ...
+test_run:cmd('stop server '..next_leader)
+ | ---
+ | - true
+ | ...
+test_run:switch(leader)
+ | ---
+ | - true
+ | ...
+-- Insert some unconfirmed data.
+box.cfg{replication_synchro_quorum=3, replication_synchro_timeout=1000}
+ | ---
+ | ...
+fib = require('fiber').create(box.space.test.insert, box.space.test, {2})
+ | ---
+ | ...
+fib:status()
+ | ---
+ | - suspended
+ | ...
+
+-- 'other', 'leader', 'next_leader' are defined on 'default' node, hence the
+-- double switches.
+test_run:switch('default')
+ | ---
+ | - true
+ | ...
+test_run:switch(other)
+ | ---
+ | - true
+ | ...
+-- Wait until the rows are replicated to the other instance.
+test_run:wait_cond(function() return box.space.test:get{2} ~= nil end)
+ | ---
+ | - true
+ | ...
+-- Old leader is gone.
+test_run:switch('default')
+ | ---
+ | - true
+ | ...
+test_run:cmd('stop server '..leader)
+ | ---
+ | - true
+ | ...
+is_possible_leader[leader_nr] = false
+ | ---
+ | ...
+
+-- Emulate a situation when next_leader wins the elections. It can't do that in
+-- this configuration, obviously, because it's behind the 'other' node, so set
+-- quorum to 1 and imagine there are 2 more servers which would vote for
+-- next_leader.
+-- Also, make the instance ignore synchronization with other replicas.
+-- Otherwise it would stall for replication_sync_timeout. This is due to the
+-- nature of the test and may be ignored (we restart the instance to simulate
+-- a situation when some rows from the old leader were not received).
+test_run:cmd('start server '..next_leader..' with args="1 0.4 candidate 1"')
+ | ---
+ | - true
+ | ...
+assert(get_leader(is_possible_leader) == next_leader_nr)
+ | ---
+ | - true
+ | ...
+test_run:switch(other)
+ | ---
+ | - true
+ | ...
+-- New leader didn't know about the unconfirmed rows but still rolled them back.
+test_run:wait_cond(function() return box.space.test:get{2} == nil end)
+ | ---
+ | - true
+ | ...
+
+test_run:switch('default')
+ | ---
+ | - true
+ | ...
+test_run:switch(next_leader)
+ | ---
+ | - true
+ | ...
+-- No signs of the unconfirmed transaction.
+box.space.test:select{} -- 1
+ | ---
+ | - - [1]
+ | ...
+
+test_run:switch('default')
+ | ---
+ | - true
+ | ...
+-- Old leader returns and old unconfirmed rows from it must be ignored.
+-- Note, it wins the elections fairly.
+test_run:cmd('start server '..leader..' with args="3 0.4 voter"')
+ | ---
+ | - true
+ | ...
+test_run:wait_lsn(leader, next_leader)
+ | ---
+ | ...
+test_run:switch(leader)
+ | ---
+ | - true
+ | ...
+test_run:wait_cond(function() return box.space.test:get{2} == nil end)
+ | ---
+ | - true
+ | ...
+box.cfg{election_mode='candidate'}
+ | ---
+ | ...
+
+test_run:switch('default')
+ | ---
+ | - true
+ | ...
+test_run:switch(next_leader)
+ | ---
+ | - true
+ | ...
+-- Resign to make old leader win the elections.
+box.cfg{election_mode='voter'}
+ | ---
+ | ...
+
+test_run:switch('default')
+ | ---
+ | - true
+ | ...
+is_possible_leader[leader_nr] = true
+ | ---
+ | ...
+assert(get_leader(is_possible_leader) == leader_nr)
+ | ---
+ | - true
+ | ...
+
+test_run:switch(next_leader)
+ | ---
+ | - true
+ | ...
+test_run:wait_upstream(1, {status='follow'})
+ | ---
+ | - true
+ | ...
+box.space.test:select{} -- 1
+ | ---
+ | - - [1]
+ | ...
+
+-- Cleanup.
+test_run:switch('default')
+ | ---
+ | - true
+ | ...
+test_run:drop_cluster(SERVERS)
+ | ---
+ | ...
diff --git a/test/replication/gh-5445-leader-inconsistency.test.lua b/test/replication/gh-5445-leader-inconsistency.test.lua
new file mode 100644
index 000000000..e7952f5fa
--- /dev/null
+++ b/test/replication/gh-5445-leader-inconsistency.test.lua
@@ -0,0 +1,129 @@
+test_run = require("test_run").new()
+
+is_leader_cmd = "return box.info.election.state == 'leader'"
+
+-- Auxiliary.
+test_run:cmd('setopt delimiter ";"')
+function name(id)
+ return 'election_replica'..id
+end;
+
+function get_leader(nrs)
+ local leader_nr = 0
+ test_run:wait_cond(function()
+ for nr, do_check in pairs(nrs) do
+ if do_check then
+ local is_leader = test_run:eval(name(nr),
+ is_leader_cmd)[1]
+ if is_leader then
+ leader_nr = nr
+ return true
+ end
+ end
+ end
+ return false
+ end)
+ assert(leader_nr ~= 0)
+ return leader_nr
+end;
+
+test_run:cmd('setopt delimiter ""');
+
+--
+-- gh-5445: make sure rolled back rows do not reappear once old leader returns
+-- to cluster.
+--
+SERVERS = {'election_replica1', 'election_replica2' ,'election_replica3'}
+test_run:create_cluster(SERVERS, "replication", {args='2 0.4'})
+test_run:wait_fullmesh(SERVERS)
+
+-- Any of the three instances may bootstrap the cluster and become leader.
+is_possible_leader = {true, true, true}
+leader_nr = get_leader(is_possible_leader)
+leader = name(leader_nr)
+next_leader_nr = ((leader_nr - 1) % 3 + 1) % 3 + 1 -- {1, 2, 3} -> {2, 3, 1}
+next_leader = name(next_leader_nr)
+other_nr = ((leader_nr - 1) % 3 + 2) % 3 + 1 -- {1, 2, 3} -> {3, 1, 2}
+other = name(other_nr)
+
+test_run:switch(other)
+box.cfg{election_mode='voter'}
+test_run:switch('default')
+
+test_run:switch(next_leader)
+box.cfg{election_mode='voter'}
+test_run:switch('default')
+
+test_run:switch(leader)
+box.ctl.wait_rw()
+_ = box.schema.space.create('test', {is_sync=true})
+_ = box.space.test:create_index('pk')
+box.space.test:insert{1}
+
+-- Simulate a situation when the instance which will become the next leader
+-- doesn't know of unconfirmed rows. It should roll them back anyways and do not
+-- accept them once they actually appear from the old leader.
+-- So, stop the instance which'll be the next leader.
+test_run:switch('default')
+test_run:cmd('stop server '..next_leader)
+test_run:switch(leader)
+-- Insert some unconfirmed data.
+box.cfg{replication_synchro_quorum=3, replication_synchro_timeout=1000}
+fib = require('fiber').create(box.space.test.insert, box.space.test, {2})
+fib:status()
+
+-- 'other', 'leader', 'next_leader' are defined on 'default' node, hence the
+-- double switches.
+test_run:switch('default')
+test_run:switch(other)
+-- Wait until the rows are replicated to the other instance.
+test_run:wait_cond(function() return box.space.test:get{2} ~= nil end)
+-- Old leader is gone.
+test_run:switch('default')
+test_run:cmd('stop server '..leader)
+is_possible_leader[leader_nr] = false
+
+-- Emulate a situation when next_leader wins the elections. It can't do that in
+-- this configuration, obviously, because it's behind the 'other' node, so set
+-- quorum to 1 and imagine there are 2 more servers which would vote for
+-- next_leader.
+-- Also, make the instance ignore synchronization with other replicas.
+-- Otherwise it would stall for replication_sync_timeout. This is due to the
+-- nature of the test and may be ignored (we restart the instance to simulate
+-- a situation when some rows from the old leader were not received).
+test_run:cmd('start server '..next_leader..' with args="1 0.4 candidate 1"')
+assert(get_leader(is_possible_leader) == next_leader_nr)
+test_run:switch(other)
+-- New leader didn't know about the unconfirmed rows but still rolled them back.
+test_run:wait_cond(function() return box.space.test:get{2} == nil end)
+
+test_run:switch('default')
+test_run:switch(next_leader)
+-- No signs of the unconfirmed transaction.
+box.space.test:select{} -- 1
+
+test_run:switch('default')
+-- Old leader returns and old unconfirmed rows from it must be ignored.
+-- Note, it wins the elections fairly.
+test_run:cmd('start server '..leader..' with args="3 0.4 voter"')
+test_run:wait_lsn(leader, next_leader)
+test_run:switch(leader)
+test_run:wait_cond(function() return box.space.test:get{2} == nil end)
+box.cfg{election_mode='candidate'}
+
+test_run:switch('default')
+test_run:switch(next_leader)
+-- Resign to make old leader win the elections.
+box.cfg{election_mode='voter'}
+
+test_run:switch('default')
+is_possible_leader[leader_nr] = true
+assert(get_leader(is_possible_leader) == leader_nr)
+
+test_run:switch(next_leader)
+test_run:wait_upstream(1, {status='follow'})
+box.space.test:select{} -- 1
+
+-- Cleanup.
+test_run:switch('default')
+test_run:drop_cluster(SERVERS)
diff --git a/test/replication/suite.cfg b/test/replication/suite.cfg
index 4a9ca0a46..8b185ce7e 100644
--- a/test/replication/suite.cfg
+++ b/test/replication/suite.cfg
@@ -19,6 +19,7 @@
"gh-5213-qsync-applier-order-3.test.lua": {},
"gh-5426-election-on-off.test.lua": {},
"gh-5433-election-restart-recovery.test.lua": {},
+ "gh-5445-leader-inconsistency.test.lua": {},
"gh-5506-election-on-off.test.lua": {},
"once.test.lua": {},
"on_replace.test.lua": {},
diff --git a/test/unit/raft.c b/test/unit/raft.c
index d0d13d8c7..0306cefcd 100644
--- a/test/unit/raft.c
+++ b/test/unit/raft.c
@@ -1267,10 +1267,44 @@ raft_test_too_long_wal_write(void)
raft_finish_test();
}
+static void
+raft_test_term_filter(void)
+{
+ raft_start_test(9);
+ struct raft_node node;
+ raft_node_create(&node);
+
+ is(raft_node_term(&node.raft, 1), 0, "empty node term");
+ ok(!raft_is_node_outdated(&node.raft, 1), "not outdated initially");
+
+ raft_process_term(&node.raft, 1, 1);
+ is(raft_node_term(&node.raft, 1), 1, "node term updated");
+ ok(raft_is_node_outdated(&node.raft, 2), "other nodes are outdated");
+
+ raft_process_term(&node.raft, 2, 100);
+ ok(raft_is_node_outdated(&node.raft, 1), "node outdated when others "
+ "have greater term");
+ ok(!raft_is_node_outdated(&node.raft, 2), "node with greatest term "
+ "isn't outdated");
+
+ raft_process_term(&node.raft, 3, 100);
+ ok(!raft_is_node_outdated(&node.raft, 2), "node not outdated when "
+ "others have the same term");
+
+ raft_process_term(&node.raft, 3, 99);
+ is(raft_node_term(&node.raft, 3), 100, "node term isn't decreased");
+ ok(!raft_is_node_outdated(&node.raft, 3), "node doesn't become "
+ "outdated");
+
+
+ raft_node_destroy(&node);
+ raft_finish_test();
+}
+
static int
main_f(va_list ap)
{
- raft_start_test(13);
+ raft_start_test(14);
(void) ap;
fakeev_init();
@@ -1288,6 +1322,7 @@ main_f(va_list ap)
raft_test_death_timeout();
raft_test_enable_disable();
raft_test_too_long_wal_write();
+ raft_test_term_filter();
fakeev_free();
diff --git a/test/unit/raft.result b/test/unit/raft.result
index 96bfc3b86..ecb962e42 100644
--- a/test/unit/raft.result
+++ b/test/unit/raft.result
@@ -1,5 +1,5 @@
*** main_f ***
-1..13
+1..14
*** raft_test_leader_election ***
1..24
ok 1 - 1 pending message at start
@@ -220,4 +220,17 @@ ok 12 - subtests
ok 8 - became candidate
ok 13 - subtests
*** raft_test_too_long_wal_write: done ***
+ *** raft_test_term_filter ***
+ 1..9
+ ok 1 - empty node term
+ ok 2 - not outdated initially
+ ok 3 - node term updated
+ ok 4 - other nodes are outdated
+ ok 5 - node outdated when others have greater term
+ ok 6 - node with greatest term isn't outdated
+ ok 7 - node not outdated when others have the same term
+ ok 8 - node term isn't decreased
+ ok 9 - node doesn't become outdated
+ok 14 - subtests
+ *** raft_test_term_filter: done ***
*** main_f: done ***
--
2.24.3 (Apple Git-128)
^ permalink raw reply [flat|nested] 57+ messages in thread
* [Tarantool-patches] [PATCH v4 08/12] election: introduce a new election mode: "manual"
2021-04-16 16:25 [Tarantool-patches] [PATCH v4 00/12] raft: introduce manual elections and fix a bug with re-applying rolled back transactions Serge Petrenko via Tarantool-patches
` (6 preceding siblings ...)
2021-04-16 16:25 ` [Tarantool-patches] [PATCH v4 07/12] raft: filter rows based on known peer terms Serge Petrenko via Tarantool-patches
@ 2021-04-16 16:25 ` Serge Petrenko via Tarantool-patches
2021-04-19 22:34 ` Vladislav Shpilevoy via Tarantool-patches
2021-04-16 16:25 ` [Tarantool-patches] [PATCH v4 09/12] raft: introduce raft_start/stop_candidate Serge Petrenko via Tarantool-patches
` (7 subsequent siblings)
15 siblings, 1 reply; 57+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-04-16 16:25 UTC (permalink / raw)
To: v.shpilevoy, gorcunov; +Cc: tarantool-patches
When an instance is configured in "manual" election mode, it behaves as
a voter for most of the time, until `box.ctl.promote()` is called.
Once `box.ctl.promote()` is called the instance will behave as a
candidate for a full election round, e.g. until the leader is known. If
the instance wins the elections, it remains in `leader` state until the
next elections. Otherwise the instance returns to `voter` state.
Follow-up #5445
Part of #3055
---
src/box/box.cc | 37 +++++++++++++++++---------
src/box/raft.c | 2 ++
src/box/raft.h | 17 ++++++++++++
test/replication/election_basic.result | 4 +--
4 files changed, 45 insertions(+), 15 deletions(-)
diff --git a/src/box/box.cc b/src/box/box.cc
index 19f1528ca..d5a55a30a 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -676,17 +676,27 @@ box_check_uri(const char *source, const char *option_name)
}
}
-static const char *
+static enum election_mode
box_check_election_mode(void)
{
const char *mode = cfg_gets("election_mode");
- if (mode == NULL || (strcmp(mode, "off") != 0 &&
- strcmp(mode, "voter") != 0 && strcmp(mode, "candidate") != 0)) {
- diag_set(ClientError, ER_CFG, "election_mode", "the value must "
- "be a string 'off' or 'voter' or 'candidate'");
- return NULL;
- }
- return mode;
+ if (mode == NULL)
+ goto error;
+
+ if (strcmp(mode, "off") == 0)
+ return ELECTION_MODE_OFF;
+ else if (strcmp(mode, "voter") == 0)
+ return ELECTION_MODE_VOTER;
+ else if (strcmp(mode, "manual") == 0)
+ return ELECTION_MODE_MANUAL;
+ else if (strcmp(mode, "candidate") == 0)
+ return ELECTION_MODE_CANDIDATE;
+
+error:
+ diag_set(ClientError, ER_CFG, "election_mode",
+ "the value must be one of the following strings: "
+ "'off', 'voter', 'candidate', 'manual'");
+ return ELECTION_MODE_INVALID;
}
static double
@@ -1109,7 +1119,7 @@ box_check_config(void)
box_check_uri(cfg_gets("listen"), "listen");
box_check_instance_uuid(&uuid);
box_check_replicaset_uuid(&uuid);
- if (box_check_election_mode() == NULL)
+ if (box_check_election_mode() == ELECTION_MODE_INVALID)
diag_raise();
if (box_check_election_timeout() < 0)
diag_raise();
@@ -1143,11 +1153,12 @@ box_check_config(void)
int
box_set_election_mode(void)
{
- const char *mode = box_check_election_mode();
- if (mode == NULL)
+ enum election_mode mode = box_check_election_mode();
+ if (mode == ELECTION_MODE_INVALID)
return -1;
- raft_cfg_is_candidate(box_raft(), strcmp(mode, "candidate") == 0);
- raft_cfg_is_enabled(box_raft(), strcmp(mode, "off") != 0);
+ box_election_mode = mode;
+ raft_cfg_is_candidate(box_raft(), mode == ELECTION_MODE_CANDIDATE);
+ raft_cfg_is_enabled(box_raft(), mode != ELECTION_MODE_OFF);
return 0;
}
diff --git a/src/box/raft.c b/src/box/raft.c
index cfd898db0..285dbe4fd 100644
--- a/src/box/raft.c
+++ b/src/box/raft.c
@@ -44,6 +44,8 @@ struct raft box_raft_global = {
.state = 0,
};
+enum election_mode box_election_mode = ELECTION_MODE_INVALID;
+
/**
* A trigger executed each time the Raft state machine updates any
* of its visible attributes.
diff --git a/src/box/raft.h b/src/box/raft.h
index 1c59f17e6..15f4e80d9 100644
--- a/src/box/raft.h
+++ b/src/box/raft.h
@@ -35,8 +35,25 @@
extern "C" {
#endif
+enum election_mode {
+ ELECTION_MODE_INVALID = -1,
+ ELECTION_MODE_OFF = 0,
+ ELECTION_MODE_VOTER = 1,
+ ELECTION_MODE_MANUAL = 2,
+ ELECTION_MODE_CANDIDATE = 3,
+};
+
struct raft_request;
+/**
+ * box_election_mode - current mode of operation for raft. Some modes correspond
+ * to RAFT operation modes directly, like CANDIDATE, VOTER and OFF.
+ * There's a mode which does not map to raft operation mode directly:
+ * MANUAL. In this mode RAFT usually operates as a voter, but it may become a
+ * candidate for some period of time when user calls `box.ctl.promote()`
+ */
+extern enum election_mode box_election_mode;
+
/** Raft state of this instance. */
static inline struct raft *
box_raft(void)
diff --git a/test/replication/election_basic.result b/test/replication/election_basic.result
index 4d7d33f2b..d5320b3ff 100644
--- a/test/replication/election_basic.result
+++ b/test/replication/election_basic.result
@@ -22,8 +22,8 @@ box.cfg{election_mode = 100}
| ...
box.cfg{election_mode = '100'}
| ---
- | - error: 'Incorrect value for option ''election_mode'': the value must be a string
- | ''off'' or ''voter'' or ''candidate'''
+ | - error: 'Incorrect value for option ''election_mode'': the value must be one of the
+ | following strings: ''off'', ''voter'', ''candidate'', ''manual'''
| ...
box.cfg{election_timeout = -1}
| ---
--
2.24.3 (Apple Git-128)
^ permalink raw reply [flat|nested] 57+ messages in thread
* [Tarantool-patches] [PATCH v4 09/12] raft: introduce raft_start/stop_candidate
2021-04-16 16:25 [Tarantool-patches] [PATCH v4 00/12] raft: introduce manual elections and fix a bug with re-applying rolled back transactions Serge Petrenko via Tarantool-patches
` (7 preceding siblings ...)
2021-04-16 16:25 ` [Tarantool-patches] [PATCH v4 08/12] election: introduce a new election mode: "manual" Serge Petrenko via Tarantool-patches
@ 2021-04-16 16:25 ` Serge Petrenko via Tarantool-patches
2021-04-16 22:23 ` Vladislav Shpilevoy via Tarantool-patches
2021-04-19 12:52 ` Serge Petrenko via Tarantool-patches
2021-04-16 16:25 ` [Tarantool-patches] [PATCH v4 10/12] election: support manual elections in clear_synchro_queue() Serge Petrenko via Tarantool-patches
` (6 subsequent siblings)
15 siblings, 2 replies; 57+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-04-16 16:25 UTC (permalink / raw)
To: v.shpilevoy, gorcunov; +Cc: tarantool-patches
Extract raft_start_candidate and raft_stop_candidate functions from
raft_cfg_is_candidate.
These functions will be used in manual elections.
Prerequisite #3055
---
src/lib/raft/raft.c | 83 ++++++++++++++++++++++++++++---------------
src/lib/raft/raft.h | 13 +++++++
test/unit/raft.c | 33 +++++++++++++++--
test/unit/raft.result | 10 +++++-
4 files changed, 108 insertions(+), 31 deletions(-)
diff --git a/src/lib/raft/raft.c b/src/lib/raft/raft.c
index e9ce8cade..b21693642 100644
--- a/src/lib/raft/raft.c
+++ b/src/lib/raft/raft.c
@@ -848,38 +848,65 @@ raft_cfg_is_enabled(struct raft *raft, bool is_enabled)
void
raft_cfg_is_candidate(struct raft *raft, bool is_candidate)
{
- bool old_is_candidate = raft->is_candidate;
raft->is_cfg_candidate = is_candidate;
- raft->is_candidate = is_candidate && raft->is_enabled;
- if (raft->is_candidate == old_is_candidate)
- return;
+ is_candidate = is_candidate && raft->is_enabled;
+ if (is_candidate)
+ raft_start_candidate(raft);
+ else
+ raft_stop_candidate(raft, true);
+}
- if (raft->is_candidate) {
- assert(raft->state == RAFT_STATE_FOLLOWER);
- if (raft->is_write_in_progress) {
- /*
- * If there is an on-going WAL write, it means there was
- * some node who sent newer data to this node. So it is
- * probably a better candidate. Anyway can't do anything
- * until the new state is fully persisted.
- */
- } else if (raft->leader != 0) {
- raft_sm_wait_leader_dead(raft);
- } else {
- raft_sm_wait_leader_found(raft);
- }
+void
+raft_start_candidate(struct raft *raft)
+{
+ if (raft->is_candidate)
+ return;
+ raft->is_candidate = true;
+ assert(raft->state != RAFT_STATE_CANDIDATE);
+ /*
+ * May still be the leader after raft_stop_candidate
+ * with demote = false.
+ */
+ if (raft->state == RAFT_STATE_LEADER)
+ return;
+ if (raft->is_write_in_progress) {
+ /*
+ * If there is an on-going WAL write, it means there was
+ * some node who sent newer data to this node. So it is
+ * probably a better candidate. Anyway can't do anything
+ * until the new state is fully persisted.
+ */
+ } else if (raft->leader != 0) {
+ raft_sm_wait_leader_dead(raft);
} else {
- if (raft->state != RAFT_STATE_LEADER) {
- /* Do not wait for anything while being a voter. */
- raft_ev_timer_stop(raft_loop(), &raft->timer);
- }
- if (raft->state != RAFT_STATE_FOLLOWER) {
- if (raft->state == RAFT_STATE_LEADER)
- raft->leader = 0;
- raft->state = RAFT_STATE_FOLLOWER;
- /* State is visible and changed - broadcast. */
- raft_schedule_broadcast(raft);
+ raft_sm_wait_leader_found(raft);
+ }
+}
+
+void
+raft_stop_candidate(struct raft *raft, bool demote)
+{
+ if (!raft->is_candidate)
+ return;
+ raft->is_candidate = false;
+ if (raft->state != RAFT_STATE_LEADER) {
+ /* Do not wait for anything while being a voter. */
+ raft_ev_timer_stop(raft_loop(), &raft->timer);
+ }
+ if (raft->state != RAFT_STATE_FOLLOWER) {
+ if (raft->state == RAFT_STATE_LEADER) {
+ if (!demote) {
+ /*
+ * Remain leader until someone
+ * triggers new elections.
+ */
+ return;
+ }
+ raft->leader = 0;
}
+ raft->state = RAFT_STATE_FOLLOWER;
+ /* State is visible and changed - broadcast. */
+ raft_schedule_broadcast(raft);
}
}
diff --git a/src/lib/raft/raft.h b/src/lib/raft/raft.h
index a5f7e08d9..69dec63c6 100644
--- a/src/lib/raft/raft.h
+++ b/src/lib/raft/raft.h
@@ -327,6 +327,19 @@ raft_cfg_is_enabled(struct raft *raft, bool is_enabled);
void
raft_cfg_is_candidate(struct raft *raft, bool is_candidate);
+/**
+ * Make the instance a candidate.
+ */
+void
+raft_start_candidate(struct raft *raft);
+
+/**
+ * Make the instance stop taking part in new elections.
+ * @param demote whether to stop being a leader immediately or not.
+ */
+void
+raft_stop_candidate(struct raft *raft, bool demote);
+
/** Configure Raft leader election timeout. */
void
raft_cfg_election_timeout(struct raft *raft, double timeout);
diff --git a/test/unit/raft.c b/test/unit/raft.c
index 0306cefcd..575886932 100644
--- a/test/unit/raft.c
+++ b/test/unit/raft.c
@@ -1296,15 +1296,43 @@ raft_test_term_filter(void)
ok(!raft_is_node_outdated(&node.raft, 3), "node doesn't become "
"outdated");
-
raft_node_destroy(&node);
raft_finish_test();
}
+static void
+raft_test_start_stop_candidate(void)
+{
+ raft_start_test(4);
+ struct raft_node node;
+ raft_node_create(&node);
+
+ raft_node_cfg_is_candidate(&node, false);
+ raft_node_cfg_election_quorum(&node, 1);
+
+ raft_start_candidate(&node.raft);
+ raft_run_next_event();
+ is(node.raft.state, RAFT_STATE_LEADER, "became leader after "
+ "start_candidate");
+ raft_stop_candidate(&node.raft, false);
+ raft_run_for(node.cfg_death_timeout);
+ is(node.raft.state, RAFT_STATE_LEADER, "remain leader after "
+ "stop_candidate");
+
+ is(raft_node_send_vote_request(&node,
+ 3 /* Term. */,
+ "{}" /* Vclock. */,
+ 2 /* Source. */
+ ), 0, "vote request from 2");
+ is(node.raft.state, RAFT_STATE_FOLLOWER, "demote once new election "
+ "starts");
+ raft_finish_test();
+}
+
static int
main_f(va_list ap)
{
- raft_start_test(14);
+ raft_start_test(15);
(void) ap;
fakeev_init();
@@ -1323,6 +1351,7 @@ main_f(va_list ap)
raft_test_enable_disable();
raft_test_too_long_wal_write();
raft_test_term_filter();
+ raft_test_start_stop_candidate();
fakeev_free();
diff --git a/test/unit/raft.result b/test/unit/raft.result
index ecb962e42..bb799936b 100644
--- a/test/unit/raft.result
+++ b/test/unit/raft.result
@@ -1,5 +1,5 @@
*** main_f ***
-1..14
+1..15
*** raft_test_leader_election ***
1..24
ok 1 - 1 pending message at start
@@ -233,4 +233,12 @@ ok 13 - subtests
ok 9 - node doesn't become outdated
ok 14 - subtests
*** raft_test_term_filter: done ***
+ *** raft_test_start_stop_candidate ***
+ 1..4
+ ok 1 - became leader after start_candidate
+ ok 2 - remain leader after stop_candidate
+ ok 3 - vote request from 2
+ ok 4 - demote once new election starts
+ok 15 - subtests
+ *** raft_test_start_stop_candidate: done ***
*** main_f: done ***
--
2.24.3 (Apple Git-128)
^ permalink raw reply [flat|nested] 57+ messages in thread
* [Tarantool-patches] [PATCH v4 10/12] election: support manual elections in clear_synchro_queue()
2021-04-16 16:25 [Tarantool-patches] [PATCH v4 00/12] raft: introduce manual elections and fix a bug with re-applying rolled back transactions Serge Petrenko via Tarantool-patches
` (8 preceding siblings ...)
2021-04-16 16:25 ` [Tarantool-patches] [PATCH v4 09/12] raft: introduce raft_start/stop_candidate Serge Petrenko via Tarantool-patches
@ 2021-04-16 16:25 ` Serge Petrenko via Tarantool-patches
2021-04-16 22:24 ` Vladislav Shpilevoy via Tarantool-patches
2021-04-19 12:47 ` Serge Petrenko via Tarantool-patches
2021-04-16 16:25 ` [Tarantool-patches] [PATCH v4 11/12] box: remove parameter from clear_synchro_queue Serge Petrenko via Tarantool-patches
` (5 subsequent siblings)
15 siblings, 2 replies; 57+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-04-16 16:25 UTC (permalink / raw)
To: v.shpilevoy, gorcunov; +Cc: tarantool-patches
This patch adds support for manual elections from
`box.ctl.clear_synchro_queue()`. When an instance is in
`election_mode='manual'`, calling `clear_synchro_queue()` will make it
start a new election round.
Follow-up #5445
Part of #3055
@TarantoolBot document
Title: describe election_mode='manual'
Manual election mode is introduced. It may be used when the user wants to
control which instance is the leader explicitly instead of relying on
Raft election algorithm.
When an instance is configured with `election_mode='manual'`, it behaves
as follows:
1) By default, the instance acts like a voter: it is read-only and may
vote for other instances that are candidates.
2) Once `box.ctl.clear_synchro_queue()` is called, the instance becomes a
candidate and starts a new election round. If the instance wins the
elections, it remains leader, but won't participate in any new elections.
---
src/box/box.cc | 71 ++++++++++++++++++++++++++++++++++++++++---
src/box/errcode.h | 2 ++
src/box/raft.c | 30 +++++++++++++++---
src/box/raft.h | 3 ++
test/box/error.result | 2 ++
5 files changed, 99 insertions(+), 9 deletions(-)
diff --git a/src/box/box.cc b/src/box/box.cc
index d5a55a30a..fcd812c09 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -1521,12 +1521,75 @@ box_clear_synchro_queue(bool try_wait)
if (!is_box_configured ||
raft_node_term(box_raft(), instance_id) == box_raft()->term)
return 0;
+
+ bool run_elections = false;
+
+ switch (box_election_mode) {
+ case ELECTION_MODE_OFF:
+ break;
+ case ELECTION_MODE_VOTER:
+ assert(box_raft()->state == RAFT_STATE_FOLLOWER);
+ diag_set(ClientError, ER_UNSUPPORTED, "election_mode='voter'",
+ "manual elections");
+ return -1;
+ case ELECTION_MODE_MANUAL:
+ assert(box_raft()->state == RAFT_STATE_FOLLOWER);
+ run_elections = true;
+ try_wait = false;
+ break;
+ case ELECTION_MODE_CANDIDATE:
+ /*
+ * Leader elections are enabled, and this instance is allowed to
+ * promote only if it's already an elected leader. No manual
+ * elections.
+ */
+ if (box_raft()->state != RAFT_STATE_LEADER) {
+ diag_set(ClientError, ER_UNSUPPORTED, "election_mode="
+ "'candidate'", "manual elections");
+ return -1;
+ }
+ break;
+ default:
+ unreachable();
+ }
+
uint32_t former_leader_id = txn_limbo.owner_id;
int64_t wait_lsn = txn_limbo.confirmed_lsn;
int rc = 0;
int quorum = replication_synchro_quorum;
in_clear_synchro_queue = true;
+ if (run_elections) {
+ /*
+ * Make this instance a candidate and run until some leader, not
+ * necessarily this instance, emerges.
+ */
+ raft_start_candidate(box_raft());
+ /*
+ * Trigger new elections without waiting for an old leader to
+ * disappear.
+ */
+ raft_new_term(box_raft());
+ box_raft_wait_leader_found();
+ /*
+ * Do not reset raft mode if it was changed while running the
+ * elections.
+ */
+ if (box_election_mode == ELECTION_MODE_MANUAL)
+ raft_stop_candidate(box_raft(), false);
+ if (!box_raft()->is_enabled) {
+ diag_set(ClientError, ER_RAFT_DISABLED);
+ in_clear_synchro_queue = false;
+ return -1;
+ }
+ if (box_raft()->state != RAFT_STATE_LEADER) {
+ diag_set(ClientError, ER_INTERFERING_PROMOTE,
+ box_raft()->leader);
+ in_clear_synchro_queue = false;
+ return -1;
+ }
+ }
+
if (txn_limbo_is_empty(&txn_limbo))
goto promote;
@@ -1544,12 +1607,10 @@ box_clear_synchro_queue(bool try_wait)
* transactions. Exit in case someone did that for us.
*/
if (former_leader_id != txn_limbo.owner_id) {
- /*
- * TODO: error once we see someone else has become the
- * leader already.
- */
+ diag_set(ClientError, ER_INTERFERING_PROMOTE,
+ txn_limbo.owner_id);
in_clear_synchro_queue = false;
- return 0;
+ return -1;
}
}
diff --git a/src/box/errcode.h b/src/box/errcode.h
index c63191fb6..d93820e96 100644
--- a/src/box/errcode.h
+++ b/src/box/errcode.h
@@ -275,6 +275,8 @@ struct errcode_record {
/*220 */_(ER_TOO_EARLY_SUBSCRIBE, "Can't subscribe non-anonymous replica %s until join is done") \
/*221 */_(ER_SQL_CANT_ADD_AUTOINC, "Can't add AUTOINCREMENT: space %s can't feature more than one AUTOINCREMENT field") \
/*222 */_(ER_QUORUM_WAIT, "Couldn't wait for quorum %d: %s") \
+ /*223 */_(ER_INTERFERING_PROMOTE, "Instance with replica id %u was promoted first") \
+ /*224 */_(ER_RAFT_DISABLED, "Elections were turned off while running box.ctl.promote()")\
/*
* !IMPORTANT! Please follow instructions at start of the file
diff --git a/src/box/raft.c b/src/box/raft.c
index 285dbe4fd..425353207 100644
--- a/src/box/raft.c
+++ b/src/box/raft.c
@@ -91,11 +91,11 @@ box_raft_update_synchro_queue(struct raft *raft)
* If the node became a leader, it means it will ignore all records from
* all the other nodes, and won't get late CONFIRM messages anyway. Can
* clear the queue without waiting for confirmations.
- * It's alright that the user may have called clear_synchro_queue
- * manually. In this case the call below will exit immediately and we'll
- * simply log a warning.
+ * In case these are manual elections, we are already in the middle of a
+ * `clear_synchro_queue` call. No need to call it once again.
*/
- if (raft->state == RAFT_STATE_LEADER) {
+ if (raft->state == RAFT_STATE_LEADER &&
+ box_election_mode != ELECTION_MODE_MANUAL) {
int rc = 0;
uint32_t errcode = 0;
do {
@@ -336,6 +336,28 @@ fail:
panic("Could not write a raft request to WAL\n");
}
+static int
+box_raft_wait_leader_found_f(struct trigger *trig, void *event)
+{
+ struct raft *raft = event;
+ assert(raft == box_raft());
+ struct fiber *waiter = trig->data;
+ if (raft->leader != REPLICA_ID_NIL || !raft->is_enabled)
+ fiber_wakeup(waiter);
+ return 0;
+}
+
+void
+box_raft_wait_leader_found(void)
+{
+ struct trigger trig;
+ trigger_create(&trig, box_raft_wait_leader_found_f, fiber(), NULL);
+ raft_on_update(box_raft(), &trig);
+ fiber_yield();
+ assert(box_raft()->leader != REPLICA_ID_NIL || !box_raft()->is_enabled);
+ trigger_clear(&trig);
+}
+
void
box_raft_init(void)
{
diff --git a/src/box/raft.h b/src/box/raft.h
index 15f4e80d9..8fce423e1 100644
--- a/src/box/raft.h
+++ b/src/box/raft.h
@@ -97,6 +97,9 @@ box_raft_checkpoint_remote(struct raft_request *req);
int
box_raft_process(struct raft_request *req, uint32_t source);
+void
+box_raft_wait_leader_found();
+
void
box_raft_init(void);
diff --git a/test/box/error.result b/test/box/error.result
index 7761c6949..cc8cbaaa9 100644
--- a/test/box/error.result
+++ b/test/box/error.result
@@ -441,6 +441,8 @@ t;
| 220: box.error.TOO_EARLY_SUBSCRIBE
| 221: box.error.SQL_CANT_ADD_AUTOINC
| 222: box.error.QUORUM_WAIT
+ | 223: box.error.INTERFERING_PROMOTE
+ | 224: box.error.RAFT_DISABLED
| ...
test_run:cmd("setopt delimiter ''");
--
2.24.3 (Apple Git-128)
^ permalink raw reply [flat|nested] 57+ messages in thread
* [Tarantool-patches] [PATCH v4 11/12] box: remove parameter from clear_synchro_queue
2021-04-16 16:25 [Tarantool-patches] [PATCH v4 00/12] raft: introduce manual elections and fix a bug with re-applying rolled back transactions Serge Petrenko via Tarantool-patches
` (9 preceding siblings ...)
2021-04-16 16:25 ` [Tarantool-patches] [PATCH v4 10/12] election: support manual elections in clear_synchro_queue() Serge Petrenko via Tarantool-patches
@ 2021-04-16 16:25 ` Serge Petrenko via Tarantool-patches
2021-04-16 16:25 ` [Tarantool-patches] [PATCH v4 12/12] box.ctl: rename clear_synchro_queue to promote Serge Petrenko via Tarantool-patches
` (4 subsequent siblings)
15 siblings, 0 replies; 57+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-04-16 16:25 UTC (permalink / raw)
To: v.shpilevoy, gorcunov; +Cc: tarantool-patches
The `try_wait` parameter became redundant with the inroduction of manual
elections concept. It may be determined whether the node should wait for
pending confirmations or not by looking at election mode, so remove the
parameter.
Part of #3055
---
src/box/box.cc | 5 +++--
src/box/box.h | 2 +-
src/box/lua/ctl.c | 2 +-
src/box/raft.c | 5 +----
4 files changed, 6 insertions(+), 8 deletions(-)
diff --git a/src/box/box.cc b/src/box/box.cc
index fcd812c09..be7234302 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -1504,7 +1504,7 @@ box_wait_quorum(uint32_t lead_id, int64_t target_lsn, int quorum,
}
int
-box_clear_synchro_queue(bool try_wait)
+box_clear_synchro_queue(void)
{
/* A guard to block multiple simultaneous function invocations. */
static bool in_clear_synchro_queue = false;
@@ -1523,9 +1523,11 @@ box_clear_synchro_queue(bool try_wait)
return 0;
bool run_elections = false;
+ bool try_wait = false;
switch (box_election_mode) {
case ELECTION_MODE_OFF:
+ try_wait = true;
break;
case ELECTION_MODE_VOTER:
assert(box_raft()->state == RAFT_STATE_FOLLOWER);
@@ -1535,7 +1537,6 @@ box_clear_synchro_queue(bool try_wait)
case ELECTION_MODE_MANUAL:
assert(box_raft()->state == RAFT_STATE_FOLLOWER);
run_elections = true;
- try_wait = false;
break;
case ELECTION_MODE_CANDIDATE:
/*
diff --git a/src/box/box.h b/src/box/box.h
index e2321b9b0..90facd189 100644
--- a/src/box/box.h
+++ b/src/box/box.h
@@ -274,7 +274,7 @@ extern "C" {
typedef struct tuple box_tuple_t;
int
-box_clear_synchro_queue(bool try_wait);
+box_clear_synchro_queue(void);
/* box_select is private and used only by FFI */
API_EXPORT int
diff --git a/src/box/lua/ctl.c b/src/box/lua/ctl.c
index d039a059f..5b8d0d0e4 100644
--- a/src/box/lua/ctl.c
+++ b/src/box/lua/ctl.c
@@ -84,7 +84,7 @@ lbox_ctl_on_schema_init(struct lua_State *L)
static int
lbox_ctl_clear_synchro_queue(struct lua_State *L)
{
- if (box_clear_synchro_queue(true) != 0)
+ if (box_clear_synchro_queue() != 0)
return luaT_error(L);
return 0;
}
diff --git a/src/box/raft.c b/src/box/raft.c
index 425353207..9a67a7cb0 100644
--- a/src/box/raft.c
+++ b/src/box/raft.c
@@ -88,9 +88,6 @@ box_raft_update_synchro_queue(struct raft *raft)
{
assert(raft == box_raft());
/*
- * If the node became a leader, it means it will ignore all records from
- * all the other nodes, and won't get late CONFIRM messages anyway. Can
- * clear the queue without waiting for confirmations.
* In case these are manual elections, we are already in the middle of a
* `clear_synchro_queue` call. No need to call it once again.
*/
@@ -99,7 +96,7 @@ box_raft_update_synchro_queue(struct raft *raft)
int rc = 0;
uint32_t errcode = 0;
do {
- rc = box_clear_synchro_queue(false);
+ rc = box_clear_synchro_queue();
if (rc != 0) {
struct error *err = diag_last_error(diag_get());
errcode = box_error_code(err);
--
2.24.3 (Apple Git-128)
^ permalink raw reply [flat|nested] 57+ messages in thread
* [Tarantool-patches] [PATCH v4 12/12] box.ctl: rename clear_synchro_queue to promote
2021-04-16 16:25 [Tarantool-patches] [PATCH v4 00/12] raft: introduce manual elections and fix a bug with re-applying rolled back transactions Serge Petrenko via Tarantool-patches
` (10 preceding siblings ...)
2021-04-16 16:25 ` [Tarantool-patches] [PATCH v4 11/12] box: remove parameter from clear_synchro_queue Serge Petrenko via Tarantool-patches
@ 2021-04-16 16:25 ` Serge Petrenko via Tarantool-patches
2021-04-19 22:35 ` Vladislav Shpilevoy via Tarantool-patches
2021-04-18 12:00 ` [Tarantool-patches] [PATCH v4 13/12] replication: send accumulated Raft messages after relay start Serge Petrenko via Tarantool-patches
` (3 subsequent siblings)
15 siblings, 1 reply; 57+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-04-16 16:25 UTC (permalink / raw)
To: v.shpilevoy, gorcunov; +Cc: tarantool-patches
New function name will be `box.ctl.promote()`. It's much shorter and
closer to the function's now enriched functionality.
Old name `box.ctl.clear_synchro_queue()` remains in Lua for the sake of
backward compatibility.
Follow-up #5445
Closes #3055
@TarantoolBot document
Title: deprecate `box.ctl.clear_synchro_queue()` in favor of `box.ctl.promote()`
Replace all the mentions of `box.ctl.clear_synchro_queue()` with
`box.ctl.promote()` and add a note that `box.ctl.clear_synchro_queue()`
is a deprecated alias to `box.ctl.promote()`
---
changelogs/unreleased/box-ctl-promote.md | 8 ++
src/box/box.cc | 20 ++--
src/box/box.h | 2 +-
src/box/lua/ctl.c | 8 +-
src/box/raft.c | 4 +-
.../gh-3055-election-promote.result | 105 ++++++++++++++++++
.../gh-3055-election-promote.test.lua | 43 +++++++
test/replication/suite.cfg | 1 +
8 files changed, 175 insertions(+), 16 deletions(-)
create mode 100644 changelogs/unreleased/box-ctl-promote.md
create mode 100644 test/replication/gh-3055-election-promote.result
create mode 100644 test/replication/gh-3055-election-promote.test.lua
diff --git a/changelogs/unreleased/box-ctl-promote.md b/changelogs/unreleased/box-ctl-promote.md
new file mode 100644
index 000000000..15f6fb206
--- /dev/null
+++ b/changelogs/unreleased/box-ctl-promote.md
@@ -0,0 +1,8 @@
+## feature/replication
+
+* Introduce `box.ctl.promote()` and the concept of manual elections (enabled
+ with `election_mode='manual'`). Once the instance is in `manual` election
+ mode, it acts like a `voter` most of the time, but may trigger elections and
+ become a leader, once `box.ctl.promote()` is called.
+ When `election_mode ~= 'manual'`, `box.ctl.promote()` replaces
+ `box.ctl.clear_synchro_queue()`, which is now deprecated (gh-3055).
diff --git a/src/box/box.cc b/src/box/box.cc
index be7234302..70cb2bd53 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -1504,12 +1504,12 @@ box_wait_quorum(uint32_t lead_id, int64_t target_lsn, int quorum,
}
int
-box_clear_synchro_queue(void)
+box_promote(void)
{
/* A guard to block multiple simultaneous function invocations. */
- static bool in_clear_synchro_queue = false;
- if (in_clear_synchro_queue) {
- diag_set(ClientError, ER_UNSUPPORTED, "clear_synchro_queue",
+ static bool in_promote = false;
+ if (in_promote) {
+ diag_set(ClientError, ER_UNSUPPORTED, "box.ctl.promote",
"simultaneous invocations");
return -1;
}
@@ -1558,7 +1558,7 @@ box_clear_synchro_queue(void)
int64_t wait_lsn = txn_limbo.confirmed_lsn;
int rc = 0;
int quorum = replication_synchro_quorum;
- in_clear_synchro_queue = true;
+ in_promote = true;
if (run_elections) {
/*
@@ -1580,13 +1580,13 @@ box_clear_synchro_queue(void)
raft_stop_candidate(box_raft(), false);
if (!box_raft()->is_enabled) {
diag_set(ClientError, ER_RAFT_DISABLED);
- in_clear_synchro_queue = false;
+ in_promote = false;
return -1;
}
if (box_raft()->state != RAFT_STATE_LEADER) {
diag_set(ClientError, ER_INTERFERING_PROMOTE,
box_raft()->leader);
- in_clear_synchro_queue = false;
+ in_promote = false;
return -1;
}
}
@@ -1610,13 +1610,13 @@ box_clear_synchro_queue(void)
if (former_leader_id != txn_limbo.owner_id) {
diag_set(ClientError, ER_INTERFERING_PROMOTE,
txn_limbo.owner_id);
- in_clear_synchro_queue = false;
+ in_promote = false;
return -1;
}
}
/*
- * clear_synchro_queue() is a no-op on the limbo owner, so all the rows
+ * promote() is a no-op on the limbo owner, so all the rows
* in the limbo must've come through the applier meaning they already
* have an lsn assigned, even if their WAL write hasn't finished yet.
*/
@@ -1651,7 +1651,7 @@ promote:
assert(txn_limbo_is_empty(&txn_limbo));
}
}
- in_clear_synchro_queue = false;
+ in_promote = false;
return rc;
}
diff --git a/src/box/box.h b/src/box/box.h
index 90facd189..04bdd397d 100644
--- a/src/box/box.h
+++ b/src/box/box.h
@@ -274,7 +274,7 @@ extern "C" {
typedef struct tuple box_tuple_t;
int
-box_clear_synchro_queue(void);
+box_promote(void);
/* box_select is private and used only by FFI */
API_EXPORT int
diff --git a/src/box/lua/ctl.c b/src/box/lua/ctl.c
index 5b8d0d0e4..368b9ab60 100644
--- a/src/box/lua/ctl.c
+++ b/src/box/lua/ctl.c
@@ -82,9 +82,9 @@ lbox_ctl_on_schema_init(struct lua_State *L)
}
static int
-lbox_ctl_clear_synchro_queue(struct lua_State *L)
+lbox_ctl_promote(struct lua_State *L)
{
- if (box_clear_synchro_queue() != 0)
+ if (box_promote() != 0)
return luaT_error(L);
return 0;
}
@@ -124,7 +124,9 @@ static const struct luaL_Reg lbox_ctl_lib[] = {
{"wait_rw", lbox_ctl_wait_rw},
{"on_shutdown", lbox_ctl_on_shutdown},
{"on_schema_init", lbox_ctl_on_schema_init},
- {"clear_synchro_queue", lbox_ctl_clear_synchro_queue},
+ {"promote", lbox_ctl_promote},
+ /* An old alias. */
+ {"clear_synchro_queue", lbox_ctl_promote},
{"is_recovery_finished", lbox_ctl_is_recovery_finished},
{"set_on_shutdown_timeout", lbox_ctl_set_on_shutdown_timeout},
{NULL, NULL}
diff --git a/src/box/raft.c b/src/box/raft.c
index 9a67a7cb0..6e9770072 100644
--- a/src/box/raft.c
+++ b/src/box/raft.c
@@ -89,14 +89,14 @@ box_raft_update_synchro_queue(struct raft *raft)
assert(raft == box_raft());
/*
* In case these are manual elections, we are already in the middle of a
- * `clear_synchro_queue` call. No need to call it once again.
+ * `promote` call. No need to call it once again.
*/
if (raft->state == RAFT_STATE_LEADER &&
box_election_mode != ELECTION_MODE_MANUAL) {
int rc = 0;
uint32_t errcode = 0;
do {
- rc = box_clear_synchro_queue();
+ rc = box_promote();
if (rc != 0) {
struct error *err = diag_last_error(diag_get());
errcode = box_error_code(err);
diff --git a/test/replication/gh-3055-election-promote.result b/test/replication/gh-3055-election-promote.result
new file mode 100644
index 000000000..6f5af13bc
--- /dev/null
+++ b/test/replication/gh-3055-election-promote.result
@@ -0,0 +1,105 @@
+-- test-run result file version 2
+test_run = require('test_run').new()
+ | ---
+ | ...
+
+--
+-- gh-3055 box.ctl.promote(). Call on instance with election_mode='manual'
+-- in order to promote it to leader.
+SERVERS = {'election_replica1', 'election_replica2', 'election_replica3'}
+ | ---
+ | ...
+-- Start in candidate state in order for bootstrap to work.
+test_run:create_cluster(SERVERS, 'replication', {args='2 0.1 candidate'})
+ | ---
+ | ...
+test_run:wait_fullmesh(SERVERS)
+ | ---
+ | ...
+
+cfg_set_manual =\
+ "box.cfg{election_mode='manual'} "..\
+ "assert(box.info.election.state == 'follower') "..\
+ "assert(box.info.ro)"
+ | ---
+ | ...
+
+for _, server in pairs(SERVERS) do\
+ ok, res = test_run:eval(server, cfg_set_manual)\
+ assert(ok)\
+end
+ | ---
+ | ...
+
+-- Promote without living leader.
+test_run:switch('election_replica1')
+ | ---
+ | - true
+ | ...
+assert(box.info.election.state == 'follower')
+ | ---
+ | - true
+ | ...
+term = box.info.election.term
+ | ---
+ | ...
+box.ctl.promote()
+ | ---
+ | ...
+assert(box.info.election.state == 'leader')
+ | ---
+ | - true
+ | ...
+assert(not box.info.ro)
+ | ---
+ | - true
+ | ...
+assert(box.info.election.term > term)
+ | ---
+ | - true
+ | ...
+
+-- Test promote when there's a live leader.
+test_run:switch('election_replica2')
+ | ---
+ | - true
+ | ...
+term = box.info.election.term
+ | ---
+ | ...
+assert(box.info.election.state == 'follower')
+ | ---
+ | - true
+ | ...
+assert(box.info.ro)
+ | ---
+ | - true
+ | ...
+assert(box.info.election.leader ~= 0)
+ | ---
+ | - true
+ | ...
+box.ctl.promote()
+ | ---
+ | ...
+assert(box.info.election.state == 'leader')
+ | ---
+ | - true
+ | ...
+assert(not box.info.ro)
+ | ---
+ | - true
+ | ...
+assert(box.info.election.term > term)
+ | ---
+ | - true
+ | ...
+
+-- Cleanup.
+test_run:switch('default')
+ | ---
+ | - true
+ | ...
+test_run:drop_cluster(SERVERS)
+ | ---
+ | ...
diff --git a/test/replication/gh-3055-election-promote.test.lua b/test/replication/gh-3055-election-promote.test.lua
new file mode 100644
index 000000000..cbc3ed206
--- /dev/null
+++ b/test/replication/gh-3055-election-promote.test.lua
@@ -0,0 +1,43 @@
+test_run = require('test_run').new()
+
+--
+-- gh-3055 box.ctl.promote(). Call on instance with election_mode='manual'
+-- in order to promote it to leader.
+SERVERS = {'election_replica1', 'election_replica2', 'election_replica3'}
+-- Start in candidate state in order for bootstrap to work.
+test_run:create_cluster(SERVERS, 'replication', {args='2 0.1 candidate'})
+test_run:wait_fullmesh(SERVERS)
+
+cfg_set_manual =\
+ "box.cfg{election_mode='manual'} "..\
+ "assert(box.info.election.state == 'follower') "..\
+ "assert(box.info.ro)"
+
+for _, server in pairs(SERVERS) do\
+ ok, res = test_run:eval(server, cfg_set_manual)\
+ assert(ok)\
+end
+
+-- Promote without living leader.
+test_run:switch('election_replica1')
+assert(box.info.election.state == 'follower')
+term = box.info.election.term
+box.ctl.promote()
+assert(box.info.election.state == 'leader')
+assert(not box.info.ro)
+assert(box.info.election.term > term)
+
+-- Test promote when there's a live leader.
+test_run:switch('election_replica2')
+term = box.info.election.term
+assert(box.info.election.state == 'follower')
+assert(box.info.ro)
+assert(box.info.election.leader ~= 0)
+box.ctl.promote()
+assert(box.info.election.state == 'leader')
+assert(not box.info.ro)
+assert(box.info.election.term > term)
+
+-- Cleanup.
+test_run:switch('default')
+test_run:drop_cluster(SERVERS)
diff --git a/test/replication/suite.cfg b/test/replication/suite.cfg
index 8b185ce7e..dc39e2f74 100644
--- a/test/replication/suite.cfg
+++ b/test/replication/suite.cfg
@@ -2,6 +2,7 @@
"anon.test.lua": {},
"anon_register_gap.test.lua": {},
"gh-2991-misc-asserts-on-update.test.lua": {},
+ "gh-3055-election-promote.test.lua": {},
"gh-3111-misc-rebootstrap-from-ro-master.test.lua": {},
"gh-3160-misc-heartbeats-on-master-changes.test.lua": {},
"gh-3247-misc-iproto-sequence-value-not-replicated.test.lua": {},
--
2.24.3 (Apple Git-128)
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [Tarantool-patches] [PATCH v4 05/12] box: make clear_synchro_queue() write a PROMOTE entry instead of CONFIRM + ROLLBACK
2021-04-16 16:25 ` [Tarantool-patches] [PATCH v4 05/12] box: make clear_synchro_queue() write a PROMOTE entry instead of CONFIRM + ROLLBACK Serge Petrenko via Tarantool-patches
@ 2021-04-16 22:12 ` Vladislav Shpilevoy via Tarantool-patches
2021-04-18 8:24 ` Serge Petrenko via Tarantool-patches
2021-04-20 22:30 ` Vladislav Shpilevoy via Tarantool-patches
1 sibling, 1 reply; 57+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-04-16 22:12 UTC (permalink / raw)
To: Serge Petrenko, gorcunov; +Cc: tarantool-patches
Good job on the patch!
See 2 comments below.
> diff --git a/src/box/txn_limbo.c b/src/box/txn_limbo.c
> index c96e497c6..2346331c7 100644
> --- a/src/box/txn_limbo.c
> +++ b/src/box/txn_limbo.c
> @@ -317,19 +317,21 @@ txn_limbo_write_cb(struct journal_entry *entry)
> }
>
> static void
> -txn_limbo_write_synchro(struct txn_limbo *limbo, uint16_t type, int64_t lsn)
> +txn_limbo_write_synchro(struct txn_limbo *limbo, uint16_t type, int64_t lsn,
> + uint64_t term)
> {
> - assert(lsn > 0);
> + assert(lsn >= 0);
>
> struct synchro_request req = {
> .type = type,
> .replica_id = limbo->owner_id,
> .lsn = lsn,
> + .term = term,
> };
>
> /*
> - * This is a synchronous commit so we can
> - * allocate everything on a stack.
> + * This is a synchronous commit so we can allocate everything on a
> + * stack. Note, that promote body includes synchro body.
1. I think this might be discarded now. They have the same encoder
in this version. Up to you.
> */
> char body[XROW_SYNCHRO_BODY_LEN_MAX];
> struct xrow_header row;
> @@ -464,6 +466,37 @@ txn_limbo_read_rollback(struct txn_limbo *limbo, int64_t lsn)
> box_update_ro_summary();
> }
>
> +void
> +txn_limbo_write_promote(struct txn_limbo *limbo, int64_t lsn, uint64_t term)
> +{
> + limbo->confirmed_lsn = lsn;
> + limbo->is_in_rollback = true;
> + /*
> + * We make sure that promote is only written once everything this
> + * instance has may be confirmed.
> + */
> + struct txn_limbo_entry *e = txn_limbo_last_synchro_entry(limbo);
> + assert(e == NULL || e->lsn <= lsn);
> + (void) e;
> + txn_limbo_write_synchro(limbo, IPROTO_PROMOTE, lsn, term);
> + limbo->is_in_rollback = false;
> +}
> +
> +/**
> + * Process a PROMOTE request, i.e. confirm all entries <= @req.lsn and rollback all
> + * entries > @req.lsn.
2. For referencing parameters in doxygen style you need to use
'@a <name>'. So it would be '@a req.lsn'.
> + */
> +static void
> +txn_limbo_read_promote(struct txn_limbo *limbo,
> + const struct synchro_request *req)
> +{
> + txn_limbo_read_confirm(limbo, req->lsn);
> + txn_limbo_read_rollback(limbo, req->lsn + 1);
> + assert(txn_limbo_is_empty(&txn_limbo));
> + limbo->owner_id = req->origin_id;
> + limbo->confirmed_lsn = 0;
> +}
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [Tarantool-patches] [PATCH v4 07/12] raft: filter rows based on known peer terms
2021-04-16 16:25 ` [Tarantool-patches] [PATCH v4 07/12] raft: filter rows based on known peer terms Serge Petrenko via Tarantool-patches
@ 2021-04-16 22:21 ` Vladislav Shpilevoy via Tarantool-patches
2021-04-18 8:49 ` Serge Petrenko via Tarantool-patches
2021-04-18 15:44 ` Vladislav Shpilevoy via Tarantool-patches
2021-04-18 16:27 ` Vladislav Shpilevoy via Tarantool-patches
2021-04-20 20:29 ` Serge Petrenko via Tarantool-patches
2 siblings, 2 replies; 57+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-04-16 22:21 UTC (permalink / raw)
To: Serge Petrenko, gorcunov; +Cc: tarantool-patches
I appreciate the work you did here!
> diff --git a/src/lib/raft/raft.h b/src/lib/raft/raft.h
> index e447f6634..a5f7e08d9 100644
> --- a/src/lib/raft/raft.h
> +++ b/src/lib/raft/raft.h
> @@ -207,6 +207,19 @@ struct raft {
> * subsystems, such as Raft.
> */
> const struct vclock *vclock;
> + /**
> + * The biggest term seen by this instance and persisted in WAL as part
> + * of a PROMOTE request. May be smaller than @a term, while there are
> + * ongoing elections, or the leader is already known, but this instance
> + * hasn't read its PROMOTE request yet.
> + * During other times must be equal to @a term.
> + */
> + uint64_t greatest_term;
> + /**
> + * Latest terms received with PROMOTE entries from remote instances.
> + * Raft uses them to determine data from which sources may be applied.
> + */
> + struct vclock term_map;
I am sorry for not noticing this first time, but I realized the
names are still not perfect - they give an impression the terms are
collected on any term bump. But they are only for promotions. So
they should probably be greatest_promote_term, and promote_term_map.
Another issue I see after that rename - they depend on something not
related to raft. Raft does write PROMOTEs. You can see that these
2 members are not used in raft code at all. Only in the limbo and
box. On the other hand, they don't remove terms dependency from the
limbo, because they are part of PROMOTE, which is part of the limbo.
That means, we introduced an explicit dependency on raft in the
limbo just to store some numbers in struct raft.
Maybe move these 2 members to the limbo? They have nothing to do with
the leader election as we can see, and our lib/raft is only about that.
They are for filtering once the leader is elected already, which is
synchronous replication's job, and which in turn is the limbo.
This also makes us closer to the idea I mentioned about lsn map
and promote term map merged into something new inside of the limbo.
I tried to deal with that idea myself, and it resulted into a commit
I pushed on top of your branch, and pasted below.
I made so the limbo does not depend on raft anymore (on its API). It
only uses term numbers. Box is the link between raft and limbo - it
passes the raft terms to the new promote entries in box.ctl.promote().
If you agree, please, squash. Otherwise lets discuss. I didn't delete
the unit test about this new map yet, only commented it out. You would
need to drop it if squash.
====================
diff --git a/src/box/applier.cc b/src/box/applier.cc
index 61d53fdec..b0e8fbba7 100644
--- a/src/box/applier.cc
+++ b/src/box/applier.cc
@@ -967,6 +967,59 @@ apply_final_join_tx(struct stailq *rows)
return rc;
}
+/*
+ * When elections are enabled we must filter out synchronous rows coming
+ * from an instance that fell behind the current leader. This includes
+ * both synchronous tx rows and rows for txs following unconfirmed
+ * synchronous transactions.
+ * The rows are replaced with NOPs to preserve the vclock consistency.
+ */
+static void
+applier_synchro_filter_tx(struct applier *applier, struct stailq *rows)
+{
+ /*
+ * XXX: in case raft is disabled, synchronous replication still works
+ * but without any filtering. That might lead to issues with
+ * unpredictable confirms after rollbacks which are supposed to be
+ * fixed by the filtering.
+ */
+ if (!raft_is_enabled(box_raft()))
+ return;
+ if (!txn_limbo_is_replica_outdated(&txn_limbo, applier->instance_id))
+ return;
+
+ struct xrow_header *row;
+ row = &stailq_last_entry(rows, struct applier_tx_row, next)->row;
+ if (row->wait_sync)
+ goto nopify;
+
+ row = &stailq_first_entry(rows, struct applier_tx_row, next)->row;
+ /*
+ * Not waiting for sync and not a synchro request - this make it already
+ * NOP or an asynchronous transaction not depending on any synchronous
+ * ones - let it go as is.
+ */
+ if (!iproto_type_is_synchro_request(row->type))
+ return;
+ /*
+ * Do not NOPify promotion, otherwise won't even know who is the limbo
+ * owner now.
+ */
+ if (iproto_type_is_promote_request(row->type))
+ return;
+nopify:;
+ struct applier_tx_row *item;
+ stailq_foreach_entry(item, rows, next) {
+ row = &item->row;
+ row->type = IPROTO_NOP;
+ /*
+ * Row body is saved to fiber's region and will be freed
+ * on next fiber_gc() call.
+ */
+ row->bodycnt = 0;
+ }
+}
+
/**
* Apply all rows in the rows queue as a single transaction.
*
@@ -1026,29 +1079,7 @@ applier_apply_tx(struct applier *applier, struct stailq *rows)
}
}
}
-
- /*
- * When elections are enabled we must filter out synchronous rows coming
- * from an instance that fell behind the current leader. This includes
- * both synchronous tx rows and rows for txs following unconfirmed
- * synchronous transactions.
- * The rows are replaced with NOPs to preserve the vclock consistency.
- */
- struct applier_tx_row *item;
- if (raft_is_node_outdated(box_raft(), applier->instance_id) &&
- (last_row->wait_sync ||
- (iproto_type_is_synchro_request(first_row->type) &&
- !iproto_type_is_promote_request(first_row->type)))) {
- stailq_foreach_entry(item, rows, next) {
- struct xrow_header *row = &item->row;
- row->type = IPROTO_NOP;
- /*
- * Row body is saved to fiber's region and will be freed
- * on next fiber_gc() call.
- */
- row->bodycnt = 0;
- }
- }
+ applier_synchro_filter_tx(applier, rows);
if (unlikely(iproto_type_is_synchro_request(first_row->type))) {
/*
* Synchro messages are not transactions, in terms
diff --git a/src/box/box.cc b/src/box/box.cc
index 70cb2bd53..cc68f0168 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -1516,10 +1516,12 @@ box_promote(void)
/*
* Do nothing when box isn't configured and when PROMOTE was already
- * written for this term.
+ * written for this term (synchronous replication and leader election
+ * are in sync, and both chose this node as a leader).
*/
- if (!is_box_configured ||
- raft_node_term(box_raft(), instance_id) == box_raft()->term)
+ if (!is_box_configured)
+ return 0;
+ if (txn_limbo_replica_term(&txn_limbo, instance_id) == box_raft()->term)
return 0;
bool run_elections = false;
diff --git a/src/box/txn_limbo.c b/src/box/txn_limbo.c
index 0726b5a04..bafb47aaa 100644
--- a/src/box/txn_limbo.c
+++ b/src/box/txn_limbo.c
@@ -34,7 +34,6 @@
#include "iproto_constants.h"
#include "journal.h"
#include "box.h"
-#include "raft.h"
struct txn_limbo txn_limbo;
@@ -46,6 +45,8 @@ txn_limbo_create(struct txn_limbo *limbo)
limbo->owner_id = REPLICA_ID_NIL;
fiber_cond_create(&limbo->wait_cond);
vclock_create(&limbo->vclock);
+ vclock_create(&limbo->promote_term_map);
+ limbo->promote_greatest_term = 0;
limbo->confirmed_lsn = 0;
limbo->rollback_count = 0;
limbo->is_in_rollback = false;
@@ -644,8 +645,13 @@ complete:
void
txn_limbo_process(struct txn_limbo *limbo, const struct synchro_request *req)
{
- /* It's ok to process an empty term. It'll just get ignored. */
- raft_process_term(box_raft(), req->origin_id, req->term);
+ uint64_t term = req->term;
+ uint32_t origin = req->origin_id;
+ if (txn_limbo_replica_term(limbo, origin) < term) {
+ vclock_follow(&limbo->promote_term_map, origin, term);
+ if (term > limbo->promote_greatest_term)
+ limbo->promote_greatest_term = term;
+ }
if (req->replica_id != limbo->owner_id) {
/*
* Ignore CONFIRM/ROLLBACK messages for a foreign master.
diff --git a/src/box/txn_limbo.h b/src/box/txn_limbo.h
index f35771dc9..e409ac657 100644
--- a/src/box/txn_limbo.h
+++ b/src/box/txn_limbo.h
@@ -129,6 +129,24 @@ struct txn_limbo {
* transactions, created on the limbo's owner node.
*/
struct vclock vclock;
+ /**
+ * Latest terms received with PROMOTE entries from remote instances.
+ * Limbo uses them to filter out the transactions coming not from the
+ * limbo owner, but so outdated that they are rolled back everywhere
+ * except outdated nodes.
+ */
+ struct vclock promote_term_map;
+ /**
+ * The biggest PROMOTE term seen by the instance and persisted in WAL.
+ * It is related to raft term, but not the same. Synchronous replication
+ * represented by the limbo is interested only in the won elections
+ * ended with PROMOTE request.
+ * It means the limbo's term might be smaller than the raft term, while
+ * there are ongoing elections, or the leader is already known and this
+ * instance hasn't read its PROMOTE request yet. During other times the
+ * limbo and raft are in sync and the terms are the same.
+ */
+ uint64_t promote_greatest_term;
/**
* Maximal LSN gathered quorum and either already confirmed in WAL, or
* whose confirmation is in progress right now. Any attempt to confirm
@@ -193,6 +211,28 @@ txn_limbo_last_entry(struct txn_limbo *limbo)
in_queue);
}
+/**
+ * Return the latest term as seen in PROMOTE requests from instance with id
+ * @a replica_id.
+ */
+static inline uint64_t
+txn_limbo_replica_term(const struct txn_limbo *limbo, uint32_t replica_id)
+{
+ return vclock_get(&limbo->promote_term_map, replica_id);
+}
+
+/**
+ * Check whether replica with id @a source_id is too old to apply synchronous
+ * data from it. The check is only valid when elections are enabled.
+ */
+static inline bool
+txn_limbo_is_replica_outdated(const struct txn_limbo *limbo,
+ uint32_t replica_id)
+{
+ return txn_limbo_replica_term(limbo, replica_id) <
+ limbo->promote_greatest_term;
+}
+
/**
* Return the last synchronous transaction in the limbo or NULL when it is
* empty.
diff --git a/src/lib/raft/raft.c b/src/lib/raft/raft.c
index b21693642..874e9157e 100644
--- a/src/lib/raft/raft.c
+++ b/src/lib/raft/raft.c
@@ -1012,7 +1012,6 @@ raft_create(struct raft *raft, const struct raft_vtab *vtab)
.death_timeout = 5,
.vtab = vtab,
};
- vclock_create(&raft->term_map);
raft_ev_timer_init(&raft->timer, raft_sm_schedule_new_election_cb,
0, 0);
raft->timer.data = raft;
diff --git a/src/lib/raft/raft.h b/src/lib/raft/raft.h
index 69dec63c6..f7bc205d2 100644
--- a/src/lib/raft/raft.h
+++ b/src/lib/raft/raft.h
@@ -207,19 +207,6 @@ struct raft {
* subsystems, such as Raft.
*/
const struct vclock *vclock;
- /**
- * The biggest term seen by this instance and persisted in WAL as part
- * of a PROMOTE request. May be smaller than @a term, while there are
- * ongoing elections, or the leader is already known, but this instance
- * hasn't read its PROMOTE request yet.
- * During other times must be equal to @a term.
- */
- uint64_t greatest_term;
- /**
- * Latest terms received with PROMOTE entries from remote instances.
- * Raft uses them to determine data from which sources may be applied.
- */
- struct vclock term_map;
/** State machine timed event trigger. */
struct ev_timer timer;
/** Configured election timeout in seconds. */
@@ -256,39 +243,6 @@ raft_is_source_allowed(const struct raft *raft, uint32_t source_id)
return !raft->is_enabled || raft->leader == source_id;
}
-/**
- * Return the latest term as seen in PROMOTE requests from instance with id
- * @a source_id.
- */
-static inline uint64_t
-raft_node_term(const struct raft *raft, uint32_t source_id)
-{
- assert(source_id < VCLOCK_MAX);
- return vclock_get(&raft->term_map, source_id);
-}
-
-/**
- * Check whether replica with id @a source_id is too old to apply synchronous
- * data from it. The check is only valid when elections are enabled.
- */
-static inline bool
-raft_is_node_outdated(const struct raft *raft, uint32_t source_id)
-{
- uint64_t source_term = raft_node_term(raft, source_id);
- return raft->is_enabled && source_term < raft->greatest_term;
-}
-
-/** Remember the last term seen for replica with id @a source_id. */
-static inline void
-raft_process_term(struct raft *raft, uint32_t source_id, uint64_t term)
-{
- if (raft_node_term(raft, source_id) >= term)
- return;
- vclock_follow(&raft->term_map, source_id, term);
- if (term > raft->greatest_term)
- raft->greatest_term = term;
-}
-
/** Check if Raft is enabled. */
static inline bool
raft_is_enabled(const struct raft *raft)
diff --git a/test/unit/raft.c b/test/unit/raft.c
index 575886932..4214dbc4c 100644
--- a/test/unit/raft.c
+++ b/test/unit/raft.c
@@ -1267,38 +1267,38 @@ raft_test_too_long_wal_write(void)
raft_finish_test();
}
-static void
-raft_test_term_filter(void)
-{
- raft_start_test(9);
- struct raft_node node;
- raft_node_create(&node);
-
- is(raft_node_term(&node.raft, 1), 0, "empty node term");
- ok(!raft_is_node_outdated(&node.raft, 1), "not outdated initially");
-
- raft_process_term(&node.raft, 1, 1);
- is(raft_node_term(&node.raft, 1), 1, "node term updated");
- ok(raft_is_node_outdated(&node.raft, 2), "other nodes are outdated");
-
- raft_process_term(&node.raft, 2, 100);
- ok(raft_is_node_outdated(&node.raft, 1), "node outdated when others "
- "have greater term");
- ok(!raft_is_node_outdated(&node.raft, 2), "node with greatest term "
- "isn't outdated");
-
- raft_process_term(&node.raft, 3, 100);
- ok(!raft_is_node_outdated(&node.raft, 2), "node not outdated when "
- "others have the same term");
-
- raft_process_term(&node.raft, 3, 99);
- is(raft_node_term(&node.raft, 3), 100, "node term isn't decreased");
- ok(!raft_is_node_outdated(&node.raft, 3), "node doesn't become "
- "outdated");
-
- raft_node_destroy(&node);
- raft_finish_test();
-}
+// static void
+// raft_test_term_filter(void)
+// {
+// raft_start_test(9);
+// struct raft_node node;
+// raft_node_create(&node);
+
+// is(raft_node_term(&node.raft, 1), 0, "empty node term");
+// ok(!raft_is_node_outdated(&node.raft, 1), "not outdated initially");
+
+// raft_process_term(&node.raft, 1, 1);
+// is(raft_node_term(&node.raft, 1), 1, "node term updated");
+// ok(raft_is_node_outdated(&node.raft, 2), "other nodes are outdated");
+
+// raft_process_term(&node.raft, 2, 100);
+// ok(raft_is_node_outdated(&node.raft, 1), "node outdated when others "
+// "have greater term");
+// ok(!raft_is_node_outdated(&node.raft, 2), "node with greatest term "
+// "isn't outdated");
+
+// raft_process_term(&node.raft, 3, 100);
+// ok(!raft_is_node_outdated(&node.raft, 2), "node not outdated when "
+// "others have the same term");
+
+// raft_process_term(&node.raft, 3, 99);
+// is(raft_node_term(&node.raft, 3), 100, "node term isn't decreased");
+// ok(!raft_is_node_outdated(&node.raft, 3), "node doesn't become "
+// "outdated");
+
+// raft_node_destroy(&node);
+// raft_finish_test();
+// }
static void
raft_test_start_stop_candidate(void)
@@ -1332,7 +1332,7 @@ raft_test_start_stop_candidate(void)
static int
main_f(va_list ap)
{
- raft_start_test(15);
+ raft_start_test(14);
(void) ap;
fakeev_init();
@@ -1350,7 +1350,7 @@ main_f(va_list ap)
raft_test_death_timeout();
raft_test_enable_disable();
raft_test_too_long_wal_write();
- raft_test_term_filter();
+ //raft_test_term_filter();
raft_test_start_stop_candidate();
fakeev_free();
diff --git a/test/unit/raft.result b/test/unit/raft.result
index bb799936b..f9a8f249b 100644
--- a/test/unit/raft.result
+++ b/test/unit/raft.result
@@ -1,5 +1,5 @@
*** main_f ***
-1..15
+1..14
*** raft_test_leader_election ***
1..24
ok 1 - 1 pending message at start
@@ -220,25 +220,12 @@ ok 12 - subtests
ok 8 - became candidate
ok 13 - subtests
*** raft_test_too_long_wal_write: done ***
- *** raft_test_term_filter ***
- 1..9
- ok 1 - empty node term
- ok 2 - not outdated initially
- ok 3 - node term updated
- ok 4 - other nodes are outdated
- ok 5 - node outdated when others have greater term
- ok 6 - node with greatest term isn't outdated
- ok 7 - node not outdated when others have the same term
- ok 8 - node term isn't decreased
- ok 9 - node doesn't become outdated
-ok 14 - subtests
- *** raft_test_term_filter: done ***
*** raft_test_start_stop_candidate ***
1..4
ok 1 - became leader after start_candidate
ok 2 - remain leader after stop_candidate
ok 3 - vote request from 2
ok 4 - demote once new election starts
-ok 15 - subtests
+ok 14 - subtests
*** raft_test_start_stop_candidate: done ***
*** main_f: done ***
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [Tarantool-patches] [PATCH v4 09/12] raft: introduce raft_start/stop_candidate
2021-04-16 16:25 ` [Tarantool-patches] [PATCH v4 09/12] raft: introduce raft_start/stop_candidate Serge Petrenko via Tarantool-patches
@ 2021-04-16 22:23 ` Vladislav Shpilevoy via Tarantool-patches
2021-04-18 8:59 ` Serge Petrenko via Tarantool-patches
2021-04-19 12:52 ` Serge Petrenko via Tarantool-patches
1 sibling, 1 reply; 57+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-04-16 22:23 UTC (permalink / raw)
To: Serge Petrenko, gorcunov; +Cc: tarantool-patches
Thanks for working on this!
See 3 comments below.
> src/lib/raft/raft.c | 83 ++++++++++++++++++++++++++++---------------
> src/lib/raft/raft.h | 13 +++++++
> test/unit/raft.c | 33 +++++++++++++++--
> test/unit/raft.result | 10 +++++-
> 4 files changed, 108 insertions(+), 31 deletions(-)
>
> diff --git a/src/lib/raft/raft.c b/src/lib/raft/raft.c
> index e9ce8cade..b21693642 100644
> --- a/src/lib/raft/raft.c
> +++ b/src/lib/raft/raft.c
> @@ -848,38 +848,65 @@ raft_cfg_is_enabled(struct raft *raft, bool is_enabled)
<...>
> +
> +void
> +raft_stop_candidate(struct raft *raft, bool demote)
1. For flags we usually use 'is', 'do', 'has' and similar prefixes.
> +{
> + if (!raft->is_candidate)
> + return;
> + raft->is_candidate = false;
> + if (raft->state != RAFT_STATE_LEADER) {
> + /* Do not wait for anything while being a voter. */
> + raft_ev_timer_stop(raft_loop(), &raft->timer);
> + }
> + if (raft->state != RAFT_STATE_FOLLOWER) {
> + if (raft->state == RAFT_STATE_LEADER) {
> + if (!demote) {
> + /*
> + * Remain leader until someone
> + * triggers new elections.
> + */
> + return;
> + }
> + raft->leader = 0;
> }
> + raft->state = RAFT_STATE_FOLLOWER;
> + /* State is visible and changed - broadcast. */
> + raft_schedule_broadcast(raft);
> }
> }
>
> diff --git a/src/lib/raft/raft.h b/src/lib/raft/raft.h
> index a5f7e08d9..69dec63c6 100644
> --- a/src/lib/raft/raft.h
> +++ b/src/lib/raft/raft.h
> @@ -327,6 +327,19 @@ raft_cfg_is_enabled(struct raft *raft, bool is_enabled);
> void
> raft_cfg_is_candidate(struct raft *raft, bool is_candidate);
>
> +/**
> + * Make the instance a candidate.
> + */
> +void
> +raft_start_candidate(struct raft *raft);
> +
> +/**
> + * Make the instance stop taking part in new elections.
> + * @param demote whether to stop being a leader immediately or not.
> + */
> +void
> +raft_stop_candidate(struct raft *raft, bool demote);
2. Double whitespace after 'struct'.
> +
> /** Configure Raft leader election timeout. */
> void
> raft_cfg_election_timeout(struct raft *raft, double timeout);
> diff --git a/test/unit/raft.c b/test/unit/raft.c
> index 0306cefcd..575886932 100644
> --- a/test/unit/raft.c
> +++ b/test/unit/raft.c
> @@ -1296,15 +1296,43 @@ raft_test_term_filter(void)
> ok(!raft_is_node_outdated(&node.raft, 3), "node doesn't become "
> "outdated");
>
> -
3. This probably should be in the previous commit.
> raft_node_destroy(&node);
> raft_finish_test();
> }
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [Tarantool-patches] [PATCH v4 10/12] election: support manual elections in clear_synchro_queue()
2021-04-16 16:25 ` [Tarantool-patches] [PATCH v4 10/12] election: support manual elections in clear_synchro_queue() Serge Petrenko via Tarantool-patches
@ 2021-04-16 22:24 ` Vladislav Shpilevoy via Tarantool-patches
2021-04-18 9:26 ` Serge Petrenko via Tarantool-patches
2021-04-19 12:47 ` Serge Petrenko via Tarantool-patches
1 sibling, 1 reply; 57+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-04-16 22:24 UTC (permalink / raw)
To: Serge Petrenko, gorcunov; +Cc: tarantool-patches
Thanks for the patch!
See 1 comment below.
> diff --git a/src/box/box.cc b/src/box/box.cc
> index d5a55a30a..fcd812c09 100644
> --- a/src/box/box.cc
> +++ b/src/box/box.cc
> @@ -1521,12 +1521,75 @@ box_clear_synchro_queue(bool try_wait)
> if (!is_box_configured ||
> raft_node_term(box_raft(), instance_id) == box_raft()->term)
> return 0;
> +
> + bool run_elections = false;
> +
> + switch (box_election_mode) {
> + case ELECTION_MODE_OFF:
> + break;
> + case ELECTION_MODE_VOTER:
> + assert(box_raft()->state == RAFT_STATE_FOLLOWER);
> + diag_set(ClientError, ER_UNSUPPORTED, "election_mode='voter'",
> + "manual elections");
> + return -1;
> + case ELECTION_MODE_MANUAL:
> + assert(box_raft()->state == RAFT_STATE_FOLLOWER);
> + run_elections = true;
> + try_wait = false;
> + break;
> + case ELECTION_MODE_CANDIDATE:
> + /*
> + * Leader elections are enabled, and this instance is allowed to
> + * promote only if it's already an elected leader. No manual
> + * elections.
> + */
> + if (box_raft()->state != RAFT_STATE_LEADER) {
> + diag_set(ClientError, ER_UNSUPPORTED, "election_mode="
> + "'candidate'", "manual elections");
> + return -1;
> + }
> + break;
> + default:
> + unreachable();
> + }
> +
> uint32_t former_leader_id = txn_limbo.owner_id;
> int64_t wait_lsn = txn_limbo.confirmed_lsn;
> int rc = 0;
> int quorum = replication_synchro_quorum;
> in_clear_synchro_queue = true;
>
> + if (run_elections) {
> + /*
> + * Make this instance a candidate and run until some leader, not
> + * necessarily this instance, emerges.
> + */
> + raft_start_candidate(box_raft());
> + /*
> + * Trigger new elections without waiting for an old leader to
> + * disappear.
> + */
> + raft_new_term(box_raft());
> + box_raft_wait_leader_found();
Shouldn't we wait for election_timeout?
Also what if the fiber is canceled before the leader is found? It
seems box_raft_wait_leader_found() would fail on an assertion because
raft is still enabled, but leader_id is nil.
> + /*
> + * Do not reset raft mode if it was changed while running the
> + * elections.
> + */
> + if (box_election_mode == ELECTION_MODE_MANUAL)
> + raft_stop_candidate(box_raft(), false);
> + if (!box_raft()->is_enabled) {
> + diag_set(ClientError, ER_RAFT_DISABLED);
> + in_clear_synchro_queue = false;
> + return -1;
> + }
> + if (box_raft()->state != RAFT_STATE_LEADER) {
> + diag_set(ClientError, ER_INTERFERING_PROMOTE,
> + box_raft()->leader);
> + in_clear_synchro_queue = false;
> + return -1;
> + }
> + }
> +
> if (txn_limbo_is_empty(&txn_limbo))
> goto promote;
>
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [Tarantool-patches] [PATCH v4 05/12] box: make clear_synchro_queue() write a PROMOTE entry instead of CONFIRM + ROLLBACK
2021-04-16 22:12 ` Vladislav Shpilevoy via Tarantool-patches
@ 2021-04-18 8:24 ` Serge Petrenko via Tarantool-patches
0 siblings, 0 replies; 57+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-04-18 8:24 UTC (permalink / raw)
To: Vladislav Shpilevoy, gorcunov; +Cc: tarantool-patches
17.04.2021 01:12, Vladislav Shpilevoy пишет:
> Good job on the patch!
>
> See 2 comments below.
>
>> diff --git a/src/box/txn_limbo.c b/src/box/txn_limbo.c
>> index c96e497c6..2346331c7 100644
>> --- a/src/box/txn_limbo.c
>> +++ b/src/box/txn_limbo.c
>> @@ -317,19 +317,21 @@ txn_limbo_write_cb(struct journal_entry *entry)
>> }
>>
>> static void
>> -txn_limbo_write_synchro(struct txn_limbo *limbo, uint16_t type, int64_t lsn)
>> +txn_limbo_write_synchro(struct txn_limbo *limbo, uint16_t type, int64_t lsn,
>> + uint64_t term)
>> {
>> - assert(lsn > 0);
>> + assert(lsn >= 0);
>>
>> struct synchro_request req = {
>> .type = type,
>> .replica_id = limbo->owner_id,
>> .lsn = lsn,
>> + .term = term,
>> };
>>
>> /*
>> - * This is a synchronous commit so we can
>> - * allocate everything on a stack.
>> + * This is a synchronous commit so we can allocate everything on a
>> + * stack. Note, that promote body includes synchro body.
> 1. I think this might be discarded now. They have the same encoder
> in this version. Up to you.
Sure, reverted to the original version.
>
>> */
>> char body[XROW_SYNCHRO_BODY_LEN_MAX];
>> struct xrow_header row;
>> @@ -464,6 +466,37 @@ txn_limbo_read_rollback(struct txn_limbo *limbo, int64_t lsn)
>> box_update_ro_summary();
>> }
>>
>> +void
>> +txn_limbo_write_promote(struct txn_limbo *limbo, int64_t lsn, uint64_t term)
>> +{
>> + limbo->confirmed_lsn = lsn;
>> + limbo->is_in_rollback = true;
>> + /*
>> + * We make sure that promote is only written once everything this
>> + * instance has may be confirmed.
>> + */
>> + struct txn_limbo_entry *e = txn_limbo_last_synchro_entry(limbo);
>> + assert(e == NULL || e->lsn <= lsn);
>> + (void) e;
>> + txn_limbo_write_synchro(limbo, IPROTO_PROMOTE, lsn, term);
>> + limbo->is_in_rollback = false;
>> +}
>> +
>> +/**
>> + * Process a PROMOTE request, i.e. confirm all entries <= @req.lsn and rollback all
>> + * entries > @req.lsn.
> 2. For referencing parameters in doxygen style you need to use
> '@a <name>'. So it would be '@a req.lsn'.
Thanks! Fixed. The incremental diff's below.
>
>> + */
>> +static void
>> +txn_limbo_read_promote(struct txn_limbo *limbo,
>> + const struct synchro_request *req)
>> +{
>> + txn_limbo_read_confirm(limbo, req->lsn);
>> + txn_limbo_read_rollback(limbo, req->lsn + 1);
>> + assert(txn_limbo_is_empty(&txn_limbo));
>> + limbo->owner_id = req->origin_id;
>> + limbo->confirmed_lsn = 0;
>> +}
===========================================================
diff --git a/src/box/txn_limbo.c b/src/box/txn_limbo.c
index 2346331c7..0d2d274f6 100644
--- a/src/box/txn_limbo.c
+++ b/src/box/txn_limbo.c
@@ -330,8 +330,8 @@ txn_limbo_write_synchro(struct txn_limbo *limbo,
uint16_t type, int64_t lsn,
};
/*
- * This is a synchronous commit so we can allocate everything on a
- * stack. Note, that promote body includes synchro body.
+ * This is a synchronous commit so we can
+ * allocate everything on a stack.
*/
char body[XROW_SYNCHRO_BODY_LEN_MAX];
struct xrow_header row;
@@ -483,8 +483,8 @@ txn_limbo_write_promote(struct txn_limbo *limbo,
int64_t lsn, uint64_t term)
}
/**
- * Process a PROMOTE request, i.e. confirm all entries <= @req.lsn and
rollback all
- * entries > @req.lsn.
+ * Process a PROMOTE request, i.e. confirm all entries <= @a req.lsn and
+ * rollback all entries > @a req.lsn.
*/
static void
txn_limbo_read_promote(struct txn_limbo *limbo,
--
Serge Petrenko
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [Tarantool-patches] [PATCH v4 07/12] raft: filter rows based on known peer terms
2021-04-16 22:21 ` Vladislav Shpilevoy via Tarantool-patches
@ 2021-04-18 8:49 ` Serge Petrenko via Tarantool-patches
2021-04-18 15:44 ` Vladislav Shpilevoy via Tarantool-patches
1 sibling, 0 replies; 57+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-04-18 8:49 UTC (permalink / raw)
To: Vladislav Shpilevoy, gorcunov; +Cc: tarantool-patches
17.04.2021 01:21, Vladislav Shpilevoy пишет:
> I appreciate the work you did here!
>
>> diff --git a/src/lib/raft/raft.h b/src/lib/raft/raft.h
>> index e447f6634..a5f7e08d9 100644
>> --- a/src/lib/raft/raft.h
>> +++ b/src/lib/raft/raft.h
>> @@ -207,6 +207,19 @@ struct raft {
>> * subsystems, such as Raft.
>> */
>> const struct vclock *vclock;
>> + /**
>> + * The biggest term seen by this instance and persisted in WAL as part
>> + * of a PROMOTE request. May be smaller than @a term, while there are
>> + * ongoing elections, or the leader is already known, but this instance
>> + * hasn't read its PROMOTE request yet.
>> + * During other times must be equal to @a term.
>> + */
>> + uint64_t greatest_term;
>> + /**
>> + * Latest terms received with PROMOTE entries from remote instances.
>> + * Raft uses them to determine data from which sources may be applied.
>> + */
>> + struct vclock term_map;
> I am sorry for not noticing this first time, but I realized the
> names are still not perfect - they give an impression the terms are
> collected on any term bump. But they are only for promotions. So
> they should probably be greatest_promote_term, and promote_term_map.
>
> Another issue I see after that rename - they depend on something not
> related to raft. Raft does write PROMOTEs. You can see that these
> 2 members are not used in raft code at all. Only in the limbo and
> box. On the other hand, they don't remove terms dependency from the
> limbo, because they are part of PROMOTE, which is part of the limbo.
>
> That means, we introduced an explicit dependency on raft in the
> limbo just to store some numbers in struct raft.
>
> Maybe move these 2 members to the limbo? They have nothing to do with
> the leader election as we can see, and our lib/raft is only about that.
>
> They are for filtering once the leader is elected already, which is
> synchronous replication's job, and which in turn is the limbo.
>
> This also makes us closer to the idea I mentioned about lsn map
> and promote term map merged into something new inside of the limbo.
>
> I tried to deal with that idea myself, and it resulted into a commit
> I pushed on top of your branch, and pasted below.
>
> I made so the limbo does not depend on raft anymore (on its API). It
> only uses term numbers. Box is the link between raft and limbo - it
> passes the raft terms to the new promote entries in box.ctl.promote().
>
> If you agree, please, squash. Otherwise lets discuss. I didn't delete
> the unit test about this new map yet, only commented it out. You would
> need to drop it if squash.
I see what you mean. It was hard to see that these methods belong to
txn_limbo, at first. But I agree with you now.
Your version looks good. Thanks for the help! Squashed.
> ====================
> diff --git a/src/box/applier.cc b/src/box/applier.cc
> index 61d53fdec..b0e8fbba7 100644
> --- a/src/box/applier.cc
> +++ b/src/box/applier.cc
> @@ -967,6 +967,59 @@ apply_final_join_tx(struct stailq *rows)
> return rc;
> }
>
> +/*
> + * When elections are enabled we must filter out synchronous rows coming
> + * from an instance that fell behind the current leader. This includes
> + * both synchronous tx rows and rows for txs following unconfirmed
> + * synchronous transactions.
> + * The rows are replaced with NOPs to preserve the vclock consistency.
> + */
> +static void
> +applier_synchro_filter_tx(struct applier *applier, struct stailq *rows)
> +{
> + /*
> + * XXX: in case raft is disabled, synchronous replication still works
> + * but without any filtering. That might lead to issues with
> + * unpredictable confirms after rollbacks which are supposed to be
> + * fixed by the filtering.
> + */
> + if (!raft_is_enabled(box_raft()))
> + return;
> + if (!txn_limbo_is_replica_outdated(&txn_limbo, applier->instance_id))
> + return;
> +
> + struct xrow_header *row;
> + row = &stailq_last_entry(rows, struct applier_tx_row, next)->row;
> + if (row->wait_sync)
> + goto nopify;
> +
> + row = &stailq_first_entry(rows, struct applier_tx_row, next)->row;
> + /*
> + * Not waiting for sync and not a synchro request - this make it already
> + * NOP or an asynchronous transaction not depending on any synchronous
> + * ones - let it go as is.
> + */
> + if (!iproto_type_is_synchro_request(row->type))
> + return;
> + /*
> + * Do not NOPify promotion, otherwise won't even know who is the limbo
> + * owner now.
> + */
> + if (iproto_type_is_promote_request(row->type))
> + return;
> +nopify:;
> + struct applier_tx_row *item;
> + stailq_foreach_entry(item, rows, next) {
> + row = &item->row;
> + row->type = IPROTO_NOP;
> + /*
> + * Row body is saved to fiber's region and will be freed
> + * on next fiber_gc() call.
> + */
> + row->bodycnt = 0;
> + }
> +}
> +
> /**
> * Apply all rows in the rows queue as a single transaction.
> *
> @@ -1026,29 +1079,7 @@ applier_apply_tx(struct applier *applier, struct stailq *rows)
> }
> }
> }
> -
> - /*
> - * When elections are enabled we must filter out synchronous rows coming
> - * from an instance that fell behind the current leader. This includes
> - * both synchronous tx rows and rows for txs following unconfirmed
> - * synchronous transactions.
> - * The rows are replaced with NOPs to preserve the vclock consistency.
> - */
> - struct applier_tx_row *item;
> - if (raft_is_node_outdated(box_raft(), applier->instance_id) &&
> - (last_row->wait_sync ||
> - (iproto_type_is_synchro_request(first_row->type) &&
> - !iproto_type_is_promote_request(first_row->type)))) {
> - stailq_foreach_entry(item, rows, next) {
> - struct xrow_header *row = &item->row;
> - row->type = IPROTO_NOP;
> - /*
> - * Row body is saved to fiber's region and will be freed
> - * on next fiber_gc() call.
> - */
> - row->bodycnt = 0;
> - }
> - }
> + applier_synchro_filter_tx(applier, rows);
> if (unlikely(iproto_type_is_synchro_request(first_row->type))) {
> /*
> * Synchro messages are not transactions, in terms
> diff --git a/src/box/box.cc b/src/box/box.cc
> index 70cb2bd53..cc68f0168 100644
> --- a/src/box/box.cc
> +++ b/src/box/box.cc
> @@ -1516,10 +1516,12 @@ box_promote(void)
>
> /*
> * Do nothing when box isn't configured and when PROMOTE was already
> - * written for this term.
> + * written for this term (synchronous replication and leader election
> + * are in sync, and both chose this node as a leader).
> */
> - if (!is_box_configured ||
> - raft_node_term(box_raft(), instance_id) == box_raft()->term)
> + if (!is_box_configured)
> + return 0;
> + if (txn_limbo_replica_term(&txn_limbo, instance_id) == box_raft()->term)
> return 0;
>
> bool run_elections = false;
> diff --git a/src/box/txn_limbo.c b/src/box/txn_limbo.c
> index 0726b5a04..bafb47aaa 100644
> --- a/src/box/txn_limbo.c
> +++ b/src/box/txn_limbo.c
> @@ -34,7 +34,6 @@
> #include "iproto_constants.h"
> #include "journal.h"
> #include "box.h"
> -#include "raft.h"
>
> struct txn_limbo txn_limbo;
>
> @@ -46,6 +45,8 @@ txn_limbo_create(struct txn_limbo *limbo)
> limbo->owner_id = REPLICA_ID_NIL;
> fiber_cond_create(&limbo->wait_cond);
> vclock_create(&limbo->vclock);
> + vclock_create(&limbo->promote_term_map);
> + limbo->promote_greatest_term = 0;
> limbo->confirmed_lsn = 0;
> limbo->rollback_count = 0;
> limbo->is_in_rollback = false;
> @@ -644,8 +645,13 @@ complete:
> void
> txn_limbo_process(struct txn_limbo *limbo, const struct synchro_request *req)
> {
> - /* It's ok to process an empty term. It'll just get ignored. */
> - raft_process_term(box_raft(), req->origin_id, req->term);
> + uint64_t term = req->term;
> + uint32_t origin = req->origin_id;
> + if (txn_limbo_replica_term(limbo, origin) < term) {
> + vclock_follow(&limbo->promote_term_map, origin, term);
> + if (term > limbo->promote_greatest_term)
> + limbo->promote_greatest_term = term;
> + }
> if (req->replica_id != limbo->owner_id) {
> /*
> * Ignore CONFIRM/ROLLBACK messages for a foreign master.
> diff --git a/src/box/txn_limbo.h b/src/box/txn_limbo.h
> index f35771dc9..e409ac657 100644
> --- a/src/box/txn_limbo.h
> +++ b/src/box/txn_limbo.h
> @@ -129,6 +129,24 @@ struct txn_limbo {
> * transactions, created on the limbo's owner node.
> */
> struct vclock vclock;
> + /**
> + * Latest terms received with PROMOTE entries from remote instances.
> + * Limbo uses them to filter out the transactions coming not from the
> + * limbo owner, but so outdated that they are rolled back everywhere
> + * except outdated nodes.
> + */
> + struct vclock promote_term_map;
> + /**
> + * The biggest PROMOTE term seen by the instance and persisted in WAL.
> + * It is related to raft term, but not the same. Synchronous replication
> + * represented by the limbo is interested only in the won elections
> + * ended with PROMOTE request.
> + * It means the limbo's term might be smaller than the raft term, while
> + * there are ongoing elections, or the leader is already known and this
> + * instance hasn't read its PROMOTE request yet. During other times the
> + * limbo and raft are in sync and the terms are the same.
> + */
> + uint64_t promote_greatest_term;
> /**
> * Maximal LSN gathered quorum and either already confirmed in WAL, or
> * whose confirmation is in progress right now. Any attempt to confirm
> @@ -193,6 +211,28 @@ txn_limbo_last_entry(struct txn_limbo *limbo)
> in_queue);
> }
>
> +/**
> + * Return the latest term as seen in PROMOTE requests from instance with id
> + * @a replica_id.
> + */
> +static inline uint64_t
> +txn_limbo_replica_term(const struct txn_limbo *limbo, uint32_t replica_id)
> +{
> + return vclock_get(&limbo->promote_term_map, replica_id);
> +}
> +
> +/**
> + * Check whether replica with id @a source_id is too old to apply synchronous
> + * data from it. The check is only valid when elections are enabled.
> + */
> +static inline bool
> +txn_limbo_is_replica_outdated(const struct txn_limbo *limbo,
> + uint32_t replica_id)
> +{
> + return txn_limbo_replica_term(limbo, replica_id) <
> + limbo->promote_greatest_term;
> +}
> +
> /**
> * Return the last synchronous transaction in the limbo or NULL when it is
> * empty.
> diff --git a/src/lib/raft/raft.c b/src/lib/raft/raft.c
> index b21693642..874e9157e 100644
> --- a/src/lib/raft/raft.c
> +++ b/src/lib/raft/raft.c
> @@ -1012,7 +1012,6 @@ raft_create(struct raft *raft, const struct raft_vtab *vtab)
> .death_timeout = 5,
> .vtab = vtab,
> };
> - vclock_create(&raft->term_map);
> raft_ev_timer_init(&raft->timer, raft_sm_schedule_new_election_cb,
> 0, 0);
> raft->timer.data = raft;
> diff --git a/src/lib/raft/raft.h b/src/lib/raft/raft.h
> index 69dec63c6..f7bc205d2 100644
> --- a/src/lib/raft/raft.h
> +++ b/src/lib/raft/raft.h
> @@ -207,19 +207,6 @@ struct raft {
> * subsystems, such as Raft.
> */
> const struct vclock *vclock;
> - /**
> - * The biggest term seen by this instance and persisted in WAL as part
> - * of a PROMOTE request. May be smaller than @a term, while there are
> - * ongoing elections, or the leader is already known, but this instance
> - * hasn't read its PROMOTE request yet.
> - * During other times must be equal to @a term.
> - */
> - uint64_t greatest_term;
> - /**
> - * Latest terms received with PROMOTE entries from remote instances.
> - * Raft uses them to determine data from which sources may be applied.
> - */
> - struct vclock term_map;
> /** State machine timed event trigger. */
> struct ev_timer timer;
> /** Configured election timeout in seconds. */
> @@ -256,39 +243,6 @@ raft_is_source_allowed(const struct raft *raft, uint32_t source_id)
> return !raft->is_enabled || raft->leader == source_id;
> }
>
> -/**
> - * Return the latest term as seen in PROMOTE requests from instance with id
> - * @a source_id.
> - */
> -static inline uint64_t
> -raft_node_term(const struct raft *raft, uint32_t source_id)
> -{
> - assert(source_id < VCLOCK_MAX);
> - return vclock_get(&raft->term_map, source_id);
> -}
> -
> -/**
> - * Check whether replica with id @a source_id is too old to apply synchronous
> - * data from it. The check is only valid when elections are enabled.
> - */
> -static inline bool
> -raft_is_node_outdated(const struct raft *raft, uint32_t source_id)
> -{
> - uint64_t source_term = raft_node_term(raft, source_id);
> - return raft->is_enabled && source_term < raft->greatest_term;
> -}
> -
> -/** Remember the last term seen for replica with id @a source_id. */
> -static inline void
> -raft_process_term(struct raft *raft, uint32_t source_id, uint64_t term)
> -{
> - if (raft_node_term(raft, source_id) >= term)
> - return;
> - vclock_follow(&raft->term_map, source_id, term);
> - if (term > raft->greatest_term)
> - raft->greatest_term = term;
> -}
> -
> /** Check if Raft is enabled. */
> static inline bool
> raft_is_enabled(const struct raft *raft)
> diff --git a/test/unit/raft.c b/test/unit/raft.c
> index 575886932..4214dbc4c 100644
> --- a/test/unit/raft.c
> +++ b/test/unit/raft.c
> @@ -1267,38 +1267,38 @@ raft_test_too_long_wal_write(void)
> raft_finish_test();
> }
>
> -static void
> -raft_test_term_filter(void)
> -{
> - raft_start_test(9);
> - struct raft_node node;
> - raft_node_create(&node);
> -
> - is(raft_node_term(&node.raft, 1), 0, "empty node term");
> - ok(!raft_is_node_outdated(&node.raft, 1), "not outdated initially");
> -
> - raft_process_term(&node.raft, 1, 1);
> - is(raft_node_term(&node.raft, 1), 1, "node term updated");
> - ok(raft_is_node_outdated(&node.raft, 2), "other nodes are outdated");
> -
> - raft_process_term(&node.raft, 2, 100);
> - ok(raft_is_node_outdated(&node.raft, 1), "node outdated when others "
> - "have greater term");
> - ok(!raft_is_node_outdated(&node.raft, 2), "node with greatest term "
> - "isn't outdated");
> -
> - raft_process_term(&node.raft, 3, 100);
> - ok(!raft_is_node_outdated(&node.raft, 2), "node not outdated when "
> - "others have the same term");
> -
> - raft_process_term(&node.raft, 3, 99);
> - is(raft_node_term(&node.raft, 3), 100, "node term isn't decreased");
> - ok(!raft_is_node_outdated(&node.raft, 3), "node doesn't become "
> - "outdated");
> -
> - raft_node_destroy(&node);
> - raft_finish_test();
> -}
> +// static void
> +// raft_test_term_filter(void)
> +// {
> +// raft_start_test(9);
> +// struct raft_node node;
> +// raft_node_create(&node);
> +
> +// is(raft_node_term(&node.raft, 1), 0, "empty node term");
> +// ok(!raft_is_node_outdated(&node.raft, 1), "not outdated initially");
> +
> +// raft_process_term(&node.raft, 1, 1);
> +// is(raft_node_term(&node.raft, 1), 1, "node term updated");
> +// ok(raft_is_node_outdated(&node.raft, 2), "other nodes are outdated");
> +
> +// raft_process_term(&node.raft, 2, 100);
> +// ok(raft_is_node_outdated(&node.raft, 1), "node outdated when others "
> +// "have greater term");
> +// ok(!raft_is_node_outdated(&node.raft, 2), "node with greatest term "
> +// "isn't outdated");
> +
> +// raft_process_term(&node.raft, 3, 100);
> +// ok(!raft_is_node_outdated(&node.raft, 2), "node not outdated when "
> +// "others have the same term");
> +
> +// raft_process_term(&node.raft, 3, 99);
> +// is(raft_node_term(&node.raft, 3), 100, "node term isn't decreased");
> +// ok(!raft_is_node_outdated(&node.raft, 3), "node doesn't become "
> +// "outdated");
> +
> +// raft_node_destroy(&node);
> +// raft_finish_test();
> +// }
>
> static void
> raft_test_start_stop_candidate(void)
> @@ -1332,7 +1332,7 @@ raft_test_start_stop_candidate(void)
> static int
> main_f(va_list ap)
> {
> - raft_start_test(15);
> + raft_start_test(14);
>
> (void) ap;
> fakeev_init();
> @@ -1350,7 +1350,7 @@ main_f(va_list ap)
> raft_test_death_timeout();
> raft_test_enable_disable();
> raft_test_too_long_wal_write();
> - raft_test_term_filter();
> + //raft_test_term_filter();
> raft_test_start_stop_candidate();
>
> fakeev_free();
> diff --git a/test/unit/raft.result b/test/unit/raft.result
> index bb799936b..f9a8f249b 100644
> --- a/test/unit/raft.result
> +++ b/test/unit/raft.result
> @@ -1,5 +1,5 @@
> *** main_f ***
> -1..15
> +1..14
> *** raft_test_leader_election ***
> 1..24
> ok 1 - 1 pending message at start
> @@ -220,25 +220,12 @@ ok 12 - subtests
> ok 8 - became candidate
> ok 13 - subtests
> *** raft_test_too_long_wal_write: done ***
> - *** raft_test_term_filter ***
> - 1..9
> - ok 1 - empty node term
> - ok 2 - not outdated initially
> - ok 3 - node term updated
> - ok 4 - other nodes are outdated
> - ok 5 - node outdated when others have greater term
> - ok 6 - node with greatest term isn't outdated
> - ok 7 - node not outdated when others have the same term
> - ok 8 - node term isn't decreased
> - ok 9 - node doesn't become outdated
> -ok 14 - subtests
> - *** raft_test_term_filter: done ***
> *** raft_test_start_stop_candidate ***
> 1..4
> ok 1 - became leader after start_candidate
> ok 2 - remain leader after stop_candidate
> ok 3 - vote request from 2
> ok 4 - demote once new election starts
> -ok 15 - subtests
> +ok 14 - subtests
> *** raft_test_start_stop_candidate: done ***
> *** main_f: done ***
>
--
Serge Petrenko
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [Tarantool-patches] [PATCH v4 09/12] raft: introduce raft_start/stop_candidate
2021-04-16 22:23 ` Vladislav Shpilevoy via Tarantool-patches
@ 2021-04-18 8:59 ` Serge Petrenko via Tarantool-patches
2021-04-19 22:35 ` Vladislav Shpilevoy via Tarantool-patches
0 siblings, 1 reply; 57+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-04-18 8:59 UTC (permalink / raw)
To: Vladislav Shpilevoy, gorcunov; +Cc: tarantool-patches
17.04.2021 01:23, Vladislav Shpilevoy пишет:
> Thanks for working on this!
>
> See 3 comments below.
Thanks for the review!
>
>> src/lib/raft/raft.c | 83 ++++++++++++++++++++++++++++---------------
>> src/lib/raft/raft.h | 13 +++++++
>> test/unit/raft.c | 33 +++++++++++++++--
>> test/unit/raft.result | 10 +++++-
>> 4 files changed, 108 insertions(+), 31 deletions(-)
>>
>> diff --git a/src/lib/raft/raft.c b/src/lib/raft/raft.c
>> index e9ce8cade..b21693642 100644
>> --- a/src/lib/raft/raft.c
>> +++ b/src/lib/raft/raft.c
>> @@ -848,38 +848,65 @@ raft_cfg_is_enabled(struct raft *raft, bool is_enabled)
> <...>
>
>> +
>> +void
>> +raft_stop_candidate(struct raft *raft, bool demote)
> 1. For flags we usually use 'is', 'do', 'has' and similar prefixes.
Ok, let it be do_demote then.
>
>> +{
>> + if (!raft->is_candidate)
>> + return;
>> + raft->is_candidate = false;
>> + if (raft->state != RAFT_STATE_LEADER) {
>> + /* Do not wait for anything while being a voter. */
>> + raft_ev_timer_stop(raft_loop(), &raft->timer);
>> + }
>> + if (raft->state != RAFT_STATE_FOLLOWER) {
>> + if (raft->state == RAFT_STATE_LEADER) {
>> + if (!demote) {
>> + /*
>> + * Remain leader until someone
>> + * triggers new elections.
>> + */
>> + return;
>> + }
>> + raft->leader = 0;
>> }
>> + raft->state = RAFT_STATE_FOLLOWER;
>> + /* State is visible and changed - broadcast. */
>> + raft_schedule_broadcast(raft);
>> }
>> }
>>
>> diff --git a/src/lib/raft/raft.h b/src/lib/raft/raft.h
>> index a5f7e08d9..69dec63c6 100644
>> --- a/src/lib/raft/raft.h
>> +++ b/src/lib/raft/raft.h
>> @@ -327,6 +327,19 @@ raft_cfg_is_enabled(struct raft *raft, bool is_enabled);
>> void
>> raft_cfg_is_candidate(struct raft *raft, bool is_candidate);
>>
>> +/**
>> + * Make the instance a candidate.
>> + */
>> +void
>> +raft_start_candidate(struct raft *raft);
>> +
>> +/**
>> + * Make the instance stop taking part in new elections.
>> + * @param demote whether to stop being a leader immediately or not.
>> + */
>> +void
>> +raft_stop_candidate(struct raft *raft, bool demote);
> 2. Double whitespace after 'struct'.
Fixed.
>
>> +
>> /** Configure Raft leader election timeout. */
>> void
>> raft_cfg_election_timeout(struct raft *raft, double timeout);
>> diff --git a/test/unit/raft.c b/test/unit/raft.c
>> index 0306cefcd..575886932 100644
>> --- a/test/unit/raft.c
>> +++ b/test/unit/raft.c
>> @@ -1296,15 +1296,43 @@ raft_test_term_filter(void)
>> ok(!raft_is_node_outdated(&node.raft, 3), "node doesn't become "
>> "outdated");
>>
>> -
> 3. This probably should be in the previous commit.
This became obsolete after I squashed your commit, but
thanks for noticing!
>
>> raft_node_destroy(&node);
>> raft_finish_test();
>> }
=================================
diff --git a/src/lib/raft/raft.c b/src/lib/raft/raft.c
index 874e9157e..3f900db9a 100644
--- a/src/lib/raft/raft.c
+++ b/src/lib/raft/raft.c
@@ -865,7 +865,7 @@ raft_start_candidate(struct raft *raft)
assert(raft->state != RAFT_STATE_CANDIDATE);
/*
* May still be the leader after raft_stop_candidate
- * with demote = false.
+ * with do_demote = false.
*/
if (raft->state == RAFT_STATE_LEADER)
return;
@@ -884,7 +884,7 @@ raft_start_candidate(struct raft *raft)
}
void
-raft_stop_candidate(struct raft *raft, bool demote)
+raft_stop_candidate(struct raft *raft, bool do_demote)
{
if (!raft->is_candidate)
return;
@@ -895,7 +895,7 @@ raft_stop_candidate(struct raft *raft, bool demote)
}
if (raft->state != RAFT_STATE_FOLLOWER) {
if (raft->state == RAFT_STATE_LEADER) {
- if (!demote) {
+ if (!do_demote) {
/*
* Remain leader until someone
* triggers new elections.
diff --git a/src/lib/raft/raft.h b/src/lib/raft/raft.h
index f7bc205d2..a8da564b0 100644
--- a/src/lib/raft/raft.h
+++ b/src/lib/raft/raft.h
@@ -289,10 +289,10 @@ raft_start_candidate(struct raft *raft);
/**
* Make the instance stop taking part in new elections.
- * @param demote whether to stop being a leader immediately or not.
+ * @param do_demote whether to stop being a leader immediately or not.
*/
void
-raft_stop_candidate(struct raft *raft, bool demote);
+raft_stop_candidate(struct raft *raft, bool do_demote);
/** Configure Raft leader election timeout. */
void
--
Serge Petrenko
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [Tarantool-patches] [PATCH v4 10/12] election: support manual elections in clear_synchro_queue()
2021-04-16 22:24 ` Vladislav Shpilevoy via Tarantool-patches
@ 2021-04-18 9:26 ` Serge Petrenko via Tarantool-patches
2021-04-18 16:07 ` Vladislav Shpilevoy via Tarantool-patches
0 siblings, 1 reply; 57+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-04-18 9:26 UTC (permalink / raw)
To: Vladislav Shpilevoy, gorcunov; +Cc: tarantool-patches
17.04.2021 01:24, Vladislav Shpilevoy пишет:
> Thanks for the patch!
>
> See 1 comment below.
>
>> diff --git a/src/box/box.cc b/src/box/box.cc
>> index d5a55a30a..fcd812c09 100644
>> --- a/src/box/box.cc
>> +++ b/src/box/box.cc
>> @@ -1521,12 +1521,75 @@ box_clear_synchro_queue(bool try_wait)
>> if (!is_box_configured ||
>> raft_node_term(box_raft(), instance_id) == box_raft()->term)
>> return 0;
>> +
>> + bool run_elections = false;
>> +
>> + switch (box_election_mode) {
>> + case ELECTION_MODE_OFF:
>> + break;
>> + case ELECTION_MODE_VOTER:
>> + assert(box_raft()->state == RAFT_STATE_FOLLOWER);
>> + diag_set(ClientError, ER_UNSUPPORTED, "election_mode='voter'",
>> + "manual elections");
>> + return -1;
>> + case ELECTION_MODE_MANUAL:
>> + assert(box_raft()->state == RAFT_STATE_FOLLOWER);
>> + run_elections = true;
>> + try_wait = false;
>> + break;
>> + case ELECTION_MODE_CANDIDATE:
>> + /*
>> + * Leader elections are enabled, and this instance is allowed to
>> + * promote only if it's already an elected leader. No manual
>> + * elections.
>> + */
>> + if (box_raft()->state != RAFT_STATE_LEADER) {
>> + diag_set(ClientError, ER_UNSUPPORTED, "election_mode="
>> + "'candidate'", "manual elections");
>> + return -1;
>> + }
>> + break;
>> + default:
>> + unreachable();
>> + }
>> +
>> uint32_t former_leader_id = txn_limbo.owner_id;
>> int64_t wait_lsn = txn_limbo.confirmed_lsn;
>> int rc = 0;
>> int quorum = replication_synchro_quorum;
>> in_clear_synchro_queue = true;
>>
>> + if (run_elections) {
>> + /*
>> + * Make this instance a candidate and run until some leader, not
>> + * necessarily this instance, emerges.
>> + */
>> + raft_start_candidate(box_raft());
>> + /*
>> + * Trigger new elections without waiting for an old leader to
>> + * disappear.
>> + */
>> + raft_new_term(box_raft());
>> + box_raft_wait_leader_found();
> Shouldn't we wait for election_timeout?
I think not. Let's wait for however long it takes to elect a leader.
Several terms may pass before the leader is finally elected.
I mean, IMO it would be simpler for the user to do:
```
box.ctl.promote()
-- term1, split vote
-- term2, split vote
-- term3, leader found
-- success
```
rather than
```
box.ctl.promote()
-- error, split vote
box.ctl.promote()
-- error, split vote
box.ctl.promote()
-- success
```
>
> Also what if the fiber is canceled before the leader is found? It
> seems box_raft_wait_leader_found() would fail on an assertion because
> raft is still enabled, but leader_id is nil.
Thanks for noticing! Will fix.
Diff:
==================================
diff --git a/src/box/box.cc b/src/box/box.cc
index 962f649c3..797aa86b5 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -1572,13 +1572,17 @@ box_clear_synchro_queue(bool try_wait)
* disappear.
*/
raft_new_term(box_raft());
- box_raft_wait_leader_found();
+ rc = box_raft_wait_leader_found();
/*
* Do not reset raft mode if it was changed while
running the
* elections.
*/
if (box_election_mode == ELECTION_MODE_MANUAL)
raft_stop_candidate(box_raft(), false);
+ if (rc != 0) {
+ in_clear_synchro_queue = false;
+ return -1;
+ }
if (!box_raft()->is_enabled) {
diag_set(ClientError, ER_RAFT_DISABLED);
in_clear_synchro_queue = false;
diff --git a/src/box/raft.c b/src/box/raft.c
index 425353207..61fa9f91b 100644
--- a/src/box/raft.c
+++ b/src/box/raft.c
@@ -347,15 +347,20 @@ box_raft_wait_leader_found_f(struct trigger *trig,
void *event)
return 0;
}
-void
+int
box_raft_wait_leader_found(void)
{
struct trigger trig;
trigger_create(&trig, box_raft_wait_leader_found_f, fiber(), NULL);
raft_on_update(box_raft(), &trig);
fiber_yield();
- assert(box_raft()->leader != REPLICA_ID_NIL ||
!box_raft()->is_enabled);
trigger_clear(&trig);
+ if (fiber_is_cancelled()) {
+ diag_set(FiberIsCancelled);
+ return -1;
+ }
+ assert(box_raft()->leader != REPLICA_ID_NIL ||
!box_raft()->is_enabled);
+ return 0;
}
void
diff --git a/src/box/raft.h b/src/box/raft.h
index 8fce423e1..6b6136510 100644
--- a/src/box/raft.h
+++ b/src/box/raft.h
@@ -97,7 +97,8 @@ box_raft_checkpoint_remote(struct raft_request *req);
int
box_raft_process(struct raft_request *req, uint32_t source);
-void
+/** Block this fiber until Raft leader is known. */
+int
box_raft_wait_leader_found();
void
>
>> + /*
>> + * Do not reset raft mode if it was changed while running the
>> + * elections.
>> + */
>> + if (box_election_mode == ELECTION_MODE_MANUAL)
>> + raft_stop_candidate(box_raft(), false);
>> + if (!box_raft()->is_enabled) {
>> + diag_set(ClientError, ER_RAFT_DISABLED);
>> + in_clear_synchro_queue = false;
>> + return -1;
>> + }
>> + if (box_raft()->state != RAFT_STATE_LEADER) {
>> + diag_set(ClientError, ER_INTERFERING_PROMOTE,
>> + box_raft()->leader);
>> + in_clear_synchro_queue = false;
>> + return -1;
>> + }
>> + }
>> +
>> if (txn_limbo_is_empty(&txn_limbo))
>> goto promote;
>>
--
Serge Petrenko
^ permalink raw reply [flat|nested] 57+ messages in thread
* [Tarantool-patches] [PATCH v4 13/12] replication: send accumulated Raft messages after relay start
2021-04-16 16:25 [Tarantool-patches] [PATCH v4 00/12] raft: introduce manual elections and fix a bug with re-applying rolled back transactions Serge Petrenko via Tarantool-patches
` (11 preceding siblings ...)
2021-04-16 16:25 ` [Tarantool-patches] [PATCH v4 12/12] box.ctl: rename clear_synchro_queue to promote Serge Petrenko via Tarantool-patches
@ 2021-04-18 12:00 ` Serge Petrenko via Tarantool-patches
2021-04-18 16:03 ` Vladislav Shpilevoy via Tarantool-patches
2021-04-19 22:37 ` [Tarantool-patches] [PATCH v4 00/12] raft: introduce manual elections and fix a bug with re-applying rolled back transactions Vladislav Shpilevoy via Tarantool-patches
` (2 subsequent siblings)
15 siblings, 1 reply; 57+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-04-18 12:00 UTC (permalink / raw)
To: v.shpilevoy, gorcunov; +Cc: tarantool-patches
It may happen that a Raft leader fails to send a broadcast to the
freshly connected follower.
Here's what happens: a follower subscribes to a candidate during
on-going elections. box_process_subscribe() sends out current node's
Raft state, which's candidate. Suppose a relay from follower to
candidate is already set up. Follower immediately responds to the vote
request. This makes the candidate become leader. But candidate's relay
is not yet ready to process Raft messages, and is_leader message from
the candidate gets rejected. Once relay starts, it relays all the xlogs,
but the follower rejects all the data, because it hasn't received
is_leader notification from the candidate.
Fix this by sending the last rejected message as soon as relay starts
dispatching Raft messages.
Follow-up #5445
---
Hey, guys, take a look please. This fixes flaky
replication/gh-5445-leader-inconsistency
and should probably fix replication/election_qsync_stress as well.
src/box/relay.cc | 79 ++++++++++++++++++++++++++++++++++++++----------
1 file changed, 63 insertions(+), 16 deletions(-)
diff --git a/src/box/relay.cc b/src/box/relay.cc
index 7be33ee31..9fdd02bc1 100644
--- a/src/box/relay.cc
+++ b/src/box/relay.cc
@@ -160,6 +160,16 @@ struct relay {
* anonymous replica, for example.
*/
bool is_raft_enabled;
+ /** Is set to true by the first Raft broadcast which comes while
+ * the relay is not yet ready to dispatch Raft messages.
+ */
+ bool has_pending_broadcast;
+ /**
+ * A Raft broadcast which should be pushed once relay notifies
+ * tx it needs Raft updates. Otherwise this message would be
+ * lost until some new Raft event happens.
+ */
+ struct raft_request pending_broadcast;
} tx;
};
@@ -626,6 +636,10 @@ struct relay_is_raft_enabled_msg {
bool value;
/** Flag to wait for the flag being set, in a relay thread. */
bool is_finished;
+ /** Whether this message carries a pending raft broadcast to relay. */
+ bool has_pending_broadcast;
+ /** The raft request relay should send upon this message's return. */
+ struct raft_request req;
};
/** TX thread part of the Raft flag setting, first hop. */
@@ -635,14 +649,28 @@ tx_set_is_raft_enabled(struct cmsg *base)
struct relay_is_raft_enabled_msg *msg =
(struct relay_is_raft_enabled_msg *)base;
msg->relay->tx.is_raft_enabled = msg->value;
+ if (msg->relay->tx.has_pending_broadcast) {
+ msg->has_pending_broadcast = true;
+ msg->req = msg->relay->tx.pending_broadcast;
+ }
}
+static void
+relay_send_raft(struct relay *relay, struct raft_request *req);
+
/** Relay thread part of the Raft flag setting, second hop. */
static void
relay_set_is_raft_enabled(struct cmsg *base)
{
struct relay_is_raft_enabled_msg *msg =
(struct relay_is_raft_enabled_msg *)base;
+ /*
+ * There might have been some pending Raft broadcasts. Send the last of
+ * them as soon as relay is set up.
+ */
+ if (msg->has_pending_broadcast)
+ relay_send_raft(msg->relay, &msg->req);
+
msg->is_finished = true;
}
@@ -938,25 +966,41 @@ struct relay_raft_msg {
struct relay *relay;
};
+/**
+ * Send a Raft message to the peer. This is done asynchronously, out of
scope
+ * of recover_remaining_wals loop.
+ */
static void
-relay_raft_msg_push(struct cmsg *base)
+relay_send_raft(struct relay *relay, struct raft_request *req)
{
- struct relay_raft_msg *msg = (struct relay_raft_msg *)base;
struct xrow_header row;
- xrow_encode_raft(&row, &fiber()->gc, &msg->req);
+ xrow_encode_raft(&row, &fiber()->gc, req);
try {
- /*
- * Send the message before restarting the recovery. Otherwise
- * all the rows would be sent from under a non-leader role and
- * would be ignored again.
- */
- relay_send(msg->relay, &row);
- if (msg->req.state == RAFT_STATE_LEADER)
- relay_restart_recovery(msg->relay);
+ relay_send(relay, &row);
} catch (Exception *e) {
- relay_set_error(msg->relay, e);
+ relay_set_error(relay, e);
fiber_cancel(fiber());
}
+}
+
+static void
+relay_raft_msg_push(struct cmsg *base)
+{
+ struct relay_raft_msg *msg = (struct relay_raft_msg *)base;
+ /*
+ * Send the message before restarting the recovery. Otherwise
+ * all the rows would be sent from under a non-leader role and
+ * would be ignored again.
+ */
+ relay_send_raft(msg->relay, &msg->req);
+ if (msg->req.state == RAFT_STATE_LEADER) {
+ try {
+ relay_restart_recovery(msg->relay);
+ } catch (Exception *e) {
+ relay_set_error(msg->relay, e);
+ fiber_cancel(fiber());
+ }
+ }
free(msg);
}
@@ -964,12 +1008,15 @@ void
relay_push_raft(struct relay *relay, const struct raft_request *req)
{
/*
- * Raft updates don't stack. They are thrown away if can't be pushed
- * now. This is fine, as long as relay's live much longer that the
- * timeouts in Raft are set.
+ * Remember the latest Raft update. It might be a notification that
+ * this node is a leader. If sometime later we find out this node needs
+ * Raft updates, we need to send is_leader notification.
*/
- if (!relay->tx.is_raft_enabled)
+ if (!relay->tx.is_raft_enabled) {
+ relay->tx.has_pending_broadcast = true;
+ relay->tx.pending_broadcast = *req;
return;
+ }
/*
* XXX: the message should be preallocated. It should
* work like Kharon in IProto. Relay should have 2 raft
--
2.24.3 (Apple Git-128)
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [Tarantool-patches] [PATCH v4 07/12] raft: filter rows based on known peer terms
2021-04-16 22:21 ` Vladislav Shpilevoy via Tarantool-patches
2021-04-18 8:49 ` Serge Petrenko via Tarantool-patches
@ 2021-04-18 15:44 ` Vladislav Shpilevoy via Tarantool-patches
2021-04-19 9:31 ` Serge Petrenko via Tarantool-patches
1 sibling, 1 reply; 57+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-04-18 15:44 UTC (permalink / raw)
To: Serge Petrenko, gorcunov; +Cc: tarantool-patches
I accidentally updated curl submodule (forgot to update the
submodules before doing my amendments), we need to revert it back.
> diff --git a/src/box/applier.cc b/src/box/applier.cc
> index 61d53fdec..b0e8fbba7 100644
> --- a/src/box/applier.cc
> +++ b/src/box/applier.cc
> @@ -967,6 +967,59 @@ apply_final_join_tx(struct stailq *rows)
> return rc;
> }
>
> +/*
Also I forgot to make it /** instead of /*.
> + * When elections are enabled we must filter out synchronous rows coming
> + * from an instance that fell behind the current leader. This includes
> + * both synchronous tx rows and rows for txs following unconfirmed
> + * synchronous transactions.
> + * The rows are replaced with NOPs to preserve the vclock consistency.
> + */
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [Tarantool-patches] [PATCH v4 13/12] replication: send accumulated Raft messages after relay start
2021-04-18 12:00 ` [Tarantool-patches] [PATCH v4 13/12] replication: send accumulated Raft messages after relay start Serge Petrenko via Tarantool-patches
@ 2021-04-18 16:03 ` Vladislav Shpilevoy via Tarantool-patches
2021-04-19 12:11 ` Serge Petrenko via Tarantool-patches
0 siblings, 1 reply; 57+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-04-18 16:03 UTC (permalink / raw)
To: Serge Petrenko, gorcunov; +Cc: tarantool-patches
Good job on the patch!
See 4 comments below.
> diff --git a/src/box/relay.cc b/src/box/relay.cc
> index 7be33ee31..9fdd02bc1 100644
> --- a/src/box/relay.cc
> +++ b/src/box/relay.cc
> @@ -160,6 +160,16 @@ struct relay {
> * anonymous replica, for example.
> */
> bool is_raft_enabled;
> + /** Is set to true by the first Raft broadcast which comes while
1. Should be a new line after /**.
> + * the relay is not yet ready to dispatch Raft messages.
> + */
> + bool has_pending_broadcast;
> + /**
> + * A Raft broadcast which should be pushed once relay notifies
> + * tx it needs Raft updates. Otherwise this message would be
> + * lost until some new Raft event happens.
> + */
> + struct raft_request pending_broadcast;
2. I wouldn't call them 'broadcasts'. Relay sends a single message to
the remote node, not to all the nodes. This is a broadcast on the raft
level. On relay level it is just a single message to one node.
> } tx;
> };
> @@ -635,14 +649,28 @@ tx_set_is_raft_enabled(struct cmsg *base)
> struct relay_is_raft_enabled_msg *msg =
> (struct relay_is_raft_enabled_msg *)base;
> msg->relay->tx.is_raft_enabled = msg->value;
> + if (msg->relay->tx.has_pending_broadcast) {
> + msg->has_pending_broadcast = true;
> + msg->req = msg->relay->tx.pending_broadcast;
3. Since you will deliver the broadcast now, it is not pending
anymore. Hence there must be msg->relay->tx.has_pending_broadcast = false
in the end.
> + }
> }
> @@ -964,12 +1008,15 @@ void
> relay_push_raft(struct relay *relay, const struct raft_request *req)
> {
> /*
> - * Raft updates don't stack. They are thrown away if can't be pushed
> - * now. This is fine, as long as relay's live much longer that the
> - * timeouts in Raft are set.
> + * Remember the latest Raft update. It might be a notification that
> + * this node is a leader. If sometime later we find out this node needs
> + * Raft updates, we need to send is_leader notification.
> */
> - if (!relay->tx.is_raft_enabled)
> + if (!relay->tx.is_raft_enabled) {
> + relay->tx.has_pending_broadcast = true;
> + relay->tx.pending_broadcast = *req;
4. Vclock memory does not belong to the request. This is why below we copy
it into the message's memory. You might need to do the same here.
> return;
> + }
> /*
> * XXX: the message should be preallocated. It should
> * work like Kharon in IProto. Relay should have 2 raft
We could also fix it like described in this XXX, could we?
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [Tarantool-patches] [PATCH v4 10/12] election: support manual elections in clear_synchro_queue()
2021-04-18 9:26 ` Serge Petrenko via Tarantool-patches
@ 2021-04-18 16:07 ` Vladislav Shpilevoy via Tarantool-patches
2021-04-19 9:32 ` Serge Petrenko via Tarantool-patches
0 siblings, 1 reply; 57+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-04-18 16:07 UTC (permalink / raw)
To: Serge Petrenko, gorcunov; +Cc: tarantool-patches
>>> + /*
>>> + * Make this instance a candidate and run until some leader, not
>>> + * necessarily this instance, emerges.
>>> + */
>>> + raft_start_candidate(box_raft());
>>> + /*
>>> + * Trigger new elections without waiting for an old leader to
>>> + * disappear.
>>> + */
>>> + raft_new_term(box_raft());
>>> + box_raft_wait_leader_found();
>> Shouldn't we wait for election_timeout?
>
> I think not. Let's wait for however long it takes to elect a leader.
> Several terms may pass before the leader is finally elected.
>
> I mean, IMO it would be simpler for the user to do:
>
> ```
> box.ctl.promote()
> -- term1, split vote
> -- term2, split vote
> -- term3, leader found
> -- success
> ```
> rather than
> ```
> box.ctl.promote()
> -- error, split vote
>
> box.ctl.promote()
> -- error, split vote
>
> box.ctl.promote()
> -- success
> ```
The first option looks simpler, but it is infinite, this is the problem.
In case of not enough voters alive box.ctl.promote() would hang until there
are enough voters. But yeah, split vote is also an issue.
Maybe we could leave it like this for now, then make split vote detection,
and then wait for a timeout. It should not break backward compatibility.
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [Tarantool-patches] [PATCH v4 07/12] raft: filter rows based on known peer terms
2021-04-16 16:25 ` [Tarantool-patches] [PATCH v4 07/12] raft: filter rows based on known peer terms Serge Petrenko via Tarantool-patches
2021-04-16 22:21 ` Vladislav Shpilevoy via Tarantool-patches
@ 2021-04-18 16:27 ` Vladislav Shpilevoy via Tarantool-patches
2021-04-19 9:30 ` Serge Petrenko via Tarantool-patches
2021-04-20 20:29 ` Serge Petrenko via Tarantool-patches
2 siblings, 1 reply; 57+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-04-18 16:27 UTC (permalink / raw)
To: Serge Petrenko, gorcunov; +Cc: tarantool-patches
> diff --git a/test/replication/gh-5445-leader-inconsistency.result b/test/replication/gh-5445-leader-inconsistency.result
> new file mode 100644
> index 000000000..5c6169f50
> --- /dev/null
> +++ b/test/replication/gh-5445-leader-inconsistency.result
> @@ -0,0 +1,292 @@
<...>
> +-- Old leader returns and old unconfirmed rows from it must be ignored.
> +-- Note, it wins the elections fairly.
> +test_run:cmd('start server '..leader..' with args="3 0.4 voter"')
> + | ---
> + | - true
> + | ...
> +test_run:wait_lsn(leader, next_leader)
> + | ---
> + | ...
> +test_run:switch(leader)
> + | ---
> + | - true
> + | ...
> +test_run:wait_cond(function() return box.space.test:get{2} == nil end)
> + | ---
> + | - true
> + | ...
> +box.cfg{election_mode='candidate'}
> + | ---
> + | ...
> +
> +test_run:switch('default')
> + | ---
> + | - true
> + | ...
> +test_run:switch(next_leader)
You might want to add a check here, that the 'next_leader'
node didn't re-apply any rows from the old leader. So {2} still
does not exist even after these two nodes are in full sync.
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [Tarantool-patches] [PATCH v4 07/12] raft: filter rows based on known peer terms
2021-04-18 16:27 ` Vladislav Shpilevoy via Tarantool-patches
@ 2021-04-19 9:30 ` Serge Petrenko via Tarantool-patches
0 siblings, 0 replies; 57+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-04-19 9:30 UTC (permalink / raw)
To: Vladislav Shpilevoy, gorcunov; +Cc: tarantool-patches
18.04.2021 19:27, Vladislav Shpilevoy пишет:
>> diff --git a/test/replication/gh-5445-leader-inconsistency.result b/test/replication/gh-5445-leader-inconsistency.result
>> new file mode 100644
>> index 000000000..5c6169f50
>> --- /dev/null
>> +++ b/test/replication/gh-5445-leader-inconsistency.result
>> @@ -0,0 +1,292 @@
> <...>
>
>> +-- Old leader returns and old unconfirmed rows from it must be ignored.
>> +-- Note, it wins the elections fairly.
>> +test_run:cmd('start server '..leader..' with args="3 0.4 voter"')
>> + | ---
>> + | - true
>> + | ...
>> +test_run:wait_lsn(leader, next_leader)
>> + | ---
>> + | ...
>> +test_run:switch(leader)
>> + | ---
>> + | - true
>> + | ...
>> +test_run:wait_cond(function() return box.space.test:get{2} == nil end)
>> + | ---
>> + | - true
>> + | ...
>> +box.cfg{election_mode='candidate'}
>> + | ---
>> + | ...
>> +
>> +test_run:switch('default')
>> + | ---
>> + | - true
>> + | ...
>> +test_run:switch(next_leader)
> You might want to add a check here, that the 'next_leader'
> node didn't re-apply any rows from the old leader. So {2} still
> does not exist even after these two nodes are in full sync.
>
But there is such a check below (at the very end):
test_run:switch(next_leader)
| ---
| - true
| ...
test_run:wait_upstream(1, {status='follow'})
| ---
| - true
| ...
box.space.test:select{} -- 1
| ---
| - - [1]
| ...
--
Serge Petrenko
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [Tarantool-patches] [PATCH v4 07/12] raft: filter rows based on known peer terms
2021-04-18 15:44 ` Vladislav Shpilevoy via Tarantool-patches
@ 2021-04-19 9:31 ` Serge Petrenko via Tarantool-patches
0 siblings, 0 replies; 57+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-04-19 9:31 UTC (permalink / raw)
To: Vladislav Shpilevoy, gorcunov; +Cc: tarantool-patches
18.04.2021 18:44, Vladislav Shpilevoy пишет:
> I accidentally updated curl submodule (forgot to update the
> submodules before doing my amendments), we need to revert it back.
Yep. Fixed.
>
>> diff --git a/src/box/applier.cc b/src/box/applier.cc
>> index 61d53fdec..b0e8fbba7 100644
>> --- a/src/box/applier.cc
>> +++ b/src/box/applier.cc
>> @@ -967,6 +967,59 @@ apply_final_join_tx(struct stailq *rows)
>> return rc;
>> }
>>
>> +/*
> Also I forgot to make it /** instead of /*.
Thanks! Fixed as well.
Incremental diff's below.
>
>> + * When elections are enabled we must filter out synchronous rows coming
>> + * from an instance that fell behind the current leader. This includes
>> + * both synchronous tx rows and rows for txs following unconfirmed
>> + * synchronous transactions.
>> + * The rows are replaced with NOPs to preserve the vclock consistency.
>> + */
I've also fixed the issue we discussed verbally, about filtering by
row->replica_id rather than by applier->instance_id.
This is important because once new leader is elected, it may be
outdated until it sends us its promote request. But there may be
valid rows from the previous leader that we need to accept.
Old leader is not outdated until we receive the new leader's promote,
so we have to apply its rows.
================================================
diff --git a/src/box/applier.cc b/src/box/applier.cc
index b0e8fbba7..dc05c91d3 100644
--- a/src/box/applier.cc
+++ b/src/box/applier.cc
@@ -967,7 +967,7 @@ apply_final_join_tx(struct stailq *rows)
return rc;
}
-/*
+/**
* When elections are enabled we must filter out synchronous rows coming
* from an instance that fell behind the current leader. This includes
* both synchronous tx rows and rows for txs following unconfirmed
@@ -975,7 +975,7 @@ apply_final_join_tx(struct stailq *rows)
* The rows are replaced with NOPs to preserve the vclock consistency.
*/
static void
-applier_synchro_filter_tx(struct applier *applier, struct stailq *rows)
+applier_synchro_filter_tx(struct stailq *rows)
{
/*
* XXX: in case raft is disabled, synchronous replication still
works
@@ -985,15 +985,18 @@ applier_synchro_filter_tx(struct applier *applier,
struct stailq *rows)
*/
if (!raft_is_enabled(box_raft()))
return;
- if (!txn_limbo_is_replica_outdated(&txn_limbo,
applier->instance_id))
+ struct xrow_header *row;
+ /*
+ * It may happen that we receive the instance's rows via some third
+ * node, so cannot check for applier->instance_id here.
+ */
+ row = &stailq_first_entry(rows, struct applier_tx_row, next)->row;
+ if (!txn_limbo_is_replica_outdated(&txn_limbo, row->replica_id))
return;
- struct xrow_header *row;
- row = &stailq_last_entry(rows, struct applier_tx_row, next)->row;
- if (row->wait_sync)
+ if (stailq_last_entry(rows, struct applier_tx_row,
next)->row.wait_sync)
goto nopify;
- row = &stailq_first_entry(rows, struct applier_tx_row, next)->row;
/*
* Not waiting for sync and not a synchro request - this make
it already
* NOP or an asynchronous transaction not depending on any
synchronous
@@ -1079,7 +1082,7 @@ applier_apply_tx(struct applier *applier, struct
stailq *rows)
}
}
}
- applier_synchro_filter_tx(applier, rows);
+ applier_synchro_filter_tx(rows);
if (unlikely(iproto_type_is_synchro_request(first_row->type))) {
/*
* Synchro messages are not transactions, in terms
diff --git a/third_party/curl b/third_party/curl
index 12af024bc..3266b35bb 160000
--- a/third_party/curl
+++ b/third_party/curl
@@ -1 +1 @@
-Subproject commit 12af024bc85606b14ffc415413a7e86e6bbee7eb
+Subproject commit 3266b35bbe21c68dea0dc7ccd991eb028e6d360c
--
Serge Petrenko
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [Tarantool-patches] [PATCH v4 10/12] election: support manual elections in clear_synchro_queue()
2021-04-18 16:07 ` Vladislav Shpilevoy via Tarantool-patches
@ 2021-04-19 9:32 ` Serge Petrenko via Tarantool-patches
0 siblings, 0 replies; 57+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-04-19 9:32 UTC (permalink / raw)
To: Vladislav Shpilevoy, gorcunov; +Cc: tarantool-patches
18.04.2021 19:07, Vladislav Shpilevoy пишет:
>>>> + /*
>>>> + * Make this instance a candidate and run until some leader, not
>>>> + * necessarily this instance, emerges.
>>>> + */
>>>> + raft_start_candidate(box_raft());
>>>> + /*
>>>> + * Trigger new elections without waiting for an old leader to
>>>> + * disappear.
>>>> + */
>>>> + raft_new_term(box_raft());
>>>> + box_raft_wait_leader_found();
>>> Shouldn't we wait for election_timeout?
>> I think not. Let's wait for however long it takes to elect a leader.
>> Several terms may pass before the leader is finally elected.
>>
>> I mean, IMO it would be simpler for the user to do:
>>
>> ```
>> box.ctl.promote()
>> -- term1, split vote
>> -- term2, split vote
>> -- term3, leader found
>> -- success
>> ```
>> rather than
>> ```
>> box.ctl.promote()
>> -- error, split vote
>>
>> box.ctl.promote()
>> -- error, split vote
>>
>> box.ctl.promote()
>> -- success
>> ```
> The first option looks simpler, but it is infinite, this is the problem.
> In case of not enough voters alive box.ctl.promote() would hang until there
> are enough voters. But yeah, split vote is also an issue.
Besides, now the user may cancel the fiber issuing the promote.
>
> Maybe we could leave it like this for now, then make split vote detection,
> and then wait for a timeout. It should not break backward compatibility.
--
Serge Petrenko
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [Tarantool-patches] [PATCH v4 13/12] replication: send accumulated Raft messages after relay start
2021-04-18 16:03 ` Vladislav Shpilevoy via Tarantool-patches
@ 2021-04-19 12:11 ` Serge Petrenko via Tarantool-patches
2021-04-19 22:36 ` Vladislav Shpilevoy via Tarantool-patches
0 siblings, 1 reply; 57+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-04-19 12:11 UTC (permalink / raw)
To: Vladislav Shpilevoy, gorcunov; +Cc: tarantool-patches
18.04.2021 19:03, Vladislav Shpilevoy пишет:
> Good job on the patch!
>
> See 4 comments below.
>
>> diff --git a/src/box/relay.cc b/src/box/relay.cc
>> index 7be33ee31..9fdd02bc1 100644
>> --- a/src/box/relay.cc
>> +++ b/src/box/relay.cc
>> @@ -160,6 +160,16 @@ struct relay {
>> * anonymous replica, for example.
>> */
>> bool is_raft_enabled;
>> + /** Is set to true by the first Raft broadcast which comes while
Hi! Thanks for the review!
My answers are irrelevant, since I took your advice from the last
comment and reworked the commit. Anyway, here they are, and the new
patch is below.
> 1. Should be a new line after /**.
Sorry for the misprint, fixed.
>
>> + * the relay is not yet ready to dispatch Raft messages.
>> + */
>> + bool has_pending_broadcast;
>> + /**
>> + * A Raft broadcast which should be pushed once relay notifies
>> + * tx it needs Raft updates. Otherwise this message would be
>> + * lost until some new Raft event happens.
>> + */
>> + struct raft_request pending_broadcast;
> 2. I wouldn't call them 'broadcasts'. Relay sends a single message to
> the remote node, not to all the nodes. This is a broadcast on the raft
> level. On relay level it is just a single message to one node.
Ok, let it be `pending_raft_msg` and `has_pending_raft_msg` then.
>
>> } tx;
>> };
>> @@ -635,14 +649,28 @@ tx_set_is_raft_enabled(struct cmsg *base)
>> struct relay_is_raft_enabled_msg *msg =
>> (struct relay_is_raft_enabled_msg *)base;
>> msg->relay->tx.is_raft_enabled = msg->value;
>> + if (msg->relay->tx.has_pending_broadcast) {
>> + msg->has_pending_broadcast = true;
>> + msg->req = msg->relay->tx.pending_broadcast;
> 3. Since you will deliver the broadcast now, it is not pending
> anymore. Hence there must be msg->relay->tx.has_pending_broadcast = false
> in the end.
Yep, fixed.
>
>> + }
>> }
>> @@ -964,12 +1008,15 @@ void
>> relay_push_raft(struct relay *relay, const struct raft_request *req)
>> {
>> /*
>> - * Raft updates don't stack. They are thrown away if can't be pushed
>> - * now. This is fine, as long as relay's live much longer that the
>> - * timeouts in Raft are set.
>> + * Remember the latest Raft update. It might be a notification that
>> + * this node is a leader. If sometime later we find out this node needs
>> + * Raft updates, we need to send is_leader notification.
>> */
>> - if (!relay->tx.is_raft_enabled)
>> + if (!relay->tx.is_raft_enabled) {
>> + relay->tx.has_pending_broadcast = true;
>> + relay->tx.pending_broadcast = *req;
> 4. Vclock memory does not belong to the request. This is why below we copy
> it into the message's memory. You might need to do the same here.
Yes, indeed. Thanks!
>
>> return;
>> + }
>> /*
>> * XXX: the message should be preallocated. It should
>> * work like Kharon in IProto. Relay should have 2 raft
> We could also fix it like described in this XXX, could we?
Yep. I didn't realise that at first. The new patch is below.
==================================
replication: send accumulated Raft messages after relay start
It may happen that a Raft leader fails to send a broadcast to the
freshly connected follower.
Here's what happens: a follower subscribes to a candidate during
on-going elections. box_process_subscribe() sends out current node's
Raft state, which's candidate. Suppose a relay from follower to
candidate is already set up. Follower immediately responds to the vote
request. This makes the candidate become leader. But candidate's relay
is not yet ready to process Raft messages, and is_leader message from
the candidate gets rejected. Once relay starts, it relays all the xlogs,
but the follower rejects all the data, because it hasn't received
is_leader notification from the candidate.
Fix this by sending the last rejected message as soon as relay starts
dispatching Raft messages.
Also, while we're at it rework relay_push_raft to use a pair of
pre-allocated raft messages instead of allocating a new one on every
raft state update.
Follow-up #5445
---
src/box/relay.cc | 122 ++++++++++++++++++++++++++++++++---------------
1 file changed, 83 insertions(+), 39 deletions(-)
diff --git a/src/box/relay.cc b/src/box/relay.cc
index 7be33ee31..85f335cd7 100644
--- a/src/box/relay.cc
+++ b/src/box/relay.cc
@@ -87,6 +87,19 @@ struct relay_gc_msg {
struct vclock vclock;
};
+/**
+ * Cbus message to push raft messages to relay.
+ */
+struct relay_raft_msg {
+ struct cmsg base;
+ struct cmsg_hop route[2];
+ struct raft_request req;
+ struct vclock vclock;
+ bool do_restart_recovery;
+ struct relay *relay;
+};
+
+
/** State of a replication relay. */
struct relay {
/** The thread in which we relay data to the replica. */
@@ -160,6 +173,24 @@ struct relay {
* anonymous replica, for example.
*/
bool is_raft_enabled;
+ /**
+ * A pair of raft messages travelling between tx and relay
+ * threads. While one is en route, the other is ready to save
+ * the next incoming raft message.
+ */
+ struct relay_raft_msg raft_msgs[2];
+ /**
+ * Id of the raft message waiting in tx thread and ready to
+ * save Raft requests. May be either 0 or 1.
+ */
+ int raft_ready_msg;
+ /** Whether raft_ready_msg holds a saved Raft message */
+ bool is_raft_push_pending;
+ /**
+ * Whether any of the messages is en route between tx and
+ * relay.
+ */
+ bool is_raft_push_sent;
} tx;
};
@@ -628,13 +659,38 @@ struct relay_is_raft_enabled_msg {
bool is_finished;
};
+static void
+relay_push_raft_msg(struct relay *relay, bool do_restart_recovery)
+{
+ if (!relay->tx.is_raft_enabled || relay->tx.is_raft_push_sent)
+ return;
+ struct relay_raft_msg *msg =
+ &relay->tx.raft_msgs[relay->tx.raft_ready_msg];
+ msg->do_restart_recovery = do_restart_recovery;
+ cpipe_push(&relay->relay_pipe, &msg->base);
+ relay->tx.raft_ready_msg = (relay->tx.raft_ready_msg + 1) % 2;
+ relay->tx.is_raft_push_sent = true;
+ relay->tx.is_raft_push_pending = false;
+}
+
/** TX thread part of the Raft flag setting, first hop. */
static void
tx_set_is_raft_enabled(struct cmsg *base)
{
struct relay_is_raft_enabled_msg *msg =
(struct relay_is_raft_enabled_msg *)base;
- msg->relay->tx.is_raft_enabled = msg->value;
+ struct relay *relay = msg->relay;
+ relay->tx.is_raft_enabled = msg->value;
+ /*
+ * Send saved raft message as soon as relay becomes operational.
+ * Do not restart recovery upon the message arrival. Recovery is
+ * positioned at replica_clock initially, i.e. already "restarted" and
+ * restarting it once again would position it at the oldest xlog
+ * possible, because relay reader hasn't received replica vclock yet.
+ */
+ if (relay->tx.is_raft_push_pending) {
+ relay_push_raft_msg(msg->relay, false);
+ }
}
/** Relay thread part of the Raft flag setting, second hop. */
@@ -930,14 +986,10 @@ relay_restart_recovery(struct relay *relay)
recover_remaining_wals(relay->r, &relay->stream, NULL, true);
}
-struct relay_raft_msg {
- struct cmsg base;
- struct cmsg_hop route;
- struct raft_request req;
- struct vclock vclock;
- struct relay *relay;
-};
-
+/**
+ * Send a Raft message to the peer. This is done asynchronously, out of
scope
+ * of recover_remaining_wals loop.
+ */
static void
relay_raft_msg_push(struct cmsg *base)
{
@@ -951,54 +1003,46 @@ relay_raft_msg_push(struct cmsg *base)
* would be ignored again.
*/
relay_send(msg->relay, &row);
- if (msg->req.state == RAFT_STATE_LEADER)
+ if (msg->req.state == RAFT_STATE_LEADER &&
+ msg->do_restart_recovery)
relay_restart_recovery(msg->relay);
} catch (Exception *e) {
relay_set_error(msg->relay, e);
fiber_cancel(fiber());
}
- free(msg);
+}
+
+static void
+tx_raft_msg_return(struct cmsg *base)
+{
+ struct relay_raft_msg *msg = (struct relay_raft_msg *)base;
+ msg->relay->tx.is_raft_push_sent = false;
+ if (msg->relay->tx.is_raft_push_pending)
+ relay_push_raft_msg(msg->relay, true);
}
void
relay_push_raft(struct relay *relay, const struct raft_request *req)
{
+ struct relay_raft_msg *msg =
+ &relay->tx.raft_msgs[relay->tx.raft_ready_msg];
/*
- * Raft updates don't stack. They are thrown away if can't be pushed
- * now. This is fine, as long as relay's live much longer that the
- * timeouts in Raft are set.
- */
- if (!relay->tx.is_raft_enabled)
- return;
- /*
- * XXX: the message should be preallocated. It should
- * work like Kharon in IProto. Relay should have 2 raft
- * messages rotating. When one is sent, the other can be
- * updated and a flag is set. When the first message is
- * sent, the control returns to TX thread, sees the set
- * flag, rotates the buffers, and sends it again. And so
- * on. This is how it can work in future, with 0 heap
- * allocations. Current solution with alloc-per-update is
- * good enough as a start. Another option - wait until all
- * is moved to WAL thread, where this will all happen
- * in one thread and will be much simpler.
+ * Overwrite the request in raft_ready_msg. Only the latest raft
request
+ * is saved.
*/
- struct relay_raft_msg *msg =
- (struct relay_raft_msg *)malloc(sizeof(*msg));
- if (msg == NULL) {
- panic("Couldn't allocate raft message");
- return;
- }
msg->req = *req;
if (req->vclock != NULL) {
msg->req.vclock = &msg->vclock;
vclock_copy(&msg->vclock, req->vclock);
}
- msg->route.f = relay_raft_msg_push;
- msg->route.pipe = NULL;
- cmsg_init(&msg->base, &msg->route);
+ msg->route[0].f = relay_raft_msg_push;
+ msg->route[0].pipe = &relay->tx_pipe;
+ msg->route[1].f = tx_raft_msg_return;
+ msg->route[1].pipe = NULL;
+ cmsg_init(&msg->base, msg->route);
msg->relay = relay;
- cpipe_push(&relay->relay_pipe, &msg->base);
+ relay->tx.is_raft_push_pending = true;
+ relay_push_raft_msg(relay, true);
}
/** Send a single row to the client. */
--
2.24.3 (Apple Git-128)
--
Serge Petrenko
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [Tarantool-patches] [PATCH v4 10/12] election: support manual elections in clear_synchro_queue()
2021-04-16 16:25 ` [Tarantool-patches] [PATCH v4 10/12] election: support manual elections in clear_synchro_queue() Serge Petrenko via Tarantool-patches
2021-04-16 22:24 ` Vladislav Shpilevoy via Tarantool-patches
@ 2021-04-19 12:47 ` Serge Petrenko via Tarantool-patches
1 sibling, 0 replies; 57+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-04-19 12:47 UTC (permalink / raw)
To: v.shpilevoy, gorcunov; +Cc: tarantool-patches
16.04.2021 19:25, Serge Petrenko пишет:
Follow-up fixes:
=============================
s.petrenko@spetrenko:~/Source/tarantool$ git diff
diff --git a/src/box/box.cc b/src/box/box.cc
index 797aa86b5..358aedd78 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -1521,9 +1521,6 @@ box_clear_synchro_queue(bool try_wait)
*/
if (!is_box_configured)
return 0;
- if (txn_limbo_replica_term(&txn_limbo, instance_id) ==
box_raft()->term)
- return 0;
-
bool run_elections = false;
switch (box_election_mode) {
@@ -1535,7 +1532,8 @@ box_clear_synchro_queue(bool try_wait)
"manual elections");
return -1;
case ELECTION_MODE_MANUAL:
- assert(box_raft()->state == RAFT_STATE_FOLLOWER);
+ if (box_raft()->state == RAFT_STATE_LEADER)
+ return 0;
run_elections = true;
try_wait = false;
break;
@@ -1550,6 +1548,10 @@ box_clear_synchro_queue(bool try_wait)
"'candidate'", "manual elections");
return -1;
}
+ if (txn_limbo_replica_term(&txn_limbo, instance_id) ==
+ box_raft()->term)
+ return 0;
+
break;
default:
unreachable();
--
Serge Petrenko
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [Tarantool-patches] [PATCH v4 09/12] raft: introduce raft_start/stop_candidate
2021-04-16 16:25 ` [Tarantool-patches] [PATCH v4 09/12] raft: introduce raft_start/stop_candidate Serge Petrenko via Tarantool-patches
2021-04-16 22:23 ` Vladislav Shpilevoy via Tarantool-patches
@ 2021-04-19 12:52 ` Serge Petrenko via Tarantool-patches
1 sibling, 0 replies; 57+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-04-19 12:52 UTC (permalink / raw)
To: v.shpilevoy, gorcunov; +Cc: tarantool-patches
16.04.2021 19:25, Serge Petrenko пишет:
> Extract raft_start_candidate and raft_stop_candidate functions from
> raft_cfg_is_candidate.
>
> These functions will be used in manual elections.
>
> Prerequisite #3055
>
Follow-up fixes:
=============================
diff --git a/src/lib/raft/raft.c b/src/lib/raft/raft.c
index 3f900db9a..46a30149f 100644
--- a/src/lib/raft/raft.c
+++ b/src/lib/raft/raft.c
@@ -886,7 +886,11 @@ raft_start_candidate(struct raft *raft)
void
raft_stop_candidate(struct raft *raft, bool do_demote)
{
- if (!raft->is_candidate)
+ /*
+ * May still be the leader after raft_stop_candidate
+ * with do_demote = false.
+ */
+ if (!raft->is_candidate && raft->state != RAFT_STATE_LEADER)
return;
raft->is_candidate = false;
if (raft->state != RAFT_STATE_LEADER) {
--
Serge Petrenko
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [Tarantool-patches] [PATCH v4 06/12] box: write PROMOTE even for empty limbo
2021-04-16 16:25 ` [Tarantool-patches] [PATCH v4 06/12] box: write PROMOTE even for empty limbo Serge Petrenko via Tarantool-patches
@ 2021-04-19 13:39 ` Serge Petrenko via Tarantool-patches
0 siblings, 0 replies; 57+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-04-19 13:39 UTC (permalink / raw)
To: v.shpilevoy, gorcunov; +Cc: tarantool-patches
16.04.2021 19:25, Serge Petrenko пишет:
> PROMOTE entry will be used to mark limbo ownership transition besides
> emptying the limbo. So it has to be written every time
> `box.ctl.clear_synchro_queue()` succeeds. Even when the limbo was
> already empty.
>
> Part of #5445
>
Follow-up fixes:
==========================
diff --git a/src/box/txn_limbo.c b/src/box/txn_limbo.c
index 0d2d274f6..8b5c76f28 100644
--- a/src/box/txn_limbo.c
+++ b/src/box/txn_limbo.c
@@ -643,7 +643,13 @@ complete:
void
txn_limbo_process(struct txn_limbo *limbo, const struct
synchro_request *req)
{
- if (req->replica_id != limbo->owner_id) {
+ if (req->replica_id == REPLICA_ID_NIL) {
+ /*
+ * The limbo was empty on the instance issuing the request.
+ * This means this instance must empty its limbo as well.
+ */
+ assert(req->lsn == 0 && req->type == IPROTO_PROMOTE);
+ } else if (req->replica_id != limbo->owner_id) {
/*
* Ignore CONFIRM/ROLLBACK messages for a foreign master.
* These are most likely outdated messages for already
confirmed
--
Serge Petrenko
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [Tarantool-patches] [PATCH v4 08/12] election: introduce a new election mode: "manual"
2021-04-16 16:25 ` [Tarantool-patches] [PATCH v4 08/12] election: introduce a new election mode: "manual" Serge Petrenko via Tarantool-patches
@ 2021-04-19 22:34 ` Vladislav Shpilevoy via Tarantool-patches
2021-04-20 9:25 ` Serge Petrenko via Tarantool-patches
0 siblings, 1 reply; 57+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-04-19 22:34 UTC (permalink / raw)
To: Serge Petrenko, gorcunov; +Cc: tarantool-patches
Hi! Thanks for working on this!
It seems starting from this commit the election stress test
hangs on my machine in 100% cases. I didn't have time to
investigate why yet.
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [Tarantool-patches] [PATCH v4 09/12] raft: introduce raft_start/stop_candidate
2021-04-18 8:59 ` Serge Petrenko via Tarantool-patches
@ 2021-04-19 22:35 ` Vladislav Shpilevoy via Tarantool-patches
2021-04-20 9:28 ` Serge Petrenko via Tarantool-patches
0 siblings, 1 reply; 57+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-04-19 22:35 UTC (permalink / raw)
To: Serge Petrenko, gorcunov; +Cc: tarantool-patches
Good job on the patch!
You didn't cover the stop_candidate call with demote flag true in the
unit tests. I added my commit on top of this one. See it on the branch
and below. Squash if you agree. Otherwise lets discuss.
====================
diff --git a/test/unit/raft.c b/test/unit/raft.c
index a718ab3f4..29e48da13 100644
--- a/test/unit/raft.c
+++ b/test/unit/raft.c
@@ -1270,29 +1270,50 @@ raft_test_too_long_wal_write(void)
static void
raft_test_start_stop_candidate(void)
{
- raft_start_test(4);
+ raft_start_test(8);
struct raft_node node;
raft_node_create(&node);
raft_node_cfg_is_candidate(&node, false);
raft_node_cfg_election_quorum(&node, 1);
- raft_start_candidate(&node.raft);
+ raft_node_start_candidate(&node);
raft_run_next_event();
is(node.raft.state, RAFT_STATE_LEADER, "became leader after "
- "start_candidate");
- raft_stop_candidate(&node.raft, false);
+ "start candidate");
+
+ raft_node_stop_candidate(&node);
raft_run_for(node.cfg_death_timeout);
is(node.raft.state, RAFT_STATE_LEADER, "remain leader after "
- "stop_candidate");
+ "stop candidate");
+
+ raft_node_demote_candidate(&node);
+ is(node.raft.state, RAFT_STATE_FOLLOWER, "demote drops a non-candidate "
+ "leader to a follower");
+
+ /*
+ * Ensure the non-candidate leader is demoted when sees a new term, and
+ * does not try election again.
+ */
+ raft_node_start_candidate(&node);
+ raft_run_next_event();
+ raft_node_stop_candidate(&node);
+ is(node.raft.state, RAFT_STATE_LEADER, "non-candidate but still "
+ "leader");
is(raft_node_send_vote_request(&node,
- 3 /* Term. */,
+ 4 /* Term. */,
"{}" /* Vclock. */,
2 /* Source. */
), 0, "vote request from 2");
is(node.raft.state, RAFT_STATE_FOLLOWER, "demote once new election "
- "starts");
+ "starts");
+
+ raft_run_for(node.cfg_election_timeout * 2);
+ is(node.raft.state, RAFT_STATE_FOLLOWER, "still follower");
+ is(node.raft.term, 4, "still the same term");
+
+ raft_node_destroy(&node);
raft_finish_test();
}
diff --git a/test/unit/raft.result b/test/unit/raft.result
index f9a8f249b..3a3dc5dd2 100644
--- a/test/unit/raft.result
+++ b/test/unit/raft.result
@@ -221,11 +221,15 @@ ok 12 - subtests
ok 13 - subtests
*** raft_test_too_long_wal_write: done ***
*** raft_test_start_stop_candidate ***
- 1..4
- ok 1 - became leader after start_candidate
- ok 2 - remain leader after stop_candidate
- ok 3 - vote request from 2
- ok 4 - demote once new election starts
+ 1..8
+ ok 1 - became leader after start candidate
+ ok 2 - remain leader after stop candidate
+ ok 3 - demote drops a non-candidate leader to a follower
+ ok 4 - non-candidate but still leader
+ ok 5 - vote request from 2
+ ok 6 - demote once new election starts
+ ok 7 - still follower
+ ok 8 - still the same term
ok 14 - subtests
*** raft_test_start_stop_candidate: done ***
*** main_f: done ***
diff --git a/test/unit/raft_test_utils.c b/test/unit/raft_test_utils.c
index b8735f373..452c05c81 100644
--- a/test/unit/raft_test_utils.c
+++ b/test/unit/raft_test_utils.c
@@ -387,6 +387,27 @@ raft_node_unblock(struct raft_node *node)
}
}
+void
+raft_node_start_candidate(struct raft_node *node)
+{
+ assert(raft_node_is_started(node));
+ raft_start_candidate(&node->raft);
+}
+
+void
+raft_node_stop_candidate(struct raft_node *node)
+{
+ assert(raft_node_is_started(node));
+ raft_stop_candidate(&node->raft, false);
+}
+
+void
+raft_node_demote_candidate(struct raft_node *node)
+{
+ assert(raft_node_is_started(node));
+ raft_stop_candidate(&node->raft, true);
+}
+
void
raft_node_cfg_is_enabled(struct raft_node *node, bool value)
{
diff --git a/test/unit/raft_test_utils.h b/test/unit/raft_test_utils.h
index bc3db0c2a..5f8618716 100644
--- a/test/unit/raft_test_utils.h
+++ b/test/unit/raft_test_utils.h
@@ -208,6 +208,23 @@ raft_node_block(struct raft_node *node);
void
raft_node_unblock(struct raft_node *node);
+/**
+ * Make the node candidate, and maybe start election if a leader is not known.
+ */
+void
+raft_node_start_candidate(struct raft_node *node);
+
+/**
+ * Make the node non-candidate for next elections, but if it is a leader right
+ * now, it will stay a leader.
+ */
+void
+raft_node_stop_candidate(struct raft_node *node);
+
+/** Stop the candidate and remove its leader role if present. */
+void
+raft_node_demote_candidate(struct raft_node *node);
+
/** Configuration methods. */
void
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [Tarantool-patches] [PATCH v4 12/12] box.ctl: rename clear_synchro_queue to promote
2021-04-16 16:25 ` [Tarantool-patches] [PATCH v4 12/12] box.ctl: rename clear_synchro_queue to promote Serge Petrenko via Tarantool-patches
@ 2021-04-19 22:35 ` Vladislav Shpilevoy via Tarantool-patches
2021-04-20 10:22 ` Serge Petrenko via Tarantool-patches
0 siblings, 1 reply; 57+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-04-19 22:35 UTC (permalink / raw)
To: Serge Petrenko, gorcunov; +Cc: tarantool-patches
The 3055 test fails in this CI job:
https://github.com/tarantool/tarantool/runs/2381269164
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [Tarantool-patches] [PATCH v4 13/12] replication: send accumulated Raft messages after relay start
2021-04-19 12:11 ` Serge Petrenko via Tarantool-patches
@ 2021-04-19 22:36 ` Vladislav Shpilevoy via Tarantool-patches
2021-04-20 10:38 ` Serge Petrenko via Tarantool-patches
0 siblings, 1 reply; 57+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-04-19 22:36 UTC (permalink / raw)
To: Serge Petrenko, gorcunov; +Cc: tarantool-patches
Thanks for the patch!
See 2 comments below.
> diff --git a/src/box/relay.cc b/src/box/relay.cc
> index 7be33ee31..85f335cd7 100644
> --- a/src/box/relay.cc
> +++ b/src/box/relay.cc
> @@ -628,13 +659,38 @@ struct relay_is_raft_enabled_msg {
> bool is_finished;
> };
>
> +static void
> +relay_push_raft_msg(struct relay *relay, bool do_restart_recovery)
1. Why is the recovery restart flag is ignored if a message is already
sent? This might lead to recovery restart loss if I am not mistaken.
> +{
> + if (!relay->tx.is_raft_enabled || relay->tx.is_raft_push_sent)
> + return;
> + struct relay_raft_msg *msg =
> + &relay->tx.raft_msgs[relay->tx.raft_ready_msg];
> + msg->do_restart_recovery = do_restart_recovery;
> + cpipe_push(&relay->relay_pipe, &msg->base);
> + relay->tx.raft_ready_msg = (relay->tx.raft_ready_msg + 1) % 2;
> + relay->tx.is_raft_push_sent = true;
> + relay->tx.is_raft_push_pending = false;
> +}
> +
> /** TX thread part of the Raft flag setting, first hop. */
> static void
> tx_set_is_raft_enabled(struct cmsg *base)
> {
> struct relay_is_raft_enabled_msg *msg =
> (struct relay_is_raft_enabled_msg *)base;
> - msg->relay->tx.is_raft_enabled = msg->value;
> + struct relay *relay = msg->relay;
> + relay->tx.is_raft_enabled = msg->value;
> + /*
> + * Send saved raft message as soon as relay becomes operational.
> + * Do not restart recovery upon the message arrival. Recovery is
> + * positioned at replica_clock initially, i.e. already "restarted" and
> + * restarting it once again would position it at the oldest xlog
> + * possible, because relay reader hasn't received replica vclock yet.
> + */
> + if (relay->tx.is_raft_push_pending) {
> + relay_push_raft_msg(msg->relay, false);
2. I don't understand. Why wasn't there such a problem before? Recovery
must be restarted when the node becomes a leader. If you do not restart
it, the data would be ignored by the replicas. How do you know it is
positioned right now at replica_clock? You are in tx thread, you can't
tell. What do I miss?
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [Tarantool-patches] [PATCH v4 00/12] raft: introduce manual elections and fix a bug with re-applying rolled back transactions
2021-04-16 16:25 [Tarantool-patches] [PATCH v4 00/12] raft: introduce manual elections and fix a bug with re-applying rolled back transactions Serge Petrenko via Tarantool-patches
` (12 preceding siblings ...)
2021-04-18 12:00 ` [Tarantool-patches] [PATCH v4 13/12] replication: send accumulated Raft messages after relay start Serge Petrenko via Tarantool-patches
@ 2021-04-19 22:37 ` Vladislav Shpilevoy via Tarantool-patches
2021-04-20 17:38 ` [Tarantool-patches] [PATCH v4 14/12] txn: make NOPs fully asynchronous Serge Petrenko via Tarantool-patches
2021-04-20 22:30 ` [Tarantool-patches] [PATCH v4 00/12] raft: introduce manual elections and fix a bug with re-applying rolled back transactions Vladislav Shpilevoy via Tarantool-patches
15 siblings, 0 replies; 57+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-04-19 22:37 UTC (permalink / raw)
To: Serge Petrenko, gorcunov; +Cc: tarantool-patches
On top of the branch I made this:
====================
diff --git a/src/box/box.cc b/src/box/box.cc
index 59925962d..b026dfe05 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -1582,16 +1582,21 @@ box_promote(void)
*/
if (box_election_mode == ELECTION_MODE_MANUAL)
raft_stop_candidate(box_raft(), false);
+ else
+ assert(false);
if (rc != 0) {
+ assert(false);
in_promote = false;
return -1;
}
if (!box_raft()->is_enabled) {
+ assert(false);
diag_set(ClientError, ER_RAFT_DISABLED);
in_promote = false;
return -1;
}
if (box_raft()->state != RAFT_STATE_LEADER) {
+ assert(false);
diag_set(ClientError, ER_INTERFERING_PROMOTE,
box_raft()->leader);
in_promote = false;
====================
And all the tests passed (except hanging election qsync stress, but it
didn't crash). This means there are not enough tests. But I didn't
have time to help adding any to cover this assertions yet.
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [Tarantool-patches] [PATCH v4 08/12] election: introduce a new election mode: "manual"
2021-04-19 22:34 ` Vladislav Shpilevoy via Tarantool-patches
@ 2021-04-20 9:25 ` Serge Petrenko via Tarantool-patches
2021-04-20 17:37 ` Serge Petrenko via Tarantool-patches
0 siblings, 1 reply; 57+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-04-20 9:25 UTC (permalink / raw)
To: Vladislav Shpilevoy, gorcunov; +Cc: tarantool-patches
20.04.2021 01:34, Vladislav Shpilevoy пишет:
> Hi! Thanks for working on this!
>
> It seems starting from this commit the election stress test
> hangs on my machine in 100% cases. I didn't have time to
> investigate why yet.
Yes, you're correct. I also see this. It's not 100% cases though.
On my machine the test doesn't hang at all (at least the first 20 runs)
until commit "txn_limbo: filter rows based on known peer terms"
Starting with commit "txn_limbo: filter rows based on known peer terms"
one or two of the 20 runs hang and get restarted.
I need some time to investigate this. Will return once I have some results.
--
Serge Petrenko
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [Tarantool-patches] [PATCH v4 09/12] raft: introduce raft_start/stop_candidate
2021-04-19 22:35 ` Vladislav Shpilevoy via Tarantool-patches
@ 2021-04-20 9:28 ` Serge Petrenko via Tarantool-patches
0 siblings, 0 replies; 57+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-04-20 9:28 UTC (permalink / raw)
To: Vladislav Shpilevoy, gorcunov; +Cc: tarantool-patches
20.04.2021 01:35, Vladislav Shpilevoy пишет:
> Good job on the patch!
>
> You didn't cover the stop_candidate call with demote flag true in the
> unit tests. I added my commit on top of this one. See it on the branch
> and below. Squash if you agree. Otherwise lets discuss.
>
> ====================
Thanks for the help!
Your changes look good, I squashed them.
> diff --git a/test/unit/raft.c b/test/unit/raft.c
> index a718ab3f4..29e48da13 100644
> --- a/test/unit/raft.c
> +++ b/test/unit/raft.c
> @@ -1270,29 +1270,50 @@ raft_test_too_long_wal_write(void)
> static void
> raft_test_start_stop_candidate(void)
> {
> - raft_start_test(4);
> + raft_start_test(8);
> struct raft_node node;
> raft_node_create(&node);
>
> raft_node_cfg_is_candidate(&node, false);
> raft_node_cfg_election_quorum(&node, 1);
>
> - raft_start_candidate(&node.raft);
> + raft_node_start_candidate(&node);
> raft_run_next_event();
> is(node.raft.state, RAFT_STATE_LEADER, "became leader after "
> - "start_candidate");
> - raft_stop_candidate(&node.raft, false);
> + "start candidate");
> +
> + raft_node_stop_candidate(&node);
> raft_run_for(node.cfg_death_timeout);
> is(node.raft.state, RAFT_STATE_LEADER, "remain leader after "
> - "stop_candidate");
> + "stop candidate");
> +
> + raft_node_demote_candidate(&node);
> + is(node.raft.state, RAFT_STATE_FOLLOWER, "demote drops a non-candidate "
> + "leader to a follower");
> +
> + /*
> + * Ensure the non-candidate leader is demoted when sees a new term, and
> + * does not try election again.
> + */
> + raft_node_start_candidate(&node);
> + raft_run_next_event();
> + raft_node_stop_candidate(&node);
> + is(node.raft.state, RAFT_STATE_LEADER, "non-candidate but still "
> + "leader");
>
> is(raft_node_send_vote_request(&node,
> - 3 /* Term. */,
> + 4 /* Term. */,
> "{}" /* Vclock. */,
> 2 /* Source. */
> ), 0, "vote request from 2");
> is(node.raft.state, RAFT_STATE_FOLLOWER, "demote once new election "
> - "starts");
> + "starts");
> +
> + raft_run_for(node.cfg_election_timeout * 2);
> + is(node.raft.state, RAFT_STATE_FOLLOWER, "still follower");
> + is(node.raft.term, 4, "still the same term");
> +
> + raft_node_destroy(&node);
> raft_finish_test();
> }
>
> diff --git a/test/unit/raft.result b/test/unit/raft.result
> index f9a8f249b..3a3dc5dd2 100644
> --- a/test/unit/raft.result
> +++ b/test/unit/raft.result
> @@ -221,11 +221,15 @@ ok 12 - subtests
> ok 13 - subtests
> *** raft_test_too_long_wal_write: done ***
> *** raft_test_start_stop_candidate ***
> - 1..4
> - ok 1 - became leader after start_candidate
> - ok 2 - remain leader after stop_candidate
> - ok 3 - vote request from 2
> - ok 4 - demote once new election starts
> + 1..8
> + ok 1 - became leader after start candidate
> + ok 2 - remain leader after stop candidate
> + ok 3 - demote drops a non-candidate leader to a follower
> + ok 4 - non-candidate but still leader
> + ok 5 - vote request from 2
> + ok 6 - demote once new election starts
> + ok 7 - still follower
> + ok 8 - still the same term
> ok 14 - subtests
> *** raft_test_start_stop_candidate: done ***
> *** main_f: done ***
> diff --git a/test/unit/raft_test_utils.c b/test/unit/raft_test_utils.c
> index b8735f373..452c05c81 100644
> --- a/test/unit/raft_test_utils.c
> +++ b/test/unit/raft_test_utils.c
> @@ -387,6 +387,27 @@ raft_node_unblock(struct raft_node *node)
> }
> }
>
> +void
> +raft_node_start_candidate(struct raft_node *node)
> +{
> + assert(raft_node_is_started(node));
> + raft_start_candidate(&node->raft);
> +}
> +
> +void
> +raft_node_stop_candidate(struct raft_node *node)
> +{
> + assert(raft_node_is_started(node));
> + raft_stop_candidate(&node->raft, false);
> +}
> +
> +void
> +raft_node_demote_candidate(struct raft_node *node)
> +{
> + assert(raft_node_is_started(node));
> + raft_stop_candidate(&node->raft, true);
> +}
> +
> void
> raft_node_cfg_is_enabled(struct raft_node *node, bool value)
> {
> diff --git a/test/unit/raft_test_utils.h b/test/unit/raft_test_utils.h
> index bc3db0c2a..5f8618716 100644
> --- a/test/unit/raft_test_utils.h
> +++ b/test/unit/raft_test_utils.h
> @@ -208,6 +208,23 @@ raft_node_block(struct raft_node *node);
> void
> raft_node_unblock(struct raft_node *node);
>
> +/**
> + * Make the node candidate, and maybe start election if a leader is not known.
> + */
> +void
> +raft_node_start_candidate(struct raft_node *node);
> +
> +/**
> + * Make the node non-candidate for next elections, but if it is a leader right
> + * now, it will stay a leader.
> + */
> +void
> +raft_node_stop_candidate(struct raft_node *node);
> +
> +/** Stop the candidate and remove its leader role if present. */
> +void
> +raft_node_demote_candidate(struct raft_node *node);
> +
> /** Configuration methods. */
>
> void
--
Serge Petrenko
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [Tarantool-patches] [PATCH v4 12/12] box.ctl: rename clear_synchro_queue to promote
2021-04-19 22:35 ` Vladislav Shpilevoy via Tarantool-patches
@ 2021-04-20 10:22 ` Serge Petrenko via Tarantool-patches
0 siblings, 0 replies; 57+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-04-20 10:22 UTC (permalink / raw)
To: Vladislav Shpilevoy, gorcunov; +Cc: tarantool-patches
20.04.2021 01:35, Vladislav Shpilevoy пишет:
> The 3055 test fails in this CI job:
> https://github.com/tarantool/tarantool/runs/2381269164
I checked the artifacts and it seems one of the instances has
chosen wrong instance as a bootstrap leader, even though it
had direct connection to the bootstrap leader.
This looks like a problem not related to the patch, but something
with replicaset_round().
I've ran the test locally 256 times with 16 workers and everything
seems to work.
--
Serge Petrenko
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [Tarantool-patches] [PATCH v4 13/12] replication: send accumulated Raft messages after relay start
2021-04-19 22:36 ` Vladislav Shpilevoy via Tarantool-patches
@ 2021-04-20 10:38 ` Serge Petrenko via Tarantool-patches
2021-04-20 22:31 ` Vladislav Shpilevoy via Tarantool-patches
0 siblings, 1 reply; 57+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-04-20 10:38 UTC (permalink / raw)
To: Vladislav Shpilevoy, gorcunov; +Cc: tarantool-patches
20.04.2021 01:36, Vladislav Shpilevoy пишет:
> Thanks for the patch!
>
> See 2 comments below.
>
>> diff --git a/src/box/relay.cc b/src/box/relay.cc
>> index 7be33ee31..85f335cd7 100644
>> --- a/src/box/relay.cc
>> +++ b/src/box/relay.cc
>> @@ -628,13 +659,38 @@ struct relay_is_raft_enabled_msg {
>> bool is_finished;
>> };
>>
>> +static void
>> +relay_push_raft_msg(struct relay *relay, bool do_restart_recovery)
> 1. Why is the recovery restart flag is ignored if a message is already
> sent? This might lead to recovery restart loss if I am not mistaken.
I think it's okay. As soon as the message is pushed from relay_push_raft()
rather than from tx_set_is_raft_enabled(), we may freely restart the
recovery.
So, we only care whether do_restart_recovery is set when the message
gets pushed
in the same call.
We don't care whether do_restart_recovery is set or not when the call
exits without pushing
the message. The next call will have the correct value for
do_restart_recovery anyway.
Please see a more detailed explanation below.
>
>> +{
>> + if (!relay->tx.is_raft_enabled || relay->tx.is_raft_push_sent)
>> + return;
>> + struct relay_raft_msg *msg =
>> + &relay->tx.raft_msgs[relay->tx.raft_ready_msg];
>> + msg->do_restart_recovery = do_restart_recovery;
>> + cpipe_push(&relay->relay_pipe, &msg->base);
>> + relay->tx.raft_ready_msg = (relay->tx.raft_ready_msg + 1) % 2;
>> + relay->tx.is_raft_push_sent = true;
>> + relay->tx.is_raft_push_pending = false;
>> +}
>> +
>> /** TX thread part of the Raft flag setting, first hop. */
>> static void
>> tx_set_is_raft_enabled(struct cmsg *base)
>> {
>> struct relay_is_raft_enabled_msg *msg =
>> (struct relay_is_raft_enabled_msg *)base;
>> - msg->relay->tx.is_raft_enabled = msg->value;
>> + struct relay *relay = msg->relay;
>> + relay->tx.is_raft_enabled = msg->value;
>> + /*
>> + * Send saved raft message as soon as relay becomes operational.
>> + * Do not restart recovery upon the message arrival. Recovery is
>> + * positioned at replica_clock initially, i.e. already "restarted" and
>> + * restarting it once again would position it at the oldest xlog
>> + * possible, because relay reader hasn't received replica vclock yet.
>> + */
>> + if (relay->tx.is_raft_push_pending) {
>> + relay_push_raft_msg(msg->relay, false);
> 2. I don't understand. Why wasn't there such a problem before? Recovery
> must be restarted when the node becomes a leader. If you do not restart
> it, the data would be ignored by the replicas. How do you know it is
> positioned right now at replica_clock? You are in tx thread, you can't
> tell. What do I miss?
This is because this `relay_push_raft_msg` is delivered before
`relay_set_is_raft_enabled`.
And both these messages get processed by the cbus_process()
loop waiting for `relay_seet_is_raft_enabled`.
This happens in relay_send_is_raft_enabled() even before
the relay reader fiber is created, so recv_vclock is zero.
Restarting recovery here would lead to it being reset to the
first ever wal this instance has, which's wrong.
Such a problem might've existed before, but was extremely
hard to catch: relay_push_raft_msg() wasn't called until
relay->tx.is_raft_enabled was set. And when tx.is_raft_enabled
was set it most probably meant that relay_set_is_raft_enabled
was already delivered and relay has exited this first
cbus_process() loop, which worked before reader fiber creation.
In order to solve the problem in some another way, I need to
make relay_push_raft_msg() deliver the message to the
second cbus_process() loop, the main one. And I couldn't
come up with an idea how to do that.
The message should be pushed right in tx_set_is_raft_enabled,
and this means it'll get delivered before relay_set_is_raft_enabled.
--
Serge Petrenko
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [Tarantool-patches] [PATCH v4 08/12] election: introduce a new election mode: "manual"
2021-04-20 9:25 ` Serge Petrenko via Tarantool-patches
@ 2021-04-20 17:37 ` Serge Petrenko via Tarantool-patches
0 siblings, 0 replies; 57+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-04-20 17:37 UTC (permalink / raw)
To: Vladislav Shpilevoy, gorcunov; +Cc: tarantool-patches
20.04.2021 12:25, Serge Petrenko via Tarantool-patches пишет:
>
>
> 20.04.2021 01:34, Vladislav Shpilevoy пишет:
>> Hi! Thanks for working on this!
>>
>> It seems starting from this commit the election stress test
>> hangs on my machine in 100% cases. I didn't have time to
>> investigate why yet.
> Yes, you're correct. I also see this. It's not 100% cases though.
>
> On my machine the test doesn't hang at all (at least the first 20 runs)
> until commit "txn_limbo: filter rows based on known peer terms"
>
> Starting with commit "txn_limbo: filter rows based on known peer terms"
> one or two of the 20 runs hang and get restarted.
>
> I need some time to investigate this. Will return once I have some
> results.
>
Ok, seems like the case is closed.
So, here's a couple of facts that lead to the test hang:
1) The instance may still write CONFIRM for its own transactions after
restart.
It may do so even before receiving a CONFIRM from some remote
instance, which
took ownership of the limbo later.
This fact alone would be ok, but:
a) the instance doesn't count its own WAL write as the first ack
after restart,
so if quorum is M it waits for M+1 acks from remote instances
before writing
confirm
b) the instance writes CONFIRM unconditionally even before getting
in sync
with other replicas, which could have already written CONFIRM for
its rows.
(this may be fine).
There's an issue related to the cause, but it needs some reformulation:
https://github.com/tarantool/tarantool/issues/5856
2) Any failure in txn_commit_try_async is treated as a WAL write error
by mistake,
and the actual reason for rollback is lost. I've opened a ticket for
this:
https://github.com/tarantool/tarantool/issues/6027
ER_WAL_IO is unrecoverable and breaks connection between master and
replica.
(We might make it recoverable as well? Why not retry WAL write after
some time?
It may work out this time).
3) NOPs are added to txn_limbo, when it isn't empty.
And here's what happened when the test hung:
1) Some instance used to be the leader and got restarted before
writing CONFIRM for its own transactions
2) Once the instance got restarted, its relays were faster than
its appliers, meaning it first gathered 2 acks for the old
transaction, and wrote CONFIRM right away, and received CONFIRM
from a remote instance later
3) This instance was elected the leader once again. Once this
happened other 2 instances started accepting rows from this
instance
4) The first row remote instances got was this CONFIRM which the
instance wrote after restart
5) The instance was considered outdated, because while it was an
elected leader, it hasn't yet sent PROMOTE to the other
instances (PROMOTE comes right after that notorious CONFIRM)
6) Like any row from an outdated instance, CONFIRM was replaced
with a NOP
7) Other instances try to insert that NOP to their limbos, which
aren't empty, due to the nature of the test (and would get
emptied with PROMOTE). Insertion fails with
ER_UNCOMMITTED_FOREIGN_SYNC_TXNS
8) ER_UNCOMMITTED_FOREIGN_SYNC_TXNS is replaced with ER_WAL_IO by
applier's on_rollback trigger. This is an unrecoverable error,
so both the remote instances' appliers break connection to
the leader.
9) Now there's an infinite loop of elections. This node never
votes for any of the remote nodes, because they are behind it.
What I've done to fix this is I've allowed transactions that consist
of NOPs solely to pass through limbo without waiting even when it's
non-empty.
The test's now rock-solid on my machine. 0 failures in 100 runs.
(with 1 worker, to be honest, but that's still better than a couple of
failures in 20 runs with 1 worker).
I've sent the new patch as [PATCH v4 14/12] in reply to this series.
Please, take a look.
--
Serge Petrenko
^ permalink raw reply [flat|nested] 57+ messages in thread
* [Tarantool-patches] [PATCH v4 14/12] txn: make NOPs fully asynchronous
2021-04-16 16:25 [Tarantool-patches] [PATCH v4 00/12] raft: introduce manual elections and fix a bug with re-applying rolled back transactions Serge Petrenko via Tarantool-patches
` (13 preceding siblings ...)
2021-04-19 22:37 ` [Tarantool-patches] [PATCH v4 00/12] raft: introduce manual elections and fix a bug with re-applying rolled back transactions Vladislav Shpilevoy via Tarantool-patches
@ 2021-04-20 17:38 ` Serge Petrenko via Tarantool-patches
2021-04-20 22:31 ` Vladislav Shpilevoy via Tarantool-patches
2021-04-20 22:30 ` [Tarantool-patches] [PATCH v4 00/12] raft: introduce manual elections and fix a bug with re-applying rolled back transactions Vladislav Shpilevoy via Tarantool-patches
15 siblings, 1 reply; 57+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-04-20 17:38 UTC (permalink / raw)
To: v.shpilevoy, gorcunov; +Cc: tarantool-patches
When a transaction consists of NOPs solely, it shouldn't wait for other
synchronous transactions to finish. It might get committed right away.
Such transactions may appear when applier filters out synchronous rows
from an outdated instance, and appending such transactions to the limbo
could lead to ER_UNCOMMITTED_FOREIGN_SYNC_TXNS error, which we tried to
avoid in the first place when replaced tx rows with NOPs.
Follow-up #5445
---
src/box/txn.c | 14 +++++++++++---
1 file changed, 11 insertions(+), 3 deletions(-)
diff --git a/src/box/txn.c b/src/box/txn.c
index a71ccadd0..8be102666 100644
--- a/src/box/txn.c
+++ b/src/box/txn.c
@@ -601,6 +601,11 @@ txn_journal_entry_new(struct txn *txn)
struct xrow_header **remote_row = req->rows;
struct xrow_header **local_row = req->rows + txn->n_applier_rows;
bool is_sync = false;
+ /*
+ * A transaction which consists of NOPs solely should pass through the
+ * limbo without waiting. Even when the limbo is not empty.
+ */
+ bool is_fully_nop = true;
stailq_foreach_entry(stmt, &txn->stmts, next) {
if (stmt->has_triggers) {
@@ -612,8 +617,11 @@ txn_journal_entry_new(struct txn *txn)
if (stmt->row == NULL)
continue;
- is_sync = is_sync || (stmt->space != NULL &&
- stmt->space->def->opts.is_sync);
+ if (stmt->row->type != IPROTO_NOP) {
+ is_fully_nop = false;
+ is_sync = is_sync || (stmt->space != NULL &&
+ stmt->space->def->opts.is_sync);
+ }
if (stmt->row->replica_id == 0)
*local_row++ = stmt->row;
@@ -627,7 +635,7 @@ txn_journal_entry_new(struct txn *txn)
* space can't be synchronous. So if there is at least one
* synchronous space, the transaction is not local.
*/
- if (!txn_has_flag(txn, TXN_FORCE_ASYNC)) {
+ if (!txn_has_flag(txn, TXN_FORCE_ASYNC) && !is_fully_nop) {
if (is_sync) {
txn_set_flags(txn, TXN_WAIT_SYNC | TXN_WAIT_ACK);
} else if (!txn_limbo_is_empty(&txn_limbo)) {
--
2.24.3 (Apple Git-128)
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [Tarantool-patches] [PATCH v4 07/12] raft: filter rows based on known peer terms
2021-04-16 16:25 ` [Tarantool-patches] [PATCH v4 07/12] raft: filter rows based on known peer terms Serge Petrenko via Tarantool-patches
2021-04-16 22:21 ` Vladislav Shpilevoy via Tarantool-patches
2021-04-18 16:27 ` Vladislav Shpilevoy via Tarantool-patches
@ 2021-04-20 20:29 ` Serge Petrenko via Tarantool-patches
2021-04-20 20:31 ` Serge Petrenko via Tarantool-patches
2 siblings, 1 reply; 57+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-04-20 20:29 UTC (permalink / raw)
To: v.shpilevoy, gorcunov; +Cc: tarantool-patches
16.04.2021 19:25, Serge Petrenko пишет:
> Start writing the actual leader term together with the PROMOTE request
> and process terms in PROMOTE requests on receiver side.
>
> Make applier only apply synchronous transactions from the instance which
> has the greatest term as received in PROMOTE requests.
>
> Closes #5445
>
A couple of fixes on top:
Only apply PROMOTE when it's for a greater term than already received.
If promote tries to confirm entries for instance id other than
limbo->owner_id
rollback everything that's unconfirmed.
=========================================
diff --git a/src/box/txn_limbo.c b/src/box/txn_limbo.c
index 14e87cd3d..5f72c891f 100644
--- a/src/box/txn_limbo.c
+++ b/src/box/txn_limbo.c
@@ -643,14 +643,21 @@ complete:
}
void
-txn_limbo_process(struct txn_limbo *limbo, const struct synchro_request
*req)
+txn_limbo_process(struct txn_limbo *limbo, struct synchro_request *req)
{
uint64_t term = req->term;
uint32_t origin = req->origin_id;
if (txn_limbo_replica_term(limbo, origin) < term) {
vclock_follow(&limbo->promote_term_map, origin, term);
- if (term > limbo->promote_greatest_term)
+ if (term > limbo->promote_greatest_term) {
limbo->promote_greatest_term = term;
+ } else if (req->type == IPROTO_PROMOTE) {
+ /*
+ * PROMOTE for an old term should be ignored.
+ * For CONFIRM and ROLLBACK term is unused.
+ */
+ return;
+ }
}
if (req->replica_id == REPLICA_ID_NIL) {
/*
@@ -665,7 +672,15 @@ txn_limbo_process(struct txn_limbo *limbo, const
struct synchro_request *req)
* data from an old leader, who has just started and
written
* confirm right on synchronous transaction recovery.
*/
- return;
+ if (req->type != IPROTO_PROMOTE) {
+ return;
+ } else {
+ /*
+ * A PROMOTE request for a foreign master - roll
back
+ * everyting in limbo.
+ */
+ req->lsn = 0;
+ }
}
switch (req->type) {
case IPROTO_CONFIRM:
diff --git a/src/box/txn_limbo.h b/src/box/txn_limbo.h
index e409ac657..a06fabccc 100644
--- a/src/box/txn_limbo.h
+++ b/src/box/txn_limbo.h
@@ -302,7 +302,7 @@ txn_limbo_wait_complete(struct txn_limbo *limbo,
struct txn_limbo_entry *entry);
/** Execute a synchronous replication request. */
void
-txn_limbo_process(struct txn_limbo *limbo, const struct synchro_request
*req);
+txn_limbo_process(struct txn_limbo *limbo, struct synchro_request *req);
/**
* Waiting for confirmation of all "sync" transactions
===========================================
--
Serge Petrenko
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [Tarantool-patches] [PATCH v4 07/12] raft: filter rows based on known peer terms
2021-04-20 20:29 ` Serge Petrenko via Tarantool-patches
@ 2021-04-20 20:31 ` Serge Petrenko via Tarantool-patches
2021-04-20 20:55 ` Serge Petrenko via Tarantool-patches
2021-04-20 22:30 ` Vladislav Shpilevoy via Tarantool-patches
0 siblings, 2 replies; 57+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-04-20 20:31 UTC (permalink / raw)
To: v.shpilevoy, gorcunov; +Cc: tarantool-patches
20.04.2021 23:29, Serge Petrenko via Tarantool-patches пишет:
>
>
> 16.04.2021 19:25, Serge Petrenko пишет:
>> Start writing the actual leader term together with the PROMOTE request
>> and process terms in PROMOTE requests on receiver side.
>>
>> Make applier only apply synchronous transactions from the instance which
>> has the greatest term as received in PROMOTE requests.
>>
>> Closes #5445
>>
>
> A couple of fixes on top:
> Only apply PROMOTE when it's for a greater term than already received.
> If promote tries to confirm entries for instance id other than
> limbo->owner_id
> rollback everything that's unconfirmed.
Sorry, discard this. election_qsync_stress hangs again with this change.
I haven't pushed it yet anyway.
--
Serge Petrenko
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [Tarantool-patches] [PATCH v4 07/12] raft: filter rows based on known peer terms
2021-04-20 20:31 ` Serge Petrenko via Tarantool-patches
@ 2021-04-20 20:55 ` Serge Petrenko via Tarantool-patches
2021-04-20 22:30 ` Vladislav Shpilevoy via Tarantool-patches
1 sibling, 0 replies; 57+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-04-20 20:55 UTC (permalink / raw)
To: v.shpilevoy, gorcunov; +Cc: tarantool-patches
20.04.2021 23:31, Serge Petrenko via Tarantool-patches пишет:
>
>
> 20.04.2021 23:29, Serge Petrenko via Tarantool-patches пишет:
>>
>>
>> 16.04.2021 19:25, Serge Petrenko пишет:
>>> Start writing the actual leader term together with the PROMOTE request
>>> and process terms in PROMOTE requests on receiver side.
>>>
>>> Make applier only apply synchronous transactions from the instance
>>> which
>>> has the greatest term as received in PROMOTE requests.
>>>
>>> Closes #5445
>>>
>>
>> A couple of fixes on top:
>> Only apply PROMOTE when it's for a greater term than already received.
>> If promote tries to confirm entries for instance id other than
>> limbo->owner_id
>> rollback everything that's unconfirmed.
>
> Sorry, discard this. election_qsync_stress hangs again with this change.
>
> I haven't pushed it yet anyway.
>
This part of change is good. I applied it and pushed the branch:
==============================
diff --git a/src/box/txn_limbo.c b/src/box/txn_limbo.c
index 14e87cd3d..ad5093750 100644
--- a/src/box/txn_limbo.c
+++ b/src/box/txn_limbo.c
@@ -649,8 +649,14 @@ txn_limbo_process(struct txn_limbo *limbo, const
struct synchro_request *req)
uint32_t origin = req->origin_id;
if (txn_limbo_replica_term(limbo, origin) < term) {
vclock_follow(&limbo->promote_term_map, origin, term);
- if (term > limbo->promote_greatest_term)
+ if (term > limbo->promote_greatest_term) {
limbo->promote_greatest_term = term;
+ } else if (req->type == IPROTO_PROMOTE) {
+ /*
+ * PROMOTE for outdated term. Ignore.
+ */
+ return;
+ }
}
if (req->replica_id == REPLICA_ID_NIL) {
/*
===============================
--
Serge Petrenko
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [Tarantool-patches] [PATCH v4 00/12] raft: introduce manual elections and fix a bug with re-applying rolled back transactions
2021-04-16 16:25 [Tarantool-patches] [PATCH v4 00/12] raft: introduce manual elections and fix a bug with re-applying rolled back transactions Serge Petrenko via Tarantool-patches
` (14 preceding siblings ...)
2021-04-20 17:38 ` [Tarantool-patches] [PATCH v4 14/12] txn: make NOPs fully asynchronous Serge Petrenko via Tarantool-patches
@ 2021-04-20 22:30 ` Vladislav Shpilevoy via Tarantool-patches
2021-04-21 6:01 ` Serge Petrenko via Tarantool-patches
15 siblings, 1 reply; 57+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-04-20 22:30 UTC (permalink / raw)
To: Serge Petrenko, gorcunov; +Cc: tarantool-patches
I pushed my fixes to this branch: sp/gh-5445-election-fixes-review
as separate commits among your commits.
It does not mean the patch is perfect for push, we still have
known issues with absence of tests, and some moments we still need
to fix such as:
- Limbo ownership should not change automatically. This is the only
way to make it more or less stable;
- The NOP filter should work regardless of election mode. Otherwise
we still have 5445 issue when it is off;
- There are no tests for the NOP filter not working for election off;
for a few places in box.cc in box.ctl.promote();
for why in the NOP filter we must use txn origin ID instead of applier
instance ID. This might lead to data loss if we did something wrong.
My commits are marked as [tosquash].
For the problems above I will create tickets if the patch is pushed as
in the branch sp/gh-5445-election-fixes-review.
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [Tarantool-patches] [PATCH v4 05/12] box: make clear_synchro_queue() write a PROMOTE entry instead of CONFIRM + ROLLBACK
2021-04-16 16:25 ` [Tarantool-patches] [PATCH v4 05/12] box: make clear_synchro_queue() write a PROMOTE entry instead of CONFIRM + ROLLBACK Serge Petrenko via Tarantool-patches
2021-04-16 22:12 ` Vladislav Shpilevoy via Tarantool-patches
@ 2021-04-20 22:30 ` Vladislav Shpilevoy via Tarantool-patches
2021-04-21 5:58 ` Serge Petrenko via Tarantool-patches
1 sibling, 1 reply; 57+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-04-20 22:30 UTC (permalink / raw)
To: Serge Petrenko, gorcunov; +Cc: tarantool-patches
Thanks for the patch!
Consider this commit on top of yours on the
branch sp/gh-5445-election-fixes-review:
====================
[tosquash] Pass args explicitly to read_promote
Firstly, to be consistent with the other read_*() functions.
Secondly, one of the next patches is going to change promote
lsn, and it should not change the original struct synchro_request
or copy it for a change.
diff --git a/src/box/txn_limbo.c b/src/box/txn_limbo.c
index 0d2d274f6..8668eb964 100644
--- a/src/box/txn_limbo.c
+++ b/src/box/txn_limbo.c
@@ -487,13 +487,13 @@ txn_limbo_write_promote(struct txn_limbo *limbo, int64_t lsn, uint64_t term)
* rollback all entries > @a req.lsn.
*/
static void
-txn_limbo_read_promote(struct txn_limbo *limbo,
- const struct synchro_request *req)
+txn_limbo_read_promote(struct txn_limbo *limbo, uint32_t replica_id,
+ int64_t lsn)
{
- txn_limbo_read_confirm(limbo, req->lsn);
- txn_limbo_read_rollback(limbo, req->lsn + 1);
+ txn_limbo_read_confirm(limbo, lsn);
+ txn_limbo_read_rollback(limbo, lsn + 1);
assert(txn_limbo_is_empty(&txn_limbo));
- limbo->owner_id = req->origin_id;
+ limbo->owner_id = replica_id;
limbo->confirmed_lsn = 0;
}
@@ -660,7 +660,7 @@ txn_limbo_process(struct txn_limbo *limbo, const struct synchro_request *req)
txn_limbo_read_rollback(limbo, req->lsn);
break;
case IPROTO_PROMOTE:
- txn_limbo_read_promote(limbo, req);
+ txn_limbo_read_promote(limbo, req->origin_id, req->lsn);
break;
default:
unreachable();
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [Tarantool-patches] [PATCH v4 07/12] raft: filter rows based on known peer terms
2021-04-20 20:31 ` Serge Petrenko via Tarantool-patches
2021-04-20 20:55 ` Serge Petrenko via Tarantool-patches
@ 2021-04-20 22:30 ` Vladislav Shpilevoy via Tarantool-patches
2021-04-21 5:58 ` Serge Petrenko via Tarantool-patches
1 sibling, 1 reply; 57+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-04-20 22:30 UTC (permalink / raw)
To: Serge Petrenko, gorcunov; +Cc: tarantool-patches
Thanks for the patch!
>> A couple of fixes on top:
>> Only apply PROMOTE when it's for a greater term than already received.
>> If promote tries to confirm entries for instance id other than limbo->owner_id
>> rollback everything that's unconfirmed.
>
> Sorry, discard this. election_qsync_stress hangs again with this change.
>
> I haven't pushed it yet anyway.
As we noticed in a private discussion, it hung on top of this commit, but
does not hang on top of the branch. Because the last commit contains a
fix for the test. So the initial change was good. I returned it in this
commit (pushed on my branch sp/gh-5445-election-fixes-review):
====================
[tosquash] Foreign promote should rollback local limbo
See the comments why.
diff --git a/src/box/txn_limbo.c b/src/box/txn_limbo.c
index 31f4c1b1e..c22bd6665 100644
--- a/src/box/txn_limbo.c
+++ b/src/box/txn_limbo.c
@@ -658,12 +658,13 @@ txn_limbo_process(struct txn_limbo *limbo, const struct synchro_request *req)
return;
}
}
+ int64_t lsn = req->lsn;
if (req->replica_id == REPLICA_ID_NIL) {
/*
* The limbo was empty on the instance issuing the request.
* This means this instance must empty its limbo as well.
*/
- assert(req->lsn == 0 && req->type == IPROTO_PROMOTE);
+ assert(lsn == 0 && req->type == IPROTO_PROMOTE);
} else if (req->replica_id != limbo->owner_id) {
/*
* Ignore CONFIRM/ROLLBACK messages for a foreign master.
@@ -671,17 +672,25 @@ txn_limbo_process(struct txn_limbo *limbo, const struct synchro_request *req)
* data from an old leader, who has just started and written
* confirm right on synchronous transaction recovery.
*/
- return;
+ if (req->type != IPROTO_PROMOTE)
+ return;
+ /*
+ * Promote has a bigger term, and tries to steal the limbo. It
+ * means it probably was elected with a quorum, and it makes no
+ * sense to wait here for confirmations. The other nodes already
+ * elected a new leader. Rollback all the local txns.
+ */
+ lsn = 0;
}
switch (req->type) {
case IPROTO_CONFIRM:
- txn_limbo_read_confirm(limbo, req->lsn);
+ txn_limbo_read_confirm(limbo, lsn);
break;
case IPROTO_ROLLBACK:
- txn_limbo_read_rollback(limbo, req->lsn);
+ txn_limbo_read_rollback(limbo, lsn);
break;
case IPROTO_PROMOTE:
- txn_limbo_read_promote(limbo, req->origin_id, req->lsn);
+ txn_limbo_read_promote(limbo, req->origin_id, lsn);
break;
default:
unreachable();
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [Tarantool-patches] [PATCH v4 13/12] replication: send accumulated Raft messages after relay start
2021-04-20 10:38 ` Serge Petrenko via Tarantool-patches
@ 2021-04-20 22:31 ` Vladislav Shpilevoy via Tarantool-patches
2021-04-21 5:59 ` Serge Petrenko via Tarantool-patches
0 siblings, 1 reply; 57+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-04-20 22:31 UTC (permalink / raw)
To: Serge Petrenko, gorcunov; +Cc: tarantool-patches
Thanks for the patch!
As discussed in private, there is a way to drop this flag, which I
did in a separate commit on top of this one, see below and on the
branch sp/gh-5445-election-fixes-review:
====================
[tosquash] Remove flag do_restart_recovery
diff --git a/src/box/relay.cc b/src/box/relay.cc
index 85f335cd7..ff43c2fc7 100644
--- a/src/box/relay.cc
+++ b/src/box/relay.cc
@@ -95,7 +95,6 @@ struct relay_raft_msg {
struct cmsg_hop route[2];
struct raft_request req;
struct vclock vclock;
- bool do_restart_recovery;
struct relay *relay;
};
@@ -433,6 +432,12 @@ relay_final_join(int fd, uint64_t sync, struct vclock *start_vclock,
relay_delete(relay);
});
+ /*
+ * Save the first vclock as 'received'. Because firstly, it was really
+ * received. Secondly, recv_vclock is used by recovery restart and must
+ * always be valid.
+ */
+ vclock_copy(&relay->recv_vclock, start_vclock);
relay->r = recovery_new(wal_dir(), false, start_vclock);
vclock_copy(&relay->stop_vclock, stop_vclock);
@@ -660,13 +665,12 @@ struct relay_is_raft_enabled_msg {
};
static void
-relay_push_raft_msg(struct relay *relay, bool do_restart_recovery)
+relay_push_raft_msg(struct relay *relay)
{
if (!relay->tx.is_raft_enabled || relay->tx.is_raft_push_sent)
return;
struct relay_raft_msg *msg =
&relay->tx.raft_msgs[relay->tx.raft_ready_msg];
- msg->do_restart_recovery = do_restart_recovery;
cpipe_push(&relay->relay_pipe, &msg->base);
relay->tx.raft_ready_msg = (relay->tx.raft_ready_msg + 1) % 2;
relay->tx.is_raft_push_sent = true;
@@ -681,16 +685,9 @@ tx_set_is_raft_enabled(struct cmsg *base)
(struct relay_is_raft_enabled_msg *)base;
struct relay *relay = msg->relay;
relay->tx.is_raft_enabled = msg->value;
- /*
- * Send saved raft message as soon as relay becomes operational.
- * Do not restart recovery upon the message arrival. Recovery is
- * positioned at replica_clock initially, i.e. already "restarted" and
- * restarting it once again would position it at the oldest xlog
- * possible, because relay reader hasn't received replica vclock yet.
- */
- if (relay->tx.is_raft_push_pending) {
- relay_push_raft_msg(msg->relay, false);
- }
+ /* Send saved raft message as soon as relay becomes operational. */
+ if (relay->tx.is_raft_push_pending)
+ relay_push_raft_msg(msg->relay);
}
/** Relay thread part of the Raft flag setting, second hop. */
@@ -901,6 +898,12 @@ relay_subscribe(struct replica *replica, int fd, uint64_t sync,
});
vclock_copy(&relay->local_vclock_at_subscribe, &replicaset.vclock);
+ /*
+ * Save the first vclock as 'received'. Because firstly, it was really
+ * received. Secondly, recv_vclock is used by recovery restart and must
+ * always be valid.
+ */
+ vclock_copy(&relay->recv_vclock, replica_clock);
relay->r = recovery_new(wal_dir(), false, replica_clock);
vclock_copy(&relay->tx.vclock, replica_clock);
relay->version_id = replica_version_id;
@@ -1003,8 +1006,7 @@ relay_raft_msg_push(struct cmsg *base)
* would be ignored again.
*/
relay_send(msg->relay, &row);
- if (msg->req.state == RAFT_STATE_LEADER &&
- msg->do_restart_recovery)
+ if (msg->req.state == RAFT_STATE_LEADER)
relay_restart_recovery(msg->relay);
} catch (Exception *e) {
relay_set_error(msg->relay, e);
@@ -1018,7 +1020,7 @@ tx_raft_msg_return(struct cmsg *base)
struct relay_raft_msg *msg = (struct relay_raft_msg *)base;
msg->relay->tx.is_raft_push_sent = false;
if (msg->relay->tx.is_raft_push_pending)
- relay_push_raft_msg(msg->relay, true);
+ relay_push_raft_msg(msg->relay);
}
void
@@ -1042,7 +1044,7 @@ relay_push_raft(struct relay *relay, const struct raft_request *req)
cmsg_init(&msg->base, msg->route);
msg->relay = relay;
relay->tx.is_raft_push_pending = true;
- relay_push_raft_msg(relay, true);
+ relay_push_raft_msg(relay);
}
/** Send a single row to the client. */
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [Tarantool-patches] [PATCH v4 14/12] txn: make NOPs fully asynchronous
2021-04-20 17:38 ` [Tarantool-patches] [PATCH v4 14/12] txn: make NOPs fully asynchronous Serge Petrenko via Tarantool-patches
@ 2021-04-20 22:31 ` Vladislav Shpilevoy via Tarantool-patches
2021-04-21 5:59 ` Serge Petrenko via Tarantool-patches
0 siblings, 1 reply; 57+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-04-20 22:31 UTC (permalink / raw)
To: Serge Petrenko, gorcunov; +Cc: tarantool-patches
Thanks for the patch!
I added a test + an explanation in the comments in my commit. See
below and on the branch (sp/gh-5445-election-fixes-review):
====================
[tosquash] Add reason and test for nop limbo bypass
diff --git a/src/box/txn.c b/src/box/txn.c
index 8be102666..03b39e0de 100644
--- a/src/box/txn.c
+++ b/src/box/txn.c
@@ -603,7 +603,10 @@ txn_journal_entry_new(struct txn *txn)
bool is_sync = false;
/*
* A transaction which consists of NOPs solely should pass through the
- * limbo without waiting. Even when the limbo is not empty.
+ * limbo without waiting. Even when the limbo is not empty. This is
+ * because otherwise they might fail with the limbo being not owned by
+ * the NOPs owner. But it does not matter, because they just need to
+ * bump vclock. There is nothing to confirm or rollback in them.
*/
bool is_fully_nop = true;
diff --git a/test/replication/qsync_basic.result b/test/replication/qsync_basic.result
index 3457d2cc9..7e711ba13 100644
--- a/test/replication/qsync_basic.result
+++ b/test/replication/qsync_basic.result
@@ -637,6 +637,67 @@ box.space.sync:count()
| - 0
| ...
+--
+-- gh-5445: NOPs bypass the limbo for the sake of vclock bumps from foreign
+-- instances, but also works for local rows.
+--
+test_run:switch('default')
+ | ---
+ | - true
+ | ...
+box.cfg{replication_synchro_quorum = 3, replication_synchro_timeout = 1000}
+ | ---
+ | ...
+f = fiber.create(function() box.space.sync:replace{1} end)
+ | ---
+ | ...
+test_run:wait_lsn('replica', 'default')
+ | ---
+ | ...
+
+test_run:switch('replica')
+ | ---
+ | - true
+ | ...
+function skip_row() return nil end
+ | ---
+ | ...
+old_lsn = box.info.lsn
+ | ---
+ | ...
+_ = box.space.sync:before_replace(skip_row)
+ | ---
+ | ...
+box.space.sync:replace{2}
+ | ---
+ | ...
+box.space.sync:before_replace(nil, skip_row)
+ | ---
+ | ...
+assert(box.space.sync:get{2} == nil)
+ | ---
+ | - true
+ | ...
+assert(box.space.sync:get{1} ~= nil)
+ | ---
+ | - true
+ | ...
+
+test_run:switch('default')
+ | ---
+ | - true
+ | ...
+box.cfg{replication_synchro_quorum = 2}
+ | ---
+ | ...
+test_run:wait_cond(function() return f:status() == 'dead' end)
+ | ---
+ | - true
+ | ...
+box.space.sync:truncate()
+ | ---
+ | ...
+
--
-- gh-5191: test box.info.synchro interface. For
-- this sake we stop the replica and initiate data
diff --git a/test/replication/qsync_basic.test.lua b/test/replication/qsync_basic.test.lua
index a604d80ee..75c9b222b 100644
--- a/test/replication/qsync_basic.test.lua
+++ b/test/replication/qsync_basic.test.lua
@@ -248,6 +248,29 @@ for i = 1, 100 do box.space.sync:delete{i} end
test_run:cmd('switch replica')
box.space.sync:count()
+--
+-- gh-5445: NOPs bypass the limbo for the sake of vclock bumps from foreign
+-- instances, but also works for local rows.
+--
+test_run:switch('default')
+box.cfg{replication_synchro_quorum = 3, replication_synchro_timeout = 1000}
+f = fiber.create(function() box.space.sync:replace{1} end)
+test_run:wait_lsn('replica', 'default')
+
+test_run:switch('replica')
+function skip_row() return nil end
+old_lsn = box.info.lsn
+_ = box.space.sync:before_replace(skip_row)
+box.space.sync:replace{2}
+box.space.sync:before_replace(nil, skip_row)
+assert(box.space.sync:get{2} == nil)
+assert(box.space.sync:get{1} ~= nil)
+
+test_run:switch('default')
+box.cfg{replication_synchro_quorum = 2}
+test_run:wait_cond(function() return f:status() == 'dead' end)
+box.space.sync:truncate()
+
--
-- gh-5191: test box.info.synchro interface. For
-- this sake we stop the replica and initiate data
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [Tarantool-patches] [PATCH v4 05/12] box: make clear_synchro_queue() write a PROMOTE entry instead of CONFIRM + ROLLBACK
2021-04-20 22:30 ` Vladislav Shpilevoy via Tarantool-patches
@ 2021-04-21 5:58 ` Serge Petrenko via Tarantool-patches
0 siblings, 0 replies; 57+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-04-21 5:58 UTC (permalink / raw)
To: Vladislav Shpilevoy, gorcunov; +Cc: tarantool-patches
21.04.2021 01:30, Vladislav Shpilevoy пишет:
> Thanks for the patch!
>
> Consider this commit on top of yours on the
> branch sp/gh-5445-election-fixes-review:
Thanks! Looks good, I squashed it.
>
> ====================
> [tosquash] Pass args explicitly to read_promote
>
> Firstly, to be consistent with the other read_*() functions.
> Secondly, one of the next patches is going to change promote
> lsn, and it should not change the original struct synchro_request
> or copy it for a change.
>
> diff --git a/src/box/txn_limbo.c b/src/box/txn_limbo.c
> index 0d2d274f6..8668eb964 100644
> --- a/src/box/txn_limbo.c
> +++ b/src/box/txn_limbo.c
> @@ -487,13 +487,13 @@ txn_limbo_write_promote(struct txn_limbo *limbo, int64_t lsn, uint64_t term)
> * rollback all entries > @a req.lsn.
> */
> static void
> -txn_limbo_read_promote(struct txn_limbo *limbo,
> - const struct synchro_request *req)
> +txn_limbo_read_promote(struct txn_limbo *limbo, uint32_t replica_id,
> + int64_t lsn)
> {
> - txn_limbo_read_confirm(limbo, req->lsn);
> - txn_limbo_read_rollback(limbo, req->lsn + 1);
> + txn_limbo_read_confirm(limbo, lsn);
> + txn_limbo_read_rollback(limbo, lsn + 1);
> assert(txn_limbo_is_empty(&txn_limbo));
> - limbo->owner_id = req->origin_id;
> + limbo->owner_id = replica_id;
> limbo->confirmed_lsn = 0;
> }
>
> @@ -660,7 +660,7 @@ txn_limbo_process(struct txn_limbo *limbo, const struct synchro_request *req)
> txn_limbo_read_rollback(limbo, req->lsn);
> break;
> case IPROTO_PROMOTE:
> - txn_limbo_read_promote(limbo, req);
> + txn_limbo_read_promote(limbo, req->origin_id, req->lsn);
> break;
> default:
> unreachable();
--
Serge Petrenko
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [Tarantool-patches] [PATCH v4 07/12] raft: filter rows based on known peer terms
2021-04-20 22:30 ` Vladislav Shpilevoy via Tarantool-patches
@ 2021-04-21 5:58 ` Serge Petrenko via Tarantool-patches
0 siblings, 0 replies; 57+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-04-21 5:58 UTC (permalink / raw)
To: Vladislav Shpilevoy, gorcunov; +Cc: tarantool-patches
21.04.2021 01:30, Vladislav Shpilevoy пишет:
> Thanks for the patch!
>
>>> A couple of fixes on top:
>>> Only apply PROMOTE when it's for a greater term than already received.
>>> If promote tries to confirm entries for instance id other than limbo->owner_id
>>> rollback everything that's unconfirmed.
>> Sorry, discard this. election_qsync_stress hangs again with this change.
>>
>> I haven't pushed it yet anyway.
> As we noticed in a private discussion, it hung on top of this commit, but
> does not hang on top of the branch. Because the last commit contains a
> fix for the test. So the initial change was good. I returned it in this
> commit (pushed on my branch sp/gh-5445-election-fixes-review):
Yes, indeed. Thanks for noticing!
Squashed your commit.
>
> ====================
> [tosquash] Foreign promote should rollback local limbo
>
> See the comments why.
>
> diff --git a/src/box/txn_limbo.c b/src/box/txn_limbo.c
> index 31f4c1b1e..c22bd6665 100644
> --- a/src/box/txn_limbo.c
> +++ b/src/box/txn_limbo.c
> @@ -658,12 +658,13 @@ txn_limbo_process(struct txn_limbo *limbo, const struct synchro_request *req)
> return;
> }
> }
> + int64_t lsn = req->lsn;
> if (req->replica_id == REPLICA_ID_NIL) {
> /*
> * The limbo was empty on the instance issuing the request.
> * This means this instance must empty its limbo as well.
> */
> - assert(req->lsn == 0 && req->type == IPROTO_PROMOTE);
> + assert(lsn == 0 && req->type == IPROTO_PROMOTE);
> } else if (req->replica_id != limbo->owner_id) {
> /*
> * Ignore CONFIRM/ROLLBACK messages for a foreign master.
> @@ -671,17 +672,25 @@ txn_limbo_process(struct txn_limbo *limbo, const struct synchro_request *req)
> * data from an old leader, who has just started and written
> * confirm right on synchronous transaction recovery.
> */
> - return;
> + if (req->type != IPROTO_PROMOTE)
> + return;
> + /*
> + * Promote has a bigger term, and tries to steal the limbo. It
> + * means it probably was elected with a quorum, and it makes no
> + * sense to wait here for confirmations. The other nodes already
> + * elected a new leader. Rollback all the local txns.
> + */
> + lsn = 0;
> }
> switch (req->type) {
> case IPROTO_CONFIRM:
> - txn_limbo_read_confirm(limbo, req->lsn);
> + txn_limbo_read_confirm(limbo, lsn);
> break;
> case IPROTO_ROLLBACK:
> - txn_limbo_read_rollback(limbo, req->lsn);
> + txn_limbo_read_rollback(limbo, lsn);
> break;
> case IPROTO_PROMOTE:
> - txn_limbo_read_promote(limbo, req->origin_id, req->lsn);
> + txn_limbo_read_promote(limbo, req->origin_id, lsn);
> break;
> default:
> unreachable();
>
--
Serge Petrenko
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [Tarantool-patches] [PATCH v4 13/12] replication: send accumulated Raft messages after relay start
2021-04-20 22:31 ` Vladislav Shpilevoy via Tarantool-patches
@ 2021-04-21 5:59 ` Serge Petrenko via Tarantool-patches
0 siblings, 0 replies; 57+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-04-21 5:59 UTC (permalink / raw)
To: Vladislav Shpilevoy, gorcunov; +Cc: tarantool-patches
21.04.2021 01:31, Vladislav Shpilevoy пишет:
> Thanks for the patch!
>
> As discussed in private, there is a way to drop this flag, which I
> did in a separate commit on top of this one, see below and on the
> branch sp/gh-5445-election-fixes-review:
Yes, thanks for reviewing this!
Squashed.
> ====================
> [tosquash] Remove flag do_restart_recovery
>
> diff --git a/src/box/relay.cc b/src/box/relay.cc
> index 85f335cd7..ff43c2fc7 100644
> --- a/src/box/relay.cc
> +++ b/src/box/relay.cc
> @@ -95,7 +95,6 @@ struct relay_raft_msg {
> struct cmsg_hop route[2];
> struct raft_request req;
> struct vclock vclock;
> - bool do_restart_recovery;
> struct relay *relay;
> };
>
> @@ -433,6 +432,12 @@ relay_final_join(int fd, uint64_t sync, struct vclock *start_vclock,
> relay_delete(relay);
> });
>
> + /*
> + * Save the first vclock as 'received'. Because firstly, it was really
> + * received. Secondly, recv_vclock is used by recovery restart and must
> + * always be valid.
> + */
> + vclock_copy(&relay->recv_vclock, start_vclock);
> relay->r = recovery_new(wal_dir(), false, start_vclock);
> vclock_copy(&relay->stop_vclock, stop_vclock);
>
> @@ -660,13 +665,12 @@ struct relay_is_raft_enabled_msg {
> };
>
> static void
> -relay_push_raft_msg(struct relay *relay, bool do_restart_recovery)
> +relay_push_raft_msg(struct relay *relay)
> {
> if (!relay->tx.is_raft_enabled || relay->tx.is_raft_push_sent)
> return;
> struct relay_raft_msg *msg =
> &relay->tx.raft_msgs[relay->tx.raft_ready_msg];
> - msg->do_restart_recovery = do_restart_recovery;
> cpipe_push(&relay->relay_pipe, &msg->base);
> relay->tx.raft_ready_msg = (relay->tx.raft_ready_msg + 1) % 2;
> relay->tx.is_raft_push_sent = true;
> @@ -681,16 +685,9 @@ tx_set_is_raft_enabled(struct cmsg *base)
> (struct relay_is_raft_enabled_msg *)base;
> struct relay *relay = msg->relay;
> relay->tx.is_raft_enabled = msg->value;
> - /*
> - * Send saved raft message as soon as relay becomes operational.
> - * Do not restart recovery upon the message arrival. Recovery is
> - * positioned at replica_clock initially, i.e. already "restarted" and
> - * restarting it once again would position it at the oldest xlog
> - * possible, because relay reader hasn't received replica vclock yet.
> - */
> - if (relay->tx.is_raft_push_pending) {
> - relay_push_raft_msg(msg->relay, false);
> - }
> + /* Send saved raft message as soon as relay becomes operational. */
> + if (relay->tx.is_raft_push_pending)
> + relay_push_raft_msg(msg->relay);
> }
>
> /** Relay thread part of the Raft flag setting, second hop. */
> @@ -901,6 +898,12 @@ relay_subscribe(struct replica *replica, int fd, uint64_t sync,
> });
>
> vclock_copy(&relay->local_vclock_at_subscribe, &replicaset.vclock);
> + /*
> + * Save the first vclock as 'received'. Because firstly, it was really
> + * received. Secondly, recv_vclock is used by recovery restart and must
> + * always be valid.
> + */
> + vclock_copy(&relay->recv_vclock, replica_clock);
> relay->r = recovery_new(wal_dir(), false, replica_clock);
> vclock_copy(&relay->tx.vclock, replica_clock);
> relay->version_id = replica_version_id;
> @@ -1003,8 +1006,7 @@ relay_raft_msg_push(struct cmsg *base)
> * would be ignored again.
> */
> relay_send(msg->relay, &row);
> - if (msg->req.state == RAFT_STATE_LEADER &&
> - msg->do_restart_recovery)
> + if (msg->req.state == RAFT_STATE_LEADER)
> relay_restart_recovery(msg->relay);
> } catch (Exception *e) {
> relay_set_error(msg->relay, e);
> @@ -1018,7 +1020,7 @@ tx_raft_msg_return(struct cmsg *base)
> struct relay_raft_msg *msg = (struct relay_raft_msg *)base;
> msg->relay->tx.is_raft_push_sent = false;
> if (msg->relay->tx.is_raft_push_pending)
> - relay_push_raft_msg(msg->relay, true);
> + relay_push_raft_msg(msg->relay);
> }
>
> void
> @@ -1042,7 +1044,7 @@ relay_push_raft(struct relay *relay, const struct raft_request *req)
> cmsg_init(&msg->base, msg->route);
> msg->relay = relay;
> relay->tx.is_raft_push_pending = true;
> - relay_push_raft_msg(relay, true);
> + relay_push_raft_msg(relay);
> }
>
> /** Send a single row to the client. */
--
Serge Petrenko
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [Tarantool-patches] [PATCH v4 14/12] txn: make NOPs fully asynchronous
2021-04-20 22:31 ` Vladislav Shpilevoy via Tarantool-patches
@ 2021-04-21 5:59 ` Serge Petrenko via Tarantool-patches
0 siblings, 0 replies; 57+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-04-21 5:59 UTC (permalink / raw)
To: Vladislav Shpilevoy, gorcunov; +Cc: tarantool-patches
21.04.2021 01:31, Vladislav Shpilevoy пишет:
> Thanks for the patch!
>
> I added a test + an explanation in the comments in my commit. See
> below and on the branch (sp/gh-5445-election-fixes-review):
Looks good, squashed as well.
>
> ====================
> [tosquash] Add reason and test for nop limbo bypass
>
> diff --git a/src/box/txn.c b/src/box/txn.c
> index 8be102666..03b39e0de 100644
> --- a/src/box/txn.c
> +++ b/src/box/txn.c
> @@ -603,7 +603,10 @@ txn_journal_entry_new(struct txn *txn)
> bool is_sync = false;
> /*
> * A transaction which consists of NOPs solely should pass through the
> - * limbo without waiting. Even when the limbo is not empty.
> + * limbo without waiting. Even when the limbo is not empty. This is
> + * because otherwise they might fail with the limbo being not owned by
> + * the NOPs owner. But it does not matter, because they just need to
> + * bump vclock. There is nothing to confirm or rollback in them.
> */
> bool is_fully_nop = true;
>
> diff --git a/test/replication/qsync_basic.result b/test/replication/qsync_basic.result
> index 3457d2cc9..7e711ba13 100644
> --- a/test/replication/qsync_basic.result
> +++ b/test/replication/qsync_basic.result
> @@ -637,6 +637,67 @@ box.space.sync:count()
> | - 0
> | ...
>
> +--
> +-- gh-5445: NOPs bypass the limbo for the sake of vclock bumps from foreign
> +-- instances, but also works for local rows.
> +--
> +test_run:switch('default')
> + | ---
> + | - true
> + | ...
> +box.cfg{replication_synchro_quorum = 3, replication_synchro_timeout = 1000}
> + | ---
> + | ...
> +f = fiber.create(function() box.space.sync:replace{1} end)
> + | ---
> + | ...
> +test_run:wait_lsn('replica', 'default')
> + | ---
> + | ...
> +
> +test_run:switch('replica')
> + | ---
> + | - true
> + | ...
> +function skip_row() return nil end
> + | ---
> + | ...
> +old_lsn = box.info.lsn
> + | ---
> + | ...
> +_ = box.space.sync:before_replace(skip_row)
> + | ---
> + | ...
> +box.space.sync:replace{2}
> + | ---
> + | ...
> +box.space.sync:before_replace(nil, skip_row)
> + | ---
> + | ...
> +assert(box.space.sync:get{2} == nil)
> + | ---
> + | - true
> + | ...
> +assert(box.space.sync:get{1} ~= nil)
> + | ---
> + | - true
> + | ...
> +
> +test_run:switch('default')
> + | ---
> + | - true
> + | ...
> +box.cfg{replication_synchro_quorum = 2}
> + | ---
> + | ...
> +test_run:wait_cond(function() return f:status() == 'dead' end)
> + | ---
> + | - true
> + | ...
> +box.space.sync:truncate()
> + | ---
> + | ...
> +
> --
> -- gh-5191: test box.info.synchro interface. For
> -- this sake we stop the replica and initiate data
> diff --git a/test/replication/qsync_basic.test.lua b/test/replication/qsync_basic.test.lua
> index a604d80ee..75c9b222b 100644
> --- a/test/replication/qsync_basic.test.lua
> +++ b/test/replication/qsync_basic.test.lua
> @@ -248,6 +248,29 @@ for i = 1, 100 do box.space.sync:delete{i} end
> test_run:cmd('switch replica')
> box.space.sync:count()
>
> +--
> +-- gh-5445: NOPs bypass the limbo for the sake of vclock bumps from foreign
> +-- instances, but also works for local rows.
> +--
> +test_run:switch('default')
> +box.cfg{replication_synchro_quorum = 3, replication_synchro_timeout = 1000}
> +f = fiber.create(function() box.space.sync:replace{1} end)
> +test_run:wait_lsn('replica', 'default')
> +
> +test_run:switch('replica')
> +function skip_row() return nil end
> +old_lsn = box.info.lsn
> +_ = box.space.sync:before_replace(skip_row)
> +box.space.sync:replace{2}
> +box.space.sync:before_replace(nil, skip_row)
> +assert(box.space.sync:get{2} == nil)
> +assert(box.space.sync:get{1} ~= nil)
> +
> +test_run:switch('default')
> +box.cfg{replication_synchro_quorum = 2}
> +test_run:wait_cond(function() return f:status() == 'dead' end)
> +box.space.sync:truncate()
> +
> --
> -- gh-5191: test box.info.synchro interface. For
> -- this sake we stop the replica and initiate data
--
Serge Petrenko
^ permalink raw reply [flat|nested] 57+ messages in thread
* Re: [Tarantool-patches] [PATCH v4 00/12] raft: introduce manual elections and fix a bug with re-applying rolled back transactions
2021-04-20 22:30 ` [Tarantool-patches] [PATCH v4 00/12] raft: introduce manual elections and fix a bug with re-applying rolled back transactions Vladislav Shpilevoy via Tarantool-patches
@ 2021-04-21 6:01 ` Serge Petrenko via Tarantool-patches
0 siblings, 0 replies; 57+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-04-21 6:01 UTC (permalink / raw)
To: Vladislav Shpilevoy, gorcunov; +Cc: tarantool-patches
21.04.2021 01:30, Vladislav Shpilevoy пишет:
> I pushed my fixes to this branch: sp/gh-5445-election-fixes-review
> as separate commits among your commits.
Thanks! I squashed all of your fixes on sp/gh-5445-election-fixes-review
>
> It does not mean the patch is perfect for push, we still have
> known issues with absence of tests, and some moments we still need
> to fix such as:
>
> - Limbo ownership should not change automatically. This is the only
> way to make it more or less stable;
>
> - The NOP filter should work regardless of election mode. Otherwise
> we still have 5445 issue when it is off;
>
> - There are no tests for the NOP filter not working for election off;
> for a few places in box.cc in box.ctl.promote();
> for why in the NOP filter we must use txn origin ID instead of applier
> instance ID. This might lead to data loss if we did something wrong.
I agree.
> My commits are marked as [tosquash].
>
> For the problems above I will create tickets if the patch is pushed as
> in the branch sp/gh-5445-election-fixes-review.
--
Serge Petrenko
^ permalink raw reply [flat|nested] 57+ messages in thread
end of thread, other threads:[~2021-04-21 6:01 UTC | newest]
Thread overview: 57+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-16 16:25 [Tarantool-patches] [PATCH v4 00/12] raft: introduce manual elections and fix a bug with re-applying rolled back transactions Serge Petrenko via Tarantool-patches
2021-04-16 16:25 ` [Tarantool-patches] [PATCH v4 01/12] wal: make wal_assign_lsn accept journal entry Serge Petrenko via Tarantool-patches
2021-04-16 16:25 ` [Tarantool-patches] [PATCH v4 02/12] xrow: enrich row's meta information with sync replication flags Serge Petrenko via Tarantool-patches
2021-04-16 16:25 ` [Tarantool-patches] [PATCH v4 03/12] xrow: introduce a PROMOTE entry Serge Petrenko via Tarantool-patches
2021-04-16 16:25 ` [Tarantool-patches] [PATCH v4 04/12] box: actualise iproto_key_type array Serge Petrenko via Tarantool-patches
2021-04-16 16:25 ` [Tarantool-patches] [PATCH v4 05/12] box: make clear_synchro_queue() write a PROMOTE entry instead of CONFIRM + ROLLBACK Serge Petrenko via Tarantool-patches
2021-04-16 22:12 ` Vladislav Shpilevoy via Tarantool-patches
2021-04-18 8:24 ` Serge Petrenko via Tarantool-patches
2021-04-20 22:30 ` Vladislav Shpilevoy via Tarantool-patches
2021-04-21 5:58 ` Serge Petrenko via Tarantool-patches
2021-04-16 16:25 ` [Tarantool-patches] [PATCH v4 06/12] box: write PROMOTE even for empty limbo Serge Petrenko via Tarantool-patches
2021-04-19 13:39 ` Serge Petrenko via Tarantool-patches
2021-04-16 16:25 ` [Tarantool-patches] [PATCH v4 07/12] raft: filter rows based on known peer terms Serge Petrenko via Tarantool-patches
2021-04-16 22:21 ` Vladislav Shpilevoy via Tarantool-patches
2021-04-18 8:49 ` Serge Petrenko via Tarantool-patches
2021-04-18 15:44 ` Vladislav Shpilevoy via Tarantool-patches
2021-04-19 9:31 ` Serge Petrenko via Tarantool-patches
2021-04-18 16:27 ` Vladislav Shpilevoy via Tarantool-patches
2021-04-19 9:30 ` Serge Petrenko via Tarantool-patches
2021-04-20 20:29 ` Serge Petrenko via Tarantool-patches
2021-04-20 20:31 ` Serge Petrenko via Tarantool-patches
2021-04-20 20:55 ` Serge Petrenko via Tarantool-patches
2021-04-20 22:30 ` Vladislav Shpilevoy via Tarantool-patches
2021-04-21 5:58 ` Serge Petrenko via Tarantool-patches
2021-04-16 16:25 ` [Tarantool-patches] [PATCH v4 08/12] election: introduce a new election mode: "manual" Serge Petrenko via Tarantool-patches
2021-04-19 22:34 ` Vladislav Shpilevoy via Tarantool-patches
2021-04-20 9:25 ` Serge Petrenko via Tarantool-patches
2021-04-20 17:37 ` Serge Petrenko via Tarantool-patches
2021-04-16 16:25 ` [Tarantool-patches] [PATCH v4 09/12] raft: introduce raft_start/stop_candidate Serge Petrenko via Tarantool-patches
2021-04-16 22:23 ` Vladislav Shpilevoy via Tarantool-patches
2021-04-18 8:59 ` Serge Petrenko via Tarantool-patches
2021-04-19 22:35 ` Vladislav Shpilevoy via Tarantool-patches
2021-04-20 9:28 ` Serge Petrenko via Tarantool-patches
2021-04-19 12:52 ` Serge Petrenko via Tarantool-patches
2021-04-16 16:25 ` [Tarantool-patches] [PATCH v4 10/12] election: support manual elections in clear_synchro_queue() Serge Petrenko via Tarantool-patches
2021-04-16 22:24 ` Vladislav Shpilevoy via Tarantool-patches
2021-04-18 9:26 ` Serge Petrenko via Tarantool-patches
2021-04-18 16:07 ` Vladislav Shpilevoy via Tarantool-patches
2021-04-19 9:32 ` Serge Petrenko via Tarantool-patches
2021-04-19 12:47 ` Serge Petrenko via Tarantool-patches
2021-04-16 16:25 ` [Tarantool-patches] [PATCH v4 11/12] box: remove parameter from clear_synchro_queue Serge Petrenko via Tarantool-patches
2021-04-16 16:25 ` [Tarantool-patches] [PATCH v4 12/12] box.ctl: rename clear_synchro_queue to promote Serge Petrenko via Tarantool-patches
2021-04-19 22:35 ` Vladislav Shpilevoy via Tarantool-patches
2021-04-20 10:22 ` Serge Petrenko via Tarantool-patches
2021-04-18 12:00 ` [Tarantool-patches] [PATCH v4 13/12] replication: send accumulated Raft messages after relay start Serge Petrenko via Tarantool-patches
2021-04-18 16:03 ` Vladislav Shpilevoy via Tarantool-patches
2021-04-19 12:11 ` Serge Petrenko via Tarantool-patches
2021-04-19 22:36 ` Vladislav Shpilevoy via Tarantool-patches
2021-04-20 10:38 ` Serge Petrenko via Tarantool-patches
2021-04-20 22:31 ` Vladislav Shpilevoy via Tarantool-patches
2021-04-21 5:59 ` Serge Petrenko via Tarantool-patches
2021-04-19 22:37 ` [Tarantool-patches] [PATCH v4 00/12] raft: introduce manual elections and fix a bug with re-applying rolled back transactions Vladislav Shpilevoy via Tarantool-patches
2021-04-20 17:38 ` [Tarantool-patches] [PATCH v4 14/12] txn: make NOPs fully asynchronous Serge Petrenko via Tarantool-patches
2021-04-20 22:31 ` Vladislav Shpilevoy via Tarantool-patches
2021-04-21 5:59 ` Serge Petrenko via Tarantool-patches
2021-04-20 22:30 ` [Tarantool-patches] [PATCH v4 00/12] raft: introduce manual elections and fix a bug with re-applying rolled back transactions Vladislav Shpilevoy via Tarantool-patches
2021-04-21 6:01 ` Serge Petrenko via Tarantool-patches
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox