[Tarantool-patches] [PATCH v2 6/6] txn_limbo: ignore CONFIRM/ROLLBACK for a foreign master
Serge Petrenko
sergepetrenko at tarantool.org
Wed Dec 23 14:59:24 MSK 2020
We designed limbo so that it errors on receiving a CONFIRM or ROLLBACK
for other instance's data. Actually, this error is pointless, and even
harmful. Here's why:
Imagine you have 3 instances, 1, 2 and 3.
First 1 writes some synchronous transactions, but dies before writing CONFIRM.
Now 2 has to write CONFIRM instead of 1 to take limbo ownership.
>From now on 2 is the limbo owner and in case of high enough load it constantly
has some data in the limbo.
Once 1 restarts, it first recovers its xlogs, and fills its limbo with
its own unconfirmed transactions from the previous run. Now replication
between 1, 2 and 3 is started and the first thing 1 sees is that 2 and 3
ack its old transactions. So 1 writes CONFIRM for its own transactions
even before the same CONFIRM written by 2 reaches it.
Once the CONFIRM written by 1 is replicated to 2 and 3 they error and
stop replication, since their limbo contains entries from 2, not from 1.
Actually, there's no need to error, since it's just a really old CONFIRM
which's already processed by both 2 and 3.
So, ignore CONFIRM/ROLLBACK when it references a wrong limbo owner.
The issue was discovered with test replication/election_qsync_stress.
Follow-up #5435
---
src/box/applier.cc | 3 +--
src/box/box.cc | 3 +--
src/box/txn_limbo.c | 14 +++++++++-----
src/box/txn_limbo.h | 2 +-
4 files changed, 12 insertions(+), 10 deletions(-)
diff --git a/src/box/applier.cc b/src/box/applier.cc
index fb2f5d130..553db76fc 100644
--- a/src/box/applier.cc
+++ b/src/box/applier.cc
@@ -861,8 +861,7 @@ apply_synchro_row(struct xrow_header *row)
if (xrow_decode_synchro(row, &req) != 0)
goto err;
- if (txn_limbo_process(&txn_limbo, &req))
- goto err;
+ txn_limbo_process(&txn_limbo, &req);
struct synchro_entry *entry;
entry = synchro_entry_new(row, &req);
diff --git a/src/box/box.cc b/src/box/box.cc
index 38bf4034e..fc4888955 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -383,8 +383,7 @@ apply_wal_row(struct xstream *stream, struct xrow_header *row)
struct synchro_request syn_req;
if (xrow_decode_synchro(row, &syn_req) != 0)
diag_raise();
- if (txn_limbo_process(&txn_limbo, &syn_req) != 0)
- diag_raise();
+ txn_limbo_process(&txn_limbo, &syn_req);
return;
}
if (iproto_type_is_raft_request(row->type)) {
diff --git a/src/box/txn_limbo.c b/src/box/txn_limbo.c
index 9272f5227..9498c7a44 100644
--- a/src/box/txn_limbo.c
+++ b/src/box/txn_limbo.c
@@ -634,13 +634,17 @@ complete:
return 0;
}
-int
+void
txn_limbo_process(struct txn_limbo *limbo, const struct synchro_request *req)
{
if (req->replica_id != limbo->owner_id) {
- diag_set(ClientError, ER_SYNC_MASTER_MISMATCH,
- req->replica_id, limbo->owner_id);
- return -1;
+ /*
+ * Ignore CONFIRM/ROLLBACK messages for a foreign master.
+ * These are most likely outdated messages for already confirmed
+ * data from an old leader, who has just started and written
+ * confirm right on synchronous transaction recovery.
+ */
+ return;
}
switch (req->type) {
case IPROTO_CONFIRM:
@@ -652,7 +656,7 @@ txn_limbo_process(struct txn_limbo *limbo, const struct synchro_request *req)
default:
unreachable();
}
- return 0;
+ return;
}
void
diff --git a/src/box/txn_limbo.h b/src/box/txn_limbo.h
index a49356c14..c28b5666d 100644
--- a/src/box/txn_limbo.h
+++ b/src/box/txn_limbo.h
@@ -257,7 +257,7 @@ int
txn_limbo_wait_complete(struct txn_limbo *limbo, struct txn_limbo_entry *entry);
/** Execute a synchronous replication request. */
-int
+void
txn_limbo_process(struct txn_limbo *limbo, const struct synchro_request *req);
/**
--
2.24.3 (Apple Git-128)
More information about the Tarantool-patches
mailing list