[Tarantool-patches] [PATCH v14 0/6] qsync: implement packets filtering

Tarantool development patches archive
 help / color / mirror / Atom feed

* [Tarantool-patches] [PATCH v14 0/6] qsync: implement packets filtering
@ 2021-09-10 15:29 Cyrill Gorcunov via Tarantool-patches
  2021-09-10 15:29 ` [Tarantool-patches] [PATCH v14 1/6] qsync: track confirmed lsn number on reads Cyrill Gorcunov via Tarantool-patches
                   ` (6 more replies)
  0 siblings, 7 replies; 21+ messages in thread
From: Cyrill Gorcunov via Tarantool-patches @ 2021-09-10 15:29 UTC (permalink / raw)
  To: tml; +Cc: Vladislav Shpilevoy

Guys, take a look please, once time permit. The questionable moments:

- use filter disabling procedure for join/recovery: we make it so since
  snapshot has promote record which fills initial limbo state

- need more tests to cover all possible scenarios

- I keep filter_confirm_rollback() as is but rereading Vlad's comment
  >
  > 9. What if rollback is for LSN > limbo's last LSN? It
  > also means nothing to do. The same for confirm LSN < limbo's
  > first LSN.
  >
  I presume I need to traverse limbo and test if incoming LSN is
  present inside current queue.

Anyway I send this version early to gather more comments, I hope
not that much left to implement to be ready for merging.

previous series https://lists.tarantool.org/tarantool-patches/20210804190752.488147-1-gorcunov@gmail.com/
branch gorcunov/gh-6036-rollback-confirm-14
issue https://github.com/tarantool/tarantool/issues/6036

v6:
 - use txn_limbo_terms name for structure
 - rebase on fresh sp/gh-6034-empty-limbo-transition branch
 - rework filtering chains
v8:
 - add ability to disable filtering for local recovery
   and join stages
 - update tests
v9:
 - opencode terms tracking
 - fix tests to use wait function since log output might
   be deferred by OS
v10:
 - rework FILTER_IN and FILTER_PROMOTE chains with more
   detailed packets inspection
 - preserve old naming for terms manipulations
 - require the packet's replica_id to match limbo owner_id
   all the time
v11-13: internal
v14:
 - use straightforward packet inspection by their type
   without more general type routing
 - tried to hide locking api inside limbo level but since
   journal completion is called from inside of sched fiber
   the lock owner get migrated which cause error thus leave
   explicit locking instead
 - added updating of limbo::confirmed_lsn since we need it
   for proper validation
 - added new error code to distinguish filter errors from
   anything else
 - use say_error instead of say_info
 - keep disabling of filtration inside initial join/recovery
   because we're filling initial limbo state

Cyrill Gorcunov (6):
  qsync: track confirmed lsn number on reads
  qsync: update confirmed lsn on initial promote request
  latch: add latch_is_locked helper
  qsync: order access to the limbo terms
  qsync: filter incoming synchro requests
  test: add replication/gh-6036-rollback-confirm

 .../gh-6036-qsync-filter-packets.md           |   9 +
 src/box/applier.cc                            |  26 +-
 src/box/box.cc                                |  30 +-
 src/box/errcode.h                             |   1 +
 src/box/memtx_engine.cc                       |   3 +-
 src/box/txn_limbo.c                           | 337 +++++++++++++++---
 src/box/txn_limbo.h                           |  85 ++++-
 src/lib/core/latch.h                          |  11 +
 test/box/error.result                         |   1 +
 test/replication/gh-6036-master.lua           |   1 +
 test/replication/gh-6036-node.lua             |  33 ++
 test/replication/gh-6036-replica.lua          |   1 +
 .../gh-6036-rollback-confirm.result           | 180 ++++++++++
 .../gh-6036-rollback-confirm.test.lua         |  92 +++++
 14 files changed, 747 insertions(+), 63 deletions(-)
 create mode 100644 changelogs/unreleased/gh-6036-qsync-filter-packets.md
 create mode 120000 test/replication/gh-6036-master.lua
 create mode 100644 test/replication/gh-6036-node.lua
 create mode 120000 test/replication/gh-6036-replica.lua
 create mode 100644 test/replication/gh-6036-rollback-confirm.result
 create mode 100644 test/replication/gh-6036-rollback-confirm.test.lua


base-commit: b0431cf8f47e9d081f6a402bc18edb1d6ad49847
-- 
2.31.1


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Tarantool-patches] [PATCH v14 1/6] qsync: track confirmed lsn number on reads
  2021-09-10 15:29 [Tarantool-patches] [PATCH v14 0/6] qsync: implement packets filtering Cyrill Gorcunov via Tarantool-patches
@ 2021-09-10 15:29 ` Cyrill Gorcunov via Tarantool-patches
  2021-09-12 15:44   ` Vladislav Shpilevoy via Tarantool-patches
  2021-09-13  8:50   ` Serge Petrenko via Tarantool-patches
  2021-09-10 15:29 ` [Tarantool-patches] [PATCH v14 2/6] qsync: update confirmed lsn on initial promote request Cyrill Gorcunov via Tarantool-patches
                   ` (5 subsequent siblings)
  6 siblings, 2 replies; 21+ messages in thread
From: Cyrill Gorcunov via Tarantool-patches @ 2021-09-10 15:29 UTC (permalink / raw)
  To: tml; +Cc: Vladislav Shpilevoy

We will use this lsn for requests validation
in next patches for sake of split-brain detection.

Part-of #6036

Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
---
 src/box/txn_limbo.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/src/box/txn_limbo.c b/src/box/txn_limbo.c
index 70447caaf..cca2ce493 100644
--- a/src/box/txn_limbo.c
+++ b/src/box/txn_limbo.c
@@ -437,6 +437,13 @@ txn_limbo_read_confirm(struct txn_limbo *limbo, int64_t lsn)
 		assert(e->txn->signature >= 0);
 		txn_complete_success(e->txn);
 	}
+
+	/*
+	 * We use confirmed lsn number to verify requests and
+	 * reject ones coming from split-brain cluster configurations,
+	 * so update it even if there were no entries to process.
+	 */
+	limbo->confirmed_lsn = lsn;
 }
 
 /**
-- 
2.31.1


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Tarantool-patches] [PATCH v14 2/6] qsync: update confirmed lsn on initial promote request
  2021-09-10 15:29 [Tarantool-patches] [PATCH v14 0/6] qsync: implement packets filtering Cyrill Gorcunov via Tarantool-patches
  2021-09-10 15:29 ` [Tarantool-patches] [PATCH v14 1/6] qsync: track confirmed lsn number on reads Cyrill Gorcunov via Tarantool-patches
@ 2021-09-10 15:29 ` Cyrill Gorcunov via Tarantool-patches
  2021-09-12 15:44   ` Vladislav Shpilevoy via Tarantool-patches
  2021-09-10 15:29 ` [Tarantool-patches] [PATCH v14 3/6] latch: add latch_is_locked helper Cyrill Gorcunov via Tarantool-patches
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 21+ messages in thread
From: Cyrill Gorcunov via Tarantool-patches @ 2021-09-10 15:29 UTC (permalink / raw)
  To: tml; +Cc: Vladislav Shpilevoy

When promote request is handled we drop last confirmed
lsn to zero because its value make sense for sync queue
owner only. Still the case where we become queue owner
for the first time is special - we need to fetch the
obtained lsn from the request and remember it so we
will be able to filter any next malformed requests
with wrong lsn numbers (see queue filtering procedure
in next patch).

Part-of #6036

Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
---
 src/box/txn_limbo.c | 8 +++++++-
 src/box/txn_limbo.h | 6 ++++++
 2 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/src/box/txn_limbo.c b/src/box/txn_limbo.c
index cca2ce493..08463219d 100644
--- a/src/box/txn_limbo.c
+++ b/src/box/txn_limbo.c
@@ -50,6 +50,7 @@ txn_limbo_create(struct txn_limbo *limbo)
 	limbo->confirmed_lsn = 0;
 	limbo->rollback_count = 0;
 	limbo->is_in_rollback = false;
+	limbo->has_initial_promote = false;
 }
 
 bool
@@ -521,8 +522,13 @@ txn_limbo_read_promote(struct txn_limbo *limbo, uint32_t replica_id,
 	txn_limbo_read_rollback(limbo, lsn + 1);
 	assert(txn_limbo_is_empty(&txn_limbo));
 	limbo->owner_id = replica_id;
+	if (likely(limbo->has_initial_promote)) {
+		limbo->confirmed_lsn = 0;
+	} else {
+		limbo->confirmed_lsn = lsn;
+		limbo->has_initial_promote = true;
+	}
 	box_update_ro_summary();
-	limbo->confirmed_lsn = 0;
 }
 
 void
diff --git a/src/box/txn_limbo.h b/src/box/txn_limbo.h
index 53e52f676..e0d17de4b 100644
--- a/src/box/txn_limbo.h
+++ b/src/box/txn_limbo.h
@@ -179,6 +179,12 @@ struct txn_limbo {
 	 * by the 'reversed rollback order' rule - contradiction.
 	 */
 	bool is_in_rollback;
+	/**
+	 * Whether the limbo received initial PROMOTE request. It is needed to
+	 * update confirmed_lsn appropriately and pass packet validation/filtering
+	 * procedure.
+	 */
+	bool has_initial_promote;
 };
 
 /**
-- 
2.31.1


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Tarantool-patches] [PATCH v14 3/6] latch: add latch_is_locked helper
  2021-09-10 15:29 [Tarantool-patches] [PATCH v14 0/6] qsync: implement packets filtering Cyrill Gorcunov via Tarantool-patches
  2021-09-10 15:29 ` [Tarantool-patches] [PATCH v14 1/6] qsync: track confirmed lsn number on reads Cyrill Gorcunov via Tarantool-patches
  2021-09-10 15:29 ` [Tarantool-patches] [PATCH v14 2/6] qsync: update confirmed lsn on initial promote request Cyrill Gorcunov via Tarantool-patches
@ 2021-09-10 15:29 ` Cyrill Gorcunov via Tarantool-patches
  2021-09-10 15:29 ` [Tarantool-patches] [PATCH v14 4/6] qsync: order access to the limbo terms Cyrill Gorcunov via Tarantool-patches
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 21+ messages in thread
From: Cyrill Gorcunov via Tarantool-patches @ 2021-09-10 15:29 UTC (permalink / raw)
  To: tml; +Cc: Vladislav Shpilevoy

To test if latch is locked.

In-scope-of #6036

Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
---
 src/lib/core/latch.h | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/src/lib/core/latch.h b/src/lib/core/latch.h
index 49c59cf63..0aaa8b634 100644
--- a/src/lib/core/latch.h
+++ b/src/lib/core/latch.h
@@ -95,6 +95,17 @@ latch_owner(struct latch *l)
 	return l->owner;
 }
 
+/**
+ * Return true if the latch is locked.
+ *
+ * @param l - latch to be tested.
+ */
+static inline bool
+latch_is_locked(const struct latch *l)
+{
+	return l->owner != NULL;
+}
+
 /**
  * Lock a latch. If the latch is already locked by another fiber,
  * waits for timeout.
-- 
2.31.1


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Tarantool-patches] [PATCH v14 4/6] qsync: order access to the limbo terms
  2021-09-10 15:29 [Tarantool-patches] [PATCH v14 0/6] qsync: implement packets filtering Cyrill Gorcunov via Tarantool-patches
                   ` (2 preceding siblings ...)
  2021-09-10 15:29 ` [Tarantool-patches] [PATCH v14 3/6] latch: add latch_is_locked helper Cyrill Gorcunov via Tarantool-patches
@ 2021-09-10 15:29 ` Cyrill Gorcunov via Tarantool-patches
  2021-09-12 15:44   ` Vladislav Shpilevoy via Tarantool-patches
  2021-09-10 15:29 ` [Tarantool-patches] [PATCH v14 5/6] qsync: filter incoming synchro requests Cyrill Gorcunov via Tarantool-patches
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 21+ messages in thread
From: Cyrill Gorcunov via Tarantool-patches @ 2021-09-10 15:29 UTC (permalink / raw)
  To: tml; +Cc: Vladislav Shpilevoy

Limbo terms tracking is shared between appliers and when
one of appliers is waiting for write to complete inside
journal_write() routine, an other may need to access read
term value to figure out if promote request is valid to
apply. Due to cooperative multitasking access to the terms
is not consistent so we need to be sure that other fibers
either read up to date terms (ie written to the WAL).

For this sake we use latching mechanism, when one fiber
took terms lock for updating other readers are waiting
until the operation is complete.

For example here is a call graph of two appliers

applier 1
---------
applier_apply_tx
  (promote term = 3
   current max term = 2)
  applier_synchro_filter_tx
  apply_synchro_row
    journal_write
      (sleeping)

at this moment another applier comes in with obsolete
data and term 2

                              applier 2
                              ---------
                              applier_apply_tx
                                (term 2)
                                applier_synchro_filter_tx
                                  txn_limbo_is_replica_outdated -> false
                                journal_write (sleep)

applier 1
---------
journal wakes up
  apply_synchro_row_cb
    set max term to 3

So the applier 2 didn't notice that term 3 is already seen
and wrote obsolete data. With locking the applier 2 will
wait until applier 1 has finished its write.

Part-of #6036

Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
---
 src/box/applier.cc  |  5 ++++-
 src/box/txn_limbo.c | 17 ++++++++++++++--
 src/box/txn_limbo.h | 48 ++++++++++++++++++++++++++++++++++++++++-----
 3 files changed, 62 insertions(+), 8 deletions(-)

diff --git a/src/box/applier.cc b/src/box/applier.cc
index b981bd436..845a7d015 100644
--- a/src/box/applier.cc
+++ b/src/box/applier.cc
@@ -857,7 +857,7 @@ apply_synchro_row_cb(struct journal_entry *entry)
 		applier_rollback_by_wal_io(entry->res);
 	} else {
 		replica_txn_wal_write_cb(synchro_entry->rcb);
-		txn_limbo_process(&txn_limbo, synchro_entry->req);
+		txn_limbo_process_locked(&txn_limbo, synchro_entry->req);
 		trigger_run(&replicaset.applier.on_wal_write, NULL);
 	}
 	fiber_wakeup(synchro_entry->owner);
@@ -873,6 +873,7 @@ apply_synchro_row(uint32_t replica_id, struct xrow_header *row)
 	if (xrow_decode_synchro(row, &req) != 0)
 		goto err;
 
+	txn_limbo_term_lock(&txn_limbo);
 	struct replica_cb_data rcb_data;
 	struct synchro_entry entry;
 	/*
@@ -915,8 +916,10 @@ apply_synchro_row(uint32_t replica_id, struct xrow_header *row)
 		diag_set_journal_res(entry.base.res);
 		goto err;
 	}
+	txn_limbo_term_unlock(&txn_limbo);
 	return 0;
 err:
+	txn_limbo_term_unlock(&txn_limbo);
 	diag_log();
 	return -1;
 }
diff --git a/src/box/txn_limbo.c b/src/box/txn_limbo.c
index 08463219d..65fbd0cac 100644
--- a/src/box/txn_limbo.c
+++ b/src/box/txn_limbo.c
@@ -47,6 +47,7 @@ txn_limbo_create(struct txn_limbo *limbo)
 	vclock_create(&limbo->vclock);
 	vclock_create(&limbo->promote_term_map);
 	limbo->promote_greatest_term = 0;
+	latch_create(&limbo->promote_latch);
 	limbo->confirmed_lsn = 0;
 	limbo->rollback_count = 0;
 	limbo->is_in_rollback = false;
@@ -737,11 +738,14 @@ txn_limbo_wait_empty(struct txn_limbo *limbo, double timeout)
 }
 
 void
-txn_limbo_process(struct txn_limbo *limbo, const struct synchro_request *req)
+txn_limbo_process_locked(struct txn_limbo *limbo,
+			 const struct synchro_request *req)
 {
+	assert(latch_is_locked(&limbo->promote_latch));
+
 	uint64_t term = req->term;
 	uint32_t origin = req->origin_id;
-	if (txn_limbo_replica_term(limbo, origin) < term) {
+	if (txn_limbo_replica_term_locked(limbo, origin) < term) {
 		vclock_follow(&limbo->promote_term_map, origin, term);
 		if (term > limbo->promote_greatest_term)
 			limbo->promote_greatest_term = term;
@@ -799,6 +803,15 @@ txn_limbo_process(struct txn_limbo *limbo, const struct synchro_request *req)
 	return;
 }
 
+void
+txn_limbo_process(struct txn_limbo *limbo,
+		  const struct synchro_request *req)
+{
+	txn_limbo_term_lock(limbo);
+	txn_limbo_process_locked(limbo, req);
+	txn_limbo_term_unlock(limbo);
+}
+
 void
 txn_limbo_on_parameters_change(struct txn_limbo *limbo)
 {
diff --git a/src/box/txn_limbo.h b/src/box/txn_limbo.h
index e0d17de4b..1ee815d1c 100644
--- a/src/box/txn_limbo.h
+++ b/src/box/txn_limbo.h
@@ -31,6 +31,7 @@
  */
 #include "small/rlist.h"
 #include "vclock/vclock.h"
+#include "latch.h"
 
 #include <stdint.h>
 
@@ -147,6 +148,10 @@ struct txn_limbo {
 	 * limbo and raft are in sync and the terms are the same.
 	 */
 	uint64_t promote_greatest_term;
+	/**
+	 * To order access to the promote data.
+	 */
+	struct latch promote_latch;
 	/**
 	 * Maximal LSN gathered quorum and either already confirmed in WAL, or
 	 * whose confirmation is in progress right now. Any attempt to confirm
@@ -217,14 +222,39 @@ txn_limbo_last_entry(struct txn_limbo *limbo)
 				in_queue);
 }
 
+/** Lock promote data. */
+static inline void
+txn_limbo_term_lock(struct txn_limbo *limbo)
+{
+	latch_lock(&limbo->promote_latch);
+}
+
+/** Unlock promote data. */
+static inline void
+txn_limbo_term_unlock(struct txn_limbo *limbo)
+{
+	latch_unlock(&limbo->promote_latch);
+}
+
+/** Fetch replica's term with lock taken. */
+static inline uint64_t
+txn_limbo_replica_term_locked(struct txn_limbo *limbo, uint32_t replica_id)
+{
+	assert(latch_is_locked(&limbo->promote_latch));
+	return vclock_get(&limbo->promote_term_map, replica_id);
+}
+
 /**
  * Return the latest term as seen in PROMOTE requests from instance with id
  * @a replica_id.
  */
 static inline uint64_t
-txn_limbo_replica_term(const struct txn_limbo *limbo, uint32_t replica_id)
+txn_limbo_replica_term(struct txn_limbo *limbo, uint32_t replica_id)
 {
-	return vclock_get(&limbo->promote_term_map, replica_id);
+	txn_limbo_term_lock(limbo);
+	uint64_t v = txn_limbo_replica_term_locked(limbo, replica_id);
+	txn_limbo_term_unlock(limbo);
+	return v;
 }
 
 /**
@@ -232,11 +262,14 @@ txn_limbo_replica_term(const struct txn_limbo *limbo, uint32_t replica_id)
  * data from it. The check is only valid when elections are enabled.
  */
 static inline bool
-txn_limbo_is_replica_outdated(const struct txn_limbo *limbo,
+txn_limbo_is_replica_outdated(struct txn_limbo *limbo,
 			      uint32_t replica_id)
 {
-	return txn_limbo_replica_term(limbo, replica_id) <
-	       limbo->promote_greatest_term;
+	txn_limbo_term_lock(limbo);
+	bool res = txn_limbo_replica_term_locked(limbo, replica_id) <
+		limbo->promote_greatest_term;
+	txn_limbo_term_unlock(limbo);
+	return res;
 }
 
 /**
@@ -308,6 +341,11 @@ txn_limbo_wait_complete(struct txn_limbo *limbo, struct txn_limbo_entry *entry);
 
 /** Execute a synchronous replication request. */
 void
+txn_limbo_process_locked(struct txn_limbo *limbo,
+			 const struct synchro_request *req);
+
+/** Lock limbo terms and execute a synchronous replication request. */
+void
 txn_limbo_process(struct txn_limbo *limbo, const struct synchro_request *req);
 
 /**
-- 
2.31.1


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Tarantool-patches] [PATCH v14 5/6] qsync: filter incoming synchro requests
  2021-09-10 15:29 [Tarantool-patches] [PATCH v14 0/6] qsync: implement packets filtering Cyrill Gorcunov via Tarantool-patches
                   ` (3 preceding siblings ...)
  2021-09-10 15:29 ` [Tarantool-patches] [PATCH v14 4/6] qsync: order access to the limbo terms Cyrill Gorcunov via Tarantool-patches
@ 2021-09-10 15:29 ` Cyrill Gorcunov via Tarantool-patches
  2021-09-12 15:44   ` Vladislav Shpilevoy via Tarantool-patches
  2021-09-10 15:29 ` [Tarantool-patches] [PATCH v14 6/6] test: add replication/gh-6036-rollback-confirm Cyrill Gorcunov via Tarantool-patches
  2021-09-12 15:43 ` [Tarantool-patches] [PATCH v14 0/6] qsync: implement packets filtering Vladislav Shpilevoy via Tarantool-patches
  6 siblings, 1 reply; 21+ messages in thread
From: Cyrill Gorcunov via Tarantool-patches @ 2021-09-10 15:29 UTC (permalink / raw)
  To: tml; +Cc: Vladislav Shpilevoy

When we receive synchro requests we can't just apply
them blindly because in worse case they may come from
split-brain configuration (where a cluster split into
several clusters and each one has own leader elected,
then clusters are trying to merge back into the original
one). We need to do our best to detect such disunity
and force these nodes to rejoin from the scratch for
data consistency sake.

Thus when we're processing requests we pass them to the
packet filter first which validates their contents and
refuse to apply if they are not matched.

Filter logic depends on request type.

First there is a common chain for any synchro packet, this
is kind of a general pre-validation:
 1) Zero LSN allowed for PROMOTE | DEMOTE packets, since
    CONFIRM | ROLLBACK has to proceed some real data with
    LSN already assigned.
 2) request:replica_id = 0 allowed for PROMOTE request only.
 3) request:replica_id should match limbo:owner_id, iow the
    limbo migration should be noticed by all instances in the
    cluster.

For CONFIRM and ROLLBACK packets:
 1) Both of them can't be considered if limbo is already empty,
    ie there is no data in a local queue and everything is
    processed already. The request is obviously from the node which
    is out of date.

For PROMOTE and DEMOTE packets:
 1) The requests should come in with nonzero term, otherwise
    the packet is corrupted.
 2) The request's term should not be less than maximal known
    one, iow it should not come in from nodes which didn't notice
    raft epoch changes and living in the past.
 3) If LSN of the request matches current confirmed LSN the packet
    is obviously correct to process.
 4) If LSN is less than confirmed LSN then the request is wrong,
    we have processed the requested LSN already.
 5) If LSN is less than confirmed LSN then
    a) If limbo is empty we can't do anything, since data is already
       processed and should issue an error;
    b) If there is some data in the limbo then requested LSN should
       be in range of limbo's [first; last] LSNs, thus the request
       will be able to commit and rollback limbo queue.

Because snapshot have promote packet we disable filtering at moment
of joining to the leader node and similarly due to recovery. The thing
is that our filtration procedure implies that limbo is already
initialized to some valid state otherwise we will have to distinguish
initial states from working ones, this can be done actuially but will
make code more complex. Thus for now lets leave filtration on and off.

Closes #6036

Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
---
 .../gh-6036-qsync-filter-packets.md           |   9 +
 src/box/applier.cc                            |  21 +-
 src/box/box.cc                                |  30 +-
 src/box/errcode.h                             |   1 +
 src/box/memtx_engine.cc                       |   3 +-
 src/box/txn_limbo.c                           | 309 +++++++++++++++---
 src/box/txn_limbo.h                           |  33 +-
 test/box/error.result                         |   1 +
 8 files changed, 350 insertions(+), 57 deletions(-)
 create mode 100644 changelogs/unreleased/gh-6036-qsync-filter-packets.md

diff --git a/changelogs/unreleased/gh-6036-qsync-filter-packets.md b/changelogs/unreleased/gh-6036-qsync-filter-packets.md
new file mode 100644
index 000000000..0db629e83
--- /dev/null
+++ b/changelogs/unreleased/gh-6036-qsync-filter-packets.md
@@ -0,0 +1,9 @@
+## feature/replication
+
+* Implemented incoming synchronous packets filtration to discard
+  requests from outdated cluster nodes. This can happen when
+  replication cluster is partitioned on a transport level and
+  two or more sub-clusters are running simultaneously for some
+  time, then they are trying to merge back. Since the subclusters
+  had own leaders they should not be able to form original cluster
+  because data is not longer consistent (gh-6036).
diff --git a/src/box/applier.cc b/src/box/applier.cc
index 845a7d015..45098e3dd 100644
--- a/src/box/applier.cc
+++ b/src/box/applier.cc
@@ -458,7 +458,8 @@ applier_wait_snapshot(struct applier *applier)
 				struct synchro_request req;
 				if (xrow_decode_synchro(&row, &req) != 0)
 					diag_raise();
-				txn_limbo_process(&txn_limbo, &req);
+				if (txn_limbo_process(&txn_limbo, &req) != 0)
+					diag_raise();
 			} else if (iproto_type_is_raft_request(row.type)) {
 				struct raft_request req;
 				if (xrow_decode_raft(&row, &req, NULL) != 0)
@@ -514,6 +515,11 @@ applier_fetch_snapshot(struct applier *applier)
 	struct ev_io *coio = &applier->io;
 	struct xrow_header row;
 
+	txn_limbo_filter_disable(&txn_limbo);
+	auto filter_guard = make_scoped_guard([&]{
+		txn_limbo_filter_enable(&txn_limbo);
+	});
+
 	memset(&row, 0, sizeof(row));
 	row.type = IPROTO_FETCH_SNAPSHOT;
 	coio_write_xrow(coio, &row);
@@ -587,6 +593,11 @@ applier_register(struct applier *applier, bool was_anon)
 	struct ev_io *coio = &applier->io;
 	struct xrow_header row;
 
+	txn_limbo_filter_disable(&txn_limbo);
+	auto filter_guard = make_scoped_guard([&]{
+		txn_limbo_filter_enable(&txn_limbo);
+	});
+
 	memset(&row, 0, sizeof(row));
 	/*
 	 * Send this instance's current vclock together
@@ -620,6 +631,11 @@ applier_join(struct applier *applier)
 	struct xrow_header row;
 	uint64_t row_count;
 
+	txn_limbo_filter_disable(&txn_limbo);
+	auto filter_guard = make_scoped_guard([&]{
+		txn_limbo_filter_enable(&txn_limbo);
+	});
+
 	xrow_encode_join_xc(&row, &INSTANCE_UUID);
 	coio_write_xrow(coio, &row);
 
@@ -874,6 +890,9 @@ apply_synchro_row(uint32_t replica_id, struct xrow_header *row)
 		goto err;
 
 	txn_limbo_term_lock(&txn_limbo);
+	if (txn_limbo_filter_locked(&txn_limbo, &req) != 0)
+		goto err;
+
 	struct replica_cb_data rcb_data;
 	struct synchro_entry entry;
 	/*
diff --git a/src/box/box.cc b/src/box/box.cc
index 7b11d56d6..f134dc8bb 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -424,8 +424,7 @@ wal_stream_apply_synchro_row(struct wal_stream *stream, struct xrow_header *row)
 		say_error("couldn't decode a synchro request");
 		return -1;
 	}
-	txn_limbo_process(&txn_limbo, &syn_req);
-	return 0;
+	return txn_limbo_process(&txn_limbo, &syn_req);
 }
 
 static int
@@ -1671,7 +1670,7 @@ box_wait_limbo_acked(double timeout)
 }
 
 /** Write and process a PROMOTE request. */
-static void
+static int
 box_issue_promote(uint32_t prev_leader_id, int64_t promote_lsn)
 {
 	struct raft *raft = box_raft();
@@ -1686,15 +1685,17 @@ box_issue_promote(uint32_t prev_leader_id, int64_t promote_lsn)
 		.lsn = promote_lsn,
 		.term = raft->term,
 	};
-	txn_limbo_process(&txn_limbo, &req);
+	if (txn_limbo_process(&txn_limbo, &req) != 0)
+		return -1;
 	assert(txn_limbo_is_empty(&txn_limbo));
+	return 0;
 }
 
 /** A guard to block multiple simultaneous promote()/demote() invocations. */
 static bool is_in_box_promote = false;
 
 /** Write and process a DEMOTE request. */
-static void
+static int
 box_issue_demote(uint32_t prev_leader_id, int64_t promote_lsn)
 {
 	assert(box_raft()->volatile_term == box_raft()->term);
@@ -1708,8 +1709,10 @@ box_issue_demote(uint32_t prev_leader_id, int64_t promote_lsn)
 		.lsn = promote_lsn,
 		.term = box_raft()->term,
 	};
-	txn_limbo_process(&txn_limbo, &req);
+	if (txn_limbo_process(&txn_limbo, &req) != 0)
+		return -1;
 	assert(txn_limbo_is_empty(&txn_limbo));
+	return 0;
 }
 
 int
@@ -1732,8 +1735,7 @@ box_promote_qsync(void)
 		diag_set(ClientError, ER_NOT_LEADER, raft->leader);
 		return -1;
 	}
-	box_issue_promote(txn_limbo.owner_id, wait_lsn);
-	return 0;
+	return box_issue_promote(txn_limbo.owner_id, wait_lsn);
 }
 
 int
@@ -1789,9 +1791,7 @@ box_promote(void)
 	if (wait_lsn < 0)
 		return -1;
 
-	box_issue_promote(txn_limbo.owner_id, wait_lsn);
-
-	return 0;
+	return box_issue_promote(txn_limbo.owner_id, wait_lsn);
 }
 
 int
@@ -1826,8 +1826,7 @@ box_demote(void)
 	int64_t wait_lsn = box_wait_limbo_acked(replication_synchro_timeout);
 	if (wait_lsn < 0)
 		return -1;
-	box_issue_demote(txn_limbo.owner_id, wait_lsn);
-	return 0;
+	return box_issue_demote(txn_limbo.owner_id, wait_lsn);
 }
 
 int
@@ -3296,6 +3295,11 @@ local_recovery(const struct tt_uuid *instance_uuid,
 
 	say_info("instance uuid %s", tt_uuid_str(&INSTANCE_UUID));
 
+	txn_limbo_filter_disable(&txn_limbo);
+	auto filter_guard = make_scoped_guard([&]{
+		txn_limbo_filter_enable(&txn_limbo);
+	});
+
 	struct wal_stream wal_stream;
 	wal_stream_create(&wal_stream);
 	auto stream_guard = make_scoped_guard([&]{
diff --git a/src/box/errcode.h b/src/box/errcode.h
index a6f096698..002fcc1e1 100644
--- a/src/box/errcode.h
+++ b/src/box/errcode.h
@@ -283,6 +283,7 @@ struct errcode_record {
 	/*228 */_(ER_SYNC_QUEUE_FOREIGN,	"The synchronous transaction queue belongs to other instance with id %u")\
 	/*226 */_(ER_UNABLE_TO_PROCESS_IN_STREAM, "Unable to process %s request in stream") \
 	/*227 */_(ER_UNABLE_TO_PROCESS_OUT_OF_STREAM, "Unable to process %s request out of stream") \
+	/*228 */_(ER_CLUSTER_SPLIT,		"Cluster split detected. %s") \
 
 /*
  * !IMPORTANT! Please follow instructions at start of the file
diff --git a/src/box/memtx_engine.cc b/src/box/memtx_engine.cc
index de918c335..de4298929 100644
--- a/src/box/memtx_engine.cc
+++ b/src/box/memtx_engine.cc
@@ -238,7 +238,8 @@ memtx_engine_recover_synchro(const struct xrow_header *row)
 	 * because all its rows have a zero replica_id.
 	 */
 	req.origin_id = req.replica_id;
-	txn_limbo_process(&txn_limbo, &req);
+	if (txn_limbo_process(&txn_limbo, &req) != 0)
+		return -1;
 	return 0;
 }
 
diff --git a/src/box/txn_limbo.c b/src/box/txn_limbo.c
index 65fbd0cac..925f401e7 100644
--- a/src/box/txn_limbo.c
+++ b/src/box/txn_limbo.c
@@ -52,6 +52,7 @@ txn_limbo_create(struct txn_limbo *limbo)
 	limbo->rollback_count = 0;
 	limbo->is_in_rollback = false;
 	limbo->has_initial_promote = false;
+	limbo->is_filtering = true;
 }
 
 bool
@@ -737,6 +738,261 @@ txn_limbo_wait_empty(struct txn_limbo *limbo, double timeout)
 	return 0;
 }
 
+/**
+ * Fill the reject reason with request data.
+ * The function is not reenterable, use with caution.
+ */
+static char *
+reject_str(const struct synchro_request *req)
+{
+	static char prefix[128];
+
+	snprintf(prefix, sizeof(prefix), "RAFT: rejecting %s (%d) "
+		 "request from origin_id %u replica_id %u term %llu",
+		 iproto_type_name(req->type), req->type,
+		 req->origin_id, req->replica_id,
+		 (long long)req->term);
+
+	return prefix;
+}
+
+/**
+ * Common chain for any incoming packet.
+ */
+static int
+filter_in(struct txn_limbo *limbo, const struct synchro_request *req)
+{
+	assert(limbo->is_filtering);
+
+	/*
+	 * Zero LSN are allowed for PROMOTE
+	 * and DEMOTE requests only.
+	 */
+	if (req->lsn == 0 && !iproto_type_is_promote_request(req->type)) {
+		say_error("%s. Zero lsn detected", reject_str(req));
+		diag_set(ClientError, ER_CLUSTER_SPLIT,
+			 "Zero LSN on promote/demote");
+		return -1;
+	}
+
+	/*
+	 * Zero @a replica_id is allowed for PROMOTE packets only.
+	 */
+	if (req->replica_id == REPLICA_ID_NIL &&
+	    req->type != IPROTO_RAFT_PROMOTE) {
+		say_error("%s. Zero replica_id detected",
+			  reject_str(req));
+		diag_set(ClientError, ER_CLUSTER_SPLIT,
+			 "Zero replica_id");
+		return -1;
+	}
+
+	/*
+	 * Incoming packets should esteem limbo owner,
+	 * if it doesn't match it means the sender
+	 * missed limbo owner migrations and out of date.
+	 */
+	if (req->replica_id != limbo->owner_id) {
+		say_error("%s. Limbo owner mismatch, owner_id %u",
+			  reject_str(req), limbo->owner_id);
+		diag_set(ClientError, ER_CLUSTER_SPLIT,
+			 "Sync queue silent owner migration");
+		return -1;
+	}
+
+	return 0;
+}
+
+/**
+ * Filter CONFIRM and ROLLBACK packets.
+ */
+static int
+filter_confirm_rollback(struct txn_limbo *limbo,
+			const struct synchro_request *req)
+{
+	assert(limbo->is_filtering);
+
+	/*
+	 * When limbo is empty we have nothing to
+	 * confirm/commit and if this request comes
+	 * in it means the split brain has happened.
+	 */
+	if (!txn_limbo_is_empty(limbo))
+		return 0;
+
+	say_error("%s. Empty limbo detected", reject_str(req));
+	diag_set(ClientError, ER_CLUSTER_SPLIT,
+		 "Confirm/rollback with empty limbo");
+	return -1;
+}
+
+/**
+ * Filter PROMOTE and DEMOTE packets.
+ */
+static int
+filter_promote_demote(struct txn_limbo *limbo,
+		      const struct synchro_request *req)
+{
+	assert(limbo->is_filtering);
+
+	/*
+	 * PROMOTE and DEMOTE packets must not have zero
+	 * term supplied, otherwise it is a broken packet.
+	 */
+	if (req->term == 0) {
+		say_error("%s. Zero term detected", reject_str(req));
+		diag_set(ClientError, ER_CLUSTER_SPLIT,
+			 "Request with zero term");
+		return -1;
+	}
+
+	/*
+	 * If the term is already seen it means it comes
+	 * from a node which didn't notice new elections,
+	 * thus been living in subdomain and its data is
+	 * no longer consistent.
+	 */
+	if (limbo->promote_greatest_term > req->term) {
+		say_error("%s. Max term seen is %llu", reject_str(req),
+			  (long long)limbo->promote_greatest_term);
+		diag_set(ClientError, ER_CLUSTER_SPLIT,
+			 "Obsolete terms");
+		return -1;
+	}
+
+	int64_t promote_lsn = req->lsn;
+
+	/*
+	 * Easy case -- processed LSN matches the new
+	 * one which comes inside request, everything
+	 * is consistent.
+	 */
+	if (limbo->confirmed_lsn == promote_lsn)
+		return 0;
+
+	/*
+	 * Explicit split brain situation. Promote
+	 * comes in with an old LSN which we've already
+	 * processed.
+	 */
+	if (limbo->confirmed_lsn > promote_lsn) {
+		say_error("%s. confirmed_lsn %lld > promote_lsn %lld",
+			  reject_str(req),
+			  (long long)limbo->confirmed_lsn,
+			  (long long)promote_lsn);
+		diag_set(ClientError, ER_CLUSTER_SPLIT,
+			 "Backward promote LSN");
+		return -1;
+	}
+
+	/*
+	 * The last case requires a few subcases.
+	 */
+	assert(limbo->confirmed_lsn < promote_lsn);
+
+	if (txn_limbo_is_empty(limbo)) {
+		/*
+		 * Transactions are rolled back already,
+		 * since the limbo is empty.
+		 */
+		say_error("%s. confirmed_lsn %lld < promote_lsn %lld "
+			  "and empty limbo", reject_str(req),
+			  (long long)limbo->confirmed_lsn,
+			  (long long)promote_lsn);
+		diag_set(ClientError, ER_CLUSTER_SPLIT,
+			 "Forward promote LSN");
+		return -1;
+	}
+
+	/*
+	 * Some entries are present in the limbo,
+	 * we need to make sure the @a promote_lsn
+	 * lays inside limbo [first; last] range.
+	 * So that the promote request has some
+	 * queued data to process, otherwise it
+	 * means the request comes from split
+	 * brained node.
+	 */
+	struct txn_limbo_entry *first, *last;
+
+	first = txn_limbo_first_entry(limbo);
+	last = txn_limbo_last_entry(limbo);
+
+	if (first->lsn > promote_lsn || last->lsn < promote_lsn) {
+		say_error("%s. promote_lsn %lld out of "
+			  "range [%lld; %lld]",
+			  reject_str(req),
+			  (long long)promote_lsn,
+			  (long long)first->lsn,
+			  (long long)last->lsn);
+		diag_set(ClientError, ER_CLUSTER_SPLIT,
+			 "Promote LSN out of queue range");
+		return -1;
+	}
+
+	return 0;
+}
+
+int
+txn_limbo_filter_locked(struct txn_limbo *limbo,
+			const struct synchro_request *req)
+{
+	assert(latch_is_locked(&limbo->promote_latch));
+
+#ifndef NDEBUG
+	say_info("limbo: filter %s replica_id %u origin_id %u "
+		 "term %lld lsn %lld, queue owner_id %u len %lld "
+		 "promote_greatest_term %lld confirmed_lsn %lld (%s)",
+		 iproto_type_name(req->type),
+		 req->replica_id, req->origin_id,
+		 (long long)req->term, (long long)req->lsn,
+		 limbo->owner_id, (long long)limbo->len,
+		 (long long)limbo->promote_greatest_term,
+		 (long long)limbo->confirmed_lsn,
+		 limbo->is_filtering ? "on" : "off");
+#endif
+
+	/*
+	 * Our filtering engine implies that limbo is
+	 * in "steady" state where variables are initialized,
+	 * thus filtering prevent wrong data to step in. Still
+	 * there are stages such as local recovery and joining
+	 * to another leader node where we fetch an initial state
+	 * of the limbo such as we can't apply the filtering rules
+	 * at this moment.
+	 */
+	if (!limbo->is_filtering)
+		return 0;
+
+	if (filter_in(limbo, req) != 0)
+		return -1;
+
+	switch (req->type) {
+	case IPROTO_RAFT_CONFIRM:
+	case IPROTO_RAFT_ROLLBACK:
+		if (filter_confirm_rollback(limbo, req) != 0)
+			return -1;
+		break;
+	case IPROTO_RAFT_PROMOTE:
+	case IPROTO_RAFT_DEMOTE:
+		if (filter_promote_demote(limbo, req) != 0)
+			return -1;
+		break;
+	default:
+		say_error("RAFT: rejecting unexpected %d "
+			  "request from instance id %u "
+			  "for term %llu.",
+			  req->type, req->origin_id,
+			  (long long)req->term);
+		diag_set(ClientError, ER_UNSUPPORTED,
+			 "Replication",
+			 "unexpected request type");
+		return -1;
+	}
+
+	return 0;
+}
+
 void
 txn_limbo_process_locked(struct txn_limbo *limbo,
 			 const struct synchro_request *req)
@@ -745,71 +1001,42 @@ txn_limbo_process_locked(struct txn_limbo *limbo,
 
 	uint64_t term = req->term;
 	uint32_t origin = req->origin_id;
+
 	if (txn_limbo_replica_term_locked(limbo, origin) < term) {
 		vclock_follow(&limbo->promote_term_map, origin, term);
 		if (term > limbo->promote_greatest_term)
 			limbo->promote_greatest_term = term;
-	} else if (iproto_type_is_promote_request(req->type) &&
-		   limbo->promote_greatest_term > 1) {
-		/* PROMOTE for outdated term. Ignore. */
-		say_info("RAFT: ignoring %s request from instance "
-			 "id %u for term %llu. Greatest term seen "
-			 "before (%llu) is bigger.",
-			 iproto_type_name(req->type), origin, (long long)term,
-			 (long long)limbo->promote_greatest_term);
-		return;
 	}
 
-	int64_t lsn = req->lsn;
-	if (req->replica_id == REPLICA_ID_NIL) {
-		/*
-		 * The limbo was empty on the instance issuing the request.
-		 * This means this instance must empty its limbo as well.
-		 */
-		assert(lsn == 0 && iproto_type_is_promote_request(req->type));
-	} else if (req->replica_id != limbo->owner_id) {
-		/*
-		 * Ignore CONFIRM/ROLLBACK messages for a foreign master.
-		 * These are most likely outdated messages for already confirmed
-		 * data from an old leader, who has just started and written
-		 * confirm right on synchronous transaction recovery.
-		 */
-		if (!iproto_type_is_promote_request(req->type))
-			return;
-		/*
-		 * Promote has a bigger term, and tries to steal the limbo. It
-		 * means it probably was elected with a quorum, and it makes no
-		 * sense to wait here for confirmations. The other nodes already
-		 * elected a new leader. Rollback all the local txns.
-		 */
-		lsn = 0;
-	}
 	switch (req->type) {
 	case IPROTO_RAFT_CONFIRM:
-		txn_limbo_read_confirm(limbo, lsn);
+		txn_limbo_read_confirm(limbo, req->lsn);
 		break;
 	case IPROTO_RAFT_ROLLBACK:
-		txn_limbo_read_rollback(limbo, lsn);
+		txn_limbo_read_rollback(limbo, req->lsn);
 		break;
 	case IPROTO_RAFT_PROMOTE:
-		txn_limbo_read_promote(limbo, req->origin_id, lsn);
+		txn_limbo_read_promote(limbo, req->origin_id, req->lsn);
 		break;
 	case IPROTO_RAFT_DEMOTE:
-		txn_limbo_read_demote(limbo, lsn);
+		txn_limbo_read_demote(limbo, req->lsn);
 		break;
 	default:
-		unreachable();
+		panic("limbo: unexpected request type %d", req->type);
+		break;
 	}
-	return;
 }
 
-void
+int
 txn_limbo_process(struct txn_limbo *limbo,
 		  const struct synchro_request *req)
 {
 	txn_limbo_term_lock(limbo);
-	txn_limbo_process_locked(limbo, req);
+	int rc = txn_limbo_filter_locked(limbo, req);
+	if (rc == 0)
+		txn_limbo_process_locked(limbo, req);
 	txn_limbo_term_unlock(limbo);
+	return rc;
 }
 
 void
diff --git a/src/box/txn_limbo.h b/src/box/txn_limbo.h
index 1ee815d1c..74c77c16b 100644
--- a/src/box/txn_limbo.h
+++ b/src/box/txn_limbo.h
@@ -190,6 +190,14 @@ struct txn_limbo {
 	 * procedure.
 	 */
 	bool has_initial_promote;
+	/**
+	 * Whether the limbo should filter incoming requests.
+	 * The phases of local recovery from WAL file and on applier's
+	 * join phase we are in complete trust of incoming data because
+	 * this data forms an initial limbo state and should not
+	 * filter out requests.
+	 */
+	bool is_filtering;
 };
 
 /**
@@ -339,15 +347,38 @@ txn_limbo_ack(struct txn_limbo *limbo, uint32_t replica_id, int64_t lsn);
 int
 txn_limbo_wait_complete(struct txn_limbo *limbo, struct txn_limbo_entry *entry);
 
+/**
+ * Verify if the request is valid for processing.
+ */
+int
+txn_limbo_filter_locked(struct txn_limbo *limbo,
+			const struct synchro_request *req);
+
 /** Execute a synchronous replication request. */
 void
 txn_limbo_process_locked(struct txn_limbo *limbo,
 			 const struct synchro_request *req);
 
 /** Lock limbo terms and execute a synchronous replication request. */
-void
+int
 txn_limbo_process(struct txn_limbo *limbo, const struct synchro_request *req);
 
+/** Enable filtering of synchro requests. */
+static inline void
+txn_limbo_filter_enable(struct txn_limbo *limbo)
+{
+	limbo->is_filtering = true;
+	say_info("limbo: filter enabled");
+}
+
+/** Disable filtering of synchro requests. */
+static inline void
+txn_limbo_filter_disable(struct txn_limbo *limbo)
+{
+	limbo->is_filtering = false;
+	say_info("limbo: filter disabled");
+}
+
 /**
  * Waiting for confirmation of all "sync" transactions
  * during confirm timeout or fail.
diff --git a/test/box/error.result b/test/box/error.result
index bc804197a..45ea7714c 100644
--- a/test/box/error.result
+++ b/test/box/error.result
@@ -449,6 +449,7 @@ t;
  |   228: box.error.SYNC_QUEUE_FOREIGN
  |   229: box.error.UNABLE_TO_PROCESS_IN_STREAM
  |   230: box.error.UNABLE_TO_PROCESS_OUT_OF_STREAM
+ |   231: box.error.CLUSTER_SPLIT
  | ...
 
 test_run:cmd("setopt delimiter ''");
-- 
2.31.1


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Tarantool-patches] [PATCH v14 6/6] test: add replication/gh-6036-rollback-confirm
  2021-09-10 15:29 [Tarantool-patches] [PATCH v14 0/6] qsync: implement packets filtering Cyrill Gorcunov via Tarantool-patches
                   ` (4 preceding siblings ...)
  2021-09-10 15:29 ` [Tarantool-patches] [PATCH v14 5/6] qsync: filter incoming synchro requests Cyrill Gorcunov via Tarantool-patches
@ 2021-09-10 15:29 ` Cyrill Gorcunov via Tarantool-patches
  2021-09-12 15:44   ` Vladislav Shpilevoy via Tarantool-patches
  2021-09-12 15:43 ` [Tarantool-patches] [PATCH v14 0/6] qsync: implement packets filtering Vladislav Shpilevoy via Tarantool-patches
  6 siblings, 1 reply; 21+ messages in thread
From: Cyrill Gorcunov via Tarantool-patches @ 2021-09-10 15:29 UTC (permalink / raw)
  To: tml; +Cc: Vladislav Shpilevoy

Follow-up #6036

Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
---
 test/replication/gh-6036-master.lua           |   1 +
 test/replication/gh-6036-node.lua             |  33 ++++
 test/replication/gh-6036-replica.lua          |   1 +
 .../gh-6036-rollback-confirm.result           | 180 ++++++++++++++++++
 .../gh-6036-rollback-confirm.test.lua         |  92 +++++++++
 5 files changed, 307 insertions(+)
 create mode 120000 test/replication/gh-6036-master.lua
 create mode 100644 test/replication/gh-6036-node.lua
 create mode 120000 test/replication/gh-6036-replica.lua
 create mode 100644 test/replication/gh-6036-rollback-confirm.result
 create mode 100644 test/replication/gh-6036-rollback-confirm.test.lua

diff --git a/test/replication/gh-6036-master.lua b/test/replication/gh-6036-master.lua
new file mode 120000
index 000000000..65baed5de
--- /dev/null
+++ b/test/replication/gh-6036-master.lua
@@ -0,0 +1 @@
+gh-6036-node.lua
\ No newline at end of file
diff --git a/test/replication/gh-6036-node.lua b/test/replication/gh-6036-node.lua
new file mode 100644
index 000000000..ac701b7a2
--- /dev/null
+++ b/test/replication/gh-6036-node.lua
@@ -0,0 +1,33 @@
+local INSTANCE_ID = string.match(arg[0], "gh%-6036%-(.+)%.lua")
+
+local function unix_socket(name)
+    return "unix/:./" .. name .. '.sock';
+end
+
+require('console').listen(os.getenv('ADMIN'))
+
+if INSTANCE_ID == "master" then
+    box.cfg({
+        listen = unix_socket("master"),
+        replication_connect_quorum = 0,
+        election_mode = 'candidate',
+        replication_synchro_quorum = 3,
+        replication_synchro_timeout = 1000,
+    })
+elseif INSTANCE_ID == "replica" then
+    box.cfg({
+        listen = unix_socket("replica"),
+        replication = {
+            unix_socket("master"),
+            unix_socket("replica")
+        },
+        read_only = true,
+        election_mode = 'voter',
+        replication_synchro_quorum = 2,
+        replication_synchro_timeout = 1000,
+    })
+end
+
+box.once("bootstrap", function()
+    box.schema.user.grant('guest', 'super')
+end)
diff --git a/test/replication/gh-6036-replica.lua b/test/replication/gh-6036-replica.lua
new file mode 120000
index 000000000..65baed5de
--- /dev/null
+++ b/test/replication/gh-6036-replica.lua
@@ -0,0 +1 @@
+gh-6036-node.lua
\ No newline at end of file
diff --git a/test/replication/gh-6036-rollback-confirm.result b/test/replication/gh-6036-rollback-confirm.result
new file mode 100644
index 000000000..e85f6af37
--- /dev/null
+++ b/test/replication/gh-6036-rollback-confirm.result
@@ -0,0 +1,180 @@
+-- test-run result file version 2
+--
+-- gh-6036: Test for record collision detection. We have a cluster
+-- of two nodes: master and replica. The master initiates syncho write
+-- but fails to gather a quorum. Before it rolls back the record the
+-- network breakage occurs and replica lives with dirty data while
+-- master node goes offline. The replica becomes a new raft leader
+-- and commits the dirty data, same time master node rolls back this
+-- record and tries to connect to the new raft leader back. Such
+-- connection should be refused because old master node is not longer
+-- consistent.
+--
+test_run = require('test_run').new()
+ | ---
+ | ...
+
+test_run:cmd('create server master with script="replication/gh-6036-master.lua"')
+ | ---
+ | - true
+ | ...
+test_run:cmd('create server replica with script="replication/gh-6036-replica.lua"')
+ | ---
+ | - true
+ | ...
+
+test_run:cmd('start server master')
+ | ---
+ | - true
+ | ...
+test_run:cmd('start server replica')
+ | ---
+ | - true
+ | ...
+
+--
+-- Connect master to the replica and write a record. Since the quorum
+-- value is bigger than number of nodes in a cluster it will be rolled
+-- back later.
+test_run:switch('master')
+ | ---
+ | - true
+ | ...
+box.cfg({                                       \
+    replication = {                             \
+            "unix/:./master.sock",              \
+            "unix/:./replica.sock",             \
+    },                                          \
+})
+ | ---
+ | ...
+_ = box.schema.create_space('sync', {is_sync = true})
+ | ---
+ | ...
+_ = box.space.sync:create_index('pk')
+ | ---
+ | ...
+
+--
+-- Wait the record to appear on the master.
+f = require('fiber').create(function() box.space.sync:replace{1} end)
+ | ---
+ | ...
+test_run:wait_cond(function() return box.space.sync:get({1}) ~= nil end, 100)
+ | ---
+ | - true
+ | ...
+box.space.sync:select{}
+ | ---
+ | - - [1]
+ | ...
+
+--
+-- Wait the record from master get written and then
+-- drop the replication.
+test_run:switch('replica')
+ | ---
+ | - true
+ | ...
+test_run:wait_cond(function() return box.space.sync:get({1}) ~= nil end, 100)
+ | ---
+ | - true
+ | ...
+box.space.sync:select{}
+ | ---
+ | - - [1]
+ | ...
+box.cfg{replication = {}}
+ | ---
+ | ...
+
+--
+-- Then we jump back to the master and drop the replication,
+-- thus unconfirmed record get rolled back.
+test_run:switch('master')
+ | ---
+ | - true
+ | ...
+box.cfg({                                       \
+    replication = {},                           \
+    replication_synchro_timeout = 0.001,        \
+    election_mode = 'manual',                   \
+})
+ | ---
+ | ...
+while f:status() ~= 'dead' do require('fiber').sleep(0.1) end
+ | ---
+ | ...
+test_run:wait_cond(function() return box.space.sync:get({1}) == nil end, 100)
+ | ---
+ | - true
+ | ...
+
+--
+-- Force the replica to become a RAFT leader and
+-- commit this new record.
+test_run:switch('replica')
+ | ---
+ | - true
+ | ...
+box.cfg({                                       \
+    replication_synchro_quorum = 1,             \
+    election_mode = 'manual'                    \
+})
+ | ---
+ | ...
+box.ctl.promote()
+ | ---
+ | ...
+box.space.sync:select{}
+ | ---
+ | - - [1]
+ | ...
+
+--
+-- Connect master back to the replica, it should
+-- be refused.
+test_run:switch('master')
+ | ---
+ | - true
+ | ...
+box.cfg({                                       \
+    replication = {                             \
+            "unix/:./replica.sock",             \
+    },                                          \
+})
+ | ---
+ | ...
+box.space.sync:select{}
+ | ---
+ | - []
+ | ...
+test_run:wait_cond(function() return            \
+    test_run:grep_log('master',                 \
+        'rejecting PROMOTE') ~= nil end, 100)   \
+test_run:wait_cond(function() return            \
+    test_run:grep_log('master',                 \
+        'ER_CLUSTER_SPLIT') ~= nil end, 100)
+ | ---
+ | ...
+
+test_run:switch('default')
+ | ---
+ | - true
+ | ...
+test_run:cmd('stop server master')
+ | ---
+ | - true
+ | ...
+test_run:cmd('delete server master')
+ | ---
+ | - true
+ | ...
+test_run:cmd('stop server replica')
+ | ---
+ | - true
+ | ...
+test_run:cmd('delete server replica')
+ | ---
+ | - true
+ | ...
diff --git a/test/replication/gh-6036-rollback-confirm.test.lua b/test/replication/gh-6036-rollback-confirm.test.lua
new file mode 100644
index 000000000..6eca23d8b
--- /dev/null
+++ b/test/replication/gh-6036-rollback-confirm.test.lua
@@ -0,0 +1,92 @@
+--
+-- gh-6036: Test for record collision detection. We have a cluster
+-- of two nodes: master and replica. The master initiates syncho write
+-- but fails to gather a quorum. Before it rolls back the record the
+-- network breakage occurs and replica lives with dirty data while
+-- master node goes offline. The replica becomes a new raft leader
+-- and commits the dirty data, same time master node rolls back this
+-- record and tries to connect to the new raft leader back. Such
+-- connection should be refused because old master node is not longer
+-- consistent.
+--
+test_run = require('test_run').new()
+
+test_run:cmd('create server master with script="replication/gh-6036-master.lua"')
+test_run:cmd('create server replica with script="replication/gh-6036-replica.lua"')
+
+test_run:cmd('start server master')
+test_run:cmd('start server replica')
+
+--
+-- Connect master to the replica and write a record. Since the quorum
+-- value is bigger than number of nodes in a cluster it will be rolled
+-- back later.
+test_run:switch('master')
+box.cfg({                                       \
+    replication = {                             \
+            "unix/:./master.sock",              \
+            "unix/:./replica.sock",             \
+    },                                          \
+})
+_ = box.schema.create_space('sync', {is_sync = true})
+_ = box.space.sync:create_index('pk')
+
+--
+-- Wait the record to appear on the master.
+f = require('fiber').create(function() box.space.sync:replace{1} end)
+test_run:wait_cond(function() return box.space.sync:get({1}) ~= nil end, 100)
+box.space.sync:select{}
+
+--
+-- Wait the record from master get written and then
+-- drop the replication.
+test_run:switch('replica')
+test_run:wait_cond(function() return box.space.sync:get({1}) ~= nil end, 100)
+box.space.sync:select{}
+box.cfg{replication = {}}
+
+--
+-- Then we jump back to the master and drop the replication,
+-- thus unconfirmed record get rolled back.
+test_run:switch('master')
+box.cfg({                                       \
+    replication = {},                           \
+    replication_synchro_timeout = 0.001,        \
+    election_mode = 'manual',                   \
+})
+while f:status() ~= 'dead' do require('fiber').sleep(0.1) end
+test_run:wait_cond(function() return box.space.sync:get({1}) == nil end, 100)
+
+--
+-- Force the replica to become a RAFT leader and
+-- commit this new record.
+test_run:switch('replica')
+box.cfg({                                       \
+    replication_synchro_quorum = 1,             \
+    election_mode = 'manual'                    \
+})
+box.ctl.promote()
+box.space.sync:select{}
+
+--
+-- Connect master back to the replica, it should
+-- be refused.
+test_run:switch('master')
+box.cfg({                                       \
+    replication = {                             \
+            "unix/:./replica.sock",             \
+    },                                          \
+})
+box.space.sync:select{}
+test_run:wait_cond(function() return            \
+    test_run:grep_log('master',                 \
+        'rejecting PROMOTE') ~= nil end, 100)   \
+test_run:wait_cond(function() return            \
+    test_run:grep_log('master',                 \
+        'ER_CLUSTER_SPLIT') ~= nil end, 100)
+
+test_run:switch('default')
+test_run:cmd('stop server master')
+test_run:cmd('delete server master')
+test_run:cmd('stop server replica')
+test_run:cmd('delete server replica')
-- 
2.31.1


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Tarantool-patches] [PATCH v14 0/6] qsync: implement packets filtering
  2021-09-10 15:29 [Tarantool-patches] [PATCH v14 0/6] qsync: implement packets filtering Cyrill Gorcunov via Tarantool-patches
                   ` (5 preceding siblings ...)
  2021-09-10 15:29 ` [Tarantool-patches] [PATCH v14 6/6] test: add replication/gh-6036-rollback-confirm Cyrill Gorcunov via Tarantool-patches
@ 2021-09-12 15:43 ` Vladislav Shpilevoy via Tarantool-patches
  6 siblings, 0 replies; 21+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-09-12 15:43 UTC (permalink / raw)
  To: Cyrill Gorcunov, tml

Hi! Thanks for the fixes!

On 10.09.2021 17:29, Cyrill Gorcunov via Tarantool-patches wrote:
> Guys, take a look please, once time permit. The questionable moments:
> 
> - use filter disabling procedure for join/recovery: we make it so since
>   snapshot has promote record which fills initial limbo state

I still don't understand why do you need the disabling. Before the snapshot
is recovered, the limbo is empty and not owner by anybody. It is fully
valid and working, like if DEMOTE is called. Correct?

Snapshot's promote should make it belong to the owner persisted in promote.
Also correct?

The next rows just replay already applied data. Also correct?

It managed to apply first time and must manage to do so again. Agree?

In what of these statements I miss a mistake which makes you disable the
filtering?

> - need more tests to cover all possible scenarios
> 
> - I keep filter_confirm_rollback() as is but rereading Vlad's comment
>   >
>   > 9. What if rollback is for LSN > limbo's last LSN? It
>   > also means nothing to do. The same for confirm LSN < limbo's
>   > first LSN.
>   >
>   I presume I need to traverse limbo and test if incoming LSN is
>   present inside current queue.

It should be enough to know the LSN range in there AFAIU.

Additionally, I tried the test from the ticket again. It still
does not behave as expected. I remind, on the last review I also
tried:

	On top of the branch I tried the test I pasted in the
	ticket's description.

	I see the connection now breaks in one direction. But the
	new leader still follows the old leader somewhy.

And you said:

	I'll take more precise look, thanks!
	https://lists.tarantool.org/tarantool-patches/YRBUww6p1dUNL0mx@grain/

So what are the news on that? The new leader should not follow the old
one. If anything, even the vice-versa situation would be fine I
suppose - the old one following the new one. But the current way does
not look valid. The old leader could send all kinds of irrelevant
garbage and the new leader would happily swallow it.

The same happens in this test (on top of the last commit of this
patchset):
https://github.com/tarantool/tarantool/issues/5295#issuecomment-912106680
The new leader still replicates from the old broken one.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Tarantool-patches] [PATCH v14 1/6] qsync: track confirmed lsn number on reads
  2021-09-10 15:29 ` [Tarantool-patches] [PATCH v14 1/6] qsync: track confirmed lsn number on reads Cyrill Gorcunov via Tarantool-patches
@ 2021-09-12 15:44   ` Vladislav Shpilevoy via Tarantool-patches
  2021-09-12 22:18     ` Cyrill Gorcunov via Tarantool-patches
  2021-09-13  8:33     ` Serge Petrenko via Tarantool-patches
  2021-09-13  8:50   ` Serge Petrenko via Tarantool-patches
  1 sibling, 2 replies; 21+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-09-12 15:44 UTC (permalink / raw)
  To: Cyrill Gorcunov, tml

Thanks for the patch!

On 10.09.2021 17:29, Cyrill Gorcunov via Tarantool-patches wrote:
> We will use this lsn for requests validation
> in next patches for sake of split-brain detection.

I don't understand. How exactly will it help?

> Part-of #6036
> 
> Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
> ---
>  src/box/txn_limbo.c | 7 +++++++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/src/box/txn_limbo.c b/src/box/txn_limbo.c
> index 70447caaf..cca2ce493 100644
> --- a/src/box/txn_limbo.c
> +++ b/src/box/txn_limbo.c
> @@ -437,6 +437,13 @@ txn_limbo_read_confirm(struct txn_limbo *limbo, int64_t lsn)
>  		assert(e->txn->signature >= 0);
>  		txn_complete_success(e->txn);
>  	}
> +
> +	/*
> +	 * We use confirmed lsn number to verify requests and
> +	 * reject ones coming from split-brain cluster configurations,
> +	 * so update it even if there were no entries to process.
> +	 */
> +	limbo->confirmed_lsn = lsn;

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Tarantool-patches] [PATCH v14 2/6] qsync: update confirmed lsn on initial promote request
  2021-09-10 15:29 ` [Tarantool-patches] [PATCH v14 2/6] qsync: update confirmed lsn on initial promote request Cyrill Gorcunov via Tarantool-patches
@ 2021-09-12 15:44   ` Vladislav Shpilevoy via Tarantool-patches
  2021-09-12 22:25     ` Cyrill Gorcunov via Tarantool-patches
  0 siblings, 1 reply; 21+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-09-12 15:44 UTC (permalink / raw)
  To: Cyrill Gorcunov, tml

Thanks for the patch!

On 10.09.2021 17:29, Cyrill Gorcunov wrote:
> When promote request is handled we drop last confirmed
> lsn to zero because its value make sense for sync queue
> owner only. Still the case where we become queue owner
> for the first time is special - we need to fetch the
> obtained lsn from the request and remember it so we
> will be able to filter any next malformed requests
> with wrong lsn numbers (see queue filtering procedure
> in next patch).

I don't understand anything. Why isn't it needed always? And
how exactly will it help to filter stuff?

> diff --git a/src/box/txn_limbo.c b/src/box/txn_limbo.c
> index cca2ce493..08463219d 100644
> --- a/src/box/txn_limbo.c
> +++ b/src/box/txn_limbo.c
> @@ -521,8 +522,13 @@ txn_limbo_read_promote(struct txn_limbo *limbo, uint32_t replica_id,
>  	txn_limbo_read_rollback(limbo, lsn + 1);
>  	assert(txn_limbo_is_empty(&txn_limbo));
>  	limbo->owner_id = replica_id;
> +	if (likely(limbo->has_initial_promote)) {
> +		limbo->confirmed_lsn = 0;
> +	} else {
> +		limbo->confirmed_lsn = lsn;
> +		limbo->has_initial_promote = true;
> +	}
>  	box_update_ro_summary();
> -	limbo->confirmed_lsn = 0;
>  }

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Tarantool-patches] [PATCH v14 4/6] qsync: order access to the limbo terms
  2021-09-10 15:29 ` [Tarantool-patches] [PATCH v14 4/6] qsync: order access to the limbo terms Cyrill Gorcunov via Tarantool-patches
@ 2021-09-12 15:44   ` Vladislav Shpilevoy via Tarantool-patches
  2021-09-13 10:52     ` Cyrill Gorcunov via Tarantool-patches
  0 siblings, 1 reply; 21+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-09-12 15:44 UTC (permalink / raw)
  To: Cyrill Gorcunov, tml

Thanks for the patch!

See 3 comments below.

> diff --git a/src/box/applier.cc b/src/box/applier.cc
> index b981bd436..845a7d015 100644
> --- a/src/box/applier.cc
> +++ b/src/box/applier.cc
> @@ -915,8 +916,10 @@ apply_synchro_row(uint32_t replica_id, struct xrow_header *row)
>  		diag_set_journal_res(entry.base.res);
>  		goto err;
>  	}
> +	txn_limbo_term_unlock(&txn_limbo);
>  	return 0;
>  err:
> +	txn_limbo_term_unlock(&txn_limbo);
>  	diag_log();

1. This function can go to 'err' before the lock is taken.


2. As for your complaint about the begin/commit/rollback API
being not working because you can't unlock from a non-owner
fiber - well, it works in your patch somehow, doesn't it?

Why do you in your patch unlock here, but in the newly proposed
API you only tried to unlock in the trigger?

You could call commit/rollback from this function, like you
do with unlock now.

>  	return -1;
>  }
> diff --git a/src/box/txn_limbo.h b/src/box/txn_limbo.h
> index e0d17de4b..1ee815d1c 100644
> --- a/src/box/txn_limbo.h
> +++ b/src/box/txn_limbo.h
> @@ -217,14 +222,39 @@ txn_limbo_last_entry(struct txn_limbo *limbo)
>  				in_queue);
>  }
>  
> +/** Lock promote data. */
> +static inline void
> +txn_limbo_term_lock(struct txn_limbo *limbo)
> +{
> +	latch_lock(&limbo->promote_latch);
> +}
> +
> +/** Unlock promote data. */
> +static inline void
> +txn_limbo_term_unlock(struct txn_limbo *limbo)
> +{
> +	latch_unlock(&limbo->promote_latch);
> +}
> +
> +/** Fetch replica's term with lock taken. */
> +static inline uint64_t
> +txn_limbo_replica_term_locked(struct txn_limbo *limbo, uint32_t replica_id)

3. Limbo can be made const here.

> +{
> +	assert(latch_is_locked(&limbo->promote_latch));
> +	return vclock_get(&limbo->promote_term_map, replica_id);
> +}

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Tarantool-patches] [PATCH v14 5/6] qsync: filter incoming synchro requests
  2021-09-10 15:29 ` [Tarantool-patches] [PATCH v14 5/6] qsync: filter incoming synchro requests Cyrill Gorcunov via Tarantool-patches
@ 2021-09-12 15:44   ` Vladislav Shpilevoy via Tarantool-patches
  2021-09-14 19:41     ` Cyrill Gorcunov via Tarantool-patches
  0 siblings, 1 reply; 21+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-09-12 15:44 UTC (permalink / raw)
  To: Cyrill Gorcunov, tml

Thanks for the patch!

See 8 comments below.

On 10.09.2021 17:29, Cyrill Gorcunov wrote:
> When we receive synchro requests we can't just apply
> them blindly because in worse case they may come from
> split-brain configuration (where a cluster split into
> several clusters and each one has own leader elected,
> then clusters are trying to merge back into the original
> one). We need to do our best to detect such disunity
> and force these nodes to rejoin from the scratch for
> data consistency sake.
> 
> Thus when we're processing requests we pass them to the
> packet filter first which validates their contents and
> refuse to apply if they are not matched.
> 
> Filter logic depends on request type.
> 
> First there is a common chain for any synchro packet, this
> is kind of a general pre-validation:
>  1) Zero LSN allowed for PROMOTE | DEMOTE packets, since
>     CONFIRM | ROLLBACK has to proceed some real data with
>     LSN already assigned.
>  2) request:replica_id = 0 allowed for PROMOTE request only.
>  3) request:replica_id should match limbo:owner_id, iow the
>     limbo migration should be noticed by all instances in the
>     cluster.
> 
> For CONFIRM and ROLLBACK packets:
>  1) Both of them can't be considered if limbo is already empty,
>     ie there is no data in a local queue and everything is
>     processed already. The request is obviously from the node which
>     is out of date.

1. It is not just about empty. They might try to touch a range
of transactions out of the LSN range waiting in the limbo. Then
their effect is also void. The question remains from the previous
review. What is the resolution here?

Besides, I don't know how could such requests happen, but I don't
how to get the ones in your example either TBH. An theoretical examle?
A test?

> For PROMOTE and DEMOTE packets:
>  1) The requests should come in with nonzero term, otherwise
>     the packet is corrupted.
>  2) The request's term should not be less than maximal known
>     one, iow it should not come in from nodes which didn't notice
>     raft epoch changes and living in the past.
>  3) If LSN of the request matches current confirmed LSN the packet
>     is obviously correct to process.
>  4) If LSN is less than confirmed LSN then the request is wrong,
>     we have processed the requested LSN already.
>  5) If LSN is less than confirmed LSN then

2. You didn't fix the typo from the previous review. Still two
points say "less than confirmed LSN". Please, re-read the comments
of the previous review and address them all. As I already told not
once, it would be best if you would respond to the comments individually
and with diff if possible. Otherwise you will continue missing them.

>     a) If limbo is empty we can't do anything, since data is already
>        processed and should issue an error;
>     b) If there is some data in the limbo then requested LSN should
>        be in range of limbo's [first; last] LSNs, thus the request
>        will be able to commit and rollback limbo queue.
> 
> Because snapshot have promote packet we disable filtering at moment
> of joining to the leader node and similarly due to recovery. The thing
> is that our filtration procedure implies that limbo is already
> initialized to some valid state otherwise we will have to distinguish
> initial states from working ones, this can be done actuially but will
> make code more complex.

3. How 'more complex'? And why do you distinguish between 'initial' and
'working' states? All states should work. Initial only means the limbo
does not belong to anybody.

Currently I only see complexity coming from the filtering being turned
on/off.

> Thus for now lets leave filtration on and off.

Please, find the real reason why is it needed. All states should be
working. 'Initial' state is not any different than for example when
DEMOTE was called.

> diff --git a/changelogs/unreleased/gh-6036-qsync-filter-packets.md b/changelogs/unreleased/gh-6036-qsync-filter-packets.md
> new file mode 100644
> index 000000000..0db629e83
> --- /dev/null
> +++ b/changelogs/unreleased/gh-6036-qsync-filter-packets.md
> @@ -0,0 +1,9 @@
> +## feature/replication

4. It is a bugfix, not feature.

> +
> +* Implemented incoming synchronous packets filtration to discard
> +  requests from outdated cluster nodes. This can happen when
> +  replication cluster is partitioned on a transport level and
> +  two or more sub-clusters are running simultaneously for some
> +  time, then they are trying to merge back. Since the subclusters
> +  had own leaders they should not be able to form original cluster
> +  because data is not longer consistent (gh-6036).> diff --git a/src/box/box.cc b/src/box/box.cc
> index 7b11d56d6..f134dc8bb 100644
> --- a/src/box/box.cc
> +++ b/src/box/box.cc
> @@ -1686,15 +1685,17 @@ box_issue_promote(uint32_t prev_leader_id, int64_t promote_lsn)
>  		.lsn = promote_lsn,
>  		.term = raft->term,
>  	};
> -	txn_limbo_process(&txn_limbo, &req);
> +	if (txn_limbo_process(&txn_limbo, &req) != 0)
> +		return -1;

5. There was already done txn_limbo_write_promote() above. If you
bail now, you have an inconsistent state - in WAL the promote is
written, in the limbo it is not applied. What will happen on recovery?

It seems you need to lock box_promote(), box_promote_qsync(),
and box_demote(). Otherwise you have the exact same problem with
local promotions vs coming from the applier as the one you tried
to fix for applier vs applier.

>  	assert(txn_limbo_is_empty(&txn_limbo));
> +	return 0;
>  }
> diff --git a/src/box/txn_limbo.c b/src/box/txn_limbo.c
> index 65fbd0cac..925f401e7 100644
> --- a/src/box/txn_limbo.c
> +++ b/src/box/txn_limbo.c
> @@ -737,6 +738,261 @@ txn_limbo_wait_empty(struct txn_limbo *limbo, double timeout)
>  	return 0;
>  }
>  
> +/**
> + * Fill the reject reason with request data.
> + * The function is not reenterable, use with caution.
> + */
> +static char *
> +reject_str(const struct synchro_request *req)
> +{
> +	static char prefix[128];

6. Please, don't try to re-invent the static buffer. Just use it.

> +
> +	snprintf(prefix, sizeof(prefix), "RAFT: rejecting %s (%d) "
> +		 "request from origin_id %u replica_id %u term %llu",
> +		 iproto_type_name(req->type), req->type,
> +		 req->origin_id, req->replica_id,
> +		 (long long)req->term);
> +
> +	return prefix;
> +}
> +
> +/**
> + * Common chain for any incoming packet.
> + */
> +static int
> +filter_in(struct txn_limbo *limbo, const struct synchro_request *req)
> +{
> +	assert(limbo->is_filtering);
> +
> +	/*
> +	 * Zero LSN are allowed for PROMOTE
> +	 * and DEMOTE requests only.
> +	 */
> +	if (req->lsn == 0 && !iproto_type_is_promote_request(req->type)) {
> +		say_error("%s. Zero lsn detected", reject_str(req));
> +		diag_set(ClientError, ER_CLUSTER_SPLIT,
> +			 "Zero LSN on promote/demote");
> +		return -1;
> +	}
> +
> +	/*
> +	 * Zero @a replica_id is allowed for PROMOTE packets only.

7. Why not for DEMOTE?

> +	 */
> +	if (req->replica_id == REPLICA_ID_NIL &&
> +	    req->type != IPROTO_RAFT_PROMOTE) {
> +		say_error("%s. Zero replica_id detected",
> +			  reject_str(req));
> +		diag_set(ClientError, ER_CLUSTER_SPLIT,
> +			 "Zero replica_id");
> +		return -1;
> +	}
> +
> +	/*
> +	 * Incoming packets should esteem limbo owner,
> +	 * if it doesn't match it means the sender
> +	 * missed limbo owner migrations and out of date.
> +	 */
> +	if (req->replica_id != limbo->owner_id) {
> +		say_error("%s. Limbo owner mismatch, owner_id %u",
> +			  reject_str(req), limbo->owner_id);
> +		diag_set(ClientError, ER_CLUSTER_SPLIT,
> +			 "Sync queue silent owner migration");
> +		return -1;
> +	}
> +
> +	return 0;
> +}
> +
> +/**
> + * Filter CONFIRM and ROLLBACK packets.
> + */
> +static int
> +filter_confirm_rollback(struct txn_limbo *limbo,

8. The limbo can be const in all filter functions.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Tarantool-patches] [PATCH v14 6/6] test: add replication/gh-6036-rollback-confirm
  2021-09-10 15:29 ` [Tarantool-patches] [PATCH v14 6/6] test: add replication/gh-6036-rollback-confirm Cyrill Gorcunov via Tarantool-patches
@ 2021-09-12 15:44   ` Vladislav Shpilevoy via Tarantool-patches
  0 siblings, 0 replies; 21+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-09-12 15:44 UTC (permalink / raw)
  To: Cyrill Gorcunov, tml

Thanks for the patch!

See 3 comments below.

> diff --git a/test/replication/gh-6036-rollback-confirm.result b/test/replication/gh-6036-rollback-confirm.result
> new file mode 100644
> index 000000000..e85f6af37
> --- /dev/null
> +++ b/test/replication/gh-6036-rollback-confirm.result

<...>

> +-- Connect master to the replica and write a record. Since the quorum
> +-- value is bigger than number of nodes in a cluster it will be rolled
> +-- back later.
> +test_run:switch('master')
> + | ---
> + | - true
> + | ...
> +box.cfg({                                       \
> +    replication = {                             \
> +            "unix/:./master.sock",              \
> +            "unix/:./replica.sock",             \
> +    },                                          \
> +})
> + | ---
> + | ...
> +_ = box.schema.create_space('sync', {is_sync = true})
> + | ---
> + | ...
> +_ = box.space.sync:create_index('pk')
> + | ---
> + | ...
> +
> +--
> +-- Wait the record to appear on the master.
> +f = require('fiber').create(function() box.space.sync:replace{1} end)
> + | ---
> + | ...
> +test_run:wait_cond(function() return box.space.sync:get({1}) ~= nil end, 100)

1. Why do you need a custom wait_cond timeout?

> + | ---
> + | - true
> + | ...
> +box.space.sync:select{}
> + | ---
> + | - - [1]
> + | ...
> +
> +--
> +-- Wait the record from master get written and then
> +-- drop the replication.
> +test_run:switch('replica')
> + | ---
> + | - true
> + | ...
> +test_run:wait_cond(function() return box.space.sync:get({1}) ~= nil end, 100)
> + | ---
> + | - true
> + | ...
> +box.space.sync:select{}

2. You don't need the waiting on the master if you wait for the same on
the replica. It couldn't get there before master itself.

> + | ---
> + | - - [1]
> + | ...
> +box.cfg{replication = {}}
> + | ---
> + | ...
> +
> +--
> +-- Then we jump back to the master and drop the replication,
> +-- thus unconfirmed record get rolled back.
> +test_run:switch('master')
> + | ---
> + | - true
> + | ...
> +box.cfg({                                       \
> +    replication = {},                           \
> +    replication_synchro_timeout = 0.001,        \
> +    election_mode = 'manual',                   \
> +})
> + | ---
> + | ...
> +while f:status() ~= 'dead' do require('fiber').sleep(0.1) end
> + | ---
> + | ...
> +test_run:wait_cond(function() return box.space.sync:get({1}) == nil end, 100)
> + | ---
> + | - true
> + | ...
> +
> +--
> +-- Force the replica to become a RAFT leader and
> +-- commit this new record.
> +test_run:switch('replica')
> + | ---
> + | - true
> + | ...
> +box.cfg({                                       \
> +    replication_synchro_quorum = 1,             \
> +    election_mode = 'manual'                    \
> +})
> + | ---
> + | ...
> +box.ctl.promote()
> + | ---
> + | ...
> +box.space.sync:select{}
> + | ---
> + | - - [1]
> + | ...
> +
> +--
> +-- Connect master back to the replica, it should
> +-- be refused.
> +test_run:switch('master')
> + | ---
> + | - true
> + | ...
> +box.cfg({                                       \
> +    replication = {                             \
> +            "unix/:./replica.sock",             \
> +    },                                          \
> +})
> + | ---
> + | ...
> +box.space.sync:select{}
> + | ---
> + | - []
> + | ...
> +test_run:wait_cond(function() return            \
> +    test_run:grep_log('master',                 \
> +        'rejecting PROMOTE') ~= nil end, 100)   \
> +test_run:wait_cond(function() return            \
> +    test_run:grep_log('master',                 \
> +        'ER_CLUSTER_SPLIT') ~= nil end, 100)

3. Why do you need these 2 conds with \? The only
reason for using \ between multiple statements is
to prevent yields. Why can't you have yields between
the two wait_cond() calls?

Also could you make one cond with 'grep_log() and grep_log()'?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Tarantool-patches] [PATCH v14 1/6] qsync: track confirmed lsn number on reads
  2021-09-12 15:44   ` Vladislav Shpilevoy via Tarantool-patches
@ 2021-09-12 22:18     ` Cyrill Gorcunov via Tarantool-patches
  2021-09-13  8:33     ` Serge Petrenko via Tarantool-patches
  1 sibling, 0 replies; 21+ messages in thread
From: Cyrill Gorcunov via Tarantool-patches @ 2021-09-12 22:18 UTC (permalink / raw)
  To: Vladislav Shpilevoy; +Cc: tml

On Sun, Sep 12, 2021 at 05:44:00PM +0200, Vladislav Shpilevoy wrote:
> Thanks for the patch!
> 
> On 10.09.2021 17:29, Cyrill Gorcunov via Tarantool-patches wrote:
> > We will use this lsn for requests validation
> > in next patches for sake of split-brain detection.
> 
> I don't understand. How exactly will it help?

Sorry for not putting more detailed explanation. Here it is: we've
a test ./test-run replication/qsync_advanced.test.lua where limbo
owner is migrated in result our filter refuses to accept new limbo
owner

> txn_limbo.c:872 E> RAFT: rejecting PROMOTE (31) request from origin_id 2 replica_id 1 term 3. confirmed_lsn 72 > promote_lsn 0
> ER_CLUSTER_SPLIT: Cluster split detected. Backward promote LSN

become promote request comes in with LSN = 0 when confirmed_lsn is bigger,
which in turn happens because we update LSN on write operation only. In this
test we have two nodes "default" and "replica". Initially "default" node
is limbo owner, which writes some data into sync space. Then we wait until
this sync data get replicated (node the "default" has confirmed_lsn > 0
because it been writting the data). Then we jump to "replica" node and
call box.promote() there which initiate PROMOTE request with lsn = 0 and
send it back to "default" node which has been limbo owner before and
has confirmed_lsn=72. When this request comes in the filtration fails.
(the replica node didn't write _anything_ locally and its confirmed_lsn = 0,
which we send in promote body).

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Tarantool-patches] [PATCH v14 2/6] qsync: update confirmed lsn on initial promote request
  2021-09-12 15:44   ` Vladislav Shpilevoy via Tarantool-patches
@ 2021-09-12 22:25     ` Cyrill Gorcunov via Tarantool-patches
  2021-09-13  8:52       ` Serge Petrenko via Tarantool-patches
  0 siblings, 1 reply; 21+ messages in thread
From: Cyrill Gorcunov via Tarantool-patches @ 2021-09-12 22:25 UTC (permalink / raw)
  To: Vladislav Shpilevoy; +Cc: tml

On Sun, Sep 12, 2021 at 05:44:04PM +0200, Vladislav Shpilevoy wrote:
> Thanks for the patch!
> 
> On 10.09.2021 17:29, Cyrill Gorcunov wrote:
> > When promote request is handled we drop last confirmed
> > lsn to zero because its value make sense for sync queue
> > owner only. Still the case where we become queue owner
> > for the first time is special - we need to fetch the
> > obtained lsn from the request and remember it so we
> > will be able to filter any next malformed requests
> > with wrong lsn numbers (see queue filtering procedure
> > in next patch).
> 
> I don't understand anything. Why isn't it needed always? And
> how exactly will it help to filter stuff?

This problem is revealed when run of ./test-run replication/gh-6034-qsync-limbo-ownership.test.lua
with filteration turned on. The confirmed_lsn make sence in bound
with limbo owner as far as I understand. And in test we have
two nodes "default" and "replica". Initially default gets up, filled
with some data into sync space and then we start up a replica node.
The replica get subscribed and then we call box.promote() on it,
since replica itself has not been writting anything its confirmed_lsn = 0,
which we send back to the "default" inside promote request body. And
it get rejected because "default" has non-zero confirmed_lsn. I've
been talking to Serge a lot about this problem and if I'm not missing
somthing obvious updating this filed on first promote arrival is
mostly correct way to handle the issue. I presume we should get
a meeting and talk again.

Or maybe better via email. Serge could you please write the details here?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Tarantool-patches] [PATCH v14 1/6] qsync: track confirmed lsn number on reads
  2021-09-12 15:44   ` Vladislav Shpilevoy via Tarantool-patches
  2021-09-12 22:18     ` Cyrill Gorcunov via Tarantool-patches
@ 2021-09-13  8:33     ` Serge Petrenko via Tarantool-patches
  1 sibling, 0 replies; 21+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-09-13  8:33 UTC (permalink / raw)
  To: Vladislav Shpilevoy, Cyrill Gorcunov, tml



12.09.2021 18:44, Vladislav Shpilevoy via Tarantool-patches пишет:
> Thanks for the patch!
>
> On 10.09.2021 17:29, Cyrill Gorcunov via Tarantool-patches wrote:
>> We will use this lsn for requests validation
>> in next patches for sake of split-brain detection.
> I don't understand. How exactly will it help?
Confirmed_lsn wasn't tracked during recovery and while
following a master. So, essentially, only the living master could
detect splitbrains by comparing confirmed_lsn to something else.
>
>> Part-of #6036
>>
>> Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
>> ---
>>   src/box/txn_limbo.c | 7 +++++++
>>   1 file changed, 7 insertions(+)
>>
>> diff --git a/src/box/txn_limbo.c b/src/box/txn_limbo.c
>> index 70447caaf..cca2ce493 100644
>> --- a/src/box/txn_limbo.c
>> +++ b/src/box/txn_limbo.c
>> @@ -437,6 +437,13 @@ txn_limbo_read_confirm(struct txn_limbo *limbo, int64_t lsn)
>>   		assert(e->txn->signature >= 0);
>>   		txn_complete_success(e->txn);
>>   	}
>> +
>> +	/*
>> +	 * We use confirmed lsn number to verify requests and
>> +	 * reject ones coming from split-brain cluster configurations,
>> +	 * so update it even if there were no entries to process.
>> +	 */
>> +	limbo->confirmed_lsn = lsn;

-- 
Serge Petrenko


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Tarantool-patches] [PATCH v14 1/6] qsync: track confirmed lsn number on reads
  2021-09-10 15:29 ` [Tarantool-patches] [PATCH v14 1/6] qsync: track confirmed lsn number on reads Cyrill Gorcunov via Tarantool-patches
  2021-09-12 15:44   ` Vladislav Shpilevoy via Tarantool-patches
@ 2021-09-13  8:50   ` Serge Petrenko via Tarantool-patches
  1 sibling, 0 replies; 21+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-09-13  8:50 UTC (permalink / raw)
  To: Cyrill Gorcunov, tml; +Cc: Vladislav Shpilevoy

10.09.2021 18:29, Cyrill Gorcunov пишет:
> We will use this lsn for requests validation
> in next patches for sake of split-brain detection.
>
> Part-of #6036
>
> Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
> ---
>   src/box/txn_limbo.c | 7 +++++++
>   1 file changed, 7 insertions(+)
>
> diff --git a/src/box/txn_limbo.c b/src/box/txn_limbo.c
> index 70447caaf..cca2ce493 100644
> --- a/src/box/txn_limbo.c
> +++ b/src/box/txn_limbo.c
> @@ -437,6 +437,13 @@ txn_limbo_read_confirm(struct txn_limbo *limbo, int64_t lsn)
>   		assert(e->txn->signature >= 0);
>   		txn_complete_success(e->txn);
>   	}
> +
> +	/*
> +	 * We use confirmed lsn number to verify requests and
> +	 * reject ones coming from split-brain cluster configurations,
> +	 * so update it even if there were no entries to process.
> +	 */
> +	limbo->confirmed_lsn = lsn;
>   }
>   
>   /**

I guess there'll be problems on master with this approach.

Say, a pair of CONFIRM requests is written, with lsns
N and N+1. So you first enter write_confirm(N), then
write_confirm(N+1). Now both fibers issuing the requests yield
waiting for the write to happen, and confirmed_lsn is N+1.

Once the first CONFIRM (N) is written, you reset confirmed_lsn to N
right in read_confirm.

So until the second CONFIRM (N+1) is written, there's a window
when confirmed_lsn is N, but it should be N+1.

I think read_confirm should set confirmed_lsn on replica only.
On master this task is performed by write_confirm.
You may split read_confirm in two parts:
set confirmed lsn (used only on replica) and
apply_confirm (everything read_confirm did before your patch)

-- 
Serge Petrenko

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Tarantool-patches] [PATCH v14 2/6] qsync: update confirmed lsn on initial promote request
  2021-09-12 22:25     ` Cyrill Gorcunov via Tarantool-patches
@ 2021-09-13  8:52       ` Serge Petrenko via Tarantool-patches
  2021-09-13 14:20         ` [Tarantool-patches] [RFC] qsync: overall design Cyrill Gorcunov via Tarantool-patches
  0 siblings, 1 reply; 21+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-09-13  8:52 UTC (permalink / raw)
  To: Cyrill Gorcunov, Vladislav Shpilevoy; +Cc: tml



13.09.2021 01:25, Cyrill Gorcunov пишет:
> On Sun, Sep 12, 2021 at 05:44:04PM +0200, Vladislav Shpilevoy wrote:
>> Thanks for the patch!
>>
>> On 10.09.2021 17:29, Cyrill Gorcunov wrote:
>>> When promote request is handled we drop last confirmed
>>> lsn to zero because its value make sense for sync queue
>>> owner only. Still the case where we become queue owner
>>> for the first time is special - we need to fetch the
>>> obtained lsn from the request and remember it so we
>>> will be able to filter any next malformed requests
>>> with wrong lsn numbers (see queue filtering procedure
>>> in next patch).
>> I don't understand anything. Why isn't it needed always? And
>> how exactly will it help to filter stuff?
> This problem is revealed when run of ./test-run replication/gh-6034-qsync-limbo-ownership.test.lua
> with filteration turned on. The confirmed_lsn make sence in bound
> with limbo owner as far as I understand. And in test we have
> two nodes "default" and "replica". Initially default gets up, filled
> with some data into sync space and then we start up a replica node.
> The replica get subscribed and then we call box.promote() on it,
> since replica itself has not been writting anything its confirmed_lsn = 0,
> which we send back to the "default" inside promote request body. And
> it get rejected because "default" has non-zero confirmed_lsn. I've
> been talking to Serge a lot about this problem and if I'm not missing
> somthing obvious updating this filed on first promote arrival is
> mostly correct way to handle the issue. I presume we should get
> a meeting and talk again.
>
> Or maybe better via email. Serge could you please write the details here?

It would be easier to discuss this verbally, I think.


-- 
Serge Petrenko


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Tarantool-patches] [PATCH v14 4/6] qsync: order access to the limbo terms
  2021-09-12 15:44   ` Vladislav Shpilevoy via Tarantool-patches
@ 2021-09-13 10:52     ` Cyrill Gorcunov via Tarantool-patches
  0 siblings, 0 replies; 21+ messages in thread
From: Cyrill Gorcunov via Tarantool-patches @ 2021-09-13 10:52 UTC (permalink / raw)
  To: Vladislav Shpilevoy; +Cc: tml

On Sun, Sep 12, 2021 at 05:44:11PM +0200, Vladislav Shpilevoy wrote:
> >  err:
> > +	txn_limbo_term_unlock(&txn_limbo);
> >  	diag_log();
> 
> 1. This function can go to 'err' before the lock is taken.

yup, thanks!

> 
> 
> 2. As for your complaint about the begin/commit/rollback API
> being not working because you can't unlock from a non-owner
> fiber - well, it works in your patch somehow, doesn't it?

Yes, it works with my patch because journal_write() are ordered,
we take and release lock explicitly inside caller code, ie inside
one same fiber(). Imagine two appliers running simultaneously

applier 1                       applier 2                       sched
---------                       ---------                       -----
apply_synchro_row
  txn_limbo_term_lock
    journal_write
      context-switch -->        apply_synchro_row
                                  txn_limbo_term_lock
                                    wait for release
                                                                apply_synchro_row_cb
                                                                context-switch
  txn_limbo_term_unlock <--                                     --+
  return
                                txn_limbo_term_lock finishes
                                ...

> Why do you in your patch unlock here, but in the newly proposed
> API you only tried to unlock in the trigger?

Because you proposed to *hide* locking completely inside limbo code,
so the caller won't know anything about it. So our commit/rollback
would handle locking internally, unfortunately this doesn't work.

> 
> You could call commit/rollback from this function, like you
> do with unlock now.

Not sure if I follow you here. Our journal engine implies completions
to be called. We pass such completion into journal entry creation. With
my patch everything remains as is except we take a lock explicitly and
release it then. Could you please point more explicitly what you've in mind?

> > +
> > +/** Fetch replica's term with lock taken. */
> > +static inline uint64_t
> > +txn_limbo_replica_term_locked(struct txn_limbo *limbo, uint32_t replica_id)
> 
> 3. Limbo can be made const here.

ok

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Tarantool-patches] [RFC] qsync: overall design
  2021-09-13  8:52       ` Serge Petrenko via Tarantool-patches
@ 2021-09-13 14:20         ` Cyrill Gorcunov via Tarantool-patches
  0 siblings, 0 replies; 21+ messages in thread
From: Cyrill Gorcunov via Tarantool-patches @ 2021-09-13 14:20 UTC (permalink / raw)
  To: Serge Petrenko, Vladislav Shpilevoy; +Cc: tml

On Mon, Sep 13, 2021 at 11:52:39AM +0300, Serge Petrenko wrote:
> > 
> > Or maybe better via email. Serge could you please write the details here?
> 
> It would be easier to discuss this verbally, I think.

Verbal meetings are good indeed, but maybe I could summarize all the
problems found so far and imprint them here. Guys, please comment,
I would really appreciate.

Terms accessing ordering
------------------------

We've found that fibers can read old terms which are already updated
but not yet written into WAL. To address this we order term reading
so appliers will wait until write to WAL is complete. While everyone
is agree that there is an issue and ordering solves it we're not yet
complete clear about internal design.

I proposed to use explicit locking via txn_limbo_term_lock/unlock calls.
The calls are used inside applier's code

apply_synchro_row
  txn_limbo_term_lock
    journal_write
  txn_limbo_term_unlock

the key moment is journal_write() call which queues completion to run and the
completion code is called from inside of sched() fiber, ie not the fiber which
took the lock (and such lock migration is prohibited in our latch-lock engine).

The propose was to hide the locking mechanism inside limbo internals code
completely, so the callers won't know about it. When I tried to make so I hit
the problem with lock context migration and had to step back to use explicit
locks as in code above.

Still Vlad's question remains

 | 2. As for your complaint about the begin/commit/rollback API
 | being not working because you can't unlock from a non-owner
 | fiber - well, it works in your patch somehow, doesn't it?

I already explained why it works with explicit locking.
https://lists.tarantool.org/tarantool-patches/YT8tZ0CuIDKwzcC4@grain/
In short - we take and release the lock in same context.

 |
 | Why do you in your patch unlock here, but in the newly proposed
 | API you only tried to unlock in the trigger?

Because our commit/rollback are called from inside of sched() fiber,
and we will have to provide some helper like completion of completion
where second completion will be called from inside of applier context
to unlock terms. As to me this is a way more messy than explicit locking
scheme.

 |
 | You could call commit/rollback from this function, like you
 | do with unlock now.

This moment I don't understand. We already have commit/rollback helpers,
so I ask Vlad to write some pseudocode, to figure out what exactly you
have in mind.

Limbo's confirmed_lsn update upon CONFIRM request read
------------------------------------------------------

Currently we update limbo::confirmed_lsn when node _writes_ this
request into the WAL. This is done on limbo owner node only, ie
transaction initiator. In result when the node which has not been
leader at all takes limbo ownership it sends own "confirmed_lsn = 0"
inside PROMOTE request, and when this request reaches previous leader
node we don't allow to proceed (due to our filtration rules where
we require the LSN to be > than current confirmed_lsn). Also Serge
pointed out

 a)
 | Confirmed_lsn wasn't tracked during recovery and while
 | following a master. So, essentially, only the living master could
 | detect splitbrains by comparing confirmed_lsn to something else.

 b)
 | Say, a pair of CONFIRM requests is written, with lsns
 | N and N+1. So you first enter write_confirm(N), then
 | write_confirm(N+1). Now both fibers issuing the requests yield
 | waiting for the write to happen, and confirmed_lsn is N+1.
 |
 | Once the first CONFIRM (N) is written, you reset confirmed_lsn to N
 | right in read_confirm.
 |
 | So until the second CONFIRM (N+1) is written, there's a window
 | when confirmed_lsn is N, but it should be N+1.
 |
 | I think read_confirm should set confirmed_lsn on replica only.
 | On master this task is performed by write_confirm.
 | You may split read_confirm in two parts:
 |  - set confirmed lsn (used only on replica) and
 |  - apply_confirm (everything read_confirm did before your patch)

Thus I seems need to rework this aspect.

Update confirmed_lsn on first PROMOTE request arrival
-----------------------------------------------------

Detailed explanation what I've seen is there

https://lists.tarantool.org/tarantool-patches/YT5+YqCJuAh0HAQg@grain/

I must confess I don't like much this moment as well since
this is a bit vague point for me so we gonna look into it
soon on a meeting.

Filtration procedure itself (split detector)
-------------------------------------------

When CONFIRM or ROLLBACK packet comes in it is not enough to
test for limbo emptiness only. We should rather traverse the
queue and figure out if LSN inside the packet belongs to the
current queue.

So the *preliminary* conclusion is the following: when CONFIRM
or ROLLBACK is coming in
a) queue is empty -- then such request is invalid and we should
   exit with error
b) queue is not empty -- then LSN should belong to a range covered
   by the queue
c) it is unclear how to test this scenario

Filtration disabling for joining and local recovery
---------------------------------------------------

When joining or recovery happens the limbo is in empty state then
our filtration start triggering false positives. For example

> autobootstrap1.sock I> limbo: filter PROMOTE replica_id 0 origin_id 0
> term 0 lsn 0, queue owner_id 0 len 0 promote_greatest_term 0 confirmed_lsn 0

This is because we require the term to be nonzero when cluster is running.

	/*
	 * PROMOTE and DEMOTE packets must not have zero
	 * term supplied, otherwise it is a broken packet.
	 */
	if (req->term == 0) {
		say_error("%s. Zero term detected", reject_str(req));
		diag_set(ClientError, ER_CLUSTER_SPLIT,
			 "Request with zero term");
		return -1;
	}

If we want to not disable filtration at all then we need to introduce
some state machine which would cover initial -> working state. I think
better to start with simpler approach where we don't verify data on
join/recovery and then extend filtration if needed.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Tarantool-patches] [PATCH v14 5/6] qsync: filter incoming synchro requests
  2021-09-12 15:44   ` Vladislav Shpilevoy via Tarantool-patches
@ 2021-09-14 19:41     ` Cyrill Gorcunov via Tarantool-patches
  0 siblings, 0 replies; 21+ messages in thread
From: Cyrill Gorcunov via Tarantool-patches @ 2021-09-14 19:41 UTC (permalink / raw)
  To: Vladislav Shpilevoy; +Cc: tml

On Sun, Sep 12, 2021 at 05:44:15PM +0200, Vladislav Shpilevoy wrote:
> > @@ -1686,15 +1685,17 @@ box_issue_promote(uint32_t prev_leader_id, int64_t promote_lsn)
> >  		.lsn = promote_lsn,
> >  		.term = raft->term,
> >  	};
> > -	txn_limbo_process(&txn_limbo, &req);
> > +	if (txn_limbo_process(&txn_limbo, &req) != 0)
> > +		return -1;
> 
> 5. There was already done txn_limbo_write_promote() above. If you
> bail now, you have an inconsistent state - in WAL the promote is
> written, in the limbo it is not applied. What will happen on recovery?
> 
> It seems you need to lock box_promote(), box_promote_qsync(),
> and box_demote(). Otherwise you have the exact same problem with
> local promotions vs coming from the applier as the one you tried
> to fix for applier vs applier.

That's a good point, but as you pointed in previous reviews we
should try to remove locking from api (which i did in new yet
not sent patches) thus we need some kind of a safe filter which
would take a lock, filter request, and release the lock then...
As to me our try to hide locking was a mistake in first place,
locks I proposed simply serialize access to terms they are underlated
to begin/commit/rollback semantics. Actually for now I'm not sure
which approach would fit your architecture ideas. Maybe some
txn_limbo_filter() helper exposed?

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2021-09-14 19:41 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-10 15:29 [Tarantool-patches] [PATCH v14 0/6] qsync: implement packets filtering Cyrill Gorcunov via Tarantool-patches
2021-09-10 15:29 ` [Tarantool-patches] [PATCH v14 1/6] qsync: track confirmed lsn number on reads Cyrill Gorcunov via Tarantool-patches
2021-09-12 15:44   ` Vladislav Shpilevoy via Tarantool-patches
2021-09-12 22:18     ` Cyrill Gorcunov via Tarantool-patches
2021-09-13  8:33     ` Serge Petrenko via Tarantool-patches
2021-09-13  8:50   ` Serge Petrenko via Tarantool-patches
2021-09-10 15:29 ` [Tarantool-patches] [PATCH v14 2/6] qsync: update confirmed lsn on initial promote request Cyrill Gorcunov via Tarantool-patches
2021-09-12 15:44   ` Vladislav Shpilevoy via Tarantool-patches
2021-09-12 22:25     ` Cyrill Gorcunov via Tarantool-patches
2021-09-13  8:52       ` Serge Petrenko via Tarantool-patches
2021-09-13 14:20         ` [Tarantool-patches] [RFC] qsync: overall design Cyrill Gorcunov via Tarantool-patches
2021-09-10 15:29 ` [Tarantool-patches] [PATCH v14 3/6] latch: add latch_is_locked helper Cyrill Gorcunov via Tarantool-patches
2021-09-10 15:29 ` [Tarantool-patches] [PATCH v14 4/6] qsync: order access to the limbo terms Cyrill Gorcunov via Tarantool-patches
2021-09-12 15:44   ` Vladislav Shpilevoy via Tarantool-patches
2021-09-13 10:52     ` Cyrill Gorcunov via Tarantool-patches
2021-09-10 15:29 ` [Tarantool-patches] [PATCH v14 5/6] qsync: filter incoming synchro requests Cyrill Gorcunov via Tarantool-patches
2021-09-12 15:44   ` Vladislav Shpilevoy via Tarantool-patches
2021-09-14 19:41     ` Cyrill Gorcunov via Tarantool-patches
2021-09-10 15:29 ` [Tarantool-patches] [PATCH v14 6/6] test: add replication/gh-6036-rollback-confirm Cyrill Gorcunov via Tarantool-patches
2021-09-12 15:44   ` Vladislav Shpilevoy via Tarantool-patches
2021-09-12 15:43 ` [Tarantool-patches] [PATCH v14 0/6] qsync: implement packets filtering Vladislav Shpilevoy via Tarantool-patches

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox