Tarantool development patches archive
 help / color / mirror / Atom feed
* [Tarantool-patches] [PATCH] limbo: introduce request processing hooks
@ 2021-07-10 22:28 Cyrill Gorcunov via Tarantool-patches
  2021-07-11 14:00 ` Vladislav Shpilevoy via Tarantool-patches
  2021-07-12  7:54 ` Serge Petrenko via Tarantool-patches
  0 siblings, 2 replies; 21+ messages in thread
From: Cyrill Gorcunov via Tarantool-patches @ 2021-07-10 22:28 UTC (permalink / raw)
  To: tml; +Cc: Vladislav Shpilevoy

Guys, this is an early rfc since I would like to discuss the
design first before going further. Currently we don't interrupt
incoming syncro requests which doesn't allow us to detect cluster
split-brain situation, as we were discussing verbally there are
a number of sign to detect it and we need to stop receiving data
from obsolete nodes.

The main problem though is that such filtering of incoming packets
should happen at the moment where we still can do a step back and
inform the peer that data has been declined, but currently our
applier code process syncro requests inside WAL trigger, ie when
data is already applied or rolling back.

Thus we need to separate "filer" and "apply" stages of processing.
What is more interesting is that we filter incomings via in memory
vclock and update them immediately. Thus the following situation
is possible -- a promote request comes in, we remember it inside
promote_term_map but then write to WAL fails and we never revert
the promote_term_map variable, thus other peer won't be able to
resend us this promote request because now we think that we've
alreday applied the promote.

To solve this I split processing routine into stages:

 - filter stage: we investigate infly packets and remember
   their terms in @terms_infly vclock
 - apply stage: the data is verified and we try to apply the
   data and write it to the disk, once everything is fine
   we update @terms_applied vclock which serves as a backup
   of @terms_infly
 - error stage: data application failed and we should revert
   @terms_infly vclock to the previous value

The stages are processed by txn_limbo_apply routine which takes
a mask of stages it should execute. Old txn_limbo_process is
simply an inline wrapper with appropriate flags.

Please note that I _didn't_ put complete conditions into
limbo_op_filter yet only moved current code there.

So take a look once time permit, maybe there some easier
approach in code structure.

branch gorcunov/gh-6036-rollback-confirm-notest

Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
---
 src/box/applier.cc  |  13 +++-
 src/box/box.cc      |   6 +-
 src/box/txn_limbo.c | 149 ++++++++++++++++++++++++++++++++++++++------
 src/box/txn_limbo.h |  97 ++++++++++++++++++++++------
 4 files changed, 223 insertions(+), 42 deletions(-)

diff --git a/src/box/applier.cc b/src/box/applier.cc
index 978383e64..8a44bf1b2 100644
--- a/src/box/applier.cc
+++ b/src/box/applier.cc
@@ -458,7 +458,8 @@ applier_wait_snapshot(struct applier *applier)
 				struct synchro_request req;
 				if (xrow_decode_synchro(&row, &req) != 0)
 					diag_raise();
-				txn_limbo_process(&txn_limbo, &req);
+				if (txn_limbo_process(&txn_limbo, &req) != 0)
+					diag_raise();
 			} else if (iproto_type_is_raft_request(row.type)) {
 				struct raft_request req;
 				if (xrow_decode_raft(&row, &req, NULL) != 0)
@@ -850,11 +851,16 @@ apply_synchro_row_cb(struct journal_entry *entry)
 	assert(entry->complete_data != NULL);
 	struct synchro_entry *synchro_entry =
 		(struct synchro_entry *)entry->complete_data;
+	struct synchro_request *req = synchro_entry->req;
+
 	if (entry->res < 0) {
+		txn_limbo_apply(&txn_limbo, req,
+				LIMBO_OP_ERROR | LIMBO_OP_ATTR_PANIC);
 		applier_rollback_by_wal_io(entry->res);
 	} else {
 		replica_txn_wal_write_cb(synchro_entry->rcb);
-		txn_limbo_process(&txn_limbo, synchro_entry->req);
+		txn_limbo_apply(&txn_limbo, synchro_entry->req,
+				LIMBO_OP_APPLY | LIMBO_OP_ATTR_PANIC);
 		trigger_run(&replicaset.applier.on_wal_write, NULL);
 	}
 	fiber_wakeup(synchro_entry->owner);
@@ -870,6 +876,9 @@ apply_synchro_row(uint32_t replica_id, struct xrow_header *row)
 	if (xrow_decode_synchro(row, &req) != 0)
 		goto err;
 
+	if (txn_limbo_apply(&txn_limbo, &req, LIMBO_OP_FILTER) != 0)
+		goto err;
+
 	struct replica_cb_data rcb_data;
 	struct synchro_entry entry;
 	/*
diff --git a/src/box/box.cc b/src/box/box.cc
index bc68ee4c8..79bacfa08 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -428,7 +428,8 @@ wal_stream_apply_synchro_row(struct wal_stream *stream, struct xrow_header *row)
 		say_error("couldn't decode a synchro request");
 		return -1;
 	}
-	txn_limbo_process(&txn_limbo, &syn_req);
+	if (txn_limbo_process(&txn_limbo, &syn_req) != 0)
+		return -1;
 	return 0;
 }
 
@@ -1701,7 +1702,8 @@ box_clear_synchro_queue(bool demote)
 				.lsn = wait_lsn,
 				.term = term,
 			};
-			txn_limbo_process(&txn_limbo, &req);
+			if (txn_limbo_process(&txn_limbo, &req) != 0)
+				return -1;
 			assert(txn_limbo_is_empty(&txn_limbo));
 		}
 	}
diff --git a/src/box/txn_limbo.c b/src/box/txn_limbo.c
index d2e4dcb1e..0650702b9 100644
--- a/src/box/txn_limbo.c
+++ b/src/box/txn_limbo.c
@@ -37,6 +37,15 @@
 
 struct txn_limbo txn_limbo;
 
+static void
+txn_limbo_promote_create(struct txn_limbo_promote *pmt)
+{
+	vclock_create(&pmt->terms_applied);
+	vclock_create(&pmt->terms_infly);
+	pmt->term_max = 0;
+	pmt->term_max_infly = 0;
+}
+
 static inline void
 txn_limbo_create(struct txn_limbo *limbo)
 {
@@ -45,11 +54,11 @@ txn_limbo_create(struct txn_limbo *limbo)
 	limbo->owner_id = REPLICA_ID_NIL;
 	fiber_cond_create(&limbo->wait_cond);
 	vclock_create(&limbo->vclock);
-	vclock_create(&limbo->promote_term_map);
-	limbo->promote_greatest_term = 0;
 	limbo->confirmed_lsn = 0;
 	limbo->rollback_count = 0;
 	limbo->is_in_rollback = false;
+
+	txn_limbo_promote_create(&limbo->promote);
 }
 
 bool
@@ -305,10 +314,12 @@ void
 txn_limbo_checkpoint(const struct txn_limbo *limbo,
 		     struct synchro_request *req)
 {
+	const struct txn_limbo_promote *pmt = &limbo->promote;
+
 	req->type = IPROTO_PROMOTE;
 	req->replica_id = limbo->owner_id;
 	req->lsn = limbo->confirmed_lsn;
-	req->term = limbo->promote_greatest_term;
+	req->term = pmt->term_max_infly;
 }
 
 void
@@ -696,27 +707,38 @@ txn_limbo_wait_confirm(struct txn_limbo *limbo)
 	return 0;
 }
 
-void
-txn_limbo_process(struct txn_limbo *limbo, const struct synchro_request *req)
+static int
+limbo_op_filter(struct txn_limbo *limbo, const struct synchro_request *req)
 {
-	uint64_t term = req->term;
+	struct txn_limbo_promote *pmt = &limbo->promote;
 	uint32_t origin = req->origin_id;
+	uint64_t term = req->term;
+
 	if (txn_limbo_replica_term(limbo, origin) < term) {
-		vclock_follow(&limbo->promote_term_map, origin, term);
-		if (term > limbo->promote_greatest_term)
-			limbo->promote_greatest_term = term;
+		vclock_follow(&pmt->terms_infly, origin, term);
+		if (term > pmt->term_max_infly)
+			pmt->term_max_infly = term;
 	} else if (iproto_type_is_promote_request(req->type) &&
-		   limbo->promote_greatest_term > 1) {
-		/* PROMOTE for outdated term. Ignore. */
-		say_info("RAFT: ignoring %s request from instance "
+		   pmt->term_max_infly > 1) {
+		say_info("RAFT: declining %s request from instance "
 			 "id %u for term %llu. Greatest term seen "
 			 "before (%llu) is bigger.",
-			 iproto_type_name(req->type), origin, (long long)term,
-			 (long long)limbo->promote_greatest_term);
-		return;
+			 iproto_type_name(req->type), origin,
+			 (long long)term,
+			 (long long)pmt->term_max_infly);
+		diag_set(ClientError, ER_UNSUPPORTED, "RAFT",
+			 "backward terms");
+		return -1;
 	}
 
+	return 0;
+}
+
+static int
+limbo_op_apply(struct txn_limbo *limbo, const struct synchro_request *req)
+{
 	int64_t lsn = req->lsn;
+
 	if (req->replica_id == REPLICA_ID_NIL) {
 		/*
 		 * The limbo was empty on the instance issuing the request.
@@ -731,7 +753,7 @@ txn_limbo_process(struct txn_limbo *limbo, const struct synchro_request *req)
 		 * confirm right on synchronous transaction recovery.
 		 */
 		if (!iproto_type_is_promote_request(req->type))
-			return;
+			goto out;
 		/*
 		 * Promote has a bigger term, and tries to steal the limbo. It
 		 * means it probably was elected with a quorum, and it makes no
@@ -740,6 +762,7 @@ txn_limbo_process(struct txn_limbo *limbo, const struct synchro_request *req)
 		 */
 		lsn = 0;
 	}
+
 	switch (req->type) {
 	case IPROTO_CONFIRM:
 		txn_limbo_read_confirm(limbo, lsn);
@@ -754,9 +777,99 @@ txn_limbo_process(struct txn_limbo *limbo, const struct synchro_request *req)
 		txn_limbo_read_demote(limbo, lsn);
 		break;
 	default:
-		unreachable();
+		panic("limbo: unreacheable stage detected");
+		break;
 	}
-	return;
+
+
+out:
+	struct txn_limbo_promote *pmt = &limbo->promote;
+	uint32_t replica_id = req->origin_id;
+	uint64_t term = req->term;
+
+	uint64_t v = vclock_get(&pmt->terms_applied, replica_id);
+	if (v < term) {
+		vclock_follow(&pmt->terms_applied, replica_id, term);
+		if (term > pmt->term_max)
+			pmt->term_max = term;
+	}
+	return 0;
+}
+
+static int
+limbo_op_error(struct txn_limbo *limbo, const struct synchro_request *req)
+{
+	struct txn_limbo_promote *pmt = &limbo->promote;
+	uint32_t replica_id = req->origin_id;
+	/*
+	 * Restore to the applied value in case of error,
+	 * this will allow to reapply the entry when remote
+	 * node get error and try to resend data.
+	 */
+	uint64_t v = vclock_get(&pmt->terms_applied, replica_id);
+	vclock_reset(&pmt->terms_infly, replica_id, v);
+
+	/*
+	 * The max value has to be recalculated in a linear
+	 * form, the errors should not happen frequently so
+	 * this is not a hot path.
+	 */
+	int64_t maxv = 0;
+	struct vclock_iterator it;
+	vclock_iterator_init(&it, &pmt->terms_infly);
+	vclock_foreach(&it, r)
+		maxv = r.lsn > maxv ? r.lsn : maxv;
+	pmt->term_max_infly = maxv;
+	return 0;
+}
+
+static int (*limbo_apply_ops[LIMBO_OP_MAX])
+	(struct txn_limbo *limbo, const struct synchro_request *req) = {
+	[LIMBO_OP_FILTER_BIT]	= limbo_op_filter,
+	[LIMBO_OP_APPLY_BIT]	= limbo_op_apply,
+	[LIMBO_OP_ERROR_BIT]	= limbo_op_error,
+};
+
+static const char *
+limbo_apply_op_str(unsigned int bit)
+{
+	static const char *str[] = {
+		[LIMBO_OP_FILTER_BIT]	= "LIMBO_OP_FILTER",
+		[LIMBO_OP_APPLY_BIT]	= "LIMBO_OP_APPLY",
+		[LIMBO_OP_ERROR_BIT]	= "LIMBO_OP_ERROR",
+	};
+
+	if (bit < lengthof(limbo_apply_ops))
+		return str[bit];
+
+	return "UNKNOWN";
+}
+
+int
+txn_limbo_apply(struct txn_limbo *limbo,
+		const struct synchro_request *req,
+		unsigned int op_mask)
+{
+	unsigned int mask = op_mask & LIMBO_OP_MASK;
+	unsigned int bit = LIMBO_OP_MIN;
+
+	while (mask) {
+		if (mask & 1) {
+			if (limbo_apply_ops[bit] == NULL)
+				panic("limbo: empty apply operation");
+			say_info("limbo: apply operation %s",
+				 limbo_apply_op_str(bit));
+			if (limbo_apply_ops[bit](limbo, req) != 0) {
+				if (op_mask & LIMBO_OP_ATTR_PANIC)
+					panic("limbo: panicing op");
+				return -1;
+			}
+		}
+		mask >>= 1;
+		bit++;
+	};
+
+	return 0;
 }
 
 void
diff --git a/src/box/txn_limbo.h b/src/box/txn_limbo.h
index 8e7315947..b93450f65 100644
--- a/src/box/txn_limbo.h
+++ b/src/box/txn_limbo.h
@@ -75,6 +75,40 @@ txn_limbo_entry_is_complete(const struct txn_limbo_entry *e)
 	return e->is_commit || e->is_rollback;
 }
 
+/**
+ * To keep state of promote requests to handle split-brain
+ * situation and other errors.
+ */
+struct txn_limbo_promote {
+	/**
+	 * Latest terms received with PROMOTE entries from remote instances.
+	 * Limbo uses them to filter out the transactions coming not from the
+	 * limbo owner, but so outdated that they are rolled back everywhere
+	 * except outdated nodes.
+	 */
+	struct vclock terms_applied;
+	/**
+	 * Infly replresentation of @a terms_applied, the term might
+	 * be not yet written to WAL but already seen.
+	 */
+	struct vclock terms_infly;
+	/**
+	 * The biggest PROMOTE term seen by the instance and persisted in WAL.
+	 * It is related to raft term, but not the same. Synchronous replication
+	 * represented by the limbo is interested only in the won elections
+	 * ended with PROMOTE request.
+	 * It means the limbo's term might be smaller than the raft term, while
+	 * there are ongoing elections, or the leader is already known and this
+	 * instance hasn't read its PROMOTE request yet. During other times the
+	 * limbo and raft are in sync and the terms are the same.
+	 */
+	uint64_t term_max;
+	/**
+	 * Infly representation of @a term_max.
+	 */
+	uint64_t term_max_infly;
+};
+
 /**
  * Limbo is a place where transactions are stored, which are
  * finished, but not committed nor rolled back. These are
@@ -130,23 +164,9 @@ struct txn_limbo {
 	 */
 	struct vclock vclock;
 	/**
-	 * Latest terms received with PROMOTE entries from remote instances.
-	 * Limbo uses them to filter out the transactions coming not from the
-	 * limbo owner, but so outdated that they are rolled back everywhere
-	 * except outdated nodes.
+	 * To track PROMOTE requests.
 	 */
-	struct vclock promote_term_map;
-	/**
-	 * The biggest PROMOTE term seen by the instance and persisted in WAL.
-	 * It is related to raft term, but not the same. Synchronous replication
-	 * represented by the limbo is interested only in the won elections
-	 * ended with PROMOTE request.
-	 * It means the limbo's term might be smaller than the raft term, while
-	 * there are ongoing elections, or the leader is already known and this
-	 * instance hasn't read its PROMOTE request yet. During other times the
-	 * limbo and raft are in sync and the terms are the same.
-	 */
-	uint64_t promote_greatest_term;
+	struct txn_limbo_promote promote;
 	/**
 	 * Maximal LSN gathered quorum and either already confirmed in WAL, or
 	 * whose confirmation is in progress right now. Any attempt to confirm
@@ -218,7 +238,8 @@ txn_limbo_last_entry(struct txn_limbo *limbo)
 static inline uint64_t
 txn_limbo_replica_term(const struct txn_limbo *limbo, uint32_t replica_id)
 {
-	return vclock_get(&limbo->promote_term_map, replica_id);
+	const struct txn_limbo_promote *pmt = &limbo->promote;
+	return vclock_get(&pmt->terms_infly, replica_id);
 }
 
 /**
@@ -229,8 +250,9 @@ static inline bool
 txn_limbo_is_replica_outdated(const struct txn_limbo *limbo,
 			      uint32_t replica_id)
 {
+	const struct txn_limbo_promote *pmt = &limbo->promote;
 	return txn_limbo_replica_term(limbo, replica_id) <
-	       limbo->promote_greatest_term;
+	       pmt->term_max_infly;
 }
 
 /**
@@ -300,9 +322,44 @@ txn_limbo_ack(struct txn_limbo *limbo, uint32_t replica_id, int64_t lsn);
 int
 txn_limbo_wait_complete(struct txn_limbo *limbo, struct txn_limbo_entry *entry);
 
+enum {
+	LIMBO_OP_MIN		= 0,
+	LIMBO_OP_FILTER_BIT	= 0,
+	LIMBO_OP_APPLY_BIT	= 1,
+	LIMBO_OP_ERROR_BIT	= 2,
+	LIMBO_OP_MAX,
+
+	LIMBO_OP_ATTR_MIN	= LIMBO_OP_MAX,
+	LIMBO_OP_ATTR_PANIC_BIT	= LIMBO_OP_ATTR_MIN + 1,
+	LIMBO_OP_ATTR_MAX,
+};
+
+enum {
+	LIMBO_OP_FILTER		= (1u << LIMBO_OP_FILTER_BIT),
+	LIMBO_OP_APPLY		= (1u << LIMBO_OP_APPLY_BIT),
+	LIMBO_OP_ERROR		= (1u << LIMBO_OP_ERROR_BIT),
+	LIMBO_OP_MASK		= (1u << LIMBO_OP_MAX) - 1,
+
+	LIMBO_OP_ATTR_PANIC	= (1u << LIMBO_OP_ATTR_PANIC_BIT),
+};
+
+/**
+ * Apply a synchronous replication request, the @a op_mask
+ * specifies stages to process.
+ */
+int
+txn_limbo_apply(struct txn_limbo *limbo,
+		const struct synchro_request *req,
+		unsigned int op_mask);
+
 /** Execute a synchronous replication request. */
-void
-txn_limbo_process(struct txn_limbo *limbo, const struct synchro_request *req);
+static inline int
+txn_limbo_process(struct txn_limbo *limbo,
+		  const struct synchro_request *req)
+{
+	const int op_mask = LIMBO_OP_FILTER | LIMBO_OP_APPLY;
+	return txn_limbo_apply(limbo, req, op_mask);
+}
 
 /**
  * Waiting for confirmation of all "sync" transactions
-- 
2.31.1


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Tarantool-patches] [PATCH] limbo: introduce request processing hooks
  2021-07-10 22:28 [Tarantool-patches] [PATCH] limbo: introduce request processing hooks Cyrill Gorcunov via Tarantool-patches
@ 2021-07-11 14:00 ` Vladislav Shpilevoy via Tarantool-patches
  2021-07-11 18:22   ` Cyrill Gorcunov via Tarantool-patches
  2021-07-12  8:01   ` Serge Petrenko via Tarantool-patches
  2021-07-12  7:54 ` Serge Petrenko via Tarantool-patches
  1 sibling, 2 replies; 21+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-07-11 14:00 UTC (permalink / raw)
  To: Cyrill Gorcunov, tml

Hi! Thanks for the patch!

On 11.07.2021 00:28, Cyrill Gorcunov wrote:
> Guys, this is an early rfc since I would like to discuss the
> design first before going further. Currently we don't interrupt
> incoming syncro requests which doesn't allow us to detect cluster
> split-brain situation, as we were discussing verbally there are
> a number of sign to detect it and we need to stop receiving data
> from obsolete nodes.
> 
> The main problem though is that such filtering of incoming packets
> should happen at the moment where we still can do a step back and
> inform the peer that data has been declined, but currently our
> applier code process syncro requests inside WAL trigger, ie when
> data is already applied or rolling back.
> 
> Thus we need to separate "filer" and "apply" stages of processing.
> What is more interesting is that we filter incomings via in memory
> vclock and update them immediately. Thus the following situation
> is possible -- a promote request comes in, we remember it inside
> promote_term_map but then write to WAL fails and we never revert
> the promote_term_map variable, thus other peer won't be able to
> resend us this promote request because now we think that we've
> alreday applied the promote.

Well, I still don't understand what the issue is. We discussed it
privately already. You simply should not apply anything until WAL
write is done. And it is not happening now on the master. The
terms vclock is updated only **after** WAL write.

Why do you need all these new vclocks if you should not apply
anything before WAL write in the first place?

> write to WAL fails and we never revert
> the promote_term_map variable

This simply should not be possible. The term map is updated only
after WAL write is done. At least this is how it works now, doesn't
it? Why did you change it? (In case you did, because I am not sure,
I didn't review the code throughly).

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Tarantool-patches] [PATCH] limbo: introduce request processing hooks
  2021-07-11 14:00 ` Vladislav Shpilevoy via Tarantool-patches
@ 2021-07-11 18:22   ` Cyrill Gorcunov via Tarantool-patches
  2021-07-12  8:03     ` Vladislav Shpilevoy via Tarantool-patches
  2021-07-12  8:01   ` Serge Petrenko via Tarantool-patches
  1 sibling, 1 reply; 21+ messages in thread
From: Cyrill Gorcunov via Tarantool-patches @ 2021-07-11 18:22 UTC (permalink / raw)
  To: Vladislav Shpilevoy; +Cc: tml

On Sun, Jul 11, 2021 at 04:00:35PM +0200, Vladislav Shpilevoy wrote:
> > 
> > Thus we need to separate "filer" and "apply" stages of processing.
> > What is more interesting is that we filter incomings via in memory
> > vclock and update them immediately. Thus the following situation
> > is possible -- a promote request comes in, we remember it inside
> > promote_term_map but then write to WAL fails and we never revert
> > the promote_term_map variable, thus other peer won't be able to
> > resend us this promote request because now we think that we've
> > alreday applied the promote.
> 
> Well, I still don't understand what the issue is. We discussed it
> privately already. You simply should not apply anything until WAL
> write is done. And it is not happening now on the master. The
> terms vclock is updated only **after** WAL write.

Currently we don't reject the data but silently ignore unexpected
requests after we've written them. In new approach we should decide
if incomng packet can be applied and get written to the WAL or we
should filter them out and return an error to the client. In case
if we filter out request we must not modify our WAL content at all.

> Why do you need all these new vclocks if you should not apply
> anything before WAL write in the first place?
> 
> > write to WAL fails and we never revert
> > the promote_term_map variable
> 
> This simply should not be possible. The term map is updated only
> after WAL write is done. At least this is how it works now, doesn't
> it? Why did you change it? (In case you did, because I am not sure,
> I didn't review the code throughly).

Look, I think this maybe not explicitly seen from the patch since
diff is big enough. But here is a key moment. Current code is called
after WAL is updated

void
txn_limbo_process(struct txn_limbo *limbo, const struct synchro_request *req)
{
	uint64_t term = req->term;
	uint32_t origin = req->origin_id;
	if (txn_limbo_replica_term(limbo, origin) < term) {
		vclock_follow(&limbo->promote_term_map, origin, term);
		if (term > limbo->promote_greatest_term) {
			limbo->promote_greatest_term = term;
		} else if (req->type == IPROTO_PROMOTE) {
			/*
			 * PROMOTE for outdated term. Ignore.
			 */
-->			return;
		}
	}

we exit early without any error.

In new approach we start reporting an error if this situation is
happening, 'cause it is a split brain. But we shall not write this
term into our WAL file, thus we should validate incoming packet
earlier.

Now imagine the following: we validated the incoming packet and
remember its term in promote_term_map, then we start writting
this packet into our WAL and write procedure failed. In result
our promote_term_map remains updated but real data on disk
won't match it. What is worse is that such request may have
updated @promote_greatest_term as well. In result our
test for obsolete replicas will be false positive

static inline bool
txn_limbo_is_replica_outdated(const struct txn_limbo *limbo,
			      uint32_t replica_id)
{
	return txn_limbo_replica_term(limbo, replica_id) <
	       limbo->promote_greatest_term;
}

because promote_greatest_term is already updated but
not on WAL level.

So I had to split our txn_limbo_process into two stages:
"filter" and "application". If application stage fails
we restore original max term so the replica will be able
to resend us the promote request and we try to write it
to the WAL again.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Tarantool-patches] [PATCH] limbo: introduce request processing hooks
  2021-07-10 22:28 [Tarantool-patches] [PATCH] limbo: introduce request processing hooks Cyrill Gorcunov via Tarantool-patches
  2021-07-11 14:00 ` Vladislav Shpilevoy via Tarantool-patches
@ 2021-07-12  7:54 ` Serge Petrenko via Tarantool-patches
  1 sibling, 0 replies; 21+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-07-12  7:54 UTC (permalink / raw)
  To: Cyrill Gorcunov, tml; +Cc: Vladislav Shpilevoy



11.07.2021 01:28, Cyrill Gorcunov пишет:
> Guys, this is an early rfc since I would like to discuss the
> design first before going further. Currently we don't interrupt
> incoming syncro requests which doesn't allow us to detect cluster
> split-brain situation, as we were discussing verbally there are
> a number of sign to detect it and we need to stop receiving data
> from obsolete nodes.
>
> The main problem though is that such filtering of incoming packets
> should happen at the moment where we still can do a step back and
> inform the peer that data has been declined, but currently our
> applier code process syncro requests inside WAL trigger, ie when
> data is already applied or rolling back.
>
> Thus we need to separate "filer" and "apply" stages of processing.
> What is more interesting is that we filter incomings via in memory
> vclock and update them immediately. Thus the following situation
> is possible -- a promote request comes in, we remember it inside
> promote_term_map but then write to WAL fails and we never revert
> the promote_term_map variable, thus other peer won't be able to
> resend us this promote request because now we think that we've
> alreday applied the promote.
>
> To solve this I split processing routine into stages:
>
>   - filter stage: we investigate infly packets and remember
>     their terms in @terms_infly vclock
>   - apply stage: the data is verified and we try to apply the
>     data and write it to the disk, once everything is fine
>     we update @terms_applied vclock which serves as a backup
>     of @terms_infly
>   - error stage: data application failed and we should revert
>     @terms_infly vclock to the previous value
>
> The stages are processed by txn_limbo_apply routine which takes
> a mask of stages it should execute. Old txn_limbo_process is
> simply an inline wrapper with appropriate flags.
>
> Please note that I _didn't_ put complete conditions into
> limbo_op_filter yet only moved current code there.
>
> So take a look once time permit, maybe there some easier
> approach in code structure.
>
> branch gorcunov/gh-6036-rollback-confirm-notest
>
> Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
> ---

Hi! Thanks for the patch!

The general idea looks good to me.
Please, find a couple of comments below.

>   src/box/applier.cc  |  13 +++-
>   src/box/box.cc      |   6 +-
>   src/box/txn_limbo.c | 149 ++++++++++++++++++++++++++++++++++++++------
>   src/box/txn_limbo.h |  97 ++++++++++++++++++++++------
>   4 files changed, 223 insertions(+), 42 deletions(-)
>
> diff --git a/src/box/applier.cc b/src/box/applier.cc
> index 978383e64..8a44bf1b2 100644
> --- a/src/box/applier.cc
> +++ b/src/box/applier.cc
> @@ -458,7 +458,8 @@ applier_wait_snapshot(struct applier *applier)
>   				struct synchro_request req;
>   				if (xrow_decode_synchro(&row, &req) != 0)
>   					diag_raise();
> -				txn_limbo_process(&txn_limbo, &req);
> +				if (txn_limbo_process(&txn_limbo, &req) != 0)
> +					diag_raise();
>   			} else if (iproto_type_is_raft_request(row.type)) {
>   				struct raft_request req;
>   				if (xrow_decode_raft(&row, &req, NULL) != 0)
> @@ -850,11 +851,16 @@ apply_synchro_row_cb(struct journal_entry *entry)
>   	assert(entry->complete_data != NULL);
>   	struct synchro_entry *synchro_entry =
>   		(struct synchro_entry *)entry->complete_data;
> +	struct synchro_request *req = synchro_entry->req;
> +
>   	if (entry->res < 0) {
> +		txn_limbo_apply(&txn_limbo, req,
> +				LIMBO_OP_ERROR | LIMBO_OP_ATTR_PANIC);

I don't like that you pass LIMBO_OP_ATTR_PANIC in a bitfield.
We usually don't do such things. It would be nice if you could
get rid of this flag or maybe pass it as bool.

Your approach might be totally fine though, that's just my POV.

>   		applier_rollback_by_wal_io(entry->res);
>   	} else {
>   		replica_txn_wal_write_cb(synchro_entry->rcb);
> -		txn_limbo_process(&txn_limbo, synchro_entry->req);
> +		txn_limbo_apply(&txn_limbo, synchro_entry->req,
> +				LIMBO_OP_APPLY | LIMBO_OP_ATTR_PANIC);
>   		trigger_run(&replicaset.applier.on_wal_write, NULL);
>   	}
>   	fiber_wakeup(synchro_entry->owner);
> @@ -870,6 +876,9 @@ apply_synchro_row(uint32_t replica_id, struct xrow_header *row)
>   	if (xrow_decode_synchro(row, &req) != 0)
>   		goto err;
>   
> +	if (txn_limbo_apply(&txn_limbo, &req, LIMBO_OP_FILTER) != 0)
> +		goto err;
> +

Maybe simply call txn_limbo_filter() here?
You always know which function to call at compile time, so there's no
need to pass OP_FILTER, OP_APPLY, etc as a parameter IMO.

Same in other places.

>   	struct replica_cb_data rcb_data;
>   	struct synchro_entry entry;
>   	/*


...


> +
> +static int
> +limbo_op_error(struct txn_limbo *limbo, const struct synchro_request *req)
> +{
> +	struct txn_limbo_promote *pmt = &limbo->promote;
> +	uint32_t replica_id = req->origin_id;
> +	/*
> +	 * Restore to the applied value in case of error,
> +	 * this will allow to reapply the entry when remote
> +	 * node get error and try to resend data.
> +	 */
> +	uint64_t v = vclock_get(&pmt->terms_applied, replica_id);
> +	vclock_reset(&pmt->terms_infly, replica_id, v);

Looks like the following situation is possible:

The same instance writes two consequent promotes for terms
i, i+1.

When some replica receives the promotes, they both pass the
filter stage, so vclock_infly[instance_id] = i+1
and vclock_applied[instance_id] = i - 1

Then the first promote succeeds and the second one fails

Isn't it possible that vclock_infly would be reset to i - 1
instead of i ?

AFAICS in wal.c in tx_complete_batch() code, rollbacks are processed
earlier than commits, so the situation above is possible.
Am I wrong? If not, we need to handle this somehow. Probably, remember
previous term to roll back to in each request after the filter() stage.


> +
> +	/*
> +	 * The max value has to be recalculated in a linear
> +	 * form, the errors should not happen frequently so
> +	 * this is not a hot path.
> +	 */
> +	int64_t maxv = 0;
> +	struct vclock_iterator it;
> +	vclock_iterator_init(&it, &pmt->terms_infly);
> +	vclock_foreach(&it, r)
> +		maxv = r.lsn > maxv ? r.lsn : maxv;
> +	pmt->term_max_infly = maxv;
> +	return 0;
> +}
> +
>

...

-- 
Serge Petrenko


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Tarantool-patches] [PATCH] limbo: introduce request processing hooks
  2021-07-11 14:00 ` Vladislav Shpilevoy via Tarantool-patches
  2021-07-11 18:22   ` Cyrill Gorcunov via Tarantool-patches
@ 2021-07-12  8:01   ` Serge Petrenko via Tarantool-patches
  2021-07-12  8:04     ` Vladislav Shpilevoy via Tarantool-patches
  2021-07-12  9:43     ` Cyrill Gorcunov via Tarantool-patches
  1 sibling, 2 replies; 21+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-07-12  8:01 UTC (permalink / raw)
  To: Vladislav Shpilevoy, Cyrill Gorcunov, tml



11.07.2021 17:00, Vladislav Shpilevoy пишет:
> Hi! Thanks for the patch!
>
> On 11.07.2021 00:28, Cyrill Gorcunov wrote:
>> Guys, this is an early rfc since I would like to discuss the
>> design first before going further. Currently we don't interrupt
>> incoming syncro requests which doesn't allow us to detect cluster
>> split-brain situation, as we were discussing verbally there are
>> a number of sign to detect it and we need to stop receiving data
>> from obsolete nodes.
>>
>> The main problem though is that such filtering of incoming packets
>> should happen at the moment where we still can do a step back and
>> inform the peer that data has been declined, but currently our
>> applier code process syncro requests inside WAL trigger, ie when
>> data is already applied or rolling back.
>>
>> Thus we need to separate "filer" and "apply" stages of processing.
>> What is more interesting is that we filter incomings via in memory
>> vclock and update them immediately. Thus the following situation
>> is possible -- a promote request comes in, we remember it inside
>> promote_term_map but then write to WAL fails and we never revert
>> the promote_term_map variable, thus other peer won't be able to
>> resend us this promote request because now we think that we've
>> alreday applied the promote.
> Well, I still don't understand what the issue is. We discussed it
> privately already. You simply should not apply anything until WAL
> write is done. And it is not happening now on the master. The
> terms vclock is updated only **after** WAL write.
>
> Why do you need all these new vclocks if you should not apply
> anything before WAL write in the first place?

If I understand correctly, the issue is that if we filter (and check for
the split brain) after the WAL write, we will end up with a conflicting
PROMOTE in our WAL. Cyrill is trying to avoid this, that's why
he's separating the filter stage. This way the error will reach
the remote peer before any WAL write, and the WAL write won't happen.

And if we filter before the WAL write, we need the second vclock, which
Cyrill has introduced.

We may leave confligting PROMOTEs in WAL (first write them and only
then check for conflicts). In this case this whole patch isn't
needed. But I personally don't like such an approach.
>
>> write to WAL fails and we never revert
>> the promote_term_map variable
> This simply should not be possible. The term map is updated only
> after WAL write is done. At least this is how it works now, doesn't
> it? Why did you change it? (In case you did, because I am not sure,
> I didn't review the code throughly).

-- 
Serge Petrenko


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Tarantool-patches] [PATCH] limbo: introduce request processing hooks
  2021-07-11 18:22   ` Cyrill Gorcunov via Tarantool-patches
@ 2021-07-12  8:03     ` Vladislav Shpilevoy via Tarantool-patches
  2021-07-12  8:09       ` Cyrill Gorcunov via Tarantool-patches
  0 siblings, 1 reply; 21+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-07-12  8:03 UTC (permalink / raw)
  To: Cyrill Gorcunov; +Cc: tml

>> This simply should not be possible. The term map is updated only
>> after WAL write is done. At least this is how it works now, doesn't
>> it? Why did you change it? (In case you did, because I am not sure,
>> I didn't review the code throughly).
> 
> Look, I think this maybe not explicitly seen from the patch since
> diff is big enough. But here is a key moment. Current code is called
> after WAL is updated
> 
> void
> txn_limbo_process(struct txn_limbo *limbo, const struct synchro_request *req)
> {
> 	uint64_t term = req->term;
> 	uint32_t origin = req->origin_id;
> 	if (txn_limbo_replica_term(limbo, origin) < term) {
> 		vclock_follow(&limbo->promote_term_map, origin, term);
> 		if (term > limbo->promote_greatest_term) {
> 			limbo->promote_greatest_term = term;
> 		} else if (req->type == IPROTO_PROMOTE) {
> 			/*
> 			 * PROMOTE for outdated term. Ignore.
> 			 */
> -->			return;
> 		}
> 	}
> 
> we exit early without any error.
> 
> In new approach we start reporting an error if this situation is
> happening, 'cause it is a split brain. But we shall not write this
> term into our WAL file, thus we should validate incoming packet
> earlier.
> 
> Now imagine the following: we validated the incoming packet and
> remember its term in promote_term_map, then we start writting
> this packet into our WAL and write procedure failed.

That is the core problem of your entire approach - why do you imagine
we update promote_term_map before writing to WAL? We do not do that.
And you should not do that. Before WAL write there should no be any
changes. **Zero changes before WAL write**. Before you write to WAL,
you can only validate requests. Using `const struct txn_limbo *`. With
zero changes.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Tarantool-patches] [PATCH] limbo: introduce request processing hooks
  2021-07-12  8:01   ` Serge Petrenko via Tarantool-patches
@ 2021-07-12  8:04     ` Vladislav Shpilevoy via Tarantool-patches
  2021-07-12  8:12       ` Cyrill Gorcunov via Tarantool-patches
  2021-07-12  9:43     ` Cyrill Gorcunov via Tarantool-patches
  1 sibling, 1 reply; 21+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-07-12  8:04 UTC (permalink / raw)
  To: Serge Petrenko, Cyrill Gorcunov, tml

On 12.07.2021 10:01, Serge Petrenko via Tarantool-patches wrote:
> 
> 
> 11.07.2021 17:00, Vladislav Shpilevoy пишет:
>> Hi! Thanks for the patch!
>>
>> On 11.07.2021 00:28, Cyrill Gorcunov wrote:
>>> Guys, this is an early rfc since I would like to discuss the
>>> design first before going further. Currently we don't interrupt
>>> incoming syncro requests which doesn't allow us to detect cluster
>>> split-brain situation, as we were discussing verbally there are
>>> a number of sign to detect it and we need to stop receiving data
>>> from obsolete nodes.
>>>
>>> The main problem though is that such filtering of incoming packets
>>> should happen at the moment where we still can do a step back and
>>> inform the peer that data has been declined, but currently our
>>> applier code process syncro requests inside WAL trigger, ie when
>>> data is already applied or rolling back.
>>>
>>> Thus we need to separate "filer" and "apply" stages of processing.
>>> What is more interesting is that we filter incomings via in memory
>>> vclock and update them immediately. Thus the following situation
>>> is possible -- a promote request comes in, we remember it inside
>>> promote_term_map but then write to WAL fails and we never revert
>>> the promote_term_map variable, thus other peer won't be able to
>>> resend us this promote request because now we think that we've
>>> alreday applied the promote.
>> Well, I still don't understand what the issue is. We discussed it
>> privately already. You simply should not apply anything until WAL
>> write is done. And it is not happening now on the master. The
>> terms vclock is updated only **after** WAL write.
>>
>> Why do you need all these new vclocks if you should not apply
>> anything before WAL write in the first place?
> 
> If I understand correctly, the issue is that if we filter (and check for
> the split brain) after the WAL write, we will end up with a conflicting
> PROMOTE in our WAL. Cyrill is trying to avoid this, that's why
> he's separating the filter stage. This way the error will reach
> the remote peer before any WAL write, and the WAL write won't happen.
> 
> And if we filter before the WAL write, we need the second vclock, which
> Cyrill has introduced.

Why do you need a second vclock? Why can't you just filter by the
existing vclock and update it after WAL write like now?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Tarantool-patches] [PATCH] limbo: introduce request processing hooks
  2021-07-12  8:03     ` Vladislav Shpilevoy via Tarantool-patches
@ 2021-07-12  8:09       ` Cyrill Gorcunov via Tarantool-patches
  2021-07-12 21:20         ` Vladislav Shpilevoy via Tarantool-patches
  0 siblings, 1 reply; 21+ messages in thread
From: Cyrill Gorcunov via Tarantool-patches @ 2021-07-12  8:09 UTC (permalink / raw)
  To: Vladislav Shpilevoy; +Cc: tml

On Mon, Jul 12, 2021 at 10:03:32AM +0200, Vladislav Shpilevoy wrote:
> > 
> > Now imagine the following: we validated the incoming packet and
> > remember its term in promote_term_map, then we start writting
> > this packet into our WAL and write procedure failed.
> 
> That is the core problem of your entire approach - why do you imagine
> we update promote_term_map before writing to WAL? We do not do that.

We don't do that now, and this is all this patch about - we must not
write the PROMOTE from split-brained node into our wal file at all,
for exactly this reason I splitted processing into phases: "filter" and
"application".

> And you should not do that. Before WAL write there should no be any
> changes. **Zero changes before WAL write**. Before you write to WAL,
> you can only validate requests. Using `const struct txn_limbo *`. With
> zero changes.

If we write screwed PROMOTEs request into our WAL then this patch is not
needed at all.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Tarantool-patches] [PATCH] limbo: introduce request processing hooks
  2021-07-12  8:04     ` Vladislav Shpilevoy via Tarantool-patches
@ 2021-07-12  8:12       ` Cyrill Gorcunov via Tarantool-patches
  2021-07-12  8:23         ` Cyrill Gorcunov via Tarantool-patches
  2021-07-12 21:20         ` Vladislav Shpilevoy via Tarantool-patches
  0 siblings, 2 replies; 21+ messages in thread
From: Cyrill Gorcunov via Tarantool-patches @ 2021-07-12  8:12 UTC (permalink / raw)
  To: Vladislav Shpilevoy; +Cc: tml

On Mon, Jul 12, 2021 at 10:04:56AM +0200, Vladislav Shpilevoy wrote:
> > 
> > And if we filter before the WAL write, we need the second vclock, which
> > Cyrill has introduced.
> 
> Why do you need a second vclock? Why can't you just filter by the
> existing vclock and update it after WAL write like now?

Because the phases are no longer atomic. We can pass "filter" stage,
update our terms, but then WAL process failed (it doesn't matter for
what reason, maybe single disk write failure) so we have to revert
former term value back so the client will retry the operation and
resend us the PROMOTE.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Tarantool-patches] [PATCH] limbo: introduce request processing hooks
  2021-07-12  8:12       ` Cyrill Gorcunov via Tarantool-patches
@ 2021-07-12  8:23         ` Cyrill Gorcunov via Tarantool-patches
  2021-07-12 21:20         ` Vladislav Shpilevoy via Tarantool-patches
  1 sibling, 0 replies; 21+ messages in thread
From: Cyrill Gorcunov via Tarantool-patches @ 2021-07-12  8:23 UTC (permalink / raw)
  To: Vladislav Shpilevoy, Serge Petrenko; +Cc: tml

On Mon, Jul 12, 2021 at 11:12:12AM +0300, Cyrill Gorcunov wrote:
> On Mon, Jul 12, 2021 at 10:04:56AM +0200, Vladislav Shpilevoy wrote:
> > > 
> > > And if we filter before the WAL write, we need the second vclock, which
> > > Cyrill has introduced.
> > 
> > Why do you need a second vclock? Why can't you just filter by the
> > existing vclock and update it after WAL write like now?
> 
> Because the phases are no longer atomic. We can pass "filter" stage,
> update our terms, but then WAL process failed (it doesn't matter for
> what reason, maybe single disk write failure) so we have to revert
> former term value back so the client will retry the operation and
> resend us the PROMOTE.

Here is kind of top-view over the code (an interesting snippet is
of course apply_synchro_row routine).

Current code

apply_synchro_row
  journal_write
    apply_synchro_row_cb
      txn_limbo_apply
        - update terms map
        - ignore bad promotes

Thus we just wrote bad promote into our WAL.

New approach

apply_synchro_row
  txn_limbo_apply(LIMBO_OP_FILTER)
    - update terms_infly if allowed
    - exit with error if prohibited
  journal_write
    - apply_synchro_row_cb if everything is fine
    - journal write failed
        txn_limbo_apply(LIMBO_OP_ERROR)
          restore terms_infly to previous _written_ value

	Cyrill

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Tarantool-patches] [PATCH] limbo: introduce request processing hooks
  2021-07-12  8:01   ` Serge Petrenko via Tarantool-patches
  2021-07-12  8:04     ` Vladislav Shpilevoy via Tarantool-patches
@ 2021-07-12  9:43     ` Cyrill Gorcunov via Tarantool-patches
  2021-07-12 15:48       ` Cyrill Gorcunov via Tarantool-patches
  1 sibling, 1 reply; 21+ messages in thread
From: Cyrill Gorcunov via Tarantool-patches @ 2021-07-12  9:43 UTC (permalink / raw)
  To: Serge Petrenko; +Cc: Vladislav Shpilevoy, tml

On Mon, Jul 12, 2021 at 11:01:29AM +0300, Serge Petrenko wrote:
> > Well, I still don't understand what the issue is. We discussed it
> > privately already. You simply should not apply anything until WAL
> > write is done. And it is not happening now on the master. The
> > terms vclock is updated only **after** WAL write.
> > 
> > Why do you need all these new vclocks if you should not apply
> > anything before WAL write in the first place?
> 
> If I understand correctly, the issue is that if we filter (and check for
> the split brain) after the WAL write, we will end up with a conflicting
> PROMOTE in our WAL. Cyrill is trying to avoid this, that's why
> he's separating the filter stage. This way the error will reach
> the remote peer before any WAL write, and the WAL write won't happen.
> 
> And if we filter before the WAL write, we need the second vclock, which
> Cyrill has introduced.
> 
> We may leave confligting PROMOTEs in WAL (first write them and only
> then check for conflicts). In this case this whole patch isn't
> needed. But I personally don't like such an approach.

Exactly. What is more interesting is that each term is involved
into maximal term calculation.

Imagine we have several appliers running (with early filtering
of incoming packets)

applier 1                       applier 2
---------                       ---------
applier_apply_tx
  apply_synchro_row
    filter
      - update term
      - calculate term_max
    journal_write
      (sleeping)
                                applier_apply_tx
                                  applier_synchro_filter_tx(rows);
                                      - the term_max should be already
                                        updated so we could NOP data
                                    journal_write
                                      (will be woken after applier 1
                                       finished its write)

Actually I start to think that we might not NOP data until applier's 1
write finished...
                                      


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Tarantool-patches] [PATCH] limbo: introduce request processing hooks
  2021-07-12  9:43     ` Cyrill Gorcunov via Tarantool-patches
@ 2021-07-12 15:48       ` Cyrill Gorcunov via Tarantool-patches
  2021-07-12 16:49         ` Serge Petrenko via Tarantool-patches
  0 siblings, 1 reply; 21+ messages in thread
From: Cyrill Gorcunov via Tarantool-patches @ 2021-07-12 15:48 UTC (permalink / raw)
  To: Serge Petrenko, Vladislav Shpilevoy; +Cc: tml

Guys, there are some moments even in current code structure which
looks somehow strange so I would like to discuss them. Lets consider
the following case: one replic sends us a promote request (assume
we're sitting in term 2 and max term is 2).

applier 1
---------
applier_apply_tx
  (promote term = 3
   current max term = 2)
  applier_synchro_filter_tx
  apply_synchro_row
    journal_write
      (sleeping)

at this moment another applier comes in with obsolete
data and term 2
                              applier 2
                              ---------
                              applier_apply_tx
                                (term 2)
                                applier_synchro_filter_tx
                                  txn_limbo_is_replica_outdated -> false
                                journal_write (sleep)

applier 1
---------
journal wakes up
  apply_synchro_row_cb
    set max term to 3
return to applier read and
applier 2 could finish its write
and wake up too

at this moment the data from applier 2 is actually queued
for write as valid but we just wrote the term 3, so if we would
had been updating terms map earlier (before jornal write) the data
from applier 2 should be NOPified. I think there is some problem
because due to journal write lag the data is not as it could be
if terms map updated early. Serge, Vlad, am I right? Which consequences
can be here? IOW, should not we track terms earlier even without
my filter series?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Tarantool-patches] [PATCH] limbo: introduce request processing hooks
  2021-07-12 15:48       ` Cyrill Gorcunov via Tarantool-patches
@ 2021-07-12 16:49         ` Serge Petrenko via Tarantool-patches
  2021-07-12 17:04           ` Cyrill Gorcunov via Tarantool-patches
  0 siblings, 1 reply; 21+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-07-12 16:49 UTC (permalink / raw)
  To: Cyrill Gorcunov, Vladislav Shpilevoy; +Cc: tml



12.07.2021 18:48, Cyrill Gorcunov пишет:
> Guys, there are some moments even in current code structure which
> looks somehow strange so I would like to discuss them. Lets consider
> the following case: one replic sends us a promote request (assume
> we're sitting in term 2 and max term is 2).
>
> applier 1
> ---------
> applier_apply_tx
>    (promote term = 3
>     current max term = 2)
>    applier_synchro_filter_tx
>    apply_synchro_row
>      journal_write
>        (sleeping)
>
> at this moment another applier comes in with obsolete
> data and term 2
>                                applier 2
>                                ---------
>                                applier_apply_tx
>                                  (term 2)
>                                  applier_synchro_filter_tx
>                                    txn_limbo_is_replica_outdated -> false
>                                  journal_write (sleep)
>
> applier 1
> ---------
> journal wakes up
>    apply_synchro_row_cb
>      set max term to 3
> return to applier read and
> applier 2 could finish its write
> and wake up too
>
> at this moment the data from applier 2 is actually queued
> for write as valid but we just wrote the term 3, so if we would
> had been updating terms map earlier (before jornal write) the data
> from applier 2 should be NOPified. I think there is some problem
> because due to journal write lag the data is not as it could be
> if terms map updated early. Serge, Vlad, am I right? Which consequences
> can be here? IOW, should not we track terms earlier even without
> my filter series?
Looks like a bug, indeed.

We may either introduce a limbo latch or start tracking terms before the
WAL write.

I'm starting to like the idea with limbo latch more.

It's come up a couple of times already for various occasions, so maybe 
it's time to
finally implement it.

-- 
Serge Petrenko


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Tarantool-patches] [PATCH] limbo: introduce request processing hooks
  2021-07-12 16:49         ` Serge Petrenko via Tarantool-patches
@ 2021-07-12 17:04           ` Cyrill Gorcunov via Tarantool-patches
  2021-07-12 21:20             ` Vladislav Shpilevoy via Tarantool-patches
  0 siblings, 1 reply; 21+ messages in thread
From: Cyrill Gorcunov via Tarantool-patches @ 2021-07-12 17:04 UTC (permalink / raw)
  To: Serge Petrenko; +Cc: Vladislav Shpilevoy, tml

On Mon, Jul 12, 2021 at 07:49:49PM +0300, Serge Petrenko wrote:
> > 
> > at this moment the data from applier 2 is actually queued
> > for write as valid but we just wrote the term 3, so if we would
> > had been updating terms map earlier (before jornal write) the data
> > from applier 2 should be NOPified. I think there is some problem
> > because due to journal write lag the data is not as it could be
> > if terms map updated early. Serge, Vlad, am I right? Which consequences
> > can be here? IOW, should not we track terms earlier even without
> > my filter series?
> Looks like a bug, indeed.
> 
> We may either introduce a limbo latch or start tracking terms before the
> WAL write.
> 
> I'm starting to like the idea with limbo latch more.
> 
> It's come up a couple of times already for various occasions, so maybe it's
> time to finally implement it.

Yeah, I think this fully incorporates into the my filter+apply stages
and I'll address the latching there then. Thanks!

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Tarantool-patches] [PATCH] limbo: introduce request processing hooks
  2021-07-12  8:09       ` Cyrill Gorcunov via Tarantool-patches
@ 2021-07-12 21:20         ` Vladislav Shpilevoy via Tarantool-patches
  2021-07-12 22:32           ` Cyrill Gorcunov via Tarantool-patches
  0 siblings, 1 reply; 21+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-07-12 21:20 UTC (permalink / raw)
  To: Cyrill Gorcunov; +Cc: tml

On 12.07.2021 10:09, Cyrill Gorcunov via Tarantool-patches wrote:
> On Mon, Jul 12, 2021 at 10:03:32AM +0200, Vladislav Shpilevoy wrote:
>>>
>>> Now imagine the following: we validated the incoming packet and
>>> remember its term in promote_term_map, then we start writting
>>> this packet into our WAL and write procedure failed.
>>
>> That is the core problem of your entire approach - why do you imagine
>> we update promote_term_map before writing to WAL? We do not do that.
> 
> We don't do that now, and this is all this patch about - we must not
> write the PROMOTE from split-brained node into our wal file at all,
> for exactly this reason I splitted processing into phases: "filter" and
> "application".

Я вижу, я не могу на английском объяснить. Да, мы не делаем этого сейчас.
Зачем ты начинаешь это делать в своем патче? Ничего не должно меняться
до записи в WAL. Ни vclock с термами, ни что-либо еще. Лимб должен быть
const пока запись в WAL не кончена.

Я вижу, что ты split processing, но ты на самом деле сделал не filter и
application. Ты сделал apply part 1 и apply part 2. У тебя обе стадии
меняют лимб. Зачем? Filter не должен менять вообще ничего. Только
проверять и отсеивать с ошибкой, если что не так.

>> And you should not do that. Before WAL write there should no be any
>> changes. **Zero changes before WAL write**. Before you write to WAL,
>> you can only validate requests. Using `const struct txn_limbo *`. With
>> zero changes.
> 
> If we write screwed PROMOTEs request into our WAL then this patch is not
> needed at all.

Я не говорил, что надо писать плохие PROMOTE. Я говорю, что ты можешь
их отсеивать с ошибкой не делая никаких изменений состояния лимба до
записи в WAL.

Я так понимаю, ты пытаешься как-то защититься от того, что одновременно
пришло несколько PROMOTE, и пока один пишется, надо остальные сразу
отлупить? Так делает для обычных данных аплаер - он двигает vclock
кластера до записи в WAL, и копирует его в replicaset.vclock если
запись удалась. При этом если во время записи придут еще такие же
транзакции от других инстансов, то они сразу отфильтруются.

Но в лимбе нет такой проблемы, что все надо делать параллельно. При
любых "паралеллельных" промоутах, конфирмах и прочих изменениях
состояния лимба лучше просто брать на него лок и все. Эти вещи
параллельно от разных инстансов приходить не должны во время нормальной
работы, так что это по перфу самого частого случая не ударит нисколько,
а код упростит значительно.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Tarantool-patches] [PATCH] limbo: introduce request processing hooks
  2021-07-12  8:12       ` Cyrill Gorcunov via Tarantool-patches
  2021-07-12  8:23         ` Cyrill Gorcunov via Tarantool-patches
@ 2021-07-12 21:20         ` Vladislav Shpilevoy via Tarantool-patches
  2021-07-12 22:34           ` Cyrill Gorcunov via Tarantool-patches
  1 sibling, 1 reply; 21+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-07-12 21:20 UTC (permalink / raw)
  To: Cyrill Gorcunov; +Cc: tml

On 12.07.2021 10:12, Cyrill Gorcunov wrote:
> On Mon, Jul 12, 2021 at 10:04:56AM +0200, Vladislav Shpilevoy wrote:
>>>
>>> And if we filter before the WAL write, we need the second vclock, which
>>> Cyrill has introduced.
>>
>> Why do you need a second vclock? Why can't you just filter by the
>> existing vclock and update it after WAL write like now?
> 
> Because the phases are no longer atomic. We can pass "filter" stage,
> update our terms,

Сложность твоего патча исходит из этого предложения - на filter stage
ты почему-то update terms. Это уже не filter, если ты начинаешь че-то
менять в лимбе до записи в WAL. Я пытаюсь от тебя добиться ответа, зачем
так. Если можно не менять до записи в WAL ничего, и все равно нормально
фильтровать. Фильтровать по настоящему, не меняя ничего. Я про латч в
другом письме написал.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Tarantool-patches] [PATCH] limbo: introduce request processing hooks
  2021-07-12 17:04           ` Cyrill Gorcunov via Tarantool-patches
@ 2021-07-12 21:20             ` Vladislav Shpilevoy via Tarantool-patches
  2021-07-12 21:52               ` Cyrill Gorcunov via Tarantool-patches
  0 siblings, 1 reply; 21+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-07-12 21:20 UTC (permalink / raw)
  To: Cyrill Gorcunov, Serge Petrenko; +Cc: tml

On 12.07.2021 19:04, Cyrill Gorcunov wrote:
> On Mon, Jul 12, 2021 at 07:49:49PM +0300, Serge Petrenko wrote:
>>>
>>> at this moment the data from applier 2 is actually queued
>>> for write as valid but we just wrote the term 3, so if we would
>>> had been updating terms map earlier (before jornal write) the data
>>> from applier 2 should be NOPified. I think there is some problem
>>> because due to journal write lag the data is not as it could be
>>> if terms map updated early. Serge, Vlad, am I right? Which consequences
>>> can be here? IOW, should not we track terms earlier even without
>>> my filter series?
>> Looks like a bug, indeed.
>>
>> We may either introduce a limbo latch or start tracking terms before the
>> WAL write.
>>
>> I'm starting to like the idea with limbo latch more.
>>
>> It's come up a couple of times already for various occasions, so maybe it's
>> time to finally implement it.
> 
> Yeah, I think this fully incorporates into the my filter+apply stages
> and I'll address the latching there then. Thanks!

Все так, да. Латч должен и сделать твой патч в 200 раз проще, и починить
этот баг.

Если конечно я не замечаю пока какой-то проблемы скрытой в латче.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Tarantool-patches] [PATCH] limbo: introduce request processing hooks
  2021-07-12 21:20             ` Vladislav Shpilevoy via Tarantool-patches
@ 2021-07-12 21:52               ` Cyrill Gorcunov via Tarantool-patches
  0 siblings, 0 replies; 21+ messages in thread
From: Cyrill Gorcunov via Tarantool-patches @ 2021-07-12 21:52 UTC (permalink / raw)
  To: Vladislav Shpilevoy; +Cc: tml

On Mon, Jul 12, 2021 at 11:20:21PM +0200, Vladislav Shpilevoy wrote:
> >>
> >> It's come up a couple of times already for various occasions, so maybe it's
> >> time to finally implement it.
> > 
> > Yeah, I think this fully incorporates into the my filter+apply stages
> > and I'll address the latching there then. Thanks!
> 
> Все так, да. Латч должен и сделать твой патч в 200 раз проще, и починить
> этот баг.
> 
> Если конечно я не замечаю пока какой-то проблемы скрытой в латче.

Да, для начала я хочу сделать общий латч, проверить, что это будет
работать, но вообще надо лок сделать только если мы знаем, что
он изменит max term, тогда задержки будут минимальными. В общем,
буду думать, как задышит, я покажу драфт.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Tarantool-patches] [PATCH] limbo: introduce request processing hooks
  2021-07-12 21:20         ` Vladislav Shpilevoy via Tarantool-patches
@ 2021-07-12 22:32           ` Cyrill Gorcunov via Tarantool-patches
  2021-07-13 19:32             ` Vladislav Shpilevoy via Tarantool-patches
  0 siblings, 1 reply; 21+ messages in thread
From: Cyrill Gorcunov via Tarantool-patches @ 2021-07-12 22:32 UTC (permalink / raw)
  To: Vladislav Shpilevoy; +Cc: tml

On Mon, Jul 12, 2021 at 11:20:17PM +0200, Vladislav Shpilevoy wrote:
> >>
> >> That is the core problem of your entire approach - why do you imagine
> >> we update promote_term_map before writing to WAL? We do not do that.
> > 
> > We don't do that now, and this is all this patch about - we must not
> > write the PROMOTE from split-brained node into our wal file at all,
> > for exactly this reason I splitted processing into phases: "filter" and
> > "application".
> 
> Я вижу, я не могу на английском объяснить. Да, мы не делаем этого сейчас.
> Зачем ты начинаешь это делать в своем патче? Ничего не должно меняться
> до записи в WAL. Ни vclock с термами, ни что-либо еще. Лимб должен быть
> const пока запись в WAL не кончена.
> 
> Я вижу, что ты split processing, но ты на самом деле сделал не filter и
> application. Ты сделал apply part 1 и apply part 2. У тебя обе стадии
> меняют лимб. Зачем? Filter не должен менять вообще ничего. Только
> проверять и отсеивать с ошибкой, если что не так.

Потому что выходить с какой-то ошибкой изнутри WAL уже нельзя. Смотри,
вот как сейчас фильтруется (с моим патчем)

static int
limbo_op_filter(struct txn_limbo *limbo, const struct synchro_request *req)
{
	struct txn_limbo_promote *pmt = &limbo->promote;
	uint32_t origin = req->origin_id;
	uint64_t term = req->term;

	if (txn_limbo_replica_term(limbo, origin) < term) {
		vclock_follow(&pmt->terms_infly, origin, term);
		if (term > pmt->term_max)
			pmt->term_max = term;
	} else if (iproto_type_is_promote_request(req->type) &&
		   pmt->term_max > 1) {
		say_info("RAFT: declining %s request from instance "
			 "id %u for term %llu. Greatest term seen "
			 "before (%llu) is bigger.",
			 iproto_type_name(req->type), origin,
			 (long long)term,
			 (long long)pmt->term_max);
		diag_set(ClientError, ER_UNSUPPORTED, "RAFT",
			 "backward terms");
-->		return -1;
	}

	return 0;
}

Это делается до записи в вал, чтобы ловить одновременно приходящие
промоты от разных реплик, пропускать только первый валидный, а
остальные выкидывать с ошибкой. Так что ты правильно говоришь ниже,
что это защита.

Если я не буду ловить такие данные до записи в вал, то получится
следующая ситуация: приходит пакет, который будет модифицировать
txn_limbo_replica_term, уходит на запись в вал и мы переключаемся
на другой апплаер, с него приходит промот, который мы по идее не должны
пропустить. Но мы пропускаем, и тоже пихаем его в очередь на запись
в вал, просто потому что первая запись в вал еще не успела отработать.
Получается такая зависимость от таймингов, если запись отрабатывает
быстро, то и фильтрация отловит плохой промот, а если нет, то не отловит.
В итоге в вале сможет появиться промот, которого там быть не должно.
Сейчас нам пофигу до такой ситуации, потому что мы просто игнорим такие
вещи. Но раз мы хотим отваливаться с ошибкой, то отваливаться надо в
ранней стадии, даже до того, как данные в вал попадают. Либо надо тогда
быть готовым, к тому что в wal будут какие-то промоты которые надо
проигнорировать на рестарте.

> 
> >> And you should not do that. Before WAL write there should no be any
> >> changes. **Zero changes before WAL write**. Before you write to WAL,
> >> you can only validate requests. Using `const struct txn_limbo *`. With
> >> zero changes.
> > 
> > If we write screwed PROMOTEs request into our WAL then this patch is not
> > needed at all.
> 
> Я не говорил, что надо писать плохие PROMOTE. Я говорю, что ты можешь
> их отсеивать с ошибкой не делая никаких изменений состояния лимба до
> записи в WAL.

Да вот похоже, что не могу (если без локов).

> Я так понимаю, ты пытаешься как-то защититься от того, что одновременно
> пришло несколько PROMOTE, и пока один пишется, надо остальные сразу
> отлупить? Так делает для обычных данных аплаер - он двигает vclock
> кластера до записи в WAL, и копирует его в replicaset.vclock если
> запись удалась. При этом если во время записи придут еще такие же
> транзакции от других инстансов, то они сразу отфильтруются.
> 
> Но в лимбе нет такой проблемы, что все надо делать параллельно. При
> любых "паралеллельных" промоутах, конфирмах и прочих изменениях
> состояния лимба лучше просто брать на него лок и все. Эти вещи
> параллельно от разных инстансов приходить не должны во время нормальной
> работы, так что это по перфу самого частого случая не ударит нисколько,
> а код упростит значительно.

Ну технически да, можно лок\латч взять, для модификаций промота и
вклок термов. Это уберет необходимость infly модификаций, что я
сделал для filter phase. Я наоборот, хотел избежать локов, но давай
действительно попробуем с локами. Если вдруг будет жесткая просадка,
то вернемся на схему с inlfy переменными и их обновлениями после
записи в вал. Я пока бранч старый удалять не буду.

	Cyrill

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Tarantool-patches] [PATCH] limbo: introduce request processing hooks
  2021-07-12 21:20         ` Vladislav Shpilevoy via Tarantool-patches
@ 2021-07-12 22:34           ` Cyrill Gorcunov via Tarantool-patches
  0 siblings, 0 replies; 21+ messages in thread
From: Cyrill Gorcunov via Tarantool-patches @ 2021-07-12 22:34 UTC (permalink / raw)
  To: Vladislav Shpilevoy; +Cc: tml

On Mon, Jul 12, 2021 at 11:20:19PM +0200, Vladislav Shpilevoy wrote:
> > Because the phases are no longer atomic. We can pass "filter" stage,
> > update our terms,
> 
> Сложность твоего патча исходит из этого предложения - на filter stage
> ты почему-то update terms. Это уже не filter, если ты начинаешь че-то
> менять в лимбе до записи в WAL. Я пытаюсь от тебя добиться ответа, зачем

отписал в https://lists.tarantool.org/tarantool-patches/YOzDFflt+MqiOjBG@grain/

> так. Если можно не менять до записи в WAL ничего, и все равно нормально
> фильтровать. Фильтровать по настоящему, не меняя ничего. Я про латч в
> другом письме написал.
> 

	Cyrill

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Tarantool-patches] [PATCH] limbo: introduce request processing hooks
  2021-07-12 22:32           ` Cyrill Gorcunov via Tarantool-patches
@ 2021-07-13 19:32             ` Vladislav Shpilevoy via Tarantool-patches
  0 siblings, 0 replies; 21+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-07-13 19:32 UTC (permalink / raw)
  To: Cyrill Gorcunov; +Cc: tml

>>>> And you should not do that. Before WAL write there should no be any
>>>> changes. **Zero changes before WAL write**. Before you write to WAL,
>>>> you can only validate requests. Using `const struct txn_limbo *`. With
>>>> zero changes.
>>>
>>> If we write screwed PROMOTEs request into our WAL then this patch is not
>>> needed at all.
>>
>> Я не говорил, что надо писать плохие PROMOTE. Я говорю, что ты можешь
>> их отсеивать с ошибкой не делая никаких изменений состояния лимба до
>> записи в WAL.
> 
> Да вот похоже, что не могу (если без локов).
> 
>> Я так понимаю, ты пытаешься как-то защититься от того, что одновременно
>> пришло несколько PROMOTE, и пока один пишется, надо остальные сразу
>> отлупить? Так делает для обычных данных аплаер - он двигает vclock
>> кластера до записи в WAL, и копирует его в replicaset.vclock если
>> запись удалась. При этом если во время записи придут еще такие же
>> транзакции от других инстансов, то они сразу отфильтруются.
>>
>> Но в лимбе нет такой проблемы, что все надо делать параллельно. При
>> любых "паралеллельных" промоутах, конфирмах и прочих изменениях
>> состояния лимба лучше просто брать на него лок и все. Эти вещи
>> параллельно от разных инстансов приходить не должны во время нормальной
>> работы, так что это по перфу самого частого случая не ударит нисколько,
>> а код упростит значительно.
> 
> Ну технически да, можно лок\латч взять, для модификаций промота и
> вклок термов. Это уберет необходимость infly модификаций, что я
> сделал для filter phase. Я наоборот, хотел избежать локов, но давай
> действительно попробуем с локами. Если вдруг будет жесткая просадка,
> то вернемся на схему с inlfy переменными и их обновлениями после
> записи в вал. Я пока бранч старый удалять не буду.

Сейчас стабильность превыше всего. Если патч с локами будет гораздо
проще, то надо идти с ним. Иначе мы еще ревью будем проходить и баги
находить месяц - столько времени нет.

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2021-07-13 19:32 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-10 22:28 [Tarantool-patches] [PATCH] limbo: introduce request processing hooks Cyrill Gorcunov via Tarantool-patches
2021-07-11 14:00 ` Vladislav Shpilevoy via Tarantool-patches
2021-07-11 18:22   ` Cyrill Gorcunov via Tarantool-patches
2021-07-12  8:03     ` Vladislav Shpilevoy via Tarantool-patches
2021-07-12  8:09       ` Cyrill Gorcunov via Tarantool-patches
2021-07-12 21:20         ` Vladislav Shpilevoy via Tarantool-patches
2021-07-12 22:32           ` Cyrill Gorcunov via Tarantool-patches
2021-07-13 19:32             ` Vladislav Shpilevoy via Tarantool-patches
2021-07-12  8:01   ` Serge Petrenko via Tarantool-patches
2021-07-12  8:04     ` Vladislav Shpilevoy via Tarantool-patches
2021-07-12  8:12       ` Cyrill Gorcunov via Tarantool-patches
2021-07-12  8:23         ` Cyrill Gorcunov via Tarantool-patches
2021-07-12 21:20         ` Vladislav Shpilevoy via Tarantool-patches
2021-07-12 22:34           ` Cyrill Gorcunov via Tarantool-patches
2021-07-12  9:43     ` Cyrill Gorcunov via Tarantool-patches
2021-07-12 15:48       ` Cyrill Gorcunov via Tarantool-patches
2021-07-12 16:49         ` Serge Petrenko via Tarantool-patches
2021-07-12 17:04           ` Cyrill Gorcunov via Tarantool-patches
2021-07-12 21:20             ` Vladislav Shpilevoy via Tarantool-patches
2021-07-12 21:52               ` Cyrill Gorcunov via Tarantool-patches
2021-07-12  7:54 ` Serge Petrenko via Tarantool-patches

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox