[Tarantool-patches] [PATCH 00/13] Applier rollback reason

Tarantool development patches archive
 help / color / mirror / Atom feed

* [Tarantool-patches] [PATCH 00/13] Applier rollback reason
@ 2021-06-11 21:56 Vladislav Shpilevoy via Tarantool-patches
  2021-06-11 21:56 ` [Tarantool-patches] [PATCH 01/13] error: introduce ER_CASCADE_ROLLBACK Vladislav Shpilevoy via Tarantool-patches
                   ` (14 more replies)
  0 siblings, 15 replies; 22+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-06-11 21:56 UTC (permalink / raw)
  To: tarantool-patches, gorcunov, sergepetrenko

Branch: http://github.com/tarantool/tarantool/tree/gerold103/gh-6027-rollback-reason
Issue: https://github.com/tarantool/tarantool/issues/6027

Vladislav Shpilevoy (13):
  error: introduce ER_CASCADE_ROLLBACK
  test: remove replica-applier-rollback.lua
  journal: make journal_write() set diag on error
  wal: refactor wal_write_to_disk()
  diag: introduce diag_set_detailed()
  wal: encapsulate ER_WAL_IO
  txn: change limbo rollback check in the trigger
  journal: introduce proper error codes
  txn: assert after WAL write that txn is not done
  txn: install proper diag errors on txn fail
  wal: introduce JOURNAL_ENTRY_ERR_CASCADE
  txn: introduce TXN_SIGNATURE_ABORT
  txn: stop TXN_SIGNATURE_ABORT override

 .../unreleased/gh-6027-applier-lost-error.md  |  7 ++
 src/box/applier.cc                            | 18 ++--
 src/box/box.cc                                |  8 +-
 src/box/errcode.h                             |  3 +-
 src/box/journal.c                             | 16 ++++
 src/box/journal.h                             | 30 ++++++-
 src/box/memtx_engine.c                        |  2 +-
 src/box/raft.c                                | 10 ++-
 src/box/txn.c                                 | 87 +++++++++++++------
 src/box/txn.h                                 | 38 +++++++-
 src/box/txn_limbo.c                           | 37 ++++----
 src/box/vy_scheduler.c                        |  2 +-
 src/box/wal.c                                 | 50 ++++++-----
 src/lib/core/diag.h                           |  9 +-
 test/box/errinj.result                        |  2 +-
 test/box/error.result                         |  3 +-
 .../gh-4730-applier-rollback.result           |  2 +-
 .../gh-4730-applier-rollback.test.lua         |  2 +-
 .../gh-6027-applier-error-show.result         | 82 +++++++++++++++++
 .../gh-6027-applier-error-show.test.lua       | 31 +++++++
 test/replication/replica-applier-rollback.lua | 16 ----
 test/replication/suite.cfg                    |  1 +
 test/replication/suite.ini                    |  2 +-
 23 files changed, 348 insertions(+), 110 deletions(-)
 create mode 100644 changelogs/unreleased/gh-6027-applier-lost-error.md
 create mode 100644 test/replication/gh-6027-applier-error-show.result
 create mode 100644 test/replication/gh-6027-applier-error-show.test.lua
 delete mode 100644 test/replication/replica-applier-rollback.lua

-- 
2.24.3 (Apple Git-128)


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [Tarantool-patches] [PATCH 01/13] error: introduce ER_CASCADE_ROLLBACK
  2021-06-11 21:56 [Tarantool-patches] [PATCH 00/13] Applier rollback reason Vladislav Shpilevoy via Tarantool-patches
@ 2021-06-11 21:56 ` Vladislav Shpilevoy via Tarantool-patches
  2021-06-11 21:56 ` [Tarantool-patches] [PATCH 10/13] txn: install proper diag errors on txn fail Vladislav Shpilevoy via Tarantool-patches
                   ` (13 subsequent siblings)
  14 siblings, 0 replies; 22+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-06-11 21:56 UTC (permalink / raw)
  To: tarantool-patches, gorcunov, sergepetrenko

It was called ER_CHECKPOINT_ROLLBACK but was set only when there
is a cascade rollback in WAL.

The new error name is going to be used in the next patches, where
not only checkpoint can fail due to a cascade rollback.

Part of #6027
---
 src/box/errcode.h      | 2 +-
 src/box/wal.c          | 8 ++++----
 test/box/errinj.result | 2 +-
 test/box/error.result  | 2 +-
 4 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/src/box/errcode.h b/src/box/errcode.h
index d93820e96..b143e91d9 100644
--- a/src/box/errcode.h
+++ b/src/box/errcode.h
@@ -186,7 +186,7 @@ struct errcode_record {
 	/*131 */_(ER_INVALID_INDEX_FILE,	"Invalid INDEX file %s: %s") \
 	/*132 */_(ER_INVALID_RUN_FILE,		"Invalid RUN file: %s") \
 	/*133 */_(ER_INVALID_VYLOG_FILE,	"Invalid VYLOG file: %s") \
-	/*134 */_(ER_CHECKPOINT_ROLLBACK,	"Can't start a checkpoint while in cascading rollback") \
+	/*134 */_(ER_CASCADE_ROLLBACK,		"WAL has a rollback in progress") \
 	/*135 */_(ER_VY_QUOTA_TIMEOUT,		"Timed out waiting for Vinyl memory quota") \
 	/*136 */_(ER_PARTIAL_KEY,		"%s index  does not support selects via a partial key (expected %u parts, got %u). Please Consider changing index type to TREE.") \
 	/*137 */_(ER_TRUNCATE_SYSTEM_SPACE,	"Can't truncate a system space, space '%s'") \
diff --git a/src/box/wal.c b/src/box/wal.c
index 63ff5584a..5c52142ef 100644
--- a/src/box/wal.c
+++ b/src/box/wal.c
@@ -606,7 +606,7 @@ wal_sync_f(struct cbus_call_msg *data)
 	struct wal_writer *writer = &wal_writer_singleton;
 	if (writer->is_in_rollback) {
 		/* We're rolling back a failed write. */
-		diag_set(ClientError, ER_WAL_IO);
+		diag_set(ClientError, ER_CASCADE_ROLLBACK);
 		return -1;
 	}
 	vclock_copy(&msg->vclock, &writer->vclock);
@@ -629,7 +629,7 @@ wal_sync(struct vclock *vclock)
 	}
 	if (!stailq_empty(&writer->rollback)) {
 		/* We're rolling back a failed write. */
-		diag_set(ClientError, ER_WAL_IO);
+		diag_set(ClientError, ER_CASCADE_ROLLBACK);
 		return -1;
 	}
 	bool cancellable = fiber_set_cancellable(false);
@@ -653,7 +653,7 @@ wal_begin_checkpoint_f(struct cbus_call_msg *data)
 		 * can't make a checkpoint - see the comment
 		 * in wal_begin_checkpoint() for the explanation.
 		 */
-		diag_set(ClientError, ER_CHECKPOINT_ROLLBACK);
+		diag_set(ClientError, ER_CASCADE_ROLLBACK);
 		return -1;
 	}
 	/*
@@ -691,7 +691,7 @@ wal_begin_checkpoint(struct wal_checkpoint *checkpoint)
 		 * the snapshot. So we abort checkpointing in this
 		 * case.
 		 */
-		diag_set(ClientError, ER_CHECKPOINT_ROLLBACK);
+		diag_set(ClientError, ER_CASCADE_ROLLBACK);
 		return -1;
 	}
 	bool cancellable = fiber_set_cancellable(false);
diff --git a/test/box/errinj.result b/test/box/errinj.result
index 44ecafc40..43daf5f0f 100644
--- a/test/box/errinj.result
+++ b/test/box/errinj.result
@@ -898,7 +898,7 @@ box.schema.user.revoke('guest', 'read', 'space', '_space')
 ...
 run()
 ---
-- error: Can't start a checkpoint while in cascading rollback
+- error: WAL has a rollback in progress
 ...
 ch:get()
 ---
diff --git a/test/box/error.result b/test/box/error.result
index cc8cbaaa9..6ef099153 100644
--- a/test/box/error.result
+++ b/test/box/error.result
@@ -353,7 +353,7 @@ t;
  |   131: box.error.INVALID_INDEX_FILE
  |   132: box.error.INVALID_RUN_FILE
  |   133: box.error.INVALID_VYLOG_FILE
- |   134: box.error.CHECKPOINT_ROLLBACK
+ |   134: box.error.CASCADE_ROLLBACK
  |   135: box.error.VY_QUOTA_TIMEOUT
  |   136: box.error.PARTIAL_KEY
  |   137: box.error.TRUNCATE_SYSTEM_SPACE
-- 
2.24.3 (Apple Git-128)


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [Tarantool-patches] [PATCH 10/13] txn: install proper diag errors on txn fail
  2021-06-11 21:56 [Tarantool-patches] [PATCH 00/13] Applier rollback reason Vladislav Shpilevoy via Tarantool-patches
  2021-06-11 21:56 ` [Tarantool-patches] [PATCH 01/13] error: introduce ER_CASCADE_ROLLBACK Vladislav Shpilevoy via Tarantool-patches
@ 2021-06-11 21:56 ` Vladislav Shpilevoy via Tarantool-patches
  2021-06-11 21:56 ` [Tarantool-patches] [PATCH 11/13] wal: introduce JOURNAL_ENTRY_ERR_CASCADE Vladislav Shpilevoy via Tarantool-patches
                   ` (12 subsequent siblings)
  14 siblings, 0 replies; 22+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-06-11 21:56 UTC (permalink / raw)
  To: tarantool-patches, gorcunov, sergepetrenko

Previously all journal and txn errors were turned into ER_WAL_IO
error code. It led to loss of the real error, which sometimes was
absolutely not related to IO. For example, a timeout in the limbo
for a synchronous transaction.

The patch makes journal/txn errors turn into proper diags.

Part of #6027
---
 src/box/errcode.h     |  1 +
 src/box/journal.c     | 10 +++++++---
 src/box/txn.c         | 17 ++++++++++++++++-
 test/box/error.result |  1 +
 4 files changed, 25 insertions(+), 4 deletions(-)

diff --git a/src/box/errcode.h b/src/box/errcode.h
index b143e91d9..49aec4bf6 100644
--- a/src/box/errcode.h
+++ b/src/box/errcode.h
@@ -277,6 +277,7 @@ struct errcode_record {
 	/*222 */_(ER_QUORUM_WAIT,		"Couldn't wait for quorum %d: %s") \
 	/*223 */_(ER_INTERFERING_PROMOTE,	"Instance with replica id %u was promoted first") \
 	/*224 */_(ER_RAFT_DISABLED,		"Elections were turned off while running box.ctl.promote()")\
+	/*225 */_(ER_TXN_ROLLBACK,		"Transaction was rolled back") \
 
 /*
  * !IMPORTANT! Please follow instructions at start of the file
diff --git a/src/box/journal.c b/src/box/journal.c
index 35c99c1c6..32c3e4bd7 100644
--- a/src/box/journal.c
+++ b/src/box/journal.c
@@ -45,9 +45,13 @@ struct journal_queue journal_queue = {
 void
 diag_set_journal_res_detailed(const char *file, unsigned line, int64_t res)
 {
-	assert(res < 0 && res != JOURNAL_ENTRY_ERR_UNKNOWN);
-	(void)res;
-	diag_set_detailed(file, line, ClientError, ER_WAL_IO);
+	switch(res) {
+	case JOURNAL_ENTRY_ERR_IO:
+		diag_set_detailed(file, line, ClientError, ER_WAL_IO);
+		return;
+	}
+	panic("Journal result code %lld can't be converted to an error "
+	      "at %s:%u", (long long)res, file, line);
 }
 
 struct journal_entry *
diff --git a/src/box/txn.c b/src/box/txn.c
index 1c1eb15bc..b3819b8f9 100644
--- a/src/box/txn.c
+++ b/src/box/txn.c
@@ -251,7 +251,22 @@ txn_free(struct txn *txn)
 void
 diag_set_txn_sign_detailed(const char *file, unsigned line, int64_t signature)
 {
-	return diag_set_journal_res_detailed(file, line, signature);
+	if (signature >= JOURNAL_ENTRY_ERR_MIN)
+		return diag_set_journal_res_detailed(file, line, signature);
+	switch(signature) {
+	case TXN_SIGNATURE_ROLLBACK:
+		diag_set_detailed(file, line, ClientError, ER_TXN_ROLLBACK);
+		return;
+	case TXN_SIGNATURE_QUORUM_TIMEOUT:
+		diag_set_detailed(file, line, ClientError,
+				  ER_SYNC_QUORUM_TIMEOUT);
+		return;
+	case TXN_SIGNATURE_SYNC_ROLLBACK:
+		diag_set_detailed(file, line, ClientError, ER_SYNC_ROLLBACK);
+		return;
+	}
+	panic("Transaction signature %lld can't be converted to an error "
+	      "at %s:%u", (long long)signature, file, line);
 }
 
 struct txn *
diff --git a/test/box/error.result b/test/box/error.result
index 6ef099153..062a90399 100644
--- a/test/box/error.result
+++ b/test/box/error.result
@@ -443,6 +443,7 @@ t;
  |   222: box.error.QUORUM_WAIT
  |   223: box.error.INTERFERING_PROMOTE
  |   224: box.error.RAFT_DISABLED
+ |   225: box.error.TXN_ROLLBACK
  | ...
 
 test_run:cmd("setopt delimiter ''");
-- 
2.24.3 (Apple Git-128)


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [Tarantool-patches] [PATCH 11/13] wal: introduce JOURNAL_ENTRY_ERR_CASCADE
  2021-06-11 21:56 [Tarantool-patches] [PATCH 00/13] Applier rollback reason Vladislav Shpilevoy via Tarantool-patches
  2021-06-11 21:56 ` [Tarantool-patches] [PATCH 01/13] error: introduce ER_CASCADE_ROLLBACK Vladislav Shpilevoy via Tarantool-patches
  2021-06-11 21:56 ` [Tarantool-patches] [PATCH 10/13] txn: install proper diag errors on txn fail Vladislav Shpilevoy via Tarantool-patches
@ 2021-06-11 21:56 ` Vladislav Shpilevoy via Tarantool-patches
  2021-06-11 21:56 ` [Tarantool-patches] [PATCH 12/13] txn: introduce TXN_SIGNATURE_ABORT Vladislav Shpilevoy via Tarantool-patches
                   ` (11 subsequent siblings)
  14 siblings, 0 replies; 22+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-06-11 21:56 UTC (permalink / raw)
  To: tarantool-patches, gorcunov, sergepetrenko

A transaction in WAL thread could be rolled back not only due to
an IO error. But also if there was a cascading rollback in
progress.

The patch makes such case use a special error code turned into its
own diag when it reaches the TX thread. Usage of ER_WAL_IO wasn't
correct here.

Part of #6027
---
 src/box/journal.c |  3 +++
 src/box/journal.h |  5 +++++
 src/box/txn.h     |  1 +
 src/box/wal.c     | 17 ++++++++++++++---
 4 files changed, 23 insertions(+), 3 deletions(-)

diff --git a/src/box/journal.c b/src/box/journal.c
index 32c3e4bd7..7de149080 100644
--- a/src/box/journal.c
+++ b/src/box/journal.c
@@ -49,6 +49,9 @@ diag_set_journal_res_detailed(const char *file, unsigned line, int64_t res)
 	case JOURNAL_ENTRY_ERR_IO:
 		diag_set_detailed(file, line, ClientError, ER_WAL_IO);
 		return;
+	case JOURNAL_ENTRY_ERR_CASCADE:
+		diag_set_detailed(file, line, ClientError, ER_CASCADE_ROLLBACK);
+		return;
 	}
 	panic("Journal result code %lld can't be converted to an error "
 	      "at %s:%u", (long long)res, file, line);
diff --git a/src/box/journal.h b/src/box/journal.h
index 18767176e..857245779 100644
--- a/src/box/journal.h
+++ b/src/box/journal.h
@@ -49,6 +49,11 @@ enum {
 	JOURNAL_ENTRY_ERR_UNKNOWN = -1,
 	/** Tried to be written, but something happened related to IO. */
 	JOURNAL_ENTRY_ERR_IO = -2,
+	/**
+	 * Rollback because there is a not finished rollback of a previous
+	 * entry.
+	 */
+	JOURNAL_ENTRY_ERR_CASCADE = -3,
 	/**
 	 * Anchor for the structs built on top of journal entry so as they
 	 * could introduce their own unique errors. Set to a big value in
diff --git a/src/box/txn.h b/src/box/txn.h
index 037865ac6..7638854a7 100644
--- a/src/box/txn.h
+++ b/src/box/txn.h
@@ -110,6 +110,7 @@ enum {
 	 */
 	TXN_SIGNATURE_UNKNOWN = JOURNAL_ENTRY_ERR_UNKNOWN,
 	TXN_SIGNATURE_IO = JOURNAL_ENTRY_ERR_IO,
+	TXN_SIGNATURE_CASCADE = JOURNAL_ENTRY_ERR_CASCADE,
 	/**
 	 * The default signature value for failed transactions.
 	 * Indicates either write failure or any other failure
diff --git a/src/box/wal.c b/src/box/wal.c
index f59cc9113..8543c4a08 100644
--- a/src/box/wal.c
+++ b/src/box/wal.c
@@ -1036,6 +1036,7 @@ wal_write_to_disk(struct cmsg *msg)
 {
 	struct wal_writer *writer = &wal_writer_singleton;
 	struct wal_msg *wal_msg = (struct wal_msg *) msg;
+	int err_code = JOURNAL_ENTRY_ERR_UNKNOWN;
 	struct stailq_entry *last_committed = NULL;
 	struct journal_entry *entry;
 	struct error *error;
@@ -1059,16 +1060,19 @@ wal_write_to_disk(struct cmsg *msg)
 
 	if (writer->is_in_rollback) {
 		/* We're rolling back a failed write. */
+		err_code = JOURNAL_ENTRY_ERR_CASCADE;
 		goto done;
 	}
 
 	/* Xlog is only rotated between queue processing  */
 	if (wal_opt_rotate(writer) != 0) {
+		err_code = JOURNAL_ENTRY_ERR_IO;
 		goto done;
 	}
 
 	/* Ensure there's enough disk space before writing anything. */
 	if (wal_fallocate(writer, wal_msg->approx_len) != 0) {
+		err_code = JOURNAL_ENTRY_ERR_IO;
 		goto done;
 	}
 
@@ -1104,8 +1108,10 @@ wal_write_to_disk(struct cmsg *msg)
 		entry->res = vclock_sum(&vclock_diff) +
 			     vclock_sum(&writer->vclock);
 		rc = xlog_write_entry(l, entry);
-		if (rc < 0)
+		if (rc < 0) {
+			err_code = JOURNAL_ENTRY_ERR_IO;
 			goto done;
+		}
 		if (rc > 0) {
 			writer->checkpoint_wal_size += rc;
 			last_committed = &entry->fifo;
@@ -1114,8 +1120,10 @@ wal_write_to_disk(struct cmsg *msg)
 		/* rc == 0: the write is buffered in xlog_tx */
 	}
 	rc = xlog_flush(l);
-	if (rc < 0)
+	if (rc < 0) {
+		err_code= JOURNAL_ENTRY_ERR_IO;
 		goto done;
+	}
 
 	writer->checkpoint_wal_size += rc;
 	last_committed = stailq_last(&wal_msg->commit);
@@ -1167,12 +1175,15 @@ done:
 	stailq_cut_tail(&wal_msg->commit, last_committed, &rollback);
 
 	if (!stailq_empty(&rollback)) {
+		assert(err_code != JOURNAL_ENTRY_ERR_UNKNOWN);
 		/* Update status of the successfully committed requests. */
 		stailq_foreach_entry(entry, &rollback, fifo)
-			entry->res = JOURNAL_ENTRY_ERR_IO;
+			entry->res = err_code;
 		/* Rollback unprocessed requests */
 		stailq_concat(&wal_msg->rollback, &rollback);
 		wal_begin_rollback();
+	} else {
+		assert(err_code == JOURNAL_ENTRY_ERR_UNKNOWN);
 	}
 	fiber_gc();
 	wal_notify_watchers(writer, WAL_EVENT_WRITE);
-- 
2.24.3 (Apple Git-128)


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [Tarantool-patches] [PATCH 12/13] txn: introduce TXN_SIGNATURE_ABORT
  2021-06-11 21:56 [Tarantool-patches] [PATCH 00/13] Applier rollback reason Vladislav Shpilevoy via Tarantool-patches
                   ` (2 preceding siblings ...)
  2021-06-11 21:56 ` [Tarantool-patches] [PATCH 11/13] wal: introduce JOURNAL_ENTRY_ERR_CASCADE Vladislav Shpilevoy via Tarantool-patches
@ 2021-06-11 21:56 ` Vladislav Shpilevoy via Tarantool-patches
  2021-06-11 21:56 ` [Tarantool-patches] [PATCH 13/13] txn: stop TXN_SIGNATURE_ABORT override Vladislav Shpilevoy via Tarantool-patches
                   ` (10 subsequent siblings)
  14 siblings, 0 replies; 22+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-06-11 21:56 UTC (permalink / raw)
  To: tarantool-patches, gorcunov, sergepetrenko

Sometimes a transaction can fail before it goes to WAL. Then the
signature didn't have any sign of it, as well as the journal_entry
result (which might be not even created yet).

Still if txn_commit/try_async() are called, they invoke
on_rollback triggers. The triggers only can see
TXN_SIGNATURE_ROLLBACK and can't distinguish it from a real
rollback like box.rollback().

Due to that some important errors like a transaction manager
conflict or OOM are lost.

The patch introduces a new error signature TXN_SIGNATURE_ABORT
which says the transaction didn't manage to try going to WAL and
for an error need to look at the global diag.

The next patch is going to stop overriding it with
TXN_SIGNATURE_ROLLBACK.

Part of #6027
---
 src/box/applier.cc     |  4 ++--
 src/box/box.cc         |  8 ++++----
 src/box/memtx_engine.c |  2 +-
 src/box/txn.c          | 26 +++++++++++++++++++++-----
 src/box/txn.h          | 14 ++++++++++++++
 src/box/vy_scheduler.c |  2 +-
 6 files changed, 43 insertions(+), 13 deletions(-)

diff --git a/src/box/applier.cc b/src/box/applier.cc
index 3fd71393d..10cea26a7 100644
--- a/src/box/applier.cc
+++ b/src/box/applier.cc
@@ -261,7 +261,7 @@ apply_snapshot_row(struct xrow_header *row)
 rollback_stmt:
 	txn_rollback_stmt(txn);
 rollback:
-	txn_rollback(txn);
+	txn_abort(txn);
 	fiber_gc();
 	return -1;
 }
@@ -942,7 +942,7 @@ apply_plain_tx(struct stailq *rows, bool skip_conflict, bool use_triggers)
 
 	return txn_commit_try_async(txn);
 fail:
-	txn_rollback(txn);
+	txn_abort(txn);
 	return -1;
 }
 
diff --git a/src/box/box.cc b/src/box/box.cc
index 6fca337bc..750262c04 100644
--- a/src/box/box.cc
+++ b/src/box/box.cc
@@ -251,7 +251,7 @@ box_process_rw(struct request *request, struct space *space,
 
 rollback:
 	if (is_autocommit) {
-		txn_rollback(txn);
+		txn_abort(txn);
 		fiber_gc();
 	}
 error:
@@ -391,7 +391,7 @@ wal_stream_abort(struct wal_stream *stream)
 {
 	struct txn *tx = in_txn();
 	if (tx != NULL)
-		txn_rollback(tx);
+		txn_abort(tx);
 	stream->tsn = 0;
 	fiber_gc();
 }
@@ -3220,9 +3220,9 @@ local_recovery(const struct tt_uuid *instance_uuid,
 	engine_begin_final_recovery_xc();
 	recover_remaining_wals(recovery, &wal_stream.base, NULL, false);
 	if (wal_stream_has_tx(&wal_stream)) {
-		wal_stream_abort(&wal_stream);
 		diag_set(XlogError, "found a not finished transaction "
 			 "in the log");
+		wal_stream_abort(&wal_stream);
 		if (!is_force_recovery)
 			diag_raise();
 		diag_log();
@@ -3247,9 +3247,9 @@ local_recovery(const struct tt_uuid *instance_uuid,
 		recovery_stop_local(recovery);
 		recover_remaining_wals(recovery, &wal_stream.base, NULL, true);
 		if (wal_stream_has_tx(&wal_stream)) {
-			wal_stream_abort(&wal_stream);
 			diag_set(XlogError, "found a not finished transaction "
 				 "in the log in hot standby mode");
+			wal_stream_abort(&wal_stream);
 			if (!is_force_recovery)
 				diag_raise();
 			diag_log();
diff --git a/src/box/memtx_engine.c b/src/box/memtx_engine.c
index 6c4982b9f..c662a3c8c 100644
--- a/src/box/memtx_engine.c
+++ b/src/box/memtx_engine.c
@@ -277,7 +277,7 @@ memtx_engine_recover_snapshot_row(struct memtx_engine *memtx,
 rollback_stmt:
 	txn_rollback_stmt(txn);
 rollback:
-	txn_rollback(txn);
+	txn_abort(txn);
 	fiber_gc();
 	return -1;
 }
diff --git a/src/box/txn.c b/src/box/txn.c
index b3819b8f9..5cae7b41d 100644
--- a/src/box/txn.c
+++ b/src/box/txn.c
@@ -264,6 +264,10 @@ diag_set_txn_sign_detailed(const char *file, unsigned line, int64_t signature)
 	case TXN_SIGNATURE_SYNC_ROLLBACK:
 		diag_set_detailed(file, line, ClientError, ER_SYNC_ROLLBACK);
 		return;
+	case TXN_SIGNATURE_ABORT:
+		if (diag_is_empty(diag_get()))
+			panic("Tried to get an absent transaction error");
+		return;
 	}
 	panic("Transaction signature %lld can't be converted to an error "
 	      "at %s:%u", (long long)signature, file, line);
@@ -870,20 +874,20 @@ txn_commit_try_async(struct txn *txn)
 
 rollback:
 	assert(txn->fiber == NULL);
-	txn_rollback(txn);
+	txn_abort(txn);
 	return -1;
 }
 
 int
 txn_commit(struct txn *txn)
 {
-	struct journal_entry *req;
+	struct journal_entry *req = NULL;
 	struct txn_limbo_entry *limbo_entry = NULL;
 
 	txn->fiber = fiber();
 
 	if (txn_prepare(txn) != 0)
-		goto rollback;
+		goto rollback_abort;
 
 	if (txn_commit_nop(txn)) {
 		txn_free(txn);
@@ -892,7 +896,7 @@ txn_commit(struct txn *txn)
 
 	req = txn_journal_entry_new(txn);
 	if (req == NULL)
-		goto rollback;
+		goto rollback_abort;
 	/*
 	 * Do not cache the flag value in a variable. The flag might be deleted
 	 * during WAL write. This can happen for async transactions created
@@ -914,7 +918,7 @@ txn_commit(struct txn *txn)
 		 */
 		limbo_entry = txn_limbo_append(&txn_limbo, origin_id, txn);
 		if (limbo_entry == NULL)
-			goto rollback;
+			goto rollback_abort;
 	}
 
 	fiber_set_txn(fiber(), NULL);
@@ -951,7 +955,10 @@ rollback_io:
 	diag_log();
 	if (txn_has_flag(txn, TXN_WAIT_SYNC))
 		txn_limbo_abort(&txn_limbo, limbo_entry);
+rollback_abort:
+	txn->signature = TXN_SIGNATURE_ABORT;
 rollback:
+	assert(txn->signature != TXN_SIGNATURE_UNKNOWN);
 	assert(txn->fiber != NULL);
 	if (!txn_has_flag(txn, TXN_IS_DONE)) {
 		fiber_set_txn(fiber(), txn);
@@ -985,6 +992,15 @@ txn_rollback(struct txn *txn)
 	fiber_set_txn(fiber(), NULL);
 }
 
+void
+txn_abort(struct txn *txn)
+{
+	assert(!diag_is_empty(diag_get()));
+	assert(txn->signature == TXN_SIGNATURE_UNKNOWN);
+	txn->signature = TXN_SIGNATURE_ABORT;
+	txn_rollback(txn);
+}
+
 int
 txn_check_singlestatement(struct txn *txn, const char *where)
 {
diff --git a/src/box/txn.h b/src/box/txn.h
index 7638854a7..8741dc6a1 100644
--- a/src/box/txn.h
+++ b/src/box/txn.h
@@ -129,6 +129,11 @@ enum {
 	 * read.
 	 */
 	TXN_SIGNATURE_SYNC_ROLLBACK = JOURNAL_ENTRY_ERR_MIN - 3,
+	/**
+	 * Aborted before it could be written due an error which is already
+	 * installed into the global diag.
+	 */
+	TXN_SIGNATURE_ABORT = JOURNAL_ENTRY_ERR_MIN - 4,
 };
 
 /**
@@ -491,6 +496,15 @@ txn_commit(struct txn *txn);
 void
 txn_rollback(struct txn *txn);
 
+/**
+ * Rollback a transaction due to an error which is already installed into the
+ * global diag. This is preferable over the plain rollback when there are
+ * already triggers installed and they might need to know the exact reason for
+ * the rollback.
+ */
+void
+txn_abort(struct txn *txn);
+
 /**
  * Submit a transaction to the journal.
  * @pre txn == in_txn()
diff --git a/src/box/vy_scheduler.c b/src/box/vy_scheduler.c
index 917b75f93..b3e4120cd 100644
--- a/src/box/vy_scheduler.c
+++ b/src/box/vy_scheduler.c
@@ -920,7 +920,7 @@ vy_deferred_delete_batch_process_f(struct cmsg *cmsg)
 	return;
 
 fail_rollback:
-	txn_rollback(txn);
+	txn_abort(txn);
 	fiber_gc();
 fail:
 	batch->is_failed = true;
-- 
2.24.3 (Apple Git-128)


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [Tarantool-patches] [PATCH 13/13] txn: stop TXN_SIGNATURE_ABORT override
  2021-06-11 21:56 [Tarantool-patches] [PATCH 00/13] Applier rollback reason Vladislav Shpilevoy via Tarantool-patches
                   ` (3 preceding siblings ...)
  2021-06-11 21:56 ` [Tarantool-patches] [PATCH 12/13] txn: introduce TXN_SIGNATURE_ABORT Vladislav Shpilevoy via Tarantool-patches
@ 2021-06-11 21:56 ` Vladislav Shpilevoy via Tarantool-patches
  2021-06-15 13:44   ` Serge Petrenko via Tarantool-patches
  2021-06-11 21:56 ` [Tarantool-patches] [PATCH 02/13] test: remove replica-applier-rollback.lua Vladislav Shpilevoy via Tarantool-patches
                   ` (9 subsequent siblings)
  14 siblings, 1 reply; 22+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-06-11 21:56 UTC (permalink / raw)
  To: tarantool-patches, gorcunov, sergepetrenko

When txn_commit/try_async() failed before going to WAL thread,
they installed TXN_SIGNATURE_ABORT signature meaning that the
caller and the rollback triggers must look at the global diag.

But they called txn_rollback() before doing return and calling the
triggers, which overrode the signature with TXN_SIGNATURE_ROLLBACK
leading to the original error loss.

The patch makes TXN_SIGNATURE_ROLLBACK installed only when a real
rollback happens (via box_txn_rollback()).

This makes the original commit errors like a conflict in the
transaction manager and OOM not lost.

Besides, ERRINJ_TXN_COMMIT_ASYNC does not need its own diag_log()
anymore. Because since this commit the applier logs the correct
error instead of ER_WAL_IO/ER_TXN_ROLLBACK.

Closes #6027
---
 .../unreleased/gh-6027-applier-lost-error.md  |  7 ++
 src/box/txn.c                                 | 10 ++-
 .../gh-6027-applier-error-show.result         | 82 +++++++++++++++++++
 .../gh-6027-applier-error-show.test.lua       | 31 +++++++
 test/replication/suite.cfg                    |  1 +
 test/replication/suite.ini                    |  2 +-
 6 files changed, 129 insertions(+), 4 deletions(-)
 create mode 100644 changelogs/unreleased/gh-6027-applier-lost-error.md
 create mode 100644 test/replication/gh-6027-applier-error-show.result
 create mode 100644 test/replication/gh-6027-applier-error-show.test.lua

diff --git a/changelogs/unreleased/gh-6027-applier-lost-error.md b/changelogs/unreleased/gh-6027-applier-lost-error.md
new file mode 100644
index 000000000..9c765b8e2
--- /dev/null
+++ b/changelogs/unreleased/gh-6027-applier-lost-error.md
@@ -0,0 +1,7 @@
+## bugfix/replication
+
+* When an error happened during appliance of a transaction received from a
+  remote instance via replication, it was always reported as "Failed to write
+  to disk" regardless of what really happened. Now the correct error is shown.
+  For example, "Out of memory", or "Transaction has been aborted by conflict",
+  and so on (gh-6027).
diff --git a/src/box/txn.c b/src/box/txn.c
index 5cae7b41d..c2734c237 100644
--- a/src/box/txn.c
+++ b/src/box/txn.c
@@ -805,7 +805,6 @@ txn_commit_try_async(struct txn *txn)
 		 * Log it for the testing sake: we grep
 		 * output to mark this event.
 		 */
-		diag_log();
 		goto rollback;
 	});
 
@@ -983,11 +982,11 @@ void
 txn_rollback(struct txn *txn)
 {
 	assert(txn == in_txn());
+	assert(txn->signature != TXN_SIGNATURE_UNKNOWN);
 	txn->status = TXN_ABORTED;
 	trigger_clear(&txn->fiber_on_stop);
 	if (!txn_has_flag(txn, TXN_CAN_YIELD))
 		trigger_clear(&txn->fiber_on_yield);
-	txn->signature = TXN_SIGNATURE_ROLLBACK;
 	txn_complete_fail(txn);
 	fiber_set_txn(fiber(), NULL);
 }
@@ -1086,6 +1085,8 @@ box_txn_rollback(void)
 		diag_set(ClientError, ER_ROLLBACK_IN_SUB_STMT);
 		return -1;
 	}
+	assert(txn->signature == TXN_SIGNATURE_UNKNOWN);
+	txn->signature = TXN_SIGNATURE_ROLLBACK;
 	txn_rollback(txn); /* doesn't throw */
 	fiber_gc();
 	return 0;
@@ -1221,7 +1222,10 @@ txn_on_stop(struct trigger *trigger, void *event)
 {
 	(void) trigger;
 	(void) event;
-	txn_rollback(in_txn());                 /* doesn't yield or fail */
+	struct txn *txn = in_txn();
+	assert(txn->signature == TXN_SIGNATURE_UNKNOWN);
+	txn->signature = TXN_SIGNATURE_ROLLBACK;
+	txn_rollback(txn);
 	fiber_gc();
 	return 0;
 }
diff --git a/test/replication/gh-6027-applier-error-show.result b/test/replication/gh-6027-applier-error-show.result
new file mode 100644
index 000000000..c3a01ab50
--- /dev/null
+++ b/test/replication/gh-6027-applier-error-show.result
@@ -0,0 +1,82 @@
+-- test-run result file version 2
+test_run = require('test_run').new()
+ | ---
+ | ...
+
+--
+-- gh-6027: on attempt to a commit transaction its original error was lost.
+--
+
+box.schema.user.grant('guest', 'super')
+ | ---
+ | ...
+s = box.schema.create_space('test')
+ | ---
+ | ...
+_ = s:create_index('pk')
+ | ---
+ | ...
+
+test_run:cmd('create server replica with rpl_master=default, '..                \
+             'script="replication/replica.lua"')
+ | ---
+ | - true
+ | ...
+test_run:cmd('start server replica')
+ | ---
+ | - true
+ | ...
+
+test_run:switch('replica')
+ | ---
+ | - true
+ | ...
+box.error.injection.set('ERRINJ_TXN_COMMIT_ASYNC', true)
+ | ---
+ | - ok
+ | ...
+
+test_run:switch('default')
+ | ---
+ | - true
+ | ...
+_ = s:replace{1}
+ | ---
+ | ...
+
+test_run:switch('replica')
+ | ---
+ | - true
+ | ...
+test_run:wait_upstream(1, {status = 'stopped'})
+ | ---
+ | - true
+ | ...
+-- Should be something about error injection.
+box.info.replication[1].upstream.message
+ | ---
+ | - Error injection 'txn commit async injection'
+ | ...
+
+test_run:switch('default')
+ | ---
+ | - true
+ | ...
+test_run:cmd('stop server replica')
+ | ---
+ | - true
+ | ...
+test_run:cmd('delete server replica')
+ | ---
+ | - true
+ | ...
+box.error.injection.set('ERRINJ_TXN_COMMIT_ASYNC', false)
+ | ---
+ | - ok
+ | ...
+s:drop()
+ | ---
+ | ...
+box.schema.user.revoke('guest', 'super')
+ | ---
+ | ...
diff --git a/test/replication/gh-6027-applier-error-show.test.lua b/test/replication/gh-6027-applier-error-show.test.lua
new file mode 100644
index 000000000..8e17cdfa9
--- /dev/null
+++ b/test/replication/gh-6027-applier-error-show.test.lua
@@ -0,0 +1,31 @@
+test_run = require('test_run').new()
+
+--
+-- gh-6027: on attempt to a commit transaction its original error was lost.
+--
+
+box.schema.user.grant('guest', 'super')
+s = box.schema.create_space('test')
+_ = s:create_index('pk')
+
+test_run:cmd('create server replica with rpl_master=default, '..                \
+             'script="replication/replica.lua"')
+test_run:cmd('start server replica')
+
+test_run:switch('replica')
+box.error.injection.set('ERRINJ_TXN_COMMIT_ASYNC', true)
+
+test_run:switch('default')
+_ = s:replace{1}
+
+test_run:switch('replica')
+test_run:wait_upstream(1, {status = 'stopped'})
+-- Should be something about error injection.
+box.info.replication[1].upstream.message
+
+test_run:switch('default')
+test_run:cmd('stop server replica')
+test_run:cmd('delete server replica')
+box.error.injection.set('ERRINJ_TXN_COMMIT_ASYNC', false)
+s:drop()
+box.schema.user.revoke('guest', 'super')
diff --git a/test/replication/suite.cfg b/test/replication/suite.cfg
index 9eccf9e19..3a0a8649f 100644
--- a/test/replication/suite.cfg
+++ b/test/replication/suite.cfg
@@ -45,6 +45,7 @@
     "gh-5536-wal-limit.test.lua": {},
     "gh-5566-final-join-synchro.test.lua": {},
     "gh-5613-bootstrap-prefer-booted.test.lua": {},
+    "gh-6027-applier-error-show.test.lua": {},
     "gh-6032-promote-wal-write.test.lua": {},
     "gh-6057-qsync-confirm-async-no-wal.test.lua": {},
     "gh-6094-rs-uuid-mismatch.test.lua": {},
diff --git a/test/replication/suite.ini b/test/replication/suite.ini
index 80e968d56..ccf3559df 100644
--- a/test/replication/suite.ini
+++ b/test/replication/suite.ini
@@ -3,7 +3,7 @@ core = tarantool
 script =  master.lua
 description = tarantool/box, replication
 disabled = consistent.test.lua
-release_disabled = catch.test.lua errinj.test.lua gc.test.lua gc_no_space.test.lua before_replace.test.lua qsync_advanced.test.lua qsync_errinj.test.lua quorum.test.lua recover_missing_xlog.test.lua sync.test.lua long_row_timeout.test.lua gh-4739-vclock-assert.test.lua gh-4730-applier-rollback.test.lua gh-5140-qsync-casc-rollback.test.lua gh-5144-qsync-dup-confirm.test.lua gh-5167-qsync-rollback-snap.test.lua gh-5506-election-on-off.test.lua gh-5536-wal-limit.test.lua hang_on_synchro_fail.test.lua anon_register_gap.test.lua gh-5213-qsync-applier-order.test.lua gh-5213-qsync-applier-order-3.test.lua gh-6032-promote-wal-write.test.lua gh-6057-qsync-confirm-async-no-wal.test.lua
+release_disabled = catch.test.lua errinj.test.lua gc.test.lua gc_no_space.test.lua before_replace.test.lua qsync_advanced.test.lua qsync_errinj.test.lua quorum.test.lua recover_missing_xlog.test.lua sync.test.lua long_row_timeout.test.lua gh-4739-vclock-assert.test.lua gh-4730-applier-rollback.test.lua gh-5140-qsync-casc-rollback.test.lua gh-5144-qsync-dup-confirm.test.lua gh-5167-qsync-rollback-snap.test.lua gh-5506-election-on-off.test.lua gh-5536-wal-limit.test.lua hang_on_synchro_fail.test.lua anon_register_gap.test.lua gh-5213-qsync-applier-order.test.lua gh-5213-qsync-applier-order-3.test.lua gh-6027-applier-error-show.test.lua gh-6032-promote-wal-write.test.lua gh-6057-qsync-confirm-async-no-wal.test.lua
 config = suite.cfg
 lua_libs = lua/fast_replica.lua lua/rlimit.lua
 use_unix_sockets = True
-- 
2.24.3 (Apple Git-128)


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [Tarantool-patches] [PATCH 02/13] test: remove replica-applier-rollback.lua
  2021-06-11 21:56 [Tarantool-patches] [PATCH 00/13] Applier rollback reason Vladislav Shpilevoy via Tarantool-patches
                   ` (4 preceding siblings ...)
  2021-06-11 21:56 ` [Tarantool-patches] [PATCH 13/13] txn: stop TXN_SIGNATURE_ABORT override Vladislav Shpilevoy via Tarantool-patches
@ 2021-06-11 21:56 ` Vladislav Shpilevoy via Tarantool-patches
  2021-06-11 21:56 ` [Tarantool-patches] [PATCH 03/13] journal: make journal_write() set diag on error Vladislav Shpilevoy via Tarantool-patches
                   ` (8 subsequent siblings)
  14 siblings, 0 replies; 22+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-06-11 21:56 UTC (permalink / raw)
  To: tarantool-patches, gorcunov, sergepetrenko

The script name was too long. It was also used as a name for the
unix socket file on which the replica listens. As a result, the
test couldn't start, at least on my machine.

Besides, the script was not any different from the existing
replica.lua, except a couple of not important settings.

The patch drops it and makes gh-4730-applier-rollback.test.lua use
replica.lua. Now it can run on my machine.

Done as a preparation for #6027, which is slightly related to the
test - it is also about errors in applier and their display.
---
 test/replication/gh-4730-applier-rollback.result |  2 +-
 .../gh-4730-applier-rollback.test.lua            |  2 +-
 test/replication/replica-applier-rollback.lua    | 16 ----------------
 3 files changed, 2 insertions(+), 18 deletions(-)
 delete mode 100644 test/replication/replica-applier-rollback.lua

diff --git a/test/replication/gh-4730-applier-rollback.result b/test/replication/gh-4730-applier-rollback.result
index 26a0eb6fa..bd9530074 100644
--- a/test/replication/gh-4730-applier-rollback.result
+++ b/test/replication/gh-4730-applier-rollback.result
@@ -19,7 +19,7 @@ box.schema.user.grant('guest', 'replication')
 --
 -- Create replica instance, we're the master and
 -- start it, no data to sync yet though
-test_run:cmd("create server replica_slave with rpl_master=default, script='replication/replica-applier-rollback.lua'")
+test_run:cmd("create server replica_slave with rpl_master=default, script='replication/replica.lua'")
  | ---
  | - true
  | ...
diff --git a/test/replication/gh-4730-applier-rollback.test.lua b/test/replication/gh-4730-applier-rollback.test.lua
index de7a740de..a010debb4 100644
--- a/test/replication/gh-4730-applier-rollback.test.lua
+++ b/test/replication/gh-4730-applier-rollback.test.lua
@@ -12,7 +12,7 @@ box.schema.user.grant('guest', 'replication')
 --
 -- Create replica instance, we're the master and
 -- start it, no data to sync yet though
-test_run:cmd("create server replica_slave with rpl_master=default, script='replication/replica-applier-rollback.lua'")
+test_run:cmd("create server replica_slave with rpl_master=default, script='replication/replica.lua'")
 test_run:cmd("start server replica_slave")
 
 --
diff --git a/test/replication/replica-applier-rollback.lua b/test/replication/replica-applier-rollback.lua
deleted file mode 100644
index 26fb10055..000000000
--- a/test/replication/replica-applier-rollback.lua
+++ /dev/null
@@ -1,16 +0,0 @@
---
--- vim: ts=4 sw=4 et
---
-
-print('arg', arg)
-
-box.cfg({
-    replication                 = os.getenv("MASTER"),
-    listen                      = os.getenv("LISTEN"),
-    memtx_memory                = 107374182,
-    replication_timeout         = 0.1,
-    replication_connect_timeout = 0.5,
-    read_only                   = true,
-})
-
-require('console').listen(os.getenv('ADMIN'))
-- 
2.24.3 (Apple Git-128)


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [Tarantool-patches] [PATCH 03/13] journal: make journal_write() set diag on error
  2021-06-11 21:56 [Tarantool-patches] [PATCH 00/13] Applier rollback reason Vladislav Shpilevoy via Tarantool-patches
                   ` (5 preceding siblings ...)
  2021-06-11 21:56 ` [Tarantool-patches] [PATCH 02/13] test: remove replica-applier-rollback.lua Vladislav Shpilevoy via Tarantool-patches
@ 2021-06-11 21:56 ` Vladislav Shpilevoy via Tarantool-patches
  2021-06-11 21:56 ` [Tarantool-patches] [PATCH 04/13] wal: refactor wal_write_to_disk() Vladislav Shpilevoy via Tarantool-patches
                   ` (7 subsequent siblings)
  14 siblings, 0 replies; 22+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-06-11 21:56 UTC (permalink / raw)
  To: tarantool-patches, gorcunov, sergepetrenko

It used to simply return -1 and set a diag only when OOM happened
inside.

The caller was forced either to ignore the result or set its own
diag regardless of what really happened.

The patch makes journal_write() set a correct diag error when it
returns -1. The only implementation to change was
wal_write_async(). The other implementations always return 0.

Part of #6027
---
 src/box/applier.cc  |  4 +++-
 src/box/raft.c      |  8 +++++---
 src/box/txn.c       | 14 ++++++++------
 src/box/txn_limbo.c | 30 +++++++++++++++++-------------
 src/box/wal.c       |  2 ++
 5 files changed, 35 insertions(+), 23 deletions(-)

diff --git a/src/box/applier.cc b/src/box/applier.cc
index 33181fdbf..60d648795 100644
--- a/src/box/applier.cc
+++ b/src/box/applier.cc
@@ -835,7 +835,9 @@ apply_synchro_row(struct xrow_header *row)
 	 * before trying to commit. But that requires extra steps from the
 	 * transactions side, including the async ones.
 	 */
-	if (journal_write(&entry.base) != 0 || entry.base.res < 0) {
+	if (journal_write(&entry.base) != 0)
+		goto err;
+	if (entry.base.res < 0) {
 		diag_set(ClientError, ER_WAL_IO);
 		goto err;
 	}
diff --git a/src/box/raft.c b/src/box/raft.c
index 6b52c9876..55dee4cb1 100644
--- a/src/box/raft.c
+++ b/src/box/raft.c
@@ -307,17 +307,19 @@ box_raft_write(struct raft *raft, const struct raft_msg *msg)
 	 * follows this pattern of 'protection'.
 	 */
 	bool cancellable = fiber_set_cancellable(false);
-	bool ok = (journal_write(entry) == 0 && entry->res >= 0);
+	bool is_err = journal_write(entry) != 0;
 	fiber_set_cancellable(cancellable);
-	if (!ok) {
+	if (is_err)
+		goto fail;
+	if (entry->res < 0) {
 		diag_set(ClientError, ER_WAL_IO);
-		diag_log();
 		goto fail;
 	}
 
 	region_truncate(region, svp);
 	return;
 fail:
+	diag_log();
 	/*
 	 * XXX: the stub is supposed to be removed once it is defined what to do
 	 * when a raft request WAL write fails.
diff --git a/src/box/txn.c b/src/box/txn.c
index 966dfafdf..761630939 100644
--- a/src/box/txn.c
+++ b/src/box/txn.c
@@ -847,7 +847,6 @@ txn_commit_try_async(struct txn *txn)
 	fiber_set_txn(fiber(), NULL);
 	if (journal_write_try_async(req) != 0) {
 		fiber_set_txn(fiber(), txn);
-		diag_set(ClientError, ER_WAL_IO);
 		diag_log();
 		goto rollback;
 	}
@@ -904,12 +903,11 @@ txn_commit(struct txn *txn)
 	}
 
 	fiber_set_txn(fiber(), NULL);
-	if (journal_write(req) != 0 || req->res < 0) {
-		if (txn_has_flag(txn, TXN_WAIT_SYNC))
-			txn_limbo_abort(&txn_limbo, limbo_entry);
+	if (journal_write(req) != 0)
+		goto rollback_io;
+	if (req->res < 0) {
 		diag_set(ClientError, ER_WAL_IO);
-		diag_log();
-		goto rollback;
+		goto rollback_io;
 	}
 	if (txn_has_flag(txn, TXN_WAIT_SYNC)) {
 		if (txn_has_flag(txn, TXN_WAIT_ACK)) {
@@ -934,6 +932,10 @@ txn_commit(struct txn *txn)
 	txn_free(txn);
 	return 0;
 
+rollback_io:
+	diag_log();
+	if (txn_has_flag(txn, TXN_WAIT_SYNC))
+		txn_limbo_abort(&txn_limbo, limbo_entry);
 rollback:
 	assert(txn->fiber != NULL);
 	if (!txn_has_flag(txn, TXN_IS_DONE)) {
diff --git a/src/box/txn_limbo.c b/src/box/txn_limbo.c
index dae6d2df4..83b86387c 100644
--- a/src/box/txn_limbo.c
+++ b/src/box/txn_limbo.c
@@ -335,21 +335,25 @@ txn_limbo_write_synchro(struct txn_limbo *limbo, uint16_t type, int64_t lsn,
 	journal_entry_create(entry, 1, xrow_approx_len(&row),
 			     journal_entry_fiber_wakeup_cb, fiber());
 
-	if (journal_write(entry) != 0 || entry->res < 0) {
+	if (journal_write(entry) != 0)
+		goto fail;
+	if (entry->res < 0) {
 		diag_set(ClientError, ER_WAL_IO);
-		diag_log();
-		/*
-		 * XXX: the stub is supposed to be removed once it is defined
-		 * what to do when a synchro request WAL write fails. One of
-		 * the possible solutions: log the error, keep the limbo
-		 * queue as is and probably put in rollback mode. Then
-		 * provide a hook to call manually when WAL problems are fixed.
-		 * Or retry automatically with some period.
-		 */
-		panic("Could not write a synchro request to WAL: "
-		      "lsn = %lld, type = %s\n", (long long)lsn,
-		      iproto_type_name(type));
+		goto fail;
 	}
+	return;
+fail:
+	diag_log();
+	/*
+	 * XXX: the stub is supposed to be removed once it is defined what to do
+	 * when a synchro request WAL write fails. One of the possible
+	 * solutions: log the error, keep the limbo queue as is and probably put
+	 * in rollback mode. Then provide a hook to call manually when WAL
+	 * problems are fixed. Or retry automatically with some period.
+	 */
+	panic("Could not write a synchro request to WAL: lsn = %lld, "
+	      "type = %s\n", (long long)lsn, iproto_type_name(type));
+
 }
 
 /**
diff --git a/src/box/wal.c b/src/box/wal.c
index 5c52142ef..25edbace6 100644
--- a/src/box/wal.c
+++ b/src/box/wal.c
@@ -1244,6 +1244,7 @@ wal_write_async(struct journal *journal, struct journal_entry *entry)
 	struct wal_writer *writer = (struct wal_writer *) journal;
 
 	ERROR_INJECT(ERRINJ_WAL_IO, {
+		diag_set(ClientError, ER_WAL_IO);
 		goto fail;
 	});
 
@@ -1258,6 +1259,7 @@ wal_write_async(struct journal *journal, struct journal_entry *entry)
 		say_error("Aborting transaction %lld during "
 			  "cascading rollback",
 			  (long long)vclock_sum(&writer->vclock));
+		diag_set(ClientError, ER_CASCADE_ROLLBACK);
 		goto fail;
 	}
 
-- 
2.24.3 (Apple Git-128)


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [Tarantool-patches] [PATCH 04/13] wal: refactor wal_write_to_disk()
  2021-06-11 21:56 [Tarantool-patches] [PATCH 00/13] Applier rollback reason Vladislav Shpilevoy via Tarantool-patches
                   ` (6 preceding siblings ...)
  2021-06-11 21:56 ` [Tarantool-patches] [PATCH 03/13] journal: make journal_write() set diag on error Vladislav Shpilevoy via Tarantool-patches
@ 2021-06-11 21:56 ` Vladislav Shpilevoy via Tarantool-patches
  2021-06-15 20:46   ` Cyrill Gorcunov via Tarantool-patches
  2021-06-11 21:56 ` [Tarantool-patches] [PATCH 05/13] diag: introduce diag_set_detailed() Vladislav Shpilevoy via Tarantool-patches
                   ` (6 subsequent siblings)
  14 siblings, 1 reply; 22+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-06-11 21:56 UTC (permalink / raw)
  To: tarantool-patches, gorcunov, sergepetrenko

It didn't have a single fail path. That led to some amount of code
duplication, and it complicated future patches where the journal
entries are going to get a proper error reason instead of default
-1 without any details.

The patch is a preparation for #6027 where it is wanted to have
more detailed errors on journal entry/transaction fail instead
of ER_WAL_IO for everything. Sometimes it can override a real
error like a cascade rollback, or a transaction conflict.

Part of #6027
---
 src/box/wal.c | 21 +++++++--------------
 1 file changed, 7 insertions(+), 14 deletions(-)

diff --git a/src/box/wal.c b/src/box/wal.c
index 25edbace6..40382e791 100644
--- a/src/box/wal.c
+++ b/src/box/wal.c
@@ -863,10 +863,8 @@ wal_opt_rotate(struct wal_writer *writer)
 		return 0;
 
 	if (xdir_create_xlog(&writer->wal_dir, &writer->current_wal,
-			     &writer->vclock) != 0) {
-		diag_log();
+			     &writer->vclock) != 0)
 		return -1;
-	}
 	/*
 	 * Keep track of the new WAL vclock. Required for garbage
 	 * collection, see wal_collect_garbage().
@@ -1038,7 +1036,10 @@ wal_write_to_disk(struct cmsg *msg)
 {
 	struct wal_writer *writer = &wal_writer_singleton;
 	struct wal_msg *wal_msg = (struct wal_msg *) msg;
+	struct stailq_entry *last_committed = NULL;
+	struct journal_entry *entry;
 	struct error *error;
+	assert(!stailq_empty(&wal_msg->commit));
 
 	/*
 	 * Track all vclock changes made by this batch into
@@ -1058,23 +1059,17 @@ wal_write_to_disk(struct cmsg *msg)
 
 	if (writer->is_in_rollback) {
 		/* We're rolling back a failed write. */
-		stailq_concat(&wal_msg->rollback, &wal_msg->commit);
-		vclock_copy(&wal_msg->vclock, &writer->vclock);
-		return;
+		goto done;
 	}
 
 	/* Xlog is only rotated between queue processing  */
 	if (wal_opt_rotate(writer) != 0) {
-		stailq_concat(&wal_msg->rollback, &wal_msg->commit);
-		vclock_copy(&wal_msg->vclock, &writer->vclock);
-		return wal_begin_rollback();
+		goto done;
 	}
 
 	/* Ensure there's enough disk space before writing anything. */
 	if (wal_fallocate(writer, wal_msg->approx_len) != 0) {
-		stailq_concat(&wal_msg->rollback, &wal_msg->commit);
-		vclock_copy(&wal_msg->vclock, &writer->vclock);
-		return wal_begin_rollback();
+		goto done;
 	}
 
 	/*
@@ -1104,8 +1099,6 @@ wal_write_to_disk(struct cmsg *msg)
 	 * Iterate over requests (transactions)
 	 */
 	int rc;
-	struct journal_entry *entry;
-	struct stailq_entry *last_committed = NULL;
 	stailq_foreach_entry(entry, &wal_msg->commit, fifo) {
 		wal_assign_lsn(&vclock_diff, &writer->vclock, entry);
 		entry->res = vclock_sum(&vclock_diff) +
-- 
2.24.3 (Apple Git-128)


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [Tarantool-patches] [PATCH 05/13] diag: introduce diag_set_detailed()
  2021-06-11 21:56 [Tarantool-patches] [PATCH 00/13] Applier rollback reason Vladislav Shpilevoy via Tarantool-patches
                   ` (7 preceding siblings ...)
  2021-06-11 21:56 ` [Tarantool-patches] [PATCH 04/13] wal: refactor wal_write_to_disk() Vladislav Shpilevoy via Tarantool-patches
@ 2021-06-11 21:56 ` Vladislav Shpilevoy via Tarantool-patches
  2021-06-11 21:56 ` [Tarantool-patches] [PATCH 06/13] wal: encapsulate ER_WAL_IO Vladislav Shpilevoy via Tarantool-patches
                   ` (5 subsequent siblings)
  14 siblings, 0 replies; 22+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-06-11 21:56 UTC (permalink / raw)
  To: tarantool-patches, gorcunov, sergepetrenko

diag_set() uses the current file and line as built-in macros.
In the future patches there are going to appear a couple of new
diag_set-like helpers which also would want to preserve the
original file and line.

For that they must be macros at least partially, like diag_set(),
and pass their own file and line. Because they are going not to be
very trivial and won't be implemented in the header.

The patch introduces diag_set_detailed() which allows to pass
custom file and line.

Needed for #6027
---
 src/lib/core/diag.h | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/src/lib/core/diag.h b/src/lib/core/diag.h
index b07eea838..fa85326ad 100644
--- a/src/lib/core/diag.h
+++ b/src/lib/core/diag.h
@@ -347,17 +347,20 @@ struct error *
 BuildSocketError(const char *file, unsigned line, const char *socketname,
 		 const char *format, ...);
 
-#define diag_set(class, ...) do {					\
+#define diag_set_detailed(file, line, class, ...) do {			\
 	/* Preserve the original errno. */                              \
 	int save_errno = errno;                                         \
-	say_debug("%s at %s:%i", #class, __FILE__, __LINE__);		\
+	say_debug("%s at %s:%i", #class, file, line);			\
 	struct error *e;						\
-	e = Build##class(__FILE__, __LINE__, ##__VA_ARGS__);		\
+	e = Build##class(file, line, ##__VA_ARGS__);			\
 	diag_set_error(diag_get(), e);					\
 	/* Restore the errno which might have been reset.  */           \
 	errno = save_errno;                                             \
 } while (0)
 
+#define diag_set(...)							\
+	diag_set_detailed(__FILE__, __LINE__, __VA_ARGS__)
+
 #define diag_add(class, ...) do {					\
 	int save_errno = errno;						\
 	say_debug("%s at %s:%i", #class, __FILE__, __LINE__);		\
-- 
2.24.3 (Apple Git-128)


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [Tarantool-patches] [PATCH 06/13] wal: encapsulate ER_WAL_IO
  2021-06-11 21:56 [Tarantool-patches] [PATCH 00/13] Applier rollback reason Vladislav Shpilevoy via Tarantool-patches
                   ` (8 preceding siblings ...)
  2021-06-11 21:56 ` [Tarantool-patches] [PATCH 05/13] diag: introduce diag_set_detailed() Vladislav Shpilevoy via Tarantool-patches
@ 2021-06-11 21:56 ` Vladislav Shpilevoy via Tarantool-patches
  2021-06-11 21:56 ` [Tarantool-patches] [PATCH 07/13] txn: change limbo rollback check in the trigger Vladislav Shpilevoy via Tarantool-patches
                   ` (4 subsequent siblings)
  14 siblings, 0 replies; 22+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-06-11 21:56 UTC (permalink / raw)
  To: tarantool-patches, gorcunov, sergepetrenko

ER_WAL_IO is set on any WAL error if it was after journal_write()
success. It is not correct, because there can be plenty of
reasons.

In WAL it could be an actual IO error or a cascading rollback in
progress. When used for transactions, it could be an error related
to synchronous transactions like a timeout, or a persistent
ROLLBACK.

These errors are overridden by ER_WAL_IO. The patch encapsulates
the diag installation for bad journal write and for transaction
rollback.

The next patches are going to introduce more error codes and use
proper ones to install a diag.

Part of #6027
---
 src/box/applier.cc  | 10 +++++-----
 src/box/journal.c   |  8 ++++++++
 src/box/journal.h   | 10 ++++++++++
 src/box/raft.c      |  2 +-
 src/box/txn.c       |  8 +++++++-
 src/box/txn.h       | 10 ++++++++++
 src/box/txn_limbo.c |  2 +-
 7 files changed, 42 insertions(+), 8 deletions(-)

diff --git a/src/box/applier.cc b/src/box/applier.cc
index 60d648795..3fd71393d 100644
--- a/src/box/applier.cc
+++ b/src/box/applier.cc
@@ -711,7 +711,7 @@ applier_read_tx(struct applier *applier, struct stailq *rows, double timeout)
 }
 
 static void
-applier_rollback_by_wal_io(void)
+applier_rollback_by_wal_io(int64_t signature)
 {
 	/*
 	 * Setup shared applier diagnostic area.
@@ -725,7 +725,7 @@ applier_rollback_by_wal_io(void)
 	 * rollback may happen a way later after it was passed to
 	 * the journal engine.
 	 */
-	diag_set(ClientError, ER_WAL_IO);
+	diag_set_txn_sign(signature);
 	diag_set_error(&replicaset.applier.diag,
 		       diag_last_error(diag_get()));
 
@@ -747,7 +747,7 @@ applier_txn_rollback_cb(struct trigger *trigger, void *event)
 	 * special handling.
 	 */
 	if (txn->signature != TXN_SIGNATURE_SYNC_ROLLBACK)
-		applier_rollback_by_wal_io();
+		applier_rollback_by_wal_io(txn->signature);
 	return 0;
 }
 
@@ -787,7 +787,7 @@ apply_synchro_row_cb(struct journal_entry *entry)
 	struct synchro_entry *synchro_entry =
 		(struct synchro_entry *)entry->complete_data;
 	if (entry->res < 0) {
-		applier_rollback_by_wal_io();
+		applier_rollback_by_wal_io(entry->res);
 	} else {
 		txn_limbo_process(&txn_limbo, synchro_entry->req);
 		trigger_run(&replicaset.applier.on_wal_write, NULL);
@@ -838,7 +838,7 @@ apply_synchro_row(struct xrow_header *row)
 	if (journal_write(&entry.base) != 0)
 		goto err;
 	if (entry.base.res < 0) {
-		diag_set(ClientError, ER_WAL_IO);
+		diag_set_journal_res(entry.base.res);
 		goto err;
 	}
 	return 0;
diff --git a/src/box/journal.c b/src/box/journal.c
index df491610a..0a1e9932a 100644
--- a/src/box/journal.c
+++ b/src/box/journal.c
@@ -31,6 +31,7 @@
 #include "journal.h"
 #include <small/region.h>
 #include <diag.h>
+#include "error.h"
 
 struct journal *current_journal = NULL;
 
@@ -41,6 +42,13 @@ struct journal_queue journal_queue = {
 	.waiter_count = 0,
 };
 
+void
+diag_set_journal_res_detailed(const char *file, unsigned line, int64_t res)
+{
+	(void)res;
+	diag_set_detailed(file, line, ClientError, ER_WAL_IO);
+}
+
 struct journal_entry *
 journal_entry_new(size_t n_rows, struct region *region,
 		  journal_write_async_f write_async_cb,
diff --git a/src/box/journal.h b/src/box/journal.h
index 4ab7e8afb..01ea60f72 100644
--- a/src/box/journal.h
+++ b/src/box/journal.h
@@ -44,6 +44,16 @@ struct journal_entry;
 
 typedef void (*journal_write_async_f)(struct journal_entry *entry);
 
+/**
+ * Convert a result of a journal entry write to an error installed into the
+ * current diag.
+ */
+void
+diag_set_journal_res_detailed(const char *file, unsigned line, int64_t res);
+
+#define diag_set_journal_res(res)						\
+	diag_set_journal_res_detailed(__FILE__, __LINE__, res)
+
 /**
  * An entry for an abstract journal.
  * Simply put, a write ahead log request.
diff --git a/src/box/raft.c b/src/box/raft.c
index 55dee4cb1..7f787c0c5 100644
--- a/src/box/raft.c
+++ b/src/box/raft.c
@@ -312,7 +312,7 @@ box_raft_write(struct raft *raft, const struct raft_msg *msg)
 	if (is_err)
 		goto fail;
 	if (entry->res < 0) {
-		diag_set(ClientError, ER_WAL_IO);
+		diag_set_journal_res(entry->res);
 		goto fail;
 	}
 
diff --git a/src/box/txn.c b/src/box/txn.c
index 761630939..ac11127d3 100644
--- a/src/box/txn.c
+++ b/src/box/txn.c
@@ -248,6 +248,12 @@ txn_free(struct txn *txn)
 	stailq_add(&txn_cache, &txn->in_txn_cache);
 }
 
+void
+diag_set_txn_sign_detailed(const char *file, unsigned line, int64_t signature)
+{
+	return diag_set_journal_res_detailed(file, line, signature);
+}
+
 struct txn *
 txn_begin(void)
 {
@@ -906,7 +912,7 @@ txn_commit(struct txn *txn)
 	if (journal_write(req) != 0)
 		goto rollback_io;
 	if (req->res < 0) {
-		diag_set(ClientError, ER_WAL_IO);
+		diag_set_journal_res(req->res);
 		goto rollback_io;
 	}
 	if (txn_has_flag(txn, TXN_WAIT_SYNC)) {
diff --git a/src/box/txn.h b/src/box/txn.h
index a06aaea23..d51761bc9 100644
--- a/src/box/txn.h
+++ b/src/box/txn.h
@@ -123,6 +123,16 @@ enum {
 	TXN_SIGNATURE_SYNC_ROLLBACK = -3,
 };
 
+/**
+ * Convert a result of a transaction execution to an error installed into the
+ * current diag.
+ */
+void
+diag_set_txn_sign_detailed(const char *file, unsigned line, int64_t signature);
+
+#define diag_set_txn_sign(signature)						\
+	diag_set_txn_sign_detailed(__FILE__, __LINE__, signature)
+
 /**
  * Status of a transaction.
  */
diff --git a/src/box/txn_limbo.c b/src/box/txn_limbo.c
index 83b86387c..b03c71514 100644
--- a/src/box/txn_limbo.c
+++ b/src/box/txn_limbo.c
@@ -338,7 +338,7 @@ txn_limbo_write_synchro(struct txn_limbo *limbo, uint16_t type, int64_t lsn,
 	if (journal_write(entry) != 0)
 		goto fail;
 	if (entry->res < 0) {
-		diag_set(ClientError, ER_WAL_IO);
+		diag_set_journal_res(entry->res);
 		goto fail;
 	}
 	return;
-- 
2.24.3 (Apple Git-128)


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [Tarantool-patches] [PATCH 07/13] txn: change limbo rollback check in the trigger
  2021-06-11 21:56 [Tarantool-patches] [PATCH 00/13] Applier rollback reason Vladislav Shpilevoy via Tarantool-patches
                   ` (9 preceding siblings ...)
  2021-06-11 21:56 ` [Tarantool-patches] [PATCH 06/13] wal: encapsulate ER_WAL_IO Vladislav Shpilevoy via Tarantool-patches
@ 2021-06-11 21:56 ` Vladislav Shpilevoy via Tarantool-patches
  2021-06-11 21:56 ` [Tarantool-patches] [PATCH 08/13] journal: introduce proper error codes Vladislav Shpilevoy via Tarantool-patches
                   ` (3 subsequent siblings)
  14 siblings, 0 replies; 22+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-06-11 21:56 UTC (permalink / raw)
  To: tarantool-patches, gorcunov, sergepetrenko

A transaction on rollback used to check if it was already rolled
back inside of the limbo by looking at its signature as

    signature != TXN_SIGNATURE_ROLLBACK

It meant the transaction is already completed.
TXN_SIGNATURE_ROLLBACK was used as a default value of the
signature. Therefore if it is not default, it is completed.

This is going to break if normal (not synchronous) transactions
would have more rollback codes except just TXN_SIGNATURE_ROLLBACK.
Also treatment of TXN_SIGNATURE_ROLLBACK as a default value looks
confusing.

Next patches are going to rework the codes and render the
assumptions above incorrect.

This patch makes the transaction use a correct way to check
whether it is in the limbo still - look at TXN_WAIT_SYNC flag. It
is set for all txns in the limbo and is not set for all the
others.

Part of #6027
---
 src/box/txn.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/box/txn.c b/src/box/txn.c
index ac11127d3..2c889a31a 100644
--- a/src/box/txn.c
+++ b/src/box/txn.c
@@ -773,7 +773,7 @@ txn_limbo_on_rollback(struct trigger *trig, void *event)
 	(void) event;
 	struct txn *txn = (struct txn *) event;
 	/* Check whether limbo has performed the cleanup. */
-	if (txn->signature != TXN_SIGNATURE_ROLLBACK)
+	if (!txn_has_flag(txn, TXN_WAIT_SYNC))
 		return 0;
 	struct txn_limbo_entry *entry = (struct txn_limbo_entry *) trig->data;
 	txn_limbo_abort(&txn_limbo, entry);
-- 
2.24.3 (Apple Git-128)


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [Tarantool-patches] [PATCH 08/13] journal: introduce proper error codes
  2021-06-11 21:56 [Tarantool-patches] [PATCH 00/13] Applier rollback reason Vladislav Shpilevoy via Tarantool-patches
                   ` (10 preceding siblings ...)
  2021-06-11 21:56 ` [Tarantool-patches] [PATCH 07/13] txn: change limbo rollback check in the trigger Vladislav Shpilevoy via Tarantool-patches
@ 2021-06-11 21:56 ` Vladislav Shpilevoy via Tarantool-patches
  2021-06-11 21:56 ` [Tarantool-patches] [PATCH 09/13] txn: assert after WAL write that txn is not done Vladislav Shpilevoy via Tarantool-patches
                   ` (2 subsequent siblings)
  14 siblings, 0 replies; 22+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-06-11 21:56 UTC (permalink / raw)
  To: tarantool-patches, gorcunov, sergepetrenko

Journal used to have only one error code in journal_entry.res:
-1. It had at least 2 problems:

- There was an assumption that TXN_SIGNATURE_ROLLBACK is the same
  as journal_entry error = -1;

- It wasn't possible to tell if the entry tried to be written and
  failed, or it didn't try yet. Both looked as -1.

The patch introduces a new error code JOURNAL_ENTRY_ERR_UNKNOWN.
The IO error now has its own value: JOURNAL_ENTRY_ERR_IO.

This helps to ensure that a not finished journal entry or a
transaction won't try to obtain a diag error for their result.

Part of #6027
---
 src/box/journal.c   |  1 +
 src/box/journal.h   | 15 ++++++++++++++-
 src/box/txn.c       |  6 ++++--
 src/box/txn.h       | 13 ++++++++++---
 src/box/txn_limbo.c |  5 +++--
 src/box/wal.c       |  4 ++--
 6 files changed, 34 insertions(+), 10 deletions(-)

diff --git a/src/box/journal.c b/src/box/journal.c
index 0a1e9932a..35c99c1c6 100644
--- a/src/box/journal.c
+++ b/src/box/journal.c
@@ -45,6 +45,7 @@ struct journal_queue journal_queue = {
 void
 diag_set_journal_res_detailed(const char *file, unsigned line, int64_t res)
 {
+	assert(res < 0 && res != JOURNAL_ENTRY_ERR_UNKNOWN);
 	(void)res;
 	diag_set_detailed(file, line, ClientError, ER_WAL_IO);
 }
diff --git a/src/box/journal.h b/src/box/journal.h
index 01ea60f72..18767176e 100644
--- a/src/box/journal.h
+++ b/src/box/journal.h
@@ -44,6 +44,19 @@ struct journal_entry;
 
 typedef void (*journal_write_async_f)(struct journal_entry *entry);
 
+enum {
+	/** Entry didn't attempt a journal write. */
+	JOURNAL_ENTRY_ERR_UNKNOWN = -1,
+	/** Tried to be written, but something happened related to IO. */
+	JOURNAL_ENTRY_ERR_IO = -2,
+	/**
+	 * Anchor for the structs built on top of journal entry so as they
+	 * could introduce their own unique errors. Set to a big value in
+	 * advance.
+	 */
+	JOURNAL_ENTRY_ERR_MIN = -100,
+};
+
 /**
  * Convert a result of a journal entry write to an error installed into the
  * current diag.
@@ -108,7 +121,7 @@ journal_entry_create(struct journal_entry *entry, size_t n_rows,
 	entry->complete_data	= complete_data;
 	entry->approx_len	= approx_len;
 	entry->n_rows		= n_rows;
-	entry->res		= -1;
+	entry->res		= JOURNAL_ENTRY_ERR_UNKNOWN;
 	entry->flags		= 0;
 }
 
diff --git a/src/box/txn.c b/src/box/txn.c
index 2c889a31a..c9c2e93ff 100644
--- a/src/box/txn.c
+++ b/src/box/txn.c
@@ -276,7 +276,7 @@ txn_begin(void)
 	txn->psn = 0;
 	txn->rv_psn = 0;
 	txn->status = TXN_INPROGRESS;
-	txn->signature = TXN_SIGNATURE_ROLLBACK;
+	txn->signature = TXN_SIGNATURE_UNKNOWN;
 	txn->engine = NULL;
 	txn->engine_tx = NULL;
 	txn->fk_deferred_count = 0;
@@ -479,6 +479,7 @@ txn_complete_fail(struct txn *txn)
 {
 	assert(!txn_has_flag(txn, TXN_IS_DONE));
 	assert(txn->signature < 0);
+	assert(txn->signature != TXN_SIGNATURE_UNKNOWN);
 	txn->status = TXN_ABORTED;
 	struct txn_stmt *stmt;
 	stailq_reverse(&txn->stmts);
@@ -535,7 +536,8 @@ txn_on_journal_write(struct journal_entry *entry)
 	 * txn_limbo has already rolled the tx back, so we just
 	 * have to free it.
 	 */
-	if (txn->signature < TXN_SIGNATURE_ROLLBACK) {
+	if (txn->signature != TXN_SIGNATURE_UNKNOWN) {
+		assert(txn->signature < 0);
 		txn_free(txn);
 		return;
 	}
diff --git a/src/box/txn.h b/src/box/txn.h
index d51761bc9..037865ac6 100644
--- a/src/box/txn.h
+++ b/src/box/txn.h
@@ -36,6 +36,7 @@
 #include "trigger.h"
 #include "fiber.h"
 #include "space.h"
+#include "journal.h"
 
 #if defined(__cplusplus)
 extern "C" {
@@ -103,24 +104,30 @@ enum {
 enum {
 	/** Signature set for empty transactions. */
 	TXN_SIGNATURE_NOP = 0,
+	/**
+	 * Aliases for journal errors to make all signature codes have the same
+	 * prefix.
+	 */
+	TXN_SIGNATURE_UNKNOWN = JOURNAL_ENTRY_ERR_UNKNOWN,
+	TXN_SIGNATURE_IO = JOURNAL_ENTRY_ERR_IO,
 	/**
 	 * The default signature value for failed transactions.
 	 * Indicates either write failure or any other failure
 	 * not caused by synchronous transaction processing.
 	 */
-	TXN_SIGNATURE_ROLLBACK = -1,
+	TXN_SIGNATURE_ROLLBACK = JOURNAL_ENTRY_ERR_MIN - 1,
 	/**
 	 * A value set for failed synchronous transactions
 	 * on master, when not enough acks were collected.
 	 */
-	TXN_SIGNATURE_QUORUM_TIMEOUT = -2,
+	TXN_SIGNATURE_QUORUM_TIMEOUT = JOURNAL_ENTRY_ERR_MIN - 2,
 	/**
 	 * A value set for failed synchronous transactions
 	 * on replica (or any instance during recovery), when a
 	 * transaction is rolled back because ROLLBACK message was
 	 * read.
 	 */
-	TXN_SIGNATURE_SYNC_ROLLBACK = -3,
+	TXN_SIGNATURE_SYNC_ROLLBACK = JOURNAL_ENTRY_ERR_MIN - 3,
 };
 
 /**
diff --git a/src/box/txn_limbo.c b/src/box/txn_limbo.c
index b03c71514..51dc2a186 100644
--- a/src/box/txn_limbo.c
+++ b/src/box/txn_limbo.c
@@ -79,7 +79,8 @@ txn_limbo_append(struct txn_limbo *limbo, uint32_t id, struct txn *txn)
 	 * needs that to be able rollback transactions, whose WAL write is in
 	 * progress.
 	 */
-	assert(txn->signature < 0);
+	assert(txn->signature == TXN_SIGNATURE_UNKNOWN);
+	assert(txn->status == TXN_PREPARED);
 	if (limbo->is_in_rollback) {
 		/*
 		 * Cascading rollback. It is impossible to commit the
@@ -393,7 +394,7 @@ txn_limbo_read_confirm(struct txn_limbo *limbo, int64_t lsn)
 			 */
 			if (e->lsn == -1)
 				break;
-		} else if (e->txn->signature < 0) {
+		} else if (e->txn->signature == TXN_SIGNATURE_UNKNOWN) {
 			/*
 			 * A transaction might be covered by the CONFIRM even if
 			 * it is not written to WAL yet when it is an async
diff --git a/src/box/wal.c b/src/box/wal.c
index 40382e791..f59cc9113 100644
--- a/src/box/wal.c
+++ b/src/box/wal.c
@@ -1169,7 +1169,7 @@ done:
 	if (!stailq_empty(&rollback)) {
 		/* Update status of the successfully committed requests. */
 		stailq_foreach_entry(entry, &rollback, fifo)
-			entry->res = -1;
+			entry->res = JOURNAL_ENTRY_ERR_IO;
 		/* Rollback unprocessed requests */
 		stailq_concat(&wal_msg->rollback, &rollback);
 		wal_begin_rollback();
@@ -1293,7 +1293,7 @@ wal_write_async(struct journal *journal, struct journal_entry *entry)
 	return 0;
 
 fail:
-	entry->res = -1;
+	assert(entry->res == JOURNAL_ENTRY_ERR_UNKNOWN);
 	return -1;
 }
 
-- 
2.24.3 (Apple Git-128)


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [Tarantool-patches] [PATCH 09/13] txn: assert after WAL write that txn is not done
  2021-06-11 21:56 [Tarantool-patches] [PATCH 00/13] Applier rollback reason Vladislav Shpilevoy via Tarantool-patches
                   ` (11 preceding siblings ...)
  2021-06-11 21:56 ` [Tarantool-patches] [PATCH 08/13] journal: introduce proper error codes Vladislav Shpilevoy via Tarantool-patches
@ 2021-06-11 21:56 ` Vladislav Shpilevoy via Tarantool-patches
  2021-06-15 13:43 ` [Tarantool-patches] [PATCH 00/13] Applier rollback reason Serge Petrenko via Tarantool-patches
  2021-06-16 23:32 ` Vladislav Shpilevoy via Tarantool-patches
  14 siblings, 0 replies; 22+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-06-11 21:56 UTC (permalink / raw)
  To: tarantool-patches, gorcunov, sergepetrenko

In the journal write trigger the transaction assumed it might be
already rolled back and completed, hence does not need to do
anything except free itself.

But it can't happen. The only imaginable reason why a transaction
might be rolled back before it completed its WAL write is a
ROLLBACK entry issued after the transaction.

But ROLLBACK applies its effects only after it is written. Hence
only after all the other pending txns are written too. Therefore
it is not possible for a transaction to get ROLLBACK before it
finishes its own WAL write.

Probably it was possible in the time when applier used to execute
ROLLBACK before writing it to WAL. But that was fixed in
b259e9306f717883026520c975f2610c9fbc7d43 ("applier: process
synchro rows after WAL write"). Can't happen now.

This became easier to realize when not finished transaction
signature got its own value TXN_SIGNATURE_UNKNOWN.
---
 src/box/txn.c | 10 +---------
 1 file changed, 1 insertion(+), 9 deletions(-)

diff --git a/src/box/txn.c b/src/box/txn.c
index c9c2e93ff..1c1eb15bc 100644
--- a/src/box/txn.c
+++ b/src/box/txn.c
@@ -532,15 +532,7 @@ static void
 txn_on_journal_write(struct journal_entry *entry)
 {
 	struct txn *txn = entry->complete_data;
-	/*
-	 * txn_limbo has already rolled the tx back, so we just
-	 * have to free it.
-	 */
-	if (txn->signature != TXN_SIGNATURE_UNKNOWN) {
-		assert(txn->signature < 0);
-		txn_free(txn);
-		return;
-	}
+	assert(txn->signature == TXN_SIGNATURE_UNKNOWN);
 	txn->signature = entry->res;
 	/*
 	 * Some commit/rollback triggers require for in_txn fiber
-- 
2.24.3 (Apple Git-128)

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Tarantool-patches] [PATCH 00/13] Applier rollback reason
  2021-06-11 21:56 [Tarantool-patches] [PATCH 00/13] Applier rollback reason Vladislav Shpilevoy via Tarantool-patches
                   ` (12 preceding siblings ...)
  2021-06-11 21:56 ` [Tarantool-patches] [PATCH 09/13] txn: assert after WAL write that txn is not done Vladislav Shpilevoy via Tarantool-patches
@ 2021-06-15 13:43 ` Serge Petrenko via Tarantool-patches
  2021-06-16 23:32 ` Vladislav Shpilevoy via Tarantool-patches
  14 siblings, 0 replies; 22+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-06-15 13:43 UTC (permalink / raw)
  To: Vladislav Shpilevoy, tarantool-patches, gorcunov



12.06.2021 00:56, Vladislav Shpilevoy пишет:
> Branch: http://github.com/tarantool/tarantool/tree/gerold103/gh-6027-rollback-reason
> Issue: https://github.com/tarantool/tarantool/issues/6027

Hi! Good job on the patchset!

LGTM with one small comment for the last commit.

> Vladislav Shpilevoy (13):
>    error: introduce ER_CASCADE_ROLLBACK
>    test: remove replica-applier-rollback.lua
>    journal: make journal_write() set diag on error
>    wal: refactor wal_write_to_disk()
>    diag: introduce diag_set_detailed()
>    wal: encapsulate ER_WAL_IO
>    txn: change limbo rollback check in the trigger
>    journal: introduce proper error codes
>    txn: assert after WAL write that txn is not done
>    txn: install proper diag errors on txn fail
>    wal: introduce JOURNAL_ENTRY_ERR_CASCADE
>    txn: introduce TXN_SIGNATURE_ABORT
>    txn: stop TXN_SIGNATURE_ABORT override
>
>   .../unreleased/gh-6027-applier-lost-error.md  |  7 ++
>   src/box/applier.cc                            | 18 ++--
>   src/box/box.cc                                |  8 +-
>   src/box/errcode.h                             |  3 +-
>   src/box/journal.c                             | 16 ++++
>   src/box/journal.h                             | 30 ++++++-
>   src/box/memtx_engine.c                        |  2 +-
>   src/box/raft.c                                | 10 ++-
>   src/box/txn.c                                 | 87 +++++++++++++------
>   src/box/txn.h                                 | 38 +++++++-
>   src/box/txn_limbo.c                           | 37 ++++----
>   src/box/vy_scheduler.c                        |  2 +-
>   src/box/wal.c                                 | 50 ++++++-----
>   src/lib/core/diag.h                           |  9 +-
>   test/box/errinj.result                        |  2 +-
>   test/box/error.result                         |  3 +-
>   .../gh-4730-applier-rollback.result           |  2 +-
>   .../gh-4730-applier-rollback.test.lua         |  2 +-
>   .../gh-6027-applier-error-show.result         | 82 +++++++++++++++++
>   .../gh-6027-applier-error-show.test.lua       | 31 +++++++
>   test/replication/replica-applier-rollback.lua | 16 ----
>   test/replication/suite.cfg                    |  1 +
>   test/replication/suite.ini                    |  2 +-
>   23 files changed, 348 insertions(+), 110 deletions(-)
>   create mode 100644 changelogs/unreleased/gh-6027-applier-lost-error.md
>   create mode 100644 test/replication/gh-6027-applier-error-show.result
>   create mode 100644 test/replication/gh-6027-applier-error-show.test.lua
>   delete mode 100644 test/replication/replica-applier-rollback.lua
>

-- 
Serge Petrenko


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Tarantool-patches] [PATCH 13/13] txn: stop TXN_SIGNATURE_ABORT override
  2021-06-11 21:56 ` [Tarantool-patches] [PATCH 13/13] txn: stop TXN_SIGNATURE_ABORT override Vladislav Shpilevoy via Tarantool-patches
@ 2021-06-15 13:44   ` Serge Petrenko via Tarantool-patches
  2021-06-15 19:34     ` Vladislav Shpilevoy via Tarantool-patches
  0 siblings, 1 reply; 22+ messages in thread
From: Serge Petrenko via Tarantool-patches @ 2021-06-15 13:44 UTC (permalink / raw)
  To: Vladislav Shpilevoy, tarantool-patches, gorcunov



12.06.2021 00:56, Vladislav Shpilevoy пишет:
> When txn_commit/try_async() failed before going to WAL thread,
> they installed TXN_SIGNATURE_ABORT signature meaning that the
> caller and the rollback triggers must look at the global diag.
>
> But they called txn_rollback() before doing return and calling the
> triggers, which overrode the signature with TXN_SIGNATURE_ROLLBACK
> leading to the original error loss.
>
> The patch makes TXN_SIGNATURE_ROLLBACK installed only when a real
> rollback happens (via box_txn_rollback()).
>
> This makes the original commit errors like a conflict in the
> transaction manager and OOM not lost.
>
> Besides, ERRINJ_TXN_COMMIT_ASYNC does not need its own diag_log()
> anymore. Because since this commit the applier logs the correct
> error instead of ER_WAL_IO/ER_TXN_ROLLBACK.
>
> Closes #6027
> ---
>   .../unreleased/gh-6027-applier-lost-error.md  |  7 ++
>   src/box/txn.c                                 | 10 ++-
>   .../gh-6027-applier-error-show.result         | 82 +++++++++++++++++++
>   .../gh-6027-applier-error-show.test.lua       | 31 +++++++
>   test/replication/suite.cfg                    |  1 +
>   test/replication/suite.ini                    |  2 +-
>   6 files changed, 129 insertions(+), 4 deletions(-)
>   create mode 100644 changelogs/unreleased/gh-6027-applier-lost-error.md
>   create mode 100644 test/replication/gh-6027-applier-error-show.result
>   create mode 100644 test/replication/gh-6027-applier-error-show.test.lua
>
> diff --git a/changelogs/unreleased/gh-6027-applier-lost-error.md b/changelogs/unreleased/gh-6027-applier-lost-error.md
> new file mode 100644
> index 000000000..9c765b8e2
> --- /dev/null
> +++ b/changelogs/unreleased/gh-6027-applier-lost-error.md
> @@ -0,0 +1,7 @@
> +## bugfix/replication
> +
> +* When an error happened during appliance of a transaction received from a
> +  remote instance via replication, it was always reported as "Failed to write
> +  to disk" regardless of what really happened. Now the correct error is shown.
> +  For example, "Out of memory", or "Transaction has been aborted by conflict",
> +  and so on (gh-6027).
> diff --git a/src/box/txn.c b/src/box/txn.c
> index 5cae7b41d..c2734c237 100644
> --- a/src/box/txn.c
> +++ b/src/box/txn.c
> @@ -805,7 +805,6 @@ txn_commit_try_async(struct txn *txn)
>   		 * Log it for the testing sake: we grep
>   		 * output to mark this event.
>   		 */

The comment became rather misleading now.
We log the error in applier, not here.

I would drop it altogether, but up to you.

> -		diag_log();
>   		goto rollback;
>   	});
>   
> @@ -983,11 +982,11 @@ void
>   txn_rollback(struct txn *txn)
>   {
>   	assert(txn == in_txn());
> +	assert(txn->signature != TXN_SIGNATURE_UNKNOWN);
>   	txn->status = TXN_ABORTED;
>   	trigger_clear(&txn->fiber_on_stop);
>   	if (!txn_has_flag(txn, TXN_CAN_YIELD))
>   		trigger_clear(&txn->fiber_on_yield);
> -	txn->signature = TXN_SIGNATURE_ROLLBACK;
>   	txn_complete_fail(txn);
>   	fiber_set_txn(fiber(), NULL);
>   }
> @@ -1086,6 +1085,8 @@ box_txn_rollback(void)
>   		diag_set(ClientError, ER_ROLLBACK_IN_SUB_STMT);
>   		return -1;
>   	}
> +	assert(txn->signature == TXN_SIGNATURE_UNKNOWN);
> +	txn->signature = TXN_SIGNATURE_ROLLBACK;
>   	txn_rollback(txn); /* doesn't throw */
>   	fiber_gc();
>   	return 0;
> @@ -1221,7 +1222,10 @@ txn_on_stop(struct trigger *trigger, void *event)
>   {
>   	(void) trigger;
>   	(void) event;
> -	txn_rollback(in_txn());                 /* doesn't yield or fail */
> +	struct txn *txn = in_txn();
> +	assert(txn->signature == TXN_SIGNATURE_UNKNOWN);
> +	txn->signature = TXN_SIGNATURE_ROLLBACK;
> +	txn_rollback(txn);
>   	fiber_gc();
>   	return 0;
>   }
> diff --git a/test/replication/gh-6027-applier-error-show.result b/test/replication/gh-6027-applier-error-show.result
> new file mode 100644
> index 000000000..c3a01ab50
> --- /dev/null
> +++ b/test/replication/gh-6027-applier-error-show.result
> @@ -0,0 +1,82 @@
> +-- test-run result file version 2
> +test_run = require('test_run').new()
> + | ---
> + | ...
> +
> +--
> +-- gh-6027: on attempt to a commit transaction its original error was lost.
> +--
> +
> +box.schema.user.grant('guest', 'super')
> + | ---
> + | ...
> +s = box.schema.create_space('test')
> + | ---
> + | ...
> +_ = s:create_index('pk')
> + | ---
> + | ...
> +
> +test_run:cmd('create server replica with rpl_master=default, '..                \
> +             'script="replication/replica.lua"')
> + | ---
> + | - true
> + | ...
> +test_run:cmd('start server replica')
> + | ---
> + | - true
> + | ...
> +
> +test_run:switch('replica')
> + | ---
> + | - true
> + | ...
> +box.error.injection.set('ERRINJ_TXN_COMMIT_ASYNC', true)
> + | ---
> + | - ok
> + | ...
> +
> +test_run:switch('default')
> + | ---
> + | - true
> + | ...
> +_ = s:replace{1}
> + | ---
> + | ...
> +
> +test_run:switch('replica')
> + | ---
> + | - true
> + | ...
> +test_run:wait_upstream(1, {status = 'stopped'})
> + | ---
> + | - true
> + | ...
> +-- Should be something about error injection.
> +box.info.replication[1].upstream.message
> + | ---
> + | - Error injection 'txn commit async injection'
> + | ...
> +
> +test_run:switch('default')
> + | ---
> + | - true
> + | ...
> +test_run:cmd('stop server replica')
> + | ---
> + | - true
> + | ...
> +test_run:cmd('delete server replica')
> + | ---
> + | - true
> + | ...
> +box.error.injection.set('ERRINJ_TXN_COMMIT_ASYNC', false)
> + | ---
> + | - ok
> + | ...
> +s:drop()
> + | ---
> + | ...
> +box.schema.user.revoke('guest', 'super')
> + | ---
> + | ...
> diff --git a/test/replication/gh-6027-applier-error-show.test.lua b/test/replication/gh-6027-applier-error-show.test.lua
> new file mode 100644
> index 000000000..8e17cdfa9
> --- /dev/null
> +++ b/test/replication/gh-6027-applier-error-show.test.lua
> @@ -0,0 +1,31 @@
> +test_run = require('test_run').new()
> +
> +--
> +-- gh-6027: on attempt to a commit transaction its original error was lost.
> +--
> +
> +box.schema.user.grant('guest', 'super')
> +s = box.schema.create_space('test')
> +_ = s:create_index('pk')
> +
> +test_run:cmd('create server replica with rpl_master=default, '..                \
> +             'script="replication/replica.lua"')
> +test_run:cmd('start server replica')
> +
> +test_run:switch('replica')
> +box.error.injection.set('ERRINJ_TXN_COMMIT_ASYNC', true)
> +
> +test_run:switch('default')
> +_ = s:replace{1}
> +
> +test_run:switch('replica')
> +test_run:wait_upstream(1, {status = 'stopped'})
> +-- Should be something about error injection.
> +box.info.replication[1].upstream.message
> +
> +test_run:switch('default')
> +test_run:cmd('stop server replica')
> +test_run:cmd('delete server replica')
> +box.error.injection.set('ERRINJ_TXN_COMMIT_ASYNC', false)
> +s:drop()
> +box.schema.user.revoke('guest', 'super')
> diff --git a/test/replication/suite.cfg b/test/replication/suite.cfg
> index 9eccf9e19..3a0a8649f 100644
> --- a/test/replication/suite.cfg
> +++ b/test/replication/suite.cfg
> @@ -45,6 +45,7 @@
>       "gh-5536-wal-limit.test.lua": {},
>       "gh-5566-final-join-synchro.test.lua": {},
>       "gh-5613-bootstrap-prefer-booted.test.lua": {},
> +    "gh-6027-applier-error-show.test.lua": {},
>       "gh-6032-promote-wal-write.test.lua": {},
>       "gh-6057-qsync-confirm-async-no-wal.test.lua": {},
>       "gh-6094-rs-uuid-mismatch.test.lua": {},
> diff --git a/test/replication/suite.ini b/test/replication/suite.ini
> index 80e968d56..ccf3559df 100644
> --- a/test/replication/suite.ini
> +++ b/test/replication/suite.ini
> @@ -3,7 +3,7 @@ core = tarantool
>   script =  master.lua
>   description = tarantool/box, replication
>   disabled = consistent.test.lua
> -release_disabled = catch.test.lua errinj.test.lua gc.test.lua gc_no_space.test.lua before_replace.test.lua qsync_advanced.test.lua qsync_errinj.test.lua quorum.test.lua recover_missing_xlog.test.lua sync.test.lua long_row_timeout.test.lua gh-4739-vclock-assert.test.lua gh-4730-applier-rollback.test.lua gh-5140-qsync-casc-rollback.test.lua gh-5144-qsync-dup-confirm.test.lua gh-5167-qsync-rollback-snap.test.lua gh-5506-election-on-off.test.lua gh-5536-wal-limit.test.lua hang_on_synchro_fail.test.lua anon_register_gap.test.lua gh-5213-qsync-applier-order.test.lua gh-5213-qsync-applier-order-3.test.lua gh-6032-promote-wal-write.test.lua gh-6057-qsync-confirm-async-no-wal.test.lua
> +release_disabled = catch.test.lua errinj.test.lua gc.test.lua gc_no_space.test.lua before_replace.test.lua qsync_advanced.test.lua qsync_errinj.test.lua quorum.test.lua recover_missing_xlog.test.lua sync.test.lua long_row_timeout.test.lua gh-4739-vclock-assert.test.lua gh-4730-applier-rollback.test.lua gh-5140-qsync-casc-rollback.test.lua gh-5144-qsync-dup-confirm.test.lua gh-5167-qsync-rollback-snap.test.lua gh-5506-election-on-off.test.lua gh-5536-wal-limit.test.lua hang_on_synchro_fail.test.lua anon_register_gap.test.lua gh-5213-qsync-applier-order.test.lua gh-5213-qsync-applier-order-3.test.lua gh-6027-applier-error-show.test.lua gh-6032-promote-wal-write.test.lua gh-6057-qsync-confirm-async-no-wal.test.lua
>   config = suite.cfg
>   lua_libs = lua/fast_replica.lua lua/rlimit.lua
>   use_unix_sockets = True

-- 
Serge Petrenko


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Tarantool-patches] [PATCH 13/13] txn: stop TXN_SIGNATURE_ABORT override
  2021-06-15 13:44   ` Serge Petrenko via Tarantool-patches
@ 2021-06-15 19:34     ` Vladislav Shpilevoy via Tarantool-patches
  0 siblings, 0 replies; 22+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-06-15 19:34 UTC (permalink / raw)
  To: Serge Petrenko, tarantool-patches, gorcunov

Hi! Thanks for the review!

>> diff --git a/src/box/txn.c b/src/box/txn.c
>> index 5cae7b41d..c2734c237 100644
>> --- a/src/box/txn.c
>> +++ b/src/box/txn.c
>> @@ -805,7 +805,6 @@ txn_commit_try_async(struct txn *txn)
>>            * Log it for the testing sake: we grep
>>            * output to mark this event.
>>            */
> 
> The comment became rather misleading now.
> We log the error in applier, not here.
> 
> I would drop it altogether, but up to you.

I forgot about this comment. Dropped on the branch.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Tarantool-patches] [PATCH 04/13] wal: refactor wal_write_to_disk()
  2021-06-11 21:56 ` [Tarantool-patches] [PATCH 04/13] wal: refactor wal_write_to_disk() Vladislav Shpilevoy via Tarantool-patches
@ 2021-06-15 20:46   ` Cyrill Gorcunov via Tarantool-patches
  2021-06-16  6:22     ` Vladislav Shpilevoy via Tarantool-patches
  0 siblings, 1 reply; 22+ messages in thread
From: Cyrill Gorcunov via Tarantool-patches @ 2021-06-15 20:46 UTC (permalink / raw)
  To: Vladislav Shpilevoy; +Cc: tarantool-patches

On Fri, Jun 11, 2021 at 11:56:12PM +0200, Vladislav Shpilevoy wrote:
> It didn't have a single fail path. That led to some amount of code
> duplication, and it complicated future patches where the journal
> entries are going to get a proper error reason instead of default
> -1 without any details.
> 
> The patch is a preparation for #6027 where it is wanted to have
> more detailed errors on journal entry/transaction fail instead
> of ER_WAL_IO for everything. Sometimes it can override a real
> error like a cascade rollback, or a transaction conflict.
> 
> Part of #6027
> ---
> @@ -1038,7 +1036,10 @@ wal_write_to_disk(struct cmsg *msg)
>  {
>  	struct wal_writer *writer = &wal_writer_singleton;
>  	struct wal_msg *wal_msg = (struct wal_msg *) msg;
>  	struct error *error;
> +	assert(!stailq_empty(&wal_msg->commit));

Hi Vlad, you know I don't understand why we need this assert...

>  	/*
>  	 * Track all vclock changes made by this batch into
> @@ -1058,23 +1059,17 @@ wal_write_to_disk(struct cmsg *msg)
>  
>  	if (writer->is_in_rollback) {
>  		/* We're rolling back a failed write. */
> -		stailq_concat(&wal_msg->rollback, &wal_msg->commit);
> -		vclock_copy(&wal_msg->vclock, &writer->vclock);
> -		return;
> +		goto done;

Jumps to "done" label change the code logic. Before the patch if we
reached the write and say wal_opt_rotate failed we set up is_in_rollback
sign and exit early, after the patch we start notifying watchers that
there "write" happened which means relay code will be woken up while there
no new data on disk level at all, which means watchers wanna be notified
for no reason, no? Or I miss something obvious?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Tarantool-patches] [PATCH 04/13] wal: refactor wal_write_to_disk()
  2021-06-15 20:46   ` Cyrill Gorcunov via Tarantool-patches
@ 2021-06-16  6:22     ` Vladislav Shpilevoy via Tarantool-patches
  2021-06-16  8:02       ` Cyrill Gorcunov via Tarantool-patches
  0 siblings, 1 reply; 22+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-06-16  6:22 UTC (permalink / raw)
  To: Cyrill Gorcunov; +Cc: tarantool-patches



On 15.06.2021 22:46, Cyrill Gorcunov wrote:
> On Fri, Jun 11, 2021 at 11:56:12PM +0200, Vladislav Shpilevoy wrote:
>> It didn't have a single fail path. That led to some amount of code
>> duplication, and it complicated future patches where the journal
>> entries are going to get a proper error reason instead of default
>> -1 without any details.
>>
>> The patch is a preparation for #6027 where it is wanted to have
>> more detailed errors on journal entry/transaction fail instead
>> of ER_WAL_IO for everything. Sometimes it can override a real
>> error like a cascade rollback, or a transaction conflict.
>>
>> Part of #6027
>> ---
>> @@ -1038,7 +1036,10 @@ wal_write_to_disk(struct cmsg *msg)
>>  {
>>  	struct wal_writer *writer = &wal_writer_singleton;
>>  	struct wal_msg *wal_msg = (struct wal_msg *) msg;
>>  	struct error *error;
>> +	assert(!stailq_empty(&wal_msg->commit));
> 
> Hi Vlad, you know I don't understand why we need this assert...

Otherwise in case of, for instance, rotate fail, the rollback won't
start.

>>  	/*
>>  	 * Track all vclock changes made by this batch into
>> @@ -1058,23 +1059,17 @@ wal_write_to_disk(struct cmsg *msg)
>>  
>>  	if (writer->is_in_rollback) {
>>  		/* We're rolling back a failed write. */
>> -		stailq_concat(&wal_msg->rollback, &wal_msg->commit);
>> -		vclock_copy(&wal_msg->vclock, &writer->vclock);
>> -		return;
>> +		goto done;
> 
> Jumps to "done" label change the code logic. Before the patch if we
> reached the write and say wal_opt_rotate failed we set up is_in_rollback
> sign and exit early, after the patch we start notifying watchers that
> there "write" happened which means relay code will be woken up while there
> no new data on disk level at all, which means watchers wanna be notified
> for no reason, no? Or I miss something obvious?

You didn't miss anything. But I see no harm in that. WAL write fail is
extremely rare, so a rare spurious wakeup won't do anything bad. I
decided the code reusability and simplicity is more important here.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Tarantool-patches] [PATCH 04/13] wal: refactor wal_write_to_disk()
  2021-06-16  6:22     ` Vladislav Shpilevoy via Tarantool-patches
@ 2021-06-16  8:02       ` Cyrill Gorcunov via Tarantool-patches
  2021-06-16 23:32         ` Vladislav Shpilevoy via Tarantool-patches
  0 siblings, 1 reply; 22+ messages in thread
From: Cyrill Gorcunov via Tarantool-patches @ 2021-06-16  8:02 UTC (permalink / raw)
  To: Vladislav Shpilevoy; +Cc: tarantool-patches

On Wed, Jun 16, 2021 at 08:22:15AM +0200, Vladislav Shpilevoy wrote:
> 
> 
> On 15.06.2021 22:46, Cyrill Gorcunov wrote:
> > On Fri, Jun 11, 2021 at 11:56:12PM +0200, Vladislav Shpilevoy wrote:
> >> It didn't have a single fail path. That led to some amount of code
> >> duplication, and it complicated future patches where the journal
> >> entries are going to get a proper error reason instead of default
> >> -1 without any details.
> >>
> >> The patch is a preparation for #6027 where it is wanted to have
> >> more detailed errors on journal entry/transaction fail instead
> >> of ER_WAL_IO for everything. Sometimes it can override a real
> >> error like a cascade rollback, or a transaction conflict.
> >>
> >> Part of #6027
> >> ---
> >> @@ -1038,7 +1036,10 @@ wal_write_to_disk(struct cmsg *msg)
> >>  {
> >>  	struct wal_writer *writer = &wal_writer_singleton;
> >>  	struct wal_msg *wal_msg = (struct wal_msg *) msg;
> >>  	struct error *error;
> >> +	assert(!stailq_empty(&wal_msg->commit));
> > 
> > Hi Vlad, you know I don't understand why we need this assert...
> 
> Otherwise in case of, for instance, rotate fail, the rollback won't
> start.

But the current whole code logic is assuming that if commit chain
is empty then there was no data to write or rollback, iow codeflow
is using list emptiness as a sign of data to procceed. If we really
want to prohibit calling wal_write_to_disk with no entries then
maybe better put panic here?

The wal_msg->commit entry is accessed later in code anyway via bpu,
which means that if we add

	if (stailq_empty(&wal_msg->commit))
		panic("wal: attempt to update vclock without data");

this won't cause perf degradation since after a few lines you gonna be
testing @commit again where bpu entry is already filled and this will
allow us to catch problems on release builds as well.

Don't get me wrong please I simply don't like assert() with a passion
because it hides problems which might happen on release builds.

Anyway, just a proposal, if you still prefer calling assert here, ok
let it be.

> > 
> > Jumps to "done" label change the code logic. Before the patch if we
> > reached the write and say wal_opt_rotate failed we set up is_in_rollback
> > sign and exit early, after the patch we start notifying watchers that
> > there "write" happened which means relay code will be woken up while there
> > no new data on disk level at all, which means watchers wanna be notified
> > for no reason, no? Or I miss something obvious?
> 
> You didn't miss anything. But I see no harm in that. WAL write fail is
> extremely rare, so a rare spurious wakeup won't do anything bad. I
> decided the code reusability and simplicity is more important here.

OK, thanks for explanation!

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Tarantool-patches] [PATCH 04/13] wal: refactor wal_write_to_disk()
  2021-06-16  8:02       ` Cyrill Gorcunov via Tarantool-patches
@ 2021-06-16 23:32         ` Vladislav Shpilevoy via Tarantool-patches
  0 siblings, 0 replies; 22+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-06-16 23:32 UTC (permalink / raw)
  To: Cyrill Gorcunov; +Cc: tarantool-patches

Hi! Thanks for the review!

>>>> @@ -1038,7 +1036,10 @@ wal_write_to_disk(struct cmsg *msg)
>>>>  {
>>>>  	struct wal_writer *writer = &wal_writer_singleton;
>>>>  	struct wal_msg *wal_msg = (struct wal_msg *) msg;
>>>>  	struct error *error;
>>>> +	assert(!stailq_empty(&wal_msg->commit));
>>>
>>> Hi Vlad, you know I don't understand why we need this assert...
>>
>> Otherwise in case of, for instance, rotate fail, the rollback won't
>> start.
> 
> But the current whole code logic is assuming that if commit chain
> is empty then there was no data to write or rollback, iow codeflow
> is using list emptiness as a sign of data to procceed. If we really
> want to prohibit calling wal_write_to_disk with no entries then
> maybe better put panic here?
> 
> The wal_msg->commit entry is accessed later in code anyway via bpu,
> which means that if we add
> 
> 	if (stailq_empty(&wal_msg->commit))
> 		panic("wal: attempt to update vclock without data");
> 
> this won't cause perf degradation since after a few lines you gonna be
> testing @commit again where bpu entry is already filled and this will
> allow us to catch problems on release builds as well.
> 
> Don't get me wrong please I simply don't like assert() with a passion
> because it hides problems which might happen on release builds.
> 
> Anyway, just a proposal, if you still prefer calling assert here, ok
> let it be.

I added the panic:

====================
diff --git a/src/box/wal.c b/src/box/wal.c
index 40382e791..cb8b2eac1 100644
--- a/src/box/wal.c
+++ b/src/box/wal.c
@@ -1039,7 +1039,8 @@ wal_write_to_disk(struct cmsg *msg)
 	struct stailq_entry *last_committed = NULL;
 	struct journal_entry *entry;
 	struct error *error;
-	assert(!stailq_empty(&wal_msg->commit));
+	if (stailq_empty(&wal_msg->commit))
+		panic("Attempted to write an empty batch to WAL");
 
 	/*
 	 * Track all vclock changes made by this batch into

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [Tarantool-patches] [PATCH 00/13] Applier rollback reason
  2021-06-11 21:56 [Tarantool-patches] [PATCH 00/13] Applier rollback reason Vladislav Shpilevoy via Tarantool-patches
                   ` (13 preceding siblings ...)
  2021-06-15 13:43 ` [Tarantool-patches] [PATCH 00/13] Applier rollback reason Serge Petrenko via Tarantool-patches
@ 2021-06-16 23:32 ` Vladislav Shpilevoy via Tarantool-patches
  14 siblings, 0 replies; 22+ messages in thread
From: Vladislav Shpilevoy via Tarantool-patches @ 2021-06-16 23:32 UTC (permalink / raw)
  To: tarantool-patches, gorcunov, sergepetrenko

Pushed to master, 2.8, 2.7.

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2021-06-16 23:33 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-11 21:56 [Tarantool-patches] [PATCH 00/13] Applier rollback reason Vladislav Shpilevoy via Tarantool-patches
2021-06-11 21:56 ` [Tarantool-patches] [PATCH 01/13] error: introduce ER_CASCADE_ROLLBACK Vladislav Shpilevoy via Tarantool-patches
2021-06-11 21:56 ` [Tarantool-patches] [PATCH 10/13] txn: install proper diag errors on txn fail Vladislav Shpilevoy via Tarantool-patches
2021-06-11 21:56 ` [Tarantool-patches] [PATCH 11/13] wal: introduce JOURNAL_ENTRY_ERR_CASCADE Vladislav Shpilevoy via Tarantool-patches
2021-06-11 21:56 ` [Tarantool-patches] [PATCH 12/13] txn: introduce TXN_SIGNATURE_ABORT Vladislav Shpilevoy via Tarantool-patches
2021-06-11 21:56 ` [Tarantool-patches] [PATCH 13/13] txn: stop TXN_SIGNATURE_ABORT override Vladislav Shpilevoy via Tarantool-patches
2021-06-15 13:44   ` Serge Petrenko via Tarantool-patches
2021-06-15 19:34     ` Vladislav Shpilevoy via Tarantool-patches
2021-06-11 21:56 ` [Tarantool-patches] [PATCH 02/13] test: remove replica-applier-rollback.lua Vladislav Shpilevoy via Tarantool-patches
2021-06-11 21:56 ` [Tarantool-patches] [PATCH 03/13] journal: make journal_write() set diag on error Vladislav Shpilevoy via Tarantool-patches
2021-06-11 21:56 ` [Tarantool-patches] [PATCH 04/13] wal: refactor wal_write_to_disk() Vladislav Shpilevoy via Tarantool-patches
2021-06-15 20:46   ` Cyrill Gorcunov via Tarantool-patches
2021-06-16  6:22     ` Vladislav Shpilevoy via Tarantool-patches
2021-06-16  8:02       ` Cyrill Gorcunov via Tarantool-patches
2021-06-16 23:32         ` Vladislav Shpilevoy via Tarantool-patches
2021-06-11 21:56 ` [Tarantool-patches] [PATCH 05/13] diag: introduce diag_set_detailed() Vladislav Shpilevoy via Tarantool-patches
2021-06-11 21:56 ` [Tarantool-patches] [PATCH 06/13] wal: encapsulate ER_WAL_IO Vladislav Shpilevoy via Tarantool-patches
2021-06-11 21:56 ` [Tarantool-patches] [PATCH 07/13] txn: change limbo rollback check in the trigger Vladislav Shpilevoy via Tarantool-patches
2021-06-11 21:56 ` [Tarantool-patches] [PATCH 08/13] journal: introduce proper error codes Vladislav Shpilevoy via Tarantool-patches
2021-06-11 21:56 ` [Tarantool-patches] [PATCH 09/13] txn: assert after WAL write that txn is not done Vladislav Shpilevoy via Tarantool-patches
2021-06-15 13:43 ` [Tarantool-patches] [PATCH 00/13] Applier rollback reason Serge Petrenko via Tarantool-patches
2021-06-16 23:32 ` Vladislav Shpilevoy via Tarantool-patches

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox